Lecture-2: Introduction To Data Science
Lecture-2: Introduction To Data Science
Lecture-2: Introduction To Data Science
2
https://www.kdnuggets.com/2016/03/data-science-process.html
Data science is not machine learning
3
https://www.datasciencecourse.org/slides/intro.pdf
Data Collection Methods
4
https://www.jotform.com/data-collection-methods /
5
Data Gathering: Structured Data
Structured data is typically stored in traditional relational
databases and refers to data that has a defined length and format.
Examples of structured data:
• Sensor data: Examples include smart meters, medical
devices, and Global Positioning System (GPS) data.
• Financial data: Many financial systems are now
programmatic; they operate based on predefined rules that
automate processes.
6
https://www.ibm.com/downloads/cas/GB8ZMQZ3
Data Gathering: Unstructured Data
Although unstructured data has some implicit structure, it doesn’t
follow a specified format.
Unstructured data could be anything: media, imaging,
audio, sensor data, text data, and much more.
Examples of unstructured data
• Social media data: This data is generated from the social
media platforms, such as YouTube, Facebook, Twitter,
LinkedIn, and Flickr.
• Mobile data: This includes text messages, notes, calendar
inputs, pictures, videos, and data entered into third-party
mobile applications.
7
Data Gathering: Real Life Data Sets
• Data Sets:
– https://archive.ics.uci.edu/ml/index.php
– https://www.kaggle.com/data
8
9
10
https://www.kaggle.com/data
11
12
13
14
15
16
Kaggle.com: Red Wine Data Set
17
Red Wine Data Set
18
Red Wine Data Description
The input features are as follows:
1. Fixed acidity - most acids involved with wine or fixed or
nonvolatile (do not evaporate readily);
2. Volatile acidity - the amount of acetic acid in wine, which at
too high of levels can lead to an unpleasant, vinegar taste;
3. Citric acid - found in small quantities, citric acid can add
‘freshness’ and flavor to wines;
4. Residual sugar - the amount of sugar remaining after
fermentation stops, it’s rare to find wines with less than 1
gram/liter and wines with greater than 45 grams/liter are
considered sweet;
5. Chlorides - the amount of salt in the wine;
19
Red Wine Data Description
6. Free sulfur dioxide - the free form of SO2 exists in equilibrium between
molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial
growth and the oxidation of wine;
7. Total sulfur dioxide - amount of free and bound forms of S02; in low
concentrations, SO2 is mostly undetectable in wine, but at free SO2
concentrations over 50 ppm, SO2 becomes evident in the nose and taste
of wine;
8. Density - the density of water is close to that of water depending on the
percent alcohol and sugar content;
9. pH - describes how acidic or basic a wine is on a scale from 0 (very
acidic) to 14 (very basic); most wines are between 3-4 on the pH scale;
10. Sulphates - a wine additive which can contribute to sulfur dioxide gas
(S02) levels, wich acts as an antimicrobial and antioxidant
11. Alcohol - the percent alcohol content of the wine
The output feature is:
• Quality - output variable (based on sensory data, score between 0
and 10); 20
21
Data Set: Iris
22
http://www.lac.inpe.br/~rafael.santos/Docs/CAP394/WholeStoryIris.html
23
24
Data Pre-processing: Data Cleaning
Few common activities:
• Handling missing values
• Identify and delete columns that contain a single value
• Identify and delete columns that have very few values
25
https://machinelearningmastery.com/basic-data-cleaning-for-machine-learning /
Python
IPython and Jupyter
Matplotlib
This library provides capabilities for a flexible range of data
visualizations in Python.
Scikit-Learn
This library provides efficient and clean Python implementations
of the most important and established machine learning
algorithms.
27
Python: Jupiter Notebook
28
Python: Jupiter Notebook
29
Python: Jupiter Notebook
30
31
Python
32