Lecture-2: Introduction To Data Science

Lecture-2
Introduction to Data Science

Data science
2
https://www.kdnuggets.com/2016/03/data-science-process.html
Data science is not machine learning
• Machine learning involves computation and statistics, but has

not (traditionally) been very concerned about answering
scientific questions Machine learning has a heavy focus on
fancy algorithms… ...
but sometimes the best way to solve a problem is just
by visualizing the data, for instance
3
https://www.datasciencecourse.org/slides/intro.pdf
Data Collection Methods
Here are the top six data collection methods:

• Interviews
• Questionnaires and surveys
• Observations
• Documents and records
• Focus groups
• Oral histories
4
https://www.jotform.com/data-collection-methods /
5
Data Gathering: Structured Data
Structured data is typically stored in traditional relational
databases and refers to data that has a defined length and format.
Examples of structured data:
• Sensor data: Examples include smart meters, medical
devices, and Global Positioning System (GPS) data.
• Financial data: Many financial systems are now
programmatic; they operate based on predefined rules that
automate processes.
6
https://www.ibm.com/downloads/cas/GB8ZMQZ3
Data Gathering: Unstructured Data
Although unstructured data has some implicit structure, it doesn’t
follow a specified format.
Unstructured data could be anything: media, imaging,
audio, sensor data, text data, and much more.
Examples of unstructured data
• Social media data: This data is generated from the social
media platforms, such as YouTube, Facebook, Twitter,
LinkedIn, and Flickr.
• Mobile data: This includes text messages, notes, calendar
inputs, pictures, videos, and data entered into third-party
mobile applications.
7
Data Gathering: Real Life Data Sets
• Data Sets:
– https://archive.ics.uci.edu/ml/index.php
– https://www.kaggle.com/data
8
9
10
https://www.kaggle.com/data
11
12
13
14
15
16
Kaggle.com: Red Wine Data Set
17
Red Wine Data Set
18
Red Wine Data Description
The input features are as follows:
1. Fixed acidity - most acids involved with wine or fixed or
nonvolatile (do not evaporate readily);
2. Volatile acidity - the amount of acetic acid in wine, which at
too high of levels can lead to an unpleasant, vinegar taste;
3. Citric acid - found in small quantities, citric acid can add
‘freshness’ and flavor to wines;
4. Residual sugar - the amount of sugar remaining after
fermentation stops, it’s rare to find wines with less than 1
gram/liter and wines with greater than 45 grams/liter are
considered sweet;
5. Chlorides - the amount of salt in the wine;
19
Red Wine Data Description
6. Free sulfur dioxide - the free form of SO2 exists in equilibrium between
molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial
growth and the oxidation of wine;
7. Total sulfur dioxide - amount of free and bound forms of S02; in low
concentrations, SO2 is mostly undetectable in wine, but at free SO2
concentrations over 50 ppm, SO2 becomes evident in the nose and taste
of wine;
8. Density - the density of water is close to that of water depending on the
percent alcohol and sugar content;
9. pH - describes how acidic or basic a wine is on a scale from 0 (very
acidic) to 14 (very basic); most wines are between 3-4 on the pH scale;
10. Sulphates - a wine additive which can contribute to sulfur dioxide gas
(S02) levels, wich acts as an antimicrobial and antioxidant
11. Alcohol - the percent alcohol content of the wine
The output feature is:
• Quality - output variable (based on sensory data, score between 0
and 10); 20
21
Data Set: Iris
• The Iris Dataset contains four features (length and width of

sepals and petals) of 50 samples of three species
of Iris (Iris setosa, Iris virginica and Iris versicolor).
• Model is used to classify the species.
22
http://www.lac.inpe.br/~rafael.santos/Docs/CAP394/WholeStoryIris.html
23
24
Data Pre-processing: Data Cleaning
Few common activities:
• Handling missing values
• Identify and delete columns that contain a single value
• Identify and delete columns that have very few values
25
https://machinelearningmastery.com/basic-data-cleaning-for-machine-learning /
Python
IPython and Jupyter
These packages provide the computational environment in which

many Python-using data scientists work.
NumPy
This library provides the ndarray object for efficient storage and
manipulation of dense data arrays in Python.
Pandas
This library provides the DataFrame object for efficient storage
and manipulation of labeled/columnar data in Python.
26
Python
Matplotlib
This library provides capabilities for a flexible range of data
visualizations in Python.
Scikit-Learn
This library provides efficient and clean Python implementations
of the most important and established machine learning
algorithms.
27
Python: Jupiter Notebook
28
29
30
31
Python
32

Lecture-2: Introduction To Data Science

Uploaded by

Copyright:

Available Formats

Lecture-2: Introduction To Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture-2: Introduction To Data Science

Uploaded by

Copyright:

Available Formats

Lecture-2

Introduction to Data Science

• Machine learning involves computation and statistics, but has

by visualizing the data, for instance

Here are the top six data collection methods:

• The Iris Dataset contains four features (length and width of

These packages provide the computational environment in which

You might also like