Nothing Special   »   [go: up one dir, main page]

Data Science

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

Introduction to Data Science

Data Science Life Cycle


Types of Data
The data is classified into majorly four categories:

• Nominal data
• Ordinal data
• Discrete data
• Continuous data
Data Collection
1. The process of gathering and analyzing accurate data from various sources to find answers to research
problems, trends and probabilities, etc., to evaluate possible outcomes is Known as Data Collection.

2. Data collection is the process of collecting and evaluating information or data from multiple sources to find
answers to research problems, answer questions, evaluate outcomes, and forecast trends and probabilities.

3. Accurate data collection is necessary to make informed business decisions, ensure quality assurance, and
keep research integrity.

4. During data collection, the researchers must identify the data types, the sources of data, and what methods
are being used.
Before an analyst begins collecting data, they must answer three questions first:

What’s the goal or purpose of this research?

What kinds of data are they planning on gathering?

What methods and procedures will be used to collect, store, and process the information?

Additionally, we can break up data into qualitative and quantitative types. Qualitative data covers descriptions such as
color, size, quality, and appearance. Quantitative data, unsurprisingly, deals with numbers, such as statistics, poll numbers,
percentages, etc.
Data Collection Methods
1. Primary Data Collection:

Primary data collection involves the collection of original data directly from the source or
through direct interaction with the respondents. This method allows researchers to
obtain firsthand information specifically tailored to their research objectives. There are
various techniques for primary data collection, including:

a. Surveys and Questionnaires: Researchers design structured questionnaires or surveys


to collect data from individuals or groups. These can be conducted through face-to-face
interviews, telephone calls, mail, or online platforms.

b. Interviews: Interviews involve direct interaction between the researcher and the
respondent. They can be conducted in person, over the phone, or through video
conferencing. Interviews can be structured (with predefined questions), semi-structured
(allowing flexibility), or unstructured (more conversational).
Observations: Researchers observe and record behaviors, actions, or events in their natural
setting. This method is useful for gathering data on human behavior, interactions, or
phenomena without direct intervention.

d. Experiments: Experimental studies involve the manipulation of variables to observe their


impact on the outcome. Researchers control the conditions and collect data to draw
conclusions about cause-and-effect relationships.

e. Focus Groups: Focus groups bring together a small group of individuals who discuss
specific topics in a moderated setting. This method helps in understanding opinions,
perceptions, and experiences shared by the participants.
Secondary Data Collection:

Secondary data collection involves using existing data collected by someone else for a purpose different from the
original intent. Researchers analyze and interpret this data to extract relevant information. Secondary data can be
obtained from various sources, including:

a. Published Sources: Researchers refer to books, academic journals, magazines, newspapers, government reports,
and other published materials that contain relevant data.

b. Online Databases: Numerous online databases provide access to a wide range of secondary data, such as research
articles, statistical information, economic data, and social surveys.
Government and Institutional Records: Government agencies, research institutions, and organizations often maintain
databases or records that can be used for research purposes.

d. Publicly Available Data: Data shared by individuals, organizations, or communities on public platforms, websites, or
social media can be accessed and utilized for research.

e. Past Research Studies: Previous research studies and their findings can serve as valuable secondary data sources.
Researchers can review and analyze the data to gain insights or build upon existing knowledge.
What are Common Challenges in Data Collection?
There are some prevalent challenges faced while collecting data, let us explore a few of them to understand
them better and avoid them.

Data Quality Issues


The main threat to the broad and successful application of machine learning is poor data quality. Data
quality must be your top priority if you want to make technologies like machine learning work for you. Let's
talk about some of the most prevalent data quality problems in this blog article and how to fix them.

Inconsistent Data
When working with various data sources, it's conceivable that the same information will have discrepancies
between sources. The differences could be in formats, units, or occasionally spellings. The introduction of
inconsistent data might also occur during firm mergers or relocations. Inconsistencies in data have a
tendency to accumulate and reduce the value of data if they are not continually resolved. Organizations that
have heavily focused on data consistency do so because they only want reliable data to support their
analytics.
Data Downtime
Data is the driving force behind the decisions and operations of data-driven businesses. However, there may be brief
periods when their data is unreliable or not prepared. Customer complaints and subpar analytical outcomes are only
two ways that this data unavailability can have a significant impact on businesses. A data engineer spends about 80%
of their time updating, maintaining, and guaranteeing the integrity of the data pipeline.

Schema modifications and migration problems are just two examples of the causes of data downtime.

Ambiguous Data
Even with thorough oversight, some errors can still occur in massive databases or data lakes. For data streaming at a fast
speed, the issue becomes more overwhelming. Spelling mistakes can go unnoticed, formatting difficulties can occur, and
column heads might be deceptive. This unclear data might cause a number of problems for reporting and analytics.
Duplicate Data
Streaming data, local databases, and cloud data lakes are just a few of the sources of data that modern enterprises must
contend with. They might also have application and system silos. These sources are likely to duplicate and overlap each
other quite a bit. For instance, duplicate contact information has a substantial impact on customer experience. If certain
prospects are ignored while others are engaged repeatedly, marketing campaigns suffer.

Inaccurate Data
For highly regulated businesses like healthcare, data accuracy is crucial. Given the current experience, it is more
important than ever to increase the data quality for COVID-19 and later pandemics. Inaccurate information does not
provide you with a true picture of the situation and cannot be used to plan the best course of action. Personalized
customer experiences and marketing strategies underperform if your customer data is inaccurate.

Hidden Data
The majority of businesses only utilize a portion of their data, with the remainder sometimes being lost in data silos or
discarded in data graveyards. For instance, the customer service team might not receive client data from sales, missing
an opportunity to build more precise and comprehensive customer profiles. Missing out on possibilities to develop
novel products, enhance services, and streamline procedures is caused by hidden data.
What are the Key Steps in the Data Collection Process?

Decide What Data You Want to Gather


The first thing that we need to do is decide what information we want to gather. We must choose the subjects the
data will cover, the sources we will use to gather it, and the quantity of information that we would require. For
instance, we may choose to gather information on the categories of products that an average e-commerce website
visitor between the ages of 30 and 45 most frequently searches for.

Establish a Deadline for Data Collection


The process of creating a strategy for data collection can now begin. We should set a deadline for our data collection at
the outset of our planning phase. Some forms of data we might want to continuously collect. We might want to build up
a technique for tracking transactional data and website visitor statistics over the long term, for instance. However, we
will track the data throughout a certain time frame if we are tracking it for a particular campaign. In these situations, we
will have a schedule for when we will begin and finish gathering data.
Select a Data Collection Approach
We will select the data collection technique that will serve as the foundation of our data gathering plan at this stage. We
must take into account the type of information that we wish to gather, the time period during which we will receive it,
and the other factors we decide on to choose the best gathering strategy.

Gather Information
Once our plan is complete, we can put our data collection plan into action and begin gathering data.

Examine the Information and Apply Your Findings


It's time to examine our data and arrange our findings after we have gathered all of our information. The analysis stage is
essential because it transforms unprocessed data into insightful knowledge that can be applied to better our marketing
plans, goods, and business judgments.
Data Preprocessing in Python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Step 1: Import the necessary libraries


1. Drop columns that arenʼt useful.
2. Handling missing values.
3. Handling Duplicate values.
4. Handling categorical features.
5. Convert the data frame to NumPy.
6. Split the dataset into train and test data.
7. Feature scaling
Step 2: Load the dataset

df = pd.read_csv('diabetes.csv')
print(df.head())

df.isnull()

df.describe()
Handling categorical features
We can take care of categorical features by converting them to integers. There are 2 common ways to do so.

Label Encoding
One Hot Encoding
In Label Encoder, we can convert the Categorical values into numerical labels.

In OneHotEncoder we make a new column for each unique categorical value, and the value is 1 for that column, if in an
actual data frame that value is there, else it is 0.

We use pandas built-in function get_dummies to convert categorical values in a dataframe to a one-hot vector.
Convert the data frame to NumPy
Now that we’ve converted all the data to integers, it’s time to prepare the data for machine learning models.

X= Input / independent variables /Predictor(s)

y = Output/Dependent variable / outcome variable

Now we convert our dataframe from Pandas to NumPy as follow:

X = df.values
y = df['predictor_column'].values
Split the dataset into train and test data

Now that we’re ready with X and y, let's split the data set: we’ll allocate 70 percent for training and 30 percent for tests
using scikit model_selection.

Feature Scaling
This is the final step of data preprocessing. Feature scaling puts all our data in the same range and on the same scale.
To work on the data, you can either load the CSV in Excel or in Pandas.

df = pd.read_csv('train.csv')

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
there are 891 total rows but Age shows only 714 (which means we’re missing some data), Embarked is missing two rows
and Cabin is missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to
numerical values.

drop some of the columns which won't contribute much to our machine learning model. We’ll start with Name,
Ticket and Cabin.

jcols = ['Name', 'Ticket', 'Cabin']


df = df.drop(cols, axis=1)
>>>df.info()

PassengerId 891 non-null int64


Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
drop all rows in the data that have missing values (NaNs).

>> df = df.dropna()

>>> df.info()
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 712 non-null int64
Survived 712 non-null int64
Pclass 712 non-null int64
Sex 712 non-null object
Age 712 non-null float64
SibSp 712 non-null int64
Parch 712 non-null int64
Fare 712 non-null float64
Embarked 712 non-null object
THE PROBLEM WITH DROPPING ROWS

After dropping rows with missing values, we find the data set is reduced to 712 rows from 891, which means we are
wasting data. Machine learning models need data to train and perform well. So, let’s preserve the data and make
use of it as much as we can.

Creating Dummy Variables


import pandas as pd
df = pd.read_csv(r'C:\Users\Adypu\Documents\Downloads\sample.csv')
print(df.info())

print(df.head())

df.isnull()

df.describe()
df

from sklearn import preprocessing


from sklearn.impute import SimpleImputer
imputer=SimpleImputer(strategy='mean',missing_values=np.nan)
imputer=imputer.fit(df[['Age','Salary']])
df[['Age','Salary']]=imputer.transform(df[['Age','Salary']])
df
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
x=df.iloc[:,:-1].values
x

y=df.iloc[:,3:].values
y
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()

x[:,0]=LE.fit_transform(x[:,0])
x

y=LE.fit_transform(y)

y
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transform =ColumnTransformer([("hello",OneHotEncoder(),[0])],remainder="passthrough")

x
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
y_test

x_test

x_train

y_train
from sklearn.preprocessing import StandardScaler
SC=StandardScaler()
x_train[:,3:5]=SC.fit_transform(x_train[:,3:5])
x_train
Data Cleaning
Python Implementation for Database Cleaning

import pandas as pd
import numpy as np

# Load the dataset


df = pd.read_csv(‘sample.csv')
df.head()
Handling Missing Data

1. isnull()
2. notnull()
3. dropna()
4. fillna()
5.replace()

1. isnull() method which stores True for ever NaN value and False for a Not null value.

2. True for every NON-NULL value and False for a null value.

3. Dropping Rows with at least 1 null value. A data frame is read and all rows with any Null values are
dropped.
Handling Missing Values:

1. Replacing NaN values with a Static value After replacing.


# importing pandas module
import pandas as pd

# making data frame from csv file


df = pd.read_csv(“hello.csv")

# replacing na values in college with No college


df[“Salary"].fillna("No salary", inplace = True)

df
5. Replacing a Single Value

df.replace(to_replace="Boston", value="Warrior")

Replacing Two Values with a Single Value

df.replace(to_replace=["Boston", "Texas"],value="Warrior")

Replacing Nan With a Random Integer Value

df.replace(to_replace = np.nan, value =99)

Replacing With Multiple Values

df1 = df.replace(['Boston', 'Amir', 'R.J.'], ['Warriors', 'Johnson', ' Thomas'])


Data Discretization
Data discretization is defined as a process of converting continuous data attribute values
into a finite set of intervals with minimal loss of information and associating with each
interval some specific data value or conceptual labels.
Why is it needed?
1. Improves the quality of discovered knowledge.
2. Easy maintainability of the data.
3. There is a necessity to use discretized data by many DM algorithms which can only deal with
discrete attributes.
4. Reduces the running time of various data mining tasks such as association rule discovery,
classification, and prediction.
5. Prepares data for further analysis, e.g., classification.
6. Discretization is considered a data reduction mechanism because it diminishes data from a large
domain of numeric values to a subset of categorical values
Exploratory Data Analysis
Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to discover
patterns, locate outliers, and identify relationships between variables. EDA is normally carried out as a
preliminary step before undertaking extra formal statistical analyses or modeling.
1. Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies. It
includes techniques including records imputation, managing missing statistics, and figuring out and getting
rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability, and
distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and percentiles
are usually used.
3. Data Visualization: EDA employs visual techniques to represent the statistics graphically.
Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts
assist in identifying styles, trends, and relationships within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to
create new functions or derive meaningful insights. Feature engineering can contain scaling,
normalization, binning, encoding express variables, and creating interplay or derived variables.
Missing Data Analysis: Missing information is a not unusual issue in datasets, and it may impact the
reliability and validity of the evaluation. Missing statistics analysis includes figuring out missing values,
know-how the patterns of missingness, and using suitable techniques to deal with missing data.
Techniques along with lacking facts styles, imputation strategies, and sensitivity evaluation are employed in
lacking facts evaluation.
6. Outlier Analysis: Outliers are statistics factors that drastically deviate from the general sample of the
facts. Outlier analysis includes identifying and knowledge the presence of outliers, their capability reasons,
and their impact at the analysis. Techniques along with box plots, scatter plots, z-rankings, and clustering
algorithms are used for outlier evaluation.
.

You might also like