DATA SCIENCE 6th Sem
DATA SCIENCE 6th Sem
DATA SCIENCE 6th Sem
Data science is a field that involves using statistical and computational techniques to extract
insights and knowledge from data. It encompasses a wide range of tasks, including data
cleaning and preparation, data visualization, statistical modeling, machine learning, and more.
Data scientists use these techniques to discover patterns and trends in data, make
predictions, and support decision-making. They may work with a variety of data types,
including structured data (such as numbers and dates in a spreadsheet) and unstructured data
(such as text, images, or audio). Data science is used in a wide range of industries, including
finance, healthcare, retail, and more.
2. Medicine
The medical industry is using big data and analytics in a big way to improve health in a variety
of ways. For instance, the use of wearable trackers to provide important information to
physicians who can make use of the data to provide better care to their patients. Wearable
trackers also provide information like whether the patient is taking his/her medication and
following the right treatment plan.
The banking industry is generally not looked at as being one that uses technology a lot.
However, this is slowly changing as bankers are beginning to increasingly use technology to
drive their decision-making.
4. Construction
It is no surprise that construction companies are beginning to embrace data science and
analytics in a big way. Construction companies track everything from the average time needed
to complete tasks to materials-based expenses and everything in between. Big data is now
being used in a big way in the construction industry to drive better decision-making.
5. Transportation
There is always a need for people to reach their destinations on time and data science and
analytics can be used by transportation providers, both public and private, to increase the
chances of successful journeys. For instance, Transport for London uses statistical data to
map customer journeys, manage unexpected circumstances, and provide people with
personalized transport details.
The five main components (tokens) of 'Python' language are:----------1. The character set--------------2.The
Data types-----3.Constants-----4.Variables------5.Keywords
1 Letters A – – – Z or a – z
2 Digits 0,1,—-9
4 White spaces blank space, horizontal tab, carriage return, new line, and form feed.
• The power of a programming language depends, among other things, on the range of
different types of data it can handle.
Type () function: using this function get the data type of any object or variable. -Example--
y = 15 print(type(y))
Constants--→ Constants are the fixed values that remain unchanged during the execution of a
program and are used in assignment statements.
Variables-→Variables are the data items whose values may vary during the execution of the
program.--------Note: Python has no command for declaring a variable.
A variable can have a short name (like x and y) or a more descriptive name (age, surname,
total_volume). Rules for Python variables:-----A variable name must start with a letter or the
underscore character----------A variable name cannot start with a number---A variable name can
only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )-------Variable names
are case-sensitive (ram, Ram, and RAM are three different variables)
Keywords
Keywords are the words that have been assigned specific meaning in the context of python
language programs.---Keywords should not be used as variable names to avoid problems.------
---There are 35 keywords are found in the python programming language.
and continue For Lambda try async elif If Or yield
Data Analysis Process consists of the following phases that are iterative in nature −
Data Collection
Data Collection is the process of gathering information on targeted variables identified as data
requirements. The emphasis is on ensuring accurate and honest collection of data. Data
Collection ensures that data gathered is accurate such that the related decisions are valid.
Data Collection provides both a baseline to measure and a target to improve.--Data is collected
from various sources ranging from organizational databases to the information in web pages.
The data thus obtained, may not be structured and may contain irrelevant information. Hence,
the collected data is required to be subjected to Data Processing and Data Cleaning.
Data Processing
The data that is collected must be processed or organized for analysis. This includes structuring the data
as required for the relevant Analysis Tools. For example, the data might have to be placed into rows and
columns in a table within a Spreadsheet or Statistical Application. A Data Model might have to be created.
Data Cleaning--→The processed and organized data may be incomplete, contain duplicates, or
contain errors. Data Cleaning is the process of preventing and correcting these errors. There are several
types of Data Cleaning that depend on the type of data. For example, while cleaning the financial data,
certain totals might be compared against reliable published numbers or defined thresholds. Likewise,
quantitative data methods can be used for outlier detection that would be subsequently excluded in
analysis.
Data Analysis---→Data that is processed, organized and cleaned would be ready for the
analysis. Various data analysis techniques are available to understand, interpret, and derive
conclusions based on the requirements. Data Visualization may also be used to examine the
data in graphical format, to obtain additional insight regarding the messages within the data.---
------Statistical Data Models such as Correlation, Regression Analysis can be used to identify
the relations among the data variables. These models that are descriptive of the data are
helpful in simplifying analysis and communicate results.---The process might require
additional Data Cleaning or additional Data Collection, and hence these activities are iterative
in nature.
The data analysts can choose data visualization techniques, such as tables and charts, which
help in communicating the message clearly and efficiently to the users. The analysis tools
provide facility to highlight the required information with color codes and formatting in tables
and charts.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to apprehend
their predominant traits, discover patterns, locate outliers, and identify relationships between variables.
EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or
modeling.
1. Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies.
It includes techniques including records imputation, managing missing statistics, and figuring out and
getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability,
and distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and
percentiles are usually used.
3. Data Visualization: EDA employs visual techniques to represent the statistics graphically.
Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts
assist in identifying styles, trends, and relationships within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to
create new functions or derive meaningful insights. Feature engineering can contain scaling,
normalization, binning, encoding express variables, and creating interplay or derived variables.
5. Correlation and Relationships: EDA allows discover relationships and dependencies between
variables. Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into
the power and direction of relationships between variables.
6. Data Segmentation: EDA can contain dividing the information into significant segments based totally
on sure standards or traits. This segmentation allows advantage insights into unique subgroups inside
the information and might cause extra focused analysis.
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA into two types.
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing information
units to uncover styles, pick out relationships, and gain insights. There are various sorts of EDA
strategies that can be hired relying on the nature of the records and the desires of the evaluation. Here
are some not unusual kinds of EDA:
1. Univariate Analysis: This sort of evaluation makes a speciality of analyzing character variables inside
the records set. It involves summarizing and visualizing a unmarried variable at a time to understand its
distribution, relevant tendency, unfold, and different applicable records. Techniques like histograms, field
plots, bar charts, and precis information are generally used in univariate analysis.
2. Bivariate Analysis: Bivariate evaluation involves exploring the connection between variables. It
enables find associations, correlations, and dependencies between pairs of variables. Scatter plots, line
plots, correlation matrices, and move-tabulation are generally used strategies in bivariate analysis.
4. Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a temporal
component. Time collection evaluation entails inspecting and modeling styles, traits, and seasonality
inside the statistics through the years. Techniques like line plots, autocorrelation analysis, transferring
averages, and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally utilized in time
series analysis.
5. Missing Data Analysis: Missing information is a not unusual issue in datasets, and it may impact the
reliability
7. Data Visualization: Data visualization is a critical factor of EDA that entails creating visible
representations of the statistics to facilitate understanding and exploration. Various visualization
What is Quantitative Analysis?
Quantitative analysis is the process of collecting and evaluating measurable and verifiable data such as
revenues, market share, and wages in order to understand the behavior and performance of a business.
In the past, business owners and company directors relied heavily on their experience and instinct when
making decisions. However, with data technology, quantitative analysis is now considered a better
approach to making informed decisions.
A quantitative analyst’s main task is to present a given hypothetical situation in terms of numerical
values. Quantitative analysis helps in evaluating performance, assessing financial instruments, and
making predictions. It encompasses three main techniques of measuring data: regression analysis,
linear programming, and data mining.
Quantitative Techniques
1. Regression Analysis
Regression analysis is a common technique that is not only employed by business owners but
also by statisticians and economists. It involves using statistical equations to predict or
estimate the impact of one variable on another. For instance, regression analysis can
determine how interest rates affect consumers’ behavior regarding asset investment. One
other core application of regression analysis is establishing the effect of education and work
experience on employees’ annual earnings.
In the business sector, owners can use regression analysis to determine the impact of
advertising expenses on business profits. Using this approach, a business owner can
establish a positive or negative correlation between two variables.
2. Linear Programming
3. Data Mining
Data mining is a combination of computer programming skills and statistical methods. The
popularity of data mining continues to grow in parallel with the increase in the quantity and
size of available data sets. Data mining techniques are used to evaluate very large sets of data
to find patterns or correlations concealed within them.
What is Statistics?
• A visual and mathematical portrayal of information is statistics. Data science is all about making
calculations with data.
• We make decisions based on that data using mathematical conditions known as models.
• Numerous fields, including data science, machine learning, business intelligence, computer
science, and many others have become increasingly dependent on statistics.
• Descriptive statistics:
Provides ways to summarize data by turning unprocessed observations into understandable
data that is simple to share.
• Inferential Statistics:
With the help of inferential statistics, it is possible to analyze experiments with small samples of
data and draw conclusions about the entire population (entire domain).
Interested to master data science course? Pursue the Intellipaat’s Data Science course and learn more
about data science.
Statistical analysis
Statistical analysis is the process of collecting and analyzing data in order to discern patterns
and trends. It is a method for removing bias from evaluating data by employing numerical
analysis. This technique is useful for collecting the interpretations of research, developing
statistical models, and planning surveys and studies.
Statistical analysis is a scientific tool in AI and ML that helps collect and analyze large
amounts of data to identify common patterns and trends to convert them into meaningful
information. In simple words, statistical analysis is a data analysis tool that helps draw
meaningful conclusions from raw and unstructured data.
The conclusions are drawn using statistical analysis facilitating decision-making and helping
businesses make future predictions on the basis of past trends. It can be defined as a science
of collecting and analyzing data to identify trends and patterns and presenting them.
Statistical analysis involves working with numbers and is used by businesses and other
institutions to make use of data to derive meaningful information.
• Descriptive Analysis
• Inferential Analysis
The inferential statistical analysis focuses on drawing meaningful conclusions on the basis of
the data analyzed. It studies the relationship between different variables or makes predictions
for the whole population.
• Predictive Analysis
Predictive statistical analysis is a type of statistical analysis that analyzes data to derive past
trends and predict future events on the basis of them. It uses machine
learning algorithms, data mining, data modelling, and artificial intelligence to conduct the
statistical analysis of data.
• Prescriptive Analysis
The prescriptive analysis conducts the analysis of data and prescribes the best course of
action based on the results. It is a type of statistical analysis that helps you make an informed
decision.
Exploratory analysis is similar to inferential analysis, but the difference is that it involves
exploring the unknown data associations. It analyzes the potential relationships within the
data.
• Causal Analysis
The causal statistical analysis focuses on determining the cause and effect relationship
between different variables within the raw data. In simple words, it determines why something
happens and its effect on other variables. This methodology can be used by businesses to
determine the reason for failure.
Major categories of Statistics or its TYPE
simply means numerical data, and is field of math that generally deals with collection of data, tabulation,
and interpretation of numerical data. It is actually a form of mathematical analysis that uses different
quantitative models to produce a set of experimental data or studies of real life. It is an area of applied
mathematics concern with data collection analysis, interpretation, and presentation. Statistics deals with
how data can be used to solve complex problems. Some people consider statistics to be a distinct
mathematical science rather than a branch of mathematics.
Statistics makes work easy and simple and provides a clear and clean picture of work you do on a
regular basis.
• Population –
It is actually a collection of set of individuals or objects or events whose properties are to be
analyzed.
• Sample –
It is the subset of a population.
Types of Statistics :
1. Descriptive Statistics :
Descriptive statistics uses data that provides a description of the population either through numerical
calculation or graph or table. It provides a graphical summary of data. It is simply used for summarizing
objects, etc. There are two categories in this as following below.
• (i) Mean :
It is measure of average of all value in a sample set.
For example,
• (ii) Median :
It is measure of central value of a sample set. In these, data set is
ordered from lowest to highest value and then finds exact middle.
For example,
• (iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of time in
central set is actually mode.
For example,
• (b). Measure of Variability – page 2 stastics type
Measure of Variability is also known as measure of dispersion and used to describe variability in
a sample or population. In statistics, there are three common measures of variability as shown
below:
• (i) Range :
It is given measure of how to spread apart values in sample set or data set.
• (ii) Variance :
It simply describes how much a random variable defers from expected value and it is
also computed as square of deviation.
In these formula, n represent total data points, ͞x represent mean of data points and xi represent
individual data points.
• (iii) Dispersion :
It is measure of dispersion of set of data from its mean.
2. Inferential Statistics :
Inferential Statistics makes inference and prediction about population based on a sample of data taken
from population. It generalizes a large dataset and applies probabilities to draw a conclusion. It is simply
used for explaining meaning of descriptive stats. It is simply used to analyze, interpret result, and draw
conclusion. Inferential Statistics is mainly related to and associated with hypothesis testing whose main
target is to reject null hypothesis.
Hypothesis testing is a type of inferential procedure that takes help of sample data to evaluate and
assess credibility of a hypothesis about a population. Inferential statistics are generally used to
determine how strong relationship is within sample. But it is very difficult to obtain a population list and
draw a random sample.
Inferential statistics can be done with help of various steps as given below:
6. Collect and gather a sample of children from population and simply run study.
7. Then, perform all tests of statistical to clarify if obtained characteristics of sample are
sufficiently different from what would be expected under null hypothesis so that we can be able
to find and reject null hypothesis.
• Confidence Interval
• T-test or Anova
• Pearson Correlation
What is Population?
In statistics, population is the entire set of items from which you draw data for a statistical study. It
can be a group of individuals, a set of items, etc. It makes up the data pool for a study.
Generally, population refers to the people who live in a particular area at a specific time. But in
statistics, population refers to data on your study of interest. It can be a group of individuals,
objects, events, organizations, etc. You use populations to draw conclusions.
An example of a population would be the entire student body at a school. It would contain all the
students who study in that school at the time of data collection. Depending on the problem
statement, data from each of these students is collected. An example is the students who speak
Hindi among the students of a school.
For the above situation, it is easy to collect data. The population is small and willing to provide data
and can be contacted. The data collected will be complete and reliable.
If you had to collect the same data from a larger population, say the entire country of India, it would
be impossible to draw reliable conclusions because of geographical and accessibility constraints,
not to mention time and resource constraints. A lot of data would be missing or might be unreliable.
Furthermore, due to accessibility issues, marginalized tribes or villages might not provide data at all,
making the data biased towards certain regions or groups.
What is a Sample?
A sample is defined as a smaller and more manageable representation of a larger group. A subset of
a larger population that contains characteristics of that population. A sample is used in statistical
testing when the population size is too large for all members or observations to be included in the
test.
The sample is an unbiased subset of the population that best represents the whole data.
To overcome the restraints of a population, you can sometimes collect data from a subset of your
population and then consider it as the general norm. You collect the subset information from the
groups who have taken part in the study, making the data reliable. The results obtained for different
groups who took part in the study can be extrapolated to generalize for the population.
The process of collecting data from a small subsection of the population and then using it to
generalize over the entire set is called Sampling.
• The population is hypothetical and is unlimited in size. Take the example of a study that
documents the results of a new medical procedure. It is unknown how the procedure will affect
people across the globe, so a test group is used to find out how people react to it.
• Satisfy all different variations present in the population as well as a well-defined selection
criterion.
Say you are looking for a job in the IT sector, so you search online for IT jobs. The first search result
would be for jobs all around the world. But you want to work in India, so you search for IT jobs in
India. This would be your population. It would be impossible to go through and apply for all positions
in the listing. So you consider the top 30 jobs you are qualified for and satisfied with and apply for
those. This is your sample.
Measures of Central Tendency in Statistics
Central Tendencies in Statistics are the numerical values that are used to represent mid-value or
central value a large collection of numerical data. These obtained numerical values are
called central or average values in Statistics. A central or average value of any statistical data or
series is the value of that variable that is representative of the entire data or its associated frequency
distribution. Such a value is of great significance because it depicts the nature or characteristics of
the entire data, which is otherwise very difficult to observe.
Mean
Mean in general terms is used for the arithmetic mean of the data, but other than the arithmetic mean
there are geometric mean and harmonic mean as well that are calculated using different formulas.
Here in this article, we will discuss the arithmetic mean.
Mean for Ungrouped Data----------------------Arithmetic mean ( ) is defined as the sum of the individual
observations (xi) divided by the total number of observations N. In other words, the mean is given by
the sum of all observations divided by the total number of observations OR
Example: If there are 5 observations, which are 27, 11, 17, 19, and 21 then the mean ( ) is given by
= (27 + 11 + 17 + 19 + 21) ÷ 5 ⇒ = 95 ÷ 5 ⇒ = 19
Median
The Median of any distribution is that value that divides the distribution into two equal parts such
that the number of observations above it is equal to the number of observations below it. Thus, the
median is called the central value of any given data either grouped or ungrouped.
To calculate the Median, the observations must be arranged in ascending or descending order. If the
total number of observations is N then there are two cases
Case 1: N is Odd
Case 2: N is Even
Mode
The Mode is the value of that observation which has a maximum frequency corresponding to it. In
other, that observation of the data occurs the maximum number of times in a dataset.
The image added below shows the measure of dispersion of various types.
These measures of dispersion capture variation between different values of the data.
We can understand the measure of dispersion by studying the following example, suppose we have
10 students in a class and the marks scored by them in a Mathematics test are 12, 14, 18, 9, 11, 7, 9,
16, 19, and 20 out of 20. Then the average value scored by the student in the class is,
= 135/10 = 13.5
Mean Deviation = {|12-13.5| + |14-13.5| + |18-13.5| + |9-13.5| + |11-13.5| + |7-13.5| + |9-13.5| + |16-13.5| +
|19-13.5| + |20-13.5|}/10 = 34.5/10 = 3.45
These measures of dispersion can be further divided into various categories. The measures of
dispersion have various parameters and these parameters have the same unit.
These measures of dispersion are measured and expressed in the units of data themselves. For
example – Meters, Dollars, Kg, etc. Some absolute measures of dispersion are:
Range: Range is defined as the difference between the largest and the smallest value in the
distribution.
Mean Deviation: Mean deviation is the arithmetic mean of the difference between the values and their
mean.
Standard Deviation: Standard Deviation is the square root of the arithmetic average of the square of
the deviations measured from the mean.
Variance: Variance is defined as the average of the square deviation from the mean of the given data
set.
Quartile Deviation: Quartile deviation is defined as half of the difference between the third quartile
and the first quartile in a given data set.
Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile is called
Interterquartile Range. The formula for Interquartile Range is given as Q3 – Q1
Skewness
Skewness is an important statistical technique that helps to determine asymmetrical behavior than
of the frequency distribution, or more precisely, the lack of symmetry of tails both left and right of
the frequency curve. A distribution or dataset is symmetric if it looks the same to the left and right of
the center point.
Types of Skewness
2. Asymmetric Skewness: A asymmetrical or skewed distribution is one in which the spread of the
frequencies is different on both the sides of the center point or the frequency curve is more
stretched towards one side or value of Mean. Median and Mode falls at different points.
• Positive Skewness: In this, the concentration of frequencies is more towards higher values of
the variable i.e. the right tail is longer than the left tail.
• Negative Skewness: In this, the concentration of frequencies is more towards the lower values of
the variable i.e. the left tail is longer than the right tail.
Kurtosis:
It is also a characteristic of the frequency distribution. It gives an idea about the shape of a
frequency distribution. Basically, the measure of kurtosis is the extent to which a frequency
distribution is peaked in comparison with a normal curve. It is the degree of peakedness of a
distribution.
Types of Kurtosis
1. Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In this curve,
there is too much concentration of items near the central value.
2. Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this curve,
there is equal distribution of items around the central value.
3. Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called platykurtic.
In this curve, there is less concentration of items around the central value.
1. It indicates the shape and size of variation on either side of the central value.--------- It indicates
the frequencies of distribution at the central value.
2. The measure differences of skewness tell us about the magnitude and direction of the
asymmetry of a distribution.---------------------------- It indicates the concentration of items at the
central part of a distribution.
3. It indicates how far the distribution differs from the normal distribution.------------------------ It
studies the divergence of the given distribution from the normal distribution.
4. The measure of skewness studies the extent to which deviation clusters is are above or below
the average.------------- It indicates the concentration of items.
2. Regression :
Regression analysis is used to predicts the value of the dependent variable based on the known
value of the independent variable, assuming that average mathematical relationship between two or
more variables.
There is no difference between the two. Both variables serve to be different, One variable is
3.
Both variables are mutually dependent. independent, while the other is dependent.
Machine learning derives insightful information from large volumes of data by leveraging algorithms
to identify patterns and learn in an iterative process. ML algorithms use computation methods to
learn directly from data instead of relying on any predetermined equation that may serve as a model.
This type of ML involves supervision, where machines are trained on labeled datasets and enabled to
predict outputs based on the provided training. The labeled dataset specifies that some input and
output parameters are already mapped. Hence, the machine is trained with the input and
corresponding output. A device is made to predict the outcome using the test dataset in subsequent
phases.
Unsupervised learning refers to a learning technique that’s devoid of supervision. Here, the machine
is trained using an unlabeled dataset and is enabled to predict the output without any supervision.
An unsupervised learning algorithm aims to group the unsorted dataset based on the input’s
similarities, differences, and patterns.
3. Semi-supervised learning
4. Reinforcement learning
Linear regression is one of the most popular and simple machine learning algorithms that is used for
predictive analysis. Here, predictive analysis defines prediction of something, and linear regression
makes predictions for continuous numbers such as salary, age, etc.It shows the linear relationship
between the dependent and independent variables, and shows how the dependent variable(y) changes
according to the independent variable (x).
It tries to best fit a line between the dependent and independent variables, and this best fit line is knowns
as the regression line. The equation for the regression line is: y= a0+ a*x+ bHere, y= dependent
variable x= independent variable a0 = Intercept of line.
2. Logistic Regression
Logistic regression is similar to the linear regression except how they are used, such as Linear
regression is used to solve the regression problem and predict continuous values, whereas Logistic
regression is used to solve the Classification problem and used to predict the discrete values.
A decision tree is a supervised learning algorithm that is mainly used to solve the classification
problems but can also be used for solving the regression problems. It can work with both categorical
variables and continuous variables. It shows a tree-like structure that includes nodes and branches,
and starts with the root node that expand on further branches till the leaf node. The internal node is
used to represent the features of the dataset, branches show the decision rules, and leaf nodes
represent the outcome of the problem.
A support vector machine or SVM is a supervised learning algorithm that can also be used for
classification and regression problems. However, it is primarily used for classification problems. The
goal of SVM is to create a hyperplane or decision boundary that can segregate datasets into different
classes.
The data points that help to define the hyperplane are known as support vectors, and hence it is
named as support vector machine algorithm.
Naïve Bayes classifier is a supervised learning algorithm, which is used to make predictions based
on the probability of the object. The algorithm named as Naïve Bayes as it is based on Bayes
theorem, and follows the naïve assumption that says' variables are independent of each other.
The Bayes theorem is based on the conditional probability; it means the likelihood that event(A) will
happen, when it is given that event(B) has already happened. The equation for Bayes theorem is
given as:
K-Nearest Neighbour is a supervised learning algorithm that can be used for both classification and
regression problems. This algorithm works by assuming the similarities between the new data point
and available data points. Based on these similarities, the new data points are put in the most similar
categories. It is also known as the lazy learner algorithm as it stores all the available datasets and
classifies each new case with the help of K-neighbours. The new case is assigned to the nearest
class with most similarities, and any distance function measures the distance between the data
Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms. It
is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age, product
price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one
or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Mathematically, we can
represent a linear regression
as:
y= a0+a1x+ ε
Here,
Let’s say we have two hypothesis for a task, h(x) and h’(x). How would we know which one
is better. Well from a high level perspective, we might take the following steps:
2. Determine whether there is any statistical significance between the two results. If there
is, select the better performing hypothesis. If not, we cannot say with any statistical
certainty that either h(x) or h’(x) is better.
When we have a classification task, we will consider the accuracy of our model by its
ability to assign an instance to its correct class. Consider this on a binary level. We have
two classes, 1 and 0. We would classify a correct prediction therefore as being when the
model classifies a class 1 instance as class 1, or a class 0 instance as class 0. Assuming
our 1 class as being the ‘Positive class’ and the 0 class being the ‘Negative class’, we can
build a table that outlines all the possibilities our model might produce
We also have names for these classifications. Our True Positive and True Negative are our
correct classifications, as we can see in both cases, the actual class and the predicted
class are the same. The other two classes, in which the model predicts incorrectly, can be
explained as follows:
• False Positive — when the model predicts 1, but the actual class is 0, also known
as Type I Error
• False Negative — when the model predicts 0, but the actual class is 1, also known
as Type II Error
When we take a series of instances and populate the above table with frequencies of how
often we observe each classification, we have produced what is known as
a confusion matrix. This is a good method to begin evaluating a hypothesis that goes a
little bit further than a simple accuracy rate. With this confusion matrix, we can define the
accuracy rate and we can also define a few other metrics to see how well our model is
performing. We use the shortened abbreviations False Positive (FP), False Negative (FN),
True Positive (TP) and True Negative (TN).
Support Vector Machine or SVM
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as
non-linear data and classifier used is called as Non-linear SVM classifier.
K-Nearest Neighbor(KNN)
>In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
>The decisions or the test are performed on the basis of features of the given dataset.
>It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
>It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
>In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
>A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Random Forest
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML.
It is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
o It enhances the accuracy of the model and prevents the overfitting issue.
o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.
Naïve Bayes
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.---------------------------------P(A) is Prior Probability: Probability of hypothesis
before observing the evidence.
Since 2016, automated feature engineering is also used in different machine learning
software that helps in automatically extracting features from raw data. Feature engineering
in ML contains mainly four processes: Feature Creation, Transformations, Feature
Extraction, and Feature Selection.
1. Feature Creation: Feature creation is finding the most useful variables to be used in a
predictive model. The process is subjective, and it requires human creativity and
intervention. The new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great flexibility.
4. Feature Selection: While developing the machine learning model, only a few variables
in the dataset are useful for building the model, and the rest features are either
redundant or irrelevant. If we input the dataset with all these redundant and irrelevant
features, it may negatively impact and reduce the overall performance and accuracy of
the model. Hence it is very important to identify and select the most appropriate
features from the data and remove the irrelevant or less important features, which is
done with the help of feature selection in machine learning. "Feature selection is a way
of selecting the subset of the most relevant features from the original features set by
removing the redundant, irrelevant, or noisy features."
Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help
of orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by
reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional
data.
PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels. It is a feature
extraction technique, so it contains the important variables and drops the least important
variable.
o Correlation: It signifies that how strongly two variables are related to each other. Such
as if one changes, the other variable also gets changed. The correlation value ranges
from -1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and
+1 indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is
called the Covariance Matrix.
As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original
features present in the dataset. Some properties of these principal components are given
below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
o By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
There are also some disadvantages of applying the dimensionality reduction, which are
given below:
Anaconda installs the latest Python 2 or 3 version in an isolated and activated environment so
any installed Python version doesn’t cause any issues for your projects,--------------It’s noob-
friendly!! Yes, You don’t need any prior coding/programming knowledge about the usual nerdy
stuff that scares noobs away.
Installation Process
Anaconda Navigator
Anaconda Navigator contains lots of stuff inside it. So let’s understand which stuff we need for
our next data science project.
1. Jupyter Notebook
Jupyter Notebook is a web-based, interactive competing notebook environment. You can edit
and run human-readable docs while describing the data analysis. The Jupyter Notebook is an
open-source web application that allows you to create and share documents that contain live
code, equations, visualizations, and narrative text. Uses include data cleaning and
transformation, numerical simulation, statistical modeling, data visualization, machine
learning, and much more. Below is a demo image to demonstrate how Jupyter Notebook UI
looks like:
2. JupyterLab
It’s an extensible environment for interactive and reproducible computing, based on the
Jupyter Notebook and Architecture. JupyterLab enables to work with documents and activities
such as Jupyter notebooks, text editors, terminals, and custom components in a flexible,
integrated, and extensible manner. Below is a demo image to demonstrate how JupyterLab UI
looks like:
3. Spyder--------One of the most important and powerful Python IDE is Spyder. Spyder is
another good open-source and cross-platform IDE written in Python. It is also called Scientific
Python Development IDE, and it is the most lightweight IDE for Python. It is mainly used by
data scientists who can integrate with Matplotlib, SciPy, NumPy, Pandas, Cython, IPython,
SymPy, and other open-source software. Below is a demo image to demonstrate how Spyder
UI looks like:
4. RStudio-----------When it comes to the data science world then Python and R are the two
most programming languages that come into our minds. R Studio is an integrated
development environment(IDE) for the R programming language. It provides literate
programming tools, which basically allow the use of R scripts, outputs, text, and images in
reports, Word documents, and even an HTML files. Below is a demo image to demonstrate
how RStudio UI looks like:
Python Data Types for Data Science
Data types refer to the categorization or classification of data components. It stands for the
kind of value that defines the possible operations on a given piece of data.
In other words, Data types are a specific class of data item that can be identified by the values
it can accept, the programming language that can be used to create it, or the actions that can
be carried out on it.
There are mainly five standard data types in python, they are given below
1.Numeric − int, float, complex 2.Dictionary – dict 3.Boolean – bool 4.Set – set
5.Sequence Type − list, tuple, range 6.String − str
Python's numeric data types are used to represent data that has a numeric value. It is mainly
in three types, i.e., an integer belonging to the int class, a floating number belonging to the
float class, or even a complex number belonging to the complex class.-----------------Integer − It
has both positive and negative whole numbers in it without fractions or decimals. And belong
to the int class with no restriction on the length of integer numbers in Python.------------------------
Float − It has a floating-point representation and is a real number. To represent it we have a
decimal point to indicate it. We may be added e or E after a positive and negative number to
designate scientific notation. -----------------Complex Number − Complex classes serve as
representations for complex numbers. As an example, 4+5j is described as (actual part) +
(imaginary part)j.-----------Note − To identify the type of data, use the type() method.
Python Dictionary
A dictionary in Python is an unordered collection of data values used to store data values
similar to a map. Dictionaries consist of key-value pairs, in contrast to other data types, which
can only contain a single value.-----------To increase the efficiency of the dictionary the Key-
value pairs are included. When representing a dictionary data type, each key-value pair is
distinguished by a colon, whereas each key is distinguished by a "comma."
Python Boolean
Data that has the predetermined values True or False. Equal to False Boolean objects are
false(false), while equal to True Boolean objects are truthy (true). However, it is also possible
to evaluate and categorize non-Boolean things in a boolean context. The bool class is used to
represent it.
In Python, a set is a non-duplicate collection of data types that may be iterated through and
changed. A set may have a variety of components, but the placement of the parts is not fixed.
Unordered objects are grouped together as a set. There cannot be any duplicates of any set
element, and it must be immutable (cannot be changed).
Due to the set's unordered nature, indexing will be useless. As a result, the slicing operator []
is useless.
Creation of set
The built-in set() method can be used to build sets with an iterable object or a series by
wrapping the sequence behind curly brackets and separating them with a comma,. The
elements in a set don't have to be of the same type; they might contain a variety of mixed data
type values.
Page 2 python data types
Example--# Create a set from a list using the set() function----s = set([1, 2, 3, 4, 5])------
print(s) # Output: {1, 2, 3, 4, 5}----------# Create a set using curly braces----------s = {1, 2,
3, 4, 5}--------------print(s) # Output: {1, 2, 3, 4, 5} ---------------Output---------------set([1, 2, 3,
4, 5])----------set([1, 2, 3, 4, 5])
Python Sequence
The sequence in Python is an ordered grouping of related or dissimilar data types.
Sequences enable the ordered and effective storage of several values. In Python,
there are various sequence types. They are given below − list Tuple Range
List Data Type-------A list can be formed by putting all the elements in square brackets
and all the present elements are separated by a comma. Elements can be any data
type even a list also and can be traversed using an iterator or using index we can also
get the elements.
Example----# Create a list using square brackets----l = [1, 2, 3, 4, 5]-----print(l) # Output:
[1, 2, 3, 4, 5]-----# Access an item in the list using its index------print(l[1]) # Output: 2-----
-------Output----[1, 2, 3, 4, 5]------2
Tuple Data Type
Tuples are similar to lists, but they can’t be modified once they are created. Tuples are
commonly used to store data that should not be modified, such as configuration
settings or data that is read from a database.
Python Rage
The range data type represents an immutable sequence of numbers. It is similar to a
list, but it is more memory-efficient and faster to iterate over.
Python String
A string of Unicode characters makes up the string. A string is a grouping of one or
more characters enclosed in a single, double, or triple quotation mark. A class called
str can be used to represent it. There is no character data type in Python; instead, a
character is a string of length 1. The class str is used to represent it.
Strings can be used for a variety of actions, including concatenation, slicing, and
repetition.
Python Operators p1
Operators are used to perform operations on variables and values.
In the example below, we use the + operator to add together two values:
print(10 + 5)
Arithmetic operators are used with numeric values to perform common mathematical
operations:
+ Addition x+y
- Subtraction x-y
* Multiplication x * y
/ Division x/y
% Modulus x%y
** Exponentiation x ** y
// Floor division x // y
= x=5 x=5
+= x += 3 x = x + 3
-= x -= 3 x=x-3
*= x *= 3 x = x * 3
/= x /= 3 x=x/3
%= x %= 3 x = x % 3
//= x //= 3 x = x // 3
**= x **= 3 x = x ** 3
|= x |= 3 x=x|3
== Equal x == y
Python operators page 2
!= Not equal x != y
and Returns True if both statements are true x < 5 and x < 10
not Reverse the result, returns False if the result is true not(x < 5 and x < 10)
Identity operators are used to compare the objects, not if they are equal, but if they are
actually the same object, with the same memory location:
is not Returns True if both variables are not the same object x is not y
in Returns True if a sequence with the specified value is present in the object -x in y
not in Returns True if a sequence with the specified value is not present in the object
- x not in y
<< Zero fill left shift Shift left by pushing zeros in from the right and let the leftmost
bits fall off x << 2
>> Signed right shift Shift right by pushing copies of the leftmost bit in from the left,
and let the rightmost bits fall off x >> 2
What is NumPy
NumPy stands for numeric python which is a python package for the computation and
processing of the multidimensional and single dimensional array elements.
Travis Oliphant created NumPy package in 2005 by injecting the features of the ancestor
module Numeric into another module Numarray.
NumPy provides a convenient and efficient way to handle the vast amount of data. NumPy
is also very convenient with Matrix multiplication and data reshaping. NumPy is fast which
makes it reasonable to work with a large set of data.
There are the following advantages of using NumPy for data analysis.
5. NumPy provides the in-built functions for linear algebra and random number
generation.
Nowadays, NumPy in combination with SciPy and Mat-plotlib is used as the replacement to
MATLAB as Python is more complete and easier programming language than MATLAB.
Prerequisite
Before learning Python Numpy, you must have the basic knowledge of Python concepts.
Audience
Problem
We assure you that you will not find any problem in this Python Numpy tutorial. But if there
is any mistake, please post the problem in the contact form.
Numpy function –
Quite understandably, NumPy contains a large number of various mathematical
operations. NumPy provides standard trigonometric functions, functions for arithmetic
operations, handling complex numbers, etc.
Trigonometric Functions
NumPy has standard trigonometric functions which return trigonometric ratios for a given
angle in radians.
Example
import numpy as np -----a = np.array([0,30,45,60,90]) ---------print 'Sine of different angles:' ---# Convert
to radians by multiplying with pi/180 -----print np.sin(a*np.pi/180) ----print '\n' ----------print 'Cosine
values for angles in array:' ----print np.cos(a*np.pi/180) ---print '\n' -----print 'Tangent values for given
angles:' ------print np.tan(a*np.pi/180) -----Here is its output −
arcsin, arcos, and arctan functions return the trigonometric inverse of sin, cos, and tan of the given
angle. The result of these functions can be verified by numpy.degrees() function by converting
radians to degrees.
Tan function:
numpy.around()----This is a function that returns the value rounded to the desired precision. The
function takes the following parameters.-----numpy.around(a,decimals)
Where,
1 A Input data
2 Decimals The number of decimals to round to. Default is 0. If negative, the integer is
rounded to position to the left of the decimal point
Example
import numpy as np ----a = np.array([1.0,5.55, 123, 0.567, 25.532]) ----print 'Original array:' ---print a ---
print '\n' --print 'After rounding:' ---print np.around(a) ----print np.around(a, decimals = 1) ----print
np.around(a, decimals = -1)----It produces the following output −----Original array: --[ 1. 5.55
123. 0.567 25.532] ---After rounding: [ 1. 6. 123. 1. 26. ]
--[ 1. 5.6 123. 0.6 25.5] [ 0. 10. 120. 0. 30. ]
numpy.floor()
This function returns the largest integer not greater than the input parameter. The floor of the scalar
x is the largest integer i, such that i <= x. Note that in Python, flooring always is rounded away from
0.
What is SciPy
The SciPy is an open-source scientific library of Python that is distributed under a
BSD license. It is used to solve the complex scientific and mathematical problems.
It is built on top of the Numpy extension, which means if we import the SciPy,
there is no need to import Numpy. The Scipy is pronounced as Sigh pi, and it
depends on the Numpy, including the appropriate and fast N-dimension array
manipulation.
It provides many user-friendly and effective numerical functions for numerical
integration and optimization.
The SciPy library supports integration, gradient optimization, special functions,
ordinary differential equation solvers, parallel programming tools, and many more.
We can say that SciPy implementation exists in every complex numerical
computation.
The scipy is a data-processing and system-prototyping environment as similar to
MATLAB. It is easy to use and provides great flexibility to scientists and
engineers.
History
Python was expanded in the 1990s to include an array type for numerical
computing called numeric. This numeric package was replaced by Numpy (blend
of Numeric and NumArray) in 2006. There was a growing number of extension
module and developers were interested to create a complete environment for
scientific and technical computing. Travis Oliphant, Eric Jones, and Pearu
Peterson merged code they had written and called the new package SciPy. The
newly created package provided a standard collection of common numerical
operation on the top of Numpy.
Why use SciPy?
SciPy contain significant mathematical algorithms that provide easiness to
develop sophisticated and dedicated applications. Being an open-source library, it
has a large community across the world to the development of its additional
module, and it is much beneficial for scientific application and data scientists.
Numpy vs. SciPy
Numpy and SciPy both are used for mathematical and numerical analysis. Numpy
is suitable for basic operations such as sorting, indexing and many more because
it contains array data, whereas SciPy consists of all the numeric data.
Numpy contains many functions that are used to resolve the linear algebra,
Fourier transforms, etc. whereas SciPy library contains full featured version of the
linear algebra module as well many other numerical algorithms.
Data operation
Python handles data of various formats mainly through the two libraries, Pandas and
Numpy. We have already seen the important features of these two libraries in the previous
chapters. In this chapter we will see some basic examples from each of the libraries on
how to operate on data.
numpy.array
Pandas Series
Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index. A pandas
Series can be created using the following constructor −pandas.Series( data, index, dtype,
copy)
Example-----Here we create a series from a Numpy Array.-----#import the pandas library and
aliasing as pd----import pandas as pd-----import numpy as np----data =
np.array(['a','b','c','d'])-----s = pd.Series(data)---------print s
Pandas DataFrame
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns. A pandas DataFrame can be created using the following constructor −
A panel is a 3D container of data. The term Panel data is derived from econometrics and is
partially responsible for the name pandas − pan(el)-da(ta)-s.
p = pd.Panel(data)------print p
Minor_axis axis: 0 to 4
Data Visualization using Matplotlib
Data Visualization using Matplotlib is the process of presenting data in the form of graphs
or charts. It helps to understand large and complex amounts of data very easily. It allows
the decision-makers to make decisions very efficiently and also allows them in identifying
new trends and patterns very easily. It is also used in high-level data analysis for Machine
Learning and Exploratory Data Analysis (EDA). Data visualization can be done with
various tools like Tableau, Power BI, Python.
In this article, we will discuss how to visualize data with the help of the Matplotlib library of
Python.
Matplotlib
Matplotlib is a low-level library of Python which is used for data visualization. It is easy to
use and emulates MATLAB like graphs and visualization. This library is built on the top of
NumPy arrays and consist of several plots like line chart, bar chart, histogram, etc. It
provides a lot of flexibility but at the cost of writing more code.
Pyplot
After knowing a brief about Matplotlib and pyplot let’s see how to create a simple plot.
Adding Title
The title() method in matplotlib module is used to specify the title of the visualization
depicted and displays the title using various attributes.
Syntax:
In layman’s terms, the X label and the Y label are the titles given to X-axis and Y-axis
respectively. These can be added to the graph by using the xlabel() and ylabel() methods.
Syntax:
You might have seen that Matplotlib automatically sets the values and the markers(points)
of the X and Y axis, however, it is possible to set the limit and markers
manually. xlim() and ylim() functions are used to set the limits of the X-axis and Y-axis
respectively. Similarly, xticks() and yticks() functions are used to set tick labels.
Adding Legends
A legend is an area describing the elements of the graph. In simple terms, it reflects the
data displayed in the graph’s Y-axis. It generally appears as the box containing a small
sample of each color on the graph and a small description of what this data means.
Data Visualization using Matplotlib page 2
The attribute bbox_to_anchor=(x, y) of legend() function is used to specify the coordinates
of the legend, and the attribute ncol represents the number of columns that the legend
has. Its default value is 1.
Syntax:
Figure class
Consider the figure class as the overall window or page on which everything is drawn. It is
a top-level container that contains one or more axes. A figure can be created using
the figure() method.
Syntax:
Multiple Plots
We have learned about the basic components of a graph that can be added so that it can
convey more information. One method can be by calling the plot function again and again
with a different set of values as shown in the above example. Now let’s see how to plot
multiple graphs using some functions and also how to plot subplots.
The add_axes() method is used to add axes to the figure. This is a method of figure class
Syntax: