DATA SCIENCE 6th Sem

WHAT IS DATA SCIENCE
Data Science is about data gathering, analysis and decision-making.

Data Science is about finding patterns in data, through analysis, and make future predictions.
By using Data Science, companies are able to make:
• Better decisions (should we choose A or B)

• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden information in the data)
Data science is a field that involves using statistical and computational techniques to extract
insights and knowledge from data. It encompasses a wide range of tasks, including data
cleaning and preparation, data visualization, statistical modeling, machine learning, and more.
Data scientists use these techniques to discover patterns and trends in data, make
predictions, and support decision-making. They may work with a variety of data types,
including structured data (such as numbers and dates in a spreadsheet) and unstructured data
(such as text, images, or audio). Data science is used in a wide range of industries, including
finance, healthcare, retail, and more.
different sectors of using data science

1. Retail
Retailers need to correctly anticipate what their customers want and then provide those
things. If they don’t do this, they will likely be left behind the competition. Big data and
analytics provide retailers the insights they need to keep their customers happy and returning
to their stores.
2. Medicine
The medical industry is using big data and analytics in a big way to improve health in a variety
of ways. For instance, the use of wearable trackers to provide important information to
physicians who can make use of the data to provide better care to their patients. Wearable
trackers also provide information like whether the patient is taking his/her medication and
following the right treatment plan.
3. Banking and Finance
The banking industry is generally not looked at as being one that uses technology a lot.
However, this is slowly changing as bankers are beginning to increasingly use technology to
drive their decision-making.
4. Construction
It is no surprise that construction companies are beginning to embrace data science and
analytics in a big way. Construction companies track everything from the average time needed
to complete tasks to materials-based expenses and everything in between. Big data is now
being used in a big way in the construction industry to drive better decision-making.
5. Transportation
There is always a need for people to reach their destinations on time and data science and
analytics can be used by transportation providers, both public and private, to increase the
chances of successful journeys. For instance, Transport for London uses statistical data to
map customer journeys, manage unexpected circumstances, and provide people with
personalized transport details.
6. Communications, Media, and Entertainment

Components of Python
A Components of Python Language is the smallest element in a program that is meaningful to the
computer. ---- These Components of Python Language define the structure of the language.----------It
is also known as a token of python language.
The five main components (tokens) of 'Python' language are:----------1. The character set--------------2.The
Data types-----3.Constants-----4.Variables------5.Keywords
The character set

Any alphabet, digits, or special symbol used to represent information is denoted by character.
The characters in Pythons are grouped into four categories.
1 Letters A – – – Z or a – z
2 Digits 0,1,—-9
3 Special Symbols -.‟@#%'” &*() _-+ = I\{}[]:;”‘< > , . ? /.
4 White spaces blank space, horizontal tab, carriage return, new line, and form feed.
The Data types
• The power of a programming language depends, among other things, on the range of
different types of data it can handle.
• Data values passed in a program may be of different types.
Note: Each data type will be learned in each different chapter.
Find the Data Type
Type () function: using this function get the data type of any object or variable. -Example--
y = 15 print(type(y))
Constants--→ Constants are the fixed values that remain unchanged during the execution of a
program and are used in assignment statements.
Variables-→Variables are the data items whose values may vary during the execution of the
program.--------Note: Python has no command for declaring a variable.
Rules to defines Variable Names
A variable can have a short name (like x and y) or a more descriptive name (age, surname,
total_volume). Rules for Python variables:-----A variable name must start with a letter or the
underscore character----------A variable name cannot start with a number---A variable name can
only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )-------Variable names
are case-sensitive (ram, Ram, and RAM are three different variables)
Example to create variable in python-a = "NIELITBHU"---a_var = "NIELITBHU"-----_a_var = "NIELITBHU"----

----aVar = "NIELITBHU"-----aVAR = "NIELITBHU"-----avar2 = "NIELITBHU"
Keywords
Keywords are the words that have been assigned specific meaning in the context of python
language programs.---Keywords should not be used as variable names to avoid problems.------
---There are 35 keywords are found in the python programming language.
and continue For Lambda try async elif If Or yield
as def From Nonlocal False await Else Import pass
assert del Global Not with None break Except In raise

True class Finally Is return
Data Analytics - Process
Data Analysis is a process of collecting, transforming, cleaning, and modeling data with the
goal of discovering the required information. The results so obtained are communicated,
suggesting conclusions, and supporting decision-making. Data visualization is at times used
to portray the data for the ease of discovering the useful patterns in the data. The terms Data
Modeling and Data Analysis mean the same.
Data Analysis Process consists of the following phases that are iterative in nature −
Data Requirements Specification

The data required for analysis is based on a question or an experiment. Based on the
requirements of those directing the analysis, the data necessary as inputs to the analysis is
identified (e.g., Population of people). Specific variables regarding a population (e.g., Age and
Income) may be specified and obtained. Data may be numerical or categorical.
Data Collection
Data Collection is the process of gathering information on targeted variables identified as data
requirements. The emphasis is on ensuring accurate and honest collection of data. Data
Collection ensures that data gathered is accurate such that the related decisions are valid.
Data Collection provides both a baseline to measure and a target to improve.--Data is collected
from various sources ranging from organizational databases to the information in web pages.
The data thus obtained, may not be structured and may contain irrelevant information. Hence,
the collected data is required to be subjected to Data Processing and Data Cleaning.
Data Processing
The data that is collected must be processed or organized for analysis. This includes structuring the data
as required for the relevant Analysis Tools. For example, the data might have to be placed into rows and
columns in a table within a Spreadsheet or Statistical Application. A Data Model might have to be created.
Data Cleaning--→The processed and organized data may be incomplete, contain duplicates, or
contain errors. Data Cleaning is the process of preventing and correcting these errors. There are several
types of Data Cleaning that depend on the type of data. For example, while cleaning the financial data,
certain totals might be compared against reliable published numbers or defined thresholds. Likewise,
quantitative data methods can be used for outlier detection that would be subsequently excluded in
analysis.
Data Analysis---→Data that is processed, organized and cleaned would be ready for the
analysis. Various data analysis techniques are available to understand, interpret, and derive
conclusions based on the requirements. Data Visualization may also be used to examine the
data in graphical format, to obtain additional insight regarding the messages within the data.---
------Statistical Data Models such as Correlation, Regression Analysis can be used to identify
the relations among the data variables. These models that are descriptive of the data are
helpful in simplifying analysis and communicate results.---The process might require
additional Data Cleaning or additional Data Collection, and hence these activities are iterative
in nature.
Communication-------→The results of the data analysis are to be reported in a format as

required by the users to support their decisions and further action. The feedback from the
users might result in additional analysis.
The data analysts can choose data visualization techniques, such as tables and charts, which
help in communicating the message clearly and efficiently to the users. The analysis tools
provide facility to highlight the required information with color codes and formatting in tables
and charts.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to apprehend
their predominant traits, discover patterns, locate outliers, and identify relationships between variables.
EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or
modeling.
The Foremost Goals of EDA
1. Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies.
It includes techniques including records imputation, managing missing statistics, and figuring out and
getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability,
and distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and
percentiles are usually used.
3. Data Visualization: EDA employs visual techniques to represent the statistics graphically.
Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts
assist in identifying styles, trends, and relationships within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to
create new functions or derive meaningful insights. Feature engineering can contain scaling,
normalization, binning, encoding express variables, and creating interplay or derived variables.
5. Correlation and Relationships: EDA allows discover relationships and dependencies between
variables. Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into
the power and direction of relationships between variables.
6. Data Segmentation: EDA can contain dividing the information into significant segments based totally
on sure standards or traits. This segmentation allows advantage insights into unique subgroups inside
the information and might cause extra focused analysis.
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA into two types.
EDA, or Exploratory Data Analysis, refers back to the method of analyzing and analyzing information
units to uncover styles, pick out relationships, and gain insights. There are various sorts of EDA
strategies that can be hired relying on the nature of the records and the desires of the evaluation. Here
are some not unusual kinds of EDA:
1. Univariate Analysis: This sort of evaluation makes a speciality of analyzing character variables inside
the records set. It involves summarizing and visualizing a unmarried variable at a time to understand its
distribution, relevant tendency, unfold, and different applicable records. Techniques like histograms, field
plots, bar charts, and precis information are generally used in univariate analysis.
2. Bivariate Analysis: Bivariate evaluation involves exploring the connection between variables. It
enables find associations, correlations, and dependencies between pairs of variables. Scatter plots, line
plots, correlation matrices, and move-tabulation are generally used strategies in bivariate analysis.
3. Multivariate Analysis: Multivariate analysis extends bivariate evaluation to encompass greater

than variables. It ambitions to apprehend the complex interactions and dependencies among more than
one variables in a records set. Techniques inclusive of heatmaps, parallel coordinates, aspect analysis,
and primary component analysis (PCA) are used for multivariate analysis.
4. Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a temporal
component. Time collection evaluation entails inspecting and modeling styles, traits, and seasonality
inside the statistics through the years. Techniques like line plots, autocorrelation analysis, transferring
averages, and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally utilized in time
series analysis.
5. Missing Data Analysis: Missing information is a not unusual issue in datasets, and it may impact the
reliability
7. Data Visualization: Data visualization is a critical factor of EDA that entails creating visible
representations of the statistics to facilitate understanding and exploration. Various visualization
What is Quantitative Analysis?
Quantitative analysis is the process of collecting and evaluating measurable and verifiable data such as
revenues, market share, and wages in order to understand the behavior and performance of a business.
In the past, business owners and company directors relied heavily on their experience and instinct when
making decisions. However, with data technology, quantitative analysis is now considered a better
approach to making informed decisions.
A quantitative analyst’s main task is to present a given hypothetical situation in terms of numerical
values. Quantitative analysis helps in evaluating performance, assessing financial instruments, and
making predictions. It encompasses three main techniques of measuring data: regression analysis,
linear programming, and data mining.
Quantitative Techniques
1. Regression Analysis
Regression analysis is a common technique that is not only employed by business owners but
also by statisticians and economists. It involves using statistical equations to predict or
estimate the impact of one variable on another. For instance, regression analysis can
determine how interest rates affect consumers’ behavior regarding asset investment. One
other core application of regression analysis is establishing the effect of education and work
experience on employees’ annual earnings.
In the business sector, owners can use regression analysis to determine the impact of
advertising expenses on business profits. Using this approach, a business owner can
establish a positive or negative correlation between two variables.
2. Linear Programming
Most companies occasionally encounter a shortage of resources such as facility

space, production machinery, and labor. In such situations, company managers must find
ways to allocate resources effectively. Linear programming is a quantitative method that
determines how to achieve such an optimal solution. It is also used to determine how a
company can make optimal profits and reduce its operating costs, subject to a given set of
constraints, such as labor.
3. Data Mining
Data mining is a combination of computer programming skills and statistical methods. The
popularity of data mining continues to grow in parallel with the increase in the quantity and
size of available data sets. Data mining techniques are used to evaluate very large sets of data
to find patterns or correlations concealed within them.
What is Statistics?
• A visual and mathematical portrayal of information is statistics. Data science is all about making
calculations with data.
• We make decisions based on that data using mathematical conditions known as models.
• Numerous fields, including data science, machine learning, business intelligence, computer
science, and many others have become increasingly dependent on statistics.
Statistics is divided broadly into two categories:
• Descriptive statistics:
Provides ways to summarize data by turning unprocessed observations into understandable
data that is simple to share.
• Inferential Statistics:
With the help of inferential statistics, it is possible to analyze experiments with small samples of
data and draw conclusions about the entire population (entire domain).
Interested to master data science course? Pursue the Intellipaat’s Data Science course and learn more
about data science.
Statistical analysis
Statistical analysis is the process of collecting and analyzing data in order to discern patterns
and trends. It is a method for removing bias from evaluating data by employing numerical
analysis. This technique is useful for collecting the interpretations of research, developing
statistical models, and planning surveys and studies.
Statistical analysis is a scientific tool in AI and ML that helps collect and analyze large
amounts of data to identify common patterns and trends to convert them into meaningful
information. In simple words, statistical analysis is a data analysis tool that helps draw
meaningful conclusions from raw and unstructured data.
The conclusions are drawn using statistical analysis facilitating decision-making and helping
businesses make future predictions on the basis of past trends. It can be defined as a science
of collecting and analyzing data to identify trends and patterns and presenting them.
Statistical analysis involves working with numbers and is used by businesses and other
institutions to make use of data to derive meaningful information.
Types of Statistical Analysis
Given below are the 6 types of statistical analysis:
• Descriptive Analysis
Descriptive statistical analysis involves collecting, interpreting, analyzing, and summarizing

data to present them in the form of charts, graphs, and tables. Rather than drawing
conclusions, it simply makes the complex data easy to read and understand.
• Inferential Analysis
The inferential statistical analysis focuses on drawing meaningful conclusions on the basis of
the data analyzed. It studies the relationship between different variables or makes predictions
for the whole population.
• Predictive Analysis
Predictive statistical analysis is a type of statistical analysis that analyzes data to derive past
trends and predict future events on the basis of them. It uses machine
learning algorithms, data mining, data modelling, and artificial intelligence to conduct the
statistical analysis of data.
• Prescriptive Analysis
The prescriptive analysis conducts the analysis of data and prescribes the best course of
action based on the results. It is a type of statistical analysis that helps you make an informed
decision.
• Exploratory Data Analysis
Exploratory analysis is similar to inferential analysis, but the difference is that it involves
exploring the unknown data associations. It analyzes the potential relationships within the
data.
• Causal Analysis
The causal statistical analysis focuses on determining the cause and effect relationship
between different variables within the raw data. In simple words, it determines why something
happens and its effect on other variables. This methodology can be used by businesses to
determine the reason for failure.
Major categories of Statistics or its TYPE
simply means numerical data, and is field of math that generally deals with collection of data, tabulation,
and interpretation of numerical data. It is actually a form of mathematical analysis that uses different
quantitative models to produce a set of experimental data or studies of real life. It is an area of applied
mathematics concern with data collection analysis, interpretation, and presentation. Statistics deals with
how data can be used to solve complex problems. Some people consider statistics to be a distinct
mathematical science rather than a branch of mathematics.
Statistics makes work easy and simple and provides a clear and clean picture of work you do on a
regular basis.
Basic terminology of Statistics :
• Population –
It is actually a collection of set of individuals or objects or events whose properties are to be
analyzed.
• Sample –
It is the subset of a population.
Types of Statistics :
1. Descriptive Statistics :
Descriptive statistics uses data that provides a description of the population either through numerical
calculation or graph or table. It provides a graphical summary of data. It is simply used for summarizing
objects, etc. There are two categories in this as following below.
• (a). Measure of central tendency –

Measure of central tendency is also known as summary statistics that is used to represents the
center point or a particular value of a data set or sample set.
In statistics, there are three common measures of central tendency as shown below:
• (i) Mean :
It is measure of average of all value in a sample set.
For example,
• (ii) Median :
It is measure of central value of a sample set. In these, data set is
ordered from lowest to highest value and then finds exact middle.
For example,
• (iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of time in
central set is actually mode.
For example,
• (b). Measure of Variability – page 2 stastics type
Measure of Variability is also known as measure of dispersion and used to describe variability in
a sample or population. In statistics, there are three common measures of variability as shown
below:
• (i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
• (ii) Variance :
It simply describes how much a random variable defers from expected value and it is
also computed as square of deviation.
S2= ∑ni=1 [(xi - x

͞ )2 ÷ n]
In these formula, n represent total data points, ͞x represent mean of data points and xi represent
individual data points.
• (iii) Dispersion :
It is measure of dispersion of set of data from its mean.
σ= √ (1÷n) ∑ni=1 (xi - μ)2
2. Inferential Statistics :
Inferential Statistics makes inference and prediction about population based on a sample of data taken
from population. It generalizes a large dataset and applies probabilities to draw a conclusion. It is simply
used for explaining meaning of descriptive stats. It is simply used to analyze, interpret result, and draw
conclusion. Inferential Statistics is mainly related to and associated with hypothesis testing whose main
target is to reject null hypothesis.
Hypothesis testing is a type of inferential procedure that takes help of sample data to evaluate and
assess credibility of a hypothesis about a population. Inferential statistics are generally used to
determine how strong relationship is within sample. But it is very difficult to obtain a population list and
draw a random sample.
Inferential statistics can be done with help of various steps as given below:
1. Obtain and start with a theory.
2. Generate a research hypothesis.
3. Operationalize or use variables
4. Identify or find out population to which we can apply study material.
5. Generate or form a null hypothesis for these population.
6. Collect and gather a sample of children from population and simply run study.
7. Then, perform all tests of statistical to clarify if obtained characteristics of sample are
sufficiently different from what would be expected under null hypothesis so that we can be able
to find and reject null hypothesis.
Types of inferential statistics –

Various types of inferential statistics are used widely nowadays and are very easy to interpret. These are
given below:
• One sample test of difference/One sample hypothesis test
• Confidence Interval
• Contingency Tables and Chi-Square Statistic
• T-test or Anova
• Pearson Correlation
What is Population?
In statistics, population is the entire set of items from which you draw data for a statistical study. It
can be a group of individuals, a set of items, etc. It makes up the data pool for a study.
Generally, population refers to the people who live in a particular area at a specific time. But in
statistics, population refers to data on your study of interest. It can be a group of individuals,
objects, events, organizations, etc. You use populations to draw conclusions.
An example of a population would be the entire student body at a school. It would contain all the
students who study in that school at the time of data collection. Depending on the problem
statement, data from each of these students is collected. An example is the students who speak
Hindi among the students of a school.
For the above situation, it is easy to collect data. The population is small and willing to provide data
and can be contacted. The data collected will be complete and reliable.
If you had to collect the same data from a larger population, say the entire country of India, it would
be impossible to draw reliable conclusions because of geographical and accessibility constraints,
not to mention time and resource constraints. A lot of data would be missing or might be unreliable.
Furthermore, due to accessibility issues, marginalized tribes or villages might not provide data at all,
making the data biased towards certain regions or groups.
What is a Sample?
A sample is defined as a smaller and more manageable representation of a larger group. A subset of
a larger population that contains characteristics of that population. A sample is used in statistical
testing when the population size is too large for all members or observations to be included in the
test.
The sample is an unbiased subset of the population that best represents the whole data.
To overcome the restraints of a population, you can sometimes collect data from a subset of your
population and then consider it as the general norm. You collect the subset information from the
groups who have taken part in the study, making the data reliable. The results obtained for different
groups who took part in the study can be extrapolated to generalize for the population.
The process of collecting data from a small subsection of the population and then using it to
generalize over the entire set is called Sampling.
Samples are used when :
• The population is too large to collect data.
• The data collected is not reliable.
• The population is hypothetical and is unlimited in size. Take the example of a study that
documents the results of a new medical procedure. It is unknown how the procedure will affect
people across the globe, so a test group is used to find out how people react to it.
A sample should generally :
• Satisfy all different variations present in the population as well as a well-defined selection
criterion.
• Be utterly unbiased on the properties of the objects being selected.
• Be random to choose the objects of study fairly.
Say you are looking for a job in the IT sector, so you search online for IT jobs. The first search result
would be for jobs all around the world. But you want to work in India, so you search for IT jobs in
India. This would be your population. It would be impossible to go through and apply for all positions
in the listing. So you consider the top 30 jobs you are qualified for and satisfied with and apply for
those. This is your sample.
Measures of Central Tendency in Statistics
Central Tendencies in Statistics are the numerical values that are used to represent mid-value or
central value a large collection of numerical data. These obtained numerical values are
called central or average values in Statistics. A central or average value of any statistical data or
series is the value of that variable that is representative of the entire data or its associated frequency
distribution. Such a value is of great significance because it depicts the nature or characteristics of
the entire data, which is otherwise very difficult to observe.
Measures of Central Tendency Meaning------------------The representative value of a data set, generally

the central value or the most occurring value that gives a general idea of the whole data set is called
the Measure of Central Tendency. Some of the most commonly used measures of central tendency
are:
Mean
Mean in general terms is used for the arithmetic mean of the data, but other than the arithmetic mean
there are geometric mean and harmonic mean as well that are calculated using different formulas.
Here in this article, we will discuss the arithmetic mean.
Mean for Ungrouped Data----------------------Arithmetic mean ( ) is defined as the sum of the individual
observations (xi) divided by the total number of observations N. In other words, the mean is given by
the sum of all observations divided by the total number of observations OR
Mean = Sum of all Observations ÷ Total number of Observations
Example: If there are 5 observations, which are 27, 11, 17, 19, and 21 then the mean ( ) is given by
= (27 + 11 + 17 + 19 + 21) ÷ 5 ⇒ = 95 ÷ 5 ⇒ = 19
Median
The Median of any distribution is that value that divides the distribution into two equal parts such
that the number of observations above it is equal to the number of observations below it. Thus, the
median is called the central value of any given data either grouped or ungrouped.
Median of Ungrouped Data
To calculate the Median, the observations must be arranged in ascending or descending order. If the
total number of observations is N then there are two cases
Case 1: N is Odd
Median = Value of observation at [(n + 1) ÷ 2]th Position
When N is odd the median is calculated as shown in the image below.
Case 2: N is Even
Median = Arithmetic mean of Values of observations at (n ÷ 2)th and

[(n ÷ 2) + 1]th Position
When N is even the median is calculated as shown in the image

below.
Mode
The Mode is the value of that observation which has a maximum frequency corresponding to it. In
other, that observation of the data occurs the maximum number of times in a dataset.
Mode of Ungrouped Data-----------------------Mode of Ungrouped Data can be simply calculated by

observing the observation with the highest frequency. Let’s see an example of the calculation of the
mode of ungrouped data.------The mode of the data set is the
highest frequency term in the data set as shown in the image added
below.
What is the Measure of Dispersion in Statistics?
Measures of Dispersion measure the scattering of the data, i.e. how the values are distributed in the
data set. In statistics, we define the measure of dispersion as various parameters that are used to
define the various attributes of the data.
The image added below shows the measure of dispersion of various types.
These measures of dispersion capture variation between different values of the data.
Measures of Dispersion Definition

Measures of Dispersion is a non-negative real number that gives various parameters of the data. The
measure of dispersion will be zero when the dispersion of the data set will be zero. If we have
dispersion in the given data then, these numbers which give the attributes of the data set are the
measure of dispersion.
Example of Measures of Dispersion
We can understand the measure of dispersion by studying the following example, suppose we have
10 students in a class and the marks scored by them in a Mathematics test are 12, 14, 18, 9, 11, 7, 9,
16, 19, and 20 out of 20. Then the average value scored by the student in the class is,
Mean (Average) = (12 + 14 + 18 + 9 + 11 + 7 + 9 + 16 + 19 + 20)/10
= 135/10 = 13.5
Then, the average value of the marks is 13.5
Mean Deviation = {|12-13.5| + |14-13.5| + |18-13.5| + |9-13.5| + |11-13.5| + |7-13.5| + |9-13.5| + |16-13.5| +
|19-13.5| + |20-13.5|}/10 = 34.5/10 = 3.45
Types of Measures of Dispersion
Measures of dispersion can be classified into two categories shown below:
• Absolute Measures of Dispersion
• Relative Measures of Dispersion
These measures of dispersion can be further divided into various categories. The measures of
dispersion have various parameters and these parameters have the same unit.
Let’s learn about them in detail.
Absolute Measures of Dispersion
These measures of dispersion are measured and expressed in the units of data themselves. For
example – Meters, Dollars, Kg, etc. Some absolute measures of dispersion are:
Range: Range is defined as the difference between the largest and the smallest value in the
distribution.
Mean Deviation: Mean deviation is the arithmetic mean of the difference between the values and their
mean.
Standard Deviation: Standard Deviation is the square root of the arithmetic average of the square of
the deviations measured from the mean.
Variance: Variance is defined as the average of the square deviation from the mean of the given data
set.
Quartile Deviation: Quartile deviation is defined as half of the difference between the third quartile
and the first quartile in a given data set.
Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile is called
Interterquartile Range. The formula for Interquartile Range is given as Q3 – Q1
Skewness
Skewness is an important statistical technique that helps to determine asymmetrical behavior than
of the frequency distribution, or more precisely, the lack of symmetry of tails both left and right of
the frequency curve. A distribution or dataset is symmetric if it looks the same to the left and right of
the center point.
Types of skewness: The following figure describes the classification of skewness:
Types of Skewness
1. Symmetric Skewness: A perfect symmetric distribution is one in which frequency distribution is

the same on the sides of the center point of the frequency curve. In this, Mean = Median = Mode.
There is no skewness in a perfectly symmetrical distribution.
2. Asymmetric Skewness: A asymmetrical or skewed distribution is one in which the spread of the
frequencies is different on both the sides of the center point or the frequency curve is more
stretched towards one side or value of Mean. Median and Mode falls at different points.
• Positive Skewness: In this, the concentration of frequencies is more towards higher values of
the variable i.e. the right tail is longer than the left tail.
• Negative Skewness: In this, the concentration of frequencies is more towards the lower values of
the variable i.e. the left tail is longer than the right tail.
Kurtosis:
It is also a characteristic of the frequency distribution. It gives an idea about the shape of a
frequency distribution. Basically, the measure of kurtosis is the extent to which a frequency
distribution is peaked in comparison with a normal curve. It is the degree of peakedness of a
distribution.
Types of kurtosis: The following figure describes the classification of kurtosis:
Types of Kurtosis
1. Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In this curve,
there is too much concentration of items near the central value.
2. Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this curve,
there is equal distribution of items around the central value.
3. Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called platykurtic.
In this curve, there is less concentration of items around the central value.
Difference Between Skewness and Kurtosis
Sr. No. Skewness Kurtosis
1. It indicates the shape and size of variation on either side of the central value.--------- It indicates
the frequencies of distribution at the central value.
2. The measure differences of skewness tell us about the magnitude and direction of the
asymmetry of a distribution.---------------------------- It indicates the concentration of items at the
central part of a distribution.
3. It indicates how far the distribution differs from the normal distribution.------------------------ It
studies the divergence of the given distribution from the normal distribution.
4. The measure of skewness studies the extent to which deviation clusters is are above or below
the average.------------- It indicates the concentration of items.
5. In an asymmetrical distribution, the deviation below or above an average is not equal.----------------

-------- No such distribution takes place.
Difference between Correlation and Regression
1. Correlation
Correlation is referred to as the analysis which lets us know the association or the absence of the
relationship between two variables ‘x’ and ‘y’.
2. Regression :
Regression analysis is used to predicts the value of the dependent variable based on the known
value of the independent variable, assuming that average mathematical relationship between two or
more variables.
Difference between Correlation and Regression :
S.No. Correlation Regression
Correlation describes as a statistical

Regression depicts how an independent variable
measure that determines the
1. serves to be numerically related to any dependent
association or co-relationship between
variable.
two variables.
Its coefficients may range from -1.00 to

2. Its coefficients may range from byx > 1 to bxy < 1.
+1.00.
There is no difference between the two. Both variables serve to be different, One variable is
3.
Both variables are mutually dependent. independent, while the other is dependent.
To find the numerical value that defines

To estimate the values of random variables based on
4. and shows the relationship between
the values shown by fixed variables.
variables.
Its coefficient serves to be independent

Its coefficient shows dependency on the change of
5. of any change of Scale or shift in
Scale but is independent of its shift in Origin.
Origin.
Its coefficient is mutual and

6. Its coefficient fails to be symmetrical.
symmetrical.
Its correlation serves to be a relative

7. Its coefficient is generally an absolute figure.
measure.
In this, x is a random variable while y is a fixed

In this, both variables x and y are
8. variable. At times, both variables may be like random
random variables.
variables.
What Is Machine Learning?
Machine learning (ML) is a discipline of artificial intelligence (AI) that provides machines with the
ability to automatically learn from data and past experiences while identifying patterns to make
predictions with minimal human intervention.
Machine learning methods enable computers to operate autonomously without explicit

programming. ML applications are fed with new data, and they can independently learn, grow,
develop, and adapt.
Machine learning derives insightful information from large volumes of data by leveraging algorithms
to identify patterns and learn in an iterative process. ML algorithms use computation methods to
learn directly from data instead of relying on any predetermined equation that may serve as a model.
• Computational finance (credit scoring, algorithmic trading)
• Computer vision (facial recognition, motion tracking, object detection)
• Computational biology (DNA sequencing, brain tumor detection, drug discovery)
• Automotive, aerospace, and manufacturing (predictive maintenance)
• Natural language processing (voice recognition)
Types of Machine Learning

1. Supervised machine learning
This type of ML involves supervision, where machines are trained on labeled datasets and enabled to
predict outputs based on the provided training. The labeled dataset specifies that some input and
output parameters are already mapped. Hence, the machine is trained with the input and
corresponding output. A device is made to predict the outcome using the test dataset in subsequent
phases.
2. Unsupervised machine learning
Unsupervised learning refers to a learning technique that’s devoid of supervision. Here, the machine
is trained using an unlabeled dataset and is enabled to predict the output without any supervision.
An unsupervised learning algorithm aims to group the unsorted dataset based on the input’s
similarities, differences, and patterns.
3. Semi-supervised learning
Semi-supervised learning comprises characteristics of both supervised and unsupervised machine

learning. It uses the combination of labeled and unlabeled datasets to train its algorithms. Using
both types of datasets, semi-supervised learning overcomes the drawbacks of the options
mentioned above.
4. Reinforcement learning
Reinforcement learning is a feedback-based process. Here, the AI component automatically takes

stock of its surroundings by the hit & trial method, takes action, learns from experiences, and
improves performance. The component is rewarded for each good action and penalized for every
wrong move. Thus, the reinforcement learning component aims to maximize the rewards by
performing good actions.
Properties of learning algorithms
1. Linear Regression
Linear regression is one of the most popular and simple machine learning algorithms that is used for
predictive analysis. Here, predictive analysis defines prediction of something, and linear regression
makes predictions for continuous numbers such as salary, age, etc.It shows the linear relationship
between the dependent and independent variables, and shows how the dependent variable(y) changes
according to the independent variable (x).
It tries to best fit a line between the dependent and independent variables, and this best fit line is knowns
as the regression line. The equation for the regression line is: y= a0+ a*x+ bHere, y= dependent
variable x= independent variable a0 = Intercept of line.
2. Logistic Regression
Logistic regression is the supervised learning algorithm, which is used

to predict the categorical variables or discrete values. It can be used for
the classification problems in machine learning, and the output of the logistic regression algorithm
can be either Yes or NO, 0 or 1, Red or Blue, etc.
Logistic regression is similar to the linear regression except how they are used, such as Linear
regression is used to solve the regression problem and predict continuous values, whereas Logistic
regression is used to solve the Classification problem and used to predict the discrete values.
3. Decision Tree Algorithm
A decision tree is a supervised learning algorithm that is mainly used to solve the classification
problems but can also be used for solving the regression problems. It can work with both categorical
variables and continuous variables. It shows a tree-like structure that includes nodes and branches,
and starts with the root node that expand on further branches till the leaf node. The internal node is
used to represent the features of the dataset, branches show the decision rules, and leaf nodes
represent the outcome of the problem.
4. Support Vector Machine Algorithm
A support vector machine or SVM is a supervised learning algorithm that can also be used for
classification and regression problems. However, it is primarily used for classification problems. The
goal of SVM is to create a hyperplane or decision boundary that can segregate datasets into different
classes.
The data points that help to define the hyperplane are known as support vectors, and hence it is
named as support vector machine algorithm.
5. Naïve Bayes Algorithm:
Naïve Bayes classifier is a supervised learning algorithm, which is used to make predictions based
on the probability of the object. The algorithm named as Naïve Bayes as it is based on Bayes
theorem, and follows the naïve assumption that says' variables are independent of each other.
The Bayes theorem is based on the conditional probability; it means the likelihood that event(A) will
happen, when it is given that event(B) has already happened. The equation for Bayes theorem is
given as:
6. K-Nearest Neighbour (KNN)
K-Nearest Neighbour is a supervised learning algorithm that can be used for both classification and
regression problems. This algorithm works by assuming the similarities between the new data point
and available data points. Based on these similarities, the new data points are put in the most similar
categories. It is also known as the lazy learner algorithm as it stores all the available datasets and
classifies each new case with the help of K-neighbours. The new case is assigned to the nearest
class with most similarities, and any distance function measures the distance between the data
Linear Regression
Linear regression is one of the easiest and most popular Machine Learning algorithms. It
is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age, product
price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one
or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the
dependent variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Mathematically, we can
represent a linear regression
as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target

Variable)
X= Independent Variable
(predictor Variable)
a0= intercept of the line
(Gives an additional degree of
freedom)
a1 = Linear regression
coefficient (scale factor to
each input value).
ε = random error
The values for x and y

variables are training datasets
for Linear Regression model
representation.
Types of Linear Regression
Linear regression can be

further divided into two types of the algorithm:
o Simple Linear Regression:

If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:

If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
What is Regularization?
Regularization is one of the most important concepts of machine learning.
It is a technique to prevent the model from overfitting by adding extra
information to it.
Sometimes the machine learning model performs well with the training data
but does not perform well with the test data. It means the model is not able
to predict the output when deals with unseen data by introducing noise in
the output, and hence the model is called overfitted. This problem can be
deal with the help of a regularization technique.
This technique can be used in such a way that it will allow to maintain all
variables or features in the model by reducing the magnitude of the
variables. Hence, it maintains accuracy as well as a generalization of the
model.
It mainly regularizes or reduces the coefficient of features toward zero. In
simple words, "In regularization technique, we reduce the magnitude of the
features by keeping the same number of features."
How does Regularization Work?

Regularization works by adding a penalty or complexity term to
the complex model. Let's consider the simple linear regression
equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted
X1, X2, …Xn are the features for Y.
β0,β1,…..βn are the weights or magnitude attached to the
features, respectively. Here represents the bias of the model, and
b represents the intercept.
Linear regression models try to optimize the β0 and b to
minimize the cost function. The equation for the cost function for
the linear model is given below:
Now, we will add a loss function and optimize parameter to make

the model that can predict the accurate value of Y. The loss
function for the linear regression is called as RSS or Residual
sum of squares.
Model Selection and Evaluation
Model Selection and Evaluation is a hugely important procedure in the machine learning
workflow. This is the section of our workflow in which we will analyse our model. We look
at more insightful statistics of its performance and decide what actions to take in order to
improve this model. This step is usually the difference between a model that performs well
and a model that performs very well. When we evaluate our model, we gain a greater
insight into what it predicts well and what it doesn’t and this helps us turn it from a model
that predicts our dataset with a 65% accuracy level to closer to 80% or 90%.
Metrics and Scoring
Let’s say we have two hypothesis for a task, h(x) and h’(x). How would we know which one
is better. Well from a high level perspective, we might take the following steps:
1. Measure the accuracy of both hypotheses
2. Determine whether there is any statistical significance between the two results. If there
is, select the better performing hypothesis. If not, we cannot say with any statistical
certainty that either h(x) or h’(x) is better.
When we have a classification task, we will consider the accuracy of our model by its
ability to assign an instance to its correct class. Consider this on a binary level. We have
two classes, 1 and 0. We would classify a correct prediction therefore as being when the
model classifies a class 1 instance as class 1, or a class 0 instance as class 0. Assuming
our 1 class as being the ‘Positive class’ and the 0 class being the ‘Negative class’, we can
build a table that outlines all the possibilities our model might produce
We also have names for these classifications. Our True Positive and True Negative are our
correct classifications, as we can see in both cases, the actual class and the predicted
class are the same. The other two classes, in which the model predicts incorrectly, can be
explained as follows:
• False Positive — when the model predicts 1, but the actual class is 0, also known
as Type I Error
• False Negative — when the model predicts 0, but the actual class is 1, also known
as Type II Error
When we take a series of instances and populate the above table with frequencies of how
often we observe each classification, we have produced what is known as
a confusion matrix. This is a good method to begin evaluating a hypothesis that goes a
little bit further than a simple accuracy rate. With this confusion matrix, we can define the
accuracy rate and we can also define a few other metrics to see how well our model is
performing. We use the shortened abbreviations False Positive (FP), False Negative (FN),
True Positive (TP) and True Negative (TN).
Support Vector Machine or SVM
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
Example: SVM can be

understood with the
example that we have
used in the KNN
classifier. Suppose we
see a strange cat that
also has some features
of dogs, so if we want a
model that can
accurately identify
whether it is a cat or
dog, so such a model
can be created by using
the SVM algorithm. We
will first train our model
with lots of images of
cats and dogs so that it
can learn about different features of cats and dogs, and then we test it with this strange
creature. So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat.
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as
non-linear data and classifier used is called as Non-linear SVM classifier.
K-Nearest Neighbor(KNN)
o K-Nearest Neighbour is one of the simplest Machine Learning

algorithms based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most
similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.
o K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time
of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to
cat and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most similar features
it will put it in either cat or dog category.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some
time.
o The computation cost is high because of calculating the distance
between the data points for all the training samples.
Decision Tree
>Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
>In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
>The decisions or the test are performed on the basis of features of the given dataset.
>It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
>It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
>In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
>A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
>Below diagram explains the general structure of a decision tree:.
Advantages of the Decision Tree

o It is simple to understand as it
follows the same process which a
human follow while making any
decision in real-life.
o It can be very useful for

solving decision-related
problems.
o It helps to think about all the

possible outcomes for a problem.
o There is less requirement of data cleaning compared

to other algorithms.
Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Random Forest
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML.
It is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Advantages of Random Forest
o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.
Naïve Bayes
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of

an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager

learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the

observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.---------------------------------P(A) is Prior Probability: Probability of hypothesis
before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

Logistic regression
o Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of independent
variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either
Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on
its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous and
discrete datasets.
o Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for the
classification. The below image is showing the logistic function:
K-Means Clustering
K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different
clusters in such a way that each dataset belongs only one group that has similar
properties.
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative
process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
What is Feature Engineering?
Feature engineering is the pre-processing step of machine learning, which extracts
features from raw data. It helps to represent an underlying problem to predictive models in
a better way, which as a result, improve the accuracy of the model for unseen data. The
predictive model contains predictor variables and an outcome variable, and while the
feature engineering process selects the most useful predictor variables for the model.
Since 2016, automated feature engineering is also used in different machine learning
software that helps in automatically extracting features from raw data. Feature engineering
in ML contains mainly four processes: Feature Creation, Transformations, Feature
Extraction, and Feature Selection.
These processes are described as below:
1. Feature Creation: Feature creation is finding the most useful variables to be used in a
predictive model. The process is subjective, and it requires human creativity and
intervention. The new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great flexibility.
2. Transformations: The transformation step of feature engineering involves adjusting the

predictor variable to improve the accuracy and performance of the model. For example,
it ensures that the model is flexible to take input of the variety of data; it ensures that
all the variables are on the same scale, making the model easier to understand. It
improves the model's accuracy and ensures that all the features are within the
acceptable range to avoid any computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering process

that generates new variables by extracting them from the raw data. The main aim of
this step is to reduce the volume of data so that it can be easily used and managed for
data modelling. Feature extraction methods include cluster analysis, text analytics,
edge detection algorithms, and principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few variables
in the dataset are useful for building the model, and the rest features are either
redundant or irrelevant. If we input the dataset with all these redundant and irrelevant
features, it may negatively impact and reduce the overall performance and accuracy of
the model. Hence it is very important to identify and select the most appropriate
features from the data and remove the irrelevant or less important features, which is
done with the help of feature selection in machine learning. "Feature selection is a way
of selecting the subset of the most relevant features from the original features set by
removing the redundant, irrelevant, or noisy features."
Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help
of orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by
reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional
data.
PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels. It is a feature
extraction technique, so it contains the important variables and drops the least important
variable.
The PCA algorithm is based on some mathematical concepts such as:
o Variance and Covariance
o Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
o Dimensionality: It is the number of features or variables present in the given dataset.

More easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such
as if one changes, the other variable also gets changed. The correlation value ranges
from -1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and
+1 indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v

will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is
called the Covariance Matrix.
Principal Components in PCA
As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original
features present in the dataset. Some properties of these principal components are given
below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC

has the most importance, and n PC will have the least importance.
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given dataset is
known as dimensionality, and the process to reduce these features is called
dimensionality reduction.
A dataset contains a huge number of input features in various cases, which makes
the predictive modeling task more complicated. Because it is very difficult to
visualize or make predictions for the training dataset with a high number of
features, for such cases, dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting
the higher dimensions dataset into lesser dimensions dataset ensuring that it
provides similar information." These techniques are widely used in machine
learning for obtaining a better fit predictive model while solving the classification
and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such
as speech recognition, signal processing, bioinformatics, etc. It can also be used
for data visualization, noise reduction, cluster analysis, etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known
as the curse of dimensionality. If the dimensionality of the input dataset increases,
any machine learning algorithm and model becomes more complex. As the
number of features increases, the number of samples also gets increased
proportionally, and the chance of overfitting also increases. If the machine
learning model is trained on high-dimensional data, it becomes overfitted and
results in poor performance.
Hence, it is often required to reduce the number of features, which can be done
with dimensionality reduction.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are
given below:
o By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction
There are also some disadvantages of applying the dimensionality reduction, which are
given below:
o Some data may be lost due to dimensionality reduction.
o In the PCA dimensionality reduction technique, sometimes the principal components

required to consider are unknown.
Anaconda installatio process
Here are some important reasons why you should choose Anaconda for your next Data
Science project:
Ease of Installation,---------------More than 1000 data science packages are available,
Anaconda installs the latest Python 2 or 3 version in an isolated and activated environment so
any installed Python version doesn’t cause any issues for your projects,--------------It’s noob-
friendly!! Yes, You don’t need any prior coding/programming knowledge about the usual nerdy
stuff that scares noobs away.
Installation Process
• To install Anaconda on Windows please refer to How to install Anaconda on Windows?
• To install Anaconda on Linux please refer to How to install Anaconda on Linux?After

successfully completing the installation process you are here now, the Anaconda
Navigator.
Anaconda Navigator
Anaconda Navigator is a graphical UI that is automatically installed with Anaconda. Navigator

will open if the installation was successful. For Windows users, click Start, search, or select
Anaconda Navigator from the menu as shown
Anaconda Navigator contains lots of stuff inside it. So let’s understand which stuff we need for
our next data science project.
1. Jupyter Notebook
Jupyter Notebook is a web-based, interactive competing notebook environment. You can edit
and run human-readable docs while describing the data analysis. The Jupyter Notebook is an
open-source web application that allows you to create and share documents that contain live
code, equations, visualizations, and narrative text. Uses include data cleaning and
transformation, numerical simulation, statistical modeling, data visualization, machine
learning, and much more. Below is a demo image to demonstrate how Jupyter Notebook UI
looks like:
2. JupyterLab
It’s an extensible environment for interactive and reproducible computing, based on the
Jupyter Notebook and Architecture. JupyterLab enables to work with documents and activities
such as Jupyter notebooks, text editors, terminals, and custom components in a flexible,
integrated, and extensible manner. Below is a demo image to demonstrate how JupyterLab UI
looks like:
3. Spyder--------One of the most important and powerful Python IDE is Spyder. Spyder is
another good open-source and cross-platform IDE written in Python. It is also called Scientific
Python Development IDE, and it is the most lightweight IDE for Python. It is mainly used by
data scientists who can integrate with Matplotlib, SciPy, NumPy, Pandas, Cython, IPython,
SymPy, and other open-source software. Below is a demo image to demonstrate how Spyder
UI looks like:
4. RStudio-----------When it comes to the data science world then Python and R are the two
most programming languages that come into our minds. R Studio is an integrated
development environment(IDE) for the R programming language. It provides literate
programming tools, which basically allow the use of R scripts, outputs, text, and images in
reports, Word documents, and even an HTML files. Below is a demo image to demonstrate
how RStudio UI looks like:
Python Data Types for Data Science
Data types refer to the categorization or classification of data components. It stands for the
kind of value that defines the possible operations on a given piece of data.
In other words, Data types are a specific class of data item that can be identified by the values
it can accept, the programming language that can be used to create it, or the actions that can
be carried out on it.
There are mainly five standard data types in python, they are given below
1.Numeric − int, float, complex 2.Dictionary – dict 3.Boolean – bool 4.Set – set
5.Sequence Type − list, tuple, range 6.String − str
Python Numeric Data Type
Python's numeric data types are used to represent data that has a numeric value. It is mainly
in three types, i.e., an integer belonging to the int class, a floating number belonging to the
float class, or even a complex number belonging to the complex class.-----------------Integer − It
has both positive and negative whole numbers in it without fractions or decimals. And belong
to the int class with no restriction on the length of integer numbers in Python.------------------------
Float − It has a floating-point representation and is a real number. To represent it we have a
decimal point to indicate it. We may be added e or E after a positive and negative number to
designate scientific notation. -----------------Complex Number − Complex classes serve as
representations for complex numbers. As an example, 4+5j is described as (actual part) +
(imaginary part)j.-----------Note − To identify the type of data, use the type() method.
Python Dictionary
A dictionary in Python is an unordered collection of data values used to store data values
similar to a map. Dictionaries consist of key-value pairs, in contrast to other data types, which
can only contain a single value.-----------To increase the efficiency of the dictionary the Key-
value pairs are included. When representing a dictionary data type, each key-value pair is
distinguished by a colon, whereas each key is distinguished by a "comma."
Python Boolean
Data that has the predetermined values True or False. Equal to False Boolean objects are
false(false), while equal to True Boolean objects are truthy (true). However, it is also possible
to evaluate and categorize non-Boolean things in a boolean context. The bool class is used to
represent it.
Example-----# define a boolean variable----b = False-----print(type(b))------Output-----<type 'bool'>
Python Set Data Type
In Python, a set is a non-duplicate collection of data types that may be iterated through and
changed. A set may have a variety of components, but the placement of the parts is not fixed.
Unordered objects are grouped together as a set. There cannot be any duplicates of any set
element, and it must be immutable (cannot be changed).
Due to the set's unordered nature, indexing will be useless. As a result, the slicing operator []
is useless.
Creation of set
The built-in set() method can be used to build sets with an iterable object or a series by
wrapping the sequence behind curly brackets and separating them with a comma,. The
elements in a set don't have to be of the same type; they might contain a variety of mixed data
type values.
Page 2 python data types
Example--# Create a set from a list using the set() function----s = set([1, 2, 3, 4, 5])------
print(s) # Output: {1, 2, 3, 4, 5}----------# Create a set using curly braces----------s = {1, 2,
3, 4, 5}--------------print(s) # Output: {1, 2, 3, 4, 5} ---------------Output---------------set([1, 2, 3,
4, 5])----------set([1, 2, 3, 4, 5])
Python Sequence
The sequence in Python is an ordered grouping of related or dissimilar data types.
Sequences enable the ordered and effective storage of several values. In Python,
there are various sequence types. They are given below − list Tuple Range
List Data Type-------A list can be formed by putting all the elements in square brackets
and all the present elements are separated by a comma. Elements can be any data
type even a list also and can be traversed using an iterator or using index we can also
get the elements.
Example----# Create a list using square brackets----l = [1, 2, 3, 4, 5]-----print(l) # Output:
[1, 2, 3, 4, 5]-----# Access an item in the list using its index------print(l[1]) # Output: 2-----
-------Output----[1, 2, 3, 4, 5]------2
Tuple Data Type
Tuples are similar to lists, but they can’t be modified once they are created. Tuples are
commonly used to store data that should not be modified, such as configuration
settings or data that is read from a database.
Python Rage
The range data type represents an immutable sequence of numbers. It is similar to a
list, but it is more memory-efficient and faster to iterate over.
Python String
A string of Unicode characters makes up the string. A string is a grouping of one or
more characters enclosed in a single, double, or triple quotation mark. A class called
str can be used to represent it. There is no character data type in Python; instead, a
character is a string of length 1. The class str is used to represent it.
Strings can be used for a variety of actions, including concatenation, slicing, and
repetition.
Python Operators p1
Operators are used to perform operations on variables and values.
In the example below, we use the + operator to add together two values:
print(10 + 5)
Python divides the operators in the following groups:
• Arithmetic operators Assignment operators Comparison operators

Logical operators Identity operators Membership operators Bitwise operators
Python Arithmetic Operators
Arithmetic operators are used with numeric values to perform common mathematical
operations:
Operator Name Example
+ Addition x+y
- Subtraction x-y
* Multiplication x * y
/ Division x/y
% Modulus x%y
** Exponentiation x ** y
// Floor division x // y
Python Assignment Operators
Assignment operators are used to assign values to variables:
Operator Example Same As
= x=5 x=5
+= x += 3 x = x + 3
-= x -= 3 x=x-3
*= x *= 3 x = x * 3
/= x /= 3 x=x/3
%= x %= 3 x = x % 3
//= x //= 3 x = x // 3
**= x **= 3 x = x ** 3
&= x &= 3 x = x & 3
|= x |= 3 x=x|3
Python Comparison Operators
Comparison operators are used to compare two values:
Operator Name Example
== Equal x == y
Python operators page 2
!= Not equal x != y
> Greater than x>y
< Less than x<y
>= Greater than or equal to x >= y
<= Less than or equal to x <= y
Python Logical Operators
Logical operators are used to combine conditional statements:
Operator Description Example
and Returns True if both statements are true x < 5 and x < 10
or Returns True if one of the statements is true x < 5 or x < 4
not Reverse the result, returns False if the result is true not(x < 5 and x < 10)
Python Identity Operators
Identity operators are used to compare the objects, not if they are equal, but if they are
actually the same object, with the same memory location:
is Returns True if both variables are the same object x is y
is not Returns True if both variables are not the same object x is not y
Python Membership Operators
Membership operators are used to test if a sequence is presented in an object:
in Returns True if a sequence with the specified value is present in the object -x in y
not in Returns True if a sequence with the specified value is not present in the object
- x not in y
Python Bitwise Operators
Bitwise operators are used to compare (binary) numbers:
Operator Name Description Example
& AND Sets each bit to 1 if both bits are 1 x&y
| OR Sets each bit to 1 if one of two bits is 1 x|y
^ XOR Sets each bit to 1 if only one of two bits is 1 x ^ y
~ NOT Inverts all the bits ~x
<< Zero fill left shift Shift left by pushing zeros in from the right and let the leftmost
bits fall off x << 2
>> Signed right shift Shift right by pushing copies of the leftmost bit in from the left,
and let the rightmost bits fall off x >> 2
What is NumPy
NumPy stands for numeric python which is a python package for the computation and
processing of the multidimensional and single dimensional array elements.
Travis Oliphant created NumPy package in 2005 by injecting the features of the ancestor
module Numeric into another module Numarray.
It is an extension module of Python which is mostly written in C. It provides various

functions which are capable of performing the numeric computations with a high speed.
NumPy provides various powerful data structures, implementing multi-dimensional arrays

and matrices. These data structures are used for the optimal computations regarding
arrays and matrices.
In this tutorial, we will go through the numeric python library NumPy.
The need of NumPy

With the revolution of data science, data analysis libraries like NumPy, SciPy, Pandas, etc.
have seen a lot of growth. With a much easier syntax than other programming languages,
python is the first choice language for the data scientist.
NumPy provides a convenient and efficient way to handle the vast amount of data. NumPy
is also very convenient with Matrix multiplication and data reshaping. NumPy is fast which
makes it reasonable to work with a large set of data.
There are the following advantages of using NumPy for data analysis.
1. NumPy performs array-oriented computing.
2. It efficiently implements the multidimensional arrays.
3. It performs scientific computations.
4. It is capable of performing Fourier Transform and reshaping the data stored in

multidimensional arrays.
5. NumPy provides the in-built functions for linear algebra and random number
generation.
Nowadays, NumPy in combination with SciPy and Mat-plotlib is used as the replacement to
MATLAB as Python is more complete and easier programming language than MATLAB.
Prerequisite
Before learning Python Numpy, you must have the basic knowledge of Python concepts.
Audience
Our Numpy tutorial is designed to help beginners and professionals.
Problem
We assure you that you will not find any problem in this Python Numpy tutorial. But if there
is any mistake, please post the problem in the contact form.
Numpy function –
Quite understandably, NumPy contains a large number of various mathematical
operations. NumPy provides standard trigonometric functions, functions for arithmetic
operations, handling complex numbers, etc.
Trigonometric Functions
NumPy has standard trigonometric functions which return trigonometric ratios for a given
angle in radians.
Example
import numpy as np -----a = np.array([0,30,45,60,90]) ---------print 'Sine of different angles:' ---# Convert
to radians by multiplying with pi/180 -----print np.sin(a*np.pi/180) ----print '\n' ----------print 'Cosine
values for angles in array:' ----print np.cos(a*np.pi/180) ---print '\n' -----print 'Tangent values for given
angles:' ------print np.tan(a*np.pi/180) -----Here is its output −
Sine of different angles:
[ 0. 0.5 0.70710678 0.8660254 1. ]
Cosine values for angles in array:-----[ 1.00000000e+00 8.66025404e-01 7.07106781e-01

5.00000000e-01 6.12323400e-17]
Tangent values for given angles:
[ 0.00000000e+00 5.77350269e-01 1.00000000e+00 1.73205081e+00 1.63312394e+16]
arcsin, arcos, and arctan functions return the trigonometric inverse of sin, cos, and tan of the given
angle. The result of these functions can be verified by numpy.degrees() function by converting
radians to degrees.
Tan function:
[ 0.00000000e+00 5.77350269e-01 1.00000000e+00 1.73205081e+00 1.63312394e+16]
Inverse of tan: [ 0. 0.52359878 0.78539816 1.04719755 1.57079633]
In degrees: [ 0. 30. 45. 60. 90.]
Functions for Rounding
numpy.around()----This is a function that returns the value rounded to the desired precision. The
function takes the following parameters.-----numpy.around(a,decimals)
Where,
Sr.No. Parameter & Description
1 A Input data
2 Decimals The number of decimals to round to. Default is 0. If negative, the integer is
rounded to position to the left of the decimal point
Example
import numpy as np ----a = np.array([1.0,5.55, 123, 0.567, 25.532]) ----print 'Original array:' ---print a ---
print '\n' --print 'After rounding:' ---print np.around(a) ----print np.around(a, decimals = 1) ----print
np.around(a, decimals = -1)----It produces the following output −----Original array: --[ 1. 5.55
123. 0.567 25.532] ---After rounding: [ 1. 6. 123. 1. 26. ]
--[ 1. 5.6 123. 0.6 25.5] [ 0. 10. 120. 0. 30. ]
numpy.floor()
This function returns the largest integer not greater than the input parameter. The floor of the scalar
x is the largest integer i, such that i <= x. Note that in Python, flooring always is rounded away from
0.
What is SciPy
The SciPy is an open-source scientific library of Python that is distributed under a
BSD license. It is used to solve the complex scientific and mathematical problems.
It is built on top of the Numpy extension, which means if we import the SciPy,
there is no need to import Numpy. The Scipy is pronounced as Sigh pi, and it
depends on the Numpy, including the appropriate and fast N-dimension array
manipulation.
It provides many user-friendly and effective numerical functions for numerical
integration and optimization.
The SciPy library supports integration, gradient optimization, special functions,
ordinary differential equation solvers, parallel programming tools, and many more.
We can say that SciPy implementation exists in every complex numerical
computation.
The scipy is a data-processing and system-prototyping environment as similar to
MATLAB. It is easy to use and provides great flexibility to scientists and
engineers.
History
Python was expanded in the 1990s to include an array type for numerical
computing called numeric. This numeric package was replaced by Numpy (blend
of Numeric and NumArray) in 2006. There was a growing number of extension
module and developers were interested to create a complete environment for
scientific and technical computing. Travis Oliphant, Eric Jones, and Pearu
Peterson merged code they had written and called the new package SciPy. The
newly created package provided a standard collection of common numerical
operation on the top of Numpy.
Why use SciPy?
SciPy contain significant mathematical algorithms that provide easiness to
develop sophisticated and dedicated applications. Being an open-source library, it
has a large community across the world to the development of its additional
module, and it is much beneficial for scientific application and data scientists.
Numpy vs. SciPy
Numpy and SciPy both are used for mathematical and numerical analysis. Numpy
is suitable for basic operations such as sorting, indexing and many more because
it contains array data, whereas SciPy consists of all the numeric data.
Numpy contains many functions that are used to resolve the linear algebra,
Fourier transforms, etc. whereas SciPy library contains full featured version of the
linear algebra module as well many other numerical algorithms.
Data operation
Python handles data of various formats mainly through the two libraries, Pandas and
Numpy. We have already seen the important features of these two libraries in the previous
chapters. In this chapter we will see some basic examples from each of the libraries on
how to operate on data.
Data Operations in Numpy---------The most important object defined in NumPy is an N-

dimensional array type called ndarray. It describes the collection of items of the same type.
Items in the collection can be accessed using a zero-based index. An instance of ndarray
class can be constructed by different array creation routines described later in the tutorial.
The basic ndarray is created using an array function in NumPy as follows −
numpy.array
Pandas Series
Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index. A pandas
Series can be created using the following constructor −pandas.Series( data, index, dtype,
copy)
Example-----Here we create a series from a Numpy Array.-----#import the pandas library and
aliasing as pd----import pandas as pd-----import numpy as np----data =
np.array(['a','b','c','d'])-----s = pd.Series(data)---------print s
Its output is as follows − 0 a 1 b 2 c 3 d dtype: object
Pandas DataFrame
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns. A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
Let us now create an indexed DataFrame using arrays.
import pandas as pd---------data = {'Name':['Tom', 'Jack', 'Steve',

'Ricky'],'Age':[28,34,29,42]}--df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])-
----print df---------Its output is as follows − Age Name rank1 28 Tom rank2
34 Jack rank3 29 Steve---rank4 42 Ricky----Pandas Panel
A panel is a 3D container of data. The term Panel data is derived from econometrics and is
partially responsible for the name pandas − pan(el)-da(ta)-s.
A Panel can be created using the following constructor −
pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)
In the below example we create a panel from dict of DataFrame Objects
#creating an empty panel----import pandas as pd-----import numpy as np-----data = {'Item1' :

pd.DataFrame(np.random.randn(4, 3)), ---- 'Item2' : pd.DataFrame(np.random.randn(4, 2))}
p = pd.Panel(data)------print p
Its output is as follows −<class 'pandas.core.panel.Panel'>----------Dimensions: 2 (items) x 4

(major_axis) x 5 (minor_axis)----------Items axis: 0 to 1----Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4
Data Visualization using Matplotlib
Data Visualization using Matplotlib is the process of presenting data in the form of graphs
or charts. It helps to understand large and complex amounts of data very easily. It allows
the decision-makers to make decisions very efficiently and also allows them in identifying
new trends and patterns very easily. It is also used in high-level data analysis for Machine
Learning and Exploratory Data Analysis (EDA). Data visualization can be done with
various tools like Tableau, Power BI, Python.
In this article, we will discuss how to visualize data with the help of the Matplotlib library of
Python.
Matplotlib
Matplotlib is a low-level library of Python which is used for data visualization. It is easy to
use and emulates MATLAB like graphs and visualization. This library is built on the top of
NumPy arrays and consist of several plots like line chart, bar chart, histogram, etc. It
provides a lot of flexibility but at the cost of writing more code.
Pyplot
Pyplot is a Matplotlib module that provides a MATLAB-like interface. Matplotlib is designed

to be as usable as MATLAB, with the ability to use Python and the advantage of being free
and open-source. Each pyplot function makes some change to a figure: e.g., creates a
figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the
plot with labels, etc. The various plots we can utilize using Pyplot are Line Plot, Histogram,
Scatter, 3D Plot, Image, Contour, and Polar.
After knowing a brief about Matplotlib and pyplot let’s see how to create a simple plot.
Adding Title
The title() method in matplotlib module is used to specify the title of the visualization
depicted and displays the title using various attributes.
Syntax:
matplotlib.pyplot.title(label, fontdict=None, loc=’center’, pad=None, **kwargs)
Adding X Label and Y Label
In layman’s terms, the X label and the Y label are the titles given to X-axis and Y-axis
respectively. These can be added to the graph by using the xlabel() and ylabel() methods.
Syntax:
matplotlib.pyplot.xlabel(xlabel, fontdict=None, labelpad=None, **kwargs)
matplotlib.pyplot.ylabel(ylabel, fontdict=None, labelpad=None, **kwargs)
Setting Limits and Tick labels
You might have seen that Matplotlib automatically sets the values and the markers(points)
of the X and Y axis, however, it is possible to set the limit and markers
manually. xlim() and ylim() functions are used to set the limits of the X-axis and Y-axis
respectively. Similarly, xticks() and yticks() functions are used to set tick labels.
Adding Legends
A legend is an area describing the elements of the graph. In simple terms, it reflects the
data displayed in the graph’s Y-axis. It generally appears as the box containing a small
sample of each color on the graph and a small description of what this data means.
Data Visualization using Matplotlib page 2
The attribute bbox_to_anchor=(x, y) of legend() function is used to specify the coordinates
of the legend, and the attribute ncol represents the number of columns that the legend
has. Its default value is 1.
Syntax:
matplotlib.pyplot.legend([“name1”, “name2”], bbox_to_anchor=(x, y), ncol=1)
Figure class
Consider the figure class as the overall window or page on which everything is drawn. It is
a top-level container that contains one or more axes. A figure can be created using
the figure() method.
Syntax:
class matplotlib.figure.Figure(figsize=None, dpi=None, facecolor=None, edgecolor=None,

linewidth=0.0, frameon=None, subplotpars=None, tight_layout=None,
constrained_layout=None)
Multiple Plots
We have learned about the basic components of a graph that can be added so that it can
convey more information. One method can be by calling the plot function again and again
with a different set of values as shown in the above example. Now let’s see how to plot
multiple graphs using some functions and also how to plot subplots.
Method 1: Using the add_axes() method
The add_axes() method is used to add axes to the figure. This is a method of figure class
Syntax:
add_axes(self, *args, **kwargs)

DATA SCIENCE 6th Sem

Uploaded by

Copyright:

Available Formats

DATA SCIENCE 6th Sem

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DATA SCIENCE 6th Sem

Uploaded by

Copyright:

Available Formats

WHAT IS DATA SCIENCE

Data Science is about data gathering, analysis and decision-making.

• Better decisions (should we choose A or B)

different sectors of using data science

3. Banking and Finance

6. Communications, Media, and Entertainment

The character set

3 Special Symbols -.‟@#%'” &*() _-+ = I\{}[]:;”‘< > , . ? /.

The Data types

• Data values passed in a program may be of different types.

Note: Each data type will be learned in each different chapter.

Find the Data Type

Rules to defines Variable Names

Example to create variable in python-a = "NIELITBHU"---a_var = "NIELITBHU"-----_a_var = "NIELITBHU"----

as def From Nonlocal False await Else Import pass

assert del Global Not with None break Except In raise

Data Requirements Specification

Communication-------→The results of the data analysis are to be reported in a format as

The Foremost Goals of EDA

3. Multivariate Analysis: Multivariate analysis extends bivariate evaluation to encompass greater

Most companies occasionally encounter a shortage of resources such as facility

Statistics is divided broadly into two categories:

Types of Statistical Analysis

Given below are the 6 types of statistical analysis:

Descriptive statistical analysis involves collecting, interpreting, analyzing, and summarizing

• Exploratory Data Analysis

Basic terminology of Statistics :

• (a). Measure of central tendency –

Range = Maximum value - Minimum value

S2= ∑ni=1 [(xi - x

σ= √ (1÷n) ∑ni=1 (xi - μ)2

1. Obtain and start with a theory.

2. Generate a research hypothesis.

3. Operationalize or use variables

4. Identify or find out population to which we can apply study material.

5. Generate or form a null hypothesis for these population.

Types of inferential statistics –

• One sample test of difference/One sample hypothesis test

• Contingency Tables and Chi-Square Statistic

Samples are used when :

• The population is too large to collect data.

• The data collected is not reliable.

A sample should generally :

• Be utterly unbiased on the properties of the objects being selected.

• Be random to choose the objects of study fairly.

Measures of Central Tendency Meaning------------------The representative value of a data set, generally

Mean = Sum of all Observations ÷ Total number of Observations

Median of Ungrouped Data

Median = Value of observation at [(n + 1) ÷ 2]th Position

When N is odd the median is calculated as shown in the image below.

Median = Arithmetic mean of Values of observations at (n ÷ 2)th and

When N is even the median is calculated as shown in the image

Mode of Ungrouped Data-----------------------Mode of Ungrouped Data can be simply calculated by

Measures of Dispersion Definition

Example of Measures of Dispersion

Mean (Average) = (12 + 14 + 18 + 9 + 11 + 7 + 9 + 16 + 19 + 20)/10

Then, the average value of the marks is 13.5

Types of Measures of Dispersion

Measures of dispersion can be classified into two categories shown below: