Data Science
Data Science
Data Science
Data:
Data is a collection of information gathered by observations, measurements, research or analysis.
They may consist of facts, numbers, names, figures or even description of things.
There exist data scientist who does data mining and with the help of that data analyse our world
Information:
Information is data that has been processed , organized, or structured in a way that makes it
meaningful, valuable and useful.
It gives knowledge, understanding and insights that can be used for decision-making , problem-
solving, communication and various other purposes.
Categories of Data
Data can be categories into two main parts –
Structured Data: This type of data is organized data into specific format, making it easy to
search , analyse and process. Structured data is found in a relational databases that includes
information like numbers, data and categories.
Unstructured Data: Unstructured data does not conform to a specific structure or format. It
may include some text documents , images, videos, and other data that is not easily
organized or analysed without additional processing.
Types of Data
Generally data can be classified into two parts:
1. Categorial Data: In categorical data we see the data which have a defined category, for
example:
Marital Status
Political Party
Eye colour
2. Numerical Data: Numerical data can further be classified into two categories:
Discrete Data: Discrete data contains the data which have discrete numerical values
for example Number of Children, Defects per Hour etc.
Continuous Data: Continuous data contains the data which have continuous
numerical values for example Weight, Voltage etc.
3. Nominal Scale: A nominal scale classifies data into several distinct categories in which no
ranking criteria is implied. For example Gender, Marital Status.
4. Ordinary Scale: An ordinal scale classifies data into distinct categories during which ranking is
implied For example:
5. Interval scale: An interval scale may be an ordered scale during which the difference
between measurements is a meaningful quantity but the measurements don’t have a true
zero point. For example:
Years
6. Ratio scale: A ratio scale may be an ordered scale during which the difference between the
measurements is a meaningful quantity and therefore the measurements have a true zero
point. Hence, we can perform arithmetic operations on real scale data. For example : Weight,
Age, Salary etc.
Data Science:
Data science is the study of data to extract meaningful insights for business.
It is a multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts
of data.
This analysis helps data scientists to ask and answer questions like what happened, why it
happened, what will happen, and what can be done with the result.
1. Descriptive analysis
Descriptive analysis examines data to gain insights into what happened or what is happening in
the data environment. It is characterized by data visualizations such as pie charts, bar charts,
line graphs, tables, or generated narratives. For example, a flight booking service may record
data like the number of tickets booked each day. Descriptive analysis will reveal booking spikes,
booking slumps, and high-performing months for this service.
2. Diagnostic analysis
3. Predictive analysis
Predictive analysis uses historical data to make accurate forecasts about data patterns that may
occur in the future. It is characterized by techniques such as machine learning, forecasting,
pattern matching, and predictive modeling. In each of these techniques, computers are trained to
reverse engineer causality connections in the data. For example, the flight service team might
use data science to predict flight booking patterns for the coming year at the start of each year.
The computer program or algorithm may look at past data and predict booking spikes for certain
destinations in May. Having anticipated their customer’s future travel requirements, the company
could start targeted advertising for those cities from February.
4. Prescriptive analysis
Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to
happen but also suggests an optimum response to that outcome. It can analyze the potential
implications of different choices and recommend the best course of action. It uses graph
analysis, simulation, complex event processing, neural networks, and recommendation engines
from machine learning.
Classification
Classification is the sorting of data into specific groups or categories. Computers are trained to
identify and sort data. Known data sets are used to build decision algorithms in a computer that
quickly processes and categorizes the data. For example:·
Regression
Regression is the method of finding a relationship between two seemingly unrelated data points.
The connection is usually modelled around a mathematical formula and represented as a graph
or curves. When the value of one data point is known, regression is used to predict the other
data point. For example:·
Clustering is the method of grouping closely related data together to look for patterns and
anomalies. Clustering is different from sorting because the data cannot be accurately classified
into fixed categories. Hence the data is grouped into most likely relationships. New patterns and
relationships can be discovered with clustering. For example: ·
Group customers with similar purchase behaviour for improved customer service.·
Group network traffic to identify daily usage patterns and identify a network attack
faster.
Cluster articles into multiple different news categories and use this information to find
fake news content.
The basic principle behind data science techniques
While the details vary, the underlying principles behind these techniques are:
Teach a machine how to sort data based on a known data set. For example, sample
keywords are given to the computer with their sort value. “Happy” is positive, while
“Hate” is negative.
Give unknown data to the machine and allow the device to sort the dataset
independently.
Allow for result inaccuracies and handle the probability factor of the result.
1. Artificial intelligence: Machine learning models and related software are used for
predictive and prescriptive analysis.
2. Cloud computing: Cloud technologies have given data scientists the flexibility and
processing power required for advanced data analytics.
3. Internet of things: IoT refers to various devices that can automatically connect to the
internet. These devices collect data for data science initiatives. They generate massive
data which can be used for data mining and data extraction.
4. Quantum computing: Quantum computers can perform complex calculations at high
speed. Skilled data scientists use them for building complex quantitative algorithms.
What is the difference between data science and data analytics?
While the terms may be used interchangeably, data analytics is a subset of data science. Data
science is an umbrella term for all aspects of data processing—from the collection to modelling to
insights. On the other hand, data analytics is mainly concerned with statistics, mathematics, and
statistical analysis. It focuses on only data analysis, while data science is related to the bigger
picture around organizational data. IN most workplaces, data scientists and data analysts work
together towards common business goals. A data analyst may spend more time on routine
analysis, providing regular reports. A data scientist may design the way data is stored,
manipulated, and analyze. Simply put, a data analyst makes sense out of existing data, whereas
a data scientist creates new methods and tools to process data for use by analysts.
Machine learning is the science of training machines to analyze and learn from data the way
humans do. It is one of the methods used in data science projects to gain automated insights
from data. Machine learning engineers specialize in computing, algorithms, and coding skills
specific to machine learning methods. Data scientists might use machine learning methods as a
tool or work closely with other machine learning engineers to process data.
For example, the tuple (0, 1) indicates that the data scientist with id 0 (Hero) and the data scientist
with id 1 (Dunn) are friends. The network is illustrated in Figure 1-1
Zen Of Python:
Beautiful is better than ugly.
Readability counts.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Python:
Python is an easy to learn, powerful programming language.
It has efficient high-level data structures and a simple but effective approach to object-oriented
programming.
Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal
language for scripting and rapid application development in many areas on most platforms.
Python has in-built mathematical libraries and functions, making it easier to calculate mathematical
problems and to perform data analysis.
Python Libraries
Python has libraries with large collections of mathematical functions and analytical tools.
Pandas - This library is used for structured data operations, like import CSV files, create
dataframes, and data preparation
Numpy - This is a mathematical library. Has a powerful N-dimensional array object, linear
algebra, Fourier transform, etc.
This will let you know that the virtual environment is currently
active.
Installing Dependencies in Virtual
Environment Python
In the image below, venv named virtual environment is active.
Now you can install dependencies related to the project in this
virtual environment.
For example, if you are using Django 1.9 for a project, you can
install it like you install other packages.
(virtualenv_name)$ pip install Django==1.9
The Django 1.9 package will be placed in virtualenv_name folder
and will be isolated from the complete system.
Deactivate Python Virtual Environment
Once you are done with the work, you can deactivate the virtual
environment by the following command:
(virtualenv_name)$ deactivate
Anaconda
Anaconda is an open source software that contains Jupyter,
spyder, etc that are used for large data processing, data
analytics, heavy scientific computing. Anaconda works for R and
Python programming language. Package versions are managed
by the package management system conda.
Installing Anaconda :
Head over to anaconda.com and install the latest version of
Anaconda. Make sure to download the “Python 3.7 Version” for
the appropriate architecture. Refer to the below articles for the
detailed information on installing anaconda on different
platforms.
Let’s go through the steps of creating a virtual environment using conda interface:
conda -V
Output:
Type conda search “^python$” to see the list of available python versions.
Now replace the envname with the name you want to give to your virtual environment and
replace x.x with the python version you want to use.
To see the list of all the available environments use command conda info -e
To activate the virtual environment, enter the given command and replace your given
environment name with envname
conda activate envname
When conda environment is activated it modifies the PATH and shell variables points specifically to
the isolated Python set- up you created.
Type the following command to install the additional packages to the environment and
replace envname with the name of your environment.
To come out of the particular environment type the following command. The settings of the
environment will remain as it is.
conda deactivate
If you no longer require a virtual environment. Delete it using the following command and
replace your environment name with envname
It is used for:
It has efficient high-level data structures and a simple but effective approach to object-oriented
programming.
Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal
language for scripting and rapid application development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or binary
form for all major platforms from the Python web site, https://www.python.org/, and may be freely
distributed.
The same site also contains distributions of and pointers to many free third-party Python modules,
programs and tools, and additional documentation.
The Python interpreter is easily extended with new functions and data types implemented in C or C+
+.
Python can connect to database systems. It can also read and modify files.
Python can be used to handle big data and perform complex mathematics.
Python can be used for rapid prototyping, or for production-ready software development
Why Python?
Python works on different platforms (Windows, Mac, Linux, Raspberry
Pi, etc).
Python has a simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer
lines than some other programming languages.
Python runs on an interpreter system, meaning that code can be
executed as soon as it is written. This means that prototyping can be
very quick.
Python can be treated in a procedural way, an object-oriented way or a
functional way.
Python Indentation
Indentation refers to the spaces at the beginning of a code line.
if 5 > 2:
print("Five is greater than two!")
Python Variables
In Python, variables are created when you assign a value to it:
x = 5
y = "Hello, World!"
Comments
Python has commenting capability for the purpose of in-code documentation.
Comments start with a #, and Python will render the rest of the line as a
comment:
#This is a comment.
print("Hello, World!")
Functions:
A function is a rule for taking zero or more inputs and returning a corresponding output.
def double(x):
""" This is where you put an optional docstring that explains what the function
does. For example, this function multiplies its input by 2. """
return x * 2
Python functions are first-class, which means that we can assign them to variables and pass them
into functions just like any other arguments:
def apply_to_one(f):
return f(1)
Output:
Welcome to GFG
Python Function with Parameters
If you have experience in C/C++ or Java then you must be
thinking about the return type of the function and data type of
arguments. That is possible in Python as well (specifically for
Python 3.5 and above).
Python Function Syntax with Parameters
def function_name(parameter: data_type) -> return_type:
"""Docstring"""
# body of the function
return expression
The following example uses arguments and parameters that you
will learn later in this article so you can come back to it again if
not understood.
Python3
def add(num1: int, num2: int) -> int:
"""Add two numbers"""
num3 = num1 + num2
return num3
# Driver code
num1, num2 = 5, 15
ans = add(num1, num2)
print(f"The addition of {num1} and {num2} results {ans}.")
Output:
The addition of 5 and 15 results 20.
Note: The following examples are defined using syntax 1, try to
convert them in syntax 2 for practice.
Python3
# some more functions
def is_prime(n):
if n in [2, 3]:
return True
if (n == 1) or (n % 2 == 0):
return False
r = 3
while r * r <= n:
if n % r == 0:
return False
r += 2
return True
print(is_prime(78), is_prime(79))
Output:
False True
Python Function Arguments
Arguments are the values passed inside the parenthesis of the
function. A function can have any number of arguments
separated by a comma.
In this example, we will create a simple function in Python to
check whether the number passed as an argument to the
function is even or odd.
Python3
# A simple Python function to check
# whether x is even or odd
def evenOdd(x):
if (x % 2 == 0):
print("even")
else:
print("odd")
Output:
even
odd
Types of Python Function Arguments
Python supports various types of arguments that can be passed
at the time of the function call. In Python, we have the following
function argument types in Python:
Default argument
Keyword arguments (named arguments)
Positional arguments
Arbitrary arguments (variable-length arguments *args
and **kwargs)
Let’s discuss each type in detail.
Default Arguments
A default argument is a parameter that assumes a default value
if a value is not provided in the function call for that argument.
The following example illustrates Default arguments to write
functions in Python.
Python3
# Python program to demonstrate
# default arguments
def myFun(x, y=50):
print("x: ", x)
print("y: ", y)
Output:
x: 10
y: 50
Like C++ default arguments, any number of arguments in a
function can have a default value. But once we have a default
argument, all the arguments to its right must also have default
values.
Keyword Arguments
The idea is to allow the caller to specify the argument name with
values so that the caller does not need to remember the order of
parameters.
Python3
# Python program to demonstrate Keyword Arguments
def student(firstname, lastname):
print(firstname, lastname)
# Keyword arguments
student(firstname='Geeks', lastname='Practice')
student(lastname='Practice', firstname='Geeks')
Output:
Geeks Practice
Geeks Practice
Positional Arguments
We used the Position argument during the function call so that
the first argument (or value) is assigned to name and the second
argument (or value) is assigned to age. By changing the position,
or if you forget the order of the positions, the values can be used
in the wrong places, as shown in the Case-2 example below,
where 27 is assigned to the name and Suraj is assigned to the
age.
Python3
def nameAge(name, age):
print("Hi, I am", name)
print("My age is ", age)
Output:
Case-1:
Hi, I am Suraj
My age is 27
Case-2:
Hi, I am 27
My age is Suraj
Arbitrary Keyword Arguments
In Python Arbitrary Keyword Arguments, *args, and **kwargs can
pass a variable number of arguments to a function using special
symbols. There are two special symbols:
*args in Python (Non-Keyword Arguments)
**kwargs in Python (Keyword Arguments)
Example 1: Variable length non-keywords argument
Python3
# Python program to illustrate
# *args for variable number of arguments
def myFun(*argv):
for arg in argv:
print(arg)
Output:
Hello
Welcome
to
GeeksforGeeks
Example 2: Variable length keyword arguments
Python3
# Python program to illustrate
# *kwargs for variable number of keyword arguments
def myFun(**kwargs):
for key, value in kwargs.items():
print("%s == %s" % (key, value))
# Driver code
myFun(first='Geeks', mid='for', last='Geeks')
Output:
first == Geeks
mid == for
last == Geeks
Docstring
The first string after the function is called the Document string
or Docstring in short. This is used to describe the functionality of
the function. The use of docstring in functions is optional but it is
considered a good practice.
The below syntax can be used to print out the docstring of a
function.
Syntax: print(function_name.__doc__)
Example: Adding Docstring to the function
Python3
# A simple Python function to check
# whether x is even or odd
def evenOdd(x):
"""Function to check if the number is even or odd"""
if (x % 2 == 0):
print("even")
else:
print("odd")
# Driver code to call the function
print(evenOdd.__doc__)
Output:
Function to check if the number is even or odd
Python Function within Functions
A function that is defined inside another function is known as
the inner function or nested function. Nested functions can
access variables of the enclosing scope. Inner functions are used
so that they can be protected from everything happening outside
the function.
Python3
# Python program to
# demonstrate accessing of
# variables of nested functions
def f1():
s = 'I love GeeksforGeeks'
def f2():
print(s)
f2()
# Driver's code
f1()
Output:
I love GeeksforGeeks
Anonymous Functions in Python
In Python, an anonymous function means that a function is
without a name. As we already know the def keyword is used to
define the normal functions and the lambda keyword is used to
create anonymous functions.
Python3
# Python code to illustrate the cube of a number
# using lambda function
def cube(x): return x*x*x
print(cube(7))
print(cube_v2(7))
Output:
343
343
Recursive Functions in Python
Recursion in Python refers to when a function calls itself. There
are many instances when you have to build a recursive function
to solve Mathematical and Recursive Problems.
Using a recursive function should be done with caution, as a
recursive function can become like a non-terminating loop. It is
better to check your exit statement while creating a recursive
function.
Python3
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n - 1)
print(factorial(4))
Output
24
Here we have created a recursive function to calculate the
factorial of the number. You can see the end statement for this
function is when n is equal to 0.
Return Statement in Python Function
The function return statement is used to exit from a function and
go back to the function caller and return the specified value or
data item to the caller. The syntax for the return statement
is:
return [expression_list]
The return statement can consist of a variable, an expression, or
a constant which is returned at the end of the function
execution. If none of the above is present with the return
statement a None object is returned.
Example: Python Function Return Statement
Python3
def square_value(num):
"""This function returns the square
value of the entered number"""
return num**2
print(square_value(2))
print(square_value(-4))
Output:
4
16
Pass by Reference and Pass by Value
One important thing to note is, in Python every variable name is
a reference. When we pass a variable to a function Python, a
new reference to the object is created. Parameter passing in
Python is the same as reference passing in Java.
Python3
# Here x is a new reference to same list lst
def myFun(x):
x[0] = 20
Output:
[20, 11, 12, 13, 14, 15]
When we pass a reference and change the received reference to
something else, the connection between the passed and
received parameters is broken. For example, consider the below
program as follows:
Python3
def myFun(x):
Output:
[10, 11, 12, 13, 14, 15]
Another example demonstrates that the reference link is broken
if we assign a new value (inside the function).
Python3
def myFun(x):
Output:
10
Exercise: Try to guess the output of the following code.
Python3
def swap(x, y):
temp = x
x = y
y = temp
# Driver code
x = 2
y = 3
swap(x, y)
print(x)
print(y)
Output:
2
3
Function Description
ascii() Returns a readable version of an object. Replaces none-ascii characters with escape chara
delattr() Deletes the specified attribute (property or method) from the specified object
hasattr() Returns True if the specified object has the specified attribute (property/method)
map() Returns the specified iterator with the specified function applied to each item
max() Returns the largest item in an iterable
range() Returns a sequence of numbers, starting from 0 and increments by 1 (by default)
repr() Returns a readable version of an object
The primary goal of data visualization is to make data more accessible and easier to interpret,
allowing users to identify patterns, trends, and outliers quickly. This is particularly important in the
context of big data, where the sheer volume of information can be overwhelming without
effective visualization techniques.
Numerical Data
Categorical Data
Let’s understand the visualization of data via a diagram with its all categories.
It would not be that easy to get this information so fast from a data table. This is just one
demonstration of the usefulness of data visualization. Let’s see some more reasons why visualization
of data is so important.
Various types of visualizations cater to diverse data sets and analytical goals.
1. Bar Charts: Ideal for comparing categorical data or displaying frequencies, bar charts offer a
clear visual representation of values.
2. Line Charts: Perfect for illustrating trends over time, line charts connect data points to reveal
patterns and fluctuations.
3. Pie Charts: Efficient for displaying parts of a whole, pie charts offer a simple way to
understand proportions and percentages.
4. Scatter Plots: Showcase relationships between two variables, identifying patterns and
outliers through scattered data points.
5. Histograms: Depict the distribution of a continuous variable, providing insights into the
underlying data patterns.
6. Heatmaps: Visualize complex data sets through color-coding, emphasizing variations and
correlations in a matrix.
7. Box Plots: Unveil statistical summaries such as median, quartiles, and outliers, aiding in data
distribution analysis.
7. distribution analysis.
8. Area Charts: Similar to line charts but with the area under the line filled, these charts
accentuate cumulative data patterns.
9. Bubble Charts: Enhance scatter plots by introducing a third dimension through varying
bubble sizes, revealing additional insights.
10. Treemaps: Efficiently represent hierarchical data structures, breaking down categories into
nested rectangles.
11. Violin Plots: Violin plots combine aspects of box plots and kernel density plots, providing a
detailed representation of the distribution of data.
12. Word Clouds: Word clouds are visual representations of text data where words are sized
based on their frequency.
13. 3D Surface Plots: 3D surface plots visualize three-dimensional data, illustrating how a
response variable changes in relation to two predictor variables.
14. Network Graphs: Network graphs represent relationships between entities using nodes and
edges. They are useful for visualizing connections in complex systems, such as social
networks, transportation networks, or organizational structures.
1. Tableau
2. Looker
3. Zoho Analytics
4. Sisense
6. Qlik Sense
7. Domo
8. Microsoft Power BI
9. Klipfolio
What is Matplotlib?
Matplotlib is a low-level graph plotting library in python that serves as a visualization utility.
Matplotlib is mostly written in python, a few segments are written in C, Objective-C and Javascript for
Platform compatibility.
Installation of Matplotlib
If you have Python and PIP already installed on a system, then installation of Matplotlib is very easy.
If this command fails, then use a python distribution that already has Matplotlib installed, like
Anaconda, Spyder etc.
Import Matplotlib
Once Matplotlib is installed, import it in your applications by adding the import module statement:
import matplotlib
import matplotlib
print(matplotlib.__version__)
Pyplot
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported under
the plt alias:
plt.plot(xpoints, ypoints)
plt.show()
Result:
Plotting Without Line
To plot only the markers, you can use shortcut string notation parameter 'o', which means 'rings'.
Example
Draw two points in the diagram, one at position (1, 3) and one in position (8, 10):
Result:
Markers
You can use the keyword argument marker to emphasize each point with a specified marker:
Result:
5. Hu
o man R
esounResources departments lev erage
data visualization to K-Nearest Neighbour is one of the simplest
Machine Learning algorithms based on Supervised Learning
o K-NN algorithm assumes the similarity between the new case/data
and available cases and put the new case into the category that is
most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new
data point based on the similarity. This means when new data
appears then it can be easily classified into a well suite category by
using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and
at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and
when it gets new data, then it classifies that data into a category
that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks
similar to cat and dog, but we want to know either it is a cat or dog.
So for this identification, we can use the KNN algorithm, as it works
on a similarity measure. Our KNN model will find the similar features
of the new data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
ADVERTISEMENT
o As we can see the 3 nearest neighbors are from category A, hence
this new data point must belong to category A.
PlayNext
Mute
Duration 18:10
Loaded: 0.37%
Â
Fullscreen
Backward Skip 10sPlay VideoForward Skip 10s
ADVERTISEMENT
|--------------------------|-------------------------------------------------------
---|----------------------------------------------------|
This table captures the key differences between FP and OOP, highlighting how they approach
programming from different angles. If you'd like to dive deeper into any of these aspects, let me
know!HR operations. Workforce demographics and diversity metrics are visually represented,
supporting inclusive practices within organizations. Additionally, analytics for recruitment and
retention strategies are enhanced through visual insights, contributing to more effective talent ma