Dsbda Unit 1

Government College of Engineering and Research
Department of Computer Engineering, Avasari
Data Science and Big Data

Analytics
TE 2019 Course
Prof.K.B.Sadafale
Assistant Professor
• Teaching Scheme :-
• Lectures: 4 Hrs/ Week
• Examination Scheme:-
• Mid-Sem (TH) : 30 Marks
• End-Sem (TH): 70 Marks

• Prerequisites Courses:
• Discrete Mathematics
• Database Management Systems

• Companion Course:
• Data Science and Big Data Analytics Laboratory

(310256)
Course Objectives
• To understand the need of Data Science and Big Data
• To understand computational statistics in Data
Science
• To study and understand the different technologies
used for Big Data processing
• To understand and apply data modeling strategies
• To learn Data Analytics using Python programming
• To be conversant with advances in analytics
Course Outcomes
• Learners should be able to
• CO1: Analyze needs and challenges for Data Science Big Data
Analytics
• CO2: Apply statistics for Big Data Analytics
• CO3: Apply the lifecycle of Big Data analytics to real world
problems
• CO4: Implement Big Data Analytics using Python
programming
• CO5: Implement data visualization using visualization tools in
Python programming
• CO6: Design and implement Big Databases using the Hadoop
ecosystem
Syllabus
• Unit I: Introduction to Data Science and Big Data
• UNIT II: Statistical Inference

• UNIT III: Big Data Analytics Life Cycle
• UNIT IV: Predictive Big Data Analytics with Python
• UNIT V: Big Data Analytics and Model Evaluation
• UNIT VI: Data Visualization and Hadoop

Unit I
Introduction to Data Science and Big Data
•Basics and need of Data Science and Big Data
•Applications of Data Science, Data explosion
•5 V’s of Big Data, Relationship between Data Science and
Information Science
•Business intelligence versus Data Science
•Data Science Life Cycle, Data: Data Types, Data
Collection.
•Need of Data wrangling, Methods: Data Cleaning, Data
Integration, Data Reduction, Data Transformation, Data
Discretization.
UNIT II
Statistical Inference
• Need of statistics in Data Science and Big Data

Analytics
• Measures of Central Tendency: Mean, Median,
Mode, Mid-range.
• Measures of Dispersion: Range, Variance, Mean
Deviation, Standard Deviation.
• Bayes theorem, Basics and need of hypothesis and
hypothesis testing
• Pearson Correlation, Sample Hypothesis testing,
Chi-Square Tests, t-test.
UNIT III:
Big Data Analytics Life Cycle
• Introduction to Big Data
• Sources of Big Data
• Data Analytic Lifecycle:
• Introduction,
• Phase 1: Discovery,
• Phase 2: Data Preparation,
• Phase 3: Model Planning,
• Phase 4: Model Building,
• Phase 5: Communication results,
• Phase 6: Operation alize.
UNIT IV:
Predictive Big Data Analytics with Python
• Introduction, Essential Python Libraries,
• Basic examples. Data Pre-processing: Removing
Duplicates,
• Transformation of Data using function or mapping,
replacing values, Handling Missing Data.
• Analytics Types: Predictive, Descriptive and Prescriptive.
• Association Rules: Apriori Algorithm, FP growth.
• Regression: Linear Regression, Logistic Regression.
• Classification: Naïve Bayes, Decision Trees.
• Introduction to Scikit-learn, Installations, Dataset, mat
plotlib, filling missing values, Regression and
Classification using Scikit-learn.
UNIT V:
Big Data Analytics and Model Evaluation
• Clustering Algorithms: K-Means, Hierarchical Clustering,
Time-series analysis.
• Introduction to Text Analysis: Text-preprocessing, Bag of
words, TF-IDF and topics.
• Need and Introduction to social network analysis,
Introduction to business analysis.
• Model Evaluation and Selection: Metrics for Evaluating
Classifier Performance, Holdout Method and Random
Sub sampling, Parameter Tuning and Optimization,
Result Interpretation, Clustering and Time-series analysis
using Scikit- learn, sklearn. metrics, Confusion matrix,
AUC-ROC Curves, Elbow plot.
UNIT VI:
Data Visualization and Hadoop
• Introduction to Data Visualization,
• Challenges to Big data visualization,
• Types of data visualization,
• Data Visualization Techniques,
• Visualizing Big Data,
• Tools used in Data Visualization,
• Hadoop ecosystem, Map Reduce, Pig, Hive,
• Analytical techniques used in Big data visualization.
• Data Visualization using Python: Line plot, Scatter
plot, Histogram, Density plot, Box- plot.
• Text Books
• 1. David Dietrich, Barry Hiller, “Data Science and Big

Data Analytics”, EMC education services, Wiley
publication, 2012, ISBN0-07-120413-X
• 2. Jiawei Han, Micheline Kamber, and Jian Pie,

“Data Mining: Concepts and Techniques” Elsevier
Publishers Third Edition, ISBN: 9780123814791,
9780123814807
Unit I
Introduction to Data Science and Big Data

What is Data science
• Data science is a deep study of the Large amount of data,
which involves extracting meaningful insights from raw,
structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.
• It is a multidisciplinary field that uses tools and techniques to
manipulate the data so that you can find something new and
meaningful.
• Data science uses the most powerful hardware, programming
systems, and most efficient algorithms to solve the data
related problems.
• In Data science taken large amount of data and analyze data
and then we take decision.
• Data science is the domain of study that deals with vast
volumes of data using modern tools and techniques to find
unseen patterns, derive meaningful information, and make
business decisions.
• Data science uses complex machine learning algorithms to
build predictive models.
• The data used for analysis can come from many different
sources and presented in various formats.
• Now that you know what data science is, let’s see why data
science is essential to today’s IT landscape.
Process of Data science
• In short, we can say that data science is all about:
• Asking the correct questions and analyzing the raw

data.
• Modeling the data using various complex and
efficient algorithms.
• Visualizing the data to get a better perspective.
• Understanding the data to make better decisions and
finding the final result.
Example 1
• Let suppose we want to travel from station A to station B by
car.
• Now, we need to take some decisions such as which route will

be the best route to reach faster at the location, in which
route there will be no traffic jam, and which will be
cost-effective.
• All these decision factors will act as input data, and we will
get an appropriate answer from these decisions, so this
analysis of data is called the data analysis, which is a part of
data science.
Example 2
• Companies manufactured two types of candy
• Orange candy and mango candy
• So there would be many people who will like mango

candy so after analyzing data ,company will increase
the manufacturing of mango candy.
Need for Data Science
• Some years ago, data was less and mostly available in a
structured form, which could be easily stored in excel
sheets, and processed using BI tools.
• But in today's world, data is becoming so vast, i.e.,

approximately 2.5 quintals bytes of data is generating on
every day, which led to data explosion.
• It is estimated as per researches, that by 2020, 1.7 MB of

data will be created at every single second, by a single
person on earth.
• Every Company requires data to work, grow, and

improve their businesses.
• facebook generates 4 petabytes of data per day .
• Now, handling of such huge amount of data is a

challenging task for every organization.
• So to handle, process, and analysis of this, we

required some complex, powerful, and efficient
algorithms and technology, and that technology
came into existence as data Science.
Following are some main reasons for using
Data science technology
• With the help of data science technology, we can convert
the large amount of raw and unstructured data into
meaningful insights.
• Data science technology is opting by various companies,
whether it is a big brand or a startup.
• Google, Amazon, Netflix, etc, which handle the huge
amount of data, are using data science algorithms for
better customer experience.
• Data science is working for automating transportation
such as creating a self-driving car, which is the future of
transportation.
• Data science can help in different predictions such as
various survey, elections, flight ticket confirmation, etc.
Data Scientist
• Data scientists are the experts who can use various
statistical tools and machine learning algorithms to
understand and analyze the data.
• The main role of data scientists is to organize the raw
data.
• Skill required: To become a data scientist, one should
have technical language skills such as R, SAS, SQL,
Python, Hive, Pig, Apache spark, MATLAB.
• Data scientists must have an understanding of
Statistics, Mathematics, visualization, and
communication skills.
• Knowlage of ML,Data mining,analytics.
Applications of Data Science
1) Image recognition and speech recognition:
•Data science is currently using for Image and speech recognition.
•When you upload an image on Facebook and start getting the
suggestion to tag to your friends.
•This automatic tagging suggestion uses image recognition algorithm,
which is part of data science.
•When you say something using, "Ok Google, Siri, Cortana", etc., and
these devices respond as per voice control, so this is possible with
speech recognition algorithm.
2) Gaming world:
•In the gaming world, the use of Machine learning algorithms is
increasing day by day.
•EA Sports, Sony, Nintendo, are widely using data science for enhancing
user experience.
3) Internet search:
•When we want to search for something on the internet,
then we use different types of search engines such as
Google, Yahoo, Bing, Ask, etc.
•All these search engines use the data science technology to
make the search experience better, and you can get a
search result with a fraction of seconds.
4) Transport:
•Transport industries also using data science technology to
create self-driving cars.
•With self-driving cars, it will be easy to reduce the number
of road accidents.
5) Healthcare:
•In the healthcare sector, data science is providing lots of benefits.
•Data science is being used for tumor detection, drug discovery,
medical image analysis, virtual medical bots, etc.
6) Recommendation systems:
•Most of the companies, such as Amazon, Netflix, Google Play, etc., are
using data science technology for making a better user experience
with personalized recommendations.
•Such as, when you search for something on Amazon, and you started
getting suggestions for similar products, so this is because of data
science technology.
7) Risk detection:
•Finance industries always had an issue of fraud and risk of losses, but
with the help of data science, this can be rescued.
•Most of the finance companies are looking for the data scientist to
avoid risk and any type of losses with an increase in customer
satisfaction.
Big Data Overview
• Data is created constantly, and at an ever-increasing rate.
• Mobile phones, social media, imaging technologies to
determine a medical diagnosis-all these and more create new
data, and that must be stored somewhere for some purpose.
• Devices and sensors automatically generate diagnostic
information that needs to be stored and processed in real
time.
• Merely keeping up with this huge influx of data is difficult, but
substantially more challenging is analyzing vast amounts of it.
• especially when it does not conform to traditional notions of
data structure, to identify meaningful patterns and extract
useful information.
• These challenges of the data deluge present the opportunity
to transform business, government, science, and everyday
life.
• Three attributes stand out as defining Big Data
characteristics:
• • Huge volume of data:
• Rather than thousands or millions of rows, Big Data can
be billions of rows and millions of columns.
• • Complexity of data types and structures:
• Big Data reflects the variety of new data sources,
formats, and structures, including digital traces being left
on the web and other digital repositories for subsequent
analysis.
• • Speed of new data creation and growth:
• Big Data can describe high velocity data, with rapid data
ingestion and near real time analysis.
What is Big Data?
Big Data Definition
• No single standard definition…
• “Big Data” is data whose scale, diversity, and

complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
“Big Data is the frontier of a firm's
ability to store, process, and access
(SPA) all the data it needs to operate
effectively, make decisions, reduce
risks, and serve customers.”
-- Forrester
“Big Data in general is defined as high
volume, velocity and variety information
assets that demand cost-effective,
innovative forms of information
processing for enhanced insight and
decision making.”
-- Gartner
▪“Big data is data that exceeds the
processing capacity of conventional
database systems.
The data is too big, moves too fast, or

doesn't fit the structures of your
database architectures.
To gain value from this data, you must

choose an alternative way to process
it.”
-- O’Reilly
“Big data is the data characterized
by 3 attributes: volume, variety and
velocity.”
-- IBM
“Big data is the data characterized by 4
key attributes: volume, variety, velocity
and
value.”
-- Oracle
• Big Data is a collection of large datasets that cannot be
processed using traditional computing techniques.
• It is not a single technique or a tool, rather it involves many
areas of business and technology.
• Data which are very large in size is called Big Data.
• Normally we work on data of size MB(WordDoc ,Excel) or
maximum GB(Movies, Codes) but data in Peta bytes size is
called Big Data.
• Big Data refers to the large amounts of data which is display
in from various data sources and has different formats.
• Big Data is data whose scale, distribution, diversity,

and/or timeliness require the use of new technical
architectures and analytics to enable insights that
unlock new sources of business value.
What Comes Under Big Data?
• Big data involves the data produced by different devices and
applications. Given below are some of the fields that come under the
umbrella of Big Data.
• Black Box Data: It is a component of helicopter, airplanes, and jets,
etc. It captures voices of the flight crew, recordings of microphones and
earphones, and the performance information of the aircraft.
• Social Media Data: Social media such as Facebook and Twitter hold
information and the views posted by millions of people across the globe.
• Stock Exchange Data: The stock exchange data holds information
about the ‘buy’ and ‘sell’ decisions made on a share of different
companies made by the customers.
• Power Grid Data: The power grid data holds information consumed
by a particular node with respect to a base station.
• Transport Data: Transport data includes model, capacity, distance
and availability of a vehicle.
• Search Engine Data: Search engines retrieve lots of data from
different databases.
Big Data Challenges
• The major challenges associated with big data are as
follows:
• Capturing data
• Storage
• Searching
• Sharing
• Transfer
• Analysis
• Presentation
Types of Big data
• The three different formats of big data are:
• Structured: Organised data format with a fixed

schema. Ex: RDBMS
• Semi-Structured: Partially organised data which does

not have a fixed format. Ex: XML,
• Unstructured: Unorganised data with an unknown

schema. Ex: Audio, video files etc.
Sources of Big Data
• Social networking sites: Facebook, Google, Instagram all
these sites generates huge amount of data on a day to day
basis as they have billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, generates huge
amount of logs from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives
very huge data which are stored and manipulated to forecast
weather.
• Telecom company: Telecom giants like Airtel, Vodafone study
the user trends and accordingly publish their plans and for
this they store the data of its million users.
• Share Market: Stock exchange across the world generates
huge amount of data through its daily transaction.
Sources of Big Data
Applications of Big Data
Financial and banking sector
• The financial and banking sectors use big data technology
extensively.
• Big data analytics help banks and customer behaviour on the
basis of investment patterns, shopping trends, motivation to
invest, and inputs that are obtained from personal or financial
backgrounds.
• Healthcare
• Big data has started making a massive difference in the
healthcare sector, with the help of predictive analytics,
medical professionals, and health care personnel.
• It can produce personalized healthcare and solo patients also.
Telecommunication and media
• Telecommunications and the multimedia sector are the main
users of Big Data.
• There are zettabytes to be generated every day and handling
large-scale data that require big data technologies.
E-commerce
• E-commerce is also an application of Big data.
• It maintains relationships with customers that is essential for
the e-commerce industry.
• E-commerce websites have many marketing ideas to retail
customers, manage transactions, and implement better
strategies of innovative ideas to improve businesses with Big
data.
• Amazon: Amazon is a tremendous e-commerce website
dealing with lots of traffic daily. But, when there is a
pre-announced sale on Amazon, traffic increase rapidly that
may crash the website.
• So, to handle this type of traffic and data, it uses Big Data.
• Big Data help in organizing and analyzing the data for far use.
Social Media
• Social Media is the largest data generator.
• The statistics have shown that around 500+ terabytes of fresh
data generated from social media daily, particularly on
Facebook.
• The data mainly contains videos, photos, message exchanges,
etc. A single activity on the social media site generates many
stored data and gets processed when required.
• The data stored is in terabytes (TB); it takes a lot of time for
processing. Big Data is a solution to the problem.
Big Data Characteristics
• Big Data contains a large amount of data that

is not being processed by traditional data
storage or the processing unit.
• It is used by many multinational companies to
process the data and business of many
organizations.
• The data flow would exceed 150 Exabyte per
day before replication.
5 V's of Big Data
• Volume
• Veracity
• Variety
• Value
• Velocity
Volume
• Data at rest (Terabyte and Exabyte)
• The name Big Data itself is related to an enormous size.

• Big Data is a vast 'volumes' of data generated from many
sources daily, such as business processes, machines, social
media platforms, networks, human interactions, and many
more.
• Facebook can generate approximately a billion messages, 4.5
billion times that the "Like" button is recorded, and more
than 350 million new posts are uploaded each day.
• Big data technologies can handle large amounts of data.
Variety
• Data in any forms like structured, unstructured, and
semi-structured .
• Big Data can be structured, unstructured, and
semi-structured that are being collected from
different sources.
• Data will only be collected from databases and
sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios,
SM posts, photos, videos, etc.
Veracity
• Veracity means how much the data is reliable.
• It has many ways to filter or translate the data.
• Veracity is the process of being able to handle and manage
data efficiently.
• Big Data is also essential in business development.
• For example, Facebook posts with hashtags.

Value
• Value is an essential characteristic of big data.
• It is not the data that we process or store.
• It is valuable and reliable data that we store,

process, and also analyze.
Velocity
• Data in motion
• Velocity plays an important role compared to others.
• Velocity creates the speed by which the data is
created in real-time.
• It contains the linking of incoming data sets speeds,
rate of change, and activity bursts.
• The primary aspect of Big Data is to provide
demanding data rapidly.
• Big data velocity deals with the speed at the data
flows from sources like application logs, business
processes, networks, and social media sites, sensors,
mobile devices, etc.
Different Between Data science and
Big data
Data Science Big Data
Data Science is an area. Big Data is a technique to collect, maintain
It process big data. and process the large amount of
information
It is about collection, processing, It is about extracting the vital and valuable

analyzing and utilizing of data into information from huge amount of the
various operations. data.
It is a field of study just like the It is a technique of tracking and

Computer Science, Applied Statistics or discovering of trends of complex data sets.
Applied Mathematics.
It is mainly used for scientific purposes. It is mainly used for business purposes and
customer satisfaction.
It broadly focuses on the science of the It is more involved with the processes of
data. handling voluminous data.
What is Data Explosion
• The large scale of data is generated and stored in
computer systems, which is called data explosion.
• The world is currently used to sparing everything
without exception in the electronic space.
• Processing power, RAM speeds and hard-disk sizes
have expanded to level that has changed our
viewpoint towards data and its storage.
• Would you be able to envision having 256 or 512 MB
RAM in your PC now?
• On the off chance that we comprehend idea of byte, we can
envision how data growth has expanded over time and how
storage systems handle it.
• We know that 1 byte is equivalent to 8 bits and these 8 bits

can represent character or expression.
• An archive with huge number of bytes will contain huge

number of characters, expressions and spaces etc.
• Similarly, megabyte (MB) is million bytes of information,
gigabyte (GB) is billion bytes of information and terabyte (TB)
is trillion bytes of information.
• We use these terms while managing data and storage, on our
everyday activities.
• But it doesn’t end here.
• Next comes the petabyte, which is quadrillion bytes or
million gigabyte.
• Ones after that are Exabyte, Zettabyte, and Yottabyte.
• Yottabyte is basically trillion terabytes of information.
• There are considerably higher numbers.
• As indicated by report by Global Information Enterprise,
IDC, in 2009 that aggregate sum of information in world
was 800 EB.
• It is expected to ascend to 44 ZB before finish of 2020,
i.e., 44 trillion gigabytes.
• Second explanation they made was that 11 ZB of this
information will be put away in cloud.
• The rapid or exponential increase in the amount of
data that is generated and stored in the computing
systems, that reaches level where data management
becomes difficult, is called “Data Explosion”.
• The key drivers of data growth are following :

Increase in storage capacities.
Cheaper storage.
Increase in data processing capabilities by modern
computing devices.
Data generated and made available by different
sectors.
Data Processing
• What Is Data Processing?
• Data in its raw form is not useful to any organization.
• Data processing is the method of collecting raw data
and translating it into usable information.
• It is usually performed in a step-by-step process by a
team of data scientists in an organization.
• The raw data is collected, filtered, sorted, processed,
analyzed, stored, and then presented in a readable
format.
• Data processing is essential for organizations to
create better business strategies and increase their
competitive edge.
• By converting the data into readable formats like
graphs, charts, and documents, employees
throughout the organization can understand and use
the data.
Data Processing Cycle
• The data processing cycle consists of a series of steps where
raw data (input) is fed into a system to produce actionable
insights (output).
• Each step is taken in a specific order, but the entire process is
repeated in a cyclic manner.
• There are six main steps in the data processing
cycle:
• Step 1: Collection
• The collection of raw data is the first step of the data
processing cycle.
• The type of raw data collected has a huge impact on the
output produced.
• Raw data should be gathered from defined and accurate
sources so that the subsequent findings are valid and usable.
• Step 2: Preparation
• Data preparation or data cleaning is the process of sorting
and filtering the raw data to remove unnecessary and
inaccurate data.
• Raw data is checked for errors, duplication, miscalculations or
missing data, and transformed into a suitable form for further
analysis and processing.
• Step 3: Input
• In this step, the raw data is converted into machine readable
form and fed into the processing unit.
• This can be in the form of data entry through a keyboard,
scanner or any other input source.
• Step 4: Data Processing

• In this step, the raw data is subjected to various data
processing methods using machine learning and artificial
intelligence algorithms to generate a desirable output.
• This step may vary slightly from process to process depending
on the source of data being processed (data lakes, online
databases, connected devices, etc.) and the intended use of
the output.
• Step 5: Output
• The data is finally transmitted and displayed to the user in a
readable form like graphs, tables, vector files, audio, video,
documents, etc.
• This output can be stored and further processed in the next
data processing cycle.
• Step 6: Storage
• The last step of the data processing cycle is storage, where
data and metadata are stored for further use.
• This allows for quick access and retrieval of information
whenever needed, and also allows it to be used as input in
the next data processing cycle directly.
Types of Data Processing
Type Uses
Data is collected and processed in batches. Used
Batch Processing for large amounts of data.
Eg: payroll system
Data is processed within seconds when the input is
Real-time Processing given. Used for small amounts of data.
Eg: withdrawing money from ATM
Data is automatically fed into the CPU as soon as it
becomes available. Used for continuous processing
Online Processing
of data.
Eg: barcode scanning
Data is broken down into frames and processed
using two or more CPUs within a single computer
Multiprocessing
system. Also known as parallel processing.
Eg: weather forecasting
Allocates computer resources and data in time slots
Time-sharing
to several users simultaneously.
Examples of Data Processing
• Data processing occurs in our daily lives whether we may be aware of it or
not. Here are some real-life examples of data processing:
• A stock trading software that converts millions of
stock data into a simple graph
• An e-commerce company uses the search history of
customers to recommend similar products
• A digital marketing company uses demographic data
of people to strategize location-specific campaigns
• A self-driving car uses real-time data from sensors to
detect if there are pedestrians and other cars on the
road
Relationship between data science and
information science
• Data science is the discovery of knowledge or
actionable information in data.
• Information science is design of practices for storing
and retrieving information.
• data science and information science are distinct but
complementary disciplines.
• Data science is heavy on computer science and
mathematics.
• Information science is more concerned with area
such as library science ,cognitive science and
communications.
• Data science is used in business functions such as strategy
formation, decision making and operational processes.
• It touches on practices such as artificial intelligence, analytics,
predictive analytics and algorithm design.
• The discovery of knowledge and actionable information in
data.
• Data science is an interdisciplinary field about scientific
methods, processes, and systems to extract knowledge or
insights from data in various forms, either structured or
unstructured.
Business intelligence versus Data science
• Data Science:
• Data science is basically a field in which information and
knowledge are extracted from the data by using various
scientific methods, algorithms, and processes.
• It can thus be defined as a combination of various
mathematical tools, algorithms, statistics, and machine
learning techniques which are thus used to find the hidden
patterns and insights from the data which help in the
decision-making process.
• Data science deals with both structured as well as
unstructured data.
• It is related to both data mining and big data.
• Data science involves studying the historic trends and thus
using its conclusions to redefine present trends and also
predict future trends and Technologies.
• Business Intelligence:
• Business intelligence(BI) is a set of technologies, applications, and
processes that are used by enterprises for business data analysis.
• It is used for the conversion of raw data into meaningful
information which is thus used for business decision-making and
profitable actions.
• It deals with the analysis of structured and sometimes unstructured
data which paves the way for new and profitable business
opportunities.
• It supports decision-making based on facts rather than
assumption-based decision-making.
• Thus it has a direct impact on the business decisions of an
enterprise.
• Business intelligence tools enhance the chances of an enterprise to
enter a new market as well as help in studying the impact of
marketing efforts.
Factor Data Science Business Intelligence
It is basically a set of technologies,

It is a field that uses mathematics,
applications and processes that are
Concept statistics and various other tools to
used by the enterprises for business
discover the hidden patterns in the data.
data analysis.
Focus It focuses on the future. It focuses on the past and present.
It deals with both structured as well as It mainly deals only with structured
Data
unstructured data. data.
Data science is much more flexible as It is less flexible as in case of

Flexibility data sources can be added as per business intelligence data sources
requirement. need to be pre-planned.
Method It makes use of the scientific method. It makes use of the analytic method.
It has a higher complexity in comparison It is much simpler when compared

Complexity
to business intelligence. to data science.
Factor Data Science Business Intelligence
Expertise It’s expertise is data scientist. It’s expertise is the business user.
It deals with the questions of what will It deals with the question of what
Questions
happen and what if. happened.
The data to be used is disseminated Data warehouse is utilized to hold

Storage
in real-time clusters. data.
The ELT (Extract-Load-Transform) The ETL (Extract-Transform-Load)

Integration of process is generally used for the process is generally used for the
data integration of data for data science integration of data for business
applications. intelligence applications.
Python, R, Hadoop/Spark, SAS, MS Excel , SAS BI,

Tools
TecnsrFlow Sisense,Microstrategy
Companies can harness their potential Business Intelligence helps in

by anticipating the future scenario performing root cause analysis on a
Usage
using data science in order to reduce failure or to understand the current
risk and increase income. status.
Business Intelligence
• Business Intelligence is a process of collecting, integrating,
analyzing and presenting the data.
• With Business Intelligence, executives and managers can have a
better understanding of decision-making.
• This process is carried out through software services and tools.
• Using Business Intelligence, organizations are able to several
strategic and operational business decisions.
• BI tools are used for analysis and creation of reports.
• They are also used for producing graphs, dashboards, summaries,
and charts to help the business executives to make better
decisions.
• Business Intelligence makes use of the data that is stored in the
form of business warehouses.
• It also supports real-time data that is generated from the services.
• Business Intelligence is used for strategic decision making.
Observe, Anticipate
and Plan (OAP)
Business process
reengineering (BPR)
• Some of the important uses of Business Intelligence are –
Measuring Performance and quantifying the progress towards

reaching the business goal.
Performing quantitative analysis through predictive

analytics and modelling.
Visualizing data and storing data in data warehouses and its

further processing in OLAP
Using knowledge management programs to develop effective

strategies in order to gain insights about learning
management and raise compliance issues.
Data Science
• In general, is about finding patterns within data.
• It is a multi-disciplinary field, meaning that data science is a
combination of several disciplines.
• Three most important fields are – Mathematics, Statistics and
Programming form the backbone of data science.
• Other than this, data scientists need to have domain knowledge in
order to find out patterns in the data.
Data Science Lifecycle
• Data Science Lifecycle revolves around the use of machine
learning and different analytical strategies to produce insights
and predictions from information in order to acquire a
commercial enterprise objective.
• The complete method includes a number of steps like data
cleaning, preparation, modelling, model evaluation, etc.
• It is a lengthy procedure and may additionally take quite a
few months to complete.
• So, it is very essential to have a generic structure to observe
for each and every hassle at hand.
• The globally mentioned structure in fixing any analytical
problem is referred to as a Cross Industry Standard Process
for Data Mining or CRISP-DM framework.
• 1. Business Understanding
• Business Understanding plays a very important role in success

of any project as the entire life cycle revolves around the
business goal.
• In order to acquire the correct data, we should be able to

understand the business.
• Asking questions about dataset and a proper business

objective will help in making the data acquisition process
much easy.
• 2. Data Understanding
• After business understanding, the next step is Data
understanding.
• This step involves the collection of all the available data.
• If you are working on a real time project in your company
then, you need to closely work with the business team as they
are aware of what data is present, what data could be used
for this business problem and other information,
• or if you are trying to build your own Data Science/ Machine
learning Project then you can find free datasets from many
websites available.
• This step involves describing the data, their structure, their
data type and many other information.
• Explore the data using graphical plots. Basically, extracting
any information that you can get about the data by just
exploring the data.
• 3. Data Preparation
• After the Data Understanding step, the next step that comes in the life
cycle steps is Data Preparation.
• This step is also known as Data Cleaning or Data Wrangling.
• It includes steps like
selecting the relevant data,
integrating the data by merging the data sets,
cleaning it,
handling the missing values by either removing them or imputing them with
relevant data,
treating erroneous data by removing them,
also check for outliers and handle them.
Constructing new data,
derive new features from existing ones by using the feature engineering.
Format the data into the desired structure,
remove unwanted columns and features.
● Data preparation is the most time consuming as it takes up to 70%-90% of
the overall project time, yet it’s the most important step in the entire life
cycle.
• 4. Data Modeling
• Data modeling is considered as the heart of data analysis.
• A model takes the prepared data from the previous step (Data
Preparation) as input and provides the desired output.
• This step includes choosing the appropriate type of model, whether
the problem is a classification problem, or a regression problem or
a clustering problem.
• After choosing the model , amongst the various algorithms present.
• We need to tune the hyper parameters of each model to achieve
the desired performance.
• In the end we need to evaluate the model by measuring
the accuracy and relevance.
• We also need to make sure there is a correct balance between
performance and generalizability, which means the model created
should not be biased and should be a generalized model.
• 5. Model Deployment
• The model after rigorous evaluation is finally deployed in the desired
format and channel.
• This is the final step in the data science life cycle.
• Each step in the data science life cycle should be worked upon carefully.
• If any step is executed improperly, it will consequently affect the next step
and the entire effort will go in vain.
• For example, if data is not collected properly, you’ll lose information and
you will not be able to build a perfect model.
• If data is not cleaned properly, the model will not work properly.
• If the model is not evaluated properly, it will fail in the real world from
giving a perfect output.
• Right from Business understanding to model deployment, each step
should be given proper attention, time and effort.
• All the above steps make a complete Data Science project but it is an
iterative process and various steps are repeated until we are able to fine
tune the methodology for a specific business case.
• Python and R are the most widely used languages for Data Science.
Data Wrangling
• Data wrangling, often referred to as data cleaning, data
cleansing, data remediation, data munging.
• Is the first important step in understanding and
operationalizing data insights.
• Some examples of data wrangling include:
• Merging multiple data sources into a single dataset for
analysis
• Identifying gaps in data (for example, empty cells in a
spreadsheet) and either filling or deleting them
• Deleting data that’s either unnecessary or irrelevant to the
project you’re working on
• Identifying extreme outliers in data and either explaining the
discrepancies or removing them so that analysis can take
place
Need of Data Wrangling
• Any analyses a business performs will ultimately be
constrained by the data that informs them.
• If data is incomplete, unreliable, or faulty, then analyses will
be too—diminishing the value of any insights gleaned.
• Data wrangling seeks to remove that risk by ensuring data is
in a reliable state before it’s analyzed and leveraged.
• This makes it a critical part of the analytical process.
• It’s important to note that data wrangling can be
time-consuming and taxing on resources, particularly when
done manually.
• This is why many organizations institute policies and best
practices that help employees streamline the data cleanup
process—for example, requiring that data include certain
information or be in a specific format before it’s uploaded to
a database.
• The below example will explain its importance as :
• Books selling Website want to show top-selling books of different
domains, according to user preference.
• For example, a new user search for motivational books, then they
want to show those motivational books which sell the most or
having a high rating, etc.
• But on their website, there are plenty of raw data from different
users.
• Here the concept of Data Munging or Data Wrangling is used.
• As we know Data is not Wrangled by System.
• This process is done by Data Scientists.
• So, the data Scientist will wrangle data in such a way that they will
sort that motivational books that are sold more or have high ratings
or user buy this book with these package of Books, etc.
• On the basis of that, the new user will make choice.
Steps to Perform Data Wrangling
• The six-step process for data wrangling, which includes
everything required to make raw data usable.
• Step 1: Data Discovery
• Step 2: Data Structuring
• Step 3: Data Cleaning
• Step 4: Data Enriching
• Step 5: Data Validating
• Step 6: Data Publishing
Step 1: Data Discovery
• The first step in the Data Wrangling process is Discovery.
• This is an all-encompassing term for understanding or getting
familiar with your data.
• You must take a look at the data you have and think about
how you would like it organized to make it easier to consume
and analyze.
• So, you begin with an Unruly Crowd of Data collected from
multiple sources in a wide range of formats.
• At this stage, the goal is to compile the Disparate, Siloed data
sources and configure each of them so they can be
understood and examined to find patterns and trends in the
data.
Step 2: Data Structuring
• When raw data is collected, it’s in a wide range of formats and sizes.
• It has no definite structure, which means that it lacks an existing model
and is completely disorganized.
• It needs to be restructured to fit in with the Analytical
Model deployed by your business, and giving it a structure allows for
better analysis.
• Unstructured data is often text-heavy and contains things such
as Dates, Numbers, ID codes, etc.
• At this stage of the Data Wrangling process, the dataset needs to
be parsed.
• This is a process whereby relevant information is extracted from fresh
data.
• For example, if you are dealing with code scrapped from a website,
you might parse HTML code, pull out what you need, and discard the
rest.
• This will result in a more user-friendly spreadsheet that contains useful
data with columns, classes, headings, and so on
Step 3: Data Cleaning
• Most people use the words Data Wrangling and Data Cleaning
interchangeably.
• However, these are two very different processes.
• Although a complex process in itself, Cleaning is just a single
aspect of the overall Data Wrangling process.
• For the most part, raw data comes with a lot of errors that
have to be cleaned before the data can move on to the next
stage.
• Data Cleaning involves Tackling Outliers, Making Corrections,
Deleting Bad Data completely, etc.
• This is done by applying algorithms to tidy up and sanitize the
dataset.
• Cleaning the data does the following:
• It removes outliers from your dataset that can potentially

skew your results when analyzing the data.
• It changes any null values and standardizes the data format to

improve quality and consistency.
• It identifies duplicate values and standardizes systems of

measurements, fixes structural errors and typos, and
validates the data to make it easier to handle.
• You can automate different algorithmic tasks using a variety

of tools such as Python and R.
Step 4: Data Enriching
• At this stage of the Data Wrangling process, you’ve become
familiar with, and have a deep understanding of the data at
hand.
• Combining your raw data with additional data from other
sources such as internal systems, third-party providers, etc.
• will help you accumulate even more data points to improve
the accuracy of your analysis.
• Alternatively, your goal might be to simply fill in gaps in the
data.
• For instance, combining two databases of customer
information where one contains customer addresses, and the
other one doesn’t.
• Enriching the data is an optional step that you only need to
take if your current data doesn’t meet your requirements.
Step 5: Data Validating
• Validating the data is an activity that services any issues in the
quality of your data so they can be addressed with the appropriate
transformations.
• The rules of data validation require repetitive programming
processes that help to verify the following:
• Quality
• Consistency
• Accuracy
• Security
• Authenticity
• This is done by checking things such as whether the fields in the
datasets are accurate, and if attributes are normally distributed.
• Preprogrammed scripts are used to compare the data’s attributes
with defined rules.
• This process may need to be repeated several times since you are
likely to find errors
Step 6: Data Publishing
• By this time, all the steps are completed and the data is ready
for analytics.
• All that’s left is to publish the newly Wrangled Data in a place
where it can be easily accessed and used by you and other
stakeholders.
• You can deposit the data into a new architecture or database.
• As long as you completed the other processes correctly, the
final output of your efforts will be high-quality data that you
use to gain insights, create business reports, and more.
• You might even further process the data to create larger and
more complex data structures such as Data Warehouses.
Data
• What is Data
• Data is a collection of information that is translated
for some purpose.
• if data is not formatted in a specific way, it does not
valuable to humans.
• Data can be available in terms of different forms
such as text number,character,symbol etc.
• Data is in raw fact.
Data Types
• Primary Data Definition
• Primary data is the data that is collected for the first time
through personal experiences or evidence, particularly for
research.
• It is also described as raw data or first-hand information.
• The mode of assembling the information is costly, as the
analysis is done by an agency or an external organization, and
needs human resources and investment.
• The investigator supervises and controls the data collection
process directly.
• The data is mostly collected through observations, physical
testing, mailed questionnaires, surveys, personal interviews,
telephonic interviews, case studies, and focus groups, etc.
• Secondary Data Definition
• Secondary data is a second-hand data that is already collected

and recorded by some researchers for their purpose, and not
for the current research problem.
• It is accessible in the form of data collected from different
sources such as government publications, censuses, internal
records of the organization, books, journal articles, websites
and reports, etc.
• This method of gathering data is affordable, readily available,
and saves cost and time.
• However, the one disadvantage is that the information
assembled is for some other purpose and may not meet the
present research purpose or may not be accurate.
BASIS FOR
PRIMARY DATA SECONDARY DATA
COMPARISON
Meaning Primary data refers to the Secondary data means data
first hand data gathered by collected by someone else earlier.
the researcher himself.
Data Real time data Past data
Process Very involved Quick and easy
Source Surveys, observations, Government publications,
experiments, questionnaire, websites, books, journal articles,
personal interview, etc. internal records etc.
Cost effectiveness Expensive Economical
Collection time Long Short
Specific Always specific to the May or may not be specific to the
researcher's needs. researcher's need.
Available in Crude form Refined form
Accuracy and More Relatively less
Reliability
• Methods of Collection
• Primary Data Primary data is collected through the
following tools:
• Questionnaire: A questionnaire is designed as per the
purpose of the research topics and objectives to be filled
by the sample population which gets further analysed for
suitable results.
• Personal Interview: The team conducts personal
interviews with every sample based on the objectives of
the research.
• Survey: The team conducts on-field surveys to assess the
behaviour of the sample population in accordance with
the research objectives.
• Experiments: The team conducts experiments or
randomised controlled experiments to assess the results.
• Secondary Data Secondary data can be accessed from the
following sources:
• Journals: The journals published every year have reliable sources of

secondary data which gets verified by a group of scholars and hence, can
be used in research.
• Government Databases: The government collects and record data over a
period of time which can be used for the analysis.
• The financial and economic data can be found from the RBI and finance
ministries database.
• One can find years of historical data on the databases many of which are
available to the public.
• UN Databases: The United Nations collects primary data from their field
works and interventions which are available to the general public for use.
• Given the reputed name and reliable verification systems, this database
can be another good option to take secondary data from
• Databases of Analytical Companies: Analytical companies like Bloomberg,
Statista have their own databases and analysis which are open to the
users and can be cited as a verifiable source in the research.
Data collection methods
Primary Data Collection
Secondary Data Collection

Primary Data Collection
• Survey Method
• Observation Method
• Experimental method
• Survey Method
❖ Interview
❖ Telephone Interview
❖ Mail survey
• Observation Method
• Structured observation
• Unstructured observation
• Live observation
• Record observation
• Direct observation
• Indirect observation
• Human observation
• Mechanical observation
• Experimental Method
• Under this method , a cause and effect relationship is

established
• The independent variables are manipulated to measure the

effect of such manipulation on dependent variables
Secondary Data Collection
• Unlike primary data collection, there are no specific collection
methods.
• Instead, since the information has already been collected, the
researcher consults various data sources, such as:
• Financial Statements
• Sales Reports
• Retailer/Distributor/Deal Feedback
• Customer Personal Information (e.g., name, address, age,
contact info)
• Business Journals
• Government Records (e.g., census, tax records, Social Security
info)
• Trade/Business Magazines
• The internet
Data
Processing
Information
End of Unit-I

Dsbda Unit 1

Uploaded by

Copyright:

Available Formats

Dsbda Unit 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dsbda Unit 1

Uploaded by

Copyright:

Available Formats

Government College of Engineering and Research

Department of Computer Engineering, Avasari

Data Science and Big Data

• Lectures: 4 Hrs/ Week

• Mid-Sem (TH) : 30 Marks

• End-Sem (TH): 70 Marks

• Database Management Systems

• Data Science and Big Data Analytics Laboratory

• UNIT II: Statistical Inference

• UNIT IV: Predictive Big Data Analytics with Python

• UNIT V: Big Data Analytics and Model Evaluation

• UNIT VI: Data Visualization and Hadoop

• Need of statistics in Data Science and Big Data

• 1. David Dietrich, Barry Hiller, “Data Science and Big

• 2. Jiawei Han, Micheline Kamber, and Jian Pie,

Introduction to Data Science and Big Data

• Asking the correct questions and analyzing the raw

• Now, we need to take some decisions such as which route will

• Orange candy and mango candy

• So there would be many people who will like mango

• But in today's world, data is becoming so vast, i.e.,

• It is estimated as per researches, that by 2020, 1.7 MB of

• Every Company requires data to work, grow, and

• Now, handling of such huge amount of data is a

• So to handle, process, and analysis of this, we

• “Big Data” is data whose scale, diversity, and

The data is too big, moves too fast, or

To gain value from this data, you must

• Big Data is data whose scale, distribution, diversity,

• Structured: Organised data format with a fixed

• Semi-Structured: Partially organised data which does

• Unstructured: Unorganised data with an unknown

• Big Data contains a large amount of data that

• The name Big Data itself is related to an enormous size.

• It has many ways to filter or translate the data.

• Veracity is the process of being able to handle and manage

• Big Data is also essential in business development.

• For example, Facebook posts with hashtags.

• Value is an essential characteristic of big data.

• It is not the data that we process or store.

• It is valuable and reliable data that we store,

It is about collection, processing, It is about extracting the vital and valuable

It is a field of study just like the It is a technique of tracking and

• We know that 1 byte is equivalent to 8 bits and these 8 bits

• An archive with huge number of bytes will contain huge

• The key drivers of data growth are following :

• Step 4: Data Processing

It is basically a set of technologies,

Focus It focuses on the future. It focuses on the past and present.

Data science is much more flexible as It is less flexible as in case of

It has a higher complexity in comparison It is much simpler when compared

The data to be used is disseminated Data warehouse is utilized to hold

The ELT (Extract-Load-Transform) The ETL (Extract-Transform-Load)

Python, R, Hadoop/Spark, SAS, MS Excel , SAS BI,

Companies can harness their potential Business Intelligence helps in

Measuring Performance and quantifying the progress towards

Performing quantitative analysis through predictive

Visualizing data and storing data in data warehouses and its

Using knowledge management programs to develop effective

• Business Understanding plays a very important role in success