Co-Po Big Data Analytics
Co-Po Big Data Analytics
Co-Po Big Data Analytics
Vision-Institute
To be a leading institution of women empowerment producing internationally accepted
professionals with psychological strength, emotional balance and ethical values
.
Mission- Institute
M1: To empower women engineers through innovative teaching-learning practices.
M2: To encourage for higher education and research with well-equipped laboratories.
M3: To promote entrepreneurship through creativity and innovation.
M4: To promote environmental sustainability and inculcate ethical, emotional and social
COURSE OUTCOMES
CO DESCRIPTION
C411.1 Understand fundamentals of Big Data analytics . (K2)
C411.2 To Learn Real time analytics platform(K3)
C411.3 Understand the components of Hadoop System (K3)
C411.4 To Introduce programming tools PIG & HIVE in Hadoop echo system(K4)
C411.5 To create chart based pictorial presentation of data. (K2)
CO PO PSO
Cos PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2
C411.1 3 2 3 - - - - - - - - - 2 2
C411.2 3 3 2 - 3 - - - 2 - - - 2 2
C411.3 3 3 3 2 - - - - - - - 1 2 2
C411.4 3 2 2 3 3 - - - 2 - - 2 2 2
C411.5 3 2 2 3 3 - - - 2 - - 2 2 3
AVG 3 2.4 2.4 2.7 3 2 1.7 2 2.2
SYLLABUS
Course Name: BIG DATA ANALYTICS Course Code: C411
Year/Sem : IV B.Tech II Sem Regulation:R19
Admitted Batch : 2019-20 Academic Year : 2022-23
Course Objectives:
To optimize business decisions and create competitive advantage with Big Data analytics
To learn to analyze the big data using intelligent techniques
To introduce programming tools PIG & HIVE in Hadoop echo system
SYLLABUS
UNIT I
Introduction: Introduction to big data: Introduction to Big Data Platform, Challenges of
Conventional Systems, Intelligent data analysis, Nature of Data, Analytic Processes and Tools,
Analysis vs Reporting.
UNIT II
Stream Processing: Mining data streams: Introduction to Streams Concepts, Stream Data Model
and Architecture, Stream Computing, Sampling Data in a Stream, Filtering Streams, Counting
Distinct Elements in a Stream, Estimating Moments, Counting Oneness in a Window, Decaying
Window, Real time Analytics Platform (RTAP) Applications, Case Studies - Real Time Sentiment
Analysis - Stock Market Predictions.
UNIT III
Introduction to Hadoop: Hadoop: History of Hadoop, the Hadoop Distributed File System,
Components of Hadoop Analysing the Data with Hadoop, Scaling Out, Hadoop Streaming,
Design of HDFS, Java interfaces to HDFS Basics, Developing a Map Reduce Application, How
Map Reduce Works, Anatomy of a Map Reduce Job run, Failures, Job Scheduling, Shuffle and
Sort, Task execution, Map Reduce Types and Formats, Map Reduce Features Hadoop
environment.
UNIT IV
Frameworks and Applications: Frameworks: Applications on Big Data Using Pig and Hive, Data
processing operators in Pig, Hive services, HiveQL, Querying Data in Hive, fundamentals of
HBase and ZooKeeper.
UNIT V
Predictive Analytics and Visualizations: Predictive Analytics, Simple linear regression, Multiple
linear regression, Interpretation of regression coefficients, Visualizations, Visual data analysis
techniques, interaction techniques, Systems and application.
Text Books:
1) Tom White, “Hadoop: The Definitive Guide”, Third Edition, O’reilly Media, Fourth
Edition, 2015.
2) Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos,
“Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming
Data”, McGrawHill Publishing, 2012.
3) Anand Rajaraman and Jeffrey David Ullman, “Mining of Massive Datasets”, CUP,
2012
Reference Books:
1) Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data
Streams with Advanced Analytics”, John Wiley& sons, 2012.
2) Paul Zikopoulos, DirkdeRoos, Krishnan Parasuraman, Thomas Deutsch, James Giles,
David Corrigan, “Harness the Power of Big Data:The IBM Big Data Platform”, Tata
McGraw Hill Publications, 2012.
3) Arshdeep Bahga and Vijay Madisetti, “Big Data Science & Analytics: A Hands On
Approach “, VPT, 2016.
4) Bart Baesens, “Analytics in a Big Data World: The Essential Guide to Data Science
and its Applications (WILEY Big Data Series)”, John Wiley & Sons, 2014.
COURSE OUTCOMES
CO No CO BASED QUESTION.
C411.1 Can you understand What kind of data is called big data.
C411.2 Can you understand Big Data Real time platform.
C411.3 Can you understand the components of Hadoop System .
C411.4 Can you understand programming tools PIG & HIVE in Hadoop echo system
C411.5 Are you construct chart based pictorial presentation of data.
UNIT – I:
Introduction: Introduction to big data: Introduction to Big Data Platform, Challenges of Conventional
Systems, Intelligent data analysis, Nature of Data, Analytic Processes and Tools, Analysis vs Reporting.
Objective:Illustrate big data challenges in different domains including social media, transportation, finance
and medicine
UNIT – II:
Stream Processing: Mining data streams: Introduction to Streams Concepts, Stream Data Model and
Architecture, Stream Computing, Sampling Data in a Stream, Filtering Streams, Counting Distinct Elements
in a Stream, Estimating Moments, Counting Oneness in a Window, Decaying Window, Real time Analytics
Platform (RTAP) Applications, Case Studies - Real Time Sentiment Analysis - Stock Market Predictions.
UNIT – III:
Introduction to Hadoop: Hadoop: History of Hadoop, the Hadoop Distributed File System, Components of
HadoopAnalysing the Data with Hadoop, Scaling Out, Hadoop Streaming, Design of HDFS, Java interfaces
to HDFS Basics, Developing a Map Reduce Application, How Map Reduce Works, Anatomy of a Map
Reduce Job run, Failures, Job Scheduling, Shuffle and Sort, Task execution, Map Reduce Types and
Formats, Map Reduce Features Hadoop environment.
Objective:Design and develop Hadoop
S. No Topics Text Book: Page No Teaching Aids
31 Failures T1:361
UNIT-4
Frameworks and Applications: Frameworks: Applications on Big Data Using Pig and Hive, Data processing operators
in Pig, Hive services, HiveQL, Querying Data in Hive, fundamentals of HBase and ZooKeeper.
Objective:Identify the characteristics of datasets and compare the trivial data and big data for various
pplications
S.No Topics Text Book: Page No Teaching aids
T2:358-359
38 Frameworks
ZooKeeper.
46 T1: 604 - 630
UNIT – V
Predictive Analytics and Visualizations: Predictive Analytics, Simple linear regression, Multiple linear
regression, Interpretation of regression coefficients, Visualizations, Visual data analysis techniques,
interaction techniques, Systems and application
Objective: Explore the various search methods and visualization techniques
S. No Topics Text Book: Page No teaching aids
55 T2:602-650
visualization techniques
Content beyond syllabus covered (if any):
TEXT BOOKS :
1. Tom White, “Hadoop: The Definitive Guide”, Third Edition, O’reilly Media, Fourth Edition, 2015.
2. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos, “Understanding Big Data: Analytics for
Enterprise Class Hadoop and Streaming Data”, McGrawHill Publishing, 2012.
3. AnandRajaraman and Jeffrey David Ullman, “Mining of Massive Datasets”, CUP, 2012
REFERENCES :
1. Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced
Analytics”, John Wiley& sons, 2012.
2. Paul Zikopoulos, DirkdeRoos, Krishnan Parasuraman, Thomas Deutsch, James Giles, David Corrigan, “Harness the
Power of Big Data:The IBM Big Data Platform”, Tata McGraw Hill Publications, 2012.
3. ArshdeepBahga and Vijay Madisetti, “Big Data Science & Analytics: A Hands On Approach “, VPT, 2016.
4. Bart Baesens, “Analytics in a Big Data World: The Essential Guide to Data Science and its Applications (WILEY
Big Data Series)”, John Wiley & Sons, 2014.
Software Links:
1. Hadoop:http://hadoop.apache.org/
2. Hive: https://cwiki.apache.org/confluence/display/Hive/Home
3. Piglatin: http://pig.apache.org/docs/r0.7.0/tutorial.html
4.(867) Big Data Analytics - YouTube
Break Timings: 10:50 - 11:10 AM & 03:20- 03:30 PM Lunch: 12:50- 01:40PM
OTHERS: Remedial Classes/Revision Classes/Projects/Activities (Co-curricular & Extra-curricular)
Course Name Section A Section B Section C
Management and Organization Mrs.B.Lavanya Mrs.K.Santosh Kumar Mrs.B.Lavanya
Non Conventional Energy Res Mrs.T.Sugana Mrs.M.Satyavathi Mrs.A.Venkata lakshmi
Big Data Analytics Mrs.K.Santoshi Rupa Mrs. Ch.Suneetha Mrs. Ch.Suneetha
# Project- II
Coordinator Head of the Department Principal
Break Timings: 10:50 - 11:10 AM & 03:20- 03:30 PM Lunch: 12:50- 01:40PM
Introduction: Case study is an empirical inquiry that investigates a contemporary phenomenon within its
real-life context, especially when the boundaries between the phenomenon and context are not clearly
evident.
Case study scenario: Describe an approach for analysis of the stock market to understand its volatile nature
and predict its behavior to make profits by investing in it.
Steps Involved:
Step 1: Introduction to Stock Market Prediction
Step 2 :Problem statement for Stock Market Prediction
Step 3 :The long short term memory
Program Implementation
Importing the Libraries
Step 4 : Getting to Visualising the Stock Market Prediction Data
Check for Null Values by printing the DataFrame Shape
Step 5 :Setting the Target Variable and Selecting the Features
Scaling
Step 6 :Creating a Training Set and a Test Set for Stock Market Prediction
Step 7 :Data Processing For LSTM
Step 8: Building the LSTM Model for Stock Market Prediction
Step 9: Training the Stock Market Prediction Model
Training the Stock Market Prediction Model
Step 10: Conclusion
No of groups: 15
Team size: 4 Or 5
Time Allotted: one-week
Problem:Problem Statement
The case study titled is Stock Market PredictionWith the advent of technological marvels like global digitization, the
prediction of the stock market has entered a technologically advanced era,. With the ceaseless increase in market
capitalization, stock trading has become a center of investment for many financial investors. Many analysts and
researchers have developed tools and techniques that predict stock price movements and help investors in proper
decision-making. Advanced trading models enable researchers to predict the market using non-traditional textual data
from social platforms. The application of advanced machine learning approaches such as text data analytics and
ensemble methods have greatly increased the prediction accuracies. Meanwhile, the analysis and prediction of stock
markets continue to be one of the most challenging research areas due to dynamic, erratic, and chaotic data. This study
explains the systematics of machine learning-based approaches for stock market prediction based on the deployment of
a generic framework. The study would be helpful for emerging researchers to understand the basics and advancements
of this emerging area, and thus carry-on further stock market prediction.
End-Users:
• Fundamental analysis:Collect required information and maiantain the information , and helping to
Developer.
• Devolper : Need implemented code and defined with user interface.
• Results: To provide and meet the requirement of the predicted Results.
Solution: Diagrams
Figure:1 Activity in classroom
Report: Documentation for case study submitted by the students with final solution got from over
groups.Group-6 & 12 is selected as the best case study report among other groups and this will be
documented here.
Activity Outcome to PO Mappings
Post Implications:
UNIT-1
UNIT-2
UNIT-3
UNIT-5
2. Which of the following is not one of the four V’s of Big Data?
Answer: d) Value
3. What is the process of transforming structured and unstructured data into a format that can
be easily analyzed?
4. Which of the following is a tool used for processing and analyzing Big Data?
Answer: a) Hadoop
5. What is the process of examining large and varied data sets to uncover hidden patterns,
unknown correlations, market trends, customer preferences, and other useful information?
6. Which of the following is not a common challenge associated with Big Data?
7. Which of the following is a technique used to extract meaningful insights from data sets
that are too large or complex to be processed by traditional data processing tools?
8. What is the process of storing and managing data in a way that allows for efficient retrieval
and analysis?
9. Which of the following is a common programming language used for Big Data processing?
10. Which of the following is a popular NoSQL database used for Big Data processing?
Answer: d) MongoDB
11. What is the process of combining data from multiple sources into a single, unified view?
12. What is the term used for the ability of a system to handle increasing amounts of data and
traffic without compromising performance?
13. What is the process of cleaning and transforming data before it is used for analysis?
14. Which of the following is not a common type of data in Big Data analysis?
15. Which of the following is a method for analyzing data in which the data is split into
smaller subsets and processed in parallel across multiple servers or nodes?
Answer: c) MapReduce
17. Which of the following is a popular programming language used for data analysis and
machine learning?
Answer: c) Python
18. Which of the following is not a common data storage technology used for Big Data
processing?
Answer: c) MySQL
19. What is the process of automatically categorizing or grouping data based on its
characteristics or attributes?
20. Which of the following is not a common data visualization tool used for Big Data
analysis?
UNIT-2
1. Which of the following is a technique used for identifying patterns in data by training a
model on a dataset and using it to make predictions on new data?
3. Which of the following is a type of machine learning algorithm in which the input data is
labeled and the model is trained to make predictions on new, unlabeled data?
4. Which of the following is a type of machine learning algorithm in which the input data is
not labeled and the model is trained to find patterns or structure in the data?
a) Decision Trees b) Random Forests c) Neural Networks d) All of the above are common
machine learning models
7. Which of the following is a measure of how well a machine learning model is able to make
predictions on new data?
9. Which of the following is not a common use case for Big Data analytics?
d) Inventory Management
10. Which of the following is a technique for predicting a continuous target variable?
Answer: b) Regression
11. Which of the following is a technique for grouping similar data points together?
Answer: d) Regression
13. Which of the following is a measure of the relationship between two variables?
Answer: a) Correlation
15. Which of the following is a measure of how much a dependent variable changes when an
independent variable changes?
Answer: c) Slope
16. Which of the following is not a common method for selecting the best features for a
machine learning model?
17. Which of the following is a measure of how much a model’s predictions vary for different
input values?
Answer: b) Variance
18. Which of the following is not a common machine learning algorithm for classification?
Answer: c) Deduplication
20. Which of the following is a technique for reducing the dimensionality of a dataset?
UNIT-3
A. Hadoop File System B. Hadoop Field System C. Hadoop File SearchD. Hadoop Field
search
Ans : A
Ans : B
A. Fault detection and recovery B. Huge datasets C. Hardware at data D. All of the above
Ans : D
4. ________ Name Node is used when the Primary Name Node goes down.
Ans : C
5. The minimum amount of data that HDFS can read or write is called a _____________.
Ans : C
Ans : B
Ans : A
A. It is suitable for the distributed storage and processing. B. Streaming access to file system
data. C. HDFS provides file permissions and authentication.
Ans : D
Ans : C
10. During start up, the ___________ loads the file system state from the fsimage and the
edits log file.
Ans : B
C. It can only process structured data. D. It can only process data stored in a Hadoop
Distributed File System (HDFS).
A. They consist of only one map task and one reduce task. B. They can have multiple map
tasks and reduce tasks. C. They can only have one reduce task. D. They can only have one
map task.
Answer: B. They can have multiple map tasks and reduce tasks.
13. What is the purpose of the map function in MapReduce?
A. To convert input data into key-value pairs B. To sort the input data C. To combine the
input data D. To summarize the input data
A. To sort the input data B. To combine the input data C. To summarize the input data
15. Which of the following is true about the shuffle phase in MapReduce?
A. It sorts the output of the map phase. B. It sorts the output of the reduce phase.
C. It combines the output of the map phase. D. It combines the output of the reduce phase.
16. Which of the following is true about the combiner function in MapReduce?
A. It is the same as the reduce function. B. It is run after the reduce function.
C. It is run after the map function. D. It is run before the reduce function.
17. Which of the following is true about the partitioner function in MapReduce?
A. It is used to sort the output of the map phase. B. It is used to group the output of the map
phase by key. C. It is used to divide the output of the map phase into partitions.
Answer: C. It is used to divide the output of the map phase into partitions.
UNIT-4
1. Which of the following Pig Latin statements is used to filter data based on a specified
condition?
Answer: d) FILTER
d) Better security
a) Groups data based on a specified key b) Sorts data in ascending order c) Filters data based
on a specified condition d) Performs a cross-product of two datasets
c) Joins two datasets based on a common key d) Performs a cross-product of two datasets
c) Filters data based on a specified condition d) Limits the number of records returned
a) They can be executed only on a single node b) They must be written in Java
c) They can be run on a cluster of nodes d) They require a web interface to execute
12. What is the name of the component in Apache Pig that translates Pig Latin scripts into
MapReduce jobs?
13. Which of the following statements is true about Pig Latin UDFs (User-Defined
Functions)?
a) They can only be written in Java b) They can be written in multiple programming
languages c) They are not allowed in Pig Latin scripts +d) They are pre-built functions
provided by Pig
15. Which of the following statements is true about Apache Pig Latin schemas?
a) They cannot be defined by the user b) They must be defined using JSON
c) Provides a detailed explanation of the execution plan for a Pig Latin script
d) Performs a cross-product of two datasets
Answer: c) Provides a detailed explanation of the execution plan for a Pig Latin script
17. Which of the following statements is true about Pig Latin LOAD statements?
a) They are not required for reading data into Pig b) They are used to write data to a file
c) They must be written in Java d) They specify the location and format of the input data
Answer: d) They specify the location and format of the input data
19. Which of the following Pig Latin statements is used to group data based on a specified
key?
Answer: a) GROUP BY
20. Which of the following Pig Latin statements is used to sort data in ascending order?
Answer: b) SORT BY
UNIT-5
a)Data from a Youtube interview b)Data from the textbook c)All of the above
a)Ranks customers and locations based on probability b)Rank customers and locations based
on profitability c)Distinguish the products and services that drive revenues
a)Can Perform Arithmetic operations on rows and columns b)Potentially columns are of
different types c)Labeled axes (rows and columns) d)All of the above
Answer: matplotlib.pyplot
b)Four: The Professional Data Analyst, The IT users, The head of the company, The Business
Usersc)One- The Business Users
Answer: Four: The Professional Data Analyst, The IT users, The head of the company, The
Business Users
a)Data mart b)Data warehouse c)Both Data Warehouse and Datamart d)Database
10. In regards to separated value files such as .csv and .tsv, what is the delimiter?
a)Any character such as the comma (,) or tab (\t) that is used to separate the row data
c)Any character such as the comma (,) or tab (\t) that is used to separate the column data
Answer: c.Any character such as the comma (,) or tab (\t) that is used to separate the column
dat
13. Which of the following methods should be employed in the code to display a plot?
Answer: Show()
Answer: barh()
a)BI has a direct impact on organization’s strategic, tactical and operational business
decisions b)BI convert raw data into meaningful information
c)BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs,
and charts d)All of the above
19. Which of the following creates an object which maps data to a dictionary?
Answer: DicReader ()
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and efficient
to use. MapReduce is a programming model used for efficient processing in parallel over large data-sets in
a distributed manner. The data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-different optimizations.
The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce it to equivalent
tasks for providing less overhead over the cluster network and to reduce the processing power. The
MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing. There
can be multiple clients available that continuously send jobs for processing to the Hadoop MapReduce
Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of so
many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all the job-
parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the Hadoop
MapReduce Master. Now, the MapReduce master will divide this job into further equivalent job-parts.
These job-parts are then made available for the Map and Reduce Task. This Map and Reduce task will
contain the program as per the requirement of the use-case that the particular company is solving. The
developer writes their logic to fulfill the requirement that the industry requires. The input data which we
are using is then fed to the Map Task and the Map will generate intermediate key-value pair as its output.
The output of Map i.e. these key-value pairs are then fed to the Reducer and the final output is stored on
the HDFS. There can be n number of Map and Reduce tasks made available for processing the data as per
the requirement. The algorithm for Map and Reduce is made with a very optimized way such that the time
complexity or space complexity is minimum.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data tuples (key-value
pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their significance.
Input Phase − Here we have a Record Reader that translates each record in an input file and sends the
parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one
of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into
identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined
code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce
algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped
key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs
are sorted by key into a larger data list. The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function
on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways,
and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value
pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final key-value
pairs from the Reducer function and writes them onto a file using a record writer.
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram
MapReduce-
Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500
million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how
Tweeter manages its tweets with the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions –
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value
pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.