Nothing Special   »   [go: up one dir, main page]

0% found this document useful (0 votes)
610 views41 pages

Co-Po Big Data Analytics

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 41

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Vision-Institute
To be a leading institution of women empowerment producing internationally accepted
professionals with psychological strength, emotional balance and ethical values
.

Mission- Institute
M1: To empower women engineers through innovative teaching-learning practices.
M2: To encourage for higher education and research with well-equipped laboratories.
M3: To promote entrepreneurship through creativity and innovation.
M4: To promote environmental sustainability and inculcate ethical, emotional and social

Vision-Computer Science and Engineering


To evolve into a center of excellence and to empower women in emerging areas of Computer
Science and Engineering with human values
Mission-Computer Science and Engineering
M1: To train students to analyze, design, develop and test software applications
M2: To impart technical expertise in sustaining the needs of the IT industry
M3: To foster research activities and entrepreneurial skills in emerging technologies
M4: To inculcate lifelong learning skills in line with technological advancement and social
consciousness

Program Specific Outcomes


PSO 1: Graduates exhibit knowledge of basic sciences, skills in engineering specialization
like information security, cloud computing, networking, software engineering and data
analytics.
PSO 2: Graduates can adapt to evolving technologies for the design and development of
full-stack applications in diversified fields with optimal programming skills.
Program Educational Objectives
PEO 1:Graduates are able to lead the diverse range of careers in IT sectors and initiate
entrepreneurship in Software development.
PEO 2:Graduates are able to excel in higher studies and research in emerging areas of
Computer Science Engineering.
PEO 3:Graduates are able to possess continuous learning by adapting to technological trends
to help society with ethical values.
Program Outcomes
Engineering Graduates will be able to:
1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and
need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give and
receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CO-PO-PSO MAPPING

Course Name: BIG DATA ANALYTICS Course Code:C411


Year/ Sem : IV B.Tech- II Sem Regulation: R19
Admitted Batch: 2019-20 Academic Year:2022-23
Course Coordinator : CH.SUNEETHA

COURSE OUTCOMES

CO DESCRIPTION
C411.1 Understand fundamentals of Big Data analytics . (K2)
C411.2 To Learn Real time analytics platform(K3)
C411.3 Understand the components of Hadoop System (K3)
C411.4 To Introduce programming tools PIG & HIVE in Hadoop echo system(K4)
C411.5 To create chart based pictorial presentation of data. (K2)

PROGRAM SPECIFIC OUTCOMES


PSO1 Graduates exhibit knowledge of basic sciences, skills in engineering specialization like
information security, cloud computing, networking, software engineering and data analytics.
PSO 2 Graduates can adapt to evolving technologies for design and development of full stack
applications, exploring with optimal programming skills

CO PO PSO
Cos PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2
C411.1 3 2 3  -  - - - -  - - -  - 2 2
C411.2 3 3 2  - 3 - - -  2 - -  - 2 2
C411.3 3 3 3 2  - - - - - - - 1 2 2
C411.4 3 2 2 3 3 - - - 2 - - 2 2 2
C411.5 3 2 2 3 3 - - - 2 - - 2 2 3
AVG 3 2.4 2.4 2.7 3       2     1.7 2 2.2

COURSE COORDINATOR HEAD OF THE DEPARTMENT


DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

SYLLABUS
Course Name: BIG DATA ANALYTICS Course Code: C411
Year/Sem : IV B.Tech II Sem Regulation:R19
Admitted Batch : 2019-20 Academic Year : 2022-23
Course Objectives:
 To optimize business decisions and create competitive advantage with Big Data analytics
 To learn to analyze the big data using intelligent techniques
 To introduce programming tools PIG & HIVE in Hadoop echo system

SYLLABUS
UNIT I
Introduction: Introduction to big data: Introduction to Big Data Platform, Challenges of
Conventional Systems, Intelligent data analysis, Nature of Data, Analytic Processes and Tools,
Analysis vs Reporting.
UNIT II
Stream Processing: Mining data streams: Introduction to Streams Concepts, Stream Data Model
and Architecture, Stream Computing, Sampling Data in a Stream, Filtering Streams, Counting
Distinct Elements in a Stream, Estimating Moments, Counting Oneness in a Window, Decaying
Window, Real time Analytics Platform (RTAP) Applications, Case Studies - Real Time Sentiment
Analysis - Stock Market Predictions.
UNIT III
Introduction to Hadoop: Hadoop: History of Hadoop, the Hadoop Distributed File System,
Components of Hadoop Analysing the Data with Hadoop, Scaling Out, Hadoop Streaming,
Design of HDFS, Java interfaces to HDFS Basics, Developing a Map Reduce Application, How
Map Reduce Works, Anatomy of a Map Reduce Job run, Failures, Job Scheduling, Shuffle and
Sort, Task execution, Map Reduce Types and Formats, Map Reduce Features Hadoop
environment.
UNIT IV
Frameworks and Applications: Frameworks: Applications on Big Data Using Pig and Hive, Data
processing operators in Pig, Hive services, HiveQL, Querying Data in Hive, fundamentals of
HBase and ZooKeeper.
UNIT V
Predictive Analytics and Visualizations: Predictive Analytics, Simple linear regression, Multiple
linear regression, Interpretation of regression coefficients, Visualizations, Visual data analysis
techniques, interaction techniques, Systems and application.
Text Books:
1) Tom White, “Hadoop: The Definitive Guide”, Third Edition, O’reilly Media, Fourth
Edition, 2015.
2) Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos,
“Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming
Data”, McGrawHill Publishing, 2012.
3) Anand Rajaraman and Jeffrey David Ullman, “Mining of Massive Datasets”, CUP,
2012
Reference Books:
1) Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data
Streams with Advanced Analytics”, John Wiley& sons, 2012.
2) Paul Zikopoulos, DirkdeRoos, Krishnan Parasuraman, Thomas Deutsch, James Giles,
David Corrigan, “Harness the Power of Big Data:The IBM Big Data Platform”, Tata
McGraw Hill Publications, 2012.
3) Arshdeep Bahga and Vijay Madisetti, “Big Data Science & Analytics: A Hands On
Approach “, VPT, 2016.
4) Bart Baesens, “Analytics in a Big Data World: The Essential Guide to Data Science
and its Applications (WILEY Big Data Series)”, John Wiley & Sons, 2014.

COURSE COORDINATOR HEAD OF THE DEPARTMENT


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
COURSE END SURVY QUESTIONS

Course Name: BIG DATA ANALYTICS Course Code:C411


Year/ Sem : IV B.Tech- II Sem Regulation: R19
Admitted Batch: 2019-20 Academic Year:2022-23
Course Coordinator : CH.SUNEETHA

COURSE OUTCOMES

C411.1 Are you able to Understand fundamentals of Big Data .


C411.2 Are you able to know Real time analytics platform.
C411.3 Are you able to apply the components of Hadoop System .
C411.4 Are you able to Understand process of programming tools PIG & HIVE in Hadoop echo system
C411.5 Are you able to apply To create chart based pictorial presentation of data.

CO No CO BASED QUESTION.
C411.1 Can you understand What kind of data is called big data.
C411.2 Can you understand Big Data Real time platform.
C411.3 Can you understand the components of Hadoop System .
C411.4 Can you understand programming tools PIG & HIVE in Hadoop echo system
C411.5 Are you construct chart based pictorial presentation of data.

COURSE COORDINATOR HEAD OF THE DEPARTMENT


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
LECTURE PLAN

Course Name: BIG DATA ANALYTICS Course Code:C411


Year/ Sem : IV B.Tech- II Sem Regulation: R19
Admitted Batch: 2019-20 Academic Year:2022-23
Course Coordinator : CH.SUNEETHA
Number Of Lectures Per Week : 06

UNIT – I:
Introduction: Introduction to big data: Introduction to Big Data Platform, Challenges of Conventional
Systems, Intelligent data analysis, Nature of Data, Analytic Processes and Tools, Analysis vs Reporting.
Objective:Illustrate big data challenges in different domains including social media, transportation, finance
and medicine

S. No Topics Text Book: Page No teaching aids

1 Introduction to big data T1:50-51


2 Introduction to Big Data Platform T1:135-149
3 Challenges of Conventional Systems T1:150-155
4 Intelligent data analysis ICT
T1:156-164
&
5 Nature of Data Black Board
T1:1-8
6 Analytic Processes and Tools T1:8-21
7 Analytic Tools T1:58-62
8 Analysis vs Reporting T1: 66-78

Content beyond syllabus covered (if any):

UNIT – II:
Stream Processing: Mining data streams: Introduction to Streams Concepts, Stream Data Model and
Architecture, Stream Computing, Sampling Data in a Stream, Filtering Streams, Counting Distinct Elements
in a Stream, Estimating Moments, Counting Oneness in a Window, Decaying Window, Real time Analytics
Platform (RTAP) Applications, Case Studies - Real Time Sentiment Analysis - Stock Market Predictions.

Objective:Use various techniques for mining data stream


S. No Topics Text Book: Page No teaching aids

9 Stream Processing T1:84-87

10 Mining data streams T1:88-92

11 Introduction to Streams Concepts T1:93-99

12 Stream Data Model and Architecture T1:99-110

13 Stream Computing, Sampling Data in a Stream T1:11-17

14 Filtering Streams T1:47-49


ICT
&
Counting Distinct Elements in a Stream T1:41-47
15 Black Board

16 Estimating Moments T1:56-58

Counting Oneness in a Window T1:25-27


17
Decaying Window T1:36-37
18
Real time Analytics Platform (RTAP) T1:456-457
19 Applications

Content beyond syllabus covered (if any):

UNIT – III:
Introduction to Hadoop: Hadoop: History of Hadoop, the Hadoop Distributed File System, Components of
HadoopAnalysing the Data with Hadoop, Scaling Out, Hadoop Streaming, Design of HDFS, Java interfaces
to HDFS Basics, Developing a Map Reduce Application, How Map Reduce Works, Anatomy of a Map
Reduce Job run, Failures, Job Scheduling, Shuffle and Sort, Task execution, Map Reduce Types and
Formats, Map Reduce Features Hadoop environment.
Objective:Design and develop Hadoop
S. No Topics Text Book: Page No Teaching Aids

20 Introduction to Hadoop T1:327


ICT
21 History of Hadoop T1:328-329 &
Black Board
22 the Hadoop Distributed File System T1:330-335

23 Components of HadoopAnalysing T1:336-343

24 the Data with Hadoop T1:344-346

25 Scaling Out, Hadoop Streaming T1:347

26 Design of HDFS T1:348

27 Java interfaces to HDFS Basics T1:350


28 Developing a Map Reduce Application T1:358

29 How Map Reduce Works T1:359

30 Anatomy of a Map Reduce Job run T1:360

31 Failures T1:361

32 Job Scheduling T1:362

33 Shuffle and Sort T1:363 - 367

34 Task execution T1:369 - 377

35 Map Reduce Types and Formats T1:379 - 382

36 Map Reduce Features. T1:384 - 389

37 Hadoop environment T1:392 - 397

Content beyond syllabus covered (if any):

UNIT-4
Frameworks and Applications: Frameworks: Applications on Big Data Using Pig and Hive, Data processing operators
in Pig, Hive services, HiveQL, Querying Data in Hive, fundamentals of HBase and ZooKeeper.
Objective:Identify the characteristics of datasets and compare the trivial data and big data for various
pplications
S.No Topics Text Book: Page No Teaching aids

T2:358-359

38 Frameworks

39 Applications on Big Data Using Pig T2:362-379

40 Applications on Big Data Using Hive T2:380

41 Data processing operators in Pig T2:380-382


ICT
42 Hive services T2:384-388 &
Black Board
43 HiveQL T2:391-401

44 Querying Data in Hive T2:422-488

45 fundamentals of HBase T1: 588 - 601

ZooKeeper.
46 T1: 604 - 630
UNIT – V
Predictive Analytics and Visualizations: Predictive Analytics, Simple linear regression, Multiple linear
regression, Interpretation of regression coefficients, Visualizations, Visual data analysis techniques,
interaction techniques, Systems and application
Objective: Explore the various search methods and visualization techniques
S. No Topics Text Book: Page No teaching aids

47 Predictive Analytics T2:525-527


48 Visualizations T2:528-529

49 Simple linear regression T2:529-530

50 Multiple linear regression T2:531-533 ICT


&
51 Interpretation of regression coefficients T2:534-543 Black Board
52 Visualizations T2:544-546

53 Visual data analysis techniques T2:547-549

54 interaction techniques T2:500-505

55 T2:602-650
visualization techniques
Content beyond syllabus covered (if any):

* Session duration: 50 mins

TEXT BOOKS :
1. Tom White, “Hadoop: The Definitive Guide”, Third Edition, O’reilly Media, Fourth Edition, 2015.
2. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos, “Understanding Big Data: Analytics for
Enterprise Class Hadoop and Streaming Data”, McGrawHill Publishing, 2012.
3. AnandRajaraman and Jeffrey David Ullman, “Mining of Massive Datasets”, CUP, 2012

REFERENCES :
1. Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced
Analytics”, John Wiley& sons, 2012.
2. Paul Zikopoulos, DirkdeRoos, Krishnan Parasuraman, Thomas Deutsch, James Giles, David Corrigan, “Harness the
Power of Big Data:The IBM Big Data Platform”, Tata McGraw Hill Publications, 2012.
3. ArshdeepBahga and Vijay Madisetti, “Big Data Science & Analytics: A Hands On Approach “, VPT, 2016.
4. Bart Baesens, “Analytics in a Big Data World: The Essential Guide to Data Science and its Applications (WILEY
Big Data Series)”, John Wiley & Sons, 2014.

Software Links:
1. Hadoop:http://hadoop.apache.org/
2. Hive: https://cwiki.apache.org/confluence/display/Hive/Home
3. Piglatin: http://pig.apache.org/docs/r0.7.0/tutorial.html
4.(867) Big Data Analytics - YouTube

COURSE COORDINATOR HEAD OF THE DEPARTMENT


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Academic Year: 2022-23 IV B.Tech II Semester w e f 05.012.2022

Section A Class Coordinator:Mrs.K.Santoshi Rupa


9:10-10:00 10:00-10:50 11:10-12:00 12:00-12:50 1:40-2:30 2:30-3:20 3:30-4:20
MON ERS BDA MOB
TUE ERS BDA MOB
WED MOB MOB BDA CODING SESSION
THU BDA ERS MOB PROJECT
FRI BDA ERS BDA
SAT MOB ERS ERS Library Sports

Section B Class Coordinator:Mrs.Ch.Sunitha


9:10-10:00 10:00-10:50 11:10-12:00 12:00-12:50 1:40-2:30 2:30-3:20 3:30-4:20
MON BDA ERS ERS
TUE ERS BDA MOB
WED BDA MOB MOB CODING SESSION
THU MOB BDA MOB PROJECT
FRI ERS BDA ERS
SAT ERS BDA MOB Library Sports

Section C Class Coordinator:Mrs.B.Sailaja


9:10-10:00 10:00-10:50 11:10-12:00 12:00-12:50 1:40-2:30 2:30-3:20 3:30-4:20
MON MOB ERS BDA
TUE BDA MOB ERS
WED ERS BDA ERS CODING SESSION
THU MOB ERS BDA PROJECT
FRI MOB MOB BDA
SAT BDA ERS MOB Library Sports

Break Timings: 10:50 - 11:10 AM & 03:20- 03:30 PM Lunch: 12:50- 01:40PM
OTHERS: Remedial Classes/Revision Classes/Projects/Activities (Co-curricular & Extra-curricular)
Course Name Section A Section B Section C
Management and Organization Mrs.B.Lavanya Mrs.K.Santosh Kumar Mrs.B.Lavanya
Non Conventional Energy Res Mrs.T.Sugana Mrs.M.Satyavathi Mrs.A.Venkata lakshmi
Big Data Analytics Mrs.K.Santoshi Rupa Mrs. Ch.Suneetha Mrs. Ch.Suneetha
# Project- II
Coordinator Head of the Department Principal

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


COURSE TIME TABLE

Academic Year: 2022-23 IV B.Tech II Semester w e f 05.012.2022

SECTION -A 9:10-10:00 10:00-10:50 11:10-12:00


MON BDA
TUE BDA
WED MOB BDA
THU BDA
FRI BDA ERS BDA
SAT ERS
SECTION-B 9:10-10:00 10:00-10:50 11:10-12:00
MON BDA
TUE BDA
WED BDA
THU BDA
FRI BDA
SAT BDA
SECTION-C 9:10-10:00 10:00-10:50 11:10-12:00
MON BDA
TUE BDA
WED BDA
THU BDA
FRI BDA
SAT BDA

Break Timings: 10:50 - 11:10 AM & 03:20- 03:30 PM Lunch: 12:50- 01:40PM

Coordinator Head of the Department Principal


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Course Name:Big Data Analytics Course Code: C411


Year/ Sem /Sec : IV B TECH II SEM -B Regulation: R19
Admitted Batch: 2019 Academic Year: 2022-23
Faculty: Mrs.Ch.Suneetha No. of Students: 63
Teaching Methodology: Case Study Topic: Stock Market Prediction
No. of Students Present:60 No. of Students Absent:03

Introduction: Case study is an empirical inquiry that investigates a contemporary phenomenon within its
real-life context, especially when the boundaries between the phenomenon and context are not clearly
evident.

Case study scenario: Describe an approach for analysis of the stock market to understand its volatile nature
and predict its behavior to make profits by investing in it.

Steps Involved:
Step 1: Introduction to Stock Market Prediction
Step 2 :Problem statement for Stock Market Prediction
Step 3 :The long short term memory
Program Implementation
Importing the Libraries
Step 4 : Getting to Visualising the Stock Market Prediction Data
Check for Null Values by printing the DataFrame Shape
Step 5 :Setting the Target Variable and Selecting the Features
Scaling
Step 6 :Creating a Training Set and a Test Set for Stock Market Prediction
Step 7 :Data Processing For LSTM
Step 8: Building the LSTM Model for Stock Market Prediction
Step 9: Training the Stock Market Prediction Model
Training the Stock Market Prediction Model
Step 10: Conclusion
No of groups: 15
Team size: 4 Or 5
Time Allotted: one-week
Problem:Problem Statement
The case study titled is Stock Market PredictionWith the advent of technological marvels like global digitization, the
prediction of the stock market has entered a technologically advanced era,. With the ceaseless increase in market
capitalization, stock trading has become a center of investment for many financial investors. Many analysts and
researchers have developed tools and techniques that predict stock price movements and help investors in proper
decision-making. Advanced trading models enable researchers to predict the market using non-traditional textual data
from social platforms. The application of advanced machine learning approaches such as text data analytics and
ensemble methods have greatly increased the prediction accuracies. Meanwhile, the analysis and prediction of stock
markets continue to be one of the most challenging research areas due to dynamic, erratic, and chaotic data. This study
explains the systematics of machine learning-based approaches for stock market prediction based on the deployment of
a generic framework. The study would be helpful for emerging researchers to understand the basics and advancements
of this emerging area, and thus carry-on further stock market prediction.
End-Users:

• Fundamental analysis:Collect required information and maiantain the information , and helping to
Developer.
• Devolper : Need implemented code and defined with user interface.
• Results: To provide and meet the requirement of the predicted Results.
Solution: Diagrams
Figure:1 Activity in classroom

Figure:2 Activity in classroom

Figure:3 Activity in classroom


ASSESSMENT TABLE
Group-1

Sl.No. Regd. No Team Score (10M)


     
1 19NM1A05B1
2 19NM1A0591
3 19NM1A05A6
4 20NM5A0510 9
Group-2
5 19NM1A0576
6 19NM1A0570
7 19NM1A0573
8 19NM1A0587
9 20NM5A0511 10
Group-3
10 19NM1A0586
11 19NM1A05A3
12 19NM1A0567
13 20NM5A0507 8
14 19NM1A0564  
Group-4
15 19NM1A0574
16 19NM1A0580
17 19NM1A05A2
18 19NM1A0582
19 19NM1A05A0 9
Group-5
20 19NM1A0565
21 19NM1A05C1
22 19NM1A05B3
23 19NM1A0584 9
Group-6
24 19NM1A05B0 10
25 19NM1A0590
26 20NM5A0508
27 19NM1A0596
Group-7
28 19NM1A05A4
29 19NM1A05C0
30 19NM1A0599
31 19NM1A05B2 10
Group-8  
32 19NM1A0571
33 19NM1A05A1
34 19NM1A05B5
35 20NM5A0506 9
Group-9
36 19NM1A0593
37 19NM1A05A9
38 19NM1A0583
39 19NM1A05C3 9
Group-10
40 19NM1A05B8
41 19NM1A0579
42 19NM1A05B9
43 19NM1A05C4 9
Group-11
44 19NM1A0568
45 19NM1A0578
46 19NM1A0592
47 19NM1A0589 10
Group-12
48 19NM1A0572
49 19NM1A0597
50 19NM1A0585
51 19NM1A05B4 10
Group-13
52 19NM1A0595
53 19NM1A0594
54 19NM1A05C5
55 19NM1A05A7 8
Group-14
56 19NM1A0569
57 19NM1A05B6
58 20NM5A0509
59 19NM1A0581 9
Group-15
60 19NM1A05B7
61 19NM1A0588
62 19NM1A05A5
63 19NM1A05A8 10
Table 1: Assessment Summary sheet

Report: Documentation for case study submitted by the students with final solution got from over
groups.Group-6 & 12 is selected as the best case study report among other groups and this will be
documented here.
Activity Outcome to PO Mappings

Activity Outcome Mapping to POs and PSOs


To Learn thelanguage for specifying, visualizing, constructing, and PO1, PO2,PO3, PO4, PSO2
documenting the artifacts of software systems.
Improvement of student learning rate due to Collaborative PO9, PO10, PO12
learning.

Post Implications:

 Improvement in student learning due to collaborative teaching-learning methodology.


 Academically weaker section students also will benefit from this case study methodology.

Subject Faculty Module Coordinator Head of the Department


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIT WISE IMPORTANT QUESTIONS

Course Name: BIG DATA ANALYTICS Course Code:C411


Year/ Sem : IV B.Tech- II Sem Regulation: R19
Admitted Batch: 2019-20 Academic Year:2022-23
Course Coordinator : CH.SUNEETHA

UNIT-1

1.Explain Big Data Platform.

2.Explain V's in Big Data

3.Write about Challenges of Conventional System.

4.Discuss clearly about Analysis vs Reporting

UNIT-2

1.Draw the architecture of data streaming and explain it.

2.Write Sampling in Big Data with example.

3.Explain Filtering streams.

4.What is RTAP .Explain it.

UNIT-3

1.What is distributed file system.

2.Explain HADOOP architecture.

3.What is HADOOP streaming.

4.Explain MAP Reduce working.

5.Explian MAP Reduce types and Formats.


UNIT-4

1. Explain predictive analytics in Big Data.

2.Write the process of PIG.

3. Explain the concept of Hive.

4.How HiveQL is working.

5. Explain Base and ZooKeeper.

UNIT-5

1. Explain Regression concept.

2.Write Visualization in Big Data.

3. Explain Interpretation of regression coefficients.

4. Expalin Visual data analysis techniques

COORDINATOR HEAD OF THE DEPARTMENT


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
UNIT WISE MULTIPLE CHOICE QUESTIONS

Course Name: BIG DATA ANALYTICS Course Code:C411


Year/ Sem : IV B.Tech- II Sem Regulation: R19
Admitted Batch: 2019-20 Academic Year:2022-23
Course Coordinator : CH.SUNEETHA
1. Which of the following is a popular open-source platform used for real-time data
processing and analytics?

a) Apache Kafka b) Apache Hadoop c) Apache Spark d) Apache Storm

Answer: d) Apache Storm

2. Which of the following is not one of the four V’s of Big Data?

a) Velocity b) Volume c) Variety d) Value

Answer: d) Value

3. What is the process of transforming structured and unstructured data into a format that can
be easily analyzed?

a) Data Mining b) Data Warehousing c) Data Integration d) Data Processing

Answer: c) Data Integration

4. Which of the following is a tool used for processing and analyzing Big Data?

a) Hadoop b) MySQL c) PostgreSQL d) Oracle

Answer: a) Hadoop

5. What is the process of examining large and varied data sets to uncover hidden patterns,
unknown correlations, market trends, customer preferences, and other useful information?

a) Data Mining b) Data Warehousing c) Data Integration d) Data Processing


Answer: a) Data Mining

6. Which of the following is not a common challenge associated with Big Data?

a) Data Quality b) Data Integration c) Data Privacy d) Data Duplication

Answer: d) Data Duplication

7. Which of the following is a technique used to extract meaningful insights from data sets
that are too large or complex to be processed by traditional data processing tools?

a) Business Intelligence b) Machine Learning c) Artificial Intelligence d) Data Science

Answer: b) Machine Learning

8. What is the process of storing and managing data in a way that allows for efficient retrieval
and analysis?

a) Data Warehousing b) Data Mining c) Data Integration d) Data Processing

Answer: a) Data Warehousing

9. Which of the following is a common programming language used for Big Data processing?

a) C++ b) Java c) Python d) All of the above

Answer: d) All of the above

10. Which of the following is a popular NoSQL database used for Big Data processing?

a) MySQL b) PostgreSQL c) Oracle d) MongoDB

Answer: d) MongoDB

11. What is the process of combining data from multiple sources into a single, unified view?

a) Data Mining b) Data Warehousing c) Data Integration d) Data Processing

Answer: c) Data Integration

12. What is the term used for the ability of a system to handle increasing amounts of data and
traffic without compromising performance?

a) Scalability b) Reliability c) Availability d) Security


Answer: a) Scalability

13. What is the process of cleaning and transforming data before it is used for analysis?

a) Data Mining b) Data Warehousing c) Data Integration d) Data Preprocessing

Answer: d) Data Preprocessing

14. Which of the following is not a common type of data in Big Data analysis?

a) Structured Data b) Semi-Structured Data c) Unstructured Data d) Simple Data

Answer: d) Simple Data

15. Which of the following is a method for analyzing data in which the data is split into
smaller subsets and processed in parallel across multiple servers or nodes?

a) Batch Processing b) Stream Processing c) MapReduce d) Hive

Answer: c) MapReduce

16. What is the process of analyzing data in real-time as it is generated?

a) Batch Processing b) Stream Processing c) MapReduce d) Hive

Answer: b) Stream Processing

17. Which of the following is a popular programming language used for data analysis and
machine learning?

a) C++ b) Java c) Python d) All of the above

Answer: c) Python

18. Which of the following is not a common data storage technology used for Big Data
processing?

a) Hadoop Distributed File System (HDFS) b) Cassandra c) MySQL d) Amazon S3

Answer: c) MySQL

19. What is the process of automatically categorizing or grouping data based on its
characteristics or attributes?

a) Clustering b) Classification c) Regression d) Anomaly Detection


Answer: a) Clustering

20. Which of the following is not a common data visualization tool used for Big Data
analysis?

a) Tableau b) QlikView c) Microsoft Excel d) D3.js

Answer: c) Microsoft Excel

UNIT-2

1. Which of the following is a technique used for identifying patterns in data by training a
model on a dataset and using it to make predictions on new data?

a) Data Miningb) Machine Learningc) Natural Language Processingd) Text Analytics

Answer: b) Machine Learning

2. Which of the following is not a common type of machine learning algorithm?

a) Supervised Learningb) Unsupervised Learningc) Reinforcement Learningd) Decision


Learning

Answer: d) Decision Learning

3. Which of the following is a type of machine learning algorithm in which the input data is
labeled and the model is trained to make predictions on new, unlabeled data?

a) Supervised Learningb) Unsupervised Learningc) Reinforcement Learningd) All of the


above

Answer: a) Supervised Learning

4. Which of the following is a type of machine learning algorithm in which the input data is
not labeled and the model is trained to find patterns or structure in the data?

a) Supervised Learningb) Unsupervised Learningc) Reinforcement Learningd) All of the


above

Answer: b) Unsupervised Learning


5. Which of the following is a type of machine learning algorithm in which the model learns
through trial and error by receiving feedback on its performance?

a) Supervised Learningb) Unsupervised Learningc) Reinforcement Learningd) All of the


above

Answer: c) Reinforcement Learning

6. Which of the following is not a common machine learning model?

a) Decision Trees b) Random Forests c) Neural Networks d) All of the above are common
machine learning models

Answer: d) All of the above are common machine learning models

7. Which of the following is a measure of how well a machine learning model is able to make
predictions on new data?

a) Accuracy b) Precision c) Recall d) All of the above

Answer: d) All of the above

8. Which of the following is a technique for reducing the dimensionality of data by


identifying the most important features?

a) Principal Component Analysis (PCA) b) Singular Value Decomposition (SVD)

c) Independent Component Analysis (ICA) d) All of the above

Answer: a) Principal Component Analysis (PCA)

9. Which of the following is not a common use case for Big Data analytics?

a) Fraud Detection b) Customer Segmentation c) Social Media Analysis

d) Inventory Management

Answer: d) Inventory Management

10. Which of the following is a technique for predicting a continuous target variable?

a) Classification b) Regression c) Clustering d) Dimensionality Reduction

Answer: b) Regression

11. Which of the following is a technique for grouping similar data points together?

a) Classification b) Regression c) Clustering d) Dimensionality Reduction


Answer: c) Clustering

12. Which of the following is not a common data preprocessing technique?

a) Normalization b) One-Hot Encoding c) Dimensionality Reduction d) Regression

Answer: d) Regression

13. Which of the following is a measure of the relationship between two variables?

a) Correlation b) Covariance c) Standard Deviation d) Mean

Answer: a) Correlation

14. Which of the following is not a common type of correlation coefficient?

a) Pearson’s Correlation Coefficient b) Spearman’s Rank Correlation Coefficient

c) Kendall’s Tau Correlation Coefficient d) Mahalanobis Correlation Coefficient

Answer: d) Mahalanobis Correlation Coefficient

15. Which of the following is a measure of how much a dependent variable changes when an
independent variable changes?

a) Covariance b) Correlation c) Slope d) Intercept

Answer: c) Slope

16. Which of the following is not a common method for selecting the best features for a
machine learning model?

a) Filter Methods b) Wrapper Methods c) Embedded Methods d) Extrapolation Methods

Answer: d) Extrapolation Methods

17. Which of the following is a measure of how much a model’s predictions vary for different
input values?

a) Bias b) Variance c) Precision d) Recall

Answer: b) Variance

18. Which of the following is not a common machine learning algorithm for classification?

a) Logistic Regression b) Decision Trees c) K-Nearest Neighbors d) Linear Regression

Answer: d) Linear Regression


19. Which of the following is a technique for reducing the size of a dataset by removing
duplicate data points?

a) Clustering b) Sampling c) Deduplication d) Normalization

Answer: c) Deduplication

20. Which of the following is a technique for reducing the dimensionality of a dataset?

a) Clustering b) Sampling c) Deduplication d) PCA

Answer: d) PCA (Principal Component Analysis)

UNIT-3

1. What is full form of HDFS?

A. Hadoop File System B. Hadoop Field System C. Hadoop File SearchD. Hadoop Field
search

Ans : A

2. HDFS works in a __________ fashion.

A. worker-master fashion B. master-slave fashion C. master-worker fashion D. slave-master


fashion

Ans : B

3. Which of the following are the Goals of HDFS?

A. Fault detection and recovery B. Huge datasets C. Hardware at data D. All of the above

Ans : D

4. ________ Name Node is used when the Primary Name Node goes down.

A. Rack B. Data C. Secondary D. Both A and B

Ans : C

5. The minimum amount of data that HDFS can read or write is called a _____________.

A. Data node B. Name node C. BlockD. None of the above

Ans : C

6. The default block size is ______.


A. 32MB B. 64MB C. 128MB D. 16MB

Ans : B

7. For every node (Commodity hardware/System) in a cluster, there will be a _________.

A. Data node B. Name node C. Block D. None of the above

Ans : A

8. Which of the following is not Features Of HDFS?

A. It is suitable for the distributed storage and processing. B. Streaming access to file system
data. C. HDFS provides file permissions and authentication.

D. Hadoop does not provides a command interface to interact with HDFS.

Ans : D

9. HDFS is implemented in _____________ language.

A. Perl B. Python C. JavaD. C

Ans : C

10. During start up, the ___________ loads the file system state from the fsimage and the
edits log file.

A. Datanode B. Namenode C. Block D. ActionNode

Ans : B

11. Which of the following is true about MapReduce?

A. It is a batch processing system. B. It is a real-time processing system.

C. It can only process structured data. D. It can only process data stored in a Hadoop
Distributed File System (HDFS).

Answer: A. It is a batch processing system.

12. Which of the following is true about MapReduce jobs?

A. They consist of only one map task and one reduce task. B. They can have multiple map
tasks and reduce tasks. C. They can only have one reduce task. D. They can only have one
map task.

Answer: B. They can have multiple map tasks and reduce tasks.
13. What is the purpose of the map function in MapReduce?

A. To convert input data into key-value pairs B. To sort the input data C. To combine the
input data D. To summarize the input data

Answer: A. To convert input data into key-value pairs.

14. What is the purpose of the reduce function in MapReduce?

A. To sort the input data B. To combine the input data C. To summarize the input data

D. To convert input data into key-value pairs

Answer: C. To summarize the input data

15. Which of the following is true about the shuffle phase in MapReduce?

A. It sorts the output of the map phase. B. It sorts the output of the reduce phase.

C. It combines the output of the map phase. D. It combines the output of the reduce phase.

Answer: A. It sorts the output of the map phase.

16. Which of the following is true about the combiner function in MapReduce?

A. It is the same as the reduce function. B. It is run after the reduce function.

C. It is run after the map function. D. It is run before the reduce function.

Answer: D. It is run before the reduce function.

17. Which of the following is true about the partitioner function in MapReduce?

A. It is used to sort the output of the map phase. B. It is used to group the output of the map
phase by key. C. It is used to divide the output of the map phase into partitions.

D. It is used to combine the output of the map phase.

Answer: C. It is used to divide the output of the map phase into partitions.

18. Which of the following is a disadvantage of using MapReduce?

A. It can only process small amounts of data. B. It requires specialized hardware.

C. It has a high latency. D. It is difficult to use.

Answer: C. It has a high latency.

19. Which of the following is an advantage of using MapReduce?


A. It requires specialized hardware. B. It can only process small amounts of data.

C. It can process data in parallel. D. It is difficult to use.

Answer: C. It can process data in parallel.

20. Which of the following is true about Hadoop?

A. It is a distributed data processing framework. B. It is a real-time processing system.

C. It can only process structured data. D. It is a database management system.

Answer: A. It is a distributed data processing framework.

UNIT-4

1. Which of the following Pig Latin statements is used to filter data based on a specified
condition?

a) GROUP BY b) SORT BY c) LIMIT d) FILTER

Answer: d) FILTER

2. What language is used in Apache Pig?

a) Python b) Java c) Perl d) Pig Latin

Answer: d) Pig Latin

3. Which of the following statements is true about Apache Pig?

a) It is an alternative to Hadoop b) It can only process structured data

c) It supports multiple programming languages d) It is not scalable

Answer: c) It supports multiple programming languages

4. What is the main advantage of using Apache Pig?

a) Faster data processing b) Easier programming c) Reduced data storage requirements

d) Better security

Answer: b) Easier programming


5. What is the function of the Pig Latin statement “GROUP”?

a) Groups data based on a specified key b) Sorts data in ascending order

c) Joins two datasets d) Performs a cross-product of two datasets

Answer: a) Groups data based on a specified key

6. What is the function of the Pig Latin statement “FILTER”?

a) Groups data based on a specified key b) Sorts data in ascending order c) Filters data based
on a specified condition d) Performs a cross-product of two datasets

Answer: c) Filters data based on a specified condition

7. What is the function of the Pig Latin statement “FOREACH”?

a) Groups data based on a specified key b) Sorts data in ascending order

c) Applies a transformation to each record d) Performs a cross-product of two datasets

Answer: c) Applies a transformation to each record .

8. What is the function of the Pig Latin statement “JOIN”?

a) Groups data based on a specified key b) Sorts data in ascending order

c) Joins two datasets based on a common key d) Performs a cross-product of two datasets

Answer: c) Joins two datasets based on a common key

9. What is the function of the Pig Latin statement “ORDER”?

a) Groups data based on a specified key b) Sorts data in ascending order

c) Filters data based on a specified condition d) Performs a cross-product of two datasets

Answer: b) Sorts data in ascending order

10. What is the function of the Pig Latin statement “LIMIT”?

a) Groups data based on a specified key b) Sorts data in ascending order

c) Filters data based on a specified condition d) Limits the number of records returned

Answer: d) Limits the number of records returned


11. Which of the following statements is true about Pig Latin scripts?

a) They can be executed only on a single node b) They must be written in Java

c) They can be run on a cluster of nodes d) They require a web interface to execute

Answer: c) They can be run on a cluster of nodes

12. What is the name of the component in Apache Pig that translates Pig Latin scripts into
MapReduce jobs?

a) Pig Compiler b) Pig Executor c) Pig Runner d) Pig Transformer

Answer: a) Pig Compiler

13. Which of the following statements is true about Pig Latin UDFs (User-Defined
Functions)?

a) They can only be written in Java b) They can be written in multiple programming
languages c) They are not allowed in Pig Latin scripts +d) They are pre-built functions
provided by Pig

Answer: b) They can be written in multiple programming languages

14. What is the function of the Pig Latin statement “DESCRIBE”?

a) Groups data based on a specified key b) Sorts data in ascending order

c) Provides metadata about a dataset d) Performs a cross-product of two datasets

Answer: c) Provides metadata about a dataset

15. Which of the following statements is true about Apache Pig Latin schemas?

a) They cannot be defined by the user b) They must be defined using JSON

c) They are optional d) They must be defined for all datasets

Answer: c) They are optional

16. What is the function of the Pig Latin statement “EXPLAIN”?

a) Groups data based on a specified key b) Sorts data in ascending order

c) Provides a detailed explanation of the execution plan for a Pig Latin script
d) Performs a cross-product of two datasets

Answer: c) Provides a detailed explanation of the execution plan for a Pig Latin script

17. Which of the following statements is true about Pig Latin LOAD statements?

a) They are not required for reading data into Pig b) They are used to write data to a file

c) They must be written in Java d) They specify the location and format of the input data

Answer: d) They specify the location and format of the input data

18. What is the function of the Pig Latin statement “STORE”?

a) Groups data based on a specified key b) Sorts data in ascending order

c) Writes data to a file d) Performs a cross-product of two datasets

Answer: c) Writes data to a file

19. Which of the following Pig Latin statements is used to group data based on a specified
key?

a) GROUP BY b) SORT BY c) LIMIT d) FOREACH

Answer: a) GROUP BY

20. Which of the following Pig Latin statements is used to sort data in ascending order?

a) GROUP BY b) SORT BY c) LIMIT d) FOREACH

Answer: b) SORT BY

UNIT-5

1.Which of these are considered secondary data?

a)Data from a Youtube interview b)Data from the textbook c)All of the above

d)None of the above

Answer: All of the above

2. The most popular data visualization library in python is _____

a)matinfolib b)matplotlib c)pip d)matpiplib


Answer: matplotlib

3. BI can catalyze a business’s success in terms of ________

a)Ranks customers and locations based on probability b)Rank customers and locations based
on profitability c)Distinguish the products and services that drive revenues

d) All of the mentioned

Answer: All of the mentioned

4. Which of these are considered primary data?

a)Doing an environment perception survey b)Counting pedestrians by the street

c)Asking a friend to fill out a survey d)All Of above

Answer: All of above

5. Which of the following is correct Features of DataFrame?

a)Can Perform Arithmetic operations on rows and columns b)Potentially columns are of
different types c)Labeled axes (rows and columns) d)All of the above

Answer: All of the above

6. Which Python Package is used for 2D graphics?

a)matplotlib.pip b)matplotlib.pyplot c)matplotlib.numpy d)matplotlib.plt

Answer: matplotlib.pyplot

7. How many types of BI users are there?

a)Two – The head of the company, The Business Users

b)Four: The Professional Data Analyst, The IT users, The head of the company, The Business
Usersc)One- The Business Users

d)Three-The IT users, The head of the company, The Business Users

Answer: Four: The Professional Data Analyst, The IT users, The head of the company, The
Business Users

8. ________________ in business intelligence allows huge data and reports to be read in a


single graphical interface.

a)Reports b)Warehouse c)OLAP d)Dashboard


Answer: Dashboard

9. Often, Where do the BI applications gather data from?

a)Data mart b)Data warehouse c)Both Data Warehouse and Datamart d)Database

Answer: Both Data Warehouse and Datamart

10. In regards to separated value files such as .csv and .tsv, what is the delimiter?

a)Any character such as the comma (,) or tab (\t) that is used to separate the row data

b)Anywhere the comma (,) character is used in the file

c)Any character such as the comma (,) or tab (\t) that is used to separate the column data

d)Delimiters are not used in separated value files

Answer: c.Any character such as the comma (,) or tab (\t) that is used to separate the column
dat

11. The process of studying data is called _________

a)Data Collection b)Data Analysis c)Data Visualization d)All of the above

Answer: Data Analysis

12. ________ are drawn with respect to the histogram created?

a)Frequency polygon b)box plot c)bar plot d)None of above

Answer: Frequency polygon

13. Which of the following methods should be employed in the code to display a plot?

a)Show() b)Display() c)Execute()d)Plot()

Answer: Show()

14. ________________function is used to create a horizontal bar chart.

a)barh() b)bar() c)barchart() d)None of the above

Answer: barh()

15. ___________plot is also described as five-number summery plot.

a)frequency polygon b)box plot c)Histogram d)scatter plot


Answer: box plot

16. Which of the following statement is true about Business Intelligence?

a)BI has a direct impact on organization’s strategic, tactical and operational business
decisions b)BI convert raw data into meaningful information

c)BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs,
and charts d)All of the above

Answer: All of above

17. What does a typical BI environment comprise of?

a)Data mart b)Data warehouse c)OLAP TOOLs d)All of these

Answer: All of these

18. KPI stands for?

a)Key Performance Identifier b)Key Performance Indicators c)Key Processes Identifier

d)Key Processes Indicator

Answer: Key Performance Indicator

19. Which of the following creates an object which maps data to a dictionary?

a)tuplereader() b) reader() c)DicReader () d)listreader()

Answer: DicReader ()

20. Which of the following are direct benefits of Business Intelligence?

a)Delivers data mining functionality b)Decision making c)Artificial intelligence

d)All of the above

Answer: Decision making

COORDINATOR HEAD OF THE DEPARTMENT


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
TUTORIAL TOPIC

Course Name: BIG DATA ANALYTICS Course Code:C411


Year/ Sem : IV B.Tech- II Sem Regulation: R19
Admitted Batch: 2019-20 Academic Year:2022-23
Course Coordinator : CH.SUNEETHA

What is MAP REDUCE and How it is Working.


Traditional Enterprise Systems normally have a centralized server to store and process data. The following
illustration depicts a schematic view of a traditional enterprise system. Traditional model is certainly not
suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers.
Moreover, the centralized system creates too much of a bottleneck while processing multiple files
simultaneously.

MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and efficient
to use. MapReduce is a programming model used for efficient processing in parallel over large data-sets in
a distributed manner. The data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-different optimizations.
The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will reduce it to equivalent
tasks for providing less overhead over the cluster network and to reduce the processing power. The
MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.  

MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing. There
can be multiple clients available that continuously send jobs for processing to the Hadoop MapReduce
Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of so
many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts:  The task or sub-jobs that are obtained after dividing the main job. The result of all the job-
parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to the Hadoop
MapReduce Master. Now, the MapReduce master will divide this job into further equivalent job-parts.
These job-parts are then made available for the Map and Reduce Task. This Map and Reduce task will
contain the program as per the requirement of the use-case that the particular company is solving. The
developer writes their logic to fulfill the requirement that the industry requires. The input data which we
are using is then fed to the Map Task and the Map will generate intermediate key-value pair as its output.
The output of Map i.e. these key-value pairs are then fed to the Reducer and the final output is stored on
the HDFS. There can be n number of Map and Reduce tasks made available for processing the data as per
the requirement. The algorithm for Map and Reduce is made with a very optimized way such that the time
complexity or space complexity is minimum.      
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 The Map task takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data tuples (key-value
pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their significance.

 Input Phase − Here we have a Record Reader that translates each record in an input file and sends the
parsed data to the mapper in the form of key-value pairs.
 Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one
of them to generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.
 Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into
identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined
code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce
algorithm; it is optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped
key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs
are sorted by key into a larger data list. The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function
on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways,
and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value
pairs to the final step.
 Output Phase − In the output phase, we have an output formatter that translates the final key-value
pairs from the Reducer function and writes them onto a file using a record writer.
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram

MapReduce-
Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500
million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how
Tweeter manages its tweets with the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions –

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
 Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value
pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


TUTORIAL TOPIC

Course Name: BIG DATA ANALYTICS Course Code:C411


Year/ Sem : IV B.Tech- II Sem Regulation: R19
Admitted Batch: 2019-20 Academic Year:2022-23
Course Coordinator : CH.SUNEETHA

You might also like