Hadoop Chapter 1
Hadoop Chapter 1
Hadoop Chapter 1
Chapter : 1
Introduction to Big data and hadoop
Bigdata is the data beyond the storing capacity or beyond the processing power.
It is :
1. High – Volume, High – Velocity and High – Varity of data.
2.Demands cost – effective and innovative ways of data processing
3.for enhanced insight and decision making.
Types Data :
1.Structured Data is data that is having a pre – defined schema or format. Eg . - RDBMS, XML etc.
2.Semi – Structured Data is data that may have predefined scheme, it is often ignored. Eg. - XML,
JSON etc.
3.Un – structured data is data that does not have predefined format. Eg. - Text Files, Images etc.
Problems : -
1.Storing and managing massive amount of data.
2.Analyse this data to predict customers behaviours, trends and outcomes in cost effective manner.
Drawbacks :
1.RDBMS => Supports GB of data, Structured data and Processing speed is low.
2.DFS => Programming is complex when we are talking about massive data sets, Getting the data from
storage device to the processor is a problem. (Assume that you want to transfer 100 GB of data to the
processor with transfer rate of 100 MB/s. It takes appro. 17 min to transfer). Difficult to deal the partial
failure of the system and sclaing up and sclaing out is not easy task adding node or removing node from
existing distributed system is complex.
Limitation : -
1.We can access the data in sequential order.
2.We can write data in it but can't update or delete the data nor can insert a new record.
Hadoop was developed by Doug Cutting and Mike Cafarella based on Google File System. Hadoop's
logo was named after Doug's son toy elephant. It was released as Apache's Open Source Project. It is a
framework with cluster of commodity hardware with streaming access pattern to achieve the goal of
storing and analysing large sets of data. It is for reliable, scalable and distributed computing. It is
designed to scale up from single servers to thousands of machines, each offering local computation and
storage. It allows for the distributed processing of large data sets across clusters of computers using
simple programming models. It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage. Rather than rely on hardware to deliver high – availability,
the Hadoop software library itself is designed to detect and handle failures at the application layer.
Streaming access pattern – Write once read any no. of times but do'nt try to change the content of the
file.
History : -
2002 – Doug Cutting and Mike Cafarella started working on Nutch
2003 – Google published paper describing GFS and Mapreduce
2004 – Doug Cutting adds DFS and MapReduce support to Nutch
Jan 2006 – Doug joins Yahoo
Feb 2006 – Hadoop project started seperately to support HDFS and MapReduce and Yahoo has adopted
Hadoop Project.
Apr 2006 – Sort benchmark (1GB/node) run on 188 nodes in 47.9 hours.
Jan 2008 – Hadoop was made its own top – level project at Apache.
Apr 2008 – Won the 1 TB sort benchmark in 3.5 minutes on 900 nodes.
Nov 2008 – Google reported that its MapReduce implementation sorted one terabyte in 68 seconds.
Apr 2009 – Yahoo ! Used Hadoop to sort one terabyte in 62 seconds.
2014 – a team of databricke used a 207 – node Spark cluster to sort 100 terabytes of data in 1406
seconds.
Features of Hadoop :
1.Distributes the data across nodes
2.Brings the computation local to data prevents the network overhead.
3.Data is replicated for high availability and reliability.
4.Scalable
5.Fault Tolerant
Installation nodes :
1.Standalone Mode : - It is configured to run in a non – distributed mode, as a single Java process. In
this mode there are no Hadoop daemons running in the background. HDFS is not used, only local file
system is used for storing files. It is best when you want to test your program for bugs with small input.
It is also known as Local Job Runner mode.
2.Pseudo – Distributed Mode : - Runs on a single machine, but it has all the daemons running in a
separate process. In this mode, we can emulate a multi – node cluster on a single machine. Different
daemons run in different JVM instances, but on a single machine.HDFS is used instead of local FS.
3.Fully – Distributed Mode : - The code runs on an actual Hadoop cluster where we can run the code
against large hadoop cluster.
Hadoop Components :=> Hadoop Core Components and Hadoop Ecosystem Components
b. MapReduce is hadoop's data processing component. It is for distributing tasks across multiple nodes.
Each node process the data stored on that node locally. It contains two phases Map and reduce. The
other important phase called shuffle and sort runs after the map phase and before the reduce phase.
Hadoop cluster is a group of machines working together to store and process data. It follows Master –
Slave Architecture. Usually it contains one master and many slave nodes.
Master Contains
a. Name Node stores metadata about the data being stored in DataNodes. Metadata includes Number of
blocks and on which datanode the block is stored etc.
b. JobTracker is a master which creates and runs the job. It runs on NameNode allocates the job to
Tasktracker which run on datanode.
Slave Contains
a. DataNode stores the actual data.
b. TaskTrackers run the tasks and report the status of task to Job Tracker.
Hadoop Daemons :
Daemons is a process that runs in the background.Hadoop has five daemons :
1.Name Node
2.Secondary Namenode
3.Data Node
4.Job Tracker
5.Task Tracker
Sqoop(SQL to Hadoop) - > Implemented in Java used to transfer data between HDFS & RDBMS.
Flume is distributed, reliable and available service for efficiently collecting, aggregating and moving
large amounts of log data from servers into HDFS. It has a simple and flexible architecture based on
streaming data flows. Flume channels data b/w sources and sinks and its data harvesting can either be
scheduled or event – driven. Possible sources for flume include avro(remote procedure call & data
serialization framework), files and system logs. Possible sinks (remove event from the channel & puts it
into external repository or forward Flume Source of next Flume agent in the flow) include HDFS and
Hbase.
Source Sink
Web
Server Channel
Agent HDFS
Hive is SQL based tool. It is to query and manage large datasets residing in distributed storage. It
provides a mechanism to query the data using a SQL – like langauage called HiveQL. Developed by
facebook and 100% open source. Converts HiveQL queries into MapReduce jobs. Submits these Jobs to
Hadoop Cluster.
Impala High performance SQL engine for vast amounts of data. Similar to HiveQL. 10 – 50 times faster
than Hive, PIG or MapReduce. Developed by Cloudera and 100% open source, Data stores in HDFS.
Does not use MapReduce.
Pig is a platform for data analysis and processing on Hadoop. It offers an alternative to write
MapReduce programs directly. Developed by Yahoo and 100% open source. It contains a high – level
scripting language called Pig Latin. It is designed to help with ETL tasks.
Hive Pig
Impala
MapReduce
HDFS
Oozie is a workflow engine. Runs on a server typically outside the cluster. Oozie workflows are PIG
jobs, Hive jobs, Sqoop Jobs, MapReduce Jobs. It can also use non MapReduce Jobs like sending mails.
It is for scheduling the job.