Nothing Special   »   [go: up one dir, main page]

Big Data Lecture Presentation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Do you remind

What is the mean of Big Data?


Big data refers to data collections that are too massive or complicated
for typical data-processing application software to handle.
Disk and memory space
processing speed
hardware faults Problems
network capacity
and optimizing resources use
Distributed Computing
 A distributed computer system consists of several
interconnected nodes.
 Nodes can be physical or virtual machines.
 This group of nodes provides services and applications
to the client as if it were a single machine,
 This is also called a cluster.
Main benefits of distributed computing

 Performance: supports intensive workloads by

spreading tasks across nodes


 Scalability: new nodes can be added to increase capacity

 Fault tolerance: resilience in case of hardware failures


Distributed Computing Architecture
Hadoop for distributed data processing
 Hadoop is a framework for running jobs on clusters of
computers that provides a good abstraction of the underlying
hardware and software.
Hadoop Features
 Hadoop is an open-source project of the Apache Software
Foundation.
 The project was created to facilitate computations involving
massive amounts of data.
 Its core components are implemented in Java initially
released in 2006.
 Last stable version is 3.3.1 from June 2022 originally
inspired by Google‘s MapReduce4 and the proprietary GFS
(Google File System)
Hadoop Features
 Hadoop’s features addressing the challenges of Big Data:
 Scalability
 Fault tolerance
 High availability
 Distributed cache/data locality
 Cost-effectiveness as it does not need high-end hardware
 Provides a good abstraction of the underlying hardware
 Easy to learn
 Data can be queried trough sql-like endpoints (hive, cassandra)
Important Features in Hadoop
 fault tolerance: the ability to withstand hardware or network
failures (also: resilience)
 high availability: this refers to the system minimizing
downtimes by eliminating single points of failure
 data locality: tasks are run on the node where data are
located, in order to reduce the cost of moving data around
Hadoop Ecosystem
 The core of Hadoop consists of:
 Hadoop common, the core libraries
 HDFS, the Hadoop Distributed File System
 MapReduce
 YARN (Yet Another Resource Negotiator) resource manager
Hadoop Ecosystem
1. Data Integration
Sqoop vs Flume
Hadoop ETL(extract, transform, and load data) Tools
Apache Sqoop Apache Flume
Apache Sqoop is a tool used to transfer Flume is a distributed, reliable, and
large amounts of data between Apache available service for collecting,
Hadoop and structured data stores aggregating, and moving large amounts
such as relational databases, data of log data. It has a simple and
warehouses, and NoSQL data stores. flexible architecture for streaming
data flows (unstructured data).
Flume supports a wide range of data
sources, including web servers,
messaging systems, log files, and social
media feeds.
2. Hadoop Server Roles
HDFS (Hadoop Distributed File System)

 A distributed file system that provides high-throughput access


to application data
 HDFS uses a master/slave architecture in which one device
(master) termed as NameNode controls one or more other
devices (slaves) termed as DataNode.
 It breaks Data/Files into small blocks (128 MB each block) and
stores on DataNode and each block replicates on other nodes
to accomplish fault tolerance.
 NameNode keeps the track of blocks written to the DataNode.
HDFS (Hadoop Distributed File System)

A typical Hadoop cluster installation consists of:


- NameNode
- Secondary NameNode
- Multiple DataNodes
HDFS (Hadoop Distributed File System)
- NameNode
The NameNode is the main point of
access of a Hadoop cluster. It is responsible
for the bookkeeping of the data partitioned
across the DataNodes, manages the whole
filesystem metadata, and performs load
balancing.

- Secondary (standby) NameNode - DataNode


Keeps track of changes in the Here is where the data is saved and
NameNode performing regular the computations take place (data
snapshots, thus allowing quick nodes should actually be called "data
startup to guarantee high availability and worker nodes")
(NameNode is a single point of failure)
HDFS- Data Redundancy (for data protection)
 Data redundancy: a method to recover the lost data using the
redundant data.
 In order to prevent data loss and/or task termination due to
hardware failures HDFS uses either :
 Replication (creating multiple copies -usually 3- of the data): The
simplest method for coding data by making n copies of the data.
 n-fold replication guarantees the availability of data for at most n -
1 failures and it has a storage overhead of 200% (this is equivalent
to a storage efficiency of 33%).
 Erasure coding
The process entails the fragmentation of data into smaller
components and the creation of supplementary elements, referred
to as parity, which are distributed across multiple storage nodes. The
presence of redundancy in the system allows for the retrieval of data
fragments that have been lost or corrupted,
Replication vs. Ensure Coding
HDFS architecture:
internal data representation
 HDFS supports working with very large files.
 Internally, data are split into blocks. One of the
reason for splitting data into blocks is that in this way
block objects all have the same size.
 The block size in HDFS can be configured at installation
time and it is by default 128MB.
 Note: Hadoop sees data as a bunch of records and it
processes multiple files the same way it does with a
single file.
Assignment:
 If you have a file of 100 TB required to be stored by HDFS,
where this file is divided into blocks of 10 TB each.
 Draw how can you store these data, and what are the
required number for the data nodes, if each one can store 30
TB.
 Note: you have to ensure replication. Also you have 2 racks
for data nodes.

You might also like