Big data refers to data collections that are too massive or complicated for typical data-processing application software to handle. Disk and memory space processing speed hardware faults Problems network capacity and optimizing resources use Distributed Computing A distributed computer system consists of several interconnected nodes. Nodes can be physical or virtual machines. This group of nodes provides services and applications to the client as if it were a single machine, This is also called a cluster. Main benefits of distributed computing
Performance: supports intensive workloads by
spreading tasks across nodes
Scalability: new nodes can be added to increase capacity
Fault tolerance: resilience in case of hardware failures
Distributed Computing Architecture Hadoop for distributed data processing Hadoop is a framework for running jobs on clusters of computers that provides a good abstraction of the underlying hardware and software. Hadoop Features Hadoop is an open-source project of the Apache Software Foundation. The project was created to facilitate computations involving massive amounts of data. Its core components are implemented in Java initially released in 2006. Last stable version is 3.3.1 from June 2022 originally inspired by Google‘s MapReduce4 and the proprietary GFS (Google File System) Hadoop Features Hadoop’s features addressing the challenges of Big Data: Scalability Fault tolerance High availability Distributed cache/data locality Cost-effectiveness as it does not need high-end hardware Provides a good abstraction of the underlying hardware Easy to learn Data can be queried trough sql-like endpoints (hive, cassandra) Important Features in Hadoop fault tolerance: the ability to withstand hardware or network failures (also: resilience) high availability: this refers to the system minimizing downtimes by eliminating single points of failure data locality: tasks are run on the node where data are located, in order to reduce the cost of moving data around Hadoop Ecosystem The core of Hadoop consists of: Hadoop common, the core libraries HDFS, the Hadoop Distributed File System MapReduce YARN (Yet Another Resource Negotiator) resource manager Hadoop Ecosystem 1. Data Integration Sqoop vs Flume Hadoop ETL(extract, transform, and load data) Tools Apache Sqoop Apache Flume Apache Sqoop is a tool used to transfer Flume is a distributed, reliable, and large amounts of data between Apache available service for collecting, Hadoop and structured data stores aggregating, and moving large amounts such as relational databases, data of log data. It has a simple and warehouses, and NoSQL data stores. flexible architecture for streaming data flows (unstructured data). Flume supports a wide range of data sources, including web servers, messaging systems, log files, and social media feeds. 2. Hadoop Server Roles HDFS (Hadoop Distributed File System)
A distributed file system that provides high-throughput access
to application data HDFS uses a master/slave architecture in which one device (master) termed as NameNode controls one or more other devices (slaves) termed as DataNode. It breaks Data/Files into small blocks (128 MB each block) and stores on DataNode and each block replicates on other nodes to accomplish fault tolerance. NameNode keeps the track of blocks written to the DataNode. HDFS (Hadoop Distributed File System)
A typical Hadoop cluster installation consists of:
- NameNode - Secondary NameNode - Multiple DataNodes HDFS (Hadoop Distributed File System) - NameNode The NameNode is the main point of access of a Hadoop cluster. It is responsible for the bookkeeping of the data partitioned across the DataNodes, manages the whole filesystem metadata, and performs load balancing.
- Secondary (standby) NameNode - DataNode
Keeps track of changes in the Here is where the data is saved and NameNode performing regular the computations take place (data snapshots, thus allowing quick nodes should actually be called "data startup to guarantee high availability and worker nodes") (NameNode is a single point of failure) HDFS- Data Redundancy (for data protection) Data redundancy: a method to recover the lost data using the redundant data. In order to prevent data loss and/or task termination due to hardware failures HDFS uses either : Replication (creating multiple copies -usually 3- of the data): The simplest method for coding data by making n copies of the data. n-fold replication guarantees the availability of data for at most n - 1 failures and it has a storage overhead of 200% (this is equivalent to a storage efficiency of 33%). Erasure coding The process entails the fragmentation of data into smaller components and the creation of supplementary elements, referred to as parity, which are distributed across multiple storage nodes. The presence of redundancy in the system allows for the retrieval of data fragments that have been lost or corrupted, Replication vs. Ensure Coding HDFS architecture: internal data representation HDFS supports working with very large files. Internally, data are split into blocks. One of the reason for splitting data into blocks is that in this way block objects all have the same size. The block size in HDFS can be configured at installation time and it is by default 128MB. Note: Hadoop sees data as a bunch of records and it processes multiple files the same way it does with a single file. Assignment: If you have a file of 100 TB required to be stored by HDFS, where this file is divided into blocks of 10 TB each. Draw how can you store these data, and what are the required number for the data nodes, if each one can store 30 TB. Note: you have to ensure replication. Also you have 2 racks for data nodes.