BDA Notes
BDA Notes
BDA Notes
Hadoop is a framework that uses distributed storage and parallel processing to store and manage big
data. It is the software most used by data analysts to handle big data, and its market size continues to
grow. There are three components of Hadoop:
Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
Hadoop MapReduce - Hadoop MapReduce is the processing unit.
Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource management unit.
Features of Hadoop
Apache Hadoop is the most popular and powerful big data tool, Hadoop provides world’s most
reliable storage layer — HDFS, a batch Processing engine — MapReduce and a Resource
Management Layer — YARN. In this section of features of Hadoop, Let us discuss important
features of Hadoop which are given below-
2.1. Open Source
Apache Hadoop is an open source project. It means its code can be modified according to
business requirements.
2.2. Distributed Processing
As data is stored in a distributed manner in HDFS across the cluster, data is processed in parallel
on a cluster of nodes.
2.3. Fault Tolerance
This is one of the very important features of Hadoop. By default 3 replicas of each block is stored
across the cluster in Hadoop and it can be changed also as per the requirement. So if any node
goes down, data on that node can be recovered from other nodes easily with the help of this
characteristic. Failures of nodes or tasks are recovered automatically by the framework. This is
how Hadoop is fault tolerant.
2.4. Reliability
Due to replication of data in the cluster, data is reliably stored on the cluster of machine despite
machine failures. If your machine goes down, then also your data will be stored reliably due to
this characteristic of Hadoop.
2.5. High Availability
Data is highly available and accessible despite hardware failure due to multiple copies of data. If
a machine or few hardware crashes, then data will be accessed from another path.
2.6. Scalability
Hadoop is highly scalable in the way new hardware can be easily added to the nodes. This feature
of Hadoop also provides horizontal scalability which means new nodes can be added on the fly
without any downtime.
2.7. Economic
Apache Hadoop is not very expensive as it runs on a cluster of commodity hardware. We do not
need any specialized machine for it. Hadoop also provides huge cost saving also as it is very easy
to add more nodes on the fly here. So if requirement increases, then you can increase nodes as
well without any downtime and without requiring much of pre-planning.
2.8. Easy to use
No need of client to deal with distributed computing, the framework takes care of all the things.
So this feature of Hadoop is easy to use.
2.9. Data Locality
This one is a unique features of Hadoop that made it easily handle the Big Data. Hadoop works on
data locality principle which states that move computation to data instead of data to computation.
When a client submits the MapReduce algorithm, this algorithm is moved to data in the cluster
rather than bringing data to the location where the algorithm is submitted and then processing it.
Hadoop Assumptions
Hadoop is written with large clusters of computers in mind and is built around the following
hadoop assumptions:
Hardware may fail, (as commodity hardware can be used)
Processing will be run in batches. Thus there is an emphasis on high throughput as opposed
to low latency.
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to
terabytes in size.
HDFS:
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly
designed for working on commodity Hardware devices(inexpensive devices), working on a
distributed file system design. HDFS is designed in such a way that it believes more in storing
the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the
other devices present in that Hadoop cluster. Data storage Nodes in HDFS.
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data about the
data. Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop
cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. The
more number of DataNode, the Hadoop cluster will be able to store more data. So it is advised
that the DataNode should have High storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of
data is divided into multiple blocks of size 128MB which is default and you can also change it
manually.
Let’s understand this concept of breaking down of file in blocks with an example. Suppose you
have uploaded a file of 400MB to your HDFS then what happens is this file got divided into
blocks of 128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of
128MB except the last one. Hadoop doesn’t know or it doesn’t care about what data is stored
in these blocks so it considers the final file blocks as a partial record as it does not have any
idea regarding it. In the Linux file system, the size of a file block is about 4KB which is very
much less than the default size of file blocks in the Hadoop file system. As we all know
Hadoop is mainly configured for storing the large size data which is in petabyte, this is what
makes Hadoop file system different from other file systems as it can be scaled, nowadays file
blocks of 128MB to 256MB are considered in Hadoop.
HDFS Architecture
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
MapReduce:
A software framework and programming model called MapReduce is used to process enormous
volumes of data. Map and Reduce are the two stages of the MapReduce program’s operation.
Vast volumes of data are generated in today’s data-driven market due to algorithms and
applications constantly gathering information about individuals, businesses, systems, and
organizations.
The MapReduce program runs in three phases: the map phase, the shuffle phase, and the reduce
phase.
1. The map stage
The task of the map or mapper is to process the input data at this level. In most cases, the input
data is stored in the Hadoop file system as a file or directory (HDFS). The mapper function
receives the input file line by line. The mapper processes the data and produces several little data
chunks.
2. The reduce stage (including shuffle and reduce)
The shuffle and reduce stages are combined to create the reduce stage. Processing the data that
arrives from the mapper is the reducer’s responsibility. Following processing, it generates a fresh
set of outputs that will be kept in the HDFS.
Hadoop assigns the Map and Reduce tasks to the proper cluster computers during a MapReduce
job. The framework controls every aspect of data-passing, including assigning tasks, confirming
their completion, and transferring data across nodes within a cluster. Most computing is done on
nodes with data stored locally on drives, which lowers network traffic. After the assigned tasks
are finished, the cluster gathers and reduces the data to create the necessary results, then delivers
it back to the Hadoop server.
Key Features of MapReduce
1. Highly scalable
A framework with excellent scalability is Apache Hadoop MapReduce. This is because of its
capacity for distributing and storing large amounts of data across numerous servers. These
servers can all run simultaneously and are all reasonably priced.
By adding servers to the cluster, we can simply grow the amount of storage and computing
power. We may improve the capacity of nodes or add any number of nodes (horizontal
scalability) to attain high computing power. Organizations may execute applications from
massive sets of nodes, potentially using thousands of terabytes of data, thanks to Hadoop
MapReduce programming.
2. Versatile
Businesses can use MapReduce programming to access new data sources. It makes it possible for
companies to work with many forms of data. Enterprises can access both organized and
unstructured data with this method and acquire valuable insights from the various data sources.
Since Hadoop is an open-source project, its source code is freely accessible for review,
alterations, and analyses. This enables businesses to alter the code to meet their specific needs.
The MapReduce framework supports data from sources including email, social media, and
clickstreams in different languages.
3. Secure
The MapReduce programming model uses the HBase and HDFS security approaches, and only
authenticated users are permitted to view and manipulate the data. HDFS uses a replication
technique in Hadoop 2 to provide fault tolerance. Depending on the replication factor, it makes a
clone of each block on the various machines. One can therefore access data from the other
devices that house a replica of the same data if any machine in a cluster goes down. Erasure
coding has taken the role of this replication technique in Hadoop 3. Erasure coding delivers the
same level of fault tolerance with less area. The storage overhead with erasure coding is less than
50%.
4. Affordability
With the help of the MapReduce programming framework and Hadoop’s scalable design, big
data volumes may be stored and processed very affordably. Such a system is particularly cost-
effective and highly scalable, making it ideal for business models that must store data that is
constantly expanding to meet the demands of the present.
In terms of scalability, processing data with older, conventional relational database management
systems was not as simple as it is with the Hadoop system. In these situations, the company had
to minimize the data and execute classification based on presumptions about how specific data
could be relevant to the organization, hence deleting the raw data. The MapReduce programming
model in the Hadoop scale-out architecture helps in this situation.
5. Fast-paced
The Hadoop Distributed File System, a distributed storage technique used by MapReduce, is a
mapping system for finding data in a cluster. The data processing technologies, such as
MapReduce programming, are typically placed on the same servers that enable quicker data
processing.
Thanks to Hadoop’s distributed data storage, users may process data in a distributed manner
across a cluster of nodes. As a result, it gives the Hadoop architecture the capacity to process
data exceptionally quickly. Hadoop MapReduce can process unstructured or semi-structured data
in high numbers in a shorter time.
6. Based on a simple programming model
Hadoop MapReduce is built on a straightforward programming model and is one of the
technology’s many noteworthy features. This enables programmers to create MapReduce
applications that can handle tasks quickly and effectively. Java is a very well-liked and simple-
to-learn programming language used to develop the MapReduce programming model.
Java programming is simple to learn, and anyone can create a data processing model that works
for their company. Hadoop is straightforward to utilize because customers don’t need to worry
about computing distribution. The framework itself does the processing.
7. Parallel processing-compatible
The parallel processing involved in MapReduce programming is one of its key components. The
tasks are divided in the programming paradigm to enable the simultaneous execution of
independent activities. As a result, the program runs faster because of the parallel processing,
which makes it simpler for the processes to handle each job. Multiple processors can carry out
these broken-down tasks thanks to parallel processing. Consequently, the entire software runs
faster.
8. Reliable
The same set of data is transferred to some other nodes in a cluster each time a collection of
information is sent to a single node. Therefore, even if one node fails, backup copies are always
available on other nodes that may still be retrieved whenever necessary. This ensures high data
availability.
The framework offers a way to guarantee data trustworthiness through the use of Block Scanner,
Volume Scanner, Disk Checker, and Directory Scanner modules. Your data is safely saved in the
cluster and is accessible from another machine that has a copy of the data if your device fails or
the data becomes corrupt.
9. Highly available
Hadoop’s fault tolerance feature ensures that even if one of the DataNodes fails, the user may
still access the data from other DataNodes that have copies of it. Moreover, the high accessibility
Hadoop cluster comprises two or more active and passive NameNodes running on hot standby.
The active NameNode is the active node. A passive node is a backup node that applies changes
made in active NameNode’s edit logs to its namespace.
Analytics architecture refers to the infrastructure and systems that are used to support the
collection, storage, and analysis of data. There are several key components that are typically
included in an analytics architecture:
1. Data collection: This refers to the process of gathering data from various sources, such as
sensors, devices, social media, websites, and more.
2. Transformation: When the data is already collected then it should be cleaned and
transformed before storing.
3. Data storage: This refers to the systems and technologies used to store and manage data,
such as databases, data lakes, and data warehouses.
4. Analytics: This refers to the tools and techniques used to analyze and interpret data, such as
statistical analysis, machine learning, and visualization.
Together, these components work together to enable organizations to collect, store, and analyze
data in order to make informed decisions and drive business outcomes.
The analytics architecture is the framework that enables organizations to collect, store, process,
analyze, and visualize data in order to support data-driven decision-making and drive business
value.
Benefits:
There are several ways in which you can use analytics architecture to benefit your organization:
1. Support data-driven decision-making: Analytics architecture can be used to collect, store,
and analyze data from a variety of sources, such as transactions, social media, web analytics,
and sensor data. This can help you make more informed decisions by providing you with
insights and patterns that you may not have been able to detect otherwise.
2. Improve efficiency and effectiveness: By using analytics architecture to automate tasks
such as data integration and data preparation, you can reduce the time and resources required
to analyze data, and focus on more value-added activities.
3. Enhance customer experiences: Analytics architecture can be used to gather and analyze
customer data, such as demographics, preferences, and behaviors, to better understand and
meet the needs of your customers. This can help you improve customer satisfaction and
loyalty.
4. Optimize business processes: Analytics architecture can be used to analyze data from
business processes, such as supply chain management, to identify bottlenecks, inefficiencies,
and opportunities for improvement. This can help you optimize your processes and increase
efficiency.
5. Identify new opportunities: Analytics architecture can help you discover new opportunities,
such as identifying untapped markets or finding ways to improve product or service
offerings.
Analytics architecture can help you make better use of data to drive business value and improve
your organization’s performance.
Applications of Analytics Architecture
Analytics architecture can be applied in a variety of contexts and industries to support data-
driven decision-making and drive business value. Here are a few examples of how analytics
architecture can be used:
1. Financial services: Analytics architecture can be used to analyze data from financial
transactions, customer data, and market data to identify patterns and trends, detect fraud, and
optimize risk management.
2. Healthcare: Analytics architecture can be used to analyze data from electronic health
records, patient data, and clinical trial data to improve patient outcomes, reduce costs, and
support research.
3. Retail: Analytics architecture can be used to analyze data from customer transactions, web
analytics, and social media to improve customer experiences, optimize pricing and inventory,
and identify new opportunities.
4. Manufacturing: Analytics architecture can be used to analyze data from production
processes, supply chain management, and quality control to optimize operations, reduce
waste, and improve efficiency.
5. Government: Analytics architecture can be used to analyze data from a variety of sources,
such as census data, tax data, and social media data, to support policy-making, improve
public services, and promote transparency.
Analytics architecture can be applied in a wide range of contexts and industries to support data-
driven decision-making and drive business value.
Limitations of Analytics Architecture
There are several limitations to consider when designing and implementing an analytical
architecture:
1. Complexity: Analytical architectures can be complex and require a high level of technical
expertise to design and maintain.
2. Data quality: The quality of the data used in the analytical system can significantly impact
the accuracy and usefulness of the results.
3. Data security: Ensuring the security and privacy of the data used in the analytical system is
critical, especially when working with sensitive or personal information.
4. Scalability: As the volume and complexity of the data increase, the analytical system may
need to be scaled to handle the increased load. This can be a challenging and costly task.
5. Integration: Integrating the various components of the analytical system can be a challenge,
especially when working with a diverse set of data sources and technologies.
6. Cost: Building and maintaining an analytical system can be expensive, due to the cost of
hardware, software, and personnel.
7. Data governance: Ensuring that the data used in the analytical system is properly governed
and compliant with relevant laws and regulations can be a complex and time-consuming task.
8. Performance: The performance of the analytical system can be impacted by factors such as
the volume and complexity of the data, the quality of the hardware and software used, and
the efficiency of the algorithms and processes employed.
There are many tools that can be used in analytics architecture, depending on the specific needs
and goals of the organization. Some common tools that are used in analytics architectures
include:
Databases: Databases are used to store and manage structured data, such as customer
information, transactional data, and more. Examples include relational databases like
MySQL and NoSQL databases like MongoDB.
Data lakes: Data lakes are large, centralized repositories that store structured and
unstructured data at scale. Data lakes are often used for big data analytics and machine
learning.
Data warehouses: Data warehouses are specialized databases that are designed for fast
querying and analysis of data. They are often used to store large amounts of historical data
that is used for business intelligence and reporting. ex. ETL tools
Business intelligence (BI): tools: BI tools are used to analyze and visualize data in order to
gain insights and make informed decisions. Examples include Tableau and Power BI.
Machine learning platforms: Machine learning platforms provide tools and frameworks for
building and deploying machine learning models. Examples include TensorFlow and scikit-
learn.
Statistical analysis tools: Statistical analysis tools are used to perform statistical analysis
and modeling of data. Examples include R and SAS.
Big Data contains a large amount of data that is not being processed by traditional data storage or
the processing unit. It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150 exabytes per day before
replication.
There are five v's of Big Data that explains the characteristics.
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
The data is categorized as below:
Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did not
know how to derive the value of data since the data is raw.
Quasi-structured Data:The data format contains textual data with inconsistent data formats that
are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the
data. Veracity is the process of being able to handle and manage data efficiently. Big Data
is also essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which
the data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Structured Data
Structured data can be crudely defined as the data that resides in a fixed field within a
record.
It is type of data most familiar to our everyday lives. for ex: birthday,address
A certain schema binds it, so all the data has the same set of properties. Structured data is
also called relational data. It is split into multiple tables to enhance the integrity of the data
by creating a single record to depict an entity. Relationships are enforced by the application
of table constraints.
The business value of structured data lies within how well an organization can utilize its
existing systems and processes for analysis purposes.
Sources of structured data
A Structured Query Language (SQL) is needed to bring the data together. Structured data is
easy to enter, query, and analyze. All of the data follows the same format. However, forcing a
consistent structure also means that any alteration of data is too tough as each record has to be
updated to adhere to the new structure. Examples of structured data include numbers, dates,
strings, etc. The business data of an e-commerce website can be considered to be structured
data.
Roll
Name Class Section No Grade
Geek1 11 A 1 A
Geek2 11 A 2 B
Geek3 11 A 3 A
Semi-structured data is not bound by any rigid schema for data storage and handling. The
data is not in the relational format and is not neatly organized into rows and columns like
that in a spreadsheet. However, there are some features like key-value pairs that help in
discerning the different entities from each other.
Since semi-structured data doesn’t need a structured query language, it is commonly
called NoSQL data.
A data serialization language is used to exchange semi-structured data across systems that
may even have varied underlying infrastructure.
Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
This type of information typically comes from external sources such as social media
platforms or other web-based data feeds .
Semi-Structured Data
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware
with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files, transit,
store, and parse. The sender and the receiver don’t need to know about the other system. As
long as the same serialization language is used, the data can be understood by both systems
comfortably. There are three predominantly used Serialization languages .
1. XML– XML stands for eXtensible Markup Language. It is a text-based markup language
designed to store and transport data. XML parsers can be found in almost all popular
development platforms. It is human and machine-readable. XML has definite standards for
schema, transformation, and display. It is self-descriptive. Below is an example of a
programmer’s details in XML .
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file format for
data interchange. JSON is easy to use and uses human/machine-readable text to store and
transmit data objects.
Unstructured Data
Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of
rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered unstructured data.
Even though the metadata accompanying an image or a video may be semi-structured, the
actual data being dealt with is unstructured.
Additionally, Unstructured data is also known as “dark data” because it cannot be analyzed
without the proper software tools.
Differences between Business Intelligence vs Big Data:
by extracting information
Dashboard, etc.
Tools Below is the list of tools used Below is the list of tools used in
Tableau Hadoop
Sisense Presto
Characteristics/ Properties Below are the six features of Big data can be described by
reports.
trends.
Cost savings.
revenues etc.
etc.
Storage
With vast amounts of data generated daily, the greatest challenge is storage (especially when the
data is in different formats) within legacy systems. Unstructured data cannot be stored in
traditional databases.
Processing
Processing big data refers to the reading, transforming, extraction, and formatting of useful
information from raw information. The input and output of information in unified formats
continue to present difficulties.
Security
Many of you are probably dealing with challenges related to poor data quality, but solutions are
available. The following are four approaches to fixing data problems:
Repairing the original data source is necessary to resolve any data inaccuracies.
You must use highly accurate methods of determining who someone is.
Scaling Big Data Systems
Database sharding, memory caching, moving to the cloud and separating read-only and write-
active databases are all effective scaling methods. While each one of those approaches is
fantastic on its own, combining them will lead you to the next level.
Companies are spending millions on new big data technologies, and the market for such tools is
expanding rapidly. In recent years, however, the IT industry has caught on to big data and
analytics potential. The trending technologies include the following:
Hadoop Ecosystem
Apache Spark
NoSQL Databases
R Software
Predictive Analytics
Prescriptive Analytics
In an extensive data set, data is constantly being ingested from various sources, making it more
dynamic than a data warehouse. The people in charge of the big data environment will fast forget
where and what each data collection came from.
Real-Time Insights
The term "real-time analytics" describes the practice of performing analyses on data as a system
is collecting it. Decisions may be made more efficiently and with more accurate information
thanks to real-time analytics tools, which use logic and mathematics to deliver insights on this
data quickly.
Data Validation
Before using data in a business process, its integrity, accuracy, and structure must be validated.
The output of a data validation procedure can be used for further analysis, BI, or even to train a
machine learning model.
Healthcare Challenges
Electronic health records (EHRs), genomic sequencing, medical research, wearables, and
medical imaging are just a few examples of the many sources of health-related big data.
expanding, new companies and technologies are developed every day. A big challenge
for companies is to find out which technology works bests for them without introducing
2. The Big Data Talent Gap: While Big Data is growing, very few experts are available.
This is because Big data is a complex field, and people who understand this field’s
complexity and intricate nature are far from between. Another major challenge in the
3. Getting data into the big data platform: Data is increasing every single day. This
means that companies have to tackle a limitless amount of data on a regular basis. The
scale and variety of data available today can overwhelm any data practitioner, which is
why it is important to make data accessibility simple and convenient for brand managers
and owners.
4. Need for synchronization across data sources: As data sets become more diverse, they
must be incorporated into an analytical platform. It can create gaps and lead to wrong
companies gain proper insights from big data analytics, and it is important that the correct
department has access to this information. A major challenge in big data analytics is