Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)
()
About this ebook
It will help you understand the different analysis types like descriptive, predictive, and prescriptive. Learn about NoSQL databases and their benefits over SQL. The book centers on Hadoop, explaining its features, versions, and main components like HDFS (storage) and MapReduce (processing). Explore MapReduce and YARN for efficient data processing. Gain insights into MongoDB and Hive, popular tools in the big data landscape.
Related to Big Data and Analytics
Related ebooks
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning Rating: 0 out of 5 stars0 ratingsOptimizing Hadoop for MapReduce Rating: 0 out of 5 stars0 ratingsBig Data Modeling and Management Systems Rating: 0 out of 5 stars0 ratingsMastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition) Rating: 0 out of 5 stars0 ratingsSQL and NoSQL Interview Questions: Your essential guide to acing SQL and NoSQL job interviews (English Edition) Rating: 0 out of 5 stars0 ratings“Mastering Relational Databases: From Fundamentals to Advanced Concepts”: GoodMan, #1 Rating: 0 out of 5 stars0 ratingsMastering Amazon Relational Database Service for MySQL: Building and configuring MySQL instances (English Edition) Rating: 0 out of 5 stars0 ratingsGetting Started with Greenplum for Big Data Analytics Rating: 0 out of 5 stars0 ratingsAWS Data Analytics: Unleashing the Power of Data: Insights and Solutions with AWS Analytics Rating: 0 out of 5 stars0 ratingsData Catalog Third Edition Rating: 0 out of 5 stars0 ratingsData Engineering with Python: Work with massive datasets to design data models and automate data pipelines using Python Rating: 0 out of 5 stars0 ratingsDatabase testing Third Edition Rating: 0 out of 5 stars0 ratingsNoSQL Databases A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsSpark SQL A Complete Guide Rating: 0 out of 5 stars0 ratingsQuery Optimization A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsHDInsight Essentials - Second Edition Rating: 0 out of 5 stars0 ratingsBeginning Microsoft SQL Server 2012 Programming Rating: 1 out of 5 stars1/5Governance Policies A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsOracle Business Intelligence Enterprise Edition 12c A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsUltimate Data Engineering with Databricks Rating: 0 out of 5 stars0 ratingsData Analysis and Harmonization: A Simple Guide Rating: 0 out of 5 stars0 ratingsData Quality Strategies A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsBig Data Architecture A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsData Modelling and Metadata The Ultimate Step-By-Step Guide Rating: 0 out of 5 stars0 ratingsWS-BPEL 2.0 Beginner's Guide Rating: 0 out of 5 stars0 ratingsAzure Data Engineering Cookbook: Design and implement batch and streaming analytics using Azure Cloud Services Rating: 0 out of 5 stars0 ratingsGetting Started with Big Data Query using Apache Impala Rating: 0 out of 5 stars0 ratingsData Warehousing Fundamentals for IT Professionals Rating: 3 out of 5 stars3/5Data Modeling and Database Design: Turn Your Data into Actionable Insights Rating: 0 out of 5 stars0 ratings
Computers For You
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Elon Musk Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsProcreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsMastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5The Best Hacking Tricks for Beginners Rating: 4 out of 5 stars4/5
Reviews for Big Data and Analytics
0 ratings0 reviews
Book preview
Big Data and Analytics - Dr. Jugnesh Kumar
C
HAPTER
1
Introduction to Big Data
Introduction
The amount of data produced by humanity is increasing exponentially because of the rapid development of technology, the proliferation of devices, and the widespread use of social networking sites. To put things in perspective, humankind produced 5 billion gigabytes of data between the beginning of time and 2003, which could cover an entire football pitch if represented as physical discs.
Amazingly, however, the same amount of data was generated every ten minutes in 2013, up from every two days in 2011, and it has continued to increase significantly. Even though this vast amount of information has many valuable insights and the potential to be helpful when processed, it is frequently underutilized and ignored. The enormous volume of data being produced at an unprecedented rate worldwide is called Big Data. Both structured and unstructured data types are possible. Businesses heavily rely on data in today's knowledge-based economy to fuel their success. So, it becomes crucial and enormously rewarding to make sense of this data, identify patterns, and expose hidden connections within this vast ocean of information. The urgent need is to turn big data into easily usable, actionable business intelligence for enterprises. Businesses of all sizes, locations, market shares, and customer segments can develop successful strategies by accessing and analyzing high-quality data. This is where Hadoop, the go-to platform for processing enormous volumes of data, comes into play.
Structure
In this chapter, we will discuss the following topics:
Diverse facets of big data
Digital data and its types
Characteristics of big data
Types of big data
Evolution of big data
Applications and challenges of big data
3Vs of big data
Non-definitional traits of big data
Big data work flow management
Business intelligence versus big data
Data science process steps
Foundations for big data systems and programming
Distributed filesystems
Data warehouse and Hadoop environment
Coexistence
Diverse facets of big data
Alternatively, we can define big data as a collection of sizable datasets processed faster than traditional computing techniques. It has developed into a broad discipline that includes various tools, techniques, and frameworks, not just a single technique or tool. The data consists of the enormous amount produced by different devices and applications. The following industries fall under the umbrella of big data, as shown in Table 1.1:
Table 1.1 : Shows the involvement of big data in various organizations
Digital data and its types
Digital data can be classified into several types based on their characteristics and formats. Here are some common types of digital data given below:
Textual data: This type includes written or typed text, such as documents, emails, webpages, and social media posts. Textual data is typically represented as a sequence of characters.
Numeric data: Numeric data consists of numbers and mathematical values. It can be discrete (whole numbers) or continuous (decimal numbers). Examples of numeric data include measurements, financial data, and statistical records.
Image data: Image data represents visual information through pictures or graphical content. It consists of a grid of pixels, where each pixel contains color or grayscale information. Image data is commonly used in photography, digital art, and computer vision applications.
Audio data: Audio data represents sound or audio signals. It can be in the form of speech, music, or other audio recordings. Audio data is typically stored as waveform samples, capturing variations in air pressure over time.
Video data: Video data consists of an order of images (frames) presented in rapid succession. It combines image and audio data to represent moving visual content. Video data is commonly used in movies, television, surveillance systems, and video streaming platforms.
Geospatial data: Geospatial data refers to data with geographical or spatial information. It includes coordinates, maps, satellite imagery, and location-based data. Geospatial data is widely used in navigation, urban planning, mapping, and environmental analysis.
Time series data: Time series data capture measurements or observations taken at different points in time. It includes data points recorded at regular intervals, such as stock prices, weather data, sensor readings, and device logs.
Structured data: This type of data follows a predefined format and schema. It is organized in a tabular or relational form, with well-defined rows and columns. Structured data is stored in databases and spreadsheets and can be easily queried and analyzed.
Unstructured data: Unstructured data refers to data that does not have a predefined format or structure. It includes free-form text, multimedia content, social media posts, emails, and documents. Unstructured data requires advanced techniques like machine learning and natural language processing to extract meaningful insights.
Metadata: Metadata provides descriptive information about other types of data. It includes file names, creation dates, author information, data sources, and formats. Metadata helps in organizing, managing, and understanding other data types.
Characteristics of big data
Data can possess several characteristics that impact its management, analysis, and interpretation. Some important features of data include:
Volume: Volume denotes the amount or size of data. It can range from small-scale data sets to massive volumes of data from various sources.
Velocity: Velocity denotes the speed at which data is created, processed, collected, and analyzed. Real-time data requires fast processing capabilities to extract timely insights.
Variety: The diversity of data types and formats is called variety. Text, images, audio, video, and other data types can exist in structured, unstructured, or semi-structured forms.
Semi-structured: The data shows some organization but lacks a strict structure, in contrast to structured data, which is prearranged in a tabular format with a predetermined schema. A certain level of hierarchy or relationship is possible because this kind of data frequently contains elements like tags, keys, or attributes.
Veracity: Veracity refers to the quality and reliability of data. Data may contain errors, inconsistencies, or inaccuracies that must be addressed to ensure data veracity.
Value: Value refers to data's usefulness, relevance, and potential insights. Extracting value from data involves analysis, interpretation, and decision-making based on the obtained insights.
Variability: Variability refers to the dynamic landscape of data. Data can exhibit variations in volume, velocity, and variety over time. Handling data variability requires adaptability and flexibility in data.
Types of big data
Big data can be classified into three main types based on the nature of the data and its characteristics. These types are mentioned in Table1.2:
Table 1.2 : Difference between structured and unstructured data based on different criteria
Structured data
Information that has been organized and can be processed, saved, and retrieved in a semiformal is structured data. It is typically kept in databases and is readily accessible utilizing simple algorithms. Since the data format is known beforehand, managing structured data is simple. Structured data includes information that is kept by a business in databases such as tables and spreadsheets. Structured data in big data refers to data that has a predefined format and fits into a well-defined schema or model. It is organized and stored in a tabular correlational format, typically found in traditional databases. Structured data follows a consistent and predefined structure, making it easier to query, analyze, and process using conventional database management systems. Figure 1.1 depicts the structure data in different colors:
Figure 1.1: Illustrates the structure data in different color
(Source: https://dryviq.com/unstructured-vs-structured-data-4-key-management-differences/)
Key characteristics of structured data in big data include:
Fixed schema: Structured data has a fixed and predefined schema that defines the structure of the data. The schema determines the kinds of data that can be stored, the relationships among different data elements, and the constraints on data values. This fixed schema enables efficient data storage, indexing, and retrieval.
Organized format: Structured data is systematized into rows, tables, as well as columns, where each column represents a specific attribute or data field, and each row represents an individual record or data instance. This tabular format allows for easy organization, storage, and manipulation of data.
Consistent data types: Structured data adheres to consistent data types, such as integers, floats, strings, dates, or Booleans, which ensure uniformity and facilitate data processing and analysis. These predefined data types provide clarity on the nature of the data and enable efficient storage and computation.
Querying and analysis: Structured data can be easily queried, filtered, and analyzed using Structured Query Language (SQL) or similar database query languages. The structured nature of the data enables efficient indexing and optimized query execution, allowing for fast and precise retrieval of desired information.
Relational Database Management Systems (RDBMS): Structured data is commonly stored and managed using RDBMS. It provides robust mechanisms for creating, storing, and manipulating structured data, ensuring data integrity, transaction management, and security.
Examples of structured data in big data include transactional data in e-commerce systems(customer orders, product details, purchase history, and so on.), financial data (stock prices, sales reports, and so on.), sensor data with fixed attributes (temperature, pressure readings, and so on.), and customer data (demographics, contact information, and so on.). Structured data is relatively easy to work with because of its organized and predictable nature. However, it is important to note that big data encompasses not only structured data but also semi-structured and unstructured data. Incorporating and integrating structured data with other data types adds complexity to big data analytics and requires advanced techniques to extract meaningful insights from the larger data landscape.
Unstructured data
Data without a predetermined structure is referred to as unstructured data. It exhibits heterogeneity and is typically larger than structured data. The results of a Google search serve as a prime example of unstructured data. It includes various sizes of text, images, videos, webpages, and other data formats. Unstructured data in big data refers to data that lacks a predefined structure or does not fit into a traditional tabular format. It is essentially any form of data that does not conform to a rigid schema or model. Unstructured data is typically more complex, diverse, and challenging to process compared to structured data. Examples of unstructured data include text documents, social media posts, emails, audio recordings, images, videos, webpages, and sensor data.
Key characteristics of unstructured data in big data include:
Lack of predefined structure: Unstructured data does not adhere to a fixed schema or predefined format. It can have varying lengths, formats, and organization. Each piece of unstructured data may contain different types of information or have different data fields, making it challenging to organize and process.
Diverse data types: Unstructured data encompasses various data types, such as text, multimedia, and sensor data. This diversity requires specialized techniques to handle different formats and extract insights from multiple data sources.
Natural language content: Unstructured data often includes natural language content, such as text documents, emails, or social media posts, sentiment analysis, text mining, and entity recognition.
Rich media content: Unstructured data also includes media files, such as images, videos, and audio recordings. Analyzing and extracting insights from these media files may involve computer vision techniques, video/image analysis, audio processing, and pattern recognition.
Semi-structured elements: Unstructured data can contain semi-structured elements, which exhibit some level of organization but lack a strict schema. For example, webpages may have HTML tags, XML files may have tags and attributes, or social media posts may have hashtags and mentions. Handling these semi-structured elements requires techniques that can capture the underlying structure while accommodating the variations in data organization.
Large volume: Unstructured data can contribute significantly to the volume of big data. Text documents, social media feeds, and multimedia files can accumulate rapidly, resulting in a massive amount of unstructured data that needs to be processed and analyzed.
Dealing with unstructured data in big data requires advanced technologies and techniques. These include text mining, image natural language processing, machine learning, video analysis, and deep learning algorithms. By leveraging these methods, organizations can unlock valuable insights hidden within unstructured data and gain a more comprehensive understanding of their business processes, market trends, customer sentiments, and more.
Semi-structured data
As the name suggests, semi-structured data is a combination of structured and unstructured data. It refers to data that is not organized into a specific database but has crucial tags that identify various components of it. A relational Database Management Systems (DBMS)table definition is a prime example of semi-structured data. Semi-structured data in big data refers to data that has some level of structure but does not conform to a rigid, predefined schema like structured data. It lies between structured and unstructured data, combining elements of both. Semi-structured data possesses some organizational patterns or tags that provide a basic structure, but it allows flexibility in terms of data fields and formats. It is commonly encountered in various domains, including web data, log files, JSON documents, XML files, and NoSQL databases.
Here are the key characteristics of semi-structured data in big data:
Flexible schema: Semi-structured data does not require a predefined, fixed schema like structured data. It allows for variations in data fields and formats, enabling greater flexibility when capturing and storing data. Each record or document can have different attributes or elements, and new attributes can be added over time without disrupting the existing data structure.
Tags or markers: Semi-structured data often includes tags, markers, or metadata that provide some level of organization or structure. These tags provide hints about the data elements and their relationships but do not enforce a strict schema. Examples include XML tags, JSON key-value pairs, or attributes in NoSQL databases.
Hierarchical structure: Semi-structured data can exhibit a hierarchical structure, where data elements are organized in a nested or tree-like fashion. This structure enables capturing complex relationships between data elements and supports efficient querying and navigation through the data.
Limited data integrity: Contrasting structured data and semi-structured data does not enforce strict data integrity constraints. It may contain inconsistencies or incomplete information. Data quality control and validation mechanisms need to be applied during data processing to ensure accuracy and reliability.
Diverse formats: Semi-structured data can be represented in various formats, including XML, JSON, YAML, HTML, or key-value pairs. These formats provide flexibility in representing complex data structures and enable interoperability between different systems and platforms.
Processing challenges: Analyzing and processing semi-structured data requires specialized tools and techniques. Techniques such as XML parsing, JSON parsing, XPath querying, or schema-on-read approaches are commonly employed to extract information, navigate through the data hierarchy, and handle the flexible structure of semi-structured data.
Semi-structured data presents unique challenges and opportunities in big data analytics. It allows for the storage and analysis of diverse and dynamic data types while providing some organizational structure. Leveraging semi-structured data requires data integration techniques, schema discovery, and flexible data processing approaches that can adapt to the evolving nature of the data.
Evolution of big data
The history of big data can be traced back to the initial days of computing and the evolution of data storage and processing technologies. Here are some key milestones in the history of big data:
Early data processing (1950s-1970s): In the early days of computing, data processing was limited to structured data stored in databases. Mainframe computers were used to process huge volumes of data, primarily in batch mode.
Relational databases (1970s-1980s): The invention of relational databases introduced a structured and organized approach to data storage and management. Structured Query Language (SQL) was developed as a standard language for interacting with relational databases.
Data warehousing (1980s-1990s): The idea of data warehousing emerged, focusing on collecting and storing large volumes of data from multiple sources for analysis and reporting. Data warehouses allow organizations to consolidate data and perform complex queries.
Internet and Web (1990s): The beginning of the Internet and the World Wide Web led to an explosion of digital data. Websites, online transactions, and digital content generated vast amounts of data, including text, images, videos, and user interactions.
Emergence of Hadoop (2000s): In 2004, Google introduced the Google File System (GFS) and MapReduce, which inspired the development of Apache Hadoop, an open-source framework for distributed storage as well as the processing of big data. Hadoop allows organizations to store and process large volumes of data across clusters of commodity hardware.
NoSQL and new database technologies (2000s-2010s): As the volume and variety of data grew, new database technologies emerged to handle unstructured and semi-structured data. NoSQL databases, such as MongoDB and Cassandra, offered scalable and flexible solutions for managing Big data.
Cloud computing (2000s-2010s): Platforms for cloud computing like Amazon Web Services (AWS)and Microsoft Azure offered scalable and affordable infrastructure for the processing and storing of Big data. Cloud services could