Nothing Special   »   [go: up one dir, main page]

Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)
Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)
Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)
Ebook452 pages3 hours

Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Big data and analytics is an indispensable guide that navigates the complex data management and analysis. This comprehensive book covers the core principles, processes, and tools, ensuring readers grasp the essentials and progress to advanced applications.

It will help you understand the different analysis types like descriptive, predictive, and prescriptive. Learn about NoSQL databases and their benefits over SQL. The book centers on Hadoop, explaining its features, versions, and main components like HDFS (storage) and MapReduce (processing). Explore MapReduce and YARN for efficient data processing. Gain insights into MongoDB and Hive, popular tools in the big data landscape.
LanguageEnglish
Release dateMar 5, 2024
ISBN9789355517050
Big Data and Analytics: The key concepts and practical applications of big data analytics (English Edition)

Related to Big Data and Analytics

Related ebooks

Computers For You

View More

Related articles

Reviews for Big Data and Analytics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Big Data and Analytics - Dr. Jugnesh Kumar

    C

    HAPTER

    1

    Introduction to Big Data

    Introduction

    The amount of data produced by humanity is increasing exponentially because of the rapid development of technology, the proliferation of devices, and the widespread use of social networking sites. To put things in perspective, humankind produced 5 billion gigabytes of data between the beginning of time and 2003, which could cover an entire football pitch if represented as physical discs.

    Amazingly, however, the same amount of data was generated every ten minutes in 2013, up from every two days in 2011, and it has continued to increase significantly. Even though this vast amount of information has many valuable insights and the potential to be helpful when processed, it is frequently underutilized and ignored. The enormous volume of data being produced at an unprecedented rate worldwide is called Big Data. Both structured and unstructured data types are possible. Businesses heavily rely on data in today's knowledge-based economy to fuel their success. So, it becomes crucial and enormously rewarding to make sense of this data, identify patterns, and expose hidden connections within this vast ocean of information. The urgent need is to turn big data into easily usable, actionable business intelligence for enterprises. Businesses of all sizes, locations, market shares, and customer segments can develop successful strategies by accessing and analyzing high-quality data. This is where Hadoop, the go-to platform for processing enormous volumes of data, comes into play.

    Structure

    In this chapter, we will discuss the following topics:

    Diverse facets of big data

    Digital data and its types

    Characteristics of big data

    Types of big data

    Evolution of big data

    Applications and challenges of big data

    3Vs of big data

    Non-definitional traits of big data

    Big data work flow management

    Business intelligence versus big data

    Data science process steps

    Foundations for big data systems and programming

    Distributed filesystems

    Data warehouse and Hadoop environment

    Coexistence

    Diverse facets of big data

    Alternatively, we can define big data as a collection of sizable datasets processed faster than traditional computing techniques. It has developed into a broad discipline that includes various tools, techniques, and frameworks, not just a single technique or tool. The data consists of the enormous amount produced by different devices and applications. The following industries fall under the umbrella of big data, as shown in Table 1.1:

    Table 1.1 : Shows the involvement of big data in various organizations

    Digital data and its types

    Digital data can be classified into several types based on their characteristics and formats. Here are some common types of digital data given below:

    Textual data: This type includes written or typed text, such as documents, emails, webpages, and social media posts. Textual data is typically represented as a sequence of characters.

    Numeric data: Numeric data consists of numbers and mathematical values. It can be discrete (whole numbers) or continuous (decimal numbers). Examples of numeric data include measurements, financial data, and statistical records.

    Image data: Image data represents visual information through pictures or graphical content. It consists of a grid of pixels, where each pixel contains color or grayscale information. Image data is commonly used in photography, digital art, and computer vision applications.

    Audio data: Audio data represents sound or audio signals. It can be in the form of speech, music, or other audio recordings. Audio data is typically stored as waveform samples, capturing variations in air pressure over time.

    Video data: Video data consists of an order of images (frames) presented in rapid succession. It combines image and audio data to represent moving visual content. Video data is commonly used in movies, television, surveillance systems, and video streaming platforms.

    Geospatial data: Geospatial data refers to data with geographical or spatial information. It includes coordinates, maps, satellite imagery, and location-based data. Geospatial data is widely used in navigation, urban planning, mapping, and environmental analysis.

    Time series data: Time series data capture measurements or observations taken at different points in time. It includes data points recorded at regular intervals, such as stock prices, weather data, sensor readings, and device logs.

    Structured data: This type of data follows a predefined format and schema. It is organized in a tabular or relational form, with well-defined rows and columns. Structured data is stored in databases and spreadsheets and can be easily queried and analyzed.

    Unstructured data: Unstructured data refers to data that does not have a predefined format or structure. It includes free-form text, multimedia content, social media posts, emails, and documents. Unstructured data requires advanced techniques like machine learning and natural language processing to extract meaningful insights.

    Metadata: Metadata provides descriptive information about other types of data. It includes file names, creation dates, author information, data sources, and formats. Metadata helps in organizing, managing, and understanding other data types.

    Characteristics of big data

    Data can possess several characteristics that impact its management, analysis, and interpretation. Some important features of data include:

    Volume: Volume denotes the amount or size of data. It can range from small-scale data sets to massive volumes of data from various sources.

    Velocity: Velocity denotes the speed at which data is created, processed, collected, and analyzed. Real-time data requires fast processing capabilities to extract timely insights.

    Variety: The diversity of data types and formats is called variety. Text, images, audio, video, and other data types can exist in structured, unstructured, or semi-structured forms.

    Semi-structured: The data shows some organization but lacks a strict structure, in contrast to structured data, which is prearranged in a tabular format with a predetermined schema. A certain level of hierarchy or relationship is possible because this kind of data frequently contains elements like tags, keys, or attributes.

    Veracity: Veracity refers to the quality and reliability of data. Data may contain errors, inconsistencies, or inaccuracies that must be addressed to ensure data veracity.

    Value: Value refers to data's usefulness, relevance, and potential insights. Extracting value from data involves analysis, interpretation, and decision-making based on the obtained insights.

    Variability: Variability refers to the dynamic landscape of data. Data can exhibit variations in volume, velocity, and variety over time. Handling data variability requires adaptability and flexibility in data.

    Types of big data

    Big data can be classified into three main types based on the nature of the data and its characteristics. These types are mentioned in Table1.2:

    Table 1.2 : Difference between structured and unstructured data based on different criteria

    Structured data

    Information that has been organized and can be processed, saved, and retrieved in a semiformal is structured data. It is typically kept in databases and is readily accessible utilizing simple algorithms. Since the data format is known beforehand, managing structured data is simple. Structured data includes information that is kept by a business in databases such as tables and spreadsheets. Structured data in big data refers to data that has a predefined format and fits into a well-defined schema or model. It is organized and stored in a tabular correlational format, typically found in traditional databases. Structured data follows a consistent and predefined structure, making it easier to query, analyze, and process using conventional database management systems. Figure 1.1 depicts the structure data in different colors:

    Figure 1.1: Illustrates the structure data in different color

    (Source: https://dryviq.com/unstructured-vs-structured-data-4-key-management-differences/)

    Key characteristics of structured data in big data include:

    Fixed schema: Structured data has a fixed and predefined schema that defines the structure of the data. The schema determines the kinds of data that can be stored, the relationships among different data elements, and the constraints on data values. This fixed schema enables efficient data storage, indexing, and retrieval.

    Organized format: Structured data is systematized into rows, tables, as well as columns, where each column represents a specific attribute or data field, and each row represents an individual record or data instance. This tabular format allows for easy organization, storage, and manipulation of data.

    Consistent data types: Structured data adheres to consistent data types, such as integers, floats, strings, dates, or Booleans, which ensure uniformity and facilitate data processing and analysis. These predefined data types provide clarity on the nature of the data and enable efficient storage and computation.

    Querying and analysis: Structured data can be easily queried, filtered, and analyzed using Structured Query Language (SQL) or similar database query languages. The structured nature of the data enables efficient indexing and optimized query execution, allowing for fast and precise retrieval of desired information.

    Relational Database Management Systems (RDBMS): Structured data is commonly stored and managed using RDBMS. It provides robust mechanisms for creating, storing, and manipulating structured data, ensuring data integrity, transaction management, and security.

    Examples of structured data in big data include transactional data in e-commerce systems(customer orders, product details, purchase history, and so on.), financial data (stock prices, sales reports, and so on.), sensor data with fixed attributes (temperature, pressure readings, and so on.), and customer data (demographics, contact information, and so on.). Structured data is relatively easy to work with because of its organized and predictable nature. However, it is important to note that big data encompasses not only structured data but also semi-structured and unstructured data. Incorporating and integrating structured data with other data types adds complexity to big data analytics and requires advanced techniques to extract meaningful insights from the larger data landscape.

    Unstructured data

    Data without a predetermined structure is referred to as unstructured data. It exhibits heterogeneity and is typically larger than structured data. The results of a Google search serve as a prime example of unstructured data. It includes various sizes of text, images, videos, webpages, and other data formats. Unstructured data in big data refers to data that lacks a predefined structure or does not fit into a traditional tabular format. It is essentially any form of data that does not conform to a rigid schema or model. Unstructured data is typically more complex, diverse, and challenging to process compared to structured data. Examples of unstructured data include text documents, social media posts, emails, audio recordings, images, videos, webpages, and sensor data.

    Key characteristics of unstructured data in big data include:

    Lack of predefined structure: Unstructured data does not adhere to a fixed schema or predefined format. It can have varying lengths, formats, and organization. Each piece of unstructured data may contain different types of information or have different data fields, making it challenging to organize and process.

    Diverse data types: Unstructured data encompasses various data types, such as text, multimedia, and sensor data. This diversity requires specialized techniques to handle different formats and extract insights from multiple data sources.

    Natural language content: Unstructured data often includes natural language content, such as text documents, emails, or social media posts, sentiment analysis, text mining, and entity recognition.

    Rich media content: Unstructured data also includes media files, such as images, videos, and audio recordings. Analyzing and extracting insights from these media files may involve computer vision techniques, video/image analysis, audio processing, and pattern recognition.

    Semi-structured elements: Unstructured data can contain semi-structured elements, which exhibit some level of organization but lack a strict schema. For example, webpages may have HTML tags, XML files may have tags and attributes, or social media posts may have hashtags and mentions. Handling these semi-structured elements requires techniques that can capture the underlying structure while accommodating the variations in data organization.

    Large volume: Unstructured data can contribute significantly to the volume of big data. Text documents, social media feeds, and multimedia files can accumulate rapidly, resulting in a massive amount of unstructured data that needs to be processed and analyzed.

    Dealing with unstructured data in big data requires advanced technologies and techniques. These include text mining, image natural language processing, machine learning, video analysis, and deep learning algorithms. By leveraging these methods, organizations can unlock valuable insights hidden within unstructured data and gain a more comprehensive understanding of their business processes, market trends, customer sentiments, and more.

    Semi-structured data

    As the name suggests, semi-structured data is a combination of structured and unstructured data. It refers to data that is not organized into a specific database but has crucial tags that identify various components of it. A relational Database Management Systems (DBMS)table definition is a prime example of semi-structured data. Semi-structured data in big data refers to data that has some level of structure but does not conform to a rigid, predefined schema like structured data. It lies between structured and unstructured data, combining elements of both. Semi-structured data possesses some organizational patterns or tags that provide a basic structure, but it allows flexibility in terms of data fields and formats. It is commonly encountered in various domains, including web data, log files, JSON documents, XML files, and NoSQL databases.

    Here are the key characteristics of semi-structured data in big data:

    Flexible schema: Semi-structured data does not require a predefined, fixed schema like structured data. It allows for variations in data fields and formats, enabling greater flexibility when capturing and storing data. Each record or document can have different attributes or elements, and new attributes can be added over time without disrupting the existing data structure.

    Tags or markers: Semi-structured data often includes tags, markers, or metadata that provide some level of organization or structure. These tags provide hints about the data elements and their relationships but do not enforce a strict schema. Examples include XML tags, JSON key-value pairs, or attributes in NoSQL databases.

    Hierarchical structure: Semi-structured data can exhibit a hierarchical structure, where data elements are organized in a nested or tree-like fashion. This structure enables capturing complex relationships between data elements and supports efficient querying and navigation through the data.

    Limited data integrity: Contrasting structured data and semi-structured data does not enforce strict data integrity constraints. It may contain inconsistencies or incomplete information. Data quality control and validation mechanisms need to be applied during data processing to ensure accuracy and reliability.

    Diverse formats: Semi-structured data can be represented in various formats, including XML, JSON, YAML, HTML, or key-value pairs. These formats provide flexibility in representing complex data structures and enable interoperability between different systems and platforms.

    Processing challenges: Analyzing and processing semi-structured data requires specialized tools and techniques. Techniques such as XML parsing, JSON parsing, XPath querying, or schema-on-read approaches are commonly employed to extract information, navigate through the data hierarchy, and handle the flexible structure of semi-structured data.

    Semi-structured data presents unique challenges and opportunities in big data analytics. It allows for the storage and analysis of diverse and dynamic data types while providing some organizational structure. Leveraging semi-structured data requires data integration techniques, schema discovery, and flexible data processing approaches that can adapt to the evolving nature of the data.

    Evolution of big data

    The history of big data can be traced back to the initial days of computing and the evolution of data storage and processing technologies. Here are some key milestones in the history of big data:

    Early data processing (1950s-1970s): In the early days of computing, data processing was limited to structured data stored in databases. Mainframe computers were used to process huge volumes of data, primarily in batch mode.

    Relational databases (1970s-1980s): The invention of relational databases introduced a structured and organized approach to data storage and management. Structured Query Language (SQL) was developed as a standard language for interacting with relational databases.

    Data warehousing (1980s-1990s): The idea of data warehousing emerged, focusing on collecting and storing large volumes of data from multiple sources for analysis and reporting. Data warehouses allow organizations to consolidate data and perform complex queries.

    Internet and Web (1990s): The beginning of the Internet and the World Wide Web led to an explosion of digital data. Websites, online transactions, and digital content generated vast amounts of data, including text, images, videos, and user interactions.

    Emergence of Hadoop (2000s): In 2004, Google introduced the Google File System (GFS) and MapReduce, which inspired the development of Apache Hadoop, an open-source framework for distributed storage as well as the processing of big data. Hadoop allows organizations to store and process large volumes of data across clusters of commodity hardware.

    NoSQL and new database technologies (2000s-2010s): As the volume and variety of data grew, new database technologies emerged to handle unstructured and semi-structured data. NoSQL databases, such as MongoDB and Cassandra, offered scalable and flexible solutions for managing Big data.

    Cloud computing (2000s-2010s): Platforms for cloud computing like Amazon Web Services (AWS)and Microsoft Azure offered scalable and affordable infrastructure for the processing and storing of Big data. Cloud services could

    Enjoying the preview?
    Page 1 of 1