Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture
()
About this ebook
This book highlights the different types of data architecture and illustrates the many possibilities hidden behind the term "Big Data", from the usage of No-SQL databases to the deployment of stream analytics architecture, machine learning, and governance.
Scalable Big Data Architecture covers real-world, concrete industry use cases that leverage complex distributed applications , which involve web applications, RESTful API, and high throughput of large amount of data stored in highly scalable No-SQL data stores such as Couchbase and Elasticsearch. This book demonstrates how data processing can be done at scale from the usage of NoSQL datastores to the combination of Big Data distribution.
When the data processing is too complex and involves different processing topology like long running jobs, stream processing, multiple data sources correlation, and machine learning, it’s often necessary to delegate the load to Hadoop or Spark and use the No-SQLto serve processed data in real time.This book shows you how to choose a relevant combination of big data technologies available within the Hadoop ecosystem. It focuses on processing long jobs, architecture, stream data patterns, log analysis, and real time analytics. Every pattern is illustrated with practical examples, which use the different open sourceprojects such as Logstash, Spark, Kafka, and so on.
Traditional data infrastructures are built for digesting and rendering data synthesis and analytics from large amount of data. This book helps you to understand why you should consider using machine learning algorithms early on in the project, before being overwhelmed by constraints imposed by dealing with the high throughput of Big data.Scalable Big Data Architecture is for developers, data architects, and data scientists looking for a better understanding of how to choose the most relevant pattern for a Big Data project and which tools tointegrate into that pattern.
Related to Scalable Big Data Architecture
Related ebooks
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions Rating: 0 out of 5 stars0 ratingsGetting Started with Greenplum for Big Data Analytics Rating: 0 out of 5 stars0 ratingsPractical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake Rating: 0 out of 5 stars0 ratingsHDInsight Essentials - Second Edition Rating: 0 out of 5 stars0 ratingsUltimate Data Engineering with Databricks Rating: 0 out of 5 stars0 ratingsLearning D3.js Mapping Rating: 0 out of 5 stars0 ratingsMonetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud Rating: 0 out of 5 stars0 ratingsMicroservices for the Enterprise: Designing, Developing, and Deploying Rating: 0 out of 5 stars0 ratingsBayesian Optimization and Data Science Rating: 0 out of 5 stars0 ratingsBeginning Application Lifecycle Management Rating: 0 out of 5 stars0 ratingsDeploying AI in the Enterprise: IT Approaches for Design, DevOps, Governance, Change Management, Blockchain, and Quantum Computing Rating: 0 out of 5 stars0 ratingsMongoDB Recipes: With Data Modeling and Query Building Strategies Rating: 0 out of 5 stars0 ratingsThe Chief Data Officer Management Handbook: Set Up and Run an Organization’s Data Supply Chain Rating: 0 out of 5 stars0 ratingsMySQL 8 Query Performance Tuning: A Systematic Method for Improving Execution Speeds Rating: 0 out of 5 stars0 ratingsPractical API Architecture and Development with Azure and AWS: Design and Implementation of APIs for the Cloud Rating: 0 out of 5 stars0 ratingsPro PowerShell for Amazon Web Services: DevOps for the AWS Cloud Rating: 0 out of 5 stars0 ratingsBuilding REST APIs with Flask: Create Python Web Services with MySQL Rating: 0 out of 5 stars0 ratingsPentaho 3.2 Data Integration Beginner's Guide Rating: 0 out of 5 stars0 ratingsBuilding JavaScript Games: for Phones, Tablets, and Desktop Rating: 0 out of 5 stars0 ratingsEnterprise Architecture at Work: Modelling, Communication and Analysis Rating: 2 out of 5 stars2/5Beginning MLOps with MLFlow: Deploy Models in AWS SageMaker, Google Cloud, and Microsoft Azure Rating: 0 out of 5 stars0 ratingsPractical Data Science with Python 3: Synthesizing Actionable Insights from Data Rating: 0 out of 5 stars0 ratingsData Engineering A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsBig Data Engineer A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsBeginning Azure IoT Edge Computing: Extending the Cloud to the Intelligent Edge Rating: 0 out of 5 stars0 ratingsLearning Azure DocumentDB Rating: 0 out of 5 stars0 ratingsSoftware Mistakes and Tradeoffs: How to make good programming decisions Rating: 0 out of 5 stars0 ratingsCloud-Based Microservices: Techniques, Challenges, and Solutions Rating: 0 out of 5 stars0 ratings
Databases For You
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator Rating: 0 out of 5 stars0 ratingsBlockchain Basics: A Non-Technical Introduction in 25 Steps Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics Rating: 0 out of 5 stars0 ratingsAccess 2019 For Dummies Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5COMPUTER SCIENCE FOR ROOKIES Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Rating: 4 out of 5 stars4/5CompTIA DataSys+ Study Guide: Exam DS0-001 Rating: 0 out of 5 stars0 ratingsBehind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight Rating: 5 out of 5 stars5/5Python Projects for Everyone Rating: 0 out of 5 stars0 ratingsLearn SQL in 24 Hours Rating: 5 out of 5 stars5/5Access 2016 For Dummies Rating: 0 out of 5 stars0 ratingsGo in Action Rating: 5 out of 5 stars5/5The Analytic Detective: Decipher Your Company’s Data Clues and Become Irreplaceable Rating: 0 out of 5 stars0 ratingsAccess for Beginners: Access Essentials, #1 Rating: 0 out of 5 stars0 ratingsLearn SQL Server Administration in a Month of Lunches Rating: 3 out of 5 stars3/5Learning Oracle 12c: A PL/SQL Approach Rating: 0 out of 5 stars0 ratingsAccess 2010 All-in-One For Dummies Rating: 4 out of 5 stars4/5Learn Git in a Month of Lunches Rating: 0 out of 5 stars0 ratingsAzure SQL Revealed: A Guide to the Cloud for SQL Server Professionals Rating: 0 out of 5 stars0 ratingsA Concise Guide to Object Orientated Programming Rating: 0 out of 5 stars0 ratingsGetting Started with SQL Server 2014 Administration Rating: 0 out of 5 stars0 ratingsPython and SQLite Development Rating: 0 out of 5 stars0 ratingsPractical SQL Rating: 4 out of 5 stars4/5SQL in 30 Pages Rating: 4 out of 5 stars4/5Learning PostgreSQL Rating: 1 out of 5 stars1/5
Reviews for Scalable Big Data Architecture
0 ratings0 reviews
Book preview
Scalable Big Data Architecture - Bahaaldine Azarmi
© Bahaaldine Azarmi 2016
Bahaaldine AzarmiScalable Big Data Architecture10.1007/978-1-4842-1326-1_1
1. The Big (Data) Problem
Bahaaldine Azarmi¹
(1)
Saint Cloud, France
Data management is getting more complex than it has ever been before. Big Data is everywhere, on everyone’s mind, and in many different forms: advertising, social graphs, news feeds, recommendations, marketing, healthcare, security, government, and so on.
In the last three years, thousands of technologies having to do with Big Data acquisition, management, and analytics have emerged; this has given IT teams the hard task of choosing, without having a comprehensive methodology to handle the choice most of the time.
When making such a choice for your own situation, ask yourself the following questions: When should I think about employing Big Data for my IT system? Am I ready to employ it? What should I start with? Should I really go for it despite feeling that Big Data is just a marketing trend?
All these questions are running around in the minds of most Chief Information Officers (CIOs) and Chief Technology Officers (CTOs), and they globally cover the reasons and the ways you are putting your business at stake when you decide to deploy a distributed Big Data architecture.
This chapter aims to help you identity Big Data symptoms—in other words when it becomes apparent that you need to consider adding Big Data to your architecture—but it also guides you through the variety of Big Data technologies to differentiate among them so that you can understand what they are specialized for. Finally, at the end of the chapter, we build the foundation of a typical distributed Big Data architecture based on real life examples.
Identifying Big Data Symptoms
You may choose to start a Big Data project based on different needs: because of the volume of data you handle, because of the variety of data structures your system has, because of scalability issues you are experiencing, or because you want to reduce the cost of data processing. In this section, you’ll see what symptoms can make a team realize they need to start a Big Data project.
Size Matters
The two main areas that get people to start thinking about Big Data are when they start having issues related to data size and volume; although most of the time these issues present true and legitimate reasons to think about Big Data, today, they are not the only reasons to go this route.
There are others symptoms that you should also consider—type of data, for example. How will you manage to increase various types of data when traditional data stores, such as SQL databases, expect you to do the structuring, like creating tables?
This is not feasible without adding a flexible, schemaless technology that handles new data structures as they come. When I talk about types of data, you should imagine unstructured data, graph data, images, videos, voices, and so on.
Yes, it’s good to store unstructured data, but it’s better if you can get something out of it. Another symptom comes out of this premise: Big Data is also about extracting added value information from a high-volume variety of data. When, a couple of years ago, there were more read transactions than write transactions, common caches or databases were enough when paired with weekly ETL (extract, transform, load) processing jobs. Today that’s not the trend any more. Now, you need an architecture that is capable of handling data as it comes through long processing to near real-time processing jobs. The architecture should be distributed and not rely on the rigid high-performance and expensive mainframe; instead, it should be based on a more available, performance driven, and cheaper technology to give it more flexibility.
Now, how do you leverage all this added value data and how are you able to search for it naturally? To answer this question, think again about the traditional data store in which you create indexes on different columns to speed up the search query. Well, what if you want to index all hundred columns because you want to be able to execute complex queries that involve a nondeterministic number of key columns? You don’t want to do this with a basic SQL database; instead, you would rather consider using a NoSQL store for this specific need.
So simply walking down the path of data acquisition, data structuring, data processing, and data visualization in the context of the actual data management trends makes it easy to conclude that size is no longer the main concern.
Typical Business Use Cases
In addition to technical and architecture considerations, you may be facing use cases that are typical Big Data use cases. Some of them are tied to a specific industry; others are not specialized and can be applied to various industries.
These considerations are generally based on analyzing application’s logs, such as web access logs, application server logs, and database logs, but they can also be based on other types of data sources such as social network data.
When you are facing such use cases, you might want to consider a distributed Big Data architecture if you want to be able to scale out as your business grows.
Consumer Behavioral Analytics
Knowing your customer, or what we usually call the 360-degree customer view
might be the most popular Big Data use case. This customer view is usually used on e-commerce websites and starts with an unstructured clickstream—in other words, it is made up of the active and passive website navigation actions that a visitor performs. By counting and analyzing the clicks and impressions on ads or products, you can adapt the visitor’s user experience depending on their behavior, while keeping in mind that the goal is to gain insight in order to optimize the funnel conversion.
Sentiment Analysis
Companies care about how their image and reputation is perceived across social networks; they want to minimize all negative events that might affect their notoriety and leverage positive events. By crawling a large amount of social data in a near-real-time way, they can extract the feelings and sentiments of social communities regarding their brand, and they can identify influential users and contact them in order to change or empower a trend depending on the outcome of their interaction with such users.
CRM Onboarding
You can combine consumer behavioral analytics with sentiment analysis based on data surrounding the visitor’s social activities. Companies want to combine these online data sources with the existing offline data, which is called CRM (customer relationship management) onboarding, in order to get better and more accurate customer segmentation. Thus, companies can leverage this segmentation and build a better targeting system to send profile-customized offers through marketing actions.
Prediction
Learning from data has become the main Big Data trend for the past two years. Prediction-enabled Big Data can be very efficient in multiple industries, such as in the telecommunication industry, where prediction router log analysis is democratized. Every time an issue is likely to occur on a device, the company can predict it and order part to avoid downtime or lost profits.
When combined with the previous use cases, you can use predictive architecture to optimize the product catalog selection and pricing depending on the user’s global behavior.
Understanding the Big Data Project’s Ecosystem
Once you understand that you actually have a Big Data project to implement, the hardest thing is choosing the technologies to use in your architecture. It is not just about picking the most famous Hadoop-related technologies, it’s also about understanding how to classify them in order to build a consistent distributed architecture.
To get an idea of the number of projects in the Big Data galaxy, browse to https://github.com/zenkay/bigdata-ecosystem#projects-1 to see more than 100 classified projects.
Here, you see that you might consider choosing a Hadoop distribution, a distributed file system, a SQL-like processing language, a machine learning language, a scheduler, message-oriented middleware, a NoSQL datastore, data visualization, and so on.
Since this book’s purpose is to describe a scalable way to build a distributed architecture, I don’t dive into all categories of projects; instead, I highlight the ones you are likely to use in a typical Big Data project. You can eventually adapt this architecture and integrate projects depending on your needs. You’ll see concrete examples of using such projects in the dedicated parts.
To make the Hadoop technology presented more relevant, we will work on a distributed architecture that meets the previously described typical use cases, namely these:
Consumer behavioral analytics
Sentiment analysis
CRM onboarding and prediction
Hadoop Distribution
In a Big Data project that involves Hadoop-related ecosystem technologies, you have two choices:
Download the project you need separately and try to create or assemble the technologies in a coherent, resilient, and consistent architecture.
Use one of the most popular Hadoop distributions, which assemble or create the technologies for you.
Although the first option is completely feasible, you might want to choose the second one, because a packaged Hadoop distribution ensures capability between all installed components, ease of installation, configuration-based deployment, monitoring, and support.
Hortonworks and Cloudera are the main actors in this field. There are a couple of differences between the two vendors, but for starting a Big Data package, they are equivalent, as long as you don’t pay attention to the proprietary add-ons.
My goal here is not to present all the components within each distribution but to focus on what each vendor adds to the standard ecosystem. I describe most of the other components in the following pages depending on what we need for our architecture in each situation.
Cloudera CDH
Cloudera adds a set of in-house components to the Hadoop-based components; these components are designed to give you better cluster management and search experiences.
The following is a list of some of these components:
Impala: A real-time, parallelized, SQL-based engine that searches for data in HDFS (Hadoop Distributed File System) and Base. Impala is considered to be the fastest querying engine within the Hadoop distribution vendors market, and it is a direct competitor of Spark from UC Berkeley.
Cloudera Manager: This is Cloudera’s console to manage and deploy Hadoop components within your Hadoop cluster.
Hue: A console that lets the user interact with the data and run scripts for the different Hadoop components