Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture

Ebook253 pages1 hour

Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture

By Bahaaldine Azarmi

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book highlights the different types of data architecture and illustrates the many possibilities hidden behind the term "Big Data", from the usage of No-SQL databases to the deployment of stream analytics architecture, machine learning, and governance.

Scalable Big Data Architecture covers real-world, concrete industry use cases that leverage complex distributed applications , which involve web applications, RESTful API, and high throughput of large amount of data stored in highly scalable No-SQL data stores such as Couchbase and Elasticsearch. This book demonstrates how data processing can be done at scale from the usage of NoSQL datastores to the combination of Big Data distribution.

When the data processing is too complex and involves different processing topology like long running jobs, stream processing, multiple data sources correlation, and machine learning, it’s often necessary to delegate the load to Hadoop or Spark and use the No-SQLto serve processed data in real time.

This book shows you how to choose a relevant combination of big data technologies available within the Hadoop ecosystem. It focuses on processing long jobs, architecture, stream data patterns, log analysis, and real time analytics. Every pattern is illustrated with practical examples, which use the different open sourceprojects such as Logstash, Spark, Kafka, and so on.

Traditional data infrastructures are built for digesting and rendering data synthesis and analytics from large amount of data. This book helps you to understand why you should consider using machine learning algorithms early on in the project, before being overwhelmed by constraints imposed by dealing with the high throughput of Big data.

Scalable Big Data Architecture is for developers, data architects, and data scientists looking for a better understanding of how to choose the most relevant pattern for a Big Data project and which tools tointegrate into that pattern.

Skip carousel

LanguageEnglish

PublisherApress

Release dateDec 31, 2015

ISBN9781484213261

Author

Bahaaldine Azarmi

Related authors

Skip carousel

Related to Scalable Big Data Architecture

Related ebooks

Skip carousel

Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
Ebook
Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions
bySudhir Rawat
Rating: 0 out of 5 stars
0 ratings
Getting Started with Greenplum for Big Data Analytics
Ebook
Getting Started with Greenplum for Big Data Analytics
byGollapudi Sunila
Rating: 0 out of 5 stars
0 ratings
Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake
Ebook
Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake
bySaurabh Gupta
Rating: 0 out of 5 stars
0 ratings
HDInsight Essentials - Second Edition
Ebook
HDInsight Essentials - Second Edition
byRajesh Nadipalli
Rating: 0 out of 5 stars
0 ratings
Ultimate Data Engineering with Databricks
Ebook
Ultimate Data Engineering with Databricks
byMayank Malhotra
Rating: 0 out of 5 stars
0 ratings
Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way
Ebook
Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way
byManoj Kukreja
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics Using Splunk: Deriving Operational Intelligence from Social Media, Machine Data, Existing Data Warehouses, and Other Real-Time Streaming Sources
Ebook
Big Data Analytics Using Splunk: Deriving Operational Intelligence from Social Media, Machine Data, Existing Data Warehouses, and Other Real-Time Streaming Sources
byPeter Zadrozny
Rating: 0 out of 5 stars
0 ratings
Learning D3.js Mapping
Ebook
Learning D3.js Mapping
byThomas Newton
Rating: 0 out of 5 stars
0 ratings
Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud
Ebook
Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud
byManuel Amunategui
Rating: 0 out of 5 stars
0 ratings
Microservices for the Enterprise: Designing, Developing, and Deploying
Ebook
Microservices for the Enterprise: Designing, Developing, and Deploying
byKasun Indrasiri
Rating: 0 out of 5 stars
0 ratings
Bayesian Optimization and Data Science
Ebook
Bayesian Optimization and Data Science
byFrancesco Archetti
Rating: 0 out of 5 stars
0 ratings
Beginning Application Lifecycle Management
Ebook
Beginning Application Lifecycle Management
byJoachim Rossberg
Rating: 0 out of 5 stars
0 ratings
Deploying AI in the Enterprise: IT Approaches for Design, DevOps, Governance, Change Management, Blockchain, and Quantum Computing
Ebook
Deploying AI in the Enterprise: IT Approaches for Design, DevOps, Governance, Change Management, Blockchain, and Quantum Computing
byEberhard Hechler
Rating: 0 out of 5 stars
0 ratings
MongoDB Recipes: With Data Modeling and Query Building Strategies
Ebook
MongoDB Recipes: With Data Modeling and Query Building Strategies
bySubhashini Chellappan
Rating: 0 out of 5 stars
0 ratings
The Chief Data Officer Management Handbook: Set Up and Run an Organization’s Data Supply Chain
Ebook
The Chief Data Officer Management Handbook: Set Up and Run an Organization’s Data Supply Chain
byMartin Treder
Rating: 0 out of 5 stars
0 ratings
MySQL 8 Query Performance Tuning: A Systematic Method for Improving Execution Speeds
Ebook
MySQL 8 Query Performance Tuning: A Systematic Method for Improving Execution Speeds
byJesper Wisborg Krogh
Rating: 0 out of 5 stars
0 ratings
Practical API Architecture and Development with Azure and AWS: Design and Implementation of APIs for the Cloud
Ebook
Practical API Architecture and Development with Azure and AWS: Design and Implementation of APIs for the Cloud
byThurupathan Vijayakumar
Rating: 0 out of 5 stars
0 ratings
Pro PowerShell for Amazon Web Services: DevOps for the AWS Cloud
Ebook
Pro PowerShell for Amazon Web Services: DevOps for the AWS Cloud
byBrian Beach
Rating: 0 out of 5 stars
0 ratings
Building REST APIs with Flask: Create Python Web Services with MySQL
Ebook
Building REST APIs with Flask: Create Python Web Services with MySQL
byKunal Relan
Rating: 0 out of 5 stars
0 ratings
Pentaho 3.2 Data Integration Beginner's Guide
Ebook
Pentaho 3.2 Data Integration Beginner's Guide
byMaria Carina Roldan
Rating: 0 out of 5 stars
0 ratings
Building JavaScript Games: for Phones, Tablets, and Desktop
Ebook
Building JavaScript Games: for Phones, Tablets, and Desktop
byArjan Egges
Rating: 0 out of 5 stars
0 ratings
Enterprise Architecture at Work: Modelling, Communication and Analysis
Ebook
Enterprise Architecture at Work: Modelling, Communication and Analysis
byMarc Lankhorst
Rating: 2 out of 5 stars
2/5
Beginning MLOps with MLFlow: Deploy Models in AWS SageMaker, Google Cloud, and Microsoft Azure
Ebook
Beginning MLOps with MLFlow: Deploy Models in AWS SageMaker, Google Cloud, and Microsoft Azure
bySridhar Alla
Rating: 0 out of 5 stars
0 ratings
Practical Data Science with Python 3: Synthesizing Actionable Insights from Data
Ebook
Practical Data Science with Python 3: Synthesizing Actionable Insights from Data
byErvin Varga
Rating: 0 out of 5 stars
0 ratings
Data Engineering A Complete Guide - 2020 Edition
Ebook
Data Engineering A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Big Data Engineer A Complete Guide - 2021 Edition
Ebook
Big Data Engineer A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Beginning Azure IoT Edge Computing: Extending the Cloud to the Intelligent Edge
Ebook
Beginning Azure IoT Edge Computing: Extending the Cloud to the Intelligent Edge
byDavid Jensen
Rating: 0 out of 5 stars
0 ratings
Learning Azure DocumentDB
Ebook
Learning Azure DocumentDB
byBecker Riccardo
Rating: 0 out of 5 stars
0 ratings
Software Mistakes and Tradeoffs: How to make good programming decisions
Ebook
Software Mistakes and Tradeoffs: How to make good programming decisions
byTomasz Lelek
Rating: 0 out of 5 stars
0 ratings
Cloud-Based Microservices: Techniques, Challenges, and Solutions
Ebook
Cloud-Based Microservices: Techniques, Challenges, and Solutions
byChandra Rajasekharaiah
Rating: 0 out of 5 stars
0 ratings

Databases For You

Skip carousel

Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
Ebook
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
byBrian Peasland
Rating: 0 out of 5 stars
0 ratings
Blockchain Basics: A Non-Technical Introduction in 25 Steps
Ebook
Blockchain Basics: A Non-Technical Introduction in 25 Steps
byDaniel Drescher
Rating: 5 out of 5 stars
5/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
Ebook
Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics
byDan Clark
Rating: 0 out of 5 stars
0 ratings
Access 2019 For Dummies
Ebook
Access 2019 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
Ebook
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
byAlexander Cooper
Rating: 2 out of 5 stars
2/5
COMPUTER SCIENCE FOR ROOKIES
Ebook
COMPUTER SCIENCE FOR ROOKIES
byAngel Bahabwa
Rating: 0 out of 5 stars
0 ratings
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
CompTIA DataSys+ Study Guide: Exam DS0-001
Ebook
CompTIA DataSys+ Study Guide: Exam DS0-001
byMike Chapple
Rating: 0 out of 5 stars
0 ratings
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
Ebook
Behind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight
byPiyanka Jain
Rating: 5 out of 5 stars
5/5
Python Projects for Everyone
Ebook
Python Projects for Everyone
byMohamad Charara
Rating: 0 out of 5 stars
0 ratings
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Access 2016 For Dummies
Ebook
Access 2016 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Go in Action
Ebook
Go in Action
byErik St. Martin
Rating: 5 out of 5 stars
5/5
The Analytic Detective: Decipher Your Company’s Data Clues and Become Irreplaceable
Ebook
The Analytic Detective: Decipher Your Company’s Data Clues and Become Irreplaceable
bySteve Leeds
Rating: 0 out of 5 stars
0 ratings
Access for Beginners: Access Essentials, #1
Ebook
Access for Beginners: Access Essentials, #1
byM.L. Humphrey
Rating: 0 out of 5 stars
0 ratings
Learn SQL Server Administration in a Month of Lunches
Ebook
Learn SQL Server Administration in a Month of Lunches
byDon Jones
Rating: 3 out of 5 stars
3/5
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
Ebook
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
byWilliam Sullivan
Rating: 5 out of 5 stars
5/5
Learning Oracle 12c: A PL/SQL Approach
Ebook
Learning Oracle 12c: A PL/SQL Approach
bySham Tickoo
Rating: 0 out of 5 stars
0 ratings
Access 2010 All-in-One For Dummies
Ebook
Access 2010 All-in-One For Dummies
byAlison Barrows
Rating: 4 out of 5 stars
4/5
Learn Git in a Month of Lunches
Ebook
Learn Git in a Month of Lunches
byRick Umali
Rating: 0 out of 5 stars
0 ratings
Azure SQL Revealed: A Guide to the Cloud for SQL Server Professionals
Ebook
Azure SQL Revealed: A Guide to the Cloud for SQL Server Professionals
byBob Ward
Rating: 0 out of 5 stars
0 ratings
A Concise Guide to Object Orientated Programming
Ebook
A Concise Guide to Object Orientated Programming
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
Getting Started with SQL Server 2014 Administration
Ebook
Getting Started with SQL Server 2014 Administration
byGethyn Ellis
Rating: 0 out of 5 stars
0 ratings
Python and SQLite Development
Ebook
Python and SQLite Development
byAgus Kurniawan
Rating: 0 out of 5 stars
0 ratings
Practical SQL
Ebook
Practical SQL
byDavid Perry
Rating: 4 out of 5 stars
4/5
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
Ebook
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
byJeremy Li
Rating: 3 out of 5 stars
3/5
SQL in 30 Pages
Ebook
SQL in 30 Pages
byU.Q. Magnusson
Rating: 4 out of 5 stars
4/5
Learning PostgreSQL
Ebook
Learning PostgreSQL
byJuba Salahaldin
Rating: 1 out of 5 stars
1/5

Related podcast episodes

Skip carousel

Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
Podcast episode
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
byDataFramed
0 ratings
0% found this document useful
#122 How Organizations Can Bridge the Data Literacy Gap
Podcast episode
#122 How Organizations Can Bridge the Data Literacy Gap
byDataFramed
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
Podcast episode
Reflections On Designing A Data Platform From Scratch: A monologue by Tobias Macey, the host of the show, about the design considerations involved in building a data platform and how the lessons learned from running the Data Engineering Podcast are influencing the choices made.
byData Engineering Podcast
100%
100% found this document useful
#536: [INTRODUCING] Amazon Redshift Serverless: With Amazon Redshift Serverless, all users—including data analysts, developers, and data scientists—
Podcast episode
#536: [INTRODUCING] Amazon Redshift Serverless: With Amazon Redshift Serverless, all users—including data analysts, developers, and data scientists—
byAWS Podcast
0 ratings
0% found this document useful
Joel Beasley from Modern CTO: Joel Beasley is the host of Modern CTO, a podcast with guests coming from IBM, Microsoft, Nasa, Reddit, and hundreds of others. Joel and I have wanted to have this discussion for a long time, and we finally found the right overlap to do it! You can learn more about Modern CTO at https://moderncto.io and listen to this episode in the alternate podcast universe here. Thanks for joining me on Developer Tea, Joel!
Podcast episode
Joel Beasley from Modern CTO: Joel Beasley is the host of Modern CTO, a podcast with guests coming from IBM, Microsoft, Nasa, Reddit, and hundreds of others. Joel and I have wanted to have this discussion for a long time, and we finally found the right overlap to do it! You can learn more about Modern CTO at https://moderncto.io and listen to this episode in the alternate podcast universe here. Thanks for joining me on Developer Tea, Joel!
byDeveloper Tea
0 ratings
0% found this document useful
#121 — ChatGPT and How Generative AI is Augmenting Workflows
Podcast episode
#121 — ChatGPT and How Generative AI is Augmenting Workflows
byDataFramed
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
Maintaining Your Data Lake At Scale With Spark - Episode 85: A conversation with the architect of Delta Lake on the challenges of building a sustainable data lake at scale
Podcast episode
Maintaining Your Data Lake At Scale With Spark - Episode 85: A conversation with the architect of Delta Lake on the challenges of building a sustainable data lake at scale
byData Engineering Podcast
0 ratings
0% found this document useful
Past, Present and Future of C++ with Bjarne Stroustrup: Rob and Jason are joined by Bjarne Stroustrup, designer and original implementer of C++ to discuss the current state of C++, his vision for the future as well as some discussion of the past. Bjarne Stroustrup is the designer and original implementer...
Podcast episode
Past, Present and Future of C++ with Bjarne Stroustrup: Rob and Jason are joined by Bjarne Stroustrup, designer and original implementer of C++ to discuss the current state of C++, his vision for the future as well as some discussion of the past. Bjarne Stroustrup is the designer and original implementer...
byCppCast
0 ratings
0% found this document useful
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
Podcast episode
Cloud Dataflow with Eric Anderson: Batch and stream processing systems have been evolving for the past decade. From MapReduce to Apache Storm to Dataflow, the best practices for large volume data processing have become more sophisticated as the industry and open source communities have ...
byCloud Engineering Archives - Software Engineering Daily
0 ratings
0% found this document useful
Building ETL Pipelines With Generative AI: Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.
Podcast episode
Building ETL Pipelines With Generative AI: Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.
byData Engineering Podcast
0 ratings
0% found this document useful
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
Podcast episode
Let's Talk About Natural Language Processing: This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of...
byData Skeptic
0 ratings
0% found this document useful
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
Podcast episode
Production data labeling workflows: with Mark Christensen, CEO of Xelex.ai
byPractical AI: Machine Learning, Data Science, LLM
0 ratings
0% found this document useful
Hasty Treat - Webhooks: In this Hasty Treat, Scott and Wes talk about webhooks — one of those concepts that seems a lot scarier than it actually is. Linode - Sponsor Whether you’re working on a personal project or managing enterprise infrastructure, you deserve simple,...
Podcast episode
Hasty Treat - Webhooks: In this Hasty Treat, Scott and Wes talk about webhooks — one of those concepts that seems a lot scarier than it actually is. Linode - Sponsor Whether you’re working on a personal project or managing enterprise infrastructure, you deserve simple,...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
MLA 015 SageMaker 1: Part 1 of deploying your ML models to the cloud with SageMaker (MLOps) MLOps is deploying your ML models to the cloud. See for an overview of tooling (also generally a great ML educational run-down.) And I forgot to...
Podcast episode
MLA 015 SageMaker 1: Part 1 of deploying your ML models to the cloud with SageMaker (MLOps) MLOps is deploying your ML models to the cloud. See for an overview of tooling (also generally a great ML educational run-down.) And I forgot to...
byMachine Learning Guide
0 ratings
0% found this document useful
Yugabyte and Database Innovations with Karthik Ranganathan: This week Corey is joined by Karthik Ranganathan, CTO and Co-Founder of Yugabyte, to talk about databases of which YugabyteDB is one of the best. Karthik started at Facebook building distributed databases and now has moved onto building even more! Why? We
Podcast episode
Yugabyte and Database Innovations with Karthik Ranganathan: This week Corey is joined by Karthik Ranganathan, CTO and Co-Founder of Yugabyte, to talk about databases of which YugabyteDB is one of the best. Karthik started at Facebook building distributed databases and now has moved onto building even more! Why? We
byScreaming in the Cloud
0 ratings
0% found this document useful
Hasty Treat WTF × SSR vs JamStack vs Serverless?: In this Hasty Treat, Scott and Wes talk about the differences between SSR, JamStack, and Serverless. LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and fix issues faster. It’s an exception...
Podcast episode
Hasty Treat WTF × SSR vs JamStack vs Serverless?: In this Hasty Treat, Scott and Wes talk about the differences between SSR, JamStack, and Serverless. LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and fix issues faster. It’s an exception...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
Podcast episode
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
Breaking Kubernetes for Fun and Profit with David Flanagan: is a developer, educator and technology enthusiast with a special interest for Kubernetes and Cloud Native technologies. David is the founder of , an online platform aiming at teaching kubernetes to developers. One of the popular shows on RawKode is ....
Podcast episode
Breaking Kubernetes for Fun and Profit with David Flanagan: is a developer, educator and technology enthusiast with a special interest for Kubernetes and Cloud Native technologies. David is the founder of , an online platform aiming at teaching kubernetes to developers. One of the popular shows on RawKode is ....
byKubernetes Podcast from Google
0 ratings
0% found this document useful
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
Podcast episode
Distributing Geospatial Data: Distributing Geospatial Data - Every wondered why you might what to do this? Or maybe you understand the why but are unsure about the how? Perhaps you have heard people talk about partitioning data or sharding data, you might have heard some of thes...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful
MLA 013 Customer Facing Tech Stack: Client, server, database, etc.
Podcast episode
MLA 013 Customer Facing Tech Stack: Client, server, database, etc.
byMachine Learning Guide
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
23: Growing a data community to 200K Followers in 2 years w/ Mike Delgado of Experian: Have you ever thought about starting your own data community? Ever wondered what really goes into it? Data communities are hot and have become the number one source for learning and networking! Our guest today talks to us about exactly how he grew...
Podcast episode
23: Growing a data community to 200K Followers in 2 years w/ Mike Delgado of Experian: Have you ever thought about starting your own data community? Ever wondered what really goes into it? Data communities are hot and have become the number one source for learning and networking! Our guest today talks to us about exactly how he grew...
byAnalytics on Fire
0 ratings
0% found this document useful
Aaron Blohowiak: The Myth of the Sufficiently Smart Engineer: Robby speaks with Aaron Blohowiak, Senior Software Engineer at Netflix. They discuss mistakes teams make when refactoring too much before finding a product-market-fit and how Netflix deals with technical debt. Aaron also shares some early era Ruby on Rails stories, along with reasons why developers might be intimidated to apply at top-tier organizations like Netflix.
Podcast episode
Aaron Blohowiak: The Myth of the Sufficiently Smart Engineer: Robby speaks with Aaron Blohowiak, Senior Software Engineer at Netflix. They discuss mistakes teams make when refactoring too much before finding a product-market-fit and how Netflix deals with technical debt. Aaron also shares some early era Ruby on Rails stories, along with reasons why developers might be intimidated to apply at top-tier organizations like Netflix.
byMaintainable
0 ratings
0% found this document useful
#78 How Data & Culture Unlock Digital Transformation
Podcast episode
#78 How Data & Culture Unlock Digital Transformation
byDataFramed
0 ratings
0% found this document useful
#54 Women in Data Science
Podcast episode
#54 Women in Data Science
byDataFramed
0 ratings
0% found this document useful
TypeScript Fundamentals: In this episode of Syntax, Scott and Wes talk about TypeScript fundamentals — what it is, how you use it, why people love it so much, and more! Sanity - Sponsor is a real-time headless CMS with a fully customizable Content Studio built in...
Podcast episode
TypeScript Fundamentals: In this episode of Syntax, Scott and Wes talk about TypeScript fundamentals — what it is, how you use it, why people love it so much, and more! Sanity - Sponsor is a real-time headless CMS with a fully customizable Content Studio built in...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Shining Some Light In The Black Box Of PostgreSQL Performance: Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.
Podcast episode
Shining Some Light In The Black Box Of PostgreSQL Performance: Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.
byData Engineering Podcast
0 ratings
0% found this document useful

Skip carousel

Route Traffic Between Networks Using A Pi
Linux Format
Article
Route Traffic Between Networks Using A Pi
Jun 2, 2020
A deep-dive into Pi networking solutions resulted in this tutorial. The goal was to uncover a Pi configuration that would enable the routing of network traffic from a wired network to a wireless network. The aim is to build a network router using a R
10 min read
Rise Of The Robots
Linux Format
Article
Rise Of The Robots
Jan 12, 2021
7 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read
Data Fabric
PC Pro Magazine
Article
Data Fabric
Aug 13, 2020
3 min read
Grafana Terminology
Linux Format
Article
Grafana Terminology
Jan 14, 2020
A Grafana data source is a database, file or service that provides data to Grafana – it cannot operate without data. A Grafana panel is the basic building block of Grafana. Panels are made of visualisations or queries. A Grafana query is used for req
1 min read
Jump to LineageOS
Linux Format
Article
Jump to LineageOS
Feb 7, 2023
5 min read
01 Giving Data Collectors—and Donors—a Real-Time Rush
Fast Company
Article
01 Giving Data Collectors—and Donors—a Real-Time Rush
Mar 20, 2017
7 min read
Use EBPF To Keep Tabs On Your CPU
Linux Format
Article
Use EBPF To Keep Tabs On Your CPU
Oct 18, 2022
Did you miss part one? Get hold of it on page 60 Mihalis Tsoukalos is a systems engineer and a technical writer. You can reach him at @mactsouk. We’re continuing our dive into the notoriously complex Extended Berkeley Packet Filter (eBPF) feature of
9 min read
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
TechLife News
Article
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
Dec 16, 2023
4 min read
Why Are We Stuck With M.2 When U.2 Is So Much Better?
APC
Article
Why Are We Stuck With M.2 When U.2 Is So Much Better?
May 22, 2023
4 min read
What is ELT?
Techfastly
Article
What is ELT?
Apr 1, 2021
It stands for extract, load, and transform- the processes a data pipeline uses for replicating the data from a source system into a target system such as a cloud data warehouse. 1. Extraction is the first step in which data is copied from the source
6 min read
We Need an FDA For Algorithms
Nautilus
Article
We Need an FDA For Algorithms
Nov 1, 2018
In the introduction to her new book, Hannah Fry points out something interesting about the phrase “Hello World.” It’s never been quite clear, she says, whether the phrase—which is frequently the entire output of a student’s first computer program—is
10 min read
15 16 questions with… Lewis Thompson
Computer Music
Article
15 16 questions with… Lewis Thompson
Feb 22, 2023
7 min read
Monitor Git Projects
Linux Format
Article
Monitor Git Projects
Feb 11, 2020
8 min read
What Is The Future Of Game Streaming Now That Stadia Is Dead?
APC
Article
What Is The Future Of Game Streaming Now That Stadia Is Dead?
Oct 31, 2022
Once hyped as being ‘the future of gaming’, the Google Stadia game streaming service was officially, just three years after launch and before even making it to Australian shores. When game streaming first launched we did have some apprehension about
2 min read
THE WORLD’S MOST ICONIC Do We Mean By Iconic? Here, Could PCs
Maximum PC
Article
THE WORLD’S MOST ICONIC Do We Mean By Iconic? Here, Could PCs
Aug 16, 2022
1 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
Ice Cold With Kali
Linux Format
Article
Ice Cold With Kali
May 2, 2023
3 min read
Roundup
Linux Format
Article
Roundup
Dec 13, 2022
13 min read
Tackling Terminal Tabular Table Tools!
Linux Format
Article
Tackling Terminal Tabular Table Tools!
Jan 10, 2023
9 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
The Big Tech Boost
Business Today
Article
The Big Tech Boost
Jan 5, 2024
5 min read
Leadership Forum: Investing in Disruption
Rotman Management
Article
Leadership Forum: Investing in Disruption
Jan 1, 2019
10 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Business Today
Article
Saxo Bank And Thoughtworks: Enabling Data Democratization At A Global Investment Bank
Jan 20, 2023
2 min read
Cloudy With No Chance Of Erp
Architectural Review Asia Pacific
Article
Cloudy With No Chance Of Erp
Nov 11, 2019
ERP (enterprise resource planning) was born around the time the first ‘[Something] for Dummies’ book was published*. It’s typically inflexible, uncompromising software designed for large businesses, like banks, large corporations, manufacturing and s
2 min read
Web App Security
Linux Format
Article
Web App Security
Jun 29, 2021
8 min read

Related categories

Skip carousel

Reviews for Scalable Big Data Architecture

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Scalable Big Data Architecture - Bahaaldine Azarmi

Bahaaldine AzarmiScalable Big Data Architecture10.1007/978-1-4842-1326-1_1

1. The Big (Data) Problem

Bahaaldine Azarmi¹

(1)

Saint Cloud, France

Data management is getting more complex than it has ever been before. Big Data is everywhere, on everyone’s mind, and in many different forms: advertising, social graphs, news feeds, recommendations, marketing, healthcare, security, government, and so on.

In the last three years, thousands of technologies having to do with Big Data acquisition, management, and analytics have emerged; this has given IT teams the hard task of choosing, without having a comprehensive methodology to handle the choice most of the time.

When making such a choice for your own situation, ask yourself the following questions: When should I think about employing Big Data for my IT system? Am I ready to employ it? What should I start with? Should I really go for it despite feeling that Big Data is just a marketing trend?

All these questions are running around in the minds of most Chief Information Officers (CIOs) and Chief Technology Officers (CTOs), and they globally cover the reasons and the ways you are putting your business at stake when you decide to deploy a distributed Big Data architecture.

This chapter aims to help you identity Big Data symptoms—in other words when it becomes apparent that you need to consider adding Big Data to your architecture—but it also guides you through the variety of Big Data technologies to differentiate among them so that you can understand what they are specialized for. Finally, at the end of the chapter, we build the foundation of a typical distributed Big Data architecture based on real life examples.

Identifying Big Data Symptoms

You may choose to start a Big Data project based on different needs: because of the volume of data you handle, because of the variety of data structures your system has, because of scalability issues you are experiencing, or because you want to reduce the cost of data processing. In this section, you’ll see what symptoms can make a team realize they need to start a Big Data project.

Size Matters

The two main areas that get people to start thinking about Big Data are when they start having issues related to data size and volume; although most of the time these issues present true and legitimate reasons to think about Big Data, today, they are not the only reasons to go this route.

There are others symptoms that you should also consider—type of data, for example. How will you manage to increase various types of data when traditional data stores, such as SQL databases, expect you to do the structuring, like creating tables?

This is not feasible without adding a flexible, schemaless technology that handles new data structures as they come. When I talk about types of data, you should imagine unstructured data, graph data, images, videos, voices, and so on.

Yes, it’s good to store unstructured data, but it’s better if you can get something out of it. Another symptom comes out of this premise: Big Data is also about extracting added value information from a high-volume variety of data. When, a couple of years ago, there were more read transactions than write transactions, common caches or databases were enough when paired with weekly ETL (extract, transform, load) processing jobs. Today that’s not the trend any more. Now, you need an architecture that is capable of handling data as it comes through long processing to near real-time processing jobs. The architecture should be distributed and not rely on the rigid high-performance and expensive mainframe; instead, it should be based on a more available, performance driven, and cheaper technology to give it more flexibility.

Now, how do you leverage all this added value data and how are you able to search for it naturally? To answer this question, think again about the traditional data store in which you create indexes on different columns to speed up the search query. Well, what if you want to index all hundred columns because you want to be able to execute complex queries that involve a nondeterministic number of key columns? You don’t want to do this with a basic SQL database; instead, you would rather consider using a NoSQL store for this specific need.

So simply walking down the path of data acquisition, data structuring, data processing, and data visualization in the context of the actual data management trends makes it easy to conclude that size is no longer the main concern.

Typical Business Use Cases

In addition to technical and architecture considerations, you may be facing use cases that are typical Big Data use cases. Some of them are tied to a specific industry; others are not specialized and can be applied to various industries.

These considerations are generally based on analyzing application’s logs, such as web access logs, application server logs, and database logs, but they can also be based on other types of data sources such as social network data.

When you are facing such use cases, you might want to consider a distributed Big Data architecture if you want to be able to scale out as your business grows.

Consumer Behavioral Analytics

Knowing your customer, or what we usually call the 360-degree customer view might be the most popular Big Data use case. This customer view is usually used on e-commerce websites and starts with an unstructured clickstream—in other words, it is made up of the active and passive website navigation actions that a visitor performs. By counting and analyzing the clicks and impressions on ads or products, you can adapt the visitor’s user experience depending on their behavior, while keeping in mind that the goal is to gain insight in order to optimize the funnel conversion.

Sentiment Analysis

Companies care about how their image and reputation is perceived across social networks; they want to minimize all negative events that might affect their notoriety and leverage positive events. By crawling a large amount of social data in a near-real-time way, they can extract the feelings and sentiments of social communities regarding their brand, and they can identify influential users and contact them in order to change or empower a trend depending on the outcome of their interaction with such users.

CRM Onboarding

You can combine consumer behavioral analytics with sentiment analysis based on data surrounding the visitor’s social activities. Companies want to combine these online data sources with the existing offline data, which is called CRM (customer relationship management) onboarding, in order to get better and more accurate customer segmentation. Thus, companies can leverage this segmentation and build a better targeting system to send profile-customized offers through marketing actions.

Prediction

Learning from data has become the main Big Data trend for the past two years. Prediction-enabled Big Data can be very efficient in multiple industries, such as in the telecommunication industry, where prediction router log analysis is democratized. Every time an issue is likely to occur on a device, the company can predict it and order part to avoid downtime or lost profits.

When combined with the previous use cases, you can use predictive architecture to optimize the product catalog selection and pricing depending on the user’s global behavior.

Understanding the Big Data Project’s Ecosystem

Once you understand that you actually have a Big Data project to implement, the hardest thing is choosing the technologies to use in your architecture. It is not just about picking the most famous Hadoop-related technologies, it’s also about understanding how to classify them in order to build a consistent distributed architecture.

To get an idea of the number of projects in the Big Data galaxy, browse to https://github.com/zenkay/bigdata-ecosystem#projects-1 to see more than 100 classified projects.

Here, you see that you might consider choosing a Hadoop distribution, a distributed file system, a SQL-like processing language, a machine learning language, a scheduler, message-oriented middleware, a NoSQL datastore, data visualization, and so on.

Since this book’s purpose is to describe a scalable way to build a distributed architecture, I don’t dive into all categories of projects; instead, I highlight the ones you are likely to use in a typical Big Data project. You can eventually adapt this architecture and integrate projects depending on your needs. You’ll see concrete examples of using such projects in the dedicated parts.

To make the Hadoop technology presented more relevant, we will work on a distributed architecture that meets the previously described typical use cases, namely these:

Consumer behavioral analytics

Sentiment analysis

CRM onboarding and prediction

Hadoop Distribution

In a Big Data project that involves Hadoop-related ecosystem technologies, you have two choices:

Download the project you need separately and try to create or assemble the technologies in a coherent, resilient, and consistent architecture.

Use one of the most popular Hadoop distributions, which assemble or create the technologies for you.

Although the first option is completely feasible, you might want to choose the second one, because a packaged Hadoop distribution ensures capability between all installed components, ease of installation, configuration-based deployment, monitoring, and support.

Hortonworks and Cloudera are the main actors in this field. There are a couple of differences between the two vendors, but for starting a Big Data package, they are equivalent, as long as you don’t pay attention to the proprietary add-ons.

My goal here is not to present all the components within each distribution but to focus on what each vendor adds to the standard ecosystem. I describe most of the other components in the following pages depending on what we need for our architecture in each situation.

Cloudera CDH

Cloudera adds a set of in-house components to the Hadoop-based components; these components are designed to give you better cluster management and search experiences.

The following is a list of some of these components:

Impala: A real-time, parallelized, SQL-based engine that searches for data in HDFS (Hadoop Distributed File System) and Base. Impala is considered to be the fastest querying engine within the Hadoop distribution vendors market, and it is a direct competitor of Spark from UC Berkeley.

Cloudera Manager: This is Cloudera’s console to manage and deploy Hadoop components within your Hadoop cluster.

Hue: A console that lets the user interact with the data and run scripts for the different Hadoop components

Enjoying the preview?

Page 1 of 1

Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture

About this ebook

Bahaaldine Azarmi

Related authors

Related to Scalable Big Data Architecture

Related ebooks

Databases For You

Related podcast episodes

Related articles

Related categories

Reviews for Scalable Big Data Architecture

What did you think?

Book preview

Scalable Big Data Architecture - Bahaaldine Azarmi

1. The Big (Data) Problem

Identifying Big Data Symptoms

Size Matters

Typical Business Use Cases

Understanding the Big Data Project’s Ecosystem

Hadoop Distribution