978-81-322-2494-5-1-80 Parte 1
978-81-322-2494-5-1-80 Parte 1
978-81-322-2494-5-1-80 Parte 1
Hrushikesha Mohanty
Prachet Bhuyan
Deepak Chenthati Editors
Big Data
A Primer
Studies in Big Data
Volume 11
Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: kacprzyk@ibspan.waw.pl
About this Series
The series “Studies in Big Data” (SBD) publishes new developments and advances
in the various areas of Big Data- quickly and with a high quality. The intent is to
cover the theory, research, development, and applications of Big Data, as embedded
in the fields of engineering, computer science, physics, economics and life sciences.
The books of the series refer to the analysis and understanding of large, complex,
and/or distributed data sets generated from recent digital sources coming from
sensors or other physical instruments as well as simulations, crowd sourcing, social
networks or other internet transactions, such as emails or video click streams and
other. The series contains monographs, lecture notes and edited volumes in Big
Data spanning the areas of computational intelligence incl. neural networks,
evolutionary computation, soft computing, fuzzy systems, as well as artificial
intelligence, data mining, modern statistics and Operations research, as well as self-
organizing systems. Of particular value to both the contributors and the readership
are the short publication timeframe and the world-wide distribution, which enable
both wide and rapid dissemination of research output.
Deepak Chenthati
Editors
Big Data
A Primer
123
Editors
Hrushikesha Mohanty Deepak Chenthati
School of Computer and Information Teradata India Private Limited
Sciences Hyderabad
University of Hyderabad India
Hyderabad
India
Prachet Bhuyan
School of Computer Engineering
KIIT University
Bhubaneshwar, Odisha
India
v
vi Preface
Collection and processing of big data are topics that have drawn considerable
attention of concerned variety of people ranging from researchers to business
makers. Developments in infrastructure such as grid and cloud technology have
given a great impetus to big data services. Research in this area is focusing on big
data as a service and infrastructure as a service. The former looks at developing
algorithms for fast data access, processing as well as inferring pieces of information
that remain hidden. To make all this happen, internet-based infrastructure must
provide the backbone structures. It also needs an adaptable architecture that can be
dynamically configured so that fast processing is possible by making use of optimal
computing as well as storage resources. Thus, investigations on big data encompass
many areas of research, including parallel and distributed computing, database
management, software engineering, optimization and artificial intelligence. The
rapid spread of the internet, several governments’ decisions in making of smart
cities and entrepreneurs’ eagerness have invigorated the investigation on big data
with intensity and speed. The efforts made in this book are directed towards the
same purpose.
The goal of this book is to highlight the issues related to research and development
in big data. For this purpose, the chapter authors are drawn from academia as well
as industry. Some of the authors are actively engaged in the development of
products and customized big data applications. A comprehensive view on six key
issues is presented in this book. These issues are big data management, algorithms
for distributed processing and mining patterns, management of security and privacy
of big data, SLA for big data service and, finally, big data analytics encompassing
several useful domains of applications. However, the issues included here are not
completely exhaustive, but the coverage is enough to unfold the research as well as
development promises the area holds for the future. Again for the purpose, the
Introduction provides a survey with several important references. Interested readers
are encouraged to take the lead following these references.
Intended Audience
This book promises to provide insights to readers having varied interest in big data.
It covers an appreciable spread of the issues related to big data and every chapter
intends to motivate readers to find the specialities and the challenges lie within.
Of course, this is not a claim that each chapter deals an issue exhaustively. But, we
sincerely hope that both conversant and novice readers will find this book equally
interesting.
Preface vii
In addition to introducing the concepts involved, the authors have made attempts to
provide a lead to realization of these concepts. With this aim, they have presented
algorithms, frameworks and illustrations that provide enough hints towards system
realization. For emphasizing growing trends on big data application, the book includes
a chapter which discusses such systems available on the public domain. Thus, we
hope this book is useful for undergraduate students and professionals looking for an
introduction to big data. For graduate students intending to take up research in this
upcoming area, the chapters with advanced information will also be useful.
This book has seven chapters. Chapter “Big Data: An Introduction” provides a
broad review of the issues related to big data. Readers new to this area are
encouraged to read this chapter first before reading other chapters. However, each
chapter is independent and self-complete with respect to the theme it addresses.
Chapter “Big Data Architecture” lays out a universal data architecture for rea-
soning with all forms of data. Fundamental to big data analysis is big data man-
agement. The ability to collect, store and make available for analysis the data in
their native forms is a key enabler for the science of analysing data. This chapter
discusses an iterative strategy for data acquisition, analysis and visualization.
Big data processing is a major challenge to deal with voluminous data and
demanding processing time. It also requires dealing with distributed storage as data
could be spread across different locations. Chapter “Big Data Processing
Algorithms” takes up these challenges. After surveying solutions to these prob-
lems, the chapter introduces some algorithms comprising random walks, distributed
hash tables, streaming, bulk synchronous processing and MapReduce paradigms.
These algorithms emphasize the usages of techniques, such as bringing application
to data location, peer-to-peer communications and synchronization, for increased
performance of big data applications. Particularly, the chapter illustrates the power
of the Map Reduce paradigm for big data computation.
Chapter “Big Data Search and Mining” talks of mining the information that big
data implicitly carries within. Often, big data appear with patterns exhibiting the
intrinsic relations they hold. Unearthed patterns could be of use for improving
enterprise performances and strategic customer relationships and marketing.
Towards this end, the chapter introduces techniques for big data search and mining.
It also presents algorithms for social network clustering using the topology dis-
covery technique. Further, some problems such as sentiment detection on pro-
cessing text streams (like tweets) are also discussed.
Security is always of prime concern. Security lapses in big data could be higher
due to its high availability. As these data are collected from different sources, the
vulnerability for security attacks increases. Chapter “Security and Privacy of Big
Data” discusses the challenges, possible technologies, initiatives by stakeholders
and emerging trends with respect to security and privacy of big data.
viii Preface
The world today, being instrumented by several appliances and aided by several
internet-based services, generates very high volume of data. These data are useful
for decision-making and furthering quality of services for customers. For this, data
service is provided by big data infrastructure to receive requests from users and to
accordingly provide data services. These services are guided by Service Level
Agreement (SLA). Chapter “Big Data Service Agreement” addresses issues on SLA
specification and processing. It also introduces needs for negotiation to avail data
services. This chapter proposes a framework for SLA processing.
Chapter “Applications of Big Data” introduces applications of big data in dif-
ferent domains including banking and financial services. It sketches scenarios for
the digital marketing space.
Acknowledgments
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
ix
Editors and Contributors
xi
xii Editors and Contributors
Contributors
xiii
xiv Acronyms
Hrushikesha Mohanty
Abstract The term big data is now well understood for its well-defined charac-
teristics. More the usage of big data is now looking promising. This chapter being
an introduction draws a comprehensive picture on the progress of big data. First, it
defines the big data characteristics and then presents on usage of big data in dif-
ferent domains. The challenges as well as guidelines in processing big data are
outlined. A discussion on the state of art of hardware and software technologies
required for big data processing is presented. The chapter has a brief discussion on
the tools currently available for big data processing. Finally, research issues in big
data are identified. The references surveyed for this chapter introducing different
facets of this emergent area in data science provide a lead to intending readers for
pursuing their interests in this subject.
Keywords Big data applications Analytics Big data processing architecture
Big data technology and tools Big data research trends
1 Big Data
“Big data” the term remains ill-defined if we talk of data volume only. It gives an
impression before data size was always small. Then, we run into problem of
defining something small and big. How much data can be called big—the question
remains unanswered or even not understood properly. With relational database
technology, one can really handle huge volume of data. This makes the term “big
data” a misnomer.
Days of yesteryears were not as machine-driven as we see it today. Changes were
also not as frequent as we find now. Once, data repository defined, repository was
H. Mohanty (&)
School of Computer and Information Sciences, University of Hyderabad,
Gachhibowli 500046, Hyderabad, India
e-mail: hmcs_hcu@yahoo.com
used for years by users. Relational database technology thus was at the top for
organisational and corporate usages. But, now emergent data no longer follow a
defined structure. Variety of data comes in variety of structures. All accommodating
in a defined structure is neither possible nor prudent to do so for different usages.
Our world is now literally swamped with several digital gadgets ranging from
wide variety of sensors to cell phones, as simple as a cab has several sensors to throw
data on its performance. As soon as a radio cab is hired, it starts sending messages on
travel. GPS fitted with cars and other vehicles produce a large amount of data at
every tick of time. Scenario on roads, i.e. traffic details, is generated in regular
intervals to keep an eye on traffic management. Such scenarios constitute data of
traffic commands, vehicles, people movement, road condition and much more
related information. All these information could be in various forms ranging from
visual, audio to textual. Leave aside very big cities, in medium-sized city with few
crores of population, the emerging data could be unexpectedly large to handle for
making a decision and portraying regular traffic conditions to regular commuters.
Internet of things (IoT) is the new emerging world today. Smart home is where
gadgets exchange information among themselves for getting house in order like
sensors in a refrigerator on scanning available amount of different commodities
may make and forward a purchase list to a near by super market of choice. Smart
cities can be made intelligent by processing the data of interest collected at different
city points. For example, regulating city traffic in pick time such that pollution
levels at city squares do not cross a marked threshold. Such applications need
processing of a huge data that emerge at instant of time.
Conducting business today unlike before needs intelligent decision makings.
More to it, decision-making now demands instant actions as business scenario
unfolds itself at quick succession. This is so for digital connectivity that makes
business houses, enterprises, and their stakeholders across the globe so closely
connected that a change at far end instantly gets transmitted to another end. So, the
business scenario changes in no time. For example, a glut in crude oil supply at a
distributor invites changes in status of oil transport, availability at countries
sourcing the crude oil; further, this impacts economy of these countries as the
productions of its industries are badly affected. It shows an event in a business
domain can quickly generate a cascade of events in other business domains.
A smart decision-making for a situation like this needs quick collection as well as
processing of business data that evolve around.
Internet connectivity has led to a virtual society where a person at far end of the
globe can be a person like your next-door neighbour. And number of people in
one’s friend list can out number to the real number of neighbours one actually has.
Social media such as Twitter, Facebook, Instagram and many such platforms
provide connectivity for each of its members for interaction and social exchanges.
They exchange messages, pictures, audio files, etc. They talk on various issues
ranging from politics, education, research to entertainment. Of course, unfortu-
nately such media are being used for subversive activities. Every moment millions
of people on social media exchanges enormous amount of information. At times for
different usages ranging from business promotions to security enhancement,
Big Data: An Introduction 3
information from emerging data. A survey on generation of big data from mobile
applications is presented in [5].
Classification of big data from different perspectives as presented in [6] is pre-
sented in Fig. 1. The perspectives considered are data sources, content format, data
stores, data staging, and data processing. The sources generating data could be web
and social media on it, different sensors reading values of parameters that changes as
time passes on, internet of things, various machinery that throw data on changing
subfloor situations and transactions that are carried out in various domains such as
enterprises and organisations for governance and commercial purposes. Data staging
is about preprocessing of data that is required for processing for information
extraction. From data store perspective, here the concern is about the way data stored
for fast access. Data processing presents a systemic approach required to process big
data. We will again touch upon these two issues later in Sect. 3.
Having an introduction on big data, next we will go for usages of these data in
different domains. That gives an idea why the study on big data is important for
both business as well as academic communities.
data usage has been since 1990s, by making use of statistical as well as data mining
techniques. This has given rise to the first generation of Business Intelligence and
Analytics (BI&A). Major IT vendors including Microsoft, IBM, Oracle, and SAP
have developed BI platforms incorporating most of these data processing and
analytical technologies.
On advent of Web technology, organisations are putting businesses online by
making use of e-commerce platforms such as Flipkart, Amazon, eBay and are
searched for by websearch engines like Google. The technologies have enabled
direct interactions among customers and business houses. User(IP)-specific infor-
mation and interaction details being collected by web technologies (through cookies
and service logs) are being used in understanding customer’s needs and new
business opportunities. Web intelligence and web analytics make Web 2.0-based
social and crowd-sourcing systems.
Now social media analytics provide unique opportunity for business develop-
ment. Interactions among people on social media can be traced and business
intelligence model be built for two-way business transactions directly instead of
traditional one-way transaction between business-to-customer [7]. We are need of
scalable techniques in information mining (e.g. information extraction, topic
identification, opinion mining, question-answering, event detection), web mining,
social network analysis, and spatial-temporal analysis, and these need to gel well
with existing DBMS-based techniques to come up with BI&A 2.0 systems. These
systems use a variety of data emanating from different sources in different varieties
and at different intervals. Such a collection of data is known as big data. Data in
abundance being accompanied with analytics can leverage opportunities and make
high impacts in many domain-specific applications [8]. Some such selective
domains include e-governance, e-commerce, healthcare, education, security and
many such applications that require boons of data science.
Data collected from interactions on social media can be analysed to understand
social dynamics that can help in delivering governance services to people at right
time and at right way resulting to good governance. Technology-assisted gover-
nance aims to use data services by deploying data analytics for social data analysis,
visualisation, finding events in communities, extracting as well as forecasting
emerging social changes and increase understanding of human and social processes
to promote economic growth and improved health and quality of life.
E-commerce has been greatly benefited in making use of data collected from
social media analytics for customer opinions, text analysis and sentiment analysis
techniques. Personalised recommender systems are now a possibility following
long-tail marketing by making use of data on social relations and choices they make
[9]. Various data processing analytics based on association rule mining, database
segmentation and clustering, anomaly detection, and graph mining techniques are
being used and developed to promote data as a service in e-commerce applications.
In healthcare domain, big data is poised to make a big impact resulting to per-
sonalisation of healthcare [10]. For this objective, healthcare systems are planning to
make use of different data the domain churns out every day in huge quantity. Two
main sources that generate a lot of data include genomic-driven study, probe-driven
6 H. Mohanty
Big data service process has few steps starting from Data acquisition, Data staging,
Data analysis and application analytics processing and visualisation. Figure 2
presents a framework for big data processing that models at higher level, the
working of such a system. Source of data could be internet-based applications and
databases that store organisational data. On acquiring data, preprocessing stage
called data staging includes removal of unrequired and incomplete data [22].
Then, it transforms data structure to a form that is required for analysis. In the
process, it is most important to do data normalisation so that data redundancy is
avoided. Normalised data then are stored for processing. Big users from different
domains such as social computing, bioscience, business domains and environment
to space science look forward information from gathered data. Analytics corre-
sponding to an application are used for the purpose. These analytics being invoked
in turn take the help of data analysis technique to scoop out information hiding in
Big Data: An Introduction 9
big data. Data analysis techniques include machine learning, soft computing,
statistical methods, data mining and parallel algorithms for fast computation.
Visualisation is an important step in big data processing. Incoming data, infor-
mation while in processing and result outcome are often required to visualise for
understanding because structure often holds information in its folds; this is more
true in genomics study.
3.2 Challenges
Big data service is hard for both hardware and software limitations. We will list
some of these limitations that are intuitively felt important. Storage device has
become a major constraint [23] for the presently usual HDD: Hard Disk Drive with
random access technology used for data storage is found restrictive particularly for
fast input/output transmission [24] that is demanding for big data processing.
solid-state device (SSD) [25] and phase change memory (PCM) [26] are the leading
technology though promising but far from reality.
Other than storage limitation, there could be algorithmic design limitation in
terms of defining proper data structures that are amenable for fast access for data
management. There is a need for optimised design and implementations of indexing
for fast accessing of data [27]. Novel idea on key-value stores [28] and database file
system arrangement are challenge for big data management [29, 30].
Communication is almost essential with big data service for both data acquisition
and service delivery for both are usually carried out on internet. Big data service
requires large bandwidth for data transmission. Loss of data during transmission is
10 H. Mohanty
3.3 Guidelines
The challenge big data processing faces, looks for solution not only in technology
but also in process of developing a system. We will list out these in brief following
the discussion made in [37–39]. Big data processing needs distributed computation.
And making for such a system is fairly dependent on type of application we have in
hand. The recommended seven principles in making of big data systems are as
follows:
Guideline-1: Choice of good architecture: big data processing is performed
either on batch mode or in stream mode for real-time processing. While for the
former MapReduce architecture is found effective but for later we need an archi-
tecture that provides fast access with key-value data stores, such as NoSQL, high
performance and index-based retrieval are allowed. For real-time big data systems,
Lambda Architecture is another example emphasising need for application-based
architecture. This architecture proposes three-layer architecture the batch layer, the
serving layer, and the speed layer claiming its usefulness in real-time big data
processing [38].
Guideline-2: Availability of analytics: Making data useful is primarily depen-
dent on veracity of analytics to meet different objectives different application
domains look for. Analytics range from statistical analysis, in-memory analysis,
machine learning, distributed programming and visualisation to real-time analysis
Big Data: An Introduction 11
Next, we will have a brief discussion on a typical big data system architecture that
can provide big data Service. Such a system is a composition of several subsystems.
The framework is presented in Fig. 3. It shows an organic link among components
that manage information and provide data service including business intelligence
applications. The framework is taken from [41]; the copyright of the framework is
with intelligent business strategies.
Big data system rather is an environment inhabited by both conventional as well
as new database technologies enabling users not only to access information of
variety forms but also infer knowledge from it. In literature, big data system is even
termed as “Big Data Ecosystem”. It has three-layered ecosystem with bottom one
interfacing to all kinds of data sources that feed the system with all types of data,
i.e. structured and unstructured. It also includes active data sources, e.g. social
media, enterprise systems, transactional systems where data of different formats
continue to stream. There could be traditional database systems, files and docu-
ments with archived information forming data sources for a big data system.
12 H. Mohanty
The middle layer is in responsible of data management that includes data staging,
data modelling, data integration, data protection, data privacy and auditing. It can
have capability of virtualisation making data availability resilient on cloud envi-
ronment. The third layer interfacing stake holders provides facilities for running
applications. Broadly, the facilities include application parallelisation, information
retrieval, intelligent implications and visualisation. The tools and techniques are key
to success of a big data system. Usability of such a system increases by making use
of appropriate technologies. Big data community is in the process of developing
technologies and some of them have caught the imagination of users. Next section
presents few popular technologies, though it does not claim a thorough review but
make an attempt in citing major technologies made impact in big data processing
In this section, we point out technology and tools that have big impact on big data
service. First we talk of hardware technologies followed by software technologies.
Later in this section, we brief on some tools that are made for different purposes in
big data service.
Big Data: An Introduction 13
Conventional storage technology DRAM to store persistent data faces problem for
long-term use because disks have moving parts that are vulnerable to malfunction in
long run. DRAM chips need constant power supply irrespective of its usage. So, it is
not an energy-efficient technology. Non-volatile memory technology shows a
promising solution in future memory designs [42]. There are thinkings on use of
NVM even at instruction level so that operating system can work fast. It is a wish to
see NVM technology brings revolution to both data store and retrieval. Other than
memory, technology looks for improving processing power to address the need for
fast data processing. Significant solution towards that includes Data-Centre-on-Chip
(DOC) [43]. It proposes four usage models that can be used to consolidate appli-
cations that are homogeneous and cooperating and manage synchrony on shared
resources and at the same time speed up computation providing cache hierarchies.
Tang et al. [44] proposes a hardware configuration that speeds execution of Java
virtual machine (JVM) by speeding up algorithms like garbage collection. Same idea
can be adopted for big data processing applying hardware technology to speed up
data processing at bottlenecks usually found at data being shared by many.
Conventional TCP/IP stack for communication is usually homogeneous and
works well for lossless transmissions. Round-trip time (RTT) is usually less than
250 μs in absence of queuing. This technology does not work for big data com-
munication, for its communication, infrastructure requirement is very different. On
addressing data-centric communication network problem, all-optical switching
fabric could be a promising solution. It proposes computer to directly talk to
network by passing the bottleneck of network interface. Processor to memory path
can also have optical fibre connection. The first parallel optical transceiver capable
of one terabit transmission per second is designed by IBM [45]. Intel is coming out
of switches with optical interconnect cable in later versions of Thunderbolt. Hybrid
electrical/optical switch Helios architecture [46] is also a promising solution.
Virtualisation technology though came with mainframe technology [47] and
gone low for availability of inexpensive desk top computing has come to forefront
again for processing big data service on cloud environment. Technologies are
coming up for both CPU, memory and I/O virtualisation. For big data analytics,
even code virtualisation (like JVM) is being intended for. This technology helps in
run-time virtualisation following dynamically typed scripting languages or the use
of just-in-time (JIT) techniques [48].
Here, we next take up developments in software technology taking place for big
data services. First, we point out the requirements in software technology in
development of big data systems following the discussion made in [49]. The
14 H. Mohanty
Phoebus [63] that works following BSP concept is increasingly being used for
analysis of postings made on social networks such as Facebook and LinkedIn.
However, getting entire graph in memory though in distributed form will in future
not be possible to meet the rise in data generation. There is a need to look for efficient
graph partitioning-based data characteristics and use of data.
Event-driven applications are usually time-driven. Basically, it monitors, receives
and process events in a stipulated time period. Some applications consider each event
as atomic and take action with respect to the event. But, some look for complex events
for taking actions for example looking for a pattern in events for making a decision.
For the later case, events in a stream pass through a pipeline for being accessed. This
remains a bottleneck in processing streaming events following MapReduce scheme,
though some systems such as Borealis [64] and IBM’s System S [65] manage to work
still in future, there is a need for finding scalable techniques for efficient stream-data
processing (Stream data: e.g. clicks on Web pages, status changes and postings on
social networks). It has been slow in development of cloud-centric stream-data
processing [56]. Resource utilisation is key to process scalable data using cloud-
centric system. However, such system must manage elasticity and fault-tolerance for
making event-driven big data applications viable.
Data-centric applications require parallel programming model and to meet data
dependency requirements among tasks, intercommunication or information sharing
can be allowed. This in turn gives rise to task parallelism. Managing conflict in
concurrent executions in big data processing environment is a formidable task.
Nevertheless, the idea of sharing key-value pairs on memory address space is being
used to resolve potential conflicts. Piccolo [66] allows to store state changes as
key-value pairs on distributed shared memory table for distributed computations.
A transactional approach for distributed processing TransMR is reported in [67]. It
uses MapReduce scheme for transactional computation over key-value data store
stored in BigTable. Work towards data-centric applications is in progress with an
aim to provide elastic, scalable and fault-tolerant systems.
Now, as big data applications evolve, there is a need of integration of models for
developing applications of certain needs. For example, stream processing followed
by batch processing is required for applications that collect clicks and process
analytics in batch mode. A published subscribe system can be of a tight coupled
system of data-centric and event-based system where subscriptions are partitioned
based on topics and stored at different locations and as publications occur then get
directed to their respective subscriptions. Though Apache yarn [54] is a system with
multiple programming models still much work is to be done for development of
systems with multiple programming models.
been several tools from University labs and Corporate R&Ds. Some of them we
mention here for completeness though the list is not claimed to be exhaustive.
Hadoop has emerged as the most suitable platform for data-intensive distributed
applications. Hadoop HDFS is a distributed file system that partitions large files
across multiple machines for high-throughput data access. It is capable of storing
both structured as well as unstructured heterogeneous data. On it data storage data
scientists run data analytics making use of its Hadoop MapReduce programming
model. The model specially designed to offer a programming framework for dis-
tributed batch processing of large data sets distributed across multiple servers. It
uses Map function for data distribution and creates key, value pairs for use in
Reduce stage. Multiple copies of a program are created and made run at data
clusters so that transporting data from its location to node for computation is
avoided. Its component Hive is a data warehouse system for Hadoop that facilitates
data summarisation, adhoc queries, and the analysis of large datasets stored in
Hadoop compatible file systems. Hive uses a SQL-like language called HiveQL.
HiveQL programs are converted into Map/Reduce programs; Hadoop has also
provision to noSQL data using its component called HBase. It uses column-oriented
store as used in Google’ Bigtable. Hadoop’s component Pig is a high-level
data-flow language for expressing Map/Reduce programs for analysing large HDFS
distributed data sets. Hadoop also hosts a scalable machine learning and data
mining library called Mahout. Hadoop environment has a component called Oozie
that schedules jobs submitted to Hadoop. It has another component Zookeeper
that provides high-performance coordination service for distributed applications.
Other than Hadoop there are a host of tools available to facilitate big data
applications. First we mention some other batch-processing tools, viz. Dryad [68],
Jaspersoft BI suite [69], Pentaho business analytics [70], Skytree Server [71],
Tableau [72], Karmasphere studio and analyst [73], Talend Open Studio [72] and
IBM InfoSphere [41, 74]. Let us have brief introduction of each of these tools as
follows.
Dryad [68] provides a distributed computation programming model that is scal-
able and user-friendly for keeping the job distribution hidden to users. It allocates,
monitors and executes a job at multiple locations. A Dryad application execution
model runs on a graph configuration where vertices represent processors and edges
for channels. On submission of an application, Dryad centralised job manager
allocates computation to several processors forming directed acyclic graph. It
monitors execution and if possible can update a graph providing resilient computing
framework in case of a failure. Thus, Dryad is a self-complete tool to deal with
data-centric applications. Jaspersoft BI suite [69] is an open source tool efficient for
big data fast access, retrieve and visualisation. Speciality of the tool is its capability
to interface with varieties of databases not necessarily of relational databases
including MongoDB, Cassandra, Redis, Riak and CouchDB. Its columnar-based
in-memory engine makes it able to process and visualise large-scale data. Pentaho
[70] is also an open source tool to process and visualise data. It provides Web
interfaces to users to collect data, store and make business decision executing
business analytics. It also can handle different types of data stores not necessarily of
Big Data: An Introduction 17
relational database. Several popular NoSQL databases are supported by this tool. It
also can handle Hadoop file systems for data processing. Skytree Server [71] is a tool
of next generation providing advance data analytics including machine learning
techniques. It has five useful advanced features, namely recommendation systems,
anomaly/outlier identification, predictive analytics, clustering and market segmen-
tation, and similarity search. The tool uses optimised machine learning algorithms
for it uses with real-time streaming data. It is capable to handle structured and
unstructured data stores. Tableau [72] has three main functionalities for data visu-
alisation, interactive visualisation and browser-based business analytics. The main
modules for these three functionalities are Tableau Desktop, Tableau Public and
Tableau Server for visualisation, creative interactive visualisation and business
analytics, respectively. Tableau also can work with Hadoop data store for data
access. It processes data queries using in-memory analytics so this caching helps to
reduce the latency of a Hadoop cluster. Karmasphere studio and analyst [73] is
another Hadoop platform that provides advanced business analytics. It handles both
structured and unstructured data. A programmer finds an integrated environment to
develop Hadoop programs and to execute. The environment provides facilities for
iterative analysis, visualisation and reporting. Karmasphere is well designed on
Hadoop platform to provide integrated and user-friendly workspace for processing
big data in collaborative way. Talend Open Studio [72] is yet another Hadoop
platform for big data processing, but the speciality with it is visual programming
facility that a developer can use to drag icons and stitch them to make an application.
Though it intends to forego writing of complex Java programs but capability of
application performance is limited to usability of icons. However, it provides a good
seeding for application development. It has facility to work with HDFS, Pig,
HCatalog, HBase, Sqoop or Hive. IBM InfoSphere [41, 74] is a big data analytic
platform with Apache Hadoop system that provides warehousing as well as big data
analytics services. It has features for data compression, MapReduce-based text and
machine learning analytics, storage security and cluster management, connectors to
IBM DB2, IBM’s PureData, Job scheduling and workflow management and
BigIndex—a MapReduce facility that leverages the power of Hadoop to build
indexes for search-based analytic applications.
Now we introduce some tools used for processing of big data streams. Among
them IBM InfoSphere Streams, Apache Kafka, SAP HANA, Splunk, Storm,
SQLstream s-Server and S4 are referred here. IBM InfoSpere Streams is a real-time
data processing stream capable of processing infinite length of stream data. It can
process streaming of different structures. It has library of real-time data analytics and
can also accept third-party analytics to run. It processes stream of data and looks for
emerging patterns. On recognition of a pattern, impact analysis is carried out and
necessary measure is taken fitting to made impact. The tool can attend to multiple
streams of data. Scalability is provided by deploying InfoSphere Streams applica-
tions on multicore, multiprocessor hardware clusters that are optimised for real-time
analytics. It has also dashboard for visualisation. Apache Kafka [75] developed at
LinkedIn processes streaming data using in-memory analytics for meeting real-time
constraints required to process streaming data. It combines offline and online
18 H. Mohanty
processing to provide real-time computation and produce ad hoc solution for these
two kinds of data. The characteristic the tool has includes persistent messaging with
O (1) disk structures, high-throughput, support for distributed processing, and
support for parallel data load into Hadoop. It follows distributed implementation of
published subscribe system for message passing. Interesting behaviour of the tool is
combining both offline and online computation to meet real-time constraints
streaming data processing demands. SAP HANA [76] is another tool for real-time
processing of streaming data. Splunk [77] as real-time data processing tool is dif-
ferent than others in the sense it processes system-generated data, i.e. data available
from system log. It uses cloud technology for optimised resource utilisation. It is
used to online monitor and analyse the data systems produce and report on its
dashboard. Storm [78] is a distributed real-time system to process streaming data. It
has many applications, such as real-time analytics, interactive operation system,
online machine learning, continuous computation, distributed RPC (Remote
Procedure Call) and ETL (Extract, Transform and Load). Alike Hadoop clusters, it
also uses clusters to speed data processing. The difference between two is in making
of topology as Storm makes different topology for different applications, whereas
Hadoop uses the same topology for iterative data analysis. Moreover, Storm can
dynamically change its topology to achieve resilient computation. A topology is
made of two types of nodes, namely spouts and bolts. Spout nodes denote input
streams, and bolt nodes receive and process a stream of data and further output a
stream of data. So, an application can be seen as parallel activities at different nodes
of a graph representing a snap shot of distributed execution of the application.
A cluster is seen as a collection of master node and several worker nodes. A master
node and a worker node implement two daemons Nimbus and Supervisor, respec-
tively. The two daemons have similar functions as JobTracker and TaskTracker in
MapReduce framework do. Another kind of daemon called Zookeeper plays an
important role to coordinate the system. The trilogy together makes storm system
working in distributed framework for real-time processing of streaming data.
SQLstream [79] is yet another real-time streaming data processing system to dis-
cover patterns. It can work efficiently with both structured as well as unstructured
data. It stores streaming in memory and processes with in-memory analytics taking
benefit of multicore computing. S4 [80] is a general purpose platform that provides
scalable, robust distributed and real-time computation of streaming data. It provides
a provision for plug-in-play making a system scalable. S4 also employs Apache
ZooKeeper to manage its cluster. Successfully it is being used in production system
of Yahoo to process thousands of queries posted to it.
However, other than batch processing and stream processing, there is a need of
interactive analysis of data by a user. Interactive processing needs speed for not
making users waiting for reply to queries. Apache drill [81] is a distributed system
capable of scaling up on 10,000 or more servers processing different types of data.
It can work on nested data and with a variety of query languages, data formats and
data sources. Dremel [82] is another kind interactive big data analysis tool proposed
by Google. Both search data stored either in columnar form or on a distributed file
system to respond to users queries.
Big Data: An Introduction 19
Different big data applications have different requirements. Hence, there are
different types of tools with different features. And for the same type of application,
different tools may have different performance. Further, tools acceptance depends
on its user friendliness as a development environment. On choosing a tool for an
application development, all these issues are to be taken into consideration.
Looking at all these tools, it is realised that good tool must be not only fast in
processing and visualisation but also should have ability in finding out knowledge
hidden in avalanche of data. Development in both the fronts requires research
progress in several areas in computer science. In next section, we address some
research issues big data is looking for.
5 Research Issues
Success in big data highly depends on its high-speed computing and analysis
methodologies. These two objectives have a natural trade-off between cost in
computation and number of patterns in data found. Here, principles of optimisation
play a role in finding optimal solution in search of enough patterns in less cost. We
have many optimisation techniques, namely stochastic optimisation, including
genetic programming, evolutionary programming, and particle swarm optimisation.
Big data application needs optimisation algorithms that work not only fast but also
with reduced memory [83, 84]. Data reduction [85] and parallelisation [49, 86] are
issues also to be considered for optimisation.
Statistics as commonly known for data analysis has also role in processing of big
data. But, statistical algorithms are to be extended to meet scale and time
requirements [87]. Parallelisation of statistical algorithms is an area of importance
[88]. Further, statistical computing [89, 90] and statistical learning [91] are two hot
research areas of promising result.
Social networking is currently massive big data generator. Many aspects [92] of
social networks need intensive research for obtaining benefits as understanding
digital society in future hold key to social innovation and engagement. First, study
of social network structure includes many interesting issues such as link formation
[93] and network evolution [94]. Visualising digital society with its dynamic
changes and issue-based associations are interesting areas of research [95, 96].
Usages of social network are plenty [97] including business recommendation [98]
and social behaviour modelling [99].
Machine learning has been in study of artificial intelligence that finds knowledge
pattern on analysing data and uses the knowledge in intelligent decision-making
that means decision making in a system is governed by the knowledge found by
itself [100]. Likewise, big data being a collection of huge data may contain several
patterns to discover by analysis. Existing machine learning algorithms both
supervised and unsupervised face scale up problem to process big data. Current
research aims for improving these algorithms to overcome the limitations they have.
For example, ANN being so successful in machine learning performs poor for big
20 H. Mohanty
data for memory limitations, intractable computing and training [101]. The solution
can be devised by reducing data keeping size of ANN limited. Another method may
opt for massive parallelisation like MapReduce strategy. Big data analysis based on
deep learning that leads to pattern detection based on regularity in spatio-temporal
dependencies [102, 103] is of research interest. A work on learning on
high-dimensional data is reported in [104]. Deep learning has been found successful
in [105, 106], and it will also be useful in finding patterns in big data. In addition,
visualisation of big data is a big research challenge [107], it takes up issues in
feature extraction and geometric modelling to significantly reduce the data size
before the actual data rendering. Again proper data structure is also an issue for data
visualisation as discussed in [108].
Research in above areas aims at developing efficient analytics for big data analysis.
Also we look for efficient and scalable computing paradigms, suitable for big data
processing. For example, there are intensive works on variants of MapReduce
strategy as the work reported in [56] is for making MapReduce work online. From
literature, next we refer to emerging computing paradigms, namely granular com-
puting, cloud computing, bio-inspired computing, and quantum computing.
Granular Computing [109] has been there even before arrival of big data.
Essentially, the concept finds granularity in data and computation and applies
different computing algorithms at different granularity of an application. For
example, for analysing country economy one needs an approach that is different
than the algorithm required for that of a state. The difference is something like
macro- and microeconomics. Big data can be viewed at different granularity and
can be used differently for different usages such as pattern finding, learning and
forecasting. This suggests the need of granular computing for big data analysis
[110, 111]. Changes in computation can be viewed as a development from
machine-centric to human-centric then to information- and knowledge-centric. In
that case, information granularity suggests the algorithm for computation.
Quantum computing [112] is in its fledging state but theoretically it suggests a
framework capable of providing enormous memory space as well as speed to
manipulate enormous input simultaneously. Quantum computing is based on theory
of quantum physics [113]. A quantum works on concept qubit that codifies states in
between zero and one to distinguishable quantum states following the phenomena of
superposition and entanglement. The research in this area [114] soon hopefully helps
to realise quantum computer that will be useful for big data analysis.
Cloud computing [115] presently has caught imagination of users by providing
capability of super computers at ease over internet by virtualisation and sharing of
affordable processors. This has assured resource and computing power of required
scale [116] for computing. The computing need of big data fits well to cloud
computing. As data grow or computation need grows, an application can ask for
both computing and storage services from cloud for its elasticity to match with
requirement load. And further, billing of cloud computing is made fairly simple for
its pay-as-you-use rule. Currently, corporates have shown interest in providing big
data application on cloud computing environment. This is understandable for the
Big Data: An Introduction 21
number of tools available for the purpose. Further research will lead cloud com-
puting to height as and when big data applications scale up to unprecedented size.
Biology has taken to centre stage in revolutionising today’s world in many
spheres. Particularly, for computation there has been an effort to get human intel-
ligence to machine. Currently, researchers are interested in looking at phenomena
that happen in human brain for storage of huge information for years after years and
retrieve an information as need arises. The size and the speed human brain capable
of are baffling and can be useful for big data analysis, if the phenomena are well
understood and replicated on a machine. This quest has given rise to both
bio-inspired hardware [117] as well as computing [118]. Researchers carry forward
works in three fronts to design biocomputers, namely biochemical computers,
biomechanical computers, and bioelectronic computers [119]. In another work,
[120] biocomputing is viewed as a cellular network performing activity such as
computation, communications, and signal processing. A recent work shows use of
biocomputing in cost minimisation for provisioning data-intensive services [121].
As a whole, biological science and computing science both together project an
emerging exciting area of research for taking up problems from both the sides with
a great hope of extending boundary of science. The evolving paradigm is also
expected to help immensely in big data analysis for possible power of computing
mixed with intelligence.
This section points out some research areas that hold promise in big data
analysis. As the applications in big data matures, its dimensions will be immersing
better and so also new research problems will come up in finding solutions. Next,
we conclude this chapter with a concluding remark.
6 Conclusion
Big data is now an emerging area in computer science. It has drawn attention of
academics as well as developers and entrepreneurs. Academics see challenge in
extracting knowledge by processing huge data of various types. Corporates wish to
develop systems for different applications. Further, making big data applications
available for users while on move is also on demand. These interests from several
quarters are going to define growth in big data research and applications. This
chapter intends to draw a picture on a big canvas of such developments.
On defining big data, the chapter goes on describing usages of big data in several
domains including health, education, science and governance. This, the idea of big
data as a service, makes a ground rationalising intent of study in big data. Next, the
chapter makes a point on complexity in big data processing and lists out some
guidelines for the same. Processing of big data needs a system that scales up to
handle huge data size and is capable of very fast computation. The third section has
a discussion on a big data system architecture. Some applications need
batch-processing system and some need online computation. Real-time processing
of big data is also of need for some usages. Thus, big data architectures vary based
22 H. Mohanty
on nature of applications. But, it is seen in general a big data system design mostly
keeps two things at priority; firstly managing huge heterogeneous data and secondly
managing computation at demanding less and less time. Currently, study on data
science has engaged itself for solutions at the earliest.
Success in big data science largely depends on progress in both hardware and
software technology. Fourth section of this chapter presents a picture on current
developments in both hardware and software technology that are of importance for big
data processing. Emergence of non-volatile memory (NVM) is expected to address
some solutions that currently memory design for big data processing faces. For fast
computations, for the time being data on chip (DOC) could be an effective solution. In
addition, cloud computing has shown a way for resource utilisation for big data
processing. Not only technology but also computing platform has a hand in success of
big data. In that context, there are several tools that provide effective platforms in
developing big data systems. This section also presents some references to known
tools so that interested readers can pursue more to know details of a tool of its interest.
The sixth section presents research challenges that big data has brought in many
areas in computer science. It discusses on several computing paradigms that may
hold key to success in big data processing for its ever demanding need for
high-speed computation. The chapter also spells needs on developing new variants
of algorithms for knowledge extraction from big data. This gives a call to computer
scientists to devise new adaptive strategy for fast computation.
Social networking and big data go almost hand to hand for the former being a
major source for generation of big data. Thus, study on social networking has a
bearing on advancement of big data usage. This also brings many vital social issues
onto picture. For example, sharing of data is a very contentious issue. For example,
an online analytic making use of data generated due to one’s activity in social
network raises an ethical question on rights and privacy. Many such ethical and
social issues gradually come to fore when usages of big data applications rise. This
challenge also invites social scientists to give a helping hand to success of big data.
Of course, there could be a question in mind, that is natural to think on sustain-
ability of big data processing. Should there be such huge distributed systems
spanning across the globe is the future of computing? Or is there some data that is
comprehensive and generic enough to describe the world around that is Small Data?
This philosophical question could be contagious to intriguing minds!
Exercise
References
1. Zikopoulos, P.C., Eaton, C., deRoos, D., Deutsch, T., Lapis, G.: Understanding Big Data.
McGrawHill, New York, (2012)
2. García, A.O., Bourov, S., Hammad, A., Hartmann, V., Jejkal, T., Otte, J.C., Pfeiffer, S.,
Schenker, T., Schmidt, C., Neuberger, P., Stotzka, R., van Wezel, J., Neumair, B., Streit, A.:
Data-intensive analysis for scientific experiments at the large scale data facility. In: IEEE
Symposium on Large Data Analysis and Visualization (LDAV), pp. 125–126 (2011)
3. O’Leary, D.E.: Artificial intelligence and big data. Intell. Syst. IEEE 28, 96–99 (2013)
4. Berman, J.J.: Introduction. In: Principles of Big Data, pp. xix-xxvi. Morgan Kaufmann,
Boston (2013)
5. Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014)
6. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Ullah, S.: The rise of “Big
Data” on cloud computing: review and open research issues. Inf. Syst. 47, January, 98–115
(2015)
7. Lusch, R.F., Liu, Y., Chen, Y.: The phase transition of markets and organizations: the new
intelligence and entrepreneurial frontier. IEEE Intell. Syst. 25(1), 71–75 (2010)
8. Chen, H., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to
big impact. MIS Quarterly 36(4), 1165–1188 (2012)
9. Adomavicius, G., Tuzhilin, A.: Toward the next generation of recommender systems: a
survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng. 17(6),
734-749 (2005)
10. Chen, H.: Smart health and wellbeing. IEEE Intell. Syst. 26(5), 78–79 (2011)
11. Parida, L., Haiminen, N., Haws, D., Suchodolski, J.: Host trait prediction of metagenomic
data for topology-based visualisation. LNCS 5956, 134–149 (2015)
12. Chen, H.: Dark Web: Exploring and Mining the Dark Side of the Web. Springer, New york
(2012)
13. NSF: Program Solicitation NSF 12-499: Core techniques and technologies for advancing big
data science & engineering (BIGDATA). http://www.nsf.gov/pubs/2012/nsf12499/nsf12499.
htm (2012). Accessed 12th Feb 2015
14. Salton, G.: Automatic Text Processing, Reading. Addison Wesley, MA (1989)
15. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing.
The MIT Press, Cambridge (1999)
16. Big Data Spectrum, Infosys. http://www.infosys.com/cloud/resource-center/Documents/big-
data-spectrum.pdf
17. Short, E., Bohn, R.E., Baru, C.: How much information? 2010 report on enterprise server
information. UCSD Global Information Industry Center (2011)
18. http://public.web.cern.ch/public/en/LHC/Computing-en.html
19. http://www.youtube.com/yt/press/statistics.html
20. http://agbeat.com/tech-news/how-carriers-gather-track-and-sell-your-private-data/
21. http://www.information-management.com/issues/21_5/big-data-is-scaling-bi-and-analytics-
10021093-1.html
24 H. Mohanty
22. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull.
23, 3–13 (2000)
23. Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U., Franklin, M., Gehrke, J.,
Haas, L., Han, J., Halevy, A., Jagadish, H.V., Labrinidis, A., Madden, S., Papakon stantinou,
Y., Patel, J., Ramakrishnan, R., Ross, K., Cyrus, S., Suciu, D., Vaithyanathan, S., Widom, J.:
Challenges and opportunities with big data. CYBER CENTER TECHNICAL REPORTS,
Purdue University (2011)
24. Kasavajhala, V.: Solid state drive vs. hard disk drive price and performance study. In: Dell
PowerVault Tech. Mark (2012)
25. Hutchinson, L.: Solid-state revolution. In: Depth on how ssds really work. Ars Technica
(2012)
26. Pirovano, A., Lacaita, A.L., Benvenuti, A., Pellizzer, F., Hudgens, S., Bez, R.: Scaling
analysis of phase-change memory technology. IEEE Int. Electron Dev. Meeting, 29.6.1–
29.6.4 (2003)
27. Chen, S., Gibbons, P.B., Nath, S.: Rethinking database algorithms for phase change memory.
In: CIDR, pp. 21–31. www.crdrdb.org (2011)
28. Venkataraman, S., Tolia, N., Ranganathan, P., Campbell, R.H.: Consistent and durable data
structures for non-volatile byte-addressable memory. In: Ganger, G.R., Wilkes, J. (eds.)
FAST, pp. 61–75. USENIX (2011)
29. Athanassoulis, M., Ailamaki, A., Chen, S., Gibbons, P., Stoica, R.: Flash in a DBMS: where
and how? IEEE Data Eng. Bull. 33(4), 28–34 (2010)
30. Condit, J., Nightingale, E.B., Frost, C., Ipek, E., Lee, B.C., Burger, D., Coetzee, D.: Better
I/O through byte—addressable, persistent memory. In: Proceedings of the 22nd Symposium
on Operating Systems Principles (22nd SOSP’09), Operating Systems Review (OSR),
pp. 133–146, ACM SIGOPS, Big Sky, MT (2009)
31. Wang, Q., Ren, K., Lou, W., Zhang, Y.: Dependable and secure sensor data storage with
dynamic integrity assurance. In: Proceedings of the IEEE INFOCOM, pp. 954–962 (2009)
32. Oprea, A., Reiter, M.K., Yang, K.: Space efficient block storage integrity. In: Proceeding of
the 12th Annual Network and Distributed System Security Symposium (NDSS 05) (2005)
33. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U.: The rise of “big
data” on cloud computing: review and open research issues, vol. 47, pp. 98–115 (2015)
34. Wang, Q., Wang, C., Ren, K., Lou, W., Li, J.: Enabling public auditability and data dynamics
for storage security in cloud computing. IEEE Trans. Parallel Distrib. Syst. 22(5), 847–859
(2011)
35. Oehmen, C., Nieplocha, J.: Scalablast: a scalable implementation of blast for high-
performance data-intensive bioinformatics analysis. IEEE Trans. Parallel Distrib. Syst. 17(8),
740–749 (2006)
36. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Hung Byers, A.: Big
data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global
Institute (2012)
37. Chen, C.L.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and
technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)
38. Marz, N., Warren, J.: Big data: principles and best practices of scalable real-time data
systems. Manning (2012)
39. Garber, L.: Using in-memory analytics to quickly crunch big data. IEEE Comput. Soc.
45(10), 16–18 (2012)
40. Molinari, C.: No one size fits all strategy for big data, Says IBM. http://www.bnamericas.
com/news/technology/no-one-size-fits-all-strategy-for-big-data-says-ibm, October 2012
41. Ferguson, M.: Architecting a big data platform for analytics, Intelligent Business Strategies.
https://www.ndm.net/datawarehouse/pdf/Netezza (2012). Accessed 19th Feb 2015
42. Ranganathan, P., Chang, J.: (Re)designing data-centric data centers. IEEE Micro 32(1),
66–70 (2012)
Big Data: An Introduction 25
43. Iyer, R., Illikkal, R., Zhao, L., Makineni, S., Newell, D., Moses, J., Apparao, P.:
Datacenter-on-chip architectures: tera-scale opportunities and challenges. Intel Tech. J. 11(3),
227–238 (2007)
44. Tang, J., Liu, S., Z, G., L, X.-F., Gaudiot, J.-L.: Achieving middleware execution efficiency:
hardware-assisted garbage collection operations. J. Supercomput. 59(3), 1101–1119 (2012)
45. Made in IBM labs: holey optochip first to transfer one trillion bits of information per second
using the power of light, 2012. http://www-03.ibm.com/press/us/en/pressrelease/37095.wss
46. Farrington, N., Porter, G., Radhakrishnan, S., Bazzaz, H.H., Subramanya, V., Fainman, Y.,
Papen, G., Vahdat, A.: Helios: a hybrid electrical/optical switch architecture for modular data
centers. In: Kalyanaraman, S., Padmanabhan, V.N., Ramakrishnan, K.K., Shorey, R.,
Voelker, G.M. (eds.) SIGCOMM, pp. 339–350. ACM (2010)
47. Popek, G.J., Goldberg, R.P.: Formal requirements for virtualizable third generation
architectures. Commun. ACM 17(7), 412–421 (1974)
48. Andersen, R., Vinter, B.: The scientific byte code virtual machine. In: GCA, pp. 175–181
(2008)
49. Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel
Distrib. Comput. 74, 2561–2573 (2014)
50. Brewer, E.A.: Towards robust distributed systems. In: Proceeding of 19th Annual ACM
Symposium on Principles of Distributed Computing (PODC), pp. 7–10 (2000)
51. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A.,
Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available
key-value store. In: Proceedings of Twenty-First ACM SIGOPS Symposium on Operating
Systems Principles, SOSP’07, ACM, New York, NY, USA, pp. 205–220 (2007)
52. Lakshman, A., Malik, P.: Cassandra: a structured storage system on a p2p network. In: SPAA
(2009)
53. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI
(2004)
54. Apache yarn. http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/
YARN.html
55. Hortonworks blog. http://hortonworks.com/blog/executive-video-series-the-hortonworks-
vision-for-apache-hadoop
56. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce
online. In: NSDI’10 Proceedings of the 7th USENIX conference on Networked systems
design and implementation, p. 21
57. Kambatla, K., Rapolu, N., Jagannathan, S., Grama, A.: Asynchronous algorithms in
MapReduce. In: IEEE International Conference on Cluster Computing, CLUSTER (2010)
58. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C.: Evaluating
mapreduce for multi-core and multiprocessor system. In: Proceedings of the 13th
International Symposium on High-Performance Computer Architecture (HPCA), Phoenix,
AZ (2007)
59. Improving MapReduce Performance in Heterogeneous Environments. USENIX Association,
San Diego, CA (2008), 12/2008
60. Polato, I., Ré, R., Goldman, A., Kon, F.: A comprehensive view of Hadoop research—a
systematic literature review. J. Netw. Comput. Appl. 46, 1–25 (2014)
61. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111
(1990)
62. Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski,
G.: Pregel: a system for large-scale graph processing. In: SIGMOD (2010)
63. Phoebus. https://github.com/xslogic/phoebus
64. Ahmad, Y., Berg, B., Cetintemel, U., Humphrey, M., Hwang, J.-H., Jhingran, A., Maskey,
A., Papaemmanouil, O., Rasin, A., Tatbul, N., Xing, W., Xing, Y., Zdonik, S.: Distributed
operation in the borealis stream processing engine. In: Proceedings of the 2005
ACM SIGMOD International Conference on Management of Data, SIGMOD ‘05,
pp. 882–884, ACM, New York, NY, USA (2005)
26 H. Mohanty
65. Andrade, H., Gedik, B., Wu, K.L., Yu, P.S.: Processing high data rate streams in system S.
J. Parallel Distrib. Comput. 71(2), 145–156 (2011)
66. Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In:
OSDI (2010)
67. Rapolu, N., Kambatla, K., Jagannathan, S., Grama, A.: TransMR: data-centric programming
beyond data parallelism. In: Proceedings of the 3rd USENIX Conference on Hot Topics in
Cloud Computing, HotCloud’11, USENIX Association, Berkeley, CA, USA, pp. 19–19
(2011)
68. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel
programs from sequential building blocks. In: EuroSys ’07 Proceedings of the 2nd ACM
SIGOPS/EuroSys European Conference on Computer Systems, vol. 41, no. 3, pp. 59–72
(2007)
69. Wayner, P.: 7 top tools for taming big data. http://www.networkworld.com/reviews/2012/
041812-7-top-tools-for-taming-258398.html (2012)
70. Pentaho Business Analytics. 2012. http://www.pentaho.com/explore/pentaho-business-
analytics/
71. Diana Samuels, Skytree: machine learning meets big data. http://www.bizjournals.com/
sanjose/blog/2012/02/skytree-machine-learning-meets-big-data.html?page=all, February
2012
72. Brooks, J.: Review: Talend open studio makes quick work of large data sets. http://www.
eweek.com/c/a/Database/REVIEW-Talend-Open-Studio-Makes-Quick-ETL-Work-of-Large-
Data-Sets-281473/ (2009)
73. Karmasphere Studio and Analyst. http://www.karmasphere.com/ (2012)
74. IBM Infosphere. http://www-01.ibm.com/software/in/data/infosphere/
75. Auradkar, A., Botev, C., Das, S., De Maagd, D., Feinberg, A., Ganti, P., Ghosh, B., Gao, L.,
Gopalakrishna, K., Harris, B., Koshy, J., Krawez, K., Kreps, J., Lu, S., Nagaraj, S.,
Narkhede, N., Pachev, S., Perisic, I., Qiao, L., Quiggle, T., Rao, J., Schulman, B., Sebastian,
A., Seeliger, O., Silberstein, A., Shkolnik, B., Soman, C., Sumbaly, R., Surlaker, K.,
Topiwala, S., Tran, C., Varadarajan, B., Westerman, J., White, Z., Zhang, D., Zhang, J.: Data
infrastructure at linkedin. In: 2012 IEEE 28th International Conference on Data Engineering
(ICDE), pp. 1370–1381 (2012)
76. Kraft, S., Casale, G., Jula, A., Kilpatrick, P., Greer, D.: Wiq: work-intensive query
scheduling for in-memory database systems. In: 2012 IEEE 5th International Conference on
Cloud Computing (CLOUD), pp. 33–40 (2012)
77. Samson, T.: Splunk storm brings log management to the cloud. http://www.infoworld.com/t/
managed-services/splunk-storm-brings-log-management-the-cloud-201098?source=footer
(2012)
78. Storm. http://storm-project.net/ (2012)
79. Sqlstream. http://www.sqlstream.com/products/server/ (2012)
80. Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: distributed stream computing platform.
In: 2010 IEEE Data Mining Workshops (ICDMW), pp. 170–177, Sydney, Australia (2010)
81. Kelly, J.: Apache drill brings SQL-like, ad hoc query capabilities to big data. http://wikibon.
org/wiki/v/Apache-Drill-Brings-SQL-Like-Ad-Hoc-Query-Capabilities-to-Big-Data,
February 2013
82. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.:
Dremel: interactive analysis of webscale datasets. In: Proceedings of the 36th International
Conference on Very Large Data Bases (2010), vol. 3(1), pp. 330–339 (2010)
83. Li, X., Yao, X.: Cooperatively coevolving particle swarms for large scale optimization. IEEE
Trans. Evol. Comput. 16(2), 210–224 (2008)
84. Yang, Z., Tang, K., Yao, X.: Large scale evolutionary optimization using cooperative
coevolution. Inf. Sci. 178(15), 2985–2999 (2008)
85. Yan, J., Liu, N., Yan, S., Yang, Q., Fan, W., Wei, W., Chen, Z.: Trace-oriented feature
analysis for large-scale text data dimension reduction. IEEE Trans. Knowl. Data Eng. 23(7),
1103–1117 (2011)
Big Data: An Introduction 27
86. Spiliopoulou, M., Hatzopoulos, M., Cotronis, Y.: Parallel optimization of large join queries
with set operators and aggregates in a parallel environment supporting pipeline. IEEE Trans.
Knowl. Data Eng. 8(3), 429–445 (1996)
87. Di Ciaccio, A., Coli, M., Ibanez, A., Miguel, J.: Advanced Statistical Methods for the
Analysis of Large Data-Sets. Springer, Berlin (2012)
88. Pébay, P., Thompson, D., Bennett, J., Mascarenhas, A.: Design and performance of a
scalable, parallel statistics toolkit. In: 2011 IEEE International Symposium on Parallel and
Distributed Processing Workshops and Phd Forum (IPDPSW), pp. 1475–1484 (2011)
89. Klemens, B.: Modeling with Data: Tools and Techniques for Statistical Computing.
Princeton University Press, New Jersey (2008)
90. Wilkinson, L.: The future of statistical computing. Technometrics 50(4), 418–435 (2008)
91. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining
Inference and Prediction, 2nd edn. Springer, Berlin (2009). (egy, Russell Sears, MapReduce
online. In: NSDI, 2009)
92. Jamali, M., Abolhassani, H.: Different aspects of social network analysis. In:
IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 66–72 (2006)
93. Zhang, Yu., van der Schaar, M.: Information production and link formation in social
computing systems. IEEE J. Sel. Areas Commun. 30(1), 2136–2145 (2012)
94. Bringmann, B., Berlingerio, M., Bonchi, F., Gionis, A.: Learning and predicting the
evolution of social networks. IEEE Intell. Syst. 25(4), 26–35 (2010)
95. Fekete, J.-D., Henry, N., McGuffin, M.: Nodetrix: a hybrid visualization of social network.
IEEE Trans. Visual. Comput. Graph. 13(6), 1302–1309 (2007)
96. Shen, Z., Ma, K.-L., Eliassi-Rad, T.: Visual analysis of large heterogeneous social networks
by semantic and structural abstraction. IEEE Trans. Visual. Comput. Graph. 12(6), 1427–
1439 (2006)
97. Lin, C.-Y., Lynn, W., Wen, Z., Tong, H., Griffiths-Fisher, V., Shi, L., Lubensky, D.: Social
network analysis in enterprise. Proc. IEEE 100(9), 2759–2776 (2012)
98. Ma, H., King, I., Lyu, M.R.-T.: Mining web graphs for recommendations. IEEE Trans.
Knowl. Data Eng. 24(12), 1051–1064 (2012)
99. Lane, N.D., Ye, X., Hong, L., Campbell, A.T., Choudhury, T., Eisenman, S.B.: Exploiting
social networks for large-scale human behavior modeling. IEEE Pervasive Comput. 10(4),
45–53 (2011)
100. Bengio, Y.: Learning deep architectures for ai, Found. Trends Mach. Learn. 2(1),1–1-1–27
(2009)
101. Seiffert, U.: Training of large-scale feed-forward neural networks. In: International Joint
Conference on Neural Networks, IJCNN ‘06, pp. 5324–5329 (2006)
102. Arel, I., Rose, D.C., Karnowski, T.P.: Deep machine learning—a new frontier in artificial
intelligence research. IEEE Comput. Intell. Mag. 5(4), 13–18 (2010)
103. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new
perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
104. Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Andrew,
Y. N.: Building high-level features using large scale unsupervised learning. In: Proceedings
of the 29th International Conference on Machine Learning (2012)
105. Dong, Y., Deng, L.: Deep learning and its applications to signal and information processing.
IEEE Signal Process. Mag. 28(1), 145–154 (2011)
106. Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image
classification. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)
107. Simoff, S., Böhlen, M.H., Mazeika, A.: Visual Data Mining: Theory, Techniques and Tools
for Visual Analytics. Springer, Berlin (2008)
108. Thompson, D., Levine, J.A., Bennett, J.C., Bremer, P.T., Gyulassy, A., Pascucci, V., Pébay,
P.P.: Analysis of large-scale scalar data using hixels. In: 2011 IEEE Symposium on Large
Data Analysis and Visualization (LDAV), pp. 23–30 (2011)
109. Andrzej, W.P., Kreinovich, V.: Handbook of Granular Computing. Wiley, New York (2008)
110. Peters, G.: Granular box regression. IEEE Trans. Fuzzy Syst. 19(6), 1141–1151 (2011)
28 H. Mohanty
111. Su, S.-F., Chuang, C.-C., Tao, C.W., Jeng, J.-T., Hsiao, C.-C.: Radial basis function
networks with linear interval regression weights for symbolic interval data. IEEE Trans. Syst.
Man Cyber.–Part B Cyber. 19(6), 1141–1151 (2011)
112. Simon, D.R.: On the power of quantum computation. SIAM J. Comput. 26, 116–123 (1994)
113. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137
(1982)
114. Nielsen, M.A., Chuang, I.L.: Quantum Computation and Quantum Information. Cambridge
University Press, Cambridge (2009)
115. Furht, B., Escalante, A.: Handbook of Cloud Computing. Springer, Berlin (2011)
116. Schadt, E.E., Linderman, M.D., Sorenson, J., Lee, L., Nolan, G.P.: Computational solutions
to large-scale data management and analysis. Nat. Rev. Genet. 11(9), 647–657 (2010)
117. Sipper, M., Sanchez, E., Mange, D., Tomassini, M., Pérez-Uribe, A., Stauffer, A.: A
phylogenetic, ontogenetic, and epigenetic view of bio-inspired hardware systems. IEEE
Trans. Evol. Comput. 1(1), 83–97 (1997)
118. Bongard, J.: Biologically inspired computing. Computer 42(4), 95–98 (2009)
119. Ratner, M., Ratner, D.: Nanotechnology: A Gentle Introduction to the Next Big Idea, 1st edn.
Prentice Hall Press, Upper Saddle River (2002)
120. Weiss, R., Basu, S., Hooshangi, S., Kalmbach, A., Karig, D., Mehreja, R., Netravali, I.:
Genetic circuit building blocks for cellular computation, communications, and signal
processing. Nat. Comput. 2, 47–84 (2003)
121. Wang, L., Shen, J.: Towards bio-inspired cost minimisation for data-intensive service
provision. In: 2012 IEEE First International Conference on Services Economics (SE),
pp. 16–23 (2012)
Big Data Architecture
Bhashyam Ramesh
Abstract Big data is a broad descriptive term for non-transactional data that are
user generated and machine generated. Data generation evolved from transactional
data to first interaction data and then sensor data. Web log was the first step in this
evolution. These machines generated logs of internet activity caused the first growth
of data. Social media pushed data production higher with human interactions.
Automated observations and wearable technologies make the next phase of big
data. Data volumes have been the primary focus of most big data discussions.
Architecture for big data often focuses on storing large volumes of data. Dollars per
TB (Terabyte) becomes the metric for architecture discussions. We argue this is not
the right focus. Big data is about deriving value. Therefore, analytics should be the
goal behind investments in storing large volumes of data. The metric should be
dollars per analytic performed. There are three functional aspects to big data—data
capture, data R&D, and data product. These three aspects must be placed in a
framework for creating the data architecture. We discuss each of these aspects in
depth in this chapter. The goal of big data is data-driven decision making. Decisions
should not be made with data silos. When context is added to data items they
become meaningful. When more contexts are added more insight is possible.
Deriving insight from data is about reasoning with all data and not just big data. We
show examples of this and argue big data architecture must provide mechanisms to
reason with all data. Big data analytic requires all forms of different technologies
including graph analytics, statistical analytics, path analysis, machine learning,
neural networks, and statistical analysis be integrated in an integrated analytics
environment. Big data architecture is an architecture that provides the framework
for reasoning with all forms of data. We end this chapter with such architecture.
This chapter makes three points as follows: (a) Big data analytics is analytics on all
data and not just big data alone; (b) Data complexity, not volume, is the primary
concern of big data analytics; (c) Measure of goodness of a big data analytic
architecture is dollars per analytics and not dollars per TB.
B. Ramesh (&)
Teradata Corporation, Dayton, USA
e-mail: vembakkambhashyam@gmail.com
1 Introduction
This chapter gives the architecture for big data. The intent is to derive value from
big data and enable data-driven decision making. The architecture specified here is
for the data-centric portions. It does not cover mechanisms and tools required for
accessing the data. The choice of such mechanisms is based at least in part on the
interfaces necessary to support a wide variety of access tools and methods. They
must support all forms of applications including business intelligence (BI), deep
analytics, visualization, and Web access to name a few.
Big data is a broad term. Wikipedia defines big data as “an all-encompassing
term for any collection of data sets so large and complex that it becomes difficult to
process using traditional data processing applications.” We agree with this defini-
tion of big data.
Benefitting from big data, that is, deriving value from big data requires pro-
cessing all data big and not so big. In other words, big data processing is not an
isolated exercise. Just as data understanding increases with descriptive metadata,
value of data in general and value of big data in particular increase with context
under which the data is analyzed. This context resides in many places in the
organizations’ data repository. These must be brought under an overarching
architecture in order to fully derive value from big data.
The rest of this chapter is organized as follows:
We discuss data and its key characteristics as foundations for understanding the
data architecture. Sections 2–4 cover these discussions. Section 2 defines big data.
It describes the three different components of big data. Section 3 defines different
characteristics of big data and their evolution. It shows that data growth has been in
spurts and has coincided with major innovations in storage and analytics. Big data
is the next stage in this data growth story. Section 4 specifies some value metric as a
motivation for evaluating the architecture.
We then discuss mechanisms for deriving value from data. Sections 5–7 cover these
discussions. Section 5 specifies three functional components of data-driven decision
making. It explains the phases that data goes through in order to become a product.
Section 6 lays the foundation for the assertion that deriving value from big data is about
reasoning with all data and not just big data. Section 7 focuses on data analytics and the
evolutions in that space. It argues for an integrated analytics framework.
Section 8 gives the overall architectural framework by building on these earlier
sections.
Finally, Sect. 9 puts them all together with a use-case.
Big Data Architecture 31
Gartner defines big data in terms of 3Vs—volume, variety, and velocity [1]. These
three Vs are briefly explained in the following paragraphs. Some add a 4th V for
veracity or data quality [2]. Data quality is a huge problem in practice. Quality of
human-entered data is typically low and even automatically collected data from
mobiles and sensors can be bad–sensors can go bad, sensors may need calibration,
and data may be lost in transmission. In addition, data may be misinterpreted due to
incorrect metadata about the sensor. It is important that data should be checked for
correctness constantly. We do not address the fourth V in this chapter.
The 3Vs can be viewed as three different dimensions of data.
Volume, the most obvious of the 3V, refers to the amount of data. In the past,
increase in data volumes was primarily due to increase in transaction volume and
granularity of transaction details. These increases are small in comparison with big
data volumes. Big data volume started growing with user interactions. Weblogs was
the starting point of big data. Size and volume of Weblogs increased with internet
adoption. These machine logs spawned a set of analytics for understanding user
behavior. Weblogs seem small compared to social media. Facilities such as
Facebook, Twitter, and others combined with the increase in capability and types of
internet devices caused a big jump in data production. The constant human desire to
connect and share caused an even bigger jump in data production. Social media
allows users to express and share video, audio, text, and e-mail in volumes with a
large social audience. The corresponding sophistication in mobile technology
makes sharing easier and increases the volumes even further. This is the second
wave of big data. The next big jump in volume will be from automated observa-
tions. All kinds of biometric sensors (heart, motion, temperature, pulse, etc.) and
motion sensors (GPS, cellular, etc.) and the ability to move them through cellular,
internet, Wi-fi, etc., is the next increase in big data volume. Wearable technologies
and the internet of things (IOT) is the next big thing that is waiting to happen. These
are referred as observation data. Sensor data and observations will dwarf the vol-
ume we have seen so far. The following Fig. 1 shows the relative increases in
volume as we move from transactions to interactions to observations.
Data variety is as diverse as the data sources and their formats. Unlike transaction
data which is highly structured in relational form, other data forms are either rela-
tively weakly structured such as XML and JSON or differently structured such as
audio, video, or without structure such as text, scanned documents. Different data
sources such as Weblog, machine log, social media, tweets, e-mails, call center logs,
and sensor data all produce data in different formats. These varieties of data make
analysis more challenging. Combining variety with volume increases the challenges.
Velocity is the speed at which data flows into the system and therefore must be
handled. Data comes from machines and humans. Velocity increases with the
number of sources. Velocity increases with the speed at which data is produced
such as with mobiles and sensors.
32 B. Ramesh
8000
Volume in Exabyte
7000
6000
4000
3000
2000
Transactional Data
1000
We combine the 3Vs and call it the complexity of data. Complexity refers to the
ability to analyze, derive insight, and value from big data. Deriving insight from big
data is orders of magnitude more complex than deriving insight from transactional
Volume
Big
Data
Traditional BI
data. Deriving insight from sensor data is more complex than deriving insight from
user-generated data.
Figure 2 shows variety and velocity in one axis versus volume in another axis.
Complexity is the moving front. Complexity increases as the front moves to the
right and top; complexity decreases as the front moves down and left.
One of the purposes of storing and analyzing big data is to understand and derive
value; otherwise, there is no point in storing huge volumes of data. Complexity of
gaining insight is directly related to the complexity of data. Analytics on complex
data is much harder. It is complex in the type of analytic techniques that are needed
and it is complex in the number of techniques that have to be combined for analysis.
We cover these aspects later in this chapter. There is also a chapter dedicated to big
data algorithms in this book.
Figure 3 shows the different types of data that make up the big data space. Each
type of data exhibits different value and different return on investment (ROI).
Each shaded transition in this picture represents a major plateau in data growth.
Each shaded region represents a quantum jump in analytic complexity, data capture,
data storage, and management. The quantum jump is followed by a duration of
incremental growth in such capabilities until the next quantum jump in such
capabilities occur. There have been many plateaus over the years and many minor
plateaus within each major plateau. Each has coincided with a jump in technology
related to capture, transmission, management, and analytics. Each plateau has
required some new forms of analytics. Each jump in technology has enabled a move
to the next plateau. Each plateau has meant the ability to process much bigger
volumes of data.
There is a ROI threshold for each plateau of data size. This means there is a
balance between the cost associated with storing and managing the data and the
value that can be derived from the data through application of analytics. Storing
them is not useful when cost exceeds the value that can be derived. ROI determines
whether the data is useful to be stored and managed. The ROI threshold is not
reached until the technology and the cost of doing analysis is cheap enough to get a
ROI from storing and analyzing the data. This notion is best described using
transactional data as example. In the past retail applications used detail data at the
level of store-item-week. Analysis used to be at this level of store-item-week
granularity. Users were satisfied doing analysis at this summary level of detail when
compared to what they were able to do before. However, users realized more detail
than week-level summary will help, but the technology was unaffordable for them
to store and manage at more detail levels. When technologies made it possible to go
to store-item-day, then to market basket summaries, and then onto market basket
detail, these new data plateaus became the new norm. At each point along the way
users felt, they had reached the correct balance between cost and capability and any
further detail, and data volumes were uneconomical since they did not see any value
in them. However, leading-edge users made the conceptual leap to more detail first.
They were visionaries in knowing how to get value from more detailed data. Now
those plateaus have become common place; in fact, they are table stakes in any
analytic environment. It is now relatively inexpensive to store at the smallest SKU
level of detail, practical to do the analysis, and suitable to get value from such
detailed data. There are such examples in every industry. In the communications
industry, the plateau evolved from billing summary data which is few records per
customer per month to storing and managing call detail records which are records of
every call ever made! The new plateau now in the communications industry is
network-level traffic data underlying each call. So there have been such plateaus
before in all varieties of data.
Big data is just another plateau. Currently, we are seeing the leading edge of the
big data plateau.
The capture and storage technologies appropriate for transactions is inappro-
priate for Weblogs, social media, and other big data forms. Technologies that can
store large volumes of data at low cost are needed. Leading-edge users have found
ways to derive value from storing interactions such as clickstreams and user nav-
igation of Web sites. They determine the likes and dislikes of customers from these
data and use them to propose new items for review. They analyze social graphs in
order to differentiate between influencers and followers with a goal to tailor their
marketing strategies. Such analyses are now moving from the leading edge to
common place.
Big Data Architecture 35
The next data plateau is on the horizon. The next jump will be from wearable
technologies, sensors, and IOT. Many users wonder how to derive value, while
leading-edge thinkers are finding innovative ways.
Social networks and the digital stream which are the current focus of big data are
medium complexity in comparison with sensor data. Special scientific customers
such as the supercollider already store sensor data. IOT will make such amounts
common place. Mobile phones with biometric and other sensors collect large
amounts of data and transmit them easily. These data are already far beyond any
social networks. Wearable computing and IOT will increase this further. IOT is like
an electronic coating on everyday things, a sensor embedded on everyday items that
makes them seem sentient. IOTs can be linked together to form a network of
sensors that can provide a bigger picture of the surroundings being monitored—a
Borg-like entity without the malevolent intent.
Leading-edge companies and applications are starting to experiment with the
beginning stages of this next plateau. Currently, sensor telemetry is an example of
what they are doing. Insurance companies were considered innovators when they
used credit rating scores with transaction data to determine premiums. Now they are
innovating with big data. They create customized insurance policies based on
individual driving behavior such as how far you drive, how well you drive, when
you drive, what conditions under which you drive, and where you drive. They
monitor every driver action to understand driving behavior. Clearly, a lot of sensors
are involved in collecting and transmitting this data. Typically the driver has a small
device in the car such as under the steering column. This device monitors the
drivers’ pattern of driving—speed, rash cornering, sudden stops, sudden accelera-
tions, time and distance-driven, weather conditions under which the car is being
driven such as rain and snow, light conditions such as day or night, areas where the
car is being driven, the driving speed in different areas—and transmits them in real
time to the insurer or aggregator. The insurer uses this data to create a compre-
hensive picture of the driver, his driving habits, and the conditions under which he
drives, perhaps even his state of mind. This allows insurance premiums to reflect
actual driving habits. The driver also has access to this data and can use to change
his behavior with a potential reduction in premium. There are different names for
this form of car insurance such as pay as you drive, pay how you drive, and usage-
based premium. Soon sensor telemetry will move from the innovators to become
36 B. Ramesh
the norm in the insurance industry. Currently, adoption is restricted by such things
as state and country local regulations, privacy considerations, drivers’ willingness
to be monitored, and other information sharing concerns.
Sensor telemetry is part of a larger movement called telematics. Telematics is the
broad term for machine to machine communication and implies a bidirectional
exchange of data between endpoints. The data is from sensing, measuring, feedback,
and control.
4 Characteristics of Data
This section shows some properties of data that have a bearing on the architecture.
Characteristics of data differ with their type and storage format. Data can be
stored in highly modeled form or in a form that lacks a model. There are strong data
models for storing transactional data such as relational (ER) models and hierar-
chical models. There are weaker models for storing Weblogs, social media, and
document storage. There are storage forms that lack a model such as text, social
media, call center logs, scanned documents, and other free forms of data. The data
model has a bearing on the kinds of analytics that can be performed on them.
1. Value Density
Value density or information per TB is a measure of the extent to which data has
to be processed for getting information. It is different in different data forms.
Transaction data has little extraneous data. Even if present, they are removed
when they are stored in disk. Social media data on the other hand is sparse.
Social media data is typically small amounts of interesting information within a
large collection of seemingly repetitive data. Such data may contain the same
thought or words multiple times. It may also contain thoughts and words outside
the focus of discussion. Similar cases occur in other user produced data streams.
Sensors produce data at fixed intervals and therefore may contain redundant
data. The challenge is filtering the interesting information from the seemingly
extraneous. Filtering useful data reduces the amount of data that needs to be
stored. Such filtering occurs upfront in strong data model leading to higher
information density. In weaker models, data is stored much closer to how they
are produced and filtering occurs prior to consumption.
The information density or value density increases as data is processed and
structured. Value density is a continuum that increases as it moves from
unstructured and gains structure.
Value density of data has a bearing on storage ROI. For instance, low-cost
storage may be appropriate to store low value density data.
2. Analytic Agility
Data organization affects the ability to analyze data. Analytics is possible on all
forms of data including transactional and big data. Analytics on structured
transactional data with strong data models are more efficient for business
Big Data Architecture 37
Wikipedia defines a data lake as a “storage repository that holds a vast amount of
raw data in its native format until it is needed.” It further says “when a business
question arises, the data lake can be queried for relevant data, and that smaller set of
data can then be analyzed to help answer the question.”
The above-mentioned definition means that data lake is the primary repository of
raw data, and raw data is accessed only when the need arises. The accessed data is
analyzed either at the data lake itself or at the other two functional points.
Raw data is produced at ever increasing rates and the volume of overall data an
organization receives is increasing. These data must be captured as quickly as
possible and stored as cheaply as possible. None of the data must be lost. The value
density of such data is typically low and the actual value of the data is unknown
when it first arrives. So an infrastructure that captures and stores it at low cost is
needed. It need not provide all the classic ACID properties of a database. Until the
data is needed or queried, it is stored in raw form. It is not processed until there is a
need to integrate it with the rest of the enterprise.
Raw data must be transformed for it to be useful. Raw data comes in different
forms. The process of raw data transformation and extraction is called data wran-
gling. Wikipedia [4] defines the process of data wrangling as “Data munging or data
wrangling is loosely the process of manually converting or mapping data from one
“raw” form into another format that allows for more convenient consumption of the
data with the help of semi-automated tools. This may include further munging, data
visualization, data aggregation, training a statistical model, as well as many other
potential uses. Data munging as a process typically follows a set of general steps
which begin with extracting the data in a raw form from the data source, “munging”
the raw data using algorithms (e.g. sorting) or parsing the data into predefined data
structures, and finally depositing the resulting content into a data sink for storage
and future use.” It elaborates on the importance of data wrangling for big data as
follows “Given the rapid growth of the internet such techniques (of data wrangling)
will become increasingly important in the organization of the growing amounts of
data available.” Data wrangling is one form of analysis that is performed in the data
lake. This process extracts from multiple raw data sources and transforms them into
structures that are appropriate for loading onto the data product and data R&D. In
general, it cleanses, transforms, summarizes, and organizes the raw data before
loading. This process is performed for non-transactional big data besides transac-
tional data. It loads the data R&D for Weblog, social data, machine, sensor, and
other forms of big data. The data lake or data platform increases the value density
by virtue of the cleansing and transformation process. According to eBay, a
Big Data Architecture 41
leading-edge big data user, “cutting the data the right way is key to good science
and one of the biggest tasks in this effort is data cleaning” [5].
Some of the insights gained here from analysis are stored locally in the data
platform and others are loaded onto the data product or data R&D platforms. In
order to perform analysis, some form of descriptive metadata about the raw data
should also be available in the data platform. Metadata describes the objects in the
various systems and applications from where data is collected, much like a card
index in a library. Metadata is needed for analyzing the raw data.
Hadoop is the platform often associated with the data lake or data platform. Data
is stored in clusters of low-cost storage. Hadoop uses a distributed file system for
storage and is called the Hadoop distributed file system (HDFS). The storage format
is key-value pairs. It stores data redundantly and is fault-tolerant. Data is processed
in clusters of low-cost nodes using MapReduce which is a highly scalable pro-
cessing paradigm. Hadoop provides horizontal scaling for many kinds of
compute-intensive operations. Therefore, many kinds of analytics are possible in
data platform.
The performance metric for the data platform is TB per sec of data capture
capacity and dollars per TB of data storage capacity.
Insight gained from analytics is productized for strategic and tactical decision
making. An enterprise data warehouse (EDW) is the platform often associated with
data product. Reports, analytical dashboards, and operational dashboards with key
performance indicators are produced at the warehouse. These provide insight into
what happened, why it happened, and what is likely to happen.
Data arrives at the warehouse from the data platform categorized and in a form to
fit the relational model structure. This structure presupposes how the data is going
to be analyzed. This is unlike how data arrives at the data platform.
The data storage structure has a bearing on the insight derivable from data (see
section on analytic agility). Data product provides insight on relational forms of
storage. Insight comes from traversing the data in complex and innovative ways.
The extent to which such traversal is possible is the extent to which new insight can
be gained from the data. Such insights include predictive analytics. For instance, it
can determine whether a customer has a propensity toward fraud or whether a
transaction is in reality a fraudulent transaction. Such analyses require among other
things an understanding of the card holders’ previous transactions, an under-
standing of the transactions that typically end up as fraudulent transaction and the
conditions under which they become fraudulent. All this goes in deciding whether
to approve the current card transaction. Similarly, predicting the success of a
marketing campaign is possible depending on the analysis of previous campaigns
for related products during related times of year. Similar examples exist in other
42 B. Ramesh
Data R&D provides insight on big data. Insight also means knowing what questions
are worth answering and what patterns are worth discerning. Such questions and
patterns are typically new and hitherto unknown. Knowing such questions and
patterns allows for their application at a later time including in the data product
portion. Therefore, data R&D is what creates intelligence through discovery for the
organization on new subject matters. It also refines and adds to previous insight. For
example, it is a new discovery to determine for the first time that card transactions
can at times be fraudulent. A refinement of this discovery is to recognize the kinds
of behavior that indicate fraud. In other words having the intelligence to ask the
question “can transaction be fraudulent?” is a big learning. Determining the patterns
that lead to fraud is another form of learning.
Data R&D is the primary site where deep analytic occurs on big data. Model
building is a key part of deep analytics. Knowledge or insight is gained by
understanding the structure that exists in the data. This requires analysis of different
forms of data from different sources using multiple techniques. This portion has the
ability to iteratively build models by accessing and combining information from
different data sources using multiple techniques (refer Sect. 7). It also constantly
refines such models as new data is received.
There are multiple techniques and technologies in the R&D space. The ability to
perform analytic on any type of data is a key requirement here.
The metric for this functional aspect of data is dollars per analytic performed and
dollars per new pattern detected.
Analysis of big data goes through these stages—acquire, prepare, analyze, train,
validate, visualize, and operationalize. The data platform acquires data and prepares
it. The analytic part in R&D recognizes patterns in big data. Many big data analytics
use correlation-based analytic techniques. A correlation among large volume of
data can suggest a cause–effect relationship among the data elements. Analyzing
large volumes of data gives confidence that such patterns have a high probability of
repeating. This is the validate part which ensures that the patterns recognized are
legitimate for all different scenarios. The train part is to automate the pattern rec-
ognition process. The operationalize part is to put structure around the data so they
can be ready for high-volume analytics in the product platform.
Big Data Architecture 43
Deriving value from data is not about big data alone. It is about capturing, trans-
forming, and dredging all data for insight and applying the new insight to everyday
analytics.
Big data analysis has been about analyzing social data, Web stream data, and
text from call centers and other such big data sources. Typical approaches build
independent special systems to analyze such data and develop special algorithms
that look at each of these data in isolation. But analyzing these data alone is not
where the real value lies. Building systems that go after each type of data or
analyzing each type of data in isolation is of limited value. An integrated approach
is required for realizing the full potential of data.
Analyzing without a global model or context does not lead to learning. There is
definitely some learning that exists in such analysis however. Social Web data or
textual entries from a call center contain extremely valuable information. It is
interesting to analyze a Web clickstream and learn that customers leave the Web
site before they complete the transaction. It is interesting to learn that certain pages
of a Web site have errors or that certain pages are more frequently visited than
others. It is interesting to learn that a series of clicks in a specific order is predictive
of customer churn. It is interesting to learn how to improve user experiences with a
Web site. It is interesting to learn in depth about the behavior of a Web site. It is
interesting to learn from graph analysis of the social Web the influencers and
followers. There is value in all this. However, this value is far less than the value
that is derived when it is combined with other data and information from other
sources from the rest of the organization. This kind of Metcalf’s law applies to data
—connecting data sources and analyzing them produces far more value than ana-
lyzing each of the data sources alone. Combining the behavior of a customer on the
Web site with what is already known about the customer from his transactions and
from other interactions provides more insight about the customer. For instance, a
call center conversation shows a trend in how people talk to each other but com-
bining information about the customer on the call and information about the call
center representative on the call (with the customer) adds context about both parties
and shows what kind of representative has what kind of effect on what kind of
customer. This is more valuable insight than how people talk to each other.
Similarly combining product experiences expressed on the Web with product
details, suppliers’ information, and repair history gives a better understanding of the
views expressed in social media. Analyzing social media such as Twitter and
Facebook can reveal an overall sentiment trend. For instance, it can tell how people
feel about a product or company. It can tell about whether people are complaining
about a product. Such information in isolation is interesting but lightweight in
value. Linking the active people on the social media with the customer base of the
company is more valuable. It combines customer behavior with customer trans-
actions. The sentiments become yet another attribute of a known customer and can
44 B. Ramesh
be analyzed holistically with other data. With a holistic view, social Web chatter
becomes tailored and more actionable.
Adding global context to any isolated analysis increases the value of such
analysis significantly. This is true for transaction analysis. This is true for raw data
transformation and cleansing. This is true for big data analysis. For instance,
analyzing sensor data with context from patient information and drug intake
information makes the sensed data more meaningful.
Therefore, deriving value from big data is not about analyzing only big data
sources. It is about reasoning with all data including transactions, interactions, and
observations. It is myopic to build an environment for analyzing big data that treats
big data sources as individual silos. A total analytic environment that combines all
data is the best way to maximize value from big data. This is one of the key
requirements for our big data architecture—the ability to reason with all data.
The three functional components we discussed earlier—Data lake, data R&D,
and data product—must all be combined for any analysis to be complete. Therefore,
reasoning with all data is possible only if the three aspects of data are integrated.
We call this integrated analytics.
The discovery process puts all these three functional components of data-driven
decision making together through integrated analytics.
Sales listing applications, barter listing applications, or auction item listing
applications are examples of integrated analytics. Such applications determine,
through analysis, that a number of different attributes of an item listed for sale
impact a buyers’ purchasing decision. The goodness metric on each of these
attributes affect the order of how the sale items are listed in a search result. For
instance if multiple parties list similar products, the search engine would return and
list the results based on ranking of these attributes. It is advantageous for better
buyer response to have items listed in the beginning of the list rather than lower
down in the list. Therefore, sellers prefer their items in the beginning of the list.
There are three functions that are necessary in this application. One, determining
the list of attributes that users consider important. This means finding out what
attributes impact customer choices. This is a deep analytic. Two, rating the quality
of each attribute for each item being listed. Attributes may be objective such as
price and age. Attributes may be subjective such as product photograph, product
packaging, and product color. Analysis is required to rate the quality of each
attribute. Such analysis is harder for subjective attributes. Subjective items such as
photograph quality and packaging quality should be analyzed on the basis of
background, clarity, sharpness, contrast, color, and perhaps many others. Data
platform performs these kinds of analytic using MapReduce. Three, combining all
Big Data Architecture 45
these to determine the listing order. The data product considers the rating of each of
the attribute in order to determine the listing order of each item for sale when a
search is performed.
Data R&D is the challenging part. The other two parts are fairly straightforward.
Hadoop clusters meet the requirement of the data platform. Hadoop clusters are
scalable and are the preferred solution for the data platform. They can capture all
forms of data from an array of sources and store them with redundancy and high
availability. They can also do the forms of analytics mentioned earlier using
MapReduce. An EDW meets the requirement of data product. Parallel RDBMS can
meet the complexity and performance requirements.
The challenge is the middle part or data R&D. Neither MapReduce alone nor
relational technology alone is appropriate for this phase. RDBMS requires the data
to fit into well-defined data models. Hadoop clusters require super human pro-
grammers to program new applications.
Figure 4 shows this challenge.
Data R&D is the focus of a number of innovations and new start-ups.
Technologies such as graph engine, text analytics, natural language processing,
machine learning, neural networks, and new sensor models are being explored for
big data. There are four broad types of efforts currently underway:
1. Bottom-up approaches. These provide relational database-like aspects on top of
Hadoop. They add SQL or SQL-like features to Hadoop and HDFS. Some
leading Hadoop vendors and Hadoop open source groups fall under this
Enterprise
Data
Warehouse
The Solution
Enterprise
Data
Warehouse
Integrated Discovery
Platform
(IDP)
The discovery platform starts with a mature SQL engine which acts as a dis-
covery hub operating on mixed data—big data, HDFS data, relational data, and
other stores. It has different forms of local store such as a graph store besides SQL
store. It uses the SQL language as a kind of scripting facility to access relational
store and other stores through UDF and proprietary interfaces. The SQL engine
provides the usual SQL platform capabilities of scalability and availability. The
platform can also execute MapReduce functions. The discovery platform may
natively implement big data analytic techniques such as time series analysis and
graph engine. The platform may also natively support other data stores such as
HDFS and graph storage. It may also access stand-alone techniques such as
machine learning to return results that are refined by that technique.
The industry can encourage this form of discovery through standardization
efforts for interaction across platforms for data and function shipping. Currently,
these are mostly ad hoc and extensions are specific to the vendor.
Big Data Architecture 49
Reasoning with all data is possible only if the three functional aspects of data are
interconnected.
Earlier, we specified three functional aspects of data. These aspects evolved data
from raw to a product by undergoing R&D. These aspects were mapped to a platform:
1. Data lake or data platform. Raw data lands here and is stored. This platform has
the capacity and cost profile to capture large volumes of data from varied
sources without any loss or time lag. This is typically a Hadoop cluster.
2. Data product. BI, analytical dashboards, and relational forms of predictive
analytics happen here. This platform supports high reuse and high concurrency
of data access. It supports mixed workloads of varying complexity with large
number of users. This is typically a parallel EDW platform.
3. Data R&D. Deep analytics from big data happens here. This platform deter-
mines new patterns and new insights from varied sets of data through iterative
data mining and application of multiple other big data analytic techniques. The
system combines a number of different technologies for discovering insight.
This is the integrated discovery platform with tentacles that reach into other
stand-alone technology engines besides implementing multiple technologies and
multiple data stores natively.
Data lake is satisfactory for stand-alone applications. For example, a ride-sharing
company can create a stand-alone booking application in the data lake. The booking
application connects passengers to drivers of vehicles for hire. Customers use
mobile apps to book a ride, cancel a ride, or check their reservation status and other
similar operations. Such applications have a high number of users and a high
number of passenger interactions. Such applications can be implemented as a
stand-alone entity in the data lake using Hadoop. The application is scalable in
terms of storage capacity and processing capacity.
An integrated analytic is, however, required if we want to go beyond stand-alone
applications. Such analytic is more valuable. For example, an integrated analytic
that reaches elsewhere for data is required if the ride-sharing company wants to
consider as part of the booking application other information such as vehicle
maintenance records, customer billing records, and driver safety records. These
kinds of information usually come from other places in the organizations chiefly
from the EDW. Integrated analytics require combining data lake and data product.
If in addition the analytic wants to understand driver routes and driving patterns and
to perform graph analytics on driver movements and location of booking data, R&D
becomes a part of this integration with data lake and data product.
This requires an architecture that integrates all three platforms as shown in
Fig. 7.
The focus of this architecture is the data layer and everything else is pushed off
to the side. A complete architecture needs all the access tools, data acquisition tools,
and all visualization tools. It requires multiple mechanisms to transport and load
50 B. Ramesh
SCM
DATA WAREHOUSE
Applications Operational
Systems
CRM
Business Intelligence
DATA
PLATFORM Predictive, ad- hoc analytics
Business Customers
Images Intelligence Partners
Machine Business
Logs Data Lake/Data Hub Analysts
DISCOVERY PLATFORM Math
and
Stats
Text
Exploratory Analytics Discovery Analytics Data
Scientists
data in parallel. It requires all the client tools that interact with each of these three
platforms. Some of these facets of the architecture are merely shown as small boxes
but are not elaborated or explained in this picture. Each of these is an entire
technology and outside the data architecture portion of this chapter.
Data connections between the three data storage platforms (lines and arrows in
this chart) are more important than the data stores. Data stores are an acceptance
that there must be different kinds of platforms with different kinds of capabilities
and different kinds of attributes to build an overall successful analytic solution.
These were elaborated in the previous sections of this chapter. These evolve over
time. The arrows show the need for their interaction. As observed earlier, the
industry is doing very little by way of standardization in these interaction areas. It is
left to individual vendors to evolve their own solution.
Unlike transactional data, the value of big data is often unknown when it initially
arrives or is accepted for storage. It is not possible to know all the patterns the data
exhibits unless we know in advance all the questions that are worth answering from
that data. New questions may come at any time in the future. Therefore, organizing
and storing the organized data is not very useful for such analytics. Organizing it
upfront loses some of the values that are in the original data even though such value
may be unknown at the time of storage.
Big Data Architecture 51
The data platform is the data landing zone. All new forms of raw data arrive at this
platform for storage and processing. The goal is get the data in, especially new
types of data, quickly, and directly as possible; land it in the cheapest storage; and
have a platform to efficiently filter, preprocess, and organize the landed data for
easier analysis. The primary point of the data platform is to deal with low value
density data and increase the value density.
Hadoop is the platform of choice here. It has a high degree of parallelism and
low cost of horizontal scaling. MapReduce is agile and flexible. It can be pro-
grammed to perform all forms of data transformations. In addition, an increasing
number of tools are available to work on data in Hadoop for improving data quality
and value density. These range from traditional extract–transform–load (ETL) tools
expanding their footprints to new data wrangling tools that are developed specifi-
cally for this space.
An example analytic that was covered earlier was to score and rate subjective
attributes such as photographs and packaging for their quality for items listed for
sale.
Organizing and transforming are forms of analytics that improve value density.
Organizing means for instance adding cross-referencing links between store ID and
customer ID. Sessionizing is an example of organizing data such as Web records in
order to infer that all the Web records are part of a particular conversation with a
user. This is a highly resource-intensive and complex operation especially since
volumes are high and session logs interleave many different users.
Sessionization is a common form of organization for customer-related interac-
tions. A customer may interact from multiple touch points. Making a chronological
order of those interactions organizes the data for analysis. Figure 8 shows customer
interactions across multiple channels.
Another example of organizing is text analytics where say four encoded attri-
butes from call center interaction text are extracted and forwarded to the customer
record. For example, this customer complained about this problem about this
product on this day could be the four attributes. The general goal is to structure the
data or add a bit of structure to the data for ease of understanding.
all important to determine and stitch together the customer interactions in time
order. The customer profile and transaction data are integrated with the sessionized
data. The integration of all this data along with path analysis determines the cus-
tomer interactions that led to the final outcome. Combined with the profitability of
the customer, necessary action can be taken. This example shows multiple tech-
nologies being combined: sessionization, pathing, and relational databases.
ETL is another form of organization that is also performed on the data platform.
ETL applies to both big data and transactional data. ETL on big data includes
operations on raw data such as structured and semi-structured transformations,
sessionization, removing XML tags, and extracting key words.
Archival is another key operation performed at the data platform. The data
landing zone has different retention periods compared to older architectures. In the
past, the landing zone was a very transient place where structure was added through
ETL processes and then the landed data was destroyed. Only the structured data
was retained for regulatory and other archival purposes. The landing zone in this
architecture is where raw data comes in and value is extracted as needed. The raw
data is archived.
There are privacy, security, and governance issues related to big data in general
and the data platform in particular. Access to raw data cannot be universal. Only the
right people should have access. Also data life cycle management issues about
when to destroy data and the considerations for doing that are important. There are
also questions about how to differentiate between insights derived versus third-party
data and how to protect each of these. Some of these issues are covered in other
places in this book. Some are outside the scope of this chapter. They are mentioned
here to indicate these have to be considered in the overall architecture discussions.
Parallel RDBMSs are the platform of choice for the data product platform. BI,
dashboards, operational analytics, and predictive analytics are all performed here.
Such analytics are based upon relational forms of data storage. Profitability analysis
is an example of the kind of predictive analytics this platform performs. The
profitability of a customer is based on revenues from the customer versus costs
associated with maintaining the relationships with the customer in a specific period.
Future profitability is predicted by associating customer characteristics and trans-
action patterns over a period with other known customers’ and their characteristics.
Profitability is also predicted during customer acquisition. Questions such as
whether the customer will be profitable and is he worth acquiring are answered
here. These are not covered further in this chapter.
As mentioned earlier, big data analytics is a combination of analytics using
MapReduce on flat files and key-value stores, SQL on relational stores, and other
forms of analytics on other forms of storage. Different analytics are combined for
discovery in the discovery or R&D platform. Varieties of different technologies are
54 B. Ramesh
involved in this area. Innovation is also constantly changing this space. The
discovery process is different from the discovery or R&D platform. If the problem
is SQL tractable and the data is already in the warehouse, then discovery is done in
the EDW. If a particular algorithm is appropriate on Hadoop and HDFS, then
discovery is done in the Hadoop platform. The discovery process is done in the
discovery platform when big data and SQL have to be combined with SQL,
MapReduce, and others technologies such as graph engines, neural nets, and
machine learning algorithms.
The discovery or R&D platform is one of the key pieces of the architecture for a
learning organization. New insights gained here are used by other components to
drive day to day decision making. It has the capabilities and technologies to derive
new insights from big data by combining data from other repositories. It has the
capabilities to apply multiple techniques to refine big data.
Health care is an example of such capabilities. Big data analytics are applied
successfully in different care areas. One example application identifies common
paths and patterns that lead to expensive surgery. Another application reduces
physician offices’ manual efforts for avoiding fraud and wasteful activities. These
applications combine different kinds of analytics. They pool data from all three
platforms. Data sources include physician notes, patient medical records, drug
information, and billing records. Analytic techniques applied include fuzzy
matching, text analytics, OCR processing, relational processing, and path analytics.
In industries such as insurance, sessionation and pathing techniques are com-
bined to understand paths and patterns that lead to a policy purchase from a Web
site or analyze customer driving behaviors for risk analysis and premium pricing.
In aircraft industry, graph analytics, sessionization, and pathing are combined to
analyze sensor data along with aircraft maintenance records to predict part failure
and improve safety of aircrafts.
As we saw earlier, interactions are much higher in volume than transactions. It is
possible to predict the value of a customer from his transactions. It is possible to
understand and predict behavior of a customer from his interactions. Predicting
behavior is more valuable insight. Data analytic techniques on different forms of
user interactions make such predictions possible. There are different techniques to
predict behavior.
Clustering is an analytic technique to predict behavior. Collaborative filtering
and affinity analysis techniques fall under this category. The idea is that if a person
A has tastes in one area similar to a person B, he is likely to have similar tastes in
another area. Market basket analysis and recommender systems are examples of this
form of analytics. “People who bought this also bought this other item” and “You
may be interested in viewing this profile” are some examples of recommendations
based on behavior. Combining a customers’ historical market basket with the
Big Data Architecture 55
This section shows a use-case that implements our data architecture. It is a big data
analytic for preventive aircraft maintenance and for providing part safety warnings
to airlines.
Figure 10 shows integrated analytic with all three platforms. The arrows in this
picture are data flow and are not indicative of process or control flow.
56 B. Ramesh
The overall flow is as follows. Sensor data arrives at the data platform. This
platform reaches out to the data product and data R&D platforms. Sessionized data
is sent to the data R&D platform. The R&D platform models the part and reaches
out to the data product and data platforms for information. The product platform
performs BI analytics and sends out warnings. It reaches out to the data R&D and
data platforms for supporting information.
Each system reaches out to the other for data that is needed in order to make
sense of its patterns.
Sensor data are received and analyzed in the data platform (Data lake). However,
if such analysis is done in isolation, then it is unwise to execute on it. Sensor data
information itself is insufficient. The sensor readings must be organized. It must be
sessionized for a time sequence of readings. The discovery platform puts the
readings in the context of a part. The part is modeled in order to understand its
behavior. This requires knowledge of the part and other sensor readings. These
different readings are organized in order to understand part behavior. Some of this is
done where the data lands. Most of it is done in the discovery platform (data R&D).
Path analysis and statistical analysis are some of the analytical capabilities used to
model and identify the pattern of failures of the part.
Big Data Architecture 57
It is not enough to determine the current state of the part in the aircraft.
Information is also needed about when the part was last maintained, the history of
the part’s behavior on this aircraft, history of the part’s behavior in other aircrafts
(i.e., the global database), the serial number of the part on this aircraft, and a host of
other information. In addition, part advisories from the manufacturer are also
needed from documents and other manufacturers release materials. Natural lan-
guage analytic [17], text analytic [18, 19], and SQL are some of the other tech-
niques used for analysis. The end result of this knowledge refinery is the issuance of
a preventive maintenance direction and safety warning for failing parts in order to
keep the plane flying safely.
Figure 10 is like a data refinery. Raw data enters the data platform. Interesting
stuff is extracted from it and passed for discovery. Data R&D models with enter-
prise data, transaction data, and dimensional data from the operations side. It is
passed to data product which produces the refined output and issues the warnings.
Learning from this refinery is used to enhance the product platform for future
automated and large-scale application of knowledge.
10 Architecture Futures
Questions
References
1. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-
Data-Volume-Velocity-and-Variety.pdf
2. http://anlenterprises.com/2012/10/30/ibms-4th-v-for-big-data-veracity
3. http://www.infogroup.com/resources/blog/4-major-types-of-analytics
4. http://en.wikipedia.org/wiki/Data_wrangling
5. http://petersposting.blogspot.in/2013/06/how-ebay-uses-data-and-analytics-to-get.html
6. https://hive.apache.org/
7. http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
8. http://hortonworks.com/labs/stinger/
9. http://hbase.apache.org/
10. http://www.splunk.com/en_us/products/splunk-enterprise.html
11. http://sparqlcity.com/documentation/
12. http://en.wikipedia.org/wiki/Sessionization
13. Liu, B., Wu, L., Dong, Q., Zhou, Y.: Large-scale heterogeneous program retrieval through
frequent pattern discovery and feature correlation analysis. In: IEEE International Congress on
Big Data (BigData Congress), 2014, pp. 780–781, 27 June–2 July 2014
Big Data Architecture 59
14. Park, S., Lee, W., Moon, I.C.: Efficient extraction of domain specific sentiment lexicon with
active learning. Pattern Recogn. Lett. 56, 38–44 (2015). ISSN:0167-8655, http://dx.doi.org/10.
1016/j.patrec.2015.01.004
15. Chui, C.K., Filbir, F., Mhaskar, H.N.: Representation of functions on big data: graphs and
trees. Appl. Comput. Harmonic Anal. Available online 1 July 2014, ISSN:1063-5203, http://
dx.doi.org/10.1016/j.acha.2014.06.006
16. Nisar, M.U., Fard, A., Miller, J.A.: Techniques for graph analytics on big data. In: BigData
Congress, pp. 255–262 (2013)
17. Ediger, D., Appling, S., Briscoe, E., McColl, R., Poovey, J.: Real-time streaming intelligence:
integrating graph and NLP analytics. In: High Performance Extreme Computing Conference
(HPEC), IEEE, pp. 1, 6, 9–11, Sept. 2014
18. Atasu, K.: Resource-efficient regular expression matching architecture for text analytics. In:
IEEE 25th International Conference on Application-specific Systems, Architectures and
Processors (ASAP), pp. 1, 8, 18–20, June 2014
19. Denecke, K., Kowalkiewicz, M.: A service-oriented architecture for text analytics enabled
business applications. In: European Conference on Web Services (ECOWS, 2010), pp. 205–
212. doi:10.1109/ECOWS.2010.27
Big Data Processing Algorithms
VenkataSwamy Martha
Abstract Information has been growing large enough to realize the need to extend
traditional algorithms to scale. Since the data cannot fit in memory and is distributed
across machines, the algorithms should also comply with the distributed storage.
This chapter introduces some of the algorithms to work on such distributed storage
and to scale with massive data. The algorithms, called Big Data Processing
Algorithms, comprise random walks, distributed hash tables, streaming, bulk syn-
chronous processing (BSP), and MapReduce paradigms. Each of these algorithms is
unique in its approach and fits certain problems. The goal of the algorithms is to
reduce network communications in the distributed network, minimize the data
movements, bring down synchronous delays, and optimize computational resources.
Data to be processed where it resides, peer-to-peer-based network communications,
computational and aggregation components for synchronization are some of the
techniques being used in these algorithms to achieve the goals. MapReduce has been
adopted in Big Data problems widely. This chapter demonstrates how MapReduce
enables analytics to process massive data with ease. This chapter also provides
example applications and codebase for readers to start hands-on with the algorithms.
Keywords Distributed algorithms Big data algorithms MapReduce paradigm
Mapper Reducer Apache Hadoop Job tracker Task tracker Name node
DataNode YARN InputReader OutputWriter Multi-outputs Hadoop example
1 Introduction
Information has been growing at a faster rate, and computing industry has been
finding ways to store such massive data. Managing and processing the data from
such distributed storage is complex than just archiving for future use. Because the
V. Martha (&)
@WalmartLabs, Sunnyvale, CA, USA
e-mail: vmartha@walmartlabs.com
data carries very valuable knowledge embedded in it and needs to process the data
to mine the insights.
1.1 An Example
As mentioned earlier, information has been growing at rapid pace. Industry has
been facing several challenges in benefiting from the vast information [1]. Some of
the challenges are listed here.
• Since the information is of giant size, it is very important to identify useful
information to infer knowledge from it. Finding right data in the collection of
data is possible with domain expertise and business-related requirements.
• Big Data is typically archived as a backup, and industry has been struggling to
manage the data properly. The data has to be stored in such a way that it can be
processed with minimal effort.
• The data storage systems for Big Data did not attempt to connect the data points.
Connecting the data points from several sources can help identify new avenues
in the information.
• Big Data technology has not been catching with the evolving data. With the fast
grown internet connectivity around the world, almost infinite number of data
sources are available to generate information at large scale. There is an imme-
diate need for scalable systems that can store and process the growing data.
Big Data Processing Algorithms 63
Both multi-core and distributed systems are designed to run computations in par-
allel. Though their objective is same, there is a clear distinction between multi-core
and distributed computing systems that makes them distinguished in their space. In
brief, multi-core computing systems are tightly coupled to facilitate shared space to
enable communications, whereas distributed systems are loosely coupled that
interact over various channels such as MPI and sockets. A distributed system is
assembled out of autonomous computers that communicate over network. On the
other hand, a multi-core system is made of a set of processors that have direct
access to some shared memory. Both multi-core and distributed systems have
advantages and disadvantages which are discussed extensively in [3], and the
summary of the discussion is tabulated in Table 1.
Given scalable solutions mandated by Big Data problems, industry is inching
toward distributed systems for Big Data processing. Moreover, Big Data cannot fit
in a memory to be shared among processes, thus to stamp out multi-core system for
Big Data processing.
several attempts to optimize distributed algorithms in generic model, and the fol-
lowing discussion deals with the algorithms.
Distributed hash tables (DHTs), in general, are used to perform lookup service in a
distributed system [5]. Given that data is distributed among a set of computing/
storage machines and each machine is responsible for a slice of information
associated with a key. The key to data association is determined by a common hash
table. The hash table is pre-distributed for all the machines in the cluster so that
Big Data Processing Algorithms 65
every machine knows where a certain information resides. Looking up a key gives
you a node identification metadata that holds the data chunk. Adopting the DHT
concept onto computing model, a distributed hash table can be leveraged to find a
set of computation machines to perform a task in a distributed system. Computing
optimization can be achieved by optimizing the hash table/function for uniform
distribution of computing nodes among all possible keys.
Bulk Synchronous Parallel (BSP) model runs an abstract computer, called BSP
computer. The BSP computer is compiled of a set of processors connected by a
communication network. Each processor is facilitated with a local memory needed
for the processor. In BSP, a processor, means a computing device, could be several
cores of CPUs, that is capable of executing a task [6]. BSP algorithm is a sequence of
super steps, and each super step consists of an input phase, computing phase, and
output phase. The computing phase processes the data sent from previous super step
and generates appropriate data as output to be received by the processors in the next
step. The processors in a super step run on data received from previous super step, and
the processors are asynchronous within a super step. The processors are synchronized
after every super step. The communication channels are used to synchronize the
process. Direct communication is possible between each processor and every other
processor, given absolute authority on distributing the data among processors in the
super step. BSP does not support shared memory or broadcasting. BSP is denoted by
(p,g,l) where ‘p’ is the number of processors, ‘g’ is communication throughput ratio,
and ‘l’ is the communication latency. The computation cost of BSP algorithm for S
P
steps is W þ H:g þ S:l: where W is local computation volume ¼ Ss¼1 ws , H is
Ps¼1
communication cost ¼ S hs , and S is synchronization cost. The BSP algorithm
should be designed to optimize computation cost, communication cost, and syn-
chronization cost [7]. The architecture diagram is presented in Fig. 1.
2 MapReduce
Big Data problems typically bid definitive approaches and could sometimes follow
non-conventional computing archetypes. All of the approaches have been discussed
in the computer science literature for Big Data for decades which follow some kind
of out of the box techniques, and MapReduce is certainly not the foremost to drive
in this direction. The MapReduce is successful in fusing the several non-conven-
tional computing models together to perform computations on a grand, unimag-
inable scale. The capability of its design made the MapReduce a synonym for Big
Data.
MapReduce paradigm calcifies a distributed system to play authority in pro-
cessing Big Data with ease. MapReduce, unlike traditional distribution systems,
regulates computations on Big Data with least effort. MapReduce exposes specific
functions for a developer to implement distributed application and hides internal
hardware details. By this, developers can raise their productivity by focusing
resources on application without worrying about organizing the tasks and syn-
chronization of tasks.
MapReduce claimed humongous attention from research community as it was
with Von Neumann’s computing model. Von Neumann proposed ‘a model as a
bridging model, a conceptual bridge between the physical implementation of a
machine and the software that is executed on that machine’ [2] for single process
machine, long ago. Von Neumann’s model served one of the root pillars for
computer science for over half a century. Likewise, MapReduce is a bridge to
connect distributed system’s platform and distributed applications through a design
functional patterns in computation. The applications do not need to know the
implementation of distributed system such as hardware, operating system, and
resource controls.
The engineers at Google first coined the term Map and Reduce as an exclusive
functions that are to be called in a specific order. The main idea of MapReduce
comes from functional programming languages. There are two primary functions,
Map and Reduce. A map function upon receiving a pair of data elements applies its
designated instructions. The function Reduce, given a set of data elements, performs
its programmed operations, typically aggregations. These two functions, performed
Big Data Processing Algorithms 67