Get Towards The Integration of IoT, Cloud and Big Data: Services, Applications and Standards 1st Edition Vinay Rishiwal Free All Chapters
Get Towards The Integration of IoT, Cloud and Big Data: Services, Applications and Standards 1st Edition Vinay Rishiwal Free All Chapters
Get Towards The Integration of IoT, Cloud and Big Data: Services, Applications and Standards 1st Edition Vinay Rishiwal Free All Chapters
com
https://ebookmeta.com/product/towards-the-
integration-of-iot-cloud-and-big-data-services-
applications-and-standards-1st-edition-vinay-
rishiwal/
OR CLICK BUTTON
DOWLOAD EBOOK
https://ebookmeta.com/product/blockchain-for-big-data-ai-iot-and-
cloud-perspectives-1st-edition-shaoliang-peng/
https://ebookmeta.com/product/blockchain-and-iot-integration-
approaches-and-applications-1st-edition-kavita-saini-editor/
https://ebookmeta.com/product/cambridge-igcse-and-o-level-
history-workbook-2c-depth-study-the-united-states-1919-41-2nd-
edition-benjamin-harrison/
https://ebookmeta.com/product/big-data-analytics-in-fog-enabled-
iot-networks-towards-a-privacy-and-security-perspective-1st-
edition-govind-p-gupta/
Computer Networks, Big Data and IoT: Proceedings of
ICCBI 2020 A.Pasumpon Pandian
https://ebookmeta.com/product/computer-networks-big-data-and-iot-
proceedings-of-iccbi-2020-a-pasumpon-pandian/
https://ebookmeta.com/product/iot-enabled-smart-healthcare-
systems-services-and-applications-1st-edition-various-autors/
https://ebookmeta.com/product/contemporary-issues-in-
communication-cloud-and-big-data-analytics/
https://ebookmeta.com/product/information-fusion-and-analytics-
for-big-data-and-iot-1st-edition-eloi-bosse/
https://ebookmeta.com/product/machine-learning-big-data-and-iot-
for-medical-informatics-1st-edition-pardeep-kumar/
Studies in Big Data 137
Vinay Rishiwal
Pramod Kumar
Anuradha Tomar
Priyan Malarvizhi Kumar Editors
Towards
the Integration
of IoT, Cloud
and Big Data
Services, Applications and Standards
Studies in Big Data
Volume 137
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances
in the various areas of Big Data- quickly and with a high quality. The intent is to
cover the theory, research, development, and applications of Big Data, as embedded
in the fields of engineering, computer science, physics, economics and life sciences.
The books of the series refer to the analysis and understanding of large, complex,
and/or distributed data sets generated from recent digital sources coming from
sensors or other physical instruments as well as simulations, crowd sourcing, social
networks or other internet transactions, such as emails or video click streams and
other. The series contains monographs, lecture notes and edited volumes in Big
Data spanning the areas of computational intelligence including neural networks,
evolutionary computation, soft computing, fuzzy systems, as well as artificial
intelligence, data mining, modern statistics and Operations research, as well as
self-organizing systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
The books of this series are reviewed in a single blind peer review process.
Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH.
All books published in the series are submitted for consideration in Web of Science.
Vinay Rishiwal · Pramod Kumar ·
Anuradha Tomar · Priyan Malarvizhi Kumar
Editors
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
The rapid advancement of technology has led to the emergence of the Internet of
Things (IoT), Cloud Computing, and Big Data as transformative forces in various
industries. As these technologies continue to evolve, there is a growing need for
their integration to unlock their full potential and enable the development of inno-
vative services, applications, and standards. The integration of these three domains
presents numerous challenges and opportunities. One of the key challenges is the
efficient and secure management of the massive data generated by IoT devices, as
well as the seamless integration of IoT devices with cloud-based infrastructure. This
requires the development of scalable and robust architectures, protocols, and stan-
dards that enable interoperability, data sharing, and resource allocation across hetero-
geneous systems. Moreover, the integration of IoT, Cloud, and Big Data enables the
creation of innovative services and applications. To achieve successful integration,
the establishment of common standards is crucial.
To summarise, it is the right time to explore the integration of IoT, Cloud, and
Big Data, which holds immense potential to transform industries, enhance services,
and enable data-driven decision-making. However, addressing the challenges related
to data management, interoperability, and security is vital for successful integration.
Moreover, the establishment of standards is crucial to facilitate seamless commu-
nication and collaboration between different systems. By leveraging the combined
power of IoT, Cloud, and Big Data, organizations can unlock new possibilities and
drive digital transformation in the era of interconnected and data-driven ecosystems.
This book consists of eight chapters. The first chapter covers introduction to Big
Data analysis and its need, skills required for Big Data analysis, characteristics of Big
data analysis, an overview of the Hadoop ecosystem, and some use cases of Big Data
analysis. The aim of the second chapter is to study and compare three of the most
common classification methods, Support Vector Machines, K-Nearest Neighbours
and Artificial Neural Networks, for heart disease prediction using the ensemble of
standard Cleveland cardiology data. The objective of the third article is to reduce the
energy consumption of the ECG machine. Authors in chapter four, have proposed a
system to implement an automatic water supply to the farms based upon their crop,
system that measures water level of soil and helps to decide to turn on or off the water
v
vi Preface
supply. Further, chapter five uses deep convolutional networks algorithms for leaf
image classification to provide accurate results. The concept of Blockchain is used
in chapter six with the aim to ensure the security of the patient’s medical records.
Chapter seven offers SHA-PSO, a PSO-based meta-heuristic technique that schedules
workloads among Virtual Machines (VM) to minimize energy. Authors in chapter
eight have proposed design of field monitoring device using IoT in Agriculture.
vii
Editors and Contributors
Prof. (Dr.) Pramod Kumar is an accomplished academic leader with over 24 years
of experience in the field. He currently serves as the Dean of Academics at Glocal
University in Saharanpur, UP, where he has been since September 2022. Prior to
this, he held the position of Dean of Computer Science and Engineering at Krishna
Engineering College in Ghaziabad and served as the director of Tula’s Institute in
Dehradun, Uttarakhand. Prof. Pramod Kumar holds a Ph.D. in Computer Science
and Engineering, which he earned in 2011, as well as an M.Tech in CSE from 2006.
He is a Senior Member of IEEE and an Ex-Joint Secretary of the IEEE U.P. section.
Through his research, he has made significant contributions to the fields of Computer
Networks, IoT, and Machine Learning. He is the author or co-author of more than 70
ix
x Editors and Contributors
research papers and has edited four books. He has also supervised and co-supervised
several M.Tech. and Ph.D. students.
Contributors
N. Arora (B)
Electronics and Computer Discipline, Indian Institute of Technology, Roorkee, India
e-mail: nitinarora.iitr@gmail.com
A. Singh
Department of Computer Science and Engineering, Graphic Era Deemed to be University,
Dehradun, India
e-mail: anupamsingh.cse@geu.ac.in
V. Shahare
Department of Computer Science and Engineering, Indian Institute of Technology,
Dharwad, India
e-mail: vivek.shahare27@gmail.com
G. Datta
School of Computer Science, University of Petroleum and Energy Studies, Dehradun, India
e-mail: gdatta@ddn.upes.ac.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 1
V. Rishiwal et al. (eds.), Towards the Integration of IoT, Cloud and Big Data,
Studies in Big Data 137, https://doi.org/10.1007/978-981-99-6034-7_1
2 N. Arora et al.
Big Data is a phrase that relates to a collection of vast and complex data sets that
are challenging to store and analyze using standard data processing methods. Big
data refers to data assets with a large volume, great velocity, and great diversity that
necessitate cost-effective, creative data processing to improve insight and decision-
making [1].
There are several distinctions between small data and big data. These distinctions
include volume, velocity, variety, veracity, value, time variance, and infrastructure
[2]. Table 1 summarizes all the differences.
The capacity to acquire data no longer limits development and creativity. However,
the capacity to organize, analyze, summarise, display, and find information from
acquired data in a timely and scalable manner is critical.
4 N. Arora et al.
7 Challenges in Big-Data
Big data size is enormous, and this data can be structured or unstructured. There are
many challenges in this and are discussed below [6].
– Volume: Thanks to new data sources that are developing, the volume of data,
particularly machine-generated data, is expanding, as is the rate at which it expands
each year. For example, the world’s data storage capacity was 800,000 petabytes
(PB) in 2000 and is anticipated to reach 35 zettabytes by 2020.
– Variety and the use of many data sets: Unstructured data makes up greater than
80% of today’s data. Most of this data is so vast for effective management.
– Velocity: As organizations realize the benefits of analytics, they face a problem:
they want the data sooner, or in other words, they want real-time analytics.
– Veracity, Data Quality, Data Availability
– Data Discovery: Finding high-quality data from the massive amounts of data
available on the Internet is a significant problem.
– Relevance and Quality: It’s tough to determine data sets’ quality and relevance
to specific requirements.
– Personally Identifiable Information: A lot of this data is about people. This
necessitates, in part, efficient industrial processes. “It partly asks for efficient
government monitoring. Partly-perhaps even entirely-it necessitates a severe
rethinking of what privacy truly entails.”
– Process Challenges: Finding the appropriate analysis model may take a lot of
time and effort; thus, the ability to cycle quickly and ‘fail fast’ through many
(perhaps throwaway) models is crucial.
– Management Challenges: Sensitive data, such as personal information, is found
in many warehouses. Accessing such data raises legal and ethical problems. As a
result, the data must be secured, access restricted, and audited.
Introduction to Big Data Analytics 5
There are many of data in today’s environment. Big businesses use these data to
expand their operations [7]. In a variety of circumstances, such as those outlined
below:
– Customer Spending Habits and Shopping Patterns: Management teams at large
retail stores keep customer spending habits, purchasing behavior, and customers’
most loved products. Based on which product is most searched/sold, that product’s
production/collection rate is fixed. Banking companies utilize information about
their customers’ purchasing habits to offer customers who want to buy a particular
product a discount or cashback using their bank’s credit or debit card. They will
send the appropriate offer to the right individual at the right time [8].
– Recommendation: Large retailers provide custom recommendations based on
spending and buying patterns. E-commerce platforms offer product suggestions.
They keep track of the products customers are interested in and propose them
based on that data [9].
– Smart Traffic System: Data on the traffic state on various roads were obtained
using a camera stationed alongside the road, at the city’s entry and departure
points, and a GPS device installed in the car. This information is examined, and
the least time-consuming or jam-free routes are advised. Big data analysis may
create an intelligent traffic system in the city. Another advantage is that fuel usage
may be lowered [10].
– Auto Driving Car: Without human interpretation, a car can be driven, thanks to
big data analysis. A sensor is installed in various places around the vehicle to
gather information on the size of the neighbouring car, barriers, distance from
the camera, and other things. Numerous computations are made based on these
data, including how many rotational angles to utilize, what speed to employ, when
to halt, etc. These calculations facilitate the automatic performance of activities
[11].
– Media and Entertainment Sector: Companies that offer media and entertain-
ment services, including Spotify, Amazon Prime, and Netflix, analyze subscriber
data. To develop the following business strategy, information is acquired and
assessed about video, music, and the number of time the users spend on the
website.
– Education Sector: Online education is highly impacted with usage of Big data.
An online or offline course provider company will market their course online to
someone looking for a YouTube tutorial video on a topic [12].
– IoT: IoT sensors are installed in equipment by manufacturing companies to collect
operational data. By analyzing this data, it is possible to anticipate how long
a machine will run without issue until it has to be repaired, allowing the firm
to take action before the equipment develops several problems or fails. As a
result, the cost of replacing the entire equipment can be reduced. Big data is
making a significant impact in healthcare [13]. Patient experiences are collected
using a big data platform and clinicians to improve treatment. An IoT gadget can
6 N. Arora et al.
detect a sign of a potentially fatal disease in the human body and prevent it from
receiving treatment in advance. IoT sensor installed nearpatients and newborn
infant continuously monitors various health conditions such as heart rate, blood
pressure, etc. When any parameter exceeds the safe limit, an alarm is transmitted
to a doctor, who can take action remotely.
– Energy Sector: Every 15 min, a smart electric meter reads the used power and
sends it to a server, where the data is evaluated, and the time of the day when the
city’s power load is lowest may be determined. Using this technology, a manu-
facturing company or a housekeeper may be advised to use their heavy machines
at night when the power load is lower, resulting in lower electricity bills.
– Secure Air Traffic System: Numerous locations along the flight route have
sensors (propellers). These sensors keep track of environmental variables such
as temperature, humidity, and flying speed. Based on this data analysis, the envi-
ronmental parameter is built up and changed while in flight. Studying the flight’s
machine-generated data may calculate how long a machine will perform flawlessly
after being replaced/repaired [14].
We are analyzing the data to improve decision-making and gain a competitive advan-
tage. Business intelligence refers to a group of tools that offers quick access to data-
driven insights into an organization’s growth and development—BI’s open-source
tools a rebirth, JasperReport, KNIME, etc.
Large amounts of organized and unstructured data are generated and sent fast from
various sources. Big data refers to massive, varied amounts of data increasing at a
high-speed rate. There are three fundamental pillars on which big data depends. Data
volume, creation speed, velocity, and the variety or scope of data points are all factors
to consider. The data variety may be structured, semi-structured, or unstructured.
Some available tools like Hadoop, Apache Spark, Cassandra, etc., are available to
deal with all types of data.
Introduction to Big Data Analytics 7
– BI aims to help firms make improved decisions. Business intelligence supports the
delivery of credible information by extracting data directly from the data source.
In contrast, Big Data’s main aim is to capture, process, and analyze structured
and unstructured data to improve consumer results.
– Localization intelligence and what if analysis are some applications of BI. Variety,
Volume, Variability, Veracity, and Velocity, on the other hand, are characteristics
that better explain extensive data.
– Big Data results can handle historical data and data generated in real-time, whereas
Business Intelligence handles only historical data sets.
To depict a simple project, the cycle is iterative. Figure 1 shows the different phases
involved in an analytical lifestyle of Big Data. A gradual approach is needed to
organize the actions and procedures involved with repurposing, collecting, analyzing,
and processing data to address the specific requirements for performing Big Data
analysis.
– The data science team researches and learns about the issue.
– Creates a sense of context and understanding.
– Researches the data sources necessary for the project which will be available.
– The team builds an initial hypothesis which is tested later with data.
– After executing the model, the team must compare the established success and
failure criteria.
– The data science team produces data sets for training, testing, and production
during this phase.
– The team builds and executes the models based on work done during this.
– Datasets are created for testing, training, and production by the team.
– The team also determines if its present tools are adequate for running the models
or whether a more stable environment is necessary.
– Open-source software includes R, PL/Rand, and WEKA.
Introduction to Big Data Analytics 9
– After executing the model, the team must assess the findings against the success
and failure criteria.
– The team assesses the best methods for informing various team members
and stakeholders of the results and conclusions while considering justification
warnings and assumptions.
– The business value should be quantified, and a narrative should be established to
summarize and explain findings to stakeholders.
– Problem-solving abilities can go a long way in the age of Big Data. Because of
its unstructured data, Big Data is considered a risk. Someone who enjoys solving
problems is the best candidate for working in Big Data. Their ingenuity and
originality will aid them in developing a better solution to an issue they have
discovered.
– SQL serves as a foundation in the Big Data era. SQL is a data-centric programming
language. While working with Big Data buzzwords like NoSQL, knowing SQL
can benefit a programmer in dealing with high-dimensional data sets.
– Utilizing as many big data tools and technologies as possible, including R, SAS,
Scala, Hadoop, Linux, MatLab, SQL, Excel, SPSS, etc., is often preferred. The
demand for professionals with strong programming and statistical knowledge has
surged.
10 N. Arora et al.
Things connected and constantly delivering data to a system generate data, which
might be semi-structured, structured, or unstructured. The best examples are your
mobile devices, from which Telecom Operators receive a massive amount of data
from each cellular network and analyze it. Bioinformatics, the Internet-of-Things,
Cyber-Physical Systems, and Social Media are just a few fields that use Big Data
to look at trends and behavior for their purposes. Modern search engines, such as
Google, are based on Big data, used to obtain information using information retrieval
techniques and logic. Furthermore, you may argue that the World Wide Web is the
most important realm of Big Data.
Big Data analytics has become a first-class citizen of daily life. It involves a process of
continual discovery using practical analytic tools to find correlations, hidden patterns,
and various other insights from big data. This includes data of any source, struc-
ture, and size. Insights can be discovered more quickly and efficiently, resulting in
immediate business decisions that decide a winner [15].
The rise of big data, which began in the 1990s, prompted the development of
big data analytics. At the advent of the computer age, corporations employed enor-
mous spreadsheets to analyze information and look for trends. New data sources
helped boost the volume of data generated in the late 1990s and early 2000s. Due to
the widespread use of mobile devices and search engines, more data was generated
than any organization could handle. Another factor to consider was speed. The more
data generated, the more and more data need to be processed. Gartner defined this
phenomenon as the “3Vs” of data in 2005: volume, velocity, and variety. Anyone
who feels it boring to deal with the vast amounts of raw and unstructured data could
unlock a coffer of unseen facts about business operations, consumer behavior, popu-
lation changes, and natural phenomena. Conventional data warehouses and relational
databases were incapable of completing the task. So it required Innovation. There-
fore, Hadoop came into existence. Yahoo engineers created it in 2006 and released
it as an Apache open source project in 2007. Thanks to the distributed processing
framework, big data applications could now run on a clustered platform. Distributed
processing is the critical distinction between traditional and big data analytics.
Only big corporations such as Facebook and Google took extensive data analysis.
But then, in the 2010s, banks, retailers, healthcare, and manufacturing organizations
saw the value in big data analytics companies. At first, big organizations with on-
premises data stores were best suited to gathering and analyzing large data sets.
However, Amazon Web Services (AWS), Microsoft Azure, and many other cloud
platform providers, on the other hand, give ease for any company to utilize a big data
analytics platform. The option to set up Hadoop clusters over the cloud allowed any
Introduction to Big Data Analytics 11
company to start and run just what they needed on-demand, irrespective of its size.
This provides flexibility in the usage of clusters. A big data analytics environment
is a critical component of adaptability, which is required for today’s businesses to
succeed [16].
14.1 HDFS
HDFS creates abstraction. HDFS is logically a single unit for storing Big Data.
Similar to virtualization, the actual data is distributed among numerous nodes. HDFS
has a master–slave architecture. In HDFS, the primary node is Name-node, while
the enslaved people are Data-nodes. Name-node holds metadata about data stored in
Data-nodes, like which data block is saved in which data node, how many replications
of the data block are retained, etc. Data nodes are where the actual data is kept.
14.2 YARN
Yet another resource negotiator (YARN) handles all data processing duties. These
duties mainly allocate resources by the manager and schedule tasks. The Resource
Manager and the Node Manager are the two primary components of YARN. The
Resource Manager plays the role of a controller node. It accepts processing requests
and then forwards them to the corresponding Node Managers. Node managers are
responsible for the actual processing that takes place. Every Data-node has a Node
Manager installed. It is in charge of completing the task on each Data-node.
14.3 MapReduce
A MapReduce task separates the input data into fragments processed by the map jobs
in parallel. The framework sorts the output of the map tasks before being given to
the reduced tasks. HDFS stores the data from the job’s input and output. The frame-
work handles task monitoring, scheduling, and re-execution. The MR framework
and HDFS run on the same nodes; hence the compute and storage nodes are usually
the same. This configuration enables the framework to efficiently schedule jobs on
the data nodes, resulting in high aggregate bandwidth throughout the cluster. A
Resource Manager (master), Node Manager (enslaved person), each cluster node, and
MR AppMaster per application make up the MapReduce framework. A MapReduce
framework is composed of four steps including map, shuffle, sort and reduce.
14.4 Spark
Big data analytics is processing large amounts of data efficiently using technolo-
gies. This is mainly used for decision-making, which requires individual intellectual
capabilities and collective knowledge. Businesses usually look forward to storing
business data history to get meaningful results for new insights to grow the busi-
ness. As a result, extensive data analysis needs technical Innovation and data science
expertise. Models for extensive data analysis were investigated and utilized to design
a general conceptual architecture to make things more transparent.
Following are examples of the need for Big Data Analytics:
1 Business decisions: online retail companies like Amazon look forward to making
decisions based on past’ Prime day’ sales and consider the best-selling items to
be repeated for the next sale.
2 Insight into data and business: A company located in multiple locations using
their sales data can get an insight into which location has maximum sales for the
last financial year.
3 Interpretation of outcomes: The data can be estimated in the nearest time range
based on pattern-based analysis.
4 Descriptive: Graphical representation of data can show business behavior.
5 Predictive analytics: Using mathematical and scientific techniques applied to
historical data, future data can be predicted with appropriate variables to a certain
confidence level.
As per industry standards, big data broadly consists of three Vs. The three V’s are
as follows:
Volume: The term “volume” refers to the “quantity of data,” which rapidly increases
daily. Humans, technology, and their interactions on social media create enormous
amounts of data.
Velocity: Velocity refers stream of data that arrives from different social media sites
continuously, and the repository gets completed with new data at the same rate. It
becomes a challenge to capture this stream of data promptly for further processing.
Variety: There is a variety of data coming from various sources. The repository stores
this data in different file formats spreadsheet, text files, e-mails, image files, video
files, etc.
Some of the use cases of Big data are as follows:
1. Fraud detection in Financial Organization: Recently, the headlines recently
found credit and debit card fraud involving millions of people. Several consumers
discovered fraud activity associated with their accounts. With big data and
14 N. Arora et al.
machine learning, this could have been minimized. Based on machine learning
analysis, banks can learn about a customer’s typical activities and transactions.
And if they notice any suspicious conduct, they can quickly block the customer’s
card or account and notify them. Banks have begun to use Big Data to study
market and consumer behavior, but more work still needs to be done.
2. Big data in health care: Healthcare businesses are being used to enhance
profitability and save lives. Healthcare firms, hospitals, and researchers collect
massive volumes of data. However, none of this information is helpful on its
own. When the data is evaluated, it becomes critical to highlight trends and
threats in patterns and construct prediction models. This data can also be used
for classification purposes, for example, COVID-19 data as presented in [19, 20].
3. Big data in the telecom sector: Telecom operators use big data analytics to
gain a more comprehensive perspective of their operations and consumers and
accelerate innovation initiatives.
4. Big data in the Oil and Gas sector: This sector has been using big data to find
new ways to innovate for the last few years. Data sensors have long been used in
the oil and gas industry to track and monitor the performance of wells, gear, and
activities. Oil and gas corporations have used this information to track healthy
activity, develop Earth models to discover new oil sources, and perform other
value-added operations.
5. Log data Analytics in business: Many commercial big data applications rely on
log data as a foundation. Long before big data, there were log management and
analysis tools. However, as business activity and transactions rise exponentially,
storing, processing, and presenting log data most efficiently and cost-effectively
can become a significant burden. In this context, big data analytics play a signifi-
cant role because of some synergy found in log data search and big data analytics
discovering by industries.
6. Big Data Analytics in Recruitment: In the rush to place applicants as rapidly
as possible in a competitive climate, recruiters frequently believe they lack the
(proper) tools. Recruiters nowadays use a new technique that performs mining
of internal database with candidates’ overall skill sets such as educational back-
ground, certification is done, job title applied for, skill sets, years of experience,
and so forth. Then this mined result is matched and compared with previous
recruitment candidates’ performance, salaries, and overall past recruitment expe-
rience. The traditional approach of matching keywords with the job description is
no longer efficient in today’s scenario, where big data analytics has significantly
changed the paradigm in different industry verticals. Figure 3 shows the steps
involved in the recruitment process using Big data analytics.
7. Big Data Analytics in Natural Language Processing (NLP): In NLP, the
computer processes languages before feeding them to the model for training [21].
Various linguistic features are being considered during processing. We find many
important use cases of NLP in different industry verticals. Sentiment analysis of
customers is one of the essential applications of natural language processing used
by several companies. They analyze customers’ sentiment by capturing contin-
uous streaming data, where customers’ feedback on any particular product is
Introduction to Big Data Analytics 15
positive, negative, and neutral. The company subsequently analyzes these textual
sentiment documents to improve its product further. One of the essential use
cases in the banking sector is a chatbot that primarily solves the customer service
officer’s job/responsibility. Chatbot process all textual data on a real-time basis,
and matching is done with the existing huge NLP database (corpus). It then tries
to respond to the user’s query—another critical use case of NLP, i.e., Machine
Translation (MT) system. Machine translation translates source language to target
language. The source is one language, e.g., English, and the target is another
language, e.g., Hindi. We call it a bilingual MT system if it translates from one
language to another.
We use neural-based translation known as the Neural Machine Translation
(NMT) system. NMT’s latest NLP models are used in its language model. In
NMT, since it uses a deep neural network, we need a massive amount of parallel
corpus to train our model. Model performance can be measured with automatic
metrics such as BLEU, METEOR, etc. Researchers have been researching perfor-
mance evaluation of MT/NMT systems with various automatic metrics, and
evaluated outcomes computed by different metrics are compared.
8. Blockchains aren’t efficient for storing large file sizes: Large file sizes are
inefficiently stored on blockchains [12]. Storing vast volumes of data on a public
blockchain is expensive and time-consuming. Storing data on-chain isn’t a very
scalable or efficient option for anything other than primary ledger data and asso-
ciated hashes. Each transaction may add up to thousands of dollars per terabyte on
the chain, plus costs each time you wish to access that data. It also consumes time,
such as minutes per megabyte, that SLAs cannot afford. As a result, blockchains
are almost entirely reliant on off-chain storage.
16 N. Arora et al.
The fundamental issue is that most firms can’t keep up with the available data and
data sources. Big data has created several challenges in collecting and storing many
correct streaming data sources for correct analysis. Most of the big data technology
we use is obsolete. Sometimes even the tools are also not able to provide a satisfactory
solution. Hence, it is necessary for organizations to upgrade/replace their existing
system. Some of the significant challenges in analyzing Big Data are as follows:
– Lack of data science skills: There is a substantial relevant skill shortage in the
data scientist community. It is a considerable challenge to minimize this gap. It is
also an issue of educating people on using big data analytics. Instead, many other
technical issues require addressing, so it will take longer to close this gap.
– Lack of proper data visualization is often disregarded when interesting and
relevant data is mixed with ordinary or irrelevant discoveries. In other cases,
team members and even seasoned data scientists often fail to present data in a
meaningful and visually appealing manner due to a lack of skill. Consequently,
sometimes they may ignore/miss the most relevant and meaningful data.
– Lack of proper data transformation demands proper transformation when we
need to get or extract insight/value from data. Since data size is too significant
and data formats are not fixed, proper and correct transformation is a big chal-
lenge for data engineers. Data engineers are responsible for converting this data
into an analytics-ready form, i.e., which analytics team members can use. Data
engineers must only depend on rudimentary and code-heavy technologies during
this transformation process. Hence it sometimes becomes a significant challenge
to transform data as per requirement.
The study of data quality in extensive data systems is still in its infancy. Most research
on big data quality acknowledges the relevance of standard dimensions in measuring
big data quality. Some critical quality dimensions of big data are accessibility, confi-
dentiality, redundancy, volume, etc. In Table 2, we have represented most of the
critical quality dimensions of Big data and their purpose.
19 Conclusion
Big Data Analytics plays a vital role in today’s world. All businesses carry vast
amounts of data with them, which can be used to uplift their future growth with
the help of Big Data Analytics and its tools. Big Data Analytics helps the company
predict future trends from past data using the Hadoop ecosystem, which eventually
Introduction to Big Data Analytics 17
References
1. Lazer, D., Radford, J.: Data ex machina: introduction to big data. Ann. Rev. Sociol. 43, 19–39
(2017)
2. Kitchin, R., Lauriault, T.P.: Small data in the era of big data. GeoJournal 80(4), 463–475 (2015)
3. Fernández, A., del Río, S., Chawla, N.V., Herrera, F.: An insight into imbalanced big data
classification: outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)
4. G’eczy, P.: Big data characteristics. Macro Theme Rev. 3(6), 94–104 (2014)
5. Ansari, S., Mohanlal, R., Poncela, J., Ansari, A., Mohanlal, K.: Importance of big data. In:
Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence,
pp. 1–19. IGI Global (2015)
6. Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014)
7. Al Nuaimi, E., Al Neyadi, H., Mohamed, N., Al-Jaroodi, J.: Applications of big data to smart
cities. J. Internet Serv. Appl. 6(1), 1–15 (2015)
8. Aloysius, J.A., Hoehle, H., Goodarzi, S., Venkatesh, V.: Big data initiatives in retail environ-
ments: linking service process perceptions to shopping outcomes. Ann. Oper. Res. 270(1),
25–51 (2018)
18 N. Arora et al.
9. Verma, J.P., Patel, B., Patel, A.: Big data analysis: recommendation system with hadoop
framework. In: 2015 IEEE International Conference on Computational Intelligence &
Communication Technology, pp. 92–97. IEEE (2015)
10. Rizwan, P., Suresh, K., Babu, M.R.: Real-time smart traffic management system for smart
cities by using Internet of things and big data. In: 2016 International Conference on Emerging
Technological Trends (ICETT), pp. 1–7. IEEE (2016)
11. Fathi, F., Abghour, N., Ouzzif, M.: From big data to better behavior in self-driving cars. In:
Proceedings of the 2018 2nd International Conference on Cloud and Big Data Computing,
pp. 42–46 (2018)
12. Daniel, B.K.: Big data in higher education: the big picture. In: Big Data and Learning Analytics
in Higher Education, pp. 19–28. Springer (2017)
13. Mahapatra, S., Singh, A.: Application of IoT-based smart devices in health care using fog
computing. In: Fog Data Analytics for IoT Applications, pp. 263–278. Springer (2020).
14. Singh, A., Mahapatra, S.: Network-based applications of multimedia big data computing in iot
environment. In: Multimedia Big Data Computing for IoT Applications, pp. 435–452. Springer
(2020).
15. Kannan, S., Karuppusamy, S., Nedunchezhian, A., Venkateshan, P., Wang, P., Bojja, N.,
Kejariwal, A.: Chapter 3 - Big data analytics for social media. In: Buyya, R., Calheiros, R.N.,
Dastjerdi, A.V. (eds.) Big Data, pp. 63–94. Morgan Kaufmann (2016).
16. Tsai, C.W., Lai, C.F., Chao, H.C., Vasilakos, A.V.: Big data analytics: a survey. J. Big Data
2(1), 1–32 (2015)
17. Landset, S., Khoshgoftaar, T.M., Richter, A.N., Hasanin, T.: A survey of open source tools for
machine learning with big data in the hadoop ecosystem. J. Big Data 2(1), 1–36 (2015)
18. Monteith, J.Y., McGregor, J.D., Ingram, J.E.: Hadoop and its evolving ecosystem. In: 5th
International Workshop on Software Ecosystems (IWSECO 2013), vol. 50, p. 74. Citeseer
(2013)
19. Goyal, L., Arora, N.: Deep transfer learning approach for detection of covid-19 from chest
x-ray images. Int. J. Comput. Appl. 975, 8887 (2020)
20. Kakde, A., Sharma, D., Arora, N.: Optimal classification of covid-19: a transfer learning
approach. Int. J. Comput. Appl. 176(20), 25–31 (2020)
21. Datta, G., Joshi, N., Gupta, K.: Empirical analysis of performance of MT systems and its metrics
for English to Bengali: a black box-based approach. In: Intelligent Systems, Technologies and
Applications, pp. 357–371. Springer (2021)
22. Sharma, A., Tiwari, S., Arora, N., Sharma, S.C.: Introduction to blockchain. In: Blockchain
Applications in IoT Ecosystem, pp. 1–14. Springer (2021)
DCD_PREDICT: Using Big Data
on Prediction for Chest Diseases
by Applying Machine Learning
Algorithms
U. Kulkarni (B)
Vidyalankar Institute of Technology Wadala, Mumbai, Maharashtra, India
e-mail: umesh.kulkarni@vit.edu.in
S. Gawade
Pillai College of Engineering, Panvel, India
e-mail: sgawade@mes.ac.in
H. Palivela
Manager-AI, Accenture Solutions, Mumbai, Maharashtra, India
V. Agaskar
Vidyavardhani College of Engineering and Technology, Vasai Road, Vasai-Virar, Maharashtra,
India
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 19
V. Rishiwal et al. (eds.), Towards the Integration of IoT, Cloud and Big Data,
Studies in Big Data 137, https://doi.org/10.1007/978-981-99-6034-7_2
20 U. Kulkarni et al.
1 Introduction
1.1 Introduction
Numerous disorders related to the chest affect people. Diseases like asthma, COPD,
pneumonia, tuberculosis, and others have symptoms that demonstrate their presence.
These symptoms, which can occur in a number of settings while people are going
about their regular lives, include shortness of breath, chest symptoms, throat and chest
coughs, among others. In order to identify which chest ailment a person is experi-
encing, we plan to use these symptoms and how they present during various human
contexts, such as running, waking up, and other situations. We achieve this by using
a questionnaire that is symptom-based. The purpose of this activity is to help with the
first diagnosis of chest problems and to help distinguish between various diseases.
We employ the idea of prodrome based surveys furthermore, weighted scores for
these inquiries in our methodology. The initiative is made to fit seamlessly into any
nearby doctor’s office, nursing home, or hospital’s regular schedule to Programme the
computer to recognize and foretell the illness the patient is suffering from. Training
is carried out using example datasets that include survey-style cues. Test datasets
are accessible in the UCI vault dataset, the California Health and Human Services
(CHHS) data set, and data from the esteemed National Institute of Tuberculosis and
Respiratory Diseases.
1.2 Background
Coronary illness has a high worldwide mortality rate. Prediction and conclusion of
coronary illness has turned into a troublesome errand for specialists and emergency
clinics both in India and abroad. The Heart Infection Forecast System is a system
that aids in the prediction of heart disease, specifically cardiovascular disease such
as myocardial infarctions. In this field, data mining and system learning algorithms
are critical. The experiments of researchers are accelerating their work to develop
a graphical user interface and machine learning algorithm that can assist doctors
in making decisions regarding the prediction and diagnosis of heart disease. This
project’s main output is predicting a patient’s heart disease using machine learning
algorithms. A comparative study is carried out, with the performance calculated using
a machine learning procedure.
DCD_PREDICT: Using Big Data on Prediction for Chest Diseases … 21
1.3 Objective
2 Literature Survey
Machine learning is a fast-growing field and I aim to utilize its potential to create this
Artificial Intelligence system. Having a vast application, this system will be used by
the doctor’s patients when all the elements are implemented in the system. Actual
doctors can decide disease with a large number of tests, which require a high process
time, conclusion, lack of skilled cognition and becoming inexperienced [1]. It is
difficult to extract important data in the form of knowledge, hence, it is crucial to use
various techniques such as mining and machine learning methods. Further, extracting
important data from such a type of medical data repository becomes important, when
using methods like classification, clustering, regression, prediction, etc. [2]. The
primary focal point of the paper is to see the strategy for information mining grouping
methods to identify coronary illness expectation in beginning phases. Likewise, by
utilizing PC based expectation, it will be not difficult to foresee heart illnesses at a
beginning phase [3]. KNN (k-closest neighbor), ANN (Artificial Neural organization)
and SVM (Support vector machine) are some of the techniques which are typically
involved and a relative report for our proposed project and for expectation is finished
utilizing the Cleveland coronary illness dataset [4].
2.1 Summary
Early detection and treatment options exist for heart disorders. Using the method
described above, we may determine whether a patient has heart disease based on their
numerous symptoms. In this instance, SVM and random forest classifiers provide
the most accurate predictions. We are unable to anticipate the many types of heart
disorders with any degree of accuracy due to the lack of abundant data, but we can
identify heart infections with a respectable degree of accuracy of roughly 80 to 85%.
When sufficient data is available, it will be possible to design methods for disease
diagnosis that are more accurate three data mining creation strategies that are used to
22 U. Kulkarni et al.
construct a model of the projection system for chest infections. The process retrieves
secret information from a historical record of chest infections. The models are made
and gotten to utilizing the DMX inquiry language and tasks. A test dataset is utilized
to prepare and approve the models. Methods like the Lift Chart and Categorization
Matrix are utilized to measure how well the models work. As a consequence of the
anticipated express, each of the three models are equipped for removing patterns.
Neural Network and Ruling Trees appear to be the best models for anticipating
individuals with chest disease. In correlation with the prepared models, the objec-
tives are assessed. Each of the three models enjoys its own benefits concerning the
effortlessness of model understanding, accessibility of exhaustive data, and preci-
sion in giving solutions to complex questions. This framework can be improved and
extended further. It may likewise incorporate extra information mining strategies, for
example, Association Rules and Time Series. The use of constant information is an
option to all out data. Another subject is to mine the colossal measure of unstructured
information present in medical care data sets utilizing message mining.
3 System Design
A large number of people suffer from chest related diseases. Several people die from
chest conditions. This is often due to the fact that they are diagnosed much later after
they occur when it becomes difficult to solve the problem. In addition to this, they
are often misdiagnosed for one another. A patient with Asthma may be told he has
COPD and vice versa. This leads to adverse effects as it leads to wrong treatment
being given to the patient. Therefore, there is a need to build an easy system to
aid doctors for preliminary decision making. A need to empower the patient with a
tool that helps him understand his condition better and take appropriate measures by
talking to the correct doctor.
It is mainly focused on Knowledge Discovery in Databases (KDD) which is
the primary proposal from which mashup candidates are identified by addressing a
repository of open services. In this methodology, there is a personalized development
of software, which can be used to produce new software based on service integration
methods. KDS define service integration qualification by discovering different phases
of web service specifications.
The process that is being used here intersects the fields of data mashup and service
mashup. This idea of obtaining information from web service offerings is comparable
to the well-established KDD approaches. The representations of data integration and
service mashup are discussed in this work. Furthermore, cutting-edge techniques for
the fundamental KDS domains of comparison processing, grouping, filtering, etc.
DCD_PREDICT: Using Big Data on Prediction for Chest Diseases … 23
with both understanding and controlling data keeping. These questions have yes or
no inputs or they have a spectrum of inputs that indicate the extent of the symptom
occurring from 1 to 4. While the patient enters the answers to the questionnaire,
there is an initial weight that has been assigned to the question that determines and
calculates the percentage chances of the disease taking place. This weight changes as
we train our machine with more understanding and control data for both Western as
well as Indian conditions. As we get more training data, the diagnosis of the system
becomes more and more precise. To test the working of the system, there will be
extensive use of UCI datasets and CHHS datasets.
Currently systems utilize a large amount of medical data taken from tests that deter-
mine the nature of the chest disease. These are costly and not scalable in nature and
require advanced medical professionals. To overcome problems on existing systems,
in the proposed system users may not require to search data in various reposito-
ries with special features. Users need only to give information which is required to
collect. Users can just type a combination of queries and based on user behavior
analysis exact data will be predicted. However, over the years, medical researchers
have compiled this medical data into prodrome based surveys which are used to
determine the complexities.
3.5 Scope
The objective of the task is to recognize the primary side effect of chest simplicity—
the recognizing component of these diseases. Our project utilizes the idea of side
effect based shapes and changed scores for these sections. The project is planned to
be incorporated into the everyday activities of any nearby doctor, nursing home, or
medical clinic.
Currently, systems utilize a large amount of medical data taken from tests that deter-
mine the nature of the chest disease. These are high-priced and not scalable in
nature and require advanced medical professionals. To overcome problems of the
existing system. In the proposed system, such data is stored in various reposito-
ries with special features. The user needs to provide only the information which is
required to be collected. Users can just type a combination of queries and based on the
user’s behavior analysis, exact data will be predicted. However, over time, medical
DCD_PREDICT: Using Big Data on Prediction for Chest Diseases … 25
4 Methodology
A controlled learning model known as the Support Vector Machine (SVM) is depicted
as limited layered vector spaces, where each aspect signifies a specific property of
an object., and it has been shown that SVM functions admirably for tackling high-
layered space issues. Due to its computational ability on tremendous datasets, SVM
is most of the time used in report classification, opinion examination, and expectation
based undertakings [16].
(b) K-Nearest Neighbors (KNN) [16]
The test information is quickly ordered utilizing the preparation tests utilizing K-
Nearest Neighbor (KNN), one more directed learning model. The greater part vote of
an item’s closest neighbors decides its grouping in KNN. As another option, distance
measurements, which can be essentially as fundamental as Euclidean distance, are
utilized to foresee the class of another sample. In the functioning strides of KNN, k
is at first determined (No. of the closest neighbors). The test information will then
be given a class name in view of the results of the normal democratic [16].
(c) Artificial Neural Network (ANN)
The administered learning procedure known as the Artificial Neural Network (ANN)
contains three layers: input, secret result, and output. The joints between the key units,
the mystery, and the result are not entirely settled by the pertinence of the allotted
load of that specific info unit. In general, significance increments with expanding
weight. ANN can utilize both direct and sigmoid exchange (actuation) functions.
ANNs might be prepared to deal with immense volumes of information with few
inputs. The most famous learning calculation for multi-facet feed forward ANNs is
the backpropagation learning tool. Three sub-datasets for preparing, approval, and
testing ought to be made from the information records for ANN.
Symptom-based Questionnaires are required for the following heart related diseases.
. Asthma.
. COPD.
. Pneumonia.
. Tuberculosis.
DCD_PREDICT: Using Big Data on Prediction for Chest Diseases … 27
. Dataset required for the purpose can be obtained from the UCI database, CSSH
database and datasets obtained from the National Institute of Tuberculosis and
Respiratory Diseases (India).
. An ML training service like TensorFlow may be used to train the system based
on the dataset selected.
. A Cloud ML service like Azure ML or Amazon ML may be used to verify and
double check the training.
The Agile cycle model was employed. Agile showcasing is a strategy that follows
programming engineers and attempts to speed up straightforwardness in marketing.
Agile is ordinarily a period boxed, iterative technique to programming conveyance
that produces programming step by step from the start as opposed to holding on until
the finish to introduce the undertaking as a whole. Agile philosophies frequently work
by separating projects into little pieces of client usefulness known as client stories,
focusing on them, and afterward consistently conveying them in short emphases of
about fourteen days.
(i) Probability Generation: Here as per the input given by the user this block will
try to give the probability of chest disease in which category it will be defined
as shown in Fig. 1.
(ii) Graph Calculation: Here it is expected depending upon the category by which
a definite Path by which the method of medicine can be worked out as shown
in Fig. 1.
28 U. Kulkarni et al.
The purpose of this chapter is to explain what taxes are, how they are
levied, and how they are spent.