© 2019 IJSRSET | Volume 6 | Issue 3 | Print ISSN: 2395-1990 | Online ISSN : 2394-4099
Themed Section : Engineering and Technology
DOI : 10.32628/ IJSRSET196373
A Survey of Machine Learning Algorithms
Prof. Deepak Agrawal, Abhiruchi Dubey
Takshshila Institute of Engineering and Technology, Jabalpur, Madhya Pradesh, India
ABSTRACT
Machine Learning is a booming research area in computer science and many other industries all over the world.
It has gained great success in vast and varied application sectors. This includes social media, economy, finance,
healthcare, agriculture, etc. Several intelligent machine learning techniques were designed and used to provide
big data predictive analytics solutions. A literature survey of different machine learning techniques is provided
in this paper. Also a study on commonly used machine learning algorithms for big data analytics is done and
presented in this paper
Keywords : ANN, Data Analytics, Machine Learning Algorithms, Technique, Prediction, Model
I.
INTRODUCTION
II. BIG DATA ANALYTICS
In this data rich era it is essential to use sophisticated
analytics techniques on huge, diverse big data sets to
produce useful knowledge and information. Big data
analytics is a budding research area that deals with
the collection, storage and analysis of immense data
sets to trace the unknown patterns and other key
information. Big data analytics helps us to recognize
the data that are integral component to the future
business decisions. Big data analytics can be
abundantly found in domains such as banking and
insurance sector, healthcare, education, social media
and
entertainment
industry,
bioinformatics
The term big data which describes extremely large
data sets is widely being used among different
researchers all over the world. Traditional relational
databases are not capable of handling big data.
Enormous quantity of data sets arrives from several
sources like sensors, transactional applications, web
and social media, etc. The big data phenomenon can
be comprehended clearly by knowing the different
Vÿs associated with them- Volume, Velocity,
Variety, Veracity and Value.
•
Volume: This denotes the huge amount of
applications, geospatial applications, agriculture etc.
data produced every second, oscillating
It is a herculean task to handle big data using
between terabytes to zettabytes. These big
conventional data processing applications. Thus to
data sets can be maintained using distributed
discover
systems.
hidden
data
patterns,
trends
and
associations, intelligent machine learning methods
•
Velocity: This term represents the rate at
can be adapted. The objective of the current research
which data is produced and processed to
paper is to discuss various machine learning
congregate the demands.
algorithms used by data scientists for analyzing and
modeling big data.
•
Variety: This indicates the diverse range of
data that we can use.
•
Veracity: This speaks about the data quality.
That is, it indicates the biases, noise,
IJSRSET196373 | Received : 20s May 2019 | Accepted : 22 June 2019 | March-April -2019 [ 6 (3) : 364-369 ]
364
Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369
•
abnormality etc. in the data.
Most
Value: This points to the precious knowledge
algorithms were implemented for data sets which
reveled from the data.
could be completely fit into the memory [15]. As
of
the
traditional
machines
learning
the data keeps getting bigger day by day, many
Data scientists use many well-marked analytics
techniques. Text analytics, predictive analytics,
natural language processing, machine learning, etc.
are a few approaches to make better and faster
decisions on big data sets to uncover hidden
insights.
intelligent
learning
methods
are
being
implemented to provide solutions to several big
data predictive analytics problems. A study on
several
commonly
used
machines
learning
techniques for big data analytics is provided in the
following section.
III. MACHINE LEARNING
IV. LITERATURE SURVEY
Machine learning is an interdisciplinary research
area which combines ideas from several branches
of science namely, artificial intelligence, statistics,
J. Qiu et al. presented different machine learning
algorithms for big data processing [7]. The first one is
representation learning or feature learning which
information theory, mathematics, etc. The prime
deals with learning data representations that make
focus of machine learning research is on the
the data analysis process easier. It is found that the
development
performances of the machine learning algorithms are
of
fast
and
efficient
learning
algorithms which can make predictions on data.
strongly
When dealing with data analytics, machine
representation (or features) [16]. This learning
learning is an approach used to create models for
scheme plays a crucial role in dimensionality
prediction. Machine learning tasks are mainly
reduction
grouped
supervised,
representation learning are feature selection, feature
learning.
extraction and distance metric learning [14]. Feature
Supervised machine learning requires training with
selection (variable selection) techniques are used to
labeled data. Each labeled training data consists of
find those features of data which are most relevant for
input value and a desired target output value. The
use in model construction. Feature extraction
supervised learning algorithm analyzes the training
techniques transform the high dimensional data into a
data and makes an inferred function, which may be
low dimensional space. In distance metric learning, a
used for mapping new values. In unsupervised
distance function is constructed to calculate the
distance between various points of a data set.
into
unsupervised
three
and
categoriesreinforcement
machine learning technique, hidden insights are
drawn from unlabelled data sets, for example,
influenced
tasks.
by
The
the
selection
important
of
steps
data
under
cluster analysis. The third category, reinforcement
The authors mentioned about another hot learning
learning allows a machine to learn its behavior
technique called deep learning in their paper. Most of
from
the ancient machine learning approaches follows
the
feedback
received
through
the
interactions with an external environment [3].
shallow-
From a data processing point of view, both
containing a single layer of nonlinear feature
supervised and unsupervised learning techniques
transformations. Some of the examples of such
are preferred for data analysis and reinforcement
learning techniques are Gaussian mixture models
techniques are preferred for decision making
(GMMs), hidden Markov models (HMMs), support
problems [7].
vector machines (SVMs), logistic regression, kernel
regression etc. [9]. In contrast to the shallow-
structured
learning
International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com)
architecture
365
that
Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369
structured
learning
architecture,
deep
learning
target domain by getting trainings from a related
techniques make use of supervised and unsupervised
source domain. Transfer learning techniques are
strategies in deep architecture. The learning systems
widely being used in many real-world data processing
with deep learning architecture are composed of
applications.
several levels of nonlinear processing stage, in which
each lower layer’s output is given as the input of the
The authors discussed about another learning scheme
immediate higher layer. Some of the examples are
called active learning. In some cases the data is
deep neural networks, conventional neural networks,
represented without labels which become a challenge.
deep belief networks and recurrent neural networks
Manually labeling this large data collection is an
etc. Because of the high performance of deep learning
expensive and strenuous task. Also, learning from
algorithms they are well suited for big data analytics
unlabelled data is very difficult. Active learning is
applications.
used to solve the above mentioned issue by selecting a
subset of the most important instances for labeling
Scalability is a challenging issue with the traditional
machine learning algorithms. The traditional schemes
[17]. Another scheme, kernel- based learning, has
been widely used in many engineering applications to
cannot process the huge data chunks within a
design efficient, powerful and high performance
stipulated time as they require all the data in the same
nonlinear algorithms [5]. Some of the algorithms
database. A new field of machine learning called
capable of operating with kernels are support vector
distributed learning has been evolved to solve this
machines (SVM), principal component analysis (PCA),
problem. In this scheme, the learning is carried out
kernel perceptron, etc.
on data sets distributed among several workstations to
scale up the learning process [4]. Examples of the
J.L. Berral-Garcia presented a paper describing the
distributed
frequently used machine learning algorithms for big
data analytics [6]. Several algorithms are used for
performing modeling, prediction and clustering tasks.
machine learning algorithms are decision rules,
Decision tree algorithms (like CART, Recursive
stacked generalization, meta-learning and distributed
Partition Trees or M5), K- Nearest neighbors
boosting etc. Parallel machine learning is another
algorithms,
popular learning scheme where the learning process
theorem) , Support vector machines(SVM), Artificial
is executed among multiple processor environments
Neural Network, K-means, DBSCAN algorithms, etc
or on multiple threaded machines [1].
are presented in this paper. Several execution
frameworks - Map-Reduce Frameworks (Apache
Transfer learning is another machine learning
Hadoop and Spark), Google’s Tensor flow, Microsoft’s
approach mentioned in their paper. A common
Azure-ML
practice is that both the training data and test data are
implementations
taken from the same field in the conventional
algorithms are made available to the public through
machine learning process. That is, the input feature
different tools, platforms and libraries such as R-cran,
space and data distribution are identical [8]. But there
Python Sci-Kit, Weka, MOA, Elastic Search, Kibana
are certain scenarios in which getting training and
etc.
Bayesian
were
of
algorithms
also
the
(using
mentioned.
previously
Byes
The
discussed
test data from the same domain is a difficult and
expensive task. In order to solve this issue, the
M. U. Bokhari et al. presented a three layered
transfer learning technique has been used. In this
scheme a high performance learner is created for a
architecture model for storing and analyzing big data
[11]. The three layers are data gathering layer, data
International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com)
366
Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369
storing layer and data analysis & report generation
medicine. Multi-omic TCGA [13] data and EHR data
layer. In order to gather and handle the huge volume
[2]were used to conduct this study.
of big data coming from high speed sources such as
sensors or social media, a cluster of high speed nodes
M.R. Bendre et al. [10] conducted a research on the
or severs are kept in the data gathering layer. The
usage of big data in precision agriculture. The authors
data storage layer is responsible for storing the big
mentioned that big data provides a broad range of
data. The Hadoop
functions to uncover new insights to address several
farming problems. The designed model uses the
Distributed File System (HDFS) can be used for data
MapReduce technique for big data processing and the
storage. In the data analysis layer, machine learning
linear regression method for data prediction. The data
techniques such as ANN, Naive Bayes, SVM and
collected from the KVR( Krishi Vidyapeeth Rahuri
Principal Component Analysis etc. are used to churn
(KVR), Ahmednagar, India) station are used to test
knowledge from the huge complex data chunks.
the model. The result forecasted using this model is
P.Y. Wu et al. in their paper provided case studies to
very useful for effective decision making in the
agriculture domain.
show how big data analytics is useful in precision
medicine to provide the most appropriate treatment
The following table summarizes the literature survey
to each patient [12]. Principal Component Analysis,
presented in this paper.
Singular Value Decomposition and tensor-based
TABLE I. LITERATURE SURVEY SUMMARY
approaches are useful for feature extraction and for
feature selection filter based and wrapper based
methods are helpful. All these are dimensionality
reduction techniques. The authors compared different
techniques for performing data mining tasks. Logistic
Sl.
No.
1.
regression, cox regression, local regression techniques
Author(s)
name (year)
J. L. BerralGarcia
(2016)
are simple to interpret, but are prone to outliers.
Logistic
regression
with
LASSO
regularization
reduces feature space. But over fitting is a problem.
Other models such as Hidden Markov models,
Conditional Random fields, relational subgroup
discovery, episode rule mining etc are also useful for
performing data mining tasks. The authors discussed
about the useful platforms for big data analytics.
Apache Hadoop, IBM InfoSphere Platform, Apache
Spark Streaming, Tableau, QlikView, TIBCO Spotfire,
and other visual analytics tools are highly impactful
platforms for providing big data analytics solutions.
Two real world case studies such as integrative -omic
data for the improved understanding of cancer
mechanisms, and the incorporation of genomic
knowledge into the EHR system for improved patient
diagnosis and care were done to discuss the usefulness
of biomedical big data analytics for precision
2.
J. Qui, Q.
Wu,
G.
Ding, Y. Xu
and S. Feng
( 2016)
Algorithms /
Techniques
Decision tree
algorithms,
KNearest
neighbor
algorithms,
Bayesian
algorithms,
SVM, ANN,
K-means,
DBSCAN
Gaussian
Mixture
models,
Hidden
Markov
Models, SVM,
logistic
regression,
Kernel
Rgression,
Deep neural
networks,
Deep belief
networks,
PCA, Kernel
Perceptron
International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com)
Summary
A survey was done
on
the
various
machine algorithms
for
classification,
prediction
and
modeling
A survey was done
on
the
various
traditional as well as
advanced machine
learning algorithms
used for big data
processing.
367
Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369
V. CONCLUSION
3.
4.
5.
M.U.
Bokhari,
M.
Zeyauddin
and M. A.
Siddiqui
(2016)
P. Y. Wu,
C.
W. Cheng,
C.
D. Kaddi, J.
Venugopala
n,
R. Hoffman
and M. D.
Wang
(2016)
M. R.
Bendre,
R. C. Thool
and V. R.
Thool
ANN,
SVM,PCA,
Naive Bayes
Logistic
regression,
PCA, HMM,
Local
regression,
cox
regression
MapReduce,
Linear
regression
Presented a 3
layered architecture
model for storing
and analyzing
Bigdata. Data storage
can be done using
the Hadoop
Distributed File
System(HDFS) and
data analysis can be
done using
techniques like
ANN, SVM, Naive
Bayes and PCA.
Discussed several
machine algorithms
and platforms like
Hadoop , IBM
Infosphere, Tableu,
Qlik view, Spark etc
for providing big
data solutions. Case
studies were done
using –omic data
from TCGA and
EHR data to show
the usefulness of
biomedical big data
analytics for
precision medicine.
A model was built
using MapReduce
and Linear
regression
techniques. Case
study was carried
out to predict the
rainfall and
temperature value
for the year 2013
using the historical
weather data
collected from KVR,
Ahmednagar. The
main objective was
to improvise the
accuracy of rainfall
forecasting.
With the advents in big data technology, it became
difficult to handle the complex big data using the
traditional learning algorithms. Therefore several
advanced,
efficient
and
intelligent
learning
algorithms are required to handle the huge chunks of
heterogeneous datasets. The results obtained through
these analytics techniques provide more effective
solutions to many real world problems in various
domains such as healthcare, agriculture, social media,
banking etc. Various research papers are surveyed to
gather
information
about
advanced
learning
techniques. This paper gives an overall idea about the
advanced machine learning algorithms and
techniques used to provide solutions to the big data
analytics problems.
VI.
[1].
REFERENCES
“Parallel machine learning toolbox”, retrieved
from
http://www.research.ibm.com/haifa/projects/ver
ification/ml_tool box/.
[2].
C. A. Caligtan and P. C. Dykes, “Electronic
health records and personal health records”,
Semin Oncol Nurs, vol. 27, pp. 218- 228, 2018.
[3].
C.M. Bishop, Pattern recognition and machine
learning, Springer, New York, 2019.
[4].
D Peteiro-Barral and B Guijarro-Berdinas, “A
survey of methods for distributed machine
learning”, Progress in Artificial Intelligence,
Springer, vol. 2, issue 1, pp. 1-11, 2018.
DOI:10.1007/s13748-012-0035-5.
[5].
G. Ding, Q. Wu, Y. D. Yao, J. Wang and Y.
Chen, “Kernel- Based Learning for Statistical
Signal Processing in Cognitive Radio Networks:
Theoretical Foundations, Example Applications,
and Future Directions”, IEEE Signal Processing
Magazine, vol. 30, issue. 4, pp. 126-136, 2018.
DOI: 10.1109/MSP.2013.2251071.
[6].
J. L. Berral-Garcia, “A quick view on current
techniques and machine learning algorithms for
big data analytics”, 18th International Conf. on
International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com)
368
Prof. Deepak Agrawal et al Int J Sci Res Sci Eng Technol. May-June-2019; 6 (3) : 364-369
[7].
Transparent Optical Networks, pp.1-4, 2016.
Network Mining , ACM, pp. 18–25, 2012. DOI:
DOI: 10.1109/ICTON.2016.7550517.
10.1145/2351333.2351336.
J. Qui, Q. Wu, G. Ding, Y. Xu and S. Feng, “A
survey of machine learning for big data
[16]. X. W. Chen and X. Lin, “Big Data Deep
Learning: Challenges and Perspectives”, in IEEE
processing”, EURASIP Journal on Advances in
Access, vol. 2, pp. 514-525, 2014. DOI:
Signal Processing, Springer, vol. 2016:67, pp. 1-
10.1109/ACCESS.2014.2325029.
16, 2016. DOI: 10.1186/s13634-016-0355-x.
[8].
[9].
[17]. Y. Bengio, A. Courville and P. Vincent,
K Weiss, T Khoshgoftaar and D Wang, “A
“Representation Learning: A Review and New
survey of transfer learning”, Journal of Big Data,
Perspectives”, in IEEE Transactions on Pattern
Springer, vol. 3, issue 9, pp. 1- 40, 2016. DOI:
Analysis and Machine Intelligence, vol. 35,
10.1186/s40537-016-0043-6
issue
Li Deng, “A tutorial survey of architectures,
10.1109/TPAMI.2013.50.
8,
pp.
1798-1828,
2017.
DOI:
algorithms and applications for deep learning”,
[18]. Y. Fu, B. Li, X. Zhu and C. Zhang, “Active
APSIPA transactions on Signal and Information
Processing,
vol.
3,pp.1-29,2014.
DOI:
Learning without Knowing Individual Instance
Labels: A Pairwise Label Homogeneity Query
https://doi.org/10.1017/atsip.2017.9.
Approach”, in IEEE Transactions on Knowledge
[10]. M. R. Bendre, R. C. Thool and V. R. Thool, “Big
data
in
precision
forecasting
for
International
agriculture:
future
Conf.
on
farming”,
Next
Computing Technologies,
750,
Weather
822, 2017. DOI: 10.1109/TKDE.2013.165.
1st
Generation
pp.
and Data Engineering, vol. 26, issue 4, pp. 808-
Cite this article as :
744Prof. Deepak Agrawal, Abhiruchi Dubey, "A Survey
2017.
[11]. DOI:10.1109/NGCT.2015.7375220.
of Machine Learning Algorithms", International
[12]. M. U. Bokhari, M. Zeyauddin and M. A.
Journal of Scientific Research in Science, Engineering
Siddiqui, “An effective model for big data
and Technology (IJSRSET), Online ISSN : 2394-4099,
analytics”, 3rd International Conference on
Print ISSN : 2395-1990, Volume 6 Issue 3, pp. 364-369,
Computing
May-June
for
Sustainable
Global
2019.
Journal URL : http://ijsrset.com/IJSRSET196373
Development, pp. 3980-3982, 2016.
[13]. P. Y. Wu, C. W. Cheng, C. D. Kaddi, J.
Venugopalan, R. Hoffman and M. D. Wang, “–
Omic and Electronic Health Record Big Data
Analytics for Precision Medicine”, IEEE
Transactions on Biomedical Engineering, vol.
64,
issue
2,
pp.
263-273,
2017.
DOI:
10.1109/TBME.2016.2573285.
[14]. T.
C.
G.
Atlas.
Available:
http://cancergenome.nih.gov/.
[15]. W.
Tu
and
S.
representation-learning
Sun,
“Cross-domain
framework
with
combination of class-separate and domainmerge objectives”, Proceedings of the 1st
International Workshop on Cross Domain
Knowledge Discovery in Web and Social
International Journal of Scientific Research in Science, Engineering and Technology ( www.ijsrset.com)
369