Nothing Special   »   [go: up one dir, main page]

Big Data Research Paper

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10
At a glance
Powered by AI
Some of the key takeaways from the document are that big data presents unique computational and statistical challenges due to its size and complexity. Tools like MapReduce and Hadoop help address these challenges by enabling parallel and distributed processing of large datasets.

Some of the challenges of big data mining mentioned are flexibility and storage bottlenecks, noise collection, spurious correlations and estimation errors. These challenges require new computational and statistical paradigms to address.

MapReduce is a programming model that allows for parallel processing of large datasets across clusters of computers. It consists of map and reduce functions that split up the work, process it in parallel, and aggregate the results. This helps process very large datasets more efficiently.

ISSUES, CHALLENGES, AND

SOLUTIONS: BIG DATA


MINING
ABSTRACT:
Data has turned into an imperative part of each economy, industry,
association, business capacity and person. Huge Data is a term used to
recognize the datasets that whose size is past the capacity of run of the mill
database programming apparatuses to store, oversee and dissect. The Big
Data present one of a kind computational and measurable difficulties,
including adaptability and capacity bottleneck, commotion collection,
spurious relationship and estimation mistakes. These difficulties are
recognized and require new computational and measurable worldview. This
paper displays the writing audit about the Big Data Mining and the issues
and difficulties with accentuation on the recognized components of Big Data.
It likewise talks about a few strategies to bargain with enormous information.

KEYWORDS: Big data mining, Security, Hadoop, MapReduce

INTRODUCTION:
Data is the gathering of values and variables related in some sense and
contrasting in some other sense. As of late the sizes of databases have
expanded quickly. This has lead to a developing enthusiasm for the
advancement of instruments able in the programmed extraction of learning
from information
[1]. Data are gathered and broke down to make data reasonable for
deciding. Thus
Data give a rich asset to learning revelation and choice backing. A database
is an organised collection of information so that it can easily be accessed,
managed and updated. Data mining is the procedure finding interesting
information, for example, association, designs, changes, irregularities and
significant structures from a large amount of information stored in
databases, data warehouses or other data stores. A widely accepeted formal
meaning of data mining is given in this manner. As indicated by this

definition, data mining is the non-trivial extraction of certain previously


unknown and potentially helpful information about data [2]. Data mining
reveals fascinating examples and connections covered up in a vast volume
of raw Data. Big Data is a new term used to distinguish the datasets that are
of expansive size and have grater complexity
[3]. So we can't store, manage and analyze them with our present
methodologies or data mining software tool. Big data is a heterogeneous
gathering of both organized and unorganized data. Organizations are
fundamentally concerned with managing unstructured data. Big Data mining
is the capability of extracting helpful information from these large datasets
or streams of information which were unrealistic before due to its volume,
variety, and velocity.
The separated information is exceptionally valuable and the mined learning
is the representation of diverse sorts of examples and every example
compares to data. Data Mining is investigating the information from
alternate points of view and condensing it into helpful information that can
be utilized for business arrangements and foreseeing the future patterns.
Mining the data makes a difference associations to settle on information
driven choices. Data mining (DM), additionally called Knowledge Discovery in
Databases (KDD) or Knowledge Discovery and Data Mining, is the procedure
of seeking large volumes of information consequently for examples,
affiliation rules [4]. It applies numerous computational strategies from
measurements, data recovery, machine learning and design
acknowledgment. Data mining extract only required pattern from the
database in a brief timeframe range. Based on the type of pattern to be
mined, data mining task can be grouped into summarization, classification,
clustring, association and trends investigation.Huge measure of information
are created each moment. A recent study estimated that every minute,
Google gets more than 4 million inquiries, email clients send more than 200
million messages, YouTube clients transfer 72 hours of video, Facebook
clients offer more than 2 million bits of content, also, Twitter clients create
277,000 tweets [5]. With the measure of information becoming
exponentially, enhanced analysis is required to concentrate data that best
matches client interests. Big Data refers to quickly developing datasets with
sizes beyond the capacity of traditional data base tools to store, manage and
investigate them. Bid Data is a heterogeneous collection of both organized
and unorganised information. Increment of storage capacity, Increase of
handling force and accessibility of information are the primary purpose
behind the appearance and development of enormous information.

Enormous information alludes to the utilization of huge information sets to


handle the gathering or reporting of information that serves organizations or
other beneficiaries in basic leadership. The information might be venture
particular or general and private or open. Enormous information are
portrayed by 3 V's: Volume, Velocity, and Variety [6]
Big Data mining refers to the action of going through big data sets to search
for relevant data. Big Data tests are accessible in space science, climatic
science, social organizing destinations, life sciences, therapeutic science,
government information, characteristic fiasco and asset administration, web
logs, cellular telephones, sensor systems, experimental exploration,
broadcast communications [7]. Two principle objectives of high dimensional
information investigation are to create powerful strategies that can precisely
predict the future perceptions and in the meantime to pick up understanding
into the relationship between the components and reaction for logical
purposes. Big Data have applications in many fields, for example, Business,
Technology, Health, Smart urban communities and so on..
individuals to have better service, better client experiences, furthermore to
anticipate and recognize sickness much easier than before [8]. The fast
development of Internet and versatile innovations has a vital part in the
development of data creation and capacity. Since the amount of information
is becoming exponentially, enhanced investigation of vast information sets is
required to extract data that best matches client interests. New technology
are required to store unstructured huge information sets and handling
strategies, for example, Hadoop and Map Reduce have more noteworthy
significance in enormous information investigation. To process extensive
volumes of information from various sources rapidly, Hadoop is utilized.
Hadoop is a free, Java-based programming system that backings the
handling of expansive information sets in a distributed computing
environment. It permits running applications on frameworks with a large
number of hubs with thousands of terabytes of data. Its distributed
document framework supports quick data exchange rates among hubs and
permits the framework to keep working continuous on occasion of hub
disappointment. It runs Map Decrease for distributed data handling and is
works with organized and unstructured information [6].

LITERATURE REVIEW

Puneet Singh Duggal, Sanchita Paul, Big Data Analysis : Challenges and
Solutions, international Conference on Cloud, Big Data and Trust 2013, Nov
13-15, RGPV.
This paper presents various methods for handling the problems of big data
analysis through Map Reduce framework over Hadoop Distributed File
System (HDFS). Map Reduce techniques have been studied in this paper
which is implemented for Big Data analysis using HDFS.
This paper exhibits a review of different algorithm from 1994-2014 importent
for taking care of Big Data set. It gives an overview of engineering and
algorithm utilized as a part of substantial information sets. These algorithms
characterize different structures and strategies executed to handle Big Data
and this paper lists different tools that were produced for analyzing them. It
also explain about the different security issues, application and patterns took
after by a huge information set [9]. Wei Fan, Albert Bifet, "Mining Big Data:
Current Status, and Forecast to the Future", SIGKDD Investigations, Volume
14, Issue 2 The paper introduces a wide outline of the point Big Data mining,
its present status, debate, what's more, figure to what's to come. This paper
additionally covers different fascinating and cutting edge points on Big Data
mining.
Priya P. Sharma, Chandrakant P. Navdeti, "Securing Big Data Hadoop: A
Review of Security Issues, Threats and Solution", IJCSIT, Vol 5(2), 2014, 21262131
This paper examines about the enormous information security at nature
level alongside the testing of inherent insurances. It also presents some
security issues that we are managing today and propose security
arrangements and economically available procedures to address the same.
The paper likewise covers all the security answers for secure the Hadoop
biological system. Richa Gupta, Sunny Gupta, Anuradha Singhal, "Huge
Data : Overview", IJCTT, Vol 9, Number 5, Walk 2014
This paper gives an outline on Big Data, its significance in our live and a
some technology to handle Big Data. This paper likewise states how Big Data
can be connected to self-sorting out sites which can be reached out to the
field of promoting in organizations.

ISSUES AND CHALLENGES

Big data analysis is the way toward applying progressed examination and
perception procedures to extensive information sets to reveal hidden pattern
and unknown relationships for viable decision making. The analysis of Big
Data includes numerous unmistakable stages which incorporate information
obtaining what's more, recording, data extraction and cleaning, information
coordination, accumulation and representation, inquiry preparing,
information displaying and examination and Interpretation. Each of these
stages presents challenges. Heterogeneity, scale, opportuneness, manysided quality and security are sure difficulties of Big Data mining.

Heterogeneity and Incompleteness


Big Data investigation is the way toward applying progressed examination
and perception procedures to huge information sets to reveal shrouded
examples and obscure relationships for compelling choice making. The
examination of Big Data includes numerous particular stages which
incorporate information obtaining what's more, recording, data extraction
and cleaning, information incorporation, total and representation, inquiry
handling, information demonstrating and investigation and Interpretation.
Each of these stages presents challenges. HeteThe challenges of enormous
information examination get from its substantial scale and additionally the
nearness of blended information taking into account distinctive examples or
standards (heterogeneous blend information) in the gathered and put away
information. On account of muddled heterogeneous blend information, the
information has a few examples and rules and the properties of the
examples change incredibly. Information can be both organized and
unstructured. 80% of the information produced by associations are
unstructured. They are exceptionally rapid and does not have specific
organization. It might exists as email connections, pictures, pdf reports,
therapeutic records, X beams, phone messages, design, video, sound and so
on and they can't be put away in line/section position as organized
information. Changing this information to organized organization for later
examination is a noteworthy test in huge information mining. So new
advances must be received for managing such information. Inadequate
information makes vulnerabilities amid information examination and it must
be overseen amid information investigation. Doing this effectively is
additionally a test. Deficient information alludes to the missing of
information field values for a few examples. The missing qualities can be
brought on by various substances, for example, the glitch of a sensor hub, or
some deliberate strategies to purposefully skirt a few qualities. While most

advanced information mining calculations have inbuilt answers for handle


missing qualities (such as disregarding information fields with missing
qualities), information attribution is a built up examination field which tries
to ascribe missing qualities keeping in mind the end goal to create enhanced
models (contrasted with the ones worked from the first information).
Numerous attribution strategies exist for this reason, and the major
methodologies are to fill most every now and again watched values or to
construct learning models to foresee conceivable qualities for every
information field, in light of the watched estimations of a given example.
rogeneity, scale, opportuneness, many-sided quality and security are sure
difficulties of huge information mining.

Scale and complexity


Managing large and rapidly increasing volumes of data is a challenging issue.
Traditional software tools are not enough for managing the increasing volumes of
data. Data analysis, organization, retrieval and modeling are also challenges due to
scalability and complexity of data that needs to be analysed.

Timeliness
As the size of the information sets to be prepared expands, it will take more
opportunity to dissect. In a few circumstances aftereffects of the
examination is required promptly. For instance, if a fake charge card
exchange is suspected, it ought to in a perfect world be hailed before the
exchange is finished by keeping the exchange from occurring by any means.
Clearly a full investigation of a client's buy history is not prone to be practical
progressively. So we have to create halfway results ahead of time so that a
little measure of incremental calculation with new information can be utilized
to touch base at a snappy assurance.
Given a substantial information set, it is regularly important to discover
components in it that meet a predetermined foundation. In the course of
information examination, this kind of hunt is liable to happen over and over.
Checking the whole information set to discover appropriate components is
clearly unfeasible. In such cases Index structures are made ahead of time to
allow discovering qualifying components rapidly. The issue is that every file
structure is intended to bolster just a few classes of criteria.

SECURITY AND PRIVACY CHALLENGES FOR BIG DATA

Big data refers to collections of data sets with sizes outside the ability of
commonly used software tools such as database management tools or
traditional data processing applications to capture, manage, and analyze
within an acceptable elapsed time. Big data sizes are constantly increasing,
ranging from a few dozen terabytes in 2012 to today many petabytes of data
in a single data set.
Big data creates tremendous opportunity for the world economy both in the
field of national security and also in areas ranging from marketing and credit
risk analysis to medical research and urban planning. The extraordinary
benefits of big data are lessened by concerns over privacy and data
protection. As big data expands the sources of data it can use, the trust
worthiness of each data source needs to be verified and techniques should
be explored in order to identify maliciously inserted data. Information
security is becoming a big data analytics problem where massive amount of
data will be correlated, analyzed and mined for meaningful patterns. Any
security control used for big data must meet the following requirements:
It must not compromise the basic functionality of the cluster.
It should scale in the same manner as the cluster.
It should not compromise essential big data characteristics.
It should address a security threat to big data environments or data
stored within the
Cluster
Unauthorized release of information, unauthorized modification of information and
denial of resources are the three categories of security violation. The following are
some of the security threats:
An unauthorized user may access files and could execute arbitrary code or carry
out
further attacks.
An unauthorized user may eavesdrop/sniff to data packets being sent to client.
An unauthorized client may read/write a data block of a file.
An unauthorized client may gain access privileges and may submit a job to a
queue or delete or change priority of the job.
Security of big data can be enhanced by using the techniques of authentication,
authorization, encryption and audit trails. There is always a possibility of occurrence
of security violations by unintended, unauthorized access or inappropriate access
by privileged users. The following are some of the methods used for protecting big
data:

1. Using authentication methods


2. Use file encryption
3. Implementing access controls
4. Use key management
5. Logging
6. Use secure communication
TECHNIQUES FOR BIG DATA MINING

Big data has great potential to produce useful information for companies
which can benefit the way they manage their problems. Big data analysis is
becoming indispensable for automatic discovering of intelligence that is
involved in the frequently occurring patterns and hidden rules.
These massive data sets are too large and complex for humans to effectively
extract useful information without the aid of computational tools. Emerging
technologies such as the Hadoop framework and MapReduce offer new and
exciting ways to process and transform big data, defined as complex,
unstructured, or large amounts of data, into meaningful knowledge.

Hadoop
Hadoop is a scalable, open source, fault tolerant Virtual Grid operating
system architecture for data storage and processing. It runs on commodity
hardware, it uses HDFS which is fault-tolerant high bandwidth clustered
storage architecture. It runs MapReduce for distributed data processing and
is works with structured and unstructured data [11]. For handling the
velocity and heterogeneity of data, tools like Hive, Pig and Mahout are used
which are parts of Hadoop and HDFS framework. Hadoop and HDFS (Hadoop
Distributed File System) by Apache is widely used for storing and managing
big data.
Hadoop consists of distributed file system, data storage and analytics
platforms and a layer that handles parallel computation, rate of flow
(workflow) and configuration administration [6].
HDFS runs across the nodes in a Hadoop cluster and together connects the
file systems on many input and output data nodes to make them into one
big file system. The present Hadoop ecosystem, as shown in Figure 1,
consists of the Hadoop kernel, MapReduce, the Hadoop distributed file
system (HDFS) and a number of related components such as Apache Hive,
HBase, Oozie, Pig and Zookeeper and these components are explained as
below:
HDFS: A highly faults tolerant distributed file system that is responsible for
storing data on the clusters.
MapReduce: A powerful parallel programming technique for distributed
processing of vast amount of dataon clusters.
HBase: A column oriented distributed NoSQL database for random
read/write access.
Pig: A high level data programming language for analyzing data of Hadoop
computation.
Hive: A data warehousing application that provides a SQL like access and
relational model.
Sqoop: A project for transferring/importing data between relational
databases and Hadoop.
Oozie: An orchestration and workflow management for dependent Hadoop
jobs.

MapReduce:
MapReduce is a programming model for processing large data sets with a
parallel, distributed algorithm on a cluster. Hadoop MapReduce is a
programming model and software framework for writing applications that
rapidly process vast amounts of data in parallel on large clusters of compute
nodes [11].
The MapReduce consists of two functions, map() and reduce(). Mapper
performs the tasks of filtering and sorting and reducer performs the tasks of
summarizing the result. There may be multiple reducers to parallelize the
aggregations [7]. Users can implement their own processing logic by
specifying a customized map() and reduce() function. The map() function
takes an input key/value pair and produces a list of intermediate key/value
pairs. The MapReduce runtime system groups together all intermediate pairs
based on the intermediate keys and passes them to reduce() function for
producing the final results. Map Reduce is widely used for the Analysis of big
data.
Large scale data processing is a difficult task. Managing hundreds or
thousands of processors and managing parallelization and distributed
environments makes it more difficult. Map Reduce provides solution to the
mentioned issues since it supports distributed and parallel I/O scheduling. It
is fault tolerant and supports scalability and it has inbuilt processes for
status and monitoring of heterogeneous and large datasets as in Big Data
[11].

CONCLUSION
The amounts of data is growing exponentially worldwide due to the
explosion of social networking sites, search and retrieval engines, media
sharing sites, stock trading sites, news sources and so on. Big Data is
becoming the new area for scientific data research and for business
applications. Big data analysis is becoming indispensable for automatic
discovering of intelligence that is involved in the frequently occurring
patterns and hidden rules. Big data analysis helps companies to take better
decisions, to predict and identify changes and to identify new opportunities.
In this paper we discussed about the issues and challenges related to big
data mining and also Big Data analysis tools like Map Reduce over Hadoop
and HDFS which helps organizations to better understand their customers
and the marketplace and to take better decisions and also helps researchers
and scientists to extract useful knowledge out of Big data. In addition to that
we introduce some big data mining tools and how to extract a significant
knowledge from the Big Data. That will help the research scholars to choose
the best mining tool for their work.

REFERENCES

[1] Julie M. David, Kannan Balakrishnan, (2011), Prediction of Key Symptoms


of Learning Disabilities in School-Age Children using Rough Sets, Int. J. of
Computer and Electrical Engineering, Hong Kong, 3(1), pp163-169.
[2] Julie M. David, Kannan Balakrishnan, (2011), Prediction of Learning
Disabilities in School-Age Children using SVM and Decision Tree, Int. J. of
Computer Science and Information Technology, ISSN 0975-9646, 2(2),
pp829-835.
[3] Albert Bifet, (2013), Mining Big data in Real time, Informatica 37,
pp15-20
[4] Richa Gupta, (2014), Journey from data mining to Web Mining to Big
Data, IJCTT, 10(1),pp18-20
[5] http://www.domo.com/blog/2014/04/data-never-sleeps-2-0/

You might also like