Nothing Special   »   [go: up one dir, main page]

Beyond The Hype

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 30

Beyond The Hype

Big Data
Big dataa growing torrent
4 VS Of Big Data
Big Data vs. DWH-DM
Challenges of Large Scale Social Network Analysis
Where does it come from??
Big data Technologies
Applications of Big data Analysis
Big data

Big data refers to datasets whose size is beyond the ability

of typical database software tools to capture, store, manage
and analyze.

This definition can vary by sector depending on what kinds

of software tools are commonly available and what sizes of
datasets are common there.

As technology advances over time, the size of datasets that

qualify as Big data will also increase.

With these caveats, Big data will range from a few dozen
terabytes to multiple petabytes (thousands of terabytes ).
Big dataa growing torrent
$600 to buy a disk drive that can store all of the worlds music

5 billion mobile phones in use in 2010 .

30 billion pieces of content shared on Facebook every month .

40% projected growth in global data generated per year vs.

5% growth in global IT spending .

235 terabytes data collected by the US Library of Congress by

April 2011.

15 out of 17 sectors in the US have more data stored per

company than the US Library of Congress.
4 VS Of Big Data
Volume -- data is getting higher/bigger than ever.
Velocity -- data is increasing e.g. Complex real time data.
Variety -- data is spiraling e.g. unstructured video & voice.
Variability -- data types/formats also different


Variabilit Big Velocity
y Velocity
y Data

Big Data vs. DWH-DM
Big Data
Multitude of data types
Structured, Semi-structured and Unstructured
Demographic, psychographic, transactional
Call center data, social media data, web log
data, sensor networks etc.
Requires new storage mechanisms eg. Hadoop
High dimensionality
Online versions of algorithms
Online services such as eBay, Yahoo, Amazon and
Facebook, have transformed/ created big data
Big Data vs. DWH-DM
Areas like genomics, astronomy, military surveillance and
RFID technology are also contributing to the explosive
growth of the field.
A jet engines sensors sends terabytes of data every hour,
which can be used to build predictive models for repair
cycles. Understanding when repairs should be done, instead
of doing traditional preventive maintenance at certain set
intervals, could be worth billions of dollars.
The challenge in big data analytics is to dig deeply, quickly
and widely
Structured data
Off-line algorithms
Challenges of Large Scale Social Network
Social networking sites like Facebook, YouTube, Orkut and
Twitter are among the most popular sites on the internet.
Users of these sites form a social network (SN), which provides
a powerful mean of sharing, organizing, and finding contents
and contacts.
However, the rate at which SNs are growing, posses many
latent challenges in maintaining the stability of their
underlying systems and the members associated with them.
Challenges of Large Scale Social Network
Social Networks (SNs) are living networks that daily give birth
to data traces which can be up to exabytes in volume.
For example, Facebook produce more than a petabyte of data
per day. Even its logging data exceeds 25 terabytes per-day.
Google creates as much information (social blogs and orkut )
in two days now, as we did from the dawn of man through
2003 i.e., one exabyte of data.
Analysts need to analyze this huge plethora of SN data to
support system management activities in limited time.
Big data and Big Brother
Perhaps one of the biggest contributors to big data, however,
is social networking.
People themselves have become contributors of information
as they increasingly use services such as Facebook and
LinkedIn to connect with each other.
LinkedIn is a particularly interesting target, given the
professional nature of its audience. By analyzing LinkedIn
network information, we can learn a lot about individuals and
the people that they know
While it may be difficult to manipulate big data at a grand
scale, it is relatively easy, given the right tools and techniques,
to analyze small subsets (such as personal networks of
contacts) for potentially useful results.

We can do this at a micro-analytic level, where we mine

profiles for snippets of information and at the macro-analytic
level, where we look at patterns in the data.

Even when people are not part of your

network, a properly filled-out profile reveals
their job title, where they worked in the past,
and where they were educated.
Where does it come from??
In the global marketplace, businesses, suppliers and customers are
creating and consuming vast amounts of information .
Cont Big Data
Gartner predicts that enterprise data in all forms will grow
650% over the next 5 years.
According to IDC, the world's volume of data doubles every
18 months.
This flood of data is referred to as information overload,
data deluge and big data .
Big data creates a challenge for business leaders.
NoSQL Databases
Most of the organizations that built data platforms have
found it necessary to go beyond the relational database model
to tackle big data, because they become ineffective at this
Managing, sharding and replication across a horde of
database servers is difficult and slow.
To store huge datasets effectively a new breed of databases
are developed. There databases are called NoSQL databases,
or Non-Relational databases.
NoSQL Databases
Many of the NoSQL databases are the logical descendants of
Googles BigTable and Amazons Dynamo.

These are designed to be distributed across many nodes, to

provide consistency and to have very flexible schema.
Popular NoSQL databases
Developed at Facebook, in production use at Twitter,
Rackspace, Reddit, and other large sites.
Cassandra is designed for high performance, reliability,
and automatic replication. It has a very flexible data
model. A new startup, Riptano, provides commercial

Part of the Apache Hadoop project, and modeled on
Googles BigTable.
Suitable for extremely large databases (billions of rows,
millions of columns), distributed across thousands of
nodes. Along with Hadoop, commercial support is
provided by Cloudera.
Prevalence of Big Data
Big data is not limited to big companies like Facebook and
According to McKinsey Global Institute study in 2011
Most of the investment firms in U.S with less than 1,000
employees has 3.8 petabytes of data stored.
Companies in all sectors have at least 100 terabytes stored.
Big Data And You
Big Data Formats
Big data Technologies
Big data technologies describe a new generation of
technologies and architectures, designed to economically
extract value from very large volumes of a wide variety of
data, by enabling high velocity capture, discovery, and/or

The above definition incorporates all types of data (e.g., real-

time, analytic) managed by next generation systems.
Googles MapReduce Approach
MapReduce approach is basically a divide-and-conquer
strategy for distributing an extremely large problem across
an extremely large computing cluster.
In the map stage, a programming task is divided into a
number of identical subtasks, which are then distributed
across many processors.
The intermediate results are then combined by a single
reduce task.
MapReduce provides a solution to Googles biggest
problem, i.e creating large searches.
MapReduce has proven to be widely applicable to many large
data problems, ranging from search to machine learning.
The most popular open source implementation of
MapReduce is the Hadoop project.
Applications of Big data Analysis
Facebook and LinkedIn use patterns of friendship
relationships to suggest other people you may know, or
should know, with frightening accuracy.

Amazon saves your searches, correlates what you search for

with what other users search for, and uses it to create
surprisingly appropriate recommendations.

Medical researchers sift through the health records of

thousands of people to try to identify useful correlations
between medical treatments and health outcomes.
Applications of Big data Analysis
Facebook and LinkedIn use patterns of friendship
relationships to suggest other people you may know, or
should know, with frightening accuracy.

Amazon saves your searches, correlates what you search for

with what other users search for, and uses it to create
surprisingly appropriate recommendations.

Medical researchers sift through the health records of

thousands of people to try to identify useful correlations
between medical treatments and health outcomes.
As data volumes are growing exponentially, so is the
concern over data preservation, access,
dissemination, and usability. Many agencies has
taken initiatives to research into areas such as
automated analysis techniques, data mining,
machine learning, privacy, and database
interoperability and these will help to identify how
big data can enable science in new ways and at new

You might also like