What Is Data

Introduction to Big Data Platform
Big Data is also data but with a huge size. Big Data is a term used for a collection
of data sets that are large and complex, which is difficult to store and process using
available database management tools or traditional data processing applications.
The challenge includes capturing, curating, storing, searching, sharing, transferring,
analyzing and visualization of this data.
Following are some the examples of Big Data-
 New York Stock Exchange:
The New York Stock Exchange generates about one terabyte of new trade
data per day.
 Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day. This
data is mainly generated in terms of photo and video uploads, message exchanges,
putting comments etc.
 Jet engine
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up too
many Petabytes.
Sources of Big Data
These data come from many sources like
o Social networking sites: Facebook, Google, LinkedIn all these sites generate
huge amount of data on a day to day basis as they have billions of users
worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge
amount of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the data of
its million users.
o Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.
Characteristics of Big Data
Big Data has certain characteristics and hence is defined using 3Vs namely:
(i)Volume – The name Big Data
itself is related to a size which is
enormous. Size of data plays a very
crucial role in determining value out
of data. Also, whether a particular
data can actually be considered as a
Big Data or not, is dependent upon
the volume of data.
Hence, 'Volume' is one
characteristic which needs to be
considered while dealing with Big
Data.
i) Variety – The next aspect of Big
Data is its variety.Variety refers to heterogeneous sources and the nature of data,
both structured and unstructured. During earlier days, spreadsheets and databases
were the only sources of data considered by most of the applications. Nowadays,
data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are
also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.Big Data Velocity deals with the speed at which data flows in
from sources like business processes, application logs, networks, and social media
sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
Other Characteristics
Various individuals and organizations have suggested expanding the original three
Vs, though these proposals have tended to describe challenges rather than qualities
of big data. Some common additions are:
 Veracity: The variety of sources and the complexity of the processing can
lead to challenges in evaluating the quality of the data (and consequently, the
quality of the resulting analysis)
 Variability: Variation in the data leads to wide variation in quality.
Additional resources may be needed to identify, process, or filter low quality
data to make it more useful.
 Value: The ultimate challenge of big data is delivering value. Sometimes, the
systems and processes in place are complex enough that using the data and
extracting actual value can become difficult.
Benefits of Big Data Processing
Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like Facebook, twitter is
enabling organizations to fine tune their business
strategies.
 Improved customer service
Traditional customer feedback systems are getting replaced by new systems
designed with Big Data technologies. In these new
systems, Big Data and natural language processing technologies are being used to
read and evaluate consumer responses.
 Early identification of risk to the product/services, if any.
 Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for
new data before identifying what data should be moved to the data warehouse. In
addition, such integration of Big Data technologies and data warehouse helps an
organization to offload infrequently accessed data.
CHALLENGES IN CONVENTIONAL SYSTEM

The sharply increasing data deluge in the big data era., brings about huge challenges
on data acquisition, storage, management and analysis. RDBMS only apply to
structured data, other than semi-structured or unstructured data. The traditional
DBMS could not handle the huge volume and heterogeneity of big data. The
research community has proposed some solutions from different perspectives.
The key challenges are
1. Data Representation
Many data sets have certain levels of heterogeneity in type, structure,
semantics, organization, granularity and accessibility. Data representation aims
to make data more meaningful for computer analysis and user interpretation.
Efficient data representation shall reflect data structure, class and type, as well
as integrated technology, so as to enable efficient operations on different data
sets.
2. Redundancy Reduction and Data Compression
It is effective to reduce the indirect cost of the entire system on the premise
that the potential values of the data are not affected.
3. Data life cycle Management
Compared with relatively slow advances of storage systems, pervasive
sensing and computing are generating data at unprecedented rates and scales.
Data importance principles related to the analytical value should be
developed to decide which data shall be stored and which data shall be
discarded.
4. Analytical Mechanism
The analytical system of big data shall process masses of heterogeneous data
with in a limited time.
5. Data Confidentiality
Most big data service providers or owners at present could not effectively
maintain and analyse such huge data sets because of their limited capacity.
They must rely on professionals or tools to analyse such data, which
increases the potential safety risk.
6. Energy Management
The energy consumption of mainframe computing systems has drawn much
attention from both economy and environment perspectives. With the
increase of data volume and analytical demands, the processing, storage and
transmission of big data will inevitably consume more and more electric
energy.
7. Expandability and scalability
The analytical system of big data must support present and future data sets.
The analytical algorithm must be able to process increasing the expanding
and more complex data sets.
8. Co-operation
Analysis of big data is an interdisciplinary research, which requires experts
in different fields cooperate to harvest the potential of big data.
Intelligent Data Analysis
Intelligent Data Analysis(IDA) is an interdisciplinary study concerned with the
effective analysis of data. It is used for extracting useful information from large
quantities of online data and desirable knowledge or interesting patterns from
existing databases. IDA is one of the hot issues in the field of artificial intelligence
and information. Intelligent data analysis reveals implicit, previously unknown and
potentially valuable information or knowledge from large amounts of data.
Intelligent data analysis is also a kind of decision support process. Based on
artificial intelligence, machine learning, pattern recognition, statistics, database and
visualization technology mainly, IDA automatically extracts useful information,
necessary knowledge and interesting models from a lot of online data in order to
help decision makers make the right choices.
The process of IDA generally consists of the following three stages:
(1) data preparation
(2) rule finding or data mining
(3) result validation and explanation.
Data preparation involves selecting the required data from the relevant data source
and integrating this into a data set to be used for data mining. Rule finding is
working out rules contained in the data set by means of certain methods or
algorithms. Result validation requires examining these rules, and result explanation
is giving intuitive, reasonable and understandable descriptions using logical
reasoning.
Nature of Data
BigData' could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
1. Structured:-
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data. Over the period of time, talent in computer science has
achieved greater success in developing techniques for working with such kind of
data (where the format is well known in advance) and also deriving value out of it.
However, nowadays, we are foreseeing issues when a size of such data grows to a
huge extent, typical sizes are being in the rage of multiple
zettabytes.1021 bytes equal to 1 zettabyte or one billion terabytes forms a
zettabyte.Data stored in a relational database management system is one example of
a 'structured' data.
E_ID E_Name Gender Dept Salary
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
Fig:-An 'Employee' table in a database is an example of Structured Data

2. Unstructured:-
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in
terms of its processing for deriving value out of it. A typical example of
unstructured data is a heterogeneous data source containing a combination of simple
text files, images, videos etc. Now day organizations have wealth of data available
with them but unfortunately, they don't know how to derive value out of it since this
data is in its raw form or unstructured format.
Fig: -The output returned by 'Google Search' is an examples of un-structured data

3.Semi-structured:-
Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition
in relational DBMS. Example of semi-structured data is a data represented in an
XML file.
Examples of Semi-Structured Data
Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Analytic Processes and Tools

Analytic is the process of examining large data sets containing a variety of
examining large data sets containing a variety of data types.ie, big data to uncover
hidden patterns, unknown correlations, market trends, customer preferences and
other useful business information. The analytical findings can lead to more effective
marketing, new revenue opportunities, better customer service, improved
operational efficiency, competitive advantages over viral organizations and other
business benefits.The primary goal of big data analysis is to help companies make
more informed business decisions by enabling data scientists. Predictive modelers
and other analytics professionals to analyze large volumes of transaction data, as
well as other forms of data that may be untapped by conventional business
intelligence (BI) programs.
That could include web server logs and Internet click stream data, social media
content and social network activity reports, text from customer emails and survey
responses, mobile-phone call detail records and machine data captured by sensors
connected to IoT. Some people exclusively associate big data with semi-structured
and unstructured data of that sort, but consulting firms like Gartner Inc. and other
structured data to be valid components of big data analytics applications. Big data
can be analyzed with the software tools commonly used for advanced analytics,
disciplines such as
 Predictive analytics
 Data mining
 Text analytics
 Statistical analysis
Mainstream BI software anddata visualization tools can play a role in the analysis
process. But the semi-structured and unstructured data may not fit well in traditional
data warehouses based on relational databases. Data warehouses may not able to
handle the processing demands posed by sets of big data that need to be updated
frequently or even continually.
Many organizations looking to collect,process and analyze big data have turned to
a newer class of technologies that includes Hadoop and related tools such as
 YARN
 Map Reduce
 Spark
 Hive
 Pig
 NoSQL databases
These technologies form the core of an open source software framework that
supports the processing of large and diverse data sets across clustered systems. Big
data vendors are pushing the concept of Hadoop data lake that serves as the central
repository for an organization’s incoming streams of raw data.
Analysis Vs Reporting
Modern Big Data Analytics Tools
Data Analytics is the process of analyzing datasets to draw results, on the basis of
information they get. It is popular in commercial industries, scientists and
researchers to make a more informed business decision and to verify theories,
models and hypothesis.
1. Tableau Public
It is a simple and intuitive tool. As it offers intriguing insights through data

visualization. Tableau Public’s million-row limit. As it’s easy to use fares better
than most of the other players in the data analytics market.With Tableau’s visuals,
you can investigate a hypothesis. Also, explore the data, and cross-check your
insights.
Uses of Tableau Public
 You can publish interactive data visualizations to the web for free.
 No programming skills required.
Visualizations published to Tableau Public can be embedded into blogs. Also, web
pages and be shared through email or social media. The shared content can be
made available s for downloads. This makes it the best Big Data Analytics tools.
Limitations of Tableau Public
 All data is public and offers very little scope for restricted access
 Data size limitation
 Cannot be connected to R.
 The only way to read is via OData sources, is Excel or txt.
2. OpenRefine
Formerly known as GoogleRefine, the data cleaning software. As it helps you clean
up data for analysis. It operates on a row of data. Also, have cells under columns,
quite similar to relational database tables.
Uses of OpenRefine
 Cleaning messy data
 Transformation of data
 Parsing data from websites
Adding data to the dataset by fetching it from web services. For instance,
OpenRefine could be used for geocoding addresses to geographic coordinates.
Limitations of OpenRefine
 Open Refine is unsuitable for large datasets.
 Refine does not work very well with big data
3. KNIME
It helps you to manipulate, analyze, and model data through visual programming.
It is used to integrate various components for data mining and machine learning.
Uses of KNIME
 Don’t write blocks of code. Rather, you have to drop and drag connection points
between activities.
 This data analysis tool supports programming languages.
In fact, analysis tools like these can be extended to run chemistry data, text
mining, python, and R.
Limitation of KNIME
 Poor data visualization
4.RapidMiner
RapidMiner provides machine learning procedures. And data mining including
data visualization, processing, statistical modeling and predictive analytics.
RapidMiner written in Java is fast gaining acceptance as a Big data analytics tool.
Uses of RapidMiner
 It provides an integrated environment for business analytics, predictive analysis.
 Along with commercial and business applications, it is also used for application
development.
Limitations of RapidMiner
 RapidMiner has size constraints with respect to the number of rows.
 For RapidMiner, you need more hardware resources than ODM and SAS.
5.Google Fusion Tables
When comes to data tools, we have a cooler, larger version of Google Spreadsheets.
An incredible tool for data analysis, mapping, and large dataset visualization. Also,
Google Fusion Tables can be added to business analytics tools list. This is also one
of the best Big Data Analytics tools.
Uses of Google Fusion Tables

 Visualize bigger table data online.
 Filter and summarize across hundreds of thousands of rows.
 Combine tables with other data on the web
 You can merge two or three tables to generate a single visualization that
includes
sets of data.
 You can create a map in minutes!
Limitations of Google Fusion Tables
 Only the first 100,000 rows of data in a table are included in query results or
mapped.
 The total size of the data sent in one API call cannot be more than 1MB.
6. NodeXL
It is a visualization and analysis software of relationships and networks. NodeXL
provides exact calculations. It is a free (not the pro one) and open-source network
analysis and visualization software. NodeXL is one of the best statistical tools for
data analysis. In which includes advanced network metrics. Also, access to social
media network data importers, and automation.
Uses of NodeXL
This is one of the data analysis tools in Excel that helps in the following areas:
 Data Import
 Graph Visualization
 Graph Analysis
 Data Representation
This software integrates into Microsoft Excel 2007, 2010, 2013, and 2016. It opens
as a workbook with a variety of worksheets containing the elements of a graph
structure. That is like nodes and edges.
This software can import various graph formats. Such adjacency matrices,
Pajek .net, UCINet .dl, GraphML, and edge lists.
Limitations of NodeXL
 You need to use multiple seeding terms for a particular problem.
 Running the data extractions at slightly different times.
7. Wolfram Alpha
It is a computational knowledge engine or answering engine founded by Stephen
Wolfram.
Uses of Wolfram Alpha
 Is an add-on for Apple’s Siri
 Provides detailed responses to technical searches and solves calculus problems.
 Helps business users with information charts and graphs. And helps in creating
topic overviews, commodity information, and high-level pricing history.
Limitations of Wolfram Alpha
 Wolfram Alpha can only deal with a publicly known number and facts, not with
viewpoints.
 It limits the computation time for each query.
8. Google Search Operators
It is a powerful resource which helps you filter Google results. That instantly to get
most relevant and useful information.
Uses of Google Search Operators
 Faster filtering of Google search results
 Google’s powerful data analysis tool can help discover new information.
9.Excel Solver
The Solver Add-in is a Microsoft Office Excel add-in program. Also, it is available
when you install Microsoft Excel or Office. It is a linear programming and
optimization tool in excel. This allows you to set constraints. It is an advanced
optimization tool that helps in quick problem-solving.
Uses of Solver
 the final values found by Solver are a solution to interrelation and decision.
 It uses a variety of methods, from nonlinear optimization. And also linear
programming to evolutionary and genetic algorithms, to find solutions.
Limitations of Solver
 Poor scaling is one of the areas where Excel Solver lacks.
 It can affect solution time and quality.
 Solver affects the intrinsic solvability of your model.
10. Dataiku DSS
This is a collaborative data science software platform. Also, it helps a team build,
prototype, explore. Although, it deliver their own data products more efficiently.
Uses of Dataiku DSS
Dataiku DSS– Data analytic tools provide an interactive visual interface. As in this
they can build, click, and point or use languages like SQL.
Limitation of Dataiku DSS
 Limited visualization capabilities
 UI hurdles: Reloading of code/datasets
 Inability to easily compile entire code into a single document/notebook
 Still, need to integrate with SPARK
Sampling Distributions
A sampling distribution is a probability distribution of a statistic obtained
through a large number of samples drawn from a specific population. The sampling
distribution of a given population is the distribution of frequencies of a range of
different outcomes that could possibly occur for a statistic of a population. A
sample is a subset of a population. The average weight computed for each sample
set is the sampling distribution of the mean. Not just the mean can be calculated
from a sample. Other statistics, such as the standard deviation, variance, proportion,
and range can be calculated from sample data. The standard deviation and variance
measure the variability of the sampling distribution. The number of observations in
a population, the number of observations in a sample and the procedure used to
draw the sample sets determine the variability of a sampling distribution. The
standard deviation of a sampling distribution is called the standard error. While the
mean of a sampling distribution is equal to the mean of the population, the standard
error depends on the standard deviation of the population, the size of the population
and the size of the sample.
In many cases we would like to learn something about a big population,
without actually inspecting every unit in that population. In that case we would like
to draw a sample that permits us to draw conclusions about a population of interest.
We may for example draw a sample from the population of Dutch men of 18 years
and older to learn something about the joint distribution of height and weight in this
population. Because we cannot draw conclusions about the population from a
sample without error, it is important to know how large these errors may be, and
how often incorrect conclusions may occur. An objective assessment of these errors
is only possible for a probability sample.
For a probability sample, the probability of inclusion in the sample is known
and positive for each unit in the population. Drawing a probability sample of size n
from a population consisting of N units, may be a quite complex random
experiment. The experiment is simplified considerably by subdividing it into n
experiments, consisting of drawing the n consecutive units. In a simple random
sample, the n consecutive units are drawn with equal probabilities from the units
concerned. In random sampling with replacement the sub experiments (drawing of
one unit) are all identical and independent: n times a unit is randomly selected from
the entire population. We will see that this property simplifies the ensuing analysis
considerably. For units in the sample we observe one or more population variables.
For probability samples, each draw is a random experiment. Every observation may
therefore be viewed as a random variable. The observation of a populationvariable
X from the unit drawn in the ith trial, yields a random variableXi. Observation of
the complete sample yields n random variables X1, ...,Xn.Likewise, if we observe
for each unit the pair of population variables (X,Y), we obtain pairs of random
variables (Xi, Yi) with outcomes (xi, yi). Consider the population of size N = 6,
displayed in table 1
Uni 1 2 3 4 5 6 Unit 1 2 3
t
X 1 1 2 2 2 3 X 1/3 1/2 1/6
Table 1: A small population Table 2: Probability distribution of X1 and X2
A random sample of size n = 2 is drawn with replacement from this population. For
each unit drawn we observe the value of X. This yields two random variables X1
and X2, with identical probability distribution as displayed in table 2. Furthermore,
X1 and X2 are independent, so their joint distribution equals the product of their
individual distributions, i.e.
Usually we are not really interested in the individual outcomes of the sample, but
rather in some sample statistic. A statistic is a function of the sample observations
X1, .., Xn, and therefore is itself also a random variable. Some important sample
n n
1 1
statistics are the sample mean X̄ = ∑
n i=1
Xi, sample variance S2= ∑ ¿ ¿)2
n−1 i=1
n
1
and sample fraction Fr= ∑ Xi,(for 0-1 variable X).
n i=1
(x1, x2) p(x1, x2) X̄ s2
(1,1) 1/9 1 0 (2,3) 1/12 2.5 0.5

(2,2) 1/4 2 0 (3,1) 1/18 2 2
(3,3) 1/36 3 0 (3,2) 1/12 2.5 0.5
(1,2) 1/6 1.5 0.5 Table 3: Probability distribution of
(1,3) 1/18 2 2 sample of size n = 2 by sampling with
(2,1) 1/6 1.5 0.5 replacement from the population
X p(X̄ )
1 1/9 S2 P(S2)
1.5 1/3 0 14/36
2 13/36 1.5 1/2
2.5 1/6 2 1/9
3 1/36
Table:4 Sampling distribution of X̄ Table:5 Sampling distribution of S2
The probability distribution of a sample statistic is called its sampling

distribution. The sampling distribution of ¯X and S2 is calculated easily from table
2.3; they are displayed in tables 2.4 and 2.5 respectively. Note that E(X̄ ) = 11/6= μ,
where μ denotes the population mean, and E(S2) = 17/36 = σ2, where _2 denotes the
population variance.
In the above example, we were able to determine the probability distribution of
the sample, and sample statistics, by complete enumeration of all possible samples.
This was feasible only because the sample size and the number of distinct values of
X was very small. When the sample is of realistic size, and X takes on many
distinct values, complete enumeration is not possible. Nevertheless, we would like
to be able to infer something about the shape of the sampling distribution of a
sample statistic, from knowledge of the distribution of X. We consider here two
options to make such inferences.
1. The distribution of X has some standard form that allows the mathematical
derivation of the exact sampling distribution.
2. We use a limiting distribution to approximate the sampling distribution of
interest. The limiting distribution may be derived from some characteristics
of the distribution of X.
The exact sampling distribution of a sample statistic is often hard to derive
analytically, even if the population distribution of X is known.
Re-sampling
Re-sampling is the method that consists of drawing repeated samples from the
original data samples. The method of Resampling is a nonparametric method of
statistical inference. In other words, the method of resampling does not involve the
utilization of the generic distribution tables (for example, normal distribution tables)
in order to compute approximate p probability values.Resampling involves the
selection of randomized cases with replacement from the original data sample in
such a manner that each number of the sample drawn has a number of cases that are
similar to the original data sample. Due to replacement, the drawn number of
samples that are used by the method of resampling consists of repetitive
cases.Resampling generates a unique sampling distribution on the basis of the
actual data. The method of resampling uses experimental methods, rather than
analytical methods, to generate the unique sampling distribution.
The main techniques are:
1. Bootstrapping and Normal resampling (sampling from a normal distribution).
2. Permutation Resampling (also called Rearrangements or Rerandomization),
3. Cross Validation.
1. Bootstrapping and Normal Resampling
Bootstrapping is a type of resampling where large numbers of smaller samples of
the same size are repeatedly drawn, with replacement, from a single original
sample. Normal resampling is very similar to bootstrapping as it is a special case of
the normal shift model—one of the assumptions for bootstrapping (Westfall et al.,
1993). Both bootstrapping and normal resampling both assume that samples are
drawn from an actual population (either a real one or a theoretical one). Another
similarity is that both techniques use sampling with replacement.Ideally, you would
want to draw large, non-repeated, samples from a population in order to create a
sampling distribution for a statistic. However, limited resources may prevent you
from getting the ideal statistic. Resampling means that you can draw small samples
over and over again from the same population. As well as saving time and money,
the samples can be quite good approximations for population parameters.
2. Permutation Resampling
Unlike bootstrapping, permutation resampling doesn’t need any “population”;
resampling is dependent only on the assignment of units to treatment groups. The
fact that you’re dealing with actual samples, instead of populations, is one reason
why it’s sometimes referred to as the Gold standard bootstrapping technique
(Strawderman and Mehr, 1990). Another important difference is that permutation
resampling is a without replacement sampling technique.
3. Cross Validation
Cross-validation is a way to validate a predictive model. Subsets of the data are
removed to be used as a validating set; the remaining data is used to form a training
set, which is used to predict the validation set.
Statistical Inference
The relation between sample data and population may be used for reasoning
in two directions: from known population to yet to be observed sample data , and
from observed data to (partially) unknown population. This last direction of
reasoning is of inductive nature and is addressed in statistical inference. It is the
form of reasoning most relevant to data analysis, since one typically has available
one set of sample data from which one intends to draw conclusions about the
unknown population.
Frequentist Inference
According to frequentists, inference procedures should be interpreted and evaluated
in terms of their behavior in hypothetical repetitions under the same conditions. The
sampling distribution of a statistic is of crucial importance. The two basic types of
frequentist inference are estimation and testing.
In estimation one wants to come up with a plausible value or range of plausible

values for an unknown population parameter. In testing one wants to decide
whether a hypothesis concerning the value of an unknown population parameter
should be accepted or rejected in the light of sample data.
Point Estimation
In point estimation one tries to provide an estimate for an unknown population
parameter, denoted by θ, with one number: the point estimate. If G denotes the
estimator of θ, then the estimation error is a random variable G − θ, which should
preferably be close to zero. An important quality measure from a frequentist point
of view is the bias of an estimator
Bθ = Eθ(G − θ) = Eθ(G) − θ
where expectation is taken with respect to repeated samples from the population. If
Eθ(G) = θ, i.e. the expected value of the estimator is equal to the value of the
population parameter, then the estimator G is called unbiased.
Interval Estimation
An interval estimator for population parameter θ is an interval of type (GL, GU ).
Two important quality measures for interval estimates are:
Eθ(GU − GL),
i.e. the expected width of the interval, and
Pθ(GL < θ < GU ),
i.e. the probability that the interval contains the true parameter value. Clearly there
is a trade-off between these quality measures. If we require a high probability that
the interval contains the true parameter value, the interval itself has to become
wider. It is customary to choose a confidence level (1 − α) and use an interval
estimator such that
Pθ(GL < θ < GU ) ≥ 1 − α
for all possible values of θ. A realisation (gL, gU ) of such an interval estimator is
called a 100(1 − α)% confidence interval.
Hypothesis Testing
A test is a statistical procedure to make a choice between two hypotheses
concerning the value of a population parameter θ. One of these, called the null
hypothesis and denoted by H0, gets the “benefit of the doubt”. The two possible
conclusions are to reject or not to reject H0. H0 is only rejected if the sample data
contains strong evidence that it is not true. The null hypothesis is rejected iff
realisation g of test statistic G is in the critical region denoted by C. In doing so we
can make two kinds of errors Type I error: Reject H0 when it is true. Type II error:
Accept H0 when it is false. Type I errors are considered to be more serious than
Type II errors. Test statistic G is usually a point estimator for θ, e.g. if we test a
hypothesis concerning the value of population mean µ, then X¯ is an obvious
choice of test statistic.
Prediction Error
A prediction error is the failure of some expected event to occur. When
predictions fail, humans can use metacognitive functions, examining prior
predictions and failures and deciding, for example, whether there
are correlations and trends, such as consistently being unable to foresee outcomes
accurately in particular situations. Applying that type of knowledge can inform
decisions and improve the quality of future predictions. Predictive analytics
software processes new and historical data to forecast activity, behavior and trends.
The programs apply statistical analysis techniques, analytical queries and machine
learning algorithms to data sets to create predictive models that quantify the
likelihood of a particular event happening.
Errors are an inescapable element of predictive analytics that should also be
quantified and presented along with any model, often in the form of a confidence
interval that indicates how accurate its predictions are expected to be. Analysis of
prediction errors from similar or previous models can help determine confidence
intervals.
In artificial intelligence (AI), the analysis of prediction errors can help
guide machine learning (ML), similarly to the way it does for human learning.
In reinforcement learning, for example, an agent might use the goal of minimizing
error feedback as a way to improve. Prediction errors, in that case, might be
assigned a negative value and predicted outcomes a positive value, in which case
the AI would be programmed to attempt to maximize its score. That approach to
ML, sometimes known as error-driven learning, seeks to stimulate learning by
approximating the human drive for mastery.

What Is Data

Uploaded by

Copyright:

Available Formats

What Is Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

What Is Data

Uploaded by

Copyright:

Available Formats

Introduction to Big Data Platform

CHALLENGES IN CONVENTIONAL SYSTEM

E_ID E_Name Gender Dept Salary

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

Fig:-An 'Employee' table in a database is an example of Structured Data

Fig: -The output returned by 'Google Search' is an examples of un-structured data

Analytic Processes and Tools

It is a simple and intuitive tool. As it offers intriguing insights through data

Uses of Google Fusion Tables

Table 1: A small population Table 2: Probability distribution of X1 and X2

(x1, x2) p(x1, x2) X̄ s2

(1,1) 1/9 1 0 (2,3) 1/12 2.5 0.5

Table:4 Sampling distribution of X̄ Table:5 Sampling distribution of S2

The probability distribution of a sample statistic is called its sampling

In estimation one wants to come up with a plausible value or range of plausible

You might also like