R Book PDF
R Book PDF
R Book PDF
Using R - A Practical
Approach
—
Dr. Umesh R. Hodeghatta
Umesha Nayak
Business Analytics
Using R - A
Practical Approach
■
■Chapter 1: Overview of Business Analytics������������������������������������ 1
■
■Chapter 2: Introduction to R��������������������������������������������������������� 17
■
■Chapter 3: R for Data Analysis����������������������������������������������������� 37
■
■Chapter 4: Introduction to descriptive analytics�������������������������� 59
■
■Chapter 5: Business Analytics Process and Data Exploration������ 91
■
■Chapter 6: Supervised Machine Learning—Classification��������� 131
■
■Chapter 7: Unsupervised Machine Learning������������������������������� 161
■
■Chapter 8: Simple Linear Regression����������������������������������������� 187
■
■Chapter 9: Multiple Linear Regression��������������������������������������� 207
■
■Chapter 10: Logistic Regression������������������������������������������������� 233
■
■Chapter 11: Big Data Analysis—Introduction and
Future Trends��������������������������������������������������������������������������������� 257
References������������������������������������������������������������������������������������� 267
Index���������������������������������������������������������������������������������������������� 273
iii
Contents
■
■Chapter 1: Overview of Business Analytics������������������������������������ 1
1.1 Objectives of This Book������������������������������������������������������������������ 3
1.2 Confusing Terminology������������������������������������������������������������������� 4
1.3 Drivers for Business Analytics�������������������������������������������������������� 5
1.3.1 Growth of Computer Packages and Applications����������������������������������������� 6
1.3.2 Feasibility to Consolidate Data from Various Sources���������������������������������� 7
1.3.3 Growth of Infinite Storage and Computing Capability���������������������������������� 7
1.3.4 Easy-to-Use Programming Tools and Platforms ������������������������������������������ 7
1.3.5 Survival and Growth in the Highly Competitive World���������������������������������� 7
1.3.6 Business Complexity Growing out of Globalization�������������������������������������� 8
v
■ Contents
2.3 Basics of R Programming������������������������������������������������������������� 25
2.3.1 Assigning Values����������������������������������������������������������������������������������������� 26
2.3.2 Creating Vectors����������������������������������������������������������������������������������������� 27
vi
■ Contents
vii
■ Contents
4.6 Probability������������������������������������������������������������������������������������ 84
4.6.1 Probability of mutually exclusive events���������������������������������������������������� 85
4.6.2 Probability of mutually independent events����������������������������������������������� 85
4.6.3 Probability of mutually non-exclusive events:�������������������������������������������� 86
4.6.4 Probability distributions����������������������������������������������������������������������������� 86
4.7 Chapter summary������������������������������������������������������������������������ 88
■
■Chapter 5: Business Analytics Process and Data Exploration������ 91
5.1 Business Analytics Life Cycle������������������������������������������������������� 91
5.1.1 Phase 1: Understand the Business Problem����������������������������������������������� 91
5.1.2 Phase 2: Collect and Integrate the Data����������������������������������������������������� 92
5.1.3 Phase 3: Preprocess the Data�������������������������������������������������������������������� 92
5.1.4 Phase 4: Explore and Visualize the Data����������������������������������������������������� 92
5.1.5 Phase 5: Choose Modeling Techniques and Algorithms����������������������������� 93
5.1.6 Phase 6: Evaluate the Model���������������������������������������������������������������������� 93
5.1.7 Phase 7: Report to Management and Review��������������������������������������������� 94
5.1.8 Phase 8: Deploy the Model������������������������������������������������������������������������� 94
5.2 Understanding the Business Problem������������������������������������������ 94
5.3 Collecting and Integrating the Data���������������������������������������������� 95
5.3.1 Sampling���������������������������������������������������������������������������������������������������� 96
5.3.2 Variable Selection��������������������������������������������������������������������������������������� 97
viii
■ Contents
ix
■ Contents
■
■Chapter 6: Supervised Machine Learning—Classification��������� 131
6.1 What Is Classification? What Is Prediction?������������������������������� 131
6.2 Probabilistic Models for Classification��������������������������������������� 132
6.2.1 Example���������������������������������������������������������������������������������������������������� 133
6.2.2 Naïve Bayes Classifier Using R����������������������������������������������������������������� 134
6.2.3 Advantages and Limitations of the Naïve Bayes Classifier����������������������� 136
6.3 Decision Trees���������������������������������������������������������������������������� 136
6.3.1 Recursive Partitioning Decision-Tree Algorithm��������������������������������������� 138
6.3.2 Information Gain��������������������������������������������������������������������������������������� 138
6.3.3 Example of a Decision Tree���������������������������������������������������������������������� 140
6.3.4 Induction of a Decision Tree��������������������������������������������������������������������� 142
6.3.5 Classification Rules from Tree������������������������������������������������������������������ 145
6.3.6 Overfitting and Underfitting���������������������������������������������������������������������� 145
6.3.7 Bias and Variance������������������������������������������������������������������������������������� 147
6.3.8 Avoiding Overfitting Errors and Setting the Size of Tree Growth�������������� 148
x
■ Contents
7.7 Summary������������������������������������������������������������������������������������ 186
■
■Chapter 8: Simple Linear Regression����������������������������������������� 187
8.1 Introduction�������������������������������������������������������������������������������� 187
8.2 Correlation���������������������������������������������������������������������������������� 188
8.2.1 Correlation Coefficient������������������������������������������������������������������������������ 189
xi
■ Contents
8.4.5 Conclusion������������������������������������������������������������������������������������������������ 203
8.4.6 Predicting the Response Variable������������������������������������������������������������� 203
8.4.7 Additional Notes��������������������������������������������������������������������������������������� 204
xii
■ Contents
10.1.5 Multicollinearity�������������������������������������������������������������������������������������� 242
10.1.6 Dispersion���������������������������������������������������������������������������������������������� 242
10.1.7 Conclusion for Logistic Regression�������������������������������������������������������� 242
xiii
■ Contents
References������������������������������������������������������������������������������������� 267
Index���������������������������������������������������������������������������������������������� 273
xiv
About the Authors
xv
■ About the Authors
at Polaris Software Lab, Chennai prior to his current assignment. He started his journey with
computers in 1981 with ICL mainframes and continued further with minis and PCs. He was
one of the founding members of information systems auditing in the banking industry in
India. He has effectively guided many organizations through successful ISO 9001/ISO 27001/
CMMI and other certifications and process/product improvements and business analytics.
He has co-authored The InfoSec Handbook: An Introduction to Information Security. You
may reach him at aum136@rediffmail.com.
xvi
About the Technical
Reviewer
xvii
CHAPTER 1
Overview of Business
Analytics
Today’s world is knowledge based. In the earliest days, knowledge was gathered through
observation. Later, knowledge not only was gathered through observation, but also
confirmed by actually doing and then extended by experimenting further. Knowledge
thus gathered was applied to practical fields and extended by analogy to other fields.
Today, knowledge is gathered and applied by analyzing, or deep-diving, into the data
accumulated through various computer applications, web sites, and more. The advent of
computers complemented the knowledge of statistics, mathematics, and programming.
The enormous storage and extended computing capabilities of the cloud, especially, have
ensured that knowledge can be quickly derived from huge amounts of data and also can
be used for further preventive or productive purposes. This chapter provides you with the
basic knowledge of where and how business analytics is used.
Imagine the following situations:
• You visit a hotel in Switzerland and are welcomed with your
favorite drink and dish; how delighted you are!
• You are offered a stay at a significantly discounted rate at your
favorite hotel when you travel to your favorite destination.
• You are forewarned about the high probability of becoming a
diabetic. You are convinced about the reasoning behind this
warning and take the right steps to avoid it.
• You are forewarned of a probable riot at your planned travel
destination. Based on this warning, you cancel the visit; you later
learn from news reports that a riot does happen at that destination!
• You are forewarned of an incompatibility with the person whom
you consider making your life partner, based on both of your
personal characteristics; you avoid a possible divorce!
• You enter a grocery store and you find that your regular monthly
purchases are already selected and set aside for you. The only
decision you have to make is whether you require all of them or
want to remove some from the list. How happy you are!
• Your preferred airline reserves tickets for you well in advance of your
vacation travels and at a lower rate compared to the market rate.
• You are planning to travel and are forewarned of a possible
cyclone in that place. Based on that warning, you postpone your
visit. Later, you find that the cyclone created havoc, and you
avoided a terrible situation.
We can imagine many similar scenarios that are made possible by analyzing data
about you and your activities that is collected through various means—including your
Google searches, visits to various web sites, your comments on social media sites, your
activities using various computer applications, and more. The use of data analytics in
these scenarios has focused on your individual perspective.
Now, let’s look at scenarios from a business perspective. Imagine these situations:
• You are in the hotel business and are able to provide competitive
yet profitable rates to your prospective customers. At the same
time, you can ensure that your hotel is completely occupied all
the time by providing additional benefits, including discounts
on local travel and local sightseeing offers tied into other local
vendors.
• You are in the taxi business and are able to repeatedly attract
the same customers based on their earlier travel history and
preferences of taxi type and driver.
• You are in the fast-food business and offer discounted rates to
attract customers on slow days. These discounts enable you to
ensure full occupancy on those days also.
• You are in the human resources (HR) department of an
organization and are bogged down by high attrition. But now you
are able to understand the types of people you should focus on
recruiting, based on the characteristics of those who perform well
and who are more loyal and committed to the organization.
• You are in the airline business, and based on data collected by
the engine system, you are warned of a potential engine failure in
the next three months. You proactively take steps to carry out the
necessary corrective actions.
• You are in the business of designing, manufacturing, and selling
medical equipment used by hospitals. You are able to understand
the possibility of equipment failure well before the equipment
actually fails, by carrying out analysis of the errors or warnings
captured in the equipment logs.
2
Chapter 1 ■ Overview of Business Analytics
All these scenarios are possible by analyzing data that the businesses and others
collect from various sources. There are many such possible scenarios. The application of
data analytics to the field of business is called business analytics.
You have likely observed the following scenarios:
• You’ve been searching for the past few days on Google for
adventurous places to visit. You’ve also tried to find various travel
packages that might be available. You suddenly find that various
other web sites you visit or the searches you make show a specific
advertisement of what you are looking for, and that too at a
discounted rate.
• You’ve been searching for a specific item to purchase on Amazon
(or any other site). Suddenly, on other sites you visit, you find
advertisements related to what you are looking for or find
customized mail landing in your mailbox, offering discounts
along with other items you might be interested in.
• You’ve also seen recommendations that Amazon makes based
on your earlier searches for items, your wish list, or previous
Amazon purchases. Many times you’ve also likely observed
Amazon offering you discounts or promoting products based on
its available data.
All of these possibilities are now a reality because of data analytics specifically used
by businesses. This book takes you through the exciting field of business analytics and
enables you to step into this field as well.
3
Chapter 1 ■ Overview of Business Analytics
1.2 Confusing Terminology
Many terms are used in discussions of this topic— for example, data analytics, business
analytics, big data analytics, and data science. Most of these are, in a sense, the same.
However, the purpose of the analytics, the extent of the data that’s available for analysis,
and the difficulty of the data analysis may vary from one to the other. Finally, regardless
of the differences in terminology, we need to know how to use the data effectively for our
businesses. These differences in terminology should not come in the way of applying
techniques to the data (especially in analyzing it and using it for various purposes
including understanding it, deriving models from it, and then using these models for
predictive purposes).
In layman’s terms, let’s look at some of this terminology:
• Data analytics is the analysis of data, whether huge or small, in
order to understand it and see how to use the knowledge hidden
within it. An example is the analysis of the data related to various
classes of travelers (as noted previously).
4
Chapter 1 ■ Overview of Business Analytics
5
Chapter 1 ■ Overview of Business Analytics
6
Chapter 1 ■ Overview of Business Analytics
7
Chapter 1 ■ Overview of Business Analytics
8
Chapter 1 ■ Overview of Business Analytics
1.4.2 Human Resources
Retention is the biggest problem faced by an HR department in any industry, especially in
the support industry. An HR department can identify which employees have high potential
for retention by processing employee data. Similarly, an HR department can also analyze
which competence (qualification, knowledge, skill, or training) has the most influence on
the organization’s or team’s capability to deliver quality within committed timelines.
1.4.3 Product Design
Product design is not easy and often involves complicated processes. Risks factored in
during product design, subsequent issues faced during manufacturing, and any resultant
issues faced by customers or field staff can be a rich source of data that can help you
understand potential issues with a future design. This analysis may reveal issues with
materials, issues with the processes employed, issues with the design process itself, issues
with the manufacturing, or issues with the handling of the equipment installation or later
servicing. The results of such an analysis can substantially improve the quality of future
designs by any company. Another interesting aspect is that data can help indicate which
design aspects (color, sleekness, finish, weight, size, or material) customers like and
which ones customers do not like.
1.4.4 Service Design
Like products, services are also carefully designed and priced by organizations.
Identifying components of the service (and what are not) also depends on product design
and cost factors compared to pricing. The length of warranty, coverage during warranty,
and pricing for various services can also be determined based on data from earlier
experiences and from target market characteristics. Some customer regions may more
easily accept “use and throw” products, whereas other regions may prefer “repair and
use” kinds of products. Hence, the types of services need to be designed according to the
preferences of regions. Again, different service levels (responsiveness) may have different
price tags and may be targeted toward a specific segment of customers (for example, big
corporations, small businesses, or individuals).
9
Chapter 1 ■ Overview of Business Analytics
10
Chapter 1 ■ Overview of Business Analytics
11
Chapter 1 ■ Overview of Business Analytics
2.
Study the data and then clean up the data for missed data
elements or errors.
3.
Check for the outliers in the data and remove them from the
data set to reduce their adverse impact on the analysis.
4.
Identify the data analysis technique(s) to be used (for
example, supervised or unsupervised).
5.
Analyze the results and check whether alternative
technique(s) can be used to get better results.
6.
Validate the results and then interpret the results.
7.
Publish the results (learning/model).
8.
Use the learning from the results / model arrived at.
9.
Keep calibrating the learning / model as you keep using it.
12
Chapter 1 ■ Overview of Business Analytics
13
Chapter 1 ■ Overview of Business Analytics
14
Chapter 1 ■ Overview of Business Analytics
1.8 Summary
To start with you went through an introduction as to how knowledge has evolved. You
also went through many scenarios in which data analytics helps individuals. Many
examples of business analytics helping businesses to grow and compete effectively were
illustrated to you. You were also provided with examples as to how business analytics
results are used by businesses effectively.
Next, you were taken through the objectives of this book. Primarily, this book is
intended to be a practical guidebook enabling you to acquire necessary skills. This is an
introductory book. You are not going to focus on terminology here but are going to look
into the practical aspects. This book will also show how to use R for business analytics.
Next you went through the definitions of data analytics, business analytics, data
science, and big data analytics. You were provided with these definitions in layman’s
terms in order to remove any possible confusion about the terminology.
Then you explored important drivers for business analytics, including the growth of
computer packages and applications, the feasibility of consolidating data from various
sources, the growth of infinite storage and computing capabilities, the increasing
numbers of easy-to-use programming tools and platforms, the need for companies
to survive and grow among their competition, and business complexity in this era of
globalization. Next you got introduced to the applications of business analytics with
examples in certain fields.
You briefly went through the skills required for a business analyst. In particular, you
understood the importance of the following: understanding the business and business
problems, data analysis techniques and algorithms, computer programming, data
structures and data storage/warehousing techniques, and statistical and mathematical
concepts required for data analytics.
Finally, you also briefly went through the life cycle of the business analytics project.
15
CHAPTER 2
Introduction to R
This chapter introduces the R tool, environment, workspace, variables, data types, and
fundamental tool-related concepts. This chapter also covers how to install R and RStudio.
After reading this chapter, you’ll have enough foundational topics to start R programming
for data analysis.
18
Chapter 2 ■ Introduction to R
Table 2-1. (continued)
Sl. No Software Package Functionality Supported URL
7 Salstat Descriptive statistics, inferential www.salstat.com
(open source) statistics, parametric and
nonparametric analysis, bar charts,
box plots, histograms, and more.
8 IBM SPSS Full set of statistical analysis, www-01.ibm.
parametrics, nonparametric analysis, com/software/
classification, regression, clustering analytics/spss/
analysis. Bar charts, histograms, box
plots. Social media analysis, text
analysis, and so forth.
9 Stata by StataCorp Descriptive statistics, ARIMA, www.stata.com
ANOVA, and MANOVA, linear
regression, time-series smoothers,
generalized linear models (GLMs),
cluster analysis. For more details
refer to:
http://www.stata.com/features/
10 Statistica Statistical analysis, graphs, plots, www.statsoft.com
data mining, data visualization, and
so forth. For more details
refer to: http://www.statsoft.com/
Products/STATISTICA-Featuresdata
11 SciPy Python library used by scientists and www.scipy.org
(pronounced analysts doing scientific computing
Sigh Pie) and technical computing. SciPy
(open source) contains modules for optimization,
linear algebra, interpolation, digital
signal and image processing,
machine- learning techniques.
12 Weka, or Waikato Contains a collection of visualization www.cs.waikato.
Environment tools and algorithms for data analysis ac.nz/ml/weka/
for Knowledge and predictive modeling, together
Analysis with graphical user interfaces for easy
(open source) access to these functions.
(continued)
19
Chapter 2 ■ Introduction to R
Table 2-1. (continued)
Sl. No Software Package Functionality Supported URL
13 RapidMiner Integrated environment for machine https://
(open source) learning, data mining, text mining, rapidminer.com/
predictive analytics, and business
analytics.
14 R Full set of functions to support www.r-project.org
(open source) statistical analysis, histograms, box
plots, hypothesis testing, inferential
statistics, t-tests, ANOVA, machine
learning, clustering, and so forth.
15 Minitab by Descriptive statistical analysis, www.minitab.com
Minitab Statistical hypothesis testing, data visualization,
Software t-tests, ANOVA, regression analysis,
reliability, and survival analysis.
https://www.minitab.com/en-us/
products/minitab/
16 Tableau Desktop Statistical summaries of your data, www.tableau.com/
by Tableau experiment with trend analyses, products/desktop
Software regressions, correlations. Connect
directly to your data for live, up-to-
date data analysis that taps into the
power of your data warehouse.
17 TIBCO Spotfire Statistical and full predictive http://spotfire.
analytics. tibco.com/
Integration of R, S+, SAS and MATLAB
into Spotfire and custom applications.
http://spotfire.tibco.com/
discover-spotfire/what-does-
spotfire-do/predictive-
analytics/tibco-spotfire-
statistics-services-tsss
18 SAS by SAS Advance statistical and machine- www.sas.com
learning functions and much more.
20
Chapter 2 ■ Introduction to R
Data analysis tools help in analyzing data and support handling and manipulating
data, statistical analysis, graphical analysis, building various types of models, and
reporting. R is an integrated suite of software packages for data handling, data
manipulation, statistical analysis, graphical analysis, and developing learning models.
This software is an extension of the S software originally developed by Bell Labs. It is open
source, free software, licensed under the GNU General Public License (www.gnu.org/
licenses/gpl-2.0.html) and supported by large research communities spread all over
the world. In the year 2000, R version 1.0.0 was released to the public.
R has following advantages and hence it is the most recommended tool for data
scientists today:
• Flexible, easy, and friendly graphical capabilities that can be
displayed on the video display of your computer or stored in
different file formats.
• Data storage facility to store large amounts of data effectively in
the memory for data analysis.
• Large number of free packages available for data analysis.
• Provides all the capabilities of a programming language.
• Supports getting data from a wide variety of sources, including
text files, database management systems, web XML files, and
other repositories.
• Duns on a wide array of platforms, including Windows, Unix, and
macOS.
• Most commercial statistical software platforms cost thousands or
even tens of thousands of dollars. R is free! If you’re a teacher or a
student, the benefits are obvious.
However, most programs in R are written for a single piece of data analysis.
2.2 R Installation
R is available for all the major computing platforms, including macOS, Windows, and
Linux. As of 1st of November 2016, the latest R version is 3.3.2.
2.2.1 Installing R
Follow these steps to download the binaries:
1.
Go to the official R site at www.r-project.org.
2.
Click the Download tab.
3.
Select the operating system.
21
Chapter 2 ■ Introduction to R
4.
Read the instructions to install the software. For example,
installing on Linux/Unix requires root/admin permissions,
and command-line options are different from those on other
platforms. On Windows, you just have to click the installer and
follow the instructions provided.
5.
Pick your geographic area and mirror site to download.
6.
Download the installer and run the installer.
7.
Follow the instructions by the installer to successfully install
the software.
After the installation, click the icon to start R. A window appears, showing the
R console (as shown in Figure 2-1).
22
Chapter 2 ■ Introduction to R
2.2.2 Installing RStudio
RStudio provides an integrated development environment (IDE) for R. RStudio is
available in two variants: a desktop version and a server version. RStudio Desktop allows
RStudio to run locally on the desktop. RStudio Server runs on a web server and can be
accessed remotely by using a web browser. RStudio Desktop is available for Microsoft
Windows, macOS, and Linux. For more information on RStudio and its support, refer to
the RStudio web site at www.rstudio.com/products/RStudio/#Desktop.
This section details the installation of RStudio Desktop or RStudio Server for
Windows. The installation procedure is similar for other OSs. For more details and
procedures for installation, please download RStudio desktop version and follow
the instructions given in the web site: https://www.rstudio.com/products/
RStudio/#Desktop To install RStudio, follow these steps:
1.
Go to the official RStudio web site: www.rstudio.com.
2.
Click the Products option.
3.
Select the server version or desktop version, depending on
your needs.
4.
Click the Downloads option and select the appropriate OS.
For Windows, select the installer.
5.
After downloading the installer, run the installer and follow its
instructions.
Please note that R has to be installed first. If R is not installed, at the end of RStudio
installation, an error message will appear and ask you to install R (Figure 2-2).
23
Chapter 2 ■ Introduction to R
RStudio has four windows, which allow you to write scripts, view the output, view
the environment and the variables, and view the graphs and plots. The top-left window
allows you to enter the R commands or scripts. R scripts provided in the window can be
executed one at a time or as a file. The code also can be saved as an R script for future
reference. Each R command can be executed by clicking Run at the top-right corner of
this window.
The bottom-left window is the R console, which displays the R output results. Also,
you can enter any R command in this window to check the results. Because it is a console
window, your R commands cannot be stored.
The top-right window lists the environment variable types and global variables.
The bottom-right window shows the generated graphs and plots, and provides help
information. It also has an option to export or save plots to a file.
Figure 2-4 shows a window with a sample R script, its output in the console, a graph
in the graphical window, and environment variable types.
24
Chapter 2 ■ Introduction to R
2.3 Basics of R Programming
R is a programming language specifically for statistical analysis, and there is no need
to compile the code. As soon as you hit the Return key after entering the command,
the R script executes and the output is displayed in the console. The output also can be
redirected to files or printers, which we discuss later in this section.
After you are in the R console window, you are prompted with a greater-than symbol
(>), which indicates that R is ready to take your commands. For example, if you type 4 + 3
and hit Return, R displays the results in the console, as shown in Figure 2-5.
25
Chapter 2 ■ Introduction to R
Notice that the results have two parameters: the first part, [1], indicates the index
of the first number displayed in row 1 of the vector, and element 7 is the result. In R, any
number you enter on the console is interpreted as a vector.
2.3.1 Assigning Values
The next step is assigning values to variables. In R, you can use the assignment function
or just the <- shortcut:
26
Chapter 2 ■ Introduction to R
Direction of the arrow really does not matter. You can use the equals sign (=) as well:
2.3.2 Creating Vectors
In R, everything is represented as a vector. Vectors are one-dimensional arrays that can
hold numeric data, character data, or logical data. To create a vector, you use the function
c(). Here are examples of creating vectors:
Here, a is a numeric vector, b is a logical vector, and c is a character vector. Note that
the data stored in a vector can be of only one type.
27
Chapter 2 ■ Introduction to R
To create a vector in R, you use the c() function. Alternatively, You can also use the
vector() function to create a vector of type numeric:
28
Chapter 2 ■ Introduction to R
2.5 Data Structures in R
Data structures in R are known as objects. Each object can hold vectors, scalars, matrices,
arrays, data frames, and lists. A list is a special type of object that can hold any or all object
types. Figure 2-6 shows the data structure representation in R. Let’s take a look at each of
the data structure types now.
a11
a21
a31
Matrix Array
Samp 1
Samp 2
Samp 3
Data Frame
29
Chapter 2 ■ Introduction to R
2.5.1 Matrices
R uses two terms to refer to two-dimensional data sets: matrix and array. Software
engineers working with application development and images call a two-dimensional data
set an array, whereas mathematicians and statisticians refer the same two-dimensional
data set as a matrix. The two-dimensional array can function exactly like a matrix. The
terms matrix and array are interchangeable, but the different conventions assumed by
different people may sometimes cause confusion.
In R, matrices are created using the matrix() function. The general format is as follows:
Here, vector contains the elements for the matrix, using nrow and ncol to specify
the row and column dimensions of the matrix, respectively; dimnames contains row and
column labels stored in character vectors, which is optional. The byrow indicates whether
the matrix should be filled in by row (byrow=TRUE) or by column (byrow=FALSE). The
default is by column.
The following example in R shows how to create a matrix by using the matrix()
function:
You can access each row, each column, or an element of the matrix as shown by the
following example:
30
Chapter 2 ■ Introduction to R
2.5.2 Arrays
Arrays can have more than two dimensions. In R, they’re created by using the array()
function:
Here, vector contains the data for the array, dimensions is a numeric vector giving
the maximal index for each dimension, and dimnames is a list of dimension labels.
The following example shows how to create a three-dimensional (2 × 3 × 2) array:
Arrays are extensions of matrices. They can be useful in programming some of the
statistical methods. Identifying elements of an array is similar to what you’ve seen for
matrices. For the previous example, the z[2, 3, 2] element is 12:
31
Chapter 2 ■ Introduction to R
2.5.3 Data Frames
Data frames are important object types and the most common data structure you will use
in R. They are similar to data storage in Microsoft Excel, Stata, SAS, or SPSS.
You can have multiple columns of different types; columns can be integer or
character or logical types. Though a data frame is similar to a matrix or an array, unlike
a matrix or an array you can store multiple data types in data frames. Data frames are a
special type of list, and every element of the list has to have the same length. Each column
can be named, indicating the names of the variables or predictors. Data frames also have
an attribute named row.names(), which indicates the information in each row.
When you read files by using read.csv(), data frames are automatically created.
A data frame is created with the data.frame() function:
Here, col1, col2, and col3, are column vectors of any type (such as character,
numeric, or logical). Names for each column can be provided by using the names
function. The following example makes this clear:
32
Chapter 2 ■ Introduction to R
You can access the elements of the data set in several ways. You can use the subscript
notation you used before in the matrices or you can specify column names. The following
example demonstrates these approaches:
33
Chapter 2 ■ Introduction to R
2.5.4 Lists
Lists are the most complex of the R data types. Lists allow you to specify and store any
data type object. A list can contain a combination of vectors, matrices, or even data
frames under one single object name. You create a list by using the list() function:
Here, object1 and object2 are any of the structures seen so far. Optionally, you can
name the objects in a list:
The following example shows how to create a list. In this example, you use the
list() function to create a list with four components: a string, a numeric vector, a
character, and a matrix. You can combine any number of objects and save them as a list.
To access the list, specify the list type you want to access—for example, accessing the
second element, myFirstList[[2]] and myFirstList[["ID"]], both refer to the numeric
vector you are trying to access.
34
Chapter 2 ■ Introduction to R
2.5.5 Factors
Factors are special data types used to represent categorical data, which is important
in statistical analysis. A factor is an integer vector; each integer type has a label. For
example, if your data has a variable by name “sex” with values Male or Female, then factor
automatically assigns values 1 for Male and 2 for Female. With these assignments, it is
easy to identify values and performstatistical analysis. In the previous example, if we have
have another variable by name “Performance” and is defined as “Excellent”, “Average”,
“Poor”. This is an example of ordinal type. You know that if a specific region is doing
excellent sales, then it’s performance is “Excellent”. In R, categorical (ordinal) variables are
called factors and they play crucial role in R determining how data will be analyzed and
presented visually.
Factor objects can be created using the factor() function that stores the categorical
values as a vector of integers in the range [1... j], and automatically assigns an internal
vector of character strings (the original values) mapped to these integers.
For example, assume that you have the vector performance <- c("Excellent",
"Average", "Poor", ”Average”):
The statement performance <- factor(performance) stores this vector as (1, 2,
3, 2) and associates it with 1= Excellent , 2 = Average, 3= Poor internally (the assignment
is alphabetical).
Any analysis performed on the vector performance will treat the variable as nominal
and select the statistical methods appropriate for this measurement.
First, you enter the data as vectors viz. strID, month, sales, region, and
performance. Then you specify that performance is an ordered factor. Finally, you
combine the data into a data frame. The function str(object) provides information
on an object in R (the data frame, in this case). It clearly shows that performance is an
ordered factor and also region as a factor, along with how it’s coded internally.
35
Chapter 2 ■ Introduction to R
2.6 Summary
In this chapter, you looked at various statistical and data analysis tools available in
the market as well as the strengths of R that make it an attractive option for students,
researchers, statisticians, and data scientists. You walked through the steps of installing
R and RStudio. You also learned about the basics of R, including various data types like
vectors, arrays, matrices, and lists—including one of the most important data structures,
the data frame.
Now that you have R up and running, it’s time to get your data for the analysis. In the
next chapter, you’ll look at performing file manipulations, reading various types of files,
and using functions, loops, plots, and graphs; you’ll also see other aspects of R that are
required to perform effective data analysis. We don’t want to make you wait any longer. In
the next chapter, you’ll jump directly into the basics of analysis using R.
36
CHAPTER 3
38
Chapter 3 ■ R for Data Analysis
file: The name of the file. The full path of the file should be
specified (use getwd() to check the path). Each line in a file
is translated as each row of a data frame. file can also be a
complete URL.
header: A logical variable (TRUE or FALSE) indicating whether
the first row of the file contains the names of variables. This
flag is set to TRUE if the first line contains the names of the
variables. By default, it is set to FALSE.
sep: The separator is by default a comma. There’s no need to
set this flag.
fill: If the file has an unequal length of rows, you can set this
parameter to TRUE so that blank fields are implicitly added.
dec: The character used in the file for decimal points.
The only parameter you have to worry about is file with a proper PATH. The
file separator is set as a comma by default. Set the header parameter appropriately,
depending on whether the first row of a file contains the names of variables.
Figure 3-2 shows a sample text file with each element of a row separated by a comma.
39
Chapter 3 ■ R for Data Analysis
You can read this CSV file into R, as shown in the following example:
In this CSV file, the first line contains headers, and the actual values start from the
second line. When you read the file by using read.csv(), you must set the header option
to TRUE so R can understand that the first row contains the names of the variables and can
read the file in a proper data-frame format.
This example demonstrates how read.csv() automatically reads the CSV-formatted
file into a data frame. R also decides the type of the variable based on the records present
in each column. Seven variables are present in this file, and each parameter is an integer
type. To view the data-set table, you can simply use the View() command in R:
Using View() opens the data table in another window, as shown in Figure 3-3.
40
Chapter 3 ■ R for Data Analysis
The read.table() command is similar to read.csv() but is used to read any text
file. The contents of the text file can be separated by a space, comma, semicolon, or colon,
and you do have to specify a separator. The command has other optional parameters. For
more details, type help(read.table). Figure 3-4 shows an example file containing values
separated by a tab.
41
Chapter 3 ■ R for Data Analysis
In this example, the text file is in a tab-separated format. The first line contains the
names of the variables. Also, note that read.table() reads the text file as a data frame,
and the Value is automatically recognized as a factor.
42
Chapter 3 ■ R for Data Analysis
Once the package is installed successfully, import an Excel file into R by executing
the following set of commands:
> library(RODBC)
> myodbc<-odbcConnectExcel("c311Lot1.xls")
> mydataframe<-sqlFetch(myodbc,"LOT")
First, you establish an ODBC connection to the XLS database by specifying the file
name. Then you call the table. Here, c311Lot1.xls is an Excel file in XLS format, LOT is the
name of the worksheet inside the Excel file. myodbc is the ODBC object that opens the ODBC
connection, and mydataframe is the data frame. The entire process is shown in Figure 3-5. A
similar procedure can be used to import data from a Microsoft Access database.
43
Chapter 3 ■ R for Data Analysis
To read the Excel 2007 XLSX format, you can install the xlsx package using install.
packages ("xlsx") and use the read.xlsx() command as follows:
> library(xlsx)
> myxlsx<- "c311Lot.xlsx"
> myxlsxdata<-read.xlsx(myxlsx,1)
The first argument of read.xlsx() is the Excel file; 1 refers to the first worksheet
in Excel, which is saved as a data frame. Numerous packages are available to read the
Excel file, including gdata, xlsReadWrite, and XLConnect, but the xlsx package is the
most popular. However, it is recommended to convert the Excel file into a CSV file and
use read.table() or read.csv() rather than the others. The following code provides
examples of using other packages and the reading of the file:
44
Chapter 3 ■ R for Data Analysis
45
Chapter 3 ■ R for Data Analysis
3.2.1 if-else
The if-else structure is the most common function used in any programming language,
including R. This structure allows you to test a true or false condition. For example, say
you are analyzing a data set and have to plot a graph of x vs. y based on a condition—for
example, whether the age of a person is greater than 65. In such situations, you can use an
if-else statement. The common structure of if-else is as follows:
You can have multiple if-else statements. For more information, you can look at
the R help pages. Here is a demonstration of the if-else function in R:
3.2.2 for loops
for loops are similar to other programing structures. During data analysis, for loops are
mostly used to access an array or list. For example, if you are accessing a specific element
in an array and performing data manipulation, you can use a for loop. Here is an
example of how a for loop is used:
46
Chapter 3 ■ R for Data Analysis
Sometimes, the seq_along() function is used in conjunction with the for loop.
The seq_along() function generates an integer based on the length of the object. The
following example uses seq_along() to print every element of x:
3.2.3 while loops
The R while loop has a condition and a body. If the condition is true, the control enters
the loop body. The loop continues execution until the condition is true; it exits after the
condition fails. Here is an example:
47
Chapter 3 ■ R for Data Analysis
repeat() and next() are not commonly used in statistical or data analysis, but they
do have their own applications. If you are interested to learn more, you can look at the R
help pages.
3.2.4 Looping Functions
Although for and while loops are useful programming tools, using curly brackets and
structuring functions can sometimes be cumbersome, especially when dealing with large
data sets. Hence, R has some functions that implement loops in a compact form to make
data analysis simpler and effective. R supports the following functions, which we’ll look at
from a data analysis perspective:
apply(): Evaluates a function to a section of an array and
returns the results in an array
lapply(): Loops over a list and evaluates on each element or
applies the function to each element
sapply(): A user-friendly application of lapply() that returns
a vector, matrix, or array
tapply(): Usually used over a subset of a data set
These functions are used to manipulate and slice data from matrices, arrays, lists, or
data frames. These functions traverse an entire data set, either by row or by column, and
avoid loop constructs. For example, these functions can be called to do the following:
• Calculate mean, sum, or any other manipulation on a row or a
column
• Transform or perform subsetting
48
Chapter 3 ■ R for Data Analysis
The apply() function is simpler in form, and its code can be very few lines (actually,
one line) while helping to perform effective operations. The other, more-complex forms
are lapply(), sapply(), vapply(), mapply(), rapply(), and tapply(). Using these
functions depends on the structure of the data and the format of the output you need to
help your analysis.
To understand how the looping function works, you’ll look at the Cars data set,
which is part of the R library. This data set contains the speed of cars and the distances
taken to stop. This data was recorded in the 1920s. This data frame has two variables:
speed and dist. Both are numeric and contain a total of 50 observations.
3.2.4.1 apply( )
Using the apply() function, you can perform operations on every row or column of a
matrix or data frame or list without having to write loops. The following example shows
how to use apply() to find the average speed and average distance of the cars:
The apply() function is used to evaluate a function over an array[] and is often
used for operations on rows or columns:
> str(apply)
function (X, MARGIN, FUN, ...)
>
49
Chapter 3 ■ R for Data Analysis
The following example calculates a quantile measure for each row and column:
3.2.4.2 lapply( )
The lapply() function outputs the results as a list. lapply() can be applied to a list, data
frame, or vector. The output is always a list that has the same number of elements as the
object that was passed to lapply():
> str(lapply)
function (X, FUN, ...)
>
>
> str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
> lap<-lapply(cars,mean)
> lap
$speed
[1] 15.4
50
Chapter 3 ■ R for Data Analysis
$dist
[1] 42.98
> str(lap)
List of 2
$ speed: num 15.4
$ dist : num 43
>
3.2.4.3 sapply( )
The main difference between sapply() and lapply() is the output result. The result for
sapply() can be a vector or a matrix, whereas the result for lapply() is a list. Depending
on the kind of data analysis you are doing and the result format you need, you can use the
appropriate functions. As we solve many analytics problems in later chapters, you will see
the use of different functions at different times. The following example demonstrates the
use of sapply() for the same Car data set example:
> sap<-sapply(cars,mean)
> sap
speed dist
15.40 42.98
> str(sap)
Named num [1:2] 15.4 43
- attr(*, "names")= chr [1:2] "speed" "dist"
3.2.4.4 tapply( )
tapply() is used over subsets of a vector. The function tapply() is similar to other
apply() functions, except it is applied over a subset of a data set:
> str(tapply)
function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
X: A vector
INDEX: A factor or a list of factors (or they are coerced into
factors)
FUN: A function to be applied
Data source: Henderson and Velleman, “Building Multiple Regression Models Interactively,”
Biometrics, 37 (1981), pp. 391–411.
Figure 3-6 shows some of the data from the Mtcars data set.
Figure 3-6. Some sample data from the Mtcars data set
52
Chapter 3 ■ R for Data Analysis
Let’s say that you need to find out the average gasoline consumption (mpg) for each
cylinder. Using the tapply() function, the task is executed in one line, as follows:
Similarly, if you want to find out the average horsepower (hp) for automatic and
manual transmission, simply use tapply() as shown here:
As you can see from this example, the family of apply() functions is powerful and
enables you to avoid traditional for and while loops. Depending on the type of data
analysis you want to achieve, you can use the respective functions.
3.2.4.5 cut( )
Sometimes in data analysis, you may have to break up continuous variables to put them
into different bins. You can do this easily by using the cut() function in R. Let’s take the
Orange data set; our task is to group trees based on age. This can be achieved by using
cut() as shown here:
53
Chapter 3 ■ R for Data Analysis
Next, you create four groups based on the age of the trees. The first parameter is the
data set and the age, and second parameter is the number of groups you want to create.
In this example, the Orange trees are grouped into four categories based on age, and there
are five trees between ages 117 and 483:
The intervals are automatically defined by cut() and may not be the ones you
anticipated. However, using the seq() command, you can specify the intervals:
3.2.4.6 split( )
Sometimes in data analysis, you may need to split the data set into groups. This can
be conveniently accomplished by using split() function. This function divides the
data set into groups defined by the argument. The unsplit() function reverses the
split() results.
The general syntax of split() is as follows; x is the vector, and f is the split
argument:
> str(split)
function (x, f, drop = FALSE, ...)
In the following example, the Orange data set is split based on the age. The difference
is that the data set is grouped based on age. There are seven age groups, and the data set
is split into seven groups:
54
Chapter 3 ■ R for Data Analysis
This function does not take any arguments. To make it more interesting, you can add
a function argument. In the body of this function, let’s add a for loop, to print as many
times as the user wants. The user determines the number of times to loop by specifying
the argument in the function. This is illustrated through the following two examples:
55
Chapter 3 ■ R for Data Analysis
In this example, myFun() is a function that takes one argument. You can pass the
parameter when calling the function. In this case, as you can see from the output, the
function prints the number of times it has looped based on the number passed to the
function as an argument.
56
Chapter 3 ■ R for Data Analysis
57
Chapter 3 ■ R for Data Analysis
3.4 Summary
Data has to be read effectively into R before any analysis can begin. In this chapter, you
started with examples showing that various file formats from diverse sources can be read
into R (including database files, text files, XML files, and files from other statistical and
analytical tools).
You further explored how various data-file formats can be read into R by using
such simple commands as read.csv() and read.table(). You also looked at examples
of importing data from MS Excel files and from the Web.
You explored through detailed examples the use of looping structures such as if-
else, while, and for. You also learned about simple recursive functions available in R,
such as apply(), lapply(), sapply(), and tapply(). These functions can extensively
reduce code complexity and potential mistakes that can occur with traditional looping
structures such as if-else, while, and for. Additionally, you looked at the use of cut()
and split() functions.
User-defined functions are useful when you need to share code or reuse a function
again and again. You saw a simple and easy way to build user-defined functions in R.
Finally, you looked at packages. You saw how to use the library() function to
determine which packages are already installed in R and how to use the install.
packages() function to install additional packages into R.
58
CHAPTER 4
Introduction to descriptive
analytics
Imagine you are traveling and have just reached the bank of a muddy river, but there are
no bridges or boats or anybody to help you to cross the river. Unfortunately, you do not
know to swim. When you look around in this confused situation where there is no help
available to you, you notice a sign board as shown in Figure 4-1:
The sign says, “The mean depth of the river is 4 feet.” Say this value of mean is
calculated by averaging the depth of the river at each square-foot area of the river. This
leads us to the following question: “What is average or mean?” Average or mean is the
quantity arrived at by summing up the depth at each square foot and dividing this sum by
the number of measurements (i.e., number of square feet measured).
Your height is 6 feet. Does Figure 4-1 provide enough information for you to attempt
to cross the river by walking? If you say, “Yes,” definitely I appreciate your guts. I would
not dare to cross the river because I do not know whether there is any point where the
depth is more than my height. If there are points with depths like 7 feet, 8 feet, 10 feet, or
12 feet, then I will not dare to cross as I do not know where these points are, and at these
points I am likely to drown.
Figure 4-2. The sign indicating mean, maximum, and minimum depths
Suppose, the sign board also says, “Maximum depth is 12ft and minimum depth is
1ft” (see Figure 4-2). I am sure this additional information will scare you since you now
know that there are points where you can get drowned. Maximum depth is the measure
at one or more points that are the largest of all the values measured. Again, with this
information you may not be sure that the depth of 12 feet is at one point or at multiple
points. Minimum sounds encouraging (this is the lowest of the values observed) for you
to cross the river, but again you do not know whether it is at one point or multiple points.
Figure 4-3. The sign indicating mean, maximum, minimum, and median depths
60
Chapter 4 ■ Introduction to descriptive analytics
Suppose, in addition to the above information the sign board (shown in Figure 4-3)
also says, “Median of the depth is 4.5ft.” Median is the middle point of all the measured
depths if all the measured depths are arranged in ascending order. This means 50% of
the depths measured are less than this and also 50% of the depths measured are above
this. You may not still dare to cross the river as 50% of the values are above 4.5 feet and
maximum depth is 12 feet.
Suppose, in addition to the above information the sign board (shown in Figure 4-4)
also says, “Quartile 3 is 4.75ft.” Quartile 3 is the point below which 75% of the measured
values fall when the measured values are arranged in ascending order. This also means
there are 25% of the measured values that have greater depth than this. You may not be
still comfortable crossing the river as you know the maximum depth is 12 feet and there
are 25% of the points above 4.75 feet.
61
Chapter 4 ■ Introduction to descriptive analytics
Suppose, in addition to the above information, the sign board (shown in Figure 4-5)
also says, “Percentile 90 is 4.9ft and percentile 95 is 5ft.” Suppose this is the maximum
information available. You now know that only 5% of the measured points are of depth
more than 5ft. You may now want to take a risk if you do not have any other means other
than crossing the river by walking or wading through as now you know that there are only
5% of the points with depth more than 5ft. Your height is 6ft. You may hope that 98 or 99
percentile may be still 5.5ft. You may now believe that the maximum points may be rare
and you can by having faith in God can cross the river safely.
In spite of the above cautious calculations you may still drown if you reach rare points
of depth of more than 6 feet (like the maximum point of depth). But, with the foregoing
information you know that your risk is substantially less compared to your risk at the initial
time when you had only limited information (that the mean depth of the river is 4 feet).
This is the point I wanted to drive through the river analogy: with one single parameter
of measurement you may not be able to describe the situation clearly and may require more
parameters to elaborate the situation. Each additional parameter calculated may increase
the clarity required to make decisions or to understand the phenomenon clearly. Again,
another note of caution: there are many other parameters than the ones discussed earlier
that are of interest in making decisions or understanding any situation or context.
Later in this chapter we will discuss how to calculate all these parameters using R.
Before that we need to understand the important aspect—the meaning of “population”
and “sample.”
4.1 Descriptive analytics
In simple terms, descriptive analysis or descriptive analytics is the analysis of the data
to provide the description of the data in various ways to enable users to understand the
situation or context or data in a clear way.
Statistical parameters such as mean or average, median, quartile, maximum,
minimum, range, variance, and standard deviation describe the data very clearly. As seen
in the example discussed earlier, one aspect of the data may not provide all the clarity
necessary, but many related parameters provide better clarity with regard to data or
situation or context.
62
Chapter 4 ■ Introduction to descriptive analytics
However, when we have to analyze the data, it is difficult to analyze the entire
population especially when the data size is huge. This is because
• It takes substantial time to process the data and the time taken to
analyze may be prohibitively high in terms of the requirements
related to the application of the data. For example, if the entire
transaction data related to all the purchases of all the users has to
be analyzed before you recommend a particular product to a user,
the amount of processing time taken may be so huge that you may
miss the opportunity to suggest the product to the user who has to
be provided the suggestions quickly when he is in session on the
Internet.
• It takes substantial processing power (i.e., memory or CPU power)
to hold the data and process it; not everyone has the capability to
deploy such a huge processing capability.
In simple terms “sample” means a section or subset of the population selected for
analysis. Examples of samples are the following: randomly selected 100, 000 employees
from the entire IT industry or randomly selected 1, 000 employees of a company or
randomly selected 1, 000, 000 transactions of an application or randomly selected 10,
000, 000 Internet users or randomly selected 5, 000 users each from each ecommerce
site, and so on. Sample can also be selected using stratification (i.e., based on some rules
of interest). For example, all the employees of the IT industry whose income is above
$100,000 or all the employees of a company whose salary is above $50,000 or top 100,
000 transactions by amount per transaction (e.g., minimum $1,000 per transaction or all
Internet users who spend more than two hours per day, etc.).
The benefits of sampling are:
• Data is now less and focused on the application. Hence, data can
be processed quickly and the information from analysis can be
applied quickly.
• The data can be processed easily with lesser requirement for the
computing power.
However, of late, we have higher computing power at our hands because of the cloud
technologies and the possibility to cluster computers for better computing power. Though
such large computing power allows us, in some cases, to use the entire population for
analysis, sampling definitely helps carry out the analysis relatively easily and faster in
many cases. However, sampling has a weakness: if the samples are not selected properly,
then the analysis results may be wrong. For example, for analyzing the data for the entire
year only this month’s data is taken. This sample selection may not give the required
information as to how the changes have happened over the months.
63
Chapter 4 ■ Introduction to descriptive analytics
4.3.1 Mean
“Mean” is also known as “average” in general terms. This is a very useful parameter. If we
have to summarize any data set quickly, then the common measure used is “mean.” Some
examples of the usage of mean are the following:
• For a business, mean profitability over the last five years may be
one good way to represent the organization’s profitability in order
to judge its success.
• For an organization, mean employee retention over last five years
may be a good way to represent employee retention in order to
judge the success of the HR (human resources) policies.
• For a country, mean GDP (Gross Domestic Product) over the
last five years may be a good way to represent the health of the
economy.
• For a business mean growth in sales or revenue over a period of
the last five years may be a good way to represent growth.
• For a business, mean reduction in cost of operations over a period
of the last five years may be a good way to understand operational
efficiency improvement.
Normally, a mean or average figure gives a sense of what the figure is likely to be for
the next year based on the performance for the last number of years. However, there are
limitations of using or relying only on this parameter.
Let us look into few examples to understand more about using mean or average:
• Good Luck Co. Pvt. Ltd earned a profit of $1,000,000; $750,000;
$600,000; $500,000; and $500,000 over the last five years. Mean
or average profit over the last five years is calculated as sum(all
the profits over the last 5 years)/No. of years i.e. ($1,000,000 +
$750,000 + $600,000 + $500,000 + $500,000) / 5 = $670,000. This
calculation is depicted in Figures 4-6 and 4-7.
64
Chapter 4 ■ Introduction to descriptive analytics
The mean for the above example is calculated using R as given in figure 4-6:
Alternative simple way of capturing the data and calculating mean in R is shown in
Figure 4-7:
Similarly, for the other examples we can work out the mean or average value if we
know the individual figures for the years.
The problem with the mean or average as a single parameter is as follows:
• Any extreme high or low figure in one or more of the years can
skew up the mean and thus the mean may not appropriately
represent the likely figure next year. For example, consider
that there was very high profit in one of the years because of a
volatile international economy which led to severe devaluation
of the local currency. Profits for five years of a company were,
respectively, €6,000,000, €4,000,000, €4,500,000, €4,750,000, and
€4,250,000. The first year profit of €5,000,000 was on account
of steep devaluation of the Euro in the international market. If
the effective value of profit without taking into consideration
devaluation during the first year is €4,000,000, then the average or
mean profit on account of increased profit would be €400,000 as
shown in Figure 4-8.
Figure 4-8. Actual mean profit and effective mean profit example
65
Chapter 4 ■ Introduction to descriptive analytics
• Using mean or average alone will not show the volatility in the
figure over the years effectively. Also, mean or average does not
depict the trend as to whether it is decreasing or increasing. Let us
take an example. Suppose the revenue of a company over the last
five years is, respectively, $22,000,000, $15,000,000, $32,000,000,
$18,000,000, and $10,000,000. The average revenue of the last five
years is $19,400,000. If you notice the figures, the revenue is quite
volatile; that is, compared to first year, it decreased significantly
in the second year then jumped up by a huge number during
the third year, and then decreased significantly during the fourth
year, and continued to decrease further significantly during the
fifth year. The average or mean figure does not depict either
this volatility in revenue or trending downwardness in revenue.
Figure 4-9 shows this downside of mean as a measure.
4.3.2 Median
Median finds out the middle value by ordering the values in either ascending order or
descending order. In many circumstances, median may be more representative than
mean. It clearly divides the data set at the middle into two equal partitions; that is, 50%
of the values will be below the median and 50% of the values will be above the median.
Examples are as follows:
• Age of workers in an organization to know the vitality of the
organization.
• Productivity of the employees in an organization.
• Salaries of the employees in an organization pertaining to a
particular skill set.
Let us consider the age of 20 workers in an organization as 18, 20, 50, 55, 56, 57, 58,
47, 36, 57, 56, 55, 54, 37, 58, 49, 51, 54, 22, and 57. From a simple examination of these
figures, you can make out that the organization has more aged workers than youngsters
and there may be an issue of knowledge drain in a few years if the organizational
retirement age is 60. Let us also compare mean and median for this data set. The
following figure shows that 50% of the workers are above 54 years of age and are likely to
retire early (i.e., if we take 60 years as retirement age, they have only 6 years to retirement)
which may depict the possibility of significant knowledge drain. However, if we use the
average figure of 47.35 it shows a better situation (i.e., about 12.65 years to retirement).
66
Chapter 4 ■ Introduction to descriptive analytics
But, it is not so if we look at the raw data: 13 of the 20 employees are already at the age
of 50 or older, which is of concern to the organization. Figure 4-10 shows a worked-out
example of median using R.
However, if the 10th value had been 54 and 11th value had been 55, respectively,
then the median would have been (54+55)/2 i.e., 54.5.
Let us take another example of a productivity of a company. Let the productivity per
day in terms of items produced per worker be 20, 50, 55, 60, 21, 22, 65, 55, 23, 21, 20, 35,
56, 59, 22, 23, 25, 30, 35, 41, 22, 24, 25, 24, and 25 respectively. The median productivity is
25 items per day, which means that there are 50% of the workers in the organization who
produce less than 25 items per day and 50% of the employees who produce more than
25 items per day. Mean productivity is 34.32 items per day because some of the workers
have significantly higher productivity than the median worker, which is evident from the
productivity of some of the workers; that is, 65 items per day, 60 items per day, 59 items
per day, 56 items per day, 56 items per day, 55 items per day, etc. The analysis from R in
Figure 4-11 clearly shows the difference between mean and median.
If you have to work out median through hand calculations, you have to arrange the
data points in ascending or descending order and then select the value of the middle
term if there are an odd number of values. If there are an even number of values, then you
have to sum up the middle two terms and then divide the sum by 2 as mentioned in the
above discussions.
If you notice from the above discussion, instead of only mean or median alone,
looking at both mean and median gives a better idea of the data.
67
Chapter 4 ■ Introduction to descriptive analytics
4.3.3 Mode
Mode is the highest times occurring data point in the data set. For example, in our data set
related to age of workers, 57 occurs the maximum number of times (i.e., three times). Hence,
57 is the mode of the workers’ age data set. This shows the pattern of repetition in the data.
There is no inbuilt function in R to compute mode. Hence, we have written a
function and have computed the mode as shown in Figure 4-12. We have used the same
data set we used earlier (i.e., WorkAge).
In the above function unique() creates a set of unique numbers from the data set. In
the case of WorkAge example unique numbers are: 18, 20, 50, 55, 56, 57, 58, 47, 36, 54, 37,
49, 51, and 22. The match() function matches the numbers between the ones in the data
set and the unique numbers set we got and provides the position of each unique number
in the original data set. The function tabulate() returns the number of times each unique
number is occurring in the data set. The function which.max() returns the position of the
maximum times repeating number in the unique numbers set.
4.3.4 Range
Range is a simple but important statistical parameter. It depicts the distance between the
end points of the data set arranged in ascending or descending order (i.e., between the
maximum value in the data set and the minimum value in the data set). This provides the
measure of overall dispersion of the dataset.
The R command range(dataset) provides the minimum and maximum values (see
Figure 4-13) on the same data set used earlier (i.e., WorkAge).
68
Chapter 4 ■ Introduction to descriptive analytics
4.3.5 Quantiles
“Quantiles” are also known as “percentiles.” Quantiles divide the data set arranged in
ascending or descending order into equal partitions. The “median” is nothing but the
data point dividing the data arranged in ascending or descending order into two sets of
equal number of elements. Hence, it is also known as the 50th percentile. On the other
hand, “quartiles” divide the data set arranged in ascending order into four sets of equal
number of data elements. First quartile (also known as Q1 or as 25th percentile) will have
25% of the data elements below it and 75% of the data elements above it. Second quartile
(also known as Q2 or 50th percentile or median) will have 50% of the data elements below
it and 50% of the data elements above it. The third quartile (also known as Q3 or 75th
percentile) has 75% of the data elements below it and 25% of the data elements above it.
“Quantile” is a generic word whereas “quartile” is specific to a particular percentile. For
example, Q1 is 25th percentile. Quartile 4 is nothing but the 100th percentile.
Quantiles, quartiles, or percentiles provide us the information which mean is
not able to provide us. In other words, quantiles, quartiles, or percentiles provide us
additional information about the data set in addition to mean.
Let us take the same two data sets as given in the section “Median” and work out
quartiles. Figures 4-14A and 4-14B show the working of the quartiles.
69
Chapter 4 ■ Introduction to descriptive analytics
Similarly, you can divide the data set into 20 sets of equal number of data elements
by using the quantile function with probs = seq(0, 1, 0.05), as shown in Figure 4-15.
Figure 4-15. Partitioning the data into a set of 20 sets of equal number of data elements
As you can observe from Figure 4-15, the minimum value of the data set is seen
at 0 percentile and maximum value of the data set is seen at 100 percentile. As you can
observe, typically between each 5 percentiles you can see one data element.
As evident from this discussion, quartiles and various quantiles provide additional
information about the data distribution in addition to that information provided by mean
or median (even though median is nothing but second quartile).
4.3.6 Standard deviation
The measures mean and median depict the center of the data set or distribution. On the
other hand, standard deviation specifies the spread of the data set or data values.
The standard deviation is manually calculated as follows:
1.
First mean of the data set or distribution is calculated
2.
Then the distance of each value from the mean is calculated
(this is known as the deviation)
3.
Then the distance as calculated above is squared
4.
Then the squared distances are summed up
5.
Then the sum of the squared distances arrived at as above
is divided by the number of values minus 1 to adjust for the
degrees of freedom
The squaring in step 3 is required to understand the real spread of the data as the
negatives and positives in the data set compensate for each other or cancel out the effect
of each other when we calculate or arrive at the mean.
Let us take the age of the workers example shown in Figure 4-16 to calculate the
standard deviation.
70
Chapter 4 ■ Introduction to descriptive analytics
Figure 4-16. Manual calculation of standard deviation using WorkAge data set
71
Chapter 4 ■ Introduction to descriptive analytics
Normally, as per the rules of the normal curve (a data set which consists of large
number of items is generally said to have a normal distribution or normal curve)
• +/- 1 standard deviation denotes that 68% of the data falls within it
• +/- 2 standard deviation denotes that 95% of the data falls within it
• +/- 3 standard deviation denote that 99.7% of the data falls within it.
In total, around 99.7% of the data will be within +/- 3 standard deviations.
Figure 4-17B. Bell curve showing data coverage within various standard deviations
As you can see from Figure 4-17B, in the case of a normally distributed data (where
the number of data points is typically greater than 30 (i.e., more the better)) it is observed
that about 68% of the data falls within +/- one standard deviation from the center of the
distribution (i.e., mean). Similarly, about 95% (or around 95.2% as shown in Figure 4-17B)
data values fall within +/- two standard deviations from the center. About 99.7% of the
data values fall within +/- three standard deviations from the center. A curve shown here
is known as typically a “bell curve” or “normal distribution curve.” For example, Profit or
Loss of all the companies in a country is normally distributed around the center value
(i.e., mean of the profit or loss).
The higher the standard deviation, the higher is the spread from the mean—i.e.
it indicates that the data points vary from each other significantly and shows the
heterogeneity of the data. The lower the standard deviation, the lower is the spread from
the mean—i.e. it indicates that the data points vary less from each other and shows the
homogeneity of the data.
72
Chapter 4 ■ Introduction to descriptive analytics
However, standard deviation along with other factors such as mean, median,
quartiles, and percentiles give us substantial information about the data or explain the
data more effectively.
4.3.7 Variance
Variance is another way of depicting the spread. In simple terms, it is the square of the
standard deviation as shown in Figure 4-18. Variance provides the spread of squared
deviation from the mean value. It is another way of representing the spread compared to
the standard deviation. Mathematically, as mentioned earlier, the variance is the square
of the standard deviation. We are continuing to use the WorkAge data set we used earlier
in this chapter.
4.3.8 “Summary” command in R
The command: summary(dataset)) provides the following information on the data
set which covers most of statistical parameters discussed. This command gives us the
output viz., minimum value, first quartile, median (i.e., the second quartile), mean,
third quartile, maximum value. This is an easy way of getting the summary information
through a single command (see Figure 4-19, which has a screenshot from R).
Figure 4-19. Finding out major statistical parameters in R using summary() command
If you use the summary(dataset) command then if required you can use additional
commands, like sd(dataset), var(dataset), etc., to obtain additional parameters of
interest related to the data.
73
Chapter 4 ■ Introduction to descriptive analytics
4.4.1 Plots in R
Plotting in R is very simple. The command plot(dataset) provides you the simple
plot of the data set. Taking the earlier example of workers’ age, the simple command
plot(WorkAge) in R will create a good graphical plot (see Figures 4-20 and 4-21).
5 10 15 20
Index
As we have not given any information about the employee other than the age of
the worker through a data set named WorkAge, the y-axis is named after the data set it
represents. Since the data set provides no additional information about the employee, the
data set index is represented in the x-axis.
With a slight variation in the command (as shown in Figure 4-22), we will create a
better graphical representation of the age of the workers in the WorkAge data set.
74
Chapter 4 ■ Introduction to descriptive analytics
Figure 4-22. Getting a better plot using additional parameters in plot() command
5 10 15 20
Employee Sequence
In the above command, col = “red” defines the color of the data points,
xlab = “ “ specifies the label we want to provide to the x-axis, ylab = “ “ specifies the
label we want to provide to the y-axis and main = “ “ (with labels / titles provided
within the “ “) specifies the title we want to provide to the entire graph. We can have
subtitles included if required using sub = “ “ within the plot command. Plotting graphs
in R is as simple as this (see Figure 4-23).
A simple addition of type = “l” will provide the following variant of the graph
shown in Figures 4-24A and 4-24B.
75
Chapter 4 ■ Introduction to descriptive analytics
5 10 15 20
Employee Sequence
Another small variant of type = “h” will provide the following variant of the graph
in a histogram type of format. Figure 4-25 is a bar graph where the bars are represented by
lines.
50
Age of the Worker’s in Years
30 40
20
5 10 15 20
Employee Sequence
76
Chapter 4 ■ Introduction to descriptive analytics
4.4.2 Histogram
A histogram will provide the data grouped over a range of values. This is very useful
for visualizing and understanding more about the data. The simple command in R:
hist(dataset) produces a simple histogram as shown in Figures 4-26A and 4-26B.
Histogram of WorkAge
7
6
5
Frequency
3 2
1
0 4
20 30 40 50 60
WorkAge
In the graph shown in Figure 4-26B, the x-axis shows the range of age from 15+ to 20,
20+ to 25, and 25+ to 30 .until 55+ to 60. As you can see, we have not specified the range
over which various range of values in this case age group has to be depicted. However, R
has scanned through the data and decided the age group over which the data has to be
depicted taking into consideration the minimum age and the maximum age.
As mentioned earlier, you can use main = “ “, xlab = “ “, ylab =
“ “, sub = “ “ to specify the main title, label for x-axis, label for
y-axis, subtitle, respectively.
4.4.3 Bar plot
A bar plot is an easy way to depict the distribution of the data. In Figure 4-27A, we have
provided the bar chart of the WorkAge data set. Worker age for each employee is provided
on the x-axis. The difference in the length of the bar will denote the difference between
the ages of the workers. The usage of the bar plot and corresponding bar plot generated is
shown in Figures 4-27A and 4-27B.
77
Chapter 4 ■ Introduction to descriptive analytics
Worker Index
4.4.4 Boxplots
The boxplot is a popular way of showing the distribution of the data. Figure 4-28B,
shows a boxplot created in R using WorkAge data set in R. The command used is
boxplot(dataset) with label for the axis along with the title as shown in Figure 4-28A.
78
Chapter 4 ■ Introduction to descriptive analytics
55
50 Boxplot of Worker Age
Age of the Worker
45
40
35
30
79
Chapter 4 ■ Introduction to descriptive analytics
The data set depicted in Figure 4-29 is a data frame. The data frame is nothing but a
table structure in R where each column represents the values of a variable and each row
represents data related to a case or an instance. In this data frame, we have data related to
name, age, and salary of ten employees. The data of each employee is depicted through
a row and the features or aspects of the employee are depicted through the labels of the
columns, and this type of data is captured in the corresponding columns.
As you can see in the figure, the command summary(dataset) can be used here
also to obtain the summary information pertaining to each feature (i.e., the data in each
column).
80
Chapter 4 ■ Introduction to descriptive analytics
You can now compute additional information required if any (as shown in Figure 4-30).
As seen above, any column from the data set can be accessed using datasetname
followed by $column_name.
4.5.1 Scatter plot
Scatter plots are one of the important kinds of plots in the analysis of data. These plots
depict the relationship between two variables. Scatter plots are normally used to show the
“cause and effect relationships,” but any relationship seen in the scatter plots need not be
always a “cause and effect relationship.” Figure 4-31A shows how to create a scatter plot
in R and Figure 4-31B shows the actual scatter plot generated. The underlying concept of
correlation has been explained in detail in Chapter 8.?.
81
Chapter 4 ■ Introduction to descriptive analytics
60000
50000
EmpData$EmpSal
40000
30000
20000
20 30 40 50 60
EmpData$EmpAge
Figure 4-31B. Scatter plot created in R (using the method specified in Figure 4-31A)
As you can see from this example, there is a direct relationship between employee
age and employee salary. The salary of the employees grows in direct proportion to
their age. This may not be true in a real scenario. Figure 4-31A shows that the salary
of the employee increases proportionate to his or her age and that too linearly. Such a
relationship is known as “linear relationship.” Please note that type = “b” along with the
plot(dataset) command has created both point and line graph.
Let me now get another data frame named EmpData1 with one more additional
feature (also known as column or field) and with different data in it. In Figure 4-32 you can
see the data and summary of the data in this data frame.As you can see in Figure 4-32, I
have added one more feature namely EmpPerGrade, and also have changed the values of
salary from the earlier data frame, that is EmpData. EmpData1 has the following data now.
82
Chapter 4 ■ Introduction to descriptive analytics
30 35 40 45 50 55
EmpData1$EmpAge1
Figure 4-33. Scatter plot from R showing the changed relationship between two features of
data frame EmpData1
Now, as you can see from Figure 4-33, the relationship between employee age and
the employee salary has changed; as you can observe, as the age grows, the increase
in employee salary is not proportional but tapers down. This is normally known as a
“quadratic relationship.”
83
Chapter 4 ■ Introduction to descriptive analytics
5
4
EmpData1$EmpPerGrade
3
2
1
0
30 35 40 45 50 55
EmpData1$EmpAge1
Figure 4-34. Scatter plot from R showing the changed relationship between two features of
data frame EmpData1
In the Figure 4-34, you can see the relationship plotted between employee age and
employee performance grade. Ignore the first data point as it was for a new employee
joined recently and he was not graded. Hence, the data related to performance grade
is 0. Otherwise, as you can observe, as the age progresses (as per the data above), the
performance has come down. In this case there is an inverse relationship between
employee age and employee performance (i.e., as the age progresses, performance is
degrading). This is again not a true data and is given only for illustration.
4.6 Probability
Concepts of probability and related distributions are as important to business analytics as
to the field of pure statistics. Some of the important concepts used in business analytics
such as Bayesian theory and decision trees etc. are based on the concepts of probability.
84
Chapter 4 ■ Introduction to descriptive analytics
As you are aware, probability in simple terms is the chance of an event happening.
In some cases, we may have some prior information related to the event; in other cases,
the event may be random—that is, we may not have prior knowledge of the outcome. A
popular way to describe the probability is with the example of tossing a coin or tossing
a die. A coin has two sides and when it is tossed, the probability of either the head or
the tail coming up is 1/2 as in any throw either it can be the head or the tail that comes
up. You can validate the same by tossing up the coin many times and observing that the
probability of either the head or the tail coming up is around 50% (i.e., 1/2). Similarly,
the probability of any one of the numbers being rolled using the die is 1/6, which can be
again validated by tossing the die many times.
If an event is not likely to happen, the probability of the same is 0. However, if an
event is sure to happen, the probability of the same is 1. However, probability of an
event is always between 0 and 1 and depends upon the chance of it happening or the
uncertainty associated with its happening.
Any given two or more events can happen independent of each other. Similarly, any
two or more events can happen exclusive of each other.
Example 1: Can you travel at the same time to two destinations in opposite
directions. If you travel toward the west direction you can’t travel toward the east
direction at the same time.
Example 2: If we are making profit in one of the client accounts we cannot make loss
in the same account.
Examples 1 and 2 are types of events that exclude the happening of a particular event
when the other event happens; they are known as mutually exclusive events.
Example 3: A person “tossing a coin” and “raining” can happen at the same, but
neither impacts the outcome of the other.
Example 4: A company may make profit and at the same time have legal issues. One
event (of making profit) does not have an impact on the other event (of having legal issues).
Examples 3 and 4 are types of events that do not impact the outcome of each other;
they are known as mutually independent events. These are also the examples of mutually
non-exclusive events as both outcomes can happen at the same time.
85
Chapter 4 ■ Introduction to descriptive analytics
4.6.4 Probability distributions
Random variables are important in analysis. Probability distributions depict the distribution
of the values of a random variable. Some important probability distributions are:
• Normal distribution
• Binomial distribution
• Poisson distribution
• Uniform distribution
• Chi-squared distribution
• Exponential distribution
We will not be discussing all of the foregoing. There are many more types of
distributions possible including F-distribution, hypergeometric distribution, joint and
marginal probability distributions, and conditional distributions. We will discuss only
normal distribution, binomial distribution, and Poisson distribution in this chapter. We
will discuss more on these distributions and other relevant distributions in later chapters.
4.6.4.1 Normal distribution
A huge amount of data is considered to be normally distributed as the distribution is
normally centered around the mean. Normal distribution is observed in real life in many
situations. On account of the bell shape of the distribution, the normal distribution is
also called “bell curve.” The properties of normal distribution typically having 68% of
the values within + / - 1 standard deviation, 95% of the values within + / - 2 standard
deviation, and 99.7% of the values within + / - 3 standard deviation are the ones heavily
used in most of the analytical techniques and so are also the properties of standard
normal curve. The standard normal curve has a mean of 0 and a standard deviation of
1. Z-score, used to normalize the values of the features in a regression, is based on the
concept of standard normal distribution. Normal distribution can be used in case of other
distributions as well as the distribution of the means of random samples is typically a
normal distribution in case of large sample size.
86
Chapter 4 ■ Introduction to descriptive analytics
Please note that we are interested in the upper tail as we want to know the
percentage of employees who have received a grade of 4 or 5. The answer here is 25.25%.
4.6.4.2 Binomial distribution
Binomial distribution normally follows where success or failure is measured. In a cricket
match, tossing a coin is an important event at the beginning of the match to decide which
side bats (or fields) first. Tossing a coin and calling for “head” wins you the toss if “head”
is the outcome. Otherwise, if the “tail” is the outcome you lose the toss.
As the sales head of an organization you have submitted responses to fifteen tenders.
There are five contenders in each tender. You may be successful in some or all or may be
successful in none. As there are five contenders in each tender and each tender can be
won by only one company the probability of winning the tender is 0.2 (i.e., 1/5). You want
to win more than four of the fifteen tenders. You can find out the probability of winning
four or less tenders employing binomial distribution using R (see Figure 4-36).
Please note that pbinom() function uses the cumulative probability distribution
function for binomial distribution as we are interested to know the chance of winning
four or more tenders. As you can see, the probability of winning four or less tenders is
83.58%. The probability of winning more than four tenders is (100 – 83.58) = 16.42%.
87
Chapter 4 ■ Introduction to descriptive analytics
4.6.4.3 Poisson distribution
Poisson distribution represents the independent events happening in a time interval. The
arrival of calls at a call center or the arrival of customers in a banking hall or the arrival of
passengers at an airport/bus terminus follow Poisson distribution.
Let us take an example of the number of customers arriving at a particular bank’s
specific branch office. Suppose an average of 20 customers are arriving per hour. We can
find out the probability of 26 or more customers arriving at the bank’s branch per hour
using R and Poisson distribution (see Figure 4-37).
Please note that we have used lower = FALSE as we are interested in the upper
tail and we want to know the probability of 26 or more customers arriving at the bank’s
branch per hour.The answer here is 11.22%.
4.7 Chapter summary
• You have looked at how relying only on one of the descriptive
statistics can be dangerous. In fact, if you have to get a holistic
view of anything you may require multiple statistical functions to
understand the context/situation.
• You also understood various statistical parameters of interest
in descriptive analytics (mean, median, quantiles, quartiles,
percentiles, standard deviation, variance, and mode). You
understood how to compute these using R. You also came to
know how most of these parameters can be gotten through a
simple command like summary(dataset) using R.
• You explored plotting the data in graphical representation using
R. You also understood how graphical representation can provide
more information to the users than can descriptive statistics.
You understood how you can create histograms, bar charts, and
boxplots using R.
• You learnt one of the important data structures of R (i.e., data
frames). You understood how to get the summary data from the
data contained in these data frames.
• You explored how scatter plots can show the relationship between
various features of the data frame and hence enable us to better
understand these relationships graphically and easily.
88
Chapter 4 ■ Introduction to descriptive analytics
89
CHAPTER 5
This chapter covers data exploration, validation, and cleaning required for data analysis.
You’ll learn the purpose of data cleaning, why you need data preparation, how to go about
handling missing values, and some of the data-cleaning techniques used in the industry.
92
Chapter 5 ■ Business Analytics Process and Data Exploration
Data
Pre-processing
Deployment
Data Exploration &
Data Visualization
Management Reporting
and Review
Model Modeling
Evaluation Techniques
& Algorithm
93
Chapter 5 ■ Business Analytics Process and Data Exploration
94
Chapter 5 ■ Business Analytics Process and Data Exploration
Most organizations have data spread across various databases. Pulling data from
multiple sources is a required part of solving business analytics tasks. Sometimes, data
may be stored in databases for different purposes than the objective you are trying to
solve. Thus, the data has to be prepared to ensure it addresses the business problem prior
to any analytics process. This process is sometimes referred to as data munging or data
wrangling, which is covered in the next section.
5.3.1 Sampling
Many times, unless you have a big data infrastructure, only a sample of the population
is used to build analytical modeling. A sample is “a smaller collection of units from a
population used to determine truths about that population” (Field, 2005). The sample
should be representative of the population. Choosing a sampling technique depends on
the type of business problem.
For example, you might want to study the annual gross domestic product (GDP)
per capita for several countries over a period of time and the periodic behavior of such
series in connection with business cycles. Monthly housing sales over a period of 6–10
years show cyclic behavior, but for 6–12 months, the sales data may show seasonal
behavior. Stock market data over a period of 10–15 years may show a different trend
than over a 100-day period. Similarly, forecasting sales based on previous data over a
period of time, or analyzing Twitter sentiments and trends over a period of time, is cyclic
data. If the fluctuations are not of a fixed period, they are cyclic. If the changes are in a
specific period of the calendar, the pattern is seasonal. Time-series data is data obtained
through repeated measurements over a particular time period.
For time-series data, the sample should contain the time period (date or time or
both) and only a sample of measurement records for that particular day or time instead
of the complete data collected. For example, the Dow Jones volume is traded over 18
months. The data is collected for every millisecond, so the volume of this data for a day is
huge. Over a 10-month period, this data can be in terabytes.
The other type of data is not time dependent. It can be continuous or discrete data, but
time has no significance in such data sets. For example, you might be looking at the income
or job skills of individuals in a company, or the number of credit transactions in a retail store,
or age and gender information. There is no relationship between the two data records.
Unless you have big data infrastructure, for any analysis you can just take a sample
of records. Use a randomization technique and take steps to ensure that all the members
of a population have an equal chance of being selected. This method is called probability
sampling. There are several variations on this type of sampling:
Random sampling: A sample is picked randomly, and every
member has an equal opportunity to be selected.
Stratified sampling: The population is divided into groups,
and data is selected randomly from a group, or strata.
Systematic sampling: You select members systematically—say,
every tenth member—in that particular time or event.
96
Chapter 5 ■ Business Analytics Process and Data Exploration
The details of calculating sample sizes is beyond the scope of this book and is
covered extensively in statistics and other research methodology books. However, to
enhance your understanding, here is a simple formula for calculating a sample:
If the population standard deviation is known, then
n = (z × sigma/E)^2
If standard deviation is unknown,
n = (p)(1 – p)* (z/E)^2
5.3.2 Variable Selection
Having more variables in the data set may not always provide desired results. However,
if you have more predictor variables, you need more records. For example, if you want to
find out the relationship between one Y and one single predictor X, then 15 data points
may give you results. But if you have 10 predictor variables, 15 data points is not enough.
Then how much is enough? Statisticians and many researchers have worked on this
and given a rough estimate. For example, a procedure by Delmater and Hancock (2001),
indicates that you should have 6 × m × p records for any predictive models, where p is
number of variables and m is the number of outcome classes. The more records you have,
the better the prediction results. Hence, in big data processing, you can eliminate the
need for sampling and try to process all the available data in order to get better results.
5.4.1 Data Types
Data can be either qualitative or quantitative. Qualitative data is not numerical—for
example, type of car, favorite color, or favorite food. Quantitative data is numeric.
Additionally, quantitative data can be divided into categories of discrete or continuous
data (described in more detail later in this section).
Quantitative data is often referred to as measurable data. This type of data
allows statisticians to perform various arithmetic operations, such as addition and
multiplication, to find parameters of a population, such as mean or variance.
97
Chapter 5 ■ Business Analytics Process and Data Exploration
The observations represent counts or measurements, and thus all values are numerical.
Each observation represents a characteristic of the individual data points in a population
or a sample.
Discrete: A variable can take a certain value that is separate
and distinct. Each value is not related to any other value.
Some examples of discrete types of data include the number
of cars per family, the number of times a person drinks
water during a day, or the number of defective products on a
production line.
Continuous: A variable that can take numeric values within
a specific range or interval. Continuous data can take any
possible value that the observations in a set can take. For
example, with temperature readings, each reading can take on
any real number value on a thermometer.
Nominal data: The order of the data is arbitrary, or no order
is associated with the data. For example, race or ethnicity
has the values black, brown, white, Indian, American, and so
forth; no order is associated with the values.
Ordinal data: This data is in a certain defined order. Examples
include Olympic medals—Gold, Silver, and Bronze, and Likert
scale surveys—disagree, agree, strongly agree. With ordinal
data, you cannot state with certainty whether the intervals
between values are equal.
Interval data: This data has meaningful intervals between
measurements, but there is no true starting zero. A good
example is the measurement of temperature in Kelvin or the
height of a tsunami. Interval data is like ordinal data, except
the intervals between values are equally split. The most
common example is temperature in degrees Fahrenheit. The
difference between 29 and 30 degrees is the same magnitude
as the difference between 58 and 59 degrees.
Ratio data: The difference between two values has the same
meaning of measurement. For example, the height above sea
level for two cities can be expressed as a ratio. The difference
in water level of two reservoirs can be expressed as a ratio,
such as twice as much as X reservoir.
The concept of scale types later received the mathematical rigor that it lacked at its
inception with the work of mathematical psychologists Theodore Alper (1985, 1987),
Louis Narens (1981a, b), and R. Duncan Luce (1986, 1987, 2001).
Before the analysis, understand the variables you are using and prepare all of them
with the right data type. Many tools support the transformation of variable types.
98
Chapter 5 ■ Business Analytics Process and Data Exploration
5.4.2 Data Preparation
After the preliminary data type conversions, the next step is to study the data. You need to
check the values and their association with the data. You also need to find missing values,
null values, empty spaces, and unknown characters so they can be removed from the data
before the analysis. Otherwise, this can impact the accuracy of the model. This section
describes some of the criteria and analysis that can be performed on the data.
99
Chapter 5 ■ Business Analytics Process and Data Exploration
3.
Take the average of the entries in each bin.
Bin 1 Average: 252.5
Bin 2 Average: 1,000
Bin 3 Average: 2,200
4.
Use the average bin values to fill in the missing value for a
particular bin.
Predict the values based on the most probable value: Based on the other attributes in
the data set, you can fill in the value based on the most probable value it can take. You can
use some of the statistical techniques such as Bayes’ theorem or a decision tree to find the
most probable value.
100
Chapter 5 ■ Business Analytics Process and Data Exploration
From these str() details, it is clear that the data set does not have variable names. R
has appended an X suffix to the first value in the column. And almost all the variable types
are defined as factors by default. Further, to view the first few lines of the data, you can use
the head() command:
101
Chapter 5 ■ Business Analytics Process and Data Exploration
■■Note str() gives the summary of the object in active memory, and head() enables you
to view the first few (six) lines of data.
The next step is to provide a proper variable name to each column. The data set we
are using in this example is the Stock Price data set. The first column is the date, and the
second column to the tenth column hold stock prices for various stocks. The last column
is the rating for the volume of transactions for the day (high, low, or medium).
Looking at this output from str(mydata), all the variables (except day and Stock9)
are of type factor(). However, they are integers except the Ratings variable. Let’s convert
the variables to the appropriate data types:
102
Chapter 5 ■ Business Analytics Process and Data Exploration
Using the as.numeric() function, the variables Stock1, Stock2, Stock3, and so forth,
have been converted to the numeric data type. In this case, we have nested as.numeric()
with as.character(). If you convert factors to numeric values, R may not give actual values.
Hence, first convert the value to a character and then to a numeric. Similarly, other functions
are available to identify and perform the task of converting one data type to another type:
Finally, let’s identify the missing values, junk characters, and null values and then
clean the data set.
Using the is.na() function, this table indicates that there are 95 missing values (NA)
in the data set. The table() function provides the tabular output by cross-classifying
factors to build a contingency table of the counts at each combination of factor levels:
> table(is.na(dat))
FALSE TRUE
11425 95
The function complete.cases() returns a logical vector indicating which cases are
complete (TRUE or FALSE). The following example shows how to identify NAs (missing
values) in the data set by using the complete.cases() function:
The following example shows how to fill NAs (missing values) with the mean value
of that particular column. You can adopt any other method discussed in the previous
sections. Depending on the type of data set, you can use other methods and write a
function to perform the same.
103
Chapter 5 ■ Business Analytics Process and Data Exploration
You can repeat the same for other variables by using the same method, or use any
apply() method discussed in earlier sections.
Operational
Manufacturing
Database
(structured and
unstructured)
Data Pre-processing
Engine
Logs data
104
Chapter 5 ■ Business Analytics Process and Data Exploration
5.5.1 Tables
The easiest and most common tool available for looking at data is a table. Tables contain
rows and columns. The raw data is displayed as rows of observations and columns of
variables. Tables are useful for smaller sets of samples, so it can be difficult to display the
whole data set if you have many records. By presenting the data in tables, you can gain
insight about the data, including the type of data, variable names, and the way the data
is structured. However, it is not possible to identify relationships between variables by
looking at tables. In R, we can view data in a table format by using the View() command,
as shown in Figure 5-3.
105
Chapter 5 ■ Business Analytics Process and Data Exploration
5.5.2 Summary Tables
We already discussed statistics and their significance in Chapter 4. Descriptive statistics
provide a common way of understanding data. These statistics can be represented as
summary tables. Summary tables show the number of observations in each column and
the descriptive statistics for each column. The following statistics are commonly used:
Minimum: The minimum value
Maximum: The maximum value
Mean: The average value
Median: The value at the midpoint
Sum: The sum of all observations in the group
Standard deviation: A standardized measure of the deviation
of a variable from the mean
First quartile: Number between the minimum and median
value of the data set
Third quartile: Number between the median and the
maximum value of the data set
The following output is the descriptive statistics of our Stock Price data set.
5.5.3 Graphs
Graphs represent data visually and provide more details about the data, enabling you to
identify outliers in the data, distribute data for each column variable, provide a statistical
description of the data, and present the relationship between the two or more variables.
106
Chapter 5 ■ Business Analytics Process and Data Exploration
Chapter 4 discussed several data visualization graphs used in exploring data. Some types
of graphs include bar charts, histograms, box plots, and scatter plots. In addition, looking
at the graphs of multiple variables simultaneously can provide more insights into the data.
Univariate analysis analyzes one variable at a time. It is the simplest form of
analyzing data. You analyze a single variable, summarize the data, and find the patterns
in the data. You can use several visualization graphs to perform univariate data analysis,
including bar charts, pie charts, box plots, and histograms. All these plots have already
been discussed in the previous chapter.
A histogram represents the frequency distribution of the data. Histograms are
similar to bar charts but group numbers into ranges. Also, a histogram lets you show the
frequency distribution of continuous data. This helps in analyzing the distribution (for
example, normal or Gaussian), any outliers present in the data, and skewness. Figure 5-4
describes the first variable of the Stock Price data set in a histogram plot.
20 30 40 50 60
Stock1
Figure 5-4. Histogram and density function
5.5.3.1 Box plots
A box, or whisker plot is also a graphical description of data. Box plots, created by John W.
Tukey, show the distribution of a data set based on a five-number summary: minimum,
maximum, median, first quartile, and third quartile. Figure 5-5 explains how to interpret a
box plot and its components.
107
Chapter 5 ■ Business Analytics Process and Data Exploration
The central rectangle spans the first and third quartile (interquartile range, or IQR).
The line inside the rectangle shows the median. The lines, also called whiskers, that are
above and below the rectangle show the maximum and minimum of the data set.
Outlier
1.5 x IQR
Maximum
Value
Third quartile
75% of data
First quartile
25 % data
Minimum value
Outlier
1.5 x IQR
Normal data sets do not have a surprisingly high maximum value or low minimum
value. Outliers are generally outside the two whisker lines. Tukey has provided following
definitions for outliers:
Outliers—2/3 IQR above the third quartile or 2/3 IQR below
the first quartile
Suspected outliers—1.5 IQR above the third quartile or 1.5 IQR
below the first quartile.
108
Chapter 5 ■ Business Analytics Process and Data Exploration
Notched plots look like box plots with notches, as shown in Figure 5-7. If two boxes’
notches do not overlap, that shows “strong evidence” that their medians differ (Chambers
et al., 1983, p. 62).
109
Chapter 5 ■ Business Analytics Process and Data Exploration
5.5.3.2 Scatter plots
The most common data visualization tool used for bivariate analysis is the scatter plot.
Scatter plots can be used to identify the relationships between two continuous variables.
Each data point on a scatter plot is a single observation. All the observations can be
plotted on a single chart.
Figure 5-8 shows a scatter plot of the number of employees vs. revenue (in millions
of dollars) of various companies. As you can see, there is a strong relationship between
the two variables that is almost linear. However, you cannot draw any causal implications
without further statistical analysis.
110
Chapter 5 ■ Business Analytics Process and Data Exploration
Scatterplot
100000
revenues
60000
20000
0
Employee
Figure 5-8. A scatter plot of the number of employees vs. revenue (in millions of dollars)
Unfortunately, scatter plots are not always useful for finding patterns or
relationships. If there are too many data points, the scatter plot does not give much
information. For example, Figure 5-9 does not indicate whether any correlation exists
between casual bike users and registered bike users.
111
Chapter 5 ■ Business Analytics Process and Data Exploration
Casual
Figure 5-9. Registered users vs. casual users
> hou<-read.table(header=TRUE,sep="\t","housing.data")
112
Chapter 5 ■ Business Analytics Process and Data Exploration
In this example, the variables are on the diagonal, from top left to bottom right
(see Figure 5-10). Each variable is plotted against the other variables. For example, the
plot that is to the right of CRM and above ZN represents a plot of ZN on the x axis and
CRM on the y axis. Similarly, the plot of Age vs. CRM, plotting Age on the x axis and CRM
on the y axis, would be the sixth plot down from CRM and the sixth plot to the left of AGE.
In this particular example, we do not see any strong relationships between any two pairs
of variables.
113
Chapter 5 ■ Business Analytics Process and Data Exploration
5.5.4.1 Trellis Plot
Trellis graphics is a framework for data visualization that lets you examine multiple
variable relationships. The following example shows the relationships of variables in the
Boston Housing data set. A trellis graph can be produced by any type of graph component
such as a histogram or a bar chart. A trellis graph is based on partitioning one or more
variables and analyzing that with the others. For categorical variables, a plot is based on
different levels of that variable, whereas for numerical values, the data subset is based on
the intervals of that variable.
In the Boston Housing data set, the value of a house is analyzed based on its age
as a subset of tax. A similar analysis can be performed for other variables. The problem
with a simple scatter plot is overplotting, which makes it hard to interpret the structures
in the data. A trellis plot creates different depths and intervals to make the interpretation
clearer. In this example, the trellis creates the MEDV tax in different intervals: The MEDV
value increases from left to right, and as you can see in Figure 5-11, higher taxes are
concentrated in lower age groups, and vice versa.
0 20 40 60 80 100
(449,580) (580,712)
50
40
30
20
10
MEDV
(186,318) (318,449)
50
40
30
20
10
0 20 40 60 80 100
AGE
Figure 5-11. Trellis graph
114
Chapter 5 ■ Business Analytics Process and Data Exploration
5.5.4.2 Correlation plot
Correlation between the two variables can be calculated and plotted by using a
correlation graph, as shown in Figure 5-12. The pairwise correlation can be plotted in a
correlation matrix plot to describe the correlation of two variables:
Stock2
Stock3
Stock4
Stock5
Stock6
Stock7
Stock8
Stock1
1
Stock1
0.8
Stock2 0.6
Stock3 0.4
0.2
Stock4
0
Stock5
-0.2
Stock6 -0.4
Stock7 -0.6
-0.8
Stock8
-1
Figure 5-12. Correlation plot
115
Chapter 5 ■ Business Analytics Process and Data Exploration
In this example, a blue dot represents a positive correlation, and red represents
a negative correlation. The larger the dot, the stronger the correlation. The diagonal
dots (from top left to bottom right) are perfectly positively correlated, because each dot
represents the correlation of each attribute with itself.
5.5.4.3 Density by Class
A density function of each variable can be plotted as a function of the class. Similar to
scatter plot matrices, a density plot can help illustrate the separation by class and show
how closely they overlap each other.
In this example of the Stock Price data set shown in Figure 5-13, some stock prices
overlap very closely and are hard to separate.
20 40 60 80 100
Stock6 Stock7 Stock8
0.15
0.10
0.05
0.00
Stock3 Stock4 Stock5
0.15
0.10
0.05
0.00
20 40 60 80 100 20 40 60 80 100
Feature
116
Chapter 5 ■ Business Analytics Process and Data Exploration
The preceding data visualization methods are some of the most commonly used.
Depending on the business problem and the data you are analyzing, you can select a
method. No single technique is considered the standard.
5.5.5 Data Transformation
After a preliminary analysis of data, sometimes you may realize that the raw data you
have may not provide good results, or doesn’t seem to make any sense. For example,
data may be skewed, data may not be normally distributed, or measurement scales
may be different for different variables. In such cases, data may require transformation.
Common transformation techniques include normalization, data aggregation, and
smoothing. After the transformation, before presenting the analysis results, the inverse
transformation should be applied.
5.5.5.1 Normalization
Certain techniques such as regression assume that the data is normally distributed and
that all the variables should be treated equally. Sometimes the data we collect for various
predictor variables may differ in their measurement units, which may have an impact on
the overall equation. This may cause one variable to have more influence over another
variable. In such cases, all the predictor variable data is normalized to one single scale.
Some common normalization techniques include the following:
Z-score normalization (zero-mean normalization): The new
value is created based on the mean and standard deviations.
The new value A’ for a record value A is normalized by
computing the following:
A’ = (A – meanA) / SDA
where meanA is the mean, and SDA is the standard deviation
of attribute A
This type of transformation is useful when we do not know the
minimum and maximum value of an attribute or when there
is an outlier dominating the results.
Min-max normalization: In this transformation, values are
transformed within the range of values specified. Min-max
normalization performs linear transformations on the original
data set. The formula is as follows:
New value A’ = (( A – MinA) / (MaxA – MinA) )(Range of A’) + MinA’
Range of A’ = MaxA’ – MinA’
Min-max transformation maps the value to a new range of
values defined by the range [MaxA’ – MinA’]. For example, for
the new set of values to be in the range of 0 and 1, the new
Max = 1 and the new Min = 0; the old value is 50, with a min
value of 12 and a max of 58, then
117
Chapter 5 ■ Business Analytics Process and Data Exploration
5.6.1 Descriptive Analytics
Descriptive analytics explains the patterns hidden in data. These patterns could be the
number of market segments, or sales numbers based on regions, or groups of products
based on reviews, software bug patterns in a defect database, behavioral patterns in an
online gaming user database, and more. These patterns are purely based on historical
data. You also can group observations into the same clusters, and this analysis is called
clustering analysis.
Similarly, association rules or affinity analysis can be used on a transactional
database to find the associations among items purchased in department stores.
This analysis is performed based on past data available in the database, to look for
associations among various items purchased by customers. This helps businesses extend
discounts, launch new products, and manage inventory effectively.
5.6.2 Predictive Analytics
Prediction consists of two methods: classification and regression analysis.
Classification is a basic form of data analysis in which data is classified into classes.
For example, a credit card can be approved or denied, flights at a particular airport are
on time or delayed, and a potential employee will be hired or not. The class prediction is
based on previous behaviors or patterns found in the data. The task of the classification
model is to determine the class of data from a new set of data that was not seen before.
118
Chapter 5 ■ Business Analytics Process and Data Exploration
5.6.3 Machine Learning
Machine learning is about making computers learn and perform tasks better based on
past historical data. Learning is always based on observations from the data available.
The emphasis is on making computers build mathematical models based on that learning
and perform tasks automatically without the intervention of humans. The system
cannot always predict with 100 percent accuracy, because the learning is based on past
data that’s available, and there is always a possibility of new data arising that was never
learned earlier by the machine. Machines build models based on iterative learning to
find hidden insights. Because there is always a possibility of new data, this iteration is
important because the machines can independently adapt to new changes. Machine
learning has been around for a long time, but recent developments in computing, storage,
and programming; new complex algorithms; and big data architectures such as Hadoop,
have helped it gain momentum. There are two types of machine learning: supervised
machine learning and unsupervised machine learning.
119
Chapter 5 ■ Business Analytics Process and Data Exploration
In this example, the classifier model is developed based on the training data set.
This data set has a set of documents (data) that are already categorized and labeled into
different classes, manually and under expert supervision. This is called the training
data set. The classification algorithm (model) learns based on this training data set,
which already has class labels. Once the learning is complete, the model is ready
for the classification of documents (new data) whose labels are unknown. Common
classification supervised-learning algorithms include support vector machines, naïve
Bayes, k-nearest neighbor, and decision trees.
120
Chapter 5 ■ Business Analytics Process and Data Exploration
Unsupervised Learning
A clustering algorithm partitions the adjectives into two
subsets
+ -
slow
scenic
nice
terrible
handsome painful
fun
expensive
comfortable
You can solve a business problem by using the available data either via simple
analytics such as data visualization or advance techniques such as predictive analytics.
The business problem can be solved via supervised machine learning or unsupervised
machine learning. It can be a classification problem or a regression problem. Depending
on the business problem you are trying to solve, different methods are selected and
different algorithms are used to solve the problem. The next chapters discuss various
classification, regression, clustering, and association techniques in detail.
The techniques and algorithms used are based on the nature of the data available.
Table 5-1 summarizes the variable type and important algorithms that can be used to
solve a business problem.
121
Chapter 5 ■ Business Analytics Process and Data Exploration
122
Chapter 5 ■ Business Analytics Process and Data Exploration
5.7.4 Cross-Validation
To avoid any bias, the data set is partitioned randomly. When you have a limited
amount of data, to achieve an unbiased performance estimate, you can use k-fold
cross-validation. In k-fold cross validation, you divide the data into k folds and build the
model using k – 1 folds, and the last fold is used for testing. You can repeat this process k
times; every time the test sample is different, each time “leaving out” one subset from the
training and using it as test set. If k equals the sample size, this is referred as a leave-one-
out cross-validation method.
Training
Build Data
Test
Evaluate
Data
Re-evaluate Validation
Data
New
Predict Data
Classification and regression are the two types of predictive models, and each has
a different set of criteria for evaluation. Let’s briefly look at the various criteria for each
model. Details are discussed in subsequent chapters.
123
Chapter 5 ■ Business Analytics Process and Data Exploration
Predicted Class
Positive Negative
(C0) (C1)
Predicted Class
Positive Negative
(C0) (C1)
Positive 80 30 Precision =
Actual (C0) 80/110 = 0.63
Class
Negative 40 90
(C1)
124
Chapter 5 ■ Business Analytics Process and Data Exploration
5.7.5.2 Lift Chart
Lift charts are commonly used for marketing problems. For example, say you want to
determine an effective marketing campaign on the Internet. You have a set of cases where
the “probability of responding” to an online ad click is assigned. The lift curve helps
determine how effectively the online advertisement campaign can be done by selecting
arelatively small group and getting maximum responders. The lift , is a measure of the
effectiveness of the model by calculating ratios with or with out the model.
A confusion matrix evaluates the effectiveness of the model as a whole population,
whereas the lift chart evaluates a portion of the population.
The graph is constructed with the number of cases on the x axis and the cumulative
true-positive cases on the y axis. True positives are those observations that are classified
correctly. Table 5-4 shows the true cases actually predicted correctly (true positives) in
the first column, second column represents the cumulative average of the true positive
class and the third column represents the cumulative average prediction of class.
Table 5-4. Sample data for Lift chart
Lift chart measures the effectiveness of a classification model by comparing the true
positives without a model. Also, it provides an indication of how well the model performs
if you select the samples randomly from a population. For example, in Figure 5-17, if we
select 10 cases randomly, our model has performed with an accuracy of 65% whereas
without the model, it would have been just 45%. With the lift chart, you can compare how
well different models are performing for a set of random cases.
125
Chapter 5 ■ Business Analytics Process and Data Exploration
12
Lift Chart
10
8
Cumulative
6 Cumulative of True
Positive Class
4 Cumulative Average of
Class
2
0
1 3 5 7 9 11 13 15 17 19 21
# of Cases
Figure 5-17. Lift chart
The lift will vary with the number of cases we choose. The red line is a reference line.
If we have to predict positive cases in case there was no model, then this line provides a
benchmark.
5.7.5.3 ROC Chart
A receiver operating characteristic (ROC) chart is similar to a lift chart. It is a plot of the
true-positive rate on the y axis and the false-positive rate on the x axis.
ROC graphs are another way of representing the performance of a classifier. Inrecent
years, ROC has become a common method in the machine-learning community due to
the fact that simple classification accuracy is not a good measure for performance of a
classifier (Provost and Fawcett, 1997; Provost et al., 1998). ROC is plotted as a function of
true positive rate (sensitivity) vs. function of false positive rate (specificity) as shown in
Figure 5-18.
For a good classifier model, sensitivity rate should be higher and false positive rate
should be lower as shown in Figure 5-18. Instead of just looking at the confusion matrix,
the Area Under Curve (AUC) for a ROC curve provides a simple pictorial representation
of the classifier performance. In the example, we have plotted ROC for three different
models and the AUC for the first line (green) is highest. For each classifier, ROC curve
is fitted and the results are compared with each of the classifier models over its entire
operating range. An AUC less than 0.5 might indicate that the model is not performing
well and needs attention. Normally, AUC should fall between 0.5 and 1.0. When
separation of the two classes is so perfect and has no overlapping of the distributions,
the area under the ROC curve reaches to 1 and ideally that is the goal for any machine
learning model.
126
Chapter 5 ■ Business Analytics Process and Data Exploration
ROC Curve
1
0.9
0.8
0.7
Classifier 1
True Positive rate
0.6
Classifier 2
(Sensitivity)
0.5
Classifier 3
0.4
0.3 AUC
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False Positive rate(1-Specificity)
Research indicates that algorithms that have two classes are most suited for the ROC
approach. A neural network is an example of an appropriate classifier, whereas decision-
tree classifiers are less suited.
5.7.6.1 Root-Mean-Square Error
A regression line predicts the y values for a given x value. Note that the values are around
the average. The prediction error (called the root-mean-square error, or RSME) is given by
the following formula:
n
∑( y − yk )
2
k
RMSE =
k =0
n
127
Chapter 5 ■ Business Analytics Process and Data Exploration
5.8.1 Problem Description
First, you specify the problem defined by the business and solved by the model. This is
important, as it connects the management of the organization back to the objective of the
data analysis. In this step, you are revalidating the precise problem intended to be solved.
128
Chapter 5 ■ Business Analytics Process and Data Exploration
5.8.7 Issues Handling
Here you present the ideal process for recording the issues observed and the ways they
will be reported, analyzed, and addressed. You also emphasize how this step may lead to
the optimization of the model over a period of time, as these may indicate the changes
happening to the basic assumptions and structure.
129
Chapter 5 ■ Business Analytics Process and Data Exploration
5.10 Summary
The chapter focused on the processes involved in business analytics, including
identifying and defining a business problem, data preparation, data collection, data
modeling, evaluating model performance, and reporting to management on the findings.
You learned about various methods involved in data cleaning, including
normalization, transformation of variables, handling missing values, and finding outliers.
You also delved into data exploration, which is the most important process in business
analytics. You saw various tables, charts, and plots which are used in these regards.
Further, you explored supervised machine learning, unsupervised machine learning,
and how to choose different methods based on business requirements. You learned how
to partition the data set into a training set, test set, and validation set and saw why each
set is important. Finally, you learned about various metrics to measure the performance
of different models including predictive models.
130
CHAPTER 6
Supervised Machine
Learning—Classification
Classification and prediction are two important methods of data analysis used to
find patterns in data. Classification predicts the categorical class (or discrete values),
whereas regression and other models predict continuous valued functions. However,
Logistic Regression addresses categorical class also. For example, a classification model
may be built to predict the results of a credit-card application approval process (credit
card approved or denied) or to determine the outcome of an insurance claim. Many
classification algorithms have been developed by researchers and machine-learning
experts. Most classification algorithms are memory intensive. Recent research has
developed parallel and distributed processing architecture, such as Hadoop, which is
capable of handling large amounts of data.
This chapter focuses on basic classification techniques. It explains some
classification methods including naïve Bayes, decision trees, and other algorithms. It also
provides examples using R packages available to perform the classification tasks.
Finally, imagine that airport authorities have to decide, based on a set of parameters,
whether a particular flight of a particular airline at a particular gate is on time or delayed.
This decision is based on previous flight details and many other parameters.
Classification is a two-step process. In the first step, a model is constructed
by analyzing the database and the set of attributes that define the class variable. A
classification problem is a supervised machine-learning problem. The training data is
a sample from the database, and the class attribute is already known. In a classification
problem, the class of Y, a categorical variable, is determined by a set of input variables
{x1, x2, x3, …}. In classification, the variable we would like to predict is typically called
class variable C, and it may have different values in the set {c1, c2, c3, …}. The observed
or measured variables X1, X2, … Xn are the attributes, or input variables, also called
explanatory variables. In classification, we want to determine the relationship between
the Class variable and the inputs, or explanatory variables. Typically, models represent
the classification rules or mathematical formulas. Once these rules are created by the
learning model, this model can be used to predict the class of future data for which the
class is unknown.
There are various types of classifier models: based on a decision boundary, based on
probability theory, and based on discriminant analysis. We begin our discussion with a
classifier based on the probabilistic approach. Then we will look at the decision trees and
discriminant classifiers.
P ( X 1 , X 2 , X 3 ,…X p | Ci ) P ( Ci )
P ( Ci | X 1 , X 2 , X 3 ,…X p ) =
P ( X 1 , X 2 ,…X p | C1 ) + P ( X 1 , X 2 ,…X p | C2 )……P ( X 1 , X 2 ,..X p | Cm )
132
Chapter 6 ■ Supervised Machine Learning—Classification
P(Ci) is the prior probability of belonging to class Ci in the absence of any other
attributes. (Ci|Xi) is the posterior probability of Xi belonging to class Ci. In order to classify
a record using Bayes’ theorem, you compute its chance of belonging to each class Ci. You
then classify based on the highest probability score calculated using the preceding formula.
It would be extremely computationally expensive to compute P(X|Ci) for data sets
with many attributes. Hence, a naïve assumption is made that presumes that each record
is independent of the others, given the class label of the sample. It is reasonable to
assume that the predictor attributes are all mutually independent within each class, so we
can simplify the equation:
P ( X | Ci ) = ∏nk =1 P ( X k | Ci )
6.2.1 Example
Let’s look at one example and see how to predict a class label by using a Bayesian classifier.
Table 6-1 presents a training set of data tuples for a bank credit-card approval process.
133
Chapter 6 ■ Supervised Machine Learning—Classification
The data samples in this training set have the attributes Age, Purchase Frequency,
and Credit Rating. The class label attribute has two distinct classes: Approved or Denied.
Let C1 correspond to the class Approved, and C2 correspond to class Denied. Using the
naïve Bayes classifier, we want to classify an unknown label sample X:
X = (Age >40, Purchase Frequency = Medium, Credit Rating = Excellent)
To classify a record, first compute the chance of a record belonging to each of the
classes by computing P(Ci|X1,X2, … Xp) from the training record. Then classify based on
the class with the highest probability.
In this example, there are two classes. We need to compute P(Xi|Ci)P(Ci). P(Ci) is the
prior probability of each class:
P(Application Approval = Yes) = 6/14 = 0.428
P(Application Approval = No) = 8/14 = 0.571
Let’s compute P(X|Ci), for i =1, 2, … conditional probabilities:
P(Age > 40 | Approval = Yes) = 2/6 = 0.333
P(Age > 40 | Approval = No) = 2/8 = 0.25
P(Purchase Frequency = Medium | Approval = Yes) =1/6 =
0.167
P(Purchase Frequency = Medium | Approval = No) = 5/8 =
0.625
P(Credit Rating = Excellent | Approval = Yes) = 2/6 = 0.333
P(Credit Rating = Excellent | Approval = No) = 3/8 = 0.375
Using these probabilities, you can obtain the following:
P(X | Approval = Yes) = 0.333 × 0.167 × 0.333 = 0.0185
P(X | Approval = No) = 0.25 × 0.625 × 0.375 = 0.0586
P(X | Approval = Yes) × P(Approval = Yes) = 0.29 × 0.428 =
0.0079
P(X | Approval = No) × P(Approval = No) = 0.586 × 0.571=
0.03346
The naïve Bayesian classifier predicts Approval = No for the given set of sample X.
134
Chapter 6 ■ Supervised Machine Learning—Classification
The next step is to build the classifier (naïve Bayes) model by using the mlbench and
e1071 packages:
135
Chapter 6 ■ Supervised Machine Learning—Classification
For the new sample data X = (Age > 40, Purchase Frequency = Medium, Credit Rating
= Excellent), the naïve Bayes model has predicted Approval = No.
6.3 Decision Trees
A decision tree builds a classification model by using a tree structure. A decision tree
structure consists of a root node, branches, and leaf nodes. Leaf nodes hold the class
label, each branch denotes the outcome of the decision-tree test, and the internal nodes
denote the decision points.
Figure 6-1 demonstrates a decision tree for the loan-approval model. Each internal
node represents a test on an attribute. In this example, the decision node is Purchase
Frequency, which has two branches, High and Low. If Purchase Frequency is High, the
next decision node would be Age, and if Purchase Frequency is Low, the next decision
node is Credit Rating. The leaf node represents the classification decision: Yes or No. This
structure is called a tree structure, and the topmost decision node in a tree is called the
root node. The advantage of using a decision tree is that it can handle both numerical and
categorical data.
136
Chapter 6 ■ Supervised Machine Learning—Classification
Purchase
frequency
h Low
Hig
exc
r
<40 >40
fai
elle
nt
No Yes No Yes
The decision tree has several benefits. It does not require any domain knowledge,
the steps involved in learning and classification are simple and fast, and the model is
easy to comprehend. This machine-learning algorithm develops a decision tree based on
divide-and-conquer rules. Let x1, x2, and x3 be independent variables; and Y denotes the
dependent variable, which is a categorical variable. The X variables can be continuous,
binary, or ordinal. A decision tree uses a recursive partitioning algorithm to construct the
tree. The first step is selecting one of the variables, xi, as the root node to split. Depending
on the type of the variable and the values, the split can be into two parts or three parts.
Then each part is divided again by choosing the next variable. The splitting continues
until the decision class is reached. Once the root node variable is selected, the top-level
tree is created, and the algorithm proceeds recursively by splitting on each child node. We
call the final leaf homogeneous, or as pure as possible. Pure means the final point contains
only one class; however, this may not be always possible. The challenge here is selecting
the nodes to split and knowing when to stop the tree or prune the tree, when you have lot
of variables to split.
137
Chapter 6 ■ Supervised Machine Learning—Classification
6.3.2 Information Gain
In order to select the decision-tree node and attribute to split the tree, we measure the
information provided by that attribute. Such a measure is referred to as a measure of the
goodness of split. The attribute with the highest information gain is chosen as the test
attribute for the node to split. This attribute minimizes the information needed to classify
the samples in the recursive partition nodes. This approach of splitting minimizes the
number of tests needed to classify an object and guarantees that a simple tree is formed.
Many algorithms use entropy to calculate the homogeneity of a sample.
Let N be a set consisting of n data samples. Let the k is the class attribute, with m
distinct class labels Ci (for i = 1, 2, 3, … m).
The Gini impurity index is defined as follows:
m
G ( N ) = 1 − ∑ ( pk )
2
k =1
138
Chapter 6 ■ Supervised Machine Learning—Classification
G(N)=Gini Index
0.5
1-p
The second impurity measure is the entropy measure. For the class of n samples
having distinct m classes, Ci (for i = 1, … m) in a sample space of N and class Ci, the
expected information needed to classify the sample is represented as follows:
n
I(n1 , n 2 , n 3 ,…n m ) = −∑ pk log 2 ( pk ) (1)
K =1
Figure 6-3. Impurity
139
Chapter 6 ■ Supervised Machine Learning—Classification
Let X be attributes with n distinct records A = {a1, a2, a3, … an}. X attributes can be
partitioned into subsets as {S1, S2, … Sv}, where Sk contains the aj values of A. The tree
develops from a root node selected from one of these attributes, based on the information
gain provided by each attribute. Subsets would correspond to the branches grown from
this node, which is a subset of S. The entropy, or expected information, of attributes A is
given as follows:
n S1j + S2 j + S3 j +…..Sm j
E(A) = ∑ = I ( S1j , S2 j ,…Sm j ) (2)
j=1 S
The purity of the subset partition depends the value of entropy. The smaller the
entropy value, the greater the purity. The information gain of each branch is calculated
on X attributes as follows:
Gain ( A ) = I ( s1 , s2 , s3 , … sm ) − E ( A ) (3)
Gain (A) is the difference between the entropy of each attribute of X. It is the
expected reduction on entropy caused by individual attributes. The attribute with the
highest information gain is chosen as the root node for the given set S, and the branches
are created for each attribute as per the sampled partition.
140
Chapter 6 ■ Supervised Machine Learning—Classification
The data samples in this training set have the attributes Age, Purchase Frequency,
and Credit Rating. The Class Label attribute has two distinct classes: Approved or Denied.
Let C1 correspond to the class Approved, and C2 correspond to the class Denied. Using the
decision tree, we want to classify the unknown label sample X:
X = (Age > 40, Income = Medium, Credit Rating = Excellent)
141
Chapter 6 ■ Supervised Machine Learning—Classification
Loan Approval
Yes No
6 8
In this example, the Class Label attribute representing Loan Approval, has two
values (namely, Approved or Denied); therefore, there are two distinct classes (m = 2). C1
represents the class Yes, and class C2 corresponds to No. There are eight samples of class
Yes and six samples of class No. To compute the information gain of each attribute, we
use equation 1 to determine the expected information needed to classify a given sample:
I (Loan Approval) =I(C1, C2) = I(6,8) = – 6/14 log2(6/14) – 8/14
log2(8/14)
= –(0.4285 × –1.2226) – (0.5714 × –0.8074)
= 0.5239 + 0.4613 = 0.9852
Next, compute the entropy of each attribute—Age, Purchase Frequency, and Credit
Rating.
For each attribute, we need to look at the distribution of Yes and No and compute the
information for each distribution. Let’s start with the Purchase Frequency attribute. The
first step is to calculate the entropy of each income category as follows:
For Purchase Frequency=High, C11 = 3 C21 = 1. Here, first subscript 1 represents Yes,
second subscript 1 represents Purchase Frequency=High, first subscript 2 represents No.
I(C11, C21) = I(3,1); I(3,1) =
– 3/4log2(3/4) – 1/4 log2(1/4)
= –(0.75 × –0.41) + (0.25 × –2)
I(C11, C21) = 0.3075 + 0.5 = 0.8075
For Purchase Frequency = Medium
I(C12,C22) = I (1,5) = –1/6 log(1/6) – 5/6 log(5/6) = –(0.1666 × –2.58)
– (0.8333 × –0.2631) = 0.4298 + 0.2192
I(C12,C22) = 0.6490
For Purchase Frequency = Low
I (C13, C23) = I (2,2) = –(0.5 × –1)–(0.5 × -1)=1
Using equation 2:
E (Purchase Frequency) = 4/14 × I(C11, C12) + 6/14 × I(C12,C22) +
4/14 × I(C13,C23)
142
Chapter 6 ■ Supervised Machine Learning—Classification
143
Chapter 6 ■ Supervised Machine Learning—Classification
In this example, CreditRating has the highest information gain and it is used as a root
node and branches are grown for each attribute value. The next tree branch node is based
on the remaining two attributes, Age and PurchaseFrequency. Both Age and Purchase
Frequency have almost same information gain. Either of these can be used as split node
for the branch. We have taken Age as the split node for the branch. The rest of the branches
are partitioned with the remaining samples, as shown in Figure 6-4. For Age < 35, the
decision is clear. Whereas for the other Age category, PurchaseFrequency parameter has
to be looked at before making the loan approval decision. This involves calculating the
information gain for the rest of the samples and identifying the next split.
CreditRating
Excellent
Fair
OK
144
Chapter 6 ■ Supervised Machine Learning—Classification
CreditRating
OK Fair
Excellent
Age Age
Age
<35 >35 <35 >35 <35 >35
High Low/Medium
High Low/Medium
Yes NO
145
Chapter 6 ■ Supervised Machine Learning—Classification
Overfitting and underfitting are two important factors that could impact the
performance of machine-learning models. Overfitting occurs when the model performs
well with training data and poorly with test data. Underfitting occurs when the model is
so simple that it performs poorly with both training and test data.
If we have too many features, the model may fit the training data too well, as it
captures the noise in the data and then performs poorly on test data. This
machine-learning model is too closely fitted to the data and will tend to have a large
variance. Hence the number of generalized errors is higher, and consequently we say that
overfitting of the data has occurred.
When the model does not capture and fit the data, it results in poor performance.
We call this underfitting. Underfitting is the result of a poor model that typically does not
perform well for any data.
One of the measures of performance in classification is a contingency table. As an
example, let’s say your new model is predicting whether your investors will fund your
project. The training data gives the results shown in Table 6-3.
Actual
1 0
Predicted
1 80 20
0 15 85
- Training Data
The accuracy of the model is quite high. Hence we see the model is a good model.
However, when we test this model against the test data, the results are as shown in
Table 6-4, and the number of errors is higher.
Actual
1 0
Predicted
1 30 70
0 80 20
- Test Data
146
Chapter 6 ■ Supervised Machine Learning—Classification
In this case, the model works well on training data but not on test data. This false
confidence, as the model was generated from the training data, will probably cause you to
take on far more risk than you otherwise would and leaves you in a vulnerable situation.
The best way to avoid overfitting is to test the model on data that is completely
outside the scope of your training data or on unseen data. This gives you confidence
that you have a representative sample that is part of the production real-world data.
In addition to this, it is always a good practice to revalidate the model periodically to
determine whether your model is degrading or needs an improvement, and to make sure
it is accomplishing your business objectives.
147
Chapter 6 ■ Supervised Machine Learning—Classification
Actual Actual
Legend
Actual
Predicted
Actual Actual
Figure 6-6. Bias-variance
148
Chapter 6 ■ Supervised Machine Learning—Classification
Training Data
Tree Nodes
149
Chapter 6 ■ Supervised Machine Learning—Classification
6.4.1 K-Nearest Neighbor
The k-nearest neighbor (K-NN) classifier is based on learning numeric attributes in an
n-dimensional space. All of the training samples are stored in an n-dimensional space
with a distinguished pattern. When a new sample is given, the K-NN classifier searches
150
Chapter 6 ■ Supervised Machine Learning—Classification
for the pattern spaces that are closest to the sample and accordingly labels the class in
the k-pattern space (called k-nearest neighbor). The “closeness” is defined in terms of
Euclidean distance, where Euclidean distance between two points, X = (x1, x2, x3, … xn)
and Y = (y1, y2, y3, … yn) is defined as follows:
n
d ( X,Y ) = ∑ ( x i − y i )
2
i =1
The unknown sample is assigned the nearest class among the k-nearest neighbors
pattern. The idea is to look for the records that are similar to, or “near,” the record to be
classified in the training records that have values close to X = (x1, x2, x3, …). These records
are grouped into classes based on the “closeness,” and the unknown sample will look for
the class (defined by k) and identifies itself to that class that is nearest in the k-space.
Compute
Distance Test Record
Figure 6-8 shows a simple example of how K-NN works. If a new record has to be
classified, it finds the nearest match to the record and tags to that class. For example, if it
walks like a duck and quacks like a duck, then it’s probably a duck.
K-nearest neighbor does not assume any relationship among the predictors (X)
and class (Y). Instead, it draws the conclusion of class based on the similarity measures
between predictors and records in the data set. Though there are many potential
measures, K-NN uses Euclidean distance between the records to find the similarities
to label the class. Please note that the predictor variables should be standardized to a
common scale before computing the Euclidean distances and classifying.
After computing the distances between records, we need a rule to put these records
into different classes (k). A higher value of k reduces the risk of overfitting due to noise in
the training set. Ideally, we balance the value of k such that the misclassification error is
minimized. Ideally, the value of k can be between 2 and 10; for each time, we try to find
the misclassification error and find the value of k that gives the minimum error.
151
Chapter 6 ■ Supervised Machine Learning—Classification
6.4.2 Random Forests
A decision tree is based on a set of true/false decision rules, and the prediction is based
on the tree rules for each terminal node. This is similar to a tree with a set of nodes
(corresponding to true/false questions), each of which has two branches, depending
on the answer to the question. A decision tree for a small set of sample training data
encounters the overfitting problem. In contrast, the Random Forests model is well suited
to handle small sample size problems.
Random Forests creates multiple deep decision trees and averages them out,
trained on different parts of the same training set. The objective of the random forest is to
overcome overfitting problems of individual trees. In other words, random forests are an
ensemble method of learning and building decision trees at training time.
A random forest consists of multiple decision trees (the more trees, the better).
Randomness is in selecting the random training subset from the training set. This method
is called bootstrap aggregating or bagging, and this is done to reduce overfitting by
stabilizing predictions. This method is used in many other machine-learning algorithms,
not just in Random Forests.
The other type of randomness is in selecting variables randomly from the set of
variables. This means different trees are based on different sets of variables. For example,
in our preceding example, one tree may use only Income and Credit Rating, and the
other tree may use all three variables. But in a forest, all the trees would still influence the
overall prediction by the random forest.
Random Forests have low bias. By adding more trees, we reduce variance and thus
overfitting. This is one of the advantages of Random Forests, and hence it is gaining
popularity. Random Forests models are relatively robust to set of input variables and
often do not care about preprocessing of data. Research has shown that they are more
efficient to build than other models such as SVM.
Table 6-5 lists the various types of classifiers and their advantages and disadvantages.
152
Chapter 6 ■ Supervised Machine Learning—Classification
Table 6-5. (continued)
Sl No Classification Method Advantages Disadvantages
2 Decision tree Simple rules to Building a decision
understand and easy to tree can be relatively
comprehend. expensive, and
It does not require any further pruning
domain knowledge. adds computational
The steps involved time.
in learning and Requires a large
classification are simple data set to construct
and fast. a good performance
Decision tree requires classifier.
no transformation of
variables or selecting
variables to split the tree
branch.
3 Nearest neighbor Simple and lack Time to find the
of parametric nearest neighbors.
assumptions. Reduced
Performs well for large performance for
training sets. large number of
predictors.
4 Random Forests Performs well for small If a variable is a
and large data sets. categorical variable
Balances bias and with multiple levels,
variance and provides Random Forests
better performance. are biased toward
More efficient to build the variable having
than other advanced multiple levels.
models, such as
nonlinear SVMs.
153
Chapter 6 ■ Supervised Machine Learning—Classification
Once you understand the business problem, the very first step is to read the data
set. Data is stored in CSV format. You will read the data. The next step is to understand
the characteristics of the data to see whether it needs any transformation of variable
types, handling of missing values, and so forth. After this, the data set is partitioned into
a training set and a test set. The training set is used to build the model, and the test set is
used to test the model performance. You’ll use a decision tree to build the classification
model. After the model is successfully developed, the next step is to understand the
performance of the model. If the model meets customer requirements, you can deploy
the model. Otherwise, you can go back and fine-tune the model. Finally, the results are
reported and the model is deployed in the real world.
The data set is in CSV format and stored in the grades.csv file. Load the tree library
and read the file as follows:
Explore the data to understand the characteristics of the data. The R functions are
shown here:
154
Chapter 6 ■ Supervised Machine Learning—Classification
The next step is to partition the data set into a training set and a test set. There are 240
records. Let’s use 70 percent as the training data set and 30 percent as the test data set:
Build the decision-tree model by using the tree() function. An ample number of
functions have been contributed to R by various communities. In this example, we use
the popular tree() package. You are free to try other functions by referring to appropriate
documentation.
The summary of the model shows that residual deviance is 0.5621, and 13.69 percent
is the misclassification error. Now, plot the tree structure, as shown in Figure 6-9.
155
Chapter 6 ■ Supervised Machine Learning—Classification
B B
Quiz3 < 25.25 Quiz1 < 9.25
Once the model is ready, test the model with the test data set. We already partitioned
the test data set. This test data set was not part of the training set. This will give you an
indication of how well the model is performing and also whether it is overfit or underfit.
156
Chapter 6 ■ Supervised Machine Learning—Classification
As you can see, the misclassification mean error is 19.71 percent. Still, the model
seems to be performing well, even with the test data that the model has never seen
before. Let’s try to improve the model performance by pruning the tree, and then you will
work with the training set.
157
Chapter 6 ■ Supervised Machine Learning—Classification
90
80
prune_mytree$dev
70
60
50
40
2 4 6 8 10 12 14
prune_mytree$size
158
Chapter 6 ■ Supervised Machine Learning—Classification
By plotting the deviance vs. size of the tree plot, the minimum error is at size 7. Let’s
prune the tree at size 7 and recalculate the prediction performance (see Figure 6-11). You
have to repeat all the previous steps.
B
Quiz3 < 22.25 Quiz1 < 9.25
Quiz4 < 97.5
C A
159
Chapter 6 ■ Supervised Machine Learning—Classification
For the pruned tree, the misclassification error is 15.48 percent, and the residual
mean deviance is 0.8708. The model fits better, but the misclassification error is a
little higher than the full tree model. Hence, the pruned tree model did not improve
the performance of the model. The next step is to carry out the process called k-fold
validation. The process is as follows:
1.
Split the data set into k folds. The suggested value is k = 10.
2.
For each k fold in the data set, build your model on k – 1 folds
and test the model to check the effectiveness for the left-out
fold.
3.
Record the prediction errors.
4.
Repeat the steps k times so that each of the k folds are part of
the test set.
5.
The average of the error recorded in each iteration of k
is called the cross-validation error, and this will be the
performance metric of the model.
6.6 Summary
The chapter explained the fundamental concepts of supervised machine learning and the
differences between the classification model and prediction model. You learned why the
classification model also falls under prediction.
You learned the fundamentals of the probabilistic classification model using naïve
Bayes. You also saw Bayes’ theorem used for classification and for predicting classes via
an example in R.
This chapter described the decision-tree model, how to build the decision tree, how
to select the decision tree root, and how to split the tree. You saw examples of building the
decision tree, pruning the tree, and measuring the performance of the classification model.
You also learned about the bias-variance concept with respect to overfitting and
underfitting.
Finally, you explored how to create a classification model, understand the measure
of performance of the model, and improve the model performance, using R.
160
CHAPTER 7
Unsupervised Machine
Learning
7.1 Clustering - Overview
Clustering analysis is performed on data to identify hidden groups or to form
different sectors. The objective of the clusters is to enable meaningful analysis in ways
that help business. Clustering can uncover previously undetected relationships in a
data set. For example, marketing cluster analysis can be used for market segmentation:
customers are segmented based on demographics and transaction history so that
a marketing strategy can be formulated. Another example is to identify groups who
purchase similar products. Similarly, you group people in various clusters based on
lifestyle and consumer expenditures so that cluster analysis can be used to estimate
the potential demand for products and services and thus help formulate business and
marketing strategies.
Clustering
Raw Data Clusters of Data
Algorithm
Nielsen (and earlier, Claritas) were pioneers in cluster analysis. Through its
segmentation solution, Nielsen helped customize demographic data to understand
geography based on region, state, zip code, neighborhood, and block. This has helped the
company to come up with effective naming and differentiation of groups such as movers
and shakers, fast-track families, football-watching beer aficionados, and casual, sweet
palate drinkers. (www.nielsen.com/us/en/solutions/segmentation.html).
In a Human Resources (HR) department, cluster analysis can be used to identify
employee skills, performance, and attrition. Furthermore, you can cluster based on
interests, demographics, gender, and salary to help a business to act on HR-related issues
such as relocating, improving performance, or hiring the properly skilled labor force for
forthcoming projects.
In finance, cluster analysis can help create risk-based portfolios based on various
characteristics such as returns, volatility, and P/E ratio. Selecting stocks from different
clusters can create a balanced portfolio based on risks. Similarly, clusters can be created
based on revenues and growth, market capital, products and solutions, and global
presence. These clusters can help a business understand how to position in the market.
Similarly, in software life-cycle management, you can group the effectiveness of the
software development process based on defects and processes. Similarly, you can group
newspaper articles on the Web based on topics such as sports, science, or politics.
The purpose of cluster analysis is to segregate data into groups that help you better
understand the overall picture. This idea has been applied in many areas, including
archaeology, astronomy, science, education, medicine, psychology, and sociology. In
biology, scientists have made extensive use of classes and subclasses to organize various
species.
Next, you will look at the methods and techniques involved in cluster analysis as well
as its challenges. You’ll also learn how to perform cluster analysis on a given data set.
162
Chapter 7 ■ Unsupervised Machine Learning
7.2 What Is Clustering?
In statistics, cluster analysis is performed on data to gain insights that help you
understand the characteristics and distribution of data. Unlike in classification, clustering
does not rely on predefined class labels or training examples. Conventional clustering is
based on the similarity measures of geometric distance. There are two general methods
of clustering for a data set of n records:
Hierarchical method: There are two types of algorithms. The
agglomerative algorithm begins with n clusters and starts
merging sequentially with similar clusters until a single
cluster is formed. The divisive is the opposite: the algorithm
first starts with one single cluster and then divides into
multiple clusters based on dissimilarities.
Nonhierarchical method: In this method, the clusters are
formed based on specified numbers initially. The method
assigns records to each cluster. Since this method is simple
and computationally less expensive, it is the preferred method
for very large data sets.
How do we measure closeness or similarities between clusters? Numerous measures
can be used. The following section describes some of the measures that are common to
both types of clustering algorithms.
(X X ) + (X X ) + (X X ) +…….. ( X ip X jp )
2 2 2 2
E ij = i1 j1 i2 j2 i3 j3 (1)
If you assign weight for each variable, then a weighted Euclidean distance can be
calculated as follows:
163
Chapter 7 ■ Unsupervised Machine Learning
Another well-known measure is Manhattan (or city block) distance and is defined as
follows:
Mij = x i1 – x j1 + x i 2 – x j2 + x i 3 – x j3 + …+ x ip – x jp (3)
Both Euclidean distance and Manhattan distance should satisfy the following
mathematical requirements of a distance function:
Eij ≥ 0 and Mij ≥ 0: The distance is a non-negative number.
Eii = 0 and Mii = 0: The distance from an object to itself is 0.
Eij = Eji and Mij = Mji: The distance is a symmetric function.
∑( x ia – m a )( x ib – m b )
R ( a,b ) = i =1
n n
∑( x – ma ) ∑( x – mb )
2 2
ia ib
i =1 i =1
In addition, xia is the value of a for the ith object, and xib is the value of b for the ith
object.
Having variables with a high positive correlation means that their dissimilarity
coefficient is close to 0, and the variables are very similar. Similarly, a strong negative
correlation is assigned a dissimilarity coefficient close to 1, and the variables are very
dissimilar.
Minkowski distance represents the generalization of Euclidean and Manhattan
distance. It is defined as follows:
Min(i, j) = – ( |xi1 – xj1|q) + |xi2 – xj2|q + |xi3 – xj3|q … |xip – xjp|q ) 1/q
It represents the Manhattan distance if q =1 and it represents Euclidean distance if q = 2.
164
Chapter 7 ■ Unsupervised Machine Learning
Variable 1
1 0 Sum
1 x y x+y
Variable 2
0 a b a+b
Here, x is the total number of records with category 1 for both variables 1 and 2.
Similarly b is the total number of records with category 0 for both variables 1 and 2 and
vice-verse. Total number of records with all the categories is sum of a, b, x and y and is
represented as p = a + b + x + y.
The well known measure to calculate the distance between the two variables is the
simple matching coefficient, which is defined below:
d(i,j) = (a + y) / (a + b + x + y)
m =p
∑W ijm d ijm
d ( i,j) = m=1
p
∑W
m=1
ijm
165
Chapter 7 ■ Unsupervised Machine Learning
1.
If m is binary or nominal, d(m)
( i,j) = 0
if xim = xjm; otherwise, d(m)
( i,j) = 1
.
2.
If m is interval-based:
x im – x jm
d (mi,j)
max ( x hm ) – min ( x hm )
x im – x jm
1–
max ( x m ) – min ( x m )
166
Chapter 7 ■ Unsupervised Machine Learning
167
Chapter 7 ■ Unsupervised Machine Learning
7.2.4.4 Centroid Distance
The centroid is the measurement of averages across all the records in that cluster. For
example, in cluster C, the centroid X a =
1 m 1 m
∑X 1i ………… ∑X pi …………
m
i=1 m i =1
7.3 Hierarchical Clustering
Hierarchical clusters have a predetermined ordering from top to bottom. For example,
an organizational chart or all the files and folders on the hard disk in your computer are
organized in a hierarchy.
Hierarchical clustering starts with every single object in a single data set as separate
cluster. Then in each iteration, a cluster agglomerates (merges) with the closest cluster,
based on distance and similarity criteria, until all the data forms into one cluster. The
hierarchical agglomerative clustering algorithm is as follows:
1.
Start with n clusters. Each record in the data set can be a
cluster by itself.
2.
The two closest case observations are merged together as
single cluster. The closeness is the similarity measure.
3.
Step 2 repeats until a single cluster is formed. At every step,
the two clusters with the smallest distance measure are
merged together until all records and clusters are combined
to form a single cluster. A hierarchy of clusters is created.
168
Chapter 7 ■ Unsupervised Machine Learning
7.3.1 Dendrograms
A dendrogram demonstrates how clusters are merged in a hierarchy. A dendrogram is
a tree-like structure that summarizes the process of clustering and shows the hierarchy
pictorially. Similar records are joined by lines whose vertical line reflects the distance
measure between two records. Figure 7-5 shows an example of a dendrogram.
Distance
7.4 Nonhierarchical Clustering
In nonhierarchical clustering, a desired number of clusters is prespecified, ki, and
you assign each case to one of the clusters so as to minimize the dispersion within the
clusters. The goal is to divide the sample into a number of predetermined k clusters
169
Chapter 7 ■ Unsupervised Machine Learning
so that the clusters are as homogeneous as possible with respect to the metrics used.
The algorithm intends to partition n objects into k clusters with the nearest mean. The
end result is to produce k different clusters with clear distinctions. This is again an
unsupervised machine-learning, numerical, and iterative method. And you always have
at least one item in each cluster. The objective of this k-means clustering is to minimize
total intracluster variance, or, the squared error function:
n k
E = ∑∑ |( x i m j )|2
i =1 i =1
Where x is the point in space representing the given objecti and mi is the mean
of cluster C. The algorithm works well when the records are well separated from one
another.
7.4.1 K-Means Algorithm
The k-means algorithm for clustering is as follows:
1.
Select k. It can be 1 or 2 or 3 or anything.
2.
Select k points at random as cluster centroids.
3.
Start assigning objects to their closest cluster based on
Euclidean measurement.
4.
Calculate the centroid of all objects in each cluster.
5.
Check the distance of the data point to the centroid of its own
cluster. If it is closest, then leave it as is. If not, move it to the
next closest cluster.
6.
Repeat the preceding steps until all the data points are
covered and no data point is moving from one cluster to
another (the cluster is stable).
The following example demonstrates the k-means algorithm.
170
Chapter 7 ■ Unsupervised Machine Learning
171
Chapter 7 ■ Unsupervised Machine Learning
172
Chapter 7 ■ Unsupervised Machine Learning
173
Chapter 7 ■ Unsupervised Machine Learning
As you can see, there are no outliers in our data as suggested by the IQR method.
174
Chapter 7 ■ Unsupervised Machine Learning
Please note that the dist() function takes the data set (for scaled data, the scaled
data set is used) and method = “euclidean” as arguments. However, euclidean is the
default method, so we can omit method = “euclidean” in the preceding code. We would
use the method = “manhattan” option for a Manhattan distance calculation, and method
= “binary” for Hamming distance.
175
Chapter 7 ■ Unsupervised Machine Learning
Let’s now try out both the average and centroid methods:
This R code shows that ten criteria from the NbClust package suggest that the best
number of clusters is three. We have assumed five maximum clusters, as we do not want a
single or only two data points in each cluster.
The preceding NbClust() command also produces the corresponding plots shown
in Figure 7-6.
176
Chapter 7 ■ Unsupervised Machine Learning
0.95
0.90
0.85
Dindex Values
0.80
0.75
0.70
0.65
Figure 7-6. Plots generated by the NbClust() command from the NbClust package,
depicting the best number of clusters
177
Chapter 7 ■ Unsupervised Machine Learning
As you can see, the first plot shows a steep drop in Dindex values, from two to three
clusters; beyond that, the Dindex values decrease slowly as the number of clusters increase.
This suggests three clusters as the best option. The second plot clearly shows three clusters
as the best option, as the second differences Dindex value is highest in this case.
We can now generate a dendrogram by using the plot() function as follows:
Cluster Dendrogram
2.0
1.5
Height
1000
1.0
1600
1200
1100
2000
0.5
1000
1400
1200
1200
1000
1000
1800
2000
2000
2200
1800
1900
2000
1200
1200
0.0
dist_among_observ
hclust(*, "centroid")
This dendrogram also shows three clusters. Now let’s superimpose rectangles on the
plot generated by using the rect.hclust() function:
178
Chapter 7 ■ Unsupervised Machine Learning
Cluster Dendrogram
2.0
1.5
Height
1000
1.0
1600
1200
1100
2000
0.5
1000
1400
1200
1200
1000
1000
1800
2000
2200
2000
1800
2000
1900
1200
1200
0.0
dist_among_observ
hclust(*, "centroid")
Figure 7-8. Rectangles superimposed on the three clusters to easily differentiate among them
Now, based on the best number of clusters determined in the preceding code, we get
these clusters with the corresponding data as follows:
The cutree() function cuts the observations into the number of clusters based on
the cluster_fit we arrived previously. The numbers are the cluster numbers. As you can
see in the preceding code, the same observations are classified into each cluster when
using both the average and centroid methods. But, this isn’t always the case. We can also
see that there are ten data observations in cluster number 1, eight data observations in
cluster number 2, and two data observations in cluster number 3. The 1st, 2nd, 3rd, 4th, 5th,
7th, 9th, 10th, 16th, and 20th data observations belong to cluster 1. The 6th, 8th, 11th, 12th, 13th,
17th, 18th, and 19th data observations fall in cluster 2. The 14th and 15th data observations
fall in cluster 3.
Let’s now interpret and validate the clusters by manually visiting our base data:
179
Chapter 7 ■ Unsupervised Machine Learning
As you can see, this data suggests that the high rent (cluster 2—$1,800 and above) is
expected when the distance to the airport, and distance to the university are very short
(less than 12 km from the airport and less than 13 from the university) and distance to
downtown is greater (16 km and more). In contrast, very low rents (cluster 1) show long
distances from both the airport and university but are <= 15 km from downtown. The
third cluster shows lower rent when the distances are far from both the airport and the
university along with a greater distance from downtown. From this, we can find that there
is appropriate clustering of the data:
We use the aggregate() function to determine the median value of each cluster. The
preceding code clearly supports the analysis we made previously. We use this median as
it makes more sense than the mean because of the rounded values of rent.
Now let’s group (or cluster) our data by using the partition cluster approach.
180
Chapter 7 ■ Unsupervised Machine Learning
In order to find out the optimal clusters or the best number of clusters, we can use
the same NbClust() function, but with the method kmeans. Here you can see how to use
this function in R and the resulting output:
181
Chapter 7 ■ Unsupervised Machine Learning
As we see in practice, the nstart value from 20 to 30 works well in most cases. This
is nothing but the initial configurations the algorithm will try before suggesting the best
configuration.
We will now use the aggregate() function with the median to determine the median
value of each cluster. The resultant output from R is shown here:
7.6 Association Rule
Another important unsupervised machine-learning concept is association-rule analysis,
also called affinity analysis or market-basket analysis (MBA). This type of analysis is often
used to find out “what item goes with what item,” and is predominantly used in the study
of customer transaction databases. Association rules provide a simple analysis in dicating
that when an event occurs, an other event occurs with a certain probability. Discovering
relationships among a huge number of transactional database records can help in better
marketing, inventory management, product promotions, launching new products, and
other business decision processes. Association rules indicate relationships by using
simple if-then rule structures computed from the data that are probabilistic in nature.
The classic example is in retail marketing. If a retail department wants to find out
which items are frequently purchased together, it can use association-rule analysis. This
helps the store manage inventory, offer promotions, and introduce new products. This
market-basket analysis also helps retailers plan ahead for sales and know which items
to promote with a reduced price. For example, this type of analysis can indicate whether
customers who purchase a mobile phone also purchase a screen guard or phone cover,
or whether a customer buys milk and bread together (see Figure 7-9). Then those stores
can promote the phone cover or can offer a new bakery bread at a promotional price for
the purchase of milk. These offers might encourage customers to buy a new product at a
reduced price.
182
Chapter 7 ■ Unsupervised Machine Learning
Shopping Baskets
Several algorithms can generate the if-then association rules, but the classic one is
the Apriori algorithm of Agrawal and Srikant (1993). The algorithm is simple. It begins by
generating frequent-item sets with just one item (a one-item set) and then generates a
two-item set with two items frequently purchased together, and then moves on to
three-item sets with three items frequently purchased together, and so on, until all the
frequent-item sets are generated. Once the list of all frequent-item sets is generated, you
can find out how many of those frequent-item sets are in the database. For example,
how many two-item sets, how many three-item sets, and so forth. In general, generating
n-item sets uses the frequent n – 1 item sets and requires a complete run through the
database once. Therefore, the Apriori algorithm is faster, even for a large database with
many unique items. The key idea is to begin generating frequent-item sets with just one
item (a one-item set) and then recursively generate two-item sets, then three-item sets,
and so on, until we have generated frequent-item sets of all sizes.
7.6.1 Choosing Rules
Once we generate the rules, the goal is to find the rules that indicate a strong association
between the items, and indicate dependencies between the antecedent (previous item)
and the consequent (next item) in the set. Three measures are used: support, confidence,
and lift ratios.
183
Chapter 7 ■ Unsupervised Machine Learning
For example, support for the two-item set {bread, jam} in the data set is 5 out of
a total of 10 records, which is (5/10) = 50 percent. You can define the support number
and ignore the other item sets from your analysis. If support is very low, it is not worth
examining.
Confidence (A --> B) is a ratio of support for A & B ( i.e. antecedents and consequents
together), to the support for A. It is expressed as a ratio of the number of transactions that
contain A & B together to the number of transactions that contain A:
numTrans ( A∪ B )
D p( A ∩ B)
conf ( A → B ) = = = p (B | A)
numTrans ( A ) p ( A)
D
7.6.1.2 Lift
Though support and confidence are good measure to show the strength of the association
rule, but sometimes it can be deceptive. For example, if the antecedent or the consequent
have a high support, we can have a high confidence even though both are independent.
A better measure is to compare the strength of an association rule with the confidence
where we can assume that the occurrence of the consequent item in a transaction is
independent of the occurrence of the antecedent rules.
p( A ∩ B)
conf ( A → B ) p ( A) p( A ∩ B)
lift ( A → B ) = = =
p(B) p(B) p ( A)p (B )
In other words,
Lift(A --> B) = Support(A & B) / [Support(A) x Support(B)]
Following example, Figure 7-10, demonstrates the three values viz. Support, Confidence
and Lift. For the following item sets, we calculate support, confidence and lift ratios:
Transaction 1: shirt, pant, tie, belt
Transaction 2: shirt, belt, tie, shoe
Transaction 3: socks, tie, shirt, jacket
Transaction 4: pant, tie, belt, blazer
Transaction 5: pant, tie, hat, sweater
184
Chapter 7 ■ Unsupervised Machine Learning
Let’s calculate support, confidence and lift for the above example using the
definition. For A --> B,
Support(A & B) = Freq(A & B) / N (where N is the total number of transactions in
database)
Confidence(A -->B) = Support(A & B) / Support(A) = Freq(A & B) / Freq(A)
Lift(A -->B) = Support(A & B) / [Support(A) x Support (B)]
(1/5)/[(1/5)*(3/5)]=5/3=
socks shirt 1/5 = 0.2 1/1 = 1
1.67
185
Chapter 7 ■ Unsupervised Machine Learning
One item
set Two item
set
Database D itemset sup
itemset itemset
TID Items {1 2}
{1 3} 1
{1}
200 1234 {1 3} {1 2} 2
201 234 {2} {2 5} {2 5} 1
202 235 {2 3}
{3} {2 4}
{2 4} 2
203 12
{3 4} {2 3} 3
{4} {3 5} {3 4} 2
{5} {3 5} 1
Three item
Min support = 50% set
itemset sup
itemset sup itemset Min support = 50%
{2 3 5} {1 2} 2
{2 3 4} 2 {2 3 4} {2 3} 3
{1 2 3} {2 4} 2
{1 3 4} {3 4} 2
7.6.3 Interpreting Results
Once you generate the frequent-item sets, it is useful to look at different measures such
as support, confidence, and lift ratios. The support gives you an indication of overall
transactions and how they affect the item sets. If you have only a small number of
transactions with minimum support, the rule may be ignored. The lift ratio provides
the strength of the consequent in a random selection. But the confidence gives the
rate at which a consequent can be found in the database. A low confidence indicates a
low consequent rate, and deciding whether promoting the consequent is a worthwhile
exercise. The more records, the better the conclusion. Finally, the more distinct the rules
that are considered, the better the interpretation and outcome. We recommend looking
at the rules from a top-down approach that can be reviewed by humans rather than
automating the decision by searching thousands of rules.
7.7 Summary
In this chapter, you saw that clustering is an unsupervised technique used to perform
data analysis. It is also part of exploratory analysis, for understanding data and its
properties. It can be used to identify any outliers in the data. However, primarily it is used
for identifying hidden groups in the data set.
Association rules find interesting associations among large transactional item sets
in the database. You learned how to perform clustering analysis, techniques used for
performing the clustering, and the concepts of association-rule mining.
186
CHAPTER 8
8.1 Introduction
Imagine you are a business investor and want to invest in startup ventures which are
likely to be highly profitable. What are all the factors you look for in the company you
want to invest in? Maybe the innovativeness of the products of the startups, maybe the
past success records of the promoters. In this case, we say the profitability of the venture
is dependent on or associated with innovativeness of the products and past success
records of the promoters. The innovativeness of the product itself may be associated with
various aspects or factors like usefulness of the product, competition in the market, and
so on. These factors may be a little difficult to gauge, but the past success record of the
promoters can be easily found from the available market data. If the promoter had started
ten ventures and eight were successful, we can say that 80% is the success rate of the
promoter.
Imagine you want to increase the sales of the products of your organization. You may
want to get more sales personnel, you may have to have a presence in or the capability
to service more territories or markets, you may require more marketing efforts in these
markets, and so on. All these aspects or factors are associated with the quantum of sales
or impact the quantum of sales. Imagine the attrition of the employees at any company
or industry. There are various factors like work environment of the organization,
compensation and benefits structure of the organization, how well known the company
is in the industry or market, and so forth. Work environment may be how conducive the
internal environment is for people to use their thinking, how much guidance or help the
seniors in the organization provide, and the current technological or product landscape
of the organization (e.g., whether or not you are working on the latest technology). It may
be even overall satisfaction level of the employees. Compensation and benefits structure
may include salary structure—that is, how good salaries are compared to those in other
similar organizations or other organizations in the industry, are there bonus or additional
incentives for higher performance or are there additional perquisites, etc. To drive home
the point, there may be multiple factors that influence a particular outcome or impact a
particular outcome or that are associated with a particular outcome. Again, each one of
these may in turn be associated with or influenced by other factors. For example, salary
structure may influence the work environment or satisfaction levels of the employees.
Imagine you are a developer of the properties as well as a builder. You are planning
to build a huge shopping mall. The prices of various inputs required like cement, steel,
sand, pipes, and so on vary a lot on a day-to-day basis. If you have to decide on the sale
price of the shopping mall or rent you need to charge for the individual shops you need
to understand the likely cost of building. For this you may have to take into consideration
how over a period of time the costs of these inputs (cement, steel, sand, etc.) have varied
and what factors influence the price of each of these in the market.
You may want to estimate the profitability of the company, arrive at the best possible
cost of manufacturing of a product, estimate the quantum of increase in sales, estimate
the attrition of the company so that you can plan well for recruitment, decide on the likely
cost of the shopping mall you are building, or decide on the rent you need to charge for a
square feet or a square meter. In all these cases you need to understand the association
or relationship of these with the ones that influence, decide, or impact them. The
relationship between two factors is normally explained in statistics through correlation or,
to be precise, coefficient of correlation (i.e., R) or coefficient of determination (i.e., R2).
Regression equation depicts the relationship between a response variable or
dependent variable and the corresponding independent variables. This means that the
value of the dependent variable can be predicted based on the values of the independent
variables. Where there is a single independent variable, then the regression is called
simple regression. When there are multiple independent variables, then the regression
is called multiple regression. Again, the regressions can be of two types based on the
relationship between the response variable and the independent variables (i.e., linear
regression or non-linear regression). In the case of linear regression the relationship
between the response variable and the independent variables is explained through a
straight line and in the case of non-linear relationship the relationship between the
response variable and independent variables is non-linear (polynomial like quadratic,
cubic, etc.).
Normally we may find a linear relationship between the price of the house and the
area of the house. We may also see a linear relationship between salary and experience.
However, if we take the relationship between rain and the production of grains, the
production of the grains may increase with moderate to good rain but then decrease if
the rain is more than good rain and becomes extreme rain. In this case, the relationship
between quantum of rain and the food grains production is normally non-linear; initially
food grain production increases and then reduces.
Regression is a supervised method as we know both the exact values of the response
(i.e., dependent) variable and the corresponding values of the independent variables.
This is the basis for the establishment of the model. This basis or model is then used
for predicting the values of the response variable where we know the values of the
independent variable and want to understand the likely value of the response variable.
8.2 Correlation
As described in earlier chapters, correlation explains the relationship between two
variables. This may be a cause-and-effect relationship or otherwise, but it need not
be always a cause-and-effect relationship. However, variation in one variable can be
explained with the help of the other parameter when we know the relationship between
188
Chapter 8 ■ Simple Linear Regression
two variables over a range of values (i.e., when we know the correlation between two
variables). Typically the relationship between two variables is depicted through a scatter
plot as explained in earlier chapters.
Attrition is related to the employee satisfaction index. This means that “attrition”
is correlated with “employee satisfaction index.” Normally, the lower the employee
satisfaction, the higher the attrition. This means that attrition is inversely correlated with
employee satisfaction. In other words, attrition has a negative correlation with employee
satisfaction or is negatively associated with employee satisfaction.
Normally the profitability of an organization is likely to grow up with the sales
quantum. This means the higher the sales, the higher the profits. The lower the sales,
the lower the profits. Here, the relationship is that of positive correlation as profitability
increases with the increase in sales quantum and decreases with the decrease in sales
quantum. Here, we can say that the profitability is positively associated with the sales
quantum.
Normally, the lesser the defects in a product or the higher the speed of response
related to issues, the higher will be the customer satisfaction of any company. Here,
customer satisfaction is inversely related to defects in the product or negatively correlated
with the defects in the product. However, the same customer satisfaction is directly
related to or positively correlated with the speed of response.
Correlation explains the extent of change in one of the variables given the unit
change in the value of another variable. Correlation assumes a very significant role in
statistics and hence in the field of business analytics as any business cannot make any
decision without understanding the relationship between various forces acting in favor of
or against it.
Strong association or correlation between two variables enables us to better predict
the value of the response variable from the value of the independent variable. However,
the weak association or low correlation between two variables does not help us to predict
the value of the response variable from the value of the independent variable.
8.2.1 Correlation Coefficient
Correlation coefficient is an important statistical parameter of interest which gives us
numerical indication of the relationship between two variables. This will be useful only in
the case of linear association between the variables. This will not be useful in the case of
non-linear associations between the variables.
It is very easy to compute the correlation coefficient. In order to compute the same
we require the following:
• Average of all the values of the independent variable
• Average of all the values of the dependent variable
• Standard deviation of all the values of the independent variable
• Standard deviation of all the values of the dependent variable
189
Chapter 8 ■ Simple Linear Regression
Once we have the foregoing, we need to convert each value of each variable into
standard units. This is done as follows:
• (Each value minus the average of the variable) / (Standard
Deviation of the variable) (i.e., [variable value – mean(variable)]
/ sd(variable)). For example, if a particular value among the
values of the independent variable is 18 and the mean of this
independent variable is 15 and the standard deviation of this
independent variable is 3, then the value of this independent
variable converted into standard units will be = (18 – 15)/3 = 1.
This is also known as z-score of the variable.
Once we have converted each value of each variable into standard units, the
correlation coefficient (normally depicted as ‘r’ or ‘R’) is calculated as follows:
• Average of [(independent variable in standard units) x
(dependent variable in standard units)] (i.e., mean[Σ(z-score of x)
* (z-score of y)])
The correlation coefficient can be also found out using the following formula:
R = [covariance(independent variable, dependent variable) / [(Standard Deviation of
the independent variable) x (Standard Deviation of the dependent variable)]
In the above, covariance is = [sum ( the value of each independent variable minus
average of the independent variable values)*(the value of each dependent variable minus
the average of the dependent variable values)] divided by [n minus 1]
In R Programming Language the calculation of the correlation coefficient is very
simple. The calculation of correlation coefficient in R is shown in Figure 8-1A and the
corresponding scatter plot is shown in Figure 8-1B:
190
Chapter 8 ■ Simple Linear Regression
2 4 6 8 10
Employee Satisfaction Index
Figure 8-1B. Scatter plot between Employee Satisfaction Index and Attrition
As can be seen from the scatter plot in Figure 8-1B, even though the relationship is
not linear it is near linear. The same is shown by the correlation coefficient of -0.983. As
you can see, the negative sign indicates the inverse association or negative association
between attrition percentage and employee satisfaction index. The above plot shows that
the deterioration in the employee satisfaction leads to an increased rate of attrition.
Further, the test shown in Figure 8-2 confirms that there is an excellent statistically
significant correlation between attrition and the employee satisfaction index:
191
Chapter 8 ■ Simple Linear Regression
Please note, the previous data is illustrative only and may not be representative of a
real scenario. It is used for the purpose of illustrating the correlation. Further, in the case
of extreme values (outliers) and associations like non-linear associations, the correlation
coefficient may be very low and may depict no relationship or association. However, there
may be real and good association among the variables.
8.3 Hypothesis Testing
At this point in time it is apt for us to briefly touch upon hypothesis testing. This is one
of the important aspects in statistics. In hypothesis testing we start with an assertion or
claim or status quo about a particular population parameter of one or more populations.
This assertion or claim or status quo is known as “null hypothesis” or H0. An example
of the null hypothesis may be a statement like the following: the population mean of
population 1 is equal to population mean of population 2. There is also another statement
known as the alternate hypothesis, or H1, which is opposite to the null hypothesis. In our
example the alternate hypothesis specifies that there is significant difference between
population mean of population 1 and the population mean of population 2. Level of
significance or Type I error of normally 0.05 is specified. This is nothing but the possibility
that the null hypothesis is rejected when actually it is true. This is represented by the
symbol α. The smaller the value of α, the smaller the risk of Type I error.
Then we decide the sample size required to reduce the errors.
We use test statistics to either reject the null hypothesis or not to reject the null
hypothesis. When we reject the null hypothesis it means that the alternate hypothesis is
true. However, we could not reject the null hypothesis does not mean that the alternate
hypothesis is true. It only shows that we do not have sufficient evidence to reject the null
hypothesis. Normally t-value is the test statistic used.
Then we use the data and arrive at the sample value of the test statistic. We then
calculate p-value on the basis of the test statistic. p-value is nothing but the probability
that the test statistic is equal to or more than the sample value of the test statistic when
the null hypothesis is true. We then compare the p-value with the level of significance
(i.e., α). If the p-value is less than the level of significance then the null hypothesis is
rejected. This also means that the alternate hypothesis is accepted. If the p-value is
greater than or equal to the level of significance then we cannot reject the null hypothesis.
The p-value is used (among many other uses in the field of statistics) to validate
the significance of the parameters to the model in the case of regression analysis. If the
p-value of any parameter in the regression model is less than the level of significance
(typically 0.05), then we reject the null hypothesis that there is no significant contribution
of the parameter to the model and we accept the alternate hypothesis that there is
significant contribution of the parameter to the model. If p-value of a parameter is greater
than or equal to the level of significance then we cannot reject the null hypothesis that
there is no significant contribution of the parameter to the model. We include in the final
model only those parameters that have significance to the model.
192
Chapter 8 ■ Simple Linear Regression
8.4.1 Assumptions of Regression
There are four assumptions of regression. These need to be fulfilled if we need to rely
upon any regression equation. They are
• Linear association between the dependent variable and the
independent variable
• Independence of the errors around the regression line between
the actual and predicted values of the response variable
• Normality of the distribution of errors
• Equal variance of the distribution of the response variable for
each level of the independent variable. This is also known as
homoscedasticity.
Y1 = β0 + β1 x1
193
Chapter 8 ■ Simple Linear Regression
In the foregoing equation, β0 is known the intercept and the β1 is known as the slope
of the regression line. Intercept is the value of the response variable when the value of the
independent variable (i.e., x) is zero. This depicts the point at which the regression line
touches the y-axis when x is zero. The slope can be calculated easily using the following
formula: (R x Standard Deviation of the response variable) / (Standard Deviation of the
independent variable).
From the foregoing you can see that when the value of the independent variable
increases by one standard deviation, the value of the response variable increases by R x
one standard deviation of the response variable, where R is the coefficient of correlation.
Figure 8-3. Creating a data frame from a text file (data for the examples)
194
Chapter 8 ■ Simple Linear Regression
In Figure 8-3, we have imported a table of data containing 21 records with the Sales_
Effort and the Product_Sales from a file by name cust1.txt into a data frame by name
cust_df. The Sales_Effort is in the number of hours of effort put in by the salesperson
during the first two weeks of a month and the Product_Sales is the number of sales
closed by the salesperson during the same period. The summary of the data is also shown
in the above figure.
In this data we can treat Product_Sales as the response variable and Sales_Effort as
the independent variable as the product sales depend upon the sales effort put in place
by the salespersons.
We will now run the simple linear regression to model the relationship between
Product_Sales and Sales_Effort using the lm(response variable ~ independent
variable, data = dataframe name) command of R:
Figure 8-4 provides the command run in R to generate the simple linear regression
model as well as the summary of the model. The model arrived at is named mod_simp_
reg and the summary command throws up the details of the simple linear regression
model arrived at.
The initial part shows which element of the data frame is regressed against which
other element and the name of the data frame which contained the data, to arrive at
the model.
Residuals depict the difference between the actual value of the response variable
and the value of the response variable predicted using the regression equation. Maximum
residual is shown as 0.20779. Spread of residuals is provided here by specifying the values
of min, max, median, Q1, and Q3 of the residuals. In this case the spread is from -0.16988
to +0.20779. As the principle behind the regression line and regression equation is to
reduce the error or this difference, the expectation is that the median value should be
195
Chapter 8 ■ Simple Linear Regression
very near to 0. As you can see here the median value is -0.01818 which is almost equal to
0. The prediction error can go up—to the maximum value of the residual. As this value
(i.e., 0.20779) is very small, we can accept this residual.
The next section specifies the coefficient details. Here β0 is given by the intercept
estimate (i.e., 0.0500779) and β1 is given by Sales_Effort estimate (0.0984054). Hence,
the simple linear regression equation is as follows:
Product_Sales1 = 0.0500779 + 0.0984054 Sales_Effort1
The value next to the coefficient estimate is the standard error of the estimate. This
specifies the uncertainty of the estimate. Then comes the “t” value of the standard error.
This specifies as to how large the coefficient estimate is with respect to the uncertainty. The
next value is the probability that absolute(t) value is greater than the one specified which
is due to a chance error. Ideally “Pr” or Probability value, or popularly known as “p-value,”
should be very small (like 0.001, 0.005, 0.01, or 0.05) for the relationship between the response
variable and the independent variable to be significant. p-value is also known as the value of
significance. As here the probability of the error of the coefficient of Sales_Effort is very less
(i.e., almost near 0) (i.e. <2e-16), we reject the null hypothesis that there is no significance of
the parameter to the model and accept the alternate hypothesis that there is significance of
the parameter to the model. Hence, we conclude that there is significant relationship between
the response variable Product_Sales and the independent variable Sales_Effort. Number
of asterisks (*s) next to the p-value of each parameter specifies the level of significance. Please
refer to “Signif. codes” in the model summary as given in Figure 8-4.
The next section shows the overall model quality-related statistics. Among these,
• The degrees of freedom specified here is nothing but the number of
rows of data minus the number of coefficients. In our case it is 21 –
2 = 19. This is the residual degrees of freedom. Ideally, the number
of degrees of freedom should be large compared to the number of
coefficients for avoiding the overfitting of the data to the model. We
have dealt with overfitting and underfitting of the data to the model
in one the chapters. Let us remember for the time being that the
overfitting is not good. Normally, by thumb rule 30 rows of data for
each variable is considered good for the training sample. Further,
we cannot use the ordinary least squares method if the number of
rows of data is less than the number of independent variables.
• Residual standard error shows the sum of the squares of the
residuals as divided by the degrees of freedom (in our case 19)
as specified in the summary. This is 0.1086 and is very low, as
required by us.
• Multiple R-squared value shown here is nothing but the square
of the correlation coefficient (i.e., R). Multiple R-squared
is also known as the coefficient of determination. However,
adjusted R-squared value is the one which is the adjusted value
of R-squared adjusted to avoid overfitting. Here again, we rely
more on adjusted R-squared value than on multiple R-squared.
The value of adjusted R-squared is 0.9987, which is very high and
shows the excellent relationship between the response variable
Product_Sales and the independent variable Sales_Effort.
196
Chapter 8 ■ Simple Linear Regression
8.4.4.1 Test of Linearity
In order to test the linearity, we plot the residuals against the corresponding values of the
independent variable. Figure 8-5 depicts this.
Residuals vs Fitted
10
0.2
11
0.1
Residuals
0.0 -0.1
8
-0.2
2 4 6 Fitted values 8 10
Im(Product_Sales~Sales_Effort)
For the model to pass the test of linearity we should not have any pattern in the
distribution of the residuals and they should be randomly placed around the 0.0 residual
line. That is, the residuals will be randomly varying around the mean of the value of the
response variable. In our case, as we can see there are no patterns in the distribution of
the residuals. Hence, it passes the condition of linearity.
197
Chapter 8 ■ Simple Linear Regression
In the case of Durbin-Watson test the null hypothesis (i.e., H0) is that there is no
autocorrelation and the alternative hypothesis (i.e., H1) is that there is autocorrelation.
If p-value is < 0.05 then we reject the null hypothesis—that is, we conclude that there is
autocorrelation. In the foregoing case the p-value is greater than 0.05 and it means that
there is no evidence to reject the null hypothesis that there is no autocorrelation. Hence,
the test of independence of errors around the regression line passes. Alternatively for this
test you can use durbinWatsonTest() function from library(car).
8.4.4.3 Test of Normality
As per this test the residuals should be normally distributed. In order to check on this we
will look at the Normal Q-Q plot (created using the plot(model name) command):
198
Chapter 8 ■ Simple Linear Regression
Normal Q-Q
10
2
11
Standardized residuals
1
0
-1
-2 -1 0 1 2
Theoretical Quantiles
Im(Product_Sales~Sales_Effort)
Figure 8-7. Test for assumption of “normality” using normal Q-Q plot
Figure 8-7 shows that the residuals are almost on the straight line in the foregoing
Normal Q-Q plot. This shows that the residuals are normally distributed. Hence, the
normality test of the residuals is passed.
199
Chapter 8 ■ Simple Linear Regression
200
Chapter 8 ■ Simple Linear Regression
In Figure 8-8 we have given the output of gvlma() function from R on our model.
The Global Stat line clearly shows that the assumptions related to this regression model
are acceptable. Here, we need to check for whether the p-value is greater than 0.05. If
the p-value is greater than 0.05 then we can safely conclude as shown above that the
assumptions are validated. If we have p-value less than 0.05 then we need to revisit the
regression model.
Scale-Location
1.4
10
11 8
1.2
|Standardiszed residuals|
1.0
0.8
0.6
0.4
0.2
0.0
2 4 6 8 10
Fitted values
Im(Product_Sales~Sales_Effort)
201
Chapter 8 ■ Simple Linear Regression
In Figure 8-9 as the points are spread in a random fashion around the near horizontal
line, this assures us of that the assumption of constant variance (or homoscedasticity).
20 40 60 80 100
Sales_Effort
Figure 8-10. Plot using crPlots() function to validate the linearity assumption
202
Chapter 8 ■ Simple Linear Regression
8.4.5 Conclusion
As seen above, the simple linear regression model fitted using the R function lm(response
variable ~ independent variable, data = dataframe name) representing the simple
linear regression equation, namely, Product_Salesi = 0.0500779 + 0.0984054 Sales_
Efforti is a good model as it passes the tests to validate the assumptions of the regression
too. A note of caution here is that there are various ways the regression equation may be
created and validated.
Figure 8-11. Prediction using the model on the new data set
This is done using the function predict(model name, newdata) where model
name is the name of the model arrived at from the input data and newdata contains the
independent variable data for which the response variable has to be predicted.
A prediction interval which specifies the range of the distribution of the
prediction can be arrived at as shown in Figure 8-12, with the additional parameter
interval = “prediction” on the predict() function. This uses by default the confidence
interval as 0.95.
203
Chapter 8 ■ Simple Linear Regression
8.4.7 Additional Notes
It may be observed from the model fitted above that the intercept is not zero but it is
0.0500779 whereas actually when there is no sales efforts ideally there should not be any
sales. But, it may not be so; there may be some walk in sales possible because of the other
means like advertisements, web sites, etc. Similarly, there cannot be partial product sales
like 3.1. However, sales efforts put in would have moved the salesperson toward the next
potential sale partially. If we are interested in arriving at the model without intercept (i.e.,
no product sales when there is no sales effort) then we can do so as shown in Figure 8-13
by forcing the intercept to zero value.
However, if we have to believe this model and use this model we have to validate the
fulfillment of the other assumptions of the regression.
8.5 Chapter Summary
• You went through some examples as to how the relationship
between various aspects/factors influence or decide or impact
other aspects/factors. Understanding these relationships helps us
not only to understand what can happen to the other associated
factors but also to predict the value of others. You understood
how the regression model or regression equation explains the
relationship between a response variable and the independent
variable(s). You also understood about the linear or non-linear
relationship as well as simple regression and multiple regression.
• You exploredthe concept of correlation with examples. You
explored the uses of correlation, strong correlation, positive
correlation, negative correlation, and so on. You also understood
how to calculate the correlation coefficient (R).
204
Chapter 8 ■ Simple Linear Regression
205
CHAPTER 9
In Chapter 8, you explored simple linear regression, which depicts the relationship
between the response variable and one predictor. You saw that the expectation is that the
response variable is a continuous variable that is normally distributed. If the response
variable is a discrete variable, you use a different regression method. If the response
variable can take values such as yes/no or multiple discrete variables (for example, views
such as strongly agree, agree, partially agree, and do not agree), you use logistic regression.
You will explore logistic regression in the next chapter. When you have more than one
predictor—say, two predictors or three predictors or n predictors (with n not equal to
1)—the regression between the response variable and the predictors is known as multiple
regression, and the linear relationship between them is expressed as multiple linear
regression or a multiple linear regression equation. In this chapter, you will see examples of
situations in which many factors affect one response, outcome, or dependent variable.
Imagine that you want to construct a building. Your main cost components are
the cost of labor and the cost of materials including cement and steel. Your profitability
is positively impacted if the costs of cement, steel, and other materials decrease while
keeping the cost of labor constant. Instead, if the costs of materials increase, your
profitability is negatively impacted while keeping the cost of labor constant. Your
profitability will further decrease if the cost of labor also increases.
Although it is possible in the market for one price to go up or down or all the prices
to move in the same direction. Suppose the real estate industry is very hot, and there are
lots of takers for the houses, apartments, or business buildings. Then, if there is more
demand for the materials and the supply decreases, the prices of these materials are likely
to increase. If the demand decreases for the houses, apartments, or business buildings,
the prices of these materials are likely to decrease as well (because of the demand being
less than the supply).
Now let’s presume that the selling prices are quite fixed because of the competition,
and hence the profitability is decided and driven primarily by the cost or cost control.
We can now collect data related to the cost of cement, steel, and other materials, as well
as the cost of labor as predictors or independent variables, and profitability (in percent)
as the response variable. Such a relationship can be expressed through a multiple linear
regression model or multiple linear regression equation.
In this example, suppose we find that the relationship of the cost of other materials
(one of the predictors) to the response variable is dependent on the cost of the cement.
Then we say that there is a significant interaction between the cost of other materials
and the cost of the cement. We include the interaction term cost of other materials:cost
of cement in the formula for generating the multiple linear regression model while also
including all the predictors. Thus our multiple linear regression model is built using the
predictors cost of cement, cost of steel, cost of other materials, and the interaction term, cost
of other materials: cost of cement vs. the profitability as the response variable.
Now imagine that you are a Human Resources (HR) manager or head. You know
that the compensation to be paid to an employee depends on her qualifications,
experience, and skill level, as well as the availability of other people with that particular
skill set vs. the demand. In this case, compensation is the response variable, and the
other parameters are the independent variables, or the predictor variables. Typically, the
higher the experience and higher the skill levels, the lower the availability of people as
compared to the demand, and the higher the compensation should be. The skill levels
and the availability of those particular skills in the market may significantly impact the
compensation, whereas the qualifications may not impact compensation as much as the
skill levels and the availability of those particular skills in the market.
In this case, there may be a possible relationship between experience and skill level;
ideally, more experience means a higher skill level. However, a candidate could have a
low skill level in a particular skill while having an overall high level of experience—in
which case, experience might not have a strong relationship with skill level. This feature
of having a high correlation between two or more predictors themselves is known
as multicollinearity and needs to be considered when arriving at the multiple linear
regression model and the multiple linear regression equation.
Understanding the interactions between the predictors as well as multicollinearity is
very important in ensuring that we get a correct and useful multiple regression model.
When we have the model generated, it is necessary to validate it on all four assumptions of
regression:
• Linearity between the response variable and the predictors (also
known as independent variables)
• Independence of residuals
• Normality of the distribution of the residuals
• Homoscedasticity, an assumption of equal variance of the errors
The starting point for building any multiple linear regression model is to get our data
in a data-frame format, as this is the requirement of the lm() function. The expectation
when using the lm() function is that the response variable data is distributed normally.
However, independent variables are not required to be normally distributed. Predictors
can contain factors.
Multiple regression modeling may be used to model the relationship between a
response variable and two or more predictor variables to n number of predictor variables
(say, 100 or more variables). The more features that have a relationship with the response
variable, the more complicated the modeling will be. For example, a person’s health,
if quantified through a health index, might be affected by the quality of that person’s
environment (pollution, stress, relationships, and water quality), the quality of that
person’s lifestyle (smoking, drinking, eating, and sleeping habits), and genetics (history
of the health of the parents). These factors may have to be taken into consideration to
understand the health index of the person.
208
Chapter 9 ■ Multiple Linear Regression
9.1.1 The Data
To demonstrate multiple linear regression, we have created data with three variables:
Annual Salary, Experience in Years, and Skill Level. These are Indian salaries, but for the
sake of convenience, we have converted them into US dollars and rounded them into
thousands. Further, we have not restricted users to assessing and assigning skill levels
in decimal points. Hence, in the context of this data even Skill Level is represented as
continuous data. This makes sense, as in an organization with hundreds of employees,
it is not fair to categorize all of them, say, into five buckets, but better to differentiate
them with skill levels such as 4.5, 3.5, 2.5, 0.5, and 1.5. In this data set, all the variables are
continuous variables.
Here, we import the data from the CSV file sal1.txt to the data frame sal_data_1:
If you use the head() and tail() functions on the data, you will get an idea of what
the data looks like, as shown here:
Please note that we have not shown all the data, as the data set has 48 records. In
addition, this data is illustrative only and may not be representative of a real scenario. The
data is collected at a certain point in time.
209
Chapter 9 ■ Multiple Linear Regression
Correlation
9.1.2
We explained correlation in Chapter 8. Correlation specifies the way that one variable
relates to another variable. This is easily done in R by using the cor() function.
The correlation between these three variables (Annual Salary, Experience in Years,
Skill Level) is shown here:
As you can see, there is a very high correlation of about 0.9888 between Annual
Salary and Experience in Years. Similarly, there is a very high correlation of about 0.9414
between Annual Salary and Skill Level. Also, there is a very high correlation of about
0.8923 between Experience in Years and Skill Level. Each variable is highly correlated to
the other variable.
The relationship between two pairs of variables is generated visually by using the R
command, as shown here:
Here, we use the caret package and the featurePlot() function. The response
variable is plotted as y, and the predictor variables are plotted as x.
The visual realization of the relationship between the variables generated through
this command is shown in Figure 9-1.
210
Chapter 9 ■ Multiple Linear Regression
15000
y
10000
1
1 2 3
8 4 6 8
4 Expe_Yrs 4
2
0 2 4 0
Scatter Plot Matrix
211
Chapter 9 ■ Multiple Linear Regression
The model created using the lm() function is shown here along with the summary—
generated using summary(model name):
You can see in this summary of the multiple regression model that both independent
variables are significant to the model, as the p-value for both is less than 0.05. Further,
the overall model p-value is also less than 0.05. Further, as you can see, the adjusted
R-squared value of 99.48 percent indicates that the model explains 99.48 percent of the
variance in the response variable. Further, the residuals are spread around the median
value of –23.8, very close to 0.
You can explore the individual aspects of this model by using specific R commands.
You can use fitted(model name) to understand the values fitted using the model.
You can use residuals(model name) to understand the residuals for each value of
Annual Salary fitted vs. the actual Annual Salary as per the data used. You can use
coefficients(model name) to get the details of the coefficients (which is part of the
summary data of the model shown previously). The following shows the use of these
commands in R:
212
Chapter 9 ■ Multiple Linear Regression
Residuals vs Fitted
46
45
500
Residuals
0-500
30
213
Chapter 9 ■ Multiple Linear Regression
Normal Q-Q
2
46
45
1
Standardized residuals
0 -1
30
-2 -1 0 1 2
Theoretical Quantiles
Im(Annu_Salary~Expe_Yrd+Skill_lev)
Scale-Location
46
30 45
1.2 1.0
|Standardized residuals|
0.4 0.6 0.8
0.2
0.0
214
Chapter 9 ■ Multiple Linear Regression
2 Residuals vs Leverage
47
28
1
Standardized residuals
0 -1
15
Cook’s distance
-2
Here the residuals vs. fitted plot seems to show that the residuals are spread
randomly around the dashed line at 0. If the response variable is linearly related to the
predictor variables, there should be no relationship between the fitted values and the
residuals. The residuals should be randomly distributed. Even though there seems to
be a pattern, in this case we know clearly from the data that there is a linear relationship
between the response variable and the predictors. This is also shown through high
correlation between the response variable and each predictor variable. Hence, we cannot
conclude that the linearity assumption is violated. The linear relationship between
the response variable and predictors can be tested through the crPlots(model name)
function, as shown in Figure 9-6.
Component+Residual(Annu_Salary)
0 1000
2000
-2000
-1000
-6000
-3000
0 2 4 6 8 1 2 3 4 5
Expe_Yrs Skill_lev
As both the graphs show near linearity, we can accept that the model sal_model_1
fulfills the test of linearity.
215
Chapter 9 ■ Multiple Linear Regression
Further, the scale-location plot shows the points distributed randomly around a
near-horizontal line. Thus, the assumption of constant variance of the errors is fulfilled.
We validate this understanding with ncvTest(model name) from the car library. Here, the
null hypothesis is that there is constant error variance and the alternative hypothesis is
that the error variance changes with the level of fitted values of the response variable or
linear combination of predictors. As p-value is >0.05 we cannot reject the null hypothesis
that there is constant error variance.
The Normal Q-Q plot seems to show that the residuals are not normally distributed.
However, the visual test may not always be appropriate. We need to ensure normality only
if it matters to our analysis and is really important, because in reality, data may not always
be normal. Hence, we need to apply our judgment in such cases. Further, typically, as per
the central limit theorem and by rule of thumb, we do not require validating the normality
for huge amounts of data, because it has to be normal. Furthermore, if the data is very
small, most of the statistical tests may not yield proper results. However, we can validate
the normality of the model (that is, of the residuals) through the Shapiro-Wilk normality
test by using the shapiro.test(residuals(model name)) command. The resultant
output is as follows:
Here, the null hypothesis is that the normality assumption holds good. We reject
the null hypothesis if p-value is <0.05. As the p-value is > 0.05, we cannot reject the null
hypothesis that normality assumption holds good.
Another way to visually confirm the assumption of normality is by using
qqPlot(model name, simulate = TRUE, envelope = 0.95). The R command used is as
follows:
216
Chapter 9 ■ Multiple Linear Regression
1.5
Studentized Residuals(sal_model_1)
1.0
0.5
0.0
-1.5 -1.0 -0.5
-2 -1 0 1 2
t Quantiles
Figure 9-7. Plot of normality generated through qqPlot(model name, simulate = TRUE,
envelope = 0.95)
As you can see, very few points (only two) are outside the confidence interval. Hence,
we can assume that the residuals are normally distributed and that the model has fulfilled
the assumption of normality.
The standardized residuals vs. leverage plot shows that most of the standardized
residuals are around zeroth line.
Now, we need to validate the independence of the residuals or the independence of
the errors, or lack of autocorrelation. However, in this case, because we know that each
data entry belongs to a different employee and is taken at a particular point in time, and
that no dependencies exist between the data of one employee and another, we can safely
assume that there is independence of the residuals and independence of errors. From an
academic-interest point of view, however, you can run a Durbin-Watson test to determine
the existence of autocorrelation, or lack of independence of errors. The command in R to
do this and the corresponding output is as follows:
217
Chapter 9 ■ Multiple Linear Regression
As the p-value is < 0.05, the Durbin-Watson test rejects the null hypothesis that there
is no autocorrelation. Hence, the Durbin-Watson test holds up the alternate hypothesis
that there exists autocorrelation, or the lack of independence of errors. However, as
mentioned previously, we know that the data used does not lack independence. Hence,
we can ignore the Durbin-Watson test.
As you can see, all the tests of the regression model assumptions are successful.
9.1.5 Multicollinearity
Multicollinearity is another problem that can happen with multiple linear regression
methods. Say you have both date of birth and age as predictors. You know that both are
the same in a way, or in other words, that one is highly correlated with the other. If two
predictor variables are highly correlated with each other, there is no point in considering
both of these predictors in a multiple linear regression equation. We usually eliminate
one of these predictors from the multiple linear regression model or equation, because
multicollinearity can adversely impact the model.
Multicollinearity can be determined in R easily by using the vif(model
name) function. VIF stands for variance inflation factor. Typically, for this test of
multicollinearity to pass, the VIF value should be greater than 5. Here is the test showing
the calculation of VIF:
218
Chapter 9 ■ Multiple Linear Regression
The following shows how to use the model to predict the Annual Salary for the new
data of Experience in Years and Skill Level:
Now, let’s take another set of values for the predictor variables and check what the model
returns Annual Salary. The prediction made by sal_model_1 in this case is shown here:
219
Chapter 9 ■ Multiple Linear Regression
This is also a significant model with the response variable Annual Salary and the
predictor variable Experience in Years (as we know), as the p-value of the model as well
as the p-value of the predictor are less than 0.05. Further, the model explains about 97.74
percent of the variance in the response variable.
Alternatively, if we remove the Experience in Years predictor variable, we get the
model shown here:
This is also a significant model with the response variable Annual Salary and the
predictor variable Skill Level (as we know), as the p-value of the model as well as the
p-value of the predictor are less than 0.05. Further, the model explains about 88 percent of
the variance in the response variable as shown by the R-squared value.
However, when we have various models available for the same response variable with
same predictor variables or different predictor variables, one of the best ways to select the
most useful model is to choose the model with the lowest AIC value. Here we have made a
comparison of the AIC values of three models: sal_model_1 with both Experience in Years
and Skill Level as predictors, sal_model_2 with only Experience in Years as the predictor
variable, and sal_model_3 with only Skill Level as the predictor variable:
If you compare the AIC values as shown, you find that the model with both
Experience in Years and Skill Level as predictors is the best model.
220
Chapter 9 ■ Multiple Linear Regression
This confirms our understanding as per the discussion in the prior section of this
chapter. The downside for the effective use of this stepwise approach is that as it drops
predictor variables one by one, it does not check for the different combinations of the
predictor variables.
221
Chapter 9 ■ Multiple Linear Regression
We need the leaps package and need to use the regsubsets(model name) function.
We can use the adjusted R-square option in the plot function as follows to understand
which of the predictors we need to include in our model. We should select the model that
has the highest adjusted R-square value. How to do this in R is shown here:
If we use the R-squared value, we get the plot shown in Figure 9-8. We chose the
predictor combinations with the highest R-squared value.
0.99
0.98
r2
0.89
Expe_Yrs
Skill_lev
(Intercept)
Figure 9-8. Plot of the multiple regression model generated in R using the all subsets
approach scaled with R-squared
As you can see, we select both the predictors Experience in Years and Skill Level,
as this model has the highest R-squared value of 0.99. However, we have seen that the
adjusted R-squared is a value that provides better insight than the R-squared value, as it
adjusts for the degrees of freedom. If we use the adjusted R-squared instead of R-squared,
we get the plot shown in Figure 9-9.
222
Chapter 9 ■ Multiple Linear Regression
0.99
adjr2
0.98
0.88
Expe_Yrs
Skill_lev
(Intercept)
Figure 9-9. Plot of the multiple regression model generated in R using all subsets approach
scaled with adjusted R-squared
In this example also, the predictors that get selected for the best model is the one
with both the predictors selected.
223
Chapter 9 ■ Multiple Linear Regression
9.1.9 Conclusion
From our data, we got a good multiple linear regression model that we validated for the
success of the assumptions. We also used this model for predictions.
As expected, you will find that the result is the same as that obtained through the
lm() function.
224
Chapter 9 ■ Multiple Linear Regression
As you can see from the data used for building the model, the predicted value is
almost in tune with the related actual values in our data set.
Please note: we have used glm_sal_model instead of sal_model_1 generated using
the lm() function, as there is no difference between the two.
This is done using the function predict(model name, newdata), where model name
is the name of the model arrived at from the input data, and newdata contains the data of
independent variables for which the response variable has to be predicted.
This model may work in predicting the Annual Salary for Experience in Years and
Skill Level beyond those in the data set. However, we have to be very cautious in using
the model on such an extended data set, as the model generated does not know or has
no means to know whether the model is suitable for the extended data set. We also do
not know whether the extrapolated data follows the linear relationship. It is possible
that after a certain number of years, while experience increases, the Skill Level and the
corresponding Annual Salary may not go up linearly, but the rate of increase in salary
may taper off, leading to a slowly tapering down in the slope of the relationship.
225
Chapter 9 ■ Multiple Linear Regression
Let’s use the same data set that we used previously to do this. For this, we have to use
install.packages("caret") and library(caret). The splitting of the data set into two
subsets—training data set and test data set—using R is shown here:
This split of the entire data set into two subsets, Training_Data and Test_Data,
has been done randomly. Now, we have 40 records in Training_Data and 8 records in
Test_Data.
We will now train our Training_Data using the machine-learning concepts to
generate a model. This is depicted as follows:
As you can see, the model generated, Trained_Model, is a good fit with p-values of the
coefficients of the predictors being < 0.05 as well as overall model p-value being < 0.05.
226
Chapter 9 ■ Multiple Linear Regression
We will now use the Trained_Model arrived at as previously to predict the values of
Annual_Salary in respect to Experience in Years and Skill Level from the Test_Data. This
is shown here, in a snapshot from R:
Now, let’s see whether the values of Annual Salary we predicted using the model
generated out of Training_Data and actual values of the Annual Salary in the Test_Data
are in tune with each other. The actual values and the predicted values are shown here:
You can see here that both the Actual Annual Salary from Test_Data and Annual
Salary from the model generated out of Training_Data (pred_annu_salary) match
closely with each other. Hence, the model generated out of Training_Data (Trained_
Model) can be used effectively.
9.5 Cross Validation
Cross validation like k-fold Cross Validation is used to validate the model generated.
This will be very useful when we have limited data and the data is required to be split
into small test set (like 20% of data) and relatively bigger training set (like 80% of data)
and this makes the validation of the model relatively difficult or not feasible because the
data in the training set and test set may be split in such a way that both may not be very
representative of the entire data set.
This problem is eliminated by cross validation methods like k-fold Cross Validation.
Here, actually we divide the data set into k-folds and use k-1 sets as training data and
the remaining 1 set as the test data and repeatedly do this for k times. Each k-fold will
have almost equal number of data points (depends upon k-value) randomly drawn from
the entire data set. In our case as we have totally 48 records each fold has 4 or 5 records.
None of the elements of one fold are repeated in the other fold. For each fold the model
is validated and the value predicted by the model (Predicted) and by the cross validation
is tabulated as the output along with the difference between actual value and the cross
validated prediction (cvpred) as CV residual. Using the difference between the Predicted
values and the cvpred we can arrive at the root mean square error of the fit of the linear
regression model. This cross validation method is also useful in the case of huge data as
every data point is used for the training as well as for testing by rotation and in none of
the folds the same data point is taken again for consideration.
227
Chapter 9 ■ Multiple Linear Regression
Let us take the example of the multiple linear regression we have used earlier. Let us
validate this model using the k-fold validation. In our example, we will be using K = 10.
This is for convenience sake so that every time we have 90% of the data in the training
set and another 10% of the data in the test set. This way the data once in training set will
move some other time to the test set. Thus in fact by rotation we will ensure that the all
the points are used for model generation as well as the testing. For this cross validation
we will use library(DAAG) and we will use either cv.lm(dataset, formula for the model,
m=number of folds) or CVlm(dataset, formula for the model, m=number of folds)
function. Here m is nothing but the number of folds (i.e. K) we have decided. The result of
run of this for our example is provided in the following Figures.
228
Chapter 9 ■ Multiple Linear Regression
Figure 9-11. A partial snapshot of output of k-fold cross validation (last fold)
Fold 5
Fold 6
Fold 7
Fold 8
Fold 9
Fold 10
15000
Annu_Salary
10000
5000
Figure 9-12. Graphical output of k-fold cross validation showing all the ten folds
The values predicted by the multiple linear regression model (lm function) and the
values predicted by the cross validation are very close to each other. As you can see from
above figures, the various points fitted using cross validation are almost mapping on to
the regression model we arrived at.
229
Chapter 9 ■ Multiple Linear Regression
Figure 9-13. Root mean square error calculated between linear regression model and cross
validation model
If we check the root mean square error between the values predicted by multiple
linear regression model and cross validation model, as shown in the above Figure, it
is very negligible i.e. 60. Hence, k-fold cross validation validates the multiple linear
regression model arrived at by us earlier.
Instead of k-fold cross validation we can use other methods like Repeated k-fold
cross validation, Bootstrap sampling, leave-one-out cross validation.
9.6 Summary
In this chapter, you saw examples of multiple linear relationships. When you have a
response variable that is continuous and is normally distributed with multiple predictor
variables, you use multiple linear regression. If a response variable can take discrete
values, you use logistic regression.
You also briefly looked into significant interaction, which occurs when the outcome
is impacted by one of the predictors based on the value of the other predictor. You learned
about multicollinearity, whereby two or more predictors may be highly correlated or may
represent the same aspect in different ways. You saw the impacts of multicollinearity and
how to handle them in the context of the multiple linear regression model or equation.
You then explored the data we took from one of the entities and the correlation
between various variables. You could see a high correlation among all three variables
involved (the response variable as well as the predictor variables). By using this data in R,
you can arrive at the multiple linear regression model and then validate that it is a good
model with significant predictor variables.
You learned various techniques to validate that the assumptions of the regression are
met. Different approaches can lead to different interpretations, so you have to proceed
cautiously in that regard.
In exploring multicollinearity further, you saw that functions such as vif() enable
you to understand the existence of multicollinearity. You briefly looked at handling
multicollinearity in the context of multiple linear regression. Multicollinearity does not
reduce the value of the model for prediction purposes. Through the example of Akaike
information criterion (AIC), you learned to compare various models and that the best one
typically has the lowest AIC value.
You then explored two alternative ways to arrive at the best-fit multiple linear
regression models: stepwise multiple linear regression and all subsets approach to
multiple linear regression. You depicted the model by using the multiple linear regression
equation.
Further, you explored how the glm() function with the frequency distribution
gaussian with link = "identity" can provide the same model as that generated
through the lm() function, as we require normality in the case of a continuous response
variable.
230
Chapter 9 ■ Multiple Linear Regression
Further, you saw how to predict the value of the response variable by using the values
of the predictor variables.
Finally, you explored how to split the data set into two subsets, training data and
test data. You can use the training data to generate a model for validating the response
variable of test data, by using the predict() function on the model generated using the
training data.
231
CHAPTER 10
Logistic Regression
In Chapters 8 and 9, we discussed simple linear regression and multiple linear regression,
respectively. In both types of regression, we have a dependent variable or response
variable as a continuous variable that is normally distributed. However, this is not always
the case. Often, the response variable is not normally distributed. It may be a binary
variable following a binomial distribution and taking the values of 0/1 or No/Yes. It
may be a categorical or discrete variable taking multiple values that may follow other
distributions other than the normal one.
If the response variable is a discrete variable (which can be nominal, ordinal,
or binary), you use a different regression method, known as logistic regression. If the
response variable takes binary values (such as Yes/No or Sick/Heathy) or multiple
discrete variables (such as Strongly Agree, Agree, Partially Agree, and Do Not Agree),
we can use logistic regression. Logistic regression is still a linear regression. Logistic
regression with the response variable taking only two values (such as Yes/No or Sick/
Healthy) is known as binomial logistic regression. A logistic regression with a response
variable that can take multiple discrete values is known as multinomial logistic regression.
Note that nonlinear regression is outside the scope of this book.
Consider the following examples:
• You want to decide whether to lend credit to a customer. Your
choice may depend on various factors including the customer’s
earlier record of payment regularity with other companies, the
customer’s credibility in the banking and financial sector, the
profitability of the customer’s organization, and the integrity of
the promoters and management.
• You are deciding whether to invest in a project. Again, your choice
depends on various factors, including the risks involved, the likely
profitability, the longevity of the project’s life, and the capability of
the organization in the particular area of the project.
• You are deciding whether to hire a particular candidate for a
senior management post. This choice may depend on various
factors including the candidate’s experience, past record, attitude,
and suitability to the culture of your organization.
234
Chapter 10 ■ Logistic Regression
We use the maximum likelihood method to generate the logistic regression equation
that predicts the natural logarithm of the odds ratio. From this logistic regression equation,
we determine the predicted odds ratio and the predicted probability of success:
Predicted probability of success = predicted odds ratio / (1 + predicted odds ratio)
Here, we do not use the dependent variable value as it is. Instead, we use the natural
logarithm of the odds ratio.
10.1 Logistic Regression
In this section, we demonstrate logistic regression using a data set.
10.1.1 The Data
Let’s start by considering that we have created data with six variables:
• Attrition represents whether the employee has exited the
organization or is still in the organization (Yes for exit, and No for
currently working in the organization).
• Yrs_Exp represents the experience of the employee at this time
(in years).
• Work_Challenging represents whether the work assigned to the
employee is challenging.
• Work_Envir represents whether the work environment is
Excellent or Low.
• Compensation represents whether the compensation is Excellent
or Low.
• Tech_Exper represents whether the employee is technically
expert (Excellent or Low).
The data covers the last six months and pertains only to those employees with two to
five years of experience. The data is extracted from the CSV text file Attri_Data_10.txt
by using the read.csv() command:
235
Chapter 10 ■ Logistic Regression
This code also presents a summary of the data. As you can see, the Attrition field
has 28 Yes values: that means that these employees have exited the organization. The
24 No values represent employees who are still continuing in the organization. You can
also observe from the preceding summary that 28 employees have not been assigned
challenging work (Work_Challenging), and 24 employees have. Furthermore, 28
employees are working in teams where the work environment (Work_Envir) is considered
excellent, whereas 24 are working in teams where the work environment is not that great
(here marked Low). Finally, 21 employees have excellent compensation, at par or above
the market compensation (shown here as Excellent); but 31 have compensation that
is below the market compensation or low compensation (shown here as Low). Out of
the total employees, 44 have excellent technical expertise, whereas 8 others have low
technical expertise. The data set contains 52 records.
Ideally, when the organization is providing challenging work to an employee, the
work environment within the team is excellent, compensation is excellent, and technical
expertise of the employee is low, then the chance for attrition should be low.
Here is a glimpse of the data:
The model created by using the glm() function is shown here, along with the
summary (generated by using summary(model name)):
236
Chapter 10 ■ Logistic Regression
Only one value among the categorical variables is shown here. This is because
each variable has two levels, and one level is taken as a reference level by the model.
An example is the categorical variable Work_Challenging, which has two levels: Work_
ChallengingYes and Work_ChallengingNo. Only Work_ChallengingYes is shown in the
model, as Work_ChallengingNo is taken as the reference level.
You can see in the preceding summary of the logistic regression model that except
for Yrs_Exp, all other variables are significant to the model (as each p-value is less than
0.05). Work_ChallengingYes, Work_EnvirLow, Compensation_Low, and Tech_ExperLow are
the significant variables. Yrs_Exp is not a significant variable to the model, as it has a high
p-value. It is quite obvious from even a visual examination of the data that Yrs_Exp will
not be significant to the model, as attrition is observed regardless of the number of years
of experience. Furthermore, you can see that the model has converged in seven Fisher’s
scoring iterations, which is good because ideally we expect the model to converge in less
than eight iterations.
237
Chapter 10 ■ Logistic Regression
We can now eliminate Yrs_Exp from the logistic regression model and recast the
model. The formula used for recasting the logistic regression model and the summary of
the model are provided here:
As you can see, the model parameters are significant because the p-values are less
than 0.05.
The degrees of freedom for the data is calculated as n minus 1 (the number of data
points – 1, or 52 – 1 = 51). The degrees of freedom for the model is n minus 1 minus the
number of coefficients (52 – 1 – 4 = 47). These are shown in the preceding summary of
the model. Deviance is a measure of lack of fit. Null deviance is nothing but the deviance
of the model with only intercept. Residual deviance is the deviance of the model. Both
are part of the preceding summary of the model. The following shows that the residual
deviance reduces with the addition of each coefficient:
238
Chapter 10 ■ Logistic Regression
Let’s compare both the models—attri_logit_model (with all the predictors) and
attri_logit_model_2 (with only significant predictors) and check how the second model
fares with respect to the first one:
This test is carried out by using the anova() function and checking the chi-square
p-value. In this table, the chi-square p-value is 0.7951. This is not significant. This
suggests that the model attri_logit_model_2 without Yrs_Exp works well compared
to the model attri_logit_model with all the variables. There is not much difference
between the two models. Hence, we can safely use the model without Yrs_Exp (that is,
attri_logit_model_2).
239
Chapter 10 ■ Logistic Regression
This calculation shows that the model explains 63.65 percent of the deviance. You can
also compute the value of pseudo R-square by using library(pscl) and pR2(model_name).
Another way to verify the model fit is by calculating the p-value with the chi-square
method as follows:
p-value <- pchisq[(model_deviance_diff ),(df_data – df_model),lower.tail=F]
240
Chapter 10 ■ Logistic Regression
Because the p-value is very small, the reduction in deviance cannot be assumed to
be by chance. As the p-value is significant, the model is a good fit.
This may be due to data or a portion of data predicting the response perfectly. This is
known as the issue of separation or quasi-separation.
Here are some general words of caution with respect to the logistic regression model:
• We have a problem when the null deviance is less than the
residual deviance.
• We have a problem when the convergence requires many Fisher’s
scoring iterations.
• We have a problem when the coefficients are large in size with
significantly large standard errors.
In these cases, we may have to revisit the model to look again at each coefficient.
241
Chapter 10 ■ Logistic Regression
10.1.5 Multicollinearity
We talked about multicollinearity in Chapter 9. Multicollinearity can be made out in R
easily by using the vif(model name) function. VIF stands for variance inflation factor.
Typically, a rule of thumb for the multicollinearity test to pass is that the VIF value should
be greater than 5. The following test shows the calculation of VIF:
As you can see, our model does not suffer from multicollinearity.
10.1.6 Dispersion
Dispersion (variance of the dependent variable) above the value of 1 (as mentioned in
the summary of the logistic regression dispersion parameter for the binomial family to be
taken as 1) is a potential issue with some of the regression models, including the logistic
regression model. This is known as overdispersion. Overdispersion occurs when the
observed variance of the dependent variable is bigger than the one expected out of
the usage of binomial distribution (that is, 1). This leads to issues with the reliability of
the significance tests, as this is likely to adversely impact standard errors.
Whether a model suffers from the issue of overdispersion can be easily found using
R, as shown here:
The model generated by us, attri_logit_model_2, does not suffer from the issue of
overdispersion. If a logistic regression model does suffer from overdispersion, you need to
use quasibinomial distribution in the glm() function instead of binomial distribution.
242
Chapter 10 ■ Logistic Regression
This split of the entire data set into two subsets (Training_Data and Test_Data)
has been done randomly. Now, we have 43 records in Training_Data and 9 records in
Test_Data.
243
Chapter 10 ■ Logistic Regression
As you can see, the model generated (train_logit_model) has taken 18 Fisher’s
scoring iterations to converge, and Tech_ExperLow has a huge coefficient and standard
error. Hence, this model may not be useful. For this reason, we cannot proceed with
using the training model to predict Attrition with respect to the records of Test_Data.
However, if we could successfully come up with a training model, we could proceed with
our further analysis as follows:
1.
Use the model generated out of the training data to predict the
response variable for the test data and store the predicted data
in the test data set itself.
2.
Compare the values generated of the response variable with
the actual values of the response variable in the test data set.
3.
Generate the confusion matrix to understand the true
positives (TP), true negatives (TN), false positives (FP), and
false negatives (FN).
• True positives are the ones that are actually positives (1) and
are also predicted as positives (1).
• True negatives are the ones that are actually negatives (0) but
are also predicted as negatives (0).
• False positives are the ones that are predicted as positives (1)
but are actually negatives (0).
• False negatives are the ones that are predicted as negatives
(0) but are actually positives (1).
244
Chapter 10 ■ Logistic Regression
4.
Check for accuracy, specificity, and sensitivity:
• Accuracy = (TP + TN) / Total Observations
• Specificity = True Negative Rate = TN / (FP + TN)
• Sensitivity = Recall = True Positive Rate = TP / (FN + TP)
In addition, Precision = TP / (FP + TP) and F1 Score = 2TP /
(2TP + FP + FN) may be considered.
Higher accuracy, higher sensitivity, and higher specificity are typically expected.
Check whether these values are appropriate to the objective of the prediction in mind.
If the prediction will affect the safety or health of people, we have to ensure the highest
accuracy. In such cases, each predicted value should be determined with caution and
further validated through other means, if required.
We take the value of Attrition as Yes if the probability returned by the prediction is
> 0.5, and we take the same as No if the probability returned by the prediction is not > 0.5.
As you can see in the preceding code, the value is far below 0.5 so we can safely assume
that Attrition = No.
The preceding prediction is determined by using the function predict(model name,
newdata=dataframe_name, type="response"), where model name is the name of the
model arrived at from the input data, newdata contains the data of independent variables
for which the response variable has to be predicted, and type="response" is required to
ensure that the outcome is not logit(y).
245
Chapter 10 ■ Logistic Regression
1.
We use the model generated to predict the dependent variable
values (logit) using the predicted_value <- predict(model,
type="response") on the full data set that we used to
generate the model:
2.
Then we generate a confusion matrix:
We can clearly see from the confusion matrix that the model
generates very high true positives and true negatives.
3.
Now we use the ROCR package and the prediction()
function from it as follows: prediction_object
<-prediction(predicted_value, dataset$dependent_
variable). With this, we now create a prediction_object:
246
Chapter 10 ■ Logistic Regression
4.
Using the performance() function from the ROCR package
on the prediction_object, we obtain TPR = TP / (FN +
TP) = TP / (All Positives) and FPR = FP / (FP + TN) = FP
/ (All Negatives) and plot TPR against the FPR to obtain
a receiver operating characteristic (ROC) curve. This is
done using plot(performance(prediction_object,
measure = "tpr", x.measure = "fpr"). Alternatively,
we can use sensitivity/specificity plots by using
plot(performance(prediction_object, measure =
"sens", x.measure = "spec") or precision/recall plots by
using plot(performance(prediction_object, measure =
"prec", x.measure = "rec"). The first of these R commands
used to generate the ROC curve is shown here, followed by the
curve generated (see Figure 10-1):
1.0
0.8
True positive rate
0.4 0.2
0.0 0.6
This ROC curve clearly shows that the model generates almost no false positives and
generates high true positives. Hence, we can conclude that the model generated is a good
model.
247
Chapter 10 ■ Logistic Regression
10.4 Regularization
Regularization is a complex subject that we won’t discuss thoroughly here. However, we
provide an introduction to this concept because it is an important aspect of statistics that
you need to understand in the context of statistical models.
Regularization is the method normally used to avoid overfitting. When we keep
adding parameters to our model to increase its accuracy and fit, at some point our
prediction capability using this model decreases. By taking too many parameters, we are
overfitting the model to the data and losing the value of generalization, which could have
made the model more useful in prediction.
Using forward and backward model fitting and subset model fitting, we try to avoid
overfitting and hence make the model more generalized and useful in predicting future
values. This will ensure less bias as well as less variance when relating to the test data.
Regularization is also useful when we have more parameters than the data
observations in our data set and the least squares method cannot help because it
would lead to many models (not a single unique model) that would fit to the same data.
Regularization allows us to find one reasonable solution in such situations.
Shrinkage methods are the most used regularization methods. They add a penalty
term to the regression model to carry out the regularization. We penalize the loss function
by adding a multiple (λ, also known as the shrinkage parameter) of the regularization
norm, such as Lasso or Ridge (also known as the shrinkage penalty), of the linear
regression weights vector. We may use cross-validation to get the best multiple (λ value).
The more complex the model, the greater the penalty. We use either L1 regularizer (Lasso)
or L2 regularizer (Ridge). Regularization shrinks the coefficient estimates to reduce the
variance.
Ridge regression shrinks the estimates of the parameters but not to 0, whereas the
Lasso regression shrinks the estimates of some parameters to 0. For Ridge, the fit will
increase with the value of λ, and along with that the value of variance also increases. This
can lead to a huge increase in parameter estimates, even for small changes in the training
data, and get aggravated with the increase in the number of parameters. Lasso creates
less-complicated models, thus making the predictability easier.
248
Chapter 10 ■ Logistic Regression
Let’s explore the concept of regularization on our data set attrition_data without
Yrs_Exp. We don’t take Yrs_Exp into consideration because we know that it is not
significant.
We use the glmnet() function from the glmnet package to determine the regularized
model. We use the cv.glmnet() function from the glmnet package to determine the
best lambda value. We use alpha=1 for the Lasso and use alpha=0 for the Ridge. We use
family="binomial" and type="class" because our response variable is binary and
we are using the regularization in the context of logistic regression, as required. The
glmnet() function requires the input to be in the form of a matrix and the response
variable to be a numeric vector. This fits a generalized linear model via penalized
maximum likelihood. The regularization path is computed for the Lasso or elasticnet
penalty at a grid of values for the regularization parameter lambda.
The generic format of this function as defined in the glmnet R package is as follows:
As usual, we will not be using all the parameters. We will be using only the absolutely
required parameters in the interest of simplicity. Please explore the glmnet package
guidelines for details of each parameter.
We will first prepare the inputs required. We need the model in the format of a
matrix, as the input for the glmnet() function. We also require the response variable as a
vector:
249
Chapter 10 ■ Logistic Regression
Explaining the contents of the summary is beyond the scope of this book, but we will
show how the regularization is carried out primarily using the graphs. We use the plot()
function for this purpose. As we are using the binary data and logistic regression, we use
xvar="dev" (where dev stands for deviance) and label = TRUE to identify the parameters
in the plot as inputs to the plot() function:
2 2 3 4 4 4 4
3
4
4
2
Coefficients
0 -2
2
5
-4
Figure 10-2. Shows the deviance of each variable: two have + coefficients, and two have –
coefficients
The output of the glmnet_fit using the print() function is shown here:
250
Chapter 10 ■ Logistic Regression
251
Chapter 10 ■ Logistic Regression
This primarily shows the degrees of freedom (number of nonzero coefficients), the
percentage of null deviance explained by the model, and the lambda value. As you can
see, the lambda value keeps on decreasing. As the lambda value decreases, the percent
of deviance explained by the model increases, as does the significant number of nonzero
coefficients. Even though we supplied nlambda = 100 for the function (this is the default),
the lambda value is shown only 68 times. This is because the algorithm ensures that it
stops at an optimal time when it sees there is no further significant change in the percent
deviation explained by the model.
Now we will make the prediction of the class labels at lambda = 0.05. Here type =
"class" refers to the response type:
As you can see, all four values are predicted accurately, as they match the first four
rows of our data set.
Now we will do the cross-validation of the regularized model by using the cv.
glmnet() function from the glmnet package. This function does k-fold cross-validation
for glmnet, produces a plot, and returns a minimum value for lambda. This also
returns a lambda value at one standard error. This function by default does a 10-fold
cross-validation. We can change the k-folds if required. Here we use type.measure =
"class" as we are using the binary data and the logistic regression. Here, class gives the
misclassification error:
We now plot the output of cv.glmnet()—that is, cv.fit—by using the plot()
function:
252
Chapter 10 ■ Logistic Regression
Figure 10-3 shows the cross-validated curve along with the upper and lower values of
the misclassification error against the log(lambda) values. The cross-validated curve here
is depicted by red dots.
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 2 0
0.5 0.4
Misclassification Error
0.2 0.3
0.1
-7 -6 -5 -4 -3 -2
log(Lambda)
The following shows some of the important output parameters of the cv.fit model,
including lambda.min or lambda.1se:
We can view the coefficient value at the lambda.min value using the coef(cv.fit,
s = "lambda.min") command in R. The output is a sparse matrix with the second levels
shown for each independent factor:
253
Chapter 10 ■ Logistic Regression
Let’s now see how this regularized model predicts the values by using the predict()
function and the s = "lambda.min" option. We will check this for the first six values of
our data set. The results are shown here:
All six values are predicted properly by the predict() function. However, please note
that we have not validated the results for our entire data set. The accuracy of the model may
not be 100 percent, as our objective of regularization was to provide a generalized model for
future predictions without worrying about an exact fit (overfit) on the training data.
10.5 Summary
In this chapter, you saw that if the response variable is a categorical or discrete variable
(which can be nominal, ordinal, or binary), you use a different regression method, called
logistic regression. If you have a dependent variable with only two values, such as Yes or
No, you use binomial logistic regression. If the dependent variable takes more than two
categorical values, you use multinomial logistic regression.
You looked at a few examples of logistic regression. The assumptions of
linearity, normality, and homoscedasticity that generally apply to regressions
do not apply to logistic regression. You used the glm() function with "family =
binomial(link="logit") to create a logistic regression model.
You also looked at the underlying statistics and how logit (log odds) of the
dependent variable is used in the logistic regression equation instead of the actual value
of the dependent variable.
254
Chapter 10 ■ Logistic Regression
You also imported the data set to understand the underlying data. You created the
model and verified the significance of the predictor variables to the model by using the
p-value. One of the variables (Yrs_Exp) was not significant. You reran the model without
this predictor variable and arrived at a model in which all the variables were significant.
You explored how to interpret the coefficients and their impact on the dependent
variable. You learned about deviance as a measure of lack of fit and saw how to verify the
model’s goodness of fit by using the p-value of deviance difference using the chi-square
method.
You need to use caution when interpreting the logistic regression model. You
checked for multicollinearity and overdispersion.
You then split the data set into training and test sets. You tried to come up with
a logistic regression model out of the training data set. Through this process, you
learned that a good model generated from such a training set can be used to predict the
dependent variable. You can use a confusion matrix to check measures such as accuracy,
specificity and sensitivity.
By using the prediction() and performance() functions from the ROCR package, you
can generate a ROC curve to validate the model, using the same data set as the original.
You learned how to predict the value of a new data set by using the logistic regression
model you developed. Then you had a brief introduction to multinomial logistic
regression and the R packages that can be used in this regard.
Finally, you learned about regularization, including why it’s required and how it’s
carried out.
255
CHAPTER 11
Data is power. Data is going to be another dimension of value in any enterprise. Data is
the decision driver going forward. All organizations and institutions have woken up to
the value of data and are trying to collate data from various sources and mine it for its
value. Businesses are trying to understand consumer/market behavior in order to get the
maximum out of each consumer with the minimum effort possible. Fortunately, these
organizations and institutions have been supported by the evolution of technology in the
form of increasing storage power and computing power, as well as the power of the cloud,
to provide infrastructure as well as tools. This has driven the growth of data analytical
fields including descriptive analytics, predictive analytics, machine learning, deep
learning, artificial intelligence, and the Internet of Things. Organizations are collecting
data without the need for thinking the value of it and use it to tell them the value the data
has for them. Data is made to learn from itself and thus throw light on many that have
been hitherto unknown or possibly a logically thinking person may not think of or accept
at face value.
This chapter does not delve into the various definitions of big data. Many pundits
have offered many divergent definitions of big data, confusing people more than
clarifying the issue. However, in general terms, big data means a huge amount of data that
cannot be easily understood or analyzed manually or with limited computing power or
limited computer resources; analyzing big data requires the capability to crunch data of a
diverse nature (from structured data to unstructured data), from various sources (such as
social media, structured databases, unstructured databases, and the Internet of Things).
In general, when people refer to big data, they are referring to data with three
characteristics: variety, volume, and velocity, as shown in Figure 11-1.
&
ri
Va
Unstructured
Velo
Streaming Data
city
Big Data
Terrabytes
Zettabytes
Volume
Variety refers to the different types of data that are available on the Web, the Internet
and various databases, etc. This data can be structured or unstructured and can be from
various social media. Volume refers to the size of the data that is available for you to
process. Its size is big—terabytes and petabytes. Velocity refers to how fast you can process
and analyze data, determine its meaning, arrive at the models and use the models that
can help business.
The following have aided the effective use of big data:
• Huge computing power of clusters of computing machines
extending the processing power and the memory power by
distributing the load over several machines
• Huge storage power distributing the data over various storage
resources
• Significant development of algorithms and packages for machine
learning
• Developments in the fields of artificial intelligence, natural
language processing, and others
• Development of tools for data visualization, data integration, and
data analysis
258
Chapter 11 ■ Big Data Analysis—Introduction and Future Trends
Hadoop Distributed File System (HDFS) allows data to be distributed and stored
among many computers. Further, it allows the use of the increased processing power and
memory of multiple clustered systems. This has overcome the obstacle of not being able
to store huge amounts of data in a single system and not being able to analyze that data
because of a lack of required processing power and memory. The Hadoop ecosystem
consists of modules that enable us to process the data and perform the analysis.
A user application can submit a job to Hadoop. Once data is loaded onto the
system, it is divided into multiple blocks, typically 64 MB or 128 MB. Then the Hadoop
Job client submits the job to the JobTracker. The JobTracker distributes and schedules
the individual tasks to different machines in a distributed system; many machines are
clustered together to form one entity. The tasks are divided into two phases: Map tasks
259
Chapter 11 ■ Big Data Analysis—Introduction and Future Trends
are done on small portions of data where the data is stored, and Reduce tasks combine
data to produce the final output. The TaskTrackers on different nodes execute the tasks
as per MapReduce implementation, and the reduce function is stored in the output files
on the file system. The entire process is controlled by various smaller tasks and functions.
The full Hadoop ecosystem and framework are shown in Figure 11-3.
Distributed
Data Map-Reduce HBASE
Processing
260
Chapter 11 ■ Big Data Analysis—Introduction and Future Trends
In addition to these tools, NoSQL (originally referring to not only SQL) databases
such as Cassandra, ArangoDB, MarkLogic, OrientDB, Apache Giraph, MongoDB, and
Dynamo have supported or complemented the big data ecosystem significantly. These
NoSQL databases can store and analyze multidimensional, structured or unstructured,
huge data effectively. This has provided significant fillip to the consolidation and
integration of data from diverse sources for analysis.
Currently Apache Spark is gaining momentum in usage. Apache Spark is a fast,
general engine for big data processing, with built-in modules for streaming, SQL, machine
learning, and graph processing. Apache Spark, an interesting development in recent
years, provides an extremely fast engine for data processing and analysis. It allows an easy
interface to applications written in R, Java, Python, or Scala. Apache Spark has a stack of
libraries such as MLib (Machine Learning), Spark Streaming, Spark SQL, and GraphX. It
can run in stand-alone mode as well as on Hadoop. Similarly, it can access various sources
such as HBase to Cassandra to HDFS. Many users and organizations have shown interest
in this tool and have started using it, resulting in it becoming very popular in a short period
of time. This tool provides significant hope to organizations and users of big data.
Tools such as Microsoft Business Intelligence and Tableau provide dashboards
and the visualization of data. These have enabled organizations to learn from the data
and leverage this learning to formulate strategies or improve the way they conduct their
operations or their processes.
The following are some of the advantages of using Hadoop for big data processing:
• Simple parallel architecture on any cheap commodity hardware
• Built-in fault-tolerance system, and application-level failure
detection and high availability
• Dynamic cluster modification without interruption
• Easy addition or removal of machines to / from the cluster
• Java based and platform independent
Microsoft Azure, Amazon, and Cloudera are some of the big providers of cloud
facilities and services for effective big data analysis.
261
Chapter 11 ■ Big Data Analysis—Introduction and Future Trends
11.2.4 Prescriptive Analytics
Data analysis is no longer focused on understanding the patterns or value hidden in
the data. The future trend is to prescribe the actions to be taken, based on the past and
depending on present circumstances, without the need for human intervention. This is
going to be of immense value in fields such as healthcare and aeronautics.
11.2.5 Internet of Things
The Internet of Things is a driving force for the future. It has the capability to bring data
from diverse sources such as home appliances, industrial machines, weather equipment,
and sensors from self-driving vehicles or even people. This has the potential to create
a huge amount of data that can be analyzed and used to provide proactive solutions to
potential future and current problems. This can also lead to significant innovations and
improvements.
11.2.6 Artificial Intelligence
Neural networks can drive artificial intelligence—in particular, making huge data learn
from itself without any human intervention, specific programming, or the need for
specific models. Deep learning is one such area that is acting as a driver in the field of
big data. This may throw up many of the whats which we are not aware of. We may not
understand the whys for some of them but the whats of those may be very useful. Hence,
we may move away from the perspective of always looking for cause and effect. The
speed at which the machine-learning field is being developed and used drives significant
emphasis in this area. Further, natural language processing (NLP) and property graphs
(PG) are also likely to drive new application design and development, to put the
capabilities of these technologies in the hands of organizations and users.
262
Chapter 11 ■ Big Data Analysis—Introduction and Future Trends
11.2.9 Real-Time Analytics
Organizations are hungry to understand the opportunities available to them. They want
to understand in real time what is happening—for example, what a particular person is
purchasing or what a person is planning for—and use the opportunity appropriately to
offer the best possible solutions or discounts or cross-sell related products or services.
Organizations are no longer satisfied with a delayed analysis of data that results in missed
business opportunities because they were not aware of what was happening in real time.
263
Chapter 11 ■ Big Data Analysis—Introduction and Future Trends
11.2.13 In-Database Analytics
In-database analytics have increased security, and reduced privacy concerns, in part
by addressing governance. Organizations, if required, can do away with intermediate
requirements for data analysis such as data warehouses. Organizations that are more
conscious about governance and security concerns will provide significant fillips to
in-database analytics. Lots of vendor organizations have already made their presence felt
in this space.
11.2.14 In-Memory Analytics
The value of in-memory analytics is driving transactional processing and analytical
processing side by side and in memory. This may be very helpful in fields where
immediate intervention based on the results of analysis is essential. Systems with
Hybrid Transactional/Analytical Processing (HTAP) are already being used by some
organizations. However, using HTAP for the sake of using it may not be of much use,
even when the rate of data change is slow and you still need to bring in data from various
diverse systems to carry out effective analysis. Instead it may be overkill, leading to higher
costs to the organization.
264
Chapter 11 ■ Big Data Analysis—Introduction and Future Trends
11.2.17 Healthcare
Big data in healthcare can be used to predict epidemics, prevent diseases, and improve
value-based healthcare and quality of living. An abundance of data is generated from
various modern devices such as smartphones, Fitbit products, and pedometers that
measure, for example, how far you walk in a day or the number of calories you burn.
This data can be used to create diet plans or prescribe medicines. Other unstructured
data—such as medical device logs, doctor’s notes, lab results, x-ray reports, and
clinical and lab data—can be big enough data to analyze and improve patient care and
thus increase efficiency. Other data that can be generated and processed for big data
analytics includes claims data, electronic health/medical record data (EHR or EMR),
pharmaceutical R&D data, clinical trials data, genomic data, patient behavior and
sentiment data, and medical device data.
265
References
1.
BAESENS, BART. (2014). Analytics in a Big Data World, The
Essential Guide to Data Science and Its Applications. Wiley
India Pvt. Ltd.
2.
MAYER-SCHONBERGER, VIKTOR & CUKIER KENNETH.
(2013). Big Data, A Revolution That Will Transform How
We Live, Work and Think. John Murray (Publishers), Great
Britain.
3.
LINDSTROM, MARTIN. (2016). Small Data – The Tiny Clues
That Uncover Huge Trends. Hodder & Stoughton, Great
Britain.
4.
FREEDMAN, DAVID; PISANI, ROBERT & PURVES, ROGER.
(2013). Statistics. Viva Books Private Limited, New Delhi.
5.
LEVINE, DAVID.M. (2011). Statistics for SIX SIGMA Green
Belts. Dorling Kindersley (India) Pvt. Ltd., Noida, India.
6.
DONNELLY, JR. ROBERT.A. (2007). The Complete Idiot’s
Guide to Statistics, 2/e. Penguin Group (USA) Inc., New York
10014, USA.
7.
TEETOR, PAUL. (2014). R Cookbook. Shroff Publishers and
Distributors Pvt. Ltd., Navi Mumbai.
8.
WITTEN, IAN.H.; FRANK, EIBE & HALL, MARK.A. (2014).
Data Mining, 3/e – Practical Machine Learning Tools and
Techniques. Morgan Kaufmann Publishers, Burlington, MA
01803, USA.
9.
HARRINGTON, PETER. (2015). Machine Learning in Action.
Dreamtech Press, New Delhi.
10.
ZUMEL, NINA & MOUNT, JOHN. (2014). Practical Data
Science with R. Dreamtech Press, New Delhi.
11.
KABACOFF, ROBERT.I. (2015). R In Action – Data analysis and
graphics with R. Dreamtech Press, New Delhi.
12.
[Online] www.quora.com.
13.
[Online] www.r-bloggers.com.
14.
[Online] www.stackexchange.com.
15.
[Online] https://cran.r-project.org/.
16.
[Online] www.r-project.org/.
17.
COMPUTERWORLD FROM IDG. (2016). 8 big trends in
big data analysis. [Online] Available from: http://www.
computerworld.com/article/2690856/big-data/8-big-
trends-in-big-data-analytics.html
18.
WELLESLEY INFORMATION SERVICES, MA 02026, USA.
(2016). Big Data Analytics Predictions for 2016. Available
from: http://data-informed.com/big-data-analytics-
predictions-2016/
19.
COMPUTERWORLD FROM IDG. (2016). 11 Market Trends
in Advanced Analytics. [Online] Available from: http://www.
computerworld.com/article/2489750/it-management/11-
market-trends-in-advanced-analytics.html#tk.drr_mlt
20.
WELLESLEY INFORMATION SERVICES, MA 02026, USA.
(2016). 5 Big Trends to Watch in 2016. [Online] Available
from: http://data-informed.com/5-big-data-trends-
watch-2016/.
21.
ZHANG, NANCY.R. Ridge Regression, LARS, Logistic
Regression. [Online] Available from: http://statweb.
stanford.edu/~nzhang/203_web/lecture12_2010.pdf
22.
QIAN, JUNYANG & HASTIE, TRAVOR. (2014). Glmnet
Vignette. [Online] Available from: http://web.stanford.
edu/~hastie/glmnet/glmnet_alpha.html
23.
USUELLI, MICHELE. (2014). R Machine Learning Essentials.
Packt Publishing.
24.
BALI, RAGHAV & SARKAR, DIPANJAN. (2016). R Machine
Learning By Example. Packt Publishing.
25.
DAVID, CHIU & YU-WEI. (2015). Machine Learning with R
Cookbook. Packt Publishing.
26.
LANTZ, BRETT. (2015). Machine Learning with R, 2/e. Packt
Publishing.
27.
Data Mining - Concepts and Techniques By Jiawei Han,
Micheline Kamber and Jian Pei, 3e, Morgan Kaufmann
268
■ References
28.
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F.
Naughton, R. Ramakrishnan, and S. Sarawagi. On the
computation of multidimensional aggregates. VLDB’96
29.
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view
maintenance in data warehouses. SIGMOD’97
30.
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling
multidimensional databases. ICDE’97
31.
S. Chaudhuri and U. Dayal. An overview of data warehousing
and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997
32.
E. F. Codd, S. B. Codd, and C. T. Salley. Beyond decision
support. Computer World, 27, July 1993.
33.
J. Gray, et al. Data cube: A relational aggregation operator
generalizing group-by, cross-tab and sub-totals. Data Mining
and Knowledge Discovery, 1:29-54, 1997.
34.
Swift, Ronald S. (2001) Accelerating Customer Relationships
Using CRM and Relationship Technologies, Prentice Hall
35.
Berry, M. J. A., Linoff, G. S. (2004) Data Mining Techniques.
Wiley Publishing.
36.
Ertek, G. Visual Data Mining with Pareto Squares for
Customer Relationship Management (CRM) (working paper,
Sabancı University, Istanbul, Turkey)
37.
Ertek, G., Demiriz, A. A framework for visualizing association
mining results (accepted for LNCS)
38.
Hughes, A. M. Quick profits with RFM analysis. http://www.
dbmarketing.com/articles/Art149.htm
39.
Kumar, V., Reinartz, W. J. (2006) Customer Relationship
Management, A Databased Approach. John Wiley & Sons Inc.
40.
Spence, R. (2001) Information Visualization. ACM Press.
41.
Dyche, Jill, The CRM Guide to Customer Relationship
Management, Addison-Wesley, Boston, 2002.
42.
Gordon, Ian. “Best Practices: Customer Relationship
Management” Ivey Business Journal Online, 2002, pp. 1-6.
43.
Data Mining for Business Intelligence: Concepts, Techniques,
and Applications in Microsoft Office Excel with XLMiner
[Hardcover] By Galit Shmueli (Author), Nitin R. Patel
(Author), Peter C. Bruce (Author)
44.
A. Gupta and I. S. Mumick. Materialized Views: Techniques,
Implementations, and Applications. MIT Press, 1999.
269
■ References
45.
J. Han. Towards on-line analytical mining in large databases.
ACM SIGMOD Record, 27:97-107, 1998.
46.
V. Harinarayan, A. Rajaraman, and J. D. Ullman.
Implementing data cubes efficiently. SIGMOD’96
47.
C. Imhoff, N. Galemmo, and J. G. Geiger. Mastering Data
Warehouse Design: Relational and Dimensional Techniques.
John Wiley, 2003
48.
W. H. Inmon. Building the Data Warehouse. John Wiley, 1996
49.
R. Kimball and M. Ross. The Data Warehouse Toolkit: The
Complete Guide to Dimensional Modeling. 2ed. John Wiley,
2002
50.
P. O'Neil and D. Quass. Improved query performance with
variant indexes. SIGMOD'97
51.
Microsoft. OLEDB for OLAP programmer's reference version
1.0. In http://www.microsoft.com/data/oledb/olap, 1998
52.
A. Shoshani. OLAP and statistical databases: Similarities and
differences. PODS’00.
53.
S. Sarawagi and M. Stonebraker. Efficient organization of large
multidimensional arrays. ICDE'94
54.
OLAP council. MDAPI specification version 2.0. In http://
www.olapcouncil.org/research/apily.htm, 1998
55.
E. Thomsen. OLAP Solutions: Building Multidimensional
Information Systems. John Wiley, 1997
56.
P. Valduriez. Join indices. ACM Trans. Database Systems,
12:218-246, 1987.
57.
J. Widom. Research problems in data warehousing. CIKM’95.
58.
Kurt Thearling. Data Mining. http://www.thearling.com,
kurt@thearling.com
59.
“Building Data Mining Applications for CRM”, By Alex Berson,
Stephen Smith and Kurt Thearling
60.
Building Data Mining Applications for CRM by Alex Berson,
Stephen Smith, Kurt Thearling (McGraw Hill, 2000).
61.
Introduction to Data Mining, By Pang-Ning, Michael
Steinbach, Vipin Kumar, 2006 Pearson Addison-Wesley.
62.
Data Mining: Concepts and Techniques, Jiawei Han and
Micheline Kamber, 2000 (c) Morgan Kaufmann Publishers
63.
Data Mining In Excel, Galit Shmueli Nitin R. Patel Peter C.
Bruce, 2005 Galit Shmueli, Nitin R. Patel, Peter C. Bruce
270
■ References
64.
Principles of Data Mining by David Hand, Heikki Mannila and
Padhraic Smyth ISBN: 026208290x The MIT Press © 2001 (546
pages)
65.
http://scikit-learn.org/stable/modules/generated/
sklearn.tree.DecisionTreeClassifier.html
66.
http://paginas.fe.up.pt/~ec/files_1011/week%2008%20
-%20Decision%20Trees.pdf
67.
http://www.quora.com/Machine-Learning/Are-
gini-index-entropy-or-classification-error-
measures-causing-any-difference-on-Decision-Tree-
classification
68.
http://www.quora.com/Machine-Learning/Are-
gini-index-entropy-or-classification-error-
measures-causing-any-difference-on-Decision-Tree-
classification
69.
https://rapid-i.com/rapidforum/index.
php?topic=3060.0
70.
http://stats.stackexchange.com/questions/19639/
which-is-a-better-cost-function-for-a-random-forest-
tree-gini-index-or-entropy
71.
Creswell, J. W. (2013). Research design: Qualitative,
quantitative, and mixed methods approaches. Sage
Publications, Incorporated.
72.
http://www.physics.csbsju.edu/stats/box2.html
73.
Advance Data Mining Techniques, Olson, D.L, Delen, D, 2008
Springer
74.
Phyu, Nu Thair, “Survey of Classification Techniques in Data
Mining”, Proceedings of the International MultiConference of
Engineers and Computer Scientists 2009 Vol I IMECS 2009,
March 18 - 20, 2009, Hong Kong
75.
Myatt, J. Glenn, “Making Sense of Data – A practical Guide
to Exploratory Data Analysis and Data Mining”, 2007,
WILEY-INTERSCIENCE A JOHN WILEY & SONS, INC.,
PUBLICATION
76.
Fawcett, Tom, “An Introduction to ROC analysis”, Pattern
Recognition Letters 27 (2006) 861–874
77.
Sayad, Saeed. “An Introduction to Data Mining”, Self-Help
Publishers (January 5, 2011).
78.
Delmater, Rhonda, and Monte Hancock. "Data mining
explained." (2001).
271
■ References
79.
Alper, Theodore M. "A classification of all order-preserving
homeomorphism groups of the reals that satisfy finite
uniqueness." Journal of mathematical psychology 31.2 (1987):
135-154.
80.
Narens, Louis. "Abstract measurement theory." (1985).
81.
Luce, R. Duncan, and John W. Tukey. "Simultaneous conjoint
measurement: A new type of fundamental measurement."
Journal of mathematical psychology 1.1 (1964): 1-27.
82.
Provost, Foster J., Tom Fawcett, and Ron Kohavi. "The
case against accuracy estimation for comparing induction
algorithms." ICML. Vol. 98. 1998.
83.
Hanley, James A., and Barbara J. McNeil. "The meaning and
use of the area under a receiver operating characteristic
(ROC) curve." Radiology 143.1 (1982): 29-36.
84.
Ducker, Sophie Charlotte, W. T. Williams, and G. N. Lance.
"Numerical classification of the Pacific forms of Chlorodesmis
(Chlorophyta)." Australian Journal of Botany 13.3 (1965):
489-499.
85.
Kaufman, Leonard, and Peter J. Rousseeuw. "Partitioning
around medoids (program pam)." Finding groups in data: an
introduction to cluster analysis(1990): 68-125.
272
Index
A
analytics, future trends, 261
addressing security and
Affinity analysis, 182 compliance, 264
Aggregate() function, 182 artificial intelligence, 262
Akaike information criterion (AIC) autonomous services for machine
value, 219 learning, 264
Amazon, 261 business users, 263
Apache Hadoop ecosystem, 259 cloud, 264
Apache Hadoop YARN, 260 data lakes, 262
Apache HBase, 260 growth of social media, 261
Apache Hive, 260 healthcare, 265
Apache Mahout, 260 in-database analytics, 264
Apache Oozie, 260 in-memory analytics, 264
Apache Pig, 260 Internet of Things, 262
Apache Spark, 261 migration of solutions, 263
Apache Storm, 260 prescriptive analytics, 262
Apply() function, 49 real-time analytics, 263
Arrays, R, 31 vertical and horizontal
Artificial intelligence, 262 applications, 263
Association-rule analysis, 182 visualization at business
association rules, 185–186 users, 262
if-then, 183 whole data processing, 263
interpreting results, 186 characteristics, 258
market-basket ecosystem, 259–261
analysis, 182–183 use of, 258
rules, 183 Big data analytics, 5
support, 183 Binomial distribution, 87
Association rules/affinity analysis, 118 Bivariate data analysis, 110
Bootstrap aggregating/bagging, 152
Boxplots, 78–79
B
Business analytics, 5
Bar plot, 77–78 applications of
Bayes theorem, 132 customer service and
Bias-variance erros, 147 support areas, 9
Big data, 257 human resources, 9
analysis, 257 marketing and sales, 8
274
■ INDEX
275
■ INDEX
Data frames, R, 32
Data lakes, 262
E
Data Mining Group (DMG), 263 Economic globalization, 8
Data science, 5 Ecosystem, big data, 259–261
Data structures Euclidean distance, 163
in R, 29 Extensible Markup Language (XML), 37
arrays, 31
data frames, 32
factors, 35
F
lists, 34 Factors, R, 35
matrices, 30–31 for loops, 46–47
Decision tree structure, 136
bias and variance, 147
classification rules, 145
G
data tuples, 140 Graphical description of data
entropy/expected information, 140 bar plot, 77–78
generalization errors, 145 boxplot, 78, 79
gini index, 139 histogram, 77
impurity, 139 plots in R
induction, 142 code, 74
information gain, 138 creation, simple plot, 74
overfitting and underfitting, 146 plot(), 75
overfitting errors, 148 variants, 76
CART method, 150 Gross domestic product (GDP), 96
pruning process, 150
regression trees, 150
tree growth, 149
H
recursive divide-and-conquer Hadoop Distributed File
approach, 138 System (HDFS), 259
root node, 136 Hadoop ecosystem, 260
Deep learning, 262 advantages, 261
Dendrograms, 169 Hadoop framework, 259
Density function, 116 Healthcare, big data, 265
Descriptive analytics, 118 Hierarchical clustering
computations on dataframes algorithm, 168
(see Computations on data frames) closeness, 168
graphical (see Graphical dendrograms, 169
description of data) limitations, 169
Maximum depth of river, 60 Histograms, 77, 107
mean depth of the river, 59 Huge computing power, 258
median of the depth of river, 61 Huge storage power, 258
notice, sign board, 59 Hybrid Transactional/Analytical
percentile, 62 Processing (HTAP), 264
population and sample, 62–63 Hypothesis testing, 192
probability, 84, 86–88
quartile 3, 61
statistical parameters (see Statistical
I
parameters) If-else structure, 46
Discrete data types, 98 In-database analytics, 264
Durbin-Watson test, 198 In-memory analytics, 264
276
■ INDEX
277
■ INDEX
M
N
Machine learning, 119, 258 Naïve Bays, 134
Manhattan distance, 164 Natural language processing (NLP), 262
MapReduce, 260 NbClust() function, 181
Market-basket analysis (MBA). See Affinity Nominal data types, 98
analysis Nonhierarchical clustering. See K-means
Matrices, R, 30–31 algorithm
Measurable data. See Quantitative data Non-linear regression, 188
Microsoft Azure, 261 Normal distribution, 86
Microsoft Business Intelligence and Normalization techniques, 117
Tableau, 261 NoSQL, 261
Microsoft Excel file, reading data, 42–44 Null hypothesis, 192
Microsoft SQL Server database, 5
Minkowski distance, 164
Min-max normalization, 117 O
Mtcars Data Set, 52 Online analytical processing (OLAP), 95
Multicollinearity, 218 Open Database Connectivity
Multinomial logistic regression, 233, 248 (ODBC), 42–43
Multiple linear regression, 207 Ordinal data types, 98
assumptions, 208 Overdispersion, 242
components, 207
correlation, 210
data, 209 P
data-frame format, 208 Packages and libraries, R, 56–57
discrete variables, 207 Partition clustering methods, 180
equation, 223 Poisson distribution, 88
278
■ INDEX
Prediction, 131
Predictive analytics
R
classification, 118 R
regression, 119 advantages, 21
Predictive Model Markup Language console, 22
(PMML), 263 control structures, 45
Preprocessing data for loops, 46–47
preparation, 99 if-else, 46
duplicate, junk, and null looping functions, 48–51, 53–54
characters, 100 while loops, 47–48
empty values, 100 writing functions, 55–56
handling missing values, 99 data analysis
R reading and writing data, 37–38,
as.numeric() function, 103 40–44
complete.cases() function, 103 data analysis tools, 17, 21
data types, 100 data structures, 29
factor levels, 103 arrays, 31
factor() type, 102 data frames, 32
head() command, 101 factors, 35
methods, 100 lists, 34
missing values, 103 matrices, 30–31
names() and c() function, 102 glm() function, 224
table() function, 103 installation, 21–22
vector operations, 101 RStudio interface, 23–24
types, 97 interfaces, 38
Probabilistic classification, 132 library(NbClust) command, 176
advantages and limitations, 136 lm() function, 212, 224
bank credit-card approval process, 133 Naïve Bays, 134
Naïve Bays, 134 objects types, 27–28
Probability packages and libraries, 56–57
concepts, 84 pairs() command, 112
distributions (see Probability programming, basics, 25–26
distributions) assigning values, 26–27
events, 85 creating vector, 27
mutually exclusive events, 85 View() command, 105
mutually independent events, 85 Random forests, 152
mutually non-exclusive events, 86 Random sampling, 96
Probability distributions Ratio data types, 98
binomial, 87 read.csv() function, 38
normal, 86–87 read.table() function, 38
poisson, 88 Receiver operating characteristic (ROC), 247
Probability sampling, 96 rect.hclust() function, 178
Property graphs (PG), 262 Regularization
cv.fit() model, 253
cv.glmnet() function, 249
Q
generic format, 249
Qualitative data, 97 glmnet() function, 249
Quantitative data, 97 glmnet_fit command, 252
279
■ INDEX
280