Big Data User's Guide For S-P 8: The Knowledge To Act
Big Data User's Guide For S-P 8: The Knowledge To Act
Big Data User's Guide For S-P 8: The Knowledge To Act
May 2007
Insightful Corporation
Seattle, Washington
Proprietary Insightful Corporation owns both this software program and its
Notice documentation. Both the program and documentation are
copyrighted with all rights reserved by Insightful Corporation.
The correct bibliographical reference for this document is as follows:
®
Big Data User’s Guide for S-PLUS 8, Insightful Corporation, Seattle,
WA.
Printed in the United States.
ii
ACKNOWLEDGMENTS
S-PLUS would not exist without the pioneering research of the Bell
Labs S team at AT&T (now Lucent Technologies): John Chambers,
Richard A. Becker (now at AT&T Laboratories), Allan R. Wilks (now
at AT&T Laboratories), Duncan Temple Lang, and their colleagues in
the statistics research departments at Lucent: William S. Cleveland,
Trevor Hastie (now at Stanford University), Linda Clark, Anne
Freeny, Eric Grosse, David James, José Pinheiro, Daryl Pregibon, and
Ming Shyu.
Insightful Corporation thanks the following individuals for their
contributions to this and earlier releases of S-PLUS: Douglas M. Bates,
Leo Breiman, Dan Carr, Steve Dubnoff, Don Edwards, Jerome
Friedman, Kevin Goodman, Perry Haaland, David Hardesty, Frank
Harrell, Richard Heiberger, Mia Hubert, Richard Jones, Jennifer
Lasecki, W.Q. Meeker, Adrian Raftery, Brian Ripley, Peter
Rousseeuw, J.D. Spurrier, Anja Struyf, Terry Therneau, Rob
Tibshirani, Katrien Van Driessen, William Venables, and Judy Zeh.
iii
S-PLUS BOOKS
®
The S-PLUS documentation includes books to address your focus
and knowledge level. Review the following table to help you choose
the S-PLUS book that meets your needs. These books are available in
PDF format in the following locations:
• In your S-PLUS installation directory (SHOME\help on
Windows, SHOME/doc on UNIX/Linux).
• In the S-PLUS Workbench, from the Help 䉴 S-PLUS
Manuals menu item.
® ®
• In Microsoft Windows , in the S-PLUS GUI, from the
Help 䉴 Online Manuals menu item.
S-PLUS documentation.
Are new to the S language and the S-PLUS GUI, Getting Started
and you want an introduction to importing data, Guide
producing simple graphs, applying statistical
®
models, and viewing data in Microsoft Excel .
Are a new S-PLUS user and need how to use User’s Guide
S-PLUS, primarily through the GUI.
Are familiar with the S language and S-PLUS, and S-PLUS Workbench
you want to use the S-PLUS plug-in, or User’s Guide
customization, of the Eclipse Integrated
Development Environment (IDE).
Have used the S language and S-PLUS, and you Programmer’s Guide
want to know how to write, debug, and program
functions from the Commands window.
iv
S-PLUS documentation. (Continued)
Are familiar with the S language and S-PLUS, and Guide to Graphics
you are looking for information about creating or
editing graphics, either from a Commands
window or the Windows GUI, or using S-PLUS-
supported graphics devices.
Are familiar with the S language and S-PLUS, and Big Data
you want to use the Big Data library to import and User’s Guide
manipulate very large data sets.
If you are familiar with the S language and S-PLUS, Guide to Statistics,
and you need a reference for the range of statistical Vol. 1
modelling and analysis techniques in S-PLUS.
Volume 1 includes information on specifying
models in S-PLUS, on probability, on estimation
and inference, on regression and smoothing, and
on analysis of variance.
If you are familiar with the S language and S-PLUS, Guide to Statistics,
and you need a reference for the range of statistical Vol. 2
modelling and analysis techniques in S-PLUS.
Volume 2 includes information on multivariate
techniques, time series analysis, survival analysis,
resampling techniques, and mathematical
computing in S-PLUS.
v
vi
CONTENTS
S-PLUS Books iv
vii
Contents
Index 161
viii
INTRODUCTION TO THE BIG
DATA LIBRARY
1
Introduction 2
Working with a Large Data Set 3
Finding a Solution 3
No 64-Bit Solution 5
Size Considerations 7
Summary 7
The Big Data Library Architecture 8
Block-based Computations 8
Data Types 11
Classes 14
Functions 15
Summary 19
1
Chapter 1 Introduction to the Big Data Library
INTRODUCTION
In this chapter, we discuss the history of the S language and large data
sets and describe improvements that the Big Data library presents.
This chapter discusses data set size considerations, including when to
use the Big Data library. The chapter also describes in further detail
the Big Data library architecture: its data objects, classes, functions,
and advanced operations.
To use the Big Data library, you must load it as you would any other
library provided with S-PLUS: that is, at the command prompt, type
library(bigdata).
2
Working with a Large Data Set
Finding a S programmers with large data sets have historically dealt with
Solution memory limitations in a variety of ways. Some opted to use other
applications, and some divided their data into “digestible” batches,
and then recompile the results. For S programmers who like the
flexibility and elegant syntax of the S language and the support
provided to owners of an S-PLUS license, the option to analyze and
model large data sets in S has been a long-awaited enhancement.
Out-of-Memory The Big Data library provides this enhancement by processing large
Processing data sets using scalable algorithms and data streaming. Instead of
loading the contents of a large data file into memory, S-PLUS creates a
special binary cache file of the data on the user’s hard disk, and then
3
Chapter 1 Introduction to the Big Data Library
Scalable Although the large data set is stored on the hard drive, the scalable
Algorithms algorithms of the Big Data library are designed to optimize access to
the data, reading from disk a minimum number of times. Many
techniques require a single pass through the data, and the data is read
from the disk in blocks, not randomly, to minimize disk access times.
These scalable algorithms are described in more detail in the section
The Big Data Library Architecture on page 8.
Data Streaming S-PLUS operates on the data binary cache file directly, using
“streaming” techniques, where data flows through the application
rather than being processed all at once in memory. The cache file is
processed on a row-by-row basis, meaning that only a small part of
the data is stored in RAM at any one time. It is this out-of-memory
data processing technique that enables S-PLUS to process data sets
hundreds of megabytes, or even gigabytes, in size without requiring
large quantities of RAM.
Data Type S-PLUS provides the large data frame, an object of class bdFrame. A
big data frame object is similar in function to standard S-PLUS data
frames, except its data is stored in a cache file on disk, rather than in
RAM. The bdFrame object is essentially a reference to that external
file: While you can create a bdFrame object that represents an
extremely large data set, the bdFrame object itself requires very little
RAM.
For more information on bdFrame, see the section Data Frames on
page 11.
S-PLUS also provides time date (bdTimeDate), time span (bdTimeSpan),
and series (bdSeries, bdSignalSeries, and bdTimeSeries) support for
large data sets. For more information, see the section Time Date
Creation on page 157 in the Appendix.
Flexibility The Big Data library provides reading, manipulating, and analyzing
capability for large data sets using the familiar S programming
language. Because most existing data frame methods work in the
same way with bdFrame objects as they do with data.frame objects,
the style of programming is familiar to S-PLUS programmers. Much
existing code from previous versions of S-PLUS runs without
4
Working with a Large Data Set
Balancing While accessing data on disk (rather than in RAM) allows for scalable
Scalability with statistical computing, some compromises are inevitable. The most
Performance obvious of these is computation speed. The Big Data library provides
scalable algorithms that are designed to minimize disk access, and
therefore provide optimal performance with out-of-memory data sets.
This makes S-PLUS a reliable workhorse for processing very large
amounts of data. When your data is small enough for traditional
S-PLUS, it’s best to remember that in-memory processes are faster
than out-of-memory processes.
If your data set size is not extremely large, all of the S-PLUS traditional
in-memory algorithms remain available, so you need not compromise
speed and flexibility for scalability when it's not needed.
No 64-Bit Are out-of-memory data analysis techniques still necessary in the 64-
Solution bit age? While 64-bit operating systems allow access to greater
amounts of *virtual* memory, it is the amount of *physical* memory
5
Chapter 1 Introduction to the Big Data Library
6
Size Considerations
SIZE CONSIDERATIONS
While the Big Data library imposes no predetermined limit for the
number of rows allowed in a big data object or the number of
elements in a big data vector, your computer’s hard drive must
contain enough space to hold the data set and create the data cache.
Given sufficient disk space, the big data object can be created and
processed by any scalable function.
The speed of most Big Data library operations is proportional to the
number of rows in the data set: if the number of rows doubles, then
the processing time also doubles.
The amount of RAM in a machine imposes a predetermined limit on
the number of columns allowed in a big data object, because column
information is stored in the data set’s metadata. This limit is in the
tens of thousands of columns. If you have a data set with a large
number of columns, remember that some operations (especially
statistical modeling functions) increase at a greater than linear rate as
the number of columns increases. Doubling the number of columns
can have a much greater effect than doubling the processing time.
This is important to remember if processing time is an issue.
7
Chapter 1 Introduction to the Big Data Library
Block-based Data sets that are much larger than the system memory are
Computations manipulated by processing one “block” of data at a time. That is, if
the data is too large to fit in RAM, then the data will be broken into
multiple data sets and the function will be applied to each of the data
sets. As an example, a 1,000,000 row by 10 column data set of double
values is 76MB in size, so it could be handled as a single data set on a
machine with 256MB RAM. If the data set was 10,000,000 rows by
100 columns, it would be 7.4GB in size and would have to be handled
as multiple blocks.
Table 1.1 lists a few of the optional arguments for the function
bd.options that you can use to set limits for caching and for
warnings:
Table 1.1: bd.options block-based computation arguments.
8
The Big Data Library Architecture
9
Chapter 1 Introduction to the Big Data Library
10
The Big Data Library Architecture
Data Types S-PLUS provides the following data types, described in more detail
below:
Table 1.3: New data types and data names for S-PLUS.
Data Frames The main object to contain your large data set is the big data frame,
an object of class bdFrame. Most methods commonly used for a
data.frame are also available for a bdFrame. Big data frame objects
are similar to standard S-PLUS data frames, except in the following
ways:
• A bdFrame object stores its data on disk, while a data.frame
object stores its data in RAM. As a result, a bdFrame object has
a much smaller memory footprint than a data.frame object.
• A bdFrame object does not have row labels, as a data.frame
object does. While this means that you cannot refer to the
rows of a bdFrame object using character row labels, this
design reduces storage requirements and improves
performance by eliminating the need to maintain unique row
labels.
• A bdFrame object can contain columns of only types double,
character, factor, timeDate, timeSpan or logical. No other
column types (such as matrix objects or user-defined classes)
are allowed. By limiting the allowed column types, S-PLUS
ensures that the binary cache file representing the data is as
compact as possible and can be efficiently accessed.
11
Chapter 1 Introduction to the Big Data Library
Note
You can specify the numbers of rows and columns to print using the bd.options function. See
bd.options in the S-PLUS Language Reference for more information.
Vectors The S-PLUS Big Data library also introduces bdVector and six
subclasses, which represent new vector types to support very long
vectors. Like a bdFrame object, the big vector object stores data out-of-
memory as a cache file on disk, so you can create very long big vector
objects without needing a lot of RAM.
You can extract an individual column from a bdFrame object (using
the $ operator) to create a large vector object. Alternatively, you can
generate a large vector using the functions listed in Table A.3 in the
Appendix. Like bdFrame objects, the actual data is stored out of
memory as a cache file on disk, so you can create very long big vector
objects without worrying about fitting them into RAM. You can use
standard vector operations, such as selections and mathematical
operations, on these data types. For example, you can create new
columns in your data set, as follows:
12
The Big Data Library Architecture
Clustering bdCluster
When you perform statistical analysis on a large data set with the Big
Data library, you can use familiar S-PLUS modeling functions and
syntax, but you supply a bdFrame object as the data argument, instead
of a data frame. This forces out-of-memory algorithms to be used,
rather than the traditional in-memory algorithms.
When you apply the modeling function lm to a bdFrame object, it
produces a model object of class bdLm. You can apply the standard
predict, summary, plot, residuals, coef, formula, anova, and fitted
methods to these new model objects.
For more information on statistical modeling, see Chapter 2, Census
Data Example.
Series Objects The standard S-PLUS library contains a series object, with two
subclasses: timeSeries and signalSeries. The series object contain:
• A data component that is typically a data frame.
• A positions component that is a timeDate or timeSequence
object (timeSeries), or a bdNumeric or numericSeries object
(signalSeries).
• A units component that is a character vector with
information on the units used in the data columns.
13
Chapter 1 Introduction to the Big Data Library
Classes The Big Data library follows the same object-oriented design as the
standard S-PLUS Sv4 design. For a review of object-oriented
programming concepts, see Chapter 8, Object-Oriented
Programming in S-PLUS in the Programmer’s Guide.
Each object has a class that defines methods that act on the object.
The library is extensible; you can add your own objects and classes,
and you can write your own methods.
The following classes are defined in the Big Data library. For more
information about each of these classes, see their individual help
topics.
Table 1.5: Big Data classes.
Class(es) Description
14
The Big Data Library Architecture
Functions In addition to the standard S-PLUS functions that are available to call
on large data sets, the Big Data library includes functions specific to
big data objects. These functions include the following.
• Big vector generating functions
• Data exploration and manipulation functions.
• Traditional and Trellis graphics functions.
• Modeling functions.
The functions for these general tasks are listed in the Appendix.
Data Import and Two of the most frequent tasks using S-PLUS are importing and
Export exporting data. The functions are described in Table A.1 in
Appendix. You can perform these tasks from the Commands
window, from the Console view in the S-PLUS Workbench, or from
the S-PLUS import and export dialog boxes in the S-PLUS GUI. For
more information about importing large data sets, see the section
Data Import on page 25 in Chapter 2, Census Data Example.
Big Vector To generate a vector for a large data set, call one of the S-PLUS
Generation functions described in Table A.3 in the Appendix. When you set the
bigdata flag to TRUE, the standard S-PLUS functions generate a
bdVector object of the specified type. For example:
Data Exploration After you import your data into S-PLUS and create the appropriate
Functions objects, you can use the functions described in Table A.4 in the
Appendix. to compare, correlate, crosstabulate, and examine
univariate computations.
Data After you import and examine your data in S-PLUS, you can use the
Manipulation data manipulation functions to append, filter, and clean the data. For
Functions an overview of these functions, see Table A.5 in the Appendix. For a
more in-depth discussion of these functions, see the section Data
Manipulation on page 37 in Chapter 2, Census Data Example.
Graph Functions The Big Data library supports graphing large data sets intelligently,
using the following techniques to manage many thousands or millions
of data points:
15
Chapter 1 Introduction to the Big Data Library
Note
The Windows GUI editable graphics do not support big data objects. To use these graphics,
create a data frame containing either all of the data or a sample of the data.
Modeling Algorithms for large data sets are available for the following statistical
Functions modeling types:
• Linear regression.
• Generalized linear regression.
• Clustering.
• Principal components.
See the section Models on page 12 for more information about the
modeling objects.
If the data argument for a modeling function is a big data object, then
S-PLUS calls the corresponding big data modeling function. The
modeling function returns an object with the appropriate class, such
as bdLm.
See Table A.12 in the Appendix for a list of the modeling functions
that return a model object.
See Tables A.10 through A.13 in the Appendix for lists of the
functions available for large data set modeling. See the S-PLUS
Language Reference for more information about these functions.
16
The Big Data Library Architecture
Formula operators
The Big Data library supports using the formula operators+, -, *, :,
%in%, and /.
Time Classes The following classes support time operations in the Big Data library.
See the Appendix for more information.
Table 1.6: Time classes.
Time Series Time series operations are available through the bdTimeSeries class
Operations and its related functions. The bdTimeSeries class supports the same
methods as the standard S-PLUS library’s timeSeries class. See the
S-PLUS Language Reference for more information about these classes.
Time and Date • When you create a time object using timeSeq, and you set the
Operations bigdata argument to TRUE, then a bdTimeDate object is
created.
• When you create a time object using timeDate or
timeCalendar, and any of the arguments are big data objects,
then a bdTimeDate object is created.
See Table A.14 in the Appendix.
Note
bdTimeDate always assumes the time as Greenwich Mean Time (GMT); however, S-PLUS stores
no time zone with an object. You can convert to a time zone with timeZoneConvert, or specify the
zone in the bdTimeDate constructor.
17
Chapter 1 Introduction to the Big Data Library
Time Conversion To convert time and date values, apply the standard S-PLUS time
Operations conversion operations to the bdTimeDate object, as listed in Table
A.14 in the Appendix.
Matrix The Big Data library does not contain separate equivalents to matrix
Operations and data.frame.
S-PLUS matrix operations are available for bdFrame objects:
• matrix algebra ( +, -, /, *, !, &, |, >, <, ==, !=, <=, =>, %%, %/%)
• matrix multiplication (%*%)
• Crossproduct (crossprod)
In algebraic operations, the operators require the big data objects to
have appropriately-corresponding dimensions. Rows or columns are
not automatically replicated.
Basic algebra
You can perform addition, subtraction, multiplication, division,
logical (!, &, and |), and comparison (>, <, =, !=, <=, >=) operations
between:
• A scalar and a bdFrame.
• Two bdFrames of the same dimension.
• A bdFrame and a single-row bdFrame with the same number of
columns.
• A bdFrame and a single-column bdFrame with the same
number of rows.
The library also offers support for element-wise +, -, *, /, and matrix
multiplication (%*%).
Matrix multiplication is available for two bdFrames with the
appropriate dimensions.
18
The Big Data Library Architecture
Summary In this section, we’ve provided an overview to the Big Data library
architecture, including the new data types, classes, and functions that
support managing large data sets. For more detailed information and
lists of functions that are included in the Big Data library, see the
Appendix: Big Data Library Functions.
In the next chapter, we provide examples for working with data sets
using the types, classes, and functions described in this chapter.
19
Chapter 1 Introduction to the Big Data Library
20
CENSUS DATA EXAMPLE
2
Introduction 22
Problem Description 22
Data Description 22
Exploratory Analysis 25
Data Import 25
Data Preparation 27
Tabular Summaries 31
Graphics 32
Data Manipulation 37
Stacking 37
Variable Creation 38
Factors 40
More Graphics 41
Clustering 45
Data Preparation 45
K-Means Clustering 46
Analyzing the Results 47
Modeling Group Membership 53
Building a Model 57
Summarizing the Fit 58
Characterizing the Group 58
21
Chapter 2 Census Data Example
INTRODUCTION
Census data provides a rich context for exploratory data analysis and
the application of both unsupervised (e.g., clustering) and supervised
(e.g., regression) statistical learning models. Furthermore the data sets
(in their unaggragated state) are quite large. The US Census 2000
estimates the total US population at over 281 million people. In its
raw form, the data set (which includes demographic variables such as
age, gender, location, income and education) is huge. For this
example, we focus on a subset of the US Census data that allows us to
demonstrate principles of working with large data on a data set that
we have included in the product.
Problem Census data has many uses. One of interest to the US government
Description and many commercial enterprises is geographical distribution of sub
populations and their characteristics. In this initial example, we look
for distinct geographical groups based on age, gender and housing
information (data that is easy to obtain in a survey), and then
characterize them by modeling the group structure as a function of
much harder-to-obtain demographics such as income, education,
race, and family structure.
Data The data for this example is included with S-PLUS and is part of the
Description US Census 2000 Summary File 3 (SF3). SF3 consists of 813 detailed
tables of Census 2000 social, economic, and housing characteristics
compiled from a sample of approximately 19 million housing units
(about 1 in 6 households) that received the Census 2000 long-form
questionnaire. The levels of aggregation for SF3 data is depicted in
Figure 2.1.
The data for this example is the summary table aggregated by Zip
Code Tabulation Areas (ZCTA5) depicted as the left-most branch of the
schematic in Figure 2.1.
The following site provides download access to many additional SF3
summary tables:
http://www.census.gov/Press-Release/www/2002/sumfile3.html
22
Introduction
Figure 2.1: US Census 2000 data grouping hierarchy schematic with implied
aggregation levels. The data used in this example comes from the Zip Code Tabulation
Area (ZCTA) depicted at the far left side of the schematic.
The variables included in the census data set are listed in Table 2.1.
They include the zip code, latitude and longitude for each zip code
region, and population counts. Population counts include the total
population for the region and a breakdown of the population by
gender and age group: Counts of males and females for ages 0 - 5, 5 -
10, ..., 80 - 85, and 85 or older.
23
Chapter 2 Census Data Example
New Variable
Variable(s) Name(s) Description
24
Exploratory Analysis
EXPLORATORY ANALYSIS
Data Import The data is provided as a comma-separated text file ( .csv format).
The file is located in the SHOME location (by default your
installation directory) in /samples/bigdata/census/census.csv.
As mentioned on the previous page, you can also download an
analysis script named new.census.demo.ssc to execute the
commands referenced in this chapter.
Reading big data is identical to what you are familiar with in previous
versions of S-PLUS with one exception: an additional argument to
specify that the data object created is stored as a big data (bd) object.
> census <- importData(paste(getenv("SHOME"),
"/samples/bigdata/census/census.csv", sep=""),
stringsAsFactors=F, bigdata=T)
25
Chapter 2 Census Data Example
Figure 2.2: Viewing big data objects is done with the Data Viewer.
The Data View page (Figure 2.2) of the Data Viewer lists all rows
and all variables in a scrollable window plus summary information at
the bottom, including the number of rows, the number of columns,
and a count of the number of different types of variables (for
example, a numeric, factor). From the summary information, we see
that census has 33,178 rows.
In addition to the Data View page, the Data Viewer contains tabs
with summary information for numeric, factor, character, and date
variables. These summary tabs provide quick access to minimums,
maximums, means, standard deviations, and missing value counts for
numeric variables and levels, level counts, and missing value counts
for factor variables.
26
Exploratory Analysis
Figure 2.3: The Numeric summary page of the Data Viewer provides quick access
to minimum, maximum, mean, standard deviation, and missing value count for
numeric data.
Data Before beginning any data preparation, start by making the names
Preparation more intuitive using the names assignment expression:
> names(census) <- c("zipcode", "lat", "long", "popTotal",
paste("male", seq(0, 85, by = 5), sep = "."),
paste("female", seq(0, 85, by = 5), sep = "."),
"housingTotal", "own", "rent")
27
Chapter 2 Census Data Example
The row names are shown in Table 2.1, along with the original
names.
Note
The S-PLUS expression paste("male", seq(0, 85, by = 5), sep = ".") creates a sequence of 18
variable names starting with male.0 and ending with male.85. The call to seq generates a
sequence of integers from 0 to 85 incremented by 5, and the call to paste pastes together the
string “male” with the sequence of integers separated with a period (.).
28
Exploratory Analysis
removes the bad popTotal rows. If your data is very large, using
subscripting and nested function calls can result in a prohibitively
lengthy execution time.
A more efficient “big data” way to remove rows with no population is
to use the bd.filter.rows function available in the Big Data library
in S-PLUS. bd.filter.rows has two required arguments:
1. data: the big data object to be filtered.
2. expr: an expression to evaluate. By default, the expression
must be valid, based on the rules of the row-oriented
Expression Language. For more details on the expression
language, see the help file for ExpressionLanguage.
Note
If you are familiar with the S-PLUS language, the Excel formula language, or another
programming language, you will find the row-oriented Expression Language natural and easy to
use. An expression is a combination of constants, operators, function calls, and references to
columns that returns a single value when evaluated
For our example, the expression is simply popTotal > 0, which you
pass as a character string to bd.filter.rows. The more efficient way
to filter the rows is:
> census <- bd.filter.rows(census, expr= "popTotal > 0")
29
Chapter 2 Census Data Example
Expression Description
age > 40 & gender == “F” All rows with females greater than
40 years of age.
Now, remove the cases with bad zip codes by using the regular
expression function, regexpr, to find the row indices of zip codes that
have only numeric characters:
Notes
• The call to the regexpr function finds all zip codes that have only integer characters in
them. The regular expression “^[0-9]+$” produces a search for strings that contain only
the characters 0, 1, 2, ..., 9. The ^ character indicates starting at the beginning of
the string, the $ character indicates continuing to the end of the string and the + symbol
implies any number of characters from the set {0, 1, 2,..., 9}.
• The call to bd.filter.rows specified the optional argument, row.language=F. This
argument produces the effect of using the standard S-PLUS expression language, rather
than the row-oriented Expression Language designed for row operations on big data.
30
Exploratory Analysis
Tabular Generate the basic tabular summary of variables in the census data
Summaries set with a call to the summary function, the same as for in-memory data
frames. The call to summary is quite fast, even for very large data sets,
because the summary information is computed and stored internally
at the time the object is created.
> summary(census)
zipcode lat long
Length: 32165 Min.:17964529 Min.:-176636755
Class: Mean:38847016 Mean: -91103295
Mode:character Max.:71299525 Max.: -65292575
rent
Min.: 0.000
Mean: 1119.391
Max.:40424.000
To check the class of objects contained in a big data data frame (class
bdFrame), call sapply, which applies a specified function to all the
columns of the bdFrame.
> sapply(census, class)
zipcode lat long popTotal
"bdCharacter" "bdNumeric" "bdNumeric" "bdNumeric"
31
Chapter 2 Census Data Example
Generate age distribution tables with the same operations you use for
in-memory data. Multiply column means by 100 to convert to a
percentage scale and round the output to one significant digit:
> ageDist <-
colMeans(census[, 5:40] / census[, "popTotal"]) * 100
> round(matrix(ageDist,
nrow = 2,
byrow = T,
dimnames = list(c("Male", "Female"),
seq(0, 85, by=5))), 1)
numeric matrix: 2 rows, 18 columns.
0 5 10 15 20 25 30 35 40 45 50 55
Male 3.2 3.6 3.8 3.8 2.9 2.9 3.2 3.9 4.1 3.8 3.3 2.7
Female 3.0 3.4 3.6 3.4 2.7 2.8 3.2 3.9 4.0 3.7 3.3 2.7
60 65 70 75 80 85
Male 2.3 2.0 1.7 1.3 0.8 0.5
Female 2.3 2.1 2.0 1.7 1.2 1.1
Graphics You can plot the columns of a bdFrame in the same manner as you do
for regular (in-memory) data frames:
> hist(census$popTotal)
will produce a histogram of total population counts for all zip codes.
Figure 2.4 displays the result.
32
Exploratory Analysis
20000
15000
10000
5000
0
census$popTotal
Figure 2.4: Histogram of total population counts for all zip codes.
You can get fancier. In fact, in general, the Trellis graphics in S-PLUS
work on big data. For example, the median number of rental units
over all zip codes is 193:
> median(census$rent)
[1] 193
You would expect that, if the number of rental units is high (typical of
cities), the population would likewise be high. We can check this
expectation with a simple Trellis boxplot:
> bwplot(rent > 193 ~ log(popTotal), data = census)
33
Chapter 2 Census Data Example
TRUE
FALSE
0 2 4 6 8 10 12
log(popTotal)
Figure 2.5: Boxplots of the log of popTotal for the number of rental units above and
below the median, showing higher populations in areas with more rental units.
Note
The default scatterplot for big data is a hexbin scatterplot. The color shading of the hexagonal
“points” indicate the number of observations in that region of the graph. For the darkest shaded
hexagon in the center of the graph, over 800 zip codes are represented, as indicated by the
legend on the right side of the graph.
34
Exploratory Analysis
12 800
700
10
600
8
log(popTotal) 500
6
400
4 300
200
100
0
1
0 2 4 6 8 10
log(rent + 0.5)
35
Chapter 2 Census Data Example
Note
In creating this plot, the example starts with big out-of-memory data (census) and ends
with small in-memory summary data (ageDist) without having to do anything special to
transition between the two. S-PLUS takes care of the data management.
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
-4 -2 0 2 4
Female Male
Figure 2.7: Age distribution by gender estimated by US Census 2000.
36
Data Manipulation
DATA MANIPULATION
The census data contains raw population counts by gender and age;
however, the counts for different genders and ages are in different
columns. To compare them more easily, stack the columns end to
end and create factors for gender and age. Start with the stacking
operation.
Stacking The bd.stack function provides the needed stacking operation. Stack
all the population counts for males and females for all ages with one
call to bd.stack:
> censusStack <- bd.stack(census,
columns = 5:40,
replicate = c(1:4, 41:43),
stack.column.name = "pop",
group.column.name = "sexAge")
The first few rows of the resulting data are listed below. Notice the
values for the sexAge variable are the names of the columns that were
stacked.
37
Chapter 2 Census Data Example
> censusStack
** bdFrame: 1150236 rows, 9 columns **
zipcode lat long popTotal housingTotal own rent
1 601 18180103 -66749472 19143 5895 4232 1663
2 602 18363285 -67180247 42042 13520 10903 2617
3 603 18448619 -67134224 55592 19182 12631 6551
4 604 18498987 -67136995 3844 1089 719 370
5 606 18182151 -66958807 6449 2013 1463 550
pop sexAge
1 712 male.0
2 1648 male.0
3 2049 male.0
4 129 male.0
5 259 male.0
... 1150231 more rows ...
Notice that the census data started with a little over 33,000 rows.
Now, after stacking, there are over 1.15 million rows.
Variable Now create the sex and age factors. There are several ways to do this,
Creation but the most computationally efficient way for large data is to use the
bd.create.columns function, along with the row-oriented expression
language. Before starting, notice that the column names for the
stacked columns (male.0, male.5, ..., female.80, female.85) can be
separated into male and female groups simply by the number of
characters in their names. All male names have seven or fewer
characters and all female names have eight or more characters.
Therefore, by checking the number of characters in the string, you
can determine whether the value should be “male” or “female”. Here
is an example of the row-oriented Expression Language:
" ifelse(nchar(sexAge) > 7, 'female', 'male' "
38
Data Manipulation
Note
The age column in the call to bd.create.columns is stored as a character column so we have
more control when creating an age factor. A discussion of this is included in the next section
Factors.
39
Chapter 2 Census Data Example
When S-PLUS creates tables or graphics that use the levels as labels,
the order is as the levels are listed, rather than in numerical order.
To control the order of the levels of a factor, call the bdFactor
function directly and state explicitly the order for the levels. For
example, using the census data:
> censusStack[, "age"] <- bdFactor(censusStack[, "age"],
levels = c("0", "5", "10", "15", "20", "25",
"30", "35", "40", "45", "50", "55",
"60", "65", "70", "75", "80", "85"))
40
More Graphics
MORE GRAPHICS
The data is now prepared to allow more interesting graphics. For
example, create an age distribution plot conditional on gender (Figure
2.8) with the following call to bwplot, a Trellis graphic function:
> bwplot(age ~ log(popProp + 0.00001) | sex,
data = censusStack)
Note
0.00001 is added to the population proportions to avoid taking the log of zero.
-10 -8 -6 -4 -2 0
female male
85
80
75
70
65
60
55
50
age
45
40
35
30
25
20
15
10
5
0
-10 -8 -6 -4 -2 0
log(popProp + 1e-005)
Figure 2.8: Boxplots of logged relative population numbers by age and sex.
41
Chapter 2 Census Data Example
Note the span of the boxes for 80 and older when there are fewer
than the median number of rental units, implying that the population
numbers for this group drops dramatically in some areas where there
few rental units.
-10 -8 -6 -4 -2 0
FALSE TRUE
85
80
75
70
65
60
55
50
age
45
40
35
30
25
20
15
10
5
0
-10 -8 -6 -4 -2 0
log(popProp + 1e-005)
Figure 2.9: Boxplots of logged relative population numbers by age and rent>193.
42
More Graphics
Use the original data set census, rather than censusStack, because
census has just one row per zip code.
> census <- bd.create.columns(census,
exprs=c("lat/1.e6", "long/1.e6"),
names=c("lat","long"))
43
Chapter 2 Census Data Example
70
1200
60
1000
lat 50 800
40 600
400
30
200
20
Figure 2.10: Hexbin scatterplot of latitudes and longitudes. Zip codes are denser
where populations are denser, so this plot displays relative population densities.
44
Clustering
CLUSTERING
This section applies clustering techniques to the census data to find
sub populations (collections of zip code areas) with similar age
distributions. The section Modeling Group Membership develops
models that characterize the subgroups we find by clustering.
Data The section Tabular Summaries computed the average age distribution
Preparation across all zip code areas by age and gender, depicted in Figure 2.7.
Next, group zip-code areas by age distribution characteristics, paying
close attention to those that deviate from the national average. For
example, age distributions in areas with military bases, typically
dominated by young adult single males without children, should
stand out from the national average.
Unusual populations are most noticeable if the population
proportions (previously computed as pop/popTotal by age and
gender) are normalized by the national average. One way to
normalize is to divide population proportions in each age and gender
group by the national average for each age and gender group. The
(odds) ratio represents how similar (or dissimilar) a zip-code
population is from the national average. For example, a ratio of 2 for
females 85 years or older indicates that the proportion of women 85
and older is twice that of the national average.
To prepare the population proportions, recall that the national
averages are produced with the colMeans function:
> ageDist <-
colMeans(census[, 5:40] / census[, "popTotal"])
That is, transpose the data matrix, divide by a vector as long as each
column of the transposed matrix, and then transpose the matrix back.
45
Chapter 2 Census Data Example
Notes
K-Means You are now ready to do the clustering. The big data version of k-
Clustering means clustering is bdCluster. The important arguments are:
• The data (a bdFrame in this example).
• The columns to cluster (if all columns of the bdFrame are not
included in the clustering operation).
46
Clustering
Notes
To match the results presented here, set the random seed to 22 before calling bdCluster. To set
the seed, at the prompt, type set.seed(22).
This example focuses on only the age x gender distributions, so columns is set to just those
columns with population counts.
Analyzing the In this section, examine the results of applying k-means clustering to
Results the census data. To get a sense of how big the clusters are and what
they look like, start by combining cluster means and counts.
1. To compute cluster means, call bd.aggregate as follows:
> clusterMeans <- bd.aggregate(censusNPred,
columns = names(popProp),
by.columns="PREDICT.membership",
methods="mean")
47
Chapter 2 Census Data Example
k = 10 k=9 k=8 k = 11 k = 14 k = 12
N = 1569 N = 1394 N = 1277 N = 1260 N = 1107 N = 510
k = 13 k = 17 k = 16 k = 15 k = 21 k = 23
N = 480 N = 414 N = 331 N = 321 N = 183 N = 121
k = 22 k = 18 k = 19 k = 20 k = 26 k = 25
N = 110 N = 67 N = 64 N = 60 N = 59 N = 57
Figure 2.11: Age distribution barplots for the first 24 groups resulting from k-means
clustering with 40 groups specified. The horizontal lines in each panel correspond to
20 (the lower one) and 70 years of age. Females are to the left of the vertical and
males are to the right.
48
Clustering
> source(paste(getenv("SHOME"),
"/samples/bigdata/census/my.vbar.q", sep=""))
> index16 <- rep(1:16, length = 24)
> par(mfrow=c(4,6))
> for(k in 1:24) {
my.vbar(bd.coerce(clusterMeansCounts), k=k,
plotcols=3:38,
Nreport.col=2,
col=1+index16[k])
Note
The bd.block.apply argument FUN is an S-PLUS function called to process a data frame. This
function itself cannot perform big data operations, or an error is generated. (This is true for
bd.by.group and bd.by.window, as well.)
49
Chapter 2 Census Data Example
This function processes a list object, which contains one block of the
census bdFrame. SP$in1 corresponds to the data, and SP$in1.pos
corresponds to the starting row position of each block of the bdFrame
that is passed to the function. The test if(SP$in1.pos == 1) checks if
the first block is being processed. If the first block is processed, a call
to plot is made; if the first block is not processed, a call to points is
made. The call to bd.block.apply is:
> bd.block.apply(census, FUN = f)
This call makes this new graph select only those rows that belong to
the cluster group of interest, and then coerce it to a data frame to
demonstrate the simplicity of using both bdFrame and a data.frame
objects in the same function. Start by keeping only those variables
that are useful for displaying the cluster group locations.
> censusNPsub <- bd.filter.columns(censusNPred,
keep = c("lat","long","PREDICT.membership"))
50
Clustering
Figure 2.12: Plot of all zip code region centers with cluster group 20 overlaid in
another color. The double histogram in the bottom left corner displays the age
distributions for females to the left and males to the right for cluster group 20. The
horizontal lines in the histogram are at 20 and 70 years of age.
51
Chapter 2 Census Data Example
Notes
1. setkis created as a regular data frame using bd.coerce, assuming that once a
given cluster group is selected the data is small enough to process it entirely in
memory.
2. bd.block.apply is used to plot all the zip code region centers, which requires
processing the entire bdFrame.
3. setk contains the latitude and longitude locations for zip code centers for the
selected group, pred[k]
4. setk was created to demonstrate the use of both bdFrame objects and data.frame
objects in a single function. Placing the cluster group points on the graph could
also be accomplished in the function passed to bd.block.apply.
52
Modeling Group Membership
Variable Description
53
Chapter 2 Census Data Example
Variable Description
54
Modeling Group Membership
Variable Description
55
Chapter 2 Census Data Example
Variable Description
56
Modeling Group Membership
Variable Description
Building a The cluster group membership variables are binary with “yes” or
Model “no”, indicating group membership for each zip code area. To get a
sense of group membership characteristics, you can create a logistic
model for each group of interest using glm, which has been extended
to handle bdFrame objects. The syntax is identical to that of glm with
regular data frames.The model specification is as follows:
> group18Fit
Call:
bdGlm(formula = group18 ~ ., family = binomial, data
= censusDemogr)
Coefficients:
(Intercept) housingTotal own
-51.49204 0.0002713171 -0.0005471851
57
Chapter 2 Census Data Example
Note
The glm function call is the same as for regular in-memory data frames; however, the extended
version of glm in the bigdata library applies appropriate methods to bdFrame data by initiating a
call to bdGlm. The call expression shows the actual call went to bdGlm.
Summarizing You can apply the usual operations (for example, summary, coef,
the Fit plot) to the resulting fit object. The plots are displayed as hexbin
scatterplots because of the volume of data.
> plot(group18Fit)
4
Counts
31780
30000
28000
2
26000
24000
Residuals
22000
20000
18000
16000
0
14000
12000
10000
8000
6000
4000
-2
2000
1
Figure 2.13: Residuals vs. fitted values resulting from modeling cluster group 18
membership as a function of census demographics.
58
Modeling Group Membership
59
Chapter 2 Census Data Example
60
CREATING GRAPHICAL
DISPLAYS OF LARGE DATA
SETS
3
Introduction 62
Overview of Graph Functions 63
Functions Supporting Graphs 63
Example Graphs 69
Plotting Using Hexagonal Binning 69
Adding Reference Lines 74
Plotting by Summarizing Data 79
Creating Graphs with Preprocessing Functions 90
Unsupported Functions 103
61
Chapter 3 Creating Graphical Displays of Large Data Sets
INTRODUCTION
This chapter includes information on the following:
• An overview of the graph functions available in the Big Data
Library, listed according to whether they take a big data
object directly, or require a preprocessing function to produce
a chart.
• Procedures for creating plots, traditional graphs, and Trellis
graphs.
Note
In Microsoft Windows, editable graphs in the graphical user interface (GUI) do not support big
data objects. To use these graphs, create an S-Plus data.frame containing either all of the data or
a sample of the data.
62
Overview of Graph Functions
Functions This section lists the functions that produce graphs for big data
Supporting objects. If you are unfamiliar with plotting and graph functions in
S-PLUS, review the Guide to Graphics.
Graphs
Implementing plotting and graph functions to support large data sets
requires an intelligent way to handle thousands of data points. To
address this need, the graph functions to support big data are
designed in the following categories:
• Functions to plot big data objects without preprocessing,
including:
• Functions to plot big data objects by hexagonal binning.
• Functions to plot big data objects by summarizing data in
a plot-specific manner.
• Functions providing the preprocessing support for plotting big
data objects.
• Functions requiring preprocessing support to plot big data
objects.
The following sections list the functions, organized into these
categories. For an alphabetical list of graph functions supporting big
data objects, see the Appendix.
Using cloud or parallel results in an error message. Instead, sample
or aggregate the data to create a data.frame that can be plotted using
these functions.
63
Chapter 3 Creating Graphical Displays of Large Data Sets
Graph Functions The following functions can plot a large data set (that is, can accept a
using Hexagonal big data object without preprocessing) by plotting large amounts of
Binning data using hexagonal binning.
Table 3.1: Functions for plotting big data using hexagonal binning.
Function Comment
64
Overview of Graph Functions
Table 3.2: Functions that add reference lines to hexbin plots. (Continued)
Function Description
65
Chapter 3 Creating Graphical Displays of Large Data Sets
Function Description
Functions The following functions are used to preprocess large data sets for
Providing graphing:
Support to Table 3.4: Functions used for preprocessing large data sets.
Preprocess Data
for Graphing Function Description
66
Overview of Graph Functions
Table 3.4: Functions used for preprocessing large data sets. (Continued)
Function Description
Functions The following functions do not accept a big data object directly to
Requiring create a graph; rather, they require one of the specified preprocessing
Preprocessing functions.
Support for Table 3.5: Functions requiring preprocessors for graphing
Graphing large data sets.
67
Chapter 3 Creating Graphical Displays of Large Data Sets
68
Example Graphs
EXAMPLE GRAPHS
The examples in this chapter require that you have the Big Data
Library loaded. The examples are not large data sets; rather, they are
small data objects that you convert to big data objects to demonstrate
using the Big Data Library graphing functions.
69
Chapter 3 Creating Graphical Displays of Large Data Sets
The functions listed in Table 3.1 support big data objects by using
hexagonal binning. This section shows examples of how to call these
functions for a big data object.
Create a Pair- The pairs function creates a figure that contains a scatter plot for
wise Scatter Plot each pair of variables in a bdFrame object.
To create a sample pair-wise scatter plot for the fuel.frame bdFrame
object, in the Commands window, type the following:
pairs(as.bdFrame(fuel.frame))
70
Example Graphs
Create a Single The plot function can accept a hexbin object, a single bdVector, two
Plot bdVectors, or a bdFrame object. The following example plots a simple
hexbin plot using the weight and mileage vectors of the fuel.bd
object.
To create a sample single plot, in the Commands window, type the
following:
71
Chapter 3 Creating Graphical Displays of Large Data Sets
Create a Multi- The function splom creates a Trellis graph of a scatterplot matrix. The
Panel Scatterplot scatterplot matrix is a good tool for displaying measurements of three
Matrix or more variables.
To create a sample multi-panel scatterplot matrix, where you create a
hexbin plot of the columns in fuel.bd against each other, in the
Commands window, type the following:
Note
Trellis functions in the Big Data Library require the data argument. You cannot use formulas
that refer to bdVectors that are not in a specified bdFrame.
Notice that the ‘.’ is interpreted as all columns in the data set
specified by data.
72
Example Graphs
73
Chapter 3 Creating Graphical Displays of Large Data Sets
Create a The function xyplot creates a Trellis graph, which graphs one set of
Conditioning Plot numerical values on a vertical scale against another set of numerical
or Scatter Plot values on a horizontal scale.
To create a sample conditioning plot, in the Commands window,
type the following:
xyplot(data=as.bdFrame(air),
ozone~radiation|temperature,
shingle.args=list(n=4), lmline=T)
The variable on the left of the ~ goes on the vertical (or y) axis, and
the variable on the right goes on the horizontal (or x) axis.
The function xyplot contains the default argument lmline=T to add
the approximate least squares line to a panel quickly. This argument
performs the same action as panel.lmline in standard S-PLUS.
The xyplot plot is displayed as follows:
Adding You can add a regression line or scatterplot smoother to hexbin plots.
Reference The regression line or smoother is a weighted fit, based on the binned
values.
Lines
74
Example Graphs
Add a Regression When you create a scatterplot from your large data set, and you
Line notice a linear association between the y-axis variable and the x-axis
variable, you might want to display a straight line that has been fit to
the data. Call lsfit to perform a least squares regression, and then
use that regression to plot a regression line.
The following example draws an abline on the chart that plots
fuel.bd weight and mileage data. First, create a hexbin object and
plot it, and then add the abline to the plot.
To add a regression line to a sample plot, in the Commands window,
type the following:
75
Chapter 3 Creating Graphical Displays of Large Data Sets
76
Example Graphs
77
Chapter 3 Creating Graphical Displays of Large Data Sets
Add a Least To add a reference line to an xyplot, set lmline=T. Alternatively, you
Squares Line to can call panel.lmline or panel.loess. See the section Create a
an xyplot Conditioning Plot or Scatter Plot on page 74 for an example.
Add a qqplot The function qqline fits and plots a line through a normal qqplot.
Reference Line To add a qqline reference line to a sample qqplot, in the
Commands window, type the following:
78
Example Graphs
Create a Box Plot The following example creates a simple box plot from fuel.bd. To
create a Trellis box and whisker plot, see the following section.
To create a sample box plot, in the Commands window, type the
following:
79
Chapter 3 Creating Graphical Displays of Large Data Sets
Create a Trellis The box and whisker plot provides graphical representation showing
Box and Whisker the center and spread of a distribution.
Plot To create a sample box and whisker plot in a Trellis graph, in the
Commands window, type the following:
bwplot(Type~Fuel, data=(as.bdFrame(fuel.frame)))
80
Example Graphs
81
Chapter 3 Creating Graphical Displays of Large Data Sets
Create a Trellis The following example creates a Trellis graph of a density plot, which
Density Plot displays the shape of a distribution. You can use the Trellis density
plot for analyzing a one-dimensional data distribution. A density plot
displays an estimate of the underlying probability density function for
a data set, allowing you to approximate the probability that your data
fall in any interval.
To create a sample Trellis density plot, in the Commands window,
type the following:
82
Example Graphs
Create a Simple A histogram displays the number of data points that fall in each of a
Histogram specified number of intervals. A histogram gives an indication of the
relative density of the data points along the horizontal axis. For this
reason, density plots are often superposed with (scaled) histograms.
To create a sample hist chart of a full dataset for a numeric vector, in
the Commands window, type the following:
83
Chapter 3 Creating Graphical Displays of Large Data Sets
84
Example Graphs
Create a The functions qq, qqmath, qqnorm, and qqplot create an ordinary x-y
Quantile-Quantile plot of 500 evenly-spaced quantiles of data.
(QQ) Plot for The function qq creates a Trellis graph comparing the distributions of
Comparing two sets of data. Quantiles of one dataset are graphed against
Multiple corresponding quantiles of the other data set.
Distributions
To create a sample qq plot, in the Commands window, type the
following:
85
Chapter 3 Creating Graphical Displays of Large Data Sets
The factor on the left side of the ~ must have exactly two levels
(fuel.bd$Compact has five levels).
The qq plot is displayed as follows:
f
(Note that in this example, by setting Type to the logical Compact, the
labels are set to FALSE and TRUE on the x and y axis, respectively.)
Create a QQ Plot The function qqmath creates normal probability plot in a Trellis
Using a graph. that is, the ordered data are graphed against quantiles of the
Theoretical or standard normal distribution.
Empirical qqmath can also make probability plots for other distributions. It has
Distribution an argument distribution, whose input is any function that
computes quantiles. The default for distribution is qnorm. If you set
distribution = qexp, the result is an exponential probability plot.
86
Example Graphs
Create a Single The function qqnorm creates a plot using a single bdVector object. The
Vector QQ Plot following example creates a plot from the mileage vector of the
fuel.bd object.
87
Chapter 3 Creating Graphical Displays of Large Data Sets
Create a Two The function qqplot creates a hexbin plot using two bdVectors. The
Vector QQ Plot quantile-quantile plot is a good tool for determining a good
approximation to a data set’s distribution. In a qqplot, the ordered
data are graphed against quantiles of a known theoretical distribution.
To create a sample two-vector qqplot, In the Commands window,
type the following:
88
Example Graphs
Create a One- The function stripplot creates a Trellis graph similar to a box plot in
Dimensional layout; however, the individual data points are shown instead of the
Scatter Plot box plot summary.
To create sample one-dimensional scatter plot, in the Commands
window, type the following:
89
Chapter 3 Creating Graphical Displays of Large Data Sets
Creating The functions discussed in this section do not accept a big data object
Graphs with directly to create a graph; rather, they require a preprocessing
function such as those listed in the section Functions Providing
Preprocessing Support to Preprocess Data for Graphing on page 66.
Functions
Create a Bar Calling barchart directly on a large data set produces a large number
Chart of bars, which results in an illegible plot.
• If your data contains a small number of cases, convert the
data to a standard data.frame before calling barchart.
• If your data contains a large number of cases, first use
aggregate, and then use bd.coerce to create the appropriate
small data set.
In the following example, sum the yields over sites to get the total
yearly yield for each variety.
90
Example Graphs
Create a Bar Plot The following example creates a simple bar plot from fuel.bd, using
table to preprocess data.
To create a sample bar plot using table to preprocess the data, in the
Commands window, type the following:
91
Chapter 3 Creating Graphical Displays of Large Data Sets
92
Example Graphs
Create a Trellis The function contourplot creates a Trellis contour plot. The
Contour Plot contourplot function creates a Trellis graph of a contour plot. For big
data sets, contourplot requires a preprocessing function such as
loess.
93
Chapter 3 Creating Graphical Displays of Large Data Sets
94
Example Graphs
Create a Dot When you create a dot chart, you can use a grouping variable and
Chart group summary, along with other options. The function dotchart can
be preprocessed using either table or tapply.
To create a sample dot chart using table to preprocess data, in the
Commands window, type the following:
95
Chapter 3 Creating Graphical Displays of Large Data Sets
96
Example Graphs
Create a Dot Plot The function dotplot creates a Trellis graph that displays that
displays dots and gridlines to mark the data values in dot plots. The
dot plot reduces most data comparisons to straightforward length
comparisons on a common scale.
When using dotplot on a big data object, call dotplot after using
aggregate to reduce size of data.
In the following example, sum the barley yields over sites to get the
total yearly yield for each variety.
To create a sample dot plot, in the Commands window, type the
following:
Create an Image The following example creates an image graph using hist2d to
Graph Using preprocess data. The function image creates an image, under some
hist2d graphics devices, of shades of gray or colors that represent a third
dimension.
97
Chapter 3 Creating Graphical Displays of Large Data Sets
Create a Trellis The levelplot function creates a Trellis graph of a level plot. For big
Level Plot data sets, levelplot requires a preprocessing function such as loess.
A level plot is essentially identical to a contour plot, but it has default
options so you can view a particular surface differently. Like contour
plots, level plots are representations of three-dimensional data in flat,
two-dimensional planes. Instead of using contour lines to indicate
heights in the z direction, level plots use colors. The following
example produces a level plot of predictions from loess.
To create a sample Trellis level plot using loess to preprocess the
data, in the Commands window, type the following:
98
Example Graphs
Create a persp The persp function creates a perspective plot given a matrix that
Graph Using represents heights on an evenly spaced grid. For more information
hist2d about persp, see the section Perspective Plots in the Application
Developer’s Guide.
To create a sample persp graph using hist2d to preprocess the data,
in the Commands window, type the following:
99
Chapter 3 Creating Graphical Displays of Large Data Sets
Hint
Create a Pie A pie chart shows the share of individual values in a variable, relative
Chart to the sum total of all the values. Pie charts display the same
information as bar charts and dot plots, but can be more difficult to
interpret. This is because the size of a pie wedge is relative to a sum,
and does not directly reflect the magnitude of the data value. Because
of this, pie charts are most useful when the emphasis is on an
individual item’s relation to the whole; in these cases, the sizes of the
pie wedges are naturally interpreted as percentages.
Calling pie directly on a big data object can result in a pie with
thousands of wedges; therefore, preprocess the data using table to
reduce the number of wedges.
To create a sample pie chart using table to preprocess the data, in the
Commands window, type the following:
100
Example Graphs
Create a Trellis The function piechart creates a pie chart in a Trellis graph.
Pie Chart • If your data contains a small number of cases, convert the
data to a standard data.frame before calling piechart.
• If your data contains a large number of cases, first use
aggregate, and then use bd.coerce to create the appropriate
small data set.
To create a sample Trellis pie chart using aggregate to preprocess the
data, in the Commands window, type the following:
101
Chapter 3 Creating Graphical Displays of Large Data Sets
102
Example Graphs
Unsupported Using the functions that add to a plot, such as points and lines,
Functions results in an error message.
103
Chapter 3 Creating Graphical Displays of Large Data Sets
104
ADVANCED PROGRAMMING
INFORMATION
4
Introduction 106
Big Data Block Size Issues 107
Block Size Options 107
Group or Window Blocks 110
Big Data String and Factor Issues 113
String Column Widths 113
String Widths and importData 113
String Widths and bd.create.columns 115
Factor Column Levels 116
String Truncation and Level Overflow Errors 117
Storing and Retrieving Large S Objects 119
Managing Large Amounts of Data 119
Increasing Efficiency 121
bd.select.rows 121
bd.filter.rows 121
bd.create.columns 122
105
Chapter 4 Advanced Programming Information
INTRODUCTION
As an S-PLUS Big Data library user, you might encounter unexpected
or unusual behavior when you manipulate blocks of data or work
with strings and factors.
This section includes warnings and advice about such behavior, and
provides examples and further information for handling these
unusual situations.
Alternatively, you might need to implement your own big-data
algorithms using out-of-memory techniques.
106
Big Data Block Size Issues
Block Size When processing big data, the system must decide how much data to
Options read and process in each block. Each block should be as big as
possible, because it is more efficient to process a few large blocks,
rather than many small blocks. However, the available memory limits
the block size. If space is allocated for a block that is larger than the
physical memory on the computer, either it uses virtual memory to
store the block (which slows all operations), or the memory allocation
operation fails.
The size of the blocks used is controlled by two options:
• bd.options("block.size")
The option "block.size" specifies the maximum number of
rows to be processed at a time, when executing big data
operations. The default value is 1e9; however, the actual
number of rows processed is determined by this value,
adjusted downwards to fit within the value specified by the
option "max.block.mb".
• bd.options("max.block.mb")
The option "max.block.mb" places a limit on the maximum
size of the block in megabytes. The default value is 10.
When S-PLUS reads a given bdFrame, it sets the block size initially to
the value passed in "block.size", and then adjusts downward until
the block size is no greater than "max.block.mb". Because the default
for "block.size" is set so high, this effectively ensures that the size of
the block is around the given number of megabytes.
The resulting number of rows in a block depends on the types and
numbers of columns in the data. Given the default "max.block.mb" of
10 megabytes, reading a bdFrame with a single numeric column could
107
Chapter 4 Advanced Programming Information
108
Big Data Block Size Issues
block defined by the group columns or the window size, and it will
generate an error if a data block is larger than
bd.options("max.block.mb").
109
Chapter 4 Advanced Programming Information
If you suspect that increasing the block size could help the
performance of a particular computation, the best strategy is to
measure the performance of the computation with
bd.options("max.block.mb") set to the default of 10, and then
measure it again with bd.options("max.block.mb") set to 20. If this
test shows no significant performance improvement, it probably will
not help to increase the block size further, but could lead only to out
of memory problems. Using large block sizes can actually lead to
worse performance, if it causes virtual memory page swapping.
Group or Note that the “block” size determined by these options and the data is
Window Blocks distinct from the “blocks” defined in the functions bd.by.group,
bd.by.window, bd.split.by.group, and bd.split.by.window. These
functions divide their input data into subsets to process as determined
by the values in certain columns or a moving window. S-PLUS
imposes a limit on the size of the data that can be processed in each
block by bd.by.group and bd.by.window: if the number of rows in a
block is larger than the block size determined by
110
Big Data Block Size Issues
BIG.GROUPS <-
data.frame(GENDER=rep(c("MALE","FEMALE"),
length=1000), NUM=rnorm(1000))
bd.options(block.size=5000)
bd.by.group(BIG.GROUPS, by.columns="GENDER",
FUN=function(df)
data.frame(GENDER=df$GENDER[1],
NROW=nrow(df)))
GENDER NROW
1 FEMALE 500
2 MALE 500
If the block size is set below the size of the groups, this same
operation will generate an error:
bd.options(block.size=10)
bd.by.group(BIG.GROUPS, by.columns="GENDER",
FUN=function(df)
data.frame(GENDER=df$GENDER[1],
NROW=nrow(df)))
Problem in bd.internal.exec.node(engine.class = :
BDLManager$BDLSplusScriptEngineNode (0): Problem in
bd.internal.by.group.script(IM, function(..: can't process
block with 500 rows for group [FEMALE]: can only process 10
rows at a time (check bd.options() values for block.size
and max.block.mb)
Use traceback() to see the call stack
111
Chapter 4 Advanced Programming Information
data.frame(GENDER=names(BIG.GROUPS.LIST),
NROW=sapply(BIG.GROUPS.LIST, nrow, simplify=T),
row.names=NULL)
GENDER NROW
1 FEMALE 500
2 MALE 500
112
Big Data String and Factor Issues
String Column When a bdFrame character column is initially defined, before any data
Widths is stored in it, the maximum number of characters (or string width)
that can appear in the column must be specified. This restriction is
necessary for rapid access to the cache file. Once this is specified, an
attempt to store a longer string in the column causes the string to be
truncated and generate a warning. It is important to specify this
maximum string width correctly. All of the big data operations
attempt to estimate this width, but there are situations where this
estimated value is incorrect. In these cases, it is possible to explicitly
specify the column string width.
To retrieve the actual column string widths used in a particular
bdFrame, call the function bd.string.column.width.
Unless the column string width is explicitly specified in other ways,
the default string width for newly-created columns is set with the
following option. The default value is 32.
bd.options("string.column.width")
String Widths When you import a big data object using importData for file types
and other than ASCII text, S-PLUS determines the maximum number of
characters in each string column and uses this value to set the bdFrame
importData
column string width.
113
Chapter 4 Advanced Programming Information
When you import ASCII text files, S-PLUS measures the maximum
number of characters in each column while scanning the file to
determine the column types. The number of lines scanned is
controlled by the argument scanLines. If this is too small, and the
scan stops before some very long strings, it is possible for the
estimated column width to be too low. For example, the following
code generates a file with steadily-longer strings.
f <- tempfile()
cat("strsize,str\n",file=f)
for(x in 1:30) {
str <- paste(rep("abcd:",x),collapse="")
cat(nchar(str), ",", str, "\n", sep="",
append=T, file=f)
}
Importing this file with the default scanLines value (256) detects that
the maximum string has 150 characters, and sets this column string
length correctly.
bd.string.column.width(dat)
strsize str
-1 150
(In the above output, the strsize value of -1 represents the value for
non-character columns.)
If you import this file with the scanLines argument set to scan only
the first few lines, the column string width is set too low. In this case,
the column string width is set to 45 characters, so longer strings are
truncated, and a warning is generated:
114
Big Data String and Factor Issues
Warning messages:
"ReadTextFileEngineNode (0): output column str has 21
string values truncated because they were longer than the
column string width of 45 characters -- maximum string size
before truncation was 150 characters" in:
bd.internal.exec.node(engine.class = engine.class, ...
You can read this data correctly without scanning the entire file by
explicitly setting bd.options("default.string.column.width")
before the call to importData:
bd.options("default.string.column.width"=200)
dat <- importData(f, type="ASCII", stringsAsFactors=F,
bigdata=T, scanLines=10)
bd.string.column.width(dat)
strsize str
-1 200
This string truncation does not occur when S-PLUS reads long strings
as factors, because there is no limit on factor-level string length.
One more point to remember when you import strings: the low-level
importData and exportData code truncates any strings (either
character strings or factor levels) that have more than 254 characters.
S-PLUS generates a warning in importData if bigdata=T if it
encounters such strings.
String Widths You can use one of the following techniques for setting string column
and widths explicitly:
bd.create. • To set the default width (if it is not determined some other
columns way), use bd.options("string.column.width").
• To override the default column string widths, in
bd.block.apply, specify the out1.column.string.widths list
element when IM$test==T, or when outputting the first non-
NULL output block.
115
Chapter 4 Advanced Programming Information
bd.create.columns(as.bdFrame(fuel.frame),
"Type+Type", "t2", "character",
string.column.width=6)
Factor Column Because of the way that bdFrame factor columns are represented, a
Levels factor cannot have an unlimited number of levels. The number of
levels is restricted to the value of the option. (The default is 500.)
bd.options("max.levels")
116
Big Data String and Factor Issues
If you attempt to create a factor with more than this many levels, a
warning is generated. For example:
Warning messages:
"CreateColumnsEngineNode (0): output column f has 1500 NA
values due to categorical level overflow (more than 500
levels) -- you may want to change this column type from
categorical to string" in: bd.internal.ex\
ec.node(engine.class = engine.class, node.props =
node.props, ....
summary(dat)
num f
Min.: 1.0 x99: 1
1st Qu.: 500.8 x98: 1
Median: 1001.0 x97: 1
Mean: 1001.0 x96: 1
3rd Qu.: 1500.0 x95: 1
Max.: 2000.0 (Other): 495
NA's:1500
Note
Strings are used for identifiers (such as street addresses or social security numbers), while factors
are used when you have a limited number of categories (such as state names or product types)
that are used to group rows for tables, models, or graphs.
bd.options("error.on.string.truncation"=T)
bd.options("error.on.level.overflow"=T)
117
Chapter 4 Advanced Programming Information
118
Storing and Retrieving Large S Objects
• bd.unpack.object
Creating a In the following example, use the data object fuel.frame to create
Packed Object 1000 linear models. The resulting object takes about 6MB.
with bd.pack. In the Commands window, type the following:
object
#Create the linear models:
many.models <- lapply(1:1000, function(x)
lm(Fuel ~ Weight + Disp., sample(fuel.frame, size=30)))
[1] 6210981
You can make a smaller object by packing each model. While this
exercise takes longer, the resulting object is smaller than 2MB.
In the Commands window, type the following:
119
Chapter 4 Advanced Programming Information
[1] 1880041
Restoring a Remember if you use bd.pack.object, you must unpack the object to
Packed Object use it again. The following example code unpacks some of the models
with within many.models.packed object and displays them in a plot.
bd.unpack. In the Commands window, type the following:
object
for(x in 1:5)
plot(
bd.unpack.object(many.models.packed[[x]]),
which.plots=3)
Summary The above example shows a space difference of only a few MB, (6MB
to 2MB), which is probably not a large enough saving to take the time
to pack the object. However, if each of the model objects were very
large, and the whole list were too large to represent, the packed
version would be useful.
120
Increasing Efficiency
INCREASING EFFICIENCY
The Big Data library offers several alternatives to standard S-PLUS
functions, to provide greater efficiency when you work with a large
data set. Key efficiency functions include:
Table D.1: Efficient Big Data library functions.
121
Chapter 4 Advanced Programming Information
Note that in the last function, above, specifying copy=F creates a new
column without copying the old columns.
122
APPENDIX: BIG DATA
LIBRARY FUNCTIONS
Introduction 124
Big Data Library Functions 125
Data Import and Export 125
Object Creation 126
Big Vector Generation 127
Big Data Library Functions 128
Data Frame and Vector Functions 136
Graph Functions 150
Data Modeling 152
Time Date and Series Functions 156
123
Appendix: Big Data Library Functions
INTRODUCTION
The Big Data library is supported by many standard S-PLUS
functions, such as basic statistical and mathematical functions,
properties functions, densities and quantiles functions, and so on. For
more information about these functions, see their individual help
topics. (To display a function’s help topic, in the Commands window,
type help(functionname).)
The Big Data library also contains functions specific to big data
objects. These functions include the following.
• Import and export functions.
• Object creation functions
• Big vector generating functions.
• Data exploration and manipulation functions.
• Traditional and Trellis graphics functions.
• Modeling functions.
These functions are described further in the following section.
124
Big Data Library Functions
Data Import For more information and usage examples, see the functions’
and Export individual help topics.
Table A.1: Import and export functions.
125
Appendix: Big Data Library Functions
Object The following methods create an object of the specified type. For
Creation more information and usage examples, see the functions’ individual
help topics.
Table A.2: Big Data library object creation functions
Function
bdCharacter
bdCluster
bdFactor
bdFrame
bdGlm
bdLm
bdLogical
bdNumeric
bdPrincomp
bdSignalSeries
bdTimeDate
bdTimeSeries
bdTimeSpan
126
Big Data Library Functions
Big Vector For the following methods, set the bigdata argument to TRUE to
Generation generate a bdVector. This instruction applies to all functions in this
table. For more information and usage examples, see the functions’
individual help topics.
Table A.3: Vector generation methods for large data sets.
Method name
rbeta
rbinom
rcauchy
rchisq
rep
rexp
rf
rgamma
rgeom
rhyper
rlnorm
rlogis
rmvnorm
rnbinom
rnorm
127
Appendix: Big Data Library Functions
Table A.3: Vector generation methods for large data sets. (Continued)
Method name
rnrange
rpois
rstab
rt
runif
rweibull
rwilcox
Big Data The Big Data library introduces a new set of "bd" functions designed
Library to work efficiently on large data. For best performance, it is important
that you write code minimizing the number of passes through the
Functions data. The Big Data library functions minimize the number of passes
made through the data. Use these functions for the best performance.
For more information and usage examples, see the functions’
individual help topics.
128
Big Data Library Functions
Data Exploration
Functions Table A.4: Data exploration functions.
129
Appendix: Big Data Library Functions
Data
Manipulation Table A.5: Data manipulation functions.
Functions
Function name Description
130
Big Data Library Functions
131
Appendix: Big Data Library Functions
132
Big Data Library Functions
133
Appendix: Big Data Library Functions
• natural join
• union
• merge
• between
• subqueries
134
Big Data Library Functions
Programming
Table A.6: Programming functions.
135
Appendix: Big Data Library Functions
Data Frame The following table lists the functions for both data frames (bdFrame)
and Vector and vectors (bdVector). The the cross-hatch (#) indicates that the
function is implemented for the corresponding object type. The
Functions Comment column provides information about the function, or
indicates which bdVector-derived class(es) the function applies to. For
more information and usage examples, see the functions’ individual
help topics.
Table A.7: Functions implemented for bdVector and bdFrame.
- # #
!= # #
$ #
$<- #
[ # #
[[ # #
136
Big Data Library Functions
[[<- # #
[<- # #
abs #
aggregate # #
all # #
all.equal # #
any # #
anyMissing # #
append #
apply #
Arith # #
as.bdCharacter #
as.bdFactor #
as.bdFrame # #
as.bdVector # #
attr # #
137
Appendix: Big Data Library Functions
attr<- # #
attributes # #
attributes<- # #
by #
casefold #
ceiling #
coerce # #
colIds #
colIds<- #
colMaxs # #
colMeans # #
colMins # #
colRanges # #
colSums # #
138
Big Data Library Functions
colVars # #
concat.two # #
cor # #
cut #
density #
densityplot #
139
Appendix: Big Data Library Functions
diff # #
digamma #
dim #
140
Big Data Library Functions
floor # #
format # #
formula #
grep #
hist #
hist2d #
histogram #
html.table # #
intersect #
141
Appendix: Big Data Library Functions
is.all.white #
is.element #
is.finite # #
is.infinite # #
is.na # #
is.nan # #
is.number # #
is.rectangular # #
length # #
mad #
match # #
matrix # #
142
Big Data Library Functions
mean # #
median #
merge # #
na.exclude # #
na.omit # #
ncol #
notSorted #
nrow #
numberMissing # #
Ops # #
pairs #
143
Appendix: Big Data Library Functions
plot # #
pmatch #
144
Big Data Library Functions
print # #
145
Appendix: Big Data Library Functions
146
Big Data Library Functions
qq #
qqmath #
qqnorm #
qqplot #
quantile #
range #
rank #
replace #
rev # #
rle #
147
Appendix: Big Data Library Functions
rowMaxs #
rowMeans #
rowMins #
rowRanges #
rowSums #
rowVars #
runif #
sample # #
scale #
setdiff #
shiftPositions #
show # #
sort #
split #
148
Big Data Library Functions
stdev # Handles
bdCharacter.
sub # #
sub<- #
substring #
substring<- #
summary # #
sweep #
t #
tapply # #
trigamma #
union #
unique # #
var # #
which.infinite # #
which.na # #
149
Appendix: Big Data Library Functions
which.nan # #
xy2cell #
xyCall #
xyplot #
Graph For more information and examples for using the traditional graph
Functions functions, see their individual help topics, or see the section Functions
Supporting Graphs on page 63.
Table A.8: Traditional graph functions.
Function name
barplot
boxplot
contour
dotchart
hexbin
hist
hist2d
image
interp
pairs
150
Big Data Library Functions
Function name
persp
pie
plot
qqnorm
qqplot
For more information about using the Trellis graph functions, see their
individual help topics, or see the section Functions Supporting
Graphs on page 63.
Table A.9: Trellis graph functions.
Function name
barchart
contourplot
densityplot
dotplot
histogram
levelplot
piechart
151
Appendix: Big Data Library Functions
Note
The cloud and parallel graphics functions are not implemented for bdFrames.
Data Modeling For more information and usage examples, see the functions’ individual
help topics.
Function name
bdCluster
bdGlm
bdLm
bdPrincomp
Function name
bd.model.frame.and.matrix
bs
ns
spline.des
contrasts
contrasts<-
152
Big Data Library Functions
Model Methods The following table identifies functions implemented for generalized
linear modeling, linear regression, principal components modeling,
and clustering. The cross-hatch (#) indicates the function is
implemented for the corresponding modeling type.
Table A.12: Modeling and Clustering Functions.
principal
Generalized linear Linear components
Function name modeling (bdGlm) Regression (bdLm) (bdPrincomp) bdCluster
AIC #
all.equal #
anova # #
BIC #
coef # #
deviance # #
durbinWatson #
effects #
family # #
fitted # # # #
formula # #
kappa #
labels #
loadings #
153
Appendix: Big Data Library Functions
principal
Generalized linear Linear components
Function name modeling (bdGlm) Regression (bdLm) (bdPrincomp) bdCluster
logLik #
model.frame #
model.matrix #
plot # #
predict # # # #
print # # # #
print.summary # # #
qqnorm # #
residuals # #
screeplot #
step # #
summary # # #
154
Big Data Library Functions
Predict from This table lists the small data models that support the predict
Small Data function. For more information and usage examples, see the functions’
Models individual help topics.
Table A.13: Predicting from small data models.
arima.mle
bs
censorReg
coxph
coxph.penal
discrim
factanal
gam
glm
gls
gnls
lm
lme
lmList
lmRobMM
loess
loess.smooth
155
Appendix: Big Data Library Functions
mlm
nlme
nls
ns
princomp
safe.predict.gam
smooth.spline
smooth.spline.fit
survreg
survReg
survReg.penal
tree
Time Date and The following tables include time date creation functions and
Series functions for manipulating time and date, time span, time series, and
signal series objects.
Functions
156
Big Data Library Functions
Time Date
Creation Table A.14: Time date creation functions.
In the following table, the cross-hatch (#) indicates that the function is
implemented for the corresponding class. If the table cell is blank, the
function is not implemented for the class. This list includes bdVector
objects (bdTimeDate and bdTimeSpan) and bdSeries classes
(bdSignalSeries, bdTimeSeries).
Table A.15: Time Date and Series Functions.
- # #
[ # # #
[<- #
+ # #
align # #
157
Appendix: Big Data Library Functions
all.equal # #
Arith # #
as.bdFrame # # #
as.bdLogical # #
bd.coerce # # # #
ceiling # #
coerce/as # # # #
cor # # # #
cumsum #
cut # #
data.frameAux # # #
days #
deltat # #
diff # #
end # #
floor # #
hms #
158
Big Data Library Functions
hours #
match # #
Math # # # #
Math2 # # # #
max # #
mdy #
mean # # # #
median # # # #
min # #
minutes #
months #
plot # # # #
quantile # # # #
quarters #
range # #
seconds #
seriesLag # #
159
Appendix: Big Data Library Functions
shiftPositions # #
show # # # #
sort # # # #
sort.list # # # #
split # #
start # #
substring<- # # # #
sum #
Summary # # # #
summary # # # #
timeConvert #
trunc # #
var # # # #
wdydy #
weekdays #
yeardays #
years #
160
INDEX
161
Index
162
Index
163
Index
E grep 141
effects 153
efficiency H
bd.filter.rows 29 help 39
end 158 hexagonal binning 16, 64, 69
exportData 125 hexbin 34, 64, 66, 75, 150
exporting data 15 hist 32, 65, 83, 141, 150
Expression Language 38 hist2d 16, 66, 97, 141, 150
ExpressionLanguage 29 histogram 65, 85, 141, 151
exprs 39 hms 158
hours 159
F html.table 141
factanal 155
factor 113 I
factor column levels 116 image 66, 68, 97, 150
family 153 importData 25, 113, 125
filtering importing data 15
columns 131 interp 16, 66, 93, 150
rows 131 intersect 141
filtering columns 131 is.all.white 142
fitted 13, 153 is.element 142
Fitting functions 152 is.finite 142
floor 141, 158 is.infinite 142
format 141 is.na 142
formula 13, 141, 153 is.nan 142
formula operators 17 is.number 142
136, 157 is.rectangular 142
- function 136, 157
J
G
joining
gam 155 data sets 132
generalized linear models 13 datasets 131
get joining data sets 131
cache file information 135
getting
K
maximum number of characters
135 kappa 153
glm 57, 155 kurtosis 142
gls 155
gnls 155 L
graph functions 63, 150
Trellis 151 labels 153
graphics functions 15 least squares line 75, 78
164
Index
165
Index
166
Index
167
Index
time operations 17 V
timeSeq 157
var 149, 160
timeSeries 13
vector 11
timeZoneConvert 17
vector generation 127
transposing
vectors 12
columns to rows 135
virtual memory limitations 3
tree 156
Trellis 34
Trellis graph W
creating 65 wdydy 160
Trellis graphic object weekdays 160
creating 64 which.infinite 149
Trellis graphics 33 which.na 149
trigamma 149 which.nan 150
trunc 160 whisker plot 80
types 39 wireframe 68, 102
U X
union 149 xy2cell 150
unique 149 xyCall 150
unique columns xyplot 34, 44, 64, 65, 69, 74, 150
determining 131
units 13
univariate statistics 129 Y
unpacking yeardays 160
cache files 136 years 160
168