Data Science in Theory and Practice Techniques For Big Data Analytics and Complex Data Sets Maria C Mariani Full Chapter

Data Science in Theory and Practice:
Techniques for Big Data Analytics and

Complex Data Sets Maria C. Mariani
Visit to download the full and correct content document:
https://ebookmass.com/product/data-science-in-theory-and-practice-techniques-for-bi
g-data-analytics-and-complex-data-sets-maria-c-mariani/
Data Science in Theory and Practice
Data Science in Theory and Practice
Techniques for Big Data Analytics and Complex Data Sets
Maria Cristina Mariani

University of Texas, El Paso
El Paso, United States
Osei Koﬁ Tweneboah

Ramapo College of New Jersey
Mahwah, United States
Maria Pia Beccar-Varela

University of Texas, El Paso
El Paso, United States
This first edition first published 2022
© 2022 John Wiley and Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or
otherwise, except as permitted by law. Advice on how to obtain permission to reuse material
from this title is available at http://www.wiley.com/go/permissions
The right of Maria Cristina Mariani, Osei Kofi Tweneboah, and Maria Pia Beccar-Varela to be
identified as the authors of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley
products visit us at www.wiley.com
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some
content that appears in standard print versions of this book may not be available in other
formats.
Limit of Liability/Disclaimer of Warranty

In view of ongoing research, equipment modifications, changes in governmental regulations,
and the constant flow of information relating to the use of experimental reagents, equipment,
and devices, the reader is urged to review and evaluate the information provided in the package
insert or instructions for each chemical, piece of equipment, reagent, or device for, among other
things, any changes in the instructions or indication of usage and for added warnings and
precautions. While the publisher and authors have used their best efforts in preparing this work,
they make no representations or warranties with respect to the accuracy or completeness of the
contents of this work and specifically disclaim all warranties, including without limitation any
implied warranties of merchantability or fitness for a particular purpose. No warranty may be
created or extended by sales representatives, written sales materials or promotional statements
for this work. The fact that an organization, website, or product is referred to in this work as a
citation and/or potential source of further information does not mean that the publisher and
authors endorse the information or services the organization, website, or product may provide
or recommendations it may make. This work is sold with the understanding that the publisher is
not engaged in rendering professional services. The advice and strategies contained herein may
not be suitable for your situation. You should consult with a specialist where appropriate.
Further, readers should be aware that websites listed in this work may have changed or
disappeared between when this work was written and when it is read. Neither the publisher nor
authors shall be liable for any loss of profit or any other commercial damages, including but not
limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging-in-Publication Data applied for

ISBN: 9781119674689
Cover Design: Wiley

Cover Image: © nobeastsofierce/Shutterstock
Set in 9.5/12.5pt STIXTwoText by Straive, Chennai, India
10 9 8 7 6 5 4 3 2 1
v
Contents
List of Figures xvii

List of Tables xxi
Preface xxiii
1 Background of Data Science 1

1.1 Introduction 1
1.2 Origin of Data Science 2
1.3 Who is a Data Scientist? 2
1.4 Big Data 3
1.4.1 Characteristics of Big Data 4
1.4.2 Big Data Architectures 5
2 Matrix Algebra and Random Vectors 7

2.1 Introduction 7
2.2 Some Basics of Matrix Algebra 7
2.2.1 Vectors 7
2.2.2 Matrices 8
2.3 Random Variables and Distribution Functions 12
2.3.1 The Dirichlet Distribution 15
2.3.2 Multinomial Distribution 17
2.3.3 Multivariate Normal Distribution 18
2.4 Problems 19
3 Multivariate Analysis 21
3.1 Introduction 21
3.2 Multivariate Analysis: Overview 21
3.3 Mean Vectors 22
3.4 Variance–Covariance Matrices 24
3.5 Correlation Matrices 26
vi Contents
3.6 Linear Combinations of Variables 28

3.6.1 Linear Combinations of Sample Means 29
3.6.2 Linear Combinations of Sample Variance and Covariance 29
3.6.3 Linear Combinations of Sample Correlation 30
3.7 Problems 31
4 Time Series Forecasting 35

4.1 Introduction 35
4.2 Terminologies 36
4.3 Components of Time Series 39
4.3.1 Seasonal 39
4.3.2 Trend 40
4.3.3 Cyclical 41
4.3.4 Random 42
4.4 Transformations to Achieve Stationarity 42
4.5 Elimination of Seasonality via Differencing 44
4.6 Additive and Multiplicative Models 44
4.7 Measuring Accuracy of Different Time Series Techniques 45
4.7.1 Mean Absolute Deviation 46
4.7.2 Mean Absolute Percent Error 46
4.7.3 Mean Square Error 47
4.7.4 Root Mean Square Error 48
4.8 Averaging and Exponential Smoothing Forecasting Methods 48
4.8.1 Averaging Methods 49
4.8.1.1 Simple Moving Averages 49
4.8.1.2 Weighted Moving Averages 51
4.8.2 Exponential Smoothing Methods 54
4.8.2.1 Simple Exponential Smoothing 54
4.8.2.2 Adjusted Exponential Smoothing 55
4.9 Problems 57
5 Introduction to R 61
5.1 Introduction 61
5.2 Basic Data Types 62
5.2.1 Numeric Data Type 62
5.2.2 Integer Data Type 62
5.2.3 Character 63
5.2.4 Complex Data Types 63
5.2.5 Logical Data Types 64
5.3 Simple Manipulations – Numbers and Vectors 64
5.3.1 Vectors and Assignment 64
Contents vii
5.3.2 Vector Arithmetic 65

5.3.3 Vector Index 66
5.3.4 Logical Vectors 67
5.3.5 Missing Values 68
5.3.6 Index Vectors 69
5.3.6.1 Indexing with Logicals 69
5.3.6.2 A Vector of Positive Integral Quantities 69
5.3.6.3 A Vector of Negative Integral Quantities 69
5.3.6.4 Named Indexing 69
5.3.7 Other Types of Objects 70
5.3.7.1 Matrices 70
5.3.7.2 List 72
5.3.7.3 Factor 73
5.3.7.4 Data Frames 75
5.3.8 Data Import 76
5.3.8.1 Excel File 76
5.3.8.2 CSV File 76
5.3.8.3 Table File 77
5.3.8.4 Minitab File 77
5.3.8.5 SPSS File 77
5.4 Problems 78
6 Introduction to Python 81
6.1 Introduction 81
6.2 Basic Data Types 82
6.2.1 Number Data Type 82
6.2.1.1 Integer 82
6.2.1.2 Floating-Point Numbers 83
6.2.1.3 Complex Numbers 84
6.2.2 Strings 84
6.2.3 Lists 85
6.2.4 Tuples 86
6.2.5 Dictionaries 86
6.3 Number Type Conversion 87
6.4 Python Conditions 87
6.4.1 If Statements 88
6.4.2 The Else and Elif Clauses 89
6.4.3 The While Loop 90
6.4.3.1 The Break Statement 91
6.4.3.2 The Continue Statement 91
6.4.4 For Loops 91
viii Contents
6.4.4.1 Nested Loops 92

6.5 Python File Handling: Open, Read, and Close 93
6.6 Python Functions 93
6.6.1 Calling a Function in Python 94
6.6.2 Scope and Lifetime of Variables 94
6.7 Problems 95
7 Algorithms 97
7.1 Introduction 97
7.2 Algorithm – Definition 97
7.3 How to Write an Algorithm 98
7.3.1 Algorithm Analysis 99
7.3.2 Algorithm Complexity 99
7.3.3 Space Complexity 100
7.3.4 Time Complexity 100
7.4 Asymptotic Analysis of an Algorithm 101
7.4.1 Asymptotic Notations 102
7.4.1.1 Big O Notation 102
7.4.1.2 The Omega Notation, Ω 102
7.4.1.3 The Θ Notation 102
7.5 Examples of Algorithms 104
7.6 Flowchart 104
7.7 Problems 105
8 Data Preprocessing and Data Validations 109

8.1 Introduction 109
8.2 Definition – Data Preprocessing 109
8.3 Data Cleaning 110
8.3.1 Handling Missing Data 110
8.3.2 Types of Missing Data 110
8.3.2.1 Missing Completely at Random 110
8.3.2.2 Missing at Random 110
8.3.2.3 Missing Not at Random 111
8.3.3 Techniques for Handling the Missing Data 111
8.3.3.1 Listwise Deletion 111
8.3.3.2 Pairwise Deletion 111
8.3.3.3 Mean Substitution 112
8.3.3.4 Regression Imputation 112
8.3.3.5 Multiple Imputation 112
8.3.4 Identifying Outliers and Noisy Data 113
8.3.4.1 Binning 113
Contents ix
8.3.4.2 Box and Whisker plot 113

8.4 Data Transformations 115
8.4.1 Min–Max Normalization 115
8.4.2 Z-score Normalization 115
8.5 Data Reduction 116
8.6 Data Validations 117
8.6.1 Methods for Data Validation 117
8.6.1.1 Simple Statistical Criterion 117
8.6.1.2 Fourier Series Modeling and SSC 118
8.6.1.3 Principal Component Analysis and SSC 118
8.7 Problems 119
9 Data Visualizations 121

9.2 Definition – Data Visualization 121
9.2.1 Scientific Visualization 123
9.2.2 Information Visualization 123
9.2.3 Visual Analytics 124
9.3 Data Visualization Techniques 126
9.3.1 Time Series Data 126
9.3.2 Statistical Distributions 127
9.3.2.1 Stem-and-Leaf Plots 127
9.3.2.2 Q–Q Plots 127
9.4 Data Visualization Tools 129
9.4.1 Tableau 129
9.4.2 Infogram 130
9.4.3 Google Charts 132
9.5 Problems 133
10 Binomial and Trinomial Trees 135

10.2 The Binomial Tree Method 135
10.2.1 One Step Binomial Tree 136
10.2.2 Using the Tree to Price a European Option 139
10.2.3 Using the Tree to Price an American Option 140
10.2.4 Using the Tree to Price Any Path Dependent Option 141
10.3 Binomial Discrete Model 141
10.3.1 One-Step Method 141
10.3.2 Multi-step Method 145
10.3.2.1 Example: European Call Option 146
10.4 Trinomial Tree Method 147
x Contents
10.4.1 What is the Meaning of Little o and Big O? 148

10.5 Problems 148
11 Principal Component Analysis 151

11.2 Background of Principal Component Analysis 151
11.3 Motivation 152
11.3.1 Correlation and Redundancy 152
11.3.2 Visualization 153
11.4 The Mathematics of PCA 153
11.4.1 The Eigenvalues and Eigenvectors 156
11.5 How PCA Works 159
11.5.1 Algorithm 160
11.6 Application 161
11.7 Problems 162
12 Discriminant and Cluster Analysis 165

12.2 Distance 165
12.3 Discriminant Analysis 166
12.3.1 Kullback–Leibler Divergence 167
12.3.2 Chernoff Distance 167
12.3.3 Application – Seismic Time Series 169
12.3.4 Application – Financial Time Series 171
12.4 Cluster Analysis 173
12.4.1 Partitioning Algorithms 174
12.4.2 k-Means Algorithm 174
12.4.3 k-Medoids Algorithm 175
12.4.4 Application – Seismic Time Series 176
12.4.5 Application – Financial Time Series 176
12.5 Problems 177
13 Multidimensional Scaling 179

13.2 Motivation 180
13.3 Number of Dimensions and Goodness of Fit 182
13.4 Proximity Measures 183
13.5 Metric Multidimensional Scaling 183
13.5.1 The Classical Solution 184
13.6 Nonmetric Multidimensional Scaling 186
13.6.1 Shepard–Kruskal Algorithm 186
13.7 Problems 187
Contents xi
14 Classiﬁcation and Tree-Based Methods 191

14.2 An Overview of Classification 191
14.2.1 The Classification Problem 192
14.2.2 Logistic Regression Model 192
14.2.2.1 l1 Regularization 193
14.2.2.2 l2 Regularization 194
14.3 Linear Discriminant Analysis 194
14.3.1 Optimal Classification and Estimation of Gaussian Distribution 195
14.4 Tree-Based Methods 197
14.4.1 One Single Decision Tree 197
14.4.2 Random Forest 198
14.5 Applications 200
14.6 Problems 202
15 Association Rules 205

15.2 Market Basket Analysis 205
15.3 Terminologies 207
15.3.1 Itemset and Support Count 207
15.3.2 Frequent Itemset 207
15.3.3 Closed Frequent Itemset 207
15.3.4 Maximal Frequent Itemset 208
15.3.5 Association Rule 208
15.3.6 Rule Evaluation Metrics 208
15.4 The Apriori Algorithm 210
15.4.1 An example of the Apriori Algorithm 211
15.5.1 Confidence 214
15.5.2 Lift 215
15.5.3 Conviction 215
15.6 Problems 216
16 Support Vector Machines 219

16.2 The Maximal Margin Classifier 219
16.3 Classification Using a Separating Hyperplane 223
16.4 Kernel Functions 225
16.6 Problems 227
xii Contents
17 Neural Networks 231

17.2 Perceptrons 231
17.3 Feed Forward Neural Network 231
17.4 Recurrent Neural Networks 233
17.5 Long Short-Term Memory 234
17.5.1 Residual Connections 235
17.5.2 Loss Functions 236
17.5.3 Stochastic Gradient Descent 236
17.5.4 Regularization – Ensemble Learning 237
17.6.1 Emergent and Developed Market 237
17.6.2 The Lehman Brothers Collapse 237
17.6.3 Methodology 238
17.6.4 Analyses of Data 238
17.6.4.1 Results of the Emergent Market Index 238
17.6.4.2 Results of the Developed Market Index 238
17.7 Significance of Study 239
17.8 Problems 240
18 Fourier Analysis 245

18.2 Definition 245
18.3 Discrete Fourier Transform 246
18.4 The Fast Fourier Transform (FFT) Method 247
18.5 Dynamic Fourier Analysis 250
18.5.1 Tapering 251
18.5.2 Daniell Kernel Estimation 252
18.6 Applications of the Fourier Transform 253
18.6.1 Modeling Power Spectrum of Financial Returns Using Fourier
Transforms 253
18.6.2 Image Compression 259
18.7 Problems 259
19 Wavelets Analysis 261

19.1.1 Wavelets Transform 262
19.2 Discrete Wavelets Transforms 264
19.2.1 Haar Wavelets 265
19.2.1.1 Haar Functions 265
19.2.1.2 Haar Transform Matrix 266
Contents xiii
19.2.2 Daubechies Wavelets 267

19.3 Applications of the Wavelets Transform 269
19.3.1 Discriminating Between Mining Explosions and Cluster of
Earthquakes 269
19.3.1.1 Background of Data 269
19.3.1.2 Results 269
19.3.2 Finance 271
19.3.3 Damage Detection in Frame Structures 275
19.3.4 Image Compression 275
19.3.5 Seismic Signals 275
19.4 Problems 276
20 Stochastic Analysis 279

20.2 Necessary Definitions from Probability Theory 279
20.3 Stochastic Processes 280
20.3.1 The Index Set  281
20.3.2 The State Space  281
20.3.3 Stationary and Independent Components 281
20.3.4 Stationary and Independent Increments 282
20.3.5 Filtration and Standard Filtration 283
20.4 Examples of Stochastic Processes 284
20.4.1 Markov Chains 285
20.4.1.1 Examples of Markov Processes 286
20.4.1.2 The Chapman–Kolmogorov Equation 287
20.4.1.3 Classification of States 289
20.4.1.4 Limiting Probabilities 290
20.4.1.5 Branching Processes 291
20.4.1.6 Time Homogeneous Chains 293
20.4.2 Martingales 294
20.4.3 Simple Random Walk 294
20.4.4 The Brownian Motion (Wiener Process) 294
20.5 Measurable Functions and Expectations 295
20.5.1 Radon–Nikodym Theorem and Conditional Expectation 296
20.6 Problems 299
21 Fractal Analysis – Lévy, Hurst, DFA, DEA 301

21.1 Introduction and Definitions 301
21.2 Lévy Processes 301
21.2.1 Examples of Lévy Processes 304
21.2.1.1 The Poisson Process (Jumps) 305
21.2.1.2 The Compound Poisson Process 305
xiv Contents
21.2.1.3 Inverse Gaussian (IG) Process 306

21.2.1.4 The Gamma Process 307
21.2.2 Exponential Lévy Models 307
21.2.3 Subordination of Lévy Processes 308
21.2.4 Stable Distributions 309
21.3 Lévy Flight Models 311
21.4 Rescaled Range Analysis (Hurst Analysis) 312
21.5 Detrended Fluctuation Analysis (DFA) 315
21.6 Diffusion Entropy Analysis (DEA) 316
21.6.1 Estimation Procedure 317
21.6.1.1 The Shannon Entropy 317
21.6.2 The H–𝛼 Relationship for the Truncated Lévy Flight 319
21.7 Application – Characterization of Volcanic Time Series 321
21.7.1 Background of Volcanic Data 321
21.7.2 Results 321
21.8 Problems 323
22 Stochastic Differential Equations 325

22.2 Stochastic Differential Equations 325
22.2.1 Solution Methods of SDEs 326
22.3 Examples 335
22.3.1 Modeling Asset Prices 335
22.3.2 Modeling Magnitude of Earthquake Series 336
22.4 Multidimensional Stochastic Differential Equations 337
22.4.1 The multidimensional Ornstein–Uhlenbeck Processes 337
22.4.2 Solution of the Ornstein–Uhlenbeck Process 338
22.5 Simulation of Stochastic Differential Equations 340
22.5.1 Euler–Maruyama Scheme for Approximating Stochastic Differential
Equations 340
22.5.2 Euler–Milstein Scheme for Approximating Stochastic Differential
Equations 341
22.6 Problems 343
23 Ethics: With Great Power Comes Great Responsibility 345

23.2 Data Science Ethical Principles 346
23.2.1 Enhance Value in Society 346
23.2.2 Avoiding Harm 346
23.2.3 Professional Competence 347
23.2.4 Increasing Trustworthiness 348
Contents xv
23.2.5 Maintaining Accountability and Oversight 348

23.3 Data Science Code of Professional Conduct 348
23.4.1 Project Planning 350
23.4.2 Data Preprocessing 350
23.4.3 Data Management 350
23.4.4 Analysis and Development 351
23.5 Problems 351
Bibliography 353
Index 359
xvii
List of Figures
Figure 4.1 Time series data of phase arrival times of an earthquake. 36

Figure 4.2 Time series data of financial returns corresponding to Bank of
America (BAC) stock index. 37
Figure 4.3 Seasonal trend component. 40
Figure 4.4 Linear trend component. The horizontal axis is time t, and the
vertical axis is the time series Yt . (a) Linear increasing trend.
(b) Linear decreasing trend. 41
Figure 4.5 Nonlinear trend component. The horizontal axis is time t and the
vertical axis is the time series Yt . (a) Nonlinear increasing trend.
(b) Nonlinear decreasing trend. 41
Figure 4.6 Cyclical component (imposed on the underlying trend). The
horizontal axis is time t and the vertical axis is the time series
Yt . 42
Figure 7.1 The big O notation. 102
Figure 7.2 The Ω notation. 103
Figure 7.3 The Θ notation. 103
Figure 7.4 Symbols used in flowchart. 105
Figure 7.5 Flowchart to add two numbers entered by user. 106
Figure 7.6 Flowchart to find all roots of a quadratic equation
ax2 + bx + c = 0. 107
Figure 7.7 Flowchart. 108
Figure 8.1 The box plot. 113
Figure 8.2 Box plot example. 114
Figure 9.1 Scatter plot of temperature versus ice cream sales. 122
Figure 9.2 Heatmap of handwritten digit data. 124
Figure 9.3 Map of earthquake magnitudes recorded in Chile. 125
xviii List of Figures
Figure 9.4 Spatial distribution of earthquake magnitudes (Mariani et al.

2016). 126
Figure 9.5 Number of text messages sent. 128
Figure 9.6 Normal Q–Q plot. 128
Figure 9.7 Risk of loan default. Source: Tableau Viz Gallery. 130
Figure 9.8 Top five publishing markets. Source: Modified from International
Publishers Association – Annual Report. 131
Figure 9.9 High yield defaulted issuer and volume trends. Source: Based on
Fitch High Yield Default Index, Bloomberg. 131
Figure 9.10 Statistics page for popular movies and cinema locations. Source:
Google Charts. 132
Figure 10.1 One-step binomial tree for the return process. 137
Figure 11.1 Height versus weight. 153
Figure 11.2 Visualizing low-dimensional data. 154
Figure 11.3 2D data set. 157
Figure 11.4 First PCA axis. 157
Figure 11.5 Second PCA axis. 157
Figure 11.6 New axis. 158
Figure 11.7 Scatterplot of Royal Dutch Shell stock versus Exxon Mobil
stock. 161
Figure 12.1 Classification (by quadrant) of earthquakes and explosions using
the Chernoff and Kullback–Leibler differences. 171
Figure 12.2 Classification (by quadrant) of Lehman Brothers collapse and
Flash crash event using the Chernoff and Kullback–Leibler
differences. 173
Figure 12.3 Clustering results for the earthquake and explosion series based
on symmetric divergence using PAM algorithm. 176
Figure 12.4 Clustering results for the Lehman Brothers collapse, Flash crash
event, Citigroup (2009), and IAG (2011) stock data based on
symmetric divergence using the PAM algorithm. 177
Figure 13.1 Scatter plot of data in Table 13.1. 180
Figure 16.1 The xy-plane and several other horizontal planes. 220
Figure 16.2 The xy-plane and several parallel planes. 221
Figure 16.3 The plane x + y + z = 1. 221
Figure 16.4 Two class problem when data is linearly separable. 224
List of Figures xix
Figure 16.5 Two class problem when data is not linearly separable. 224
Figure 16.6 ROC curve for linear SVM. 226
Figure 16.7 ROC curve for nonlinear SVM. 227
Figure 17.1 Single hidden layer feed-forward neural networks. 232
Figure 17.2 Simple recurrent neural network. 234
Figure 17.3 Long short-term memory unit. 235
Figure 17.4 Philippines (PSI). (a) Basic RNN. (b) LTSM. 239
Figure 17.5 Thailand (SETI). (a) Basic RNN. (b) LTSM. 240
Figure 17.6 United States (NASDAQ). (a) Basic RNN. (b) LTSM. 241
Figure 17.7 JPMorgan Chase & Co. (JPM). (a) Basic RNN. (b) LTSM. 242
Figure 17.8 Walmart (WMT). (a) Basic RNN. (b) LTSM. 243
Figure 18.1 3D power spectra of the daily returns from the four analyzed stock
companies. (a) Discover. (b) Microsoft. (c) Walmart. (d) JPM
Chase. 255
Figure 18.2 3D power spectra of the returns (generated per minute) from the
four analyzed stock companies. (a) Discover. (b) Microsoft.
(c) Walmart. (d) JPM Chase. 257
Figure 19.1 Time-frequency image of explosion 1 recorded by ANMO
(Table 19.2). 270
Figure 19.2 Time-frequency image of earthquake 1 recorded by ANMO
(Table 19.2). 270
Figure 19.3 Three-dimensional graphic information of explosion 1 recorded
by ANMO (Table 19.2). 272
Figure 19.4 Three-dimensional graphic information of earthquake 1 recorded
by ANMO (Table 19.2). 272
Figure 19.5 Time-frequency image of explosion 2 recorded by TUC
(Table 19.3). 273
Figure 19.6 Time-frequency image of earthquake 2 recorded by TUC
(Table 19.3). 273
Figure 19.7 Three-dimensional graphic information of explosion 2 recorded
by TUC (Tabl 19.3). 274
Figure 19.8 Three-dimensional graphic information of earthquake 2 recorded
by TUC (Table 19.3). 274
Figure 21.1 R∕S for volcanic eruptions 1 and 2. 322
Figure 21.2 DFA for volcanic eruptions 1 and 2. 323
Figure 21.3 DEA for volcanic eruptions 1 and 2. 323
xxi
List of Tables
Table 2.1 Examples of random vectors. 13

Table 3.1 Ramus Bone Length at Four Ages for 20 Boys. 33
Table 4.1 Time series data of the volume of sales of over a six hour
period. 50
Table 4.2 Simple moving average forecasts. 50
Table 4.3 Time series data used in Example 4.6. 52
Table 4.4 Weighted moving average forecasts. 52
Table 4.5 Trend projection of weighted moving average forecasts. 53
Table 4.6 Exponential smoothing forecasts of volume of sales. 55
Table 4.7 Exponential smoothing forecasts from Example 4.9. 56
Table 4.8 Adjusted exponential smoothing forecasts. 57
Table 6.1 Numbers. 83
Table 6.2 Files mode in Python. 93
Table 7.1 Common asymptotic notations. 103
Table 9.1 Temperature versus ice cream sales. 122
Table 12.1 Events information. 170
Table 12.2 Discriminant scores for earthquakes and explosions groups. 170
Table 12.3 Discriminant scores for Lehman Brothers collapse and Flash crash
event. 172
Table 12.4 Discriminant scores for Citigroup in 2009 and IAG stock in
2011. 172
Table 13.1 Data matrix. 180
Table 13.2 Distance matrix. 181
Table 13.3 Stress and goodness of fit. 182
xxii List of Tables
Table 13.4 Data matrix. 188

Table 14.1 Models’ performances on the test dataset with 23 variables using
AUC and mean square error (MSE) values for the five
models. 201
Table 14.2 Top 10 variables selected by the Random forest algorithm. 201
Table 14.3 Performance for the four models using the top 10 features from
model Random forest on the test dataset. 201
Table 15.1 Market basket transaction data. 206
Table 15.2 A binary 0∕1 representation of market basket transaction
data. 206
Table 15.3 Grocery transactional data. 211
Table 15.4 Transaction data. 216
Table 16.1 Models performances on the test dataset. 226
Table 18.1 Percentage of power for Discover data. 254
Table 18.2 Percentage of power for JPM data. 254
Table 18.3 Percentage of power for Microsoft data. 254
Table 18.4 Percentage of power for Walmart data. 254
Table 19.1 Determining p and q for N = 16. 266
Table 19.2 Percentage of total power (energy) for Albuquerque, New Mexico
(ANMO) seismic station. 271
Table 19.3 Percentage of total power (energy) for Tucson, Arizona (TUC)
seismic station. 271
Table 21.1 Moments of the Poisson distribution with intensity 𝜆. 306
Table 21.2 Moments of the Γ(a, b) distribution. 307
Table 21.3 Scaling exponents of Volcanic Data time series. 322
xxiii
Preface
This textbook is dedicated to practitioners, graduate, and advanced undergraduate

students who have interest in Data Science, Business analytics, and Statistical and
Mathematical Modeling in different disciplines such as Finance, Geophysics, and
Engineering. This book is designed to serve as a textbook for several courses in the
aforementioned areas and a reference guide for practitioners in the industry.
The book has a strong theoretical background and several applications to
specific practical problems. It contains numerous techniques applicable to
modern data science and other disciplines. In today’s world, many fields are
confronted with increasingly large amounts of complex data. Financial, health-
care, and geophysical data sampled with high frequency is no exception. These
staggering amounts of data pose special challenges to the world of finance and
other disciplines such as healthcare and geophysics, as traditional models and
information technology tools can be poorly suited to grapple with their size
and complexity. Probabilistic modeling, mathematical modeling, and statistical
data analysis attempt to discover order from apparent disorder; this textbook may
serve as a guide to various new systematic approaches on how to implement these
quantitative activities with complex data sets.
The textbook is split into five distinct parts. In the first part of this book, foun-
dations of Data Science, we will discuss some fundamental mathematical and
statistical concepts which form the basis for the study of data science. In the second
part of the book, Data Science in Practice, we will present a brief introduction to
R and Python programming and how to write algorithms. In addition, vari-
ous techniques for data preprocessing, validations, and visualizations will be
discussed. In the third part, Data Mining and Machine Learning techniques for
Complex Data Sets and fourth part of the book, Advanced Models for Big Data
Analytics and Complex Data Sets, we will provide exhaustive techniques for
analyzing and predicting different types of complex data sets.
xxiv Preface
We conclude this book with a discussion of ethics in data science: With great
power comes great responsibility.
The authors express their deepest gratitude to Wiley for making the publication
a reality.
El Paso, TX and Mahwah, NJ, USA Maria Cristina Mariani

September 2021 Osei Kofi Tweneboah
Maria Pia Beccar-Varela
1
Background of Data Science
1.1 Introduction
Data science is one of the most promising and high-demand career paths for skilled
professionals in the 21st century. Currently, successful data professionals under-
stand that they must advance past the traditional skills of analyzing large amounts
of data, statistical learning, and programming skills. In order to explore and dis-
cover useful information for their companies or organizations, data scientists must
have a good grip of the full spectrum of the data science life cycle and have a level
of flexibility and understanding to maximize returns at each phase of the process.
Data science is a “concept to unify statistics, mathematics, computer science,
data analysis, machine learning and their related methods” in order to find trends,
understand, and analyze actual phenomena with data. Due to the Coronavirus dis-
ease (COVID-19) many colleges, institutions, and large organizations asked their
nonessential employees to work virtually. The virtual meetings have provided col-
leges and companies with plenty of data. Some aspect of the data suggest that
virtual fatigue is on the rise. Virtual fatigue is defined as the burnout associated
with the over dependence on virtual platforms for communication. Data science
provides tools to explore and reveal the best and worst aspects of virtual work.
In the past decade, data scientists have become necessary assets and are present
in almost all institutions and organizations. These professionals are data-driven
individuals with high-level technical skills who are capable of building complex
quantitative algorithms to organize and synthesize large amounts of information
used to answer questions and drive strategy in their organization. This is coupled
with the experience in communication and leadership needed to deliver tangible
results to various stakeholders across an organization or business.
Data scientists need to be curious and result-oriented, with good knowledge
(domain specific) and communication skills that allow them to explain very tech-
nical results to their nontechnical counterparts. They possess a strong quantitative
background in statistics and mathematics as well as programming knowledge with
Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets,
First Edition. Maria Cristina Mariani, Osei Kofi Tweneboah, and Maria Pia Beccar-Varela.
© 2022 John Wiley & Sons, Inc. Published 2022 by John Wiley & Sons, Inc.
2 1 Background of Data Science
focuses in data warehousing, mining, and modeling to build and analyze algo-
rithms. In fact, data scientists are a group of analytical data expert who have the
technical skills to solve complex problems and the curiosity to explore how prob-
lems need to be solved.
1.2 Origin of Data Science

Data scientists are part mathematicians, statisticians and computer scientists.
And because they span both the business and information technology (IT) worlds,
they’re in high demand and well-paid. Data scientists were not very popular
some decades ago; however, their sudden popularity reflects how businesses now
think about “Big data.” Big data is defined as a field that treats ways to analyze,
systematically extract information from, or otherwise deal with data sets that are
too large or complex to be dealt with by traditional data-processing application
software. That bulky mass of unstructured information can no longer be ignored
and forgotten. It is a virtual gold mine that helps boost revenue as long as there
is someone who explores and discovers business insights that no one thought
to look for before. Many data scientists began their careers as statisticians or
business analyst or data analysts. However, as big data began to grow and evolve,
those roles evolved as well. Data is no longer just an add on for IT to handle.
It is vital information that requires analysis, creative curiosity, and the ability
to interpret high-tech ideas into innovative ways to make profit and to help
practitioners make informed decisions.
1.3 Who is a Data Scientist?

The term “data scientist” was invented as recently as 2008 when companies real-
ized the need for data professionals who are skilled in organizing and analyz-
ing massive amounts of data. Data scientists are quantitative and analytical data
experts who utilize their skills in both technology and social science to find trends
and manage the data around them. With the growth of big data integration in busi-
ness, they have evolved at the forefront of the data revolution. They are part math-
ematicians, statisticians, computer programmers, and analysts who are equipped
with a diverse and wide-ranging skill set, balancing knowledge in several com-
puter programming languages with advanced experience in statistical learning
and data visualization.
There is not a definitive job description when it comes to a data scientist role.
However, we outline here some stuffs they do:
● Collecting and recording large amounts of unruly data and transforming it into
a more usable format.
1.4 Big Data 3
● Solving business-related problems using data-driven techniques.

● Working with a variety of programming languages, including SAS, Minitab, R,
and Python.
● Having a strong background of mathematics and statistics including statistical
tests and distributions.
● Staying on top of quantitative and analytical techniques such as machine learn-
ing, deep learning, and text analytics.
● Communicating and collaborating with both IT and business.
● Looking for order and patterns in data, as well as spotting trends that enables
businesses to make informed decisions.
Some of the useful tools that every data scientist or practitioner needs are outlined
below:
● Data preparation: The process of cleaning and transforming raw data into suit-
able formats prior to processing and analysis.
● Data visualization: The presentation of data in a pictorial or graphical format so
it can be easily analyzed.
● Statistical learning or Machine learning: A branch of artificial intelligence based
on mathematical algorithms and automation. Artificial intelligence (AI) refers
to the process of building smart machines capable of performing tasks that typ-
ically require human intelligence. They are designed to make decisions, often
using real-time data. Real-time data are information that is passed along to the
end user immediately it is gathered.
● Deep learning: An area of statistical learning research that uses data to model
complex abstractions.
● Pattern recognition: Technology that recognizes patterns in data (often used
interchangeably with machine learning).
● Text analytics: The process of examining unstructured data and drawing mean-
ing out of written communication.
We will discuss all the above tools in details in this book. There are several sci-
entific and programming skills that every data scientist should have. They must
be able to utilize key technical tools and skills, including R, Python, SAS, SQL,
Tableau, and several others. Due to the ever growing technology, data scientist
must always learn new and emerging techniques to stay on top of their game. We
will discuss the R and Python programming in Chapters 5 and 6.
1.4 Big Data

Big data is a term applied to ways to analyze, systematically extract information
from, or otherwise deal with data sets that are too large or complex to be dealt
with by classical data-processing tools. In particular, it refers to data sets whose
size or type is beyond the ability of traditional relational databases to capture,

manage, and process the data with low latency. Sources of big data includes data
from sensors, stock market, devices, video/audio, networks, log files, transactional
applications, web, and social media and much of it generated in real time and at a
very large scale.
In recent times, the use of the term “big data” (both stored and real-time) tend
to refer to the use of user behavior analytics (UBA), predictive analytics, or certain
other advanced data analytics methods that extract value from data. UBA solutions
look at patterns of human behavior, and then apply algorithms and statistical anal-
ysis to detect meaningful anomalies from those patterns’ anomalies that indicate
potential threats. For example detection of hackers, detection of insider threats,
targeted attacks, financial fraud, and several others.
Predictive analytics deals with the process of extracting information from
existing data sets in order to determine patterns and predict future outcomes and
trends. Generally, predictive analytics does not tell you what will happen in the
future. However, it forecasts what might happen in the future with some degree
of certainty. Predictive analytics goes hand in hand with big data: Businesses and
organizations collect large amounts of real-time customer data and predictive
analytics and uses this historical data, combined with customer insight, to forecast
future events. Predictive analytics helps organizations to use big data to move
from a historical view to a forward-looking perspective of the customer. In this
book, we will discuss several methods for analyzing big data.
1.4.1 Characteristics of Big Data

Big data has one or more of the following characteristics: high volume, high veloc-
ity, high variety, and high veracity. That is, the data sets are characterized by huge
amounts (volume) of frequently updated data (velocity) in various types, such as
numeric, textual, audio, images and videos (variety), with high quality (veracity).
We briefly discuss each in detail. Volume: Volume describes the quantity of
generated and stored data. The size of the data determines the value and potential
insight, and whether it can be considered big data or not. Velocity: Velocity
describes the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development. Big data
is often available in both stored and real-time. Compared to small data, big data
are produced more continually (it could be nanosecond, second, minute, hours,
etc.). Two types of velocity related to big data are the frequency of generation and
the frequency of handling, recording, and reporting. Variety: Variety describes
the type and formats of the data. This helps people who analyze it to effectively
use the resulting insight. Big data draws from different formats and completes
missing pieces through data fusion. Data fusion is a term used to describe the
technique of integrating multiple data sources to produce more consistent,
1.4 Big Data 5
accurate, and useful information than that provided by any individual data
source. Veracity: Veracity describes the quality of data and the data value. The
quality of data obtained can greatly affect the accuracy of the analyzed results. In
the next subsection we will discuss some big data architectures. A comprehensive
study of this topic can be found in the application architecture guide of the
Microsoft technical documentation.
1.4.2 Big Data Architectures

Big data architectures are designed to handle the ingestion, processing, and anal-
ysis of data that is too large or complex for classical data-processing application
tools. Some popular big data architectures are the Lambda architecture, Kappa
architecture and the Internet of Things (IoT). We refer the reader to the Microsoft
technical documentation on Big data architectures for a detailed discussion on the
different architectures. Almost all big data architectures include all or some of the
following components:
● Data sources: All big data solutions begin with one or more data sources. Some
common data sources includes the following: Application data stores such as
relational databases, static files produced by applications such as web server log
files, and real-time data sources such as the Internet of Things (IoT) devices.
● Data storage: Data for batch processing operations is typically stored in a dis-
tributed file store that can hold high volumes of large files in various formats.
This kind of store is often called a data lake. A data lake is a storage repository
that allows one to store structured and unstructured data at any scale until it is
needed.
● Batch processing: Since data sets are enormous, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and other-
wise prepare the data for analysis. Normally, these jobs involve reading source
files, processing them, and writing the output to new files. Options include run-
ning U-SQL jobs or using Java, Scala, R, or Python programs. U-SQL is a data
processing language that merges the benefits of SQL with the expressive power
of ones own code.
● Real-time message ingestion: If the solution includes real-time sources, the archi-
tecture must include a way to capture and store real-time messages for stream
processing. This might be a simple data store, where incoming messages are
stored into a folder for processing. However, many solutions need a message
ingestion store to act as a buffer for messages and to support scale-out process-
ing, reliable delivery, and other message queuing semantics.
● Stream processing: After obtaining real-time messages, the solution must process
them by filtering, aggregating, and preparing the data for analysis. The processed
stream data is then written to an output sink.
● Analytical data store: Several big data solutions prepare data for analysis and
then serve the processed data in a structured format that can be queried using
analytical tools. The analytical data store used to serve these queries can be a
Kimball-style relational data warehouse, as observed in most classical business
intelligence (BI) solutions. Alternatively, the data could be presented through a
low-latency NoSQL technology, such as HBase, or an interactive Hive database
that provides a metadata abstraction over data files in the distributed data store.
● Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. Users can analyze the data using
mathematical and statistical models as well using data visualization techniques.
Analysis and reporting can also take the form of interactive data exploration by
data scientists or data analysts.
● Orchestration: Several big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical
data store, or move the results to a report or dashboard.
7
Matrix Algebra and Random Vectors
2.1 Introduction
The matrix algebra and random vectors presented in this chapter will enable us to
precisely state statistical models. We will begin by discussing some basic concepts
that will be essential throughout this chapter. For more details on matrix algebra
please consult (Axler 2015).
2.2 Some Basics of Matrix Algebra

2.2.1 Vectors
Definition 2.1 (Vector) A vector x is an array of real numbers x1 , x2 , … , xN ,

and it is written as:
⎡x1 ⎤
⎢ ⎥
x
x = ⎢ 2⎥ .
⎢⋮⎥
⎢x ⎥
⎣ n⎦
Definition 2.2 (Scaler multiplication of vectors) The product of a scalar c,

and a vector is the vector obtained by multiplying each entry in the vector by the
scalar:
⎡cx1 ⎤
⎢ ⎥
cx
cx = ⎢ 2 ⎥ .
⎢⋮⎥
⎢cx ⎥
⎣ n⎦
Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets,
First Edition. Maria Cristina Mariani, Osei Kofi Tweneboah, and Maria Pia Beccar-Varela.
© 2022 John Wiley & Sons, Inc. Published 2022 by John Wiley & Sons, Inc.
8 2 Matrix Algebra and Random Vectors
Definition 2.3 (Vector addition) The sum of two vectors of the same size is
the vector obtained by adding corresponding entries in the vectors:
⎡x1 ⎤ ⎡y1 ⎤ ⎡ x1 + y1 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
x y x + y2 ⎥
x + y = ⎢ 2⎥ + ⎢ 2⎥ = ⎢ 2
⎢⋮⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
⎢x ⎥ ⎢y ⎥ ⎢x + y ⎥
⎣ n⎦ ⎣ n⎦ ⎣ n n⎦
so that x + y is the vector with the ith element xi + yi .
2.2.2 Matrices
Definition 2.4 (Matrix) Let m and n denote positive integers. An m-by-n

matrix is a rectangular array of real numbers with m rows and n columns:
⎡ A1,1 · · · A1,n ⎤
A=⎢ ⋮ ⋮ ⎥.
⎢ ⎥
⎣Am,1 · · · Am,n ⎦
The notation Ai,j denotes the entry in row i, column j of A. In other words,
the first index refers to the row number and the second index refers to the column
number.
Example 2.1
⎛1 4 8⎞
If A = ⎜0 4 9⎟ ,
⎜ ⎟
⎝7 −1 7⎠
then A3,1 = 7.
Definition 2.5 (Transpose of a matrix) The transpose operation AT of a

matrix changes the columns into rows, i.e. in matrix notation (AT )i,j = Aj,i , where
“T” denotes transpose.
Example 2.2
( ) ⎛1 0⎞
1 4 8
If A2×3 = , then AT3×2 = ⎜4 4⎟ .
0 4 9 ⎜ ⎟
⎝8 9⎠
Definition 2.6 (Scaler multiplication of a matrix) The product of a scalar

c, and a matrix is the matrix obtained by multiplying each entry in the matrix
by the scalar:
⎡ cA1,1 · · · cA1,n ⎤
cA = ⎢ ⋮ ⋮ ⎥.
⎢ ⎥
⎣cAm,1 · · · cAm,n ⎦
In other words, (cA)i,j = cAi,j .
Definition 2.7 (Matrix addition) The sum of two vectors of the same size is
the vector obtained by adding corresponding entries in the vectors:
⎡ A1,1 · · · A1,n ⎤ ⎡ B1,1 · · · B1,n ⎤
A+B=⎢ ⋮ ⋮ ⎥+⎢ ⋮ ⋮ ⎥
⎢ ⎥ ⎢ ⎥
⎣ Am,1 · · · Am,n ⎦ ⎣ Bm,1 · · · Bm,n ⎦
⎡ A1,1 + B1,1 · · · A1,n + B1,n ⎤
=⎢ ⋮ ⋮ ⎥.
⎢ ⎥
⎣ Am,1 + Bm,1 · · · Am,n + Bm,n ⎦
In other words, (A + B)i,j = Ai,j + Bi,j .
Definition 2.8 (Matrix multiplication) Suppose A is an m-by-n matrix

and B is an n-by-p matrix. Then AB is defined to be the m-by-p matrix whose
entry in row i, column j, is given by the following equation:
∑
n
(AB)i,j = Ai,k Bk,j .
k=1
In other words, the entry in row i, column j, of AB is computed by taking row

i of A and column j of B, multiplying together corresponding entries, and then
summing. The number of columns of A must be equal to the number of rows of B.
Example 2.3
⎡1 4 ⎤ [ ]
1 1
If A = ⎢0 4 ⎥ and B=
⎢ ⎥ 2 1
⎣7 −1⎦
then
⎡1 4 ⎤ [ ] ⎡ 1(1) + 4(2) 1(1) + 4(1) ⎤ ⎡9 5⎤
⎢ ⎥ 1 1
AB = 0 4 = ⎢ 0(1) + 4(2) 0(1) + 4(1) ⎥ = ⎢8 4⎥ .
⎢ ⎥ 2 1 ⎢ ⎥ ⎢ ⎥
⎣7 −1⎦ ⎣7(1) + −1(2) 7(1) + −1(1)⎦ ⎣5 6⎦
Definition 2.9 (Square matrix) A matrix A is said to be a square matrix if

the number of rows is the same as the number of columns.
Definition 2.10 (Symmetric matrix) A square matrix A is said to be sym-

metric if A=AT or in matrix notation (AT )i,j = Ai,j = Aj,i all i and j.
[ ] [ ]
1 4 1 6
Example 2.4 The matrix A = is symmetric; the matrix B = is not
4 4 4 −4
symmetric.
Definition 2.11 (Trace) For any square matrix A, the trace of A denoted
by tr(A) is defined as the sum of the diagonal elements, i.e.
∑
n
tr(A) = aii = a11 + a22 + · · · + ann .
i=1
Example 2.5 Let A be a matrix with

⎡1 4 9 ⎤
A = ⎢1 0 0 ⎥ .
⎢ ⎥
⎣1 4 −9⎦
Then
∑
2
tr(A) = aii = a11 + a22 + a33 = 1 + 0 + (−9) = −8.
i=1
We remark that trace are only defined for square matrices.
Definition 2.12 (Determinant of a matrix) Suppose A is an n-by-n matrix,

⎡a1,1 · · · a1,n ⎤
A=⎢ ⋮ ⋮ ⎥.
⎢ ⎥
⎣an,1 · · · an,n ⎦
The determinant of A, denoted det A or |A|, is defined by
det A = ai1 Ci1 + ai2 Ci2 + · · · + ain Cin ,
where Cij are referred to as the “cofactors” and are computed from
Cij = (−1)i+j det Mi,j .

The term Mij is known as the “minor matrix” and is the matrix you get if you
eliminate row i and column j from matrix A.
Finding the determinant depends on the dimension of the matrix A; determi-

nants only exist for square matrices.
Example 2.6 For a 2 by 2 matrix

[ ]
a b
A=
c d
we have
|a b|
| |
det A = |A| = | | = ad − bc.
| c d|
| |
Example 2.7 For a 3 by 3 matrix

⎡a11 a12 a13 ⎤
A = ⎢a21 a22 a23 ⎥
⎢ ⎥
⎣a31 a32 a33 ⎦
we have
|a |
| 11 a12 a13 |
| |
det A = |A| = |a21 a22 a23 |
| |
|a31 a32 a33 |
| |
= a11 (a22 a33 − a23 a33 ) − a12 (a21 a33 − a23 a31 )
+ a13 (a21 a32 − a22 a31 ).
Definition 2.13 (Positive definite matrix) A square n × n matrix A is called

positive definite if, for any vector u ∈ ℝn nonidentically zero, we have
uT Au > 0.
Example 2.8 Let A be a 2 by 2 matrix

[ ]
9 −2
A= .
−2 6
To show that A is positive definite, by definition
[ ][ ]
[ ] 9 −2 u1
uT Au = u1 , u2
−2 6 u2
= 9u21 − 4u1 u2 + 6u22
= (2u1 − u2 )2 + 5(u21 + u22 ) > 0 for [u1 , u2 ] ≠ [0, 0].
Therefore, A is positive definite.
Definition 2.14 (Positive semidefinite matrix) A matrix A is called posi-

tive semidefinite (or nonnegative definite) if, for any vector u ∈ ℝn , we have
uT Au ≥ 0.
Definition 2.15 (Negative definite matrix) A square n × n matrix A is

called negative definite if, for any vector u ∈ ℝn nonidentically zero, we have
uT Au < 0.
Example 2.9 Let A be a 2 by 2 matrix

[ ]
−2 1
A= .
1 −2
To show that A is negative definite, by definition
[ ][ ]
[ ] −2 1 u1
u Au = u1 , u2
T
1 −2 u2
= −2u21 + 2u1 u2 − 2u22
= −(u1 − u2 )2 < 0 for [u1 , u2 ] ≠ [0, 0].
Therefore, A is negative definite.
Definition 2.16 (Negative semidefinite matrix) A matrix A is called nega-

tive semidefinite if, for any vector u ∈ ℝn , we have
uT Au ≤ 0.
We state the following theorem without proof.
Theorem 2.1 A 2 by 2 symmetric matrix

[ ]
a b
A=
c d
is:
1. positive definite if and only if a > 0 and det A > 0
2. negative definite if and only if a < 0 and det A > 0
3. indefinite if and only if det A < 0.
2.3 Random Variables and Distribution Functions

We begin this section with the definition of 𝜎-algebra.
Definition 2.17 (σ-algebra ) A 𝜎-algebra  is a collection of sets  of Ω sat-

isfying the following condition:
1. ∅ ∈  .
2. If F ∈  then its complement F c ∈  .
3. If F1 , F2 , … is a countable collection of sets in  then their union ∪∞
n=1 Fn ∈  .
Definition 2.18 (Measurable functions) A real-valued function f defined

on Ω is called measurable with respect to a sigma algebra  in that space if the
inverse image of the set B, defined as f −1 (B) ≡ {𝜔 ∈ E ∶ f (𝜔) ∈ B} is a set in
𝜎-algebra  , for all Borel sets B of ℝ. Borel sets are sets that are constructed from
open or closed sets by repeatedly taking countable unions, countable intersections
and relative complement.
Definition 2.19 (Random vector) A random vector X is any measurable

function defined on the probability space (Ω,  , p) with values in ℝn (Table 2.1).
Measurable functions will be discussed in detail in Section 20.5.

Suppose we have a random vector X defined on a space (Ω,  , p). The sigma
algebra generated by X is the smallest sigma algebra in (Ω,  , p) that contains all
the pre images of sets in ℝ through X. That is
𝜎(X) = 𝜎({X−1 (B) ∣ for all B Borel sets in ℝ}).
This abstract concept is necessary to make sure that we may calculate any proba-
bility related to the random variable X.
Any random vector has a distribution function, defined similarly with the
one-dimensional case. Specifically, if the random vector X has components
X = (X1 , … , Xn ), its cumulative distribution function or cdf is defined as:
FX (x) = P(X ≤ x) = P(X1 ≤ x1 , … , Xn ≤ xn ) for all x.
Associated with a random variable X and its cdf FX is another function,
called the probability density function (pdf) or probability mass function (pmf).
The terms pdf and pmf refer to the continuous and discrete cases of random
variables, respectively.
Table 2.1 Examples of random vectors.
Experiment Random variable
Toss two dice X = sum of the numbers

Toss a coin 10 times X = sum of tails in 10 tosses
Definition 2.20 (Probability mass function) The pmf of a discrete random

variable X is given by
fX (x) = P(X = x) for all x.
Definition 2.21 (Probability density function) The pdf, fX (x) of a continu-

ous random variable X is the function that satisfies
x1 xn
F(x) = F(x1 , … , xn ) = … fX (t1 , … , tn )dtn … dt1 .
∫−∞ ∫−∞
We will discuss these notations in details in Chapter 20.

Using these concepts, we can define the moments of the distribution. In fact,
suppose that g ∶ ℝn → ℝ is any function, then we can calculate the expected value
of the random variable g(X1 , … , Xn ) when the joint density exists as:
∞ ∞
E[g(X1 , … , Xn )] = … g(x1 , … , xn )f (x1 , … , xn )dx1 … dxn .
∫−∞ ∫−∞
Now we can define the moments of the random vector. The first moment is a
vector
⎛E[X1 ]⎞
E[X] = 𝜇X = ⎜ ⋮ ⎟ .
⎜ ⎟
⎝E[Xn ]⎠
The expectation applies to each component in the random vector. Expectations of
functions of random vectors are computed just as with univariate random vari-
ables. We recall that expectation of a random variable is its average value.
The second moment requires calculating all the combination of the components.
The result can be presented in a matrix form. The second central moment can be
presented as the covariance matrix.
Cov(X) = E[(X − 𝜇X )(X − 𝜇X )t ]
⎛Var(X1 ) Cov(X1 , X2 ) … Cov(X1 , Xn ) ⎞
⎜ ⎟
Cov(X2 , X1 ) Var(X2 ) … Cov(X2 , Xn )
=⎜ ⎟, (2.1)
⎜⋮ ⋮ ⋱ ⋮ ⎟
⎜Cov(X , X ) Cov(Xn , X2 ) … Var(Xn ) ⎟
⎝ n 1 ⎠
where we used the transpose matrix notation and since the Cov(Xi , Xj ) =
Cov(Xj , Xi ), the matrix is symmetric.
We note that the covariance matrix is positive semidefinite (nonnegative defi-
nite), i.e. for any vector u ∈ ℝn , we have uT Xu ≤ 0.
Now we explain why the covariance matrix has to be semidefinite. Take any
vector u ∈ ℝn . Then the product
∑
uT X = u i Xi (2.2)
is a random variable (one dimensional) and its variance must be nonnegative.

This is because in the one-dimensional case, the variance of a random variable
is defined as Var(X) = E(X − E[X])2 . We see that the variance is nonnegative for
every random variable, and it is equal to zero if and only if the random variable is
constant. The expectation of (2.2) is E[uT X] = uT 𝜇X . Then we can write (since for
any number a, a2 = aaT )
Var(ut X) = E [(uT X − uT 𝜇X )2 ]
= E [(uT X − uT 𝜇X )(ut X − uT 𝜇X )t ]
= E [uT (X − 𝜇X )(X − 𝜇X )t (uT )t ]
= uT Cov(X) u.
Since the variance is always nonnegative, the covariance matrix must be nonneg-
ative definite (or positive semidefinite). We recall that a square symmetric matrix
A ∈ ℝn×n is positive semidefinite if ut Au ≥ 0, ∀u ∈ ℝn . This difference is in fact
important in the context of random variables since you may be able to construct a
linear combination uT X which is not always constant but whose variance is equal
to zero.
The covariance matrix is discussed in detail in Chapter 3.
We now present examples of multivariate distributions.
2.3.1 The Dirichlet Distribution

Before we discuss the Dirichlet distribution, we define the Beta distribution.
Definition 2.22 (Beta distribution) A random variable X is said to have a

Beta distribution with parameters 𝛼 and 𝛽 if it has a pdf f (x) defined as:
⎧ Γ(𝛼+𝛽) 𝛼−1 𝛽−1
⎪ Γ(𝛼)Γ(𝛽) x (1 − x) , if 0 < x < 1,
f (x) = ⎨
⎪0, if otherwise,
⎩
where 𝛼 > 0 and 𝛽 > 0.
The Dirichlet distribution Dir(𝜶), named after Johann Peter Gustav Lejeune
Dirichlet (1805–1859), is a multivariate distribution parameterized by a vector 𝛂
of positive parameters (𝛼1 , … , 𝛼n ).
Specifically, the joint density of an n-dimensional random vector X ∼ Dir(𝜶) is
defined as:
( n )
1 ∏ 𝛼 −1
f (x1 , … , xn ) = x 𝟏{xi >0} 𝟏{x1 +···+xn =1} ,
i
B(𝜶) i=1 i
where 1{x1 +···+xn =1} is an indicator function.
Definition 2.23 (Indicator function) The indicator function of a subset A of

a set X is a function
1A ∶ X → {0, 1}
defined as
{
1, if x ∈ A,
1A (x) =
0, if x ∉ A.
The components of the random vector X thus are always positive and have the
property X1 + · · · + Xn = 1. The normalizing constant B(𝜶) is the multinomial
beta function, that is defined as:
∏n ∏n
i=1 Γ(𝛼i ) Γ(𝛼i )
B(𝜶) = (∑n ) = i=1 ,
i=1 𝛼i
Γ Γ(𝛼0 )
∑n ∞
where we used the notation 𝛼0 = i=1 𝛼i and Γ(x) = ∫0 tx−1 e−t dt for the Gamma
function.
Because the Dirichlet distribution creates n positive numbers that always sum
to 1, it is extremely useful to create candidates for probabilities of n possible out-
comes. This distribution is very popular and related to the multinomial distri-
bution which needs n numbers summing to 1 to model the probabilities in the
distribution. The multinomial distribution is defined in Section 2.3.2.
With the notation mentioned above and 𝛼0 as the sum of all parameters, we can
calculate the moments of the distribution. The first moment vector has coordi-
nates:
𝛼
E[Xi ] = i .
𝛼0
The covariance matrix has elements:
𝛼 (𝛼 − 𝛼i )
Var(Xi ) = i2 0 ,
𝛼0 (𝛼0 + 1)
and when i ≠ j
−𝛼i 𝛼j
Cov(Xi , Xj ) = .
𝛼02 (𝛼0 + 1)
The covariance matrix is singular (its determinant is zero).
Finally, the univariate marginal distributions are all beta with parameters Xi ∼
Beta(𝛼i , 𝛼0 − 𝛼i ). All these are in the reference (see Balakrishnan and Nevzorov
2004).
Please refer to Lin (2016) for the proof of the properties of the Dirichlet distri-
bution.
2.3.2 Multinomial Distribution

We begin with a definition of the binomial distribution.
Definition 2.24 (Binomial distribution) A random variable X is said to

have a binomial distribution with parameters n and p if it has a pmf shown below
( )
n
P(x; p, n) = (p)x (1 − p)(n−x) for x = 0, 1, … , n,
k
where p is the probability of success on an individual trial and n is number of trials
in the binomial experiment.
The multinomial distribution is a generalization of the binomial distribution.

Specifically, assume that n independent distributions may result in one of the k
outcomes generically labeled S = {1, 2, … , k}, each with corresponding probabil-
ities (p1 , … , pk ). Now define a vector X = (X1 , … , Xk ), where each of the Xi counts
the number of outcomes i in the resulting sample of size n. The joint distribution
of the vector X is
n! x x
f (x1 , … , xk ) = p 1 … pkk 𝟏{x𝟏 +· · ·+xk =n} .
x1 ! … xk ! 1
In the same way as the binomial probabilities appear as coefficients in the binomial
expansion of (p + (1 − p))n , the multinomial probabilities are the coefficients in the
multinomial expansion (p1 + · · · + pk )n , so they sum to 1. This expansion in fact
gives the name of the distribution.
If we label the outcome i as a success and everything else a failure, then Xi sim-
ply counts successes in n independent trials and thus Xi ∼ Binom(n, pi ). Thus, the
first moment of the random vector and the diagonal elements in the covariance
matrix are easy to calculate as npi and npi (1 − pi ), respectively. The off-diagonal
elements (covariances) are not that complicated to calculate either. However,
for multinomial random vectors, the first two moments are difficult to compute.
The one-dimensional marginal distributions are binomial; however, the joint
distribution of (X1 , … , Xr ), the first r components, is not multinomial. Instead,
suppose we group the first r categories into 1 and we let Y = X1 + · · · + Xr . Because
the categories are linked, that is, X1 + · · · + Xk = n, we also have that Y = n −
Xr+1 − · · · − Xk . We can easily verify that the vector (Y , Xr+1 , … , Xk ), or equiv-
alently (n − Xr+1 − · · · − Xk , Xr+1 , … , Xk ), will have a multinomial distribution
with associated probabilities (pY , pr+1 , … , pk ) = (p1 + · · · + pr , pr+1 , … , pk ).
Next consider the conditional distribution of the first r components given the
last k − r components. That is, the distribution of
(X1 , … , Xr ) ∣ Xr+1 = nr+1 , … , Xk = nk .
This distribution is also multinomial with the number of elements n − nr+1 −

pi
· · · − nk and probabilities (p′1 , … , p′r ), where p′i = p +···+p .
1 r
2.3.3 Multivariate Normal Distribution

A vector X is said to have a k-dimensional multivariate normal distribution
∑
(denoted MVNk (𝜇, ), where Nk is k-dimensional multivariate normal distribu-
∑
tion) with mean vector 𝜇 = (𝜇1 , … , 𝜇k ) and covariance matrix = (𝜎ij )ij∈{1,…,k} if
its density can be written as
∑
1 − 1 (x−𝜇)T −1 (x−𝜇)
f (x) = k∕2
∑ 1∕2 e 2 ,
(2𝜋) det ( )
where we used the usual notations for the determinant, transpose, and inverse of
a matrix. The vector of means 𝜇 may have any elements in ℝ, but, just as in the
one-dimensional case, the standard deviation has to be positive. In the multivari-
∑
ate case, the covariance matrix has to be symmetric and positive definite.
The multivariate normal defined thus has many nice properties. The basic one
is that the one-dimensional distributions are all normal, that is, Xi ∼ N(𝜇i , 𝜎ii ) and
Cov(Xi , Xj ) = 𝜎ij . This is also true for any marginal. For example, if (Xr , … , Xk ) are
the last coordinates, then
⎛ Xr ⎞ ⎛⎛ 𝜇r ⎞ ⎛ 𝜎r,r 𝜎r,r+1 … 𝜎r,k ⎞⎞
⎜ ⎟ ⎜⎜ ⎟ ⎜ ⎟⎟
X
⎜ r+1 ⎟ ∼ MVN 𝜇 𝜎 𝜎
⎜⎜ r+1 ⎟ , ⎜ r+1,r r+1,r+1 … 𝜎r+1,k ⎟⎟
.
⎜ ⋮ ⎟ k−r+1
⎜⎜ ⋮ ⎟ ⎜ ⋮ ⋮ ⋱ ⋮ ⎟⎟
⎜X ⎟ ⎜⎜ 𝜇 ⎟ ⎜ 𝜎 𝜎k,r+1 … 𝜎k,k ⎟⎠⎟⎠
⎝ k ⎠ ⎝⎝ k ⎠ ⎝ k,r
So any particular vector of components is normal.
Conditional distribution of a multivariate normal is also a multivariate normal.
∑
Given that X is a MVNk (𝜇, ) and using the vector notations above assuming
that X1 = (X1 , … , Xr ) and X2 = (Xr+1 , … , Xk ), then we can write the vector 𝜇 and
∑
matrix as
( ) (∑ ∑ )
𝜇1 ∑ 11 12
𝜇= and = ∑ ∑ ,
𝜇2
21 22
where the dimensions are accordingly chosen to match the two vectors (r and k −
r). Thus, the conditional distribution of X1 given X2 = a, for some vector a is
( ∑ ∑−1 ∑ ∑ ∑−1 ∑ )
X1 |X2 = a ∼ MVNr 𝜇1 − (𝜇2 − a) , − .
12 22 11 12 22 21
∑ ∑−1
Furthermore, the vectors X2 and X1 − 21 22 X2 are independent. Finally, any
affine transformation AX + b, where A is a k × k matrix and b is a k-dimensional
constant vector, is also a multivariate normal with mean vector A𝜇 + b and
∑
covariance matrix A AT . Please refer to the text by Axler (2015) and Johnson and
Wichern (2014) for more details on the Multinomial distribution and Multivariate
normal distributions.
2.4 Problems 19
2.4 Problems
1 If A and B are two matrices, prove the following properties of the trace of a
matrix.
(a) tr(AB) = tr(BA).
(b) tr(A + B) = tr(A) + tr(B).
(c) tr(cA) = ctr(A), for a any constant c.
2 If A and B are two matrices, prove the following properties of the determinant
of a matrix.
(a) det A = det AT .
(b) det (AB) = det A⋅ det A = det (BA).
3 Let ( ) ( )
1 4 8 2 4 −3
A= , B= .
0 4 9 1 8 9
(a) Find A + B.
(b) Find A − B.
(c) Find A′ A.
(d) Find AA′ .
4 Let
( ) ⎛ 2 4⎞
1 4 8
A= , B = ⎜ 1 8⎟ .
0 4 9 ⎜ ⎟
⎝−3 9⎠
(a) Find AB.

(b) Find BA.
(c) Compare tr(AB) and tr(BA).
5 Let
( ) ⎛ 0 4⎞
5 4 8
A= , B = ⎜ 1 3⎟ .
0 4 3 ⎜ ⎟
⎝−3 2⎠
(a) Find A−1 .

(b) Find (BA)−1 .
Another random document with
no related content on Scribd:
(p. 126).
Idotheidae (p. 127).
Valvifera (p. 127)
Arcturidae (p. 127).
Asellidae (p. 128).
Asellota (p. 127) Munnopsidae
(p. 128).
Oniscoida (p. 128).
Microniscidae
(p. 130).
Cryptoniscidae
(p. 130).
Liriopsidae (p. 130).
Hemioniscidae
Cryptoniscina (pp. 129, 130) (p. 130).
Cabiropsidae
(p. 130).
Epicarida
Podasconidae
(p. 129)
(p. 130).
Asconiscidae
(p. 130).
Dajidae (p. 130).
Phryxidae (p. 130).
Bopyrina (pp. 129, 130, 132) Bopyridae (pp. 130,
133).
Entoniscidae
(pp. 130, 134).
Phreatoicidae
Phreatoicidea (p. 136)
(p. 136)
Lysianassidae
(p. 137).
Haustoriidae (p. 137).
Crevettina (p. 137) Gammaridae
Amphipoda (p. 138).
(p. 136) Talitridae (p. 139).
Corophiidae (p. 139).
Caprellidae (p. 139).
Laemodipoda (p. 139)
Cyamidae (p. 140).
Hyperina (p. 140).
Hoplocarida
Stomatopoda (p. 141) Squillidae (p. 143).
(p. 141)
Eucarida Euphausiidae
Euphausiacea (p. 144)
(p. 144) (p. 144).
Decapoda Macrura Nephropsidae
(p. 152) (p. 153) (p. 154).
Nephropsidea (p. 154)
Astacidae (p. 157).
Parastacidae (p. 157).
Eryonidea (p. 157) Eryonidae (p. 158).
Peneidae (p. 162).
Sergestidae (p. 162).
Peneidea (pp. 158, 162)
Stenopodidae
(p. 162).
Caridea (pp. 158, 163) Pasiphaeidae
(p. 163).
Acanthephyridae
(p. 163).
Atyidae (p. 163).
Alpheidae (p. 163).
Psalidopodidae
(p.164).
Pandalidae (p. 164).
Hippolytidae
(p. 164).
Palaemonidae
(p. 164).
Glyphocrangonidae
(p. 164).
Crangonidae (p. 164).
Palinuridae (p. 167).
Loricata (p. 165)
Scyllaridae (p. 167).
Callianassidae
Thalassinidea (p. 167)
(p. 167).
Aegleidae (p. 169).
Galatheidae (p. 169).
Galatheidea (p. 168)
Porcellanidae
(p. 170).
Albuneidae (p. 171).
Hippidea (p. 170)
Hippidae (p. 171).
Pylochelidae (p. 180).
Anomura Paguridae (p. 180).
(p. 167) Eupagurinae
(p. 180).
Pagurinae (p. 180).
Paguridea (p. 171)
Coenobitidae (p. 181).
Lithodidae (p. 181).
Hapalogastorinae
(p. 181).
Lithodinae (p. 181).
Brachyura Dromiidae (p. 184).
(p. 181) Dynomenidae
Dromiacea (p. 183)
(p. 184).
Homolidae (p. 184).
Calappidae (p. 187).
Leucosiidae (p. 188).
Oxystomata (p. 185)
Dorippidae (p. 188).
Raninidae (p. 188).
Corystidae (p. 190).
Atelecyclidae
(p. 190).
Cancridae (p. 191).
Cyclometopa (p. 188) Portunidae (p. 191).
Xanthidae (p. 191).
Thelphusidae =
Potamonidae
(p. 191).
Maiidae (p. 193).
Parthenopidae
Oxyrhyncha (p. 191) (p. 193).
Hymenosomatidae
(p. 193).
Catometopa (p. 193) Carcinoplacidae
(p. 195).
Gonoplacidae
(p. 195).
Pinnotheridae
(p. 195).
Grapsidae (p. 196).
Gecarcinidae
(p. 196).
Ocypodidae (p. 196).
TRILOBITA (p. 221).
Families
Agnostidae (p. 244).
Shumardiidae (p. 245).
Trinucleidae (p. 245).
Harpedidae (p. 245).
Paradoxidae (p. 246).
Conocephalidae = Conocoryphidae (p. 247).
Olenidae (p. 247).
Calymenidae (p. 247).
Asaphidae (p. 249).
Bronteidae (p. 249).
Phacopidae (p. 249).
Cheiruridae (p. 250).
Proëtidae (p. 251).
Encrinuridae (p. 251).
Acidaspidae (p. 251).
Lichadidae (p. 252).
ARACHNIDA (p. 255).
DELOBRANCHIATA = MEROSTOMATA (pp. 258, 259).

Orders. Sub-Orders. Families. Sub-Families.
Xiphosurinae
Xiphosuridae (p. 276).
Xiphosura (pp. 258, 259, 276)
(p. 276) Tachypleinae
(p. 276).
Eurypterida = Gigantostraca (pp. 258, 283) Eurypteridae (p. 290).
EMBOLOBRANCHIATA (pp. 258, 297).
Buthinae (p. 306).

Buthidae (p. 306)
Centrurinae (p. 306).
Diplocentrinae
(p. 307).
Urodacinae (p. 307).
Scorpionidae Scorpioninae
(p. 306) (p. 307).
Hemiscorpioninae
(p. 307).
Scorpionidea (pp. 258, 297) Ischnurinae (p. 307).
Chaerilidae (p. 307).
Megacorminae
(p. 308).
Chactidae (p. 307) Euscorpiinae
(p. 308).
Chactinae (p. 308).
Vejovidae (p. 308).
Bothriuridae (p. 308).
Thelyphonidae (p. 312).
Schizonotidae = Tartaridae (p. 312).
Pedipalpi (pp. 258, 308) Tarantulidae = Tarantulinae (p. 313).
Phrynidae Phrynichinae (p. 313).
(p. 312) Charontinae (p. 313).
Araneae (pp. 258, 314) Liphistiidae (p. 386).
Paratropidinae
(p. 387).
Actinopodinae
(p. 387).
Aviculariidae = Miginae (p. 387).
Mygalidae
Ctenizinae (p. 388).
(p. 386)
Barychelinae (p. 389).
Aviculariinae
(p. 389).
Diplurinae (p. 390).
Atypidae (p. 390).
Filistatidae (p. 391).
Oecobiidae = Urocteidae (p. 392).
Sicariidae = Scytodidae (p. 393).
Hypochilidae (p. 393).
Leptonetidae (p. 393).
Oonopidae (p. 393).
Hadrotarsidae (p. 394).
Dysderinae (p. 394).
Dysderidae (p. 394)
Segestriinae (p. 395).
Caponiidae (p. 395).
Prodidomidae (p. 395).
Drassinae (p. 396).
Clubioninae (p. 397).
Drassidae (p. 396)
Liocraninae (p. 397).
Micariinae (p. 397).
Palpimanidae (p. 398).
Eresidae (p. 398).
Dictynidae (p. 398).
Psechridae (p. 399).
Zodariidae = Enyoidae (p. 399).
Hersiliidae (p. 400).
Pholcidae (p. 401).
Argyrodinae (p. 402).
Episininae (p. 402).
Theridioninae
(p. 403).
Theridiidae (p. 401) Phoroncidiinae
(p. 404).
Erigoninae (p. 404).
Formicinae (p. 405).
Linyphiinae (p. 405).
Theridiosomatinae
(p. 407).
Tetragnathinae
(p. 407).
Argiopinae (p. 408).
Epeiridae (p. 406) Nephilinae (p. 408).
Epeirinae (p. 408).
Gasteracanthinae
(p. 409).
Poltyinae (p. 410).
Arcyinae (p. 410).
Dinopinae (p. 410).
Uloborinae (p. 410).
Uloboridae (p. 410)
Miagrammopinae
(p. 411).
Archeidae (p. 411).
Mimetidae (p. 411).
Thomisinae =
Misumeninae
(p. 412).
Philodrominae
(p. 413).
Thomisidae (p. 412) Sparassinae (p. 414).
Aphantochilinae
(p. 414).
Stephanopsinae
(p. 414).
Selenopinae (p. 414).
Zoropsidae (p. 415).
Platoridae (p. 415).
Cybaeinae (p. 415).
Ageleninae (p. 416).
Agelenidae (p. 415)
Hahniinae (p. 416).
Nicodaminae (p. 416).
Pisauridae (p.416).
Lycosidae (p. 417).
Ctenidae (p. 418).
Senoculidae (p. 418).
Oxyopidae (p. 419).
Attidae = Salticidae (p. 419).
Palpigradi (pp. 258, 422).
Galeodidae (p. 428).
Rhagodinae (p. 429).
Solpuginae (p. 429).
Daesiinae (p. 429).
Solifugae = Solpugae (pp. 258, 423) Solpugidae (p. 429)
Eremobatinae
(p. 429).
Karshiinae (p. 429).
Hexisopodidae (p. 429).
Cheliferidae (p. 436)
Garypinae (pp. 436,
Chernetidea = Chernetes = Cheliferinae
437).
Pseudoscorpiones (pp. 258, 430) (p. 436).
Obisiinae (pp. 436,
437).
Podogona = Ricinulei (pp. 258, 439) Cryptosteinmatidae (p.440).
Cyphophthalmi
Sironidae (p. 448).
(p. 447)
Mecostethi = Phalangodidae (p. 448).
Laniatores Cosmetidae (p. 449).
(p. 448) Gonyleptidae (p. 449).
Phalangidea = Opiliones (pp. 258,
440) Sclerosomatinac
Phalangiidae (p.449).
Plagiostethi = (p. 449)
Phalangiinae (p. 450).
Palpatores
Ischyropsalidae (p. 451).
(p. 449)
Nemastomatidae (p. 451).
Trogulidae (p. 452).
Acarina = Acari = Acaridea (pp. 258, Vermiformia Eriophyidae = Phytoptidae (p. 464).
454) (p. 464) Demodicidae (p. 465).
Sarcoptinae (p. 466).
Analgesinae (p. 466).
Astigmata (p. 465) Sarcoptidae (p. 466)
Tyroglyphinae
(p. 466).
Oribatidae (p. 467).
Argasidae (p. 469).
Ixodoidea (p. 468)
Metastigmata Ixodidae (p. 469).
(p. 467) Gamasinae (p. 470).
Gamasidae (p. 470) Dermanyssinae
(p. 471).
Heterostigmata
Tarsonemidae (p. 471).
(p. 471)
Prostigmata (p. 471) Bdellidae (p. 471).
Halacaridae (p. 472).
Hydrachnidae (p. 472).
Trombidiidae Limnocharinae
(p. 472) (p. 472).
Caeculinae (p. 472).
Tetranychinae
(p. 472).
Cheyletinae (p. 473).
Erythraeinae (p. 473).
Trombidiinae
(p. 473).
Notostigmata
Opilioacaridae (p. 473).
(p. 473)
Orders.
TARDIGRADA (pp. 258, 477).

PENTASTOMIDA (pp. 258, 488).
PYCNOGONIDA = PODOSOMATA = PANTOPODA (p. 501).
Families.
Decolopodidae (p. 531).
Colossendeidae = Pasithoidae (p. 532).
Eurycididae = Ascorhynchidae (p. 533).
Ammotheidae (p. 534).
Rhynchothoracidae (p. 535)
Nymphonidae (p. 536).
Pallenidae (p. 537).
Phoxichilidiidae (p. 538).
Phoxichilidae (p. 539).
Pycnogonidae (p. 539).
CRUSTACEA
CHAPTERS I and III-VII

BY
GEOFFREY SMITH, M.A. (Oxon.)

Fellow of New College, Oxford
CHAPTER II
BY
The Late W. F. R. WELDON, M.A. (D.Sc. Oxon.)

Formerly Fellow of St. John’s College, Cambridge, and Linacre Professor of Human and
Comparative Anatomy, Oxford
CHAPTER I
CRUSTACEA—GENERAL ORGANISATION
The Crustacea are almost exclusively aquatic animals, and they play a
part in the waters of the world closely parallel to that which insects play
on land. The majority are free-living, and gain their sustenance either as
vegetable-feeders or by preying upon other animals, but a great number
are scavengers, picking clean the carcasses and refuse that litter the
ocean, just as maggots and other insects rid the land of its dead cumber.
Similar to insects also is the great abundance of individuals which
represent many of the species, especially in the colder seas, and the
naturalist in the Arctic or Antarctic oceans has learnt to hang the
carcasses of bears and seals over the side of the boat for a few days in
order to have them picked absolutely clean by shoals of small Amphipods.
It is said that these creatures, when crowded sufficiently, will even attack
living fishes, and by sheer press of numbers impede their escape and
devour them alive. Equally surprising are the shoals of minute Copepods
which may discolour the ocean for many miles, an appearance well known
to fishermen, who take profitable toll of the fishes that follow in their
wake. Despite this massing together we look in vain for any elaborate
social economy, or for the development of complex instincts among
Crustacea, such as excite our admiration in many insects, and though
many a crab or lobster is sufficiently uncanny in appearance to suggest
unearthly wisdom, he keeps his intelligence rigidly to himself, encased in
the impenetrable reserve of his armour and vindicated by the most
powerful of pincers. It is chiefly in the variety of structure and in the
multifarious phases of life-history that the interest of the Crustacea lies.
Before entering into an examination of these matters, it will be well to
take a general survey of Crustacean organisation, to consider the plan on
which these animals are built, and the probable relation of this plan to
others met with in the animal kingdom.
The Crustacea, to begin with, are a Class of the enormous Phylum
Arthropoda, animals with metamerically segmented bodies and usually
with externally jointed limbs. Their bodies are thus composed of a series
of repeated segments, which are on the whole similar to one another,
though particular segments may be differentiated in various respects for
the performance of different functions. This segmentation is apparent
externally, the surface of a Crustacean being divided typically into a
number of hard chitinous rings, some of which may be fused rigidly
together, as in the carapace of the crabs, or else articulated loosely.
Each segment bears typically a pair of jointed limbs, and though they
vary greatly in accordance with the special functions for which they are
employed, and may even be absent from certain segments, they may yet
be reduced to a common plan and were, no doubt, originally present on
all the segments.
Passing from the exterior to the interior of the body we find, generally
speaking, that the chief system of organs which exhibits a similar
repetition, or metameric segmentation, is the nervous system. This
system is composed ideally of a nervous ganglion situated in each
segment and giving off peripheral nerves, the several ganglia being
connected together by a longitudinal cord. This ideal arrangement,
though apparent during the embryonic development, becomes obscured
to some extent in the adult owing to the concentration or fusion of ganglia
in various parts of the body. The other internal organs do not show any
clear signs of segmentation, either in the embryo or in the adult; the
alimentary canal and its various diverticula lie in an unsegmented body-
cavity, and are bathed in the blood which courses through a system of
narrow canals and irregular spaces which surround all the organs of the
body. A single pair, or at most two pairs of kidneys are present.
The type of segmentation exhibited by the Crustacea is thus of a limited
character, concerning merely the external skin with its appendages, and
the nervous system, and not touching any of the other internal organs.[1]
In this respect the Crustacea agree with all the other Arthropods, in the
adults of which the segmentation is confined to the exterior and to the
nervous system, and does not extend to the body-cavity and its contained
organs; and for the same reason they differ essentially from all other
metamerically segmented animals, e.g. Annelids, in which the
segmentation not only affects the exterior and the nervous system, but
especially applies to the body-cavity, the musculature, the renal, and often
the generative organs. The Crustacea also resemble the other Arthropoda
in the fact that the body-cavity contains blood, and is therefore a
“haemocoel,” while in the Annelids and Vertebrates the segmented body-
cavity is distinct from the vascular system, and constitutes a true
“coelom.” To this important distinction, and to its especial application to
the Crustacea, we will return, but first we may consider more narrowly
the segmentation of the Crustacea and its main types of variation
within the group. In order to determine the number of segments which
compose any particular Crustacean we have clearly two criteria: first, the
rings or somites of which the body is composed, and to each of which a
pair of limbs must be originally ascribed; and, second, the nervous
ganglia.
Around and behind the region of the mouth there is very little difficulty
in determining the segments of the body, if we allow embryology to assist
anatomy, but in front of the mouth the matter is not so easy.
In the Crustacea the moot point is whether we consider the paired eyes
and first pair of antennae as true appendages belonging to two true
segments, or whether they are structures sui generis, not homologous to
the other limbs. With regard to the first antennae we are probably safe in
assigning them to a true body-segment, since in some of the
Entomostraca, e.g. Apus, the nerves which supply them spring, not from
the brain as in more highly specialised forms, but from the commissures
which pass round the oesophagus to connect the dorsally lying brain to
the ventral nerve-cord. The paired eyes are always innervated from the
brain, but the brain, or at least part of it, is very probably formed of
paired trunk-ganglia which have fused into a common cerebral mass; and
the fact that under certain circumstances the stalked eye of Decapods
when excised with its peripheral ganglion[2] can regenerate in the form of
an antenna, is perhaps evidence that the lateral eyes are borne on what
were once a pair of true appendages.
Now, with regard to the segmentation of the body, the Crustacea fall
into three categories: the Entomostraca, in which the number of segments
is indefinite; the Malacostraca, in which we may count nineteen
segments, exclusive of the terminal piece or telson and omitting the
lateral eyes; and the Leptostraca, including the single recent genus
Nebalia, in which the segmentation of head and thorax agrees exactly
with that of the Malacostraca, but in the abdomen there are two
additional segments.
It has been usually held that the indefinite number of segments
characteristic of the Entomostraca, and especially the indefinitely large
number of segments characteristic of such Phyllopods as Apus, preserves
the ancestral condition from which the definite number found in the
Malacostraca has been derived; but recently it has been clearly pointed
out by Professor Carpenter[3] that the number of segments found in the
Malacostraca and Leptostraca corresponds with extraordinary exactitude
to the number determined as typical in all the other orders of Arthropoda.
This remarkable correspondence (it can hardly be coincidence) seems to
point to a common Arthropodan plan of segmentation, lying at the very
root of the phyletic tree; and if this is so, we are forced to the conclusion
that the Malacostraca have retained the primitive type of segmentation in
far greater perfection than the Entomostraca, in some of which many
segments have been added, e.g. Phyllopoda, while in others segments
have been suppressed, e.g. Cladocera, Ostracoda. It may be objected to
this view of the primitive condition of segmentation in the Crustacea that
the Trilobites, which for various reasons are regarded as related to the
ancestral Crustaceans, exhibit an indefinite and often very high number
of segments; but, as Professor Carpenter has pointed out, the oldest and
most primitive of Trilobites, such as Olenellus, possessed few segments
which increase as we pass from Cambrian to Carboniferous genera.
The following table shows the segmentation of the body in the
Malacostraca, as compared with that of Limulus (cf. p. 263), Insecta, the
primitive Myriapod Scolopendrella, and Peripatus. It will be seen that the
correspondence, though not exact, is very close, especially in the first four
columns, the number of segments in Peripatus being very variable in the
different species.
Table showing the Segmentation of various Arthropods
Malacostraca. Limulus. Insecta. Myriapoda. Peripatus.
(Scolopendrella).
1 Eyes Median Eyes
eyes
2 1st antennae Rostrum Antennae Feelers Feelers
3 2nd antennae Chelicerae Intercalary
segment
4 Mandibles Pedipalpi Mandibles Mandibles Mandibles
5 1st maxillae 1st walking Maxillulae Maxillulae 1st jaw-claw
legs
6 2nd maxillae 2nd „ „ 1st 1st maxillae 2nd jaw-
maxillae claw
7 1st maxillipede 3rd „ „ 2nd 2nd maxillae 1st leg
maxillae
8 2nd maxillipede 4th „ „ 1st leg 1st leg 2nd „
9 3rd maxillipede Chilaria 2nd „ 2nd „ 3rd „
10 1st ambulatory Genital 3rd „ 3rd „ 4th „
operculum
11 2nd „ 1st gill- 1st 4th „ 5th „
book abdominal
12 3rd „ 2nd „ 2nd „ 6th „ 6th „
13 4th „ 3rd „ 3rd „ 6th „ 7th „
14 5th „ 4th „ 4th „ 7th „ 8th „
15 1st abdominal 5th „ 5th „ 8th „ 9th „
16 2nd „ No 6th „ 9th „ 10th „
appendages
17 3rd „ „ 7th „ 10th „ 11th „
18 4th „ „ 8th „ 11th „ 12th „
19 5th „ „ 9th „ 12th „ 13th „
20 6th „ „ 10th „ Reduced limbs 14th „
21 [4] „ Cercopods [5]
Telson Telson Telson Telson Telson

The appendages of the Crustacea exhibit a wonderful variety of
structure, but these variations can be reduced to at most two, and
possibly to one fundamental plan. In a typical Crustacean, besides the
paired eyes, which may be borne on stalks, possibly homologous to highly
modified limbs, there are present, first, two pairs of rod-like or
filamentous antennae, which in the adult are usually specialised for
sensory purposes, but frequently retain their primitive function as
locomotory limbs even in the adult, e.g. Ostracoda; while in the Nauplius
larva, found in almost all the chief subdivisions of the Crustacea, the two
pairs of antennae invariably aid in locomotion, and the base of the second
antennae is usually furnished with sharp biting spines which assist
mastication. Following the antennae is a pair of mandibles which are
fashioned for biting the food or for piercing the prey, and posterior to
these are two pairs of maxillae, biting organs more slightly built than the
mandibles, whose function it is to lacerate the food and prepare it for the
more drastic action of the mandibles. So far, with comparatively few
exceptions, the order of specialisation is invariable; but behind the
maxillae the trunk-appendages vary greatly both in structure and function
in the different groups.
As a general rule, the first or first few thoracic limbs are turned
forwards toward the mouth, and are subsidiary to mastication; they are
then called maxillipedes; this happens usually in the Malacostraca, but to
a much less extent in the Entomostraca; and in any case these
appendages immediately behind the maxillae never depart to any great
extent from a limb-like structure, and they may graduate insensibly into
the ordinary trunk-appendages. The latter show great diversity in the
different Crustacean groups, according as the animals lead a natatory,
creeping, or parasitic method of life; they may be foliaceous, as in the
Branchiopoda, or biramous, as in the swimming thoracic and abdominal
appendages of the Mysidae, or simply uniramous, as in the walking legs of
the higher Decapoda, and the clinging legs of various parasitic forms.
Without going into detailed deviations of structure, many of which will
be described under the headings of special groups, it is clear from the
foregoing description and from Fig. 1 (p. 10), that three main types of
appendage can be distinguished: first, the foliaceous or multiramous;
second, the biramous; and, third, the uniramous.
We may dismiss the uniramous type with a few words: it is obviously
secondarily derived from the biramous type; this can be proved in detail
in nearly every case. Thus, the uniramous second antennae of some adult
forms are during the Nauplius stage invariably biramous, a condition
which is retained in the adult Cladocera. Similarly the uniramous walking
legs of many Decapoda pass through a biramous stage during
development, the outer branches or exopodites of the limbs being
suppressed subsequently, while the primitively biramous condition of the
thoracic limbs is retained in the adults of the Schizopoda, which doubtless
own a common ancestry with the Decapoda. The only Crustacean limb
which appears to be constantly uniramous both in larval and adult life is
the first pair of antennae.
We are reduced, therefore, to two types—the foliaceous and biramous.
Sir E. Ray Lankester,[6] in one of his most incisive morphological essays,
has explained how these two types are really fundamentally the same. He
compares, for instance, the foliaceous first maxillipede (Fig. 1, A), or the
second maxilla (Fig. 1, B) of a Decapod, e.g. Astacus, with the foliaceous
thoracic limb of Branchipus (Fig. 1, D), and with the typically biramous
first maxillipede of a Schizopod (Fig. 1, F).
In each case there is present, on the outer edge of the limb, one or more
projections or epipodites which are generally specialised for respiratory
purposes, and may carry the gills. The 6th and 5th “endites” in the
foliaceous limb (Fig. 1, D) are compared with the exopodite and
endopodite respectively of the biramous limb, while the endites 4–1 of the
foliaceous limb are found in the basal joints of the biramous limb.
Lankester presumes that the biramous type of limb throughout has been
derived from the foliaceous type by the suppression of the endites 1–4, as
discrete rami, and the exaggerated development of the endites 5 and 6, as
above indicated.
Fig. 1.—Appendages of Crustacea (A-G) and
Trilobita (H). A, First maxillipede of
Astacus; B, second maxilla of Astacus; C,
second walking leg of Astacus; D, thoracic
limb of Branchipus; E, first maxillipede of
Mysis; F, first maxillipede of
Gnathophausia; G, thoracic limb of
Nebalia; H, thoracic limb of Triarthrus. bp,
basipodite; br, bract; cp, carpopodite; cxp,
coxopodite; cx.s, coxopoditic setae; dp,
dactylopodite; end, endopodite; ep,
epipodite; ex, exopodite; ip, ischiopodite;
mp, meropodite; pp, propodite; 1–6, the six
endites.
The essential fact that the two types of limb are built on the same plan
may be considered as established; but it may be urged that the biramous
type represents this common plan more nearly than the foliaceous. It is,
at any rate, certain that in the maxillipedes of the Decapoda we witness
the conversion of the biramous type into the foliaceous by the expansion
of the basal joints concomitantly with the assumption by the maxillipedes
of masticatory functions. Thus in the Decapoda the first maxillipede is
decidedly foliaceous owing to the expanded “gnathobases” (Fig. 1, A, bp,
cxp), and the second maxillipedes are flattened, with their basal joints
somewhat expanded and furnished with biting hairs; but in the
“Schizopoda” (e.g. Mysis) the first maxillipede is a typical biramous limb,
though the expanded gnathobases in some forms are beginning to project
(Fig. 1, E), while the limb following, which corresponds to the second
maxillipede of Decapods, is simply a biramous swimming leg. Besides this
obvious conversion of a biramous into a foliaceous limb, further evidence
of the fundamental character of the biramous type is found, first, in its
invariable occurrence in the Nauplius stage, which does not necessarily
mean that the ancestors of the Crustacea possessed this type of limb in
the adult, but which does imply that this type of limb was possessed at
some period of life by the common ancestral Crustacean; and, second, the
limbs of the Trilobita, a group which probably stands near the origin of
the Crustacea, have been shown by Beecher to conform to the biramous
type (Fig. 1, H). Furthermore, the thoracic limbs of Nebalia, an animal
which combines many of the characteristics of Entomostraca and
Malacostraca, and is therefore considered as a primitive type, despite
their flattened character, are really built upon a biramous plan (Fig. 1, G).
In conclusion, we may point out that this view of the Crustacean limb,
as essentially a biramous structure, agrees with the conclusion derived
from our consideration of the segmentation of the body, and points less to
the Branchiopoda as primitive Crustacea and more to some generalised
Malacostracan type.
So far we have shortly dealt with those systems of organs which are
clearly affected by the metameric segmentation of the body; we must now
expose the condition of the body-cavity to a similar scrutiny. If we
remove the external integument of a Crustacean, we find that the internal
organs do not lie in a spacious and discrete body-cavity, as is the case in
the Annelids and Vertebrates, but that they are packed together in an
irregular system of spaces (“haemocoel”) in communication with the
vascular system and containing blood. In the Entomostraca and smaller
forms generally, a definite vascular system hardly exists, though a central
heart and artery may serve to propel the blood through the irregular
lacunae of the body-cavity; but in the larger Malacostraca a complicated
system of arteries may be present which pour the blood into fairly
definitely arranged spaces surrounding the chief organs. These spaces
return the blood to the pericardium, and so to the heart again through the
apertures or ostia which pierce its walls.
This condition of the body-cavity or haemocoel is reproduced in the
adults of all Arthropods, but in some of them by following the
development we can trace the steps by which the true coelom is replaced
by the haemocoel. In the embryos of all Arthropods except the Crustacea,
a true closed metamerically segmented coelom is formed as a split in the
mesodermal embryonic layer of cells, distinct from the vascular system.
During the course of development the segmented coelomic spaces and
their walls give rise to the reproductive organs and to certain renal organs
in Peripatus, Myriapoda, and Arachnida (nephridia and coxal glands),
but the general body-cavity is formed as an extension of the vascular
system, which is laid down outside the coelom by a canaliculisation of the
extra-coelomic mesoderm. In the embryos of the Crustacea, however,
there is never at any time a closed segmented coelom, and in this respect
the Crustacea differ from all other Arthropods. The only clear instance in
which metamerically repeated mesodermal cavities have been seen in the
embryo Crustacean is that of Astacus; here Reichenbach[7] states that in
the abdomen segmental cavities are formed which subsequently break
down; but even in this instance no connexion has been shown to subsist
between these embryonic cavities and the reproductive and excretory
organs of the adult.
Since the connexion between the coelom and the excretory organs is
always a very close one throughout the animal kingdom, interest naturally
centres upon the renal organs in Crustacea, and it has been suggested
that these organs in Crustacea represent the sole remains, with the
possible exception of the gonads, of the coelom. Since, at any rate, a part
of the kidneys appears to be developed as a closed sac in the mesoderm,
and since they possess a possible segmental value, this suggestion is
plausible; but, on the other hand, since there are never more than two
pairs of kidneys, and since they are totally unconnected with the gonads
or with any other indication of a segmented coelom, the suggestion
remains purely hypothetical.
The renal organs of the Crustacea, excluding the Malpighian tubes
present in some Amphipods which open into the alimentary canal, and
resemble the Malpighian tubes of Insects, consist of two pairs—the
antennary gland, opening at the base of the second antenna, and the
maxillary gland, opening on the second maxilla. These two pairs of glands
rarely subsist together in the adult condition, though this is said to be the
case in Nebalia and possibly Mysis; the antennary glands are
characteristic of adult Malacostraca[8] and the larvae of the Entomostraca,
while the maxillary glands (“shell-glands”) are present in adult
Entomostraca and larval Malacostraca, that is to say, the one pair
replaces the other in the two great subdivisions of the Crustacea. The
shell-gland of the Entomostraca is a simple structure consisting of a
coiled tube opening to the exterior on the external branch of the second
maxilla, and ending blindly in a dilated vesicle, the end-sac. The
antennary gland of the Malacostraca is usually more complicated: these
complications have been studied especially by Weldon,[9] Allen, and
Marchal[10] in the Decapoda. In a number of forms we have a tube opening
to the exterior at the base of the second antenna, and expanding within to
form a spacious bladder into which the coiled tubular part of the kidney
opens, while at the extremity of this coiled portion is the vesicle called the
end-sac. This arrangement may be modified; thus in Palaemon Weldon
described the two glands as fusing together above and below the
oesophagus, the dorsal commissure expanding into a huge sac stretching
dorsally down the length of the body. This closed sac with excretory
functions thus comes to resemble a coelomic cavity, and the view that it is
really coelomic has indeed been upheld.
A modified form of this view is that of Vejdovský, who describes a
funnel-apparatus leading from the coiled tube into the end-sac of the
antennary gland of Amphipods; he regards the end-sac alone as
representing the coelom, while the funnel and coiled tube represent the
kidney opening into it.
Not very much is known of the development of these various structures.
Some authors have considered that both antennary and maxillary glands
are developed in the embryo from ectodermal inpushings, but the more
recent observations of Waite[11] on Homarus americanus indicate that the
antennary gland at any rate is a composite structure, formed by an
ectodermal ingrowth which meets a mesodermal strand, and from the
latter are produced the end-sac and perhaps the tubular excretory
portions of the gland with their derivatives.
With regard to the possible metameric repetition of the renal organs, it
is of interest to note that by feeding Mysis and Nebalia on carmine,
excretory glands of a simple character were observed by Metschnikoff
situated at the bases of the thoracic limbs.
The alimentary canal of the Crustacea is a straight tube composed of
three parts—a mid-gut derived from the endoderm of the embryo, and a
fore- and hind-gut formed by ectodermal invaginations in the embryo
which push into and fuse with the endodermal canal. The regions of the
fore- and hind-gut can be recognised in the adult by the fact of their being
lined with the chitinous investment which is continued over the external
surface of the body forming the hard exoskeleton, while the mid-gut is
naked. The chitinous lining of fore- and hind-gut is shed whenever the
animal moults. In the Malacostraca, in which a complicated “gastric mill”
may be present, the chitinous lining of this part of the gut is thrown into
ridges bearing teeth, and this stomach in the crabs and lobsters reaches a
high degree of complication and materially assists the mastication of the
food. The gut is furnished with a number of secretory and metabolic
glands; the so-called liver, which is probably a hepatopancreas, opening
into the anterior end of the mid-gut, is directed forwards in most
Entomostraca and backwards in the Malacostraca, in the Decapoda
developing into a complicated branching organ which fills a large part of
the thorax. In the Decapoda peculiar vermiform caeca of doubtful
function are present, a pair of which open into the gut anteriorly where
fore-passes into mid-gut, and a single asymmetrically placed caecum
opens posteriorly into the alimentary tract where mid- passes into hind-
gut.
The disposition of these caeca, marking as they do the morphological
position of fore-, mid-, and hind-gut, is of peculiar interest owing to the
variations exhibited. From some unpublished drawings of Mr. E. H.
Schuster, which he kindly lent me, it appears that in certain Decapods,
e.g. Callianassa subterranea, the length of the mid-gut between the
anterior and posterior caeca is very long; in Carcinus maenas it is
considerable; in Maia squinado it is greatly reduced, the caeca being
closely approximated; while in Galathea strigosa the caeca are greatly
reduced, and the mid-gut as a separate entity has almost disappeared.
The relation of these variations to the habits of the different crabs and to
their modes of development is unknown.
The reproductive organs usually make their appearance as a small
paired group of mesodermal cells in the thorax comparatively late in life;
and neither in their early development nor in the adult condition do they
show any clear signs of segmentation or any connexion with a coelomic
cavity. The sexes are usually separate, but hermaphroditism occurs
sporadically in many forms, and as a normal condition in some parasitic
groups (see pp. 105–107). The adult gonads are generally simple paired
tubes, from the walls of which the germ-cells are produced, and as these
grow and come to maturity they fill up the cavities of the tubes; special
nutrient cells are rarely differentiated, though in some cases (e.g.
Cladocera) a few ova nourish themselves by devouring their sister-cells
(see p. 44). The oviducts and vasa deferentia are formed as simple
outgrowths from the gonadial tubes, which acquire an opening to the
exterior; they are usually poorly supplied with accessory glands, the
epithelium of the canals often supplying albuminous secretions for
cementing the eggs together, while the lining of the vasa deferentia may
be instrumental in the formation of spermatophores for transferring large
packets of spermatozoa to the female. In the vast majority of Crustacea
copulation takes place, the male passing spermatophores or free
spermatozoa into special receptacles (spermathecae), or into the oviducts
of the female. The spermatophores are hollow chitinous structures in

Data Science in Theory and Practice Techniques For Big Data Analytics and Complex Data Sets Maria C Mariani Full Chapter

Uploaded by

Copyright:

Available Formats

Data Science in Theory and Practice Techniques For Big Data Analytics and Complex Data Sets Maria C Mariani Full Chapter

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science in Theory and Practice Techniques For Big Data Analytics and Complex Data Sets Maria C Mariani Full Chapter

Uploaded by

Copyright:

Available Formats

Data Science in Theory and Practice:

Techniques for Big Data Analytics and

Techniques for Big Data Analytics and Complex Data Sets

Maria Cristina Mariani

Osei Koﬁ Tweneboah

Maria Pia Beccar-Varela

Limit of Liability/Disclaimer of Warranty

Library of Congress Cataloging-in-Publication Data applied for

Cover Design: Wiley

Set in 9.5/12.5pt STIXTwoText by Straive, Chennai, India

List of Figures xvii

1 Background of Data Science 1

2 Matrix Algebra and Random Vectors 7

3.6 Linear Combinations of Variables 28

4 Time Series Forecasting 35

5.3.2 Vector Arithmetic 65

6.4.4.1 Nested Loops 92

8 Data Preprocessing and Data Validations 109

8.3.4.2 Box and Whisker plot 113

9 Data Visualizations 121

10 Binomial and Trinomial Trees 135

10.4.1 What is the Meaning of Little o and Big O? 148

11 Principal Component Analysis 151

12 Discriminant and Cluster Analysis 165

13 Multidimensional Scaling 179

14 Classiﬁcation and Tree-Based Methods 191

15 Association Rules 205

16 Support Vector Machines 219

17 Neural Networks 231

18 Fourier Analysis 245

19 Wavelets Analysis 261

19.2.2 Daubechies Wavelets 267

20 Stochastic Analysis 279

21 Fractal Analysis – Lévy, Hurst, DFA, DEA 301

21.2.1.3 Inverse Gaussian (IG) Process 306

22 Stochastic Differential Equations 325

23 Ethics: With Great Power Comes Great Responsibility 345

23.2.5 Maintaining Accountability and Oversight 348

Figure 4.1 Time series data of phase arrival times of an earthquake. 36

Figure 9.4 Spatial distribution of earthquake magnitudes (Mariani et al.

Table 2.1 Examples of random vectors. 13

Table 13.4 Data matrix. 188

This textbook is dedicated to practitioners, graduate, and advanced undergraduate

El Paso, TX and Mahwah, NJ, USA Maria Cristina Mariani

Background of Data Science

1.2 Origin of Data Science

1.3 Who is a Data Scientist?

● Solving business-related problems using data-driven techniques.

1.4 Big Data

size or type is beyond the ability of traditional relational databases to capture,

1.4.1 Characteristics of Big Data

1.4.2 Big Data Architectures

Matrix Algebra and Random Vectors

2.2 Some Basics of Matrix Algebra

Definition 2.1 (Vector) A vector x is an array of real numbers x1 , x2 , … , xN ,

Definition 2.2 (Scaler multiplication of vectors) The product of a scalar c,

so that x + y is the vector with the ith element xi + yi .

Definition 2.4 (Matrix) Let m and n denote positive integers. An m-by-n

Definition 2.5 (Transpose of a matrix) The transpose operation AT of a