Nothing Special   »   [go: up one dir, main page]

Data Analytics For The Social Sciences: Applications in R 1st Edition Garson Download PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Full download test bank at ebookmeta.

com

Data Analytics for the Social Sciences:


Applications in R 1st Edition Garson

For dowload this book click LINK or Button below

https://ebookmeta.com/product/data-analytics-for-
the-social-sciences-applications-in-r-1st-edition-
garson/

OR CLICK BUTTON

DOWLOAD EBOOK

Download More ebooks from https://ebookmeta.com


More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Social Big Data Analytics: Practices, Techniques, and


Applications Bilal Abu-Salih

https://ebookmeta.com/product/social-big-data-analytics-
practices-techniques-and-applications-bilal-abu-salih/

Social Data Analytics 1st Edition Beheshti

https://ebookmeta.com/product/social-data-analytics-1st-edition-
beheshti/

Data Analytics for Social Microblogging Platforms 1st


Edition Soumi Dutta

https://ebookmeta.com/product/data-analytics-for-social-
microblogging-platforms-1st-edition-soumi-dutta/

Urban Analytics with Social Media Data: Foundations,


Applications and Platforms 1st Edition Tan Yigitcanlar

https://ebookmeta.com/product/urban-analytics-with-social-media-
data-foundations-applications-and-platforms-1st-edition-tan-
yigitcanlar/
Online Learning Analytics (Data Analytics Applications)
1st Edition Jay Liebowitz

https://ebookmeta.com/product/online-learning-analytics-data-
analytics-applications-1st-edition-jay-liebowitz/

Advances in Big Data Analytics 1st Edition Hamid R.


Arabnia

https://ebookmeta.com/product/advances-in-big-data-analytics-1st-
edition-hamid-r-arabnia/

Exploratory Data Analytics for Healthcare 1st Edition


R. Lakshmana Kumar (Editor)

https://ebookmeta.com/product/exploratory-data-analytics-for-
healthcare-1st-edition-r-lakshmana-kumar-editor/

Business Analytics: Data Science for Business Problems


Walter R. Paczkowski

https://ebookmeta.com/product/business-analytics-data-science-
for-business-problems-walter-r-paczkowski/

Regression Analysis in R: A Comprehensive View For The


Social Sciences 1st Edition Jocelyn E. Bolin

https://ebookmeta.com/product/regression-analysis-in-r-a-
comprehensive-view-for-the-social-sciences-1st-edition-jocelyn-e-
bolin/
Data Analytics for the Social Sciences

Data Analytics for the Social Sciences is an introductory, graduate-level treatment of data analytics for social
science. It features applications in the R language, arguably the fastest growing and leading statistical tool for
researchers.
The book starts with an ethics chapter on the uses and potential abuses of data analytics. Chapters 2 and 3 show
how to implement a broad range of statistical procedures in R. Chapters 4 and 5 deal with regression and classifica-
tion trees and with random forests. Chapter 6 deals with machine learning models and the “caret” package, which
makes available to the researcher hundreds of models. Chapter 7 deals with neural network analysis, and Chapter
8 deals with network analysis and visualization of network data. A final chapter treats text analysis, including web
scraping, comparative word frequency tables, word clouds, word maps, sentiment analysis, topic analysis, and more.
All empirical chapters have two “Quick Start” exercises designed to allow quick immersion in chapter topics, fol-
lowed by “In Depth” coverage. Data are available for all examples and runnable R code is provided in a “Command
Summary”. An appendix provides an extended tutorial on R and RStudio. Almost 30 online supplements provide
information for the complete book, “books within the book” on a variety of topics, such as agent-based modeling.
Rather than focusing on equations, derivations, and proofs, this book emphasizes hands-on obtaining of output
for various social science models and how to interpret the output. It is suitable for all advanced level undergraduate
and graduate students learning statistical data analysis.

G. David Garson teaches advanced research methodology in the School of Public and International Affairs, North
Carolina State University, USA. Founder and longtime editor emeritus of the Social Science Computer Review, he
is president of Statistical Associates Publishing, which provides free digital texts worldwide. His degrees are from
Princeton University (BA, 1965) and Harvard University (PhD, 1969).
Data Analytics for the Social Sciences

Applications in R

G. David Garson
First published 2022
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN

and by Routledge
605 Third Avenue, New York, NY 10158

Routledge is an imprint of the Taylor & Francis Group, an informa business

© 2022 G. David Garson

The right of G. David Garson to be identified as author of this work has been asserted by them in accordance with sections 77
and 78 of the Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information
storage or retrieval system, without permission in writing from the publishers.

Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification
and explanation without intent to infringe.

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library

Library of Congress Cataloging-in-Publication Data


A catalog record has been requested for this book

ISBN: 978-0-367-62429-3 (hbk)


ISBN: 978-0-367-62427-9 (pbk)
ISBN: 978-1-003-10939-6 (ebk)

DOI: 10.4324/9781003109396

Typeset in Times
by KnowledgeWorks Global Ltd.

Access the Support Material: www.routledge.com/9780367624293


This book is dedicated, as I am, to my radiant Irish-
American soulmate and happiness, Kathryn Kallio.

– Dave Garson, April 2021


Contents
Acknowledgments xvi
Preface xvii

1 Using and abusing data analytics in social science 1


1.1 Introduction 1
1.2 The promise of data analytics for social science 3
1.2.1 Data analytics in public affairs and public policy 3
1.2.2 Data analytics in the social sciences 3
1.2.3 Data analytics in the humanities 4
1.3 Research design issues in data analytics 4
1.3.1 Beware the true believer 4
1.3.2 Pseudo-objectivity in data analytics 4
1.3.3 The bias of scholarship based on algorithms using big data 5
1.3.4 The subjectivity of algorithms 8
1.3.5 Big data and big noise 9
1.3.6 Limitations of the leading data science dissemination models 9
1.4 Social and ethical issues in data analytics 10
1.4.1 Types of ethical issues in data analytics 10
1.4.2 Bias toward the privileged 11
1.4.3 Discrimination 12
1.4.4 Diversity and data analytics 13
1.4.5 Distortion of democratic processes 14
1.4.6 Undermining of professional ethics 14
1.4.7 Privacy, profiling, and surveillance issues 15
1.4.8 The transparency issue 18
1.5 Summary: Technology and power 19
Endnotes 21

2 Statistical analytics with R, Part 1 22


PART I: OVERVIEW OF STATISTICAL ANALYSIS WITH R 22
2.1 Introduction 22
2.2 Data and packages used in this chapter 22
2.2.1 Example data 22
2.2.2 R packages used 23
PART II: QUICK START ON STATISTICAL ANALYSIS WITH R 24
2.3 Descriptive statistics 24
2.4 Linear multiple regression 26
PART III: STATISTICAL ANALYSIS WITH R IN DETAIL 33
2.5 Hypothesis testing 33
2.5.1 One-sample test of means 34
2.5.2 Means test for two independent samples 35
2.5.3 Means test for two dependent samples 35
2.6 Crosstabulation, significance, and association 36
2.7 Loglinear analysis for categorical variables 38
viii Contents

2.8 Correlation, correlograms, and scatterplots 38


2.9 Factor analysis (exploratory) 43
2.10 Multidimensional scaling 44
2.11 Reliability analysis 44
2.11.1 Cronbach’s alpha and Guttman’s lower bounds 46
2.11.2 Guttman’s lower bounds and Cronbach’s alpha 46
2.11.3 Krippendorff’s alpha and Cohen’s kappa 48
2.12 Cluster analysis 49
2.12.1 Hierarchical cluster analysis 50
2.12.2 K-means clustering 50
2.12.3 Nearest neighbor analysis 59
2.13 Analysis of variance 60
2.13.1 Data and packages used 60
2.13.2 GLM univariate: ANOVA 61
2.13.3 GLM univariate: ANCOVA 66
2.13.4 GLM multivariate: MANOVA 67
2.13.5 GLM multivariate: MANCOVA 70
2.14 Logistic regression 73
2.14.1 ROC and AUC analysis 77
2.14.2 Confusion table and accuracy 77
2.15 Mediation and moderation 79
2.16 Chapter 2 command summary 89
Endnotes 89

3 Statistical analytics with R, Part 2 91


PART I: OVERVIEW OF STATISTICAL ANALYTICS WITH R 91
3.1 Introduction 91
3.2 Data and packages used in this chapter 91
3.2.1 Example data 91
3.2.2 R Packages used 92
PART II: QUICK START ON STATISTICAL ANALYSIS PART 2 92
3.3 Quick start: Linear regression as a generalized linear modeling (GZLM) 92
3.3.1 Background to GZLM 92
3.3.2 The linear model in glm() 92
3.3.3 GZLM output 93
3.3.4 Fitted value, residuals, and plots 94
3.3.5 Noncanonical custom links 97
3.3.6 Multiple comparison tests 98
3.3.7 Estimated marginal means (EMM) 98
3.4 Quick start: Testing if multilevel modeling is needed 99
PART III: STATISTICAL ANALYSIS, PART 2, IN DETAIL 101
3.5 Generalized linear models (GZLM) 101
3.5.1 Introduction 101
3.5.2 Setup for GZLM models in R 103
3.5.3 Binary logistic regression example 104
3.5.4 Gamma regression model 105
3.5.5 Poisson regression model 108
3.5.6 Negative binomial regression 113
3.6 Multilevel modeling (MLM) 115
3.6.1 Introduction 115
3.6.2 Setup and data 115
Contents ix

3.6.3 The random coefficients model 116


3.6.4 Likelihood ratio test 119
3.7 Panel data regression (PDR) 119
3.7.1 Introduction 119
3.7.2 Types of PDR model 120
3.7.3 The Hausman test 122
3.7.4 Setup and data 123
3.7.5 PDR with the plm package 124
3.7.6 PDR with the panelr package 133
3.8 Structural equation modeling (SEM) 134
3.9 Missing data analysis and data imputation 134
3.10 Chapter 3 command summary 134
Endnotes 134

4 Classification and regression trees in R 136


PART I: OVERVIEW OF CLASSIFICATION AND REGRESSION TREES WITH R 136
4.1 Introduction 137
4.2 Advantages of decision tree analysis 137
4.3 Limitations of decision tree analysis 138
4.4 Decision tree terminology 139
4.5 Steps in decision tree analysis 140
4.6 Decision tree algorithms 140
4.7 Random forests and ensemble methods 142
4.8 Software 143
4.8.1 R language 143
4.8.2 Stata 144
4.8.3 SAS 144
4.8.4 SPSS 144
4.8.5 Python language 144
4.9 Data and packages used in this chapter 144
4.9.1 Example data 144
4.9.2 R packages used 145
PART II: QUICK START - CLASSIFICATION AND REGRESSION TREES 145
4.10 Classification tree example: Survival on the Titanic 145
4.11 Regression tree example: Correlates of murder 149
PART III: CLASSIFICATION AND REGRESSION TREES, IN DETAIL 152
4.12 Overview 152
4.13 The rpart() program 153
4.13.1 Introduction 153
4.13.2 Training and validation datasets 155
4.13.3 Setup for rpart() trees 156
4.14 Classification trees with the rpart package 158
4.14.1 The basic rpart classification tree 158
4.14.2 Printing tree rules 160
4.14.3 Visualization with prp() and draw.tree() 161
4.14.4 Visualization with fancyRpartPlot() 163
4.14.5 Interpreting tree summaries 164
4.14.6 Listing nodes by country and countries by node 169
4.14.7 Node distribution plots 170
4.14.8 Saving predictions and residuals 171
4.14.9 Cross-validation and pruning 173
x Contents

4.14.10 The confusion matrix and model performance metrics 176


4.14.11 The ROC curve and AUC 182
4.14.12 Lift plots 184
4.14.13 Gains plots 186
4.14.14 Precision vs. recall plot 186
4.15 Regression trees with the rpart package 189
4.15.1 Setup 189
4.15.2 Creating an rpart regression tree 189
4.15.3 Printing tree rules 192
4.15.4 Visualization with prp() and fancyRpartPlot() 192
4.15.5 Interpreting tree summaries 194
4.15.6 The CP table 197
4.15.7 Listing nodes by country and countries by node 198
4.15.8 Saving predictions and residuals 199
4.15.9 Plotting residuals 200
4.15.10 Cross-validation and pruning 201
4.15.11 R-squared for regression trees 202
4.15.12 MSE for regression trees 205
4.15.13 The confusion matrix 206
4.15.14 The ROC curve and AUC 206
4.15.15 Gains plots 206
4.15.16 Gains plot with OLS comparison 209
4.16 The tree package 212
4.17 The ctree() program for conditional decision trees 212
4.18 More decision trees programs for R 212
4.19 Chapter 4 command summary 213
Endnotes 213

5 Random forests 215


PART I: OVERVIEW OF RANDOM FORESTS IN R 215
5.1 Introduction 215
5.1.1 Social science examples of random forest models 215
5.1.2 Advantages of random forests 216
5.1.3 Limitations of random forests 217
5.1.4 Data and packages 217
PART II: QUICK START – RANDOM FORESTS 218
5.2 Classification forest example: Searching for the causes of happiness 218
5.3 Regression forest example: Why so much crime in my town? 221
PART III: RANDOM FORESTS, IN DETAIL 226
5.4 Classification forests with randomForest() 226
5.4.1 Setup 226
5.4.2 A basic classification model 227
5.4.3 Output components of randomForest() objects for classification models 230
5.4.4 Graphing a randomForest tree? 238
5.4.5 Comparing randomForest() and rpart() performance 239
5.4.6 Tuning the random forest model 241
5.4.7 MDS cluster analysis of the RF classification model 250
5.5 Regression forests with randomForest() 253
5.5.1 Introduction 253
5.5.2 Setup 254
5.5.3 A basic regression model 254
5.5.4 Output components for regression forest models 256
Contents xi

5.5.5 Graphing a randomForest tree? 260


5.5.6 MDS plots 260
5.5.7 Quartile plots 261
5.5.8 Comparing randomForest() and rpart() regression models 262
5.5.9 Tuning the randomForest() regression model 263
5.5.10 Outliers: Identifying and removing 268
5.6 The randomForestExplainer package 272
5.6.1 Setup for the randomForestExplainer package 272
5.6.2 Minimal depth plots 273
5.6.3 Multiway variable importance plots 274
5.6.4 Multiway ranking of variable importance 277
5.6.5 Comparing randomForest and OLS rankings of predictors 278
5.6.6 Which importance criteria? 280
5.6.7 Interaction analysis 281
5.6.8 The explain _ forest() function 286
5.7 Summary 286
5.8 Conditional inference forests 287
5.9 MDS plots for random forests 287
5.10 More random forest programs for R 287
5.11 Command summary 289
Endnotes 289

6 Modeling and machine learning 291


PART I: OVERVIEW OF MODELING AND MACHINE LEARNING 291
6.1 Introduction 291
6.1.1 Social science examples of modeling and machine learning in R 292
6.1.2 Advantages of modeling and machine learning in R 294
6.1.3 Limitations of modeling and machine learning in R 294
6.1.4 Data, packages, and default directory 295
PART II: QUICK START – MODELING AND MACHINE LEARNING 297
6.2 Example 1: Bayesian modeling of county-level poverty 297
6.2.1 Introduction 297
6.2.2 Setup 297
6.2.3 Correlation plot 298
6.2.4 The Bayes generalized linear model 300
6.3 Example 2: Predicting diabetes among Pima Indians with mlr3 307
6.3.1 Introduction 307
6.3.2 Setup 307
6.3.3 How mlr3 works 307
6.3.4 The Pima Indian data 309
PART III: MODELING AND MACHINE LEARNING IN DETAIL 316
6.4 Illustrating modeling and machine learning with SVM in caret 316
6.4.1 How SVM works 317
6.4.2 SVM algorithms compared to logistic and OLS regression 317
6.4.3 SVM kernels, types, and parameters 318
6.4.4 Tuning SVM models 319
6.4.5 SVM and longitudinal data 319
6.5 SVM versus OLS regression 320
6.6 SVM with the caret package: Predicting world literacy rates 320
6.6.1 Setup 321
6.6.2 Constructing the SVM regression model with caret 322
6.6.3 Obtaining predicted values and residuals 323
xii Contents

6.6.4 Model performance metrics 323


6.6.5 Variable importance 324
6.6.6 Other output elements 324
6.6.7 SVM plots 325
6.7 Tuning SVM models 326
6.7.1 Tuning for the train() command from the caret package 327
6.7.2 Tuning for the svm() command from the e1071 package 328
6.7.3 Cross-validating SVM models 330
6.7.4 Using e1071 in caret rather than the default kern package 331
6.8 SVM classification models: Classifying U.S. Senators 333
6.8.1 The “senate” example and setup 333
6.8.2 SVM classification with alternative kernels: Senate example 333
6.8.3 Tuning the SVM binary classification model 338
6.9 Gradient boosting machines (GBM) 341
6.9.1 Introduction 341
6.9.2 Setup and example data 342
6.9.3 Metrics for comparing models 343
6.9.4 The caret control object 343
6.9.5 Training the GBM model under caret 344
6.10 Learning vector quantization (LVQ) 345
6.10.1 Introduction 345
6.10.2 Setup and example data 346
6.10.3 Metrics for comparing models 346
6.10.4 The caret control object 346
6.10.5 Training the LVQ model under caret 346
6.11 Comparing models 347
6.12 Variable importance 349
6.12.1 Leave-one-out modeling 349
6.12.2 Recursive feature elimination (RFE) with caret 350
6.12.3 Other approaches to variable importance 352
6.13 SVM classification for a multinomial outcome 352
6.14 Command summary 352
Endnotes 352

7 Neural network models and deep learning 355


PART I: OVERVIEW OF NEURAL NETWORK MODELS AND DEEP LEARNING 355
7.1 Overview 355
7.2 Data and packages 356
7.3 Social science examples 357
7.4 Pros and cons of neural networks 358
7.5 Artificial neural network (ANN) concepts 359
7.5.1 ANN terms 359
7.5.2 R software programs for ANN 362
7.5.3 Training methods for ANN 363
7.5.4 Algorithms in neuralnet 363
7.5.5 Algorithms in nnet 363
7.5.6 Tuning ANN models 364
PART II: QUICK START - MODELING AND MACHINE LEARNING 364
7.6 Example 1: Analyzing NYC airline delays 364
7.6.1 Introduction 364
7.6.2 General setup 364
7.6.3 Data preparation 364
7.6.4 Modeling NYC airline delays 365
Contents xiii

7.7 Example 2: The classic iris classification example 370


7.7.1 Setup 370
7.7.2 Exploring separation with a violin plot 371
7.7.3 Normalizing the data 371
7.7.4 Training the model with nnet in caret 372
7.7.5 Obtain model predictions 374
7.7.6 Display the neural model 375
PART III: NEURAL NETWORK MODELS IN DETAIL 375
7.8 Analyzing Boston crime via the neuralnet package 375
7.8.1 Setup 376
7.8.2 The linear regression model for unscaled data 377
7.8.3 The neuralnet model for unscaled data 379
7.8.4 Scaling the data 379
7.8.5 The linear regression model for scaled data 379
7.8.6 The neuralnet model for scaled data 380
7.8.7 Neuralnet results for the training data 381
7.8.8 Model performance plots 382
7.8.9 Visualizing the neuralnet model 383
7.8.10 Variable importance for the neuralnet model 384
7.9 Analyzing Boston crime via neuralnet under the caret package 386
7.10 Analyzing Boston crime via nnet in caret 386
7.10.1 Setup 387
7.10.2 The nnet/caret model of Boston crime 388
7.10.3 Variable importance for the nnet/caret model 392
7.10.4 Further tuning the nnet model outside caret 393
7.11 A classification model of marital status using nnet 395
7.11.1 Setup 395
7.11.2 The nnet classification model of marital status 397
7.12 Neural network analysis using “mlr3keras” 400
7.13 Command summary 400
Endnotes 400

8 Network analysis 401


PART I: OVERVIEW OF NETWORK ANALYSIS WITH R 401
8.1 Introduction 401
8.2 Data and packages used in this chapter 401
8.3 Concepts in network analysis 403
8.4 Getting data into network format 404
PART II: QUICK START ON NETWORK ANALYSIS WITH R 405
8.5 Quick start exercise 1: The Medici family network 405
8.6 Quick start exercise 2: Marvel hero network communities 409
PART III: NETWORK ANALYSIS WITH R IN DETAIL 416
8.7 Interactive network analysis with visNetwork 416
8.7.1 Undirected networks: Research team management 417
8.7.2 Clustering by group: Research team grouped by gender 421
8.7.3 A larger network with navigation and circle layout 422
8.7.4 Visualizing classification and regression trees: National literacy 425
8.7.5 A directed network (asymmetrical relationships in a research team) 426
8.8 Network analysis with igraph 429
8.8.1 Term adjacency networks: Gubernatorial websites and the covid pandemic 429
8.8.2 Similarity/distance networks with igraph: Senate interest group ratings 436
8.8.3 Communities, modularity, and centrality 440
8.8.4 Similarity network analysis: All senators 447
xiv Contents

8.9 Using intergraph for network conversions 453


8.10 Network-on-a-map with the diagram and maps packages 457
8.11 Network analysis with the statnet and network packages 462
8.11.1 Introduction 462
8.11.2 Visualization 467
8.11.3 Neighborhoods 470
8.11.4 Cluster analysis 472
8.12 Clique analysis with sna 473
8.12.1 A simplified clique analysis 473
8.12.2 A clique analysis of the DHHS formal network 475
8.12.3 K-core analysis of the DHHS formal network 481
8.13 Mapping international trade flow with statnet and Intergraph 481
8.14 Correlation networks with corrr 481
8.15 Network analysis with tidygraph 484
8.15.1 Introduction 484
8.15.2 A simple tidygraph example 484
8.15.3 Network conversions with tidygraph 490
8.15.4 Finding community clusters with tidygraph 491
8.16 Simulating networks 494
8.16.1 Agent-based network modeling with SchellingR 494
8.16.2 Agent-based network modeling with RSiena 499
8.16.3 Agent-based network modeling with NetLogoR 499
8.17 Summary 500
8.18 Command summary 501
Endnotes 501

9 Text analytics 503


PART I: OVERVIEW OF TEXT ANALYTICS WITH R 503
9.1 Overview 503
9.2 Data used in this chapter 503
9.3 Packages used in this chapter 504
9.4 What is a corpus? 505
9.5 Text files 505
9.5.1 Overview 505
9.5.2 Archived texts 505
9.5.3 Project Gutenberg archive 506
9.5.4 Comma-separated values (.csv) files 509
9.5.5 Text from Word .docx files with the textreadr package 509
9.5.6 Text from other formats with the readtext package 512
9.5.7 Text from raw text files 514
PART II: QUICK START ON TEXT ANALYTICS WITH R 516
9.6 Quick start exercise 1: Key word in context (kwic) indexing 516
9.7 Quick start exercise 2: Word frequencies and histograms 518
PART III: NETWORK ANALYSIS WITH R IN DETAIL 523
9.8 Web scraping 523
9.8.1 Overview 523
9.8.2 Web scraping: The “htm2txt” package 524
9.8.3 Web scraping: The “rvest” package 527
9.9 Social media scraping 531
9.9.1 Analysis of Twitter data: Trump and the New York Times 532
9.9.2 Social media scraping with twitter 536
Contents xv

9.10 Leading text formats in R 539


9.10.1 Overview 539
9.10.2 Formats related to the “tidytext” package 540
9.10.3 Formats related to the “tm” package 543
9.10.4 Formats related to the “quanteda” package 547
9.10.5 Common text file conversions 552
9.11 Tokenization 554
9.11.1 Overview 554
9.11.2 Word tokenization 554
9.12 Character encoding 557
9.13 Text cleaning and preparation 559
9.14 Analysis: Multigroup word frequency comparisons 559
9.14.1 Multigroup analysis in tidytext 559
9.14.2 Multigroup analysis with quanteda’s textstat _ keyness() command 563
9.14.3 Multigroup analysis with textstat _ frequency() in quanteda and ggplot2 566
9.15 Analysis: Word clouds 567
9.16 Analysis: Comparison clouds 572
9.17 Analysis: Word maps and word correlations 574
9.17.1 Working with the tdm format 574
9.17.2 Working with the dtm format 575
9.17.3 Word frequencies and word correlations 576
9.17.4 Correlation plots of word and document associations 577
9.17.5 Plotting word stem correlations for word pairs 581
9.17.6 Word correlation maps 584
9.18 Analysis: Sentiment analysis 587
9.18.1 Overview 587
9.18.2 Example: sentiment analysis of news articles 587
9.19 Analysis: Topic modeling 596
9.19.1 Overview 596
9.19.2 Topic analysis example 1: Modeling topic frequency over time 597
9.19.3 Topic analysis example 2: LDA analysis 603
9.20 Analysis: Lexical dispersion plots 610
9.21 Analysis: Bigrams and ngrams 611
9.22 Command Summary 612
Endnotes 612

Appendix 1: Introduction to R and R studio 613


Appendix 2: Data used in this book 658
References 668
Index 678
Acknowledgments
I would like to thank the dozens of reviewers, anonymous and otherwise, who provided valuable feedback on the
proposal for this work, and for the work itself, though all errors are my own, of course. The extensive R com-
munity is something all authors using R, myself included, must acknowledge and praise. Particular thanks go to
Sarah Bauduin for her detailed help with the module on NetLogoR, and to Florian Pfisterer for assistance with the
klr3keras package. I am also obliged to former doctoral student Kate Albrecht for her authorship of the online sup-
plement on modeling with RSiena, and to current doctoral student Brad Johnson for his creation of the PowerPoint
slides which accompany this text.
Preface
This book is intended to be an introductory graduate-level treatment of data analytics for social science. The reader
may ask, “Why ‘data analytics’ rather than ‘data science’?” When I started writing this book 2 years ago, Google
searching showed “data analytics” to be the more prevalent term. Today, in Spring 2021, the tide has shifted and
“data science” is more prevalent by about a 2:1 margin. However, where “data science” carries a strong connotation
of computer science and programming, I feel the term “data analytics” carries a connotation of applications for
social, economic, and organizational analysis. I hope one function of my work is to show that while some types of
analysis benefit from a bit of programming (e.g., for looping through repetitive functions), the social science student
or researcher need not feel that they need to take a career detour into computer science simply to be able to use many
of the tools of data science in their dissertations and research.
The subtitle of this book is “Applications in R”. R is a language which is used for statistics, data analysis,
text analysis, and machine learning. R arguably is the fastest-growing and leading statistical tool for researchers.
Social scientists can take advantage of thousands of cutting-edge programs for an “alphabet soup” of applications,
including agent-based modeling, Bayesian modeling, cluster analysis, correlation, correspondence analysis, data
management, decision trees, descriptive statistics, economics, factor analysis, forecasting, generalized linear model-
ing, instrumental variables regression, logistic regression, longitudinal and time series analysis, machine learning
models, mapping and spatial analysis, mediation and moderation analysis, multiple linear regression, multilevel
modeling, network analysis, neural network analysis, panel data regression, path analysis, partial least squares
modeling, power analysis, reliability analysis, significance testing, structural equation modeling, survey research,
text analytics, and visualization of data – and many more. New state-of-the-art R packages are added daily in an
ever-expanding universe of research tools, many created by leading scholars in their fields.
R is free and thus liberates the researcher from dependency on the willingness of his or her institution to provide
the needed software. Moreover, it is platform-independent and may be used with any operating system. Also, R is
open-source, with all source code available to those inclined to look “under the hood”. Statistical algorithms are not
locked in proprietary “black boxes”. R packages are available to import from and export to a variety of data sources,
such as SPSS, SAS, Stata, and Excel, to name a few. Although R is quite full-featured in its own right, it also may
be integrated with other programming environments, such as Python, Java, and C/C++. Starting with version 1.4,
RStudio now offers access to Python tools and packages through its Python interpreter, the “reticulate” package,
and as well as through other avenues. All of this is supported by a very large user community with a full array of
mailing lists (through which help questions may be posed and answered), blogs, conferences, journals, training
opportunities, and archives.
This is not a book for statisticians or advanced users. The reader will not find a forbidding mass of equations,
derivations, and proofs that may rightly be associated with later courses on data analysis. Rather, to make this the
introductory-level book it was intended to be, I have emphasized how to obtain output for various common types of
models, how to interpret the output, assumptions underlying the interpretation, and differences among R applica-
tions packages. Among the helpful features of the book are these:

• All empirical chapters have two “Quick Start” exercises designed to allow students to immerse themselves
quickly in and obtain successful results from R analyses related to the chapter topic.
• In the Support Material (www.routledge.com/9780367624293), all chapters have an abstract, which gives an
overview of the contents.
• All chapters have “Review Questions” for students in the student section of the Support Material (www.
routledge.com/9780367624293), with answers and comments in the instructor section.
• All chapters have text boxes highlighting the applicability of the chapter topic, and of R, to recent published
examples of social science research.
xviii Preface

• Appendix 1 provides a book-within-the-book, on “Introduction to R and RStudio”.


• Data for all examples are available to the student on a Support Material (www.routledge.com/9780367624293)
and are described in Appendix 1. These are listed in Appendix 2.
• The Support Material (www.routledge.com/9780367624293) also has “Command Summaries” of the run-
nable R code for each chapter, stripped of commentary and output in the chapters themselves.

In terms of organization of the book, I chose to start with a chapter on the uses and potential abuses of data analyt-
ics, emphasizing issues in ethics. Chapters 2 and 3 show how to implement a broad range of statistical procedures
in R. I thought this to be important in order that students and researchers see data analytics in R as something
having great continuity with what they already know. Chapters 4 and 5 deal with regression and classification trees
and with random forests. In addition to being valuable tools for prediction and classification in their own right,
these particular tools are often found desirable because they imitate the way ordinary people make decisions and
because they can be visualized graphically. Chapter 6 deals with machine learning models such as support vector
machines. A focus is placed on the “caret” package, which makes available to the researcher dozens of types of
models and facilitates comparison of their results on a cross-validated basis. Chapter 7 deals with neural network
analysis, a topic associated in the public eye with “artificial intelligence” and which also is a tool that may generate
superior solutions. Chapter 8 focuses on network analysis. A very broad range of social science data may be treated
as network data, and data relationships may be visualized in network diagrams. A final chapter treats text analysis,
including text acquisition through web scraping and other means; showing text relationships through comparative
word frequency tables, word clouds, and word maps; and use of sentiment analysis and topic analysis. In fact, topics
are so numerous that for space reasons some content is placed in online supplements in the Support Material (www.
routledge.com/9780367624293) to the text. Some supplements, such as agent-based modeling, are “books within the
book” bonuses for the reader of this text.
Data analytics represents a paradigm shift in social science research methodology. When I took my first teaching
position at Tufts University, we ran statistics, often in the Fortran language, on a “mainframe” with only 8 kilobytes
of memory! The “computer lab” at my next teaching position, at North Carolina State University, was initially
centered on sorting machines for IBM punch-card data. The teaching of research methods since then has been a
constant process of learning new tools and procedures. As social scientists we need to ride the wave of the paradigm
shift, not fear the learning curve all new things bring with them. I hope this book can be a small contribution to what
can only be described as a revolution in the teaching of research methods for social science. Happy data surfing!
G. DAVID GARSON
School of Public and International Affairs
North Carolina State University
April, 2021
Chapter 1
Using and abusing data
analytics in social science
1.1 Introduction
The use and abuse of data analytics (DA), data science, and artificial intelligence (AI) is of major concern in
business, government, and academia. In late 2019, based on a survey of 350 US and UK executives involved in
AI and machine learning, DataRobot (2019a, 2019b), itself a developer of machine learning automation plat-
forms, issued a news release on its report, headlining “Nearly half of AI professionals are ‘very to extremely’
concerned about AI bias.” Critics think the percentage should be even higher. This chapter has a triple pur-
pose. First, published literature in the social and policy sciences is used to illustrate the promise of big data
and DA, highlighting a variety of specific ways in which DA are useful. However, the other two sections of
this chapter are cautionary. In the second section, inventory threats to good research design common among
researchers employing big data and DA are discussed. The third section inventories various ethical issues
associated with big data and DA. The question underlying this chapter is whether, in terms of big data and
DA, we are marching toward a better society or toward an Orwellian “1984”. As in all such questions, the
answer is, “Some of both”.
Before beginning, a word about terminology is needed. The terms “data science”, “data analytics”, “machine
learning”, and “artificial intelligence” overlap in scope. In this volume, these “umbrella” terms may be used
interchangeably by the author and by other authors who are cited. However, connotations differ. Data science
suggests work done by graduates of data science programs, which are dominated by computer science depart-
ments. DA connotes the application of data science methods to other disciplines, such as social science. Machine
learning refers to any of a large number of algorithms which may be used for classification and prediction. AI
refers to algorithms that adjust and hopefully improve in effectiveness across iterations, such as neural networks
of various types. (In this book we do not refer to the broader popular meaning of artificial human intelligence
as portrayed in science fiction.) The common denominator of all these admittedly fuzzy terms is what is often
called “algorithmic thinking”, meaning reliance on computer algorithms to arrive at classifications, predictions,
and decisions. All approaches may utilize “big data”, referring to the capacity of these methods to deal with enor-
mous sets of mixed numeric, text, and even video data, such as may be scraped from the internet. Big data may
magnify bias associated with algorithmic thinking but it is not a prerequisite for bias and abuse in the application
of data science methods.
Official policy on ethics for information technology, including DA, is found in the 2012 “Menlo Report” of the
Directorate of Science & Technology of the US Department of Homeland Security. This report was followed up
by a “companion” document containing case studies and further guidance (Dittrich, Kenneally, & Bailey, 2013).

DOI: 10.4324/9781003109396-1
2 Using and abusing data analytics

The Menlo guidelines contain highly generalized guidelines for ethical practice in the domain of DA. In a nutshell,
it sets out four principles that are as follows:

1. Respect for persons: DA projects should be based on informed consent of those participating in or impacted
by the project.
The problem, of course, is that the whole basis of “big data” approaches is that huge amounts of data are col-
lected without realistic possibility of gathering true informed consent. Even when data are collected directly
from the person, consent takes the form of a button click, giving “consent” to fine print in legalese. This token
consent may even be obtained coercively as failure to click may deny the person the right to make a purchase
or obtain some other online benefits.
2. Beneficence: This is the familiar “do not harm” ethic with roots going back to the Hippocratic Oath for doc-
tors. In practical terms, DA projects are called upon to undertake systematic assessments of risks and harms
as well as benefits.
The problem is that DA projects are mostly commissioned with deliverables set beforehand and with tight
timetables. For the most part, the technocratic staff of DA projects is ill-trained to undertake true cost-benefit
studies even if time constraints and work contracts are permitted. The Menlo Report itself provides a giant
loophole, noting that there are long-term social benefits to having research. It is easy to see these benefits as
outweighing diffuse costs which take the form of loss of confidentiality and privacy, violations of data integ-
rity, and individual or group impairment of reputation. The reality is that few, if any, DA projects are halted
due to lack of “beneficence”, though placing a privacy policy on one’s website or obtaining pro forma
“consent” is commonplace. The costs in time and money of challenging shortcomings in “beneficence” falls
of the aggrieved person, who often finds pro-business legislation and courts, not to mention the superior legal
staff of corporations and governments, make the chance of success dim.
3. Justice: The principle of information justice means that all persons are treated equally with regard to data
selection without bias. Also, benefits of information technology are to be distributed fairly.
The problem is that on the selection side, profiling is inherent in big data analysis. Profiling, in turn, is famously
subject to bias. On the fair distribution side, the Menlo Report and DA projects generally interpret fairness in
terms of individual need, individual effort, societal contribution, and overall merit. These fairness concepts
are subjective and extremely vague. If information justice is considered at all, it is easy to rationalize to justify
DA practices without need for revision.
4. Respect for law and the public interest: DA projects should be based on legal “due diligence”, transparency
with regard to DA methods and results, and DA should be subject to accountability.
DA projects lack “due diligence” if there is no evidence that some effort was undertaken to conform to relevant
laws dealing with privacy and data integrity. The corporation or government agency which commissions a DA
project is wise to have such evidence, usually in the form of an official privacy policy, a policy on data sharing,
and so on. These policies are frequently posted on the web, giving evidence of “transparency”. The problem is
that this primarily serves for legal protection of the corporation or government entity and is rarely a constraint
on what the DA project actually does.

It is common in many domains for ethical guidelines to lack impact. An illustration at this writing is the ethical
standards document of the American Society for Public Administration in the era of the Trump presidency and its
many challenges to ethics. Like that document, the usefulness of the Menlo Report is primarily to call attention to
ethical issues, not actually to regulate DA projects.
Ostensibly, every US federal agency has appointed a “data steward” responsible for each database it maintains.
While this is different from each algorithm-based program, most agencies have a data steward statement of respon-
sibilities that often includes responsibilities in the areas of data privacy, transparency, and other values. An example
is in the “Readings and References” section of the student Support Material (www.routledge.com/9780367624293) for
this book.1 There may be a Data Stewardship Executive Policy Committee to oversee data stewardship, as there is
in the US Census Bureau. A literature review by the author was unable to find even a single empirical study of the
Using and abusing data analytics 3

effectiveness of governmental data stewards, though prescriptive articles on what makes a data steward effective
abound. “The proof is in the pudding” must be the investigatory rule here. Much of this chapter is devoted to illus-
trations of problems with the pudding.
Petrozzino (2020), addressing the Menlo Report, has argued that formal ethical principles do make a differ-
ence. Petrozzino, a Principal Cybersecurity Engineer within the National Security Engineering Center operated by
MITRE for the US Department of Defense, concluded her analysis by writing, “The enthusiasm of organizations to
use big data should be married with the appropriate analysis of potential impact to individuals, groups, and society.
Without this analysis, the potential issues are numerous and substantively damaging to their mission, organization,
and external stakeholders” (p. 17). Like Biblical principles of morality, it is largely up to the individual to act upon
ethical principles. However, it is thought better for the DA project director to have principles than not to have them!

1.2 The promise of data analytics for social science


1.2.1 Data analytics in public affairs and public policy
The Menlo Report discussed earlier specifically calls attention to the societal value of basic research based on big
data. Big data and DA have been applied to address such public policy problems as diverse as making health-care
delivery more efficient (Sousa et al., 2019), improving the state of the art in biomedicine (Mittelstadt, 2019), advanc-
ing the techniques of forensic accounting (Zabihollah & Wang, 2019), improving crop selection in agriculture
(Tseng, Cho, & Wu, 2019), estimating travel time in transportation networks (Bertsimas et al., 2019), and identifying
trucks involved in illegal construction waste dumping (Lu, 2019). Likewise, Hauer (2019: 222) is one of many who
have noted the sweeping scope of algorithms, which implement DA. He wrote, “Algorithms plan flights and then fly
with planes. Algorithms run factories, the bank is a vast array of algorithms, evaluating our credit score, algorithms
collect revenue and keep records, read medical images, diagnose cancer, drive cars, write scientific texts, compose
music, conduct symphony orchestras, navigate drones, speak to us and for us, write film scenarios, invent chemical
formulations for a new cosmetic cream, order, advise, paint pictures. Climate models decide what is a safe carbon
dioxide level in the atmosphere. NSA algorithms decide whether you are a potential terrorist.”
In the same vein, Cathy Petrozzino has observed, “the public sector at every level – federal, state, local, and
tribal – also has benefited from its creation of big data collections and applications of data science.” She gave such
examples as the Care Assessment Needs (CAN) system of the Veterans Health Administration (VHA), and the
Office of Anti-Fraud Program of the Social Security Administration (Petrozzino, 2020: 14).
Public health and the provision of medical care is one of the domains, which have been a center of big data and DA
activity. Garattini et al. (2019: 69), for instance, have noted many benefits of big data in medicine, where DA “offers
the capacity to rationalize, understand and use big data to serve many different purposes, from improved services
modelling to prediction of treatment outcomes, to greater patient and disease stratification. In the area of infectious
diseases, the application of big data analytics has introduced a number of changes in the information accumulation
models… Big data analytics is fast becoming a crucial component for the modeling of transmission – aiding infec-
tion control measures and policies – emergency response analyses required during local or international outbreaks.”

1.2.2 Data analytics in the social sciences


Given the DA revolution in public and private sectors, it would be surprising not to see a rapid gravitation of the
social sciences in the same direction and, indeed, this is happening quickly in the current era. The work of Richard
Hendra, director of the Manpower Demonstration Research Corporation’s (MDRC) Center for Data Insights
(https://www.mdrc.org/), exemplifies how a social scientist can employ data analytic methods to address some of the
nation’s toughest social policy challenges through leveraging already collected data to derive actionable insights to
help improve well-being among low-income individuals and families. Illustrative projects include a nonprofit initia-
tive that focuses on leveraging MIS data to improve program targeting and a national effort to improve DA capacity
and infrastructure in the Temporary Assistance for Needy Families (TANF) system. Other application areas include
employment, housing, criminal justice, financial inclusion, and substance abuse issues. Hendra’s work centers on
how data science fits within long-term learning agendas, using techniques like random forests and ensemble meth-
ods to complement the causal inference studies that MDRC is known for.
4 Using and abusing data analytics

1.2.3 Data analytics in the humanities


We would be remiss before closing this subsection not to mention that DA and big data open up new opportunities
for scholars working in the humanities, where text analysis is paramount. Thus boyd (sic.) and Crawford (2012: 667)
noted “Big Data offers the humanistic disciplines a new way to claim the status of quantitative science and objective
method. It makes many more social spaces quantifiable.”

1.3 Research design issues in data analytics


1.3.1 Beware the true believer
Almost a decade ago the authors boyd and Crawford (2012: 666) found “an arrogant undercurrent in many Big Data
debates where other forms of analysis are too easily sidelined… This is not a space that has been welcoming to
older forms of intellectual craft.” This intellectual arrogance continues to the present day. For instance, this author
(Garson) has experienced data science students having been taught that soon conventional statistical analysis would
be a thing of the past. As boyd and Crawford noted, intellectual arrogance has the potential to “crystalize into new
orthodoxies”, discouraging collaboration and inhibiting rather than promoting innovation. The deserved praise for
the potential of big data and DA must be tempered with recognition that there are many quantitative, qualitative,
and mixed paths to knowledge. Moreover many of the “new” machine language techniques like deep learning with
neural networks or text analytic content analysis antedate the rise of modern data science, and correspondingly data
science texts today commonly present linear and logistic regression, cluster analysis, multidimensional scaling, and
other “traditional” statistical approaches as integral to DA, albeit often done in R or Python rather than SPSS, SAS,
or Stata.
Technocratic isolation encourages “true believership”, diversity mitigates it. Speaking of ethics and bias in the
application of machine learning and AI in response to the COVID pandemic crisis of 2020, Sipior (2020) wrote,
“A diversity of disciplines is the key to success in AI, especially to minimize risk associated with rapid deploy-
ment … Team membership should be well-rounded, from a wide range of backgrounds and skill sets, for complex
problem-solving with innovative solutions and for recognizing the potential for bias. To address issues such as bias,
ethics, and compliance, among others, roles such as an AI Ethicist, attorney, and/or review board, may be added.
She quotes Shellenbarger (2019), “The biases that are implicit in one team member are clear to, and avoided by,
another… So it’s really key to get people who aren’t alike.” Diversity in the algorithm-development team requires
not only data scientists but also subject matter experts, ethicists, and, above all, representation of populations likely
to be impacted. In reality, however, diversity is expensive in both time and money. While diversity in data analytic
development teams is the gold standard, gold is hard to come by and is rare in nature.

1.3.2 Pseudo-objectivity in data analytics


Social science topics from education to elections have been taken up by data scientists in academia and in consult-
ing firms with increasing frequency in recent years. Often their work is presented as reporting objective facts drawn
from mountains of big data. Those from computer science and technical backgrounds are accustomed to thinking in
terms of fact-based knowledge. Pseudo-objectivity replaces objectivity, however, when information is confounded
with knowledge.
Kusner and Loftus (2020) discuss the “we just report the facts” problem in data science, using the example of an
algorithm deployed across the United States, but found by Obermeyer et al. (2019) to underestimate the health needs
of black patients. The data scientists behind the algorithm chose to use health-care costs as a measure of health
needs. This failed to take into account the fact that health-care costs for black patients historically have been lower
due to relative lack of access to treatment, in turn due to racism and income inequalities (Glauser, 2020). This poor
research design led to wrong inferences about the health needs of the black population. Kusner and Loftus went on
to note the tendency of data science algorithms to be developed by technocrats who focus on bivariate correlations
at a surface level, hoping that big data washes out any research design shortcomings. However, this is a false hope.
As social scientists are well aware, what is needed prior to deploying an algorithm is to have a validated model of
the outcome of interest (patient health in this case), taking into account all major relevant variables and analyzing
Using and abusing data analytics 5

TEXT BOX 1.1 John von Neumann on Mathematical Models


John von Neumann was a Hungarian who emigrated to the United States and helped develop the massive
code-breaking first-generation computers of the WWII era. Regarded as the foremost mathematician of his
time, his book, The Computer and the Brain (1958) was published posthumously. It is widely acknowledged
that artificial intelligence and machine learning owe a great deal to his work (Findler, 1988).
Some of his insights, collected in the quotations below, remain relevant to work today in the areas of data
analytics, data science, and AI.

The sciences do not try to explain, they hardly even try to interpret, they mainly make models. By
a model is meant a mathematical construct which, with the addition of certain verbal interpreta-
tions, describes observed phenomena. The justification of such a mathematical construct is solely
and precisely that it is expected to work – that is correctly to describe phenomena from a reason-
ably wide area.
It is exceptional that one should be able to acquire the understanding of a process without hav-
ing previously acquired a deep familiarity with running it, with using it, before one has assimilated
it in an instinctive and empirical way… Thus any discussion of the nature of intellectual effort in
any field is difficult, unless it presupposes an easy, routine familiarity with that field. In mathemat-
ics this limitation becomes very severe.
Truth is much too complicated to allow anything but approximations.
There’s no sense in being precise when you don’t even know what you’re talking about.
Can we survive technology?

John von Neumann

the data on a multivariate basis. When such a model does not exist, as is often the case, analysis is exploratory at
best and is “not ready for prime time”.
This example illustrates how machine learning and AI can maintain and amplify inequity. Most algorithms
exploit crude correlations in data. Yet these correlations are often by-products of more salient social relationships
(in the health-care example, treatment that is inaccessible is, by definition, cheaper), or chance occurrences that will
not replicate.
To identify and mitigate discriminatory relationships embedded in data, we need models that capture or account
for the causal pathways that give rise to them.
Information is factual. Knowledge is interpretive. As soon as the analyst seeks to understand what data mean
inherently, the subjective process of interpretation has begun. Indeed, subjectivity antecedes data collection since
the researcher must selectively decide what information to collect and what to ignore. Even if their topic is the same,
different researchers will make different decisions about the types, sources, variables, dates, and other aspects
of their intended data corpus, whether quantitative or textual, “big” or traditional. Thus David Bollier (2010: 13)
observed, “Big Data is not self-explanatory”. He gives the example of data-cleaning. All data, perhaps especially big
data, require cleaning. Cleaning involves subjective decisions about which data elements matter. Cleaned data are
no longer objective data yet data cleaning is essential. When data come from multiple sources, each with their own
biases and sources of error, the problem is compounded.

1.3.3 The bias of scholarship based on algorithms using big data


1.3.3.1 Bias in access to the means of scholarly production
The old Marxian viewpoint that ownership of the means of production is the key determinant of outcomes has some
relevance to bias in big data research. Large social media companies own the data and are not obligated to release it.
At the same time, in-house researchers and those with direct access to the “full stream” of social media data are in a
6 Using and abusing data analytics

privileged position in terms of scholarship. From a scholar’s perspective, such access is valuable and worth protect-
ing and this vested interest can produce bias. Thus boyd and Crawford (2012: 674) observed, “Big Data researchers
with access to proprietary data sets are less likely to choose questions that are contentious to a social media com-
pany if they think it may result in their access being cut. The chilling effects on the kinds of research questions that
can be asked – in public or private – are something we all need to consider.”

1.3.3.2 Bias in the tools of scholarly production


The bias of big data has also been noted by boyd and Crawford (2012: 666), who have pointed out the poor archiving
and search function of such big data sources as Twitter and Facebook. Often these sources are “black boxes” with
proprietary restrictions on full access by social scientists. As an example, Crimson Hexagon is a service that makes
available longitudinal Twitter data, which Twitter itself does not. Its main clientele are corporations interested in fol-
lowing public consumer trends. To better serve this clientele, Crimson Hexagon uses an algorithm to ascribe gender
to individual tweets. The algorithm is a corporate secret and is of unknown scientific validity. The social science
researcher is faced with the bad alternatives of not using gender as a variable, using gender without validation, or
refusing to publish research based on “black box” methods.
More generally, there is a bias toward using what is generally available, which is data that may be “scraped”
from social media, blogs, websites, and other online sources. This is cited as an advantage of the big data approach:
Ability to retrieve large amounts of data at costs far below the cost associated with traditional means, such as
national surveys and panels. This bias focuses researchers on topics, boyd and Crawford noted (p. 666), “in the
present or immediate past – tracking reactions to an election, TV finale, or natural disaster – because of the sheer
difficulty or impossibility of accessing older data.”
While the number of observations in large datasets drawn from social media, blogs, or websites may vastly
exceed the number of observations in traditional survey research, this does not make them a better basis for inter-
pretation, let alone make them free from error. “Large data sets from internet sources are often unreliable,” boyd
and Crawford (2012: 668) note, and are “prone to outages and losses, and these errors and gaps are magnified when
multiple data sets are used together.” With big data it is often difficult to establish the representativeness essential in
the data on which an algorithm is based.
Algorithms are developed using training datasets. For instance, if an algorithm developer in the medical field
has not employed random sampling for the training set, then, as Glauser (2020: E21) noted in a medical journal, “A
program trained on lung scans may seem neutral, but if the training data sets include only images from patients from
one sex or racial group, it may miss health conditions in diverse populations.” In the same vein, Mannes (2020: 64)
observed, “Many issues with algorithmic bias are the result of decisions about what data are used to train the model.
Including and excluding variables, as well as errors in data curation and collection, can skew the AI’s results.” There
is a twofold take-away from this observation:

1. Interpretation is sounder when the data sample is randomly selected from the universe to which the researcher
wishes to generalize, or at least are representative of the desired sampling frame.
2. Model specification must include the proper variables. For instance, Wykstra, (2018) noted how an algorithm
assigning scores predicting likelihood of recidivism was dramatically different depending on whether the
predictor was past arrests or past convictions. If the true causes are not included in the model (and true causes
are often unknowable or if good indicators of the true causes are not available) the reliability of the model, and
hence the rate of false predictions can pose serious problems.

An example of big data bias based on scraping social media comments is given by Papakyriakopoulos, Carlos, and
Hegelich (2020), who studied German users’ political comments and parties’ posts on social media. “We quanti-
tatively demonstrate”, they wrote, “that hyperactive users have a significant role in the political discourse: They
become opinion leaders, as well as having an agenda-setting effect, thus creating an alternate picture of public opin-
ion.” The authors found hyperactive users participated in discussions differently, liked different content, and that
they became opinion leaders whose comments were more popular than those of ordinary users. Other research has
shown that some hyperactive users are paid political spammers or even “bots”, not random individuals who happen
to be more active. The bias introduced by hyperactive users translates directly into bias in recommender systems,
Using and abusing data analytics 7

such as those used by Facebook and all major social networks, leading to “the danger of algorithmic manipulation
of political communication” by these networks.

1.3.3.3 Bias in the methods of scholarly production


A great deal of what comes out of DA may be characterized as pattern-matching. The researcher uses data analytic
tools to sift through masses of data in order to show constellations of variables associated with an outcome of inter-
est. However, correlation is not causation. The data analyst cannot escape the conundrums, which always plague
social science. Intricate and difficult problems of interpretation include, to take a few examples, the problem of
mutual causation (non-recursivity), the problem of causation by unmeasured variables (the endogeneity problem),
and problems associated with any number of threats to model validity (e.g., making causal inferences without
longitudinal data, generalizing about individuals based on aggregate data, cross-cultural differentials of meaning
regarding the ostensibly same construct, etc.). Biased if not flatly erroneous interpretation arises from naïve applica-
tion of pattern matching to big data. Basics of statistical research still apply, such as knowing that correlation is not
causation (e.g., one researcher showed via data mining that American stock market changes correlated well with
butter production in Bangladesh – an example of spurious correlation (Leinweber, 2007)). Often data scientists find
themselves well advised to turn back to traditional statistical forms of modeling which address complex research
problems.
To take an example, Loni Hagen commenting on big data research, which developed an algorithm purporting
to flag “troll”, “bot”, and “Russian propaganda” messages on social media. She observed that machine learning is
“good at learning biases and the majority rules of the world” but “As a consequence, minority rules can be eas-
ily ignored in the process of machine learning” (personal email to author, 2/12/2020). The machine algorithm, in
essence, incorporated the reasoning that (1) Russian troll messages are highly critical of person X; (2) the message
at hand is highly critical of person X; (3) therefore, the message may be classed as a troll message. In another line
of reasoning, (1) user opinions are the gold standard for judging messages to be trolling; (2) a given message has
received negative user comments and has sometimes been flagged as trolling; (3) therefore, the message is a troll.
Hagen points out that these fallacious lines of reasoning wind up labeling minority opinion messages as “non-
genuine accounts” or “trolls”. The practical bias of such big data algorithms may be, in Hagen’s words, “to oppress
minority opinions by automatically flagging them.”
In the area of medicine and AI, Parikh, Teeple, and Navathe (2019: 2377) have observed that “clinicians may have
a propensity to trust suggestions from AI decision support systems, which summarize large numbers of inputs into
automated real-time predictions, while inadvertently discounting relevant information from nonautomated systems –
so-called automation complacency”. That is, data science and AI provide a cloak of mystification and legitimacy to
recommendations that might otherwise be subject to scrutiny and challenge. This is the proven tendency for com-
placency and bias to be associated with human use of automated technology (Parasuraman & Manzey, 2010). This
bias is particularly likely to exist where AI reinforces existing biases. Parikh and his colleagues give the example of
the Framingham Study, a classic study of factors in heart disease, used by doctors for decades but now known to be
biased due to having been based on an overwhelmingly non-Hispanic white population. When an algorithm applies
the Framingham Risk Score to populations with otherwise similar clinical characteristics, the predicted errors
occurred for blacks. While this particular bias has been recognized, one must wonder how many other examples are
unrecognized. Parikh gives other examples, such as AI algorithms using electronic health records (EHR) making
recommendations, which wrongly fail to recommend cardiac ischemia testing for older women, or making incor-
rect estimates of breast cancer in black women due to treatment of missing data, not recognizing that missingness is
related to race. These authors conclude, “While all predictive models may automate bias, AI may be unique in the
extent to which bias is unrecognized” (p. 2377).

1.3.3.4 Bias of social media data itself


Social media users are not a random sample of the American population, let alone the world, yet some in the DA
field act as if having hundreds of thousands or even millions of data points is a substitute. However, generalization
made on the basis of force of numbers is not good social science. The authors boyd and Crawford (2012: 669) note
of Twitter data, “Regardless of the number of tweets, it is not a representative sample as the data are skewed from
the beginning.” In this section, we enumerate some of the many cautions that attach to social media data, often the
type of data to which data analytic methods are applied.
8 Using and abusing data analytics

Based on article counts in Summon for the 2014–2019 period, Facebook and Twitter were the dominant sources
of data for scholarly articles (about 280,000 articles each), followed by YouTube (116 k), Instagram (75 k), and
WhatsApp (20 k). This huge number of articles reflects the relative ease with which social scientists may scrape
social media data. In this section, we take Twitter data as an example, but its limitations often are similar to limita-
tions on all social media data.
Three of the many limitations of Twitter data are those listed below.

1. Problems in acquiring unbiased data: Twitter is popular among scholars because it provides some tweets
through its public APIs. A few companies and large institutions have access to theoretically all public tweets
(those not made private by users). The great majority of researchers must be content with access to 10% or 1%
Twitter streams covering a time-limited period. The sampling process is not revealed in detail to research-
ers. Some tweets are eliminated because they come from protected accounts. Others are eliminated because
not-entirely-accurate algorithms determine they contain spam, pornography, or other forbidden content. For
those that are included, there is the problem of overcounting due to some people having multiple accounts and
undercounting because sometime multiple people use the same account. Then there is the much-publicized
problem that a nontrivial amount of use reflects bots, which send content on an automated basis or reflects the
work of banks of human agents working for some entity.
2. Difficulty in defining users: It is difficult to distinguish just what Twitter “use” and “participation” is. A few
years back, Twitter (2011) noted that 40% of active users are passive, listening but not posting. With survey
research it is possible, for example, to analyze the views of both those who voted and also the views of non-
voters. In contrast, in Twitter research it is not possible to compare the sentiments of those who tweeted with
sentiments of those who just listened.
3. Dangers of pooling data: When handing data from multiple sources, pooling issues arise. Serious errors of
interpretation may well arise when different sets of data are combined, as not infrequently happens in “big
data” research on social media sources. These problems are outlined, for instance, in Knapp (2013). Suffice it
to say, combining social media data from multiple sources may be difficult or impossible to do without incur-
ring bias.

1.3.3.5 Bias of big data network research


A common type of big data analysis takes the form of network analysis, sometimes presented in graphical con-
nected-circles format called sociograms. Data may be articulated (e.g., email connections) or behavioral (e.g., prox-
imity based on cell phone GPS data). Measures such as centrality to the network may be calculated, imputing
more network importance to units with higher centrality coefficients. This is reminiscent of a line of social science
researcher in which scholars such as Jacob Moreno (1934) in psychosociology and later Floyd Hunter (1969) in
political sociology. But where classical sociological research focused on interpersonal and political relationships of
consequence, much network research based on big data focuses on what Granovetter (1973) called “weak ties” (e.g.,
being “friended” on Facebook, where the user may have thousands of “friends”). As boyd and Caldwell (2012: 671)
have noted, “Not every connection is equivalent to every other connection, and neither does frequency of contact
indicate strength of relationship.” Bias and erroneous interpretation arises when weak ties are confounded with
strong ones, and when social context variables are not part of the data being analyzed.

1.3.4 The subjectivity of algorithms


When DA is engaged in support of algorithmically based decision-making, as in banking decisions about credit,
insurance decisions about eligibility, or university decisions about student admittance, decisions must be made
about what variables are employed by the algorithm. Even the most apparently fair-minded approach may be biased,
as when available economic-based variables are included while unavailable human values variables are excluded.
Bias is even more likely when the algorithm is embedded in a “black box” making assessment of the role of vari-
ables difficult or impossible, as in many AI applications. As another instance, it is common for employers to use
algorithms to screen job applicants based on test scores without ever establishing that in fact those with higher test
scores can perform specific job-related tasks better than, say, those with medium test scores.
Using and abusing data analytics 9

In the selection and weighting of variables, biases may be introduced by the analyst creating the algorithm or by
his or her employer. There is even the possibility of a politics of algorithms, in which interested parties lobby to have
their interests represented. For instance, there is possible bias in the credit rating industry, as when groups lobby a
credit bureau to have membership in their organization counted as a plus or when discount stores lobby to have high
rates of credit card spending in their stores not count as a minus. Zarsky (2016: 125) concluded, “Lobbying obvi-
ously increases unfair outcomes of the processes mentioned because it facilitates a biased decision-making process
that systematically benefits stronger and well-organized social segments (and thus is unfair to weaker segments).”
The problem of subjectivity in the development of algorithms is compounded by the tendency of data scientists
and the public alike to anthropomorphize them. David Watson observed, “Algorithms are not ‘just like us’ and the
temptation to pretend they are can have profound ethical consequences when they are deployed in high-risk domains
like finance and clinical medicine. By anthropomorphizing a statistical model, we implicitly grant it a degree of
agency that not only overstates its true abilities, but robs us of our own autonomy” (Watson, 2019: 435). The prob-
lem is that it is not ethically neutral to blindly accept that AI, being rooted in neural sciences of the human mind,
is therefore to be seen, as human beings are seen, as agents having their own set of ethics. Rather than being like
humans, AI applications are tools. Like all tools, they tend to be used in the interest of those who fund them. While
it is common to observe that DA may be used for good or evil, a more accurate generalization is to say that on aver-
age, DA tends to serve powerful interests in society. Ethical vigilance by human beings is of utmost importance.

1.3.5 Big data and big noise


Social scientists have long understood that what counts is data quality, not data quantity. For instance, a scientific
national survey in the United States may be accomplished with fewer than 2,000 respondents. Having 20,000 or
even 2 million data points does not give a better sample. Likewise, in their classic political science study of Pearl
Harbor, Wohlstetter and Schelling (1962) showed that the failure of decision in that event was not due to too little
warning information but too much, combined with failure to properly analyze the data at hand.
A case in point was reported in 2020 by the Fragile Families Project, in which high-quality data were collected
in a panel study with a view to undertaking predictive modeling of family outcomes as a basis for policy analysis in
social and criminal justice programs. Data were collected on children at ages 1, 3, 5, 9, and 15. The panel invited a
competition to predict six life outcomes (e.g., grade point average) at age 15, based only on data from the first four
waves of the study. The project received 457 applications from 68 institutions from around the globe, including
several teams based at Princeton University. The competitors used a variety of machine learning AI techniques. As
reported by Virginia Tech, “Even after using state-of-the-art modeling and a high-quality dataset containing 13,000
data points for more than 4,000 families, the best AI predictive models were not very accurate” (Jimenez & Daniels,
2020). However, the use of large datasets may confer misplaced legitimacy and may mislead researchers and poli-
cymakers into assuming accuracy is assured.
MacFeely (2019) has made related points about big data. While acknowledging the potential benefit of big data,
he noted, “Big data also present enormous statistical and governance challenges and potential pitfalls: Legal; ethi-
cal; technical; and reputational. Big data also present a significant expectations management challenge, as it seems
many hold the misplaced belief that accessing big data is straightforward and that their use will automatically and
dramatically reduce the costs of producing statistical information. As yet the jury is out on whether big data will
offer official statistics anything especially useful. Beyond the hype of big data, and hype it may well be, statisticians
understand that big data are not always better data and that more data doesn’t automatically mean more insight. In
fact more data may simply mean more noise.” A big data researcher may brag about having 800,000 data points
compared to the 800 of a survey researcher studying the same topic. However, that is no evidence at all that the
former is a better basis for decision than the latter and is no evidence that the use of either dataset is appropriate
and valid.

1.3.6 Limitations of the leading data science dissemination models


R and Python, along with CRAN and GitHub distribution channels, lack quality control except for virus/malware
checking and checking for program error messages on multiple platforms.2 Even the R distribution itself comes with
no warranty. On this problem, Marc Schwartz observed, “Even if you narrowly define ‘safe’ as being virus/malware
10 Using and abusing data analytics

free and even if the CRAN maintainers have extensive screening in place, the burden will still be on the end users to
test/scan the downloaded packages (whether in source or binary form), according to some a priori defined standard
operating procedures, to achieve a level of confidence, that the packages pass those tests/scans.”3 However, the end
user is typically ill-equipped to evaluate bias and error in the algorithms underlying packages the user intends to
employ.
Of course, proprietary statistical and other software also may contain algorithmic errors. Moreover, unlike
R and Python packages, with commercial packages source code is not available for inspection for the most part.
However, companies do have paid staff to undertake quality control and vetting, and capitalist competition moti-
vates companies to offer products which “work” lest profits suffer. In the community-supported world of R and
Python, in contrast, such quality control work is unpaid, unsystematic, and idiosyncratic. For these reasons this
author recommends that in the area of statistical methods that researchers cross-check and confirm critical results
obtained from R and Python packages with results from major commercial packages. Even when results can be
verified by forcing correct settings, the researcher may find default settings in community-supported software may
be unconventional.

1.4 Social and ethical issues in data analytics


1.4.1 Types of ethical issues in data analytics
By way of introduction, La Fors, Custers, and Keymolen (2019: 217) have enumerated ten major ways in which the
rise of big data and DA poses threats to ethics. These are paraphrased below:

1. Human welfare: Algorithm-driven decisions on matters ranging from employment to education may lead to de
facto discrimination against and unfair treatment of citizens.
2. Autonomy: DA-driven profiling and consumer targeting can undermine the exercise of free choice and affect
the news, politics, product advertising, and even cultural information to which the individual I exposed.
3. Justice: Algorithmic profiling can flag false positives or false negatives in law enforcement, resulting in sys-
tematic unfairness and injustices.
4. Solidarity: Non-transparent decisions made by complex algorithms based on big data may prioritize some
groups over others without ever affording the opportunity for the mobilization of potential group solidarity in
defense against these decisions.
5. Dignity: Algorithmic profiling can lead to stigmatization on assault on human dignity. Being treated “as a
number” is inherent in algorithmic policymaking but is also inherently dehumanizing to the affected indi-
vidual, who would often favor case-by-case decision by human beings. Mannes (2020: 61) thus writes about
AI that it cannot only produce financial loss or even physical injury, but it also can cause “more subtle harms
such as instantiating human bias or undermining individual dignity.”
6. Non-maleficence: Non-maleficence refers to the medical principle of doing no harm, such as by a doctor’s duty
to end course of treatment found to be harmful. Big data analytics, however, puts non-maleficence as a value
under pressure due to the prevalence of non-transparent data reuse and repurposing.
7. Accountability: Citizens affected by DA algorithms may well be unaware they are affected and even if aware,
may not understand the implications of related decisions affecting them, and even if they do understand, citi-
zens my well not know who to try to hold accountable or how to do so.
8. Privacy: Even when “opt-in” or “opt-out” privacy protections are in place, the correlations among variables
in personal data in big data initiatives allow for easy re-identification and consequent intrusion on privacy.
Studying verbatim Twitter quotations found in journal articles, for instance, Ayers, Nebeker, and Dredze
(2018) found that in 84% of cases, re-identification was possible.
9. Environmental welfare: The “digitalization of everything” also has indirect environmental effects, neglect
of which is an ethical issue. An example is neglecting the issue of increased lithium mining to support the
Using and abusing data analytics 11

millions of batteries needed in a digital world, knowing that lithium mining is associated with chemical leak-
age and soil and water pollution. Impacts are not equally distributed, raising issues of environmental justice
as well.
10. Trustworthiness: Ethically negative consequences enumerated above may well lead to diminished trust in
institutions associated with these consequences. Diminished trust, in turn, is associated with diminished
social capital and with negative consequences for society as a whole.

1.4.2 Bias toward the privileged


Numerous social science articles have documented the “digital divide” (access) and “second digital divide” (use)
that favor higher-status groups in society. Recent research by Eszter Hargittai studied social media use, focusing on
Twitter, Facebook, LinkedIn, Tumblr, and Reddit, based on a nationally representative US sample administered by
NORC at the University of Chicago. This panel is noted for supplementing area probability sampling with additional
coverage of hard-to-survey population segments, such as rural and low-income households. Hargittai’s abstract
summarized, “Those of higher socioeconomic status are more likely to be on several platforms suggesting that big
data derived from social media tend to oversample the views of more privileged people. Additionally, internet skills
are related to using such sites, again showing that opinions visible on these sites do not represent all types of people
equally. The article cautions against relying on content from such sites as the sole basis of data to avoid dispropor-
tionately ignoring the perspectives of the less privileged. Whether business interests or policy considerations, it is
important that decisions that concern the whole population are not based on the results of analyses that favor the
opinions of those who are already better off” (Hargittai, 2020).
The bias toward the privileged documented by Hargittai is just the tip of the iceberg. Social media, websites,
blogs, and other sources of big data are tools. Those with greater resources to use tools do so. That is, there is a
multiplier effect. Not only do those higher in social status use the internet more, they also hire others to do so on
their behalf. Those at the top of the status pecking order are in a position to commission websites, pay legions of
blog posters, underwrite bot campaigns, fund banks of social media tweeters, and hire services which to conduct
online “PR” campaigns to promote their interests, products, or candidates. With the internet landscape biased in
this manner, it is all too easy for social scientists to fall prey to the same biases because “that’s what the data say”.
What the data say depends on for whom the data were created. Writing of advances in medicine associated with
big data analytics, it has been found that “Innovators and early adopters are generally from higher-resourced envi-
ronments. This leads to data and findings biased towards those environments. Such biased data in turn continue
to be used to generate new discoveries, further obscuring potentially underrepresented populations, and creating
a nearly inescapable cycle of health inequity” (Tossas-Milligan & Winn, 2019: 86). The same bias exists in other
domains.
Virginia Eubanks (2019), author of Automating Inequality: How High-Tech Tools Profile, Police, and Punish the
Poor, has outlined the impact of digital decision tools on the low-income populations. She observed, “At lectures,
conferences, and gatherings, I am often approached by engineers or data scientists who want to talk about the eco-
nomic and social implications of their designs” (p. 212). She has found all high-tech proposals from data scientists
to not meet even feeble standards in terms of “dismantling the digital poorhouse” and she calls for a revolution in
thinking about how digital skills might be redirected to protect human rights and strengthen human capacity, par-
ticularly with regard to poverty.
The relation of big data and DA to human rights issues has been widely recognized but what to do is an unre-
solved matter. Nersessian (2018: 851), for example, has noted, “Even in advanced economies, the inherently global
nature of big data makes it difficult to effectively regulate at the national level, and many domestic laws and poli-
cies are behind the curve.” While Nersessian, citing the United Nations’ “Guiding Principles” document (United
Nations, 2011), advocates using international human rights law to restrict the use of big data by “taking off the table”
any use that violates human rights, at least at present this is no more an effective form of regulation than is legisla-
tion by individual nations.
Digital bias toward the privileged is not limited to matters of poverty and race. There is also a digital divide
within academia as well, with privileged and deprived classes of scholars. This is expressed well by boyd and
Crawford (2012: 673–674) who noted the policies of social media companies regulate access to their data and
12 Using and abusing data analytics

impose fees for better access. These authors wrote, “This produces considerable unevenness in the system: Those
with money – or those inside the company – can produce a different type of research than those outside. Those
without access can neither reproduce nor evaluate the methodological claims of those who have privileged access.
It is also important to recognize that the class of the Big Data rich is reinforced through the university system: Top-
tier, well-resourced universities will be able to buy access to data, and students from the top universities are the ones
most likely to be invited to work within large social media companies. Those from the periphery are less likely to
get those invitations and develop their skills.” The result of the academic digital divide is a widening of the gap in
the capacity to do scholarship with big data.

1.4.3 Discrimination
Scholarly studies have routinely found that computer algorithms, the fodder of DA, may promote bias. A 2015
Carnegie Mellon University study of employment websites found that Google’s algorithms listed high-paying jobs
to men at about six times the rate that the same add was displayed to women. A University of Washington study
found that Google Images searches for “C.E.O.” returned 11% female images whereas the percentage of CEOs who
are women is over twice that (27%). Crawford (2017) gives numerous instances of discriminatory effects, such as AI
applications classifying men as doctors and women as nurses, or not processing darker skin tones. Based on research
in the field, Garcia (2016: 112) observed, “It doesn’t take active prejudice to produce skewed results in web searches,
data-driven home loan decisions, or photo-recognition software. It just takes distorted data that no one notices and
corrects for. Thus, as we begin to create artificial intelligence, we risk inserting racism and other prejudices into the
code that will make decisions for years to come.”
The complexity of fairness/discrimination issues involving data analytics and big data are illustrated in the
debate between ProPublica and the firm “equivant” (formerly Northpointe) over the COMPAS system. COMPAS,
the Correctional Offender Management Profiling for Alternative Sanctions system, is widely used in the correc-
tional community to identify likely recidivists and is advertise by the equivant company as “Software for Justice”.
Presumably COMPAS information is used by law enforcement for closer tracking of former inmates with high recid-
ivism COMPAS scores. A 2016 study by the public interest group ProPublica showed that COMPAS “scored black
offenders more harshly than white offenders who have similar or even more negative backgrounds” (Petrozzino,
2020: 2, referring to Angwin et al., 2016). The equivant company responded by arguing there was no discrimination
since the COMPAS accuracy rate was not significantly different for whites as compared to blacks, and thus was fair.
ProPublica, in turn, defended their charge of discrimination in a later article (Dressel & Farid, 2018) which argued
that fairness should not be gauged by overall accuracy but by the “false positive” rate, since that reflected the area of
potential discriminatory impact. By that criterion, COMPAS had a significantly higher false positive rate for blacks
than for whites. Dressel and Farid concluded, “Black defendants who did not recidivate were incorrectly predicted
to reoffend at a rate of 44.9%, nearly twice as high as their white counterparts at 23.5%; and white defendants who
did recidivate were incorrectly predicted to not reoffend at a rate of 47.7%, nearly twice as high as their black coun-
terparts at 28.0%. In other words, COMPAS scores appeared to favor white defendants over black defendants by
underpredicting recidivism for white and overpredicting recidivism for black defendants.” In this case, fairness or
information justice could be defined in two ways, leading to opposite inferences. It is hardly surprising that those
responsible for and heavily invested in a DA project like COMPAS chose to select a fairness definition favorable to
their interests. It is not so much a case of “lying with statistics” as it is a case of data analysis resting on debatable
assumptions and definitions.
A 2019 systematic literature review of big data and discrimination by Maddalena Favaretto and her colleagues at
the Institute for Biomedical Ethics, University of Basel, found that most research addressing big data and discrimi-
nation focused on such recommendations as better algorithms, more transparency, and more regulation (Favaretto,
De Clercq, and Elger, 2019). However, these authors found that “our study results identify a considerable number of
barriers to the proposed strategies, such as technical difficulties, conceptual challenges, human bias and shortcom-
ings of legislation, all of which hamper the implementation of such fair data mining practices” (p. 23). Moreover,
the DA literature was found to have rarely discussed “how data mining technologies, if properly implemented, could
also be an effective tool to prevent unfair discrimination and promote equality” (p. 24). That is, existing research
focuses on avoiding discriminatory abuse of big data systems, neglecting the possible use of big data to mitigate
discrimination itself.
Using and abusing data analytics 13

Algorithms may enact practices which violate the law. In July, 2020, the Lawyers’ Committee for Civil Rights
under Law filed an amicus brief in a lawsuit against Facebook for redlining, an illegal practice by which minority
groups are effectively obstructed from financing, such as for the purchase of homes in certain areas. Referring
to Facebook financial services advertisements, The Lawyers’ Committee for Civil Rights Under Law (2020)
argued, “Redlining is discriminatory and unjust whether it takes place online or offline and we must not allow
corporations to blame technology for harmful decisions made by CEOs”. The lawsuit contended that digital
advertising on Facebook discriminated based on the race, gender, and age of its users and then provided different
services to these users, excluding them from economic opportunities. This discriminatory practice was based on
profiling of Facebook users. Different users were provided different services based on their algorithm-generated
profiles, resulting in “digital redlining”. (At this writing the case (Opiotennione v. Facebook, Inc.) has not been
adjudicated.)
Likewise, discrimination is inherent in big data systems, which are more effective for some racial groups than
others. The MIT Media Lab, for instance, found that facial recognition software correctly identified white males
99–100% of the time, but the rate for black women was as low as 65% (Campbell, 2019: 54). The higher the rate of
misidentifications, the greater the chance that actions taken on the basis of the algorithms of such software might
be racially discriminatory. Concerns over misidentification using algorithms led San Francisco in May, 2019, to
become the first city to ban facial-recognition software in its police department. The American Civil Liberties
Union (ACLU) has demanded a ban on using facial recognition software by the government and law enforcement
after finding that “Facial recognition technology is known to produce biased and inaccurate results, particularly
when applied to people of color” (Williams, 2020: 11).
In a test, the ACLU ran images of members of Congress against a mug shot database, finding 28 instances where
members of Congress were wrongly identified as possible criminals. Again, people of color were disproportion-
ately represented in the false positive group, including civil rights leader John Lewis (Williams, 2020: 13). A later
ACLU report headlined, “Untold Number of People Implicated in Crimes They Didn’t Commit Because of Face
Recognition” (ACLU, 2020). Inaccuracy, however, has not prevented its widespread and growing use and convic-
tions based on identifications by facial recognition software. Likewise, ICE now routinely uses facial recognition
software to sift through ID cards and drivers’ licenses to find and deport undocumented people in a secret system
largely devoid of protections for those fingered by the software (Williams, 2020: 13).
Discriminatory impacts are even more likely when the algorithm in question draws on discriminatory views in
Twitter and other social media. Garcia (2016: 111) gives the example of “Tay”, an AI bot created by Microsoft for use
on Twitter. The intent of the algorithm was to create a self-learning AI conversationalist. The one-day Tay experi-
ment ended in failure when, starting with neutral language, “in a mere 12 hours, Tay went from upbeat conversa-
tionalist to foul-mouthed, racist Holocaust denier who said feminists ‘should all die and burn in hell’ and that the
actor ‘Ricky Gervais learned totalitarianism from Adolf Hitler, the inventor of atheism’”. That is, the Tay algorithm
amplified existing extremist views of a discriminatory nature.

1.4.4 Diversity and data analytics


Sage Publications is the world’s largest publisher in the field of statistics, data analysis, and research design in social
science. Its “Sage OCEAN” initiative (https://ocean.sagepub.com/) seeks to support social scientists engaging with
computational research methods, data science, and big data. Eve Kraicer (2019), associated with this initiative,
noted, “Here at SAGE Ocean, we’ve been collecting data on the landscape of tools for computational social sci-
ence. While looking through the data, we found an incredible variety, from resources to aid crowdsourcing to text
analysis to social media analysis. Despite this diversity at the technical level of the tools, we found a persistent lack
of diversity in terms of who built these tools.”
Kraicer reported findings that showed that 90% of the founders, chief technical officers, and software developers
were male and that a majority were white. While this is not unusual in science, technology, engineering, and math-
ematics (STEM) fields, it is nonetheless problematic for two major reasons:

1. The modeling effect: Social science research (e.g., Riccucci, Van Ryzin, and Li, 2016) has shown that when
roles are representative by gender, race, or other categories, people from those categories are more likely to
seek to play those roles also. In the case of DA, lack of representativeness may inhibit both being a user of
14 Using and abusing data analytics

DA tools or become a developer of them. Kraicer wrote, “The gap … could limit both who we imagine as a
computational social scientist, and even how computational social science should work.”
2. Standpoint theory: Standpoint theory research (e.g., Hekman, 1997) has shown that “where you stand” is cor-
related with the kinds of questions you ask and the kinds of answers you find. In part this is due to differential
access to knowledge, tools, and resources, but “where you stand” also has to do with your role as a woman, a
person of color, or with other life experiences. The body of DA research may be influenced by lack of repre-
sentativeness in the field. Kraecer noted, “Our social position informs what and how we research, and using
tools built from a single perspective may limit what we think to ask and test.”

In line with this, Frey, Patton, and Gaskell (2020) noted that “When analyzing social media data from marginal-
ized communities, algorithms lack the ability to accurately interpret offline context, which may lead to dangerous
assumptions about and implications for marginalized communities” (p. 42). Taking youth gangs as an example of
a marginalized community whose social media communication can be misinterpreted by algorithms, leading to
dire consequences for some and failure to provide services for others, Frey and his associates undertook an experi-
ment in which gang members became involved in the development of algorithms for processing relevant social
media messages. They found “the complexity of social media communication can only be uncovered through the
involvement of people who have knowledge of the localized language, culture, and changing nature of community
climate… If the gap between people who create algorithms and people who experience the direct impacts of them
persists, we will likely continue to reinforce the very social inequities we hope to ameliorate” (pp. 54–55). While
the likelihood of implementing the Frey experiment on a mass basis seems unlikely, to say the least, the experiment
did highlight how and why algorithms for processing social media may lead to error and bias.

1.4.5 Distortion of democratic processes


“Social bots” are computer programs whose algorithms mimic the communication of human beings but whose con-
tent is dictated by whatever individual, group, or government is paying for them. These algorithms, of course, are
made possible by the advance of DA methods using big data. In the 2016 presidential elections, it is estimated that
150,000,000 Americans encountered Russian disinformation on Facebook and Instagram alone (McNamee, 2020:
21). Hagen et al. (2020) studied the use of social bots in the 2020 election, concluding “Specifically, we found that
bot-like accounts created the appearance of a virtual community around far-right political messaging, obscured the
influence of traditional actors (i.e., media personalities, subject matter experts, etc.), and influenced network senti-
ment by amplifying pro-Trump messaging.”
In addition to bias in gathering and interpreting big data, data analytics suffers from another ethical problem:
Withholding data from those who need it. Even if data scientists are scrupulously ethical and adhering to sound
research design, their superiors may not be so. “Officials may have incentives to hide coronavirus cases. China,
Indonesia and Iran have all come under scrutiny for their statistics. ‘Juking the stats’ is not unknown in other
contexts in the U.S., either” (O’Neill, 2020). In June, 2020, Brazil has removed months of data on Covid-19 from
a government website amid criticism of its president’s handling of the COVID outbreak. In the United States, as
one of several such instances, a Florida newspaper editorialized that “The state of Florida is hiding information
about coronavirus deaths from citizens. Under the direction of Gov. Ron DeSantis and the Florida Department of
Health, the state has consistently refused to inform the public about deaths and infections in Florida nursing homes,
prisons and now, coronavirus deaths as documented by public medical examiners” Pensacola News Journal (2020).
Subsequently Florida’s COVID-19 data and dashboard manager was “forced to resign after voicing concerns over
being told to delete coronavirus data” (CBS News, 2020).

1.4.6 Undermining of professional ethics


One dimension associated with big data and DA is the undermining of existing systems of professional ethics and
accountability. In the area of medicine, for instance, Chiauzzi and Wick (2019) have written, “The availability of
large data sets has attracted researchers who are not traditionally associated with health data and its associated ethi-
cal considerations, such as computer and data scientists. Reliance on oversight by ethics review boards is inadequate
Using and abusing data analytics 15

and, due to the public availability of social media data, there is often confusion between public and private spaces. In
addition, social media participants and researchers may pay little attention to traditional terms of use.” When medi-
cal professionals defer to AI and anthropomorphize its results, professional ethics may risk being compromised.
In their article, these authors presented four case studies involving commercial scraping, de-anonymization of
forum users, fake profile data, and multiple scraper bots. In each case, the authors found serious violations of spe-
cific guidelines set forth by the Council for International Organizations of Medical Sciences (CIOMS). Violations,
which the authors labeled forms of “digital trespass”, involved “unauthorized scraping of social media data, entry of
false information, misrepresentation of researcher identities of participants on forums, lack of ethical approval and
informed consent, use of member quotations, and presentation of findings at conferences and in journals without
verifying accurate potential biases and limitations of the data” (Chiauzzi & Wick, 2019: n.p., abstract).
While attention to ethical issues in data science has been increasing, it is also widely acknowledged that ethical
training in data science has been deficient. In their article, “Data science education: We’re missing the boat, again”,
Howe et al. (2017), for example, called for new efforts in data science classes, focusing on ethics, legal compliance,
scientific reproducibility, data quality, and algorithmic bias.
The undermining of professional standards has consequences for the research result. For instance, in classic mul-
tivariate procedures such as confirmatory factor analysis and multigroup structural equation modeling, or even in
exploratory factor analysis, social scientists have sought to address the common problem that different groups may
attach different meanings to constructs. Chiauzzi and Wick (2019) give the example of differences over the mean-
ing of “treatment” in medical studies, where patients routinely define treatment in broader terms than do doctors.
Patients, for instance, may include not just medications but also “pets” and “handicapped parking stickers” as part
of “treatment”. Women more than men may attach social dimensions to “treatment”. Algorithm-makers may follow
the precepts of computer science without due sensitivity to the need for more subtle and appropriate development of
the measurement model for multivariate analysis. Chiauzzi and Wick conclude that “Faulty data assumptions and
researcher biases may cascade into poorly built algorithms that lead to ultimate inaccurate (and possible harmful)
conclusions.”
The worst impact on professional ethics of DA, data science, AI, and big data may be on the horizon as the auto-
mation of AI itself threatens to institutionalize poor ethical decision-making now common in the field. Dakuo Wang
et al. (2019) of IBM Research USA recently surveyed nearly two dozen corporate data scientists, publishing their
results in an article titled, “Human-AI collaboration in data science: Exploring data scientists’ perceptions of auto-
mated AI.” Though automation of the creation of AI applications is not yet widespread in business or government,
Wang and his colleagues found that “while informants expressed concerns about the trend of automating their jobs,
they also strongly felt it was inevitable” (p. 1). The issue for the future is what “it” is and if automated AI creation
will rest on underlying assumptions that perpetuate biases and unethical practices of the past.

1.4.7 Privacy, profiling, and surveillance issues


As it is in traditional social science research, the privacy issue is a contentious one in the domain of DA, particularly
with regard to “big data” of the social media variety. As the issue is still evolving, lacking consensus among social
scientists, and is even subject to litigation, we cannot here set forth clear guidelines. Often commentators content
themselves to note that serious ethical issues are raised and social scientists must wrestle with them and adopt
research management policies they deem appropriate (e.g., boyd & Caldwell, 2012: 671–673). Speaking of informa-
tion, which a citizen in earlier days would have regarded as private and protected, Chief Justice John Roberts noted,
“The fact that technology now allows an individual to carry such information in his hand does not make the infor-
mation any less worthy of the protection for which the Founders fought” (Riley v. California, 573 U.S. 373, 2014).4
An extreme example of data analytics gone to the dark side is provided by China, which has sold its AI-enhanced
surveillance system to at least 18 other countries as of 2019. Campbell (2019: 54) reports how China “is also rolling
out Big Data and surveillance to inculcate ‘positive’ behavior in its citizens.” By combining facial, voice, and gait
recognition software with intense use of cameras (one camera for every six citizens) and feeding data into computer
algorithms, DA is being used to identify and penalize everything from fighting with one’s neighbors to visiting a
mosque to posting the wrong material online to actual crimes. After surveillance systems were installed in taxicabs,
a driver sparked images of Orwell’s 1984 when he told Time magazine that “Now I can’t cuddle my girlfriend off
duty or curse my bosses” (p. 54). The result is a dystopian society in which persecuted groups like the Uighurs feel
16 Using and abusing data analytics

compelled to be ultra-patriotic, displaying images of President Xi Jinping in their stores and making posts laudatory
of the regime to social media. Over a million people have been rounded up, partly enabled by DA, and sent to “re-
education centers”, where dire conditions prevail.
All tools may be used for good or evil. The CEO of Watrix, one of the suppliers for surveillance systems in
China, stated, “From our perspective, we just provide the technology. As for how it’s used, like all high tech, it may
be a double-edged sword” (Campbell, 2019: 55). This is a prevalent attitude in the big data community. Facebook,
for instance, disavows any responsibility for contributing to the rise of hate groups in America, to allowing Russians
to hack American and other elections via social media, or for racial bias in outcomes.
An example closer to home is Google’s “Project Nightingale”, an effort to digitize and store up to 50 mil-
lion health-care records obtained from Ascension, a leading US health-care provider. As reported by the Wall
St. Journal, the Guardian, and in a medical journal by Schneble, Elger, and Shaw (2020), a project employee
blew the whistle on misconduct in failing to protect the privacy and confidentiality of personal health information.
Specifically, the whistleblower charged and the Wall St. Journal confirmed that patients and doctors were not asked
for informed consent to share data and were not even notified. Also, health data were transmitted without anony-
mization with the result that Google employees had full access to non-anonymous patient health-care records. All
this occurred in spite of Google requiring training in medical data ethics. In her medical journal article, Schneble
concludes that data science and AI should not be exempt from scrutiny and prior approval by Institutional Review
Boards. The challenge, of course, is assuring IRB independence from employer interests.
Medicine provides other leading examples of privacy issues pertaining to big data and DA. Garattini et al. (2019:
69), for instance, cite four major categories of ethical issues in the medical sector:

1. Automation and algorithmic methods may restrict freedom of choice by the patient over what is done with
the data that individual provides. There is great “difficulty for individuals to be fully aware of what happens
to their data after collection … the initial data often moves through an information value chain: From data
collectors, to data aggregators, to analysts/advisors, to policy makers, to implementers. … with the final actor/
implementer using the data for purposes that can be very different from the initial intention of the individual
that provided the data” (p. 74). The authors suggest that offering the freedom to opt-out of data collection or
at least the option to seek a second, independent decision could be a remedy for patients, but opt-out strategies
have not proved effective consumer protection in other areas and second opinions may be prohibitively costly
for many patients even if possible in principle.
2. Big data analytics complexity may effectively make informed consent impossible. Garattini et al. (2019:
75–76) cite a recent Ebola outbreak in explaining the impossibility of applying informed consent in the con-
text of viral outbreaks, for instance.
3. Data analytics may well serve as a form of profiling individual and group identities, with consequent issues for
fair health access and justice. Garattini et al. (2019: 76) write, “In the case of viral diagnostics for example, the
amount and granularity of information provides not only the knowledge regarding potential drug resistance
parameters by the infecting organism but also the reconstruction of infectious disease outbreaks, transform-
ing the question of ‘who infected whom’ into ‘they infected them’, i.e., from the more general to the defini-
tive form” (cf. Pak & Kasarskis, 2015). To take another example, Lu (2019) was able to use data analytics to
identify trucks engaged in illegal construction site dumping with .84 precision, meaning that 16% of trucks
profiled as such were not illegal.
4. Big data analytics is normalizing surveillance of the population and changing the capabilities for and norms
regarding population-wide interventions of various types. Garattini et al. (2019: 77) note that in the area of
monitoring infectious diseases, big data may include information on social media, search engine search word
trends, and other indirect measures such that “Algorithms can provide automated decision support as part of
clinical workflow at the time and location of the decision-making, without requiring clinician initiative,” as,
for example, hospital-level or government-level to mount vaccination programs. The decision about vaccina-
tion is elevated from the realm of doctor-patient norms to the realm of norm pertaining to public health policy,
with attendant benefits but also risks. The authors note, “The overall consequences for individuals, groups,
healthcare providers and society as a whole remain poorly understood” (p. 80).
Using and abusing data analytics 17

What is legal may not be ethical when it comes to DA. On the one hand there are powerful arguments in favor of
treating data scraped from the web and social media as public:

1. The data are in fact publically accessible. Moreover, individuals who post do so knowing this. Journalists, law
enforcement authorities, teachers, and others have frequently warned that one should not post unless one is
willing for one’s community, friends, workplaces, and the public to know what is posted. Users frequently use
the public nature of posting to re-tweet or otherwise disseminate posted information themselves.
2. The courts have not prevented large corporations, government, and other entities from collecting web and
social media data on a mass basis. For example, it is now routine for a person’s posts about seeking to buy a
particular automobile or other item to result in email and pop-up web advertisements directed to that person.
Indeed, doing just this has become a giant business in its own right. At this writing it seems extremely unlikely
that there will be a legal sea change in favor of privacy.
3. In social science, the open science movement has emphasized data availability. The ability of other schol-
ars to replicate a researcher’s work is fundamental to the scientific method. If research cannot be repli-
cated, it is suspect. Replication requires access to the researcher’s data. The National Science Foundation
policy states “Investigators are expected to share with other researchers, at no more than incremental
cost and within a reasonable time, the primary data, samples, physical collections and other supporting
materials created or gathered in the course of work under NSF grants. Grantees are expected to encour-
age and facilitate such sharing” (https://www.nsf.gov/bfa/dias/policy/dmp.jsp). It is not uncommon for
other research funding organizations to require the public archiving of research data they have funded.
Following the replicability principle, many journals will not publish papers based on proprietary, classi-
fied, or otherwise unavailable data such that it is impossible to check the validity of the author’s work. The
replicability principle applies to all research data and does not make exception for data scraped from the
web or social media.

On the other hand, there are strong arguments for privacy also. Most of these revolve around the Hippocratic Oath,
which emphasizes the “Do no harm” principle, which is also seen as a professional obligation. What is legal is not
necessarily ethical. Institutional Review Boards have long been established with the charge of promoting ethical
behavior in survey and experimental research. In both of those contexts, unlike the context of social media, it is
possible and expected to obtain informed consent at the individual level. Attempts have been made to apply the
informed consent principle to the digital world, notably the European Union Data Directive. Its Article 7 this direc-
tive allows subjects to block usage of their personal data without consent, and its Article 12 requires that subjects
receive an account of digitally-based decisions which impact them. A 2018 EU evaluation of the directive revealed
considerable debate about its effectiveness.
Injury to the respondents might be incurred by release of individually identifiable information on sensitive issues
such as health (employers and insurers might otherwise use this), illegal activities (law enforcement might use infor-
mation on drug use), sexual views and activities (make this public could disrupt marriages), and views on race, abor-
tion, and other sensitive issues (release of this could lead to harassment by neighbors and the community). IRBs have
generally taken the view that data gathering (e.g., all survey items or interview protocols) require written consent of
the individual. Applying this principle to social media and other big data may lead to a policy of not releasing data
(e.g., not releasing tweets gathered from the public Twitter API) unless anonymized in order to protect individuals
from possible injury.
Given the pro-public and pro-privacy arguments, social scientists are forced to do more than ponder the ethical
issues. At the end of the day, decisions must be made about data access. Compromise policies must be adopted. To
give one example of such compromise, the followings are guidelines from the Social Science Computer Review with
regard to their “data availability” requirement for all articles:

• There are a variety of ways to fulfill the data availability requirement.


• Refer to the url of a public archive through which the anonymized data are available.
• State that the data are in an online supplement file hosted by the journal.
• Refer to a public source of the data, with url or contact information.
18 Using and abusing data analytics

• State that data are available for use under controlled conditions by applying to a board/department/committee
whose charge includes making data available for replication, giving contact information.
• State that the data may be purchased at a non-prohibitive price from a third party, whose contact information
is given.
• State that the anonymized data are available from an author at a given email address.
• State that the variance-covariance matrix and variable-level descriptive statistics are available from an author
at a given email address. (Many statistical procedures, such as factor analysis or structural equation modeling
may be performed with such input, not requiring individual-level data.)
• In the case of data scraped from social media or the web, it is sufficient if an appendix contains detailed infor-
mation that would enable a reader-researcher to reconstruct the same or a similar dataset.
• In rare cases, dataset availability is not relevant to the particular article. Check with the editor about such an
exception.

This particular journal noted that the alternative to the foregoing data availability policy would not be having no
data availability statement, but rather a statement from the journal that the data are unavailable for replication and
consequently findings based on inference from the data should be viewed as unverifiable.

1.4.8 The transparency issue


A leading purpose of data analytics is to support policy decision-making. A prime democratic principle of policy
decision-making is that it should be transparent or at least explainable to those affected by the decision. The purpose
and the principle are in conflict. Tal Zarsky has outlined how automation of algorithmic-based decision-making
founded on data analytics, whether it is decisions about credit-worthiness in the banking sector or life-and-death
decisions about drone strikes in the military sector, inherently involves an increase in opacity. Zarsky (2016: 121)
thus wrote, “Analysis based upon mined data, premised on thousands of parameters, may be difficult to explain
to humans. Therefore, achieving transparency in such cases presents substantial challenges. Equally, the firm gov-
erning through such data analysis would find it difficult to adequately explain the ‘real reason’ for its automated
response – even after making a good faith effort to do so.” When data analysts use “machine learning” and “deep
learning” procedures, “black boxes” are created, which undermine transparency.
That citizens tend to accept technology-based decisions as valid (Citron, 2007) only compounds the problem
from the viewpoint of democratic theory. Blind and unquestioning acceptance of authoritative decisions by the
public is antithetical to the premises of democracy. The mantle of data analytic technology has the capacity to cloak
decisions in an aura of science. Under democratic principles, however, decision-making and governance is supposed
to be founded on legitimacy, not mystery. Legitimacy, in turn, is supposed to be rooted in the will of the people,
which in turn requires transparency.
Unfortunately, providing transparency is not simple and may be impossible. The citizen who wishes to challenge
a decision made by algorithms may well find that there are legal and institutional restrictions (e.g., privacy laws
prevent access to information on what happened to comparable patients in medicine, military, and law enforcement
secrecy) may restrict information as classified, or ownership rights may lead media companies to deny access for
whatever reason. Even where there is some measure of transparency, as in the credit industry or medical industry,
both of which ostensibly extend a citizen right to review records and file corrections, practice undermines the theo-
retical benefits.
Kemper and Kolkman (2019) made the point that “one elementary – yet key – question remains largely undis-
cussed. If transparency is a primary concern, then to whom should algorithms be transparent? We consider algo-
rithms as socio-technical assemblages and conclude that without a critical audience, algorithms cannot be held
accountable.” That is, if meaningful transparency is to exist, it must exist for an independent critical audience.
Kemper and Kolkman conclude, “The value of transparency fundamentally depends on enlisting and maintaining
critical and informed audiences.” In other areas of public policy and governance, the critical audience might take
such forms as public hearings, ombudsmen, citizen review boards, or inspectors general. However, these forms of
institutionalizing a critical audience are time-intensive and do not mesh well with the production needs of algo-
rithm-based systems and may be hamstrung by such issues as corporate property rights and governmental secrecy,
not to mention their sheer cost. Moreover, the effectiveness of such remedies in other spheres has a spotty record at
Using and abusing data analytics 19

best, even were those commissioning algorithmic systems inclined to make trouble for themselves by institutional-
izing an independent critical audience as part of their development process.
The meaningfulness of most transparency measures is questionable at present. While transparency is widely
given lip service, in practice very few citizens avail themselves of the ostensible opportunities (Zarsky, 2016:
122). To the extent that people do challenge algorithmic decisions such as those related to their financial credit,
this raises transactional costs for the credit-giving institution, which is apt to respond by not promoting the
opportunity to challenge by making it easy but rather just the opposite. From the citizen point of view, taking
advantage of transparency opportunities imposes high costs of time and sometimes even legal fees. The few
who challenge may well give up after protracted dealings with the institution. The high price of implementing
meaningful transparency is why, by and large, it does not exist in most settings where algorithmically-based
decision-making prevails.

1.5 Summary: Technology and power


Philip Brey has argued that certain types of technology act as facilitators, enablers, and ensurers of certain types of
power structures in society, sectors, and even organizations. In Brey’s theoretical framework there are five types of
ways in which power is exercised. Building on Brey, Mark Ryan has further shown how this works in the agricultural
sector and in environmental policy specifically with regard to the technology of data analytics. In farming, agricul-
tural big data analytics (ABDA) presents itself as a politically neutral and beneficial way of improving farming prac-
tices, improving agricultural decision-making, and creating a sustainable, environment-friendly future. However, it
is not that simple. Brey’s five modes of power are listed below, along with Ryan’s corresponding illustrations:

1. Manipulation: ABDA can be used as a form of manipulative power to initiate cheap land grabs in ways farm-
ers would not have agreed to willingly.
2. Seduction: ABDA can pressure farmers to install monitors on their farms, limit access to their farms, limiting
the freedom of farmers and otherwise encouraging practices farmers themselves would not otherwise have
chosen.
3. Leadership: Agricultural technology providers get farmers to agree to use of ABDA without their informed
consent with regard to data ownership, data sharing, and data privacy.
4. Coercion: Agricultural technology providers threaten farmers with the loss of big data analytics if famers
do no obey their policies, and farmers are coerced into remaining with the provider due to fear of legal and
economic reprisal.
5. Force: Agricultural technology providers use ABDA to calculate farmer willingness-to-pay rates and then use
this information to force farmers into vulnerable financial positions.

Ryan, who goes into much more detail on each of these five points in his article, makes the case that far from
being neutral; data analytics is instrumental to the exercise of power. Data analytics has the proven potential to
give agricultural technology providers the upper hand in the game of power, much as it does in all sectors of the
economy.
In this chapter, we started with a brief account of the promise of DA, data science, and AI. As this story is
prominent in the media, our account here was brief, wishing to acknowledge the positives in general and for social
science specifically. However, most of this chapter has been devoted to the much-needed but less-told story of the
perils and pitfalls of big data and algorithmic policymaking both in terms of research design problems and in terms
of social and ethical issues.
In matters of research design this chapter called attention to the very real problem of “true believership” and
disinclination of data science as a field to see the possibility that there may be multiple paths to the truth, including
traditional statistics on the one hand and qualitative research on the other. Those who use data analytics must recog-
nize pseudo-objectivity when they see it in research and recognize that progress is made not by denying bias exists
but rather by acknowledging it and seeking to counterbalance it. This is an enormous challenge given the limitations
in the way both big data and the tools to analyze it are created.
20 Using and abusing data analytics

TEXT BOX 1.2 Data Ethics – Top Ten Checklist

10. Does the organization restrict data collection to the necessary? Ethical compromise often arises from
collecting all data in sight. In contrast, ethical practices are better promoted by a policy of data minimi-
zation, which means collecting only data necessary to achieve organizational goals.
9. Does the organization repurpose data? If data authorized by the sources for one purpose are then
repurposed to other goals, the principle of informed consent is violated. This problem is confounded
when the repurposing is done by another entity to which the data are sold or shared.
8. Does the organization promote data transparency? No matter what other internal and external mecha-
nisms the organization puts into place to assure ethical data practices, they will never be comprehensive.
By making data and systems as transparent as possible, additional feedback will be forthcoming, some-
times from unexpected sources. More feedback promotes better and more ethical decision-making.
7. Does the organization promote a culture of data ethics? The organization must care about broader
values than short-term profits or political advantage. Promoting an organizational culture of data ethics
may involve embedding this culture in job descriptions, hiring processes, orientations, ongoing training,
manuals and reports, and job evaluations.
6. Does the organization reward data ethics entrepreneurs? In every area of successful innovation, imple-
mentation of the innovation is promoted when there is an advocate promoting change. If there is such
a data ethics entrepreneur, that person should be rewarded, not only for the person’s sake but also as a
statement of the organization’s values and culture.
5. Does the organization hire data scientists who care about ethics? Rather than force people to change, it
is better to hire the right people at the outset. Newly-hired data scientists should understand that focus-
ing on more modest but more ethical outcomes takes precedence over constructing unbridled systems
which might be technologically feasible.
4. Does the organization seek to counter algorithmic bias? Ethical lapses are often traced to biased and
flawed model assumptions. Short of hiring better analysts to begin, giving them ethical mandates, and
allowing them time to do their job, bias is also minimized if the project team includes not only technical
data science staff but also subject matter experts, research methodologists from outside data science,
representatives of affected groups, and peer reviewers.
3. Are impact studies conducted prior to system deployment? In addition to countering bias by a diverse
development team, requiring a formal, independent data system impact study alerts the organization to
prospective ethical problems.
2. Does the CEO support data ethics? Studies of technology acceptance and diffusion show many success
factors, but prime among them is strong support by the chief operating officer for the innovation. This
applies to introducing data ethics mechanisms into the organization.
1. Is someone responsible for data ethics? While all organizational members share ethical responsibili-
ties, the organization needs (1) a named data steward for each data system deployed; (2) oversight of the
data steward by an in-house Ethics Review Committee or the like; and (3) an annual independent and
external data ethics audit involving a data ethicist.

There are many types of social and ethical issues in data analytics, data science, and AI. Foremost is that fact that
these are tools when all is said and done. Tools may be used for good or evil. Tools may be best and most exploited
by those with the resources to do so, that is why studies find a bias toward the privileged in society. Specific ethical
issues such as discrimination or the undermining of privacy are becoming better known, but these issues are the tip
Using and abusing data analytics 21

of the iceberg. Submerged beneath the surface but posing a greater and more subtle danger to society are threats to
democracy, professional standards, and the way decisions are made. Algorithmic rigidity, misleading profiling, and
failure to reap the benefits of diversity are true and present dangers. It was said of those who fought despotism from
within in another time, “they did what they could”. It is trite but accurate to say that eternal vigilance is the price of
freedom. This applies to the digital world as well. As social science scholars, we must do what we can, supporting
transparency, diversity, and the public good in a problematic economic and political environment.

Endnotes
1. https://ncvhs.hhs.gov/wp-content/uploads/2014/05/090930lt.pdf
2. http://kbroman.org/pkg_primer/pages/cran.html
3. https://stat.ethz.ch/pipermail/r-help/2016-December/443689.html
4. In this case, the Supreme Court held unanimously that warrantless search and seizure of a cell phone with its digital con-
tents during an arrest is unconstitutional.
Another random document with
no related content on Scribd:
silloin hän löytäisi sinut täältä. Paetkaamme! Matkasuunnitelmamme
muuttuu taas», jatkoi Mrs. Edgecombe nauraen. »Puolison takaa-
ajaminen ja kummitätiä pakeneminen ei ole helpoimmin suoritettavia
matkoja!»
XXVIII.

Frankfurtissa, jonne molemmat naiset matkasivat, he saivat sen


tiedon, että Ballmann oli nostanut perintönsä ja matkustanut
Hampuriin sekä sieltä jatkanut matkaansa Amerikkaan lähtevässä
laivassa. Sinne saakka he eivät halunneet häntä seurata saamatta
tarkempia tietoja. Mrs. Edgecombe jätti sentakia erään asianajajan
toimeksi jatkaa tiedusteluja. Laivan nimi saatiin pian selville, mutta jo
New-Yorkissa, jossa hän astui maihin, hukkuivat kaikki jäljet. Minne
mahtoi tämä vieras matkailija mennä tuossa jättiläiskaupungissa? —
Kukaan ei tiennyt sitä. Saksalaisesta konsulivirastosta,
poliisilaitoksesta, useista hotelleista kyseltiin, mutta turhaan, joten
Mrs. Edgecomben lähettiläällä ei ollut muuta neuvoa kuin jättää
tiedustelujen jatkaminen amerikkalaiselle virkaveljelleen, jonka tuli
vähänväliä lähettää tietoja, olivatko hänen yrityksensä onnistuneet
tai oliko hän sattuman kautta saanut joitakin tietoja Ewald
Ballmannin oleskelupaikasta.

»Maapallo on niin pieni ja pyöreä, joten me kyllä vielä joskus


tapaamme pakolaisen», sanoi Mrs. Edgecombe lohduttaen Hannaa.

»Mutta rakas täti — saanko olla avomielinen?»


»Ole aina, lapseni.»

»Ei minulla ole niinkään suurta halua tavata petettyä, kaikesta


päättäen hyvin vihastunutta miestäni. Hän on rikas ja matkustelee
maita ja meriä, mikä aina on ollutkin hänen hartain toivonsa, ja
luultavasti hän on onnellinen. Hän ehkä rakastaa vapautta, ja kukapa
tietää, tahtoisiko hän vielä kerran uhrautua minun takiani?…»

»Joutavia — hän olisi liiankin onnellinen saadessaan sinut


takaisin.
Mutta sinä itse — etkö enää rakastakaan häntä?»

»Minä — minä en tiedä», sanoi Hanna huoaten. »Minä pidän


hänestä kylläkin — mutta — me emme ymmärrä toisiamme.»

»Se on suuri onnettomuus. Mutta tässä eivät auta saarnat; meidän


täytyy odottaa. Sinä esiinnyt edeskinpäin veljentyttärenäni. Anna
tukkasi vaan olla mustana, se sopii sinulle erinomaisesti, eikä
kukaan sinua tunne — ei edes miehesikään, jos me joskus vielä
tapaamme hänet.»

Hanna oli tyytyväinen. Hän viihtyi vanhan ystävättärensä parissa


hyvin eikä toistaiseksi muuta toivonutkaan kuin saada viettää
elämänsä tässä rauhallisessa kodissa uutena ihmisenä, toisella
nimellä, ulkomuodoltaan muuttuneena, olematta missään
tekemisissä karanneen ja panetellun Hanna Ballmannin kanssa.

Eräänä päivänä Mrs. Edgecombe ilmoitti saaneensa kylliksi


matkustamisesta ja aikovansa panna täytäntöön tuumansa asettua
asumaan Saksaan.
»Otahan paperia ja lyijykynä, Hanna», sanoi hän. »Me teemme nyt
valmiiksi suunnitelmamme ja laskelmamme. Siitä tuleekin jättiläistyö,
jota saat ihmetellä.»

Molemmat naiset istuivat »Westendhall»-hotellin


seurusteluhuoneessa.
Noin kaksi kuukautta oli kulunut heidän saapumisestaan Frankfurtiin.
Viimeinen tiedonanto New-Yorkista oli ollut: »Tiedustelut toistaiseksi
tuloksettomia; niitä jatketaan innokkaasti.»

Hanna toi paperia ja kynän ja istuutui tyytyväisesti hymyillen


vastapäätä Mrs. Edgecombea.

»Nyt olen valmis. Mistä alotamme?»

»Menoarviosta. Se on hyvin tärkeä osa tulevaisuuden


suunnittelussa.
Siitä riippuu, onko köyhä vai rikas.»

»Sen täytyy riippua tietenkin omaisuuden suuruudesta», keskeytti


Hanna.

»Niinkö luulet? Sinä olet yhtä erehtynyt kuin monet muutkin. Minun
tulee siis käännyttää sinut. Olen rakentanut kokonaisen järjestelmän
ja koetan nyt opettaa sen sinulle. Sanohan, mitä tarkoitat sanalla
'rikas'?…»

»Minä pitäisin itseäni rikkaana, jos minulla olisi — sanokaamme —


kolmenkymmenentuhannen taalerin tulot.»

»Tein äsken saman kysymyksen kamarineitsyelleni, joka vastasi


olevansa rikas, jos hänellä olisi kahdeksansataa markkaa vuodessa.
Eräs nuori herra sanoi minulle kerran tyytyvänsä Norfolkin herttuan
tuloihin, jotka nousevat kahteenmiljoonaan puntaan. Rikkaus on siis,
katsoen omaisuuden suuruuteen, hyvin vaihtelevainen käsite. Mutta
todellisesti rikas on se, jonka tulot ovat suuremmat kuin menot. Jos
asetan menoni niin, että ne tekevät yhteensä viisituhatta markkaa
tulojeni noustessa kuuteentuhanteen, olen varakas. Jos tuloni ovat
menojani puolta suuremmat, olen rikas, ja kuta suurempi ero näiden
summien välillä on, sitä rikkaampi olen. Ja päinvastoin, jos tuloni
ovat satatuhatta, mutta järjestän elämäni niin, että menot nousevat
sataantuhanteen, olen niukoissa varoissa, siksi että minulla on
alituisia huolia siitä, että saisin menoni ja tuloni menemään yhteen.
Jos taasen järjestän niin, että menoni ovat tulojani suuremmat,
vaikka kuinkakin vähän, olen vaikeissa olosuhteissa. Ymmärrätkö?»

»Kyllä, mutta nautinnot kai riippuvat niistä summista, joita voi niihin
käyttää?…»

»Eivät suinkaan. Suurin nautinto on siinä, mikä on poikkeavaa


jokapäiväisestä elämästä. Todellinen rikkaus on ylellisyyttä, ja
ylellisyyttä esiintyy kaikkialla, missä menoarvio on tehty tuloarviota
pienemmäksi.»

»Mutta jos minulla ei ole minkäänlaista omaisuutta?» puuttui


Hanna puheeseen.

»Silloin et myöskään tee laskelmia. Sellaiset ihmiset, joilla ei ole


kerrassaan mitään, tuntevat useinkin ylellisyyden tunnetta, sillä
missä he vähimmin odottavat, tuntuu jokainen tilapäinen ansio
ylellisyydeltä. Niin ei ole kuitenkaan minun laitani. Minä en ole
ainoastaan jokseenkin, vaan — suurienkin vaatimusten mukaan —
todellisesti hyvin rikas. Mutta minä tahdon myöskin nauttia tästä
rikkaudestani. Tahdon joka päivä iloita siitä, ja siksi me nyt teemme
laskelmamme, jossa kaksi kolmasosaa tuloistani varataan
odottamattomia menoja varten. Juuri ne tuottavat iloa. Tähän
ryhmään kuuluvat tyydytetyt toiveet, hullutukset, korkealentoisten
mielitekojen täyttymiset, rohkeimpien päähänpistojen toimeenpano.»

»Ja hyväntekeväisyys — sitä te, jalomielinen lahjoittaja, ette


mainitse!»

»Ole huoletta, sen olisin kyllä maininnut, sillä se on suurin iloni.


Vuosittainen niin ja niin suurien summien suorittaminen laitoksille on
huonoa iloa; mutta itse saada lieventää ihmisten hätää, sellaisten
ihmisten, jotka eivät ole sitä odottaneet, saada täyttää hartaasti
haluttuja toiveita, yllättää lahjoilla, jotka herättävät ilohuutoja: etkö
luule siitä olevan iloa? Ja useimmat rikkaat ihmiset joutuvat sellaisiin
tilaisuuksiin, jolloin heidän sydämensä vuotaa verta, kun he eivät voi
auttaa, jolloin he sanovat: 'Miten mielellään auttaisin, jos voisin!
Mutta he eivät sitä voi; heidän oma elämänsä nielee kaiken
omaisuuden, ja siten heiltä puuttuu monta ilonhetkeä. Heillä on
monastikin rahahuolia, sillä yksi meno vaatii toisen. Kertomus
miehestä, joka joutui perikatoon saadessaan lahjaksi kultakirjaillut
tohvelit, ei ole niinkään hullu, kuin miltä se näyttää, sillä tohveleihin
ei sopinut hänen vanha yönuttunsa; uusi yönuttu vaati uuden
lepotuolin; se taas uuden huonekaluston, mikä taasen vaati
hienomman huoneiston j.n.e. Ja mikään näistä menoista ei tuottanut
iloa, sillä ne olivat välttämättömiä. Minä aion tehdä aivan
päinvastoin: ostan yksinkertaisen talon, johon kelpaavat
yksinkertaiset huonekalut, mutta kun minulla on varaa ostaa hienoja
huonekaluja, aion huvitella itseäni sillä. Niihin taasen sopisi hieno
yönuttu, mutta aion ostaakin kallisarvoisen…»

»Mutta silloinhan ovat kultakirjaillut tohvelit välttämättömät?»


keskeytti Hanna.
»Hopeakirjaillut kelpaisivat kylläkin, mutta ollessani säästäväinen
taloa ostaessani, voin nyt ostaa kultakirjaillut. Ja nyt, selitettyäni
järjestelmäni, voimme alkaa tehdä laskelmia. Kirjoita siis: Tuloja
kymmenentuhatta puntaa: se on jotenkin kaksisataatuhatta
Saksanmarkkaa.» Hanna hätkähti.

»Sehän on ääretöntä!» huudahti hän. »Silloinhan voi hankkia


itselleen linnan, vaunuja ja…»

»Niin, niin, jatka vaan! — hovimestarin, puuteroituja lakeijoja ja


ranskalaisen kokin sekä taulukokoelman, tanssiaisia ja — ikävyyksiä
jos jonkinlaisia… ja sitten jossain epäonnistuneessa yrityksessä
menettää kaiken. Ei, me järjestämme vallan toisella lailla
ollaksemme satumaisen rikkaita. Me ostamme pienen, kauniin
huvilan, joka edustaa neljäntuhannen Saksanmarkan korkoa. Kirjoita
siis: Asunto — neljätuhatta markkaa.»

»Entä sisustus?»

»Sen me hankimme ensi vuoden säästöillä. Sitäpaitsi on minulla


Englannissa koko joukko huonekaluja, perhemuistoja, tauluja,
taideteoksia ynnä muuta, jotapaitsi meillä on äärettömän paljon
matkoilla ostettuja kalleuksia. Niin pian kuin olemme saaneet
huvilan, lähetytän tänne kirstut, ja niiden purkaminen tulee olemaan
kuin päiväkirjan selailemista. Mutta palatkaamme jälleen
menoarvioomme! Kirjoita edelleen: Palveluskunta: kaksi
kamarineitsyttä, keittäjätär ja hänelle apulainen, kaksi palvelijaa,
kuski.»

»Siis kuitenkin vaunut?» sanoi Hanna.


»Niin, mutta sinun linnassasi tarvittaisiin vaunuja kymmenittäin
ynnä kaikenlaisia muitakin ajopelejä sekä paljon hevosia. Minun
huvilassani on yhdet vaunut, pari kunnollista hevosta ja
ratsuhevonen sinua varten aivan kylliksi.»

»Minua varten? Ratsuhevonen? Miten hauskaa… sitä olen aina


toivonut… Miten hyvä te olette!»

»Katsohan nyt, miten paljon iloa yksi ainoa ratsuhevonen saa


aikaa ja miten hienolta se näyttää huvilamme puitteissa, kun sinun
linnasi tallissa olisi varmaan toistakymmentä englantilaista juoksijaa
marmoripilttuissa. Palvelijakuntamme lisääntyy ratsupalvelijalla.
Oletko kirjoittanut?»

»Kyllä, mutta nyt kai tulee palkkojen vuoro. Mitenkä korkeiksi ne


määrätään? Me kai pidämme Minetten ja teidän englantilaisen
Lizzienne?»

»Tietysti sekä Gustavin, nykyisen palvelijamme. Hänestä tehdään


hovimestari ja kamaripalvelija ja hän saa kaksi palvelijaa avukseen.
Entä palkat? Gustav saa kaksisataa markkaa kuukaudessa, Minette
saman,
Lizzie sataviisikymmentä, keittäjätär sata…»

Hanna keskeytti kirjoittamisensa.

»Mutta, rakas täti, saanko tehdä yhden muistutuksen: nämä palkat


soveltuvat englantilaisiin oloihin, mutta Saksassa ei makseta
palvelijoille niin hyvin…»

»Ne ovat minun palkkojani, rakas lapsi, ja osana järjestelmääni.


Kuten tulet näkemään, olen seuraava järjestelmääni vallan
turhantarkasti. Maksan palvelijoilleni puolta enemmän kuin mitä he
pyytävät. Sillä tavoin minulla on kahden palvelijan sijasta yksi, joka
tekee työtä kolmen ja iloitsee neljän edestä. Sellainen laskutapa on
kylläkin hiukan mutkallista, mutta siitä on paljon hyötyä. Etkö ole
huomannut, millä innolla määräykseni toimitetaan?»

»Rehellisesti puhuen, en ole koskaan kuullut teidän vielä antavan


määräyksiä. Te sanotte aina näin: 'Gustav, olkaa ystävällinen ja
toimittakaa vaunut portaiden eteen', tai 'Lizzie, ole niin kiltti ja pura
tuo matkalaukku', tai 'Minette, teetkö minulle palveluksen
herättämällä minut huomenna kello kuusi'… ja joka tuotua vesilasia
kohti kohtelias 'kiitos', hyvin tehdyn työn jälkeen joitakuita kiittäviä
sanoja, ja kun jokin menee hullusti, niin te hymyilette anteeksi
antaen. Kuuluuko sekin järjestelmään, täti?»

»Ei, sillä se on niin luonnollista. Minun täytyisi toimia vasten


luontoani, ellen kiittäisi, kun minulle jotain tuodaan, ellen lausuisi
muutamia ylisteleviä sanoja mielestäni hyvin tehdystä työstä, ellen
nauraisi, jos jokin on tehnyt pienen tyhmyyden. Mutta nyt olemme
selvillä palvelijoista. Mikä on seuraava osasto?»

»Ruoka kai.»

»Olet oikeassa. Ruoka on talouden tärkeimpiä puolia.»

»Niin, mutta te ette ole suuri herkkusuu.»

»Minäkö? Siinä erehdyt. En tosin ole kova herkuttelija — mutta


pidän hyvästä ruuasta. Ei tarvitse välttämättä olla aina parsaa,
tryffeliä, kravunpyrstöjä, madeirakastiketta tai muita suurenmoisia
laitoksia, mutta pöydässämme tarjotut hyvät ruuat maistuvat
erinomaiselta. Siis — kyökki — kuukausittain… sitä on vaikea
arvioida, kun emme tunne täkäläisiä hintoja.»

»Mutta suunnilleen?»

Molemmat naiset arvioivat sinne ja tänne. Laskelman teko veti


monta tuntia. Siihen otettiin ravinto, valaistus, lämpö, hevosten hoito,
vaatteet, kotilääkäri y.m.

»Vielä puuttuu», sanoi Hanna, »osasto: 'Odottamattomia


menoja'.»

»Sepä on juuri meidän rikkaus-osastomme. Laske yhteen saadut


summat, niin jäännös, vähennettyäsi siten saadun summan
tuloistani, on 'Odottamattomia menoja varten.»

Hanna laski.

»Neljäkymmentäviisituhatta markkaa», sanoi hän.

»Siis on meillä sataviisikymmentäviisituhatta käytettävänämme.


Sillä tavalla tehdään minusta Kroisos, rakas lapsi.»
XXIX.

Mrs. Edgecombe valitsi huvilansa Wiesbadenista. Tässä ihanassa


Taunus-laaksossa sijaitsevassa kylpypaikassa oli tilaisuutta
kaupunki- ja maalaiselämään sekä suuren maailman huvituksiin.

Täällä tapaamme molemmat naisemme kahdeksan vuoden


kuluttua. Heidän suhteensa oli tullut yhä läheisemmäksi; he
rakastivat toisiaan kuin äiti ja tytär. Kumpikin ajatteli sisäisellä
vavistuksella Ewald Ballmannin tapaamista. Mitenkä sitten, kun hän
palaisi, tapaisi rouvansa ja vaatisi hänet takaisin? Mrs.
Edgecombelle se olisi suuri tappio ja Hannalle vielä suurempi.
Hanna oli jo melkein unohtanut miehensä, joten hänen paluuseensa
olisi yhdistynyt jotain kauheaa, salaperäistä, ikäänkuin jos kuollut,
jota ei kaipaa, heräisi unestaan ja vaatisi entisten aikojen jatkamista.
Hehän olivat jo hänen elinkautenaan — heidän avioliittonsa aikana
— olleet niin vieraita toisilleen, mitenkä nyt sitten? Sen ajan jälkeen
Hanna oli oppinut ja nähnyt niin paljon, hänen näköpiirinsä oli
laajentunut, ja kun hän ajatteli hiljaista koulunopettajaa, jonka
ajatukset ja tunteet eivät eksyneet hänen pienen kirjavarastonsa
ulkopuolelle, niin hän ei voinut ymmärtää, että tämä opettaja joskus
oli ollut hänen miehensä.
Mutta peljätty kummitus ei ainakaan vielä ollut näyttäytynyt. Yhä
harvemmin saapuvat tiedonannot Amerikasta olivat aina
samanlaiset: »Herra Ewald Ballmannin olinpaikka yhä tuntematon»,
ja vihdoin ei tietoja enää tullutkaan. Kaikesta päättäen oli Ewald
Ballmann toisella nimellä asettunut johonkin kaukaiseen kaupunkiin
Amerikassa eikä enää koskaan aikonut palata, ajatteli Hanna. Tämä
vakaumus rauhoitti häntä. Hän oli niin onnellinen nykyisissä
oloissaan, ettei mitenkään olisi tahtonut luopua niistä. Hän ei tosin
tuntenut yliluonnollista onnea, mutta hänen elämänsä oli niin täynnä
iloa, ettei hän voinut muuta ajatella, kuin että kohtalo oli häntä
hellävaroin pidellyt.

Mrs. Edgecombe tahtoi kasvattaa hänestä todellisen, ylevän


ihmisen ja työskenteli järjestelmällisesti rikastuttaakseen hänen
sielunelämäänsä. Tämä ei tapahtunut opettavaisten esitelmien tai
siveyssaarnojen kautta, vaan opinnoilla, jotka ovat ainoa keino
näköpiirin laajentamiseksi. Kaksikymmenvuotias Hanna luuli
olevansa hyvin sivistynyt, kun puhui monta kieltä ja helposti otti osaa
erilaisiin seurapiirin keskusteluihin sekä tunsi kirjallisuutta. Hänellä ei
ollut aavistustakaan, miten rajoitettu hänen maailmankatsomuksensa
oli. Mrs. Edgecombe ei sanonut mitä häneltä puuttui, vaan koetti
antaa hänelle sen. Mrs. Edgecombe ei ilmaissut, mitä kaikkea
korkea katsantokanta näkee ympärillään — sillä alempana oleva ei
olisi sitä käsittänyt — vaan hän vei Hannan mukanaan vuorelle
omana seuranaan. Hannan täytyi lukea ystävälleen erilaisia
tieteellisiä teoksia ja kirjoitelmia päivän polttavista kysymyksistä ynnä
sitäpaitsi jokapäiväisten sanomalehtien hallinnollisia tiedonantoja.
Tämä tuntui nuoresta esilukijattaresta alussa hyvin kuivalta ja
ainoastaan velvollisuuksiensa täyttämiseltä. Uusi ajatussuunta on
kuin vieras kieli, joka täytyy oppia, — ensin se tuntuu aivan
mahdottomalta ymmärtää, sitten käsittää jo jonkun sanan sieltä
täältä, kohta kokonaisen lauseen, eikä aikaakaan, ennenkuin jo
ymmärtää kaiken yksityiskohtia myöten. Näin kävi Hannankin. Mitä
hän luki, se sai hänet innostumaan. Mrs. Edgecombe osasi sitäpaitsi
kiihottaa hänen tiedonhaluaan. Hän teki kysymyksiä, alotti
keskusteluja pakottaakseen oppilaansa itsenäisesti ajattelemaan ja
ilostui joka kerta nähdessään uusien ajatusten kypsyvän Hannan
mielessä.

Jonkun ajan kuluttua Hannasta tuntui aivan mahdottomalta, että


hän niin kauan oli voinut elää aavistamattakaan niitä asioita, joita
hän nyt tunsi.

Mrs. Edgecombe herätti eloon vielä hänen soittoharrastuksensa.


Hän pakotti suojattinsa harrastelijan tasolta taiteilijan tasolle. Kaksi
tuntia päivässä hänen täytyi harjoitella; hän sai soittaa mestarien
edessä; käymällä teattereissa ja konserteissa hän sai tilaisuuden
kuulla mestariteoksia, joten hänen synnynnäinen soitannollisuutensa
sai uutta virikettä, ja hän alkoi rakastaa taidetta ja sen harjoittamista.
Pian ei kaksi harjoitustuntia päivittäin ollut enää kylliksi. Sisäisellä
tyydytyksellä hän näki taitonsa kasvavan — koko maailma oli hänelle
avoinna?… Hanna ei aikonut esiintyä konserteissa, mutta hän
unelmoi mestaruudesta, joka asettaisi hänet ihailtujen taiteilijain
rinnalle. Hänen edistyksensä hämmästytti häntä ja hän kyseli
ihmetellen itseltään, mihin saakka voisikaan päästä.

Molempien naisten elämä oli hyvin virkistävää. Heidän pieni


kotinsa oli täynnä mukavuuksia ja hienoa makua; kolme kuukautta
he oleskelivat matkoilla, ja alkuperäisen suunnitelman mukaisesti
käytettiin äärettömät tulot iloatuottaviin tarkoituksiin. Mrs.
Edgecombe rakasti seuraa, mutta ei välittänyt n.s. maailmallisesta
elämästä. Hänen kodissaan seurusteli mieltäkiinnittäviä henkilöitä:
oppineita, taiteilijoita, eteviä ulkomaalaisia. Kaikki Wiesbadenin
kautta matkustavat suuruudet pitivät suurena kunniana tulla
esitetyiksi Mrs. Edgecomben kodissa. Mutta tanssiaisia ja iltamia
hän vältti. Hän itse oli liian vanha sellaisiin huvituksiin, joissa
kiemailu oli pääasiana, ja Hanna taas ei halunnut »hakkailla», mikä
olisi ollut hänelle sekä alentavaa että vaarallista. Hänen
kauneutensa herätti kylläkin ihailua, mutta kokemuksiensa jälkeen
hän pelkäsi kaikenlaisia runollisia seikkailuja, joten hän aina ymmärsi
vetäytyä syrjään. Tämä oli helppoa sentähden, ettei hän ottanut
osaa maailmallisiin huveihin. Sitäpaitsi hän pelkäsi aina, että ihailu
koski hänen luuloteltua rikkauttaan, ja se teki hänet
välinpitämättömäksi. Häntä pidettiin upporikkaan Mrs. Edgecomben
naimattomana veljentyttärenä, ja tämän häntä kohtaan osoittama
hellyys sai ihmiset ajattelemaan, että Hanna oli rikas perijätär.

Näin oli Hanna viettänyt kahdeksan vuotta ilman sydänsuruja,


ilman runollisia lemmenhetkiä, työskennellen kehittyäkseen,
tuntematta muuta kuin ääretöntä rakkautta hyväntekijäänsä kohtaan,
jota hän oppi tuntemaan yhä paremmin. Huvinhaluista,
ymmärtämätöntä, runollista Hannaa, joka niin epäselvästi oli nähnyt
sen maailman, jossa luuli olevansa luotu elämään, vaikkei tiennyt
miten, ei enää ollut olemassa; hän oli kehittynyt levollisesti ja selvästi
ajattelevaksi naiseksi. Hänen ulkomuotonsakaan ei ollut enää sama.
Hänen solakka, notkea ruumiinsa oli pulskistunut; veikeä hymyily oli
muuttunut vakavaksi, sielukkaaksi ilmeeksi. Hanna oli yhä
edelleenkin kaunis, ehkä kauniimpi kuin ensi nuoruutensa aikoina,
mutta hän oli muuttunut niin, ettei häntä olisi voinut tuntea.
XXX.

Mrs. Edgecombe ja Hanna palasivat kävelyltään Nero-laaksosta


takaisin huvilaan. Hämärsi jo. Salissa, jonka molemmat lasiovet
olivat auki, ei oltu vielä sytytetty valoja heidän sinne astuessaan.
Köynnösruusut ja kuusamat levittivät voimakasta tuoksua, ja kaukaa
kuuluivat kylpylaitoksen orkesterin sävelet. Palvelija tuli sytyttämään
lamppuja.

»Ei vielä», sanoi Mrs. Edgecombe, »minä soitan, kun tarvitaan


valoa.
Ja jos joku tulee vielä tänä iltana niin emme ole kotona», hän jatkoi.
Palvelijan mentyä hän lisäsi:

»Olkaamme tänään yksin ja juokaamme rauhassa teemme


jutellessamme. Onko sinulla mitään sitä vastaan?»

»Ei ensinkään», vastasi puhuteltu; »meillä on koko viikko ollut


iltaisin vieraita, joten on oikein hauskaa taasen saada olla
kahdenkesken — ja lukea jotain yhdessä.»

»Ei, tänään emme lue, vaan juttelemme. Minulla on sinulle jotain


sanottavana — minun täytyy kysyä sinulta erästä asiaa…»
»Te teette minut oikein uteliaaksi, rakas täti. Miksi ette sanonut
mitään kävelyllämme?»

»Olin koko ajan niin kahden vaiheilla. Muuten on helpompi jutella


hämärässä. Silloin en ainakaan näe sinun punastuvan…»

»Punastuvan? Mitä olen sitten tehnyt?»

»Tehnyt? Sitä en tiedä — mutta ehkä ajatellut, tuntenut…»

»En koskaan luullut tädin voivan olla sellainen yli-inkvisiittori…»

»Minä lykkäänkin inkvisitsionin tuonnemmaksi. Ensin kerron


sinulle jotain: Paroni Theken on pyytänyt kättäsi.»

»Todellakin?… Entä minkä vastauksen olette antanut?»

»Sen, minkä aina tällaisissa tapauksissa olemme antaneet: ettet


tahdo mennä naimisiin. On vahinko, ettet ole esiintynyt naimisissa-
olevana veljentyttärenäni. Se olisi säästänyt meiltä monta ikävyyttä.
Minua surettaa, että tuo hauska, nuori paroni nyt tulee vihamieliseksi
meitä kohtaan, kuten aina rukkaset saanut. Oletko antanut hänelle
aihetta? Oletko kiemaillut hänen edessään, vai onko omatuntosi
puhdas?»

»On, totta totisesti. Hänhän onkin kääntynyt tädin puoleen eikä


suorastaan minun, sillä olen aina keskeyttänyt sellaiset puheet, jotka
ovat vähänkään menneet siihen suuntaan. Sitäpaitsi olen kerran
maininnut, että minulla ei ole kättä eikä sydäntäkään
poisannettavana.»

»Kättä tosin… mutta sydäntä, mitenkä on, Jane? Nyt alkaa


inkvisiittorintoimeni, eikö sinulla todellakaan ole sydän vapaana?»
»Minullahan ei pitäisi olla, täti.»

»Sen tiedän varsin hyvin, olenhan siitä sinulle aina saarnaillut ja


koettanut suojella sinua sellaisista vaaroista. Mutta olen alkanut
epäillä, että olenkohan vaatinut sinulta liikaa? Sillä jos 'intohimon
punainen ruusu', kuten runoilijamme Storm sanoo, jälleen putoaisi
syliisi, onko minulla silloin oikeutta riistää sinulta sitä kukkaa? Olen
aina ollut kylmä, ajatteleva luonne… enkä ole joutunut kiusauksiin,
joten minulla ei ole siinä suhteessa kokemuksia. Sitäpaitsi olen
vanha, ja samaten kuin terve unohtaa kestetyt tuskat, unohtaa
vanhus nuoruuden hehkun… Mistä saatan tietää, mitä liikkuu sinun
kaltaisesi nuoren, tulisen ja kauniin naisen sydämessä?»

»Täti — te hämmästytätte minua! Ette kai te neuvo minua…


ottamaan miestä itselleni?»

»Neuvo? Älä käsitä minua väärin, lapsi! Olen puhunut intohimosta,


joka ei kysy neuvoa, vielä vähemmin velvollisuuksien täyttämistä,
vaan joka vaatii — lopulta joko voittaen tai häviten — taisteluun.
Minä tahdon vain saada selville, onko sinulla sellaisia taisteluita, ja
tunnen, miten suuri edesvastuu minulla on siinä, voitatko vai
joudutko häviöön… Ole minulle avomielinen, Jane — minussa näet
totisen ystäväsi. Jos järkisyyt voivat sinua suojella vaarasta, olen
valmis ne sinulle esittämään… Sillä useimmiten tämmöisessä
tapauksessa on parasta tarkastaa sitä joka puolelta ja siten
vapautua lumouksesta. Luota minuun. Rakkauden liekki, joka
loimuaa niin kirkkaana, että se näyttää kohoavan taivaisiin ja
saattavan varjoon helvetin tuskat… sammuu aikaisemmin tai
myöhemmin… entä miten sitten käy? Kun ei enää ole
omantunnonrauhaa, oman arvon tunnetta, maailman kunnioitusta —
mihin silloin voi ryhtyä?… Sinä olet sen jo kerran kokenut… löysit
rauhanmajan kodissani… mutta jos sinä toisen kerran pakenisit
maailmalle?… Jane, miksi et vastaa mitään?»

»Koetan koota ajatuksiani, voidakseni vastata aivan avomielisesti


ylevään myötätuntoonne. Mutta sallikaa minun ensin kysyä erästä
asiaa. Mitä syitä tädillä on tuollaisiin ajatuksiin? Onko joku lausunut
epäilyttäviä ajatuksia minusta, tai onko käytökseni antanut aihetta
niihin?»

»Ei käytöksesi, mutta koko olentosi. Jonkun aikaa on minusta


tuntunut, että sinä olet hajamielinen — enemmänkin kuin
hajamielinen: levoton, mielesi ei ole tasapainossa. Usein näen sinun
äkkiä punastuvan ja sitten kalpenevan, ikäänkuin veri suonissasi
polttaisi; kuulen sinun huokaavan — ja katseesi on kaivaten ollut
suunnattuna etäisyyksiin… Kaiken tämän olen huomannut, ja nyt
vielä tämä kosinta — se antaa minulle syytä kysyä: oletko rakastunut
paroni Thekeniin?»

»En, täti, en ole. Paroni Theken on minulle aivan yhdentekevä —


hän ei minua edes miellytä.»

»Kenestä sitten pidät?»

»Sen sanon teille aivan heti — avaan teille sydämeni kokonaan.


Mikään ei ole niin vaikeata kestää kuin ihanat päivät toisensa
jälkeen. Minähän elän, tädille siitä kiitos, mitä huolettomimmissa
olosuhteissa; elämä tarjoo minulle niin paljon kaunista ja hauskaa;
olen oppinut ja nähnyt niin paljon ihanaa ja mielenkiintoista — minun
pitäisi siis olla maailman onnellisin ihminen ja iloita aamusta iltaan…
teitä kiittää vuodesta toiseen. Ja kun ajattelen näitä kahdeksaa, tädin
luona asumaani vuotta, ajattelen niitä kadehtien… sillä se aika on
ollut niin onnellinen. Vasta nyt viime aikoina on surumielisyys saanut
minussa vallan… kaipaan jotain… en voi sitä selittää. Rakkauden
onnea… perheonnea… kukaan ei minua tarvitse… kukaan ei
minusta välitä…»

»Minähän pidän sinusta, Jane… vaikka olenhan niin monta vuotta


sinua vanhempi. On luonnollista, että sinä joskus jäät yksin tänne
maailmaan, mutta muista, että se on useiden muidenkin kohtalo.
Minäkin olen yksin, ilman perhettä; lapseni otettiin minulta. Ja miten
monta vaimoa on, jotka ovat miestensä takia onnettomia! Perheonni
on ihaninta maailmassa, mutta miksi uneksia saavuttamatonta,
olematta tyytyväinen siihen, mitä on?»

»Täti, te olette oikeassa kuten aina. Tämä tunnustus on tehnyt


minulle hyvää. Olen nyt huomannut, miten turhat ja tukea vailla
unelmani ovat olleet. Jos minua taas alkaa vaivata surumieliset
ajatukset, tulen tädin luokse saamaan virkistävän saarnan.»

»Tulehan antamaan minulle suukko!… Soita nyt, niin saamme


valoa!»
XXXI.

Myöhään seuraavan päivän iltana Mrs. Edgecombe istui yksin


salissaan.
Hanna oli mennyt ulos erään ystävättären kanssa.

Edgecomben huvilan sali oli kaunein, komein ja hauskin huone


mitä saattoi ajatella. Le style c'est l'homme, mutta kodin sisustus,
c'est la femme. Kaikkialla kuvastui älykkäisyys, taideaisti ja hyvä
maku. Sali oli suuri, mutta niin komeasti ja taidolla kalustettu, että se
erinäisine huonekalustoineen ja varjostimineen vaikutti monelta
pieneltä huoneelta. Huonekalut olivat vaihtelevia, siellä oli matalia
lepotuoleja, nojatuoleja, vielä matalampia jakkaroita, jotka olivat
päällystetyt erilaisilla kankailla ja neulomuksilla. Sitäpaitsi oli siellä
pieniä pöytiä kaikenlaisine pikkutavaroineen, kallisarvoisia kaappeja,
maljakkoja kukkineen; seinillä riippui mestarimaalauksia ja siellä
täällä taideteoksia jaspiskivestä, norsunluusta ja porsliinista. Suuri
Bechstein-flyygeli oli melkein aina auki, ja nuottitelineellä oli nuotteja;
keskellä lattiaa oli pyöreä pöytä, jolla oli valokuvia ja kuvitettuja
loistoteoksia; kirjoituspöydän molemmin puolin oli kaksiosainen
murattiteline ja sen vieressä hyvinvarustettu kirjahylly; eräässä
nurkassa oli jättiläisteline lukuisine valokuvineen, toisessa oli
koruompelukehys alotettuine töineen, työkori silkki- ja

You might also like