Nothing Special   »   [go: up one dir, main page]

Corpus Linguistics and Statistics: With R Introduction To Quantitative Methods in Linguistics 1st Edition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Full download test bank at ebook textbookfull.

com

Corpus Linguistics and Statistics


with R Introduction to Quantitative
Methods in Linguistics 1st Edition

CLICK LINK TO DOWLOAD

https://textbookfull.com/product/corpus-
linguistics-and-statistics-with-r-
introduction-to-quantitative-methods-in-
linguistics-1st-edition-guillaume-desagulier-
auth/

textbookfull
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Statistics in Corpus Linguistics A Practical Guide


Vaclav Brezina

https://textbookfull.com/product/statistics-in-corpus-
linguistics-a-practical-guide-vaclav-brezina/

Contemporary Corpus Linguistics Paul Baker

https://textbookfull.com/product/contemporary-corpus-linguistics-
paul-baker/

Doing Corpus Linguistics 2nd Edition Eniko Csomay

https://textbookfull.com/product/doing-corpus-linguistics-2nd-
edition-eniko-csomay/

An Introduction to Applied Linguistics Norbert Schmitt

https://textbookfull.com/product/an-introduction-to-applied-
linguistics-norbert-schmitt/
Corpus Linguistics for Health Communication A Guide for
Research 1st Edition Brookes

https://textbookfull.com/product/corpus-linguistics-for-health-
communication-a-guide-for-research-1st-edition-brookes/

Introduction to Statistics Through Resampling Methods


and R Phillip I. Good

https://textbookfull.com/product/introduction-to-statistics-
through-resampling-methods-and-r-phillip-i-good/

English Linguistics An Introduction 4th edition


Christian Mair

https://textbookfull.com/product/english-linguistics-an-
introduction-4th-edition-christian-mair/

Research methods in applied linguistics A Practical


Resource Brian Paltridge

https://textbookfull.com/product/research-methods-in-applied-
linguistics-a-practical-resource-brian-paltridge/

Corpus Linguistics for English Teachers New Tools


Online Resources and Classroom Activities 1st Edition
Eric Friginal

https://textbookfull.com/product/corpus-linguistics-for-english-
teachers-new-tools-online-resources-and-classroom-activities-1st-
edition-eric-friginal/
Quantitative Methods in the Humanities
and Social Sciences

Guillaume Desagulier

Corpus Linguistics
and Statistics
with R
Introduction to Quantitative Methods in
Linguistics
Quantitative Methods in the Humanities and Social Sciences

Editorial Board
Thomas DeFanti, Anthony Grafton, Thomas E. Levy, Lev Manovich,
Alyn Rockwood

Quantitative Methods in the Humanities and Social Sciences is a book series designed to foster
research-based conversation with all parts of the university campus from buildings of ivy-covered
stone to technologically savvy walls of glass. Scholarship from international researchers and the
esteemed editorial board represents the far-reaching applications of computational analysis, statis-
tical models, computer-based programs, and other quantitative methods. Methods are integrated in
a dialogue that is sensitive to the broader context of humanistic study and social science research.
Scholars, including among others historians, archaeologists, classicists and linguists, promote this
interdisciplinary approach. These texts teach new methodological approaches for contemporary
research. Each volume exposes readers to a particular research method. Researchers and students
then benefit from exposure to subtleties of the larger project or corpus of work in which the quan-
titative methods come to fruition.

More information about this series at http://www.springer.com/series/11748


Guillaume Desagulier

Corpus Linguistics and Statistics


with R
Introduction to Quantitative Methods in Linguistics

123
Guillaume Desagulier
Université Paris 8
Saint Denis, France

Additional material to this book can be downloaded from http://extras.springer.com.

ISSN 2199-0956 ISSN 2199-0964 (electronic)


Quantitative Methods in the Humanities and Social Sciences
ISBN 978-3-319-64570-4 ISBN 978-3-319-64572-8 (eBook)
DOI 10.1007/978-3-319-64572-8

Library of Congress Control Number: 2017950518

© Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned,
specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any
other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for
general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and
accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect
to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Fatima, Idris, and Hanaé (who knows how to
finish a paragraph).

“Piled Higher and Deeper” by Jorge Cham, www.phdcomics.com


Preface

Who Should Read This Book

In the summer of 2008, I gave a talk at an international conference in Brighton. The talk was about construc-
tions involving multiple hedging in American English (e.g., I’m gonna have to ask you to + VP). I remember
this talk because even though I had every reason to be happy (the audience showed sustained interest and a
major linguist in my field gave me positive feedback), I remember feeling a pang of dissatisfaction. Because
my research was mostly theoretical at the time, I had concluded my presentation with the phrase “pending
empirical validation” one too many times. Of course, I had used examples gleaned from the renowned Cor-
pus of Contemporary American English, but my sampling was not exhaustive and certainly biased. Even
though I felt I had answered my research questions, I had provided no quantitative summary. I went home
convinced that it was time to buttress my research with corpus data. I craved for a better understanding of
corpora, their constitution, their assets, and their limits. I also wished to extract the data I needed the way I
wanted, beyond what traditional, prepackaged corpus tools have to offer.
I soon realized that the kind of corpus linguistics that I was looking for was technically demanding,
especially for the armchair linguist that I was. In the summer of 2010, my lab offered that I attend a
one-week boot camp in Texas whose instructor, Stefan Th. Gries (University of Santa Barbara), had just
published Quantitative Corpus Linguistics with R. This boot camp was a career-changing opportunity. I
went on to teach myself more elaborate corpus-linguistics techniques as well as the kinds of statistics that
linguists generally have other colleagues do for them. This led me to collaborate with great people outside
my field, such as mathematicians, computer engineers, and experimental linguists. While doing so, I never
left aside my research in theoretical linguistics. I can say that acquiring empirical skills has made me a better
theoretical linguist.
If the above lines echo your own experience, this book is perfect for you. While written for a readership
with little or no background in corpus linguistics, computer programming, or statistics, Corpus Linguistics
and Statistics with R will also appeal to readers with more experience in these fields. Indeed, while presenting
in detail the text-mining apparatus used in traditional corpus linguistics (frequency lists, concordance tables,
collocations, etc.), the text also introduces the reader to some appealing techniques that I wish I had become
acquainted with much earlier in my career (motion charts, word clouds, network graphs, etc.).

vii
viii Preface

Goals

This is a book on empirical linguistics written from a theoretical linguist’s perspective. It provides both
a theoretical discussion of what quantitative corpus linguistics entails and detailed, hands-on, step-by-step
instructions to implement the techniques in the field.

A Note to Instructors

I have written this book so that instructors feel free to teach the chapters individually in the order that they
want. For a one-semester course, the emphasis can be on either:
• Methods in corpus linguistics (Part I)
• Statistics for corpus linguistics (Part II)
In either case, I would recommend to always include Chaps. 1 and 2 to make sure that the students are
familiar with the core concepts of corpus-based research and R programming.
Note that it is also possible to include all the chapters in a one-semester course, as I do on a regular
basis. Be aware, however, that this leaves the students less time to apply the techniques to their own research
projects. Experience tells me that the students need sufficient time to work through all of the earlier chapters,
as well as to accommodate to the statistical analysis material.

Supplementary Materials

Through the chapters, readers will make extensive use of datasets and code. These materials are available
from the book’s Springer Extra’s website:
http://extras.springer.com/2017/978-3-319-64570-4
I recommend that you check the repository on a regular basis for updates.

Acknowledgments

Writing this book would not have been possible without the help and support of my family, friends, and
colleagues. Thanks go to Fatima, Idris, and Hanaé for their patience. In particular, I would like to thank
those who agreed to proofread and test-drive early drafts in real-life conditions. Special thanks go to Antoine
Chambaz for his insightful comments on Chap. 8.

Paris, France Guillaume Desagulier


April 2017
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 From Introspective to Corpus-Informed Judgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Looking for Corpus Linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 What Counts as a Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 What Linguists Do with the Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 How Central the Corpus Is to a Linguist’s Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Part I Methods in Corpus Linguistics


2 R Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Downloads and Installs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Downloading and Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Downloading and Installing RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Downloading the Book Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Setting the Working Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 R Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Downloading Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Loading Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Simple Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Variables and Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Functions and Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8.1 Ready-Made Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8.2 User-Defined Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9.4 Data Frames (and Factors) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 for Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.11 if and if...else Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
ix
x Contents

2.11.1 if Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.11.2 if...else Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.12 Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.13 Common Mistakes and How to Avoid Them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 Digital Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1 A Short Typology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Corpus Compilation: Kennedy’s Five Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Unannotated Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Collecting Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Character Encoding Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.3 Creating an Unannotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Annotated Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 POS-Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.3 POS-Tagging in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.4 Semantic Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Obtaining Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 Processing and Manipulating Character Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Character Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.2 Loading Several Text Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 First Forays into Character String Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.3 Replacing and Deleting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.2 Literals vs. Metacharacters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.3 Line Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.4 Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.5 Alternations and Groupings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.6 Character Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.7 Lazy vs. Greedy Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.8 Backreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.9 Exact Matching with strapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.10 Lookaround . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Contents xi

5 Applied Character String Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Concordances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.1 A Concordance Based on an Unannotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.2 A Concordance Based on an Annotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Making a Data Frame from an Annotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.1 Planning the Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2 Compiling the Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.3 The Full Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Frequency Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.1 A Frequency List of a Raw Text File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.2 A Frequency List of an Annotated File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6 Summary Graphics for Frequency Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Plots, Barplots, and Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3 Word Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.4 Dispersion Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.5 Strip Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6 Reshaping Tabulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.7 Motion Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Part II Statistics for Corpus Linguistics


7 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.2.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.2.2 The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2.3 The Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3 Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.1 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.2 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.3.3 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

8 Notions of Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2.2 Simple Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.2.3 Joint and Marginal Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.2.4 Union vs. Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
xii Contents

8.2.5 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


8.2.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.3 Populations, Samples, and Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.4 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.5 Response/Dependent vs. Explanatory/Descriptive/Independent Variables . . . . . . . . . . . . . . . . 159
8.6 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.7 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.8 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.8.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.8.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.9 The χ 2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.9.1 A Case Study: The Quotative System in British and Canadian Youth . . . . . . . . . . . . . 178
8.10 The Fisher Exact Test of Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.11 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.11.1 Pearson’s r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.11.2 Kendall’s τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.11.3 Spearman’s ρ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.11.4 Correlation Is Not Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

9 Association and Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.2 Cooccurrence Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.2.1 Collocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.2.2 Colligation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.2.3 Collostruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9.3 Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.3.1 Measuring Significant Co-occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.3.2 The Logic of Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.3.3 A Quick Inventory of Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.3.4 A Loop for Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.3.5 There Is No Perfect Association Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.3.6 Collostructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.3.7 Asymmetric Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.4 Lexical Richness and Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.1 Hapax-Based Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.2 Types, Tokens, and Type-Token Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.4.3 Vocabulary Growth Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

10 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.1.1 Multidimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.1.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Contents xiii

10.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242


10.2.1 Principles of Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.2.2 A Case Study: Characterizing Genres with Prosody in Spoken French . . . . . . . . . . . . 243
10.2.3 How PCA Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.3 An Alternative to PCA: t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
10.4 Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.4.1 Principles of Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.4.2 Case Study: General Extenders in the Speech of English Teenagers . . . . . . . . . . . . . . 257
10.4.3 How CA Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.4.4 Supplementary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.5 Multiple Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.5.1 Principles of Multiple Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
10.5.2 Case Study: Predeterminer vs. Preadjectival Uses of Quite and Rather . . . . . . . . . . . . 270
10.5.3 Confidence Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
10.5.4 Beyond MCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.6 Hierarchical Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.6.1 The Principles of Hierarchical Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
10.6.2 Case Study: Clustering English Intensifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.6.3 Cluster Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
10.6.4 Standardizing Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
10.7 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
10.7.1 What Is a Graph? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
10.7.2 The Linguistic Relevance of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
A.1 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
A.1.1 Dispersion Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
A.2 Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
A.2.1 Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
A.2.2 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
A.2.3 A χ 2 Distribution Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

B Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Chapter 1
Introduction

Abstract In this chapter, I explain the theoretical relevance of corpora. I answer three questions: what
counts as a corpus?; what do linguists do with the corpus?; what status does the corpus have in the linguist’s
approach to language?

1.1 From Introspective to Corpus-Informed Judgments

Linguists investigate what it means to know and speak a language. Despite having the same objective,
linguists have their theoretical preferences, depending on what they believe is the true essence of language.
Supporters of transformational-generative grammar (Chomsky 1957, 1962, 1995) claim that the core of
grammar consists of a finite set of abstract, algebraic rules. Because this core is assumed to be common to
all the natural languages in the world, it is considered a universal grammar. The idiosyncrasies of language
are relegated to the periphery of grammar. Such is the case of the lexicon, context, elements of inter-speaker
variation, cultural connotations, mannerisms, non-standard usage, etc.
In generative grammar, pride of place is given to syntax, i.e. the way in which words are combined to
form larger constituents such as phrases, clauses, and sentences. Syntax hinges on an opposition between
deep structure and surface structure. The deep structure is the abstract syntactic representation of a sentence,
whereas the surface structure is the syntactic realization of the sentence as a string of words in speech.
For example, the sentence an angry cow injured a farmer with an axe has one surface structure (i.e. one
realization as an ordered string of words) but two alternative interpretations at the level of the deep structure:
(a) an angry cow injured a farmer who had an axe in his hands, and (b) an angry cow used an axe to injure a
farmer. In other words, a sentence is generated from the deep structure down to the surface structure.
More generally, generative grammar is a “top-down” approach to language: a limited set of abstract rules
“at the top” is enough to generate and account for an infinite number of sentences “at the bottom”. What gen-
erative linguists truly look for is the finite set of rules that core grammar consists of, in other words speakers’
competence, as opposed to speakers’ performance. This is to the detriment of idiosyncrasies of all kinds.
Conversely, theories such as functional linguistics (Dik 1978, 1997; Givón 1995), cognitive linguistics
(Langacker 1987, 1991; Goldberg 1995), and contemporary typology (Greenberg 1963) advocate a “bottom-
up” approach to language. It is usage that shapes the structure of language (Bybee 2006, 2010; Langacker
1988). Grammar is therefore derivative, not generative. There is no point of separating competence and

© Springer International Publishing AG 2017 1


G. Desagulier, Corpus Linguistics and Statistics with R, Quantitative Methods in the Humanities
and Social Sciences, DOI 10.1007/978-3-319-64572-8_1
2 1 Introduction

performance anymore because competence builds up on performance. In the same vein, grammar has no
core or periphery: it is a structured inventory of symbolic units (Goldberg 2003). Therefore, any linguistic
unit is worth being taken into consideration with equal attention: morphemes (un-, -ness), words or phrases
(corpus linguistics), ritualized or formulaic expressions (break a leg!), idioms (he snuffed it), non-canonical
phrasal expressions (sight unseen), semi-schematic expressions (e.g. just because X doesn’t mean Y), and
fully schematic expressions (e.g. the ditransitive construction).
Like biologists, who apprehend life indirectly, i.e. by studying the structure, function, growth, evolution,
distribution, and taxonomy of living cells and organisms, linguists apprehend language through its manifes-
tations. For this reason, all linguistic theories rely on native speakers acting as informants to provide data.
On the basis of such data, linguists can formulate hypotheses and test them.
Generative linguists are known to rely on introspective judgments as their primary source of data. This is
a likely legacy of de Saussure, the father of modern linguistics, who delimited the object of study (langue)
as a structured system independent from context and typical of an ideal speaker. However, the method can
be deemed faulty on at least two accounts. First, there is no guarantee that linguists’ introspective accept-
ability judgments always match what systematically collected data would reveal (Sampson 2001; Wasow
and Arnold 2005). Second, for an intuition of well-formedness to be valid, it should at least be formulated
by a linguist who is a native speaker of the language under study.1 If maintained, this constraint limits
the scope of a linguist’s work and invalidates a significant proportion of the research published worldwide,
judging from the number of papers written by linguists who are not fluent speakers of the languages they
investigate, including among generativists. This radical position is hardly sustainable in practice. Some
generativists working on early language acquisition have used real performance data to better understand the
development of linguistic competence.2
Because of its emphasis on language use in all its complexity, the “bottom-up” approach provides fertile
ground for corpus-informed judgments. It is among its ranks that linguists, dissatisfied with the practice
of using themselves as informants, have turned to corpora, judging them to be a far better source than
introspection to test their hypotheses. Admittedly, corpora have their limitations. The most frequent criticism
that generative linguists level against corpus linguists is that no corpus can ever provide negative evidence.
In other words, no corpus can indicate whether a given sentence is impossible.3 The first response to this
critique is simple: grammar rules are generalizations over actual usage and negative evidence is of little
import. The second response is that there are statistical methods that can either handle the non-occurrence
of a form in a corpus or estimate the probability of occurrence of a yet unseen unit. A more serious criticism
is the following: no corpus, however large and balanced, can ever hoped to be representative of a speaker,
let alone of a language. Insofar as this criticism has to do with the nature and function of corpora, I address
it in Chap. 3.
Linguists make use of corpus-informed judgment because (a) they believe their work is based on a psy-
chologically realistic view of grammar (as opposed to an ideal view of grammar), and (b) such a view can be
operationalized via corpus data. These assumptions underlie this book. What remains to be shown is what
counts as corpus data.

1 In fact, Labov (1975) shows that even native speakers do not know how they speak.
2 See McEnery and Hardie (2012, p. 168) for a list of references.
3 Initially, generativists claim that the absence of negative evidence from children’s linguistic experience is an argument in favor

of the innateness of grammatical knowledge. This claim is now used beyond language acquisition against corpus linguistics.
1.2 Looking for Corpus Linguistics 3

1.2 Looking for Corpus Linguistics

Defining corpus linguistics is no easy task. It typically fills up entire textbooks such as Biber et al. (1998),
Kennedy (1998), McEnery and Hardie (2012), and Meyer (2002). Because corpus linguistics is changing
fast, I believe a discussion of what it is is more profitable than an elaborate definition that might soon be
outdated. The discussion that follows hinges on three questions:
• what counts as a corpus?
• what do linguists do with the corpus?
• what status does the corpus have in the linguist’s approach to language?
Each of the above questions has material, technical, and theoretical implications, and offers no straightfor-
ward answer.

1.2.1 What Counts as a Corpus

A corpus (plural corpora) is a body of material (textual, graphic, audio, and/or video) upon which some anal-
ysis is based. Several disciplines make use of corpora: linguistics of course, but also literature, philosophy,
art, and science. A corpus is not just a collection of linguistically relevant material. For that collection to
count as a corpus, it has to meet a number of criteria: sampling, balance, representativeness, comparability,
and naturalness.

1.2.1.1 A Sample of Samples

A corpus is a finite sample of genuine linguistic productions by native speakers. Even a monitor corpus such
as The Bank of English, which grows over time, has a fixed size at the moment when the linguist taps into it.
Usually, a corpus has several parts characterized by mode (spoken, written), genre (e.g. novel, newspaper,
etc.), or period (e.g. 1990–1995), for example. These parts are themselves sampled from a number of
theoretically infinite sources.
Sampling should not be seen as a shortcoming. In fact, it is thanks to it that corpus linguistics can be very
ambitious. Like astrophysics, which infers knowledge about the properties of the universe and beyond from
the study of an infinitesimal portion of it, corpus linguists use finite portions of language use in the hope that
they will reveal the laws of a given language, or some aspect of it. A corpus is therefore a sample of samples
or, more precisely, a representative and balanced sample of representative and balanced samples.

1.2.1.2 Representativeness

A corpus is representative when its sampling scheme is faithful to the variability that characterizes the target
language. Suppose you want to study the French spoken by Parisian children. The corpus you will design
for this study will not be representative if it consists only of conversations with peers. To be representative,
the corpus will have to include a sizeable portion of conversations with other people, such as parents, school
teachers, and other educators.
4 1 Introduction

Biber (1993, p. 244) writes: “[r]epresentativeness refers to the extent to which a sample includes the full
range of variability in a population.” Variability is a function of situational and linguistic parameters. Situa-
tional parameters include mode (spoken vs. written), format (published, unpublished), setting (institutional,
public, private, etc.), addressee (present, absent, single, interactive, etc.), author (gender, age, occupation,
etc.), factuality (informational, imaginative, etc.), purposes (information, instruction, entertainment, etc.), or
topics (religion, politics, education, etc.).
Linguistic parameters focus on the distribution of language-relevant features in a language. Biber (1993,
p. 249) lists ten grammatical classes that are used in most variation studies (e.g. nouns, pronouns, preposi-
tions, passives, contractions, or WH-clauses). Each class has a distinctive distribution across text categories,
which can be used to guarantee a representative selection of text types. For example, pronouns and contrac-
tions are interactive and typically occur in texts with a communicative function. In contrast, WH-clauses
require structural elaboration, typical of informative texts.
Distribution is a matter of numerical frequencies. If you are interested in the verbal interaction between
waiters and customers in restaurants, unhedged suggestions such as “do you want fries with that?” will
be overrepresented if you include 90% of conversations from fast-food restaurants. Conversely, forms of
hedged suggestion such as “may I suggest. . . ?” are likely to be underrepresented. To guarantee the repre-
sentation of the full range of linguistic variation existing in a specific dialect/register/context, distributional
considerations such as the number of words per text, the number of texts per text types must be taken into
account.
Corpus compilers use sampling methodologies for the inclusion of texts in a corpus. Sampling techniques
based on sociological research make it possible to obtain relative proportions of strata in a population thanks
to demographically accurate samples. Crowdy (1993) is an interesting illustration of the demographic sam-
pling method used to compile the spoken component of the British National Corpus The British National
Corpus (2007). To fully represent the regional variation of British English, the United Kingdom was di-
vided into three supra-regions (the North, the Midlands, and the South) and twelve areas (five for the North,
three for the Midlands, and four for the South). Yet, as Biber (1993, p. 247) points out, the demographic
representativeness of corpus design is not as important to linguists as the linguistic representativeness.

1.2.1.3 Balance

A corpus is said to be balanced when the proportion of the sampled elements that make it representative
corresponds to the proportion of the same elements in the target language. Conversely, an imbalanced
corpus introduces skews in the data. Like representativeness, the balance of a corpus also depends on the
sampling scheme.
Corpus-linguistics textbooks frequently present the Brown Corpus (Francis and Kučera 1979) and its
British counterpart, the Lancaster-Oslo-Bergen corpus (Johansson et al. 1978) as paragons of balanced cor-
pora. Each attempts to provide a representative and balanced collection of written English in the early
1960s. For example, the compilers of the Brown Corpus made sure that all the genres and subgenres of the
collection of books and periodicals in the Brown University Library and the Providence Athenaeum were
represented. Balance was achieved by choosing the number of text samples to be included in each category.
By way of example, because there were about thirteen times as many books in learned and scientific writing
as in science fiction, 80 texts of the former genre were included, and only 6 of the latter genre.
Because the Brown Corpus and the LOB corpus are snapshot corpora, targetting a well-delimited mode
in a well delimited context, achieving balance was fairly easy. Compiling a reference corpus with spoken
data is far more difficult for two reasons. Because obtaining these resources supposes that informants are
1.2 Looking for Corpus Linguistics 5

willing to record their conversations, they take time to collect, transcribe, and they are more expensive.
Secondly, you must have an exact idea of what proportion each mode, genre, or subgenre represents in the
target language. If you know that 10% of the speech of Parisian children consists of monologue, your corpus
should consist of about 10% of monologue recordings. But as you may have guessed, corpus linguists have
no way of knowing these proportions. They are, at best, educated estimates.

1.2.1.4 An Ideal

Sampling methods imply a compromise between what is theoretically desirable and what is feasible. The
abovementioned criteria are therefore more of an ideal than an attainable goal (Leech 2007).
Although planned and designed as a representative and balanced corpus, the BNC is far from meeting this
ideal. Natural languages are primarily spoken. Yet, 90% of the BNC consists of written texts. However, as
Gries (2009) points out, a salient written expression may have a bigger impact on speakers’ linguistic systems
than a thousand words of conversation. Furthermore, as Biber (1993) points out, linguistic representativeness
is far more important than the representativeness of the mode.

1.2.1.5 The Materiality of Corpora

The above paragraphs have shown that a lot of thinking goes on before a corpus is compiled. However,
the theoretical status of a corpus should not blind us to the fact that we first apprehend a corpus through
the materiality its database, or lack thereof. That materiality is often complex. The original material of a
linguistic corpus generally consists of one or several of the following:
• audio recordings;
• audio-video recordings;
• text material.
This material may be stored in a variety of ways. For example, audio components may appear in the forms of
reel-to-reel magnetic tapes, audio tapes, or CDs, as was the case back in the days. Nowadays, they are stored
as digital files on the web. The datasets of the CHILDES database are available in CHAT, a standardised
encoding format. The files, which are time aligned and audio linked, are to be processed with CLAN
(http://childes.psy.cmu.edu/). In general, multimodal components include both the original
audio/video files and, optionally, a written transcription to allow for quantification.
As a corpus linguist working mainly on the English language, I cannot help paying tribute to the first
“large-scale” corpus of English: the Survey of English Usage corpus (also known as the Survey Corpus). It
was initiated by Randolph Quirk in 1959 and took thirty years to complete. The written component consists
of 200 texts of 5000 words each, amounting to one million words of British English produced between 1955
and 1985. The Survey Corpus also has a spoken component in the form of now digitized tape recordings of
monologues and dialogues. The corpus is originally compiled from magnetic data recordings, transcribed
by hand, and typed on thousands of 6-by-4-inch paper slips stored in a filing cabinet. Each lexical item is
annotated for grammatical features and stored likewise. For examples, all verb phrases are filed in the verb
phrase section. The spoken component is also transcribed and annotated for prosodic features. It still exists
under the name of the London-Lund Corpus.
6 1 Introduction

The Survey Corpus was not computerized until the early 1980s.4 With a constant decrease in the price
of disk storage and an ever increasing offer of tools to process an ever increasing amount of electronic data,
all corpus projects are now assumed to be digital from stage one. For this reason, when my students fail to
bring their laptop to their first corpus linguistics class, I suggest they take a trip to University College London
and browse manually through the many thousands of slips filed in the cabinets of the Survey Corpus. Al-
though the students are always admirative of the work of corpus linguistics pioneers (Randoph Quirk, Sidney
Greenbaum, Geoffrey Leech, Jan Svartvik, David Crystal, etc.), they never forget their laptops afterwards.

1.2.1.6 Does Size Matter?

Most contemporary corpora in the form of digital texts (from either written or transcribed spoken sources)
are very large. The size of a corpus is relative. When it first came out, in the mid 1990s, the BNC was
considered a very big corpus as it consisted of 100 million words. Compared to the Corpus of Contemporary
American English (450 million words), the Bank of English (45 billion words), the ukWaC corpus of English
(2.25 billion words),5 or Sketch Engine’s enTenTen12 corpus (13 billion words), the BNC does not seem
that large anymore.
What with the availability of digital material and a rich offer of automatic annotation and markup tools,
we are currently witnessing an arms race in terms of corpus size. This is evidenced by the increasing number
of corpora that are compiled from the web (Baroni et al. 2009). On the one hand, a large corpus guarantees
the presence of rare linguistic forms. This is no trifling matter insofar as the distribution of linguistic units
in language follows a Zipfian distribution: there is a large number of rare units. On the other hand, a very
large corpus loses in terms of representativeness and balance, unless huge means are deployed to compile it
with respect to the abovementioned corpus ideal (which, in most cases, has an unbearable cost). Finally, no
corpus extraction goes without noise (i.e. unwanted data that is hard to filter out in a query). The larger the
corpus, the more the noise.
Depending on your case study and the language you investigate, using a small corpus is not a bad thing.
For example, Ghadessy et al. (2001) show that small corpora can go a long way in English language teaching.
Unless you study a macrolanguage (e.g. Mandarin, Spanish, English, or Hindi), it is likely that you will not
be able to find any corpus. The problem is even more acute for endangered or minority languages (e.g.
Breton in France, or Amerin languages in the Americas) for which you only have a sparse collection of texts
and sometimes punctual recordings at your disposal. Once compiled into a corpus, those scarse resources
can serve as the basis of major findings, such as Boas and Schuchard (2012) in Texas German, or Hollmann
and Siewierska (2007) in Lancashire dialects.
Small size becomes a problem if the unit you are interested in is not well represented. All in all, size
matters, but if it is wisely used, a small corpus is worth more than a big corpus that is used unwisely.

1.2.2 What Linguists Do with the Corpus

Having a corpus at one’s disposal is a necessary but insufficient condition for corpus linguistics. Outside
linguistics, other disciplines in the humanities make use of large-scale collections of texts. What they gener-

4 Transcriptions of the spoken component of the Survey Corpus were digitized in the form of the London-Lund Corpus.
5 Ferraresi (2007).
1.2 Looking for Corpus Linguistics 7

ally do with those texts is known as text mining. Corpus-linguists may use text-mining techniques at some
point (for instance when they make frequency lists). However, the goal of text mining techniques is to obtain
knowledge about a text or group of texts (Jockers 2014), whereas the goal of corpus-linguistics techniques
is to better understand the rules governing a language as a whole, or at least some aspect of that language
(e.g. a specific register or dialect).

1.2.2.1 Generalization

You do not suddenly become a corpus linguist by running a query in a digital text database in search of a
savory example.6 You investigate a corpus because you believe your findings can be extended to the target
language or variety. This is called generalization.
Generalization implies a leap of faith from what we can infer from a finite set of observable data to the
laws of the target language. Outside corpus linguistics, not everyone agrees. In an interview (Andor 2004,
p. 97), Chomsky states that collecting huge amounts of data in the hope of coming up with generalizations
is unique in science. He makes it sound like it is a pipe dream. To reformulate Chomsky’s claim, a corpus
should not be a basis for linguistic studies if it cannot represent language in its full richness and complexity.
Most corpus linguists rightly counter that they do not aim to explain all of a language in every study (Glynn
2010) and that the limits of their generalizations are the limits of the corpus. Furthermore, even if no corpus
can provide access to the true, unknown law of a language, a corpus can be considered a sample drawn from
this law. As you will discover in the second part of the book, there are ways of bridging the gap between
what we can observe and measure from a corpus, and what we do not know about a language. Ambitious
statistics is needed to achieve this. I am specifically referring to the statistics used in biostatistics, where
scientists commonly bridge the gap between what they can observe from a group of patients and what they
can infer about a disease, contra Chomsky’s intuition.

1.2.2.2 Quantification

Corpus linguistics is quantitative when the study of linguistic phenomena based on corpora is systematic
and exhaustive. Gries (2014, p. 365) argues that corpus linguistics in the usage-based sense of the word
is a distributional science. A distributional science infers knowledge from the distribution, dispersion, and
co-occurrence of data. For this reason, quantitative corpus linguists typically focus on corpus frequencies
by means of frequency lists, concordances, measures of dispersion, and co-occurrence frequencies.
All linguists aim at some form of generalization, but not all of them engage in some form of quantification
to meet this aim. This means that you can make a qualitative and quantitative use of a corpus. Qualitative
corpus analysis may consist in formulating a hypothesis and testing it by looking for potential counterex-
amples in the corpus. It may also consist in using the corpus to refine the hypothesis at some stage. The
concordancer is a corpus tool par excellence meant for qualitative inspection. It is used to examine a word
in its co-text regardless of any form of quantification.
An oft-heard misconception about corpus linguistics is that the quantitative methods involved are sus-
piciously too objective and miss some information that only the linguist’s expert subjectivity can bring.
Nothing is further from the truth. First, you should never start exploring a corpus, let alone quantify your
findings, if you do not have a research question, a research hypothesis, and a clear idea as to how the corpus

6 The practise is known as “chasing butterflies”.


8 1 Introduction

will help you answer the question and test the hypothesis. Second, you should never consider the quantified
corpus findings as the final stage in your research project. Uninterpreted results are useless because they do
not speak for themselves: “quantification in empirical research is not about quantification, but about data
management and hypothesis testing” (Geeraerts 2010).

1.2.3 How Central the Corpus Is to a Linguist’s Work

Some linguists believe that using a corpus is but one of several steps in a research project (although a
significant one). For example, a linguist might generate a working hypothesis regarding a linguistic form
and then decide to test that hypothesis by observing the behavior of that linguistic form in a corpus and by
generalizing over the findings. Other linguists adopt a more radical stance. They claim that the corpus is the
only possible approximation we have of a speaker’s linguistic competence. It is therefore a linguist’s job to
investigate corpora.

1.2.3.1 The Corpus as a Step in the Empirical Cycle

To most corpus linguists, corpora and quantitative methods are only a moment in what the cognitive se-
manticist Dirk Geeraerts calls “the empirical cycle” (Geeraerts 2010).7 Empirical data are of two kinds:
observational and experimental. Observational data are collected as they exist. In contrast, elicited data
have to be obtained experimentally. Corpus data are observational, as opposed to elicited data, which are
experimental.
The flowchart in Fig. 1.1 is a representation of D. Geeraerts’s empirical cycle, adapted to quantitative
corpus linguistics. The corpus, which appears in green is but a moment, albeit a central one, of the empirical
cycle. The left part of the chart (where the steps are connected by means of dashed arrows) is iterative. The
cycle involves several rounds of hypothesis formulating/updating, operationalizing, corpus data gathering,
and hypothesis testing. If the empirical testing is satisfactory, the findings can inform back theory (see the
right part of the chart), which in turn helps formulate other hypotheses, and so on and so forth.
Those who reject corpus linguistics on the basis that it is too objective should notice that nearly all the
blocks of the flowchart involve subjective appreciation from the linguist. This is valid also for the block “test
hypotheses” with quantification and statistics insofar as you cannot quantify or run statistics blindly. The
choice of quantitative methods depends largely on what the corpus gives you.

1.2.3.2 The Corpus as the Alpha and Omega of Linguistics

Some linguists adopt a much stronger stance on the place of corpora in their work. They consider that the
grammar of speakers takes the form of a mental corpus.
The idea of grammar being represented as a mental corpus originates from outside corpus linguistics
per se. Cognitive Grammar, an influential usage-based theory of language, defines grammar as a structured
inventory of symbolic linguistic units (Langacker 1987, p. 57). In this theory, grammar is the psychological
representation of linguistic knowledge. A unit is symbolic because its formal component and its semantic

7 In case it is not clear enough, this paper is a must-read if you plan to do corpus-based semantics.
1.2 Looking for Corpus Linguistics 9

ask a theory-
previous informed theory
works question

formulate
hypotheses

operationalize
hypotheses

update gather inform


hypotheses corpus data back theory

quantify test use


your observations hypotheses statistics

interpret
the results

are the yes


results
no
significant?

Fig. 1.1: The empirical cycle (adapted to quantitative corpus linguistics)

component are linked by conventions. This is reminiscent of Ferdinand de Saussure’s “arbitrariness of the
sign”. Unlike generative linguistics, which separates form and meaning, Cognitive Grammar advocates a
uniform representation that combines both. Central in Cognitive Grammar, and of major interest to corpus
linguistics because of its roots in frequency, is the concept of entrenchment: “[w]ith repeated use, a novel
structure becomes progressively entrenched, to the point of becoming a unit [. . . ]” (Langacker 1987, p. 59).
Each time a unit (a phoneme, a morpheme, a morphosyntactic pattern, etc.) is used, it activates a node or
a pattern of nodes in the mind. The more the unit occurs, the more likely it is to be stored independently.
In other words, linguistic knowledge is a repository of units memorized from experiences of language use.
Each experience leaves a trace in the speaker’s memory and interacts with previously stored units. This
dynamic repository is what Taylor (2012) calls a mental corpus.
In line with the above, some linguists consider that the closest approximation of a mental corpus is a
linguist’s corpus. Tognini-Bonelli (2001) distinguishes between corpus-based and corpus-driven linguistics.
Corpus-based linguistics is the kind exemplified in Sect. 1.2.3.1. Corpus-driven linguistics is the radical
extension of corpus-based linguistics: the corpus is not part of a method but our sole access to language
competence. According to McEnery and Hardie (2012), this idea has its roots in the works of scholars
10 1 Introduction

inspired by J.R. Firth, such as John Sinclair, Susan Hunston, Bill Louw, Michael Stubbs, or Wolfgang
Teubert. Studies in this tradition typically focus on collocations and discourse effects. Let me zoom in on
collocation, as it is a concept we shall come back to later.
According to Firth (1957), central aspects of the meaning of a linguistic unit are lost if we examine the
unit independently from its context.8 For example, a seemingly vague word such as something acquires
a negative meaning once it precedes the verb happen, as in: [. . . ] a lady stopped me and she said has
something happened to your mother? And I said, yes, she died [. . . ] (BNC-FLC). A unit and its context
are not mere juxtapositions. They form a consistent network whose nodes are bound by mutual expectations
(Firth 1957, p. 181). Firth was not concerned with corpora but by the psychological relevance of retrieving
meaning from context. Firth’s intuitions were translated into a corpus-driven methodology through the work
of “neo-Firthians”, such as Sinclair (1966, 1987, 1991), Sinclair and Carter (2004), Teubert (2005), or Stubbs
(2001).
As always, even though radical ideas are often theoretically challenging, the middle ground is preferable
in practice. This is why most quantitative corpus linguists consider that a corpus is a central part of research,
but not the alpha and omega of linguistics. You can still do corpus-based linguistics and make use of popular
corpus-driven methods such as concordances or association measures.

References

Andor, József. 2004. The Master and His Performance: An Interview with Noam Chomsky. Intercultural
Pragmatics 1 (1): 93–111. doi:10.1515/iprg.2004.009.
Baroni, Marco, et al. 2009. The WaCky Wide Web: A Collection of Very Large Linguistically Processed
Web-Crawled Corpora. Language Resources and Evaluation 43 (3): 209–226.
Biber, Douglas. 1993. Representativeness in Corpus Design. Literary and Linguistic Computing 8 (4):
241–257.
Biber, Douglas, Susan Conrad, and Randi Reppen. 1998. Corpus Linguistics: Investigating Language Struc-
ture and Use. Cambridge: Cambridge University Press.
Boas, Hans C., and Sarah Schuchard. 2012. A Corpus-Based Analysis of Preterite Usage in Texas German.
In Proceedings of the 34th Annual Meeting of the Berkeley Linguistics Society.
Bybee, Joan. 2010. Language, Usage, and Cognition. Cambridge: Cambridge University Press.
Bybee, Joan L. 2006. From Usage to Grammar: The Mind’s Response to Repetition. Language 82 (4):
711–733.
Chomsky, Noam. 1957. Syntactic structures. The Hague: Mouton.
Chomsky, Noam. 1962. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
Chomsky, Noam. 1995. The Minimalist Program. Cambridge, MA: MIT Press.
Crowdy, Steve. 1993. Spoken Corpus Design. Literary and Linguistic Computing 8 (4): 259–265.
Dik, Simon C. 1978. Functional Grammar. Amsterdam, Oxford: North-Holland Publishing Co.
Dik, Simon C. 1997. The Theory of Functional Grammar, ed. Kees Hengeveld, 2nd ed. Berlin, New York:
Mouton de Gruyter.
Ferraresi, Adriano. 2007. Building a Very Large Corpus of English Obtained by Web Crawling: ukWaC.
Master’s Thesis. University of Bologna.

8 “You shall know a word by the company it keeps” (Firth 1957, p. 179).
References 11

Firth, John. 1957. A Synopsis of Linguistic Theory, 1930–1955. In Selected Papers of J.R. Firth 1952–1959,
ed. Frank Palmer, 168–205. London: Longman.
Francis, W. Nelson, and Henry Kučera. 1979. Manual of Information to Accompany a Standard Corpus of
Present-Day Edited American English, for Use with Digital Computers. Department of Linguistics. Brown
University. http://www.hit.uib.no/icame/brown/bcm.html(visited on 03/10/2015).
Geeraerts, Dirk. 2010. The Doctor and the Semantician. In Quantitative Methods in Cognitive Semantics:
Corpus-Driven Approaches, ed. Dylan Glynn and Kerstin Fischer, 63–78. Berlin, Boston: Mouton De
Gruyter.
Ghadessy, Mohsen, Alex Henry, and Robert L. Roseberry. 2001. Small Corpus Studies and ELT. Amsterdam:
John Benjamins.
Givón, Talmy. 1995. Functionalism and Grammar. Amsterdam: John Benjamins.
Glynn, Dylan. 2010. Corpus-Driven Cognitive Semantics: Introduction to the Field. In Quantitative Methods
in Cognitive Semantics: Corpus-driven Approaches, 1–42. Berlin: Mouton de Gruyter.
Goldberg, Adele E. 1995. Constructions: A Construction Grammar Approach to Argument Structure.
Chicago: University of Chicago Press.
Goldberg, Adele E. 2003. Constructions: A New Theoretical Approach to Language. Trends in Cognitive
Sciences 7 (5): 219–224. http://dx.doi.org/10.1016/S1364-6613(03)00080-9.
Greenberg, Joseph H. 1963. Some Universals of Grammar with Particular Reference to the Order of Mean-
ingful Elements. In Universals of Human Language, ed. Joseph H. Greenberg, 73–113. Cambridge: MIT
Press.
Gries, Stefan Thomas. 2009. What is Corpus Linguistics? Language and Linguistics Compass 3:
1225–1241. doi:10.1111/j.1749-818X.2009.00149.x.
Gries, Stefan Thomas. 2014. Frequency Tables: Tests, Effect Sizes, and Explorations. In Corpus Methods
for Semantics: Quantitative Studies in Polysemy and Synonymy, ed. Dylan Glynn and Justyna Robinson,
365–389. Amsterdam: John Benjamins.
Hollmann, Willem B., and Anna Siewierska. 2007. A Construction Grammar Account of Possessive Con-
structions in Lancashire Dialect: Some Advantages and Challenges. English Language and Linguistics 11
(2): 407–424. doi:10.1017/S1360674307002304.
Jockers, Matthew. 2014. Text Analysis with R for Students of Literature. New York: Springer.
Johansson, Stig, Geoffrey Leech, and Helen Goodluck. 1978. Manual of Information to Accompany the
Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers. Department of En-
glish. University of Oslo. http://clu.uni.no/icame/manuals/LOB/INDEX.HTM (visited on
03/10/2015).
Kennedy, Graeme. 1998. An Introduction to Corpus Linguistics. Harlow: Longman.
Labov, William. 1975. Empirical Foundations of Linguistic Theory. In The Scope of American Linguis-
tics: Papers of the First Golden Anniversary Symposium of the Linguistic Society of America, ed. Robert
Austerlitz, 77–133. Lisse: Peter de Ridder.
Langacker, Ronald W. 1987. Foundations of Cognitive Grammar: Theoretical Prerequisites, Vol. 1. Stan-
ford: Stanford University Press.
Langacker, Ronald W. 1988. A Usage-Based Model. In Topics in Cognitive Linguistics, ed. Brygida Rudzka-
Ostyn, 127–161. Amsterdam, Philadelphia: John Benjamins.
Langacker, Ronald W. 1991. Foundations of Cognitive Grammar: Descriptive Application, Vol. 2. Stanford:
Stanford University Press.
Leech, Geoffrey. 2007. New Resources, or Just Better Old Ones? The Holy Grail of Representativeness. In
Corpus Linguistics and the Web, ed. Marianne Hundt, Nadja Nesselhauf, and Carolin Biewer, 133–149.
Amsterdam: Rodopi.
12 1 Introduction

McEnery, Tony, and Andrew Hardie. 2012. Corpus Linguistics : Method, Theory and Practice. Cambridge
Textbooks in Linguistics. Cambridge, New York: Cambridge University Press.
Meyer, Charles F. 2002. English Corpus Linguistics: An Introduction. Cambridge: Cambridge University
Press.
Sampson, Geoffrey. 2001. Empirical Linguistics. London: Continuum.
Sinclair, John. 1966. Beginning the Study of Lexis. In In Memory of J.R. Firth, ed. C.E. Bazell, et al.,
410–431. London: Longman.
Sinclair, John. 1987. Collocation: A Progress Report. In Language Topics: Essays in Honour of Michael
Halliday, ed. R. Steele, and T. Threadgold, Vol. 2, 319–331. Amsterdam: John Benjamins.
Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Sinclair, John, and Ronald Carter. 2004. Trust the Text: Language, Corpus and Discourse. London: Rout-
ledge
Stubbs, Michael. 2001. Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Wiley–Blackwell.
Taylor, John R. 2012. The Mental Corpus: How Language is Represented in the Mind. Oxford: Oxford
University Press.
Teubert, Wolfgang. 2005. My Version of Corpus Linguistics. International Journal of Corpus Linguistics 10
(1): 1–13. doi:10.1075/ijcl.10.1.01teu.
The British National Corpus. 2007. BNC XML Edition. Version 3. http://www.natcorp.ox.ac.
uk/. Distributed by Oxford University Computing Services On Behalf of the BNC Consortium.
Tognini-Bonelli, Elena. 2001. Corpus Linguistics at Work. Amsterdam: John Benjamins.
Wasow, Thomas, and Jennifer Arnold. 2005. Intuitions in Linguistic Argumentation. Lingua 115 (11):
1481–1496. http://dx.doi.org/10.1016/j.lingua.2004.07.001.
Part I
Methods in Corpus Linguistics
14 I Methods in Corpus Linguistics

If you are an absolute beginner, there is a chance that coding will be frustrating at first. The good news is that
it will not last. An excellent reason for being optimistic is that you will never code alone. R is maintained,
developed, and supported by a big community of friendly users. Before you know it, you will be part of
that community and you will be writing your own R scripts. You will find below a list of helpful online
ressources.

Getting help from the online community


• Stack Overflow: a question and answer site on a variety of topics, including R
http://stackoverflow.com/questions/tagged/r
• R-bloggers: a compilation of bloggers’ contributions on R
http://www.r-bloggers.com/
• inside-R: a community site dedicated to R
http://www.inside-r.org/blogs
• GitHub: a collaborative code repository (not limited to R)
https://github.com/
type #R in GitHub’s search engine
• The R mailing list
https://stat.ethz.ch/mailman/listinfo/r-help

The most frequently asked question from my students is “how long does it take to master R?”. There is
no straight answer because several variables are involved.
First of all, you do not need to master all of R’s features to start using it in corpus linguistics. As you will
realize in the next chapter, generating a frequency list does not require an engineer’s skills.
Secondly, even though your programming experience does make a difference in how fast you learn the
R language, it does not teach you how to select the appropriate tool for the appropriate task. This means
that your learning of the R language is a direct function of what you need to do with it. The best R learners
among my students also happen to be the best linguists.
Thirdly, over and above your programming experience, your level of motivation is key. Every time I teach
a course on corpus linguistics and statistics with R, it is only a matter of days until I receive my first emails
from students asking me for advice regarding some advanced task that we have not covered yet and whose
code they are eager to crack.
You can take advantage of this book in two ways. The most basic use is to treat it like a recipe book
because you do not want to invest time in learning the inner logic of a programming language. In this case,
you just need to adapt the scripts to suit your needs. A more advanced (and far more beneficial) use is to
take the book as an opportunity to learn corpus techniques along with a new programming language, killing
two birds with one stone. From my experience, the asset of the second option is that learning R will help
you become a better quantitative corpus linguist. In any case, just like learning how to play the piano, it is
only through regular practice that you will get familiar with R commands, R objects, R scripts, functions,
and even learn to recognize R oddities (Burns 2011).
Chapter 2
R Fundamentals

Abstract This chapter is designed to get linguists familiar with the R environment. First, it explains how
to download and install R and R packages. It moves on to teach how to enter simple commands, use ready-
made functions, and write user-defined functions. Finally, it introduces basic R objects: the vector, the list,
the matrix, and the data frame. Although meant for R beginners, this chapter can be read as a refresher
course by those readers who have some experience in R.

2.1 Introduction

If you are new to R, this chapter is the place to start. If you have some experience with R or if you are too
impatient, feel free to skip those pages. You can still come back to them if you get stuck at some point and
hit the wall.
There are many other books that do a great job at introducing R in depth. This chapter is not meant to
compete with them. Nevertheless, because I firmly believe that getting familiar with R basics is a prerequisite
to understand what we can truly do as corpus linguists, I have designed this chapter to provide an introduction
to R that both covers the bare necessities and anticipates the kind of coding and data structures involved in
corpus linguistics.

2.2 Downloads and Installs

R is an interpreted language: the user types and enters the commands from a script and the system interprets
the commands. Most users run R from a window system: the commands are entered via the R console.

© Springer International Publishing AG 2017 15


G. Desagulier, Corpus Linguistics and Statistics with R, Quantitative Methods in the Humanities
and Social Sciences, DOI 10.1007/978-3-319-64572-8_2
16 2 R Fundamentals

2.2.1 Downloading and Installing R

First, visit the Comprehensive R Archive Network (CRAN, https://cran.r-project.org/) and


download the most recent release of R.1
• If you are a Windows user, click on “Download R for Windows”, then on “base”. Download the setup
file (.exe) by clicking on “Download R 3.X.X for Windows” (where 3.X.X stands for the identification
number of the most recent release) and run it. The installation program gives you step by step instruc-
tions to install R in the default directory of your system. R should appear in the usual Programs folder.
Double click the icon to launch the R graphical user interface (GUI).
• If, like me, you are a Mac OS or macOS user, click on “Download R for (Mac) OS X” and select the
release that corresponds to your OS version (i.e. Snow Leopard, Lion, Moutain Lion, Mavericks, or
Yosemite). The file takes the form of an Apple software package whose name ends with .pkg. The
installer guides you until R is installed on your system. You should find it in the usual Applications
folder. Double click the icon to launch the R GUI.
• Finally, if you are a Linux user, click on “Download R for Linux”, open the parent directory that corre-
sponds to your distribution (Debian, Red Hat, SUSE, or Ubuntu), and follow the instructions until you
have the base installation of R on your system. To start R, type “R” at the command line.
You now have the base installation of R on your system. You may want to visit the CRAN website once
in a while to make sure you have the most updated release, which includes bugfixes. However, before you
upgrade, I advise you to make sure that all your packages (see Sect. 2.5 below) are compatible with that
version. Most of the time, packages are tested on beta versions of the most recent R release, but it pretty
much depends on how much time the authors have until the official version is released. Note that you can
always easily come back to older releases if incompatibility issues arise.

2.2.2 Downloading and Installing RStudio

RStudio is an Integrated Development Environment (IDE) for R. In other words, R must be installed first for
RStudio to run. RStudio has many added values, the main one in my view being a nice windowing display
that allows you to visualize the R console, scripts, plots, data, packages, files, etc. in the same environment.
Although I appreciate RStudio, as long as it remains free, open source, and faithful to the initial spirit of
those who created R, my own preference goes to the basic console version of R. I do not have any objective
reason for this, except that I did my first steps in R using the R console. My workflow is now based on the
console. In the remainder of the book, all instructions and illustrations will be relative to the console version
of R and entirely compatible with RStudio.
To download and install RStudio Desktop (Open Source Edition), visit https://www.rstudio.
com/ and follow the instructions.

1 At the time of writing, the latest R release is 3.3.1, “Bug in Your Hair”.
2.4 R Scripts 17

2.2.3 Downloading the Book Materials

Now that R and/or RStudio are installed, it is time to download the book materials. You can download
a zipped archive (CLSR.zip for “Corpus Linguistics and Statistics with R”) from http://extras.
springer.com/2018/978-3-319-64570-4. The archive contains:
• the code for each chapter;
• input files.
Unzip the archive and place the resulting CLSR folder at the root of your hard drive, i.e. C: on Windows
and Macintosh HD on a Mac. One obvious reason for this location is that you will not have to type long
pathnames when you want to access the files in the folder.

2.3 Setting the Working Directory

Before proceeding further, you must set the working directory. The working directory is a folder which you
want R to read data from and store output into. Because you are probably using this book as a companion
in your first foray into quantitative linguistics with R, I recommend that you set the CLSR folder as your
working directory.
You are now about to enter your first R command. Commands are typed directly in the R console, right
after the prompt >. To know your default working directory, type getwd() and press ENTER.
> getwd()

Set the working directory by entering the path to the desired directory/folder:
> setwd("C:/CLSR") # Windows
> setwd("/CLSR") # Mac

Finally, make sure the working directory has been set correctly by typing getwd()again.2

2.4 R Scripts

Simple operations can be typed and entered directly into the R console. However, corpus linguistics gener-
ally involves a series of distinct operations whose retyping is tedious. To save you the time and trouble of
retyping the same lines of code and to separate commands from results, R users type commands in a script
and save the script in a file with a special extension (.r or .R). Thanks to this extension, your system knows
that the file must be opened in R.
To create a script file, there are several options:
• via the drop-down menu: File > New Document;
• via the R GUI: click on the blank page;

2 Note that if you have a PC running on Windows, you might be denied access to the C: drive. The default behavior of that
drive can be overridden.
Another random document with
no related content on Scribd:
7
Getting to the chateau was difficult the next Saturday, although
Madame Perceval had been right and the snow had stopped and
temporarily dashed the skiers' hopes. But enough snow remained on
the ground so that Flip put on her spiked ski-boots to help her climb
the mountain. Up above her the mountain had a striped, zebra-like
look, long streaks of snow alternating with rock or the darker lines of
the evergreens. The air was cold and clear and sent the color flying
to her cheeks.
Paul greeted her with a relieved shout, crying, "Are you all better,
Flip?"
"Oh yes, I feel fine now."
"I was worried about you. I was afraid you might have caught more
cold from coming last Sunday. You shouldn't have, you know."
"I had to," Flip said. "I promised."
"I knew something you couldn't help had kept you. Of course I was a
little afraid you'd been caught and they were keeping you from
coming. Did you have any trouble getting here today? What will you
do when there's a real snow, Flip? You'll never be able to make it."
"I'll make it," Flip assured him. "Where's Ariel?"
"He's home with my father. Flip, I—I've done something that may
make you angry."
"What?"
"Well, I got to thinking. It's so terribly cold in the chateau; I'm sure
that's why you caught cold, and I didn't think we should go back
there in the damp today so I told my father about you. He won't give
us away, Flip, I made him promise."
"Are you sure?" Flip asked anxiously.
"Quite sure. My father would never break his word. Anyhow he's a
philosopher and things like girls schools and rules and regulations
and things don't seem as important to him as they do to other
people. He told me to bring you home with me and he said he'd fix
some real hot chocolate for us. So come along."
Flip followed Paul over the snow, past the chateau, and down an
overgrown driveway. Grass and weeds and bits of stubble poked up
through the snow and it did not look like much of a snowfall here
though the drifts had seemed formidable enough on her way up the
mountain from school.
A tall, stooped man, whom Flip recognized as the one she had seen
Paul with in the chalet on the Col de Jaman, met them at the door to
the lodge. Ariel came bounding out to welcome them noisily.
"My father," Paul announced formally. "Monsieur Georges Laurens.
Papa, my friend, Miss Philippa Hunter."
Georges Laurens bowed. "I am happy indeed to meet you, Miss
Hunter. Come in by the fire and get warm." He led them into a room,
comfortable from the blazing fire in the stone fireplace, and gently
pushed Flip into an easy chair. She looked about her. Two beautiful
brocades were hung on the walls and there were what seemed like
hundreds of books in improvised bookshelves made of packing
cases. Two or three lamps were already lit against the early
darkness which had settled about the mountain side by this time of
the afternoon, and Flip saw a copper saucepan filled with hot
chocolate sitting on the hearth.
"Flip's afraid you'll let the cat out of the bag, papa," Paul said.
Georges Laurens took a long spoon, stirred the chocolate, and
poured it out. He handed a cup to Flip and pushed Ariel away from
the saucepan. "Watch out, you'll burn your nose again." Then he
turned to Flip. "Why should I let the cat out of the bag? You aren't
doing anyone any harm and you're giving a great deal of pleasure to
my lonely Paul. In fact, I like so much the idea of Paul's having your
companionship that my only concern is how to help you continue
your visits. As soon as we have a heavy snow you won't be able to
climb up the mountains through the woods to us, and in any event
someone would be sure to find you out sooner or later and you
would be forbidden to come if nothing else. These are facts we have
to face, isn't that so?"
"Yes, that's so," Flip said.
"She has to come," Paul said very firmly.
Georges Laurens took off his heavy steel-rimmed spectacles and
wiped them on his handkerchief. Then he took the tongs and placed
another log on the fire. "My suggestion is this: Why don't I go to the
headmistress of this school and get permission for Miss Flip to come
to tea with us every Saturday or Sunday afternoon. That would be
allowed, wouldn't it?"
"I don't know," Flip said. "Esmée Bodet's parents are spending a
month in Montreux and she has dinner with them every Sunday. But
Paul's a boy and we're not allowed to have dates until we're
Seniors."
"I think if I were very charming," Georges Laurens refilled her cup
with hot chocolate from the copper saucepan, "I could manage your
headmistress. What is her name?"
"Mlle. Dragonet," Flip told him. "We call her The Dragon," she said,
then added, remembering the visit in the infirmary, "but she's really
quite human."
Georges Laurens laughed. "Well, I shall be St. George, then, and
conquer the dragon. I will brave her in her den this very afternoon."
"And now I suggest that you get back to your school and tomorrow
we will have a proper visit, and I will come for you and bring you
over." He held out his hand. "I promise."
8
It never occurred to Flip that on this last forbidden trip to the
chateau she might be caught. Luck had been her friendly companion
in the venture and now that the visits to Paul were about to be
approved by authority, surely fortune would not forsake her. But, just
as she came to the clearing where the railroad tracks ran through the
woods, she saw two figures in warm coats and snow boots and
recognized Madame Perceval and Signorina del Rossi. She darted
behind a tree, but they had evidently caught a glimpse of her blue
uniform coat, for Signorina put a gloved hand on Madame Perceval's
arm and said something in a low voice, and Madame Perceval called
out sharply,
"Who is it?"
Flip thought of making a wild dash for safety, but she knew it would
be useless. They were between her and the school and they would
be bound to recognize her if she tried to run past them. So she
stepped out from behind the tree and confronted them just as a train
came around the bend. In a moment the train was between them;
she was not sure whether or not they had had an opportunity to
recognize her in the misty dark; the school uniforms were all identical
and there were dozens of girls with short fair hair. Now was her
chance to run and hide. They would never find her in the dark of the
woods and the train would give her a good chance to get a head
start. But somehow, even if this meant that she would never be given
permission to see Paul, she could not run like a coward from
Madame Perceval, so she stood very quietly, cold with fear, until the
train had passed. Then she crossed the tracks to them.
"Thank you for waiting, Philippa," Madame Perceval said.
She stood, numbly staring at the art teacher, her fingers twisting
unhappily inside her mittens.
"Did you know you were out of bounds, Philippa?" Madame Perceval
asked her.
She shook her head. "I didn't remember where the bounds were."
Then she added, "but I was pretty sure I was out of them."
Signorina stood looking at her with the serene half-smile that seldom
left her face even when she had to cope with the dullest and most
annoying girls in her Italian classes. "Where were you going, little
one?"
"Back to school."
"Where from?"
"I was—walking."
"Was it necessary to go out of bounds on your walk?" Madame
Perceval asked coldly. "Mlle. Dragonet is very severe with girls who
cross the railroad tracks."
Flip remembered the walk on which she had first met Ariel, and how,
somehow, it had been necessary to go up, up, the mountain. "I
wanted to climb."
"Were you alone?" Madame Perceval looked at her piercingly but the
dark hid the girl's expression. When she hesitated, Madame
pursued, "Did you meet anyone?"
"Yes," Flip answered so low that she could scarcely be heard.
"You'd better come back to the school with me," Madame Perceval
said. She turned to Signorina. "Go along, Signorina. Tell them I'll
come when I can."
In silence Flip followed Madame down the mountain. When she
slipped on a piece of ice and her long legs went flying over her head,
Madame helped her to pick herself up and brush off the snow, but
she said nothing. They left the trees and crossed the lawn, covered
with patches of snow, and went into the big Hall. Madame Perceval
led the way upstairs, and Flip followed her, on up the five flights and
down the hall to Madame's own rooms. Madame switched on the
lights and when she spoke her voice was suddenly easy and
pleasant.
"Sit down, Philippa." Flip's spindly legs seemed to collapse under her
like a puppy's as she sat on the stool in front of the fire. "Now,"
Madame went on. "Can you tell me about it?"
Flip shook her head and stared miserably up at Madame, "No,
Madame."
"Who did you go to meet?"
"I'd rather not say. Please."
"Was it anybody from school?"
"No, Madame."
"Did anybody at school have anything to do with it?"
"No, Madame. There wasn't anybody else but me."
"And you can't tell me who it was you went to meet?"
"No. I'm sorry."
"Philippa," Madame said slowly. "I know you've been trying hard and
that the going has been rough for you. I understand your need for
interests outside the school. But the rules we have here are all for a
definite purpose and they were not made to be lightly broken."
"I wasn't breaking them lightly, Madame."
"Once a girl ran away and was killed crossing the railroad tracks.
They are dangerous, especially after dark. You see they are placed
out of bounds for a very good reason. And if there's anybody you
want to see outside school it's not difficult to get permission. If you
were one of the senior girls I might think you were slipping away to
meet one of the boys from the school up the mountain. But I know
that's not the case. I don't like having to give penalties and if you'll
tell me about it I promise you I'll be as lenient as I can."
But Flip's thoughts were rushing around in confusion, and she
thought, if I tell now they'll never give me permission to see Paul.
So she just shook her head while she continued to stare helplessly
at the art teacher.
Madame started to speak again, but just then the telephone rang
and she went over to it. "Yes?... Yes, Signorina!" She listened for a
moment, then burst out laughing and continued the conversation in
Italian. Flip could tell that she was pleased and rather excited about
something. They talked for a long time and Flip could tell that
Madame was asking Signorina a great many questions. When she
hung up she turned to Flip, and her face was half smiling, half
serious. "Philippa," she said, "I know I can trust you."
"Yes, Madame."
"And I want you to prove yourself worthy of my trust. Will you?"
"I'll try, Madame."
"So it's Paul you've been running off to meet," Madame Perceval
said with a smile.
Flip jerked erect on her stool. "How did you know!"
Now Madame laughed, a wonderful, friendly laugh that took Flip and
made her part of a secret they were to share together. "Paul's father,
Georges Laurens, is my brother-in-law."
Flip's jaw dropped open. "But Madame!" she sputtered. "But
Madame!"
Madame laughed and laughed. Finally she said, "I think you and I
had better have another little tea party," and she reached for the
telephone. "Fräulein Hauser," she said, "I am excusing Philippa
Hunter from tea." Then she went into her little kitchen and put the
kettle on. When she came back she said, "Signorina and I were on
our way over to borrow a book from Georges and have a visit with
him and Paul when we bumped into you. Now tell me how you found
Paul."
Flip told her about the way Ariel had come jumping out of the
undergrowth at her and how he led her to the chateau and to Paul.
"I see." Madame nodded. "Now tell me what Paul has told you about
himself."
"Why—nothing much," Flip said. "I mean, I know his mother's singing
in Italy and Monsieur Laurens is writing a book and Paul is going to
be a doctor...."
Madame Perceval nodded again. "I see," she repeated thoughtfully.
"Now, Philippa, I suppose you realize that you should be penalized.
You've been breaking rules right and left. It's a pretty serious
situation."
"I know, Madame. Please punish me. I can stand anything as long as
I can see Paul again. If I can't see him again I shall die."
"I don't think you'd die, Philippa. And since you're not a senior you're
not allowed to have dates. Not seeing Paul would be automatic
before your penalties were even considered."
The color drained from Flip's face and she stared up at Madame
Perceval, but she did not move or say anything.
Madame spread cheese on a cracker, handed it absently to Flip, and
leaned back in her chair. She held the cheese knife in her hand and
suddenly she slapped it against her palm with a decisive motion. "I'm
not going to forbid you to see Paul, Philippa," she said, "but you will
have to have a penalty and a stiff one, because the fact that it was
Paul you were seeing does not lessen the seriousness of your
offense, but I'll decide on that tomorrow. In the meantime I want to
talk to you about Paul." For a moment Madame Perceval looked
probingly at Flip. Then, as though satisfied with what she saw, she
continued. "We've been worried about Paul, and I think you can help
us."
"Me, Madame?" Flip asked.
"Yes, you. Yes, I think you of all people, Philippa."
"But how, Madame?"
"First of all simply by seeing him. Signorina told me that Georges
was planning to get permission for you to come to the gate house to
see Paul once a week. I shall see that you get the permission. And
remember, Philippa, that I am doing this for Paul's sake, not yours."
"Yes, Madame. But Madame—"
"But what, Philippa?"
"Monsieur Laurens asked what Mlle. Dragonet's name was. Wouldn't
he know?"
Madame Perceval laughed. "He was just playing along with Paul.
Paul didn't want you to know he had any connection with the school."
"Oh. But—"
"But what, Philippa?"
"Why are you worried about Paul, Madame?"
"I can't tell you that now, Flip. Paul will let you know himself sooner
or later. In the meantime, the best way that you can help him is to
continue trying to get on with the girls here at school, and to become
really happy here. That sounds like rather a tall order to you, doesn't
it? But I think you can do it. You'd like to, wouldn't you?"
"Oh, Madame, you know I would. But please—I don't see—how
would that help Paul?"
"Perhaps you've discovered already," Madame said, "that Paul has a
horror of anything he can label an institution. He knows that you hate
this institution. Because he respects you, if he could watch you grow
to like it here he might be willing to go back to school himself.
Georges is tutoring him but he needs regular schooling. If he really
wants to be a doctor he cannot dispense with formal education, and I
believe that one day Paul will make a very brilliant doctor. I know this
is all very confusing to you, Philippa, but you must trust me as I am
trusting you. All I can tell you is that I think you can help Paul and
because of this I am willing to disregard the manner in which you
have been seeing him up to now and to see that you have official
permission to see him in the future." Madame Perceval stood up.
"You'd better report to me tomorrow and I'll tell you what your
penalties are. I'm afraid this has been a very slim tea for you,
Philippa. If you hurry down to the dining room there may be a few
scraps left."
"I'm not hungry," Flip said. "Thank you for—for everything, Madame."
Madame put her hand on Flip's shoulder. "I'm very glad this
happened to you, Philippa, instead of—say one of your roommates.
Very glad." She was smiling warmly and Flip's heart leaped with joy
at this great praise. Madame gave her a little shove. "Run along
now," she said.
9
Flip was ready and waiting in the big Hall when Monsieur Laurens
came for her the next afternoon. The girls were all curious and rather
envious, when, in answer to their questions they learned that Flip
had been given special permission to have tea with Madame's
nephew; and she felt that her stock had gone up with them.
"My aunt, Pill, it's really a date!" Gloria whistled.
"I bet Pill's never been out with a boy before," Esmée said. "Have
you, Pill? Usually the Americans have more dates than the rest of us
but I bet this is Pill's first date. Are you going to let him kiss you,
Pill?"
"Don't be silly," Flip said.
"Anyhow Black and Midnight said it wasn't a date," Sally added. "I
bet this nephew's just a child."
Erna whispered to Flip, "Esmée and Sally're just boy crazy. Don't
mind them. Personally I think boys are dopes."
At the gate house an hour later, Flip and Paul lay on the great rug in
front of the fire and roasted chestnuts while Georges Laurens
watched from his chair, and Ariel rested his head on his master's
knee.
"So you don't like school?" Georges Laurens asked Flip.
"No, sir."
"Why not."
"I can't seem to fit in. I'm different."
"And I suppose you despise the other girls?" Georges Laurens
asked.
Flip looked surprised for a moment, then hesitated, thinking his
question over as she opened and ate a chestnut. "No. I don't despise
them. I'm just uncomfortable with them," she answered finally,
chewing the delicate tender meat and staring at the delicate unicorn
in the tapestry on the wall above her.
"But you want to be like them anyhow?" Georges Laurens pursued.
She nodded, then added, "I want to be like them and like myself,
too."
"You think quite a lot of yourself?"
"Oh, no!" She shook her head vehemently. "It isn't that at all. I think
I'm—I'm not anything I want to be. It's just that there are certain
things outside me and the way I feel about them that I wouldn't want
changed. The way I feel about the mountains and the lake. And
stars. I love them so very much. And I don't think the others really
care about them. I don't think they really see them. And it's the way I
feel about things like the mountains and the lake and stars that I
wouldn't want changed."
"You want a great deal, my little Flip," Georges Laurens said, gently
stroking Ariel's head, "when you want to be exactly like everybody
else and yet be different at the same time."
Paul reached for another chestnut and rolled lazily onto his back. "I
sympathize with you, Flip. It's horrible to be in an institution. Couldn't
you have stayed at home with your parents?"
"I wanted to," Flip said, "but Gram's in New York and right now my
father's in China, and my mother's dead. I wanted to travel around
with father but he said he was going to go to all sorts of places I
couldn't go, and I couldn't miss school anyhow." Remembering her
promise to Madame Perceval she added, "and I don't hate school
nearly as much as I used to, Paul. Truly I don't."
"What do you like about it?" Paul asked bluntly.
"Oh, lots of things," Flip said vaguely. "Well—look at all the things
you can learn at school you couldn't learn by yourself. I mean not
only dull things. Art, for instance. Madame Perceval's taught me all
kinds of things in a few months."
"Go on," Paul said.
"And skiing—Fräulein Hauser's going to teach me to ski."
"I know how to ski," Paul said.
Flip tried again. "Well, there's music. They teach us lots about music
and that's fun."
"This is the best way to learn about music," Paul said, going to the
phonograph and turning it on. "You don't have to be in school to
listen to good music."
Flip gave up.
The record on the victrola was Bach's Jesu Joy of Man's Desiring. It
was music that Flip knew and she sat quietly staring into the fire and
listening. It was the first time in three years that she had been able to
listen to that music. At home in New York in the Christmases of her
childhood her mother had played it and played it. The Christmas
after her mother's death Flip had found the record broken and was
glad. But now she was listening to it with a kind of peace. She looked
over at Paul and said softly, "My mother used to love that...."
But Paul did not hear. He jumped up and turned off the record before
it had played to the end and said, "Let's go for a walk."
Flip followed him outside. The evening was still and cold and there
was a hint of blue-green left in the sky. The stars were beginning to
come out. Flip looked up at the first one she saw and made a wish.
—I wish Paul may always like me. Please, God. Amen. She wished
on the star and there was a sudden panic in her mind because the
Paul walking beside her was not the Paul with whom she had spent
the afternoon. His face in the last light as she glanced at it out of the
corner of her eye seemed stern, even angry, and he seemed to be
miles and miles away from her. He had withdrawn his
companionship and she searched desperately for a way to bring him
back to her.
"Paul," she hesitated, then gathered her courage and went on, "do
you remember Christmas when you were very little?"
"No," he answered harshly, "I don't remember."
She felt as though he had slapped her. Why wouldn't he remember?
She remembered those first Christmases so vividly. Was he just
trying to keep her from talking? Had she unwittingly done something
to make him angry?
She glanced at him again but his face was unrelenting and she
clenched her mittened hands tightly inside her pockets and said over
and over to herself,—please God, please God, please God....
"I don't remember!" Paul suddenly cried out and abruptly stopped his
rapid walking and wheeled about to face her. "I don't remember." His
voice was no longer harsh but he spoke with an intensity that
frightened Flip.
She could only ask, "But why, Paul? Why?"
He reached out for her hands and held them so tightly that it hurt. "I
don't know why ... that's the hardest part. I don't remember anything
at all beyond the last few years."
Flip tried to make it seem unimportant—to say something, anything
that would make Paul relax. None of this was as serious as he
thought—lots of people had poor memories. Anyway it had nothing
to do with their friendship. "Paul," she began—but he was not
listening. He was not even conscious that she had spoken.
"I don't want to frighten you," Paul said, "but Flip I have to tell you—I
don't know who I am."
CHAPTER FOUR
The Lost Boy
FLIP did not say anything. She just stood there and let Paul hold her
hands too tightly and she felt that somehow the pain in her hands
might ease the pain in his mind. Then he dropped her hands and
started to walk again, but more slowly. When he began to speak she
listened intently, but it was impossible to make it seem real. The
story Paul was pouring out to her now was like a movie, or
something read in a book. The concentration camps. The children
and the children's parents gassed and burned. The cold and the
hunger, and afterwards the lostness. The children in the DP camps.
The children roaming and scavenging the streets like hungry wolves.
"I was one of the lucky ones," Paul said in a low voice. "My mother
and father found me. I mean—Monsieur and Madame Laurens....
You'll have to understand Flip, if I keep calling them my mother and
father—but that's the way I think of them now, and I don't remember
anyone else for a mother and father."
Flip nodded, and Paul continued, his face tense in the starlight.
"They found me in a bombed out cellar in Berlin when my mother
was singing there for the troops just after the war was over. I'd been
trapped there somehow and I was nearly dead I guess, but I kept on
calling and they found me and rescued me. And for some reason I
didn't want to be rescued. It's like sometimes when you try to save
an animal he snarls and bites at you before he realizes that you
aren't going to hurt him more. A dog was run over on our street once
—not Ariel, another dog—and he kept trying to bite at me for a long
time until he realized that I wanted to help him. His back was broken
and I had to chloroform him. Dr. Bejart helped me." Paul stopped
talking and continued to walk so rapidly that Flip almost had to run to
keep up with him. She looked up through the bare trees and the last
color had drained from the sky and the full flowering of stars was out
and they seemed to be caught in the topmost branches of the trees
like blossoms. By their light she could see Paul quite clearly but she
knew that she must not say anything to him. They had walked
beyond the chateau now and behind her she could hear an owl
calling forlornly from one of the turrets.
"I don't really remember anything before my mother and father found
me," Paul said. "Sometimes I remember bits of the concentration
camp. Aunt Colette thinks its because of the concentration camp that
I'm afraid of institutions. I might as well admit it, Flip, I am afraid of
institutions. I think if I could remember I wouldn't be afraid.
Sometimes when I'm in the chateau I feel as though I were going to
remember but I never do. I remember bits of the camp, the way you
sometimes remember bits of a nightmare, but when I try really to
remember it's like going out of a bright room into a dark room and
you can't see anything in the dark except strange shapes and
shadows...."
"Oh, Paul," Flip whispered. She could think of no words of comfort or
reassurance so she whispered "Oh, Paul...." again to let him know
that she was listening and caring.
"It's like being blind," Paul said, "not remembering. When people talk
about the five senses they forget memory. Memory's like a sense....
Flip, I have never said these things to anyone before." He turned
sharply and they started walking back to the gate house. After a
while he asked her, "Are you cold?"
"No," Flip said.
"You must be cold. You wouldn't be shivering if you weren't. We'll go
back and roast some more chestnuts." She walked along beside him
and suddenly he turned to her and smiled and his voice was Paul's
voice again, "We're going to have wonderful times this winter, Flip!"
he said. "When you learn how to ski we can go skiing together. And
in the spring we can go for trips on the lake and in the summer we
can go swimming. I'm glad you came to the chateau, Flip."
"Oh!" Flip said, "supposing I hadn't."
2
The next afternoon the sky clouded over and it began to snow and it
snowed all afternoon and all night and the following afternoon skiing
began. Fräulein Hauser met the beginners in the ski room, and told
them the various parts of the skis and the ski sticks, and how to take
care of them. Flip clutched Eunice's discarded skis and felt happy
with the excited kind of anticipation that comes before Christmas.
Somehow she knew she would be able to ski and maybe if she
turned out to be a really wonderful skier the girls would like her better
and then she would begin to like school and she would be better
able to help Paul.
But when they got out on the gentle slopes where Fräulein Hauser
taught the beginners it wasn't at all the way she had imagined and
hoped it would be. Instead of all at once being able to fly over the
snow like a bird as she had dreamed, she found that no sooner was
she on her feet than she was flat on her back, skis up in the air, or
with her head buried in the snow, or doing a kind of wild splits.
Fräulein Hauser was not unkind, but after a while she said,
"You don't seem to have much aptitude for this, do you, Philippa?"
Flip gritted her teeth. "I'll learn."
"I hope so." Fräulein Hauser sounded dubious.
Every afternoon Flip went out grimly with the beginners. She was
covered with bruises and every muscle in her body ached, but she
was determined that she was going to learn how to ski, that in this
one thing at any rate she would not fail. When the other beginners
laughed at her tumbles she tried desperately to laugh back, to
pretend that she thought it was funny, too.
At the end of the skiing class on Friday afternoon, Fräulein Hauser
called her back to the ski room as the others left.
"I don't want to hurt your feelings, Philippa, but I think you'd better
drop skiing. You'll enjoy the ice-skating when the hockey field is
flooded, I'm sure, and in the meantime there are walks, and gym
work."
"But why, please, Fräulein Hauser?" Flip gasped in dismay.
"You just don't seem able to learn, and I'll have to admit I can't teach
you. I'm afraid you'll hurt yourself in one of your falls and I think it
would be best if you just give it up."
Flip looked at the racks and racks of skis as they suddenly began
blurring together. "I'd rather keep on, please, Fräulein Hauser, if it's
all right."
"I'm afraid it isn't all right," Fräulein Hauser said impatiently. "I just
can't have you in my class. I'll put you on the walk list for tomorrow."
Flip turned her head and left. She walked blindly down the corridor
but she had managed to control her tears by the time she got to the
big Hall.
3
On Sunday she could not help telling Paul of her defeat. Paul had
immediately seen that something was wrong, asking, "What's the
matter?"
"I know I could learn to ski if she'd just let me go on trying," Flip
persisted. "I know I could." Ariel was licking her face in a worried
manner and she put her head down on his back to try to hide the
tears that were threatening.
"Bring your skis over next Sunday and I'll help you," Paul told her.
"Oh, would you really, Paul?"
"I said I would. Do you think it's because of your bad knee you told
me about?"
Flip shook her head. "No. My father asked the doctor when Eunice
gave me her skis and he said skiing was fine for me. So it isn't that."
"Well, bring your skis next time then," Paul told her.
So Flip brought her skis over. Madame Perceval arrived just as they
were about to set off.
"Hello, Philippa, Paul, what's this?" she asked, fending off Ariel's
frantic welcome.
"I'm going to teach Flip to ski," Paul announced.
"Oh?"
"Fräulein Hauser said I had to drop skiing," Flip explained.
"Why, Philippa?"
"She said I just couldn't learn and she couldn't teach me. But
Madame, I'm sure I can learn, I'm sure I can."
"Why don't I go out with you and Paul?" Madame Perceval said, "and
we'll see."
She watched while Flip put on her skis, watched her push off, fall
down, push off and fall down again.
"Where did you get your skis?" she asked.
"A friend of my father's gave them to me. They were hers."
"Take them off for a moment," Madame Perceval said. "Now raise
your arm." She measured the skis against Flip. "Just as I thought.
They're much too long for you. I don't know what your father's friend
was thinking of. She can't know much about skiing."
"Well, she says she's skied a lot," Flip said. "Maybe she was trying to
impress father. He doesn't know anything about skiing. He used to
use snow shoes when he was a boy."
Madame Perceval took the skis away from Flip. "No wonder you
couldn't learn on these. They would be too long for Paul. I don't know
why Fräulein Hauser didn't notice it at once."
"She probably would have on anybody else. People just expect me
to be bad at sports."
Madame Perceval laughed. "You're probably right, Philippa. And
Fräulein Hauser certainly has her hands full with beginners this year.
Now, there's a pair of very good ash skis back at school that would
be just about right for you. One of the girls from last year left them. I
think I'll run along back and get them. You and Paul wait inside for
me."
"Oh, thank you, Madame!" Flip cried.
"Thank you Aunt Colette," Paul added.
She and Paul went indoors. Georges Laurens was shut up in his tiny
study, deep in concentration, so they did not speak to him, but went
over to the fire, stripping off jackets and sweaters. For a moment
they were silent and Flip knew that Paul did not want to talk about
any of the things he had told her, or to have her talk about them.
"Papa's been writing all day, except when he went to get you," Paul
said, talking nervously as he stared into the fire. "I was afraid that he
might forget to go for you, but he didn't."
"Thank goodness for that," Flip sighed.
Paul stood up. "I'm hungry. I'll go get us some bread and cheese
from Thérèse." He disappeared in the direction of the kitchen and
came back with a chunk of cheese, half a loaf, and a bone handled
carving knife. Flip lay on the hearth, using Ariel as a pillow.
"Aunt Colette was over here last night," Paul said, "And that Italian
teacher, Signorina what's-her-name."
"Signorina del Rossi."
"That's right," Paul said. "And they were talking about you."
"About me! What did they say?" Flip cried, sitting up.
"Well, I didn't hear all of it because I was reading."
"But what did they say?" Flip asked again.
"Well, Signorina was saying that it was the first time Aunt Colette's
ever taken a special interest in any one girl. And Aunt Colette
laughed and said that you had great talent and then she said that an
artist's life was a hard one but she was afraid you were stuck with it.
And then she said—now, don't get angry with me, Flip—"
"Go on."
"Well, she said you were a nice child when you didn't spoil it by
being sorry for yourself."
"Oh," Flip said. "Oh." And she lay down again rubbing her cheek
against Ariel's fur.
"Here," Paul said. "Have some more bread and cheese.... You aren't
angry at me, are you, Flip?"
"No."
"Are you sorry for yourself?"
"Yes. I think I am sometimes—"
"Why?"
"Oh—because I want my mother. And the girls don't like me. Oh, and
everything. And I want to be with father instead of at school. But I
don't feel that way so much any more, Paul. And if I can learn how to
ski it will be wonderful. And I love coming here every week-end. And
I'm beginning to like school. Truly I am."
"Why do you keep saying that?" Paul asked, holding the loaf against
his chest and cutting off another chunk. "You keep saying you like
school so much and I don't believe you really do at all."
"I do," Flip persisted. "I don't hate it the way I used to." And she
realized with a start that her words were true. While she didn't
actually like school, she no longer hated it with the sickening passion
of only a few short weeks ago.
"Aunt Colette said something else," Paul went on. "Do you want to
hear?"
"Of course."
"She said you reminded her of Denise."
"Who's Denise?"
"Her daughter."
"What!" Flip yelled. "Her daughter!"
"Hush. Here she comes. Have some more bread and cheese, Flip,"
Paul said as Madame Perceval came in carrying a pair of skis.
"Here you are, Philippa," Madame Perceval held the skis up. "Let's
try these for size."

You might also like