Corpus Linguistics and Statistics With R Introduction To Quantitative Methods in Linguistics 1st Edition Guillaume Desagulier (Auth.) Download PDF
Corpus Linguistics and Statistics With R Introduction To Quantitative Methods in Linguistics 1st Edition Guillaume Desagulier (Auth.) Download PDF
Corpus Linguistics and Statistics With R Introduction To Quantitative Methods in Linguistics 1st Edition Guillaume Desagulier (Auth.) Download PDF
https://textbookfull.com/product/corpus-
linguistics-and-statistics-with-r-introduction-to-
quantitative-methods-in-linguistics-1st-edition-
guillaume-desagulier-auth/
OR CLICK BUTTON
DOWLOAD NOW
https://textbookfull.com/product/statistics-in-corpus-
linguistics-a-practical-guide-vaclav-brezina/
https://textbookfull.com/product/contemporary-corpus-linguistics-
paul-baker/
https://textbookfull.com/product/doing-corpus-linguistics-2nd-
edition-eniko-csomay/
https://textbookfull.com/product/an-introduction-to-applied-
linguistics-norbert-schmitt/
Corpus Linguistics for Health Communication A Guide for
Research 1st Edition Brookes
https://textbookfull.com/product/corpus-linguistics-for-health-
communication-a-guide-for-research-1st-edition-brookes/
https://textbookfull.com/product/introduction-to-statistics-
through-resampling-methods-and-r-phillip-i-good/
https://textbookfull.com/product/english-linguistics-an-
introduction-4th-edition-christian-mair/
https://textbookfull.com/product/research-methods-in-applied-
linguistics-a-practical-resource-brian-paltridge/
https://textbookfull.com/product/corpus-linguistics-for-english-
teachers-new-tools-online-resources-and-classroom-activities-1st-
edition-eric-friginal/
Quantitative Methods in the Humanities
and Social Sciences
Guillaume Desagulier
Corpus Linguistics
and Statistics
with R
Introduction to Quantitative Methods in
Linguistics
Quantitative Methods in the Humanities and Social Sciences
Editorial Board
Thomas DeFanti, Anthony Grafton, Thomas E. Levy, Lev Manovich,
Alyn Rockwood
Quantitative Methods in the Humanities and Social Sciences is a book series designed to foster
research-based conversation with all parts of the university campus from buildings of ivy-covered
stone to technologically savvy walls of glass. Scholarship from international researchers and the
esteemed editorial board represents the far-reaching applications of computational analysis, statis-
tical models, computer-based programs, and other quantitative methods. Methods are integrated in
a dialogue that is sensitive to the broader context of humanistic study and social science research.
Scholars, including among others historians, archaeologists, classicists and linguists, promote this
interdisciplinary approach. These texts teach new methodological approaches for contemporary
research. Each volume exposes readers to a particular research method. Researchers and students
then benefit from exposure to subtleties of the larger project or corpus of work in which the quan-
titative methods come to fruition.
123
Guillaume Desagulier
Université Paris 8
Saint Denis, France
In the summer of 2008, I gave a talk at an international conference in Brighton. The talk was about construc-
tions involving multiple hedging in American English (e.g., I’m gonna have to ask you to + VP). I remember
this talk because even though I had every reason to be happy (the audience showed sustained interest and a
major linguist in my field gave me positive feedback), I remember feeling a pang of dissatisfaction. Because
my research was mostly theoretical at the time, I had concluded my presentation with the phrase “pending
empirical validation” one too many times. Of course, I had used examples gleaned from the renowned Cor-
pus of Contemporary American English, but my sampling was not exhaustive and certainly biased. Even
though I felt I had answered my research questions, I had provided no quantitative summary. I went home
convinced that it was time to buttress my research with corpus data. I craved for a better understanding of
corpora, their constitution, their assets, and their limits. I also wished to extract the data I needed the way I
wanted, beyond what traditional, prepackaged corpus tools have to offer.
I soon realized that the kind of corpus linguistics that I was looking for was technically demanding,
especially for the armchair linguist that I was. In the summer of 2010, my lab offered that I attend a
one-week boot camp in Texas whose instructor, Stefan Th. Gries (University of Santa Barbara), had just
published Quantitative Corpus Linguistics with R. This boot camp was a career-changing opportunity. I
went on to teach myself more elaborate corpus-linguistics techniques as well as the kinds of statistics that
linguists generally have other colleagues do for them. This led me to collaborate with great people outside
my field, such as mathematicians, computer engineers, and experimental linguists. While doing so, I never
left aside my research in theoretical linguistics. I can say that acquiring empirical skills has made me a better
theoretical linguist.
If the above lines echo your own experience, this book is perfect for you. While written for a readership
with little or no background in corpus linguistics, computer programming, or statistics, Corpus Linguistics
and Statistics with R will also appeal to readers with more experience in these fields. Indeed, while presenting
in detail the text-mining apparatus used in traditional corpus linguistics (frequency lists, concordance tables,
collocations, etc.), the text also introduces the reader to some appealing techniques that I wish I had become
acquainted with much earlier in my career (motion charts, word clouds, network graphs, etc.).
vii
viii Preface
Goals
This is a book on empirical linguistics written from a theoretical linguist’s perspective. It provides both
a theoretical discussion of what quantitative corpus linguistics entails and detailed, hands-on, step-by-step
instructions to implement the techniques in the field.
A Note to Instructors
I have written this book so that instructors feel free to teach the chapters individually in the order that they
want. For a one-semester course, the emphasis can be on either:
• Methods in corpus linguistics (Part I)
• Statistics for corpus linguistics (Part II)
In either case, I would recommend to always include Chaps. 1 and 2 to make sure that the students are
familiar with the core concepts of corpus-based research and R programming.
Note that it is also possible to include all the chapters in a one-semester course, as I do on a regular
basis. Be aware, however, that this leaves the students less time to apply the techniques to their own research
projects. Experience tells me that the students need sufficient time to work through all of the earlier chapters,
as well as to accommodate to the statistical analysis material.
Supplementary Materials
Through the chapters, readers will make extensive use of datasets and code. These materials are available
from the book’s Springer Extra’s website:
http://extras.springer.com/2017/978-3-319-64570-4
I recommend that you check the repository on a regular basis for updates.
Acknowledgments
Writing this book would not have been possible without the help and support of my family, friends, and
colleagues. Thanks go to Fatima, Idris, and Hanaé for their patience. In particular, I would like to thank
those who agreed to proofread and test-drive early drafts in real-life conditions. Special thanks go to Antoine
Chambaz for his insightful comments on Chap. 8.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 From Introspective to Corpus-Informed Judgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Looking for Corpus Linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 What Counts as a Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 What Linguists Do with the Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 How Central the Corpus Is to a Linguist’s Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.11.1 if Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.11.2 if...else Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.12 Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.13 Common Mistakes and How to Avoid Them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Digital Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1 A Short Typology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Corpus Compilation: Kennedy’s Five Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Unannotated Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Collecting Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Character Encoding Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.3 Creating an Unannotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Annotated Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 POS-Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.3 POS-Tagging in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.4 Semantic Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Obtaining Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
A.1 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
A.1.1 Dispersion Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
A.2 Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
A.2.1 Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
A.2.2 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
A.2.3 A χ 2 Distribution Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
B Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Chapter 1
Introduction
Abstract In this chapter, I explain the theoretical relevance of corpora. I answer three questions: what
counts as a corpus?; what do linguists do with the corpus?; what status does the corpus have in the linguist’s
approach to language?
Linguists investigate what it means to know and speak a language. Despite having the same objective,
linguists have their theoretical preferences, depending on what they believe is the true essence of language.
Supporters of transformational-generative grammar (Chomsky 1957, 1962, 1995) claim that the core of
grammar consists of a finite set of abstract, algebraic rules. Because this core is assumed to be common to
all the natural languages in the world, it is considered a universal grammar. The idiosyncrasies of language
are relegated to the periphery of grammar. Such is the case of the lexicon, context, elements of inter-speaker
variation, cultural connotations, mannerisms, non-standard usage, etc.
In generative grammar, pride of place is given to syntax, i.e. the way in which words are combined to
form larger constituents such as phrases, clauses, and sentences. Syntax hinges on an opposition between
deep structure and surface structure. The deep structure is the abstract syntactic representation of a sentence,
whereas the surface structure is the syntactic realization of the sentence as a string of words in speech.
For example, the sentence an angry cow injured a farmer with an axe has one surface structure (i.e. one
realization as an ordered string of words) but two alternative interpretations at the level of the deep structure:
(a) an angry cow injured a farmer who had an axe in his hands, and (b) an angry cow used an axe to injure a
farmer. In other words, a sentence is generated from the deep structure down to the surface structure.
More generally, generative grammar is a “top-down” approach to language: a limited set of abstract rules
“at the top” is enough to generate and account for an infinite number of sentences “at the bottom”. What gen-
erative linguists truly look for is the finite set of rules that core grammar consists of, in other words speakers’
competence, as opposed to speakers’ performance. This is to the detriment of idiosyncrasies of all kinds.
Conversely, theories such as functional linguistics (Dik 1978, 1997; Givón 1995), cognitive linguistics
(Langacker 1987, 1991; Goldberg 1995), and contemporary typology (Greenberg 1963) advocate a “bottom-
up” approach to language. It is usage that shapes the structure of language (Bybee 2006, 2010; Langacker
1988). Grammar is therefore derivative, not generative. There is no point of separating competence and
performance anymore because competence builds up on performance. In the same vein, grammar has no
core or periphery: it is a structured inventory of symbolic units (Goldberg 2003). Therefore, any linguistic
unit is worth being taken into consideration with equal attention: morphemes (un-, -ness), words or phrases
(corpus linguistics), ritualized or formulaic expressions (break a leg!), idioms (he snuffed it), non-canonical
phrasal expressions (sight unseen), semi-schematic expressions (e.g. just because X doesn’t mean Y), and
fully schematic expressions (e.g. the ditransitive construction).
Like biologists, who apprehend life indirectly, i.e. by studying the structure, function, growth, evolution,
distribution, and taxonomy of living cells and organisms, linguists apprehend language through its manifes-
tations. For this reason, all linguistic theories rely on native speakers acting as informants to provide data.
On the basis of such data, linguists can formulate hypotheses and test them.
Generative linguists are known to rely on introspective judgments as their primary source of data. This is
a likely legacy of de Saussure, the father of modern linguistics, who delimited the object of study (langue)
as a structured system independent from context and typical of an ideal speaker. However, the method can
be deemed faulty on at least two accounts. First, there is no guarantee that linguists’ introspective accept-
ability judgments always match what systematically collected data would reveal (Sampson 2001; Wasow
and Arnold 2005). Second, for an intuition of well-formedness to be valid, it should at least be formulated
by a linguist who is a native speaker of the language under study.1 If maintained, this constraint limits
the scope of a linguist’s work and invalidates a significant proportion of the research published worldwide,
judging from the number of papers written by linguists who are not fluent speakers of the languages they
investigate, including among generativists. This radical position is hardly sustainable in practice. Some
generativists working on early language acquisition have used real performance data to better understand the
development of linguistic competence.2
Because of its emphasis on language use in all its complexity, the “bottom-up” approach provides fertile
ground for corpus-informed judgments. It is among its ranks that linguists, dissatisfied with the practice
of using themselves as informants, have turned to corpora, judging them to be a far better source than
introspection to test their hypotheses. Admittedly, corpora have their limitations. The most frequent criticism
that generative linguists level against corpus linguists is that no corpus can ever provide negative evidence.
In other words, no corpus can indicate whether a given sentence is impossible.3 The first response to this
critique is simple: grammar rules are generalizations over actual usage and negative evidence is of little
import. The second response is that there are statistical methods that can either handle the non-occurrence
of a form in a corpus or estimate the probability of occurrence of a yet unseen unit. A more serious criticism
is the following: no corpus, however large and balanced, can ever hoped to be representative of a speaker,
let alone of a language. Insofar as this criticism has to do with the nature and function of corpora, I address
it in Chap. 3.
Linguists make use of corpus-informed judgment because (a) they believe their work is based on a psy-
chologically realistic view of grammar (as opposed to an ideal view of grammar), and (b) such a view can be
operationalized via corpus data. These assumptions underlie this book. What remains to be shown is what
counts as corpus data.
1 In fact, Labov (1975) shows that even native speakers do not know how they speak.
2 See McEnery and Hardie (2012, p. 168) for a list of references.
3 Initially, generativists claim that the absence of negative evidence from children’s linguistic experience is an argument in favor
of the innateness of grammatical knowledge. This claim is now used beyond language acquisition against corpus linguistics.
1.2 Looking for Corpus Linguistics 3
Defining corpus linguistics is no easy task. It typically fills up entire textbooks such as Biber et al. (1998),
Kennedy (1998), McEnery and Hardie (2012), and Meyer (2002). Because corpus linguistics is changing
fast, I believe a discussion of what it is is more profitable than an elaborate definition that might soon be
outdated. The discussion that follows hinges on three questions:
• what counts as a corpus?
• what do linguists do with the corpus?
• what status does the corpus have in the linguist’s approach to language?
Each of the above questions has material, technical, and theoretical implications, and offers no straightfor-
ward answer.
A corpus (plural corpora) is a body of material (textual, graphic, audio, and/or video) upon which some anal-
ysis is based. Several disciplines make use of corpora: linguistics of course, but also literature, philosophy,
art, and science. A corpus is not just a collection of linguistically relevant material. For that collection to
count as a corpus, it has to meet a number of criteria: sampling, balance, representativeness, comparability,
and naturalness.
A corpus is a finite sample of genuine linguistic productions by native speakers. Even a monitor corpus such
as The Bank of English, which grows over time, has a fixed size at the moment when the linguist taps into it.
Usually, a corpus has several parts characterized by mode (spoken, written), genre (e.g. novel, newspaper,
etc.), or period (e.g. 1990–1995), for example. These parts are themselves sampled from a number of
theoretically infinite sources.
Sampling should not be seen as a shortcoming. In fact, it is thanks to it that corpus linguistics can be very
ambitious. Like astrophysics, which infers knowledge about the properties of the universe and beyond from
the study of an infinitesimal portion of it, corpus linguists use finite portions of language use in the hope that
they will reveal the laws of a given language, or some aspect of it. A corpus is therefore a sample of samples
or, more precisely, a representative and balanced sample of representative and balanced samples.
1.2.1.2 Representativeness
A corpus is representative when its sampling scheme is faithful to the variability that characterizes the target
language. Suppose you want to study the French spoken by Parisian children. The corpus you will design
for this study will not be representative if it consists only of conversations with peers. To be representative,
the corpus will have to include a sizeable portion of conversations with other people, such as parents, school
teachers, and other educators.
4 1 Introduction
Biber (1993, p. 244) writes: “[r]epresentativeness refers to the extent to which a sample includes the full
range of variability in a population.” Variability is a function of situational and linguistic parameters. Situa-
tional parameters include mode (spoken vs. written), format (published, unpublished), setting (institutional,
public, private, etc.), addressee (present, absent, single, interactive, etc.), author (gender, age, occupation,
etc.), factuality (informational, imaginative, etc.), purposes (information, instruction, entertainment, etc.), or
topics (religion, politics, education, etc.).
Linguistic parameters focus on the distribution of language-relevant features in a language. Biber (1993,
p. 249) lists ten grammatical classes that are used in most variation studies (e.g. nouns, pronouns, preposi-
tions, passives, contractions, or WH-clauses). Each class has a distinctive distribution across text categories,
which can be used to guarantee a representative selection of text types. For example, pronouns and contrac-
tions are interactive and typically occur in texts with a communicative function. In contrast, WH-clauses
require structural elaboration, typical of informative texts.
Distribution is a matter of numerical frequencies. If you are interested in the verbal interaction between
waiters and customers in restaurants, unhedged suggestions such as “do you want fries with that?” will
be overrepresented if you include 90% of conversations from fast-food restaurants. Conversely, forms of
hedged suggestion such as “may I suggest. . . ?” are likely to be underrepresented. To guarantee the repre-
sentation of the full range of linguistic variation existing in a specific dialect/register/context, distributional
considerations such as the number of words per text, the number of texts per text types must be taken into
account.
Corpus compilers use sampling methodologies for the inclusion of texts in a corpus. Sampling techniques
based on sociological research make it possible to obtain relative proportions of strata in a population thanks
to demographically accurate samples. Crowdy (1993) is an interesting illustration of the demographic sam-
pling method used to compile the spoken component of the British National Corpus The British National
Corpus (2007). To fully represent the regional variation of British English, the United Kingdom was di-
vided into three supra-regions (the North, the Midlands, and the South) and twelve areas (five for the North,
three for the Midlands, and four for the South). Yet, as Biber (1993, p. 247) points out, the demographic
representativeness of corpus design is not as important to linguists as the linguistic representativeness.
1.2.1.3 Balance
A corpus is said to be balanced when the proportion of the sampled elements that make it representative
corresponds to the proportion of the same elements in the target language. Conversely, an imbalanced
corpus introduces skews in the data. Like representativeness, the balance of a corpus also depends on the
sampling scheme.
Corpus-linguistics textbooks frequently present the Brown Corpus (Francis and Kučera 1979) and its
British counterpart, the Lancaster-Oslo-Bergen corpus (Johansson et al. 1978) as paragons of balanced cor-
pora. Each attempts to provide a representative and balanced collection of written English in the early
1960s. For example, the compilers of the Brown Corpus made sure that all the genres and subgenres of the
collection of books and periodicals in the Brown University Library and the Providence Athenaeum were
represented. Balance was achieved by choosing the number of text samples to be included in each category.
By way of example, because there were about thirteen times as many books in learned and scientific writing
as in science fiction, 80 texts of the former genre were included, and only 6 of the latter genre.
Because the Brown Corpus and the LOB corpus are snapshot corpora, targetting a well-delimited mode
in a well delimited context, achieving balance was fairly easy. Compiling a reference corpus with spoken
data is far more difficult for two reasons. Because obtaining these resources supposes that informants are
1.2 Looking for Corpus Linguistics 5
willing to record their conversations, they take time to collect, transcribe, and they are more expensive.
Secondly, you must have an exact idea of what proportion each mode, genre, or subgenre represents in the
target language. If you know that 10% of the speech of Parisian children consists of monologue, your corpus
should consist of about 10% of monologue recordings. But as you may have guessed, corpus linguists have
no way of knowing these proportions. They are, at best, educated estimates.
1.2.1.4 An Ideal
Sampling methods imply a compromise between what is theoretically desirable and what is feasible. The
abovementioned criteria are therefore more of an ideal than an attainable goal (Leech 2007).
Although planned and designed as a representative and balanced corpus, the BNC is far from meeting this
ideal. Natural languages are primarily spoken. Yet, 90% of the BNC consists of written texts. However, as
Gries (2009) points out, a salient written expression may have a bigger impact on speakers’ linguistic systems
than a thousand words of conversation. Furthermore, as Biber (1993) points out, linguistic representativeness
is far more important than the representativeness of the mode.
The above paragraphs have shown that a lot of thinking goes on before a corpus is compiled. However,
the theoretical status of a corpus should not blind us to the fact that we first apprehend a corpus through
the materiality its database, or lack thereof. That materiality is often complex. The original material of a
linguistic corpus generally consists of one or several of the following:
• audio recordings;
• audio-video recordings;
• text material.
This material may be stored in a variety of ways. For example, audio components may appear in the forms of
reel-to-reel magnetic tapes, audio tapes, or CDs, as was the case back in the days. Nowadays, they are stored
as digital files on the web. The datasets of the CHILDES database are available in CHAT, a standardised
encoding format. The files, which are time aligned and audio linked, are to be processed with CLAN
(http://childes.psy.cmu.edu/). In general, multimodal components include both the original
audio/video files and, optionally, a written transcription to allow for quantification.
As a corpus linguist working mainly on the English language, I cannot help paying tribute to the first
“large-scale” corpus of English: the Survey of English Usage corpus (also known as the Survey Corpus). It
was initiated by Randolph Quirk in 1959 and took thirty years to complete. The written component consists
of 200 texts of 5000 words each, amounting to one million words of British English produced between 1955
and 1985. The Survey Corpus also has a spoken component in the form of now digitized tape recordings of
monologues and dialogues. The corpus is originally compiled from magnetic data recordings, transcribed
by hand, and typed on thousands of 6-by-4-inch paper slips stored in a filing cabinet. Each lexical item is
annotated for grammatical features and stored likewise. For examples, all verb phrases are filed in the verb
phrase section. The spoken component is also transcribed and annotated for prosodic features. It still exists
under the name of the London-Lund Corpus.
6 1 Introduction
The Survey Corpus was not computerized until the early 1980s.4 With a constant decrease in the price
of disk storage and an ever increasing offer of tools to process an ever increasing amount of electronic data,
all corpus projects are now assumed to be digital from stage one. For this reason, when my students fail to
bring their laptop to their first corpus linguistics class, I suggest they take a trip to University College London
and browse manually through the many thousands of slips filed in the cabinets of the Survey Corpus. Al-
though the students are always admirative of the work of corpus linguistics pioneers (Randoph Quirk, Sidney
Greenbaum, Geoffrey Leech, Jan Svartvik, David Crystal, etc.), they never forget their laptops afterwards.
Most contemporary corpora in the form of digital texts (from either written or transcribed spoken sources)
are very large. The size of a corpus is relative. When it first came out, in the mid 1990s, the BNC was
considered a very big corpus as it consisted of 100 million words. Compared to the Corpus of Contemporary
American English (450 million words), the Bank of English (45 billion words), the ukWaC corpus of English
(2.25 billion words),5 or Sketch Engine’s enTenTen12 corpus (13 billion words), the BNC does not seem
that large anymore.
What with the availability of digital material and a rich offer of automatic annotation and markup tools,
we are currently witnessing an arms race in terms of corpus size. This is evidenced by the increasing number
of corpora that are compiled from the web (Baroni et al. 2009). On the one hand, a large corpus guarantees
the presence of rare linguistic forms. This is no trifling matter insofar as the distribution of linguistic units
in language follows a Zipfian distribution: there is a large number of rare units. On the other hand, a very
large corpus loses in terms of representativeness and balance, unless huge means are deployed to compile it
with respect to the abovementioned corpus ideal (which, in most cases, has an unbearable cost). Finally, no
corpus extraction goes without noise (i.e. unwanted data that is hard to filter out in a query). The larger the
corpus, the more the noise.
Depending on your case study and the language you investigate, using a small corpus is not a bad thing.
For example, Ghadessy et al. (2001) show that small corpora can go a long way in English language teaching.
Unless you study a macrolanguage (e.g. Mandarin, Spanish, English, or Hindi), it is likely that you will not
be able to find any corpus. The problem is even more acute for endangered or minority languages (e.g.
Breton in France, or Amerin languages in the Americas) for which you only have a sparse collection of texts
and sometimes punctual recordings at your disposal. Once compiled into a corpus, those scarse resources
can serve as the basis of major findings, such as Boas and Schuchard (2012) in Texas German, or Hollmann
and Siewierska (2007) in Lancashire dialects.
Small size becomes a problem if the unit you are interested in is not well represented. All in all, size
matters, but if it is wisely used, a small corpus is worth more than a big corpus that is used unwisely.
Having a corpus at one’s disposal is a necessary but insufficient condition for corpus linguistics. Outside
linguistics, other disciplines in the humanities make use of large-scale collections of texts. What they gener-
4 Transcriptions of the spoken component of the Survey Corpus were digitized in the form of the London-Lund Corpus.
5 Ferraresi (2007).
1.2 Looking for Corpus Linguistics 7
ally do with those texts is known as text mining. Corpus-linguists may use text-mining techniques at some
point (for instance when they make frequency lists). However, the goal of text mining techniques is to obtain
knowledge about a text or group of texts (Jockers 2014), whereas the goal of corpus-linguistics techniques
is to better understand the rules governing a language as a whole, or at least some aspect of that language
(e.g. a specific register or dialect).
1.2.2.1 Generalization
You do not suddenly become a corpus linguist by running a query in a digital text database in search of a
savory example.6 You investigate a corpus because you believe your findings can be extended to the target
language or variety. This is called generalization.
Generalization implies a leap of faith from what we can infer from a finite set of observable data to the
laws of the target language. Outside corpus linguistics, not everyone agrees. In an interview (Andor 2004,
p. 97), Chomsky states that collecting huge amounts of data in the hope of coming up with generalizations
is unique in science. He makes it sound like it is a pipe dream. To reformulate Chomsky’s claim, a corpus
should not be a basis for linguistic studies if it cannot represent language in its full richness and complexity.
Most corpus linguists rightly counter that they do not aim to explain all of a language in every study (Glynn
2010) and that the limits of their generalizations are the limits of the corpus. Furthermore, even if no corpus
can provide access to the true, unknown law of a language, a corpus can be considered a sample drawn from
this law. As you will discover in the second part of the book, there are ways of bridging the gap between
what we can observe and measure from a corpus, and what we do not know about a language. Ambitious
statistics is needed to achieve this. I am specifically referring to the statistics used in biostatistics, where
scientists commonly bridge the gap between what they can observe from a group of patients and what they
can infer about a disease, contra Chomsky’s intuition.
1.2.2.2 Quantification
Corpus linguistics is quantitative when the study of linguistic phenomena based on corpora is systematic
and exhaustive. Gries (2014, p. 365) argues that corpus linguistics in the usage-based sense of the word
is a distributional science. A distributional science infers knowledge from the distribution, dispersion, and
co-occurrence of data. For this reason, quantitative corpus linguists typically focus on corpus frequencies
by means of frequency lists, concordances, measures of dispersion, and co-occurrence frequencies.
All linguists aim at some form of generalization, but not all of them engage in some form of quantification
to meet this aim. This means that you can make a qualitative and quantitative use of a corpus. Qualitative
corpus analysis may consist in formulating a hypothesis and testing it by looking for potential counterex-
amples in the corpus. It may also consist in using the corpus to refine the hypothesis at some stage. The
concordancer is a corpus tool par excellence meant for qualitative inspection. It is used to examine a word
in its co-text regardless of any form of quantification.
An oft-heard misconception about corpus linguistics is that the quantitative methods involved are sus-
piciously too objective and miss some information that only the linguist’s expert subjectivity can bring.
Nothing is further from the truth. First, you should never start exploring a corpus, let alone quantify your
findings, if you do not have a research question, a research hypothesis, and a clear idea as to how the corpus
will help you answer the question and test the hypothesis. Second, you should never consider the quantified
corpus findings as the final stage in your research project. Uninterpreted results are useless because they do
not speak for themselves: “quantification in empirical research is not about quantification, but about data
management and hypothesis testing” (Geeraerts 2010).
Some linguists believe that using a corpus is but one of several steps in a research project (although a
significant one). For example, a linguist might generate a working hypothesis regarding a linguistic form
and then decide to test that hypothesis by observing the behavior of that linguistic form in a corpus and by
generalizing over the findings. Other linguists adopt a more radical stance. They claim that the corpus is the
only possible approximation we have of a speaker’s linguistic competence. It is therefore a linguist’s job to
investigate corpora.
To most corpus linguists, corpora and quantitative methods are only a moment in what the cognitive se-
manticist Dirk Geeraerts calls “the empirical cycle” (Geeraerts 2010).7 Empirical data are of two kinds:
observational and experimental. Observational data are collected as they exist. In contrast, elicited data
have to be obtained experimentally. Corpus data are observational, as opposed to elicited data, which are
experimental.
The flowchart in Fig. 1.1 is a representation of D. Geeraerts’s empirical cycle, adapted to quantitative
corpus linguistics. The corpus, which appears in green is but a moment, albeit a central one, of the empirical
cycle. The left part of the chart (where the steps are connected by means of dashed arrows) is iterative. The
cycle involves several rounds of hypothesis formulating/updating, operationalizing, corpus data gathering,
and hypothesis testing. If the empirical testing is satisfactory, the findings can inform back theory (see the
right part of the chart), which in turn helps formulate other hypotheses, and so on and so forth.
Those who reject corpus linguistics on the basis that it is too objective should notice that nearly all the
blocks of the flowchart involve subjective appreciation from the linguist. This is valid also for the block “test
hypotheses” with quantification and statistics insofar as you cannot quantify or run statistics blindly. The
choice of quantitative methods depends largely on what the corpus gives you.
Some linguists adopt a much stronger stance on the place of corpora in their work. They consider that the
grammar of speakers takes the form of a mental corpus.
The idea of grammar being represented as a mental corpus originates from outside corpus linguistics
per se. Cognitive Grammar, an influential usage-based theory of language, defines grammar as a structured
inventory of symbolic linguistic units (Langacker 1987, p. 57). In this theory, grammar is the psychological
representation of linguistic knowledge. A unit is symbolic because its formal component and its semantic
7 In case it is not clear enough, this paper is a must-read if you plan to do corpus-based semantics.
Another random document with
no related content on Scribd:
The Project Gutenberg eBook of Historia de la
lengua y literatura castellana, Tomo 2
This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
Language: Spanish
El transcriptor ha modificado la
imagen de la cubierta original y la ha
puesto en el dominio público.
HISTORIA DE LA LENGUA
Y
LITERATURA CASTELLANA
(ÉPOCA DE CARLOS V)
POR
Señor:
Julio Cejador.
ÍNDICE
PÁG.
Dedicatoria á
Archer
Milton
Huntington v
Bibliografía
de la historia
del teatro 1
Época de
Carlos V. El
Renacimiento
Clásico y el
Eramismo
la lírica
y la prosa 5
Índice por 273
año de
autores y
obras
anónimas
COLOCACIÓN DE LAS LÁMINAS
PÁG.
Gonzalo
Hernández de
Oviedo 44
Dr. Andrés
Laguna 118
El M. Fray
Luis de
Granada 122
El magnífico
cavallero
Pero Mexía 154
Martín de
Azpilcueta 164
Gutierre de
Cetina 168
Don Antonio 174
Agustín
Ambrosio de
Morales 180
El maestro
Juan de Mal-
Lara 196
Carlos V 208
Don Fray
Bartolomé de
las Casas 220
Parte Primera
de la crónica
del Perú
(de Pedro de
Cieza de
León) 227
Lope de
Rueda 256
BIBLIOGRAFÍA DE LA HISTORIA DEL
TEATRO
Segall (J.-B.). Corneille and the spanish drama. New York, 1907.
(Men. Pelayo).
ÉPOCA DE CARLOS V
EL RENACIMIENTO CLÁSICO Y EL
ERASMISMO. LA LÍRICA Y LA PROSA