Corpus Linguistics and Statistics With R Introduction To Quantitative Methods in Linguistics 1st Edition Guillaume Desagulier (Auth.) Download PDF

Download More ebooks [PDF]. Format PDF ebook download PDF KINDLE.
Full download textbook at textbookfull.com
Corpus Linguistics and Statistics with R

Introduction to Quantitative Methods in
Linguistics 1st Edition Guillaume
Desagulier (Auth.)
For dowload this book click BUTTON or LINK below
https://textbookfull.com/product/corpus-
linguistics-and-statistics-with-r-introduction-to-
quantitative-methods-in-linguistics-1st-edition-
guillaume-desagulier-auth/
OR CLICK BUTTON
DOWLOAD NOW
Download more ebooks from https://textbookfull.com

More products digital (pdf, epub, mobi) instant
download maybe you interests ...
Statistics in Corpus Linguistics A Practical Guide

Vaclav Brezina
https://textbookfull.com/product/statistics-in-corpus-
linguistics-a-practical-guide-vaclav-brezina/
Contemporary Corpus Linguistics Paul Baker
https://textbookfull.com/product/contemporary-corpus-linguistics-
paul-baker/
Doing Corpus Linguistics 2nd Edition Eniko Csomay
https://textbookfull.com/product/doing-corpus-linguistics-2nd-
edition-eniko-csomay/
An Introduction to Applied Linguistics Norbert Schmitt
https://textbookfull.com/product/an-introduction-to-applied-
linguistics-norbert-schmitt/
Corpus Linguistics for Health Communication A Guide for
Research 1st Edition Brookes
https://textbookfull.com/product/corpus-linguistics-for-health-
communication-a-guide-for-research-1st-edition-brookes/
Introduction to Statistics Through Resampling Methods

and R Phillip I. Good
https://textbookfull.com/product/introduction-to-statistics-
through-resampling-methods-and-r-phillip-i-good/
English Linguistics An Introduction 4th edition

Christian Mair
https://textbookfull.com/product/english-linguistics-an-
introduction-4th-edition-christian-mair/
Research methods in applied linguistics A Practical

Resource Brian Paltridge
https://textbookfull.com/product/research-methods-in-applied-
linguistics-a-practical-resource-brian-paltridge/
Corpus Linguistics for English Teachers New Tools

Online Resources and Classroom Activities 1st Edition
Eric Friginal
https://textbookfull.com/product/corpus-linguistics-for-english-
teachers-new-tools-online-resources-and-classroom-activities-1st-
edition-eric-friginal/
Quantitative Methods in the Humanities
and Social Sciences
Guillaume Desagulier
Corpus Linguistics
and Statistics
with R
Introduction to Quantitative Methods in
Linguistics
Quantitative Methods in the Humanities and Social Sciences
Editorial Board
Thomas DeFanti, Anthony Grafton, Thomas E. Levy, Lev Manovich,
Alyn Rockwood
Quantitative Methods in the Humanities and Social Sciences is a book series designed to foster
research-based conversation with all parts of the university campus from buildings of ivy-covered
stone to technologically savvy walls of glass. Scholarship from international researchers and the
esteemed editorial board represents the far-reaching applications of computational analysis, statis-
tical models, computer-based programs, and other quantitative methods. Methods are integrated in
a dialogue that is sensitive to the broader context of humanistic study and social science research.
Scholars, including among others historians, archaeologists, classicists and linguists, promote this
interdisciplinary approach. These texts teach new methodological approaches for contemporary
research. Each volume exposes readers to a particular research method. Researchers and students
then benefit from exposure to subtleties of the larger project or corpus of work in which the quan-
titative methods come to fruition.
More information about this series at http://www.springer.com/series/11748

Corpus Linguistics and Statistics

with R
Introduction to Quantitative Methods in Linguistics
123
Université Paris 8
Saint Denis, France
Additional material to this book can be downloaded from http://extras.springer.com.
ISSN 2199-0956 ISSN 2199-0964 (electronic)

Quantitative Methods in the Humanities and Social Sciences
ISBN 978-3-319-64570-4 ISBN 978-3-319-64572-8 (eBook)
DOI 10.1007/978-3-319-64572-8
Library of Congress Control Number: 2017950518
© Springer International Publishing AG 2017

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned,
specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any
other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for
general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and
accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect
to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Printed on acid-free paper
This Springer imprint is published by Springer Nature

The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Fatima, Idris, and Hanaé (who knows how to
finish a paragraph).
“Piled Higher and Deeper” by Jorge Cham, www.phdcomics.com

Preface
Who Should Read This Book
In the summer of 2008, I gave a talk at an international conference in Brighton. The talk was about construc-
tions involving multiple hedging in American English (e.g., I’m gonna have to ask you to + VP). I remember
this talk because even though I had every reason to be happy (the audience showed sustained interest and a
major linguist in my field gave me positive feedback), I remember feeling a pang of dissatisfaction. Because
my research was mostly theoretical at the time, I had concluded my presentation with the phrase “pending
empirical validation” one too many times. Of course, I had used examples gleaned from the renowned Cor-
pus of Contemporary American English, but my sampling was not exhaustive and certainly biased. Even
though I felt I had answered my research questions, I had provided no quantitative summary. I went home
convinced that it was time to buttress my research with corpus data. I craved for a better understanding of
corpora, their constitution, their assets, and their limits. I also wished to extract the data I needed the way I
wanted, beyond what traditional, prepackaged corpus tools have to offer.
I soon realized that the kind of corpus linguistics that I was looking for was technically demanding,
especially for the armchair linguist that I was. In the summer of 2010, my lab offered that I attend a
one-week boot camp in Texas whose instructor, Stefan Th. Gries (University of Santa Barbara), had just
published Quantitative Corpus Linguistics with R. This boot camp was a career-changing opportunity. I
went on to teach myself more elaborate corpus-linguistics techniques as well as the kinds of statistics that
linguists generally have other colleagues do for them. This led me to collaborate with great people outside
my field, such as mathematicians, computer engineers, and experimental linguists. While doing so, I never
left aside my research in theoretical linguistics. I can say that acquiring empirical skills has made me a better
theoretical linguist.
If the above lines echo your own experience, this book is perfect for you. While written for a readership
with little or no background in corpus linguistics, computer programming, or statistics, Corpus Linguistics
and Statistics with R will also appeal to readers with more experience in these fields. Indeed, while presenting
in detail the text-mining apparatus used in traditional corpus linguistics (frequency lists, concordance tables,
collocations, etc.), the text also introduces the reader to some appealing techniques that I wish I had become
acquainted with much earlier in my career (motion charts, word clouds, network graphs, etc.).
vii
viii Preface
Goals
This is a book on empirical linguistics written from a theoretical linguist’s perspective. It provides both
a theoretical discussion of what quantitative corpus linguistics entails and detailed, hands-on, step-by-step
instructions to implement the techniques in the field.
A Note to Instructors
I have written this book so that instructors feel free to teach the chapters individually in the order that they
want. For a one-semester course, the emphasis can be on either:
• Methods in corpus linguistics (Part I)
• Statistics for corpus linguistics (Part II)
In either case, I would recommend to always include Chaps. 1 and 2 to make sure that the students are
familiar with the core concepts of corpus-based research and R programming.
Note that it is also possible to include all the chapters in a one-semester course, as I do on a regular
basis. Be aware, however, that this leaves the students less time to apply the techniques to their own research
projects. Experience tells me that the students need sufficient time to work through all of the earlier chapters,
as well as to accommodate to the statistical analysis material.
Supplementary Materials
Through the chapters, readers will make extensive use of datasets and code. These materials are available
from the book’s Springer Extra’s website:
http://extras.springer.com/2017/978-3-319-64570-4
I recommend that you check the repository on a regular basis for updates.
Acknowledgments
Writing this book would not have been possible without the help and support of my family, friends, and
colleagues. Thanks go to Fatima, Idris, and Hanaé for their patience. In particular, I would like to thank
those who agreed to proofread and test-drive early drafts in real-life conditions. Special thanks go to Antoine
Chambaz for his insightful comments on Chap. 8.
Paris, France Guillaume Desagulier

April 2017
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 From Introspective to Corpus-Informed Judgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Looking for Corpus Linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 What Counts as a Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 What Linguists Do with the Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 How Central the Corpus Is to a Linguist’s Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Part I Methods in Corpus Linguistics

2 R Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Downloads and Installs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Downloading and Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Downloading and Installing RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Downloading the Book Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Setting the Working Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 R Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Downloading Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Loading Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Simple Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Variables and Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Functions and Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8.1 Ready-Made Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8.2 User-Defined Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9.4 Data Frames (and Factors) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.10 for Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.11 if and if...else Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
ix
x Contents
2.11.1 if Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.11.2 if...else Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.12 Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.13 Common Mistakes and How to Avoid Them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.14 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Digital Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1 A Short Typology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Corpus Compilation: Kennedy’s Five Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Unannotated Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1 Collecting Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.2 Character Encoding Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.3 Creating an Unannotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Annotated Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 POS-Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.3 POS-Tagging in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.4 Semantic Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Obtaining Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Processing and Manipulating Character Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Character Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.2 Loading Several Text Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 First Forays into Character String Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.2 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.3 Replacing and Deleting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.2 Literals vs. Metacharacters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.3 Line Anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.4 Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.5 Alternations and Groupings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.6 Character Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.7 Lazy vs. Greedy Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.8 Backreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.9 Exact Matching with strapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.10 Lookaround . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Contents xi
5 Applied Character String Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Concordances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.1 A Concordance Based on an Unannotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.2 A Concordance Based on an Annotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Making a Data Frame from an Annotated Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.1 Planning the Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2 Compiling the Data Frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.3 The Full Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Frequency Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.1 A Frequency List of a Raw Text File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.2 A Frequency List of an Annotated File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6 Summary Graphics for Frequency Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Plots, Barplots, and Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3 Word Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.4 Dispersion Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.5 Strip Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6 Reshaping Tabulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.7 Motion Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Part II Statistics for Corpus Linguistics

7 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.2 Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.2.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.2.2 The Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2.3 The Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3 Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.1 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.2 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.3.3 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8 Notions of Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.2.2 Simple Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.2.3 Joint and Marginal Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.2.4 Union vs. Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
xii Contents
8.2.5 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8.2.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.3 Populations, Samples, and Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.4 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.5 Response/Dependent vs. Explanatory/Descriptive/Independent Variables . . . . . . . . . . . . . . . . 159
8.6 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.7 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.8 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.8.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.8.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.9 The χ 2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.9.1 A Case Study: The Quotative System in British and Canadian Youth . . . . . . . . . . . . . 178
8.10 The Fisher Exact Test of Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
8.11 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.11.1 Pearson’s r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.11.2 Kendall’s τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.11.3 Spearman’s ρ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
8.11.4 Correlation Is Not Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9 Association and Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.2 Cooccurrence Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.2.1 Collocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.2.2 Colligation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.2.3 Collostruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9.3 Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.3.1 Measuring Significant Co-occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.3.2 The Logic of Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
9.3.3 A Quick Inventory of Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.3.4 A Loop for Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.3.5 There Is No Perfect Association Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.3.6 Collostructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.3.7 Asymmetric Association Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.4 Lexical Richness and Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.1 Hapax-Based Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.4.2 Types, Tokens, and Type-Token Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.4.3 Vocabulary Growth Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
10 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.1.1 Multidimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.1.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Contents xiii
10.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

10.2.1 Principles of Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
10.2.2 A Case Study: Characterizing Genres with Prosody in Spoken French . . . . . . . . . . . . 243
10.2.3 How PCA Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.3 An Alternative to PCA: t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
10.4 Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.4.1 Principles of Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.4.2 Case Study: General Extenders in the Speech of English Teenagers . . . . . . . . . . . . . . 257
10.4.3 How CA Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.4.4 Supplementary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.5 Multiple Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.5.1 Principles of Multiple Correspondence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
10.5.2 Case Study: Predeterminer vs. Preadjectival Uses of Quite and Rather . . . . . . . . . . . . 270
10.5.3 Confidence Ellipses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
10.5.4 Beyond MCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.6 Hierarchical Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
10.6.1 The Principles of Hierarchical Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
10.6.2 Case Study: Clustering English Intensifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.6.3 Cluster Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
10.6.4 Standardizing Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
10.7 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
10.7.1 What Is a Graph? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
10.7.2 The Linguistic Relevance of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
A.1 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
A.1.1 Dispersion Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
A.2 Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
A.2.1 Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
A.2.2 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
A.2.3 A χ 2 Distribution Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
B Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Chapter 1
Introduction
Abstract In this chapter, I explain the theoretical relevance of corpora. I answer three questions: what
counts as a corpus?; what do linguists do with the corpus?; what status does the corpus have in the linguist’s
approach to language?
1.1 From Introspective to Corpus-Informed Judgments
Linguists investigate what it means to know and speak a language. Despite having the same objective,
linguists have their theoretical preferences, depending on what they believe is the true essence of language.
Supporters of transformational-generative grammar (Chomsky 1957, 1962, 1995) claim that the core of
grammar consists of a finite set of abstract, algebraic rules. Because this core is assumed to be common to
all the natural languages in the world, it is considered a universal grammar. The idiosyncrasies of language
are relegated to the periphery of grammar. Such is the case of the lexicon, context, elements of inter-speaker
variation, cultural connotations, mannerisms, non-standard usage, etc.
In generative grammar, pride of place is given to syntax, i.e. the way in which words are combined to
form larger constituents such as phrases, clauses, and sentences. Syntax hinges on an opposition between
deep structure and surface structure. The deep structure is the abstract syntactic representation of a sentence,
whereas the surface structure is the syntactic realization of the sentence as a string of words in speech.
For example, the sentence an angry cow injured a farmer with an axe has one surface structure (i.e. one
realization as an ordered string of words) but two alternative interpretations at the level of the deep structure:
(a) an angry cow injured a farmer who had an axe in his hands, and (b) an angry cow used an axe to injure a
farmer. In other words, a sentence is generated from the deep structure down to the surface structure.
More generally, generative grammar is a “top-down” approach to language: a limited set of abstract rules
“at the top” is enough to generate and account for an infinite number of sentences “at the bottom”. What gen-
erative linguists truly look for is the finite set of rules that core grammar consists of, in other words speakers’
competence, as opposed to speakers’ performance. This is to the detriment of idiosyncrasies of all kinds.
Conversely, theories such as functional linguistics (Dik 1978, 1997; Givón 1995), cognitive linguistics
(Langacker 1987, 1991; Goldberg 1995), and contemporary typology (Greenberg 1963) advocate a “bottom-
up” approach to language. It is usage that shapes the structure of language (Bybee 2006, 2010; Langacker
1988). Grammar is therefore derivative, not generative. There is no point of separating competence and
© Springer International Publishing AG 2017 1

G. Desagulier, Corpus Linguistics and Statistics with R, Quantitative Methods in the Humanities
and Social Sciences, DOI 10.1007/978-3-319-64572-8_1
2 1 Introduction
performance anymore because competence builds up on performance. In the same vein, grammar has no
core or periphery: it is a structured inventory of symbolic units (Goldberg 2003). Therefore, any linguistic
unit is worth being taken into consideration with equal attention: morphemes (un-, -ness), words or phrases
(corpus linguistics), ritualized or formulaic expressions (break a leg!), idioms (he snuffed it), non-canonical
phrasal expressions (sight unseen), semi-schematic expressions (e.g. just because X doesn’t mean Y), and
fully schematic expressions (e.g. the ditransitive construction).
Like biologists, who apprehend life indirectly, i.e. by studying the structure, function, growth, evolution,
distribution, and taxonomy of living cells and organisms, linguists apprehend language through its manifes-
tations. For this reason, all linguistic theories rely on native speakers acting as informants to provide data.
On the basis of such data, linguists can formulate hypotheses and test them.
Generative linguists are known to rely on introspective judgments as their primary source of data. This is
a likely legacy of de Saussure, the father of modern linguistics, who delimited the object of study (langue)
as a structured system independent from context and typical of an ideal speaker. However, the method can
be deemed faulty on at least two accounts. First, there is no guarantee that linguists’ introspective accept-
ability judgments always match what systematically collected data would reveal (Sampson 2001; Wasow
and Arnold 2005). Second, for an intuition of well-formedness to be valid, it should at least be formulated
by a linguist who is a native speaker of the language under study.1 If maintained, this constraint limits
the scope of a linguist’s work and invalidates a significant proportion of the research published worldwide,
judging from the number of papers written by linguists who are not fluent speakers of the languages they
investigate, including among generativists. This radical position is hardly sustainable in practice. Some
generativists working on early language acquisition have used real performance data to better understand the
development of linguistic competence.2
Because of its emphasis on language use in all its complexity, the “bottom-up” approach provides fertile
ground for corpus-informed judgments. It is among its ranks that linguists, dissatisfied with the practice
of using themselves as informants, have turned to corpora, judging them to be a far better source than
introspection to test their hypotheses. Admittedly, corpora have their limitations. The most frequent criticism
that generative linguists level against corpus linguists is that no corpus can ever provide negative evidence.
In other words, no corpus can indicate whether a given sentence is impossible.3 The first response to this
critique is simple: grammar rules are generalizations over actual usage and negative evidence is of little
import. The second response is that there are statistical methods that can either handle the non-occurrence
of a form in a corpus or estimate the probability of occurrence of a yet unseen unit. A more serious criticism
is the following: no corpus, however large and balanced, can ever hoped to be representative of a speaker,
let alone of a language. Insofar as this criticism has to do with the nature and function of corpora, I address
it in Chap. 3.
Linguists make use of corpus-informed judgment because (a) they believe their work is based on a psy-
chologically realistic view of grammar (as opposed to an ideal view of grammar), and (b) such a view can be
operationalized via corpus data. These assumptions underlie this book. What remains to be shown is what
counts as corpus data.
1 In fact, Labov (1975) shows that even native speakers do not know how they speak.
2 See McEnery and Hardie (2012, p. 168) for a list of references.
3 Initially, generativists claim that the absence of negative evidence from children’s linguistic experience is an argument in favor
of the innateness of grammatical knowledge. This claim is now used beyond language acquisition against corpus linguistics.
1.2 Looking for Corpus Linguistics 3
1.2 Looking for Corpus Linguistics
Defining corpus linguistics is no easy task. It typically fills up entire textbooks such as Biber et al. (1998),
Kennedy (1998), McEnery and Hardie (2012), and Meyer (2002). Because corpus linguistics is changing
fast, I believe a discussion of what it is is more profitable than an elaborate definition that might soon be
outdated. The discussion that follows hinges on three questions:
• what counts as a corpus?
• what do linguists do with the corpus?
• what status does the corpus have in the linguist’s approach to language?
Each of the above questions has material, technical, and theoretical implications, and offers no straightfor-
ward answer.
1.2.1 What Counts as a Corpus
A corpus (plural corpora) is a body of material (textual, graphic, audio, and/or video) upon which some anal-
ysis is based. Several disciplines make use of corpora: linguistics of course, but also literature, philosophy,
art, and science. A corpus is not just a collection of linguistically relevant material. For that collection to
count as a corpus, it has to meet a number of criteria: sampling, balance, representativeness, comparability,
and naturalness.
1.2.1.1 A Sample of Samples
A corpus is a finite sample of genuine linguistic productions by native speakers. Even a monitor corpus such
as The Bank of English, which grows over time, has a fixed size at the moment when the linguist taps into it.
Usually, a corpus has several parts characterized by mode (spoken, written), genre (e.g. novel, newspaper,
etc.), or period (e.g. 1990–1995), for example. These parts are themselves sampled from a number of
theoretically infinite sources.
Sampling should not be seen as a shortcoming. In fact, it is thanks to it that corpus linguistics can be very
ambitious. Like astrophysics, which infers knowledge about the properties of the universe and beyond from
the study of an infinitesimal portion of it, corpus linguists use finite portions of language use in the hope that
they will reveal the laws of a given language, or some aspect of it. A corpus is therefore a sample of samples
or, more precisely, a representative and balanced sample of representative and balanced samples.
1.2.1.2 Representativeness
A corpus is representative when its sampling scheme is faithful to the variability that characterizes the target
language. Suppose you want to study the French spoken by Parisian children. The corpus you will design
for this study will not be representative if it consists only of conversations with peers. To be representative,
the corpus will have to include a sizeable portion of conversations with other people, such as parents, school
teachers, and other educators.
4 1 Introduction
Biber (1993, p. 244) writes: “[r]epresentativeness refers to the extent to which a sample includes the full
range of variability in a population.” Variability is a function of situational and linguistic parameters. Situa-
tional parameters include mode (spoken vs. written), format (published, unpublished), setting (institutional,
public, private, etc.), addressee (present, absent, single, interactive, etc.), author (gender, age, occupation,
etc.), factuality (informational, imaginative, etc.), purposes (information, instruction, entertainment, etc.), or
topics (religion, politics, education, etc.).
Linguistic parameters focus on the distribution of language-relevant features in a language. Biber (1993,
p. 249) lists ten grammatical classes that are used in most variation studies (e.g. nouns, pronouns, preposi-
tions, passives, contractions, or WH-clauses). Each class has a distinctive distribution across text categories,
which can be used to guarantee a representative selection of text types. For example, pronouns and contrac-
tions are interactive and typically occur in texts with a communicative function. In contrast, WH-clauses
require structural elaboration, typical of informative texts.
Distribution is a matter of numerical frequencies. If you are interested in the verbal interaction between
waiters and customers in restaurants, unhedged suggestions such as “do you want fries with that?” will
be overrepresented if you include 90% of conversations from fast-food restaurants. Conversely, forms of
hedged suggestion such as “may I suggest. . . ?” are likely to be underrepresented. To guarantee the repre-
sentation of the full range of linguistic variation existing in a specific dialect/register/context, distributional
considerations such as the number of words per text, the number of texts per text types must be taken into
account.
Corpus compilers use sampling methodologies for the inclusion of texts in a corpus. Sampling techniques
based on sociological research make it possible to obtain relative proportions of strata in a population thanks
to demographically accurate samples. Crowdy (1993) is an interesting illustration of the demographic sam-
pling method used to compile the spoken component of the British National Corpus The British National
Corpus (2007). To fully represent the regional variation of British English, the United Kingdom was di-
vided into three supra-regions (the North, the Midlands, and the South) and twelve areas (five for the North,
three for the Midlands, and four for the South). Yet, as Biber (1993, p. 247) points out, the demographic
representativeness of corpus design is not as important to linguists as the linguistic representativeness.
1.2.1.3 Balance
A corpus is said to be balanced when the proportion of the sampled elements that make it representative
corresponds to the proportion of the same elements in the target language. Conversely, an imbalanced
corpus introduces skews in the data. Like representativeness, the balance of a corpus also depends on the
sampling scheme.
Corpus-linguistics textbooks frequently present the Brown Corpus (Francis and Kučera 1979) and its
British counterpart, the Lancaster-Oslo-Bergen corpus (Johansson et al. 1978) as paragons of balanced cor-
pora. Each attempts to provide a representative and balanced collection of written English in the early
1960s. For example, the compilers of the Brown Corpus made sure that all the genres and subgenres of the
collection of books and periodicals in the Brown University Library and the Providence Athenaeum were
represented. Balance was achieved by choosing the number of text samples to be included in each category.
By way of example, because there were about thirteen times as many books in learned and scientific writing
as in science fiction, 80 texts of the former genre were included, and only 6 of the latter genre.
Because the Brown Corpus and the LOB corpus are snapshot corpora, targetting a well-delimited mode
in a well delimited context, achieving balance was fairly easy. Compiling a reference corpus with spoken
data is far more difficult for two reasons. Because obtaining these resources supposes that informants are
willing to record their conversations, they take time to collect, transcribe, and they are more expensive.
Secondly, you must have an exact idea of what proportion each mode, genre, or subgenre represents in the
target language. If you know that 10% of the speech of Parisian children consists of monologue, your corpus
should consist of about 10% of monologue recordings. But as you may have guessed, corpus linguists have
no way of knowing these proportions. They are, at best, educated estimates.
1.2.1.4 An Ideal
Sampling methods imply a compromise between what is theoretically desirable and what is feasible. The
abovementioned criteria are therefore more of an ideal than an attainable goal (Leech 2007).
Although planned and designed as a representative and balanced corpus, the BNC is far from meeting this
ideal. Natural languages are primarily spoken. Yet, 90% of the BNC consists of written texts. However, as
Gries (2009) points out, a salient written expression may have a bigger impact on speakers’ linguistic systems
than a thousand words of conversation. Furthermore, as Biber (1993) points out, linguistic representativeness
is far more important than the representativeness of the mode.
1.2.1.5 The Materiality of Corpora
The above paragraphs have shown that a lot of thinking goes on before a corpus is compiled. However,
the theoretical status of a corpus should not blind us to the fact that we first apprehend a corpus through
the materiality its database, or lack thereof. That materiality is often complex. The original material of a
linguistic corpus generally consists of one or several of the following:
• audio recordings;
• audio-video recordings;
• text material.
This material may be stored in a variety of ways. For example, audio components may appear in the forms of
reel-to-reel magnetic tapes, audio tapes, or CDs, as was the case back in the days. Nowadays, they are stored
as digital files on the web. The datasets of the CHILDES database are available in CHAT, a standardised
encoding format. The files, which are time aligned and audio linked, are to be processed with CLAN
(http://childes.psy.cmu.edu/). In general, multimodal components include both the original
audio/video files and, optionally, a written transcription to allow for quantification.
As a corpus linguist working mainly on the English language, I cannot help paying tribute to the first
“large-scale” corpus of English: the Survey of English Usage corpus (also known as the Survey Corpus). It
was initiated by Randolph Quirk in 1959 and took thirty years to complete. The written component consists
of 200 texts of 5000 words each, amounting to one million words of British English produced between 1955
and 1985. The Survey Corpus also has a spoken component in the form of now digitized tape recordings of
monologues and dialogues. The corpus is originally compiled from magnetic data recordings, transcribed
by hand, and typed on thousands of 6-by-4-inch paper slips stored in a filing cabinet. Each lexical item is
annotated for grammatical features and stored likewise. For examples, all verb phrases are filed in the verb
phrase section. The spoken component is also transcribed and annotated for prosodic features. It still exists
under the name of the London-Lund Corpus.
6 1 Introduction
The Survey Corpus was not computerized until the early 1980s.4 With a constant decrease in the price
of disk storage and an ever increasing offer of tools to process an ever increasing amount of electronic data,
all corpus projects are now assumed to be digital from stage one. For this reason, when my students fail to
bring their laptop to their first corpus linguistics class, I suggest they take a trip to University College London
and browse manually through the many thousands of slips filed in the cabinets of the Survey Corpus. Al-
though the students are always admirative of the work of corpus linguistics pioneers (Randoph Quirk, Sidney
Greenbaum, Geoffrey Leech, Jan Svartvik, David Crystal, etc.), they never forget their laptops afterwards.
1.2.1.6 Does Size Matter?
Most contemporary corpora in the form of digital texts (from either written or transcribed spoken sources)
are very large. The size of a corpus is relative. When it first came out, in the mid 1990s, the BNC was
considered a very big corpus as it consisted of 100 million words. Compared to the Corpus of Contemporary
American English (450 million words), the Bank of English (45 billion words), the ukWaC corpus of English
(2.25 billion words),5 or Sketch Engine’s enTenTen12 corpus (13 billion words), the BNC does not seem
that large anymore.
What with the availability of digital material and a rich offer of automatic annotation and markup tools,
we are currently witnessing an arms race in terms of corpus size. This is evidenced by the increasing number
of corpora that are compiled from the web (Baroni et al. 2009). On the one hand, a large corpus guarantees
the presence of rare linguistic forms. This is no trifling matter insofar as the distribution of linguistic units
in language follows a Zipfian distribution: there is a large number of rare units. On the other hand, a very
large corpus loses in terms of representativeness and balance, unless huge means are deployed to compile it
with respect to the abovementioned corpus ideal (which, in most cases, has an unbearable cost). Finally, no
corpus extraction goes without noise (i.e. unwanted data that is hard to filter out in a query). The larger the
corpus, the more the noise.
Depending on your case study and the language you investigate, using a small corpus is not a bad thing.
For example, Ghadessy et al. (2001) show that small corpora can go a long way in English language teaching.
Unless you study a macrolanguage (e.g. Mandarin, Spanish, English, or Hindi), it is likely that you will not
be able to find any corpus. The problem is even more acute for endangered or minority languages (e.g.
Breton in France, or Amerin languages in the Americas) for which you only have a sparse collection of texts
and sometimes punctual recordings at your disposal. Once compiled into a corpus, those scarse resources
can serve as the basis of major findings, such as Boas and Schuchard (2012) in Texas German, or Hollmann
and Siewierska (2007) in Lancashire dialects.
Small size becomes a problem if the unit you are interested in is not well represented. All in all, size
matters, but if it is wisely used, a small corpus is worth more than a big corpus that is used unwisely.
1.2.2 What Linguists Do with the Corpus
Having a corpus at one’s disposal is a necessary but insufficient condition for corpus linguistics. Outside
linguistics, other disciplines in the humanities make use of large-scale collections of texts. What they gener-
4 Transcriptions of the spoken component of the Survey Corpus were digitized in the form of the London-Lund Corpus.
5 Ferraresi (2007).
ally do with those texts is known as text mining. Corpus-linguists may use text-mining techniques at some
point (for instance when they make frequency lists). However, the goal of text mining techniques is to obtain
knowledge about a text or group of texts (Jockers 2014), whereas the goal of corpus-linguistics techniques
is to better understand the rules governing a language as a whole, or at least some aspect of that language
(e.g. a specific register or dialect).
1.2.2.1 Generalization
You do not suddenly become a corpus linguist by running a query in a digital text database in search of a
savory example.6 You investigate a corpus because you believe your findings can be extended to the target
language or variety. This is called generalization.
Generalization implies a leap of faith from what we can infer from a finite set of observable data to the
laws of the target language. Outside corpus linguistics, not everyone agrees. In an interview (Andor 2004,
p. 97), Chomsky states that collecting huge amounts of data in the hope of coming up with generalizations
is unique in science. He makes it sound like it is a pipe dream. To reformulate Chomsky’s claim, a corpus
should not be a basis for linguistic studies if it cannot represent language in its full richness and complexity.
Most corpus linguists rightly counter that they do not aim to explain all of a language in every study (Glynn
2010) and that the limits of their generalizations are the limits of the corpus. Furthermore, even if no corpus
can provide access to the true, unknown law of a language, a corpus can be considered a sample drawn from
this law. As you will discover in the second part of the book, there are ways of bridging the gap between
what we can observe and measure from a corpus, and what we do not know about a language. Ambitious
statistics is needed to achieve this. I am specifically referring to the statistics used in biostatistics, where
scientists commonly bridge the gap between what they can observe from a group of patients and what they
can infer about a disease, contra Chomsky’s intuition.
1.2.2.2 Quantification
Corpus linguistics is quantitative when the study of linguistic phenomena based on corpora is systematic
and exhaustive. Gries (2014, p. 365) argues that corpus linguistics in the usage-based sense of the word
is a distributional science. A distributional science infers knowledge from the distribution, dispersion, and
co-occurrence of data. For this reason, quantitative corpus linguists typically focus on corpus frequencies
by means of frequency lists, concordances, measures of dispersion, and co-occurrence frequencies.
All linguists aim at some form of generalization, but not all of them engage in some form of quantification
to meet this aim. This means that you can make a qualitative and quantitative use of a corpus. Qualitative
corpus analysis may consist in formulating a hypothesis and testing it by looking for potential counterex-
amples in the corpus. It may also consist in using the corpus to refine the hypothesis at some stage. The
concordancer is a corpus tool par excellence meant for qualitative inspection. It is used to examine a word
in its co-text regardless of any form of quantification.
An oft-heard misconception about corpus linguistics is that the quantitative methods involved are sus-
piciously too objective and miss some information that only the linguist’s expert subjectivity can bring.
Nothing is further from the truth. First, you should never start exploring a corpus, let alone quantify your
findings, if you do not have a research question, a research hypothesis, and a clear idea as to how the corpus
6 The practise is known as “chasing butterflies”.

8 1 Introduction
will help you answer the question and test the hypothesis. Second, you should never consider the quantified
corpus findings as the final stage in your research project. Uninterpreted results are useless because they do
not speak for themselves: “quantification in empirical research is not about quantification, but about data
management and hypothesis testing” (Geeraerts 2010).
1.2.3 How Central the Corpus Is to a Linguist’s Work
Some linguists believe that using a corpus is but one of several steps in a research project (although a
significant one). For example, a linguist might generate a working hypothesis regarding a linguistic form
and then decide to test that hypothesis by observing the behavior of that linguistic form in a corpus and by
generalizing over the findings. Other linguists adopt a more radical stance. They claim that the corpus is the
only possible approximation we have of a speaker’s linguistic competence. It is therefore a linguist’s job to
investigate corpora.
1.2.3.1 The Corpus as a Step in the Empirical Cycle
To most corpus linguists, corpora and quantitative methods are only a moment in what the cognitive se-
manticist Dirk Geeraerts calls “the empirical cycle” (Geeraerts 2010).7 Empirical data are of two kinds:
observational and experimental. Observational data are collected as they exist. In contrast, elicited data
have to be obtained experimentally. Corpus data are observational, as opposed to elicited data, which are
experimental.
The flowchart in Fig. 1.1 is a representation of D. Geeraerts’s empirical cycle, adapted to quantitative
corpus linguistics. The corpus, which appears in green is but a moment, albeit a central one, of the empirical
cycle. The left part of the chart (where the steps are connected by means of dashed arrows) is iterative. The
cycle involves several rounds of hypothesis formulating/updating, operationalizing, corpus data gathering,
and hypothesis testing. If the empirical testing is satisfactory, the findings can inform back theory (see the
right part of the chart), which in turn helps formulate other hypotheses, and so on and so forth.
Those who reject corpus linguistics on the basis that it is too objective should notice that nearly all the
blocks of the flowchart involve subjective appreciation from the linguist. This is valid also for the block “test
hypotheses” with quantification and statistics insofar as you cannot quantify or run statistics blindly. The
choice of quantitative methods depends largely on what the corpus gives you.
1.2.3.2 The Corpus as the Alpha and Omega of Linguistics
Some linguists adopt a much stronger stance on the place of corpora in their work. They consider that the
grammar of speakers takes the form of a mental corpus.
The idea of grammar being represented as a mental corpus originates from outside corpus linguistics
per se. Cognitive Grammar, an influential usage-based theory of language, defines grammar as a structured
inventory of symbolic linguistic units (Langacker 1987, p. 57). In this theory, grammar is the psychological
representation of linguistic knowledge. A unit is symbolic because its formal component and its semantic
7 In case it is not clear enough, this paper is a must-read if you plan to do corpus-based semantics.
Another random document with
no related content on Scribd:
The Project Gutenberg eBook of Historia de la
lengua y literatura castellana, Tomo 2
This ebook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this ebook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
Title: Historia de la lengua y literatura castellana, Tomo 2

Época de Carlos V
Author: Julio Cejador y Frauca
Release date: March 24, 2024 [eBook #73257]
Language: Spanish
Original publication: Madrid: Tip. de la "Rev. de arch., bibl. y

museos", 1915
Credits: Andrés V. Galia, Santiago, and the Online Distributed

Proofreading Team at https://www.pgdp.net (This file
was produced from images generously made available
by The Internet Archive)
*** START OF THE PROJECT GUTENBERG EBOOK HISTORIA DE LA

LENGUA Y LITERATURA CASTELLANA, TOMO 2 ***
NOTAS DEL TRANSCRIPTOR
En la versión de texto sin formatear

las palabras en itálicas están
indicadas con _guiones bajos_; las
palabras en Versalitas se han escrito
en MAYÚSCULAS y las palabras en
negrita se indican =así=. Además,
una letra precedida por el signo “^”
indica que esa letra es un
superíndice. Por ejemplo ^e
representa la letra “e” en tamaño
más pequeño que la escritura del
resto del texto, y se encuentra
ligeramente por encima de la línea de
escritura.
En algunas partes del texto original

se emplea el signo del antiguo et
latino. Para esta transcripción se ha
utilizado el 2 invertido (ᘔ) para su
representación.
Para el texto escrito por Cejador y

Frauca, el criterio utilizado para llevar
a cabo esta transcripción ha sido el
de respetar las reglas de la Real
Academia Española, vigentes cuando
la presente edición de la obra fue
publicada. El lector interesado puede
consultar el Mapa de Diccionarios
Académicos de la Real Academia
Española.
Para el texto citado de otros autores,

el criterio fue privilegiar que
coincidiese con el texto que figura en
la imagen utilizada para llevar a cabo
la transcripción. No se han modificado
evidentes errores tipográficos ni de
ortografía en esos textos, estimando
que la intención de Cejador y Frauca
fue de preservar la grafía original. Se
ha respetado la ortografía usada en
los epígrafes de las láminas incluidas
en la obra, a pesar de que la misma
no refleja las normas actuales de la
RAE. Es por todo esto que se
encontrarán inconsistencias en la
forma que están escritos varios
vocablos.
El transcriptor ha incluido al principio

un Índice y ha mudado la Lista de
láminas presentadas en la edición
impresa al principio de la obra.
El transcriptor ha modificado la
imagen de la cubierta original y la ha
puesto en el dominio público.
HISTORIA DE LA LENGUA
Y
LITERATURA CASTELLANA
(ÉPOCA DE CARLOS V)
POR
D. JULIO CEJADOR Y FRAUCA

CATEDRÁTICO DE LENGUA Y LITERATURA LATINAS
DE LA UNIVERSIDAD CENTRAL
MADRID
TIP. DE LA "REV. DE ARCH., BIBL. Y MUSEOS"
Olózaga, 1.—Teléfono 3.185.
1915
ES PROPIEDAD DEL AUTOR Y QUEDA HECHO EL
DEPÓSITO
QUE MARCA LA LEY
AL INSIGNE HISPANÓFILO
MR. ARCHER MILTON HUNTINGTON
Señor:
Los escritores y eruditos españoles todos se honran

con vuestra cariñosa amistad, la literatura española
os adeuda beneficios sin cuento, el nombre español
brilla cada día con nuevas luces en vuestra nación
merced á las empresas que habéis acabado para
enaltecerlo, la España culta os cuenta entre sus hijos
predilectos, el Rey os tiene por amigo y familiar, vos
mismo sois tan apasionadamente aficionado y
devoto, no sólo de las letras, antigüedades é historia
de España, sino de cuanto á España atañe, que sólo
sentís no haber nacido español, teniendo tan
española el alma y en esta nuestra tierra todos
vuestros amores.
Permitid, pues, que el último de los eruditos de

España, aunque no lo es de los que os admiran y
quieren, os dirija este liviano trabajo sobre la
"Historia de la lengua y literatura castellana durante
la época de Carlos V", que tan hondamente conocéis
y cuyos mejores monumentos escritos guardáis como
el más preciado de vuestros tesoros.
Llamaros Mecenas de las letras españolas sería

llamaros bien poca cosa, para lo que de hecho
habéis sido, sois y habéis de ser respecto de ellas.
Harto lo tenemos sabido cuantos en ellas
entendemos; pero justo es no guardárnoslo
agradecidos, sin que lo pregonemos á cada paso por
todas partes para que el mundo entero lo sepa y os
lo reconozca debidamente. Interesamos en ello los
españoles, porque ensalzar vuestras obras es
ensalzar á nuestra misma patria.
Como disfrutaseis en vuestra tierra de la magnífica

biblioteca de libros españoles allegada por el
benemérito hispanista Ticknor, á quien debe la
literatura española la primera y mejor historia que
tenemos, os tomó tan desapoderada afición por
nuestras letras, que no descansasteis hasta venir á
España, y, enamorado de la vieja epopeya de
Castilla, os entregasteis de lleno al estudio del añejo
pergamino que del "Cantar de Mio Cid" guardaba en
cofrecito de hierro, como oro en paño, don Alejandro
Pidal. Planeasteis los tres magníficos tomos de la
obra, con la edición crítica, la versión inglesa,
variantes del texto y comentarios; recorristeis paso á
paso los que el héroe castellano hubo de dar con sus
mesnadas; paseasteis la tierra, le bebisteis el
espíritu, sacasteis costosas fotografías y disteis al
cabo á la estampa el maravilloso monumento de
vuestra obra, digna de parearse con la que Alejandro
ordenó sobre la epopeya homérica.
Erais además arqueólogo, porque no hay campo de

la cultura que á vuestra alteza de pensamientos esté
vedado, y os fuisteis á Itálica, arrendasteis los
terrenos particulares arrendables de Santiponce,
pagando más que si los hubieseis comprado, y
ordenasteis las excavaciones á todo coste.
Desenterrados magníficos tesoros, las piezas
mayores regalásteislas al Museo Arqueológico de
Sevilla; con lo demás enriquecisteis el Museo que á
las antigüedades españolas levantabais en Nueva
York, así como en la Biblioteca hispánica, que
juntamente fundabais, habíais recogido ya hasta
18.000 volúmenes, comprados parte de la biblioteca
de Ticknor en Boston, parte en España, pagando
aquí y allá á peso de oro, libro por libro, cuanto de
más raro y precioso sabía rebuscar el ansia que os
aquejaba de allegar cosas españolas.
Dos años mortales luchasteis con el Marqués de
Jerez de los Caballeros por que os vendiese su
inestimable biblioteca de 22.000 volúmenes, la flor y
nata de los más exquisitos y rebuscados libros de la
antigua España. Cuando al cabo vencisteis, y,
pagados un millón de francos, sacasteis de España
tamaño tesoro literario, con lágrimas del corazón
lloraron los eruditos españoles aquel, al parecer,
triste y fatal acontecimiento. Los pergaminos de
nuestra antigua hidalguía salían de la casa solariega,
dejábannos sin los últimos testigos que acreditasen
nuestras glorias pasadas. Pero bien pronto enjugaron
los eruditos sus lágrimas, y no sólo se consolaron,
sino que se congratularon y á buena estrella para la
cultura española atribuyeron el que hubieran pasado
á tales manos, que sabrían guardarlos mejor que no
los hubiéramos nosotros sabido guardar.
Habíais comprado en el Andubon Park de Nueva

York, donde cada día se extiende lo más granado de
aquella gran ciudad, terrenos bastantes para labrar,
como labrasteis en ellos, el magnífico palacio del
"Museo y Biblioteca hispana", verdadero templo del
arte y del saber español, obra única en el mundo,
como no la hay consagrada al arte y saber de ningún
otro pueblo.
No contento con esto, comenzasteis á devolvernos
los mejores libros, rica y fielmente reproducidos,
regalando ejemplares á los centros de cultura y á
cuantos particulares eruditos pudieran aprovecharlos,
como la reproducción de las dos primeras ediciones
de la primera parte del "Quijote", hechas por Cuesta
en 1605, y la de la segunda de 1615: la reproducción
del famoso manuscrito del "Abecedarium", de
Hernando Colón, índice de la antigua biblioteca
colombina, y tantas y tantas otras reproducciones
que allanan las antes insuperables dificultades que
ofrecía el estudio de nuestra literatura.
¿Qué más? No hay libro, no hay obra de arte, no hay

papel, pergamino, lienzo, tabla ó cascote que
atestigüe el menor pedazo de nuestra antigua
cintura, que no lo apreciéis como un inapreciable
pedazo del alma española, que tan al alma propia os
llega, y no derrochéis vuestros bien empleados
caudales para haceros con ello, depositándolo en
aquel templo de las glorias españolas que habéis
levantado en el corazón de la capital de la más rica y
poderosa de las naciones. Exposiciones de pinturas
españolas, compra de cuadros, todo lo hacéis y no os
cansáis de glorificar con ello á España.
Y para que la obra fuese duradera, fundasteis allí en

1914 "The Hispanic Society of America", la Sociedad
hispana de América, que os nombró su Presidente, la
cual lleva y llevará adelante lo emprendido, será
perpetua vocera del nombre español, guardará los
tesoros de la española cultura y facilitará todo linaje
de estudios y trabajos sobre cosas españolas.
Sólo sentís no haber nacido en España; pero sois tan

español de alma, de sentimientos, de cariños, como
el mejor de los españoles. Pasa tan de la raya este
vuestro encariñamiento por España y por todos sus
hijos, que con el fin de regalarles y hacerles á todos
gustosa la estancia en aquel Palacio que les habéis
aparejado, siendo protestante os ocurrió la peregrina
idea de levantar un templo católico, y lo habéis
levantado, luciendo en él una hermosa lámpara de
bronce, regalo de S. M. Alfonso XIII.
Justamente nuestro augusto Monarca os trata de

amigo y os asienta á su mesa particular con su
esposa la Majestad de nuestra augusta Reina, así
como á vuestra propia esposa, cuando á Madrid os
acompaña.
Éstas son, señor, sin ponderaciones y llanamente

recordadas, las cosas que habéis emprendido y
acabado para honra de España. Éste el amor
entrañable, el cariño de hijo, que á España tenéis.
Permitid que vuestro claro nombre venga á honrar
este mi trabajo, dirigiéndooslo como escasa muestra
de la admiración y amistad que os tiene
Julio Cejador.
ÍNDICE
PÁG.
Dedicatoria á
Archer
Milton
Huntington v
Bibliografía
de la historia
del teatro 1
Época de
Carlos V. El
Renacimiento
Clásico y el
Eramismo
la lírica
y la prosa 5
Índice por 273
año de
autores y
obras
anónimas
COLOCACIÓN DE LAS LÁMINAS
PÁG.
Gonzalo
Hernández de
Oviedo 44
Dr. Andrés
Laguna 118
El M. Fray
Luis de
Granada 122
El magnífico
cavallero
Pero Mexía 154
Martín de
Azpilcueta 164
Gutierre de
Cetina 168
Don Antonio 174
Agustín
Ambrosio de
Morales 180
El maestro
Juan de Mal-
Lara 196
Carlos V 208
Don Fray
Bartolomé de
las Casas 220
Parte Primera
de la crónica
del Perú
(de Pedro de
Cieza de
León) 227
Lope de
Rueda 256
BIBLIOGRAFÍA DE LA HISTORIA DEL
TEATRO
Ahrens (Theodor G.). Zur Charakteristik des spanischen Dramas im

Anfang des xvii Jahrhunderts. (Luis Vélez de Guevara und Mira de
Mescua). Halle a. S., 1911.
Barrera y Leirado (Cayetano Alberto de la). Catálogo bibliográfico y

biográfico del teatro antiguo español desde sus orígenes hasta
mediados del siglo xviii. Madrid, 1860.
Bonilla y San Martín (Adolfo). El Teatro español anterior á Lope de

Vega. (En prensa).
Buchanam (Milton A.). Notes on the Spanish drama [Lope, Mira de

Mescua and Moreto], en Modern Language Notes (1905), t. XX,
páginas 38-41; [The case of Calderón's "La vida es sueño"; The
cloak episode in Lope's "El honrado hermano"; Was Tirso one of
the authors of "El Caballero de Olmedo?"], en Modern Language
Notes (1907), t. XXII, págs. 215-218.
Buchanam (Milton A.). At a Spanish Theatre in the Seventeenth

Century, en The University of Toronto Monthly (1908), t. VIII,
páginas 204-209, 230-236.
Cañete (Manuel). Teatro español del siglo xvi. Madrid, 1885.

Cotarelo y Mori (Emilio). María Ladvenant y Quirante. Madrid, 1896.
Cotarelo y Mori (Emilio). María del Rosario Fernández, La Tirana.

Madrid, 1897.
Cotarelo y Mori (Emilio). Isidoro Máiquez y el teatro de su tiempo.

Madrid, 1902.
Cotarelo y Mori (Emilio). Catálogo de obras dramáticas, impresas,

pero no conocidas hasta el presente. Madrid, 1902.
Cotarelo y Mori (Emilio). Bibliografía de las controversias sobre la

licitud del teatro en España. Madrid, 1904.
Crawford (J. P. Wickersham). The Devil as a dramatic figure in the

Spanish religious Drama before Lope de Vega, en The Romanic
Review (1910), t. I, págs. 302-312, 374-383.
Crawford (J. P. Wickersham). The Braggart soldier and the Rufián in

the Spanish Drama of the sixteenth century, en The Romanic
Review (1911), t. II, págs. 186-208.
Crawford (J. P. Wickersham). The Pastor and Bobo in the Spanish

religious Drama of the sixteenth century, en The Romanic Review
(1911), t. II, págs. 376-401.
Creizenach (Wilhelm). Geschichte des neueren Dramas. Halle a. S.,

1893-1909. 4 vols. (En publicación). 2.ª ed., t. I, 1911.
Cruzada Villaamil (G.). Datos inéditos que dan á conocer la cronología

de las comedias representadas en el reinado de Felipe IV en los
sitios reales, en el Alcázar de Madrid, Buen Retiro y otras partes,
etc., en El Averiguador (1871), t. I, págs. 7, 25, 73, 106, 123,
170 y 201. [Véase Rennert.]
Damas-Hinard (Joseph-Stanislas-Albert). Discours sur l'histoire et

l'esprit du théâtre espagnol. París, 1847.
Díaz de Escovar (Narciso). El Teatro de Málaga. Málaga, 1896.
Ebner (J.). Zur Geschichte des klassischen Dramas in Spanien.

Passau, 1908.
Gassier (Alfred). Le théâtre espagnol. París, 1898.
Huszár (Guillaume). Corneille et le théâtre espagnol. París, 1903.
Huszár (Guillaume). Molière et l'Espagne. París, 1907.
Huszár (Guillaume). L'Influence de l'Espagne sur le théâtre français

des xviiie et xixe siècles. París, 1912.
Klein (Julius Léopold). Geschichte des Dramas. Das spanische

Drama. Leipzig, 1871-1875, t. VIII á XI.
Köhler (Eugen). Sieben spanische dramatische Eklogen mit einer

Einleitung über die Anfänge des spanischen Dramas. Dresden,
1911.
Lionnet (Henri). Le théâtre en Espagne. París, 1897.
Lomba y Pedraja (José R.). El rey don Pedro en el teatro, en Homenaje

á Menéndez y Pelayo. Madrid, 1899, t. I, págs. 257-339.
Mariscal de Gante (Jaime). Los autos sacramentales desde sus

orígenes hasta mediados del siglo xviii. Madrid, 1911.
Martinenche (Ernest). La Comedia espagnole en France, de Hardy à

Racine. París, 1900.
Martinenche (Ernest). Molière et le théâtre espagnol. París, 1906.
Morel-Fatio (Alfred) y Rouanet (Léo). Le théâtre espagnol.

(Bibliothèque des bibliographies critiques, no 7: Société des
Études historiques). París, 1900.
Morel-Fatio (Alfred). La "Comedia" espagnole du xvii
e siècle. París,
1885.
Münch-Bellinghausen (Eligius von). Ueber die älteren Sammlungen

spanischer Dramen. Wien, 1852.
Paz y Melia (A.). Catálogo de las piezas de teatro que se conservan

en el departamento de manuscritos de la Biblioteca Nacional.
Madrid, 1899.
Pérez Pastor (Cristóbal). Nuevos datos acerca del histrionismo

español en los siglos xvi y xvii. Madrid, 1901. [La continuación
viene publicándose en el Bulletin Hispanique desde 1906.]
Ramírez de Arellano (Rafael). El Teatro en Córdoba. Ciudad Real,

1912.
Rennert (Hugo Albert). The Staging of Lope de Vega's comedias, en

Revue Hispanique (1906), t. XV, págs. 453-485.
Rennert (Hugo Albert). Notes on the Chronology of the Spanish

Drama, en The Modern Language Review (1907), t. II, pp. 331-
341 y t. III, págs. 43-55 [reimpr., aumentada y corregida de la
lista de Cruzada Villaamil].
Rennert (Hugo Albert). The Spanish stage in the time of Lope de

Vega. New York, 1909.
Restori (Antonio). La Collezione CC*IV.28033 della Biblioteca

Palatina-Parmense (Comedias de diferentes autores), en los Studj
di filologia romanza (1891), fasc. 15, págs. 1-156.
Restori (Antonio). Une liste de comédies de l'an 1666, en la Revue

des langues romanes (1898), 5.ª serie, t. I, págs. 133-164.
Restori (Antonio). Appunti teatrali spagnuoli en los Studj di filologia

romanza (1899), fasc. 20, págs. 403-445.
Restori (Antonio). Piezas de títulos de comedias. Messina, 1903.
Retana (W. E.). Noticias histórico-bibliográficas del teatro en Filipinas

desde sus orígenes hasta 1898. Madrid, 1910.
Reynier (Gustave). Thomas Corneille [ch. IV, La Comédie espagnole].

París, 1893.
Sánchez Arjona (José). Noticias referentes á los anales del teatro en

Sevilla desde Lope de Rueda hasta fines del siglo xviii. Sevilla,
1898.
Schack (Adolf Friedrich von). Geschichte der dramatischen Literatur

und Kunst in Spanien. Berlín, 1845-1846, 2 vols.; Nachträge, etc.,
Frankfurt a. M., 1854. Trad. española por E. de Mier. Madrid,
1885-1887, 5 vols.
Schaeffer (Adolf). Geschichte des spanischen Nationaldramas.

Leipzig, 1890. 2 vols.
Schelling (Félix E.). Elisabethan Drama, 1558-1642. Boston and New

York, 1908, 2 vols.
Schelling (Félix E.). The Cambridge History of English Literature.

Cambridge, 1912, t. VIII, págs. 115-145.
Schwering (Julius). Zur Geschichte des niederländischen und

spanischen Dramas in Deutschland. Münster (Westf.), 1895.
Segall (J.-B.). Corneille and the spanish drama. New York, 1907.
Sepúlveda (Ricardo). El Corral de la Pacheca. Madrid, 1888.
Stiefel (Arthur Ludwig). I, Nachahmung spanischer Komödien in

England unter den ersten Stuarts, en Romanische Forschungen
(1890), t. V, págs. 193-220; II, Nachahmung, etc., en Archiv für
das Studium der neueren Sprachen und Literaturen (1897), t.
XCIX, págs. 270-310; III, Nachahmung, etc., en Archiv für das
Studium der neueren Sprachen und Literaturen (1907), t. CXIX,
págs. 309-350.
Stiefel (Arthur Ludwig). Notizen zur Geschichte und Bibliographie des

spanischen Dramas, en Zeitschrift für romanische Philologie
(1891), t. XV, págs. 217-227; (1906) t. XXX, págs. 540-555;
(1907) t. XXXI, páginas 352-370, 473-493.
Stuart (Donald Clive). Honor in the Spanish Drama, en The Romanic

Review (1910), t. I, págs. 247-258, 357-366.
Viel-Castel (Louis de). Essai sur le théâtre espagnol. París, 1882, 2

vols.
Yxart (José). El arte escénico en España. Barcelona, 1894-1896, 2

vols.
"No hay, no ha habido,
ni habrá en la tierra
pueblo que en una
misma época presente
en igual grado de
desarrollo todas las
ramas del árbol de la
cultura...".
(Men. Pelayo).
ÉPOCA DE CARLOS V
EL RENACIMIENTO CLÁSICO Y EL
ERASMISMO. LA LÍRICA Y LA PROSA
(PRIMERA MITAD DEL SIGLO XVI, 1517-

1554)
Literatura italiana.—Poetas: Ariosto (1474-1533). Orlando Furioso

(1516-1532). Rucellai (1475-1525). Michelangelo (1475-1564).
Trissino (1478-1550). Molza (1489-1544). Vittoria Colonna (1492-
1547). Berni (1498-1535). Bernardo Tasso (1493-1569). Aretino
(1492-1557). Alamanni (1495-1556). Anníbal Caro (1507-1566).
Tansillo (1510-1568).—Historiadores: Paolo Giovio (1483-1552).
Guicciardini (1483-1540). Machiavelli (1469-1527), el Príncipe (1514-
1518), Décadas (1515-1520), Historia de Florencia (hacia 1525).
Vettori, Histor. de Italia (hacia 1527). Benedetto Varchi (1503-65).
Benvenuto Cellini (1500-1571). Vasari (1511-1574).—Novelistas,
moralistas, etc.: Bembo (1470-1547). Baldassare Castiglione (1478-
1529), Cortigiano (1528). Bandello (1485-1561). Firenzuola (1493-
1545). Giraldi Cinzio (1504-1573).—Dramáticos: Bernardo Dovizi
(1470-1520), La Calandria (1513). Alamanni, Rucellai, Aretino,
Machiavelli, Giraldi. Los Ingannati (1531). Cecchi (1518-1587).
Literatura francesa.—Poetas: Clément Marot, Adolescence

Clémentine (1532), Psaumes (1541-1543). Marguerite de Navarre,
Poésies (1531-1547). Du Bellay, Défense et illustration de la langue
française (1549). Ronsard, Odes (1550), Hymnes (1555), Mystères
et Farces. La Pléyade.—Prosistas: Calvin (1509-1564), Institutio
(1536-1541), Sermones. La Boétie, Contr' un (hacia 1548-1550).
Despériers, Cymbalum mundi (1538), Joyeux Devis. Rabelais,
Pantagruel (1533), Gargantua (1535), 3.e livre (1546), 4.e livre
(1552).—Humanistas: Budé (1468-1540). Turnèbe († 1565). Robert
y Charles Estienne. Henri Estienne (1528-1598). Amyot (1513-1593).
Erasmo (1467-1536).
1. En la época de Carlos V sazonan los frutos de las

humanidades en los grandes maestros que
comenzaron á florecer en la época anterior y en los
que de nuevo en ésta florecen. Pero una no esperada
empresa, á más de las ya emprendidas en Italia y
América, se ofrece á los ojos de los españoles, que
no les deja vagar para entregarse de lleno á los
sosegados ocios de las letras, teniendo que empuñar

Corpus Linguistics and Statistics With R Introduction To Quantitative Methods in Linguistics 1st Edition Guillaume Desagulier (Auth.) Download PDF

Uploaded by

Copyright:

Available Formats

Corpus Linguistics and Statistics With R Introduction To Quantitative Methods in Linguistics 1st Edition Guillaume Desagulier (Auth.) Download PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corpus Linguistics and Statistics With R Introduction To Quantitative Methods in Linguistics 1st Edition Guillaume Desagulier (Auth.) Download PDF

Uploaded by

Copyright:

Available Formats

Download More ebooks [PDF]. Format PDF ebook download PDF KINDLE.

Full download textbook at textbookfull.com

Corpus Linguistics and Statistics with R

Download more ebooks from https://textbookfull.com

Statistics in Corpus Linguistics A Practical Guide

Contemporary Corpus Linguistics Paul Baker

Doing Corpus Linguistics 2nd Edition Eniko Csomay

An Introduction to Applied Linguistics Norbert Schmitt

Introduction to Statistics Through Resampling Methods

English Linguistics An Introduction 4th edition

Research methods in applied linguistics A Practical

Corpus Linguistics for English Teachers New Tools

More information about this series at http://www.springer.com/series/11748

Corpus Linguistics and Statistics

Additional material to this book can be downloaded from http://extras.springer.com.

ISSN 2199-0956 ISSN 2199-0964 (electronic)

Library of Congress Control Number: 2017950518

© Springer International Publishing AG 2017

Printed on acid-free paper

This Springer imprint is published by Springer Nature

“Piled Higher and Deeper” by Jorge Cham, www.phdcomics.com

Who Should Read This Book

Paris, France Guillaume Desagulier

Part I Methods in Corpus Linguistics

4 Processing and Manipulating Character Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Applied Character String Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 Summary Graphics for Frequency Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Part II Statistics for Corpus Linguistics

8 Notions of Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8.2.5 Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9 Association and Productivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

10 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

10.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

1.1 From Introspective to Corpus-Informed Judgments

© Springer International Publishing AG 2017 1

1.2 Looking for Corpus Linguistics

1.2.1 What Counts as a Corpus

1.2.1.1 A Sample of Samples

1.2.1.5 The Materiality of Corpora

1.2.1.6 Does Size Matter?

1.2.2 What Linguists Do with the Corpus

6 The practise is known as “chasing butterflies”.

1.2.3 How Central the Corpus Is to a Linguist’s Work

1.2.3.1 The Corpus as a Step in the Empirical Cycle

1.2.3.2 The Corpus as the Alpha and Omega of Linguistics

Title: Historia de la lengua y literatura castellana, Tomo 2

Author: Julio Cejador y Frauca

Release date: March 24, 2024 [eBook #73257]

Original publication: Madrid: Tip. de la "Rev. de arch., bibl. y

Credits: Andrés V. Galia, Santiago, and the Online Distributed

*** START OF THE PROJECT GUTENBERG EBOOK HISTORIA DE LA

En la versión de texto sin formatear

En algunas partes del texto original

Para el texto escrito por Cejador y

Para el texto citado de otros autores,

El transcriptor ha incluido al principio

D. JULIO CEJADOR Y FRAUCA

Los escritores y eruditos españoles todos se honran

Permitid, pues, que el último de los eruditos de