Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Beyond lexical frequencies: using R for text analysis in the digital humanities

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper presents a combination of R packages—user contributed toolkits written in a common core programming language—to facilitate the humanistic investigation of digitised, text-based corpora. Our survey of text analysis packages includes those of our own creation (cleanNLP and fasttextM) as well as packages built by other research groups (stringi, readtext, hyphenatr, quanteda, and hunspell). By operating on generic object types, these packages unite research innovations in corpus linguistics, natural language processing, machine learning, statistics, and digital humanities. We begin by extrapolating on the theoretical benefits of R as an elaborate gluing language for bringing together several areas of expertise and compare it to linguistic concordancers and other tool-based approaches to text analysis in the digital humanities. We then showcase the practical benefits of an ecosystem by illustrating how R packages have been integrated into a digital humanities project. Throughout, the focus is on moving beyond the bag-of-words, lexical frequency model by incorporating linguistically-driven analyses in research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The environment can be saved so that the research may still be reproduced even after packages are discontinued and R versions have changed; for details on doing this see Ushey et al. (2016).

  2. A press release from Oracle in 2012, http://www.oracle.com/us/corporate/press/1515738, estimates that at there were at least 2 million users of R. By metrics such as Google searches, downloads, and blog posts, this number has continued to grow over the past 5 years (Hornik et al. 2017).

  3. Stefan Gries devised an R script, implementing the function exact.matches(), that literally turns R into a concordancer (Gries 2009).

  4. See (Ballier forthcoming).

  5. A full set of code and data for replication can be found at https://github.com/statsmaths/beyond-lexical-frequencies.

  6. In this case, Wikipedia includes an infobox at the top of the page listing the country’s capital city. Our analysis ignores this box, using only the raw text to illustrate how information can be extracted from completely unstructured text.

  7. Interestingly, this was not always in Virginia. When the original string was, for example, ‘Dallas, Texas’ all three locations pointed to the Dallas in Texas regardless of what was tacked onto the end.

  8. In the past year the udpipe package has done an admirable job of extending lemmatisation and dependency parsing to a larger set of target languages (Wijffels 2018).

  9. Mutatis mutandis, this also applies to corpus linguistics where Stefan Gries has advocated more complex modelling of L1 to investigate L2 production, promoting what he calls MuPDAR (Multifactorial Prediction and Deviation Analysis Using R, (Gries and Deshors 2014).

References

  • Allaire, J., Cheng, J., Xie, Y., McPherson, J., Chang, W., Allen, J., Wickham, H., Atkins, A., Hyndman, R., & Arslan, R. (2017). rmarkdown: Dynamic documents for R. R package version 1.6. https://cran.r-project.org/package=rmarkdown.

  • Anthony, L. (2004). Antconc: A learner and classroom friendly, multi-platform corpus analysis toolkit. In Proceedings of IWLeL (pp. 7–13).

  • Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30(2), 141–161.

    Google Scholar 

  • Arnold, T., & Benoit, K. (2017). tif: Text interchange format. R package version 0.2. https://github.com/ropensci/tif/.

  • Arnold, T., Lissón, P., & Ballier, N. (2017). fasttextM: Work with bilingual word embeddings. R package version 0.0.1. https://github.com/statsmaths/fasttextM/.

  • Arnold, T. (2017). A tidy data model for natural language processing using cleannlp. The R Journal, 9(2), 1–20.

    Google Scholar 

  • Arnold, T., & Tilton, L. (2015). Humanities data in R. New York: Springer.

    Google Scholar 

  • Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.

    Google Scholar 

  • Baglama, J., Reichel, L., & Lewis, B. W. (2017). irlba: Fast truncated singular value decomposition and principal components analysis for large dense and sparse matrices. R package version 2.2.1. https://cran.r-project.org/package=irlba.

  • Ballier, N., & Lissón, P. (2017). R-based strategies for DH in English Linguistics: A case study. In Bockwinkel, P., Declerck, T., Kübler, S., Zinsmeister, H. (eds), Proceedings of the Workshop on Teaching NLP for Digital Humanities, CEUR Workshop Proceedings, Berlin, Germany (Vol. 1918, pp. 1–10). http://ceur-ws.org/Vol-1918/ballier.pdf.

  • Ballier, N. (2016). R, pour un écosystème du traitement des données? L’exemple de la linguistique. In P. Caron (Ed.), Données, Métadonnées des corpus et catalogage des objets en sciences humaines et sociales. Rennes: Presses universitaires de Rennes.

    Google Scholar 

  • Bastian, M., Heymann, S., Jacomy, M., et al. (2009). Gephi: An open source software for exploring and manipulating networks. International Conference on Web and Social Media, 8, 361–362.

    Google Scholar 

  • Becker, R. A., & Chambers, J. M. (1984). S: An interactive environment for data analysis and graphics. Boca Raton: CRC Press.

    Google Scholar 

  • Bécue-Bertaut, M., & Lebart, L. (2018). Analyse textuelle avec R. Rennes: Presses universitaires de Rennes.

    Google Scholar 

  • Benoit, K., & Matsuo, A. (2017). spacyr: R Wrapper to the spaCy NLP Library. R package version 0.9.0. https://cran.r-project.org/package=spacyr.

  • Benoit, K., & Obeng, A. (2017). readtext: Import and handling for plain and formatted text files. R package version 0.50. https://cran.r-project.org/package=readtext.

  • Benoit, K., Watanabe, K., Nulty, P., Obeng, A., Wang, H., Lauderdale, B., & Lowe, W. (2017). Quanteda: Quantitative analysis of textual data. R package version 0.99.9. https://cran.r-project.org/package=quanteda.

  • Berry, D. M. (2011). The computational turn: Thinking about the digital humanities. Culture Machine, 12, 1–22.

    Google Scholar 

  • Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions, Association for Computational Linguistics (pp. 69–72).

  • Blevins, C., & Mullen, L. (2015). Jane, John ... Leslie? A historical method for algorithmic gender prediction. Digital Humanities Quarterly 9(3).

  • Bradley, J., & Rockwell, G. (1992). Towards new research tools in computer-assisted text analysis. In Canadian Learned Societies Conference.

  • Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139–173.

    Google Scholar 

  • Camargo, B. V., & Justo, A. M. (2013). Iramuteq: um software gratuito para análisede dados textuais. Temas em Psicologia, 21(2), 513–518.

    Google Scholar 

  • Chang, W., Cheng, J., Allaire, J., Xie, Y., & McPherson, J. (2017). shiny: Web application framework for R. R package version 1.0.4. https://cran.r-project.org/package=shiny.

  • Deschamps, R. (2017). Correspondence analysis for historical research with R. The Programming Historian. https://programminghistorian.org/en/lessons/correspondence-analysis-in-R.

  • Dewar, T. (2016). R basics with tabular data. The Programming Historian. https://programminghistorian.org/en/lessons/r-basicswith-tabular-data.

  • Donaldson, J. (2016). tsne: T-distributed stochastic neighbor embedding for R (t-SNE). R package version 0.1-3. https://cran.r-project.org/package=tsne.

  • Eder, M., Rybicki, J., & Kestemont, M. (2016). Stylometry with R: A package for computational text analysis. R Journal, 8(1), 107–121.

    Google Scholar 

  • Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54.

    Google Scholar 

  • Fleury, S., & Zimina, M. (2014). Trameur: A framework for annotated text corpora exploration. In COLING (Demos) (pp. 57–61).

  • Gagolewski, M. (2017). R package stringi: Character string processing facilities. https://cran.r-project.org/package=stringi.

  • Gerdes, K. (2014). Corpus collection and analysis for the linguistic layman: The Gromoteur. http://gromoteur.ilpga.fr/.

  • Goldstone, A., & Underwood, T. (2014). The quiet transformations of literary studies: What thirteen thousand scholars could tell us. New Literary History, 45(3), 359–384.

    Google Scholar 

  • Gries, S. (2009). Quantitative corpus linguistics with R: A practical introduction. London: Routledge.

    Google Scholar 

  • Gries, S. (2013). Statistics for linguistics with R: A practical introduction. Berlin: Walter de Gruyter.

    Google Scholar 

  • Gries, S. T., & Deshors, S. C. (2014). Using regressions to explore deviations between corpus data and a standard/target: Two suggestions. Corpora, 9(1), 109–136.

    Google Scholar 

  • Gries, S. T., & Wulff, S. (2012). Regression analysis in translation studies. Quantitative methods in corpus-based translation studies: A practical guide to descriptive translation research (pp. 35–52). Amsterdam: Benjamins.

    Google Scholar 

  • Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30. https://doi.org/10.18637/jss.v040.i13.

    Article  Google Scholar 

  • Heiden, S. (2010). The txm platform: Building open-source textual analysis software compatible with the tei encoding scheme. In 24th Pacific Asia conference on language, information and computation, Institute for Digital Enhancement of Cognitive Development, Waseda University (pp. 389–398).

  • Honnibal, M., & Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, Association for Computational Linguistics, Lisbon, Portugal (pp. 1373–1378).

  • Hornik, K. (2016). openNLP: Apache OpenNLP tools interface. R package version 0.2-6. https://cran.r-project.org/package=openNLP.

  • Hornik, K. (2017a). NLP: Natural language processing infrastructure. R package version 0.1-11. https://cran.r-project.org/package=NLP.

  • Hornik, K. (2017b). R FAQ. https://cran.r-project.org/doc/FAQ/R-FAQ.html.

  • Hornik, K., Ligges, U., & Zeileis, A. (2017). Changes on CRAN. The R Journal, 9(1), 505–507.

    Google Scholar 

  • Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3), 299–314.

    Google Scholar 

  • Jockers, M. L. (2013). Macroanalysis: Digital methods and literary history. Champaign: University of Illinois Press.

    Google Scholar 

  • Jockers, M. L. (2014). Text analysis with R for students of literature. New York: Springer.

    Google Scholar 

  • Johnson, K. (2008). Quantitative methods in linguistics. London: Wiley.

    Google Scholar 

  • Kahle, D., & Wickham, H. (2013). ggmap: Spatial visualization with ggplot2. The R Journal, 5(1), 144–161.

    Google Scholar 

  • Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., et al. (2014). The sketch engine: Ten years on. Lexicography, 1(1), 7–36.

    Google Scholar 

  • Klaussner, C., Nerbonne, J., & Çöltekin, Ç. (2015). Finding characteristic features in stylometric analysis. Digital Scholarship in the Humanities, 30(suppl 1), i114–i129.

    Google Scholar 

  • Komen, E. R. (2011). Cesax: Coreference editor for syntactically annotated xml corpora. Reference manual Nijmegen. Nijmegen: Radboud University Nijmegen.

    Google Scholar 

  • Lamalle, C., Martinez, W., Fleury, S., Salem, A., Fracchiolla, B., Kuncova, A., & Maisondieu, A. (2003). Lexico3–outils de statistique textuelle. manuel d’utilisation. SYLED–CLA2T, Université de la Sorbonne nouvelle–Paris 3:48.

  • Lancashire, I., Bradley, J., McCarty, W., Stairs, M., & Wooldridge, T. (1996). Using tact with electronic texts. New York: MLA.

    Google Scholar 

  • Levine, L. W. (1988). Documenting America (Vol. 2, pp. 1935–1943). Berkeley: University of California Press.

    Google Scholar 

  • Levshina, N. (2015). How to do linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins Publishing Company.

    Google Scholar 

  • Lienou, M., Maitre, H., & Datcu, M. (2010). Semantic annotation of satellite images using latent dirichlet allocation. IEEE Geoscience and Remote Sensing Letters, 7(1), 28–32.

    Google Scholar 

  • Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The stanford corenlp natural language processing toolkit. In ACL (system demonstrations) (pp. 55–60).

  • McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.

    Google Scholar 

  • Michalke, M. (2017). koRpus: An R package for text analysis. (Version 0.10-2). https://cran.rproject.org/package=koRpus.

  • Mimno, D. (2013). mallet: A wrapper around the Java machine learning tool MALLET. R package version 1.0. https://cran.r-project.org/package=mallet.

  • Morton, T., Kottmann, J., Baldridge, J., & Bierner, G. (2005). Opennlp: A java-based nlp toolkit. In EACL.

  • O’Donnell, M. (2008). The uam corpustool: Software for corpus annotation and exploration. In Proceedings of the XXvI congreso de AESLA, Almeria, Spain (pp. 3–5).

  • Ooms, J. (2017). hunspell: High-performance Stemmer, Tokenizer, and spell checker for R. R package version 2.6. https://cran.r-project.org/package=hunspell.

  • O’Sullivan, J., Jakacki, D., & Galvin, M. (2015). Programming in the digital humanities. Digital Scholarship in the Humanities, 30(suppl 1), i142–i147.

    Google Scholar 

  • Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227.

    Google Scholar 

  • Rayson, P. (2009). Wmatrix: A web-based corpus processing environment. http://ucrel.lancs.ac.uk/wmatrix/.

  • Rinker, T. W. (2013). qdap: Quantitative discourse analysis package. Buffalo, NY: University at Buffalo/SUNY. 2.2.8.

    Google Scholar 

  • RStudio Team. (2017). RStudio: Integrated development environment for R. Boston, MA: RStudio Inc.

    Google Scholar 

  • Rudis, B., Levien, R., Engelhard, R., Halls, C., Novodvorsky, P., Németh, L., & Buitenhuis, N. (2016). hyphenatr: Tools to Hyphenate Strings Using the ’Hunspell’ Hyphenation Library. R package version 0.3.0. https://cran.r-project.org/package=hyphenatr.

  • Salkie, R. (1995). Intersect: A parallel corpus project at brighton university. Computers and Texts, 9, 4–5.

    Google Scholar 

  • Schreibman, S., Siemens, R., & Unsworth, J. (2015). A new companion to digital humanities. London: Wiley.

    Google Scholar 

  • Scott, M. (1996). WordSmith tools, Stroud: Lexical analysis software. https://lexically.net/wordsmith/.

  • Siddiqui, N. (2017). Data wrangling and management in R. The Programming Historian. https://programminghistorian.org/en/lessons/data_wrangling_and_management_in_R.

  • Sievert, C., & Shirley, K. (2015). LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA). R package version 0.1. https://cran.r-project.org/package=LDAtools.

  • Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011). Regularization paths for cox’s proportional hazards model via coordinate descent. Journal of Statistical Software, 39(5), 1–13.

    Google Scholar 

  • Sinclair, S., Rockwell, G., et al. (2016). Voyant tools. http://voyant-tools.org/. Accessed 4 Sept 2018.

  • Th Gries, S., & Hilpert, M. (2008). The identification of stages in diachronic data: Variability-based neighbour clustering. Corpora, 3(1), 59–81.

    Google Scholar 

  • Underwood, T. (2017). A genealogy of distant reading. Digital Humanities Quarterly. http://digitalhumanities.org/dhq/vol/11/2/000317/000317.html.

  • Ushey, K., McPherson, J., Cheng, J., Atkins, A., & Allaire, J. (2016). packrat: A dependency management system for projects and their R package dependencies. R package version 0.4.8-1. https://cran.r-project.org/package=packrat.

  • Wang, X., & Grimson, E. (2008). Spatial latent Dirichlet Allocation. In: Advances in neural information processing systems 20 (pp. 1577–1584). Curran Associates, Inc. http://papers.nips.cc/paper/3278-spatial-latent-dirichlet-allocation.pdf.

  • Welbers, K., Van Atteveldt, W., & Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245–265.

    Google Scholar 

  • Wiedemann, G., & Niekler, A. (2017). Hands-on: A five day text mining course for humanists and social scientists in R. In Proceedings of the 1st workshop teaching NLP for digital humanities.

  • Wiedemann, G. (2016). Text mining for qualitative data analysis in the social sciences. New York: Springer.

    Google Scholar 

  • Wijffels, J. (2018). udpipe: Tokenization, parts of speech tagging, lemmatization and dependency parsing with the ’UDPipe’ ’NLP’ Toolkit. R package version 0.6.1. https://cran.r-project.org/package=udpipe.

  • Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52.

    Google Scholar 

  • Xie, Y. (2014). knitr: A comprehensive tool for reproducible research in R. In: Stodden, V., Leisch, F., & Peng, R. D. (eds), Implementing reproducible computational research. Chapman and Hall/CRC. ISBN: 978-1466561595.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taylor Arnold.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arnold, T., Ballier, N., Lissón, P. et al. Beyond lexical frequencies: using R for text analysis in the digital humanities. Lang Resources & Evaluation 53, 707–733 (2019). https://doi.org/10.1007/s10579-019-09456-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-019-09456-6

Keywords

Navigation