Abstract
Software tools are of vital importance in corpus-based research, but they can also lead to restrictions on the type of supported corpora and the range of analyses that can be performed. For example, corpus analysis tools, as general purpose software, do not include specific features to process corpora of theatre plays. This situation is even worse for parallel corpora of theatrical texts, in that there is currently a lack of software that allows for both the alignment and analysis of parallel corpora here. In this contribution, we will first outline the peculiarities of theatre texts and suggest three software features to address them: annotation of the structural units of plays, alignment at the utterance level, and concordances and statistics using the annotated units. Second, we will present the specific functionalities of TAligner and ACM to build and analyse parallel corpora of play texts, showing how new avenues of research are opening up with the development of these tools.
Data availability
Not applicable.
Code availability
Software application.
Notes
Following Xiao and Yue (2004, p. 240), parallel corpora are understood as a set of source texts aligned with their translations. From the perspective of corpus tools, parallel corpora are more complex than monolingual or comparable ones, since issues such as alignment need to be considered (Sanjurjo-González, 2018, p. 25). Although the focus of this contribution is on parallel corpora, the adjustments of tools for structural annotation and specific analytical functions may also be useful for monolingual corpora.
TAligner can be accessed at https://addi.ehu.es/handle/10810/42445.
We might mention that, apart from theatre plays, film and TV scripts also belong to the dramatic field (Esslin, 1990, p. 31) and share these structural peculiarities. Therefore, the analysis of these text types could also benefit from the advances in analytical tools suggested here.
Theatre-specific features are marked with asterisks.
Evert (2014) points out that “cwb-align isn't a particularly sophisticated sentence aligner, so it's likely to get some cases wrong”.
Line breaks might be useful for alignment; however, utterances can have more than one paragraph. In addition, the use of line breaks in plays may be inconsistent. Moreover, features of AntPConc are far from those of the tool’s monolingual version. AntPConc can be considered as a simple parallel concordance.
The application offers frequency lists, but so far they take texts as wholes (Sanz-Villar & Andaluz-Pinedo, 2021).
ACTRES Corpus Manager pending register.
A recently developed tool for custom annotation, OpenTagger (Sanjurjo-González & Andaluz-Pinedo, 2020), is planned to be integrated into ACM-theatre in the future.
References
Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research. https://doi.org/10.17250/khisli.30.2.201308.001
Anthony, L. (2014). AntPConc (Versión 1.1.0) [Software]. Tokio: Waseda University. Retrieved April 3, 2020, from http://www.laurenceanthony.net/
Archer, D., Wilson, A. & Rayson, P. (2002). Introduction to the USAS category system. Benedict project report. http://ucrel.lancs.ac.uk/usas/usas%20guide.pdf
Arrula, G. (2018). Autoitzulpenaren teoria eta praktika Euskal Herrian / Theory and practice of self-translation in the Basque Country (Doctoral dissertation). Bilbao: Universidad del País Vasco. Retrieved April 3, 2020, from http://hdl.handle.net/10810/27983
Bandín, E. (2007) Traducción, recepción y censura de teatro clásico inglés en la España de Franco. Estudio descriptivo-comparativo del Corpus TRACEtci (1939–1985) (Doctoral dissertation). Universidad de León. Retrieved April 3, 2020, from https://buleria.unileon.es/handle/10612/1885
Culpeper, J. (2014). Keywords and Characterization. An Analysis of Six Characters in Romeo and Juliet. In D. L. Hoover, J. Culpeper, & K. O’Halloran (Eds.), Digital literary studies. Corpus approaches to poetry, prose and drama (pp. 9–33). Routledge.
Doval, I., & Sánchez-Nieto, T. (2019). Parallel corpora in focus: An account of current achievements and challenges. In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 1–15). John Benjamins.
Esslin, M. (1990). The Field of Drama. Methuen.
Evert, S., & Hardie, A. (2011). Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In Proceedings of the Corpus Linguistics Conference 2011. University of Birmingham.
Evert, S. (2014). [CWB] A question about the aligning using cwb-encoding (CWB mailing list). Retrieved April 3, 2020, from http://liste.sslmit.unibo.it/pipermail/cwb/2014-January/001529.html
Evert, S. (2020). The IMS Open Corpus Workbench (CWB) CQP Query Language Tutorial. Retrieved September 2, 2020, from http://cwb.sourceforge.net/files/CQP_Tutorial.pdf
Gutiérrez-Lanza, C., Bandín, E., García-González, J. E., & Lobejón-Santos, S. (2015). Desarrollo de software de etiquetado y alineación textual: TRACE Corpus Tagger/Aligner 1.0©. Paper presented at the II Congreso Internacional de Humanidades Digitales Hispánicas: Innovación, globalización e impacto, Madrid, Spain. Retrieved April 3, 2020, from http://hdh2015.linhd.es/ebook/hdh15-gutierrezlanza.xhtml
Hardie, A. (2012). CQPweb—Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics. https://doi.org/10.1075/ijcl.17.3.04har
Johansson, S., & Hofland, K. (1994). Towards an English-Norwegian parallel corpus. In U. Fries, G. Tottie, & P. Schneider (Eds.), Creating and Using English Language Corpora (pp. 25–37). Rodopi.
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The sketch engine: Ten years on. Lexicography. https://doi.org/10.1007/s40607-014-0009-9
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Proceedings of Machine Translation Summit, X(5), 79–86.
Lavid, J. (2019). Discourse annotation in the MULTINOT corpus: Issues and challenges. In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 159–182). John Benjamins.
Marco, J. (2019). Living with parallel corpora: The potentials and limitations of their use in translation research. In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 39–56). John Benjamins.
McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge University Press.
Merino-Álvarez, R. (2007). La homosexualidad censurada: estudio sobre corpus de teatro TRACEti inglés-español (desde 1960). R. Merino-Álvarez (Ed.), Traducción y censura en España (1939–1985). Estudios sobre corpus TRACE: cine, narrativa, teatro. Universidad de León/Universidad del País Vasco.
Merino-Álvarez, R. (1992). Rewriting for the Spanish stage. KOINÉ. Annali della Scuola Superiore per Interpreti e Traduttori San Pellegrino, 2(1–2), 283–289.
Merino-Álvarez, R. (1994). Traducción, tradición y manipulación Teatro inglés en España 1950–1990. Universidad de León/Universidad del País Vasco.
Merino-Álvarez, R., & Andaluz-Pinedo, O. (2017). Peter Shaffer en la cultura española. Creneida. Anuario De Literaturas Hispánicas, 5, 239–278.
Miller, A. (1955). The Crucible. Penguin Books.
Molés-Casés, T., & Oster, U. (2019). Indexation and analysis of a parallel corpus using CQPweb: The COVALT PAR_ES Corpus (EN/FR/DE > ES). In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 197–214). John Benjamins.
Oksefjell, S. (1999). A description of the English-Norwegian parallel corpus. International Journal of Corpus Linguistics, 4(2), 197–219.
Pérez, M. (2004). Traducciones censuradas de teatro norteamericano en la España de Franco (1939–1963) (Doctoral dissertation). Universidad del País Vasco.
Rafalovitch, A., & Dale, R. (2009). United Nations general assembly resolutions: A six-language parallel corpus. MT Summit XII (pp. 292–299). AMTA.
Sanjurjo-González, H. (2017b). Creación de un framework para el tratamiento de corpus lingüísticos - Development of a framework for corpus linguistic (Doctoral dissertation). Universidad de León, León, Spain. Retrieved April 3, 2020, from https://buleria.unileon.es/handle/10612/6920
Sanjurjo-González, H. (2017a). ACTRES Corpus Manager. [Computer software].
Sanjurjo-González, H., & Andaluz-Pinedo, O. (2020). OpenTagger: A flexible and user-friendly linguistic tagger. 56th Linguistics Colloquium. http://hdl.handle.net/10810/48683
Sanjurjo-González, H. (2018). Creación de un framework para el tratamiento de corpus lingüísticos (Development of a framework for corpus linguistic análisis). Universidad de León, Área de Publicaciones.
Sanjurjo-González, H., & Izquierdo, M. (2019). P-ACTRES 2.0: a parallel corpus for cross-linguistic research. In I. Doval & M. T. SánchezNieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 215–232). John Benjamins.
Sanz-Villar, Z. (2015). Unitate fraseologikoen itzulpena: alemana-euskara. Literatur testuen corpusean oinarritutako analisia (Doctoral dissertation). University of the Basque Country UPV/EHU. Retrieved April 3, 2020, from http://hdl.handle.net/10810/15128
Sanz-Villar, Z. (2019). An overview of basque corpora and the extraction of certain multi-word expressions from a translational corpus. In I. Doval & M. T. SánchezNieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 233–247). John Benjamins.
Sanz-Villar, Z., & Andaluz-Pinedo, O. (2021). TAligner 3.0: a tool to create parallel and multilingual corpora. In J. Lavid, C. Maíz-Arévalo, & J. R. Zamorano-Mansilla (Eds.), Corpora in translation research: recent advances and applications (pp. 126–146). John Benjamins.
Scott, M. (2012). WordSmith Tools (Versión 6) [Software]. Stroud: Lexical Analysis Software. Retrieved April 3, 2020, from http://www.lexically.net/wordsmith/
Stührenberg, M. (2012). The TEI and current standards for structuring linguistic data. An overview. Journal of the Text Encoding Initiative. https://doi.org/10.4000/jtei.523
TEI Consortium (2019). Performance Texts. TEI P5: Guidelines for Electronic Text Encoding and Interchange (pp. 234–259). https://tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf
Xiao, R., & Yue, M. (2009). Using corpora in translation studies: The state of the art. In P. Baker (Ed.), Contemporary Corpus Linguistics (pp. 237–261). Continuum.
Zeldes, A., Lüdeling, A., Julia, R., & Chiarcos, C. (2009). ANNIS: a search tool for multi-layer annotated corpora. In M. Mahlberg, V. González Díaz & C. Smith (Eds.), Proceedings of the Corpus Linguistics Conference 2009 (pp. 358–362). University of Liverpool.
Zubillaga, N. (2013). Alemanetik euskaratutako haur- eta gazte-literatura: zuzeneko nahiz zeharkako itzulpenen azterketa corpus baten bidez (Doctoral dissertation). Universidad del País Vasco. Retrieved April 3, 2020, from http://hdl.handle.net/10810/12431
Zubillaga, N., Sanz-Villar, Z., & Uribarri, I. (2015). Building a trilingual parallel corpus to analyse literary translations from German into Basque. In C. Fantinuoli & F. Zanettin (Eds.), New directions in corpus-based translation studies (pp. 71–92). Language Science Press.
Acknowledgements
Research group TRALIMA/ITZULIK, University of the Basque Country UPV/EHU, GIU 16/48, Basque Government consolidated research group, IT1209-19. Research group ACTRES. Part of this study has been supported by the Spanish Agency for Research, Development and Innovation (Ministry of Economy and Competitiveness) [FFI2016-75672-R]. Red de Excelencia CorpusNet, funded by Ministry of Economy and Competitiveness project [FFI2016-81934-RED]. At the time of writing, the co-author Olaia Andaluz-Pinedo is a doctoral student funded by the University of the Basque Country UPV/EHU [PIF]. We would like to thank the reviewers for their useful comments.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
Not applicable.
Consent to participate.
Not applicable.
Consent for publication.
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Andaluz-Pinedo, O., Sanjurjo-González, H. Corpus tools for parallel corpora of theatre plays: an introduction to TAligner and ACM-theatre. Lang Resources & Evaluation 56, 651–671 (2022). https://doi.org/10.1007/s10579-022-09585-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-022-09585-5