Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

A Generic Tool for Annotating Tei-Compliant Corpora: An ELT-Based Approach to Corpus Annotation1

2010, … -Based Approaches to …

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/263973237 A generic tool for annotating TEI-compliant corpora Chapter · January 2009 CITATIONS READS 0 29 3 authors, including: Pascual Pérez-Paredes University of Cambridge 68 PUBLICATIONS 109 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: TELL-OP View project Adverbs in spoken language: a corpus-based analysis. Cambridge Humanities Research Grants Scheme 2015/2016. View project All content following this page was uploaded by Pascual Pérez-Paredes on 17 July 2014. The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately. 17 A Generic Tool for Annotating Tei-Compliant Corpora: an ELTbased approach to corpus annotation 1 Alcaraz, Jose Maria Pérez-Paredes, Pascual Tornero, Encarnación University of Murcia, Spain jmalcaraz@dif.um.es, pascualf@um.es , nanitornero@yahoo.es 17.1. Introduction: annotation and language pedagogy Annotation plays a significant role in Data Driven Learning (DDL). If annotation is pedagogically oriented, this role may be even more relevant. In cognitive-mediated learning processes, such as foreign language learning, it is usual to find tools, materials and pedagogic media which guide and help the learner to understand the information she is trying to learn. These pedagogic tools may be of different nature. They could be elements such as a slide, a photograph or a figure; teaching media such as a board, a computer or an overhead projector, and learning tools such as computer programs, books or any other conventional source of knowledge. In this context, a computer could be considered as a tool, a medium as well as a piece of material. Annotation tools make it possible that the user can enrich texts or a linguistic corpus with additional information or meta-information which might be of interest to researchers or applied linguists. This possibility may be helpful in developing high quality pedagogic materials 307 ranging from plain texts to well-designed annotation of information. This was a very important concern in SACODEYL. Leech (1993) states that generic annotation does not necessarily lead to high quality standards. It is necessary first to design and select the characteristics of the text we want to annotate depending on what for the annotation is going to be used. However, the challenge for a generic tool is tremendously ambitious as the possible applications that linguists could expect are galore. For example, Levy (1997, 49-50) has classified the general field of CALL in 24 subfields, such as language data processing, language teaching methodology, linguistics or second language acquisition among others. The scope of application of a given tool, such as a test, will necessarily determine the way and the qualities of the annotation process. In other words, the scope of the application of a text or a corpus precedes the design of the annotation approach. Thus, in the field of language teaching topic-driven approaches (Braun 2006) highlight the most relevant characteristics from a pedagogic perspective. If we analyse some of the current annotation tools (Pérez-Paredes and Alcaraz 2009, PérezParedes et al. 2007), we can observe that most of them carry out the annotation process from the perspective of the researcher in linguistics, who needs textual annotation on morpho-syntactic level. Hence, almost all the annotators are specialized in carrying out a morphological and syntactic text annotation in a manual, semi-automatic or automatic way with, inter alia, XerosEAGLES (Cutting and Pedersen, 1993), TreeTagger (Schmid 1995), CLAWS (Garside 1987) and FreeLing (Atserias, Casas, Comelles,González, Padró and Padró, 2006). Despite this majority of tools, there are now annotation solutions which allow the use of a more general and open process of annotation, not only a specific, closed set of labels or tags. This is the logical consequence of changing the representation system of the annotation process from the traditional way described above to a new approach for representing the information based on 308 XML. Therefore, there is an increasing amount of new tools based on XML-aware annotation, among which we find Calisto (Bayer, Doran, Condon and Gertner, 2006), LACITO (Jacobson 2006) and LT XML (Grover, Matthews and Tobin, 2006). However, even though there are now more XML-based tools which allow an annotation process, these tools do not generally enable the user to define which characteristics to annotate. It would be then ideal that these XML tools were generic and extensible. We can find some proposals in this context, such as Dexter (Garretson 2006) or EXMERaLDA (Schmidt 2004). These tools are a step forward in getting annotation tools to be used in a generic way in different contexts of knowledge representation, for example foreign language teachers who want to prepare teaching materials. However, these tools are designed with a research aim in mind and it is not easy to use them pedagogically. Furthermore, these solutions code the annotation of the 2 linguistic corpora using their own XML schema, not following any standardized mechanism to represent the linguistic annotation of data and metadata. In the language teaching context, we need thereby a tool which enables a pedagogic annotation to facilitate the development of high quality pedagogic resources to be used by both learners and teachers in the context of DDL. These pedagogic resources may be used later for data-driven teaching or just for creating CALL exercises. What is more, it would be advisable to produce pedagogic resources which could be re-used by other applications, which implies using a universally accepted academic standard to represent the information of the annotation (Ward 2002). This way, any application using standard-compliant XML annotated corpora could reuse these new pedagogic resources. The bottom line here is that if pedagogy can be annotated, language learning resources which make use of corpus-based materials are more likely to be implemented in the classroom. 309 17.2. SACODEYL Annotator 17.2.1. Using SACODEYL Annotator SACODEYL Annotator3 has been developed as part of undergoing work on System Aided Compilation and Open Distribution of European Youth Language. The aim of the tool is to give the multilingual and multinational SACODEYL team the means to annotate seven different corpora. The main distinctive feature of the tool is that it has been developed originally to implement pedagogical annotation, which means that we have not adapted or converted other tools that may have been developed with other aims. Another design principle has been that of providing ease of use for the annotators as well as power and robustness in terms of the output data. It is expected that different users will have an interest in the tool, from a computational linguist interested in annotating texts to a language learner that wishes to navigate the features annotated in a corpus, and thus become more acquainted with the sort of meta-information that has been included by the annotators. Certainly language teachers will show a natural inclination towards material selection and/or development. All of these users will find a very friendly interface that greatly facilitates both the annotation as well as the navigation process. Let us examine this interface in detail. 310 Figure 1. SACODEYL Annotator: a multi-purpose generic tool Figure 1 shows the distribution of the main window of the tool. On the left, we can find the annotation structure established by the annotator(s). On the right, we find the annotation performed on a text, in the example above an oral text. Let us concentrate first on the left handside of the application. This area of the tool clearly shows the potential of SACODEYL Annotator to become a truly generic tool for problem-oriented tagging (McEnery and Wilson 1996). This is possible as annotators are given the chance to decide on the tags they want to work with, and the tool takes care of the rest, that is, the application performs management, extension, addition, modification and suppression functions on this set of tags. This is a key point in the development of a multipurpose application that seeks to meet the needs of a wide range of language professionals. On the right hand side we can find the annotation of an oral text. The area is divided in four different columns. In the first column we find information as to the section of the text that is 311 being annotated (section1......sectionn). In SACODEYL this is a crucial issue, as the different texts of a corpus are segmented bearing a didactic exploitation in mind. For a further discussion on the idea of section see Braun (2005), Pérez-Paredes et. al (2007) and Pérez-Paredes and Alcaraz (2009). On the second column (under Applied Taxonomies) we find the annotation that has been assigned to a section, while on the third it is possible to identify the speaker or contributor. Finally, on the fourth column we can see the text proper. Here you can note some highlighting which matches the relationship established between the tags and the stretch of text that motivated the adscription of a particular sub-taxonomy or tag on a section. This is optional, that is, a tag can be assigned and the annotator may decide not to establish a link between the language data and the annotation. In SACODEYL we call this highlighted stretch of text a keyword. Figure 3 shows how a search tool may render the annotation performed on section 1 of the text previously displayed in Figure 1: 312 Figure 2. Taxonomy tree as rendered by SACODEYL Search Tool. As seen above, the tool has been developed to meet the needs of a very wide range of users, and as a consequence no a priori knowledge of CL is needed in order to start annotation right away. The tool is very easy to use: tag assignment is performed through drag and drop and keyword assignment through select and click basic operations. To facilitate this process the application filters out the information shown on screen and so users can decide which highlighted keywords they want to see or hide. Secure deleting of the annotation is also provided. The tool is so intuitive that even learners with no CL background whatsoever might use it to navigate the annotation. But there is more to the tool. SACODEYL Annotator allows for the management of multiple corpus files, an underlying principle in SACODEYL. Users can thus create, import or select 313 different corpora and perform the same or different annotation schemes on each of them. The user may work on texts of different nature, spoken or written, monologic or dialogic. The tool has used UNICODE standard (Needleman 2000) which gives it truly multilingual power. For SACODEYL this means that all seven language corpora (DE, EN, ES, FR, IT, LT, RO) can be annotated with the same tool, but for he generic potential discussed above it means that any corpus of any language could be annotated with it, from Chinese to Korean, just to cite two important non-Western languages. Also, it must be stressed that, apart form the language of the corpus, SACODEYL Annotator will read files encoded according to different standards: ANSI, ASCII, ISO, UFT-8 and Unicode. A key issue in CL is how meta-data are handled. SACODEYL Annotator allows that the different XML entities in a corpus, i.e. texts, be assigned all kinds of meta-information such as title, author, editor, date, participants, description, language, etc. Figure 3 shows how this is done: Figure 3. SACODEYL Annotator meta-data screen 314 An interesting issue is the possibility for annotators to incorporate external resources or data to a particular section. In the framework of SACODEYL, it has been envisaged that this particular feature will be used to enrich the corpus pedagogically and feed the DDL web system with links to web services such as pages, multimedia, textual resources and FLT activities. In SACODEYL we have for the most part used this feature to feed our DDL web system, although learners or teachers may very well use it to enrich their language experiences in different ways. It is worth mentioning that most annotation tools will not let the users modify or edit the linguistic data that is being annotated. In the framework of SACODEYL this power feature has performed a very important role in securing consistency and accuracy. These textual alterations can be easily done preserving the annotated tags, which no doubt facilitates the transcriptionannotation-data delivery process. This is another feature that will be of interest to different professionals in a wide array of fields. So far we have discussed the usefulness of SACODEYL Annotator in the annotation of corpusbased learning resources. However, the tool has been designed with a generic use in mind. SACODEYL Annotator is language input-independent, as different languages and text typologies can be annotated. A case in point is spoken language where different contributors can be represented by the tool interface, making the annotation and navigation process more intuitive. Also, the tool is discipline-independent due to the fact that annotators are given flexibility to establish the use that the corpus will be put to and, in accordance, the discipline where the annotated corpus is to be delivered. It is interesting to underscore the relevance that may have for non XML-aware users the fact that both annotation and „taxonomy definition‟ can be performed with the same tool and on the same screen interface. Within the field of language and linguistics, SACODEYL Annotator allows very refined uses and applications. Some of these include translations and interpretation studies, general and 315 specific language learning purposes, computational studies, creation of folksonomies and the generation of ontologies. Last but not least, SACODEYL Annotator is multi-user oriented as it may cater for different and simultaneous needs, ranging from those of teachers, learners and materials developers. Having discussed the generic potential of the tool, let us move to gloss over the technology that makes these generic uses possible in SACODEYL Annotator: the Text Encoding Initiative. 17.2.2. TEI as standardization method One of the challenges to be met by system developers is that of standards and normalization. In our case the main issue was to decide on the way in which our linguistic data and our annotation should be stored. As discussed in Pérez-Paredes and Alcaraz (2009), our aim was to develop tools and products that could be reusable and, in this way, contribute significantly to the evergrowing movement of open-content. We were aware of the fact that existing ad hoc solutions could provide us with tools that could do the job, but we still felt that, given the nature of our initiative, we should strive for standardization as a goal. Having such a goal in mind, we decided to use the standard XML representation of the Text Encoding Initiative. TEI is a widely spread standard for text encoding that provides an XML schema for storing corpus and the metadata information associated to them. The main target of TEI is to offer a common framework for text encoding and cover all the different aspects and features that could be associated with any text or corpus. This way, spoken discourse features such as pauses or breaks, can be treated uniformly across different software applications. In the case of written texts, structural divisions of a text at different layers such as, documents, sections, paragraphs, sentences or words, bibliography description, tables of content, tagset description, and metadata can be conveniently stored in standard XML with a wide range of tools. 316 The number of XML tools that support TEI is increasing by days: oXygen by SyncRO Soft (2007), OpenOffice (Haugland and Jones, 2002), TEI E-macs (Lease, 2005), Anastasia Scholarly, by Digital Editions (2004), TEI Publisher (Lease, 2005) and, inter alia, Xaira (Burnard 1995). SACODEYL annotator benefits from this standard coding at the same time that provides users with an extremely intuitive interface. Our SACODEYL XML files are corpus files that contain the language data, the language data structure information and the annotation proper (Pérez-Paredes et al. 2007). It must be stressed that the tool easily adapts to the needs of advanced users or computational linguists who wish to work on the XML code itself. This feature allows advanced XML-aware users the possibility to perform changes on the very code. This can be better appreciated in Figure 4: Figure 4. SACODEYL Annotator XML edition and exploration screen. 317 17.3. Annotation in the foreign language classroom Adapting texts and corpora to the needs of the language classroom is an area where SACODEYL Annotator may be instrumental. It is well-known that the text encoding initiative allows the subdivision of a text into meaningful fragments for analytic purposes, a feature which has been conveniently adapted into SACODEYL Annotator for the representation of our own section element. The annotation categories are declared in a <classDecl> element which allows for creating extensible subcategories as deemed by the annotators. The section element has been integrated into the <div> tags. An example of the categories annotated on a section of the Spanish corpus follows: <div decls="#routinesTopic #Adverbios #TextOrganizationFeatures #futurePlanTopic #Tipical OffSpokenLang" type="event" xml:id="R2738C0D1"> <head>Una semana de mi vida</head> An important feature of the SACODEYL system is that every corpus can be looked upon and searched dynamically in the sense that each corpus informs our search tool about the different annotated categories that have been applied to the corresponding sections. Figure 5 exemplifies this point: 318 Figure 5. Annotation in action as displayed by SACODEYL Search Tool This is a major breakthrough in the customization of corpus-based language learning and teaching. To the date, language professionals have been prompted to make use of materials whose primary orientation was linguistic research. In this sense, annotation can give language learning and teaching stakeholders the chance to adapt corpus methods and resources to the type of authenticity that is sought after in the language classroom. The search interface shown on Figure 5 can be reached from the SACODEYL website http://www.um.es/sacodeyl or from the SACODEYL dedicated server http://www.purl.org/sacodeyl/search. This search interface dynamically reads the annotation and renders a query tree based on the information which has been provided by the annotators. This is a ready-to-use example of how pedagogic annotation can be used in varied language learning contexts. In the case of SACODEYL the aim was to develop annotation which could serve as a pedagogic mediator in the process foreign language learning of young Europeans. Although the possibilities are unlimited, the annotation categories which were used by the seven language teams in SACODEYL focused very significantly on topics, CEFRL levels and the features of spoken language. Learners and teachers interested in evidence-driven language learning can use the power of annotation to query multimodal corpora. Say, a group of learners is interested in learning more about the hobbies topic area. 319 Figure 6. Searching for sections where “Hobby” has been annotated The Search interface (Figure 6) displays 71 results for this corpus, which is probably way too many. Learners may want now to refine their search and establish technology as a subset within the results. Figure 7. Refining a search Now the learners get seven sections where Hobbies > Using technologies are used. In a way, we have applied CL methods to the notion of topic and pedagogic section, which we expect to be of usefulness in most FLT contexts. Learners have now sections which deal with a very restricted 320 thematic area and which can be further searched. Figure 8 shows how one of these sections has been annotated as displaying modality, while retaining the thematic feature: Figure 8. A section in the SACODEYL Search Tool This section has been called “On the Internet”, can be viewed in isolation or in the context of the whole interview/text and, interestingly, in red, displays a feature which the annotation team has found of interest from a pedagogic perspective: modality. Now the search has been expanded into Hobbies > Using technologies > Modality by way of the suggested features added by the annotators. If the learner is interested in the section, she can watch it, as shown in Figure 9: 321 Figure 9. A multimodal section Of course, word search is central to the application. Figure 10 shows “Facebook” search in SACODEYL: Figure 10. Word search 322 Using this search, learners could build up a sense of the contexts where one could expect to find Facebook in discourse, that is, being a member of Facebook, find Facebook really good, Facebook is for slightly older people, go on Facebook, etc., which while not being representative of English discourse as BNC, still can compensate for the important weaknesses of representative corpora in pedagogic contexts. 17.4. Conclusions and future work SACODEYL Annotator has already enabled the SACODEYL team to accomplish DDLoriented pedagogical annotation (Tornero et al. 2007). However, it is our intention to refine and improve the tool to make it as generic and flexible as possible. The tool may contribute to building knowledge in many disciplines and provide textual resources with different kinds of annotated enrichment. In special, the tool could be helpful in CALL-related fields providing high quality pedagogical materials stored in a standard format. Furthermore, these materials could be also re-used by a wide amount of tools that support TEI. SACODEYL is then the first major effort where pedagogic DDL has been implemented. By using TEI standardization, we hope to make this effort even more meaningful to the FLT and linguistic community. This environment can be viewed as a language learning platform which integrates multimodal search facilities, including section search and browse plus the more traditional concordance lines. So far we have implemented P5 version of the TEI guidelines, which were released on November the first 2007. Future work on SACODEYL Annotator is focused on the dissemination of the tool in connection with the Text Encoding Initiative tools and utilities such 4 5 6 as XAIRA (Bernard 2004) , TAPoR , PhiloLogic or Wordhoard . 323 A wiki7 has been established to attract the interest of fellow researchers in pedagogical annotation, and we expect to continue to develop SACODEYL Annotator into a more powerful device and system independent tool to store and process texts and corpora that can be used in the language classroom. References Anastasia Scholarly Digital Editions. (2004), „Anastasia: Analytical System Tools and SGML/XML Integration Applications‟. Available through http://anastasia.sourceforge.net/whatis.html Atserias, J. , Casas, E., Comelles, M., González, L., Padró and M. Padró. (2006), „FreeLing 1.3: Syntactic and semantic services in an open-source NLP library‟, in Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06). Genoa, Italy. Bayer, S., Doran, C., Condon, S. and Gertner, A. (2006), „Dialogue annotation as a correction task‟, in 9th International Conference on Intelligent User Interfaces, 2006. Bernard, L. (2004), „BNC-Baby and Xaira‟, in Proceedings of the Sixth Teaching and Langauge Corpora conference, Granada, pp. 84. Braun, S. (2005), „From pedagogically relevant corpora to authentic language learning contents‟, ReCALL 17, 1, 47-64. 324 Braun, S. (2006), „ELISA - a pedagogically enriched corpus for language learning purposes‟, in S. Braun, K. Kohn, and J. Mukherjee, (eds) Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods. Frankfurt, M, Peter Lang, pp. 25-47. Braun, S. (2007), „Integrating corpus work into secondary education: from data-driven learning to needs-driven corpora‟. ReCALL 19, (3), 307-328. Burnard, L. (1995), „The Text Encoding Initiative: an overview‟, in G. Leech, G. Myers, and J. Thomas (eds), Spoken English on Computer: Transcription, Markup and Applications. Harlow: Longman. Burnard, L. and Berglund, Y. (2007), „Exploring BNC XML Edition with Xaira‟. 28th Annual Conference of the International Computer Archive for Modern and Mediaeval English (ICAME). Cushion, S. (2004), „Increasing accessibility by pooling digital resources‟. ReCALL 16, (1),4150. Cutting D. and Pedersen, J. (1993), The xerox part-of-speech tagger version. Via citeseer.ist.psu .edu/cutting93xerox.html. EAGLES Guilelines. (1996), Expert Advisory Group on Language Engineering Standards. http://www.ilc.cnr.it/EAGLES96/browse.html Garretson, G. (2006), „Dexter: free tools for analyzing texts‟. V International Congress of AELFE. Academic and Professional Communication in the 21st Century: Genres and Rhetoric in the Construction of Disciplinary Knowledge. Garside, R. (1987), „The CLAWS word-tagging system‟, in R. Garside, F. Leech, and G. Sampson (eds), The Computational Analysis of English. Harlow: Longman. 325 Grover, C., Matthews, M. and Tobin, R. (2006), „Tools to Address the Interdependence between Tokenisation and Standoff Annotation‟, in Proceedings of NLPXML-2006 (Multidimensional Markup in Natural Language Processing), pp. 19-26. Haugland, S. and Jones, F. (2002), OpenOffice.Org 1.0 Resource Kit. Indianapolis: Prentice Hall PTR. Jacobson, M. (2006), „Le projet "Archivage" du LACITO‟. Langues et cité. 6, pp. 11 Lease, E. (2005), „Creating and managing XML with open source software‟. Library Hi Tech Journal 23, (4), 526-540. Leech, G. (1993), Literary and Linguistic Computing 8(4) 275-281. Levy, M. (1997). Computer-Assisted Language Learning: Context and Conceptualization. Oxford University Press. McEnery, T. and Wilson, A. (1996), Corpus linguistics. Edinburgh: University of Edinburgh Press. Mercader, A., Pérez-Paredes, P., Alcaraz, J. M. and Tornero, E. (2007), „The role of pedagogic annotation in DDL‟. Paper presented at the 1st International Conference on Corpus-Based Approaches to ELT. Universitat Jaume I, Castellón, November 2007. Needleman, M. (2000), „The Unicode Standard‟. Serial Review Journal 26, (2) , 51-54. Pérez-Paredes, P., Alcaraz, J. M., Mercader, A. and Tornero, E. (2007), „Extracting data from xml annotated corpora: not so mysterious ways into data driven learning (DDL)‟. Paper presented at the 1st International Conference on Corpus-Based Approaches to ELT. Universitat Jaume I, Castellón, November 2007. 326 Pérez-Paredes, P. and Alcaraz, J. M. (2009), „Developing annotation solutions for online datadriven learning‟. ReCall, 21,(1), 55-75. Schmid, H. (1995), TreeTagger –a language independent part-of-speech tagger. Institut fur Maschinelle Sprachverarbeitung, 1995 SyncRO Soft Ltd. (2007), oXygen XML Editor. http://www.oxygenxml.com Schmidt, T. (2004), Transcribing and annotating spoken language with EXMARaLDA, in Proceedings of the LREC-Workshop on XML based richly annotated corpora, Lisbon 2004. Tornero, E., Pérez-Paredes, P., Mercader, A. and Alcaraz, J. M. (2007), „Annotating Spanish youngsters spoken language for DDL applications‟. Paper presented at the EUROCALL Conference. University of Ulster at Coleraine, September 2007. Ward, M. (2002), „Reusable XML technologies and the development of language learning materials‟. ReCALL 14, (2), 285-294. 1 System Aided Compilation and Open Distribution of European Youth Language research funded by the European Commission under the Socrates-Minerva initiative (225836-CP-1-2005-1-ES-MINERVA). 2 The importance of standards in computer science lies beyond the scope of this article. Suffice it to say that if standard XML is used more and more users and applications will reuse the annotated resources. 3 SACODEYL Site, http://www.um.es/sacodeyl. URL last accessed 15(07/2009 4 http://portal.tapor.ca/portal/portal URL last accessed 15(07/2009 327 5 http://www.lib.uchicago.edu/efts/ARTFL/philologic/ URL last accessed 15(07/2009 6 http://wordhoard.northwestern.edu/userman/index.html URL last accessed 15(07/2009 7 http://www.tei-c.org/wiki/index.php/Sacodeyl_Annotator URL last accessed 15(07/2009 328 View publication stats