See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/263973237
A generic tool for annotating TEI-compliant
corpora
Chapter · January 2009
CITATIONS
READS
0
29
3 authors, including:
Pascual Pérez-Paredes
University of Cambridge
68 PUBLICATIONS 109 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
TELL-OP View project
Adverbs in spoken language: a corpus-based analysis. Cambridge Humanities Research Grants
Scheme 2015/2016. View project
All content following this page was uploaded by Pascual Pérez-Paredes on 17 July 2014.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
17
A Generic Tool for Annotating Tei-Compliant Corpora: an ELTbased approach to corpus annotation 1
Alcaraz, Jose Maria Pérez-Paredes, Pascual Tornero, Encarnación
University of Murcia, Spain
jmalcaraz@dif.um.es, pascualf@um.es , nanitornero@yahoo.es
17.1. Introduction: annotation and language pedagogy
Annotation plays a significant role in Data Driven Learning (DDL). If annotation is
pedagogically oriented, this role may be even more relevant. In cognitive-mediated learning
processes, such as foreign language learning, it is usual to find tools, materials and pedagogic
media which guide and help the learner to understand the information she is trying to learn.
These pedagogic tools may be of different nature. They could be elements such as a slide, a
photograph or a figure; teaching media such as a board, a computer or an overhead projector,
and learning tools such as computer programs, books or any other conventional source of
knowledge. In this context, a computer could be considered as a tool, a medium as well as a
piece of material.
Annotation tools make it possible that the user can enrich texts or a linguistic corpus with
additional information or meta-information which might be of interest to researchers or applied
linguists. This possibility may be helpful in developing high quality pedagogic materials
307
ranging from plain texts to well-designed annotation of information. This was a very important
concern in SACODEYL.
Leech (1993) states that generic annotation does not necessarily lead to high quality standards.
It is necessary first to design and select the characteristics of the text we want to annotate
depending on what for the annotation is going to be used. However, the challenge for a generic
tool is tremendously ambitious as the possible applications that linguists could expect are
galore. For example, Levy (1997, 49-50) has classified the general field of CALL in 24 subfields, such as language data processing, language teaching methodology, linguistics or second
language acquisition among others. The scope of application of a given tool, such as a test, will
necessarily determine the way and the qualities of the annotation process. In other words, the
scope of the application of a text or a corpus precedes the design of the annotation approach.
Thus, in the field of language teaching topic-driven approaches (Braun 2006) highlight the
most relevant characteristics from a pedagogic perspective.
If we analyse some of the current annotation tools (Pérez-Paredes and Alcaraz 2009, PérezParedes et al. 2007), we can observe that most of them carry out the annotation process from the
perspective of the researcher in linguistics, who needs textual annotation on morpho-syntactic
level. Hence, almost all the annotators are specialized in carrying out a morphological and
syntactic text annotation in a manual, semi-automatic or automatic way with, inter alia, XerosEAGLES (Cutting and Pedersen, 1993), TreeTagger (Schmid 1995), CLAWS (Garside 1987)
and FreeLing (Atserias, Casas, Comelles,González, Padró and Padró, 2006).
Despite this majority of tools, there are now annotation solutions which allow the use of a more
general and open process of annotation, not only a specific, closed set of labels or tags. This is
the logical consequence of changing the representation system of the annotation process from
the traditional way described above to a new approach for representing the information based on
308
XML. Therefore, there is an increasing amount of new tools based on XML-aware annotation,
among which we find Calisto (Bayer, Doran, Condon and Gertner, 2006), LACITO (Jacobson
2006) and LT XML (Grover, Matthews and Tobin, 2006).
However, even though there are now more XML-based tools which allow an annotation
process, these tools do not generally enable the user to define which characteristics to annotate.
It would be then ideal that these XML tools were generic and extensible. We can find some
proposals in this context, such as Dexter (Garretson 2006) or EXMERaLDA (Schmidt 2004).
These tools are a step forward in getting annotation tools to be used in a generic way in different
contexts of knowledge representation, for example foreign language teachers who want to
prepare teaching materials. However, these tools are designed with a research aim in mind and it
is not easy to use them pedagogically. Furthermore, these solutions code the annotation of the
2
linguistic corpora using their own XML schema, not following any standardized mechanism to
represent the linguistic annotation of data and metadata.
In the language teaching context, we need thereby a tool which enables a pedagogic annotation
to facilitate the development of high quality pedagogic resources to be used by both learners and
teachers in the context of DDL. These pedagogic resources may be used later for data-driven
teaching or just for creating CALL exercises. What is more, it would be advisable to produce
pedagogic resources which could be re-used by other applications, which implies using a
universally accepted academic standard to represent the information of the annotation (Ward
2002). This way, any application using standard-compliant XML annotated corpora could reuse these new pedagogic resources.
The bottom line here is that if pedagogy can be annotated, language learning resources which
make use of corpus-based materials are more likely to be implemented in the classroom.
309
17.2. SACODEYL Annotator
17.2.1. Using SACODEYL Annotator
SACODEYL Annotator3 has been developed as part of undergoing work on System Aided
Compilation and Open Distribution of European Youth Language. The aim of the tool is to give
the multilingual and multinational SACODEYL team the means to annotate seven different
corpora. The main distinctive feature of the tool is that it has been developed originally to
implement pedagogical annotation, which means that we have not adapted or converted other
tools that may have been developed with other aims. Another design principle has been that of
providing ease of use for the annotators as well as power and robustness in terms of the output
data.
It is expected that different users will have an interest in the tool, from a computational linguist
interested in annotating texts to a language learner that wishes to navigate the features annotated
in a corpus, and thus become more acquainted with the sort of meta-information that has been
included by the annotators. Certainly language teachers will show a natural inclination towards
material selection and/or development. All of these users will find a very friendly interface that
greatly facilitates both the annotation as well as the navigation process. Let us examine this
interface in detail.
310
Figure 1. SACODEYL Annotator: a multi-purpose generic tool
Figure 1 shows the distribution of the main window of the tool. On the left, we can find the
annotation structure established by the annotator(s). On the right, we find the annotation
performed on a text, in the example above an oral text. Let us concentrate first on the left handside of the application.
This area of the tool clearly shows the potential of SACODEYL Annotator to become a truly
generic tool for problem-oriented tagging (McEnery and Wilson 1996). This is possible as
annotators are given the chance to decide on the tags they want to work with, and the tool takes
care of the rest, that is, the application performs management, extension, addition, modification
and suppression functions on this set of tags. This is a key point in the development of a multipurpose application that seeks to meet the needs of a wide range of language professionals.
On the right hand side we can find the annotation of an oral text. The area is divided in four
different columns. In the first column we find information as to the section of the text that is
311
being annotated (section1......sectionn). In SACODEYL this is a crucial issue, as the different
texts of a corpus are segmented bearing a didactic exploitation in mind. For a further discussion
on the idea of section see Braun (2005), Pérez-Paredes et. al (2007) and Pérez-Paredes and
Alcaraz (2009). On the second column (under Applied Taxonomies) we find the annotation that
has been assigned to a section, while on the third it is possible to identify the speaker or
contributor. Finally, on the fourth column we can see the text proper. Here you can note some
highlighting which matches the relationship established between the tags and the stretch of text
that motivated the adscription of a particular sub-taxonomy or tag on a section. This is optional,
that is, a tag can be assigned and the annotator may decide not to establish a link between the
language data and the annotation. In SACODEYL we call this highlighted stretch of text a
keyword.
Figure 3 shows how a search tool may render the annotation performed on section 1 of the text
previously displayed in Figure 1:
312
Figure 2. Taxonomy tree as rendered by SACODEYL Search Tool.
As seen above, the tool has been developed to meet the needs of a very wide range of users, and
as a consequence no a priori knowledge of CL is needed in order to start annotation right away.
The tool is very easy to use: tag assignment is performed through drag and drop and keyword
assignment through select and click basic operations. To facilitate this process the application
filters out the information shown on screen and so users can decide which highlighted keywords
they want to see or hide. Secure deleting of the annotation is also provided. The tool is so
intuitive that even learners with no CL background whatsoever might use it to navigate the
annotation.
But there is more to the tool. SACODEYL Annotator allows for the management of multiple
corpus files, an underlying principle in SACODEYL. Users can thus create, import or select
313
different corpora and perform the same or different annotation schemes on each of them. The
user may work on texts of different nature, spoken or written, monologic or dialogic. The tool
has used UNICODE standard (Needleman 2000) which gives it truly multilingual power. For
SACODEYL this means that all seven language corpora (DE, EN, ES, FR, IT, LT, RO) can be
annotated with the same tool, but for he generic potential discussed above it means that any
corpus of any language could be annotated with it, from Chinese to Korean, just to cite two
important non-Western languages. Also, it must be stressed that, apart form the language of the
corpus, SACODEYL Annotator will read files encoded according to different standards: ANSI,
ASCII, ISO, UFT-8 and Unicode.
A key issue in CL is how meta-data are handled. SACODEYL Annotator allows that the
different XML entities in a corpus, i.e. texts, be assigned all kinds of meta-information such as
title, author, editor, date, participants, description, language, etc. Figure 3 shows how this is
done:
Figure 3. SACODEYL Annotator meta-data screen
314
An interesting issue is the possibility for annotators to incorporate external resources or data to a
particular section. In the framework of SACODEYL, it has been envisaged that this particular
feature will be used to enrich the corpus pedagogically and feed the DDL web system with links
to web services such as pages, multimedia, textual resources and FLT activities. In
SACODEYL we have for the most part used this feature to feed our DDL web system, although
learners or teachers may very well use it to enrich their language experiences in different ways.
It is worth mentioning that most annotation tools will not let the users modify or edit the
linguistic data that is being annotated. In the framework of SACODEYL this power feature has
performed a very important role in securing consistency and accuracy. These textual alterations
can be easily done preserving the annotated tags, which no doubt facilitates the transcriptionannotation-data delivery process. This is another feature that will be of interest to different
professionals in a wide array of fields.
So far we have discussed the usefulness of SACODEYL Annotator in the annotation of corpusbased learning resources. However, the tool has been designed with a generic use in mind.
SACODEYL Annotator is language input-independent, as different languages and text
typologies can be annotated. A case in point is spoken language where different contributors can
be represented by the tool interface, making the annotation and navigation process more
intuitive. Also, the tool is discipline-independent due to the fact that annotators are given
flexibility to establish the use that the corpus will be put to and, in accordance, the discipline
where the annotated corpus is to be delivered. It is interesting to underscore the relevance that
may have for non XML-aware users the fact that both annotation and „taxonomy definition‟ can
be performed with the same tool and on the same screen interface.
Within the field of language and linguistics, SACODEYL Annotator allows very refined uses
and applications. Some of these include translations and interpretation studies, general and
315
specific language learning purposes, computational studies, creation of folksonomies and the
generation of ontologies. Last but not least, SACODEYL Annotator is multi-user oriented as it
may cater for different and simultaneous needs, ranging from those of teachers, learners and
materials developers.
Having discussed the generic potential of the tool, let us move to gloss over the technology that
makes these generic uses possible in SACODEYL Annotator: the Text Encoding Initiative.
17.2.2. TEI as standardization method
One of the challenges to be met by system developers is that of standards and normalization. In
our case the main issue was to decide on the way in which our linguistic data and our annotation
should be stored. As discussed in Pérez-Paredes and Alcaraz (2009), our aim was to develop
tools and products that could be reusable and, in this way, contribute significantly to the evergrowing movement of open-content. We were aware of the fact that existing ad hoc solutions
could provide us with tools that could do the job, but we still felt that, given the nature of our
initiative, we should strive for standardization as a goal.
Having such a goal in mind, we decided to use the standard XML representation of the Text
Encoding Initiative. TEI is a widely spread standard for text encoding that provides an XML
schema for storing corpus and the metadata information associated to them. The main target of
TEI is to offer a common framework for text encoding and cover all the different aspects and
features that could be associated with any text or corpus. This way, spoken discourse features
such as pauses or breaks, can be treated uniformly across different software applications. In the
case of written texts, structural divisions of a text at different layers such as, documents,
sections, paragraphs, sentences or words, bibliography description, tables of content, tagset
description, and metadata can be conveniently stored in standard XML with a wide range of
tools.
316
The number of XML tools that support TEI is increasing by days: oXygen by SyncRO Soft
(2007), OpenOffice (Haugland and Jones, 2002), TEI E-macs (Lease, 2005), Anastasia
Scholarly, by Digital Editions (2004), TEI Publisher (Lease, 2005) and, inter alia, Xaira
(Burnard 1995). SACODEYL annotator benefits from this standard coding at the same time
that provides users with an extremely intuitive interface. Our SACODEYL XML files are
corpus files that contain the language data, the language data structure information and the
annotation proper (Pérez-Paredes et al. 2007).
It must be stressed that the tool easily adapts to the needs of advanced users or computational
linguists who wish to work on the XML code itself. This feature allows advanced XML-aware
users the possibility to perform changes on the very code. This can be better appreciated in
Figure 4:
Figure 4. SACODEYL Annotator XML edition and exploration screen.
317
17.3. Annotation in the foreign language classroom
Adapting texts and corpora to the needs of the language classroom is an area where
SACODEYL Annotator may be instrumental. It is well-known that the text encoding initiative
allows the subdivision of a text into meaningful fragments for analytic purposes, a feature which
has been conveniently adapted into SACODEYL Annotator for the representation of our own
section element. The annotation categories are declared in a <classDecl> element which allows
for creating extensible subcategories as deemed by the annotators. The section element has been
integrated into the <div> tags. An example of the categories annotated on a section of the
Spanish corpus follows:
<div decls="#routinesTopic #Adverbios #TextOrganizationFeatures #futurePlanTopic #Tipical
OffSpokenLang" type="event" xml:id="R2738C0D1">
<head>Una semana de mi vida</head>
An important feature of the SACODEYL system is that every corpus can be looked upon and
searched dynamically in the sense that each corpus informs our search tool about the different
annotated categories that have been applied to the corresponding sections. Figure 5 exemplifies
this point:
318
Figure 5. Annotation in action as displayed by SACODEYL Search Tool
This is a major breakthrough in the customization of corpus-based language learning and
teaching. To the date, language professionals have been prompted to make use of materials
whose primary orientation was linguistic research. In this sense, annotation can give language
learning and teaching stakeholders the chance to adapt corpus methods and resources to the type
of authenticity that is sought after in the language classroom. The search interface shown on
Figure 5 can be reached from the SACODEYL website http://www.um.es/sacodeyl or from the
SACODEYL dedicated server http://www.purl.org/sacodeyl/search. This search interface
dynamically reads the annotation and renders a query tree based on the information which has
been provided by the annotators. This is a ready-to-use example of how pedagogic annotation
can be used in varied language learning contexts. In the case of SACODEYL the aim was to
develop annotation which could serve as a pedagogic mediator in the process foreign language
learning of young Europeans. Although the possibilities are unlimited, the annotation categories
which were used by the seven language teams in SACODEYL focused very significantly on
topics, CEFRL levels and the features of spoken language.
Learners and teachers interested in evidence-driven language learning can use the power of
annotation to query multimodal corpora. Say, a group of learners is interested in learning more
about the hobbies topic area.
319
Figure 6. Searching for sections where “Hobby” has been annotated
The Search interface (Figure 6) displays 71 results for this corpus, which is probably way too
many. Learners may want now to refine their search and establish technology as a subset within
the results.
Figure 7. Refining a search
Now the learners get seven sections where Hobbies > Using technologies are used. In a way, we
have applied CL methods to the notion of topic and pedagogic section, which we expect to be of
usefulness in most FLT contexts. Learners have now sections which deal with a very restricted
320
thematic area and which can be further searched. Figure 8 shows how one of these sections has
been annotated as displaying modality, while retaining the thematic feature:
Figure 8. A section in the SACODEYL Search Tool
This section has been called “On the Internet”, can be viewed in isolation or in the context of
the whole interview/text and, interestingly, in red, displays a feature which the annotation team
has found of interest from a pedagogic perspective: modality. Now the search has been
expanded into Hobbies > Using technologies > Modality by way of the suggested features
added by the annotators. If the learner is interested in the section, she can watch it, as shown in
Figure 9:
321
Figure 9. A multimodal section
Of course, word search is central to the application. Figure 10 shows “Facebook” search in
SACODEYL:
Figure 10. Word search
322
Using this search, learners could build up a sense of the contexts where one could expect to find
Facebook in discourse, that is, being a member of Facebook, find Facebook really good,
Facebook is for slightly older people, go on Facebook, etc., which while not being
representative of English discourse as BNC, still can compensate for the important weaknesses
of representative corpora in pedagogic contexts.
17.4. Conclusions and future work
SACODEYL Annotator has already enabled the SACODEYL team to accomplish DDLoriented pedagogical annotation (Tornero et al. 2007). However, it is our intention to refine and
improve the tool to make it as generic and flexible as possible.
The tool may contribute to building knowledge in many disciplines and provide textual
resources with different kinds of annotated enrichment. In special, the tool could be helpful in
CALL-related fields providing high quality pedagogical materials stored in a standard format.
Furthermore, these materials could be also re-used by a wide amount of tools that support TEI.
SACODEYL is then the first major effort where pedagogic DDL has been implemented. By
using TEI standardization, we hope to make this effort even more meaningful to the FLT and
linguistic community. This environment can be viewed as a language learning platform which
integrates multimodal search facilities, including section search and browse plus the more
traditional concordance lines.
So far we have implemented P5 version of the TEI guidelines, which were released on
November the first 2007. Future work on SACODEYL Annotator is focused on the
dissemination of the tool in connection with the Text Encoding Initiative tools and utilities such
4
5
6
as XAIRA (Bernard 2004) , TAPoR , PhiloLogic or Wordhoard .
323
A wiki7 has been established to attract the interest of fellow researchers in pedagogical
annotation, and we expect to continue to develop SACODEYL Annotator into a more powerful
device and system independent tool to store and process texts and corpora that can be used in
the language classroom.
References
Anastasia Scholarly Digital Editions. (2004), „Anastasia: Analytical System Tools and
SGML/XML Integration Applications‟. Available through
http://anastasia.sourceforge.net/whatis.html
Atserias, J. , Casas, E., Comelles, M., González, L., Padró and M. Padró. (2006),
„FreeLing 1.3: Syntactic and semantic services in an open-source NLP library‟, in
Proceedings of the 5th International Conference on Language Resources and Evaluation
(LREC'06). Genoa, Italy.
Bayer, S., Doran, C., Condon, S. and Gertner, A. (2006), „Dialogue annotation as a correction
task‟, in 9th International Conference on Intelligent User Interfaces, 2006.
Bernard, L. (2004), „BNC-Baby and Xaira‟, in Proceedings of the Sixth Teaching and Langauge
Corpora conference, Granada, pp. 84.
Braun, S. (2005), „From pedagogically relevant corpora to authentic language learning
contents‟, ReCALL 17, 1, 47-64.
324
Braun, S. (2006), „ELISA - a pedagogically enriched corpus for language learning purposes‟, in
S. Braun, K. Kohn, and J. Mukherjee, (eds) Corpus Technology and Language Pedagogy: New
Resources, New Tools, New Methods. Frankfurt, M, Peter Lang, pp. 25-47.
Braun, S. (2007), „Integrating corpus work into secondary education: from data-driven learning
to needs-driven corpora‟. ReCALL 19, (3), 307-328.
Burnard, L. (1995), „The Text Encoding Initiative: an overview‟, in G. Leech, G. Myers, and J.
Thomas (eds), Spoken English on Computer: Transcription, Markup and Applications. Harlow:
Longman.
Burnard, L. and Berglund, Y. (2007), „Exploring BNC XML Edition with Xaira‟. 28th Annual
Conference of the International Computer Archive for Modern and Mediaeval English
(ICAME).
Cushion, S. (2004), „Increasing accessibility by pooling digital resources‟. ReCALL 16, (1),4150.
Cutting D. and Pedersen, J. (1993), The xerox part-of-speech tagger version. Via citeseer.ist.psu
.edu/cutting93xerox.html.
EAGLES Guilelines. (1996), Expert Advisory Group on Language Engineering Standards.
http://www.ilc.cnr.it/EAGLES96/browse.html
Garretson, G. (2006), „Dexter: free tools for analyzing texts‟. V International Congress of
AELFE. Academic and Professional Communication in the 21st Century: Genres and Rhetoric
in the Construction of Disciplinary Knowledge.
Garside, R. (1987), „The CLAWS word-tagging system‟, in R. Garside, F. Leech, and G.
Sampson (eds), The Computational Analysis of English. Harlow: Longman.
325
Grover, C., Matthews, M. and Tobin, R. (2006), „Tools to Address the Interdependence
between Tokenisation and Standoff Annotation‟, in Proceedings of NLPXML-2006 (Multidimensional Markup in Natural Language Processing), pp. 19-26.
Haugland, S. and Jones, F. (2002), OpenOffice.Org 1.0 Resource Kit. Indianapolis: Prentice
Hall PTR.
Jacobson, M. (2006), „Le projet "Archivage" du LACITO‟. Langues et cité. 6, pp. 11
Lease, E. (2005), „Creating and managing XML with open source software‟. Library Hi Tech
Journal 23, (4), 526-540.
Leech, G. (1993), Literary and Linguistic Computing 8(4) 275-281.
Levy, M. (1997). Computer-Assisted Language Learning: Context and Conceptualization.
Oxford University Press.
McEnery, T. and Wilson, A. (1996), Corpus linguistics. Edinburgh: University of Edinburgh
Press.
Mercader, A., Pérez-Paredes, P., Alcaraz, J. M. and Tornero, E. (2007), „The role of pedagogic
annotation in DDL‟. Paper presented at the 1st International Conference on Corpus-Based
Approaches to ELT. Universitat Jaume I, Castellón, November 2007.
Needleman, M. (2000), „The Unicode Standard‟. Serial Review Journal 26, (2) , 51-54.
Pérez-Paredes, P., Alcaraz, J. M., Mercader, A. and Tornero, E. (2007), „Extracting data from
xml annotated corpora: not so mysterious ways into data driven learning (DDL)‟. Paper
presented at the 1st International Conference on Corpus-Based Approaches to ELT. Universitat
Jaume I, Castellón, November 2007.
326
Pérez-Paredes, P. and Alcaraz, J. M. (2009), „Developing annotation solutions for online datadriven learning‟. ReCall, 21,(1), 55-75.
Schmid, H. (1995), TreeTagger –a language independent part-of-speech tagger. Institut fur
Maschinelle Sprachverarbeitung, 1995
SyncRO Soft Ltd. (2007), oXygen XML Editor. http://www.oxygenxml.com
Schmidt, T. (2004), Transcribing and annotating spoken language with EXMARaLDA, in
Proceedings of the LREC-Workshop on XML based richly annotated corpora, Lisbon 2004.
Tornero, E., Pérez-Paredes, P., Mercader, A. and Alcaraz, J. M. (2007), „Annotating Spanish
youngsters spoken language for DDL applications‟. Paper presented at the EUROCALL
Conference. University of Ulster at Coleraine, September 2007.
Ward, M. (2002), „Reusable XML technologies and the development of language learning
materials‟. ReCALL 14, (2), 285-294.
1
System Aided Compilation and Open Distribution of European Youth Language research funded by the
European Commission under the Socrates-Minerva initiative (225836-CP-1-2005-1-ES-MINERVA).
2
The importance of standards in computer science lies beyond the scope of this article. Suffice it to say
that if standard XML is used more and more users and applications will reuse the annotated resources.
3
SACODEYL Site, http://www.um.es/sacodeyl. URL last accessed 15(07/2009
4
http://portal.tapor.ca/portal/portal URL last accessed 15(07/2009
327
5
http://www.lib.uchicago.edu/efts/ARTFL/philologic/ URL last accessed 15(07/2009
6
http://wordhoard.northwestern.edu/userman/index.html URL last accessed 15(07/2009
7
http://www.tei-c.org/wiki/index.php/Sacodeyl_Annotator URL last accessed 15(07/2009
328
View publication stats