Haejoong Lee

2014

The DARPA BOLT Program develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language genres. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.

pdf bib
Aikuma: A Mobile App for Collaborative Language Documentation
Steven Bird | Florian R. Hanke | Oliver Adams | Haejoong Lee
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

2012

Linguistic Data Consortium and the National Institute of Standards and Technology are collaborating to create a large, heterogeneous annotated multimodal corpus to support research in multimodal event detection and related technologies. The HAVIC (Heterogeneous Audio Visual Internet Collection) Corpus will ultimately consist of several thousands of hours of unconstrained user-generated multimedia content. HAVIC has been designed with an eye toward providing increased challenges for both acoustic and video processing technologies, focusing on multi-dimensional variation inherent in user-generated multimedia content. To date the HAVIC corpus has been used to support the NIST 2010 and 2011 TRECVID Multimedia Event Detection (MED) Evaluations. Portions of the corpus are expected to be released in LDC's catalog in the coming year, with the remaining segments being published over time after their use in the ongoing MED evaluations.

2010

pdf bib abs
Transcription Methods for Consistency, Volume and Efficiency
Meghan Lammie Glenn | Stephanie M. Strassel | Haejoong Lee | Kazuaki Maeda | Ramez Zakhary | Xuansong Li
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes recent efforts at Linguistic Data Consortium at the University of Pennsylvania to create manual transcripts as a shared resource for human language technology research and evaluation. Speech recognition and related technologies in particular call for substantial volumes of transcribed speech for use in system development, and for human gold standard references for evaluating performance over time. Over the past several years LDC has developed a number of transcription approaches to support the varied goals of speech technology evaluation programs in multiple languages and genres. We describe each transcription method in detail, and report on the results of a comparative analysis of transcriber consistency and efficiency, for two transcription methods in three languages and five genres. Our findings suggest that transcripts for planned speech are generally more consistent than those for spontaneous speech, and that careful transcription methods result in higher rates of agreement when compared to quick transcription methods. We conclude with a general discussion of factors contributing to transcription quality, efficiency and consistency.

pdf bib abs
Technical Infrastructure at Linguistic Data Consortium: Software and Hardware Resources for Linguistic Data Creation
Kazuaki Maeda | Haejoong Lee | Stephen Grimes | Jonathan Wright | Robert Parker | David Lee | Andrea Mazzucchi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Linguistic Data Consortium (LDC) at the University of Pennsylvania has participated as a data provider in a variety of governmentsponsored programs that support development of Human Language Technologies. As the number of projects increases, the quantity and variety of the data LDC produces have increased dramatically in recent years. In this paper, we describe the technical infrastructure, both hardware and software, that LDC has built to support these complex, large-scale linguistic data creation efforts at LDC. As it would not be possible to cover all aspects of LDCs technical infrastructure in one paper, this paper focuses on recent development. We also report on our plans for making our custom-built software resources available to the community as open source software, and introduce an initiative to collaborate with software developers outside LDC. We hope that our approaches and software resources will be useful to the community members who take on similar challenges.

2008

pdf bib abs
Management of Large Annotation Projects Involving Multiple Human Judges: a Case Study of GALE Machine Translation Post-editing
Meghan Lammie Glenn | Stephanie Strassel | Lauren Friedman | Haejoong Lee | Shawn Medero
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Managing large groups of human judges to perform any annotation task is a challenge. Linguistic Data Consortium coordinated the creation of manual machine translation post-editing results for the DARPA Global Autonomous Language Exploration Program. Machine translation is one of three core technology components for GALE, which includes an annual MT evaluation administered by National Institute of Standards and Technology. Among the training and test data LDC creates for the GALE program are gold standard translations for system evaluation. The GALE machine translation system evaluation metric is edit distance, measured by HTER (human translation edit rate), which calculates the minimum number of changes required for highly-trained human editors to correct MT output so that it has the same meaning as the reference translation. LDC has been responsible for overseeing the post-editing process for GALE. We describe some of the accomplishments and challenges of completing the post-editing effort, including developing a new web-based annotation workflow system, and recruiting and training human judges for the task. In addition, we suggest that the workflow system developed for post-editing could be ported efficiently to other annotation efforts.

pdf bib abs
Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium
Kazuaki Maeda | Haejoong Lee | Shawn Medero | Julie Medero | Robert Parker | Stephanie Strassel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Linguistic Data Consortium (LDC) creates a variety of linguistic resources - data, annotations, tools, standards and best practices - for many sponsored projects. The programming staff at LDC has created the tools and technical infrastructures to support the data creation efforts for these projects, creating tools and technical infrastructures for all aspects of data creation projects: data scouting, data collection, data selection, annotation, search, data tracking and worklow management. This paper introduces a number of samples of LDC programming staffs work, with particular focus on the recent additions and updates to the suite of software tools developed by LDC. Tools introduced include the GScout Web Data Scouting Tool, LDC Data Selection Toolkit, ACK - Annotation Collection Kit, XTrans Transcription and Speech Annotation Tool, GALE Distillation Toolkit, and the GALE MT Post Editing Workflow Management System.

2006

We report on the success of a two-pass approach to annotating metadata, speech effects and syntactic structure in English conversational speech: separately annotating transcribed speech for structural metadata, or structural events, (fillers, speech repairs ( or edit dysfluencies) and SUs, or syntactic/semantic units) and for syntactic structure (treebanking constituent structure and shallow argument structure). The two annotations were then combined into a single representation. Certain alignment issues between the two types of annotation led to the discovery and correction of annotation errors in each, resulting in a more accurate and useful resource. The development of this corpus was motivated by the need to have both metadata and syntactic structure annotated in order to support synergistic work on speech parsing and structural event detection. Automatic detection of these speech phenomena would simultaneously improve parsing accuracy and provide a mechanism for cleaning up transcriptions for downstream text processing. Similarly, constraints imposed by text processing systems such as parsers can be used to help improve identification of disfluencies and sentence boundaries. This paper reports on our efforts to develop a linguistic resource providing both spoken metadata and syntactic structure information, and describes the resulting corpus of English conversational speech.

pdf bib abs
A New Phase in Annotation Tool Development at the Linguistic Data Consortium: The Evolution of the Annotation Graph Toolkit
Kazuaki Maeda | Haejoong Lee | Julie Medero | Stephanie Strassel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The Linguistic Data Consortium (LDC) has created various annotated linguistic data for a variety of common task evaluation programs and projects to create shared linguistic resources. The majority of these annotated linguistic data were created with highly customized annotation tools developed at LDC. The Annotation Graph Toolkit (AGTK) has been used as a primary infrastructure for annotation tool development at LDC in recent years. Thanks to the direct feedback from annotation task designers and annotators in-house, annotation tool development at LDC has entered a new, more mature and productive phase. This paper describes recent additions to LDC's annotation tools that are newly developed or significantly improved since our last report at the Fourth International Conference on Language Resource and Evaluation Conference in 2004. These tools are either directly based on AGTK or share a common philosophy with other AGTK tools.

2002

pdf bib
Models and Tools for Collaborative Annotation
Xiaoyi Ma | Haejoong Lee | Steven Bird | Kazuaki Maeda
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit
Steven Bird | Kazuaki Maeda | Xiaoyi Ma | Haejoong Lee | Beth Randall | Salim Zayat
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Creating Annotation Tools with the Annotation Graph Toolkit
Kazauki Maeda | Steven Bird | Xiaoyi Ma | Haejoong Lee
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

pdf bib
Annotation Tools Based on the Annotation Graph API
Steven Bird | Kazuaki Maeda | Xiaoyi Ma | Haejoong Lee
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources

pdf bib
The Annotation Graph Toolkit: Software Components for Building Linguistic Annotation Tools
Kazuaki Maeda | Steven Bird | Xiaoyi Ma | Haejoong Lee
Proceedings of the First International Conference on Human Language Technology Research