article

Using semantics for granularities of tokenization

Authors:

Martin Riedl,

Chris BiemannAuthors Info & Claims

Computational Linguistics, Volume 44, Issue 3

Pages 483 - 524

https://doi.org/10.1162/coli_a_00325

Published: 01 September 2018 Publication History

Get Access

Abstract

Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identification of meaningful and useful language units. This entails both identifying units composed of several single words that form a several single words that form a, as well as splitting single-word compounds into their meaningful parts. In this article, we introduce unsupervised and knowledge-free methods for these two tasks. The main novelty of our research is based on the fact that methods are primarily based on distributional similarity, of which we use two flavors: a sparse count-based and a dense neural-based distributional semantic model. First, we introduce <monospace>DRUID</monospace>, which is a method for detecting MWEs. The evaluation on MWE-annotated data sets in two languages and newly extracted evaluation data sets for 32 languages shows that <monospace>DRUID</monospace> compares favorably over previous methods not utilizing distributional information. Second, we present <monospace>SECOS</monospace>, an algorithm for decompounding close compounds. In an evaluation of four dedicated decompounding data sets across four languages and on data sets extracted from Wiktionary for 14 languages, we demonstrate the superiority of our approach over unsupervised baselines, sometimes even matching the performance of previous language-specific and supervised methods. In a final experiment, we show how both decompounding and MWE information can be used in information retrieval. Here, we obtain the best results when combining word information with MWEs and the compound parts in a bag-of-words retrieval set-up. Overall, our methodology paves the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging.

Using semantics for granularities of tokenization

Recommendations

Tokenization: returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit
ACL '12: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact ...
Tokenization as the initial phase in NLP
COLING '92: Proceedings of the 14th conference on Computational linguistics - Volume 4

In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation, respectively. Automatic ...
Enhancing recurrent neural network-based language models by word tokenization

Different approaches have been used to estimate language models from a given corpus. Recently, researchers have used different neural network architectures to estimate the language models from a given corpus using unsupervised learning neural networks ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Computational Linguistics Volume 44, Issue 3

September 2018

225 pages

ISSN:0891-2017

EISSN:1530-9312

Issue’s Table of Contents

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 01 September 2018

Published in COLI Volume 44, Issue 3

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
37
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Recommendations

Tokenization: returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit

Tokenization as the initial phase in NLP

Enhancing recurrent neural network-based language models by word tokenization