research-article

Free access

langid.py: an off-the-shelf language identification tool

Authors:

Timothy BaldwinAuthors Info & Claims

ACL '12: Proceedings of the ACL 2012 System Demonstrations

Pages 25 - 30

Published: 10 July 2012 Publication History

Abstract

We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

References

[1]

Alfred V. Aho and Margaret J. Corasick. 1975. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6): 333--340, June.

Digital Library

[2]

Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Proceedings of NAACL HLT 2010, pages 229--237, Los Angeles, USA.

Digital Library

[3]

Jamie Callan and Mark Hoy, 2009. ClueWeb09 Dataset. Available at http://boston.lti.cs.cmu.edu/Data/clueweb09/.

[4]

Simon Carter, Wouter Weerkamp, and Manos Tsagkias. to appear. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal.

Digital Library

[5]

William B. Cavnar and John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings of the Third Symposium on Document Analysis and Information Retrieval, Las Vegas, USA.

[6]

Hakan Ceylan and Yookyung Kim. 2009. Language identification of search engine queries. In Proceedings of ACL2009, pages 1066--1074, Singapore.

Digital Library

[7]

George Forman. 2003. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research, 3(7--8): 1289--1305, October.

Digital Library

[8]

Rayid Ghani, Rosie Jones, and Dunja Mladenic. 2004. Building Minority Language Corpora by Learning to Generate Web Search Queries. Knowledge and Information Systems, 7(1): 56--83, February.

Digital Library

[9]

Harald Hammarstrom. 2007. A Fine-Grained Model for Language Identication. In Proceedings of iNEWS07, pages 14--20.

[10]

Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. MT summit, 11.

[11]

Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 553--561, Chiang Mai, Thailand.

[12]

Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for Naive Bayes text classification. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, Madison, USA.

[13]

J. R. Quinlan. 1986. Induction of Decision Trees. Machine Learning, 1(1): 81--106, October.

[14]

Penelope Sibun and Jeffrey C. Reynar. 1996. Language determination: Examining the issues. In Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval, pages 125--135, Las Vegas, USA.

[15]

Jörg Tiedemann. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing, V: 237--248.

[16]

Erik Tromp and Mykola Pechenizkiy. 2011. Graph-Based N-gram Language Identification on Short Texts. In Proceedings of Benelearn 2011, pages 27--35, The Hague, Netherlands.

[17]

Tommi Vatanen, Jaakko J. Vayrynen, and Sami Virpioja. 2010. Language identification of short text segments with n-gram models. In Proceedings of LREC 2010, pages 3423--3430.

[18]

Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML 97.

Digital Library

Cited By

Srinath MMatheson LVenkit PZanfir-Fortuna GSchaub FGiles CWilson S(2023)Privacy Now or NeverProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3609342(1-4)Online publication date: 22-Aug-2023
https://dl.acm.org/doi/10.1145/3573128.3609342
Goyal VBansal NRani S(2020)Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media TextInternational Journal of E-Adoption10.4018/IJEA.202001010512:1(52-62)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.4018/IJEA.2020010105
Sarkar BSinhababu NRoy MPramanik PChoudhury P(2020)Mining multilingual and multiscript Twitter dataInternational Journal of Business Intelligence and Data Mining10.1504/ijbidm.2020.10384716:1(107-127)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1504/ijbidm.2020.103847
Show More Cited By

Recommendations

Layout-sensitive language extensibility with SugarHaskell
Haskell '12: Proceedings of the 2012 Haskell Symposium

Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Layout-sensitive language extensibility with SugarHaskell
Haskell '12

Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Cedalion: a language for language oriented programming
OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications

Language Oriented Programming (LOP) is a paradigm that puts domain specific programming languages (DSLs) at the center of the software development process. Currently, there are three main approaches to LOP: (1) the use of internal DSLs, implemented as ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

ACL '12: Proceedings of the ACL 2012 System Demonstrations

July 2012

186 pages

Conference Chair:
Min Zhang
Institute for Infocomm Research, Singapore

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 10 July 2012

Qualifiers

Research-article

Acceptance Rates

Overall Acceptance Rate 85 of 443 submissions, 19%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
1,122
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)19

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Srinath MMatheson LVenkit PZanfir-Fortuna GSchaub FGiles CWilson S(2023)Privacy Now or NeverProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3609342(1-4)Online publication date: 22-Aug-2023
https://dl.acm.org/doi/10.1145/3573128.3609342
Goyal VBansal NRani S(2020)Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media TextInternational Journal of E-Adoption10.4018/IJEA.202001010512:1(52-62)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.4018/IJEA.2020010105
Sarkar BSinhababu NRoy MPramanik PChoudhury P(2020)Mining multilingual and multiscript Twitter dataInternational Journal of Business Intelligence and Data Mining10.1504/ijbidm.2020.10384716:1(107-127)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1504/ijbidm.2020.103847
Li MHickman LTay LUngar LGuntuku S(2020)Studying Politeness across Cultures using English Twitter and Mandarin WeiboProceedings of the ACM on Human-Computer Interaction10.1145/34151904:CSCW2(1-15)Online publication date: 15-Oct-2020
https://dl.acm.org/doi/10.1145/3415190
Vo DKhoury RAlhajj R(2020)Language identification on massive datasets of short messages using an attention mechanism CNNProceedings of the 12th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM49781.2020.9381393(16-23)Online publication date: 7-Dec-2020
https://dl.acm.org/doi/10.1109/ASONAM49781.2020.9381393
Jauhiainen TLui MZampieri MBaldwin TLindén K(2019)Automatic language identification in textsJournal of Artificial Intelligence Research10.1613/jair.1.1167565:1(675-682)Online publication date: 1-May-2019
https://dl.acm.org/doi/10.1613/jair.1.11675
Lapponi ESØyland MVelldal EOepen S(2018)The Talk of NorwayLanguage Resources and Evaluation10.5555/3270332.327037652:3(873-893)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.5555/3270332.3270376
Ottoni RCunha EMagno GBernardina PMeira Jr. WAlmeida VAkkermans HFontaine KVermeulen IHouben GWeber M(2018)Analyzing Right-wing YouTube ChannelsProceedings of the 10th ACM Conference on Web Science10.1145/3201064.3201081(323-332)Online publication date: 15-May-2018
https://dl.acm.org/doi/10.1145/3201064.3201081
El Ali AStratmann TPark SSchöning JHeuten WBoll SMandryk RHancock MPerry MCox A(2018)Measuring, Understanding, and Classifying News Media Sympathy on Twitter after Crisis EventsProceedings of the 2018 CHI Conference on Human Factors in Computing Systems10.1145/3173574.3174130(1-13)Online publication date: 21-Apr-2018
https://dl.acm.org/doi/10.1145/3173574.3174130
Sristy NKrishna NKrishna BRavi V(2017)Language Identification in Mixed ScriptProceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3158354.3158357(14-20)Online publication date: 8-Dec-2017
https://dl.acm.org/doi/10.1145/3158354.3158357
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents