Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2390470.2390475dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
research-article
Free access

langid.py: an off-the-shelf language identification tool

Published: 10 July 2012 Publication History

Abstract

We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

References

[1]
Alfred V. Aho and Margaret J. Corasick. 1975. Efficient string matching: an aid to bibliographic search. Communications of the ACM, 18(6): 333--340, June.
[2]
Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Proceedings of NAACL HLT 2010, pages 229--237, Los Angeles, USA.
[3]
Jamie Callan and Mark Hoy, 2009. ClueWeb09 Dataset. Available at http://boston.lti.cs.cmu.edu/Data/clueweb09/.
[4]
Simon Carter, Wouter Weerkamp, and Manos Tsagkias. to appear. Microblog language identification: Overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation Journal.
[5]
William B. Cavnar and John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings of the Third Symposium on Document Analysis and Information Retrieval, Las Vegas, USA.
[6]
Hakan Ceylan and Yookyung Kim. 2009. Language identification of search engine queries. In Proceedings of ACL2009, pages 1066--1074, Singapore.
[7]
George Forman. 2003. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research, 3(7--8): 1289--1305, October.
[8]
Rayid Ghani, Rosie Jones, and Dunja Mladenic. 2004. Building Minority Language Corpora by Learning to Generate Web Search Queries. Knowledge and Information Systems, 7(1): 56--83, February.
[9]
Harald Hammarstrom. 2007. A Fine-Grained Model for Language Identication. In Proceedings of iNEWS07, pages 14--20.
[10]
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. MT summit, 11.
[11]
Marco Lui and Timothy Baldwin. 2011. Cross-domain feature selection for language identification. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 553--561, Chiang Mai, Thailand.
[12]
Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for Naive Bayes text classification. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, Madison, USA.
[13]
J. R. Quinlan. 1986. Induction of Decision Trees. Machine Learning, 1(1): 81--106, October.
[14]
Penelope Sibun and Jeffrey C. Reynar. 1996. Language determination: Examining the issues. In Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval, pages 125--135, Las Vegas, USA.
[15]
Jörg Tiedemann. 2009. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing, V: 237--248.
[16]
Erik Tromp and Mykola Pechenizkiy. 2011. Graph-Based N-gram Language Identification on Short Texts. In Proceedings of Benelearn 2011, pages 27--35, The Hague, Netherlands.
[17]
Tommi Vatanen, Jaakko J. Vayrynen, and Sami Virpioja. 2010. Language identification of short text segments with n-gram models. In Proceedings of LREC 2010, pages 3423--3430.
[18]
Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML 97.

Cited By

View all
  • (2023)Privacy Now or NeverProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3609342(1-4)Online publication date: 22-Aug-2023
  • (2020)Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media TextInternational Journal of E-Adoption10.4018/IJEA.202001010512:1(52-62)Online publication date: 1-Jan-2020
  • (2020)Mining multilingual and multiscript Twitter dataInternational Journal of Business Intelligence and Data Mining10.1504/ijbidm.2020.10384716:1(107-127)Online publication date: 1-Jan-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
ACL '12: Proceedings of the ACL 2012 System Demonstrations
July 2012
186 pages
  • Conference Chair:
  • Min Zhang

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 10 July 2012

Qualifiers

  • Research-article

Acceptance Rates

Overall Acceptance Rate 85 of 443 submissions, 19%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)61
  • Downloads (Last 6 weeks)19
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Privacy Now or NeverProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3609342(1-4)Online publication date: 22-Aug-2023
  • (2020)Experimenting Language Identification for Sentiment Analysis of English Punjabi Code Mixed Social Media TextInternational Journal of E-Adoption10.4018/IJEA.202001010512:1(52-62)Online publication date: 1-Jan-2020
  • (2020)Mining multilingual and multiscript Twitter dataInternational Journal of Business Intelligence and Data Mining10.1504/ijbidm.2020.10384716:1(107-127)Online publication date: 1-Jan-2020
  • (2020)Studying Politeness across Cultures using English Twitter and Mandarin WeiboProceedings of the ACM on Human-Computer Interaction10.1145/34151904:CSCW2(1-15)Online publication date: 15-Oct-2020
  • (2020)Language identification on massive datasets of short messages using an attention mechanism CNNProceedings of the 12th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1109/ASONAM49781.2020.9381393(16-23)Online publication date: 7-Dec-2020
  • (2019)Automatic language identification in textsJournal of Artificial Intelligence Research10.1613/jair.1.1167565:1(675-682)Online publication date: 1-May-2019
  • (2018)The Talk of NorwayLanguage Resources and Evaluation10.5555/3270332.327037652:3(873-893)Online publication date: 1-Sep-2018
  • (2018)Analyzing Right-wing YouTube ChannelsProceedings of the 10th ACM Conference on Web Science10.1145/3201064.3201081(323-332)Online publication date: 15-May-2018
  • (2018)Measuring, Understanding, and Classifying News Media Sympathy on Twitter after Crisis EventsProceedings of the 2018 CHI Conference on Human Factors in Computing Systems10.1145/3173574.3174130(1-13)Online publication date: 21-Apr-2018
  • (2017)Language Identification in Mixed ScriptProceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3158354.3158357(14-20)Online publication date: 8-Dec-2017
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media