Nothing Special   »   [go: up one dir, main page]

skip to main content
10.3115/1075096.1075146dlproceedingsArticle/Chapter ViewAbstractPublication PagesaclConference Proceedingsconference-collections
Article
Free access

Unsupervised learning of Arabic stemming using a parallel corpus

Published: 07 July 2003 Publication History

Abstract

This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10 K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic, but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in 87.5% agreement with a state of the art, proprietary Arabic stemmer built using rules, affix lists, and human annotated text, in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text, and 96% of the performance of the proprietary stemmer above.

References

[1]
P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer. 1993. The mathematics of machine translation: Parameter estimation. In Computational Linguistics, pages 263--311.
[2]
Tim Buckwalter. 1999. Buckwalter transliteration. http://www.cis.upenn.edu/~cis639/arabic/info/translitchart.html.
[3]
Alexander Clark. 2001. Learning morphology with pair hidden markov models. In ACL (Companion Volume), pages 55--60.
[4]
Mona Diab and Philip Resnik. 2002. An unsupervised method for word sense tagging using parallel corpora. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 255--262, July.
[5]
John Goldsmith. 2001. Unsupervised learning of the morphology of a natural language. In Computational Linguistics.
[6]
Leah Larkey, Lisa Ballesteros, and Margaret Connell. Improving stemming for arabic information retrieval: Light stemming and co-occurrence analysis. In SIGIR 2002, pages 275--282.
[7]
Young-Suk Lee, Kishore Papineni, Salim Roukos, Ossama Emam, and Hany Hassan. Language model based arabic word segmentation. In To appear in ACL 2003.
[8]
Patrick Schone and Daniel Jurafsky. Knowledge-free induction of morphology using latent semantic analysis. In 4th Conference on Computational Natural Language Learning, Lisbon, 2000.
[9]
Matthew Snover. 2002. An unsupervised knowledge free algorithm for the learning of morphology in natural languages. Master's thesis, Washington University, May.
[10]
David Yarowsky, Grace Ngai, and Richard Wicentowski. 2000. Inducing multilingual text analysis tools via robust projection across aligned corpora.

Cited By

View all
  • (2018)A Study of Graph Based Stemmer in Arabic Extrinsic Plagiarism DetectionProceedings of the 2nd Mediterranean Conference on Pattern Recognition and Artificial Intelligence10.1145/3177148.3180089(27-32)Online publication date: 27-Mar-2018
  • (2017)Enhancing Arabic stemming process using resources and benchmarking toolsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2016.11.01029:2(164-170)Online publication date: 1-Apr-2017
  • (2014)Stemming resource-poor Indian languagesACM Transactions on Asian Language Information Processing10.1145/262967013:3(1-26)Online publication date: 3-Oct-2014
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
July 2003
571 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 07 July 2003

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 85 of 443 submissions, 19%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)6
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)A Study of Graph Based Stemmer in Arabic Extrinsic Plagiarism DetectionProceedings of the 2nd Mediterranean Conference on Pattern Recognition and Artificial Intelligence10.1145/3177148.3180089(27-32)Online publication date: 27-Mar-2018
  • (2017)Enhancing Arabic stemming process using resources and benchmarking toolsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2016.11.01029:2(164-170)Online publication date: 1-Apr-2017
  • (2014)Stemming resource-poor Indian languagesACM Transactions on Asian Language Information Processing10.1145/262967013:3(1-26)Online publication date: 3-Oct-2014
  • (2014)Aligned-Parallel-Corpora Based Semi-Supervised Learning for Arabic Mention DetectionIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2013.228705522:2(314-324)Online publication date: 1-Feb-2014
  • (2014)Transliteration normalization for Information Extraction and Machine TranslationJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2014.06.01126:4(379-387)Online publication date: 1-Dec-2014
  • (2010)Enhancing mention detection using projection via aligned corporaProceedings of the 2010 Conference on Empirical Methods in Natural Language Processing10.5555/1870658.1870755(993-1001)Online publication date: 9-Oct-2010
  • (2010)Posterior Regularization for Structured Latent Variable ModelsThe Journal of Machine Learning Research10.5555/1756006.185991811(2001-2049)Online publication date: 1-Aug-2010
  • (2010)An accuracy-enhanced light stemmer for arabic textACM Transactions on Speech and Language Processing 10.1145/1921656.19216577:2(1-22)Online publication date: 24-Feb-2010
  • (2009)An extensible crosslinguistic readability frameworkProceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora10.5555/1690339.1690344(11-18)Online publication date: 6-Aug-2009
  • (2008)Cross-lingual propagation for morphological analysisProceedings of the 23rd national conference on Artificial intelligence - Volume 210.5555/1620163.1620204(848-854)Online publication date: 13-Jul-2008
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media