Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/956863.956890acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Statistical transliteration for english-arabic cross language information retrieval

Published: 03 November 2003 Publication History

Abstract

Out of vocabulary (OOV) words are problematic for cross language information retrieval. One way to deal with OOV words when the two languages have different alphabets, is to transliterate the unknown words, that is, to render them in the orthography of the second language. In the present study, we present a simple statistical technique to train an English to Arabic transliteration model from pairs of names. We call this a selected n-gram model because a two-stage training procedure first learns which n-gram segments should be added to the unigram inventory for the source language, and then a second stage learns the translation model over this inventory. This technique requires no heuristics or linguistic knowledge of either language. We evaluate the statistically-trained model and a simpler hand-crafted model on a test set of named entities from the Arabic AFP corpus and demonstrate that they perform better than two online translation sources. We also explore the effectiveness of these systems on the TREC 2002 cross language IR task. We find that transliteration either of OOV named entities or of all OOV words is an effective approach for cross language IR.

References

[1]
Ajeeb online translation engine. http://tarjim.ajeeb.com/ajeeb/
[2]
Al Misbar. http://www.almisbar.com/salam_trans.html
[3]
Al-Onaizan, Y. and Knight, K. Machine translation of names in Arabic text. Proceedings of the ACL conference workshop on computational approaches to Semitic languages, 2002.
[4]
Arabic Proper Names Dictionary from NMSU. http://crl.nmsu.edu/ahmed/downloads.html
[5]
Arbabi, Mansur, Scott M. Fischthal, Vincent C. Cheng, and Elizabeth Bar. 1994. Algorithms for Arabic name transliteration. IBM Journal of research and Development, 38(2):183-193.
[6]
Automatically-trained Transliteration Model. http://www.cs.umass.edu/nasreen/automatic_model.txt
[7]
Ballesteros, L. and Croft, W. B. Resolving ambiguity for cross-language retrieval. SIGIR '98, 64--71, 1998.
[8]
Davis, M. W. and Ogden, W. C. Free resources and advanced alignment for cross-language text retrieval. In Proceedings of the sixth text retrieval conference (TREC-6), E. M. Voorhees and D. K. Harman (eds.). Gaithersburg: NIST Special Publication 500-240, 385--394, 1998.
[9]
Darwish, Kareem, David Doermann, Ryan Jones, Douglas Oard and Mika Rautiainen. 2001. TREC-10 experiments at Maryland: CLIR and video. In TREC 2001. Gaithersburg: NIST. http://trec.nist.gov/pubs/trec10/t10_proceedings.html
[10]
Fujii, Atsushi and Tetsuya, Ishikawa. Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration. Computers and the Humanities, Vol.35, No.4, pp.389--420, 2001
[11]
Gey, F. C. and Oard, D. W. 2001. The TREC-2001 cross-language information retrieval track: Searching Arabic using English, French, or Arabic queries. In TREC 2001. Gaithersburg: NIST. http://trec.nist.gov/pubs/trec10/t10_proceedings.html
[12]
GIZA++. http://www-i6.informatik.rwth-aachen.de/Colleagues/och/software/GIZA++.html
[13]
Knight, Kevin and Graehl, Jonathan. 1997. Machine transliteration. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pp. 128--135. Morgan Kaufmann.
[14]
Larkey, L. S., Allan, J., Connell, M. E., Bolivar, A., & Wade, C. UMass at TREC 2002: Cross language and novelty tracks, to appear in The Eleventh Text REtrieval Conference (TREC 2002). Gaithersburg: NIST, 2003.
[15]
Larkey, L. S., & Connell, M. E. Arabic Information Retrieval at UMass in TREC-10, The Tenth Text Retrieval Conference, TREC 2001. Gaithersburg: NIST, 562--570, 2002.
[16]
Larkey, Leah, Nasreen AbdulJaleel, and Margaret Connell. 2003. What's in a Name?: Proper Names in Arabic Cross Language Information Retrieval, CIIR Technical Report, IR-278 .
[17]
Och, Franz Josef and Hermann Ney. October 2000. Improved Statistical Alignment Models. Proc. of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 440--447, Hong Kong, China.
[18]
Pirkola, A. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In proceedings of SIGIR 98 (Melbourne, Australia, Aug 1998), ACM Press 55--63.
[19]
Sakhr multilingual dictionary at http://dictionary.ajeeb.com/en.htm
[20]
Stalls, Bonnie Glover and Kevin Knight. 1998. Translating names and technical terms in Arabic text. http://citeseer.nj.nec.com/glover98translating.html
[21]
Whitaker, B. Arabic words and the Roman alphabet. http://www.al-bab.com/arab/language/roman1.ht
[22]
World cities. http://www.fourmilab.ch/earthview/cities.html

Cited By

View all

Index Terms

  1. Statistical transliteration for english-arabic cross language information retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management
    November 2003
    592 pages
    ISBN:1581137230
    DOI:10.1145/956863
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2003

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross language information retrieval
    2. named entities
    3. out of vocabulary words
    4. statistical transliteration

    Qualifiers

    • Article

    Conference

    CIKM03

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 29 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Transliterating Latin to Amharic scripts using user-defined rules and character mappingsInternational Journal on Digital Libraries10.1007/s00799-023-00346-524:1(63-75)Online publication date: 2-Mar-2023
    • (2021)MOOCs One-Stop Shop: A Realization of a Unified MOOCs Search EngineIEEE Access10.1109/ACCESS.2021.31308419(160175-160185)Online publication date: 2021
    • (2018)Machine transliteration and transliterated text retrieval: a surveySādhanā10.1007/s12046-018-0828-843:6Online publication date: 7-Jun-2018
    • (2017)Bengali-to-English Forward and Backward Machine Transliteration Using Support Vector MachinesComputational Intelligence, Communications, and Business Analytics10.1007/978-981-10-6430-2_43(552-566)Online publication date: 26-Sep-2017
    • (2016)Arabic Cross-Language Information RetrievalACM Transactions on Asian and Low-Resource Language Information Processing10.1145/278921015:3(1-44)Online publication date: 28-Jan-2016
    • (2015)An improved approach to English-Hindi based Cross Language Information Retrieval systemProceedings of the 2015 Eighth International Conference on Contemporary Computing (IC3)10.1109/IC3.2015.7346706(354-359)Online publication date: 20-Aug-2015
    • (2015)Cross Language Duplicate Record Detection in Big DataBig Data in Complex Systems10.1007/978-3-319-11056-1_5(147-171)Online publication date: 2015
    • (2014)Information RetrievalNatural Language Processing of Semitic Languages10.1007/978-3-642-45358-8_10(299-334)Online publication date: 25-Mar-2014
    • (2013)A Method to Construct Chinese-Japanese Named Entity Translation Equivalents Using Monolingual CorporaNatural Language Processing and Chinese Computing10.1007/978-3-642-41644-6_16(164-175)Online publication date: 2013
    • (2013)Improving Cross-Language Information Retrieval by Transliteration Mining and GenerationMultilingual Information Access in South Asian Languages10.1007/978-3-642-40087-2_29(310-333)Online publication date: 2013
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media