Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

DeASCIIfication approach to handle diacritics in Turkish information retrieval

Published: 01 March 2016 Publication History

Abstract

Risk-sensitive evaluation of approaches for handling diacritics in Turkish information retrieval.Application of diacritics restoration to Turkish information retrieval.Investigation of the diacritics sensitivity of stemming algorithms. The absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacritics.

References

[1]
K. Adal, G. Eryiit, Vowel and diacritic restoration for social media texts, 2014.
[2]
Akn, A. A. Akn, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. http://zemberek.googlecode.com/files/zemberek_makale.pdf [Retrieved August 21, 2015].
[3]
A. Alpkoak, M. Ceylan, Effects of diacritics on Turkish information retrieval, Turkish Journal of Electrical Engineering & Computer Sciences, 20 (2012) 787-804.
[4]
T.V. Asubiaro, Effects of diacritics on web search engines performance for retrieval of Yoruba documents, Journal of Library and Information Studies, 12 (2014) 1-19.
[5]
S. Ayta, Multilingual information retrieval on the Internet: a case study of Turkish users, The International Information & Library Review, 37 (2005) 275-284.
[6]
A.M. Azmi, R.S. Almajed, A survey of automatic Arabic diacritization techniques, Natural Language Engineering, 21 (2015) 477-495.
[7]
J. Bar-Ilan, T. Gutman, How do search engines respond to some non-English queries?, Journal of Information Science, 31 (2005) 13-28.
[8]
A. Biaecki, R. Muir, G. Ingersoll, Apache Lucene 4, 2012.
[9]
Y. Bitirim, Y. Tonta, H. Sever, Information retrieval effectiveness of Turkish search engines, Springer Berlin Heidelberg, 2002.
[10]
C. Buckley, E.M. Voorhees, Retrieval evaluation with incomplete information, 2004.
[11]
F. Can, S. Koberber, E. Balk, C. Kaynak, H.C. calan, O.M. Vursava, Information retrieval on Turkish texts, Journal of the American Society for Information Science and Technology, 59 (2008) 407-421.
[12]
O. Chapelle, D. Metlzer, Y. Zhang, P. Grinspan, Expected reciprocal rank for graded relevance, 2009.
[13]
K. Choro, Testing the effectiveness of retrieval to queries using Polish words with diacritics, 2005.
[14]
K. Collins-Thompson, C.L. Clarke, P. Bennet, F. Diaz, E.M. Voorhees, TREC 2013 Web Track Overview, National Institute of Standards and Technology (NIST), 2014.
[15]
Diaz, J. (2008). A cellphones missing dot kills two people, puts three more in jail. http://gizmodo.com/382026/ [Retrieved August 21, 2015].
[16]
T.N.D. Do, D.B. Nguyen, D.K. Mac, D.D. Tran, Machine translation approach for Vietnamese diacritic restoration, 2013.
[17]
C. Grozea, Experiments and results with diacritics restoration inRomanian, Springer Berlin Heidelberg, 2012.
[18]
H. Haddad, C. Bechikh Ali, Performance of Turkish information retrieval: evaluating the impact of linguistic parameters and compound nouns, Springer Berlin Heidelberg, 2014.
[19]
K. Jrvelin, J. Keklinen, Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems, 20 (2002) 422-446.
[20]
C.D. Manning, P. Raghavan, H. Schtze, Introduction to information retrieval, Cambridge University Press, New York, NY, USA, 2008.
[21]
R.F. Mihalcea, Diacritics restoration: Learning from letters versus learning from words, Springer Berlin Heidelberg, 2002.
[22]
B.C. Okur, H. Tak, Y.S. Akgl, Rewriting Turkish texts written in English alphabet using Turkish alphabet, 2013.
[23]
O. ztrkmenolu, A. Alpkoak, Comparison of different lemmatization approaches for information retrieval on Turkish text collection, 2012.
[24]
B. Stein, M. Potthast, Putting successor variety stemming to work, Springer Berlin Heidelberg, 2007.
[25]
V. Tunal, T.T. Bilgin, Examining the impact of stemming on clustering Turkish texts, 2012.
[26]
G. Tr, Bilkent University, 2000.
[27]
N. anti, J. najder, B.D. Bai, Automatic diacritics restoration in Croatian texts, 2009.
[28]
L. Wang, P.N. Bennett, K. Collins-Thompson, Robust ranking models via risk-sensitive optimization, 2012.
[29]
D. Yret, M. de la Maza, The greedy prepend algorithm for decision list induction, 2006.

Cited By

View all
  • (2022)Effect of Context on Smartphone Users’ Typing Performance in the WildACM Transactions on Computer-Human Interaction10.1145/357701330:3(1-44)Online publication date: 20-Dec-2022
  1. DeASCIIfication approach to handle diacritics in Turkish information retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Information Processing and Management: an International Journal
    Information Processing and Management: an International Journal  Volume 52, Issue 2
    March 2016
    186 pages

    Publisher

    Pergamon Press, Inc.

    United States

    Publication History

    Published: 01 March 2016

    Author Tags

    1. Accents
    2. DeASCIIfier
    3. Diacritics restoration
    4. Risk-sensitive evaluation
    5. Stemming
    6. Turkish information retrieval

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Effect of Context on Smartphone Users’ Typing Performance in the WildACM Transactions on Computer-Human Interaction10.1145/357701330:3(1-44)Online publication date: 20-Dec-2022

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media