Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval

Published: 01 December 2012 Publication History

Abstract

Wikipedia is characterized by its dense link structure and a large number of articles in different languages, which make it a notable Web corpus for knowledge extraction and mining, in particular for mining the multilingual associations. In this paper, motivated by a psychological theory of word meaning, we propose a graph-based approach to constructing a cross-language association dictionary (CLAD) from Wikipedia, which can be used in a variety of cross-language accessing and processing applications. In order to evaluate the quality of the mined CLAD, and to demonstrate how the mined CLAD can be used in practice, we explore two different applications of the mined CLAD to cross-language information retrieval (CLIR). First, we use the mined CLAD to conduct cross-language query expansion; and, second, we use it to filter out translation candidates with low translation probabilities. Experimental results on a variety of standard CLIR test collections show that the CLIR retrieval performance can be substantially improved with the above two applications of CLAD, which indicates that the mined CLAD is of sound quality. © 2012 Wiley Periodicals, Inc.

References

[1]
Ballesteros, L., & Croft, W.B. (1998). Resolving ambiguity for cross-language retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1606–1611). New York: ACM Press.
[2]
Buckley, C., & Voorhees, E.M. (2000). Evaluating evaluation measure stability. In SIGIR '00: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 33–40). New York: ACM Press.
[3]
Cao, G., Gao, J., & Nie, J.-Y. (2007a). A system to mine large-scale bilingual dictionaries from monolingual Web pages. In B. Maegaard (Ed.), Proceedings: Machine Translation Summit XI (pp. 57–64). Retrieved from http://www.mt-archive.info/MTS-2007-Cao.pdf
[4]
Cao, G., Gao, J., Nie, J.-Y., & Bai, J. (2007b). Extending query translation to cross-language query expansion with markov chain models. In CIKM '07: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management (pp. 351–360). New York: ACM Press.
[5]
Carballo, J.P., & Strzalkowski, T. (2000). Natural language information retrieval: Progress report. Information Processing and Management, 36(1), 155–178.
[6]
Carpineto, C., Romano, G., & Giannini, V. (2002). Improving retrieval feedback with mutltiple term-ranking function combination. ACM Transactions on Information Systems, 20(3), 259–290.
[7]
Chen, H.-H., Lin, W.-C., Yang, C., & Lin, W.-H. (2006). Translating––transliterating named entities for multilingual information access. Journal of the American Society for Information Science and Technology, 57(5), 645–659.
[8]
Crouch, C. (1988). A cluster-based approach to thesaurus construction. In SIGIR '88: Proceedings of the 11th Annual international ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 309–320). New York: ACM Press.
[9]
Diaz, F., & Metzler, D. (2006). Improving the estimation of relevance models using large external corpora. In SIGIR '06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 154–161). New York: ACM Press.
[10]
Farwell, D., Gerber, L., & Hovy, E.H. (Eds.) (1998). Machine translation and the information soup. Lecture Notes in Computer Science, Vol. 1529. Berlin: Springer.
[11]
Federico, M., & Bertoldi, N. (2002). Statistical cross-language information retrieval using n-best query translations. In SIGIR '02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 167–174). New York: ACM Press.
[12]
Franz, M., McCarley, J.S., & Roukos, S. (1998). Ad hoc and multilingual information retrieval at IBM. In Proceedings of the Seventh Text REtrieval Conference (TREC-7) (pp. 104–115). NIST Special Publication.
[13]
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In IJCAI'07: Proceedings of the 20th International Joint Conference on Artificial intelligence (pp. 1606–1611). San Francisco: Morgan Kaufmann.
[14]
Gao, J., & Nie, J. (2006). A study of statistical models for query translation: Finding a good unit of translation. In SIGIR '06: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 194–201). New York: ACM Press.
[15]
Gao, J., Nie, J., He, H., Chen, W., & Zhou, M. (2002). Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In SIGIR '02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 183–190). New York: ACM Press.
[16]
Gey, F.C., Kando, N., & Peters, C. (2004). Cross-language information retrieval: The way ahead. Information Processing and Management, 41(3), 415–431.
[17]
Hauer, B., & Kondrak, G. (2011). Clustering semantically equivalent words into cognate sets in multilingual lists. In Proceedings of the 5th International Joint Conference on Natural Language Processing (pp. 865–873). Retrieved from at: http://aclweb.org/anthology-new/I/I11/I11-1097.pdf
[18]
He, B. (2009). Query expansion models. In L. Liu & M.T. ÖÖzsu (Eds.), Encyclopedia of database systems (pp. 2257–2260). New York: Springer.
[19]
He, D., & Wu, D. (2008). Translation enhancement: A new relevance feedback method for cross-language information retrieval. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (pp. 729–738). New York: ACM Press.
[20]
Herbert, B., Szarvas, G., & Gurevych, I. (2011). Combining query translation techniques to improve cross-language information retrieval. In ECIR'11: Proceedings of the 33rd European Conference on Advances in Information Retrieval (pp. 712–715). Berlin: Springer-Verlag.
[21]
Hu, B. (2010). Wikimantics: Interpreting ontologies with Wikipedia. Knowledge and Information Systems, 25(3), 445–472.
[22]
Ito, M., Nakayama, K., Hara, T., & Nishio, S. (2008). Association thesaurus construction methods based on link co-occurrence analysis for Wikipedia. In Proceeding of the 17th ACM Conference on Information and Knowledge Management (pp. 817–826). New York: ACM Press.
[23]
Juffinger, A., Kern, R., & Granitzer, M. (2008). Crosslanguage retrieval based on Wikipedia statistics. In CLEF'08: Proceedings of the 9th Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access (pp. 155–162). Berlin: Springer-Verlag.
[24]
Kondrak, G., Marcu, D., & Knight, K. (2003). Cognates can improve statistical translation models. In Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 46–48).
[25]
Lin, M.-C., Li, M.-X., Hsu, C.-C., & Wu, S.-H. (2010). Query expansion from Wikipedia and topic Web crawler on CLIR. Data Processing, 120-121, 101––106. Retrieved from http://www.ncbi.nlm.nih.gov/ /21486131
[26]
Liu, Y., Jin, R., & Chai, J. (2005). A maximum coherence model for dictionary-based cross-language information retrieval. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 536–543). New York: ACM Press.
[27]
Lu, C., Xu, Y., & Geva, S. (2007). Translation disambiguation in Web-based translation extraction for English-Chinese CLIR. In Proceedings of the 2007 ACM Symposium on Applied Computing (pp. 819–823). New York: ACM Press.
[28]
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 203–208.
[29]
Lund, K., Burgess, C., & Atchley, R.A. (1995). Semantic and associative priming in high-dimensional semantic space. In J.D. Moore & J. Fain Lehman (Eds.), Proceedings of the 17th Annual Conference of the Cognitive Science Society (pp. 660––665). Mahwah, NJ: Erlbaum Associates.
[30]
McNamee, P., & Mayfield, J. (2002). Comparing cross-language query expansion techniques by degrading translation resources. In SIGIR '02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 159–166). New York: ACM Press.
[31]
Medelyan, O., Milne, D., Legg, C., & Witten, I.H. (2009). Mining meaning from Wikipedia. International Journal of Human Computer Studies, 67(9), 716––754.
[32]
Milne, D., Medelyan, O., & Witten, I. (2006). Mining domain-specific thesauri from Wikipedia: A case study. In WI '06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 442–448). Washington, DC: IEEE Computer Society.
[33]
Pirkola, A. (1998). The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In SIGIR '98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 55–63). New York: ACM Press.
[34]
Pirkola, A., Hedlund, T., Keskustalo, H., & Kalervo, J. (2001). Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval, 4, 209–230.
[35]
Pirkola, A., Puolamäki, D., & Jäärvelin, K. (2003). Applying query structuring in cross-language retrieval. Information Processing and Management, 39(3), 391–402.
[36]
Potthast, M., Stein, B., & Anderka, M. (2008). A Wikipedia-based multilingual retrieval model. Advances in Information Retrieval. Lecture Notes in Computer Science, 4956, 522–530.
[37]
Salton, G., & Buckley, C. (1990). Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41, 288–297.
[38]
Salton, G., & McGill, M. (1986). An introduction to modern information retrieval. New York: McGraw-Hill.
[39]
Schönhofen, P., Benczúr, A., Bíró, I., & Csalogáány, K. (2008). Cross-language retrieval with Wikipedia. In C. Peters, et al. (Eds.), Advances in multilingual and multimodal information retrieval. Lecture Notes in Computer Science, Vol. 5152 (pp. 72–79). Berlin: Springer.
[40]
Schutze, H., & Pedersen, J. (1997). A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing and Management, 33(3), 307–318.
[41]
Smeaton, A.F. (1997). Using NLP or NLP resources for information retrieval tasks. T. Strzalkowski (Ed.), Natural language information retrieval, (pp. 99–111). Dordrecht-Boston: Kluwer Academic Publishers.
[42]
Sperer, R., & Oard, D.W. (2000). Structured translation for cross-language information retrieval. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 120–127). New York: ACM Press.
[43]
Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., & Keskustalo, H. (2007, February). Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems, 25(1).
[44]
Wang, J., & Oard, D.W. (2006). Combining bidirectional translation and synonymy for cross-language information retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 202–209). New York: ACM Press.
[45]
Wang, P., Hu, J., Zeng, H.-J., & Chen, Z. (2009). Using Wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.
[46]
Wang, Y.-C., Lee, C.-W., Tsai, R.T.-H., & Hsu, W.-L. (2008). Iasl system for NTCIR-6 Korean-Chinese cross-language information retrieval. In Proceedings of NTCIR-6 Workshop Meeting (pp. 109–126). Retrieved from http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings6/NTCIR/29.pdf
[47]
Xu, J., & Croft, W. (2000). Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems, 18(1), 79–112.
[48]
Xu, J., Weischedel, R., & Nguyen, C. (2001). Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 105–110). New York: ACM Press.
[49]
Ye, Z., Huang, X., & Lin, H. (2009). A graph-based approach to mining multilingual word associations from Wikipedia. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 690–691). New York: ACM Press.
[50]
Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transaction on Information Systems, 22(2), 179–214.
[51]
Zhang, Y., & Vines, P. (2004). Using the Web for automated translation extraction in cross-language information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 162–169). New York: ACM Press.
[52]
Zhou, D., Truran, M., Brailsford, T., & Ashman, H. (2008). A hybrid technique for English-Chinese cross language information retrieval. ACM Transactions on Asian Language Information Processing (TALIP), 7(2), 1–35.

Cited By

View all
  • (2022)A Comparative Optimization Model of Japanese Literature Characteristics for Cognitive Retrieval of Cross-Language InformationComputational Intelligence and Neuroscience10.1155/2022/81950752022Online publication date: 1-Jan-2022
  • (2022)Identifying cross-lingual plagiarism using rich semantic features and deep neural networksJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2020.04.00934:4(1110-1123)Online publication date: 18-May-2022
  • (2018)What kind of knowledge is in Wikipedia? Unsupervised extraction of properties for similar conceptsJournal of the Association for Information Science and Technology10.5555/3151328.315133765:12(2489-2497)Online publication date: 17-Dec-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of the American Society for Information Science and Technology
Journal of the American Society for Information Science and Technology  Volume 63, Issue 12
December 2012
208 pages

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 December 2012

Author Tags

  1. information processing
  2. information retrieval
  3. web mining

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Comparative Optimization Model of Japanese Literature Characteristics for Cognitive Retrieval of Cross-Language InformationComputational Intelligence and Neuroscience10.1155/2022/81950752022Online publication date: 1-Jan-2022
  • (2022)Identifying cross-lingual plagiarism using rich semantic features and deep neural networksJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2020.04.00934:4(1110-1123)Online publication date: 18-May-2022
  • (2018)What kind of knowledge is in Wikipedia? Unsupervised extraction of properties for similar conceptsJournal of the Association for Information Science and Technology10.5555/3151328.315133765:12(2489-2497)Online publication date: 17-Dec-2018
  • (2016)Transfer Learning for Cross-Lingual Sentiment Classification with Weakly Shared Deep Neural NetworksProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2911490(245-254)Online publication date: 7-Jul-2016
  • (2014)Interlinking cross language metadata using heterogeneous graphs and WikipediaProceedings of the 2014 International Conference on Dublin Core and Metadata Applications10.5555/2771234.2771251(157-166)Online publication date: 8-Oct-2014

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media