Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

Published: 01 September 2006 Publication History

Abstract

The vector space model (VSM) is one of the most widely used information retrieval (IR) models in both academia and industry. It was less effective at the Chinese ad hoc retrieval tasks than other retrieval models in the NTCIR-3 evaluation workshop, but comparable to those in the NTCIR-4 and NTCIR-5 workshops. We do not know whether the lower level performance was due to the VSM's inherent deficiencies or to a less effective normalization of document length. Hence we evaluated the VSM with various pivoted normalizations of document length using the NTCIR-3 collection for confirmation. We found that VSM's retrieval effectiveness with pivoted normalization was comparable to other competitive retrieval models (for example, 2-Poisson), and that VSM's retrieval speed with pivoted normalization was similar to competitive retrieval models (2-Poisson). We proposed a novel adaptive scheme that automatically estimates the (near) best parameters for pivoted document-length normalization based on query size; the new normalization is called adaptive pivoted document-length normalization. This scheme achieved good retrieval effectiveness, sometimes for short (title) queries and sometimes for long queries, without manually adjusting parameter values. We found that unique, adaptive pivoted normalization can enhance fixed pivoted normalizations for different test collections (TREC-5 and TREC-6). We also evaluated the VSM with the adaptive pivoted normalization using the pseudo-relevance feedback (PRF) and found that this type of VSM performs similarly to the competitive retrieval models (2-Poisson) with PRF. Hence, we conclude that the VSM with unique (adaptive) pivoted document-length normalization is effective for Chinese IR and that its retrieval effectiveness is comparable to that of other competitive retrieval models with or without PRF for the reference test collections used in this evaluation.

References

[1]
Abdou, S. and Savoy, J. 2005. Report on CLIR task for the NTCIR-5 evaluation campaign. In Proceedings of the Fifth NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access (National Center of Sciences, Tokyo, Dec.), N. Kando and M. Takaku, eds. Nihon Printing, Tokyo, 44--51.
[2]
Allan, J., Callan, J., Croft, B.W., Ballesteros, L., Byrd, D., and Xu, J. 1997. Inquery does battle with TREC-6. In Proceedings of the Sixth Text Retrieval Conference (Gaithersburg, MD, Nov.), E.M. Voorhees and D.K. Harman, eds. National Institute of Standards and Technology, 169--206.
[3]
Ballesteros, L. and Croft, W.B. 1997. Phrasal translation and query Expansion techniques for crosslanguage information retrieval. In Proceedings of the 20th Annual ACM Conference on Research and Development in Information Retrieval (Philadelphia, PA, July), N.J. Belkin et al., eds. ACM, New York, 84--91.
[4]
Buckley, C., Singhal, A., and Mitra, M. 1996. Using query zoning and correlation within SMART: TREC-5. In Proceedings of the TREC-5 Conference (Gaithersburg, MD, Nov.), E.M. Voorhees and D.K. Harman,eds. National Institute of Standards and Technology, 105--118.
[5]
Chen, K.J. and Huang, C.R., Eds. 1993. Chinese word class analysis. Tech. Rep. 93-05, Institute of Information Science, Academia Sinica, Taiwan.
[6]
Chowdhury, A., Mccabe, M.C., Grossman, D., and Frieder, O. 2002. Document normalization revisited. In Proceedings of the 25th Annual ACM Conference on Research and Development in Information Retrieval (Tampere, Finland, Aug.), Javerlin et al., eds. ACM, New York, 381--382.
[7]
Cooper, W.S., Chen, A., and Gey, F.C. 1993. Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression. In Proceedings of the Second Text Retrieval Conference (Gaithersburg, MD, Nov.), D.K. Harman, ed. National Institute of Standards and Technology, 57--66.
[8]
Iwayama, M., Fuji, A., Kando, N., and Marukawa, Y. 2003. An empirical study on retrieval models for different document genres: patents and newspaper articles, In Proceedings of the 26th Annual ACM Conference on Research and Development in Information Retrieval (Toronto, July--Aug.), C. Clarke et al., eds. ACM, New York, 251--258.
[9]
Juang, D-W. and Tseng, Y.H. 2003. Uniform indexing and retrieval scheme for Chinese, Japanese and Korean. In Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering (National Center of Sciences, Tokyo, Sept.--Oct.), K. Oyama et al., eds., Nihon Printing, Tokyo, 132--140.
[10]
Kando, N. 2001. NTCIR Workshop: Japanese- and Chinese-English Cross-Lingual Information Retrieval and Multi-Grade Relevance Judgments. Lecture Notes in Computer Science 2069, Springer, Berlin, 24--35.
[11]
Kit, C., Liu, Y., and Liang, N. 1989. On methods of Chinese automatic word segmentation. J. Chinese Inf. Process. 3, 1, 13--20.
[12]
Kwok, K.L. 1995. A network approach to probabilistic information retrieval, ACM Trans. Inf. Syst. 13, 3, 325--353.
[13]
Lee, D.L., Chuang, H., and Seamons, K. 1997. Document ranking and the vector-space model. IEEE Software 14, 2, 67--75.
[14]
Luk, R.W.P. 2003. Different retrieval models and hybrid term indexing. In Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering (National Center of Sciences, Tokyo, Sept.--Oct.), K. Oyama et al., eds. Nihon Printing, Tokyo, 91--100.
[15]
Luk, R.W.P. and Kwok, K.L. 2002. A comparison of different Chinese indexing strategies and retrieval models. ACM Trans. Asian Language Inf. Process. 1, 3, 225--268.
[16]
Luk, R.W.P. and Wong, K.F. 2003. Hybrid Chinese term indexing and the 2-Poisson model. IEICE Trans. Inf. Syst. E86-D, 9, 1745--1752.
[17]
Nie, J-Y., Chevallet, J-P., and Bruandet, M-F. 1997. Between terms and words for European IR and between words and bigrams for Chinese IR. In Proceedings of the Sixth Text Retrieval Conference (Gaithersburg, MD, Nov.), E.M. Voorhees and D.K. Harman, eds. National Institute of Standards and Technology, 697--710.
[18]
Nie, J-Y., GAO, J., Zhang, J., and Zhou, M. 2000. On the use of words and n-grams for Chinese information retrieval. In Proceedings of Fifth International Workshop on Information Retrieval with Asian Languages (Hong Kong, Sept.--Oct.), K.F. Wong et al., eds. ACM, New York, 141--148.
[19]
Nie, J.-Y. and Ren, F. 1997. Chinese information retrieval: using characters or words. Inf. Process. Manage. 35, 4, 443--462.
[20]
Robertson, S.E., Walker, S., Jones, S., Hancock-Beualiue. M., and Gatford, M. 1994. Okapi at TREC-3. In Proceedings of the Third Text Retrieval Conference (Gaithersburg, MD, Nov.), D.K. Harman, ed., National Institute of Standards and Technology, 109--128.
[21]
Robertson, S.E. and Walker, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual ACM Conference on Research and Development in Information Retrieval (Dublin, July), W. Bruce Croft and C.J. Van Rijsbergen, eds., ACM, New York, 232--241.
[22]
Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 5, 513--523.
[23]
Savoy, J. 2005a. Comparative study of monolingual and multilingual search models for use with Asian languages. ACM Transactions on Asian Language Information Processing 4, 2, 163--189.
[24]
Savoy, J. 2005b. Report on CLIR task for the NTCIR-4 evaluation campaign. In Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies Information Retrieval, Question Answering and Summarization (National Center of Sciences, Tokyo, June), N. Kando and H. Ishikawa, eds., National Institute of Informatics, Tokyo, 178--185.
[25]
Singhal, A., Buckley, C., and Mitra, M. 1996. Pivoted document length normalization. In Proceedings of 19th Annual ACM Conference on Research and Development in Information Retrieval (Zurich, Aug), H-P. Frei et al., eds., ACM Press, New York, 21--29.
[26]
Vines, P. and Zobel, J. 1999. Efficient building and querying Asian language document databases. In Proceedings of the Fourth International Workshop on Information Retrieval with Asian Languages (Taipei, Nov.), 118--125.
[27]
Voorhees, E.M. and Buckley, C. 2002. The effect of topic set size on retrieval experiment errors. In Proceedings of the 25th Annual ACM Conference on Research and Development in Information Retrieval (Tampere, Finland, Aug.), K. Javerlin et al., eds., ACM, New York, 316--323.
[28]
Voorhees, E.M. and Harman, D.K. 1997. Overview of the sixth text retrieval conference. In Proceedings of the Sixth Text Retrieval Conference (Gaithersburg, MD, Nov.), E.M. Voorhees and D. K. Harman, eds., National Institute of Standards and Technology, 1--24.
[29]
Yang, Y. and Ma, N. 2003. CMU in cross-lingual information retrieval at NTCIR-3. In Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering (National Center of Sciences, Tokyo, Sept.--Oct.), K. Oyama et al., eds., Nihon Printing, Tokyo, 113--117.
[30]
Zhang, J., Sun, L., Qu W., Du, L., Sun, Y., Fan, Y., and Lin, Z. 2003. ISCAS at NTCIR-3: Monolingual, bilingual and multiLingual IR tasks. In Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering (National Center of Sciences, Tokyo, Sept.--Oct.), K. Oyama et al., eds., Nihon Printing, Tokyo, 118--125.

Cited By

View all
  • (2021)A Comparison between Term-Independence Retrieval Models for Ad Hoc RetrievalACM Transactions on Information Systems10.1145/348361240:3(1-37)Online publication date: 8-Dec-2021
  • (2019)A topic‐based term frequency normalization framework to enhance probabilistic information retrievalComputational Intelligence10.1111/coin.1224836:2(486-521)Online publication date: 20-Nov-2019
  • (2018)A New Term Frequency Normalization Model for Probabilistic Information RetrievalThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210147(1237-1240)Online publication date: 27-Jun-2018
  • Show More Cited By

Index Terms

  1. Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Asian Language Information Processing
        ACM Transactions on Asian Language Information Processing  Volume 5, Issue 3
        September 2006
        107 pages
        ISSN:1530-0226
        EISSN:1558-3430
        DOI:10.1145/1194936
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 01 September 2006
        Published in TALIP Volume 5, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Chinese information retrieval
        2. indexing strategies
        3. pivoted normalization

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)7
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 23 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)A Comparison between Term-Independence Retrieval Models for Ad Hoc RetrievalACM Transactions on Information Systems10.1145/348361240:3(1-37)Online publication date: 8-Dec-2021
        • (2019)A topic‐based term frequency normalization framework to enhance probabilistic information retrievalComputational Intelligence10.1111/coin.1224836:2(486-521)Online publication date: 20-Nov-2019
        • (2018)A New Term Frequency Normalization Model for Probabilistic Information RetrievalThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3210147(1237-1240)Online publication date: 27-Jun-2018
        • (2018)Verbosity normalized pseudo-relevance feedback in information retrievalInformation Processing & Management10.1016/j.ipm.2017.09.00654:2(219-239)Online publication date: Mar-2018
        • (2017)Improving Retrieval Performance for Verbose Queries via Axiomatic Analysis of Term Discrimination HeuristicProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080761(1201-1204)Online publication date: 7-Aug-2017
        • (2017)A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filteringKnowledge-Based Systems10.1016/j.knosys.2017.09.004136:C(97-120)Online publication date: 15-Nov-2017
        • (2016)A novel approach for extraction and representation of main data from web pages to Android application2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT)10.1109/RTEICT.2016.7808007(1126-1130)Online publication date: May-2016
        • (2015)A Study of Query Length Heuristics in Information RetrievalProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806592(1747-1750)Online publication date: 17-Oct-2015
        • (2012)A constraint to automatically regulate document-length normalisationProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398662(2443-2446)Online publication date: 29-Oct-2012
        • (2010)A content tendency judgment algorithm for micro-blog platform2010 IEEE International Conference on Intelligent Computing and Intelligent Systems10.1109/ICICISYS.2010.5658492(168-172)Online publication date: Oct-2010
        • Show More Cited By

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media