In order for an automatic information retrieval system to effectively retrieve documents related to a given subject area, the content of each document in the system''s database must be represented accurately. This study examines the hypothesis that better representations of document content can be constructed if the content analysis method takes into consideration the syntactic structure of document and query texts. Two methods of automatically generating phrases for use as content indicators have been implemented and tested experimentally. The non-syntactic (or statistical) method is based on simple text characteristics such as word frequency and the proximity of words in text. The syntactic method uses augmented phrase structure rules (production rules) to selectively extract phrases from parse trees generated by an automatic syntactic analyzer. Experimental results show that the effect of non-syntactic phrase indexing is inconsistent. For the five collections tested, increases in average precision ranged from 22.7% to 2.2% over simple, single term indexing. The syntactic phrase indexing method was tested on two collections. Precision figures averaged over all test queries indicate that non-syntactic phrase indexing performs significantly better than syntactic phrase indexing for one collection, but that the difference is insignificant for the other collection. More detailed analysis of individual queries, however, indicates that the performance of both methods is highly variable, and that there is evidence that syntax-based indexing has certain benefits not available with the non-syntactic approach. Possible improvements of both methods of phrase indexing are considered. It is concluded that the prospects for improving the syntax-based approach to document indexing are better than for the non-syntactic approach. The PLNLP system was used for syntactic analysis of document and query texts, and for implementing the syntax-based phrase construction rules. The SMART information retrieval system was used for retrieval experimentation.
Cited By
- Guo J, Cai Y, Fan Y, Sun F, Zhang R and Cheng X (2022). Semantic Models for the First-Stage Retrieval: A Comprehensive Review, ACM Transactions on Information Systems, 40:4, (1-42), Online publication date: 31-Oct-2022.
- Elsayed T, Lin J and Metzler D When close enough is good enough Proceedings of the 20th ACM international conference on Information and knowledge management, (1993-1996)
- Vilares J, Alonso M and Vilares M (2008). Extraction of complex index terms in non-English IR, Information Processing and Management: an International Journal, 44:4, (1517-1537), Online publication date: 1-Jul-2008.
- Gao J, Nie J and Zhou M (2006). Statistical query translation models for cross-language information retrieval, ACM Transactions on Asian Language Information Processing (TALIP), 5:4, (323-359), Online publication date: 1-Dec-2006.
- Kamps J, Marx M, de Rijke M and Sigurbjörnsson B Structured queries in XML retrieval Proceedings of the 14th ACM international conference on Information and knowledge management, (4-11)
- Mishne G and de Rijke M Boosting web retrieval through query operations Proceedings of the 27th European conference on Advances in Information Retrieval Research, (502-516)
- Gao J, Nie J, Xun E, Zhang J, Zhou M and Huang C Improving query translation for cross-language information retrieval using statistical models Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, (96-104)
- Turney P (2000). Learning Algorithms for Keyphrase Extraction, Information Retrieval, 2:4, (303-336), Online publication date: 1-May-2000.
- Salton G Automatic text indexing using complex identifiers Proceedings of the ACM conference on Document processing systems, (135-144)
- Mitra M, Buckley C, Singhal A and Cardie C An analysis of statistical and syntactic phrases Computer-Assisted Information Searching on Internet, (200-214)
- Pohlmann R and Kraaij W The effect of syntactic phrase indexing on retrieval performance for Dutch texts Computer-Assisted Information Searching on Internet, (176-187)
- Gelbart D and Smith J FLEXICON Proceedings of the 4th international conference on Artificial intelligence and law, (142-151)
- Buckley C The importance of proper weighting methods Proceedings of the workshop on Human Language Technology, (349-352)
- Burkowski F Retrieval activities in a database consisting of heterogeneous collections of structured text Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, (112-125)
- Croft W, Turtle H and Lewis D The use of phrases and structured queries in information retrieval Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, (32-45)
- Smeaton A and Sheridan P Using morpho-syntactic language analysis in phrase matching Intelligent Text and Image Handling, (414-430)
- Sacks-Davis R, Wallis P and Wilkinson R Using syntactic analysis in a document retrieval system that uses signature files Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, (179-192)
- Salton G and Smith M (1989). On the application of syntactic methodologies in automatic text analysis, ACM SIGIR Forum, 23:SI, (137-150), Online publication date: 25-Jun-1989.
- Salton G and Smith M On the application of syntactic methodologies in automatic text analysis Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval, (137-150)
- Salton G Syntactic approaches to automatic book indexing Proceedings of the 26th annual meeting on Association for Computational Linguistics, (204-210)
- Jacob P and Rau L Natural language techniques for intelligent information retrieval Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, (85-99)
- Smeaton A and van Rijsbergen C Experiments on incorporating syntactic processing of user queries into a document retrieval strategy Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, (31-51)
Recommendations
Incremental syntactic language models for phrase-based translation
HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1This paper describes a novel technique for incorporating syntactic knowledge into phrase-based machine translation through incremental syntactic parsing. Bottom-up and top-down parsers typically require a completed string as input. This requirement ...
Phrasal syntactic category sequence model for phrase-based MT
CICLing'12: Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part IIIncorporating target syntax into phrase-based machine translation (PBMT) can generate syntactically well-formed translations. We propose a novel phrasal syntactic category sequence (PSCS) model which allows a PBMT decoder to prefer more grammatical ...
CCG syntactic reordering models for phrase-based machine translation
WMT '12: Proceedings of the Seventh Workshop on Statistical Machine TranslationStatistical phrase-based machine translation requires no linguistic information beyond word-aligned parallel corpora (Zens et al., 2002; Koehn et al., 2003). Unfortunately, this linguistic agnosticism often produces ungrammatical translations. Syntax, ...