Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/779232.779240guidebooksArticle/Chapter ViewAbstractPublication PagesBookacm-pubtype
chapter

Searching large text collections

January 2002
Pages 195 - 243
Published: 01 January 2002 Publication History

Abstract

In this chapter we present the main data structures and algorithms for searching large text collections. We emphasize inverted files, the most used index, but also review suffix arrays, which are useful in a number of specialized applications. We also cover parallel and distributed implementations of these two structures. As an example, we show how mechanisms based upon inverted files can be used to index and search the Web.

References

[1]
O. Alonso and R. Baeza-Yates. A model for visualizing large answers in WWW. In XVIII Int. Conf. of the Chilean CS Society, pages 2-7, Antofagasta, Chile, 1998. IEEE CS Press.]]
[2]
AltaVista. AltaVista: Main page. http://www.altavista.com, 1996.]]
[3]
T. Anderson, D. Culler, and D. Patterson. A case for NOW (network of workstations). IEEE Micro, 15(1):54-64, 1995.]]
[4]
V. N. Anh and A. Moffat. Compressed inverted files with reduced decoding overheads. In Croft et al. (1998), pages 290-297.]]
[5]
A. Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, NATO ISI Series, pages 85-96. Springer-Verlag, 1985.]]
[6]
R. Baeza-Yates, E. Barbosa, and N. Ziviani. Hierarchies of indices for text searching. Information Systems, 21(6):497-514, 1996.]]
[7]
R. Baeza-Yates and G. Gonnet. All-against-all sequence matching. Dept. of Computer Science, University of Chile, 1990.]]
[8]
R. Baeza-Yates and G. Gonnet. Fast text searching for regular expressions or automaton searching on a trie. J. of the ACM, 43(6):915-936, 1996.]]
[9]
R. Baeza-Yates and G. Navarro. Block-addressing indices for approximate text retrieval. In Proc. ACM CIKM'97, pages 1-8, 1997. Extended version to appear in JASIS.]]
[10]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley & ACM Press, Harlow, UK, 1999.]]
[11]
E. Barbosa and N. Ziviani. From partial to full inverted lists for text searching. In R. Baeza-Yates and U. Manber, editors, Proc. 2nd South American Workshop on String Processing, WSP'95, pages 1-10, 1995.]]
[12]
R. Bayer and K. Unterauer. Prefix B-trees. ACM TODS, 2(1):11-26, Mar 1977.]]
[13]
Tim Berners-Lee, Robert Cailliau, Ari Luotonen, Henrik Frystyk Nielsen, and Arthur Secret. The World-Wide Web. Comm. of the ACM, 37(8):76-82, 1994.]]
[14]
Krisha Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The connectivity server: fast access to linkage information on the Web. In 7th WWW Conf., Brisbane, Australia, April 1998.]]
[15]
Philippe Bonnet and Anthony Tomasic. Partial answers for unavailable data sources. In Workshop on Flexible Query-Answering Systems, pages 43-54, 1998.]]
[16]
C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, Udi Manber, and Michael F. Schwartz. The Harvest information discovery and access system. In Proc. 2nd Inter. World Wide Web Conf., pages 763-771, October 1994.]]
[17]
S. Brin. Extracting patterns and relations from the World Wide Web. In Workshop on Web Databases, Valencia, Spain, March 1998.]]
[18]
A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the Web. In 6th Int'l WWW Conference, pages 391-404, Santa Clara, CA, USA, April 1997.]]
[19]
A.Z. Broder. On the resemblance and containment of documents. In Conf. on Compression and Complexity of Sequences, pages 21-29, Salerno, Italy, 1997. IEEE Computer Society.]]
[20]
C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In Proc. 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 97-110, Montreal, Canada, June 1985. ACM Press, New York.]]
[21]
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In 7th WWW Conference, pages 65-74, Brisbane, Australia, April 1998.]]
[22]
S. Chawathe, H. García-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In Proc. of IPSJ Conference, pages 7-18, October 1994.]]
[23]
C. Chen. Structuring and visualizing the WWW by generalized similarity analysis. In 8th ACM Conference on Hypertext and Hypermedia, pages 177-186, Southampton, England, 1997.]]
[24]
J. Cho, H. García-Molina, and L. Page. Efficient crawling through URL ordering. In 7th WWW Conference, Brisbane, Australia, April 1998.]]
[25]
D. R. Clark and J. I. Munro. Efficient suffix trees on secondary storage. In Proceedings of the 7th ACM-SIAM Annual Symposium on Discrete Algorithms, pages 383-391, Atlanta, Georgia, 1996.]]
[26]
A. Clauser and P. Ferragina. On constructing suffix arrays in external memory. Algorithmica, 2000. (to appear). Part of this work appeared in European Symposium on Algorithms, LNCS 1643, 1999.]]
[27]
E. G. Coffman, Z. Liu, and R. R. Weber. Optimal robot scheduling for Web search engines. Technical Report 3317, INRIA, France, December 1997.]]
[28]
W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors. Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998. ACM Press, New York.]]
[29]
O. de Kretser and A. Moffat. Effective document presentation with a locality-based similarity heuristic. In M. Hearst, F. Gey, and R. Tong, editors, Proc. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113-120, New York, August 1999. ACM Press.]]
[30]
O. de Kretser, A. Moffat, T. Shimmin, and J. Zobel. Methodologies for distributed information retrieval. In M. P. Papazoglou, M. Takizawa, B. Krämer, and S. Chanson, editors, Proc. 18th International Conference on Distributed Computing Systems, pages 66-73, Los Alamitos, CA, May 1998. IEEE Computer Society.]]
[31]
Alin Deutsch, Mary Fernández, Daniela Florescu, Alon Levy, and Dan Suciu. A query language for XML. http://www.research.att.com/~mff/xml/w3cnote.html, 1998.]]
[32]
Fast. Fast: Main page. http://alltheweb.com, 1998.]]
[33]
P. Ferragina and R. Grossi. The string B-tree: A new data structure for string search in external memory and its applications. J. Assoc. Comput. Mach., 46:236-280, 1999.]]
[34]
W. B. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, New Jersey, 1992.]]
[35]
R. G. Gallager and D. C. Van Voorhis. Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information Theory, IT-21(2):228-230, March 1975.]]
[36]
D. Gibson, J. Kleinberg, and P. Raghavan. Inferring Web communities from link topologies. In 9th ACM Conference on Hypertext and Hypermedia, Pittsburgh, USA, 1998.]]
[37]
R. Goldman, J. McHugh, and J. Widom. Lore: A database management system for XML. Technical report, Stanford University Database Group, 1998.]]
[38]
S. W. Golomb. Run-length encodings. IEEE Transactions on Information Theory, IT-12(3):399-401, July 1966.]]
[39]
G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin. Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992.]]
[40]
G. Gonnet, R. Baeza-Yates, and T. Snider. New indices for text: Pat trees and Pat arrays, chapter 3, pages 66-82. In, Frakes and Baeza-Yates (1992), 1992a.]]
[41]
G. Gonnet, M. Cohen, and S. Benner. Exhaustive matching of the entire protein sequence database. Science, 256:1443-45, 1992b.]]
[42]
Google. Google: Main page. http://www.google.com, 1998.]]
[43]
L. Gravano, K. Chang, H. García-Molina, C. Lagoze, and A. Paepcke. STARTS: Stanford protocol proposal for Internet retrieval and search. Technical report, Stanford University, Digital Library Project, 1997. http://www-db.stanford.edu/~gravano/starts.html.]]
[44]
Luis Gravano, Hector García-Molina, and Ant hony Tomasic. GLOSS: Text-source discovery over the Internet. ACM Transactions on Database Systems, 2000. To appear.]]
[45]
S.J. Green. Automated link generation: can we do better than term repetition. In 7th WWW Conference, Brisbane, Australia, 1998.]]
[46]
D. K. Harman. Overview of the second text retrieval conference (TREC- 2). Information Prcessing & Management, 31(3):271-289, May 1995.]]
[47]
D. K. Harman and G. Candela. Retrieving records from a gigabyte of text on a minicomputer using statistical ranking. Journal of the American Society for Information Science, 41(8):581-589, August 1990.]]
[48]
H. Itoh and H. Tanaka. An efficient method for in memory construction of suffix arrays. In IEEE CS Press, editor, Proc. 6th Symp. on String Processing and Information Retrieval, SPIRE'99, pages 81-89, Cancun, Mexico, 1999.]]
[49]
J. Jájá. An Introduction to Parallel Algorithms. Addison-Wesley, 1992.]]
[50]
B. Kahle. Archiving the Internet. http://www.alexa.com/~brewster/ essays/sciam_article.html, 1997.]]
[51]
M. Kaszkiel, J. Zobel, and R. Sacks-Davis. Efficient passage ranking for document databases. ACM Transactions on Information Systems, 17 (4):406-439, October 1999.]]
[52]
R. Khare and A. Rifkin. XML: A door to automated Web applications. IEEE Internet Computing, 1(4):78-86, 1977.]]
[53]
J. Kitajima and G. Navarro. A fast distributed suffix array generation algorithm. In Proc. 6th Symp. on String Processing and Information Retrieval, SPIRE'99, pages 97-104, Cancun, Mexico, 1999. IEEE CS Press.]]
[54]
J. Kitajima, G. Navarro, B. Ribeiro-Neto, and N. Ziviani. Distributed generation of suffix arrays: a quicksort-based approach. In Proc. 4th South American Workshop on String Processing, WSP'97, pages 53-69. Carleton University Press, 1997.]]
[55]
J. Kitajima, B. Ribeiro-Neto, and N. Ziviani. Network and memory analysis in distributed parallel generation of PAT arrays. In Proc. 8th Brazilian Symposium on Computer Architectures - High-Performance Processing, pages 193-202. Brazilian Computer Society - SBC, 1996.]]
[56]
Jon Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms, pages 668-677, San Francisco, USA, Jan 1998.]]
[57]
Stefan Kurtz. Reducing the space requirement of suffix trees. Software - Practice and Experience, 29(13):1149-1171, 1999.]]
[58]
N. J. Larsson and K. Sadakane. Faster suffix sorting. Technical Report LU-CS-TR:99-214, LUNDFD6/NFCS3140/1-20/(1999), Department of Computer Science, Lund University, Sweden, May 1999.]]
[59]
S. Lawrence and L. Giles. Accessability and distribution of information on the Web. Nature, July 1999.]]
[60]
M. Lesk. Practical Digital Libraries: Books, Bytes, and Bucks. Morgan Kaufmann, San Francisco, 1997.]]
[61]
U. Manber and E. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, pages 935-948, 1993.]]
[62]
Udi Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Usenix Winter 1994 Technical Conference, pages 23-32, San Francisco, CA, January 1994.]]
[63]
M. Marchiori. The quest for correct information on the Web: Hyper search engi nes. In 6th WWW Conf., pages 265-274, Santa Clara, CA, USA, 1997.]]
[64]
M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. In Proc. RIA0'97 Conf. on Computer-Assisted Information Searching on the Internet, pages 200-214, June 1997.]]
[65]
A. Moffat and T. A. H. Bell. In-situ generation of compressed inverted files. Journal of the American Society for Information Science, 46(7): 537-550, August 1995.]]
[66]
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349-379, October 1996.]]
[67]
A. Moffat, J. Zobel, and N. Sharman. Text compression for dynamic document databases. IEEE Transactions on Knowledge and Data Engineering, 9(2):302-313, March 1997.]]
[68]
E. S. de Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast searching on compressed text allowing errors. In Croft et al. (1998), pages 298-306.]]
[69]
M. Nagao and S. Mori. A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of japanese. In Proc. COLING'94, pages 611-615, 1994.]]
[70]
G. Navarro and R. Baeza-Yates. A new indexing method for approximate string matching. In Maxime Crochemore and Mike Paterson, editors, Combinatorial Pattern Matching (CPM'99), number 1645 in LNCS, pages 163-185, Warwick, UK, July 1999. Springer Verlag.]]
[71]
G. Navarro, E. Barbosa, R. Baeza-Yates, W. Cunto, and N. Ziviani. Binary searching with non-uniform costs and its application to text retrieval. Algorithmica, 2000. To appear. Preliminary version in Proc. ESA '95, LNCS, Springer Verlag.]]
[72]
G. Navarro, J. Kitajima, B. Ribeiro-Neto, and N. Ziviani. Distributed generation of suffix arrays. In Proc. CPM'97, LNCS 1264, pages 102-115, 1997.]]
[73]
Northern. Northern Light: Main page. http://www.northernlight.com, 1997.]]
[74]
M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science, 47(10):749-764, October 1996.]]
[75]
Peter Pirolli, James Pitkow, and Ramana Rao. Silk from a sow's ear: Extracting usable structures from the Web. In Proc. of the ACM SIGCHI Conference on Human Factors in Computing Systems, pages 118-125, Zurich, Switzerland, May 1996. ACM Press.]]
[76]
J. Powell and E. Fox. Multilingual federated searching across heterogeneous collections. D-Lib Magazine, September 1998.]]
[77]
D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying semistructured heterogeneous information. In Deductive and Object-Oriented Databases, Proc. of the DOOD '95 Conference, pages 319-344, Singapore. December 1995. Springer-Verlag.]]
[78]
M. J. Quinn. Parallel Computing: Theory and Practice. McGraw-Hill, second edition, 1994.]]
[79]
B. Ribeiro-Neto and R. Barbosa. Query performance for tightly coupled distributed digital libraries. In Proc. ACM Int. Conference on Digital Libraries (DL'98), pages 182-190, 1998.]]
[80]
B. Ribeiro-Neto, J. Kitajima, G. Navarro, C. Sant'Ana, and N. Ziviani. Parallel generation of inverted lists for distributed text collections. In Y. Eterovic. editor, Proc. XVIII Chilean Computer Science Society, SCCC'98, pages 149-157. IEEE CS Press, 1998.]]
[81]
B. Ribeiro-Neto, E. Moura, M. Neubert, and N. Ziviani. Efficient distributed algorithms to build inverted files. In Proc. ACM SIGIR'99, pages 105-112, 1999.]]
[82]
K. Sadakane. A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation. In Proc. IEEE Data Compression Conference (DCC'98), pages 129-138, 1998.]]
[83]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513-523, 1988.]]
[84]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval . McGraw-Hill, New York, 1983.]]
[85]
N. Shivakumar and H. García-Molina. Finding near-replicas of documents on the Web. In Workshop on Web Databases, Valencia, Spain, March 1998.]]
[86]
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In H.-P. Frei, D. Harman, P. Schäuble, and R. Wilkinson, editors, Proc. 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21-29, New York, August 1996. ACM Press.]]
[87]
Snap. Snap: Main page. http://www.snap.com, 1998.]]
[88]
Ellen Spertus. ParaSite: Mining structural information on the Web. In 6th Int'l WWW Conference, Santa Clara, CA, USA, April 1997.]]
[89]
W. Szpankowski. Probabilistic analysis of generalized suffix trees. In Proc. CPM'92, pages 1-14, 1992. LNCS 644.]]
[90]
A. Tomasic and H. Garcia-Molina. Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In Proc. Int. Conf. on Parallel and Distributed Information Systems, PDIS'93, 1993.]]
[91]
E. Ukkonen. Approximate string matching over suffix trees. In Proc. CPM'93, pages 228-242, 1993.]]
[92]
R. Weiss, B. Vélez, M. Sheldon, C. Nemprempre, P. Szilagyi, and D.K. Gifford. HyPursuit: A hierarchical network engine that exploits content-link hypertext clustering. In 7th ACM Conference on Hypertext and Hypermedia, pages 180-193, Washington, D.C., USA, 1996.]]
[93]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, second edition, 1999.]]
[94]
B. Yuwono and D.L. Lee. Server ranking for distributed text retrieval systems on the Internet. In 5th Int. Conf. on Databases Systems for Advanced Applications (DASFAA), pages 41-50, Melbourne, Australia, 1997.]]
[95]
J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. Software-Practice and Experience, 25(8):891-903, August 1995.]]
[96]
J. Zobel and A. Moffat. Exploring the similarity space. SIGIR Forum, 32(1):18-34, Spring 1998.]]
[97]
J. Zobel, A. Moffat, and K. Ramamohanarao. Inverted files versus signature files for text indexing. ACM Transactions on Database Systems, 23(4):453-490, December 1998.]]

Cited By

View all
  • (2018)Efficiently mining frequent itemsets applied for textual aggregationApplied Intelligence10.1007/s10489-017-1050-948:4(1013-1019)Online publication date: 1-Apr-2018
  • (2013)Faster and smaller inverted indices with treapsProceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval10.1145/2484028.2484088(193-202)Online publication date: 28-Jul-2013
  • (2012)Dual-Sorted inverted lists in practiceProceedings of the 19th international conference on String Processing and Information Retrieval10.1007/978-3-642-34109-0_31(295-306)Online publication date: 21-Oct-2012
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide books
Handbook of massive data sets
January 2002
1252 pages
ISBN:1402004893

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 January 2002

Author Tags

  1. index
  2. inverted files
  3. search
  4. suffix arrays
  5. supraindex
  6. web crawlers

Qualifiers

  • Chapter

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Efficiently mining frequent itemsets applied for textual aggregationApplied Intelligence10.1007/s10489-017-1050-948:4(1013-1019)Online publication date: 1-Apr-2018
  • (2013)Faster and smaller inverted indices with treapsProceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval10.1145/2484028.2484088(193-202)Online publication date: 28-Jul-2013
  • (2012)Dual-Sorted inverted lists in practiceProceedings of the 19th international conference on String Processing and Information Retrieval10.1007/978-3-642-34109-0_31(295-306)Online publication date: 21-Oct-2012
  • (2012)Ranked document retrieval in (almost) no spaceProceedings of the 19th international conference on String Processing and Information Retrieval10.1007/978-3-642-34109-0_16(155-160)Online publication date: 21-Oct-2012
  • (2010)Dual-sorted inverted listsProceedings of the 17th international conference on String processing and information retrieval10.5555/1928328.1928370(309-321)Online publication date: 11-Oct-2010
  • (2006)Inverted files for text search enginesACM Computing Surveys10.1145/1132956.113295938:2(6-es)Online publication date: 25-Jul-2006

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media