Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/775152.775167acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Dynamic maintenance of web indexes using landmarks

Published: 20 May 2003 Publication History

Abstract

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.

References

[1]
L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, and J. S. Vitter. A unified approach for indexed and non-indexed spatial joins. Proceedings of the 7th Intl. Conf. on Extending Database Technology (EDBT '00), 1777, 413--429, 2000.]]
[2]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.]]
[3]
R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20, 762--772, 1976.]]
[4]
B. Brewington and G. Cybenko. Keeping up with the changing web. IEEE Computer, 33(5), 52--58, May 2000.]]
[5]
E. W. Brown, J. P. Callan, and W. B. Croft. Fast incremental indexing for full-text information retrieval. In 20th Intl. Conf. on Very Large Data Bases, 192--202, 1994.]]
[6]
J. Cho and H. Garcia-Molina. Estimating frequency of change. Submitted for publication, 2000.]]
[7]
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. 26th Intl. Conf. on Very Large Data Bases, 2000.]]
[8]
C. Clarke and G. Cormack. Dynamic inverted indexes for a distributed full-text retrieval system. Tech. Report CS-95-01, Univ. of Waterloo CS Dept., 1995.]]
[9]
C. Clarke, G. Cormack, and F. Burkowski. Fast inverted indexes with on-line update. Tech. Report CS-94-40, Univ. of Waterloo CS Dept., 1994.]]
[10]
D. Cutting and J. Perdersen. Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405--411, 1990.]]
[11]
W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.]]
[12]
D. E. Knuth, J. H. Morris, and V. B. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6, 323--350, 1977.]]
[13]
S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400, 107--109, 1999.]]
[14]
Q. Li and B. Moon. Indexing and querying xml data for regular path expressions. In 27th Intl. Conf. on Very Large Data Bases, 361--370, 2001.]]
[15]
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. C. Agarwal. Characterizing web document change. In Advances in Web-Age Information Management, 2nd Intl. Conf., WAIM 2001, 133--144, 2001.]]
[16]
U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the Winter 1994 USENIX Conf., 23--32. USENIX, 1994.]]
[17]
S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. Proceedings of the 10th Intl. WWW Conf., 2001.]]
[18]
L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th Intl. WWW Conf., 107--117, 1998.]]
[19]
A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. Proceedings of 1994 ACM SIGMOD Intl. Conf. of Management of Data, 289--300, May 1994.]]
[20]
E. Ukkonen. Algorithms for approximate string matching. Information and Control, 64, 100--118, 1985.]]
[21]
J. S. Vitter. Faster methods for random sampling. Communications of the ACM, 27, July 1984.]]
[22]
J. S. Vitter. An efficient I/O interface for optical disks. ACM Trans. on Database Systems, 129--162, June 1985.]]
[23]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999.]]
[24]
C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems. In Proceedings of 2001 ACM SIGMOD Intl. Conf. of Management of Data, 361--370, 2001.]]

Cited By

View all
  • (2023)Can Persistent Homology provide an efficient alternative for Evaluation of Knowledge Graph Completion Methods?Proceedings of the ACM Web Conference 202310.1145/3543507.3583308(2455-2466)Online publication date: 30-Apr-2023
  • (2022)Immediate Text Search on Streams Using Apoptosic IndexesAdvances in Information Retrieval10.1007/978-3-030-99736-6_11(157-169)Online publication date: 5-Apr-2022
  • (2020)Dynamic Data Retrieval Using Incremental Clustering and IndexingInternational Journal of Information Retrieval Research10.4018/IJIRR.202007010510:3(74-91)Online publication date: 1-Jul-2020
  • Show More Cited By

Index Terms

  1. Dynamic maintenance of web indexes using landmarks

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '03: Proceedings of the 12th international conference on World Wide Web
      May 2003
      772 pages
      ISBN:1581136803
      DOI:10.1145/775152
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 May 2003

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. inverted files
      2. update processing

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 24 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Can Persistent Homology provide an efficient alternative for Evaluation of Knowledge Graph Completion Methods?Proceedings of the ACM Web Conference 202310.1145/3543507.3583308(2455-2466)Online publication date: 30-Apr-2023
      • (2022)Immediate Text Search on Streams Using Apoptosic IndexesAdvances in Information Retrieval10.1007/978-3-030-99736-6_11(157-169)Online publication date: 5-Apr-2022
      • (2020)Dynamic Data Retrieval Using Incremental Clustering and IndexingInternational Journal of Information Retrieval Research10.4018/IJIRR.202007010510:3(74-91)Online publication date: 1-Jul-2020
      • (2016)Hybrid Indexing for Versioned Document Search with Cluster-based RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983733(377-386)Online publication date: 24-Oct-2016
      • (2014)Incremental Text Indexing for Fast Disk-Based SearchACM Transactions on the Web10.1145/25608008:3(1-31)Online publication date: 8-Jul-2014
      • (2012)Optimizing positional index structures for versioned document collectionsProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348319(245-254)Online publication date: 12-Aug-2012
      • (2012)Searching web dataWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2011.04.00410(33-58)Online publication date: 1-Jan-2012
      • (2012)Storing and Indexing Massive RDF DatasetsSemantic Search over the Web10.1007/978-3-642-25008-8_2(31-60)Online publication date: 28-Jan-2012
      • (2010)Efficient Dynamic Index Structure for SSD (SPM)The Journal of the Korea Contents Association10.5392/JKCA.2010.10.2.05410:2(54-62)Online publication date: 28-Feb-2010
      • (2010)Online update of b-treesProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871460(149-158)Online publication date: 26-Oct-2010
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media