Article

Dynamic maintenance of web indexes using landmarks

Authors:

Sriram Padmanabhan,

Jeffrey Scott Vitter,

Ramesh AgarwalAuthors Info & Claims

WWW '03: Proceedings of the 12th international conference on World Wide Web

Pages 102 - 111

https://doi.org/10.1145/775152.775167

Published: 20 May 2003 Publication History

Abstract

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index.

References

[1]

L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, and J. S. Vitter. A unified approach for indexed and non-indexed spatial joins. Proceedings of the 7th Intl. Conf. on Extending Database Technology (EDBT '00), 1777, 413--429, 2000.]]

Digital Library

[2]

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.]]

Digital Library

[3]

R. S. Boyer and J. S. Moore. A fast string searching algorithm. Communications of the ACM, 20, 762--772, 1976.]]

Digital Library

[4]

B. Brewington and G. Cybenko. Keeping up with the changing web. IEEE Computer, 33(5), 52--58, May 2000.]]

Digital Library

[5]

E. W. Brown, J. P. Callan, and W. B. Croft. Fast incremental indexing for full-text information retrieval. In 20th Intl. Conf. on Very Large Data Bases, 192--202, 1994.]]

Digital Library

[6]

J. Cho and H. Garcia-Molina. Estimating frequency of change. Submitted for publication, 2000.]]

[7]

J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. 26th Intl. Conf. on Very Large Data Bases, 2000.]]

Digital Library

[8]

C. Clarke and G. Cormack. Dynamic inverted indexes for a distributed full-text retrieval system. Tech. Report CS-95-01, Univ. of Waterloo CS Dept., 1995.]]

[9]

C. Clarke, G. Cormack, and F. Burkowski. Fast inverted indexes with on-line update. Tech. Report CS-94-40, Univ. of Waterloo CS Dept., 1994.]]

[10]

D. Cutting and J. Perdersen. Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405--411, 1990.]]

Digital Library

[11]

W. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.]]

Digital Library

[12]

D. E. Knuth, J. H. Morris, and V. B. Pratt. Fast pattern matching in strings. SIAM Journal of Computing, 6, 323--350, 1977.]]

[13]

S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400, 107--109, 1999.]]

[14]

Q. Li and B. Moon. Indexing and querying xml data for regular path expressions. In 27th Intl. Conf. on Very Large Data Bases, 361--370, 2001.]]

Digital Library

[15]

L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. C. Agarwal. Characterizing web document change. In Advances in Web-Age Information Management, 2nd Intl. Conf., WAIM 2001, 133--144, 2001.]]

Digital Library

[16]

U. Manber and S. Wu. GLIMPSE: A tool to search through entire file systems. In Proceedings of the Winter 1994 USENIX Conf., 23--32. USENIX, 1994.]]

Digital Library

[17]

S. Melnik, S. Raghavan, B. Yang, and H. Garcia-Molina. Building a distributed full-text index for the web. Proceedings of the 10th Intl. WWW Conf., 2001.]]

Digital Library

[18]

L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th Intl. WWW Conf., 107--117, 1998.]]

Digital Library

[19]

A. Tomasic, H. Garcia-Molina, and K. Shoens. Incremental updates of inverted lists for text document retrieval. Proceedings of 1994 ACM SIGMOD Intl. Conf. of Management of Data, 289--300, May 1994.]]

Digital Library

[20]

E. Ukkonen. Algorithms for approximate string matching. Information and Control, 64, 100--118, 1985.]]

Digital Library

[21]

J. S. Vitter. Faster methods for random sampling. Communications of the ACM, 27, July 1984.]]

Digital Library

[22]

J. S. Vitter. An efficient I/O interface for optical disks. ACM Trans. on Database Systems, 129--162, June 1985.]]

Digital Library

[23]

I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA 94022, USA, second edition, 1999.]]

Digital Library

[24]

C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. Lohman. On supporting containment queries in relational database management systems. In Proceedings of 2001 ACM SIGMOD Intl. Conf. of Management of Data, 361--370, 2001.]]

Digital Library

Cited By

Bastos ASingh KNadgeri AHoffart JSingh MSuzumura T(2023)Can Persistent Homology provide an efficient alternative for Evaluation of Knowledge Graph Completion Methods?Proceedings of the ACM Web Conference 202310.1145/3543507.3583308(2455-2466)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583308
Eades PWirth AZobel J(2022)Immediate Text Search on Streams Using Apoptosic IndexesAdvances in Information Retrieval10.1007/978-3-030-99736-6_11(157-169)Online publication date: 5-Apr-2022
https://doi.org/10.1007/978-3-030-99736-6_11
Uma Priya D Santhi Thilagam P (2020)Dynamic Data Retrieval Using Incremental Clustering and IndexingInternational Journal of Information Retrieval Research10.4018/IJIRR.202007010510:3(74-91)Online publication date: 1-Jul-2020
https://doi.org/10.4018/IJIRR.2020070105
Show More Cited By

Index Terms

Dynamic maintenance of web indexes using landmarks
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval

Recommendations

Efficient Update of Indexes for Dynamically Changing Web Documents

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index ...
Efficient Textual Web Retrieval using Wavelet Tree

Searching on the web is one of the most progressive and expanding field nowadays. A large amount of information is available on the World Wide Web, motivating the need of efficient text indexing method that support fast text retrieval. In the past, two ...
Incremental maintenance of XML structural indexes
SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data

Increasing popularity of XML in recent years has generated much interest in query processing over graph-structured data. To support efficient evaluation of path expressions, many structural indexes have been proposed. The most popular ones are the 1-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '03: Proceedings of the 12th international conference on World Wide Web

May 2003

772 pages

ISBN:1581136803

DOI:10.1145/775152

Conference Chairs:
Gusztáv Hencsey
MTA SZTAKI, Hungary
,
Bebo White
Stanford Linear Accelerator Center, USA
,
Program Chairs:
Yih-Farn Robin Chen
AT&T Labs -- Research, USA
,
László Kovács
MTA SZTAKI, Hungary
,
Steve Lawrence
Google Inc., USA

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
880
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bastos ASingh KNadgeri AHoffart JSingh MSuzumura T(2023)Can Persistent Homology provide an efficient alternative for Evaluation of Knowledge Graph Completion Methods?Proceedings of the ACM Web Conference 202310.1145/3543507.3583308(2455-2466)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583308
Eades PWirth AZobel J(2022)Immediate Text Search on Streams Using Apoptosic IndexesAdvances in Information Retrieval10.1007/978-3-030-99736-6_11(157-169)Online publication date: 5-Apr-2022
https://doi.org/10.1007/978-3-030-99736-6_11
Uma Priya D Santhi Thilagam P (2020)Dynamic Data Retrieval Using Incremental Clustering and IndexingInternational Journal of Information Retrieval Research10.4018/IJIRR.202007010510:3(74-91)Online publication date: 1-Jul-2020
https://doi.org/10.4018/IJIRR.2020070105
Jin XAgun DYang TWu QShen YZhao SMukhopadhyay SZhai CBertino ECrestani FMostafa JTang JSi LZhou XChang YLi YSondhi P(2016)Hybrid Indexing for Versioned Document Search with Cluster-based RetrievalProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983733(377-386)Online publication date: 24-Oct-2016
https://dl.acm.org/doi/10.1145/2983323.2983733
Margaritis GAnastasiadis S(2014)Incremental Text Indexing for Fast Disk-Based SearchACM Transactions on the Web10.1145/25608008:3(1-31)Online publication date: 8-Jul-2014
https://dl.acm.org/doi/10.1145/2560800
He JSuel THersh WCallan JMaarek YSanderson M(2012)Optimizing positional index structures for versioned document collectionsProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348319(245-254)Online publication date: 12-Aug-2012
https://dl.acm.org/doi/10.1145/2348283.2348319
Delbru RCampinas STummarello G(2012)Searching web dataWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2011.04.00410(33-58)Online publication date: 1-Jan-2012
https://dl.acm.org/doi/10.1016/j.websem.2011.04.004
Luo YPicalausa FFletcher GHidders JVansummeren S(2012)Storing and Indexing Massive RDF DatasetsSemantic Search over the Web10.1007/978-3-642-25008-8_2(31-60)Online publication date: 28-Jan-2012
https://doi.org/10.1007/978-3-642-25008-8_2
Jin DKim JYou BJung H(2010)Efficient Dynamic Index Structure for SSD (SPM)The Journal of the Korea Contents Association10.5392/JKCA.2010.10.2.05410:2(54-62)Online publication date: 28-Feb-2010
https://doi.org/10.5392/JKCA.2010.10.2.054
Barsky MThomo AToth ZZuzarte CHuang JKoudas NJones GWu XCollins-Thompson KAn A(2010)Online update of b-treesProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871460(149-158)Online publication date: 26-Oct-2010
https://dl.acm.org/doi/10.1145/1871437.1871460
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents