A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

Shaozhi Ye¹⁶,
Ruihua Song¹⁶,
Ji-Rong Wen¹⁶ &
…
Wei-Ying Ma¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3007))

Included in the following conference series:

Asia-Pacific Web Conference

548 Accesses

Abstract

Duplication of Web pages greatly hurts the perceived relevance of a search engine. Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. The offline methods target to detect all duplicates in a large set of Web pages, but none of the reported methods is capable of processing more than 30 million Web pages, which is about 1% of the pages indexed by todayś commercial search engines. On the contrary, the online methods focus on removing duplicated pages in the search results at run time. Although the number of pages to be processed is smaller, these methods could heavily increase the response time of search engines. Our experiments on real query logs show that there is a significant difference between popular and unpopular queries in terms of query number and duplicate distributions. Then, we propose a hybrid query-dependent duplicate detection method which combines both advantage of offline and online methods. This hybrid method provides not only an effective but also scalable solution for duplicate detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Semantic-Based Duplicate Web Page Detection

Remove-Duplicate Algorithm Based on Meta Search Result

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

References

Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proceeding of the Special Interest Group on Management of Data (SIGMOD 1995), pp. 298–409 (1995)
Google Scholar
Denning, P.J.: Plagiarism in the Web. Communications of the ACM 38 ( December 1995)
Google Scholar
Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Electronic Commerce Worksop, pp.191-200 (November 1996)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic Clastering of the Web. In: Proceedings of the Sixth International World Wide Web Conference(WWW6) (1997)
Google Scholar
Shivakumar, N., Garica-Molina, H.: Finding Near-Replicas of Documents on the Web. In: International Workshop on the Web and Databases (WebDB 1998) (1998)
Google Scholar
Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a Very Large AltaVista Query Log. Technical Report 1998-014, Digital System Research Center (October 1998)
Google Scholar
Lopresti, D.P.: Models and Algorithms for Duplicate Document Detection. In: Proceedings of the 5th International Conference on Document Analysis and Recognition (September 1999)
Google Scholar
Bharat, K., Broder, A.: Mirror on the Web: A Study of HostPairs with Replicated Content. In: Proceedings of 8th International World Wide Web Conference (WWW8 1999), pp.501–512 (1999)
Google Scholar
Turner, M., Katsnelson, Y., Smith, J.: Large-Scale Duplicate Document Detection in Operation. In: Proceedings of the 2001 Symposium on Document Image Understanding Technology (2001)
Google Scholar
Spink, A., Wolfram, D., Jansen, B., Saracevic, T.: Searching The Web: The Public and Their Queries. Journal of the American Society for Information Science 53(2), 226–234 (2001)
Google Scholar
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on Information Systems 20(2), 171–191 (2002)
Article Google Scholar
Cooper, J.W., Coden, A.R., Brown, E.W.: Detecting Similar Documents using Salient Terms. In: the 11th International Conference on Information and Knowledge Management, CIKM 2002 (November 2002)
Google Scholar
Xie, Y., O’Hallaron, D.: Locality in Search Engine Queries and its Implications for Caching. In: Proceedings of IEEE Infocom (June 2002)
Google Scholar
Bar-Yossef, Z., Rajagopalan, S.: Temlate Detection via Data Mining and its Applications. In: Proceedings of the 11th International World Wide Web Conference, WWW 2002 (2002)
Google Scholar
Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving Pseudo-Relevance Feedback in Web Infromation Retrieval Using Web Page Segmentation. In: Proceedings of the 12th International World Wide Web Conference, WWW 2003, May 2003, pp.11–18 (2003)
Google Scholar
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research Asia, 5F, Sigma Center, No 49 Zhichun Rd, Beijing, China, 100080
Shaozhi Ye, Ruihua Song, Ji-Rong Wen & Wei-Ying Ma

Authors

Shaozhi Ye
View author publications
You can also search for this author in PubMed Google Scholar
Ruihua Song
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Rong Wen
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Ying Ma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
The University of News South Wales, NSW 2052, Australia
Xuemin Lin
Department of Computer Science, Tsinghua University, 100084, Beijing, P.R. China
Hongjun Lu
Victoria University, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, S., Song, R., Wen, JR., Ma, WY. (2004). A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-24655-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21371-0
Online ISBN: 978-3-540-24655-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics