Abstract
Duplication of Web pages greatly hurts the perceived relevance of a search engine. Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. The offline methods target to detect all duplicates in a large set of Web pages, but none of the reported methods is capable of processing more than 30 million Web pages, which is about 1% of the pages indexed by todayś commercial search engines. On the contrary, the online methods focus on removing duplicated pages in the search results at run time. Although the number of pages to be processed is smaller, these methods could heavily increase the response time of search engines. Our experiments on real query logs show that there is a significant difference between popular and unpopular queries in terms of query number and duplicate distributions. Then, we propose a hybrid query-dependent duplicate detection method which combines both advantage of offline and online methods. This hybrid method provides not only an effective but also scalable solution for duplicate detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: Proceeding of the Special Interest Group on Management of Data (SIGMOD 1995), pp. 298–409 (1995)
Denning, P.J.: Plagiarism in the Web. Communications of the ACM 38 ( December 1995)
Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Electronic Commerce Worksop, pp.191-200 (November 1996)
Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic Clastering of the Web. In: Proceedings of the Sixth International World Wide Web Conference(WWW6) (1997)
Shivakumar, N., Garica-Molina, H.: Finding Near-Replicas of Documents on the Web. In: International Workshop on the Web and Databases (WebDB 1998) (1998)
Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a Very Large AltaVista Query Log. Technical Report 1998-014, Digital System Research Center (October 1998)
Lopresti, D.P.: Models and Algorithms for Duplicate Document Detection. In: Proceedings of the 5th International Conference on Document Analysis and Recognition (September 1999)
Bharat, K., Broder, A.: Mirror on the Web: A Study of HostPairs with Replicated Content. In: Proceedings of 8th International World Wide Web Conference (WWW8 1999), pp.501–512 (1999)
Turner, M., Katsnelson, Y., Smith, J.: Large-Scale Duplicate Document Detection in Operation. In: Proceedings of the 2001 Symposium on Document Image Understanding Technology (2001)
Spink, A., Wolfram, D., Jansen, B., Saracevic, T.: Searching The Web: The Public and Their Queries. Journal of the American Society for Information Science 53(2), 226–234 (2001)
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on Information Systems 20(2), 171–191 (2002)
Cooper, J.W., Coden, A.R., Brown, E.W.: Detecting Similar Documents using Salient Terms. In: the 11th International Conference on Information and Knowledge Management, CIKM 2002 (November 2002)
Xie, Y., O’Hallaron, D.: Locality in Search Engine Queries and its Implications for Caching. In: Proceedings of IEEE Infocom (June 2002)
Bar-Yossef, Z., Rajagopalan, S.: Temlate Detection via Data Mining and its Applications. In: Proceedings of the 11th International World Wide Web Conference, WWW 2002 (2002)
Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving Pseudo-Relevance Feedback in Web Infromation Retrieval Using Web Page Segmentation. In: Proceedings of the 12th International World Wide Web Conference, WWW 2003, May 2003, pp.11–18 (2003)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ye, S., Song, R., Wen, JR., Ma, WY. (2004). A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-24655-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21371-0
Online ISBN: 978-3-540-24655-8
eBook Packages: Springer Book Archive