Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/956863.956946acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Online duplicate document detection: signature reliability in a dynamic retrieval environment

Published: 03 November 2003 Publication History

Abstract

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a 'fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with new documents, or new versions of documents, arriving frequently, and other documents periodically removed. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and completeness arise. We show that even with very large training collections possessing extremely high feature correlations before and after updates, underlying fingerprints remain sensitive to subtle changes. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components from multiple domains. This technique appears to offer a practical foundation for fingerprint stability. We also consider mechanisms for updating training collections while mitigating signature instability. Our research is divided into three parts. We begin with a study of the distribution of duplicate types in two broad-ranging news collections consisting of approximately 50 million documents. We then examine the utility of document signatures in addressing identical or nearly identical duplicate documents and their sensitivity to collection updates. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identification of non-identical duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.

References

[1]
S. Brin, J. Davis, and H. Gracía-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD '95), pages 398--409. ACM Press, May 1995.]]
[2]
S. Brin and L. Page. The anatomy of a large-scale hypertextual uppercase Web search engine. In Proceedings of the Seventh Int'l World Wide Web Conference (WWW7 '98), pages 107--117. Elsevier Science, April 1998.]]
[3]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of the Sixth Int'l World Wide Web Conference (WWW6 '97), pages 391--404. Elsevier Science, April 1997.]]
[4]
J. Callan and M. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems (TOIS), 19(2):97--130, April 2001.]]
[5]
A. Chowdhury, O. Frieder, D. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems (TOIS), 20(2):171--191, April 2002.]]
[6]
J. W. Cooper, A. R. Coden, and E. W. Brown. Detecting similar documents using salient terms. In Proceedings of the 11th Int'l Conference on Information and Knowledge Management (CIKM '02), pages 245--251. ACM Press, Nov. 2002.]]
[7]
D. P. Dabney, H. R. Turtle, J. G. Conrad, et. al. System and Method of Processing Formatted Text Documents in a Database. U.S. Patent App. No. 09/120,170, 1999.]]
[8]
O. Frieder, D. A. Grossman, A. Chowdhury, and G. Frieder. Efficiency considerations for scalable information retrieval servers. Journal of Digital Information, 1(5):26 pgs, Jan. 2000.]]
[9]
N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX Electronic Commerce Workshop, pages 191--200, Nov. 1996.]]
[10]
U. Manber. Finding similar files in a large file system. In USENIX Winter 1994 Technical Conference Proceedings (USENIX '94), pages 1--10, Jan. 1994.]]
[11]
C. Miller. Detecting duplicates: A researcher's dream come true. Online, 14(4):27--34, July 1990.]]
[12]
M. J. Moroney. Facts from Figures, pages 334--370. Penguin Books, Middlesex, UK, 3rd edition, 1956.]]
[13]
S.-T. Park, D. M. Pennock, C. L. Giles, and R. Krovetz. Analysis of lexical signatures for finding lost or related documents. In Proceedings of the 25th Int'l Conference on Research and Development in Information Retrieval (SIGIR '02), pages 11--18. ACM Press, Aug. 2002.]]
[14]
T. A. Phelps and R. Wilensky. Robust hyperlinks: Cheap, everywhere, now. In Proceedings of the 8th Int'l Conference on Digital Documents and Electronic Publishing (DDEP '00). Springer-Verlag, Sept. 2000.]]
[15]
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, pages 507--510. Cambridge University Press, New York, NY, 2nd edition, 1992.]]
[16]
N. Shrivakumar and H. Gracía-Molina. Finding near-repli- cas of documents on the uppercase Web. In Proceedings of Workshop on Web Databases (WebDB '98), pages 204--212, March 1998.]]
[17]
C. Tenopir and P. Cahn. Target & uppercase Freestyle: uppercase DIALOG and uppercase Mead join the relevance ranks. Online, 18(3):31--47, 1994.]]
[18]
P. Thompson, H. Turtle, B. Yang, and J. Flood. uppercase TREC-3 ad hoc experiments using the uppercase WIN system. In Proc. of TREC-3, pages 211--217. NIST, Nov. 1995.]]
[19]
H. Turtle. Natural language vs. uppercase B oolean query evaluation: A comparison of retrieval performance. In Proceedings of the 17th Annual Int'l ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '94), pages 212--221. Springer-Verlag, July 1994.]]
[20]
H. R. Turtle. Inference Networks for Document Retrieval. Ph. uppercase D. uppercase D issertation, Univ. of Massachusetts--Amherst, 1991.]]
[21]
U. S. Department of Commerce/National Institute of Standards and Technology. Secure Hash Std, 1995.]]
[22]
E. M. Voorhees and D. Harman. Overview of the uppercase Sixth uppercase Text uppercase RE trieval uppercase Conference uppercase (TREC-6). Information Processing and Management, 36(1):3--35, Jan. 2000.]]

Cited By

View all
  • (2023)Near-Duplicate Sequence Search at Scale for Large Language Model Memorization EvaluationProceedings of the ACM on Management of Data10.1145/35893241:2(1-18)Online publication date: 20-Jun-2023
  • (2022)A Record Linkage-Based Data Deduplication Framework with DataCleaner ExtensionMultimodal Technologies and Interaction10.3390/mti60400276:4(27)Online publication date: 11-Apr-2022
  • (2019)On Tradeoffs Between Document Signature Methods for a Legal Due Diligence CorpusProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331311(1001-1004)Online publication date: 18-Jul-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management
November 2003
592 pages
ISBN:1581137230
DOI:10.1145/956863
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data management
  2. doc signatures
  3. duplicate document detection

Qualifiers

  • Article

Conference

CIKM03

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Near-Duplicate Sequence Search at Scale for Large Language Model Memorization EvaluationProceedings of the ACM on Management of Data10.1145/35893241:2(1-18)Online publication date: 20-Jun-2023
  • (2022)A Record Linkage-Based Data Deduplication Framework with DataCleaner ExtensionMultimodal Technologies and Interaction10.3390/mti60400276:4(27)Online publication date: 11-Apr-2022
  • (2019)On Tradeoffs Between Document Signature Methods for a Legal Due Diligence CorpusProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331311(1001-1004)Online publication date: 18-Jul-2019
  • (2019)A pruning strategy to improve pairwise comparison-based near-duplicate detectionKnowledge and Information Systems10.1007/s10115-018-1299-2Online publication date: 3-Jan-2019
  • (2018)Framework for proficient proof of identity of duplicate and near-duplicate images and image distances using high-disguisable image fragment2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC)10.1109/PDGC.2018.8745792(102-104)Online publication date: Dec-2018
  • (2017)Find, understand, and extend development screencasts on YouTubeProceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics10.1145/3121257.3121260(1-7)Online publication date: 4-Sep-2017
  • (2017)On the similarity of software development documentationProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering10.1145/3106237.3119875(1030-1033)Online publication date: 21-Aug-2017
  • (2017)A comparison of classification models for natural disaster and critical event detection from news2017 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2017.8258374(3750-3759)Online publication date: Dec-2017
  • (2016)A social spider optimization approach for clustering text documents2016 2nd International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB)10.1109/AEEICB.2016.7538275(22-26)Online publication date: Feb-2016
  • (2016)Fast and scalable vector similarity joins with MapReduceJournal of Intelligent Information Systems10.1007/s10844-015-0363-646:3(473-497)Online publication date: 1-Jun-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media