A document comparison scheme for secure duplicate detection

Federica Mandreoli¹,
Riccardo Martoglia¹ &
Paolo Tiberio¹

99 Accesses
Explore all metrics

Abstract

The ever-growing volumes of textual information from various sources have fostered the development of digital libraries, making digital content readily accessible but also easy for malicious users to plagiarize, thus giving rise to security problems. In this paper, we introduce a duplicate detection scheme that is able to determine, with a particularly high accuracy, the degree to which one document is similar to another. Our pairwise document comparison scheme detects the resemblance between the content of documents by considering document chunks, representing contexts of words selected from the text. The resulting duplicate detection technique presents a good level of security in the protection of intellectual property while improving the availability of the data stored in the digital library and the correctness of the search results. Finally, the paper addresses efficiency and scalability issues by introducing new data reduction techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

General Methodology for Detecting Fuzzy Duplicates in Electronic Texts with Integrated Mechanisms for Data Confidentiality Preservation

Near Duplicate Text Detection Using Frequency-Biased Signatures

Locating similar names through locality sensitive hashing and graph theory

Article 31 July 2018

References

Secure Hash Standard (1995) Technical Report FIPS PUB 180-1 US Department of Commerce/National Institute of Standards and Technology
Arms WY (2000) Digital libraries. MIT Press, Cambridge, MA
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Reading, MA
Baeza-Yates RA, Navarro G (1996) A faster algorithm for approximate string matching. In: 7th annual symposium on combinatorial pattern matching, pp 1–23
Breunig M, Kriegel H, Kröger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Proc. ACM international conference on management of data (SIGMOD’01), pp 79–90
Bricklin D (2004) Copy Protection Robs the Future. http://www.bricklin.com/robfuture.htm
Brin S, Davis J, Garcia-Molina H (1995) Copy detection mechanisms for digital documents. In: Proc. 1995 ACM SIGMOD international conference on management of data, pp 398–409
Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the Web. Computer Netw ISDN Syst 29(8-13):1157–1166
Article Google Scholar
Chowdhury A, Frieder O, Grossman D (2002) Collection statistics for fast duplicate document detection. ACM Trans Inf Syst 20(2):171–191
Article Google Scholar
Ciaccia P, Patella M (2002) Searching in metric spaces with user-defined and approximate distances. Trans Database Syst 4(27):398–437
Article Google Scholar
Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proc. 23rd international conference on very large data bases (VLDB), pp 426–435
Gravano L, Ipeirotis P, Jagadish H, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proc. 27th international conference on very large data bases (VLDB)
Heintze N (1996) Scalable document fingerprinting. In: 2nd Usenix workshop on electronic commerce, pp 191–200
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ
Jain A, Murty M, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Khachiyan L (1979) A polynomial algorithm in linear programming. Doklady Akademii Nauk SSSR 244:1093–1096
MathSciNet MATH Google Scholar
Kwok SH (2003) Watermark-based copyright protection system security. Commun ACM 46(10):98–101
Article Google Scholar
Lawrence S, Bollacher K, Lee Giles C (1999) Indexing and retrieval of scientific literature. In: Proc. 8th international conference on information and knowledge management (CIKM)
Litman J (2002) Digital copyright and the progress of science. ACM SIGIR Forum 36(2):44–52
Mandreoli F, Martoglia R, Tiberio P (2002) A syntactic approach for searching similarities within sentences. In: Proc. 11th ACM conference of information and knowledge management (ACM CIKM)
Mandreoli F, Martoglia R, Tiberio P (2003) Exploiting multi-lingual text potentialities in EBMT systems. In: Proc. 13th IEEE international workshop on research issues in data engineering: multi lingual information management (IEEE RIDE-MLIM 2003)
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Article Google Scholar
Shivakumar N, Garcia-Molina H (1995) SCAM: a copy detection mechanism for digital documents. In: Proc. 2nd international conference on theory and practice of digital libraries
Sutinen E, Tarhio J (1996) Filtration with q-samples in approximate string matching. In: Proc. 7th annual symposium on combinatorial pattern matching
Vitter J (1987) An efficient algorithm for sequential random sampling. ACM Trans Math Softw 13(1):58–67
Article Google Scholar
Zhou J, Sander J (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Proc. 29th international conference on very large data bases (VLDB)

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria dell’Informazione, Università di Modena e Reggio Emilia, via Vignolese 905, 41100, Modena, Italy
Federica Mandreoli, Riccardo Martoglia & Paolo Tiberio

Authors

Federica Mandreoli
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Martoglia
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Tiberio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Federica Mandreoli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mandreoli, F., Martoglia, R. & Tiberio, P. A document comparison scheme for secure duplicate detection. Int J Digit Libr 4, 223–244 (2004). https://doi.org/10.1007/s00799-004-0079-7

Download citation

Published: 01 November 2004
Issue Date: November 2004
DOI: https://doi.org/10.1007/s00799-004-0079-7

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

General Methodology for Detecting Fuzzy Duplicates in Electronic Texts with Integrated Mechanisms for Data Confidentiality Preservation

Near Duplicate Text Detection Using Frequency-Biased Signatures

Locating similar names through locality sensitive hashing and graph theory

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A document comparison scheme for secure duplicate detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

General Methodology for Detecting Fuzzy Duplicates in Electronic Texts with Integrated Mechanisms for Data Confidentiality Preservation

Near Duplicate Text Detection Using Frequency-Biased Signatures

Locating similar names through locality sensitive hashing and graph theory

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation