Abstract
The ever-growing volumes of textual information from various sources have fostered the development of digital libraries, making digital content readily accessible but also easy for malicious users to plagiarize, thus giving rise to security problems. In this paper, we introduce a duplicate detection scheme that is able to determine, with a particularly high accuracy, the degree to which one document is similar to another. Our pairwise document comparison scheme detects the resemblance between the content of documents by considering document chunks, representing contexts of words selected from the text. The resulting duplicate detection technique presents a good level of security in the protection of intellectual property while improving the availability of the data stored in the digital library and the correctness of the search results. Finally, the paper addresses efficiency and scalability issues by introducing new data reduction techniques.
Similar content being viewed by others
References
Secure Hash Standard (1995) Technical Report FIPS PUB 180-1 US Department of Commerce/National Institute of Standards and Technology
Arms WY (2000) Digital libraries. MIT Press, Cambridge, MA
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Reading, MA
Baeza-Yates RA, Navarro G (1996) A faster algorithm for approximate string matching. In: 7th annual symposium on combinatorial pattern matching, pp 1–23
Breunig M, Kriegel H, Kröger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Proc. ACM international conference on management of data (SIGMOD’01), pp 79–90
Bricklin D (2004) Copy Protection Robs the Future. http://www.bricklin.com/robfuture.htm
Brin S, Davis J, Garcia-Molina H (1995) Copy detection mechanisms for digital documents. In: Proc. 1995 ACM SIGMOD international conference on management of data, pp 398–409
Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the Web. Computer Netw ISDN Syst 29(8-13):1157–1166
Chowdhury A, Frieder O, Grossman D (2002) Collection statistics for fast duplicate document detection. ACM Trans Inf Syst 20(2):171–191
Ciaccia P, Patella M (2002) Searching in metric spaces with user-defined and approximate distances. Trans Database Syst 4(27):398–437
Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proc. 23rd international conference on very large data bases (VLDB), pp 426–435
Gravano L, Ipeirotis P, Jagadish H, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proc. 27th international conference on very large data bases (VLDB)
Heintze N (1996) Scalable document fingerprinting. In: 2nd Usenix workshop on electronic commerce, pp 191–200
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ
Jain A, Murty M, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Khachiyan L (1979) A polynomial algorithm in linear programming. Doklady Akademii Nauk SSSR 244:1093–1096
Kwok SH (2003) Watermark-based copyright protection system security. Commun ACM 46(10):98–101
Lawrence S, Bollacher K, Lee Giles C (1999) Indexing and retrieval of scientific literature. In: Proc. 8th international conference on information and knowledge management (CIKM)
Litman J (2002) Digital copyright and the progress of science. ACM SIGIR Forum 36(2):44–52
Mandreoli F, Martoglia R, Tiberio P (2002) A syntactic approach for searching similarities within sentences. In: Proc. 11th ACM conference of information and knowledge management (ACM CIKM)
Mandreoli F, Martoglia R, Tiberio P (2003) Exploiting multi-lingual text potentialities in EBMT systems. In: Proc. 13th IEEE international workshop on research issues in data engineering: multi lingual information management (IEEE RIDE-MLIM 2003)
Shivakumar N, Garcia-Molina H (1995) SCAM: a copy detection mechanism for digital documents. In: Proc. 2nd international conference on theory and practice of digital libraries
Sutinen E, Tarhio J (1996) Filtration with q-samples in approximate string matching. In: Proc. 7th annual symposium on combinatorial pattern matching
Vitter J (1987) An efficient algorithm for sequential random sampling. ACM Trans Math Softw 13(1):58–67
Zhou J, Sander J (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Proc. 29th international conference on very large data bases (VLDB)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mandreoli, F., Martoglia, R. & Tiberio, P. A document comparison scheme for secure duplicate detection. Int J Digit Libr 4, 223–244 (2004). https://doi.org/10.1007/s00799-004-0079-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-004-0079-7