An Overview of XML Duplicate Detection Algorithms

Pável Calado³,
Melanie Herschel⁴ &
Luís Leitão³

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 255))

690 Accesses

Abstract

Fuzzy duplicate detection aims at identifying multiple representations of real-world objects in a data source, and is a task of critical relevance in data cleaning, data mining, and data integration tasks. It has a long history for relational data, stored in a single table or in multiple tables with an equal schema. However, algorithms for fuzzy duplicate detection in more complex structures, such as hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors to improve duplicate detection effectiveness. In this chapter, we study different approaches that have been proposed for XML fuzzy duplicate detection. Our study includes a description and analysis of the different approaches, as well as a comparative experimental evaluation performed on both artificial and real-world data. The two main dimensions used for comparison are the methods effectiveness and efficiency. Our comparison shows that the DogmatiX system [44] is the most effective overall, as it yields the highest recall and precision values for various kinds of differences between duplicates. Another system, called XMLDup [27] has a similar performance, being most effective especially at low recall values. Finally, the SXNM system [36] is the most efficient, as it avoids executing too many pairwise comparisons, but its effectiveness is greatly affected by errors in the data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Detecting Near Duplicate Dataset

Near-Duplicate Document Detection Using Semantic-Based Similarity Measure: A Novel Approach

Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework

References

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Conference on Very Large Databases (VLDB), Hong Kong, China, pp. 586–597 (2002)
Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Google Scholar
Batini, C., Scannapieco, M.: Data quality - concepts, methodologies and techniques. Springer, Berlin (2006)
MATH Google Scholar
Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: KDD Workshop on Multi-Relational Data Mining, MRDM (2005)
Google Scholar
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Conference on Data Mining (SDM), Bethesda, MD (2006)
Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data 1(1) (2007)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, pp. 39–48 (2003)
Google Scholar
Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1–3), 217–239 (2005)
Article MATH MathSciNet Google Scholar
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008), http://doi.acm.org/10.1145/1456650.1456651
Article Google Scholar
Carvalho, J.C.P., da Silva, A.S.: Finding similar identities among objects from multiple web sources. In: CIKM Workshop on Web Information and Data Management (WIDM), New Orleans, Louisiana, USA, pp. 90–93 (2003)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Conference on the Management of Data (SIGMOD), San Diego, CA, pp. 313–324 (2003)
Google Scholar
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: International Conference on Data Engineering (ICDE), Tokyo, Japan, pp. 865–876 (2005)
Google Scholar
Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting relationships for object consolidation. In: SIGMOD Workshop on Information Quality in Information Systems (IQIS), Baltimore, MD (2005)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, Alberta, Canada, pp. 475–480 (2002)
Google Scholar
Doan, A., Lu, Y., Lee, Y., Han, J.: Profile-based object matching for information integration. IEEE Intelligent Systems 18(5), 54–59 (2003)
Article Google Scholar
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Conference on the Management of Data (SIGMOD), Baltimore, MD, pp. 85–96 (2005)
Google Scholar
Elfeky, M.G., Elmagarmid, A.K., Verykios, V.S.: Tailor: A record linkage tool box. In: International Conference on Data Engineering (ICDE), San Jose, CA, pp. 17–28 (2002)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1) (2007)
Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association (1969)
Google Scholar
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML joins. In: Conference on the Management of Data (SIGMOD), Madison, WI (2002)
Google Scholar
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Conference on the Management of Data (SIGMOD), San Jose, CA, pp. 127–138 (1995)
Google Scholar
Herschel, M., Naumann, F.: Scaling up duplicate detection in graph data. In: Conference on Information and Knowledge Management (CIKM), pp. 1325–1326 (2008)
Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Conference on Database Systems for Advanced Applications (DASFAA), Kyoto, Japan (2003)
Google Scholar
Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (TODS) 31(2), 716–767 (2006)
Article Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1–2), 83–97 (1955)
Article MathSciNet Google Scholar
Lehti, P., Fankhauser, P.: Unsupervised duplicate detection using sample non-duplicates. Journal on Data Semantics VII 4244, 136–164 (2006)
Article Google Scholar
Leitão, L., Calado, P., Weis, M.: Structure-based inference of XML similarity for fuzzy duplicate detection. In: Conference on Information and Knowledge Management (CIKM), pp. 293–302 (2007)
Google Scholar
Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966)
MathSciNet Google Scholar
Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. In: International Conference on Data Engineering (ICDE), Vienna, Austria, pp. 294–301 (1993)
Google Scholar
McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington, DC (2003)
Google Scholar
Milano, D., Scannapieco, M., Catarci, T.: Structure aware XML object identification. In: VLDB Workshop on Clean Databases (CleanDB), Seoul, Korea (2006)
Google Scholar
Minkov, E., Cohen, W.W., Ng, A.Y.: Contextual search and name disambiguation in email using graphs. In: Conference on Research and Development in Information Retrieval (SIGIR), Seattle, Washington, USA, pp. 27–34 (2006)
Google Scholar
Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Tucson, AZ (1997)
Google Scholar
Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Article Google Scholar
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1988)
Google Scholar
Puhlmann, S., Weis, M., Naumann, F.: XML duplicate detection using sorted neigborhoods. In: Conference on Extending Database Technology (EDBT), Munich, Germany, pp. 773–791 (2006)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001)
Article MATH Google Scholar
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23, 3–13 (2000)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, Alberta, pp. 269–278 (2002)
Google Scholar
Singla, P., Domingos, P.: Multi-relational record linkage. In: KDD Workshop on Multi-Relational Data Mining (MRDM), Seattle, WA (2004)
Google Scholar
Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Conference on Principals and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, pp. 297–308 (2005)
Google Scholar
Weis, M.: Duplicate Detection in XML Data. WiKu-Verlag Verlag fuer Wissenschaft und Kultur (2008)
Google Scholar
Weis, M., Naumann, F.: Duplicate detection in XML. In: SIGMOD Workshop on Information Quality in Information Systems (IQIS), Paris, France, pp. 10–19 (2004)
Google Scholar
Weis, M., Naumann, F.: Dogmatix tracks down duplicates in XML. In: Conference on the Management of Data (SIGMOD), Baltimore, MD, pp. 431–442 (2005)
Google Scholar
Weis, M., Naumann, F.: Detecting duplicates in complex XML data. In: International Conference on Data Engineering (ICDE), Atlanta, GA (2006)
Google Scholar
Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. In: Proceedings of the VLDB Endowment (PVLDB), vol. 1(2), pp. 1253–1264 (2008)
Google Scholar
Whang, S.E., Menestrina, D., Koutrika, G., Tin Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Conference on the Management of Data (SIGMOD). Providence, Rhode Island (2009)
Google Scholar
Yin, X., Han, J., Yu, P.S.: LinkClus: Efficient clustering via heterogeneous semantic links. In: Conference on Very Large Databases (VLDB), Seoul, Korea, pp. 427–438 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

IST/INESC-ID, Av. Prof. Cavaco Silva, 2744-016, Porto Salvo, Portugal
Pável Calado & Luís Leitão
Universität Tübingen, Sand 13, 72076, Tübingen, Germany
Melanie Herschel

Authors

Pável Calado
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Herschel
View author publications
You can also search for this author in PubMed Google Scholar
Luís Leitão
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Northeastern University, Liaoning, China
Zongmin Ma & Li Yan &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Calado, P., Herschel, M., Leitão, L. (2010). An Overview of XML Duplicate Detection Algorithms. In: Ma, Z., Yan, L. (eds) Soft Computing in XML Data Management. Studies in Fuzziness and Soft Computing, vol 255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14010-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-14010-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14009-9
Online ISBN: 978-3-642-14010-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics