Abstract
Fuzzy duplicate detection aims at identifying multiple representations of real-world objects in a data source, and is a task of critical relevance in data cleaning, data mining, and data integration tasks. It has a long history for relational data, stored in a single table or in multiple tables with an equal schema. However, algorithms for fuzzy duplicate detection in more complex structures, such as hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms use similarity measures that consider the duplicate status of their direct neighbors to improve duplicate detection effectiveness. In this chapter, we study different approaches that have been proposed for XML fuzzy duplicate detection. Our study includes a description and analysis of the different approaches, as well as a comparative experimental evaluation performed on both artificial and real-world data. The two main dimensions used for comparison are the methods effectiveness and efficiency. Our comparison shows that the DogmatiX system [44] is the most effective overall, as it yields the highest recall and precision values for various kinds of differences between duplicates. Another system, called XMLDup [27] has a similar performance, being most effective especially at low recall values. Finally, the SXNM system [36] is the most efficient, as it avoids executing too many pairwise comparisons, but its effectiveness is greatly affected by errors in the data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Conference on Very Large Databases (VLDB), Hong Kong, China, pp. 586–597 (2002)
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Batini, C., Scannapieco, M.: Data quality - concepts, methodologies and techniques. Springer, Berlin (2006)
Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: KDD Workshop on Multi-Relational Data Mining, MRDM (2005)
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Conference on Data Mining (SDM), Bethesda, MD (2006)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data 1(1) (2007)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Conference on Knowledge Discovery and Data Mining (KDD), Washington, DC, pp. 39–48 (2003)
Bille, P.: A survey on tree edit distance and related problems. Theoretical Computer Science 337(1–3), 217–239 (2005)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv. 41(1), 1–41 (2008), http://doi.acm.org/10.1145/1456650.1456651
Carvalho, J.C.P., da Silva, A.S.: Finding similar identities among objects from multiple web sources. In: CIKM Workshop on Web Information and Data Management (WIDM), New Orleans, Louisiana, USA, pp. 90–93 (2003)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Conference on the Management of Data (SIGMOD), San Diego, CA, pp. 313–324 (2003)
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: International Conference on Data Engineering (ICDE), Tokyo, Japan, pp. 865–876 (2005)
Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting relationships for object consolidation. In: SIGMOD Workshop on Information Quality in Information Systems (IQIS), Baltimore, MD (2005)
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, Alberta, Canada, pp. 475–480 (2002)
Doan, A., Lu, Y., Lee, Y., Han, J.: Profile-based object matching for information integration. IEEE Intelligent Systems 18(5), 54–59 (2003)
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Conference on the Management of Data (SIGMOD), Baltimore, MD, pp. 85–96 (2005)
Elfeky, M.G., Elmagarmid, A.K., Verykios, V.S.: Tailor: A record linkage tool box. In: International Conference on Data Engineering (ICDE), San Jose, CA, pp. 17–28 (2002)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1) (2007)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association (1969)
Guha, S., Jagadish, H.V., Koudas, N., Srivastava, D., Yu, T.: Approximate XML joins. In: Conference on the Management of Data (SIGMOD), Madison, WI (2002)
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Conference on the Management of Data (SIGMOD), San Jose, CA, pp. 127–138 (1995)
Herschel, M., Naumann, F.: Scaling up duplicate detection in graph data. In: Conference on Information and Knowledge Management (CIKM), pp. 1325–1326 (2008)
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Conference on Database Systems for Advanced Applications (DASFAA), Kyoto, Japan (2003)
Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems (TODS) 31(2), 716–767 (2006)
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1–2), 83–97 (1955)
Lehti, P., Fankhauser, P.: Unsupervised duplicate detection using sample non-duplicates. Journal on Data Semantics VII 4244, 136–164 (2006)
Leitão, L., Calado, P., Weis, M.: Structure-based inference of XML similarity for fuzzy duplicate detection. In: Conference on Information and Knowledge Management (CIKM), pp. 293–302 (2007)
Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966)
Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. In: International Conference on Data Engineering (ICDE), Vienna, Austria, pp. 294–301 (1993)
McCallum, A., Wellner, B.: Object consolidation by graph partitioning with a conditionally-trained distance metric. In: KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, Washington, DC (2003)
Milano, D., Scannapieco, M., Catarci, T.: Structure aware XML object identification. In: VLDB Workshop on Clean Databases (CleanDB), Seoul, Korea (2006)
Minkov, E., Cohen, W.W., Ng, A.Y.: Contextual search and name disambiguation in email using graphs. In: Conference on Research and Development in Information Retrieval (SIGIR), Seattle, Washington, USA, pp. 27–34 (2006)
Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Tucson, AZ (1997)
Newcombe, H., Kennedy, J., Axford, S., James, A.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1988)
Puhlmann, S., Weis, M., Naumann, F.: XML duplicate detection using sorted neigborhoods. In: Conference on Extending Database Technology (EDBT), Munich, Germany, pp. 773–791 (2006)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001)
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23, 3–13 (2000)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, Alberta, pp. 269–278 (2002)
Singla, P., Domingos, P.: Multi-relational record linkage. In: KDD Workshop on Multi-Relational Data Mining (MRDM), Seattle, WA (2004)
Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Conference on Principals and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, pp. 297–308 (2005)
Weis, M.: Duplicate Detection in XML Data. WiKu-Verlag Verlag fuer Wissenschaft und Kultur (2008)
Weis, M., Naumann, F.: Duplicate detection in XML. In: SIGMOD Workshop on Information Quality in Information Systems (IQIS), Paris, France, pp. 10–19 (2004)
Weis, M., Naumann, F.: Dogmatix tracks down duplicates in XML. In: Conference on the Management of Data (SIGMOD), Baltimore, MD, pp. 431–442 (2005)
Weis, M., Naumann, F.: Detecting duplicates in complex XML data. In: International Conference on Data Engineering (ICDE), Atlanta, GA (2006)
Weis, M., Naumann, F., Jehle, U., Lufter, J., Schuster, H.: Industry-scale duplicate detection. In: Proceedings of the VLDB Endowment (PVLDB), vol. 1(2), pp. 1253–1264 (2008)
Whang, S.E., Menestrina, D., Koutrika, G., Tin Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: Conference on the Management of Data (SIGMOD). Providence, Rhode Island (2009)
Yin, X., Han, J., Yu, P.S.: LinkClus: Efficient clustering via heterogeneous semantic links. In: Conference on Very Large Databases (VLDB), Seoul, Korea, pp. 427–438 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Calado, P., Herschel, M., Leitão, L. (2010). An Overview of XML Duplicate Detection Algorithms. In: Ma, Z., Yan, L. (eds) Soft Computing in XML Data Management. Studies in Fuzziness and Soft Computing, vol 255. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14010-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-14010-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14009-9
Online ISBN: 978-3-642-14010-5
eBook Packages: EngineeringEngineering (R0)