Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A node resistance-based probability model for resolving duplicate named entities

Published: 01 September 2020 Publication History

Abstract

Duplicate entities tend to degrade the quality of data seriously. Despite recent remarkable achievement, existing methods still produce a large number of false positives (i.e., an entity determined to be a duplicate one when it is not) that are likely to impair the accuracy. Toward this challenge, we propose a novel node resistance-based probability model in which we view a given data set as a graph of entities that are linked each other via relationships, and then compute the probability value between two entities to see how similar the two entities are. Especially, in the graph, each node has its own resistance value equivalent to 1-confidence (normalized in 0–1) and resistance·probability value is filtered out per node during computing the probability value. To evaluate the proposed model, we performed intensive experiments with different data sets including ACM (https://dl.acm.org), DBLP (https://dblp.uni-trier.de), and IMDB (https://imdb.com). Our experimental results show that the proposed probability model outperforms the existing probability model, improving average F1 scores up to 14%, but never worsens them.

References

[1]
Ailon NAggregating inconsistent information: Ranking and clusteringJACM20085551-2724565481325.68102
[2]
Aldous DSome inequalities for reversible Markov chainsJournal of the London Mathematical Society198225564-5766575120489.60077
[3]
Alias-i. (2008). Lingpipe 4.1.0. http://alias-i.com/lingpipe. Retrieved October 1, 2008.
[4]
Arasu, A. (2010). On active learning of record matching packages. In SIGMOD.
[5]
Bansal NCorrelation clusteringMachine Learning2004561-333634231089.68085
[6]
Baxter, R. P. C., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. In ACM SIGKDD’03 workshop on data cleaning, record linkage and object consolidation.
[7]
Bellare, K. (2012). Active sampling for entity matching. In Proceedings of the 18th ACM SIGKDD.
[8]
Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S, and Widom J Swoosh: A generic approach to entity resolution The VLDB Journal 2009 18 1 255-276
[9]
Bhattachary, I., & Getoor, L. (2007). A latent Dirichhlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM international conference on data mining.
[10]
Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMKD’04), Paris, France, June 13.
[11]
Bhattacharya I and Getoor L Collective entity resolution in relational data ACM Transactions on Knowledge Discovery from Data 2007 1 1 1-36
[12]
Bilenko, M., & Mooney, R. (2003). Adaptive duplicate detection using leanable string similarity. In Proceedings of international conference on knowledge discovery and data mining (KDD).
[13]
Bilenko, M., Kamath, B., & Mooney, R. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of IEEE international conference on data mining (ICDM’06), Hong Kong, China, December.
[14]
Chaudhuri, S. (2005). Robust identification of fuzzy duplicates. In Proceedings of ICDE.
[15]
Chen, Z. (2009). Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of SIGMOD.
[16]
Christen, P. (2007). Towards parameter-free blocking for scalable record linkage. Technical report, The Australian National University, Canberra.
[17]
Christen, P. (2008). Automatic record linkage using seeded nearest neighbor and support vector machine classification. In Proceedings of international conference on knowledge discovery and data mining (KDD).
[18]
Christen P A survey of indexing techniques for scalable record linkage and deduplication IEEE Transactions on Knowledge and Data Engineering 2011 24 9 1537-1555
[19]
Christen P A survey of indexing techniques for scalable record linkage and deduplication IEEE Transactions on Knowledge and Data Engineering 2011 99 1 5
[20]
Cochinwala M Efficient data reconcillation Information Sciences 2001 137 1-4
[21]
Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. In Proceedings of workshop on data cleaning, record linkage, and object consolidation in conjunction with ACM international conference on knowledge discovery and data mining (KDD’03), Washington DC, USA, August 21–24.
[22]
do Nascimento DCCESP and Mestre DG Heuristic-based approaches for speeding up incremental record linkage Journal of Systems and Software 2018 137 335-354
[23]
Elkan, A. E. M. C. (1996). The field matching problem: Algorithms and applications. In Proceedings of the second international conference on knowledge discovery and data mining (KDD-96).
[24]
Elmagarmid A, Ipeirotis P, and Verykios V Duplicate record detection: A survey IEEE Transactions on Knowledge and Data Engineering 2007 19 1 1-16
[25]
Elsayed, T., Oard, D., & Namata, G. (2008). Resolving personal names in email using context expansion. In Proceedings of the 46th annual meeting of the association for computational linguistics: Human language technologies (ACL’08), Columbus, OH, USA, June 15–20.
[26]
Elsner, M., & Charnaik, E. (2008). You talking to me? A corpus and algorithm for conversation disentanglement. ACL-HLT.
[27]
Elsner, M., & Schudy, W. (2009). Bounding and comparing methods for correlation clustering beyong ilp. ILP-NLP.
[28]
Fan X, Wang J, Pu X, Zhou L, Zuou L, and Lv B On graph-based name disambiguation Journal of Data and Information Quality 2011 2 2 1-23
[29]
Fellegi I and Sunter A A theory for record linkage Journal of American Statistical Association 1968 63 324 1321-1332
[30]
Ferreira, A., Silva, R., Goncalves, M., Veloso, A., & Laender, A. (2012). Active associative sampling for author name disambiguation. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’12), Washington DC, USA, June 10–14.
[31]
Fienberg, W. C. P. R. S., & Rivard, K. (2013). Secondstring project page: Open source java-based package of approximate string-matching specification. http://www.secondstringsourceforgenet.
[32]
Firmani D, Saha B, and Srivastava D Online entity resolution using an oracle Proc of the VLDB Endowment 2016 9 5 384-395
[33]
Freire, N., Borbinha, J., & Calado, P. (2007). Identification of frbr works within bibliographic databases: An experiment with unimarc and duplicate detection techniques. In Proceedings of the international conference on Asian digital libraries (ICADL’07), Hanoi, Vietnam, December 10–13.
[34]
Geerts, F., Mecca, G., Papotti, P., & Santoro, D. (2013). The llunatic data-cleaning framework. In Proceedings of the 39th international conference on very large data bases (VLDB ’13), Riva del Garda, Trento, Italy, August 26–30.
[35]
Getoor, L. (2012). Entity resolution tutorial. In Proceedings of the 38th international conference on very large data bases (VLDB ’12), Istanbul, Turkey, August 27–31.
[36]
Giles SLL and Bollacker K Digital libraries and autonomous citation indexing IEEE Computer 1999 32 6 67-71
[37]
Gravano, L., Ipeirotis, P., Koudas, N., & Srivastava, D. (2013). Text joins in an rdbms for web data integration. In Proceedings of the 14th international world wide web conference (WWW’03), Budapest, Hungary, May 20–24, 2003 Trento, Italy, August 26–30.
[38]
Gruenheid A, Dong XL, and Srivastava D Incremental record linkage The VLDB Journal 2014 7 697-708
[39]
Guo, S., Dong, X., Srivastava, D., & Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. In Proceedings of the 37th International Conference on Very Large Data Bases (VLDB’10), Singapore, August 29–September 3.
[40]
Gupta R and Sarawagi S Answering table augmentation queries from unstructured lists on the web PVLDB 2009 2 1 289-300
[41]
Hall, R., Sutton, C., & McCallum, A. (2008). Unsupervised deduplication using cross-field dependencies. In Proceedings of the ACM international conference on knowledge discovery and data mining (KDD’08), Las Vegas, NV, USA, August 24–27.
[42]
Hermansson, L., Johansson, F., Kerola, T., Jethava, V., & Dubhashi, D. (2013). Entity disambiguation in anonymized graphs using graph kernels. In Proceedings of the ACM international conference on information and knowledge management (CIKM’13), San Francisco, CA, USA, October 27–November 1.
[43]
Hernandez, M., & Stolfo, S. (1995). The merge/purge problem for large databases. In Proceedings of the ACM special interest group on management of data conference (SIGMOD’95), San Jose, CA, USA, May 22–25.
[44]
Herranz J, Nin J, and Sole M Optimal symbol alignment distance: A new distance for sequences of symbols IEEE Transactions on Knowledge and Data Engineering 2010 23 10 1541-1554
[45]
Herschel M, Naumann F, Szott S, and Taubert M Scalable iterative graph duplicate detection IEEE Transactions on Knowledge and Data Engineering 2012 24 2094-2108
[46]
Herzog S Data Quality and Record Linkage Techniques 2007 New York Springer
[47]
Hong, Y., On, B., & Lee, D. (2004). System support for name authority control problem in digital libraries: Open dblp approach. In Proceeding of 8th European conference on digital libraries (ECDL’04), Bath, UK, September 12–17.
[48]
Jaro M Advances in record linkage methodology as applied to matching the 1985 census of tampa florida Journal of American Statistical Association 1989 84 406 414-420
[49]
Kalashnikov, D., Mehrotra, S., & Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In Proceeding of SIAM conference on data mining (SDM’05), Newport Beach, California, USA, April 21–23.
[50]
Kalashnikov D, Mehrotra S, and Chen Z Domain-independent data cleaning via analysis of entity-relationship graph ACM Transactions on Database Systems 2006 31 716-767
[51]
Khabsa, M., Treeratpituk, P., & Giles, C. (2012). Entity resolution using search engine results. In Proceeding of the 21st ACM international conference on information and knowledge management (CIKM’12), Maui, USA, October 29–November 2.
[52]
Kim, H., & Lee, D. (2010). Harra: Fast iterative hashed record linkage for large-scale data collections. In Proceeding of the 13th international conference on extending database technology (EDBT’10), Lausanne, Switzerland, March 22–26.
[53]
Kolb, L., Thor, A., & Rahm, E. (2011). Block-based load balancing for entity resolution for mapreduce. In Proceeding of the 20th ACM international conference on information and management (CIKM’11), Glasgow, Scotland, UK, October 24–28.
[54]
Lee D, Kang J, Mitra P, Giles L, and On B Are your citations clean? ACM Communication of the ACM 2007 50 12 33-38
[55]
Lee, D., On, B., J.K., & Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceeding of ACM SIGMOD workshop on information quality in information systems (IQIS’05), Baltimore, Maryland, USA, June 13–16.
[56]
Li, P., Dong, X., Maurino, A., & Srivastava, D. (2011). Linking temporal records. In Proceeding of the 37th International conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.
[57]
Lingli, L., Li, J., Wang, H., & Gao, H. (2011). Context-based entity description rule for entity resolution. In Proceeding of the 20th ACM international conference on information and management (CIKM’11), Glasgow, Scotland, UK, October 24–28.
[58]
Marcus, A. (2011). Human-powered sorts and joins. PVLDB.
[59]
Navarro G and Gonzalo S A guided tour to approximate string matching ACM Computing Surveys 2001 33 1 31-88
[60]
Nentwig, M. A. G., & Rahm, E. (2016). Gb-jer: A graph-based model for joint entity resolution. In IEEE 16th international conference on data mining workshops (ICDMW).
[61]
Ng V and Cardie C Improving machine learning approaches to coreference resolution 2002 Philadelphia ACL
[62]
On B, Choi G, and Jung S A case study for understanding the nature of redundant entities in bibliographic digital libraries Electronic Libraries and Information Systems 2014 48 3 246-271
[63]
On, B., Elmacioglu, E., Lee, D., Kang, J., & Pei, J. (2006). Improving grouped-entity resolution using quasi-cliques. In Proceeding of IEEE international conference on data mining (ICDM’06), Hong Kong, China, December.
[64]
On, B., Koudas, N., Lee, D., & Srivastava, D. (2007). Group linkage. In Proceeding of IEEE international conference on data engineering (ICDE’07), Istanbul, Turkey, April.
[65]
On B and Lee I Meta similarity Applied Intelligence 2011 35 3 359-374
[66]
On B, Lee I, and Lee D Scalable clustering methods for the name disambiguation problem Knowledge and Information Systems 2012 31 1 129-151
[67]
Papadakis, G., Ioannou, E., Niederee, C., & Fankhauser, P. (2011). Efficient entity resolution for large heterogeneous information spaces. In Proceeding of the 4th ACM international conference on web search and data mining (WSDM’11), Hong Kong, China, February 9–12.
[68]
Pasula H, Marthi B, Milch B, Russell S, and Shpitser I Dietterich TG, Becker S, and Ghahramani Z Identity unsertainty and citation matching Advances in neural information processing systems 2003 Cambridge, MA MIT Press
[69]
Pujara, J., & Getoor, L. (2016). Generic statistical relational entity resolution in knowledge graphs. In Proceeding of international workshop on statistical relational AI.
[70]
Rastogi, V., Dalvi, N., & Garofalakis, M. (2011). Large-scale collective entity matching. In Proceeding of the 37th international conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.
[71]
Ravikumar, P., & Cohen, W. (2004). A hierarchical graphical model for record linkage. In UAI.
[72]
Ravikumar, W. C. P., & Fienberg, S. (2003). A comparison of string distance metrics for name-matching tasks. In Proceeding of IJCAI workshop on information integration on the web.
[73]
Rick B, Hengel-Dittrich C, O’Neill E, and Tilett B Viaf(virtual international authority file): Linking die deutsche bibliothek and library of congress name authority files International Cataloging and Bibliographic Control 2007 36 1 12-19
[74]
Sarawagi, S. (2003). Interactive deduplication using active learning. In Proceeding of international conference on knowledge discovery and data mining (KDD).
[75]
Shah DGossip algorithmsFoundations and Trends in Networking2008311-1251185.68072
[76]
Shen, W., Li, X., & Doan, A. (2005). Constraint-based entity matching. In Proceeding of the 25th national conference on artificial intelligence (AAAI’05), Pittsburgh, PA, USA, July 9–13.
[77]
Simon, D. F. E., & Shasha, D. (2000). An extensible framework for data cleaning. In Proceeding of international conference on data engineering.
[78]
Soon W A machine learning approach to coreference resolution of noun phrases Computational Linguistics 2001 27 4 521-544
[79]
Soundex. (2007-05-30). The soundex indexing system. National Archives and Records Administration.
[80]
Sun, C., Shen, D., Kou, Y., Nie, T., & Yu, G. (2015). GB-JER: A graph-based model for joint entity resolution. In International conference on database systems for advanced applications.
[81]
Tang, J., Lu, Q., Wang, T., Wang, J., & Li, W. (2011). Constraint-based entity matching. In Proceeding of 34th international ACM SIGIR conference on research and development in information retrieval (SIGIR’11), Beijing, China, July 24–28.
[82]
Taniguchi S Constraint-based entity matching Journal of Information Science 2013 39 2 153-168
[83]
Tejada SLearning object identification rules for information integrationInformation Sciences200112683-981002.68799
[84]
Wang, J., J.Y., Li, G., & Feng, J. (2011). Entity matching: How similar is similar. In Proceeding of the 37th international conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.
[85]
Wang J, Kraska T, Franklin M, and Feng J Crowder: Crowdsourcing entity resolution PVLDB 2012 5 11 1483-1494
[86]
Weber, J. (2015). Leaf: Linking and exploring authority files. http://www.leaf-eduorg. Retrieved March 1, 2015.
[87]
Whang, S., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. In Proceeding of the 36th international conference on very large data bases (VLDB’10), Singapore, August 29–September 3.
[88]
Whang, S., & Garcia-Molina, H. (2012). Joint entity resolution. In Proceeding of IEEE 28th international conference on data engineering (ICDE’12), Arlington, VA, USA, April 1–5.
[89]
Whang S and Garcia-Molina H Incremental entity resolution on rules and data VLDB Journal 2014 23 1 77-102
[90]
Whang S, Marmaros D, and Garcia-Molina H Pay-as-you-go entity resolution IEEE Transactions on Knowledge and Data Engineering 2012 25 5 1111-1124
[91]
Wick, M., Singh, S., & McCallum, A. (2012). A discriminative hierarchical model for fast coreference at large scale. In Proceeding of the 50th annual meeting of the association for computational linguistics (ACL’12), Jeju, Korea, July 8–14.
[92]
Winkler, W. (1990). String comparator metrics and enchanced decision rules in the Fellegi–Sunter model of record linkage. In Proceeding of the section on survey research methods. American Statistical Association.
[93]
Winkler, W. E. (1999). The state of record linkage and current research problems. Technical report, US Census Bureau.
[94]
Winkler, W. (2006). Overview of record linkage and current research directions. Technical report, Bureau of the Census.
[95]
Xiao, C., Wang, W., & Lin, X. (2008). Ed-join: An efficient algorithm for similarity joins with edit distance constraints. In Proceeding of the 34th international conference on very large data bases (VLDB’08), Auckland, New Zealand, August 24–30.

Index Terms

  1. A node resistance-based probability model for resolving duplicate named entities
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Scientometrics
        Scientometrics  Volume 124, Issue 3
        Sep 2020
        987 pages

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 01 September 2020
        Received: 19 April 2018

        Author Tags

        1. Similarity
        2. Text mining
        3. Disambiguation

        Qualifiers

        • Research-article

        Funding Sources

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 0
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 22 Nov 2024

        Other Metrics

        Citations

        View Options

        View options

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media