Author name disambiguation using a graph model with node splitting and merging based on bibliographic information

Dongwook Shin¹,
Taehwan Kim¹,
Joongmin Choi¹ &
…
Jungsun Kim¹

1584 Accesses
52 Citations
3 Altmetric
Explore all metrics

Abstract

Author ambiguity mainly arises when several different authors express their names in the same way, generally known as the namesake problem, and also when the name of an author is expressed in many different ways, referred to as the heteronymous name problem. These author ambiguity problems have long been an obstacle to efficient information retrieval in digital libraries, causing incorrect identification of authors and impeding correct classification of their publications. It is a nontrivial task to distinguish those authors, especially when there is very limited information about them. In this paper, we propose a graph based approach to author name disambiguation, where a graph model is constructed using the co-author relations, and author ambiguity is resolved by graph operations such as vertex (or node) splitting and merging based on the co-authorship. In our framework, called a Graph Framework for Author Disambiguation (GFAD), the namesake problem is solved by splitting an author vertex involved in multiple cycles of co-authorship, and the heteronymous name problem is handled by merging multiple author vertices having similar names if those vertices are connected to a common vertex. Experiments were carried out with the real DBLP and Arnetminer collections and the performance of GFAD is compared with three representative unsupervised author name disambiguation systems. We confirm that GFAD shows better overall performance from the perspective of representative evaluation metrics. An additional contribution is that we released the refined DBLP collection to the public to facilitate organizing a performance benchmark for future systems on author disambiguation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author Name Disambiguation by Exploiting Graph Structural Clustering and Hybrid Similarity

Article 16 February 2018

Author Name Disambiguation Based on Rule and Graph Model

Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names

Notes

http://dblp.uni-trier.de/.
http://citeseer.ist.psu.edu/.
http://www.ncbi.nlm.nih.gov/pubmed.
http://www.lbd.dcc.ufmg.br/bdbcomp/.
http://arnetminer.org/.
By a citation record, we mean a set of bibliographic attributes containing author names, paper title, and publication venue of a particular publication.
GFAD also relies on paper title in addition to co-authorship, but it is only used at outlier removal step, if necessary, to meet the specific objectives of the system.
http://meta.wikimedia.org/wiki/WikiAuthors.
http://www.paritycomputing.com/web/index.html.
http://info.scival.com/experts.
We can have isolated vertices during the graph construction process and/or after namesake resolution process.
To maximize the possibility of selecting different name variations denoting the same person, while minimizing the chance of judging similar names denoting different person as the same person, suitable threshold values must be manually determined in the first place. So we empirically determined the threshold value after experimenting with randomly collected 200 name pairs including 100 pairs of name variations and 100 pairs of similar names.
To measure and analyze the ratios of the occurrence frequencies of three failure cases, we randomly selected 18 ambiguous groups from the Arnetminer collection.
http://ieeexplore.ieee.org/Xplore/home.jsp.
http://www.paritycomputing.com/web/index.html.

References

Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S. E., & Widom, J. (2009). Swoosh: A generic approach to entity resolution. The International Journal on Very Large Databases, 18(1), 255–276.
Article Google Scholar
Bhattacharya, I., & Getoor, L. (2006). A latent Dirichlet model for unsupervised entity resolution. In Proceedings of the Sixth SIAM International Conference on Data Mining.
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transaction on Knowledge Discovery from Data, 1(1), 5.
Article Google Scholar
Borgman, C. (1999). What are digital libraries? Competing visions. Information Processing and Management, 35(3), 227–243.
Google Scholar
Carvalho, A., Ferreira, A., Laender, A., & Goncalves, M. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management, 2(3), 289–304.
Google Scholar
Cherednichenko, S. (2005). Outlier detection in clustering. Master’s thesis, Department of Computer Science, University of Joensuu.
Cota, R., Ferreira, A., Nascimento, C., Goncalves, M., & Laender, A. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870.
Article Google Scholar
Fan, X., Wang, J., Pu, X., Zhou, L., & LV, B. (2011). On graph-based name disambiguation. ACM Journal of Data and Information Quality, 2(2), 10.
Google Scholar
Ferreira, A., Goncalves, M., & Laender, A. (2012). A brief survey of automatic methods for author name disambiguation. SIGMOD Record, 41(2), 15–26.
Article Google Scholar
Ferreira, A., Veloso, A., Goncalves, M., & Laender, A. (2010). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings the Tenth Annual Joint Conference on Digital Libraries (pp. 39–48).
Han, H., Giles, C., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the fourth ACM/IEEE-CS joint conference on digital libraries, 296-305.
Han, H., Zha, H., & Giles, C. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the Fifth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 334–343).
Johnson, D. (1975). Finding all the elementary circuits of a directed graph. SIAM Journal on Scientific Computing, 4(1), 77–84.
Article MATH Google Scholar
Kang, I., Na, S., Lee, S., Jung, H., Kim, P., Sung, W., et al. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97.
Article Google Scholar
Klass, V. (2007). Who’s who in the world wide web: Approaches to name disambiguation. Diplomarbeit/diploma thesis, Institute of Computer Science, LMU, Munich.
Levin, F., & Heuser, C. (2010). Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management, 1(2), 183–197.
Google Scholar
Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In Proceedings of the Ninth International Symposium on String Processing and Information Retrieval, 2476, (pp. 1–10).
Masada, T., Takasu, A., & Adachi, J. (2007). Citation data clustering for author name disambiguation. In Proceedings of the Second International Conference on Scalable Information Systems.
Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity uncertainty and citation matching. Advances in Neural Information Systems, 15, 1401–1408.
Google Scholar
Peng, H., Lu, C., Hsu, W., & Ho, J. (2012). Disambiguating authors in citations on the web and authorship correlations. Expert Systems with Applications, 39(12), 10521–10532.
Article Google Scholar
Pereira, D., Neto, B., & Ziviani, N. (2011). A generic web-based entity resolution framework. Journal of the American Society for Information Science and Technology, 62(5), 919–932.
Article Google Scholar
Pereira, D., Neto, B., Ziviani, N., Laender, A., Goncalves, M., & Ferreira, A. (2009). Using web information for author name disambiguation. In Proceedings of the Ninth ACM/IEEE-CS Joint Conference on Digital Libraries (49–58).
Scoville, C., Johnson, E., & McConnell, A. (2003). When A. Rose is not A. Rose: The vagaries of author searching. Medical Reference Services Quarterly, 22(4), 1–11.
Article Google Scholar
Soler, J. (2007). Separating the articles of authors with the same name. Scientometrics, 72(2), 281–290.
Article MathSciNet Google Scholar
Tan, Y., Kan, M., & Lee, D. (2006). Search engine driven author disambiguation. In Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 314–315).
Tang, J., Lu, Q., Wang, T., Wang, J., & Li, W. (2011). A bipartite graph based social network splicing method for person name disambiguation. In Proceedings of the Thirty-Fourth International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1233–1234.
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetmier: Extraction and mining of academic social networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 990–998).
Veloso, A., Ferreira, A., Goncalves, M., Laender, A., & Meira, W. (2012). Cost-effective on-demand associative author name disambiguation. Information Processing and Management, 48(4), 680–697.
Article Google Scholar
Wang, X., Tang, J., Cheng, H., & Yu, P. (2011). ADANA: Active Name Disambiguation. In Proceedings of the IEEE eleventh International Conference on Data Mining (pp. 794–803).
Wooding, S., Wilcox-Jay, K., Lewison, G., & Grant, J. (2006). Co-author inclusion: A novel recursive algorithmic method for dealing with homonyms in bibliometrics analysis. Scientometrics, 66(1), 11–21.
Article Google Scholar
Wu, J., & Ding, X. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697.
Article MathSciNet Google Scholar
Yang, K., Peng, H., Jiang, J., Lee, H., & Ho, J. (2008). Author name disambiguation for citations using topic and web correlation. In Proceedings of the twelfth European conference on Research and Advanced Technology for Digital Libraries (pp. 185–196).
Yin, X., Han, J., & Yu, P. S. (2007). Object distinction: Distinguishing objects with identical names. In Proceedings of the IEEE International Conference on Data Engineering (pp. 1242–1246).

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Hanyang University, 55 Hanyangdaehak-ro, Sangrok-Gu, Ansan, Gyeonggi-Do, 426-791, South Korea
Dongwook Shin, Taehwan Kim, Joongmin Choi & Jungsun Kim

Authors

Dongwook Shin
View author publications
You can also search for this author in PubMed Google Scholar
Taehwan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Joongmin Choi
View author publications
You can also search for this author in PubMed Google Scholar
Jungsun Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jungsun Kim.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shin, D., Kim, T., Choi, J. et al. Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100, 15–50 (2014). https://doi.org/10.1007/s11192-014-1289-4

Download citation

Received: 27 March 2013
Published: 19 April 2014
Issue Date: July 2014
DOI: https://doi.org/10.1007/s11192-014-1289-4

Author name disambiguation using a graph model with node splitting and merging based on bibliographic information

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Author Name Disambiguation by Exploiting Graph Structural Clustering and Hybrid Similarity

Author Name Disambiguation Based on Rule and Graph Model

Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Author name disambiguation using a graph model with node splitting and merging based on bibliographic information

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Author Name Disambiguation by Exploiting Graph Structural Clustering and Hybrid Similarity

Author Name Disambiguation Based on Rule and Graph Model

Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation