Abstract
Author ambiguity mainly arises when several different authors express their names in the same way, generally known as the namesake problem, and also when the name of an author is expressed in many different ways, referred to as the heteronymous name problem. These author ambiguity problems have long been an obstacle to efficient information retrieval in digital libraries, causing incorrect identification of authors and impeding correct classification of their publications. It is a nontrivial task to distinguish those authors, especially when there is very limited information about them. In this paper, we propose a graph based approach to author name disambiguation, where a graph model is constructed using the co-author relations, and author ambiguity is resolved by graph operations such as vertex (or node) splitting and merging based on the co-authorship. In our framework, called a Graph Framework for Author Disambiguation (GFAD), the namesake problem is solved by splitting an author vertex involved in multiple cycles of co-authorship, and the heteronymous name problem is handled by merging multiple author vertices having similar names if those vertices are connected to a common vertex. Experiments were carried out with the real DBLP and Arnetminer collections and the performance of GFAD is compared with three representative unsupervised author name disambiguation systems. We confirm that GFAD shows better overall performance from the perspective of representative evaluation metrics. An additional contribution is that we released the refined DBLP collection to the public to facilitate organizing a performance benchmark for future systems on author disambiguation.
Similar content being viewed by others
Notes
By a citation record, we mean a set of bibliographic attributes containing author names, paper title, and publication venue of a particular publication.
GFAD also relies on paper title in addition to co-authorship, but it is only used at outlier removal step, if necessary, to meet the specific objectives of the system.
We can have isolated vertices during the graph construction process and/or after namesake resolution process.
To maximize the possibility of selecting different name variations denoting the same person, while minimizing the chance of judging similar names denoting different person as the same person, suitable threshold values must be manually determined in the first place. So we empirically determined the threshold value after experimenting with randomly collected 200 name pairs including 100 pairs of name variations and 100 pairs of similar names.
To measure and analyze the ratios of the occurrence frequencies of three failure cases, we randomly selected 18 ambiguous groups from the Arnetminer collection.
References
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S. E., & Widom, J. (2009). Swoosh: A generic approach to entity resolution. The International Journal on Very Large Databases, 18(1), 255–276.
Bhattacharya, I., & Getoor, L. (2006). A latent Dirichlet model for unsupervised entity resolution. In Proceedings of the Sixth SIAM International Conference on Data Mining.
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transaction on Knowledge Discovery from Data, 1(1), 5.
Borgman, C. (1999). What are digital libraries? Competing visions. Information Processing and Management, 35(3), 227–243.
Carvalho, A., Ferreira, A., Laender, A., & Goncalves, M. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management, 2(3), 289–304.
Cherednichenko, S. (2005). Outlier detection in clustering. Master’s thesis, Department of Computer Science, University of Joensuu.
Cota, R., Ferreira, A., Nascimento, C., Goncalves, M., & Laender, A. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853–1870.
Fan, X., Wang, J., Pu, X., Zhou, L., & LV, B. (2011). On graph-based name disambiguation. ACM Journal of Data and Information Quality, 2(2), 10.
Ferreira, A., Goncalves, M., & Laender, A. (2012). A brief survey of automatic methods for author name disambiguation. SIGMOD Record, 41(2), 15–26.
Ferreira, A., Veloso, A., Goncalves, M., & Laender, A. (2010). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings the Tenth Annual Joint Conference on Digital Libraries (pp. 39–48).
Han, H., Giles, C., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the fourth ACM/IEEE-CS joint conference on digital libraries, 296-305.
Han, H., Zha, H., & Giles, C. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the Fifth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 334–343).
Johnson, D. (1975). Finding all the elementary circuits of a directed graph. SIAM Journal on Scientific Computing, 4(1), 77–84.
Kang, I., Na, S., Lee, S., Jung, H., Kim, P., Sung, W., et al. (2009). On co-authorship for author disambiguation. Information Processing and Management, 45(1), 84–97.
Klass, V. (2007). Who’s who in the world wide web: Approaches to name disambiguation. Diplomarbeit/diploma thesis, Institute of Computer Science, LMU, Munich.
Levin, F., & Heuser, C. (2010). Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management, 1(2), 183–197.
Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In Proceedings of the Ninth International Symposium on String Processing and Information Retrieval, 2476, (pp. 1–10).
Masada, T., Takasu, A., & Adachi, J. (2007). Citation data clustering for author name disambiguation. In Proceedings of the Second International Conference on Scalable Information Systems.
Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity uncertainty and citation matching. Advances in Neural Information Systems, 15, 1401–1408.
Peng, H., Lu, C., Hsu, W., & Ho, J. (2012). Disambiguating authors in citations on the web and authorship correlations. Expert Systems with Applications, 39(12), 10521–10532.
Pereira, D., Neto, B., & Ziviani, N. (2011). A generic web-based entity resolution framework. Journal of the American Society for Information Science and Technology, 62(5), 919–932.
Pereira, D., Neto, B., Ziviani, N., Laender, A., Goncalves, M., & Ferreira, A. (2009). Using web information for author name disambiguation. In Proceedings of the Ninth ACM/IEEE-CS Joint Conference on Digital Libraries (49–58).
Scoville, C., Johnson, E., & McConnell, A. (2003). When A. Rose is not A. Rose: The vagaries of author searching. Medical Reference Services Quarterly, 22(4), 1–11.
Soler, J. (2007). Separating the articles of authors with the same name. Scientometrics, 72(2), 281–290.
Tan, Y., Kan, M., & Lee, D. (2006). Search engine driven author disambiguation. In Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 314–315).
Tang, J., Lu, Q., Wang, T., Wang, J., & Li, W. (2011). A bipartite graph based social network splicing method for person name disambiguation. In Proceedings of the Thirty-Fourth International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1233–1234.
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetmier: Extraction and mining of academic social networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 990–998).
Veloso, A., Ferreira, A., Goncalves, M., Laender, A., & Meira, W. (2012). Cost-effective on-demand associative author name disambiguation. Information Processing and Management, 48(4), 680–697.
Wang, X., Tang, J., Cheng, H., & Yu, P. (2011). ADANA: Active Name Disambiguation. In Proceedings of the IEEE eleventh International Conference on Data Mining (pp. 794–803).
Wooding, S., Wilcox-Jay, K., Lewison, G., & Grant, J. (2006). Co-author inclusion: A novel recursive algorithmic method for dealing with homonyms in bibliometrics analysis. Scientometrics, 66(1), 11–21.
Wu, J., & Ding, X. (2013). Author name disambiguation in scientific collaboration and mobility cases. Scientometrics, 96(3), 683–697.
Yang, K., Peng, H., Jiang, J., Lee, H., & Ho, J. (2008). Author name disambiguation for citations using topic and web correlation. In Proceedings of the twelfth European conference on Research and Advanced Technology for Digital Libraries (pp. 185–196).
Yin, X., Han, J., & Yu, P. S. (2007). Object distinction: Distinguishing objects with identical names. In Proceedings of the IEEE International Conference on Data Engineering (pp. 1242–1246).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shin, D., Kim, T., Choi, J. et al. Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100, 15–50 (2014). https://doi.org/10.1007/s11192-014-1289-4
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-014-1289-4