Abstract
Author name ambiguity has been a long-standing problem which impairs the accuracy of publication retrieval and bibliometric methods. Most of the existing disambiguation methods are built on similarity measures, e.g., “Jaccard Coefficient”, between two sets of papers to be disambiguated, each set represented by a set of categorical features, e.g., coauthors and published venues. Such measures perform bad when the two sets are small, which is typical in Author Name Disambiguation. In this paper, we propose a novel categorical set similarity measure. We model an author’s preference, e.g., to venues, using a categorical distribution, and derive a likelihood ratio to estimate the likelihood that the two sets are drawn from the same distribution. This likelihood ratio is used as the similarity measure to decide whether two sets belong to the same author. This measure is mathematically principled and verified to perform well even when the cardinalities of the two compared sets are small. Additionally, we propose a new method to estimate the number of distinct authors for a given name based on the name statistics extracted from a digital library. Experiment shows that our method significantly outperforms a baseline method, a widely used benchmark method, and a real system.
Chapter PDF
Similar content being viewed by others
References
Agresti, A.: Categorical data analysis. Wiley series in probability and statistics. Wiley-Interscience (2002)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1 (March 2007)
Cota, R.G., Ferreira, A.A., Nascimento, C., Gonalves, M.A., Laender, A.H.F.: An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J. Am. Soc. Inf. Sci. Technol. 61(9), 1853–1870 (2010)
Gretton, A., Borgwardt, K., Rasch, M., Schlkopf, B., Smola, A.: A kernel method for the two sample problem. In: NIPS, vol. 19, pp. 513–520. MIT Press (2007)
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: JCDL 2004. ACM (2004)
Li, S., Cong, G., Miao, C.: Supplementary material to author name disambiguation using a categorical distribution similarity, http://git.io/namedis
Pereira, D.A., Ribeiro-Neto, B., Ziviani, N., Laender, A.H., Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: JCDL 2009. ACM (2009)
Tang, J., Fong, A.C., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE TKDE 99 (2011) (preprints)
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: KDD 2008. ACM (2008)
Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in medline. ACM Trans. Knowl. Discov. Data 3, 11:1–11:29 (2009)
Wang, X., Tang, J., Cheng, H., Yu, P.S.: Adana: Active name disambiguation. In: ICDM 2011 (2011)
Yin, X., Han, J., Yu, P.S.: Object distinction: Distinguishing objects with identical names by link analysis. In: ICDE 2007 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, S., Cong, G., Miao, C. (2012). Author Name Disambiguation Using a New Categorical Distribution Similarity. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33460-3_42
Download citation
DOI: https://doi.org/10.1007/978-3-642-33460-3_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33459-7
Online ISBN: 978-3-642-33460-3
eBook Packages: Computer ScienceComputer Science (R0)