Author Name Disambiguation Using a New Categorical Distribution Similarity

Shaohua Li²⁰,
Gao Cong²⁰ &
Chunyan Miao²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7523))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4908 Accesses
17 Citations

Abstract

Author name ambiguity has been a long-standing problem which impairs the accuracy of publication retrieval and bibliometric methods. Most of the existing disambiguation methods are built on similarity measures, e.g., “Jaccard Coefficient”, between two sets of papers to be disambiguated, each set represented by a set of categorical features, e.g., coauthors and published venues. Such measures perform bad when the two sets are small, which is typical in Author Name Disambiguation. In this paper, we propose a novel categorical set similarity measure. We model an author’s preference, e.g., to venues, using a categorical distribution, and derive a likelihood ratio to estimate the likelihood that the two sets are drawn from the same distribution. This likelihood ratio is used as the similarity measure to decide whether two sets belong to the same author. This measure is mathematically principled and verified to perform well even when the cardinalities of the two compared sets are small. Additionally, we propose a new method to estimate the number of distinct authors for a given name based on the name statistics extracted from a digital library. Experiment shows that our method significantly outperforms a baseline method, a widely used benchmark method, and a real system.

Download to read the full chapter text

Chapter PDF

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

Article 14 June 2019

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Article 07 July 2015

Author Name Disambiguation by Exploiting Graph Structural Clustering and Hybrid Similarity

Article 16 February 2018

Keywords

References

Agresti, A.: Categorical data analysis. Wiley series in probability and statistics. Wiley-Interscience (2002)
Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM Trans. Knowl. Discov. Data 1 (March 2007)
Google Scholar
Cota, R.G., Ferreira, A.A., Nascimento, C., Gonalves, M.A., Laender, A.H.F.: An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. J. Am. Soc. Inf. Sci. Technol. 61(9), 1853–1870 (2010)
Article Google Scholar
Gretton, A., Borgwardt, K., Rasch, M., Schlkopf, B., Smola, A.: A kernel method for the two sample problem. In: NIPS, vol. 19, pp. 513–520. MIT Press (2007)
Google Scholar
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two supervised learning approaches for name disambiguation in author citations. In: JCDL 2004. ACM (2004)
Google Scholar
Li, S., Cong, G., Miao, C.: Supplementary material to author name disambiguation using a categorical distribution similarity, http://git.io/namedis
Pereira, D.A., Ribeiro-Neto, B., Ziviani, N., Laender, A.H., Gonçalves, M.A., Ferreira, A.A.: Using web information for author name disambiguation. In: JCDL 2009. ACM (2009)
Google Scholar
Tang, J., Fong, A.C., Wang, B., Zhang, J.: A unified probabilistic framework for name disambiguation in digital library. IEEE TKDE 99 (2011) (preprints)
Google Scholar
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: KDD 2008. ACM (2008)
Google Scholar
Torvik, V.I., Smalheiser, N.R.: Author name disambiguation in medline. ACM Trans. Knowl. Discov. Data 3, 11:1–11:29 (2009)
Article Google Scholar
Wang, X., Tang, J., Cheng, H., Yu, P.S.: Adana: Active name disambiguation. In: ICDM 2011 (2011)
Google Scholar
Yin, X., Han, J., Yu, P.S.: Object distinction: Distinguishing objects with identical names by link analysis. In: ICDE 2007 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Nanyang Technological University, Singapore
Shaohua Li, Gao Cong & Chunyan Miao

Authors

Shaohua Li
View author publications
You can also search for this author in PubMed Google Scholar
Gao Cong
View author publications
You can also search for this author in PubMed Google Scholar
Chunyan Miao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road, BS8 1UB, Bristol, UK
Peter A. Flach , Tijl De Bie & Nello Cristianini , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, S., Cong, G., Miao, C. (2012). Author Name Disambiguation Using a New Categorical Distribution Similarity. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33460-3_42

Download citation

DOI: https://doi.org/10.1007/978-3-642-33460-3_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33459-7
Online ISBN: 978-3-642-33460-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Author Name Disambiguation Using a New Categorical Distribution Similarity

Abstract

Chapter PDF

Similar content being viewed by others

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Author Name Disambiguation by Exploiting Graph Structural Clustering and Hybrid Similarity

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Author Name Disambiguation Using a New Categorical Distribution Similarity

Abstract

Chapter PDF

Similar content being viewed by others

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

On the combination of domain-specific heuristics for author name disambiguation: the nearest cluster method

Author Name Disambiguation by Exploiting Graph Structural Clustering and Hybrid Similarity

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation