Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3132847.3133008acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Public Access

Nationality Classification Using Name Embeddings

Published: 06 November 2017 Publication History

Abstract

Nationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative sets of labeled names, typically extracted from Wikipedia. As a result, these methods achieve limited performance and cannot support fine-grained classification.
We exploit the phenomena of homophily in communication patterns to learn name embeddings, a new representation that encodes gender, ethnicity, and nationality which is readily applicable to building classifiers and other systems. Through our analysis of 57M contact lists from a major Internet company, we are able to design a fine-grained nationality classifier covering 39 groups representing over 90% of the world population. In an evaluation against other published systems over 13 common classes, our F1 score (0.795) is substantial better than our closest competitor Ethnea (0.580). To the best of our knowledge, this is the most accurate, fine-grained nationality classifier available.
As a social media application, we apply our classifiers to the followers of major Twitter celebrities over six different domains. We demonstrate stark differences in the ethnicities of the followers of Trump and Obama, and in the sports and entertainments favored by different groups. Finally, we identify an anomalous political figure whose presumably inflated following appears largely incapable of reading the language he posts in.

References

[1]
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilingual NLP. ACL (2013).
[2]
Anurag Ambekar, Charles Ward, Jahangir Mohammed, Swapna Male, and Steven Skiena. 2009. Name-ethnicity classification from open sources. SIGKDD. ACM, 49--58.
[3]
Osei Appiah. 2001. Ethnic identification on adolescents' evaluations of advertisements. Journal of Advertising Research Vol. 41, 5 (2001), 7--22.
[4]
Elizabeth Aries and Kimberly Moorehead. 1989. The importance of ethnicity in the development of identity of Black adolescents. Psychological Reports Vol. 65, 1 (1989), 75--82.
[5]
Yambazi Banda, Mark N Kvale, Thomas J Hoffmann, Stephanie E Hesselson, Dilrini Ranatunga, Hua Tang, Chiara Sabatti, Lisa A Croen, Brad P Dispensa, Mary Henderson, et almbox. 2015. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics, Vol. 200, 4 (2015), 1285--1295.
[6]
Donald A Barr. 2014. Health disparities in the United States: Social class, race, ethnicity, and health. JHU Press.
[7]
Yoshua Bengio and Greg Corrado. 2015. Bilbowa: Fast bilingual distributed representations without word alignments. ICML (2015).
[8]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research Vol. 3, Feb (2003), 1137--1155.
[9]
Robert W Buechley. 1976. Generally useful ethnic search system: GUESS. In Annual Meeting of the American Names Society.
[10]
Esteban González Burchard, Elad Ziv, Eliseo J Pérez-Stable, and Dean Sheppard. 2003. The importance of race and ethnic background in biomedical research and clinical practice. The New England journal of medicine Vol. 348, 12 (2003), 1170.
[11]
Jonathan Chang, Itamar Rosenn, Lars Backstrom, and Cameron Marlow. 2010. ePluribus: Ethnicity on Social Networks. ICWSM Vol. 10, 18--25.
[12]
Andrew J Coldman, Terry Braun, and Richard P Gallagher. 1988. The classification of ethnic status using name information. Journal of epidemiology and community health, Vol. 42, 4 (1988), 390--395.
[13]
Seeromanie Harding, Howard Dews, and Stephen Ludi Simpson. 1999. The potential to identify South Asians using a computerised algorithm to classify names. Population Trends London (1999), 46--49.
[14]
Yifan Hu, Emden Gansner, and Stephen Kobourov. 2010. Visualizing graphs and clusters as maps. IEEE Computer Graphics and Applications Vol. 30 (2010), 54--66.
[15]
Gueorgi Kossinets and Duncan J Watts. 2009. Origins of homophily in an evolving social network 1. American journal of sociology Vol. 115, 2 (2009), 405--450.
[16]
Diane S Lauderdale and Bert Kestenbaum. 2000. Asian American ethnic identification by surname. Population Research and Policy Review Vol. 19, 3 (2000), 283--300.
[17]
Quoc V Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. ICML, Vol. Vol. 14.
[18]
Jure Leskovec and Eric Horvitz. 2008. Planetary-scale views on a large instant-messaging network WWW. ACM, 915--924.
[19]
Pablo Mateos. 2007. A review of name-based ethnicity classification methods and their potential in population studies. Population, Space and Place Vol. 13, 4 (2007), 243--263.
[20]
Pablo Mateos, Richard Webber, and PA Longley. 2007. The cultural, ethnic and linguistic classification of populations and neighbourhoods using personal names. (2007).
[21]
Miller McPherson, Lynn Smith-Lovin, and James M Cook. 2001. Birds of a feather: Homophily in social networks. Annual review of sociology Vol. 27, 1 (2001), 415--444.
[22]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality NIPS. 3111--3119.
[23]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP, Vol. Vol. 14. 1532--43.
[24]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations SIGKDD. ACM, 701--710.
[25]
James Quesada, Laurie Kain Hart, and Philippe Bourgois. 2011. Structural vulnerability and health: Latino migrant laborers in the United States. Medical Anthropology, Vol. 30, 4 (2011), 339--362.
[26]
Maja Rudolph, Francisco Ruiz, Stephan Mandt, and David Blei. 2016. Exponential Family Embeddings. In NIPS. 478--486.
[27]
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1067--1077.
[28]
Vetle I Torvik and Sneha Agarwal. 2016. Ethnea--an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science (2016).
[29]
Pucktada Treeratpituk and C Lee Giles. 2012. Name-ethnicity classification and ethnicity-sensitive name matching. AAAI.
[30]
Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research Vol. 15, 1 (2014), 3221--3245.
[31]
Zhaohui Wu, Dayu Yuan, Pucktada Treeratpituk, and C Lee Giles. 2014. Science and Ethnicity: How Ethnicities Shape the Evolution of Computer Science Research Community. arXiv preprint arXiv:1411.1129 (2014).

Cited By

View all
  • (2024)Analysis of science journalism reveals gender and regional disparities in coverageeLife10.7554/eLife.84855.312Online publication date: 28-May-2024
  • (2024)Analysis of science journalism reveals gender and regional disparities in coverageeLife10.7554/eLife.8485512Online publication date: 28-May-2024
  • (2024)Measuring Cultural and Ethnic Diversity in Research and InnovationSSRN Electronic Journal10.2139/ssrn.4854150Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ethnicity classification
  2. name embedding
  3. nationality classification

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '17
Sponsor:

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,412
  • Downloads (Last 6 weeks)195
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Analysis of science journalism reveals gender and regional disparities in coverageeLife10.7554/eLife.84855.312Online publication date: 28-May-2024
  • (2024)Analysis of science journalism reveals gender and regional disparities in coverageeLife10.7554/eLife.8485512Online publication date: 28-May-2024
  • (2024)Measuring Cultural and Ethnic Diversity in Research and InnovationSSRN Electronic Journal10.2139/ssrn.4854150Online publication date: 2024
  • (2024)Country Reputation and Corporate ActivityManagement Science10.1287/mnsc.2023.475370:3(1483-1504)Online publication date: Mar-2024
  • (2024)Household Mobility, Networks, and Gentrification of Minority Neighborhoods in the United StatesJournal of Labor Economics10.1086/72880542:S1(S61-S94)Online publication date: 1-Apr-2024
  • (2024)The price of mistrust: the impact of a medical ethics scandal on scientific capacity in Sub-Saharan AfricaIndustry and Innovation10.1080/13662716.2024.2421933(1-24)Online publication date: 2-Nov-2024
  • (2024)China and the U.S. produce more impactful AI research when collaborating togetherScientific Reports10.1038/s41598-024-79863-514:1Online publication date: 19-Nov-2024
  • (2024)“Is cash better with color?” – Do CEOs of color generate higher value of cash?Economics Letters10.1016/j.econlet.2024.112007244(112007)Online publication date: Nov-2024
  • (2024)Individualistic CEOs and financial misstatementsReview of Quantitative Finance and Accounting10.1007/s11156-024-01364-3Online publication date: 20-Nov-2024
  • (2024)Analysing the impact of ChatGPT in researchApplied Intelligence10.1007/s10489-024-05298-054:5(4172-4188)Online publication date: 21-Mar-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media