Decoding Demographic un-fairness from Indian Names

Vahini Medidoddi¹²,
Jalend Bantupalli¹²,
Souvic Chakraborty¹² &
…
Animesh Mukherjee¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13618))

Included in the following conference series:

International Conference on Social Informatics

1236 Accesses

Abstract

Demographic classification is essential in fairness assessment in recommender systems or in measuring unintended bias in online networks and voting systems. Important fields like education and politics, which often lay a foundation for the future of equality in society, need scrutiny to design policies that can better foster equality in resource distribution constrained by the unbalanced demographic distribution of people in the country.

We collect three publicly available datasets to train state-of-the-art classifiers in the domain of gender and caste classification. We train the models in the Indian context, where the same name can have different styling conventions (Jolly Abraham/Kumar Abhishikta in one state may be written as Abraham Jolly/Abishikta Kumar in the other). Finally, we also perform cross-testing (training and testing on different datasets) to understand the efficacy of the above models.

We also perform an error analysis of the prediction models. Finally, we attempt to assess the bias in the existing Indian system as case studies and find some intriguing patterns manifesting in the complex demographic layout of the sub-continent across the dimensions of gender and caste.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Social norm bias: residual harms of fairness-aware algorithms

Article 23 January 2023

Equal accuracy for Andrew and Abubakar—detecting and mitigating bias in name-ethnicity classification algorithms

Article Open access 09 February 2023

Name-based demographic inference and the unequal distribution of misrecognition

Article 17 April 2023

Notes

1.
https://www.britannica.com/place/India/Indo-European-languages.
2.
https://github.com/vahini01/IndianDemographics.
3.
Detailed stats are available in Appendix A.
4.
https://www.kooapp.com/.
5.
https://resultsarchives.nic.in.
6.
maintained in the same website as CBSE.
7.
https://en.wikipedia.org/wiki/Scheduled_Castes_and_Scheduled_Tribes.
8.
https://www.kooapp.com/feed.
9.
https://www.kooapp.com/.
10.
IndicBERT supports the following 12 languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
11.
https://en.wikipedia.org/wiki/2011_Census_of_India.

References

Ambekar, A., Ward, C.B., Mohammed, J., Male, S., Skiena, S.: Name-ethnicity classification from open sources. In: KDD, pp. 49–58. Association for Computing Machinery, New York, NY, USA (2009)
Google Scholar
Gender API. https://gender-api.com (2021)
Name API. https://www.nameapi.org/en/home/ (2021)
Chakraborty, S., Dutta, P., Roychowdhury, S., Mukherjee, A.: CRUSH: contextually regularized and user anchored self-supervised hate speech detection. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1874–1886. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.findings-naacl.144. https://aclanthology.org/2022.findings-naacl.144
Chakraborty, S., Goyal, P., Mukherjee, A.: Aspect-based sentiment analysis of scientific reviews. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 207–216. JCDL 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3383583.3398541
Chakraborty, S., Goyal, P., Mukherjee, A.: (IM) balance in the representation of news? an extensive study on a decade long dataset from India. International Conference on Social Informatics, SocInfo (2022). arXiv preprint arXiv:2110.14183
Genderize. https://genderize.io/ (2021)
Hu, Y., Hu, C., Tran, T., Kasturi, T., Joseph, E., Gillingham, M.: What’s in a name? - gender classification of names with character based machine learning models (2021)
Google Scholar
Krüger, S., Hermann, B.: Can an online service predict gender? on the state-of-the-art in gender identification from texts. In: Proceedings of the 2nd International Workshop on Gender Equality in Software Engineering, pp. 13–16. GE 2019. IEEE Press, Canada (2019). https://doi.org/10.1109/GE.2019.00012
Mueller, J., Stumme, G.: Gender inference using statistical name characteristics in twitter. In: Proceedings of the The 3rd Multidisciplinary International Social Networks Conference on Social Informatics 2016, Data Science 2016. MISNC, SI, DS 2016, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2955129.2955182
Onograph. https://forebears.io/onograph/ (2021)
Parasurama, P.: raceBERT - a transformer-based model for predicting race and ethnicity from names (2021). arXiv preprint arXiv:2112.03807
Singh, A.K., et al.: What’s kooking? characterizing india’s emerging social network, koo. In: Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 193–200. Association for Computing Machinery, New York, NY, USA (2021)
Google Scholar
Sood, G., Laohaprapanon, S.: Predicting race and ethnicity from the sequence of characters in a name (2018)
Google Scholar
Swami, S., Khandelwal, A., Shrivastava, M., Akhtar, S.: LRTC IIITH at IBEREVAL 2017: stance and gender detection in tweets on catalan independence. In: CEUR Workshop Proceedings 1881, 199–203 (2017), 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IBEREVAL (2017)
Google Scholar
Tang, C., Ross, K., Saxena, N., Chen, R.: What’s in a name: a study of names, gender inference, and gender behavior in facebook. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds.) Database Systems for Advanced Applications - 16th International Conference, DASFAA 2011, International Workshops, pp. 344–356 (2011)
Google Scholar
Treeratpituk, P., Giles, C.L.: Name-ethnicity classification and ethnicity-sensitive name matching. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1141–1147. AAAI2012, AAAI Press, Canada (2012)
Google Scholar
Tripathi, A., Faruqui, M.: Gender prediction of Indian names. In: IEEE Technology Students’ Symposium, pp. 137–141. IEEE, Kharagpur (2011). https://doi.org/10.1109/TECHSYM.2011.5783842

Download references

Author information

Authors and Affiliations

Indian Institute of Technology, Kharagpur, West Bengal, India
Vahini Medidoddi, Jalend Bantupalli, Souvic Chakraborty & Animesh Mukherjee

Authors

Vahini Medidoddi
View author publications
You can also search for this author in PubMed Google Scholar
Jalend Bantupalli
View author publications
You can also search for this author in PubMed Google Scholar
Souvic Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar
Animesh Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Souvic Chakraborty .

Editor information

Editors and Affiliations

Universität Koblenz-Landau, Koblenz, Germany
Frank Hopfgartner
National University of Singapore, Singapore, Singapore
Kokil Jaidka
GESIS – Leibniz-Institut für Sozialwissenschaften, Cologne, Germany
Philipp Mayr
University of Glasgow, Glasgow, UK
Joemon Jose
University of Glasgow, Glasgow, UK
Jan Breitsohl

Appendix

1.1 Dataset Statistics

Table 4 displays the dataset stats.

Table 4. The table below contains information on datasets that are used to train models and conduct case studies.

Full size table

1.2 Baseline APIs and Models

We used a bunch of APIs available for gender classification as baselines and compared them with the results obtained from our transformer based methods.

Gender API [2]: Gender-API.com is a simple-to-implement solution that adds gender information to existing records. It receives input via an API and returns the split-up name (first name, last name) and gender to the app or the website. According to the website, it will search for the name in a database belonging to the specific country, and if it is not found, it will perform a global lookup. If it cannot find a name in a global lookup, it performs several normalizations on the name to correct typos and cover all spelling variants.

Onograph API [11]: OnoGraph is a set of services that predicts a person’s characteristics based on their name. It can predict nationality, gender, and location (where they live). The services are based on the world’s largest private database of living people, which contains over 4.25 billion people (as of July 2020). According to the documentation, “OnoGraph’s results are the most accurate of any comparable service; and it recognizes around 40 million more names than the nearest comparable service.”

Genderize API [7]: It is a simple API that predicts a person’s gender based on their name. The request will generate a response with the following keys: name, gender, likelihood, and count. The probability denotes the certainty of the gender assigned. The count indicates the number of data rows reviewed to calculate the response.

1.3 Model Description

Logistic Regression: We concatenate the different parts of the name and compute character n-grams. Next we obtain TF-IDF scores from the character n-grams and pass them as features to the logistic regression model.

SVM: The objective of the support vector machine algorithm is to identify a hyperplane in N-dimensional space (N = the number of features) that categorizes the data points clearly. Then, we accomplish classification by locating the hyper-plane that best distinguishes the two classes. There are several hyperplanes that might be used to split the two groups of data points. Our goal is to discover a plane with the greatest margin or the greatest distance between data points from both classes.

Char CNN: Character-level CNN (char-CNN) is a well-known text classification algorithm. Each character is encoded with a fixed-length trainable embedding. A 1-D CNN is applied to the matrix created by concatenating the above vectors. In our model, we utilize 256 convolution filters in a single hidden layer of 1D convolution with a kernel size of 7.

Char LSTM: A name is a sequence of characters. Like char-CNN, each character of the input name is transformed into trainable embedding vectors and provided as input. Our model employs a single LSTM layer with 64 features and a 20% dropout layer.

Transformer Models

We choose BERT for demographic categorization, using full names as inputs because it has proven to be highly efficient in English data sequence modeling.
mBERT is trained using a masked language modeling (MLM) objective on the top 104 languages with the largest Wikipedia.
IndicBERT is a multilingual ALBERT model that has only been trained on 12 major Indian languages^{Footnote 10}. IndicBERT has much fewer parameters than other multilingual models.
MuRIL is pre-trained on 17 Indian languages and their transliterated counterparts. It employs a different tokenizer from the BERT model. This model is an appropriate candidate for categorization based on Indian names because it is pre-trained on Indian languages.

Hyperparameters

LR: learning rate = 0.003, n-gram range = (1–6)
SVM: kernel=rbf, n-gram range = (1–6), degree = 3, gamma = scale
Char CNN: learning rate = 0.001, hidden layers = 1, filters = 256, kernel size = 7, optimizer = adam
Char LSTM: learning rate = 0.001, dropout = 0.2, hidden layers = 1, features = 64, optimizer = adam
Transformer models: models = [bert-base-uncased, google/muril-base-cased, ai4bharat/indic-bert, bert-base-multilingual-uncased], epochs = 3, learning rate = 0.00005

1.4 Results

More detailed results are given in Tables 5 and 6.

Handling of Corner Cases: As a name can be common across both genders or caste, we use majority voting inorder to label a name with binary label for both gender and caste classification tasks. In case of equality we considered arbitrarily decided labels.

Table 5. Performance of the models for gender classification on each dataset.

Full size table

Table 6. Performance of the models for caste classification on AIEEE dataset.

Full size table

1.5 Error Analysis - Baseline APIs vs Our Models

Table 7 lists some of the best and worst test cases for the best performing baselines and the best performing transformer based models. Both these types of models perform the best when the first name (first word) is a good representative of the gender (e.g., Karishma Chettri). Baselines usually fail in three cases: the presence of parental name or surname (e.g., Avunuri Aruna), longer names where gender is represented by multiple words (e.g., Kollipara Kodahda Rama Murthy), and core Indian names (e.g., Laishram Priyabati, Gongkulung Kamei). The main reason for the better performance of transformer models might be that they are trained on complete names and larger datasets. As a result, they handle the complexity of Indian names. However, both these types of models tend to fail in presence of unusual and highly complicated names (e.g., Raj Blal Rawat, Pullammagari Chinna Maddileti).

Table 7. Table listing some common errors by the best performing baselines and the best performing transformer models. Here W stands for wrong and C stands for correct. And XX denotes the model, API results respectively; for e.g., WC lists names where transformer predicted wrong while API predicted correct. The letter in bracket denotes the gender (M for male and F for female). The listed names have multiple instances in the datasets. So none of the names uniquely identify any person

Full size table

1.6 Case Studies - Values of Median Percentile

Table 8 displays values that are plotted in the left plot of Fig. 1.

1.7 Case Studies - State Wise Results

To understand state wise distribution of Caste and Gender, we answer following additional research questions(ARQ).

ARQ1: Which states in India have the highest representation of females and backward castes in higher education compared to its population?
ARQ2: Which states in India have been successful in achieving a significant decrease in bias toward females and backward castes over time? Which states are lacking in this aspect?

Table 8. Median perctile of Women and Reserved students in AIEEE data

Full size table

ARQ1: Which states in India have the highest representation of females and backward castes in higher education compared to its population?

The AIEEE dataset has the state information for each data point. We also collect the state wise population record from Census 2011^{Footnote 11}. We compute the population normalized fraction of women and backward caste people writing the AIEEE 2011 exam. From the plotted results in Fig. 2, we observe that the top states with population normalized higher representation of women writing the AIEEE exam are Jammu & Kashmir, Himachal Pradesh, Punjab, West Bengal, and Maharashtra. Similarly, the states with population normalized higher representation of backward castes writing the AIEEE exam are West Bengal, Maharashtra, Punjab, Uttarakhand, and Jammu & Kashmir. We believe that the education policies of these states could act as a suitable guidance to improve the condition of the other Indian states.

ARQ2: Which states in India have been successful in achieving a significant decrease in bias toward females and backward castes over time? Which states are lacking in this aspect?

One way to measure the reduction (increase) in bias would be to check for the increase (decrease) in the population normalized percentage of women and backward caste over time. To this purpose, we obtained the rate of change of population normalized women and backward class candidates taking the AIEEE exam. For each state, the rate of change is measured as the slope of the best fit line (linear regression) of the year versus population normalized percentage scatter plot. The year range considered was 2004 to 2011.

Table 9. Gender and caste breakup (%) in the Koo data.

Full size table

Table 10. % users at in the oldest 1% data sorted by creation date.

Full size table

From Fig. 3, we observe that the most successful states in reducing the gender inequality are Himachal Pradesh, Andhra Pradesh (Seemandhra and Telangana), Haryana and Maharashtra. With respect to reducing caste inequality we find West Bengal, Punjab, Uttarakhand, Maharashtra, Karnataka are the most successful.

1.8 Distribution of Caste and Gender in Koo

Table 11. % users at in the most recent 1% data sorted by creation date.

Full size table

Table 12. % users in the top 1% data sorted by number of followers.

Full size table

Table 13. % users in the bottom 1% data sorted by number of followers.

Full size table

In Table 9 we show the % breakup of the cross-sectional categories in the Koo dataset. We observe that the largest representation is from the general category males while the smallest is from the reserved category females. In the latest time point (see Table 11) we observe higher female representation than in the oldest time point (see Table 10). The % of females (both general and reserved) in top 1% users sorted by followers is relatively larger than in the bottom 1% followers (see Tables 12 and 13). This is exactly the opposite (see Tables 12 and 13) for males (both general and reserved). We believe that a possible reason could be that women have closed coteries of followership.

1.9 Ethical Implications

Like any other classification task, it can also be potentially misused when in the hands of malicious actors. Instead of reduction of bias, the same technology can be used to enforce discrimination. Hence, we request the researchers to exercise caution while using this technology as some demography classification APIs are already publicly available. Further, to keep personally identifiable data private, we opensource the codebase to collect the datapoints instead of sharing the datasets, a policy ubiquitous for social science researchers.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Medidoddi, V., Bantupalli, J., Chakraborty, S., Mukherjee, A. (2022). Decoding Demographic un-fairness from Indian Names. In: Hopfgartner, F., Jaidka, K., Mayr, P., Jose, J., Breitsohl, J. (eds) Social Informatics. SocInfo 2022. Lecture Notes in Computer Science, vol 13618. Springer, Cham. https://doi.org/10.1007/978-3-031-19097-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-19097-1_33
Published: 12 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19096-4
Online ISBN: 978-3-031-19097-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Decoding Demographic un-fairness from Indian Names

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Social norm bias: residual harms of fairness-aware algorithms

Equal accuracy for Andrew and Abubakar—detecting and mitigating bias in name-ethnicity classification algorithms

Name-based demographic inference and the unequal distribution of misrecognition

Notes

References