Abstract
Demographic classification is essential in fairness assessment in recommender systems or in measuring unintended bias in online networks and voting systems. Important fields like education and politics, which often lay a foundation for the future of equality in society, need scrutiny to design policies that can better foster equality in resource distribution constrained by the unbalanced demographic distribution of people in the country.
We collect three publicly available datasets to train state-of-the-art classifiers in the domain of gender and caste classification. We train the models in the Indian context, where the same name can have different styling conventions (Jolly Abraham/Kumar Abhishikta in one state may be written as Abraham Jolly/Abishikta Kumar in the other). Finally, we also perform cross-testing (training and testing on different datasets) to understand the efficacy of the above models.
We also perform an error analysis of the prediction models. Finally, we attempt to assess the bias in the existing Indian system as case studies and find some intriguing patterns manifesting in the complex demographic layout of the sub-continent across the dimensions of gender and caste.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
Detailed stats are available in Appendix A.
- 4.
- 5.
- 6.
maintained in the same website as CBSE.
- 7.
- 8.
- 9.
- 10.
IndicBERT supports the following 12 languages: Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
- 11.
References
Ambekar, A., Ward, C.B., Mohammed, J., Male, S., Skiena, S.: Name-ethnicity classification from open sources. In: KDD, pp. 49–58. Association for Computing Machinery, New York, NY, USA (2009)
Gender API. https://gender-api.com (2021)
Name API. https://www.nameapi.org/en/home/ (2021)
Chakraborty, S., Dutta, P., Roychowdhury, S., Mukherjee, A.: CRUSH: contextually regularized and user anchored self-supervised hate speech detection. In: Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1874–1886. Association for Computational Linguistics, Seattle, United States (2022). https://doi.org/10.18653/v1/2022.findings-naacl.144. https://aclanthology.org/2022.findings-naacl.144
Chakraborty, S., Goyal, P., Mukherjee, A.: Aspect-based sentiment analysis of scientific reviews. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 207–216. JCDL 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3383583.3398541
Chakraborty, S., Goyal, P., Mukherjee, A.: (IM) balance in the representation of news? an extensive study on a decade long dataset from India. International Conference on Social Informatics, SocInfo (2022). arXiv preprint arXiv:2110.14183
Genderize. https://genderize.io/ (2021)
Hu, Y., Hu, C., Tran, T., Kasturi, T., Joseph, E., Gillingham, M.: What’s in a name? - gender classification of names with character based machine learning models (2021)
Krüger, S., Hermann, B.: Can an online service predict gender? on the state-of-the-art in gender identification from texts. In: Proceedings of the 2nd International Workshop on Gender Equality in Software Engineering, pp. 13–16. GE 2019. IEEE Press, Canada (2019). https://doi.org/10.1109/GE.2019.00012
Mueller, J., Stumme, G.: Gender inference using statistical name characteristics in twitter. In: Proceedings of the The 3rd Multidisciplinary International Social Networks Conference on Social Informatics 2016, Data Science 2016. MISNC, SI, DS 2016, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2955129.2955182
Onograph. https://forebears.io/onograph/ (2021)
Parasurama, P.: raceBERT - a transformer-based model for predicting race and ethnicity from names (2021). arXiv preprint arXiv:2112.03807
Singh, A.K., et al.: What’s kooking? characterizing india’s emerging social network, koo. In: Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 193–200. Association for Computing Machinery, New York, NY, USA (2021)
Sood, G., Laohaprapanon, S.: Predicting race and ethnicity from the sequence of characters in a name (2018)
Swami, S., Khandelwal, A., Shrivastava, M., Akhtar, S.: LRTC IIITH at IBEREVAL 2017: stance and gender detection in tweets on catalan independence. In: CEUR Workshop Proceedings 1881, 199–203 (2017), 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, IBEREVAL (2017)
Tang, C., Ross, K., Saxena, N., Chen, R.: What’s in a name: a study of names, gender inference, and gender behavior in facebook. In: Xu, J., Yu, G., Zhou, S., Unland, R. (eds.) Database Systems for Advanced Applications - 16th International Conference, DASFAA 2011, International Workshops, pp. 344–356 (2011)
Treeratpituk, P., Giles, C.L.: Name-ethnicity classification and ethnicity-sensitive name matching. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1141–1147. AAAI2012, AAAI Press, Canada (2012)
Tripathi, A., Faruqui, M.: Gender prediction of Indian names. In: IEEE Technology Students’ Symposium, pp. 137–141. IEEE, Kharagpur (2011). https://doi.org/10.1109/TECHSYM.2011.5783842
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
1.1 Dataset Statistics
Table 4 displays the dataset stats.
1.2 Baseline APIs and Models
We used a bunch of APIs available for gender classification as baselines and compared them with the results obtained from our transformer based methods.
Gender API [2]: Gender-API.com is a simple-to-implement solution that adds gender information to existing records. It receives input via an API and returns the split-up name (first name, last name) and gender to the app or the website. According to the website, it will search for the name in a database belonging to the specific country, and if it is not found, it will perform a global lookup. If it cannot find a name in a global lookup, it performs several normalizations on the name to correct typos and cover all spelling variants.
Onograph API [11]: OnoGraph is a set of services that predicts a person’s characteristics based on their name. It can predict nationality, gender, and location (where they live). The services are based on the world’s largest private database of living people, which contains over 4.25 billion people (as of July 2020). According to the documentation, “OnoGraph’s results are the most accurate of any comparable service; and it recognizes around 40 million more names than the nearest comparable service.”
Genderize API [7]: It is a simple API that predicts a person’s gender based on their name. The request will generate a response with the following keys: name, gender, likelihood, and count. The probability denotes the certainty of the gender assigned. The count indicates the number of data rows reviewed to calculate the response.
1.3 Model Description
Logistic Regression: We concatenate the different parts of the name and compute character n-grams. Next we obtain TF-IDF scores from the character n-grams and pass them as features to the logistic regression model.
SVM: The objective of the support vector machine algorithm is to identify a hyperplane in N-dimensional space (N = the number of features) that categorizes the data points clearly. Then, we accomplish classification by locating the hyper-plane that best distinguishes the two classes. There are several hyperplanes that might be used to split the two groups of data points. Our goal is to discover a plane with the greatest margin or the greatest distance between data points from both classes.
Char CNN: Character-level CNN (char-CNN) is a well-known text classification algorithm. Each character is encoded with a fixed-length trainable embedding. A 1-D CNN is applied to the matrix created by concatenating the above vectors. In our model, we utilize 256 convolution filters in a single hidden layer of 1D convolution with a kernel size of 7.
Char LSTM: A name is a sequence of characters. Like char-CNN, each character of the input name is transformed into trainable embedding vectors and provided as input. Our model employs a single LSTM layer with 64 features and a 20% dropout layer.
Transformer Models
-
We choose BERT for demographic categorization, using full names as inputs because it has proven to be highly efficient in English data sequence modeling.
-
mBERT is trained using a masked language modeling (MLM) objective on the top 104 languages with the largest Wikipedia.
-
IndicBERT is a multilingual ALBERT model that has only been trained on 12 major Indian languagesFootnote 10. IndicBERT has much fewer parameters than other multilingual models.
-
MuRIL is pre-trained on 17 Indian languages and their transliterated counterparts. It employs a different tokenizer from the BERT model. This model is an appropriate candidate for categorization based on Indian names because it is pre-trained on Indian languages.
Hyperparameters
-
LR: learning rate = 0.003, n-gram range = (1–6)
-
SVM: kernel=rbf, n-gram range = (1–6), degree = 3, gamma = scale
-
Char CNN: learning rate = 0.001, hidden layers = 1, filters = 256, kernel size = 7, optimizer = adam
-
Char LSTM: learning rate = 0.001, dropout = 0.2, hidden layers = 1, features = 64, optimizer = adam
-
Transformer models: models = [bert-base-uncased, google/muril-base-cased, ai4bharat/indic-bert, bert-base-multilingual-uncased], epochs = 3, learning rate = 0.00005
1.4 Results
More detailed results are given in Tables 5 and 6.
Handling of Corner Cases: As a name can be common across both genders or caste, we use majority voting inorder to label a name with binary label for both gender and caste classification tasks. In case of equality we considered arbitrarily decided labels.
1.5 Error Analysis - Baseline APIs vs Our Models
Table 7 lists some of the best and worst test cases for the best performing baselines and the best performing transformer based models. Both these types of models perform the best when the first name (first word) is a good representative of the gender (e.g., Karishma Chettri). Baselines usually fail in three cases: the presence of parental name or surname (e.g., Avunuri Aruna), longer names where gender is represented by multiple words (e.g., Kollipara Kodahda Rama Murthy), and core Indian names (e.g., Laishram Priyabati, Gongkulung Kamei). The main reason for the better performance of transformer models might be that they are trained on complete names and larger datasets. As a result, they handle the complexity of Indian names. However, both these types of models tend to fail in presence of unusual and highly complicated names (e.g., Raj Blal Rawat, Pullammagari Chinna Maddileti).
1.6 Case Studies - Values of Median Percentile
Table 8 displays values that are plotted in the left plot of Fig. 1.
1.7 Case Studies - State Wise Results
To understand state wise distribution of Caste and Gender, we answer following additional research questions(ARQ).
-
ARQ1: Which states in India have the highest representation of females and backward castes in higher education compared to its population?
-
ARQ2: Which states in India have been successful in achieving a significant decrease in bias toward females and backward castes over time? Which states are lacking in this aspect?
ARQ1: Which states in India have the highest representation of females and backward castes in higher education compared to its population?
The AIEEE dataset has the state information for each data point. We also collect the state wise population record from Census 2011Footnote 11. We compute the population normalized fraction of women and backward caste people writing the AIEEE 2011 exam. From the plotted results in Fig. 2, we observe that the top states with population normalized higher representation of women writing the AIEEE exam are Jammu & Kashmir, Himachal Pradesh, Punjab, West Bengal, and Maharashtra. Similarly, the states with population normalized higher representation of backward castes writing the AIEEE exam are West Bengal, Maharashtra, Punjab, Uttarakhand, and Jammu & Kashmir. We believe that the education policies of these states could act as a suitable guidance to improve the condition of the other Indian states.
ARQ2: Which states in India have been successful in achieving a significant decrease in bias toward females and backward castes over time? Which states are lacking in this aspect?
One way to measure the reduction (increase) in bias would be to check for the increase (decrease) in the population normalized percentage of women and backward caste over time. To this purpose, we obtained the rate of change of population normalized women and backward class candidates taking the AIEEE exam. For each state, the rate of change is measured as the slope of the best fit line (linear regression) of the year versus population normalized percentage scatter plot. The year range considered was 2004 to 2011.
From Fig. 3, we observe that the most successful states in reducing the gender inequality are Himachal Pradesh, Andhra Pradesh (Seemandhra and Telangana), Haryana and Maharashtra. With respect to reducing caste inequality we find West Bengal, Punjab, Uttarakhand, Maharashtra, Karnataka are the most successful.
1.8 Distribution of Caste and Gender in Koo
In Table 9 we show the % breakup of the cross-sectional categories in the Koo dataset. We observe that the largest representation is from the general category males while the smallest is from the reserved category females. In the latest time point (see Table 11) we observe higher female representation than in the oldest time point (see Table 10). The % of females (both general and reserved) in top 1% users sorted by followers is relatively larger than in the bottom 1% followers (see Tables 12 and 13). This is exactly the opposite (see Tables 12 and 13) for males (both general and reserved). We believe that a possible reason could be that women have closed coteries of followership.
1.9 Ethical Implications
Like any other classification task, it can also be potentially misused when in the hands of malicious actors. Instead of reduction of bias, the same technology can be used to enforce discrimination. Hence, we request the researchers to exercise caution while using this technology as some demography classification APIs are already publicly available. Further, to keep personally identifiable data private, we opensource the codebase to collect the datapoints instead of sharing the datasets, a policy ubiquitous for social science researchers.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Medidoddi, V., Bantupalli, J., Chakraborty, S., Mukherjee, A. (2022). Decoding Demographic un-fairness from Indian Names. In: Hopfgartner, F., Jaidka, K., Mayr, P., Jose, J., Breitsohl, J. (eds) Social Informatics. SocInfo 2022. Lecture Notes in Computer Science, vol 13618. Springer, Cham. https://doi.org/10.1007/978-3-031-19097-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-19097-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19096-4
Online ISBN: 978-3-031-19097-1
eBook Packages: Computer ScienceComputer Science (R0)