Europe PMC
Nothing Special   »   [go: up one dir, main page]

Europe PMC requires Javascript to function effectively.

Either your web browser doesn't support Javascript or it is currently turned off. In the latter case, please turn on Javascript support in your web browser and reload this page.

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


Meta-researchers commonly leverage tools that infer gender from first names, especially when studying gender disparities. However, tools vary in their accuracy, ease of use, and cost. The objective of this study was to compare the accuracy and cost of the commercial software Genderize and Gender API, and the open-source gender R package. Differences in binary gender prediction accuracy between the three services were evaluated. Gender prediction accuracy was tested on a multi-national dataset of 32,968 gender-labeled clinical trial authors. Additionally, two datasets from previous studies with 5779 and 6131 names, respectively, were re-evaluated with modern implementations of Genderize and Gender API. The gender inference accuracy of Genderize and Gender API were compared, both with and without supplying trialists' country of origin in the API call. The accuracy of the gender R package was only evaluated without supplying countries of origin. The accuracy of Genderize, Gender API, and the gender R package were defined as the percentage of correct gender predictions. Accuracy differences between methods were evaluated using McNemar's test. Genderize and Gender API demonstrated 96.6% and 96.1% accuracy, respectively, when countries of origin were not supplied in the API calls. Genderize and Gender API achieved the highest accuracy when predicting the gender of German authors with accuracies greater than 98%. Genderize and Gender API were least accurate with South Korean, Chinese, Singaporean, and Taiwanese authors, demonstrating below 82% accuracy. Genderize can provide similar accuracy to Gender API while being 4.85x less expensive. The gender R package achieved below 86% accuracy on the full dataset. In the replication studies, Genderize and gender API demonstrated better performance than in the original publications. Our results indicate that Genderize and Gender API achieve similar accuracy on a multinational dataset. The gender R package is uniformly less accurate than Genderize and Gender API.

Free full text 


Logo of pdigLink to Publisher's site
PLOS Digit Health. 2024 Oct; 3(10): e0000456.
Published online 2024 Oct 29. https://doi.org/10.1371/journal.pdig.0000456
PMCID: PMC11521266
PMID: 39471154

Inferring gender from first names: Comparing the accuracy of Genderize, Gender API, and the gender R package on authors of diverse nationality

Alexander D. VanHelene, Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing, 1 , 2 , 3 Ishaani Khatri, Conceptualization, Writing – original draft, Writing – review & editing, 4 C. Beau Hilton, Formal analysis, Writing – review & editing, 5 Sanjay Mishra, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Visualization, Writing – review & editing, 1 , 2 , 4 Ece D. Gamsiz Uzun, Visualization, Writing – review & editing, 2 , 4 , 6 , 7 and Jeremy L. Warner, Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Validation, Visualization, Writing – original draft, Writing – review & editingcorresponding author 1 , 2 , 4 , 8 , 9 ,*
Miguel Ángel Armengol de la Hoz, Editor

Associated Data

Supplementary Materials
Data Availability Statement

Abstract

Meta-researchers commonly leverage tools that infer gender from first names, especially when studying gender disparities. However, tools vary in their accuracy, ease of use, and cost. The objective of this study was to compare the accuracy and cost of the commercial software Genderize and Gender API, and the open-source gender R package. Differences in binary gender prediction accuracy between the three services were evaluated. Gender prediction accuracy was tested on a multi-national dataset of 32,968 gender-labeled clinical trial authors. Additionally, two datasets from previous studies with 5779 and 6131 names, respectively, were re-evaluated with modern implementations of Genderize and Gender API. The gender inference accuracy of Genderize and Gender API were compared, both with and without supplying trialists’ country of origin in the API call. The accuracy of the gender R package was only evaluated without supplying countries of origin. The accuracy of Genderize, Gender API, and the gender R package were defined as the percentage of correct gender predictions. Accuracy differences between methods were evaluated using McNemar’s test. Genderize and Gender API demonstrated 96.6% and 96.1% accuracy, respectively, when countries of origin were not supplied in the API calls. Genderize and Gender API achieved the highest accuracy when predicting the gender of German authors with accuracies greater than 98%. Genderize and Gender API were least accurate with South Korean, Chinese, Singaporean, and Taiwanese authors, demonstrating below 82% accuracy. Genderize can provide similar accuracy to Gender API while being 4.85x less expensive. The gender R package achieved below 86% accuracy on the full dataset. In the replication studies, Genderize and gender API demonstrated better performance than in the original publications. Our results indicate that Genderize and Gender API achieve similar accuracy on a multinational dataset. The gender R package is uniformly less accurate than Genderize and Gender API.

Author summary

Gender disparities in academia have prompted researchers to investigate gender gaps in professorship roles and publication authorship. Of particular concern are the gender gaps in cancer clinical trial authorship. Methodologies that evaluate gender disparities in academia often rely on tools that infer gender from first names. Tools that predict gender from first names are often used in methodologies that determine the gender ratios of academic departments or publishing authors in a discipline. However, researchers must choose between different gender predicting tools that vary in their accuracy, ease of use, and cost. We evaluated the binary gender prediction accuracy of Genderize, Gender API, and the gender R package on a gold-standard dataset of 32,968 clinical trialists from around the world. Genderize and Gender API are commercially available, while the gender R package is free and open source. We found that Genderize and Gender API were more accurate than the gender R package. In addition, Genderize is cheaper than Gender API, but is more sensitive to inconsistencies in name formatting and the presence of diacritical marks. Both Genderize and Gender API were most accurate with non-Asian names.

Introduction

One of the most well-documented disparities in STEM is gender disparity [1,2]. This issue is especially notable in the cancer clinical trial domain, with underrepresentation of women in the leadership of pivotal trials documented as recently as within the last decade [3]. Furthermore, gender disparities in healthcare accessibility have also been documented, and gender disparity also affects public health, policy-making, and diversity metrics [4]. The study of gender disparity in scientific authorship and other contexts often requires the determination of gender from very limited data, e.g., author forenames. Software [510] that infers gender from forenames could potentially enable researchers to automate gender prediction in large datasets. Commercial gender prediction services [11,12] such as Genderize and Gender API programmatically predict gender from first names. The gender R package [13] is an open-source alternative to these proprietary gender prediction tools.

Gender prediction software has demonstrated high accuracy when evaluating non-Asian first names, but often falters when evaluating names from Asian cultures [14]. Further, the presence of diacritical marks and hyphens reportedly affects the accuracy of gender prediction in some tools [15]. Few studies [16] to date have evaluated differences in accuracy in gender predicting software between non-Asian and Asian names. To our knowledge, no studies have evaluated how different ways of delimiting two-part first names e.g. Jean-Pierre vs Jean Pierre vs Jeanpierre, affect gender prediction accuracy.

We compared the gender prediction accuracy of Genderize, Gender API, and the gender R package using a large manually curated registry of cancer clinical trialists with labeled genders and diverse nationalities. In addition, we quantified the accuracy of these tools by author nationality and compared different strategies for delimiting two-part forenames, which are common in the English language spelling of Korean, Chinese, Singaporean, and Taiwanese names.

Materials and methods

Three gender prediction tools: 1) Genderize; 2) Gender API; and 3) the gender R package, were tested on a gold-standard registry of cancer clinical trialists with manually determined binary gender. Trialists’ names and affiliations were sourced from the HemOnc knowledge base, a continually growing resource created to capture the standard-of-care treatments in the fields of hematology and oncology. The methodology for building the HemOnc knowledgebase has been previously described [17]. Likewise, the edibility criteria for including trialists in the HemOnc knowledgebase is defined on HemOnc.org [18]. The binary gender classifications used in our study refer to socially constructed gender categories, not biological sex [19,20]. Names in HemOnc are primarily sourced from the MEDLINE records of published clinical trials and undergo extensive normalization to account for the presence of diacritics, middle initials, misspellings, multipart last names represented as middle names, and other variations. When first names are not available through MEDLINE, the original manuscripts are examined for this information. Binary gender is determined by a combination of automated mappings of typically masculine or feminine forenames (e.g., John; Rebecca), web searches of publicly available information such as biographies on academic web pages, and consensus determinations including consultation with native speakers. If gender cannot be determined after these efforts, the author is labeled as “unknown gender.” A subset of journals does not provide forenames; in these cases, the gender is labeled as “could not be determined.” Country affiliations sourced from MEDLINE also undergo extensive normalization.

Evaluation metrics

Gender prediction accuracy was defined as the percent of individuals whose gender was correctly predicted, as compared to the gold standard dataset. The percent of incorrect gender predictions and the percent of names with no predicted gender were also calculated. For binary statistical tests, gender predictions were categorized as successes or failures–correct gender predictions were defined as successes, while names with incorrect or absent predictions were failures.

Evaluation protocol

All trialists with a gender determination were evaluated with Genderize and Gender API on 2023-11-21 using the R package httr (version 1.4.7). Both US Social Security Administration (SSA) and US Census Integrated Public Use Microdata Series (IPUMS) name datasets were used as a reference when predicting names with the gender R package [13] (version 0.6.0).

Genderize and Gender API were used to predict names with and without supplying a country of origin for the subset of authors with a singular country of affiliation. The Gender R package was only tested without supplying country names because the SSA and IPUMS methods do not provide that functionality. The SSA and IPUMS methods source names from 1932 to 2012 and 1789 to 1930 respectively [21]. The gender R package provides gender probabilities rather than explicit gender predictions. We assigned gender based on which gender was reported to have the highest probability in the R package. For example, the name Mark was classified as a man’s name because the gender R package returned a 99.7% probability that ‘Mark’ is a man when the IPUMS method is selected.

Two-part names were concatenated without any delimiter e.g. Jean-Pierre was converted to jeanpierre. Middle names were removed, unless an author had a first initial/middle name, in which case their middle name was used. Gender bias in name prediction was descriptively evaluated by calculating the percent of names that were misgendered, compared to the gold standard labeled dataset. In an additional analysis, accuracy differences resulting from delimiting two-part first names with different characters were evaluated. Two-part first name prediction accuracy was also evaluated using the first half of two-part names only. For example, the name Jean-Pierre was tested four ways: 1) jean-pierre; 2) jean pierre; 3) jeanpierre; and 4) jean.

In addition to predicting the gender of a first name, Genderize and Gender API also report an estimated probability that a gender prediction is correct. We evaluated the correlation between these API-reported probability estimates and the gold standard labeled dataset with linear regressions and Brier scores. Names with a reported probability less than or equal to 50% were excluded from the regression and Brier scores.

Precision, recall, and both gender-specific and global F1 scores were calculated for each gender inference tool. For gender-specific F1 scores, one gender was designated the positive prediction, and the other gender the negative prediction. Gender-specific F1 scores were calculated for both men and women. For global F1 scores, correctly gendered Men’s names and correctly gendered women’s names were classified as true positives. Names that yielded no gender predictions were classified as false negatives.

Reanalysis of prior studies’ data

The gender prediction accuracies of Genderize and Gender API were also separately evaluated using publicly available datasets from two studies [16,22] that tested gender prediction in 2018 and 2021, respectively. The dataset [23] provided by Santamaria 2018 consisted of 5,779 names sourced from various other datasets. The dataset [24] sourced from Sebo 2021 consisted of 6,131 Swiss physicians. The names from these public datasets were not modified prior to our evaluation on 2023-11-07. Nor were nationalities supplied to Genderize and Gender API when evaluating these public datasets, following the original experimental design.

Software

All software accuracy comparisons were computed in R version 4.3.1. Differences in accuracy between methods were evaluated using the default R stats package implementation of McNemar’s test [25]. Data analysis was facilitated with tidyverse [26] (version 2.0.0), haven [27] (version 2.5.3), readxl [28] (version 1.4.3), testthat [29] (version 3.1.10), ggpmisc [30] (version 0.5.5), and patchwork [31] (version 1.1.3) R libraries.

Results

Out of 40,273 unique clinical trialists present in the HemOnc KB as of 2023-11-21, 37,420 (92.9%) had a resolvable name and were thus eligible for gender determination (S1 Fig). This group was sourced from 7,473 clinical trial manuscripts published between 1947–2023. After excluding trialists with gender not yet determined (n = 4,360, 11.7%), those with a determined unknown gender (n = 78, 0.2%), and those with a determined gender but initial-only first names (n = 14, <0.1%), the final analysis set included 32,968 trialists with predetermined binary gender. Of the 32,968 trialists, 11,398 (34.6%) were designated as women. There were 7849 unique names after normalizing first initial/middle name combinations to only include a middle name. The remainder of names were shared by more than one individual. Michael was the most common name, with 473 (1.4%) occurrences. Only 1,899 (24.2%) of names occurred more than twice.

Of 25,240 trialists with a known site affiliation, 24,930 (98.8%) were affiliated with sites in a single country and were assigned to the country of their affiliated institution when querying Genderize and Gender API with nationalities. When excluding clinical trialists without a recorded country of origin, the number of trialists and unique names was 24,930 and 6,756, respectively. The final analysis set included trialists from 87 countries, the most abundant being the US with 9,485 (38%) affiliated trialists. There were 7,569 first name-country combinations that occurred only once. The most common first-name-country combination was David-US with 201 (0.8%) instances. Only 1,760 (7.1%) of first name-country combinations appeared more than twice.

All names that were misgendered more than once are catalogued in S1 through S6 Tables. The name Andrea was most frequently misgendered when calling the Genderize and Gender API services without countries in the API call. Jan was the most misgendered name for the gender R package SSA method and Genderize when countries were included. The most frequently misgendered names for Gender API when countries are included and the gender R package IPUMS method were Laurence and Nicole respectively. The 100 most common trialist name-country combinations are presented in S7 Table.

Gender prediction accuracy when country of origin was not supplied (baseline case)

The overall accuracy of Genderize when predicting gender for the full dataset without supplying country was 96.6% with 2.3% incorrect gender predictions and 1.1% of names yielding no prediction (Table 1). Similarly, the overall accuracy of Gender API was 96.1% with 2.7% incorrect gender predictions and 1.1% of names resulting in no prediction. The accuracy of the gender R package’s predictions was lower, with 79.8% and 85.7% accuracy with the IPUMS and SSA methods, respectively. Names of men were misgendered as women less than 3% of the time for all gender prediction tools (Table 1). Names of women were misgendered over 3% of the time for all services except the gender R package when using SSA data as a reference. The difference in the percent of correct gender predictions between Genderize and Gender API was significant in favor of Genderize (p<0.001). Likewise, the accuracy difference between the gender R package methods were also significant (p<0.001), in favor of the SSA method. The accuracy differences between the gender R package methods and both Genderize and Gender API were significant, in favor of Genderize and Gender API in all cases (p<0.001). Gender API demonstrated slightly higher gender prediction accuracy when two-part names were delimited with a space: the percent of correctly inferred genders rose from 96.1% to 96.3%. Precision, Recall, and F1 scores for all gender inference tools are presented in S8 Table.

Table 1

Accuracy of gender predictions on 32,968 included trialists.
MethodaCorrect, n (%)Incorrect, n (%)No Predictions, n (%)Men Incorrectly Gendered as Women, n (%)bWomen Incorrectly Gendered as Men, n (%)b
Genderize31,857/32,968 (96.6%)763/32,968 (2.3%)348/32,968 (1.1%)401/21,324 (1.9%)362/11,296 (3.2%)
Gender API31,690/32,968 (96.1%)899/32,968 (2.7%)379/32,968 (1.1%)393/21,320 (1.8%)506/11,269 (4.5%)
gender (IPUMS)26,294/32,968 (79.8%)1366/32,968 (4.1%)5308/32,968 (16.1%)508/17,941 (2.8%)858/9719 (8.8%)
gender (SSA)28,266/32,968 (85.7%)590/32,968 (1.8%)4112/32,968 (12.5%)489/18,595 (2.6%)101/10,261 (1%)

aTwo-part first names were appended together without a delimiting character.

bDenominators are not consistent across rows because names that did not return a gender prediction for a given service were excluded.

After restricting Genderize’s predictions to trialists affiliated with a single country, the percentage of correct, incorrect, and missing predictions were 96.2%, 2.6%, and 1.2% respectively (Fig 1A). Genderize achieved the highest accuracy when evaluating first names from German authors, and the lowest accuracy when evaluating names from South Korean, Chinese, Singaporean, and Taiwanese authors. When evaluating the same 24,929 clinical trialists with Gender API, the percentage of correct, incorrect, and missing predictions were 95.8%, 3%, and 1.3% respectively. Gender API also had high accuracy when predicting the gender of German authors, and the lowest accuracy when evaluating names from South Korean, Chinese, Singaporean, and Taiwanese authors. The difference in accuracy between Genderize and Gender API is significant (p<0.001), in favor of Genderize.

An external file that holds a picture, illustration, etc.
Object name is pdig.0000456.g001.jpg
Accuracy of Genderize and Gender API.

Panel A shows the gender prediction accuracies when countries are not included in the API call. Panel B shows the results when countries are included in the API call. The top 3 countries with the most trialists and the top 4 East-Asian/Southeast-Asian countries are plotted. Method a is genderize and method b is Gender API. Each bar is labeled with the fraction and percentage of correct gender predictions. Two-part first names were appended together without a delimiting character.

Gender prediction accuracy when country of origin was supplied to the API

The gender prediction accuracies when countries of origin were supplied to Genderize and Gender API are visualized in Fig 1B. Supplying the countries of origin alongside first names in the API call decreased the percentage of correct gender predictions when using Genderize from 96.2% to 95.4%, while also reducing the percentage of incorrect predictions from 2.6% to 2.1%. Conversely, including countries of origin increased the ratio of correct gender predictions of Gender API from 95.8% to 96% and decreased incorrect predictions from 3% to 2.7%. Supplying countries also increased the percentage of names with no gender prediction for Genderize from 1.2% to 2.5%, while Gender API remained constant at 1.3%. The difference in accuracy between Genderize and Gender API was significant in favor of Gender API (p<0.001).

Gender prediction accuracy when using different characters to delimit two-part forenames

24,930 of the 32,968 (75.6%) trialists in our dataset were affiliated with sites in a single country and were assigned to the country of their affiliation when querying Genderize and Gender API with nationalities. Gender prediction accuracy when evaluating two-part names was higher when countries were not included in the API call in all contexts except when calling Genderize with the first half of a two-part name, e.g., Jean-Pierre as jean. Genderize was most accurate (76.4%) when no character was used to delimit two-part names, e.g., Jean-Pierre represented as jeanpierre (Fig 2). Genderize provided zero predictions for two-part first names delimited with a space. In contrast, Gender API achieved the highest gender prediction accuracy when delimiting two-part names with a space (83.5%). Gender prediction accuracy for two-part names was worse than for one-part names when countries were not included in the API call and two-part names were separated without a delimiter: OR 0.07 (95% CI 0.06–0.08) for Genderize and OR 0.08 (95% CI 0.07–0.09) for Gender API, respectively (Fig 3).

An external file that holds a picture, illustration, etc.
Object name is pdig.0000456.g002.jpg
Accuracy of the gender R package.

Columns a and b show the accuracy of the gender R package’s SSA and IPUMS methods respectively. The top 3 countries with the most trialists and the top 4 East-Asian/Southeast-Asian countries are plotted. Only trialists affiliated with sites in a single country are plotted.

An external file that holds a picture, illustration, etc.
Object name is pdig.0000456.g003.jpg
Accuracy of gender predictions based on delimiter between two-part names.

Panel A is Genderize and Panel B is Gender API. Plot facets correspond to the type of delimiter separating two-part names. Stacked bars correspond to correct, incorrect, and no predictions respectively. Bars are labeled with the count and percent of correct gender predictions.

The accuracy of Genderize and Gender API were evaluated for statistical significance by comparing the percent of correct gender predictions between delimiter categories. The difference in Gender prediction accuracy between Genderize and Gender API when evaluating two-part names without a delimiting character and including countries in the API call was not significant. All other comparisons between Genderize and Gender API were significant in favor of Gender API (p<0.001).

Gender prediction accuracy by API-reported confidence thresholds

There was high agreement overall between gender prediction services and the gold standard labeled dataset (Fig 4). Genderize reported over 50% confidence in gender predictions for 32,573 (98.8%) trialists. Similarly, Gender API reported over 50% confidence for 32,587 (98.8%) trialists. Gender API demonstrated a correlation of 0.91 between its reported confidence and actual accuracy, compared to Genderize’s correlation of 0.82. The Brier scores for Gender API and Genderize were 0.0077 and 0.0048 respectively.

An external file that holds a picture, illustration, etc.
Object name is pdig.0000456.g004.jpg
Experimental name prediction accuracy at different API probability cutoffs.

Names with gender predictions were aggregated into the following API reported probability bins: 50%-55%, 55%-60%, 60%-65%, 65%-70%, 70%-75%, 75%-80%, 80%-85%, 85%-90%, 90%-95%, 95%-100%. The API reported probabilities within each bin were averaged and plotted on the x-axis. The experimentally determined gender prediction accuracies for the names in each bin are visualized on the y-axis.

Replication of analyses by Santamaria 2018 and Sebo 2021

The original dataset used by Santamaria consisted of 5779 first names with known genders, 34% of whom were women. Only 0.4% of the 5779 had diacritical marks. In addition, 1.1% and 2% of names contain spaces or hyphens, respectively. The original paper reported 80% accuracy using Genderize and 87% using Gender API. In our re-analysis, Genderize predicted the correct gender 92.5% of the time. Similarly, Gender API achieved 92.8% accuracy. The difference in accuracy between Genderize and Gender API was not statistically significant.

The dataset originally analyzed by Sebo 2021 included 6131 names of whom 50.3% were women. Diacritical marks were present in 6.6% of names. 10.2% of names contained spaces, and 6.6% of names included a hyphen. The original paper reported 81% accuracy using Genderize and 97% with Gender API. In our re-analysis, the accuracy of Genderize and Gender API on these 6131 names were 86.2% and 98% respectively. McNemar’s test indicated that the differences in accuracy was statistically significant (p<0.001), in favor of Gender API. Gender API was 99.5% accurate when evaluating names with diacritical marks, while Genderize was 71.7% accurate.

Cost and accessibility

Genderize and Gender API provide a graphical user interface, while the gender R package requires programming. Genderize [11] provides 1,000 free predictions per day, whereas Gender API [12] only allows 100 free predictions per month. Gender API currently costs 4.85x more than Genderize for a monthly subscription that provides 100,000 predictions. The costs mentioned herein are accurate as of April 2024.

Discussion

Tools [23,24] that infer gender from first names are commonly used by meta researchers to evaluate gender ratios in academia. A study [32] by Holman et al in 2018 reported persistent gender disparities in traditionally male-dominated fields. Gender inference technology like Genderize and Gender API facilitate programmatic gender mapping for methodologies that evaluate gender ratios based on first names.

Genderize and Gender API both demonstrated over 95% overall accuracy on our gold-standard dataset of cancer clinical trialists. Genderize was slightly more accurate than Gender API when countries were not included in the API call. Conversely, Gender API performed slightly better than Genderize when countries were included. For both services, including countries reduced the number of incorrect gender assignments at the cost of increasing the number of names with no predicted gender (Fig 1, S9 and S10 Tables). The gender R package performed worse than Genderize or Gender API (Table 1). The high accuracy of Genderize and Gender API, coupled with relatively modest usage fees, suggest that these proprietary tools may be superior to the gender R package for many use cases.

Genderize and Gender API differed in how their accuracy was affected by the delimiter separating two-part first names. Genderize was most accurate when two-part first names were appended together without a delimiter (Fig 2). In fact, Genderize appeared to be incompatible with two-part first names that were delimited by a space as the service yielded zero correct predictions when evaluating such names. Conversely, Gender API performed best when two-part first names were delimited with a space. The slightly higher overall gender prediction accuracy attained by Genderize compared to Gender API is partially an artifact of our decision to append two-part names without a delimiter in the baseline comparison, since Gender API performed best when two-part names were delimited with a space.

The gender R package was less accurate than both Genderize and Gender API on all fronts (Table 1, Figs Figs11 and and2).2). Between the two gender R package methods, the SSA method proved more accurate. Interestingly, the SSA method misgendered men as women more frequently than the reverse. In contrast, all other tools evaluated in this study misgendered women as men more frequently.

The present study provides an updated accuracy benchmark for tools that infer gender from first names. Furthermore, this study adds to a growing body of literature on the accuracy of gender inference tools. Santamaria and Mihaljevic compared Gender API, gender-guesser, Genderize, NameAPI, and NamSor in 2018, and concluded that Gender API achieved superior accuracy [16]. In 2021, Sebo compared Gender API, Genderize, and NamSor, ultimately reporting that Gender API demonstrated the highest accuracy, followed by NamSor [22].

A commonality between the present study and several previous studies was the lower prediction accuracy of Genderize and Gender API when evaluating names of people in Asian countries, with the exception of Japanese names [16,22]. The higher accuracy achieved for both services in our re-analysis of Santamaria’s dataset indicates that both services have improved since 2015, although Genderize improved by a larger margin. Gender API outperformed Genderize when re-analyzing Sebo’s dataset largely because Gender API handled two-part names that were delimited by spaces as well as names with diacritical marks. In fact, a follow up study [15] by the same author recommended removing diacritical marks and modifying two-part names to improve the accuracy of Genderize.

Tools that infer gender from first names will never achieve perfect accuracy because many names are gender-neutral. Additionally, the gender of certain names correlates with country of origin. For example, Andrea, Gabriele, and Daniele are more likely to correspond to men if they are of Italian origin. In fact, Andrea was the most frequently misgendered name when Genderize and Gender API were called without including countries of origin. Importantly, the higher accuracy achieved after reanalyzing Sebo and Santamaria’s datasets suggests that the accuracy of Genderize and Gender API is likely to increase over time as their respective companies continue to improve their products. Indeed, algorithms such as Genderize and Gender API must be periodically calibrated in order to maintain their accuracy in changing contexts [33,34].

This study’s results should be interpreted with certain caveats in mind. We did not filter out recurring first names during this analysis because the count of names in real-world datasets like ours tends to follow a long-tail distribution [35]. The process for determining the “gold-standard” gender of each of the trialists relied on inference from available information. Affiliation data was missing for a substantial subset of authors, mostly due to the older practice of MEDLINE including only the affiliation of the first author; a substantial number of high-profile oncology journals (e.g., the Journal of Clinical Oncology and Blood) did not include clear 1:1 mappings for author-to-affiliation for a period of time; this issue affects at least 320 (4%) of manuscripts in the HemOnc KB. A subset of authors in the HemOnc KB have not had their gender determined (18.1%), and this subset has some important differences from the set of determined genders. Most notably, the undetermined subset has many more Asian hyphenated names (43.8% vs 3.8%) and authors with a country of affiliation including South Korea, China, Singapore, and/or Taiwan (45.6% vs 3.42%). It is thus likely that our results represent a “best-case scenario” and that automated gender mapping will become increasingly difficult as cancer clinical trials are increasingly conducted in the Asia-Pacific region [36,37]. Additionally, a researcher’s nationality in our data set does not always reflect the cultural origin of their first name as some researchers immigrated to the country of their academic affiliation. Furthermore, the database of gendered researchers in the HemOnc KB is constrained by the diversity of the clinical trialist community.

All tools we evaluated were least accurate with names from Asian countries, excluding Japan. Many Asian names with an identifiable gender in their native script become a gender-neutral representation when converted to the Latin alphabet [16]. Journals should consider publishing Asian names using native characters in addition to English translations. Even so, a 2022 study found that many Chinese characters that are traditionally used in women’s names have become more gender-neutral over time [38]. Similarly, a 2023 study reported that phonological characteristics that indicate the gender of the English spelling of Korean names have also become more ambiguous post-2020 [39]. The ambiguity of gender-neutral names underscores the importance for researchers to have a public presence that includes self-reported demographic information.

In this study, the gender R package was tested with the SSA and IPUMS methods because those name datasets are more recent than the gender R package’s NAPP method, which does not include names post-1910. The gender R package’s Kantrowitz method was also not used because potentially gender-neutral names like Alex, Jamie, and Andrea listed as having a gender of “either”, without a probability of which gender is more likely.

It is important to note that Genderize, Gender API, and the gender R package assume a gender binary. However, a recent survey [40] found that 1.6% of U.S. adults identify as transgender or nonbinary. When evaluating gender ratios using gender inference tools, it is imperative that inclusivity must be considered. However, any tool that predicts the likelihood of an individual being non-binary based solely on first names would remain challenging, due in part to the scarcity of gender-labelled datasets that include non-binary individuals. We suggest that academic journals could facilitate research on gender equity among clinical trialists by including self-reported author gender identities. Critically, methodologies that use tools that infer binary gender must acknowledge the exclusion of non-binary individuals as a limitation. Accounting for the presence of non-binary individuals may not be feasible without self-reported gender name data sets. In response, gender researchers should attempt to develop or locate datasets that are labelled with non-binary identities when conducting meta research gender disparity analyses. For example, a 2022 study evaluated the proportions of men, women, and non-binary corresponding authors in physics journals using self-reported gender [41]. Furthermore, researchers should consider prioritizing accuracy over cost when selecting gender inference tools in order to reduce the number of names that are misgendered.

Both Genderize and Gender API demonstrated high gender prediction accuracy with non-Asian names that were highly normalized without middle or last names or diacritical marks. The cost per name evaluated with Genderize is also several times cheaper than Gender API. However, Genderize loses accuracy compared to Gender API when name formatting becomes less consistent. The SSA and IPUMS methods of the gender R package were less accurate but are open-source alternatives. The results from this study provide a new benchmark for gender inference tools. Replicating the studies of Santamaria 2018 and Sebo 2021 demonstrated that Genderize and Gender API have improved over time. Accordingly we should expect that gender prediction accuracy and features of Genderize and Gender API would continue to be more accurate over time.

Supporting information

S1 Fig

Flowchart depicting trialist inclusions from HemOnc Knowledgebase (KB).

The HemOnc KB is a growing resource, and 4,360 trialists had not had their genders evaluated at the time of this study. Of the 32,968 trialists included in the study, 24,930 were affiliated with sites in a single country.

(DOCX)

S1 Table

Names misgendered more than once, Gender API with country provided.

(CSV)

S2 Table

Names misgendered more than once, Gender API without country provided.

(CSV)

S3 Table

Names misgendered more than once, gender R package, IPUMS method.

(CSV)

S4 Table

Names misgendered more than once, gender R package, SSA method.

(CSV)

S5 Table

Names misgendered more than once, Genderize API with country provided.

(CSV)

S6 Table

Names misgendered more than once, Genderize API without country provided.

(CSV)

S7 Table

100 most common trialist name-country combinations.

(DOCX)

S8 Table

Precision, recall, and F1 scores.

(DOCX)

S9 Table

Gender prediction accuracy for all countries with at least 20 trialists when countries are not included in the API call.

(DOCX)

S10 Table

Gender prediction accuracy for all countries with at least 20 trialists when countries are included in the API call.

(DOCX)

Acknowledgments

We would like to acknowledge the efforts of the editorial board of HemOnc.org.

Funding Statement

This work was supported by grants from the National Cancer Institute: U24 CA265879 and U24 CA248010 (https://www.cancer.gov). ADV, SM, and JLW were supported by U24 CA265879. ADV, EDGU, and JLW were supported by U24 CA248010. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability

The data that support the findings of this study are publicly available from the Harvard DataVerse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=10.7910/DVN/HZSTFZ.

References

1. Chatterjee P, Werner RM. Gender Disparity in Citations in High-Impact Journal Articles. JAMA Netw Open. 2021;4: e2114509. 10.1001/jamanetworkopen.2021.14509 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
2. Murphy M, Callander JK, Dohan D, Grandis JR. Women’s Experiences of Promotion and Tenure in Academic Medicine and Potential Implications for Gender Disparities in Career Advancement: A Qualitative Analysis. JAMA Netw Open. 2021;4: e2125843. 10.1001/jamanetworkopen.2021.25843 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
3. Dymanus KA, Butaney M, Magee DE, Hird AE, Luckenbaugh AN, Ma MW, et al.. Assessment of gender representation in clinical trials leading to FDA approval for oncology therapeutics between 2014 and 2019: A systematic review-based cohort study. Cancer. 2021;127: 3156–3162. 10.1002/cncr.33533 [Abstract] [CrossRef] [Google Scholar]
4. Rapp KS, Volpe VV, Hale TL, Quartararo DF. State–Level Sexism and Gender Disparities in Health Care Access and Quality in the United States. J Health Soc Behav. 2022;63: 2–18. 10.1177/00221465211058153 [Abstract] [CrossRef] [Google Scholar]
5. Wais K. Gender Prediction Methods Based on First Names with genderizeR. R J. 2016;8/1: 17–37. [Google Scholar]
6. Cevik M, Haque SA, Manne-Goehler J, Kuppalli K, Sax PE, Majumder MS, et al.. Gender disparities in coronavirus disease 2019 clinical trial leadership. Clin Microbiol Infect. 2021;27: 1007–1010. 10.1016/j.cmi.2020.12.025 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
7. Topaz CM, Sen S. Gender Representation on Journal Editorial Boards in the Mathematical Sciences. Danforth CM, editor. PLOS ONE. 2016;11: e0161357. 10.1371/journal.pone.0161357 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
8. Nielsen MW, Andersen JP, Schiebinger L, Schneider JW. One and a half million medical papers reveal a link between author gender and attention to gender and sex analysis. Nat Hum Behav. 2017;1: 791–796. 10.1038/s41562-017-0235-x [Abstract] [CrossRef] [Google Scholar]
9. Sebo P, Clair C. Are female authors under-represented in primary healthcare and general internal medicine journals? Br J Gen Pract. 2021;71: 302.1–302. 10.3399/bjgp21X716249 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
10. Szymkowiak M. Genderizing fisheries: Assessing over thirty years of women’s participation in Alaska fisheries. Mar Policy. 2020;115: 103846. 10.1016/j.marpol.2020.103846 [CrossRef] [Google Scholar]
11. Genderize Documentation. In: Genderize [Internet]. [cited 2 Jan 2024]. https://genderize.io/
12. Gender API—Determines the gender of a first name. [cited 2 Jan 2024]. https://gender-api.com/
13. Mullen L. gender: Predict Gender from Names Using Historical Data. 2021. https://github.com/lmullen/gender
14. Sebo P. How accurate are gender detection tools in predicting the gender for Chinese names? A study with 20,000 given names in Pinyin format. J Med Libr Assoc. 2021;110. 10.5195/jmla.2022.1289 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
15. Sebo P. Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference. J Med Libr Assoc. 2021;109. 10.5195/jmla.2021.1252 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
16. Santamaría L, Mihaljević H. Comparison and benchmark of name-to-gender inference services. PeerJ Comput Sci. 2018;4: e156. 10.7717/peerj-cs.156 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
17. Warner JL, Cowan AJ, Hall AC, Yang PC. HemOnc.org: A Collaborative Online Knowledge Platform for Oncology Professionals. J Oncol Pract. 2015;11: e336–e350. 10.1200/JOP.2014.001511 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
18. Eligibility criteria | HemOnc.org—A Hematology Oncology Wiki. [cited 16 Apr 2024]. https://hemonc.org/wiki/Eligibility_criteria
19. Heidari S, Babor TF, De Castro P, Tort S, Curno M. Sex and Gender Equity in Research: rationale for the SAGER guidelines and recommended use. Res Integr Peer Rev. 2016;1: 2. 10.1186/s41073-016-0007-6 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
20. CIHR Institute Of Gender And Health. What a difference sex and gender make: a gender, sex and health research casebook. 2012 [cited 18 Jan 2024].
21. Mihaljević H, Tullney M, Santamaría L, Steinfeldt C. Reflections on Gender Analyses of Bibliographic Corpora. Front Big Data. 2019;2: 29. 10.3389/fdata.2019.00029 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
22. Sebo P. Performance of gender detection tools: a comparative study of name-to-gender inference services. J Med Libr Assoc. 2021;109. 10.5195/jmla.2021.1185 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
23. Mihaljevic H, Santamaria L. Evaluation of name-based gender inference methods. GenderGapSTEM-PublicationAnalysis; 2023. https://github.com/GenderGapSTEM-PublicationAnalysis/name_gender_inference
24. Sebo P. Performance of gender detection tools: a comparative study of name-to-gender inference services. 2021. [cited 2 Jan 2024]. 10.17605/OSF.IO/KR2MX [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
25. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; https://www.R-project.org/
26. Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al.. Welcome to the Tidyverse. J Open Source Softw. 2019;4: 1686. 10.21105/joss.01686 [CrossRef] [Google Scholar]
27. Wickham H, Miller E, Smith D. haven: Import and Export “SPSS”, “Stata” and “SAS” Files. 2023. https://CRAN.R-project.org/package=haven
28. Wickham H, Bryan J. readxl: Read Excel Files. 2023. https://CRAN.R-project.org/package=readxl
29. Wickham H. testthat: Get Started with Testing. 2011. https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf
30. Aphalo P. ggpmisc: Miscellaneous Extensions to “ggplot2.” 2023. https://CRAN.R-project.org/package=ggpmisc
31. Pedersen T. patchwork: The Composer of Plots. 2023. https://CRAN.R-project.org/package=patchwork
32. Holman L, Stuart-Fox D, Hauser CE. The gender gap in science: How long until women are equally represented? Sugimoto C, editor. PLOS Biol. 2018;16: e2004956. 10.1371/journal.pbio.2004956 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
33. Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G. Learning under Concept Drift: A Review. IEEE Trans Knowl Data Eng. 2018; 1–1. 10.1109/TKDE.2018.2876857 [CrossRef] [Google Scholar]
34. Widmer G, Kubat M. Learning in the presence of concept drift and hidden contexts. Mach Learn. 1996;23: 69–101. 10.1007/BF00116900 [CrossRef] [Google Scholar]
35. Clauset A, Shalizi CR, Newman MEJ. Power-Law Distributions in Empirical Data. SIAM Rev. 2009;51: 661–703. 10.1137/070710111 [CrossRef] [Google Scholar]
36. Akiki V, Troussard X, Metges J, Devos P. Global trends in oncology research: A mixed-methods study of publications and clinical trials from 2010 to 2019. Cancer Rep. 2023;6: e1650. 10.1002/cnr2.1650 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
37. Terada M, Nakamura K, Matsuda T, Okuma HS, Sudo K, Yusof A, et al.. A new era of the Asian clinical research network: a report from the ATLAS international symposium. Jpn J Clin Oncol. 2023;53: 619–628. 10.1093/jjco/hyad033 [Abstract] [CrossRef] [Google Scholar]
38. Huang Y, Wang T. MULAN in the name: Causes and consequences of gendered Chinese names. China Econ Rev. 2022;75: 101851. 10.1016/j.chieco.2022.101851 [CrossRef] [Google Scholar]
39. Kim J, Obasi SN. Phonological Trends of Gendered Names in Korea and the U.S.A. Names. 2023;71: 36–46. 10.5195/names.2023.2485 [CrossRef] [Google Scholar]
40. Brown A, Menasce Horowitz J, Parker K, Minkin R. The Experiences, Challenges and Hopes of Transgender and Nonbinary U.S. Adults. In: Pew Research Center’s Social & Demographic Trends Project [Internet]. 7 Jun 2022 [cited 2 Jan 2024]. https://www.pewresearch.org/social-trends/2022/06/07/the-experiences-challenges-and-hopes-of-transgender-and-nonbinary-u-s-adults/
41. Son J-Y, Bell ML. Scientific authorship by gender: trends before and during a global pandemic. Humanit Soc Sci Commun. 2022;9: 348. 10.1057/s41599-022-01365-4 [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
2024 Oct; 3(10): e0000456.
Published online 2024 Oct 29. 10.1371/journal.pdig.0000456.r001

Decision Letter 0

26 Mar 2024

PDIG-D-24-00041

Inferring Gender from First Names: Comparing the Accuracy of Genderize, Gender API, and the gender R Package on Authors of Diverse Nationality

PLOS Digital Health

Dear Dr. Warner,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days May 25 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at gro.solp@htlaehlatigid. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Miguel Ángel Armengol de la Hoz, Ph.D.

Section Editor

PLOS Digital Health

Journal Requirements:

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

Reviewer #4: Partly

Reviewer #5: Yes

Reviewer #6: Partly

Reviewer #7: Partly

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: N/A

Reviewer #5: Yes

Reviewer #6: Yes

Reviewer #7: Yes

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: Yes

Reviewer #7: No

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: Yes

Reviewer #7: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Several researches uses such type of technology to seek assistance in their work. In light of this, it becomes a necessary point to determine which technology could be best suited to assist in the research work. Very useful topic and well presented article. Just kindly re-check once the line 280-283 of Pg-15 as it develops confusion regarding the observed result. Kindly clarify once.

Reviewer #2: The study of VanHelene et al. compared the accuracy and cost of commercial software Genderize and Gender API, and the open-source gender R package, in inferring gender from first names. It found that Genderize and Gender API had high accuracy rates, especially for German authors, but were less accurate for authors from South Korea, China, Singapore, and Taiwan. The study's merit lies in providing valuable insights into the performance and cost-effectiveness of gender inference tools, aiding researchers in making informed decisions when studying gender disparities.

Comments:

1. Abstract: incomplete sentence „The accuracy of the gender R package was only evaluated without supplying countries of origin since.“

2. Line 55: Why are gender gaps in the authorship (not patient inclusion) of cancer clinical trials of particular concern?

3. Lines 65-66, 80-81, 327: How did the authors define Western / Non-Western names? The present approach only evaluated names from Western / Asian countries.

4. Lines 106-110: The R package only returns aggregate percentages for each name. The authors should describe in greater detail how they calculated the numbers in Table 1 based on these aggregate percentages. Did they a) multiply probabilities with total number of persons with given name and round the result or b) assume that all persons of a given name belong to the same gender? Example: index = 53717 / first_name = ilya / probability = 0.5008 / predicted_gender = woman / man_count = 2 / woman_count = 0. In this example scenario a) would yield 1 man and 1 woman (accuracy 50%) while scenario b) would yield 0 man and 2 woman (accuracy 0%).

5. Lines 145-151: The addition of an flow chart depicting inclusion/exclusion of trialists as described would be helpful for the reader.

6. Lines 171-173: A supplementary list of the most common names that yielded incorrect predictions would be useful for later studies.

7. Lines 174-175 misses a direct statistical comparison between all methods used. Only comparisons between Genderize vs. Gender API and SSA vs. IPUMS are reported.

8. Lines 176-177: For the sake of clarity, these lines should be moved to the result section that compares the different delimiters.

9. Line 207: Figure 1 should include bar graphs and statistical comparisons for the R package options that were investigated

10. Line 274: Please provide year and month of this cost analysis as prices can be highly dynamic.

11. What year or range of years did the authors impute into the R algorithm? This could greatly impact the accuracy of the algorithm.

Reviewer #3: This paper could benefit from a more comprehensive discussion on several aspects:

- Why and when names-based gender identification needs to be used, including the limitations and potential biases inherent in this approach

- An in-depth analysis of the methodological challenges, particularly concerning names that are gender-neutral, and an estimation of the maximum accuracy achievable under these conditions.

- Identifying names that are frequently misclassified would enhance the paper's insightfulness.

- Recommendations for more effective methods, such as integrating / combining multiple approaches or adding more contextual information to enhance prediction accuracy, should be explored.

- The impact of including the country of origin on prediction accuracy needs further examination, including whether specific names significantly affect the results.

- The discussion on non-binary gender options requires clarification on how they integrate with name-based gender prediction methods, questioning the assumption that certain names can indicate non-binary identities.

- This topic should extend into a broader dialogue on methodological limitations, citing examples like Meryll, Parker, and Jamie, which are ambiguous in terms of gender.

- Suggestions for improving accuracy, particularly in regions where current methods show diminished effectiveness, would greatly contribute to the paper's value.

- Exploring how the accuracy of the model might shift over time, in response to changes in the popularity of names, would also be insightful.

Reviewer #4: I thoroughly enjoyed reading the article titled "Inferring Gender from First Names: Comparing the Accuracy of Genderize, Gender API, and the gender R Package on Authors of Diverse Nationality". It provided a comprehensive overview of an important subject.

This article presents a comparative analysis of the accuracy and cost-effectiveness of three gender inference tools commonly used in meta-research: Genderize, Gender API, and the gender R package. The study evaluates these tools' performance in predicting the gender of authors listed in clinical trial publications across various countries.

One of the strengths of this article is its clear objective, which is to compare the accuracy and cost of different gender inference tools. The use of a large multinational dataset also adds credibility to the findings, as it allows for a comprehensive evaluation of the tools' performance across diverse cultural contexts (generalization).

Although the manuscript follows the IMRaD format (Introduction, Method, Results and Discussion), which is a widely recognized structure used in scientific writing, especially in journals such as PLOS ONE, I missed an independent and robust literature review section.

The authors rely on a dataset of clinical trial authors sourced from the HemOnc knowledge base, which may not accurately represent the diversity of names and genders worldwide [I am not sure]. Can you please elaborate on this subject?

The article needs to address the limitations of binary gender classification and the exclusion of transgender and nonbinary individuals from the analysis. This reflects a broader issue of gender essentialism. Thus, the article treats gender as a binary variable without critically engaging with the social construction of gender and its fluidity. By reinforcing binary gender norms, the article perpetuates outdated and harmful stereotypes, neglecting the diverse experiences and identities of individuals across the gender spectrum.

The comparison between the accuracy of Genderize and Gender API appears to ignore critical specificities regarding names from non-Western cultures. While the study acknowledges the lower accuracy of these tools with Asian names, it does not provide meaningful information about the reasons behind this disparity or propose solutions to resolve it. Furthermore, the article's emphasis on cost-effectiveness ignores ethical considerations and the potential consequences of prioritizing accessibility over accuracy in gender prediction.

The article could benefit from a discussion of future research directions and potential improvements in gender prediction methodologies. It could include exploring alternative approaches such as probabilistic modeling, incorporating user feedback and community input, and integrating intersectional perspectives to address the limitations of existing tools. In essence, I believe that the article has the potential to significantly enhance its robustness in its present form. Any limitations that cannot be adequately addressed at this stage should be acknowledged, and avenues for further research should be suggested for future exploration.

In conclusion, although the article introduces new insights into the challenges and complexities of gender prediction in research, the interpretation of results and ethical considerations somewhat undermine the credibility and relevance of its conclusions. A more rigorous and inclusive approach to gender inference, along with critical reflection on the limitations of existing tools, is essential to promoting gender equity and inclusion in research practices.

Reviewer #5: Good study that is well written and communicated. Some comments are needed for improvement:

-The concluding statement of the abstract needs rephrasing to sum up the findings.

-In the methods: The list of definitions could be placed as a separate section

-Make sure all the criteria are defined in the method for e.g.high profile journals

-Discuss in detail the differences in predictions between the different tools used. This is shown in the results communicated but not discussed thoroughly

-In the discussion, highlight the performance with non-Western names

Reviewer #6: Report

Research objective:Comparing prediction accuracies of three prediction tools;

Genderize,Gender API and open source gender R package using multinational dataset of clinicial tria author names.

Observations

1.Research objective clearly stated

2.Research methods stated

3.machine learning algorithms used, stated with brief explanations

Issues

1. How does api's work?

2. what is the basis for comparing these apis?

3. How does the internal structures of these apis influence these comparisons?

4. Concerning the choice of these api's, what are their related differences?

4. what are the study constraints and identified bias.

Reviewer #7: This research provides a helpful tool to evaluate the gender attribution. However, it is unclear what is the significance or relevance to digital health. For example, is this helping to improve the accuracy of trials? How could this improve the recognition of gender through first names? Evaluating packages/tools may carry practical value, but it needs to be highlighted. Also, the statistical method used is very basic.

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Reviewer #5: No

Reviewer #6: No

Reviewer #7: No

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at gro.solp@serugif. Please note that Supporting Information files do not need this step.

    2024 Oct; 3(10): e0000456.
    Published online 2024 Oct 29. 10.1371/journal.pdig.0000456.r002

    Author response to Decision Letter 0

    30 May 2024

    Attachment

    Submitted filename:

      2024 Oct; 3(10): e0000456.
      Published online 2024 Oct 29. 10.1371/journal.pdig.0000456.r003

      Decision Letter 1

      16 Jul 2024

      PDIG-D-24-00041R1

      Inferring Gender from First Names: Comparing the Accuracy of Genderize, Gender API, and the gender R Package on Authors of Diverse Nationality

      PLOS Digital Health

      Dear Dr. Warner,

      Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

      Please submit your revised manuscript within 60 days Sep 14 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at gro.solp@htlaehlatigid. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

      Please include the following items when submitting your revised manuscript:

      * A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

      * A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

      * An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

      If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

      We look forward to receiving your revised manuscript.

      Kind regards,

      Miguel Ángel Armengol de la Hoz, Ph.D.

      Section Editor

      PLOS Digital Health

      Journal Requirements:

      Additional Editor Comments (if provided):

      [Note: HTML markup is below. Please do not edit.]

      Reviewers' comments:

      Reviewer's Responses to Questions

      Comments to the Author

      1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

      Reviewer #8: (No Response)

      Reviewer #9: All comments have been addressed

      Reviewer #10: All comments have been addressed

      Reviewer #11: All comments have been addressed

      --------------------

      2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

      Reviewer #8: No

      Reviewer #9: Yes

      Reviewer #10: Partly

      Reviewer #11: Yes

      --------------------

      3. Has the statistical analysis been performed appropriately and rigorously?

      Reviewer #8: No

      Reviewer #9: Yes

      Reviewer #10: No

      Reviewer #11: Yes

      --------------------

      4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

      The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

      Reviewer #8: No

      Reviewer #9: Yes

      Reviewer #10: (No Response)

      Reviewer #11: Yes

      --------------------

      5. Is the manuscript presented in an intelligible fashion and written in standard English?

      PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

      Reviewer #8: No

      Reviewer #9: Yes

      Reviewer #10: Yes

      Reviewer #11: Yes

      --------------------

      6. Review Comments to the Author

      Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

      Reviewer #8: COMMENTS NOT ADDRESSED FULL

      Reviewer #9: Thanks to the authors for answering my concerns. The authors had addressed my concerns. Thus, I recommend the journal accept the current manuscript.

      Reviewer #10: Revisions Suggested :-

      1. Accuracy alone can be misleading, especially in datasets with class imbalance. Precision, recall, and F1 score provide a more detailed understanding of the performance by considering the trade-offs between false positives and false negatives.

      2. The authors have noted the presence of missing values present in the data, Trialists with unresolved names or undetermined genders were excluded from the analysis. The authors should provide a detailed explanation of how this exclusion might affect the results. The exclusion of a significant proportion of trialists with missing data (7.1% unresolved names, 11.9% unresolved genders) could introduce bias, especially if the missing data is not random.

      3. The introduction should briefly discuss the importance of accurate gender prediction in broader contexts beyond academia, such as implications for public health, policy-making, and diversity metrics.

      4. The manuscript should provide a more in-depth analysis of why non-Western names, especially from Asian countries, are less accurately predicted. This should include a discussion of cultural and linguistic factors affecting name-gender associations.

      5. Expand the discussion on the limitations of using binary gender classifications and the exclusion of non-binary and transgender identities. This section should also discuss the potential biases introduced by these limitations and suggest ways future research can address them.

      Reviewer #11: I think this paper is well-written and transparent, with an appropriate statistical analysis and that its conclusions follow from the analysis. The figures and tables are all appropriate.

      However, I think the paper has a fundamental flaw that makes it of limited usefulness which I will try to explain here.

      From examination of the Genderize and Gender API websites, as well as the source code of the gender package (and the genderdata package on which it depends), it appears that all three "methods" discussed in the paper are actually the same method: a simple look-up of the given name on the database associated with the method. Of course, there is some simple text-parsing with the API-based tools that allows the input to be less sensitive to formatting (in particular with Gender API), but otherwise this study is comparing the accuracy of four databases, not three methods. The R package actually allows the user to access the Genderize database by specifying method = "genderize" rather than method = "ssa" or method = "ipums" inside the gender() function, and this would presumably give identical results to those reported for Genderize in this paper. The R package also contains two other historical databases, namely the North American Population Project (NAPP) database, and the Kantrowitz name corpus database. Neither are even mentioned in this paper for some reason.

      That being the case, it seems the paper is actually comparing four databases: the actively commercially maintained Gender API database, the actively commercially maintained Genderize database, the historical, unmaintained SSA database and the even older unmaintained IPUMS database.

      The paper's conclusion, namely that Genderize and Gender API are more accurate than the gender R package, is saying little other than actively maintained and up-to-date name databases are more accurate than unmaintained historical databases that have not been updated for many years. This should surprise no-one, yet that is not how the results are presented.

      Because the paper treats the APIs like a "black box", the casual reader would be left with the impression that they are something more sophisticated than they actually are, which is simply well-maintained lookup tables. Furthermore, the R package can presumably be induced to equalling their accuracy with a simple change of input parameters and an API key for the Genderize database, so framing this as the R package being less accurate is just misleading. It is the open source data to which the package has free access that is lacking, not the software itself.

      In addition, I have concern about the longevity of any conclusions drawn from this study. APIs change, companies go bust, R packages have a fairly high attrition rate too (maybe 40% per 10 years). This paper describes a snapshot of three currently available methods to predict gender from names that are very unlikely to all stay in their current format for more than few years, particularly with the advent of context-rich AI analysis.

      On that subject, it is entirely possible that even the current iteration of ChatGPT could give a more accurate gender prediction than any of the methods discussed here, and this will only improve with time. Ultimately, a medical / sociological scientific report on an the accuracy of inscrutable proprietry method that is subject to change and cannot be replicated does little to advance our knowledge.

      --------------------

      7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

      Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

      For information about this choice, including consent withdrawal, please see our Privacy Policy.

      Reviewer #8: Yes: Mubashir Zafar

      Reviewer #9: No

      Reviewer #10: No

      Reviewer #11: No

      --------------------

      [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

      While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at gro.solp@serugif. Please note that Supporting Information files do not need this step.

        2024 Oct; 3(10): e0000456.
        Published online 2024 Oct 29. 10.1371/journal.pdig.0000456.r004

        Author response to Decision Letter 1

        10 Aug 2024

        Attachment

        Submitted filename:

          2024 Oct; 3(10): e0000456.
          Published online 2024 Oct 29. 10.1371/journal.pdig.0000456.r005

          Decision Letter 2

          9 Sep 2024

          Inferring Gender from First Names: Comparing the Accuracy of Genderize, Gender API, and the gender R Package on Authors of Diverse Nationality

          PDIG-D-24-00041R2

          Dear Dr. Warner,

          We are pleased to inform you that your manuscript 'Inferring Gender from First Names: Comparing the Accuracy of Genderize, Gender API, and the gender R Package on Authors of Diverse Nationality' has been provisionally accepted for publication in PLOS Digital Health.

          Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

          Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

          IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

          If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact gro.solp@htlaehlatigid.

          Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

          Best regards,

          Miguel Ángel Armengol de la Hoz, Ph.D.

          Section Editor

          PLOS Digital Health

          ***********************************************************

          Reviewer Comments (if any, and for reference):

          Reviewer's Responses to Questions

          Comments to the Author

          1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

          Reviewer #8: (No Response)

          Reviewer #9: All comments have been addressed

          Reviewer #10: All comments have been addressed

          Reviewer #11: All comments have been addressed

          **********

          2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

          Reviewer #8: Partly

          Reviewer #9: Yes

          Reviewer #10: Yes

          Reviewer #11: Yes

          **********

          3. Has the statistical analysis been performed appropriately and rigorously?

          Reviewer #8: No

          Reviewer #9: Yes

          Reviewer #10: Yes

          Reviewer #11: Yes

          **********

          4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

          The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

          Reviewer #8: Yes

          Reviewer #9: Yes

          Reviewer #10: Yes

          Reviewer #11: Yes

          **********

          5. Is the manuscript presented in an intelligible fashion and written in standard English?

          PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

          Reviewer #8: Yes

          Reviewer #9: Yes

          Reviewer #10: Yes

          Reviewer #11: Yes

          **********

          6. Review Comments to the Author

          Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

          Reviewer #8: method section need to rewrite

          missing some information , validity of questionoare, sample size calculation, appropriate statistical test

          Reviewer #9: The authors have addressed my all concerns.

          Reviewer #10: All my comments and concerns have been addressed.

          Reviewer #11: Thank you for taking the time to respond to my comments and updating the paper accordingly. While I think my concerns about the framing of the results (and the longevity of their validity) are justified, and while I do think that the peer-review process is an important part of assessing the likely utility of a paper, I accept the authors' point that it will be useful for meta researchers to be able to cite the accuracy of existing tools in this field. On reflection, I would be happy to cite this paper as justification for selecting one of the tools. The authors are to be commended for being so accommodating and responsive to reviewer feedback.

          **********

          7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

          Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

          For information about this choice, including consent withdrawal, please see our Privacy Policy.

          Reviewer #8: Yes: Mubashir Zafar

          Reviewer #9: No

          Reviewer #10: Yes: Aasim Ayaz Wani

          Reviewer #11: No

          **********


            Articles from PLOS Digital Health are provided here courtesy of PLOS

            Citations & impact 


            This article has not been cited yet.

            Impact metrics

            Alternative metrics

            Altmetric item for https://www.altmetric.com/details/169804336
            Altmetric
            Discover the attention surrounding your research
            https://www.altmetric.com/details/169804336

            Similar Articles 


            To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.


            Funding 


            Funders who supported this work.

            NCI NIH HHS (2)

            National Cancer Institute (2)