Human-Centric Multimodal Machine Learning: Recent Advances and Testbed On AI-based Recruitment

Springer Nature 2021 LATEX template
arXiv:2302.10908v1 [cs.LG] 13 Feb 2023
Human-Centric Multimodal Machine

Learning: Recent Advances and Testbed on
AI-based Recruitment
Alejandro Peña, Ignacio Serna, Aythami Morales, Julian

Fierrez, Alfonso Ortega, Ainhoa Herrarte, Manuel Alcantara
and Javier Ortega-Garcia
Universidad Autónoma de Madrid, Madrid, 28049, Spain.
Contributing authors: alejandro.penna@uam.es;

ignacio.serna@uam.es; aythami.morales@uam.es;
julian.fierrez@uam.es; alfonso.ortega@uam.es;
ainhoa.herrarte@uam.es; manuel.alcantara@uam.es;
javier.ortega@uam.es;
Abstract
The presence of decision-making algorithms in society is rapidly increas-
ing nowadays, while concerns about their transparency and the possi-
bility of these algorithms becoming new sources of discrimination are
arising. There is a certain consensus about the need to develop AI
applications with a Human-Centric approach. Human-Centric Machine
Learning needs to be developed based on four main requirements: (i) util-
ity and social good; (ii) privacy and data ownership; (iii) transparency
and accountability; and (iv) fairness in AI-driven decision-making pro-
cesses. All these four Human-Centric requirements are closely related
to each other. With the aim of studying how current multimodal algo-
rithms based on heterogeneous sources of information are affected by
sensitive elements and inner biases in the data, we propose a ficti-
tious case study focused on automated recruitment: FairCVtest. We
train automatic recruitment algorithms using a set of multimodal syn-
thetic profiles including image, text, and structured data, which are
consciously scored with gender and racial biases. FairCVtest shows the
capacity of the Artificial Intelligence (AI) behind automatic recruit-
ment tools built this way (a common practice in many other application
scenarios beyond recruitment) to extract sensitive information from
1
2 Human-Centric Multimodal Machine Learning
unstructured data and exploit it in combination to data biases in unde-

sirable (unfair) ways. We present an overview of recent works developing
techniques capable of removing sensitive information and biases from
the decision-making process of deep learning architectures, as well as
commonly used databases for fairness research in AI. We demonstrate
how learning approaches developed to guarantee privacy in latent spaces
can lead to unbiased and fair automatic decision-making process. Our
methodology and results show how to generate fairer AI-based tools
in general, and in particular fairer automated recruitment systems.∗
Keywords: Automated recruitment, bias, biometrics, computer vision, deep

learning, FairCV, fairness, multimodal, natural language processing.
1 Introduction
Artificial Intelligence plays a key role in people’s lives nowadays, with auto-
matic systems being deployed in a large variety of fields, such as healthcare,
education, or jurisprudence. The data science community’s breakthroughs of
the last decades along with the large amounts of data currently available have
made possible such deployment, allowing us to train deep models that achieve
a performance never seen before. The emergence of deep learning technologies
has generated a paradigm shift, with handcrafted algorithms being replaced
by data-driven approaches. However, the application of machine learning algo-
rithms built using training data collected from society can lead to adverse
effects, as these data may reflect current socio-cultural and historical biases [1].
In this scenario, automated decision-making models have the capacity to repli-
cate human biases present in the data, or even amplify them [2][3][4][5][6] if
appropriate measures are not taken.
There are relevant models based on machine learning that have been shown
to make decisions largely influenced by demographic attributes in various
fields. For example, Google’s [7] and Facebook’s [8] ad delivery systems gener-
ated undesirable discrimination with disparate performance across population
groups. In 2016, ProPublica researchers [9] analyzed several Broward County
defendants’ criminal records 2 years after being assessed with the recidivism
system COMPAS, finding that the algorithm was biased towards black defen-
dants. New York’s insurance regulator probed UnitedHealth Group over its
use of an algorithm that researchers found to be racially biased, the algorithm
prioritized healthier white patients over sicker black ones [10]. Apple Credit
service granted higher credit limits to men than women even though it was pro-
grammed to be blind to that variable [11]. Face analysis technologies have also
shown a gap in performance between some demographic groups [2][12][13][14]
as a major consequence of an undiverse representation of society in the train-
ing data. Moreover, as Balakrishnan et al. pointed out [15], the problem of
∗
Paper based on the keynote by Prof. Julian Fierrez at ICPRAM 2021.
Human-Centric Multimodal Machine Learning 3
data bias goes beyond the training set, as we need a bias-free evaluation set
in order to correctly assess algorithmic fairness.
The usage of AI technologies is also growing in the labor market [16], where
automatic-decision making systems are commonly used in different stages
within the hiring pipeline [17]. However, automatic tools in this area have also
exhibited worrying biased behaviors, such as Amazon’s recruiting tool prefer-
ring male candidates over female ones [18]. Ensuring that all social groups have
equal opportunities in the labor market is crucial to overcome differences with
minority groups, which have been historically penalized [19]. Some works are
starting to address fairness in hiring [20][21][22], but the lack of transparency
(i.e. both the models and their training data are usually private for legal or
corporate reasons [20]) hinders the technical evaluation of these systems.
In response to the deployment of automatic systems, along with the con-
cerns about their fairness, the governments are adopting regulations in this
matter, placing special emphasis on personal data processing and preventing
algorithmic discrimination. Among these regulations, the European Union’s
General Data Protection Regulation (GDPR)1 is specially relevant for its
impact on the use of machine learning algorithms [23]. The GDPR aims to
protect EU citizens’ rights concerning data protection and privacy by regu-
lating how to collect, store, and process personal data (e.g. Articles 17 and
44), and requires measures to prevent discriminatory effects while processing
sensitive data (according to Article 9, sensitive data includes “personal data
revealing racial or ethnic origin, political opinions, religious or philosophical
beliefs”). Thus, research on transparency, fairness, or explicability in machine
learning is not only an ethical matter, but a legal concern and the basis for the
development of responsible and helpful AI systems that can be trusted [24].
On the other hand, one of the most active areas in Machine Learning (ML)
is around the development of new multimodal models capable of understanding
and processing information from multiple heterogeneous sources of informa-
tion [25]. Among such sources of information we can include structured data
(e.g. tabular data), and unstructured data from images, audio, and text. The
implementation of these models in society must be accompanied by effective
measures to prevent algorithms from becoming a source of discrimination. In
this scenario, where multiple sources of both structured and unstructured data
play a key role in algorithms’ decisions, the task of detecting and preventing
biases becomes even more relevant and difficult.
In this environment of desirable fair and trustworthy AI, the main
contributions of this work are:
• We review the latest advances in Human-Centric ML research with special
focus on the public available databases proposed by the community.
• We present a new public experimental framework around automated recruit-
ment, aimed to study how multimodal machine learning is influenced by
demographic biases present in the training datasets: FairCVtest.2
1
https://gdpr.eu/
2
https://github.com/BiDAlab/FairCVtest
• We have evaluated the capacity of both pretrained models and data-driven

technologies to extract demographic information and learn biased target
functions from multimodal sources of information, including images, texts,
and structured data from resumes.
• We evaluated a discrimination-aware learning method based on the elimina-
tion of sensitive information such as gender or ethnicity from the learning
process of multimodal approaches, and apply it to our automatic recruitment
testbed for improving fairness among demographic groups.
Our results demonstrate the high capacity of commonly used learning
methods to expose sensitive information (e.g. gender and ethnicity) from dif-
ferent data domains, and the necessity to implement appropriate techniques
to guarantee discrimination-free decision-making processes.
A preliminary version of this article was published in [26]. This article
significantly improves [26] in the following aspects:
• We extend FairCVdb to incorporate a name and a short biography to each
profile. To the best of our knowledge, this upgrade makes FairCVdb the first
fairness research database including image, text and structured data.
• We provide more extensive experiments within FairCVtest, where we analyze
the impact of data bias on an automatic recruitment tool under different
scenarios. In these experiments, we use commonly used fairness criteria to
quantify this impact. We also measure the sensitive information exploited in
the decision-making process, whereas [26] limited the experiments to a more
qualitative analysis. Furthermore, by including text data to our dataset, we
extend FairCVtest with Natural Language Processing techniques.
• We provide a survey on fairness research in AI, in which we review some of
the methods proposed in recent years to prevent algorithmic discrimination,
and the most commonly used databases in the field.
The rest of the paper is structured as follows: Section 2 presents an overview
on explainability in ML models, discrimination-aware ML approaches, and
Human-Centric ML databases. Section 3 describes the considered automatic
hiring pipeline, examines the information available in a resume highlighting the
sensitive data associated to it, and describes the dataset created in this work:
FairCvdb. Section 4.1 presents the general framework for our work includ-
ing problem formulation. Section 5 reports the experiments in our testbed
FairCVtest after describing the experimental methodology and the different
scenarios evaluated. Finally, Section 6 summarizes the main conclusions.
2 Human-Centric Research in Machine

Learning
The recent advances in AI and the large amounts of data available have
made possible the deployment of automatic decision-making algorithms in our
society. Due to their great impact in people’s lives, especially in high stake
settings, is essential that these systems are responsible and trustworthy. How-
ever, there are many models that have been shown to make decisions based
on attributes considered as private (e.g. gender3 and ethnicity), or exhibiting
systematically discrimination against individuals belonging to disadvantaged
groups. We can find examples of such unfair treatment in various fields, such
as healthcare [10][29], ad delivery systems [7][8][30], hiring [16][18], and both
facial analysis [5][12][13] and NLP technologies [31][32].
In the following sections we will present recent advances in Human-Centric
ML research related with: 1) explainability and interpretability of ML models;
2) discrimination-aware ML approaches; and 3) databases for Human-Centric
ML research.
2.1 Interpretable and Explainable ML

One of the long-term goals in deep learning is to learn abstract represen-
tations, which are generally invariant to local changes in the input [33]. It
has been observed that many learned representations correspond to human-
interpretable concepts. But it is not quite clear what function they serve and
whether it has a causal role that reveals how the network models its higher-level
notions [34]. Research is showing that not all representations in the convolu-
tional layers of a DNN correspond to natural parts, raising the possibility of
a different decomposition of the world than humans might expect, calling for
further study into the exact nature of the learned representations [35][36].
There is significant work on understanding neural networks. Most methods
typically focus on what a network looks at when making a decision [37][38];
other approaches seek to train explanatory models [39] or networks [40] that
generate human-readable text.
We can distinguish between two types of approaches for generating a better
understanding of an AI model: interpretable and explainable. As defined in
[41], an interpretation is the mapping of an abstract concept (e.g. a predicted
class) into a domain that the human can make sense of, e.g. images or text;
and an explanation is the collection of features of the interpretable domain
that have contributed for a given example to produce a decision.
On the interpretation side we have Activation Maximization, which con-
sists of looking for an input pattern that produces a maximum response of
the model. It was introduced in [42], but such visualization technique has a
limitation: as complexity increases, it becomes more difficult to find a sim-
ple representation of a higher layer unit, because the optimization does not
converge to a single global minimum. Simonyan et al. came up with the sug-
gestion to perform the optimization with respect to the input image, obtaining
an artificial image representative of the class of interest [43].
3
We are aware of the studies that move away from the traditional view of gender as a binary
variable [27], and the difference between gender identity and biological sex. Despite the limitations
of such model [28], in this paper we use “gender” to refer to the external perception of biological
sex, in line with the work historically developed in gender classification into male and female
individuals.
One way of improving activation maximization to enable enhanced visu-

alizations of learned features is with the so-called expert. That is, in the
function to be maximized, the L2-norm regularizer (a term that penalizes
inputs that are farther away from the origin) is replaced by a more sophis-
ticated one, called expert [35][44][45][46]. Another way is via deep generative
models, incorporating such model in the activation maximization framework
[47].
On the explanation side we have Sensitivity Analysis: how much do changes
in each pixel affect the prediction. Initially intended for pruning neural net-
works and reducing the dimensionality of their input vector, was particularly
useful for understanding the sensitivity of performance with respect to their
structure, parameters, and input variables [48][49]. More recently, it has been
used for explaining the classification of images by deep neural networks.
Simonyan et al. [43] applied partial derivatives to compute saliency maps. They
show the sensitivity of each of the input image pixels, where the sensitivity of
a pixel measures to what extent small changes in its value make the image to
belong more or less to the class (local explanation).
Alternatives for explaining deep neural network predictions are back-
ward propagation techniques. Some are: deconvolution, layer-wise relevance
propagation (LRP) and guided backprop.
Zeiler and Fergus [50] proposed deconvolution to compute a heatmap show-
ing which input pattern originally caused a certain activation in the feature
maps. The idea behind the deconvolution approach is to map the activations
from the network back to pixel space using a backpropagation rule. The quan-
tity being propagated can be filtered to retain only what passes through certain
neurons or feature maps.
The LRP method [37] applies a propagation rule that distributes back
(without gradients) the classification output f (x) decomposed into pixel rel-
evances onto the input variables. This algorithm can be used to visualize the
contribution of pixels both for and against a given class.
Guided backprop is the extension of the deconvolution approach for visual-
izing features learned by CNNs. Proposed in [51], it combines backpropagation
and deconvolution by masking out the values for which at least one of the
entries of the top gradient (deconvnet) or bottom data (backpropagation) is
negative.
Another very well known backpropagation-based method combining gradi-
ents, network weights, and activations is Grad-CAM [38]. Gradient-weighted
Class Activation Mapping (Grad-CAM) uses the gradients of the class score
with respect to the input image to produce a coarse localization map high-
lighting the important regions in the image for predicting the concept. It can
be combined with guided backpropagation for fine-grained visualizations of
class-discriminative features.
These methods selectively illustrate one of the multiple patterns a filter
represents, explanatory graphs provide a workaround. [52] proposed a method
disentangling part patterns from each filter to represent the semantic hierarchy
hidden inside a CNN.
Some other methods have gone beyond visualization of CNNs and diag-
nosed CNN representations to gain a deep understanding of the features
encoded in a CNN. Others report the inconsistency of some widely deployed
saliency methods, as they are not independent of both the data on which the
model was trained and the model parameters [53].
Szegedy et al. [54] reported the existence of blind spots and counter-
intuitive properties of neural networks. They found that its possible to change
the network’s prediction by applying an imperceptible optimized perturbation
to the input image, which they called and adversarial example. Paving the
way for a series of works that sought to produce images with which to fool the
models [55][56][57].
Other studies aiming to understand deep neural networks are neuron
ablation techniques. These seek a complete functional understanding of the
model, trying to elucidate its inner workings or shed light on its internal rep-
resentations. Bau et al. found evidence for the emergence of disentangled,
human-interpretable units (of objects, materials and colors) during training
[34].
2.2 Discrimination-aware Learning

In order to prevent automated systems from making decisions based on pro-
tected attributes or reproduce biased behaviors against disadvantaged groups,
the research community has devised various ways to improve fairness in
AI systems. These approaches are usually divided in the literature between
pre-processing, in-processing, and post-processing techniques [24].
The pre-processing techniques aim to transform the input domain to pre-
vent discrimination and remove sensitive information. The authors of [58]
propose to remove sensitive information while improving model interpretability
by learning a data-to-data transformation in the input domain, where the new
representation achieves certain fairness criterion. This transformation is based
in both neural style transfer and kernel Hilbert spaces. A similar approach is
proposed in [59], which seeks to generate a new dataset similar to a given one,
but fairer with respect to a certain protected attribute. For this purpose, a fair-
ness criterion is added to the loss function of an auxiliary GAN [60]. In [61] the
authors address the pre-processing transformation as an optimization problem
which trades off discrimination and utility at probabilistic level, while control-
ling sample distorsion on an individual level. More recently, Ramaswamy et
al. proposed [62] a method for augmenting real datasets with GAN-generated
synthetic images by modifying vectors in the GAN latent space to de-correlate
sensitive and target attributes.
In-processing approaches focus on the learning process as the key point
to prevent biased models, by changing the optimization objective or impos-
ing fairness constraints. In [63] the authors propose an adaptation of Domain
Adaptation Neural Networks [64] to generate agnostic feature representations,
unbiased related to certain protected attribute. Also based in domain adap-
tion, in [65] the authors reduce racial biases in face recognition using mutual
information and unsupervised domain adaptation, from a labeled domain (i.e.

Caucasian individuals) to an unlabeled one (i.e. non Caucasian individuals). A
method to mitigate bias in occupation classification without having access to
protected attributes is developed in [66], by reducing the correlation between
the classifier’s output for each individual and the word embeddings of their
names. Wang et al. studied in [13] the use of an adaptive margin in large
margin face recognition loss functions [67] to reduce the gap in performance
between different ethnicity groups. They proposed to use deep Q-learning to
adaptively find the margin for each demographic group during training.
More recently, in-processing approaches based on adversarial learning
frameworks [68] have been explored. A joint learning and unlearning method
is proposed in [69] to simultaneously learn the main classification task while
unlearning biases by applying confusion loss, based on computing the cross
entropy between the output of the best bias classifier and a uniform distribu-
tion. The authors of [70] introduced a new regularization loss based on mutual
information between feature embeddings and bias, training the networks using
adversarial and gradient reversal [64] techniques. In [71] an extension of triplet
loss [72] is applied to remove sensitive information in feature embeddings,
without losing performance in the main task.
Finally, post-processing methods assume that the output of the model
may be biased, so they apply a transformation on it to improve fairness
between demographic groups. Some works in this line have proposed to prevent
unfairness using discrimination-aware data-mining [73][74]. In [75], the authors
propose a framework that enables a human manager to select how to make the
trade-off among fairness and utility. Then, the method selects a threshold for
each demographic group to obtain an optimal classifier according to the man-
ager’s preferences. Post-processing techniques are also common among studies
on fairness in ranking [76][77][78], which are close to our work here.
2.3 Databases
The datasets used for learning or inference may be the most critical elements
of the machine learning process where bias can appear. As these data are
collected from society, they may reflect sociocultural biases [1], or reflect an
unbalanced representation of the different demographic groups composing it.
A naive approach would be to remove all sensitive information from data,
but this is almost infeasible in a general AI setup (e.g. [31] demonstrates how
removing explicit gender indicators from personal biographies is not enough
to remove the gender bias from an occupation classifier, as other words may
serve as “proxy”). On the other hand, collecting large datasets that represent
broad social diversity in a balanced manner can be extremely costly, and not
enough to avoid disparate treatment between groups [13].
The biases introduced in the dataset used to train machine learning models
typically reflect human biases present in society, or are related to an inaccu-
rate representation of groups [89][90]. In view of this situation, the scientific
Database #Samples Image Text Cat./Num. Demographic Access
UCI Adult Income [79] 48.8K ✗ ✗ ✓ Ethnicity, Gender archive.ics.uci.edu/ml/datasets/adult

archive.ics.uci.edu/ml/datasets/
German Credit [79] 1K ✗ ✗ ✓ Age
statlog+(german+credit+data)
Bank Marketing [80] 41.1K ✗ ✗ ✓ Age archive.ics.uci.edu/ml/datasets/Bank+Marketing
ProPublica Recidivism [9] 11.8K ✗ ✗ ✓ Ethnicity github.com/propublica/compas-analysis
Common Crawl Bios [31] 397K ✗ ✓ ✗ Gender github.com/microsoft/biosbias
WinoBias [81] 3.2K ✗ ✓ ✗ Gender github.com/uclanlp/corefBias
CelebA [82] 202.6K ✓ ✗ ✓ Gender mmlab.ie.cuhk.edu.hk/projects/CelebA.html
IMDB-WIKI [83] 523K ✓ ✗ ✗ Age, Gender data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
Cleaned IMDB [69] 140K ✓ ✗ ✗ Age, Gender robots.ox.ac.uk/ vgg/data/laofiw/
Age, Ethnicity, ebill.uncw.edu/C20231 ustores/web/
MORPH [84] 55K ✓ ✗ ✗
Gender product detail.jsp?PRODUCTID=8
PPB [12] 1.3K ✓ ✗ ✗ Ethnicity, Gender gendershades.org/index.html
LAOFIW [69] 14K ✓ ✗ ✗ Ethnicity, Gender robots.ox.ac.uk/ vgg/data/laofiw/
Age, Ethnicity,
FairFace [85] 108.5K ✓ ✗ ✗ github.com/joojs/fairface
Gender
Age, Ethnicity, research.ibm.com/artificial-intelligence/
Diversity in Faces [86] 1M ✓ ✗ ✗
Gender trusted-ai/
DiveFace [71] 120K ✓ ✗ ✗ Ethnicity, Gender github.com/BiDAlab/DiveFace
BFW [87] 20K ✓ ✗ ✗ Ethnicity, Gender github.com/visionjo/facerec-bias-bfw
download.hertasecurity.com/research/
DemogPairs [88] 10.8K ✓ ✗ ✗ Gender, Ethnicity
DemogPairs.zip
RFW [65] 40K ✓ ✗ ✗ Ethnicity whdeng.cn/RFW/testing.html
BUPT-B [13] 1.3M ✓ ✗ ✗ Ethnicity whdeng.cn/RFW/Trainingdataste.html
BUPT-G [13] 2M ✓ ✗ ✗ Ethnicity whdeng.cn/RFW/Trainingdataste.html
FairCVdb (Ours) 24K ✓ ✓ ✓ Ethnicity, Gender github.com/BiDAlab/FairCVtest
Table 1 Summary of the most common public databases for AI fairness and bias
research. We specify the different modalities included in each dataset (i.e. images, texts,
and categorical/numerical attributes), along with the demographic attributes typically
studied with each one.
community has put lots of effort into collecting databases that improve the rep-
resentation of different demographic groups, which can be used to suppress the
presence of bias. In this section, we discuss some of the most commonly used
databases in AI fairness research, either because of the biases they present, or
their absence (i.e. databases more balanced in terms of certain demographic
attributes). Table 1 provides an overview of these databases, including the
number of samples, data modality and the demographic attributes studied
with each one. The Adult Income dataset [79] from the UCI repository is
frequently used on gender and ethnicity bias mitigation. The main task of
the database is predict whether a person will earn more or less than $50K
per year. The database includes 48, 842 samples with 14 numerical/categor-
ical attributes each, such as education level, capital-gain or occupation, and
missing values.
The German Credit dataset [79] contains 1K entries with 20 different
categorical/numerical attributes, where each entry represents a loan applicant
by a bank. The applicants are classified as good or bad risk credit, showing
age bias toward young people. Also related to age biases, the Bank Mar-
keting database [80] contains marketing campaign data of a Portuguese bank
institution. With more than 41K samples, the goal is to predict if the client
will subscribe a term deposit, based on 20 categorical/numerical attributes
including personal data and socioeconomic contextual information.
The ProPublica Recidivism dataset [9] provides more than 11K pretrial
defendants records, assessed with the COMPAS algorithm to predict their
likelihood of recidivism. After a 2-year study, the researchers find out that the
algorithm was biased towards African-Americans, showing both higher false
positive and lower false negative rates than white defendants.
In the study of demographic bias in NLP technologies,4 we can cite the
Common Crawl Bios dataset [31], which contains nearly 400K short biogra-
phies collected from Common Crawl. The goal of the dataset is to predict the
occupation from these bios, out of 28 possible occupations showing high gender
imbalances. The dataset also provides a “gender blinded” version of each bio,
where explicit gender indicators have been removed (e.g. pronouns or names).
On a closely related task, the WinoBias database [81] provides 3, 160 sen-
tences, where the goal is to find all the expressions related to certain entity.
Centered in people entities referred by their occupations, the dataset requires
to link gender pronouns to male/female stereotypical occupations.
We now focus in face datasets, which are the basis for different face
analysis task such as face recognition or gender classification. The CelebA
database [82] contains nearly 202.6K images from more than 10K celebri-
ties. Each image is annotated with 5 facial landmarks, along with 40
binary attributes including appearance features, demographic information, or
attractiveness, which shows a strong gender bias.
The IMDB-WIKI dataset [83] provides 460.7K images from the IMDB
profiles of 20, 284 different celebrities, along with 62.3K images from
Wikipedia. Images were labeled using the information available in the profiles
(i.e. name, gender, and birth date), extracting an age label by comparing the
timestamp of the images and the birth date. The dataset presents a gender
bias in the age distributions, as we encounter younger females and older males.
Due to the image acquisition process, some labels are noisy, so the authors
of [69] released the cleaned IMDB dataset, with 60K cleaned images for age
prediction and 80K for gender classification obtained from the IMDb split.
Also related with age studies, the MORPH database [84] provides 55K
images from 13K individuals, aimed to study the effect of age-progression on
different facial tasks. The database is longitudinal with age, having pictures of
the same user over time. The database is strongly unbalanced with respect to
gender and ethnicity, with 65% images belonging to African-American males.
Some databases aim to mitigate biases in face analysis technologies by
putting emphasis in demographic balance and diversity. Pilot Parliaments
Benchmark (PPB) [12] is a dataset of 1, 270 parliamentarians images from 6
different countries in Europe and Africa. The images are balanced with respect
to gender and skin color, which are available as labels (the skin color is codi-
fied using the six-point Fitzpatrick system). The Labeled Ancestral Origin
Faces in the Wild (LAOFIW) dataset [69] provides 14K images manually
divided into 4 ancestral origin groups. The database is balanced with respect
4
There are several works that study demographic biases in word embeddings [32][91], working
with representation spaces trained with large corpus of texts from Wikipedia, Common Crawl or
Google News, among other sources.
to ancestral origin and gender, and a variety of pose and illumination. Also
emphasizing ethnicity balance, the FairFace database [85] contains more than
100K images equally distributed in 7 ethnicity groups (White, Black, Indian,
East Asian, Southeast Asian, Middle East, and Latino), also providing gender
and age labels. Aimed to study facial diversity, Diversity in Faces [86] pro-
vides 1M images annotated with 10 different facial systems including gender,
age, skin color, pose, and facial contrast labels, among others.
If we look at face recognition databases, DiveFace [71] contains face images
equitably distributed among 6 demographic classes related to gender and 3
ethnic groups (Black, Asian, and Caucasian), including 24K different identities
and a total of 120K images. The DemogPairs database [88] also proposes 6
balanced demographic groups related to gender and ethnicity, each one with
100 subjects and 1.8K images. On his part, the Balanced Faces in the Wild
(BFW) database [87] presents 8 demographic groups related with gender and
4 ethnicity groups (Asian, Black, Indian and White), each one with 100 differ-
ent users and 2.5K images. Finally, Wang and Deng proposed three different
databases based on MS-Celeb-1M [92], namely Racial Faces in the Wild
(RFW) [65], BUPT-B [13] and BUPT-G [13]. While RFW is designed as
a validation dataset, aimed to measure ethnicity biases, both BUPT-B and
BUPT-G are proposed as ethnicity-aware training datasets. RFW defines 4
ethnic groups (Caucassian, Asian, Indian, and African), each one with 10K
images and 3 different subjects. On the other hand, both BUPT-B and BUPT-
G propose the same ethnic groups, the first one almost ethnicity-balanced
with 1.3M images and 28K subjects, while the latter contains 2M images
and 38K subjects, which are distributed approximating the world’s population
distribution.
3 FairCVdb: Dataset for Multimodal Bias

Research
3.1 AI in Hiring Processes
The usage of predictive tools in recruitment processes is increasing nowadays.
Employers have adopted these tools in an attempt to reduce the time and
cost of hiring, or to maximize the quality of the hiring process, among other
reasons [16]. Rather than a single-point decision, the hiring pipeline suppose
a multi-stage process, which can be broadly divided in four stages [16]. In
the sourcing stage the employers attract potential candidates through adver-
tisements or job posting. Then, during screening the employers assess the
applicants to choose a subset to interview individually. Finally, employers make
a final decision (i.e. whether to hire or reject each applicant) in the selection
stage. All of these stages can benefit from the use of automatic algorithms5 ,
as well as suffer from algorithmic discrimination if systems are not carefully
designed. The labor market has a long history of unfair treatment of minority
5
https://www.hirevue.com/
Fig. 1 Information blocks in a resume and personal attributes that can be derived from
each one. The number of crosses represent the level of sensitive information (+++ = high,
++ = medium, + = low).
groups [19][93], which makes bias prevention a crucial step in automatic hir-
ing tools design. Although the study of fairness in algorithmic hiring has been
limited [21], some works are starting to address this topic [20][22][94].
For the purpose of studying discrimination in Artificial Intelligence at large,
and particularly in hiring processes, in this work we propose a new experimen-
tal framework inspired in a fictitious automated recruiting system: FairCVtest.
Our work can be framed within the screening stage of the hiring pipeline, where
an automatic tool determines a score from a set of applicants resumes. We
chose this application because it comprises personal information from different
nature [95].
The resume is traditionally composed by structured data including name,
position, age, gender, experience, or education, among others (see Figure 1),
and also includes unstructured data such as a face photo or a short biography.
A face image is rich in unstructured information such as identity, gender,
ethnicity, or age [96][97]. That information can be recognized in the image,
but it requires a cognitive or automatic process trained previously for that
task. The text is also rich in unstructured information. The language and the
way we use that language, determine attributes related to your nationality,
age, or gender. Both, image and text, represent two of the domains that have
attracted major interest from the AI research community during last years.
The Computer Vision and the Natural Language Processing communities have
boosted the algorithmic capabilities in image and text analysis through the
usage of massive amounts of data, large computational capabilities (GPUs),
and deep learning techniques.
The resumes used in the proposed FairCVtest framework include merits
of the candidate (e.g. experience, education level, languages, etc.), two demo-
graphic attributes (gender and ethnicity), and a face photograph (see Section
3.2 for all the details).
3.2 FairCVdb: Dataset Description

In this work we present FairCVdb, a new dataset with 24, 000 synthetic resume
profiles for both fairness and multimodal research in AI. Each profile includes
2 demographic attributes (gender and ethnicity), an occupation, a face image,
a name, 7 attributes obtained from 5 information blocks that are usually found
in a standard resume, and a short biography. The profiles comprise data from
different nature including structured and unstructured data:
• Demographic attributes (structured data): Each profile has been gener-
ated according to two gender classes and three ethnicity classes. These
demographic attributes determine the face image (gender and ethnicity
related), name (gender related), and pronouns in the short biography
(gender related).
• Face image (unstructured data - image): Each profile contains a real and
unique face image assigned from the DiveFace database [71], which was
introduced in Section 2.3. DiveFace6 contains face images from 24K different
identities with their corresponding gender and ethnicity attributes.
• Short Biography (unstructured data - text ): We use the Common Crawl Bios
dataset [31] to associate a short biography, a name, and an occupation (from
a pool of 10 different occupations) to each profile.
• Candidate competencies (structured data): The 5 information blocks are: 1)
education attainment, 2) availability, 3) previous experience, 4) the existence
of a recommendation letter, and 5) language proficiency in a set of 3 different
and common languages. Each language is encoded with an individual feature
(3 features in total) that represents the level of knowledge in that language.
We will refer to these resume features as candidate competencies.
As we previously mentioned in Section 2.3, the Common Crawl Bios
dataset7 [31] contains online biographies collected from Common Crawl relat-
ing 28 different occupations. Gender and occupation labels are available for
each biography, as well as a “blinded” version of the bio, in which explicit
gender indicators have been removed. For example, a biography labeled as
[Attorney, Female] is presented as: Andrea Jepsen is an attorney with the
School Law Center, a law firm focusing on the rights of students and families
in education and school law disputes. She has worked with people with disabil-
ities since 1997 in a variety of roles, including as an early childhood special
education service coordinator, and as a legal services provider working regu-
larly in the courts and in administrative proceedings. Ms. Jepsen’s broad legal
experience has involved representing clients in a variety of critical legal issues
related to education, housing, elder law matters, public benefits, family law dis-
putes, probate and other concerns. Note that we underlined explicit gender
indicators removed in the “blinded bio”, and that both name and occupation
can be found in the first sentence of each biography, so this sentence was not
included in the bios.
6
https://github.com/BiDAlab/DiveFace
7
https://github.com/microsoft/biosbias
We select 24K biographies, and its corresponding blinded versions, from

a subset of 10 different occupations. Each biography is associated according
to gender to one FairCV profile, providing as well an occupation label and
a name to the profiles, which we obtain by processing the first sentence of
each bio. We group the occupations in 4 professional sectors: 1) audiovisual
communication and journalism, with journalist, photographer, and filmmaker ;
2) administration and jurisdiction, with attorney and accountant ; 3) health-
care, with surgeon, nurse, and physician; and 4) education, with professor and
teacher. Each professional sector has the same number of samples (i.e. 6K bios),
and is gender-balanced. Furthermore, we define a suitability attribute (S), rep-
resenting the affinity degree of each sector with the potential job to which the
resumes apply. The association of this attribute with each sector has purely
academic purposes, without seeking to state the usefulness or importance of
each of them.
The score T j for a profile j is generated by linear combination of the
candidate competencies xj = [xj1 , ..., xjn ] and the suitability attribute S j as:
n
αi xji + αs S j
X
T j = βj + (1)
i=1
where n = 7 is the number of features (competencies), αi are the weighting
factors for each competency xji (fixed manually based on consultation with a
human recruitment expert), and β j is a small Gaussian noise to introduce a
small degree of variability (i.e. two profiles with the same competencies do not
necessarily have to obtain the same result in all cases). Those scores T j will
serve as groundtruth in our experiments.
Note that, by not taking into account gender or ethnicity information dur-
ing the score generation in Equation (1), these scores become agnostic to this
information, and should be equally distributed among different demographic
groups. Thus, we will refer to this target function as Unbiased scores T U , from
which we define two target functions that include two types of bias: Gender
bias T G and Ethnicity bias T E . Biased scores are generated by applying a
penalty factor Tδ to certain individuals belonging to a particular demographic
group. For the Gender-biased scores T G , we apply a penalty factor on the
female group, while in the Ethnicity-biased scores T E we apply the penalty
factor to one ethnic group, and the inverse to another one (i.e. the individuals
belonging to this group are overrated in T E , showing a higher score than in
T U ). This leads to a set of scores where, with the same competencies, certain
groups have lower scores than others, simulating the case where the process
is influenced by certain cognitive biases introduced by humans, protocols, or
automatic systems.
Table 2 summarizes the features that make up each profile, as well as their
labels. We divided the FairCVdb in two splits, with 80% of the synthetic
profiles (19, 200 CVs) as training set, and the remaining 20% (4, 800 CVs) as
validation set. Both sets are almost perfectly balanced among gender, ethnicity
Name Type Data values

Education I x1 ∈ {0.2, 0.4, 0.6, 0.8, 1}
Recommendation I x2 ∈ {0, 1}
Availability I x3 ∈ {0.2, 0.4, 0.6, 0.8, 1}
Previous experience I x4 ∈ {0, 0.2, 0.4, 0.6, 0.8, 1}
xi ∈ {0, 0.2, 0.4, 0.6, 0.8, 1},
Language proficiency I
i ∈ {5, 6, 7}
Face Image I I [m, n]/ m, n ∈ [0, 119]
Face embedding I f ∈ R20 /kf k = 1
Agnostic face embedding I f a ∈ R20 /kf a k = 1
Name I Text data
Biography I Text data
Agnostic Biography I Text data
Gender I/T G ∈ {0, 1}
Ethnicity I/T E ∈ {0, 1, 2}
Occupation I/T O : N ∈ [0, 11]
Suitability I/T S ∈ {0.25, 0.5, 0.75, 1}
Blind score T T U : R ∈ [0, 1]
Gender biased score T T G : R ∈ [0, 1]
Ethnicity biased score T T E : R ∈ [0, 1]
Table 2 Overview of the different attributes available in each FairCV profile. We include
the possible values of each attribute, as well as its nature as Input and/or Target.
and professional sector. Fig. 2 presents four visual examples of the resumes
generated with FairCVdb.
4 FairCVtest: Description
4.1 General Learning Framework
The multimodal model represented by its parameters vector wF (F for
fused model [95]) is trained using features learned by M independent models
{w1 , ..., wM } where each model produces ni features xi = [xi1 , ..., xini ] ∈ Rni .
Without loss of generality, the Figure 3 presents the learning framework for
M = 3. The learning process is guided by a Target function T , and a learn-
ing strategy that minimizes the error between the output O and the Target
function T . In our framework, where xi is data obtained from the resume, wi
are models trained specifically for different information domains (e.g. images,
text) and T is a score within the interval [0, 1] ranking the candidates accord-
ing to their merits. A score close to 0 corresponds to the worst candidate, while
the best candidate would get 1. The learning strategy is traditionally based
on the minimization of a loss function defined to obtain the best performance.
The most popular approach for supervised learning is to train the model w by
minimizing a loss function L over a set of training samples S:
X
min L(O(xj | wF ), T j ) (2)
wF
xj ∈S
Biases can be introduced in different stages of the learning process (see
Figure 3): in the Data used to train the models (A), the Preprocessing or
Fig. 2 Visual examples of the FairCVdb synthetic resumes, including a face image, a name,
an occupation, a short biography and the candidate competencies.
Fig. 3 Block diagram of the automatic multimodal learning process and 6 (A to E ) stages
where bias can appear.
Feature generation (B ), the Target function (C ), and the Learning strategy

(D ). As a result of the biases introduced at of these points (A to D), we
may obtain biased Results (R). In this work we focus on the Target function
(C ) and the Learning strategy (D ). The Target function is critical as it could
introduce cognitive biases from biased processes.
4.2 FairCVtest: Multimodal Learning Architecture for

Automatic CV Analysis
Figure 4 summarizes the learning architecture proposed to study the differ-
ent scenarios of FairCVtest. We designed the candidate score predictor as a
multimodal neural network with three input branches: i) face image, ii) text
biography, and iii) candidate competencies. The learning architecture includes
two specific models to process the face image and text data from the biography,
before fusing the information from all three modalities.
4.2.1 Face Analysis Model

We use the face image from each profile, and the pre-trained model ResNet-
50 [98] as feature extractor to obtain feature embeddings from the applicants’
face attributes. ResNet-50 is a popular Convolutional Neural Network com-
posed with 50 layers including residual or “shortcuts” connections to improve
accuracy as the net depth increases (i.e. solving the “vanishing gradient”
problem). ResNet-50’s last convolutional layer outputs embeddings with 2048
features, so we added a fully connected layer to perform a bottleneck that com-
presses these embeddings to just 20 features (maintaining the competitive face
recognition performance), so that its size approximates to that of the candi-
dates competencies. Note that our face model was trained exclusively for the
task of face recognition. However, although gender or ethnicity information
were not intentionally employed during the training process, this informa-
tion is part of the face attributes. Therefore, an AI system trained on these
face embeddings could detect the protected attributes without being explicitly
trained for this task.
4.2.2 Text Analysis Model

The second branch is aimed to extract a text representation from the bios,
using a bidirectional LSTM layer composed by 32 units and hyperbolic tangent
activation. This branch receives as input a sequence of word vectors. We use
the fastText8 word embeddings [99] to represent each word in the biographies
as 300-dimensional word vectors. Note that these word vectors were trained on
a different Common Crawl subset than the one used to extract the biographies
of [31].
4.2.3 Multimodal Model

The face and text features obtained from its respective models are combined
with the candidate competencies to feed the multimodal network. This network
is composed by two hidden layers, with 40 and 20 neurons respectively and
ReLU activation, and only one neuron with sigmoid activation in the output
layer. Note that, as the target functions T in FairCVdb are real valued scores
within the interval [0, 1], we treat this task as a regression problem. A binary
8
https://fasttext.cc/docs/en/english-vectors.html
Fig. 4 Multimodal learning architecture, composed by a Convolutional Neural Network

(ResNet-50 [98]), a BLSTM and a fully connected network used to fuse the features from
different domains (image, text and structured data). Note that, in the agnostic scenario, we
include a sensitive information removal module [71] over the ResNet network to generate
agnostic face embeddings.
classifier can be obtained by thresholding the predicted scores (i.e. switching

from a scoring tool to a selection tool), as we will show in Section 5.1.
4.2.4 Privacy-enhancing Representation Learning

With the aim of generating another representation, agnostic with regard to
gender and ethnicity, we use the method proposed in [71], called SensitiveNets.
This method was proposed to improve the privacy in face biometrics, by incor-
porating an adversarial regularizer capable of removing sensitive information
from pre-trained feature embeddings without losing performance in the main
task. Thus, two different face representations are available for each profile,
one containing gender and ethnicity sensitive information, and a second one
“blind” or agnostic to these attributes. In order to remove sensitive information
from the learned space, the equation 2 is replaced by:
X
min L(O(xj | wF ), T j ) + ∆ (3)
wF
xj ∈S
where ∆ is an adversarial regularizer introduced to measure the amount of
sensitive information available in the learned space represented by wj :
∆ = log{ 1+ | 0.9 − P ( M ale | xj ) |} (4)

The probability P is the output of a classifier trained to detect the sensitive
attribute in the learned space (e.g., Gender in this example). In other words,
P is the probability of observing Male features in the learned space after the
sensitive information suppression (see [71] for details).
4.3 Scenarios and Protocols

In order to evaluate how and to what extent an algorithm is influenced by
biases that are present in the FairCVdb target function, we use the FairCVdb
dataset previously introduced in Section 3 to train a recruitment system under
3 different scenarios. The proposed testbed (FairCVtest) consist of FairCVdb,

the trained recruitment systems, and the related experimental protocols.
We present 3 different versions of the recruitment system, with slight
differences in the input data and the target function aimed at studying
gender/ethnicity biases in multimodal learning. The 3 scenarios included
in FairCVtest were all trained using the candidate competencies, a face
representation, and a short bio, with the following particular configurations:
• Neutral: Training with Unbiased scores T U , the original face representa-
tion extracted with ResNet-50 [98], and the biography with explicit gender
indicators.
• Biased: Training with Biased scores T (G/E) , the original face representation,
and the biography with explicit gender indicators.
• Agnostic: Training with Biased scores T (G/E) , the gender and ethnicity
agnostic representation learned with [71], and the “blind” biography.
The experiments performed in next section will try to evaluate the capacity
of the recruitment AI in each scenario to detect protected attributes (e.g.
gender, ethnicity) without being explicitly trained for this task.
5 Experiments and Results

In this section we will train and evaluate different recruitment models, aimed to
predict a score from the candidate resumes. Each recruitment tool follows the
configuration of one of the scenarios exposed in Section 4.3, and was trained
for 16 epochs using Adam optimizer (α = 0.001, β1 = 0.9 and β2 = 0.999),
batch size of 128, and mean absolute error as loss metric.
In Figure 5 we can observe the distribution of the scores predicted from our
validation set, by gender or ethnicity, in both Neutral and Biased scenarios. As
a measure of the bias’ impact in the classifier, we compute the Kullback-Leiber
divergence KL(P k Q) between demographic distributions. In the gender case,
we define P as the male score distribution and Q as the female’s one, while
in the ethnicity setup we make 1-1 comparisons (i.e. G1 vs G2, G1 vs G3 and
G2 vs G3) and report the average divergence. In the Neutral Scenario (see top
row in Figure 5) there is no difference between demographic groups, as can
be corroborated with the KL divergence tending to zero in both cases (KL
=0.019 in the gender case, KL = 0.023 in the ethnicity one). As expected,
using the unbiased scores T U as target function and a balanced training set
leads us to an unbiased classifier, even in the presence of data containing
demographic information (as we will see in Section 5.2). On the other hand, the
demographic difference is clearly visible in the Biased scenarios. This difference
is most notorious in the gender case (see bottom-left plot in Figure 5), with
the KL divergence rising to 0.320, compared to its low value in the Neutral
setup. Attending to the Ethnicity-biased Scenario, the average KL divergence
rises to 0.178. However, the difference between Groups 1 and 3 is close to that
seen between male-female classes, with a KL divergence around 0.317. Note
Fig. 5 Hiring score distributions by gender (left) and ethnicity (right). The top row presents
hiring score distributions in the Neutral Scenario, while the bottom presents them in the
Gender- and Ethnicity-biased Scenarios.
that gender or ethnicity are not inputs of our model, but rather the system is
able to detect this sensitive information from some of the input features (i.e.
the face embedding, the biography, or the competencies). Therefore, despite
not having explicit access to demographic attributes, the classifier is able to
detect this information and find its correlation with the biases introduced in
the scores, and so it ends up reproducing them.
The third scenario provided by FairCVtest, which we call Agnostic Sce-
nario, aims to prevent the system to inherit data biases. As we introduced in
Section 4.3, the Agnostic Scenario uses a gender blind version of the biogra-
phies, as well as a face embedding where sensitive information has been
removed using the method of [71]. Figure 5 presents the hiring score distribu-
tions in this setup. As we can see, the gender distributions are close to the ones
observed in the Neutral Scenario (see top-left plot in Figure 5), despite using
gender-biased labels during training. In the ethnicity case, we can observe a
slight difference between groups, much smoother than the one we saw in the
Biased Scenario (see bottom-left plot in Figure 5), as can be confirmed with
the KL divergence (i.e. 0.061, compared to the biased case where this value is
around 0.178). However, this gap on the scores between demographic groups
still has margin to decrease to a level similar to that of the Neutral Scenario.
The difference observed in the behavior of gender and ethnicity agnostic cases
Fig. 6 Hiring score distributions by gender (left) and ethnicity (right), in the Agnostic
Scenario.
can be explained by the fact that we removed almost all gender information
from the input (i.e. face embedding and biography), but for the ethnicity we
only took measures on the face embedding, not on the competencies. Thus,
competencies are acting as a soft proxy for the ethnicity group.
Note that our agnostic approximation does not seek to make the system
capable of detecting whether a score is unfair, nor to compensate such bias,
but rather blind it to sensitive attributes with the aim of preventing the model
to establish a correlation between the demographic groups and score biases.
This fact can be corroborated with the training loss, which has a higher value
in the Agnostic Scenario (0.035 for gender, 0.044 for ethnicity) than in the
Biased Scenario (0.49 for gender, 0.64 for ethnicity). By removing sensitive
information from the input, the model is not able to learn what motivates the
difference in the scores between individuals with similar competencies, as it is
blind to the demographic group, and therefore its output does not approximate
correctly the biased target function after training.
5.1 Fairness in Recruitment Tools: Learning

Demographic Parity
Now that we have analyzed the effect of data biases in the score distributions, in
this section we evaluate their impact in the final decision of a screening process.
A screening tool is used to assess a set of individuals according to certain
criteria to select a subset of the “best” ones. The outcome of such process
could be a list of selected candidates (e.g. applicants selected for an individual
interview) or a top-k ranking that measures the relative quality of the k best
individuals from the set. We propose an experiment to simulate a screening
process with FairCtest, using the recruitment tools that we trained in the
previous section. For each scenario, we predict the scores from a pool including
the 4, 800 resumes of our validation set, and select the top-1000 candidates (i.e.
the candidates with the highest scores) among them. By selecting the 1000
candidates with the highest scores, we establish a thresholding rule to classify
the candidates in two categories, therefore switching from a regression task to

a binary classification task.
We will measure fairness in each scenario using the demographic parity
criterion. This criterion requires a classifier’s decision to be statistically inde-
pendent of a protected attribute (i.e. gender or ethnicity in our experiments).
As we’re working with balanced groups, the criterion implies that all demo-
graphic groups should have the same rate of appearance in the top. We can
measure demographic parity between two groups through the p% score as:

P (ŷ = 1 | z = 0) P (ŷ = 1 | z = 1)
p% = min , (5)
P (ŷ = 1 | z = 1) P (ŷ = 1 | z = 0)
where ŷ is a trained classifier’s prediction, and z is a binary protected attribute.
The p% score calculates how far off the equality the model’s decisions are.
According to the U.S. Equal Employment Opportunity Commission “ 4/5
rule” [100], the positive rate of a protected group shouldn’t be less than 4/5
of that of the group with the higher positive rate. Otherwise, the protected
group could be suffering disparate impact. Hence, we will set this rate as an
indicator that a model is biased.
Table 3 presents the top-1000 candidates in each scenario, by gender and
ethnicity group. In the ethnicity case, we compute three p% scores per model
by doing 1-1 comparisons with the three ethnic groups. As we can observe,
in the Neutral Scenario the classifier shows no demographic bias, with both
gender and ethnicity groups having a balanced representation in the ranking.
This can be corroborated with the p% score, which reach values higher than
90% in all cases. In the Biased scenario the groups Male and Ethnic Group
1 are significantly favored and the difference between groups is now clearly
visible. In the gender case, almost 70% of the individuals in the top belong
to the Male group, which reduces the p% score nearly to 40%. On the other
hand, the first ethnic group represents almost half of the top, with the third
one exhibiting a 21.6%. For both G2 and G3 the p% score points out unfair
treatment (i.e. a value under 80%) with respect to G1 (see p1% and p2% in
Table 3). Finally, in the Agnostic Scenario the demographic differences were
significantly reduced with respect to the Biased one, with Male and Female
rates showing even more balance than in the Neutral Scenario. The reduction
of the gap among ethnic rates is enough to overcome the limit in the p% score,
but still leaves room for improvement with a difference of nearly 6% between
G1 and G3. This is not surprising, as we already observed in Figure 5 an slight
difference between the score distribution of each ethnicty group.
5.2 Privacy in recruitment tools: Removing sensitive

information
We have observed in the previous sections the impact of demographic biases
in both the score distribution and the selection rates in different scenarios.
In these experiments, the difference between groups was a consequence of the
Gender Ethnicity
Scenario p% p1 % p2 % p3 %
Male Female Group 1 Group 2 Group 3
Neutral 51.90% 48.10% 92.68% 34.20% 35.00% 30.80% 97.71% 90.06% 88.00%
Biased 72.90% 27.10% 37.17% 50.80% 30.40% 18.80% 59.84% 37.01% 61.84%
Agnostic 52.80% 47.20% 89.39% 36.70% 32.70% 30.60% 89.10% 83.38% 93.58%
Table 3 Distribution of the top 1000 candidates in each Scenario of FairCVtest, by
gender and ethnicity group. We include the p% score (see Eqn. 5) as a measure of the
difference between groups. In the ethnicity case, p1 % for G1 vs G2, p2 % for G1 vs G3 and
p3 % for G2 vs G3.
biases introduced in the target function. However, as can be seen in the Agnos-
tic Scenario, by removing gender and ethnicity information from the input we
can prevent the model to reproduce those biases, as it cannot see which factor
determines the score penalty for some individuals.
Since the key of our Agnostic Scenario is the removal of sensitive informa-
tion, in this section we will analyze the demographic information extracted by
the hiring tool in each scenario. To this aim, we use multimodal feature embed-
dings extracted by the recruitment tool to train and evaluate the performance
of both gender and ethnicity classifiers. We obtain these embeddings as the
output of the first dense layer of our learning architecture (see Section 4.3), in
which the information from different data domains has already been fused. For
each scenario, we train 3 different classification algorithms, namely Support
Vector Machines (SVM), Random Forests (RF), and Neural Networks (NN).
Table 4 presents the accuracies obtained by each classification algorithm in
the 3 scenarios of FairCVtest. The results show a different behavior between
scenarios and demographic traits. As expected, the setup in which most sen-
sitive information can be extracted (gender and ethnicity in this work) is the
Biased one for both attributes. The SVM classifier obtains the higher valida-
tion accuracies, with almost 90% in the gender case and 76.40% in the ethnicity
one. Note that none of these values reach state-of-art performances (i.e. nei-
ther the ResNet-50 model nor the hiring tools weren’t explicitly trained to
classify those attributes), but both of them warn of large amounts of sensi-
tive information within the embeddings. On the other hand, both Neutral and
Agnostic scenarios show lower accuracies than the Biased configuration. How-
ever, we can see a gap in performance between them, with all the classifiers
showing higher accuracy in the Neutral Scenario. This fact demonstrates that,
despite training with the Unbiased scores T U which have no relationship with
any demographic group membership, the embeddings extracted in the Neutral
Scenario contain some sensitive information. By using the gender blinded bios
and the face embeddings in which demographic information has been removed,
we reduced the amount of latent sensitive information within the agnostic
embeddings. This reduction leads us to almost random-choice accuracies in
the gender case (i.e. in a binary task, the random choice classifier’s accuracy
is 50%), but in the ethnicity one the classifiers fall far from this limit (i.e. 33%
corresponding to 3 ethnic groups), since there is still some information related
to that sensitive attribute in the candidate competencies.
Gender Classification Ethnicity Classification

Scenario
SVM RF NN SVM RF NN
Neutral 65.04% 62.25% 63.92% 54.13% 51.94% 50.29%
Biased 89.50% 88.46% 86.37% 76.40% 75.88% 74.31%
Agnostic 54.13% 51.94% 52.94% 48.85% 48.13% 49.71%
Table 4 Accuracy of different classification algorithms, trained with feature embeddings
extracted by the recruitment tool in each scenario (SVM = Support Vector Machines, RF
= Random Forests, NN = Neural Networks).
6 Conclusions
The development of Human-Centric Artificial Intelligence applications will be
critical to ensure the correct deployment of AI technologies in our society. In
this paper we have revised the recent advances in this field, with particular
attention to available databases proposed by the research community. We have
also presented FairCVtest, a new experimental framework (publicly available9 )
on AI-based automated recruitment to study how multimodal machine learning
is affected by biases present in the training data. Using FairCVtest, we have
studied the capacity of common deep learning algorithms to expose and exploit
sensitive information from commonly used structured and unstructured data.
The contributed experimental framework includes FairCVdb, a large set
of 24,000 synthetic profiles with information typically found in job applicants’
resumes from different data domains (e.g. face images, text data and struc-
tured data). These profiles were scored introducing gender and ethnicity biases,
which resulted in gender and ethnicity discrimination in the learned models
targeted to generate candidate scores for hiring purposes. In this scenario, the
system was able to extract demographic information from the input data, and
learn its relation with the biases introduced in the scores. This behavior is not
limited to the case studied, where the bias lies in the target function. Fea-
ture selection or unbalanced data can also become sources of biases. This last
case is common when datasets are collected from historical sources that fail to
represent the diversity of our society.
We discussed recent methods to prevent undesired effects of algorithmic
biases, as well as the most widely used databases in the bias and fairness
research in AI. We then experimented with one of these methods, known as
SensitiveNets, to improve fairness in this AI-based recruitment framework.
Our agnostic setup removes sensitive information from text data at the input
level, and apply SensitiveNets to remove it from the face images during the
learning process. After the demographic “blinding” process, the recruitment
system did not show discriminatory treatment even in the presence of biases
in training data, thus improving equity among different demographic groups.
The most common approach to analyze algorithmic discrimination is
through group-based bias [14]. However, recent works are now starting to inves-
tigate biased effects in AI with user-specific methods, e.g. [75][101]. We plan
9
https://github.com/BiDAlab/FairCVtest
to update FairCVtest with such user-specific biases in addition to the consid-

ered group-based bias. Other future work includes extending our testbed to
other multimodal setups like smartphone-based interaction with application
to authentication [102], behavior understanding [103], and remote monitor-
ing/assessment [104]. Finally, we also foresee worthy research in the extension
of the presented bias-assessment [105] and bias-reduction methods [71] based
on recent advances in biometric template protection [106] and distributed
privacy preservation [107].
Acknowledgments. This work has received funding from different projects,
including BBforTAI (PID2021-127641OB-I00 MICINN/FEDER), Human-
CAIC (TED2021-131787B-I00), TRESPASS-ETN (MSCA-ITN-2019-860813),
and PRIMA (MSCA-ITN-2019-860315). The work of A. Peña is supported
by a FPU Fellowship (FPU21/00535) by the Spanish MIU. Also, I. Serna is
supported by a FPI Fellowship from the UAM.
7 Compliance with Ethical Standards

Conflict of Interest. On behalf of all authors, the corresponding author
states that there is no confict of interest.
Ethical Approval. This article does not contain any studies with human
participants performed by any of the authors.
Funding. This work has received funding from different projects, includ-
ing BBforTAI (PID2021-127641OB-I00 MICINN/FEDER), HumanCAIC
(TED2021-131787B-I00), TRESPASS-ETN (MSCA-ITN-2019-860813), and
PRIMA (MSCA-ITN-2019-860315). The work of A. Peña is supported by
a FPU Fellowship (FPU21/00535) by the Spanish MIU. Also, I. Serna is
supported by a FPI Fellowship from the UAM.
References
[1] Barocas, S., Selbst, A.D.: Big data’s disparate impact. California Law
Review (2016)
[2] Acien, A., Morales, A., Vera-Rodriguez, R., Bartolome, I., Fierrez, J.:
Measuring the Gender and Ethnicity Bias in Deep Models for Face
Recognition. In: Proceedings of Iberoamerican Congress on Pattern
Recognition (IbPRIA), Madrid, Spain (2018)
[3] Drozdowski, P., Rathgeb, C., Dantcheva, A., Damer, N., Busch, C.:
Demographic bias in biometrics: A survey on an emerging challenge.
IEEE Transactions on Technology and Society 1, 89–103 (2020)
[4] Nagpal, S., Singh, M., Singh, R., Vatsa, M., K. Ratha, N.: Deep learning
for face recognition: Pride or prejudiced? arXiv/1904.01219 (2019)
[5] Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.: Men also
like shopping: Reducing gender bias amplification using corpus-level con-
straints. In: Proceedings of Conference on Empirical Methods in Natural
Language Processing, pp. 2979–2989 (2017)
[6] Noble, S.U.: Algorithms of Oppression: How Search Engines Reinforce

Racism. NYU Press, New York (2018)
[7] Sweeney, L.: Discrimination in online ad delivery. Queue 11, 10–29

(2013)
[8] Ali, M., Sapiezynski, P., Bogen, M., Korolova, A., Mislove, A., Rieke,
A.: Discrimination through optimization: How Facebook’s ad delivery
can lead to skewed outcomes. In: Proceedings of the ACM Conference
on Human-Computer Interaction (2019)
[9] Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias. ProPublica
(2016)
[10] Evans, M., Mathews, A.W.: New York regulator probes United Health
algorithm for racial bias. The Wall Street Journal (2019)
[11] Knight, W.: The Apple Card didn’t ’see’ gender — and that’s the
problem. Wired (2019)
[12] Buolamwini, J., Gebru, T.: Gender shades: Intersectional accuracy dis-
parities in commercial gender classification. In: Proceedings of the ACM
Conference on Fairness, Accountability, and Transparency, NY, USA
(2018)
[13] Wang, M., Deng, W.: Mitigating bias in face recognition using skewness-
aware reinforcement learning. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 9322–9331 (2020)
[14] Serna, I., Morales, A., Fierrez, J., Cebrian, M., Obradovich, N., Rah-
wan, I.: Algorithmic discrimination: Formulation and exploration in deep
learning-based face biometrics. In: Proceedings of the AAAI Workshop
on SafeAI (2020)
[15] Balakrishnan, G., Xiong, Y., Xia, W., Perona, P.: Towards causal bench-
marking of bias in face analysis algorithms. In: European Conference on
Computer Vision (ECCV), pp. 547–563 (2020)
[16] Bogen, M., Rieke, A.: Help wanted: Examination of hiring algorithms,
equity, and bias. Technical report (2018). https://www.upturn.org
[17] Black, J.S., van Esch, P.: AI-enabled recruiting: What is it and how
should a manager use it? Business Horizons 63, 215–226 (2020)
[18] Dastin, J.: Amazon scraps secret AI recruiting tool that showed bias
against women. Reuters (2018)
[19] Bertrand, M., Mullainathan, S.: Are Emily and Greg more employ-
able than Lakisha and Jamal? A field experiment on labor market
discrimination. American economic review 94, 991–1013 (2004)
[20] Raghavan, M., Barocas, S., Kleinberg, J., Levy, K.: Mitigating bias in
algorithmic hiring: Evaluating claims and practices. In: Conference on
Fairness, Accountability, and Transparency, pp. 469–481 (2020)
[21] Schumann, C., Foster, J.S., Mattei, N., Dickerson, J.P.: We need fair-
ness and explainability in algorithmic hiring. In: Proceedings of the
19th International Conference on Autonomous Agents and MultiAgent
Systems, pp. 1716–1720 (2020)
[22] Sánchez-Monedero, J., Dencik, L., Edwards, L.: What does it mean to
’solve’ the problem of discrimination in hiring? Social, technical and legal
perspectives from the UK on automated hiring systems. In: Conference
on Fairness, Accountability, and Transparency, pp. 458–468 (2020)
[23] Goodman, B., Flaxman, S.: EU regulations on algorithmic decision-

making and a ”Right to explanation”. AI Magazine 38 (2016)
[24] Cheng, L., Varshney, K.R., Liu, H.: Socially responsible ai algo-
rithms: issues, purposes, and challenges. Journal of Artificial Intelligence
Research 71, 1137–1181 (2021)
[25] Baltrušaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning:
A survey and taxonomy. IEEE Transactions on Pattern Analysis and
Machine Intelligence 41, 423–443 (2019)
[26] Peña, A., Serna, I., Morales, A., Fierrez, J.: Bias in multimodal AI:
Testbed for fair automatic recruitment. In: IEEE CVPR Workshop on
Fair, Data Efficient and Trusted Computer Vision (2020)
[27] Richards, C., Bouman, W.P., Seal, L., Barker, M.J., Nieder, T.O.,
T’Sjoen, G.: Non-binary or genderqueer genders. International Review
of Psychiatry 28(1), 95–102 (2016)
[28] Keyes, O.: The misgendering machines: Trans/hci implications of auto-

matic gender recognition. Proceedings of the ACM on human-computer
interaction 2(CSCW), 1–22 (2018)
[29] Larrazabal, A.J., Nieto, N., Peterson, V., Milone, D.H., Ferrante, E.:
Gender imbalance in medical imaging datasets produces biased classifiers

for computer-aided diagnosis. Proceedings of the National Academy of
Sciences 117, 12592–12594 (2020)
[30] Speicher, T., Ali, M., Venkatadri, G., Ribeiro, F.N., Arvanitakis, G.,
Benevenuto, F., Gummadi, K.P., Loiseau, P., Mislove, A.: Potential for
discrimination in online targeted advertising. In: Conference on Fairness,
Accountability and Transparency, pp. 5–19 (2018)
[31] De-Arteaga, M., Romanov, R., Wallach, H., Chayes, J., Borgs, C., et al.:
Bias in bios: A case study of semantic representation bias in a high-stakes
setting. In: Conference on Fairness, Accountability, and Transparency,
pp. 120–128 (2019)
[32] Bolukbasi, T., Chang, K., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is
to computer programmer as woman is to homemaker? Debiasing word
embeddings. Advances in Neural Information Processing Systems (2016)
[33] Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review
and new perspectives. IEEE Transactions on Pattern Analysis and
Machine Intelligence 35(8), 1798–1828 (2013)
[34] Bau, D., Zhu, J., Strobelt, H., Lapedriza, A., Zhou, B., Torralba, A.:
Understanding the role of individual units in a deep neural network.
Proceedings of the National Academy of Sciences 117(48), 1–8 (2020)
[35] Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding
neural networks through deep visualization. In: Intenational Conference
on Machine Learning (ICML) Deep Learning Workshop, Lille, France
(2015)
[36] Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A.,
Brendel, W.: ImageNet-trained CNNs are biased towards texture;
increasing shape bias improves accuracy and robustness. In: Interna-
tional Conference on Learning Representations (ICLR), New Orleans,
Louisiana, USA (2019)
[37] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K., Samek,
W.: On pixel-wise explanations for non-linear classifier decisions by layer-
wise relevance propagation. PloS one 10(7), 1–46 (2015)
[38] Selvaraju, R., Cogswell, M., et al.: Grad-CAM: Visual explanations from
deep networks via gradient-based localization. In: IEEE International
Conference on Computer Vision (CVPR), Honolulu, Hawaii, USA, pp.
618–626 (2017). IEEE
[39] Ortega, A., Fierrez, J., Morales, A., Wang, Z., de la Cruz, M., Alonso,
C.L., Ribeiro, T.: Symbolic ai for xai: evaluating lfit inductive program-
ming for explaining biases in machine learning. Computers 10(11), 154
(2021)
[40] Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B.,
Darrell, T.: Generating visual explanations. In: European Conference
on Computer Vision (ECCV), Amsterdam, The Netherlands, pp. 3–19
(2016). Springer
[41] Montavon, G., Samek, W., Müller, K.: Methods for interpreting and
understanding deep neural networks. Digital Signal Processing 73
(2018)
[42] Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing Higher-
Layer Features of a Deep Network. University of Montreal 1341(3)
(2009)
[43] Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional net-
works: Visualising image classification models and saliency maps. In:
International Conference on Learning Representations (ICLR) Work-
shop, Banff, Canada (2014)
[44] Mahendran, A., Vedaldi, A.: Understanding deep image representations

by inverting them. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Boston, MA, USA, pp. 5188–5196 (2015). IEEE
[45] Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J.: Plug
& Play generative networks: Conditional iterative generation of images
in latent space. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Honolulu, Hawaii, USA (2017). IEEE
[46] Nguyen, A., Yosinski, J., Clune, J.-: Multifaceted feature visualization:
Uncovering the different types of features learned by each neuron in
deep neural networks. In: Intenational Conference on Machine Learning
(ICML) Deep Learning Workshop, New York, NY, USA (2016)
[47] Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., Clune, J.: Syn-
thesizing the preferred inputs for neurons in neural networks via deep
generator networks. In: Conference on Neural Information Processing
Systems (NIPS), Barcelona, Spain, pp. 3395–3403 (2016)
[48] Karnin, E.D.: A simple procedure for pruning back-propagation trained

neural networks. Transactions on Neural Networks 1(2), 239–242 (1990)
[49] Zurada, J.M., Malinowski, A., Cloete, I.: Sensitivity analysis for min-
imization of input data dimension for feedforward neural networ. In:
International Symposium on Circuits and Systems (ISCAS), vol. 6, pp.
447–450 (1994)
[50] Zeiler, .D., Fergus, R.: Visualizing and understanding convolutional net-
works. In: European Conference on Computer Vision (ECCV), Zurich,
Switzerland, pp. 818–833 (2014). Springer
[51] Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving
for simplicity: The all convolutional net. In: International Conference on
Learning Representations (ICLR), San Diego, CA, USA (2015)
[52] Zhang, Q., Cao, R., Shi, F., Wu, Y.N., Zhu, S.: Interpreting CNN
knowledge via an explanatory graph. In: AAAI Conference on Artificial
Intelligence, vol. 32. AAAI Press, New Orleans, Louisiana, USA (2018)
[53] Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., Kim,
B.: Sanity checks for saliency maps. In: Advances in Neural Information
Processing Systems (NIPS), vol. 31, pp. 9525–9536. Curran Associates
Inc., Montréal, Canada (2018)
[54] Szegedy, C., Zaremba, W., Sutskever, I., Estrach, J.B., Erhan, D.,
Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In:
International Conference on Learning Representations (ICLR), Banff,
Canada (2014)
[55] Pang, W.K., Percy, L.: Understanding black-box predictions via influence
functions. In: International Conference on Machine Learning (ICML),
vol. 70, pp. 1885–1894. PMLR, Sydney, Australia (2017)
[56] Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily
fooled: High confidence predictions for unrecognizable images. In: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
427–436 (2015). IEEE
[57] Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural
networks. Transactions on Evolutionary Computation 23(5), 828–841
(2019)
[58] Quadrianto, N., Sharmanska, V., Thomas, O.: Discovering fair represen-
tations in the data domain. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 8227–8236 (2019)
[59] Sattigeri, P., Hoffman, S.C., Chenthamarakshan, V., Varshney, K.R.:

Fairness GAN: Generating datasets with fairness properties using a gen-
erative adversarial network. IBM Journal of Research and Development
63, 1–9 (2019)
[60] Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxil-
iary classifier GANs. In: International Conference on Machine Learning
(ICML), Sydney, Australia, pp. 2642–2651 (2017)
[61] Calmon, F.P., Wei, D., Vinzamuri, B., Ramamurthy, K.N., Varshney,
K.R.: Optimized pre-processing for discrimination prevention. In: Pro-
ceedings of the 31st International Conference on Neural Information
Processing Systems, pp. 3995–4004 (2017)
[62] Ramaswamy, V.V., Kim, S.S., Russakovsky, O.: Fair attribute classifica-
tion through latent space de-biasing. In: IEEE Conference on Computer
Vision and Pattern Recognition, pp. 9301–9310 (2021)
[63] Jia, S., Lansdall-Welfare, T., Cristianini, N.: Right for the right reason:
Training agnostic networks. In: Advances in Intelligent Data Analysis
XVII, pp. 164–174 (2018)
[64] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Lavio-
lette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of
neural networks. Journal of Machine Learning Research 17, 2096–2030
(2016)
[65] Wang, M., Deng, W., Hu, J., Tao, X., Huang, Y.: Racial faces in the wild:
Reducing racial bias by information maximization adaptation network.
In: IEEE International Conference on Computer Vision (ICCV), pp. 692–
702 (2019)
[66] Romanov, A., De-Arteaga, M., Wallach, H., Chayes, J., Borgs, C., et al.:
What’s in a name? Reducing bias in bios without access to protected
attributes. In: Proceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational Linguistics: Human
Language Technologies, pp. 4187–4195 (2019)
[67] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular mar-
gin loss for deep face recognition. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 4690–4699 (2019)
[68] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., et al.: Generative
adversarial nets. Advances in Neural Information Processing Systems 27,
2672–2680 (2014)
[69] Alvi, M., Zisserman, A., Nellaker, C.: Turning a blind eye: Explicit
removal of biases and variation from deep neural network embeddings.
In: European Conference on Computer Vision (ECCV) (2018)
[70] Kim, B., Kim, H., Kim, K., Kim, S., Kim, J.: Learning not to learn:
Training deep neural networks with biased data. In: IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 9012–9020

(2019)
[71] Morales, A., Fierrez, J., Vera-Rodriguez, R., Tolosana, R.: SensitiveNets:
Learning agnostic representations with application to face recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence 43(6),
2158–2164 (2021)
[72] Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding
for face recognition and clustering. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 815–823 (2015)
[73] Berendt, B., Preibusch, S.: Exploring discrimination: A user-centric

evaluation of discrimination-aware data mining. In: IEEE International
Conference on Data Mining Workshops, pp. 344–351 (2012)
[74] Pedreshi, D., Ruggieri, S., Turini, F.: Discrimination-aware data mining.
In: Proceedings of the 14th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 560–568 (2008)
[75] Zhang, Y., Bellamy, R., Varshney, K.R.: Joint optimization of AI fair-
ness and utility: A human-centered approach. In: Proceedings of the
AAAI/ACM Conference on AI, Ethics, and Society, pp. 400–406 (2020)
[76] Yang, K., Stoyanovich, J.: Measuring fairness in ranked outputs. In: Pro-
ceedings of the 29th International Conference on Scientific and Statistical
Database Management, pp. 1–6 (2017)
[77] Celis, L.E., Straszak, D., Vishnoi, N.K.: Ranking with fairness con-
straints. arXiv/1704.06840 (2017)
[78] Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., Baeza-
Yates, R.: FA*IR: A fair top-k ranking algorithm. In: Proceedings of the
2017 ACM on Conference on Information and Knowledge Management,
pp. 1569–1578 (2017)
[79] Dua, D., Graff, C.: UCI Machine Learning Repository (2017). http://
archive.ics.uci.edu/ml
[80] Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the
success of bank telemarketing. Decision Support Systems 62, 22–31
(2014)
[81] Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.: Gender bias in
coreference resolution: Evaluation and debiasing methods. In: Conference
of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, vol. 2 (2018)
[82] Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the
wild. In: International Conference on Computer Vision (ICCV) (2015)
[83] Rothe, R., Timofte, R., Van Gool, L.: Dex: Deep expectation of apparent
age from a single image. In: IEEE International Conference on Computer
Vision Workshops (CVPRW), pp. 10–15 (2015)
[84] Ricanek Jr., K., Tesafaye, T.: Morph: A longitudinal image database of
normal adult age-progression. In: International Conference on Automatic
Face and Gesture Recognition, pp. 341–345 (2006)
[85] Karkkainen, K., Joo, J.: FairFace: Face attribute dataset for balanced
race, gender, and age for bias measurement and mitigation. In: IEEE
Winter Conference on Applications of Computer Vision, pp. 1548–1558
(2021)
[86] Merler, M., Ratha, N., Feris, S.R., Smith, J.R.: Diversity in faces.
arXiv/1901.10436 (2019)
[87] Robinson, J.P., Livitz, G., Henon, Y., Qin, C., Fu, Y., Timoner, S.: Face
recognition: Too bias, or not too bias? In: IEEE Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW) (2020)
[88] Hupont, I., Fernández, C.: DemogPairs: Quantifying the impact of demo-
graphic imbalance in deep face recognition. In: IEEE International
Conference on Automatic Face & Gesture Recognition (2019)
[89] Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)
(2011)
[90] Serna, I., Morales, A., Fierrez, J., Cebrian, M., Obradovich, N., Rahwan,
I.: SensitiveLoss: Improving accuracy and fairness of face representa-
tions with discrimination-aware deep learning. Artificial Intelligence 305
(2022)
[91] Garg, N., Schiebinger, L., Jurafsky, D., Zou, J.: Word embeddings quan-
tify 100 years of gender and ethnic stereotypes. Proceedings of the
National Academy of Sciences 115, 3635–3644 (2018)
[92] Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: A dataset
and benchmark for large-scale face recognition. In: European Conference
on Computer Vision (ECCV) (2016)
[93] Bendick Jr, M., Jackson, C.W., Romero, J.H.: Employment discrimina-
tion against older workers: An experimental study of hiring practices.
Journal of Aging & Social Policy 8, 25–46 (1997)
[94] Cowgill, B.: Bias and productivity in humans and algorithms: The-
ory and evidence from resume screening. Columbia Business School 29
(2018)
[95] Fierrez, J., Morales, A., Vera-Rodriguez, R., Camacho, D.: Multiple
classifiers in biometrics. Part 1: Fundamentals and review. Information
Fusion 44, 57–64 (2018)
[96] Gonzalez-Sosa, E., Fierrez, J., Vera-Rodriguez, R., Alonso-Fernandez, F.:

Facial soft biometrics for recognition in the wild: Recent works, annota-
tion and COTS evaluation. IEEE Transactions on Information Forensics
and Security 13, 2001–2014 (2018)
[97] Ranjan, R., Sankaranarayanan, S., Bansal, A., Bodla, N., Chen, J., Patel,
V.M., Castillo, C.D., Chellappa, R.: Deep learning for understanding
faces: Machines may be just as good, or better, than humans. IEEE
Signal Processing Magazine 35, 66–83 (2018)
[98] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image
recognition. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 770–778 (2016)
[99] Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances
in pre-training distributed word representations. In: Proceedings of the
International Conference on Language Resources and Evaluation (LREC
2018) (2018)
[100] Biddle, D.: Adverse Impact and Test Validation: A Practitioner’s Guide
to Valid and Defensible Employment Testing. Routledge, London (2017)
[101] Bakker, M., Valdes, H.R., Tu, D.P., Gummadi, K.P., Varshney, K.R.,
et al.: Fair enough: Improving fairness in budget-constrained decision
making using confidence thresholds. In: AAAI Workshop on Artificial
Intelligence Safety, New York, NY, USA, pp. 41–53 (2020)
[102] Acien, A., Morales, A., Vera-Rodriguez, R., Fierrez, J., Delgado, O.:
Smartphone sensors for modeling human-computer interaction: Gen-
eral outlook and research datasets for user authentication. In: IEEE
Conference on Computers, Software, and Applications (COMPSAC)
(2020)
[103] Acien, A., Morales, A., Fierrez, J., Vera-Rodriguez, R., Bartolome,
I.: BeCAPTCHA: Detecting human behavior in smartphone interac-
tion using multiple inbuilt sensors. In: AAAI Workshop on Artificial
Intelligence for Cyber Security (AICS) (2020)
[104] Hernandez-Ortega, J., Daza, R., Morales, A., Fierrez, J., Ortega-Garcia,
J.: edBB: Biometrics and behavior for assessing remote education. In:
AAAI Workshop on Artificial Intelligence for Education (AI4EDU)
(2020)
[105] Serna, I., DeAlcala, D., Morales, A., Fierrez, J., Ortega-Garcia, J.:
IFBiD: Inference-free bias detection. In: AAAI Workshop on Artificial
Intelligence Safety (SafeAI). CEUR, vol. 3087 (2022)
[106] Gomez-Barrero, M., Maiorana, E., Galbally, J., Campisi, P., Fierrez, J.:
Multi-biometric template protection based on homomorphic encryption.
Pattern Recognition 67, 149–163 (2017)
[107] Hassanpour, A., Moradikia, M., Yang, B., Abdelhadi, A., Busch, C.,
Fierrez, J.: Differential privacy preservation in robust continual learning.
IEEE Access 10, 24273–2428 (2022)

Human-Centric Multimodal Machine Learning: Recent Advances and Testbed On AI-based Recruitment

Uploaded by

Copyright:

Available Formats

Human-Centric Multimodal Machine Learning: Recent Advances and Testbed On AI-based Recruitment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Human-Centric Multimodal Machine Learning: Recent Advances and Testbed On AI-based Recruitment

Uploaded by

Copyright:

Available Formats

Springer Nature 2021 LATEX template

arXiv:2302.10908v1 [cs.LG] 13 Feb 2023

Human-Centric Multimodal Machine

Alejandro Peña, Ignacio Serna, Aythami Morales, Julian

Contributing authors: alejandro.penna@uam.es;

2 Human-Centric Multimodal Machine Learning

unstructured data and exploit it in combination to data biases in unde-

Keywords: Automated recruitment, bias, biometrics, computer vision, deep

Human-Centric Multimodal Machine Learning 3

4 Human-Centric Multimodal Machine Learning

• We have evaluated the capacity of both pretrained models and data-driven

2 Human-Centric Research in Machine

Human-Centric Multimodal Machine Learning 5

2.1 Interpretable and Explainable ML

6 Human-Centric Multimodal Machine Learning

One way of improving activation maximization to enable enhanced visu-

Human-Centric Multimodal Machine Learning 7

2.2 Discrimination-aware Learning

8 Human-Centric Multimodal Machine Learning

information and unsupervised domain adaptation, from a labeled domain (i.e.

Human-Centric Multimodal Machine Learning 9

Database #Samples Image Text Cat./Num. Demographic Access

UCI Adult Income [79] 48.8K ✗ ✗ ✓ Ethnicity, Gender archive.ics.uci.edu/ml/datasets/adult

10 Human-Centric Multimodal Machine Learning

Human-Centric Multimodal Machine Learning 11

3 FairCVdb: Dataset for Multimodal Bias

12 Human-Centric Multimodal Machine Learning

Human-Centric Multimodal Machine Learning 13

3.2 FairCVdb: Dataset Description

14 Human-Centric Multimodal Machine Learning

We select 24K biographies, and its corresponding blinded versions, from

Human-Centric Multimodal Machine Learning 15

Name Type Data values

16 Human-Centric Multimodal Machine Learning

Feature generation (B ), the Target function (C ), and the Learning strategy

Human-Centric Multimodal Machine Learning 17

4.2 FairCVtest: Multimodal Learning Architecture for

4.2.1 Face Analysis Model

4.2.2 Text Analysis Model

4.2.3 Multimodal Model

18 Human-Centric Multimodal Machine Learning

Fig. 4 Multimodal learning architecture, composed by a Convolutional Neural Network

classifier can be obtained by thresholding the predicted scores (i.e. switching

4.2.4 Privacy-enhancing Representation Learning

∆ = log{ 1+ | 0.9 − P ( M ale | xj ) |} (4)

4.3 Scenarios and Protocols

Human-Centric Multimodal Machine Learning 19

3 different scenarios. The proposed testbed (FairCVtest) consist of FairCVdb,

5 Experiments and Results

20 Human-Centric Multimodal Machine Learning

Human-Centric Multimodal Machine Learning 21

5.1 Fairness in Recruitment Tools: Learning

22 Human-Centric Multimodal Machine Learning

the candidates in two categories, therefore switching from a regression task to

5.2 Privacy in recruitment tools: Removing sensitive

Human-Centric Multimodal Machine Learning 23

24 Human-Centric Multimodal Machine Learning

Gender Classification Ethnicity Classification

Human-Centric Multimodal Machine Learning 25