Human-Centric Multimodal Machine Learning: Recent Advances and Testbed On AI-based Recruitment
Human-Centric Multimodal Machine Learning: Recent Advances and Testbed On AI-based Recruitment
Human-Centric Multimodal Machine Learning: Recent Advances and Testbed On AI-based Recruitment
Abstract
The presence of decision-making algorithms in society is rapidly increas-
ing nowadays, while concerns about their transparency and the possi-
bility of these algorithms becoming new sources of discrimination are
arising. There is a certain consensus about the need to develop AI
applications with a Human-Centric approach. Human-Centric Machine
Learning needs to be developed based on four main requirements: (i) util-
ity and social good; (ii) privacy and data ownership; (iii) transparency
and accountability; and (iv) fairness in AI-driven decision-making pro-
cesses. All these four Human-Centric requirements are closely related
to each other. With the aim of studying how current multimodal algo-
rithms based on heterogeneous sources of information are affected by
sensitive elements and inner biases in the data, we propose a ficti-
tious case study focused on automated recruitment: FairCVtest. We
train automatic recruitment algorithms using a set of multimodal syn-
thetic profiles including image, text, and structured data, which are
consciously scored with gender and racial biases. FairCVtest shows the
capacity of the Artificial Intelligence (AI) behind automatic recruit-
ment tools built this way (a common practice in many other application
scenarios beyond recruitment) to extract sensitive information from
1
Springer Nature 2021 LATEX template
1 Introduction
Artificial Intelligence plays a key role in people’s lives nowadays, with auto-
matic systems being deployed in a large variety of fields, such as healthcare,
education, or jurisprudence. The data science community’s breakthroughs of
the last decades along with the large amounts of data currently available have
made possible such deployment, allowing us to train deep models that achieve
a performance never seen before. The emergence of deep learning technologies
has generated a paradigm shift, with handcrafted algorithms being replaced
by data-driven approaches. However, the application of machine learning algo-
rithms built using training data collected from society can lead to adverse
effects, as these data may reflect current socio-cultural and historical biases [1].
In this scenario, automated decision-making models have the capacity to repli-
cate human biases present in the data, or even amplify them [2][3][4][5][6] if
appropriate measures are not taken.
There are relevant models based on machine learning that have been shown
to make decisions largely influenced by demographic attributes in various
fields. For example, Google’s [7] and Facebook’s [8] ad delivery systems gener-
ated undesirable discrimination with disparate performance across population
groups. In 2016, ProPublica researchers [9] analyzed several Broward County
defendants’ criminal records 2 years after being assessed with the recidivism
system COMPAS, finding that the algorithm was biased towards black defen-
dants. New York’s insurance regulator probed UnitedHealth Group over its
use of an algorithm that researchers found to be racially biased, the algorithm
prioritized healthier white patients over sicker black ones [10]. Apple Credit
service granted higher credit limits to men than women even though it was pro-
grammed to be blind to that variable [11]. Face analysis technologies have also
shown a gap in performance between some demographic groups [2][12][13][14]
as a major consequence of an undiverse representation of society in the train-
ing data. Moreover, as Balakrishnan et al. pointed out [15], the problem of
∗
Paper based on the keynote by Prof. Julian Fierrez at ICPRAM 2021.
Springer Nature 2021 LATEX template
data bias goes beyond the training set, as we need a bias-free evaluation set
in order to correctly assess algorithmic fairness.
The usage of AI technologies is also growing in the labor market [16], where
automatic-decision making systems are commonly used in different stages
within the hiring pipeline [17]. However, automatic tools in this area have also
exhibited worrying biased behaviors, such as Amazon’s recruiting tool prefer-
ring male candidates over female ones [18]. Ensuring that all social groups have
equal opportunities in the labor market is crucial to overcome differences with
minority groups, which have been historically penalized [19]. Some works are
starting to address fairness in hiring [20][21][22], but the lack of transparency
(i.e. both the models and their training data are usually private for legal or
corporate reasons [20]) hinders the technical evaluation of these systems.
In response to the deployment of automatic systems, along with the con-
cerns about their fairness, the governments are adopting regulations in this
matter, placing special emphasis on personal data processing and preventing
algorithmic discrimination. Among these regulations, the European Union’s
General Data Protection Regulation (GDPR)1 is specially relevant for its
impact on the use of machine learning algorithms [23]. The GDPR aims to
protect EU citizens’ rights concerning data protection and privacy by regu-
lating how to collect, store, and process personal data (e.g. Articles 17 and
44), and requires measures to prevent discriminatory effects while processing
sensitive data (according to Article 9, sensitive data includes “personal data
revealing racial or ethnic origin, political opinions, religious or philosophical
beliefs”). Thus, research on transparency, fairness, or explicability in machine
learning is not only an ethical matter, but a legal concern and the basis for the
development of responsible and helpful AI systems that can be trusted [24].
On the other hand, one of the most active areas in Machine Learning (ML)
is around the development of new multimodal models capable of understanding
and processing information from multiple heterogeneous sources of informa-
tion [25]. Among such sources of information we can include structured data
(e.g. tabular data), and unstructured data from images, audio, and text. The
implementation of these models in society must be accompanied by effective
measures to prevent algorithms from becoming a source of discrimination. In
this scenario, where multiple sources of both structured and unstructured data
play a key role in algorithms’ decisions, the task of detecting and preventing
biases becomes even more relevant and difficult.
In this environment of desirable fair and trustworthy AI, the main
contributions of this work are:
• We review the latest advances in Human-Centric ML research with special
focus on the public available databases proposed by the community.
• We present a new public experimental framework around automated recruit-
ment, aimed to study how multimodal machine learning is influenced by
demographic biases present in the training datasets: FairCVtest.2
1
https://gdpr.eu/
2
https://github.com/BiDAlab/FairCVtest
Springer Nature 2021 LATEX template
settings, is essential that these systems are responsible and trustworthy. How-
ever, there are many models that have been shown to make decisions based
on attributes considered as private (e.g. gender3 and ethnicity), or exhibiting
systematically discrimination against individuals belonging to disadvantaged
groups. We can find examples of such unfair treatment in various fields, such
as healthcare [10][29], ad delivery systems [7][8][30], hiring [16][18], and both
facial analysis [5][12][13] and NLP technologies [31][32].
In the following sections we will present recent advances in Human-Centric
ML research related with: 1) explainability and interpretability of ML models;
2) discrimination-aware ML approaches; and 3) databases for Human-Centric
ML research.
3
We are aware of the studies that move away from the traditional view of gender as a binary
variable [27], and the difference between gender identity and biological sex. Despite the limitations
of such model [28], in this paper we use “gender” to refer to the external perception of biological
sex, in line with the work historically developed in gender classification into male and female
individuals.
Springer Nature 2021 LATEX template
Some other methods have gone beyond visualization of CNNs and diag-
nosed CNN representations to gain a deep understanding of the features
encoded in a CNN. Others report the inconsistency of some widely deployed
saliency methods, as they are not independent of both the data on which the
model was trained and the model parameters [53].
Szegedy et al. [54] reported the existence of blind spots and counter-
intuitive properties of neural networks. They found that its possible to change
the network’s prediction by applying an imperceptible optimized perturbation
to the input image, which they called and adversarial example. Paving the
way for a series of works that sought to produce images with which to fool the
models [55][56][57].
Other studies aiming to understand deep neural networks are neuron
ablation techniques. These seek a complete functional understanding of the
model, trying to elucidate its inner workings or shed light on its internal rep-
resentations. Bau et al. found evidence for the emergence of disentangled,
human-interpretable units (of objects, materials and colors) during training
[34].
2.3 Databases
The datasets used for learning or inference may be the most critical elements
of the machine learning process where bias can appear. As these data are
collected from society, they may reflect sociocultural biases [1], or reflect an
unbalanced representation of the different demographic groups composing it.
A naive approach would be to remove all sensitive information from data,
but this is almost infeasible in a general AI setup (e.g. [31] demonstrates how
removing explicit gender indicators from personal biographies is not enough
to remove the gender bias from an occupation classifier, as other words may
serve as “proxy”). On the other hand, collecting large datasets that represent
broad social diversity in a balanced manner can be extremely costly, and not
enough to avoid disparate treatment between groups [13].
The biases introduced in the dataset used to train machine learning models
typically reflect human biases present in society, or are related to an inaccu-
rate representation of groups [89][90]. In view of this situation, the scientific
Springer Nature 2021 LATEX template
community has put lots of effort into collecting databases that improve the rep-
resentation of different demographic groups, which can be used to suppress the
presence of bias. In this section, we discuss some of the most commonly used
databases in AI fairness research, either because of the biases they present, or
their absence (i.e. databases more balanced in terms of certain demographic
attributes). Table 1 provides an overview of these databases, including the
number of samples, data modality and the demographic attributes studied
with each one. The Adult Income dataset [79] from the UCI repository is
frequently used on gender and ethnicity bias mitigation. The main task of
the database is predict whether a person will earn more or less than $50K
per year. The database includes 48, 842 samples with 14 numerical/categor-
ical attributes each, such as education level, capital-gain or occupation, and
missing values.
The German Credit dataset [79] contains 1K entries with 20 different
categorical/numerical attributes, where each entry represents a loan applicant
by a bank. The applicants are classified as good or bad risk credit, showing
age bias toward young people. Also related to age biases, the Bank Mar-
keting database [80] contains marketing campaign data of a Portuguese bank
institution. With more than 41K samples, the goal is to predict if the client
will subscribe a term deposit, based on 20 categorical/numerical attributes
including personal data and socioeconomic contextual information.
Springer Nature 2021 LATEX template
The ProPublica Recidivism dataset [9] provides more than 11K pretrial
defendants records, assessed with the COMPAS algorithm to predict their
likelihood of recidivism. After a 2-year study, the researchers find out that the
algorithm was biased towards African-Americans, showing both higher false
positive and lower false negative rates than white defendants.
In the study of demographic bias in NLP technologies,4 we can cite the
Common Crawl Bios dataset [31], which contains nearly 400K short biogra-
phies collected from Common Crawl. The goal of the dataset is to predict the
occupation from these bios, out of 28 possible occupations showing high gender
imbalances. The dataset also provides a “gender blinded” version of each bio,
where explicit gender indicators have been removed (e.g. pronouns or names).
On a closely related task, the WinoBias database [81] provides 3, 160 sen-
tences, where the goal is to find all the expressions related to certain entity.
Centered in people entities referred by their occupations, the dataset requires
to link gender pronouns to male/female stereotypical occupations.
We now focus in face datasets, which are the basis for different face
analysis task such as face recognition or gender classification. The CelebA
database [82] contains nearly 202.6K images from more than 10K celebri-
ties. Each image is annotated with 5 facial landmarks, along with 40
binary attributes including appearance features, demographic information, or
attractiveness, which shows a strong gender bias.
The IMDB-WIKI dataset [83] provides 460.7K images from the IMDB
profiles of 20, 284 different celebrities, along with 62.3K images from
Wikipedia. Images were labeled using the information available in the profiles
(i.e. name, gender, and birth date), extracting an age label by comparing the
timestamp of the images and the birth date. The dataset presents a gender
bias in the age distributions, as we encounter younger females and older males.
Due to the image acquisition process, some labels are noisy, so the authors
of [69] released the cleaned IMDB dataset, with 60K cleaned images for age
prediction and 80K for gender classification obtained from the IMDb split.
Also related with age studies, the MORPH database [84] provides 55K
images from 13K individuals, aimed to study the effect of age-progression on
different facial tasks. The database is longitudinal with age, having pictures of
the same user over time. The database is strongly unbalanced with respect to
gender and ethnicity, with 65% images belonging to African-American males.
Some databases aim to mitigate biases in face analysis technologies by
putting emphasis in demographic balance and diversity. Pilot Parliaments
Benchmark (PPB) [12] is a dataset of 1, 270 parliamentarians images from 6
different countries in Europe and Africa. The images are balanced with respect
to gender and skin color, which are available as labels (the skin color is codi-
fied using the six-point Fitzpatrick system). The Labeled Ancestral Origin
Faces in the Wild (LAOFIW) dataset [69] provides 14K images manually
divided into 4 ancestral origin groups. The database is balanced with respect
4
There are several works that study demographic biases in word embeddings [32][91], working
with representation spaces trained with large corpus of texts from Wikipedia, Common Crawl or
Google News, among other sources.
Springer Nature 2021 LATEX template
to ancestral origin and gender, and a variety of pose and illumination. Also
emphasizing ethnicity balance, the FairFace database [85] contains more than
100K images equally distributed in 7 ethnicity groups (White, Black, Indian,
East Asian, Southeast Asian, Middle East, and Latino), also providing gender
and age labels. Aimed to study facial diversity, Diversity in Faces [86] pro-
vides 1M images annotated with 10 different facial systems including gender,
age, skin color, pose, and facial contrast labels, among others.
If we look at face recognition databases, DiveFace [71] contains face images
equitably distributed among 6 demographic classes related to gender and 3
ethnic groups (Black, Asian, and Caucasian), including 24K different identities
and a total of 120K images. The DemogPairs database [88] also proposes 6
balanced demographic groups related to gender and ethnicity, each one with
100 subjects and 1.8K images. On his part, the Balanced Faces in the Wild
(BFW) database [87] presents 8 demographic groups related with gender and
4 ethnicity groups (Asian, Black, Indian and White), each one with 100 differ-
ent users and 2.5K images. Finally, Wang and Deng proposed three different
databases based on MS-Celeb-1M [92], namely Racial Faces in the Wild
(RFW) [65], BUPT-B [13] and BUPT-G [13]. While RFW is designed as
a validation dataset, aimed to measure ethnicity biases, both BUPT-B and
BUPT-G are proposed as ethnicity-aware training datasets. RFW defines 4
ethnic groups (Caucassian, Asian, Indian, and African), each one with 10K
images and 3 different subjects. On the other hand, both BUPT-B and BUPT-
G propose the same ethnic groups, the first one almost ethnicity-balanced
with 1.3M images and 28K subjects, while the latter contains 2M images
and 38K subjects, which are distributed approximating the world’s population
distribution.
5
https://www.hirevue.com/
Springer Nature 2021 LATEX template
Fig. 1 Information blocks in a resume and personal attributes that can be derived from
each one. The number of crosses represent the level of sensitive information (+++ = high,
++ = medium, + = low).
groups [19][93], which makes bias prevention a crucial step in automatic hir-
ing tools design. Although the study of fairness in algorithmic hiring has been
limited [21], some works are starting to address this topic [20][22][94].
For the purpose of studying discrimination in Artificial Intelligence at large,
and particularly in hiring processes, in this work we propose a new experimen-
tal framework inspired in a fictitious automated recruiting system: FairCVtest.
Our work can be framed within the screening stage of the hiring pipeline, where
an automatic tool determines a score from a set of applicants resumes. We
chose this application because it comprises personal information from different
nature [95].
The resume is traditionally composed by structured data including name,
position, age, gender, experience, or education, among others (see Figure 1),
and also includes unstructured data such as a face photo or a short biography.
A face image is rich in unstructured information such as identity, gender,
ethnicity, or age [96][97]. That information can be recognized in the image,
but it requires a cognitive or automatic process trained previously for that
task. The text is also rich in unstructured information. The language and the
way we use that language, determine attributes related to your nationality,
age, or gender. Both, image and text, represent two of the domains that have
attracted major interest from the AI research community during last years.
The Computer Vision and the Natural Language Processing communities have
boosted the algorithmic capabilities in image and text analysis through the
usage of massive amounts of data, large computational capabilities (GPUs),
and deep learning techniques.
The resumes used in the proposed FairCVtest framework include merits
of the candidate (e.g. experience, education level, languages, etc.), two demo-
graphic attributes (gender and ethnicity), and a face photograph (see Section
3.2 for all the details).
Springer Nature 2021 LATEX template
and professional sector. Fig. 2 presents four visual examples of the resumes
generated with FairCVdb.
4 FairCVtest: Description
4.1 General Learning Framework
The multimodal model represented by its parameters vector wF (F for
fused model [95]) is trained using features learned by M independent models
{w1 , ..., wM } where each model produces ni features xi = [xi1 , ..., xini ] ∈ Rni .
Without loss of generality, the Figure 3 presents the learning framework for
M = 3. The learning process is guided by a Target function T , and a learn-
ing strategy that minimizes the error between the output O and the Target
function T . In our framework, where xi is data obtained from the resume, wi
are models trained specifically for different information domains (e.g. images,
text) and T is a score within the interval [0, 1] ranking the candidates accord-
ing to their merits. A score close to 0 corresponds to the worst candidate, while
the best candidate would get 1. The learning strategy is traditionally based
on the minimization of a loss function defined to obtain the best performance.
The most popular approach for supervised learning is to train the model w by
minimizing a loss function L over a set of training samples S:
X
min L(O(xj | wF ), T j ) (2)
wF
xj ∈S
Biases can be introduced in different stages of the learning process (see
Figure 3): in the Data used to train the models (A), the Preprocessing or
Springer Nature 2021 LATEX template
Fig. 2 Visual examples of the FairCVdb synthetic resumes, including a face image, a name,
an occupation, a short biography and the candidate competencies.
Fig. 3 Block diagram of the automatic multimodal learning process and 6 (A to E ) stages
where bias can appear.
8
https://fasttext.cc/docs/en/english-vectors.html
Springer Nature 2021 LATEX template
Fig. 5 Hiring score distributions by gender (left) and ethnicity (right). The top row presents
hiring score distributions in the Neutral Scenario, while the bottom presents them in the
Gender- and Ethnicity-biased Scenarios.
that gender or ethnicity are not inputs of our model, but rather the system is
able to detect this sensitive information from some of the input features (i.e.
the face embedding, the biography, or the competencies). Therefore, despite
not having explicit access to demographic attributes, the classifier is able to
detect this information and find its correlation with the biases introduced in
the scores, and so it ends up reproducing them.
The third scenario provided by FairCVtest, which we call Agnostic Sce-
nario, aims to prevent the system to inherit data biases. As we introduced in
Section 4.3, the Agnostic Scenario uses a gender blind version of the biogra-
phies, as well as a face embedding where sensitive information has been
removed using the method of [71]. Figure 5 presents the hiring score distribu-
tions in this setup. As we can see, the gender distributions are close to the ones
observed in the Neutral Scenario (see top-left plot in Figure 5), despite using
gender-biased labels during training. In the ethnicity case, we can observe a
slight difference between groups, much smoother than the one we saw in the
Biased Scenario (see bottom-left plot in Figure 5), as can be confirmed with
the KL divergence (i.e. 0.061, compared to the biased case where this value is
around 0.178). However, this gap on the scores between demographic groups
still has margin to decrease to a level similar to that of the Neutral Scenario.
The difference observed in the behavior of gender and ethnicity agnostic cases
Springer Nature 2021 LATEX template
Fig. 6 Hiring score distributions by gender (left) and ethnicity (right), in the Agnostic
Scenario.
can be explained by the fact that we removed almost all gender information
from the input (i.e. face embedding and biography), but for the ethnicity we
only took measures on the face embedding, not on the competencies. Thus,
competencies are acting as a soft proxy for the ethnicity group.
Note that our agnostic approximation does not seek to make the system
capable of detecting whether a score is unfair, nor to compensate such bias,
but rather blind it to sensitive attributes with the aim of preventing the model
to establish a correlation between the demographic groups and score biases.
This fact can be corroborated with the training loss, which has a higher value
in the Agnostic Scenario (0.035 for gender, 0.044 for ethnicity) than in the
Biased Scenario (0.49 for gender, 0.64 for ethnicity). By removing sensitive
information from the input, the model is not able to learn what motivates the
difference in the scores between individuals with similar competencies, as it is
blind to the demographic group, and therefore its output does not approximate
correctly the biased target function after training.
Gender Ethnicity
Scenario p% p1 % p2 % p3 %
Male Female Group 1 Group 2 Group 3
Neutral 51.90% 48.10% 92.68% 34.20% 35.00% 30.80% 97.71% 90.06% 88.00%
Biased 72.90% 27.10% 37.17% 50.80% 30.40% 18.80% 59.84% 37.01% 61.84%
Agnostic 52.80% 47.20% 89.39% 36.70% 32.70% 30.60% 89.10% 83.38% 93.58%
Table 3 Distribution of the top 1000 candidates in each Scenario of FairCVtest, by
gender and ethnicity group. We include the p% score (see Eqn. 5) as a measure of the
difference between groups. In the ethnicity case, p1 % for G1 vs G2, p2 % for G1 vs G3 and
p3 % for G2 vs G3.
biases introduced in the target function. However, as can be seen in the Agnos-
tic Scenario, by removing gender and ethnicity information from the input we
can prevent the model to reproduce those biases, as it cannot see which factor
determines the score penalty for some individuals.
Since the key of our Agnostic Scenario is the removal of sensitive informa-
tion, in this section we will analyze the demographic information extracted by
the hiring tool in each scenario. To this aim, we use multimodal feature embed-
dings extracted by the recruitment tool to train and evaluate the performance
of both gender and ethnicity classifiers. We obtain these embeddings as the
output of the first dense layer of our learning architecture (see Section 4.3), in
which the information from different data domains has already been fused. For
each scenario, we train 3 different classification algorithms, namely Support
Vector Machines (SVM), Random Forests (RF), and Neural Networks (NN).
Table 4 presents the accuracies obtained by each classification algorithm in
the 3 scenarios of FairCVtest. The results show a different behavior between
scenarios and demographic traits. As expected, the setup in which most sen-
sitive information can be extracted (gender and ethnicity in this work) is the
Biased one for both attributes. The SVM classifier obtains the higher valida-
tion accuracies, with almost 90% in the gender case and 76.40% in the ethnicity
one. Note that none of these values reach state-of-art performances (i.e. nei-
ther the ResNet-50 model nor the hiring tools weren’t explicitly trained to
classify those attributes), but both of them warn of large amounts of sensi-
tive information within the embeddings. On the other hand, both Neutral and
Agnostic scenarios show lower accuracies than the Biased configuration. How-
ever, we can see a gap in performance between them, with all the classifiers
showing higher accuracy in the Neutral Scenario. This fact demonstrates that,
despite training with the Unbiased scores T U which have no relationship with
any demographic group membership, the embeddings extracted in the Neutral
Scenario contain some sensitive information. By using the gender blinded bios
and the face embeddings in which demographic information has been removed,
we reduced the amount of latent sensitive information within the agnostic
embeddings. This reduction leads us to almost random-choice accuracies in
the gender case (i.e. in a binary task, the random choice classifier’s accuracy
is 50%), but in the ethnicity one the classifiers fall far from this limit (i.e. 33%
corresponding to 3 ethnic groups), since there is still some information related
to that sensitive attribute in the candidate competencies.
Springer Nature 2021 LATEX template
6 Conclusions
The development of Human-Centric Artificial Intelligence applications will be
critical to ensure the correct deployment of AI technologies in our society. In
this paper we have revised the recent advances in this field, with particular
attention to available databases proposed by the research community. We have
also presented FairCVtest, a new experimental framework (publicly available9 )
on AI-based automated recruitment to study how multimodal machine learning
is affected by biases present in the training data. Using FairCVtest, we have
studied the capacity of common deep learning algorithms to expose and exploit
sensitive information from commonly used structured and unstructured data.
The contributed experimental framework includes FairCVdb, a large set
of 24,000 synthetic profiles with information typically found in job applicants’
resumes from different data domains (e.g. face images, text data and struc-
tured data). These profiles were scored introducing gender and ethnicity biases,
which resulted in gender and ethnicity discrimination in the learned models
targeted to generate candidate scores for hiring purposes. In this scenario, the
system was able to extract demographic information from the input data, and
learn its relation with the biases introduced in the scores. This behavior is not
limited to the case studied, where the bias lies in the target function. Fea-
ture selection or unbalanced data can also become sources of biases. This last
case is common when datasets are collected from historical sources that fail to
represent the diversity of our society.
We discussed recent methods to prevent undesired effects of algorithmic
biases, as well as the most widely used databases in the bias and fairness
research in AI. We then experimented with one of these methods, known as
SensitiveNets, to improve fairness in this AI-based recruitment framework.
Our agnostic setup removes sensitive information from text data at the input
level, and apply SensitiveNets to remove it from the face images during the
learning process. After the demographic “blinding” process, the recruitment
system did not show discriminatory treatment even in the presence of biases
in training data, thus improving equity among different demographic groups.
The most common approach to analyze algorithmic discrimination is
through group-based bias [14]. However, recent works are now starting to inves-
tigate biased effects in AI with user-specific methods, e.g. [75][101]. We plan
9
https://github.com/BiDAlab/FairCVtest
Springer Nature 2021 LATEX template
References
[1] Barocas, S., Selbst, A.D.: Big data’s disparate impact. California Law
Review (2016)
[2] Acien, A., Morales, A., Vera-Rodriguez, R., Bartolome, I., Fierrez, J.:
Measuring the Gender and Ethnicity Bias in Deep Models for Face
Recognition. In: Proceedings of Iberoamerican Congress on Pattern
Recognition (IbPRIA), Madrid, Spain (2018)
[3] Drozdowski, P., Rathgeb, C., Dantcheva, A., Damer, N., Busch, C.:
Demographic bias in biometrics: A survey on an emerging challenge.
IEEE Transactions on Technology and Society 1, 89–103 (2020)
[4] Nagpal, S., Singh, M., Singh, R., Vatsa, M., K. Ratha, N.: Deep learning
for face recognition: Pride or prejudiced? arXiv/1904.01219 (2019)
Springer Nature 2021 LATEX template
[5] Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.: Men also
like shopping: Reducing gender bias amplification using corpus-level con-
straints. In: Proceedings of Conference on Empirical Methods in Natural
Language Processing, pp. 2979–2989 (2017)
[8] Ali, M., Sapiezynski, P., Bogen, M., Korolova, A., Mislove, A., Rieke,
A.: Discrimination through optimization: How Facebook’s ad delivery
can lead to skewed outcomes. In: Proceedings of the ACM Conference
on Human-Computer Interaction (2019)
[9] Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias. ProPublica
(2016)
[10] Evans, M., Mathews, A.W.: New York regulator probes United Health
algorithm for racial bias. The Wall Street Journal (2019)
[11] Knight, W.: The Apple Card didn’t ’see’ gender — and that’s the
problem. Wired (2019)
[12] Buolamwini, J., Gebru, T.: Gender shades: Intersectional accuracy dis-
parities in commercial gender classification. In: Proceedings of the ACM
Conference on Fairness, Accountability, and Transparency, NY, USA
(2018)
[13] Wang, M., Deng, W.: Mitigating bias in face recognition using skewness-
aware reinforcement learning. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 9322–9331 (2020)
[14] Serna, I., Morales, A., Fierrez, J., Cebrian, M., Obradovich, N., Rah-
wan, I.: Algorithmic discrimination: Formulation and exploration in deep
learning-based face biometrics. In: Proceedings of the AAAI Workshop
on SafeAI (2020)
[15] Balakrishnan, G., Xiong, Y., Xia, W., Perona, P.: Towards causal bench-
marking of bias in face analysis algorithms. In: European Conference on
Computer Vision (ECCV), pp. 547–563 (2020)
[16] Bogen, M., Rieke, A.: Help wanted: Examination of hiring algorithms,
equity, and bias. Technical report (2018). https://www.upturn.org
[17] Black, J.S., van Esch, P.: AI-enabled recruiting: What is it and how
Springer Nature 2021 LATEX template
[18] Dastin, J.: Amazon scraps secret AI recruiting tool that showed bias
against women. Reuters (2018)
[19] Bertrand, M., Mullainathan, S.: Are Emily and Greg more employ-
able than Lakisha and Jamal? A field experiment on labor market
discrimination. American economic review 94, 991–1013 (2004)
[20] Raghavan, M., Barocas, S., Kleinberg, J., Levy, K.: Mitigating bias in
algorithmic hiring: Evaluating claims and practices. In: Conference on
Fairness, Accountability, and Transparency, pp. 469–481 (2020)
[21] Schumann, C., Foster, J.S., Mattei, N., Dickerson, J.P.: We need fair-
ness and explainability in algorithmic hiring. In: Proceedings of the
19th International Conference on Autonomous Agents and MultiAgent
Systems, pp. 1716–1720 (2020)
[22] Sánchez-Monedero, J., Dencik, L., Edwards, L.: What does it mean to
’solve’ the problem of discrimination in hiring? Social, technical and legal
perspectives from the UK on automated hiring systems. In: Conference
on Fairness, Accountability, and Transparency, pp. 458–468 (2020)
[24] Cheng, L., Varshney, K.R., Liu, H.: Socially responsible ai algo-
rithms: issues, purposes, and challenges. Journal of Artificial Intelligence
Research 71, 1137–1181 (2021)
[25] Baltrušaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning:
A survey and taxonomy. IEEE Transactions on Pattern Analysis and
Machine Intelligence 41, 423–443 (2019)
[26] Peña, A., Serna, I., Morales, A., Fierrez, J.: Bias in multimodal AI:
Testbed for fair automatic recruitment. In: IEEE CVPR Workshop on
Fair, Data Efficient and Trusted Computer Vision (2020)
[27] Richards, C., Bouman, W.P., Seal, L., Barker, M.J., Nieder, T.O.,
T’Sjoen, G.: Non-binary or genderqueer genders. International Review
of Psychiatry 28(1), 95–102 (2016)
[29] Larrazabal, A.J., Nieto, N., Peterson, V., Milone, D.H., Ferrante, E.:
Springer Nature 2021 LATEX template
[30] Speicher, T., Ali, M., Venkatadri, G., Ribeiro, F.N., Arvanitakis, G.,
Benevenuto, F., Gummadi, K.P., Loiseau, P., Mislove, A.: Potential for
discrimination in online targeted advertising. In: Conference on Fairness,
Accountability and Transparency, pp. 5–19 (2018)
[31] De-Arteaga, M., Romanov, R., Wallach, H., Chayes, J., Borgs, C., et al.:
Bias in bios: A case study of semantic representation bias in a high-stakes
setting. In: Conference on Fairness, Accountability, and Transparency,
pp. 120–128 (2019)
[32] Bolukbasi, T., Chang, K., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is
to computer programmer as woman is to homemaker? Debiasing word
embeddings. Advances in Neural Information Processing Systems (2016)
[33] Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review
and new perspectives. IEEE Transactions on Pattern Analysis and
Machine Intelligence 35(8), 1798–1828 (2013)
[34] Bau, D., Zhu, J., Strobelt, H., Lapedriza, A., Zhou, B., Torralba, A.:
Understanding the role of individual units in a deep neural network.
Proceedings of the National Academy of Sciences 117(48), 1–8 (2020)
[35] Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding
neural networks through deep visualization. In: Intenational Conference
on Machine Learning (ICML) Deep Learning Workshop, Lille, France
(2015)
[36] Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A.,
Brendel, W.: ImageNet-trained CNNs are biased towards texture;
increasing shape bias improves accuracy and robustness. In: Interna-
tional Conference on Learning Representations (ICLR), New Orleans,
Louisiana, USA (2019)
[37] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K., Samek,
W.: On pixel-wise explanations for non-linear classifier decisions by layer-
wise relevance propagation. PloS one 10(7), 1–46 (2015)
[38] Selvaraju, R., Cogswell, M., et al.: Grad-CAM: Visual explanations from
deep networks via gradient-based localization. In: IEEE International
Conference on Computer Vision (CVPR), Honolulu, Hawaii, USA, pp.
618–626 (2017). IEEE
[39] Ortega, A., Fierrez, J., Morales, A., Wang, Z., de la Cruz, M., Alonso,
Springer Nature 2021 LATEX template
C.L., Ribeiro, T.: Symbolic ai for xai: evaluating lfit inductive program-
ming for explaining biases in machine learning. Computers 10(11), 154
(2021)
[40] Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B.,
Darrell, T.: Generating visual explanations. In: European Conference
on Computer Vision (ECCV), Amsterdam, The Netherlands, pp. 3–19
(2016). Springer
[41] Montavon, G., Samek, W., Müller, K.: Methods for interpreting and
understanding deep neural networks. Digital Signal Processing 73
(2018)
[42] Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing Higher-
Layer Features of a Deep Network. University of Montreal 1341(3)
(2009)
[43] Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional net-
works: Visualising image classification models and saliency maps. In:
International Conference on Learning Representations (ICLR) Work-
shop, Banff, Canada (2014)
[45] Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J.: Plug
& Play generative networks: Conditional iterative generation of images
in latent space. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), Honolulu, Hawaii, USA (2017). IEEE
[46] Nguyen, A., Yosinski, J., Clune, J.-: Multifaceted feature visualization:
Uncovering the different types of features learned by each neuron in
deep neural networks. In: Intenational Conference on Machine Learning
(ICML) Deep Learning Workshop, New York, NY, USA (2016)
[47] Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., Clune, J.: Syn-
thesizing the preferred inputs for neurons in neural networks via deep
generator networks. In: Conference on Neural Information Processing
Systems (NIPS), Barcelona, Spain, pp. 3395–3403 (2016)
[49] Zurada, J.M., Malinowski, A., Cloete, I.: Sensitivity analysis for min-
imization of input data dimension for feedforward neural networ. In:
International Symposium on Circuits and Systems (ISCAS), vol. 6, pp.
Springer Nature 2021 LATEX template
447–450 (1994)
[50] Zeiler, .D., Fergus, R.: Visualizing and understanding convolutional net-
works. In: European Conference on Computer Vision (ECCV), Zurich,
Switzerland, pp. 818–833 (2014). Springer
[51] Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving
for simplicity: The all convolutional net. In: International Conference on
Learning Representations (ICLR), San Diego, CA, USA (2015)
[52] Zhang, Q., Cao, R., Shi, F., Wu, Y.N., Zhu, S.: Interpreting CNN
knowledge via an explanatory graph. In: AAAI Conference on Artificial
Intelligence, vol. 32. AAAI Press, New Orleans, Louisiana, USA (2018)
[53] Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., Kim,
B.: Sanity checks for saliency maps. In: Advances in Neural Information
Processing Systems (NIPS), vol. 31, pp. 9525–9536. Curran Associates
Inc., Montréal, Canada (2018)
[54] Szegedy, C., Zaremba, W., Sutskever, I., Estrach, J.B., Erhan, D.,
Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In:
International Conference on Learning Representations (ICLR), Banff,
Canada (2014)
[55] Pang, W.K., Percy, L.: Understanding black-box predictions via influence
functions. In: International Conference on Machine Learning (ICML),
vol. 70, pp. 1885–1894. PMLR, Sydney, Australia (2017)
[56] Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily
fooled: High confidence predictions for unrecognizable images. In: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp.
427–436 (2015). IEEE
[57] Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural
networks. Transactions on Evolutionary Computation 23(5), 828–841
(2019)
[58] Quadrianto, N., Sharmanska, V., Thomas, O.: Discovering fair represen-
tations in the data domain. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 8227–8236 (2019)
[60] Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxil-
iary classifier GANs. In: International Conference on Machine Learning
(ICML), Sydney, Australia, pp. 2642–2651 (2017)
[61] Calmon, F.P., Wei, D., Vinzamuri, B., Ramamurthy, K.N., Varshney,
K.R.: Optimized pre-processing for discrimination prevention. In: Pro-
ceedings of the 31st International Conference on Neural Information
Processing Systems, pp. 3995–4004 (2017)
[62] Ramaswamy, V.V., Kim, S.S., Russakovsky, O.: Fair attribute classifica-
tion through latent space de-biasing. In: IEEE Conference on Computer
Vision and Pattern Recognition, pp. 9301–9310 (2021)
[63] Jia, S., Lansdall-Welfare, T., Cristianini, N.: Right for the right reason:
Training agnostic networks. In: Advances in Intelligent Data Analysis
XVII, pp. 164–174 (2018)
[64] Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Lavio-
lette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of
neural networks. Journal of Machine Learning Research 17, 2096–2030
(2016)
[65] Wang, M., Deng, W., Hu, J., Tao, X., Huang, Y.: Racial faces in the wild:
Reducing racial bias by information maximization adaptation network.
In: IEEE International Conference on Computer Vision (ICCV), pp. 692–
702 (2019)
[66] Romanov, A., De-Arteaga, M., Wallach, H., Chayes, J., Borgs, C., et al.:
What’s in a name? Reducing bias in bios without access to protected
attributes. In: Proceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational Linguistics: Human
Language Technologies, pp. 4187–4195 (2019)
[67] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: Additive angular mar-
gin loss for deep face recognition. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 4690–4699 (2019)
[68] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., et al.: Generative
adversarial nets. Advances in Neural Information Processing Systems 27,
2672–2680 (2014)
[69] Alvi, M., Zisserman, A., Nellaker, C.: Turning a blind eye: Explicit
removal of biases and variation from deep neural network embeddings.
In: European Conference on Computer Vision (ECCV) (2018)
[70] Kim, B., Kim, H., Kim, K., Kim, S., Kim, J.: Learning not to learn:
Training deep neural networks with biased data. In: IEEE Conference
Springer Nature 2021 LATEX template
[71] Morales, A., Fierrez, J., Vera-Rodriguez, R., Tolosana, R.: SensitiveNets:
Learning agnostic representations with application to face recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence 43(6),
2158–2164 (2021)
[72] Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding
for face recognition and clustering. In: IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 815–823 (2015)
[74] Pedreshi, D., Ruggieri, S., Turini, F.: Discrimination-aware data mining.
In: Proceedings of the 14th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 560–568 (2008)
[75] Zhang, Y., Bellamy, R., Varshney, K.R.: Joint optimization of AI fair-
ness and utility: A human-centered approach. In: Proceedings of the
AAAI/ACM Conference on AI, Ethics, and Society, pp. 400–406 (2020)
[76] Yang, K., Stoyanovich, J.: Measuring fairness in ranked outputs. In: Pro-
ceedings of the 29th International Conference on Scientific and Statistical
Database Management, pp. 1–6 (2017)
[77] Celis, L.E., Straszak, D., Vishnoi, N.K.: Ranking with fairness con-
straints. arXiv/1704.06840 (2017)
[78] Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., Baeza-
Yates, R.: FA*IR: A fair top-k ranking algorithm. In: Proceedings of the
2017 ACM on Conference on Information and Knowledge Management,
pp. 1569–1578 (2017)
[79] Dua, D., Graff, C.: UCI Machine Learning Repository (2017). http://
archive.ics.uci.edu/ml
[80] Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the
success of bank telemarketing. Decision Support Systems 62, 22–31
(2014)
[81] Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.: Gender bias in
coreference resolution: Evaluation and debiasing methods. In: Conference
of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, vol. 2 (2018)
Springer Nature 2021 LATEX template
[82] Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the
wild. In: International Conference on Computer Vision (ICCV) (2015)
[83] Rothe, R., Timofte, R., Van Gool, L.: Dex: Deep expectation of apparent
age from a single image. In: IEEE International Conference on Computer
Vision Workshops (CVPRW), pp. 10–15 (2015)
[84] Ricanek Jr., K., Tesafaye, T.: Morph: A longitudinal image database of
normal adult age-progression. In: International Conference on Automatic
Face and Gesture Recognition, pp. 341–345 (2006)
[85] Karkkainen, K., Joo, J.: FairFace: Face attribute dataset for balanced
race, gender, and age for bias measurement and mitigation. In: IEEE
Winter Conference on Applications of Computer Vision, pp. 1548–1558
(2021)
[86] Merler, M., Ratha, N., Feris, S.R., Smith, J.R.: Diversity in faces.
arXiv/1901.10436 (2019)
[87] Robinson, J.P., Livitz, G., Henon, Y., Qin, C., Fu, Y., Timoner, S.: Face
recognition: Too bias, or not too bias? In: IEEE Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW) (2020)
[88] Hupont, I., Fernández, C.: DemogPairs: Quantifying the impact of demo-
graphic imbalance in deep face recognition. In: IEEE International
Conference on Automatic Face & Gesture Recognition (2019)
[89] Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR)
(2011)
[90] Serna, I., Morales, A., Fierrez, J., Cebrian, M., Obradovich, N., Rahwan,
I.: SensitiveLoss: Improving accuracy and fairness of face representa-
tions with discrimination-aware deep learning. Artificial Intelligence 305
(2022)
[91] Garg, N., Schiebinger, L., Jurafsky, D., Zou, J.: Word embeddings quan-
tify 100 years of gender and ethnic stereotypes. Proceedings of the
National Academy of Sciences 115, 3635–3644 (2018)
[92] Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: A dataset
and benchmark for large-scale face recognition. In: European Conference
on Computer Vision (ECCV) (2016)
[93] Bendick Jr, M., Jackson, C.W., Romero, J.H.: Employment discrimina-
tion against older workers: An experimental study of hiring practices.
Journal of Aging & Social Policy 8, 25–46 (1997)
Springer Nature 2021 LATEX template
[94] Cowgill, B.: Bias and productivity in humans and algorithms: The-
ory and evidence from resume screening. Columbia Business School 29
(2018)
[95] Fierrez, J., Morales, A., Vera-Rodriguez, R., Camacho, D.: Multiple
classifiers in biometrics. Part 1: Fundamentals and review. Information
Fusion 44, 57–64 (2018)
[97] Ranjan, R., Sankaranarayanan, S., Bansal, A., Bodla, N., Chen, J., Patel,
V.M., Castillo, C.D., Chellappa, R.: Deep learning for understanding
faces: Machines may be just as good, or better, than humans. IEEE
Signal Processing Magazine 35, 66–83 (2018)
[98] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image
recognition. In: IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 770–778 (2016)
[99] Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances
in pre-training distributed word representations. In: Proceedings of the
International Conference on Language Resources and Evaluation (LREC
2018) (2018)
[100] Biddle, D.: Adverse Impact and Test Validation: A Practitioner’s Guide
to Valid and Defensible Employment Testing. Routledge, London (2017)
[101] Bakker, M., Valdes, H.R., Tu, D.P., Gummadi, K.P., Varshney, K.R.,
et al.: Fair enough: Improving fairness in budget-constrained decision
making using confidence thresholds. In: AAAI Workshop on Artificial
Intelligence Safety, New York, NY, USA, pp. 41–53 (2020)
[102] Acien, A., Morales, A., Vera-Rodriguez, R., Fierrez, J., Delgado, O.:
Smartphone sensors for modeling human-computer interaction: Gen-
eral outlook and research datasets for user authentication. In: IEEE
Conference on Computers, Software, and Applications (COMPSAC)
(2020)
[103] Acien, A., Morales, A., Fierrez, J., Vera-Rodriguez, R., Bartolome,
I.: BeCAPTCHA: Detecting human behavior in smartphone interac-
tion using multiple inbuilt sensors. In: AAAI Workshop on Artificial
Intelligence for Cyber Security (AICS) (2020)
[104] Hernandez-Ortega, J., Daza, R., Morales, A., Fierrez, J., Ortega-Garcia,
Springer Nature 2021 LATEX template
J.: edBB: Biometrics and behavior for assessing remote education. In:
AAAI Workshop on Artificial Intelligence for Education (AI4EDU)
(2020)
[105] Serna, I., DeAlcala, D., Morales, A., Fierrez, J., Ortega-Garcia, J.:
IFBiD: Inference-free bias detection. In: AAAI Workshop on Artificial
Intelligence Safety (SafeAI). CEUR, vol. 3087 (2022)
[106] Gomez-Barrero, M., Maiorana, E., Galbally, J., Campisi, P., Fierrez, J.:
Multi-biometric template protection based on homomorphic encryption.
Pattern Recognition 67, 149–163 (2017)
[107] Hassanpour, A., Moradikia, M., Yang, B., Abdelhadi, A., Busch, C.,
Fierrez, J.: Differential privacy preservation in robust continual learning.
IEEE Access 10, 24273–2428 (2022)