Fairness Lectures-21

Statistical theory of algorithmic fairness
M2 Stat. et ML // Université Paris-Saclay
Evgenii Chzhen
evgenii.chzhen@cnrs.fr
2
Contents
1 Introduction 5
1.1 Examples of unwanted behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Supervised learning with sensitive attributes . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Definitions of fairness: binary classification . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Definitions of fairness: general case of supervised learning . . . . . . . . . . . . . . . 9
1.5 Risk based fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Main approaches and course overview . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6.1 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6.2 In-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6.4 What this course is about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.7 Bibliographic references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Other courses on algorithmic fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Post-processing: deriving the form of optimal fair classifier 17

2.1 Optimal classifier under Demographic Parity constraint . . . . . . . . . . . . . . . . 17
2.2 Post-processing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 In-processing: fair ERM and reduction to iterative cost-sensitive classification 33

3.1 Linear classifiers and unawareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Fair empirical risk minimization: a direct approach . . . . . . . . . . . . . . . . . . . 35
3.3 Reduction to cost-sensitive classification: passing to randomized classifiers . . . . . . 38
3.3.1 Approximate saddle point yields guarantees . . . . . . . . . . . . . . . . . . . 39
3.4 Finding approximate saddle point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Regression under demographic parity constraint 47

4.1 Deriving the form of optimal prediction under the independence constraint . . . . . 47
4.2 Post-processing estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Risk guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3
Notation
For K ∈ N we denote by [K] the set of the first K positive integers. Given two spaces X and Y
we denote by Y X —all functions from X to Y. If X and Y are measurable spaces, we additionally
(and silently) assume that Y X denotes all measurable functions from X to Y. Pretty much all the
time X , Y will be some topological spaces with Borel σ-algebra on them. From now on, all the
measurability issues are silently ignored. The sign (· ⊥⊥ ·) is used to denote independence of random
variables (or elements) and (· ⊥⊥ ·) | · denotes conditional independence. In what follows P and E is
used to denote the distribution of (X, S, Y ) (the definition of this triplet will be provided later) and
the expectation with respect to this distribution, meanwhile, P, E stand for generic probability and
expectation (all the randomness is averaged).
4
Chapter 1
Introduction
1.1 Examples of unwanted behaviour

We begin this lecture by providing a few well-known real-world examples of unwanted behaviour
from algorithmic systems—which we shall vaguely define as an algorithm or a pool of algorithms,
eventually interacting with humans. We refer the reader to Barocas et al. (2019) for an extended
presentation of such examples.
Perhaps the most famous example comes from COMPAS1 , a risk-assessment software developed
and owned by Northpointe, a technology and management consulting firm. The software is used in
U.S. courts to predict recidivism risks of defendants. Its predictions are presented to judges during
criminal sentencing as additional insights. A study led by Propublica, an American newsroom
specialized in investigative journalism, showed that COMPAS has a low overall accuracy and that,
even though race is not used as an input, COMPAS scores yields a high discrepancy regarding False
Positive and False Negative rates across ethnicity groups: “Overall, Northpointe’s assessment tool
correctly predicts recidivism 61 percent of the time. But blacks are almost twice as likely as whites
to be labeled a higher risk but not actually re-offend. It makes the opposite mistake among whites:
They are much more likely than blacks to be labeled lower risk but go on to commit other crimes.”
Angwin et al. (2016). Courts decisions have a huge impact on people’s life as they can lead to
important freedom restrictions. It is therefore essential to ensure that the information provided by
the algorithm do not contradict neither moral nor legal grounds while being useful. Another issue
regarding COMPAS is its opacity: being a proprietary software, its code is protected by trade secrets
and thus cannot be examined directly by the public.
Another example is given to us by StreetBump, a smartphone app developed by the City of
Boston which uses GPS and accelerometer data to passively detect potholes: while its users are
driving in the city. Kate Crawford was the first to speak about the limitations of such an app
“StreetBump has a signal problem. People in lower income groups in the US are less likely to have
smartphones, and this is particularly true of older residents, where smartphone penetration can be
as low as 16%. For cities like Boston, this means that smartphone data sets are missing inputs from
significant parts of the population — often those who have the fewest resources.’. 2
Finally, let us briefly mention a modern redlining3 example given in Barocas et al. (2019):
“Amazon uses a data-driven system to determine the neighbourhoods in which to offer free same-day
delivery. A 2016 study found stark disparities in the demographic make-up of these neighbourhoods:
1 COMPAS stands for Correctional Offender Management Profiling for Alternative Sanctions.
2 See Kate Crawford’s article The Hidden Biases in Big Data in Harvard Business Review.
3 “A discriminatory practice in which services (financial and otherwise) were withheld from potential customers
who resided in neighborhoods regarded as collectively financial risky; these residents largely belonged to racial and
ethnic minorities” (Wikipedia).
5
CHAPTER 1 Selected topics on algorithmic fairness
in many U.S. cities, white residents were more than twice as likely as black residents to live in one
of the qualifying neighbourhoods.”
1.2 Supervised learning with sensitive attributes

Supervised learning: a reminder. Here we give a quick reminder on the statistical supervised
learning vocabulary. Normally, all the content of this paragraph should be familiar to you, especially
if you followed a basic course in statistical learning.
In the standard supervised learning paradigm, we have access to data (Z 1 , Y1 ), . . . , (Z n , Yn ) ∈
Z × Y. In the statistical learning framework, it is commonly assumed that these data are i.i.d.
realizations of some random variables (Z, Y ) ∼ P. The core task of supervised learning is to extract
information from these data to make useful predictions in the future. The next definition formalizes
the notion of algorithm, a mapping to obtain a prediction function from data.
Definition 1.2.1 (Algorithm). An algorithm is a mapping
∞
[
fˆ : (Z × Y)n → Y Z ,
n=1
Z 4
where Y denotes all functions from Z to Y.
Note that, given the definition of an algorithm, we need to write
fˆ ((Z 1 , Y1 ), . . . , (Z n , Yn )) (x) ,

to denote a prediction of fˆ on x, trained on (Z 1 , Y1 ), . . . , (Z n , Yn ). However, as it is often done, we

will only write fˆ(x), dropping the explicit dependency on the data. As a rule of thumb, whenever the
“hat” notation is used, the object under the “hat” denotes a data-driven algorithm. In what follows
and in the whole lecture notes E and P is used for (Z, Y ), while E and P are generic expectations
and probability which integrate out all the randomness. Just to get used to this notation, it makes
sense to understand that when we write E[fˆ(Z)], we actually mean E[fˆ(Z) | (Z 1 , Y1 ), . . . , (Z n , Yn )].
The quality of a prediction f function is measured by its risk—expected loss:
R(f ) := E[`(Y, f (Z))] ,
where ` : Y × Y → R+ is a loss function. The Bayes optimal prediction f ∗ is defined as
f ∗ ∈ arg min R(f ) .
f : Z→Y
Finally, our goal can be formalized as: find an algorithm fˆ with small excess risk
E(fˆ) := R(fˆ) − R(f ∗ ) ,
since fˆ depends on data, hence, it is random and the previous sentence should be understood in
expectation or with high probability.
Example 1.2.2. Two classical examples of classification with 0/1-loss and regression with squared
loss.
• Classification. Loss: `(y, y 0 ) = I(y 6= y 0 ); Bayes optimal f ∗ (z) = I(η(z) ≥ 1/2), with
η(z) = E[Y | Z = z]. Excess risk:
h i
E(fˆ) = E |2η(Z) − 1|I fˆ(Z) 6= f ∗ (Z) .
4 In this lecture notes we will avoid discussions on measurability.
Page 6 of 63
• Regression. Loss: `(y, y 0 ) = (y−y 0 )2 ; Bayes optimal f ∗ (z) = η(z), with η(z) = E[Y | Z = z].
h i
E(fˆ) = E (fˆ(Z) − f ∗ (Z))2 .
Supervised learning: adding sensitive information and fairness constraints. To formalize

the idea of fairness and non-discrimination, we will assume that the features Z can be split into
a couple (X, S). We should think of X as the non-sensitive information and S as the sensitive
information. Thus, in our model, we have a triplet (X, S, Y ) ∈ X × S × Y following some unknown
distribution P. We would like to point out that, in the considered framework, the choice of the
sensitive attributes belongs to the decision-maker, potentially incentivized by legal or ethical motives.
It is not the statistician’s task to determine which feature should be regarded as sensitive. Given
this decoupling between features and its meaning, we need to explicitly specify the type of permitted
prediction functions—those that are allowed to use S and those that are not.
Definition 1.2.3 (Fairness through awareness). Fairness through awareness is a framework in
which the predictions functions are of the form
f : X ×S →Y .
For discrete sets S, the awareness framework allows to build a separate prediction function for
each value of the sensitive attribute. Due to this reason it might be considered illegal (see e.g., Loi n°
2008-496 du 27 mai 2008 for French regulations) as it introduces direct discrimination. To avoid this
issue and formally comply with the enforced laws, we often need to consider the fairness through
unawareness framework.
Definition 1.2.4 (Fairness through unawareness). Fairness through unawareness is a framework
in which the predictions functions are of the form
f :X →Y .
One might be tempted to say that by considering f : X → Y, we eliminate all possible

discrimination, since the sensitive attribute is not used explicitly. However, note that the triplet
(X, S, Y ) follows some joint distribution P, that is, X and S are not necessarily independent. Thus,
even if we do not explicitly use the sensitive information S for prediction, there are prediction
functions that can give discriminatory predictions for those x, which are more likely to posses
certain sensitive characteristics. As a very straightforward example think of the case of S being
male or female and X, among other descriptive characteristics, contains the first name of the person.
Clearly, in most cultures, the first name is an extremely predictive characteristic of the gender.
Let us use Z ∈ Z, whenever we do not care about particular choice of the prediction functions.
That is, Z = X and Z = X in case of unawareness and Z = (X, S) with Z = X × S in the case of
awareness.
What is S and what is discrimination? In these lecture notes it is always assumed that S is
something fixed in advance. In particular, we are not going to pursue the direction where S needs to
be learned (this problem is still largely open). The question of what is S and what is discrimination
is actually non-trivial. Here, we will draw an inspiration from the French law (other countries have
their alternatives). According to the French law number 2008-496, Article 1, the following attributes
are protected
“Constitue une discrimination directe la situation dans laquelle, sur le fondement de son origine, de
son sexe, de sa situation de famille, de sa grossesse, de son apparence physique, de la
particulière vulnérabilité résultant de sa situation économique, apparente ou connue de son auteur,
de son patronyme, de son lieu de résidence ou de sa domiciliation bancaire, de son état de santé,
Page 7 of 63
de sa perte d’autonomie, de son handicap, de ses caractéristiques génétiques, de ses mœurs, de son
orientation sexuelle, de son identité de genre, de son âge, de ses opinions politiques, de ses
activités syndicales, de sa capacité à s’exprimer dans une langue autre que le français, de son
appartenance ou de sa non-appartenance, vraie ou supposée, à une ethnie, une nation, une
prétendue race ou une religion déterminée, une personne est traitée de manière moins favorable
qu’une autre ne l’est, ne l’a été ou ne l’aura été dans une situation comparable.”
One of the messages of this long list suggests a fruitful research direction: there are a lot of sensitive
attributes and, hence, we need to develop algorithms which are able to cope with the situation
|S| ≥ max{dim(X ), n}. We will touch upon this issue in the context of online multi-calibration.
One can also remark that seemingly, the one and only constraint that the above law is enforcing
is the unawareness framework—the explicit use of the sensitive attribute is prohibited. Importantly,
the aforementioned law has the second part, which underlines a much more intricate notion of
indirect discrimination.
“Constitue une discrimination indirecte une disposition, un critère ou une pratique neutre en
apparence, mais susceptible d’entraîner, pour l’un des motifs mentionnés au premier alinéa, un
désavantage particulier pour des personnes par rapport à d’autres personnes, à moins que cette
disposition, ce critère ou cette pratique ne soit objectivement justifié par un but légitime et que les
moyens pour réaliser ce but ne soient nécessaires et appropriés.”
Already here, we can see that simply erasing the sensitive attribute from the input is not enough
from the legal perspective, one needs to ensure that no “discrimination indirecte” happens. The
formalization of “discrimination indirecte” is notoriously difficult—it depends on the problem at hand,
on the impact of the eventual decision on the individuals, on understanding of the socio-economic
environment, and, many more factors, which are not (yet?) necessarily formalizable within the
mathematical language. Nevertheless, given our formalizm of statistical learning, we can propose
some seemingly context independent notions of fairness.
Individual and group fairness. Basically, the mathematical definitions of fairness can be divided
into two groups: individual fairness and group fairness. The former notion reflects the principle
that similar individuals must be treated similarly, which translates into Lipschitz type constraints
on possible (often randomized) prediction rules. The latter defines fairness on population level
via (conditional) statistical independence of a prediction from sensitive attribute S (e.g., gender,
ethnicity). The high-level idea of group fairness notions can be seen as bounding or diminishing an
eventual discrepancy between group-wise distributions of the predicted outcomes. In this lecture
notes we will focus exclusively on the group fairness. We provide formal definitions of several group
fairness concepts in the next section.
1.3 Definitions of fairness: binary classification

The group fairness notions are best understood in the case of binary classification, that is, when
Y = {0, 1} and a classifier is g : Z 7→ {0, 1}.
Let us think that g(Z) = 1 means that a candidate Z is accepted to the university by the classifier
g and Y = 1 means that that the candidate Z has successfully graduated from the university. The
first idea is to require equal acceptance rate for all values of the sensitive attributes.
Definition 1.3.1 (Demographic Parity). A classifier g : Z → {0, 1} satisfies demographic parity

constraint w.r.t. the distribution P if
P(g(Z) = 1 | S = s) = P(g(Z) = 1) , ∀s ∈ S .
Page 8 of 63
The above constraint (or its approximate version) merely introduces a diversity in the potential
students of the university. In particular, classifiers satisfying the demographic parity constraint can
act as a compensation for historical injustices that were present in the society. In particular, were
the university at question segregated (one of the groups is greatly dominated by the other one),
deploying classifier which minimizes P(g(Z) 6= Y ) would, at best, preserve the segregation. Even
worth, it can amplify it.
Changing the context from university admission to credit scoring (g(Z) = 1 now means that the
bank approves the loan while Y = 1 corresponds to the individual Z paying back the loan, while
Z can be all the relevant (e.g., salary) and irrelevant (e.g., gender) information about individual’s
financial situation) we note that demographic parity potentially sets some people to failure. Indeed,
if the paying abilities across groups vary significantly, i.e., (P(Y = 1 | S = s))s∈S are different, then
by giving loans equally to each sub-population the bank might either lose a lot of money, or set one
of the populations to inevitable default. To this end, it was proposed to incorporate the response
variable Y into the eventual fairness constraint, leading to two related notions.
Definition 1.3.2 (Equal Opportunity). A classifier g : Z → {0, 1} satisfies equal opportunity

constraint w.r.t. the distribution P if
P(g(Z) = 1 | S = s, Y = 1) = P(g(Z) = 1 | Y = 1) , ∀s ∈ S .
Definition 1.3.3 (Equal Odds). A classifier g : Z → {0, 1} satisfies equal odds constraint w.r.t.
the distribution P if
P(g(Z) = y | S = s, Y = y) = P(g(Z) = y | Y = y) , ∀s ∈ S, y ∈ {0, 1} .
While equal opportunity only requires equalisation of true positive rates across groups, equal odds
demands equalization of true negative rates across groups on top of it. g(Z) = 1 is often interpreted
as a positive decision (a certain opportunity is given to Z), hence the choice of terminology.
In general, given three random variables (g(Z), S, Y ) with g(Z), Y ∈ {0, 1} we can consider
other statistics beyond true positive and true negative rates. For instance, one can define fairness as
equality of group-wise positive predictive values, defined as
PPVs (g) = P(Y = 1 | S = s, g(Z) = 1) .
At this point, the reader should have a better sense of group fairness notions in the case of binary
classification. The next section introduces a broader view of those notions in the supervised learning
setting.
1.4 Definitions of fairness: general case of supervised learning

Before introducing group fairness notions in the general case of supervised learning, let us issue a
technical remark. In order to be completely rigorous, since the group fairness notions depend on
the data distribution P, we need to be careful with degenerate distributions P. We will avoid this
benign technicality by always assuming that the support of P is the whole X × S × Y, that is (only
intuitively) all values of (X, S, Y ) ∈ X × S × Y are possible. Otherwise, we would need to restrict
ourselves to the support of P (to avoid conditioning on zero-probability events).
The following three fairness criteria underline the most modern5 research on group fairness. It
was noted (or, at least, the authors learned about it from their talk) by S. Barocas and M. Hardt in
their NeurIPS tutorial talk that all other group fairness criteria fall within the scope of either of
these three.
5 At the time of this lectures.
Page 9 of 63
Definition 1.4.1 (Independence). We say that f : Z → Y satisfies the independence criteria w.r.t.
the distribution P of (X, S, Y ) if
f (Z) ⊥
⊥S .
Independence, as a fairness criteria, is extremely natural. Essentially, this criteria corresponds

to the first instinct of formalizing non-discrimination based on the protected attribute—just make
the prediction function (as random variable!) independent from it. Yet, for this criteria, we do not
actually need to know the whole joint distribution P of (X, S, Y ), we only need its marginal on the
first two components P(X,S) . This simple observation generated a number of critiques towards this
definition and, eventually, other definitions were proposed—those that actually rely on the whole
distribution P.
Definition 1.4.2 (Separation). We say that f : Z → Y satisfies the separation criteria w.r.t. the
distribution P of (X, S, Y ) if
(f (Z) ⊥
⊥ S) | Y .
Whereas the independence criterion imposes a complete decorrelation between the prediction
f (Z) and the sensitive attribute S, the separation criterion allows those random variables to be
correlated, to the extent that it is justified by the target variable Y . Equal odds (and, in some way,
equal opportunity) can be seen as a special case of the separation criterion in the binary classification
setting.
Definition 1.4.3 (Sufficiency). We say that f : Z → Y satisfies the sufficiency criteria w.r.t. the
distribution P of (X, S, Y ) if
(Y ⊥
⊥ S) | f (Z) .
Intuitively, the sufficiently criterion requires that the amount of information shared between the
sensitive attribute S and the target variable Y is contained in the prediction f (Z). The name is
inspired by the analogous terminology in statistics.
Disparate Learning Process At the first sight it might seem unnatural to combine group
fairness constraint with the unawareness setup. Indeed, the unawareness setup prohibits the explicit
use of S, meanwhile the above fairness constraints rely explicitly on the unknown probabilistic
relation between (X, Y ) and S. Thus, without additional modeling effort, it is hopeless to recover
this relation without ever observing the sensitive attribute.
The aforementioned issue motivates the so-called Disparate Learning Process (DLP), where the
sensitive attribute can be used at training time but it is never used at inference time. More formally,
recalling the definition of an algorithm, DLP can be defined as
∞
[
fˆ : (X × S × Y)n → Y X .
n=1
Though not proven formally (actually, it is not even clear how to formally phrase it), empirical
evidences suggest that any such algorithm tailored to minimize some risk is deemed to guess the
sensitive attribute S from an observation X. Such guessing can lead to ridiculous policies that have
very little to do with fairness (see bibliographic references).
Impossibility theorems One naturally wonders if we can achieve all the above mentioned fairness
constraints simultaneously. Unfortunately, it is even impossible to satisfy any two constraints in
general. These impossibility results are summarized below.
Page 10 of 63
Proposition 1.4.4. Assume that f : Z → Y satisfies independence and sufficiency w.r.t. the
distribution P of (X, S, Y ), then Y ⊥
⊥ S.
Proof. Fix two arbitrary (measurable) sets AY ⊂ Y, AS ⊂ S. Since f assumed to be such that
⊥ Y ) | f (Z) (sufficiency), we can deduce that
(S ⊥
P(S ∈ AS , Y ∈ AY ) = E [P(S ∈ AS | f (Z)) · P(Y ∈ AY | f (Z))] .
Furthermore, since f (Z) ⊥

⊥ S, then P(S ∈ AS | f (Z)) = P(S ∈ AS ) almost surely and the above
implies that
P(S ∈ AS , Y ∈ AY ) = P(S ∈ AS ) · P(Y ∈ AY ) .
We conclude that unless the sensitive attribute has no impact on the outcome Y , there is no
prediction function which satisfies independence and sufficiency simultaneously.
Proposition 1.4.5. Let Y = {0, 1}. Assume that f : Z → Y satisfies independence and separation,
w.r.t. the distribution P of (X, S, Y ), then f (Z) ⊥
⊥ Y or Y ⊥
⊥ S or both.
Proof. Take three arbitrary (measurable) sets Af ⊂ Y, AY ⊂ Y. On the one hand, since f is
assumed to be such that (f (Z) ⊥
⊥ S) | Y (separation) and f (Z) ⊥
⊥ S (independence), then
X
P(f (Z) ∈ Af ) = P(f (Z) ∈ Af | S) = P(f (Z) ∈ Af | S, Y = y)P(Y = y | S)
y∈{0,1}
X
= P(f (Z) ∈ Af | Y = y)P(Y = y | S) ,
y∈{0,1}
almost surely. On the other hand, it holds that

X
P(f (Z) ∈ Af ) = P(f (Z) ∈ Af | Y = y)P(Y = y) .
y∈{0,1}
Thus, the combination of the two expressions yields

X
P(f (Z) ∈ Af | Y = y) (P(Y = y | S) − P(Y = y)) = 0
y∈{0,1}
almost surely. Furthermore, since P(Y = 0) = 1 − P(Y = 1) (and the same conditionally on S) the
above can be written as
P(f (Z) ∈ Af | Y = 1) (P(Y = 1 | S) − P(Y = 1))

= P(f (Z) ∈ Af | Y = 0) (P(Y = 1 | S) − P(Y = 1)) .
Thus, P(Y = 1 | S) = P(Y = 1) almost surely or P(f (Z) ∈ Af | Y = 1) = P(f (Z) ∈ Af | Y = 0) or

both.
The next result, given in the context of binary classification, is similar to the previous one. If
we want to enforce independence and separation at the same time, then either f (Z) carries no
information about Y or the distribution is very specific (or both). Note that the restriction to binary
classification is crucial in this proof. Furthermore, there is a counter example where if Y = {1, 2, 3},
then there exists P which contradicts the above statement.
Proposition 1.4.6. Let Y = {0, 1} and |S| < ∞. Assume that f : Z → Y satisfies sufficiency and
separation, then either Y ⊥
⊥ S or P(f (Z) = 1 | Y = 1) = 0 or f (Z) ⊥
⊥Y.
Page 11 of 63
Proof. Since f is assumed to satisfy (Y ⊥

⊥ S) | f (Z) (sufficiency) and (f (Z) ⊥
⊥ S) | Y (separation),
then for all s, s0 ∈ S
P(f (Z) = 1 | S = s, Y = 1)P(Y = s | S = 1)
P(Y = 1 | S = s, f (Z) = 1) =
P(f (Z) = 1 | S = s)
P(f (Z) = 1 | Y = 1)P(Y = 1 | S = s)
=P
y∈{0,1} P(f (Z) = 1 | Y = y)P(Y = y | S = s)
= P(Y = 1 | S = s0 , f (Z) = 1)
P(f (Z) = 1 | Y = 1)P(Y = 1 | S = s0 )
=P 0
.
y∈{0,1} P(f (Z) = 1 | Y = y)P(Y = y | S = s )
If Y ⊥
⊥ S, then there is nothing to prove. Otherwise, there exist s, s0 ∈ S such that P(Y = 1 |
S = s0 ) 6= P(Y = 1 | S = s). For this choice of s, s0 ∈ S, set ps = P(Y = 1 | S = s) and
ps0 = P(Y = 1 | S = s0 ). Then, in order to satisfy the above condition, TPR = P(f (Z) = 1 | Y = 1)
and FPR = P(f (Z) = 1 | Y = 0) must be such that
TPR TPR
= .
TPRps + FPR(1 − ps ) TPRps0 + FPR(1 − ps0 )
If TPR = 0, then the statement follows. Otherwise, above equality implies that
TPR(ps − ps0 ) = FPR(ps − ps0 ) .
Hence, if TPR 6= 0, then f (Z) ⊥

⊥Y.
The above impossibility result can be proven for generic triplet (f (Z), S, Y ) assuming that all
the events have non-zero probability, it corresponds to the intersection property of conditional
independence.
The interpretation from the fairness perspective is not very promising. We have deduced that
unless the distribution of P is very particular, there is no prediction function that can satisfy either
two (natural!) fairness criteria. Depending on the level of pessimism, the reader can draw several
conclusions, each of them leading to different research directions.
• Conclusion 1: instinctive. Group fairness as a framework of fair statistical learning is

not interesting. Consequently, a research direction is: which theoretical framework is less
pessimistic, but still intuitive and useful.
• Conclusion 2: optimistic. The above results simply say that we cannot have everything.
We need to make an informed choice of the fairness criteria for our problem. In particular,
given a choice of fairness definition, how to adapt our algorithms to satisfy it in a provable
manner?
• Conclusion 3: theoretician. It is a good first step6 , but one should actually provide a
complete description of the space of three criteria in their approximate form and show which
relaxations are achievable.
The first conclusion is indeed very natural and it has actually sparked a lot of interesting research
in the direction of theoretical foundations of fair learning. In this lectures, however, we will stick
with the second conclusion and try to be optimistic. To the best of our knowledge, the direction
sketched in conclusion 3 remains largely open. That is to say that there are no conclusive theoretical
studies of the three presented constraints.
6 V. I. Arnold in one of his interviews to S. P. Kapitsa, attributed the following quote to H. Poincaré: “In
mathematics there are two types of questions: with binary answer and interesting ones”. (translation from Russian is
frivolous)
Page 12 of 63
1.5 Risk based fairness

The group-fairness constraints presented in the previous section were relying purely on the proba-
bilistic relations between (X, S, Y ) and the prediction f (Z). Another direction is to incorporate
fairness constraints via the risk. A natural attempt is to require equality of group-wise risks.
Definition 1.5.1. We say that f : Z → Y equalizes group-wise risks w.r.t. the distribution P of
(X, S, Y ) and a loss function ` if
E[`(f (Z), Y ) | S = s] = E[`(f (Z), Y )] ∀s ∈ S .
While natural, it was only recently observed that enforcing the above fairness constraint is
not necessarily desirable. The part “not necessarily desirable” actually corresponds to a precise
mathematical statement phrased in terms of Pareto domination.
Let r1 , . . . , rK : A → R. Consider a multi-criteria optimization problem
min(r1 (a), . . . , rK (a)) .

a∈A
To give a meaning to the above expression, we define a partial ordering on A via the notion of
Pareto dominance.
Definition 1.5.2. We say that a1 ∈ A dominates a2 ∈ A in the sense of Pareto and write
a1 a2 if
rs (a1 ) ≤ rs (a2 ) ∀s ∈ [K] .
If at least one of the above inequalities is strict, then we say that a1 strictly Pareto dominates
a2 .
Definition 1.5.3. We say that f : Z → Y achieves min max fairness w.r.t. the distribution P of
(X, S, Y ) and a loss function ` if it solves

min max rs (f ) := E[`(f (Z), Y ) | S = s] .
f : Z→Y s∈S
Proposition 1.5.4. Assume that f satisfies Definition 1.5.1 and f ∗ satisfies Definition 1.5.3.
Consider the multi-criteria optimization problem
min (r1 (f ), . . . , rK (f )) ,
f :Z→Y
where rs (f ) = E[`(f (Z), Y ) | S = s]. Then, f ∗ dominates f in the sense of Pareto. Moreover, if
max rs (f ) > max rs (f ∗ ) ,

s∈S s∈S
then f ∗ strictly dominates f .
Note how the above statement actually contains two separate claims. Firstly, it says that f ∗ and
f are comparable and only then it states that f ∗ should be preferred in the sense of Pareto.
Proof. By definition of f ∗ it holds that
max rs (f ) ≥ max rs (f ∗ ) .
s∈S s∈S
Page 13 of 63
Furthermore, since f equalizes the group-wise risks, then maxs∈S rs (f ) = rs0 (f ) for all s0 ∈ S. Thus,
the above inequality implies that for all s0 ∈ S
rs0 (f ) ≥ max rs (f ∗ ) ≥ rs0 (f ∗ ) .

s∈S
The above implies that f ∗ Pareto dominates f . The second claim is derived similarly.
While we have managed to give precise mathematical meaning to the phrase “equal risks is not
necessarily desirable” by relying on the notion of Pareto dominance, it does not mean that equal
group-wise risks is useless. Social and ethical requirements are not necessarily captured by the
multi-criteria optimization and both fairness notions can co-exist.
1.6 Main approaches and course overview

In these lecture notes we shall stick to the following presentation style:
1. Make a choice of risk and fairness constraint and define the “gold standard” prediction function;
2. Define available data for building estimator;
3. Define the estimator;
4. Derive excess risk and fairness guarantees;
We shall mainly stick to the demographic parity constraint under the awareness framework—that is,
we consider f (X, S) ⊥
⊥ S. We shall indicate explicitly if the considered algorithm can be extended
beyond demographic parity and awareness. In contrast, if the extension is unknown and challenging,
an open question shall be proposed.
1.6.1 Post-processing
The first type of algorithms to achieve some fairness constraint is called post-processing. In this
paradigm, it is assumed that an initial base estimator fˆ is given. Then, the typical questions are
1. What kind of data do we need to make fˆ fair in the sense of some definition?
2. Which excess risk can one guarantee when compared to a prediction function that minimizes
the risk under the considered fairness constraint?
On a very high level, the post-processing approach is often built on the relation between the solutions
of
min R(f ) and min {R(f ) : f satisfies some fairness constraint} .

f : Z→Y f : Z→Y
Most of the time, it is possible to show that the solution of the latter problem can be obtained
from the solution of the former problem via some transformation f 7→ T (f ), which depends on
the unknown distribution P. Then, the post-processing algorithms are tailored to estimate this
transformation by some T̂ with the help of additional data and output T̂ (fˆ).
Page 14 of 63
1.6.2 In-processing
In-processing, unlike the post-processing aims at enforcing some fairness constraint at training
time. Typically, this approach is framed as an empirical risk minimization under empirical fairness
constraints:
n o
min R̂(f ) : f satisfies some empirical fairness constraint ,
f ∈F
where F is some known (beforehand) class of prediction function. In some sense, that hopefully will
be clarified later, post-processing is inspired by non-parametric statistics, while the in-processing
by statistical learning. While already in the presented form the estimator can achieve reasonable
statistical guarantees, the main challenge here is to incorporate the fairness constraints into the
optimization loop. That is, if one is able to solve
min R̂(f ) ,
f ∈F
how to use this black-box solver to achieve the desired fairness constraint?
1.6.3 Pre-processing
Pre-processing is yet another idea of building algorithms satisfying some fairness constraints. One
of the main goals in this direction is to find a feature representation z 7→ φ(z) which ensures the
independence of the feature representation φ(Z) from the sensitive attribute S. Once such a feature
representation has been found, one can use any standard ML algorithm on top of it. This direction
is particularly relevant in applications where the prediction task (i.e., Y ) is not known beforehand.
1.6.4 What this course is about?

The “minimum” goal for the 21h course is to study two algorithms: post-processing and in-processing
for binary classification under the demographic parity constraint. In an ambitious situation (if time
permits and the students are comfortable with the material) we shall consider regression under the
independence constraint; online multi-calibration; and, possibly, one pre-processing algorithm.
All the mathematical tools that we are going to use for the study of the first two algorithms
should be accessible for those who have followed Statistical Learning Theory (S. Arlot, J. Mourtada,
or V. Perchet edition) and/or High-dimensional statistics (Ch. Giraud & M. Lerasle edition). For
regression under independence constraints it is useful to be familiar with one-dimensional Wasserstein
distances (a short intro will be provided). For online multi-calibration no prior knowledge is assumed,
but familiarity with calibration can be useful.
The content of the course is very much biased towards theory-driven approaches to algorithmic
fairness. In particular, at every moment, it should be clear for the reader what is the final goal
that we aim to achieve and which fairness definition is considered. This is largely explained by
the nature of the Master’s program where this course has first appeared and the personal tastes
and contributions of the authors. However, we urge the reader that completely formal approach
to fairness (in its broad sense) is not necessarily the way to go and to truly tackle the issue of
algorithmic fairness one needs an inter-disciplinary approach involving a range of various expertise.
1: 2022—Lecture 1
1.7 Bibliographic references

A decade before the first version of these notes, only a handful of papers in the machine learning
community were concerned by algorithmic discrimination (Dwork et al., 2012; Kamiran and Calders,
Page 15 of 63
2009; Luong et al., 2011; Pedreschi et al., 2009; Pedreshi et al., 2008; Zemel et al., 2013). The
definition of individual fairness, which is not covered in these lecture notes, is due to Dwork et al.
(2012). It is probably after NIPS’16 that the issue of algorithmic fairness started to receive an
increasing attention of statisticians and computer scientists. Notably, NIPS’16 features at least four
very influential works on algorithmic fairness (Bolukbasi et al., 2016; Goh et al., 2016; Hardt et al.,
2016; Joseph et al., 2016). From here, it is pretty much impossible to list all the papers. The notion
of equal opportunity and equalized odds is due to Hardt et al. (2016) and the impossibility results
were established concurrently by Chouldechova (2017); Kleinberg et al. (2016). The exposition of
the impossibility theorems in these lecture notes follows the one in (Barocas et al., 2019). The term
Disparate Learning Process was first used by Lipton et al. (2018), where the authors highlighted
theoretically and empirically several issues that can arise in this context, which otherwise can be
avoided in the awareness framework. The notion of minmax fairness supported by the Pareto
dominance interpretation was first introduced by Martinez et al. (2020) and started to gain an
increasing interest in the community (see e.g., Diana et al., 2021; Martinez et al., 2021).
1.8 Other courses on algorithmic fairness

Here we collect some courses on algorithmic fairness, which, for interested reader, can serve as a
great complementary read. If you find any other course that is related to algorithmic fairness and
is available online, please let us know. Unlike the present course, these alternatives have a much
greater discussion on the social and ethical aspects of fairness.
• University of Washington, by Jamie Morgenstern:

https://courses.cs.washington.edu/courses/cse599m/20wi/
• Carnegie Mellon University, by Hoda Heidari
https://www.cs.cmu.edu/ hheidari/mles-fall-21.html
• UC Berkeley, by Moritz Hardt
https://fairmlclass.github.io
• University of Michigan, by H.V. Jagadish
coursera
• NYU, by Julia Stoyanovich and George Wood

https://dataresponsibly.github.io/rds/modules/fairness/week1/
• Duke University, by Ashwin Machanavajjhala
https://sites.duke.edu/cs590f18privacyfairness/
• University Paris-Saclay, by Isabelle Guyon and Kim Gerdes

Fairness in Artificial Intelligence, M1 in computer science
Something is missing? Drop a line: evgenii.chzhen@universite-paris-saclay.fr
Page 16 of 63
Chapter 2
Post-processing: deriving the form of

optimal fair classifier
As we have seen in the previous class, a fairness constraint is a dichotomy describing which prediction
functions are allowed. Notably, all the definitions that we have seen so far actually depend on the
underlying, unknown, distribution P of (X, S, Y ). Similarly to the Bayes optimal prediction, which
minimizes expected loss over all prediction functions, we can define the fair optimal prediction which
minimizes the same quantity over those predictions that we decided to call fair. Namely, from a
statistical perspective, we might want to estimate a solution of
min {E[`(f (Z), Y )] : f satisfies some fairness constraint} .

f : Z→Y
In this chapter we are going to consider a particular case of the above problem. In particular, we
will treat the following setup:
• Awareness: Z = X × S;
• Classification: `(f (Z), Y ) = I(f (Z) 6= Y );
• Binary group: S = {−1, 1};
• Demographic Parity: P(f (Z) = 1 | S = −1) = P(f (Z) = 1 | S = 1).
Note that here we are not restricting f by some known class of prediction functions (as we usually
do in learning theory). The analysis of this chapter is closer, in spirit, to the literature on non-
parametric classification, where instead of restricting our attention on a lower complexity class
prediction functions, we impose direct assumptions on the Bayes optimal prediction.
2.1 Optimal classifier under Demographic Parity constraint

Let (X, S, Y ) ∈ Rd × {−1, 1} × {0, 1} be the triplet consisting of the feature vector, the sensitive
attribute, and the binary label to be predicted. Introduce η(X, S) := E[Y | X, S], the regression
function of this problem The name regression function comes from the fact that the following
decomposition always holds (in particular it holds for binary label Y ):
Y = E[Y | X, S] + (Y − E[Y | X, S]) ,

| {z }
ξ
17
with E[ξ | X] = 0. This function η, as in the classical setup of binary classification, will play a key
role in the analysis of this part. A classifier is any (measurable) function g : Rd × {−1, 1} → {0, 1}.
Mathematical expressions of this chapter will often rely heavily on the fact that S = {−1, 1}. It
is possible to re-write everything for arbitrary S, but the resulting expressions are too long, thus,
for conciseness, we stick to S = {−1, 1}. For example, demographic parity constraint in the case
S = {−1, 1} can be compactly written as
P
sP(g(Z) = 1 | S = s) = 0 .
s∈{−1,1}
The above expression of demographic parity will be used later in the text, when the usual expression
become too long to fit one line.
We are interested in following optimization problem
g∗ ∈ arg min {P(g(X, S) 6= Y ) : P(g(X, S) = 1 | S = −1) = P(g(X, S) = 1 | S = 1)} .
g : Rd ×{−1,1}→{0,1}
It is good to recall that constant predictions do satisfy the above constraint—the feasible set is
non-empty.
The following condition is central for the analysis. It is proposed to think of a way to relax it in the
exercise at the end of this subsection.
Assumption 2.1.1. We assume that t 7→ P(η(X, S) ≤ t | S = s) is continuous for s ∈ {−1, 1}.
Let us understand what this assumption entails. Assume that there exists x ∈ Rd such
that P(X = x | S = 1) > 0. In this case the above continuity assumption is violated since
P(η(X, S) = η(x, 1) | S = s) > 0. Thus, no matter the function η, the measure P(· | S = s) cannot
concentrate on distinct feature vectors. Secondly, we cannot have a set of individuals A ∈ Rd with
P(X ∈ A | S = 1) > 0 such that they all have the same conditional probability to have positive
label. The fact that P(· | S = s) cannot concentrate on distinct vectors is not particularly restrictive
(personally, I have never faced a dataset which has two identical entries with different labels). The
second part, however, is more of a restriction, at least at first sight. Yet, empirical results show that
this assumption is of little concern in practice.
As in the standard classification without constraints setting, where the Bayes rule is given by the
classifier I(η(x, s) > 1/2), having such a simple expression in the context of fairness can be very
beneficial. The next result shows that under the above assumption, we still remain in the realm of
“thresholding conditional probabilities”.
Theorem 2.1.1: Optimal fair classifier under demographic parity
Let Assumption 2.1.1 and set ps := P(S = s) for s ∈ {−1, 1}, then
1 sλ∗

∗
g (x, s) = I η(x, s) ≥ + ∀(x, s) ∈ Rd × {−1, 1} ,
2 2ps
with
X
λ∗ ∈ arg min

E max ps {2η(X, S) − 1} − sλ, 0 | S = s ,
λ∈R s∈S
is a solution of
min {P(g(X, S) 6= Y ) : P(g(X, S) = 1 | S = −1) = P(g(X, S) = 1 | S = 1)} .

g:Rd ×S→{0,1}
Furthermore, it holds that

X
R(g ∗ ) = E[Y ] − min (2.1)

E max ps {2η(X, S) − 1} − sλ, 0 | S = s .
λ∈R
s∈S
Page 18 of 63
Before proving, let us have a closer look at the above result. Intuitively, if the pair (X, Y ) is
independent from S, the introduced constraint should not matter at all. Indeed, in this case the
very introduction of the sensitive attribute is void. Thus, our hope that in this case λ∗ = 0. To see
this, we observe that under this independence η(·, −1) ≡ η(·, 1) and P(· | S = −1) ≡ P(· | S = −1).
Hence, under this independence, λ∗ is a minimizer of

E max p−1 {2η(X, 1) − 1} + λ, 0 + E max p1 {2η(X, 1) − 1} − λ, 0 .
Using the fact that max{a, 0} + max{b, 0} ≥ max{a + b, 0} for all a, b ∈ R, we conclude that λ∗ can
be taken as 0. To sum-up, if (X, Y ) is independent from S, then (under the continuity assumption)
Theorem 2.1.1 reduces to the expression for the Bayes rule. The second observation that we make
concerns the exact expression for g ∗ —it is given by thresholding each group in opposite directions.
That is to say, the acceptance threshold for one of the groups is increased while for the other one is
decreased. Also, the division by ps in the indicator appearing in the expression for g ∗ makes the
threshold adjustment more drastic for the “minority” group, i.e., for s ∈ {−1, 1} where P(S = s) 1.
From the social perspective it is certainly not obvious that this is “the way to go” and, on the look
of it, this strategy is rather questionable.
At this moment one might have a reasonable question: why are we learning all this? First, notice
that it will take us two pages of math derivations to come to these conclusions; that is, while on the
first side demographic parity seems to be a useful and very intuitive concept, we still need to develop
sufficient understanding of it. Second, the statistical results that we prove in what follows do not
have to be used exclusively in the context of fairness. So the answer depends on one’s preference:
learn it for the tools, learn it for the deeper understanding, or close this lecture notes.
Proof. The proof of this result tries to mimic the derivation of the Bayes optimal classifier. To do
so, we denote by (P ), the value of the minimization problem of interest, that is (P ) equals to
min {P(g(X, S) 6= Y ) : P(g(X, S) = 1 | S = −1) = P(g(X, S) = 1 | S = 1)} .

g : Rd ×S→{0,1}
Introducing the Lagrangian
L(g, λ) = P(g(X, S) 6= Y ) + λ (P(g(X, S) = 1 | S = −1) − P(g(X, S) = 1 | S = 1)) ,
we can equivalently express (P ) as
(P ) = min max L(g, λ) .

g : Rd ×S→{0,1} λ∈R
Reminder 2.1.1: Max-min inequality

For any function f : A × B → R it holds that
min max f (a, b) ≥ max min f (a, b) .

a∈A b∈B b∈B a∈A
The max-min inequality asserts that
min max L(g, λ) ≥ max min L(g, λ) := (D) ,

g:Rd ×S→{0,1} λ∈R λ∈R g:Rd ×S→{0,1}
where we introduce (D). From now on our goal is to solve the max min problem, obtaining the value
of (D) and, then, establish that under Assumption 2.1.1 (D) = (P ).
Page 19 of 63
Solving the max min problem. We start with the unconstrained inner min over the classifiers.
First, we simplify the introduce Lagrangian, reducing it to an easier to handle form. The only
ingredient in the following manipulations is the tower rule for conditional expectation. Since for all
a, b ∈ {0, 1} we have I(a 6= b) = a(1 − b) + (1 − a)b, we get
P(g(X, S) 6= Y ) = E[g(X, S)(1 − Y ) + (1 − g(X, S))Y ]

= E[Y ] + E[g(X, S)(1 − 2η(X, S))]
X
= E[Y ] + ps E[g(X, S)(1 − 2η(X, S)) | S = s] ,
s∈S
where ps = P(S = s). Furthermore, for any s ∈ S = {−1, 1} we have
sP(g(X, S) = 1 | S = s) = sE[g(X, S) | S = s] .
Using the above two displays, the Lagrangian can be expressed as

X
L(g, λ) = E[Y ] + E[g(X, S)(ps {1 − 2η(X, S)} + sλ) | S = s] .
s∈S
We are in position to solve the inner unconstrained minimization problem appearing in the definition
of (D). Fix an arbitrary λ ∈ R and denote by gλ any solution of
min L(g, λ) .
g : Rd ×S→{0,1}
Clearly, it holds for all (λ, g) that

X
L(g, λ) ≥ E[Y ] + E min (ps {1 − 2η(X, S)} + sλ), 0 | S = s .
s∈S
Thus, we can take gλ defined point-wise for all (x, s) ∈ Rd × S as
gλ (x, s) = I(ps {1 − 2η(X, S)} + sλ ≤ 0) .
N.B. It is interesting to note that if λ = 0, the corresponding gλ coincides with the Bayes classifier.
No surprise here since L(g, 0) is just the usual misclassification risk, but good for sanity check.
Substituting the expression for gλ into the Lagrangian and using the fact that max{f (a)} =
− min{−f (a)}, we deduce that
X
(D) = E[Y ] − min E max ps {2η(X, S) − 1} − sλ, 0 | S = s .
λ∈R
s∈S
The objective function of the minimization problem appearing in the above expression
X
λ 7→ E max ps {2η(X, S) − 1} − sλ, 0 | S = s ,
s∈S
is convex. Indeed, for each fixed (x, s)

∈ R × S the functions λ 7→ 0 and λ 7→ ps {2η(x, S) − 1} − sλ
d
are convex, implying that

λ 7→ max ps {2η(x, S) − 1} − sλ, 0 is convex. Taking expectation over
(X, S) from λ 7→ min ps {2η(x, S) − 1} − sλ, 0 preserves this convexity (as well as summing two
convex functions). Furthermore, notice that
X
lim E max ps {2η(X, S) − 1} − sλ, 0 | S = s = +∞ ,
|λ|→+∞
s∈S
Page 20 of 63
that is, this function is also coercive, hence the global minimizer exists.
Reminder 2.1.2: Sub-differential
Let F : Rd → R be a convex function. The sub-differential ∂F (x) of F at a point x ∈ Rd is
defined as
∂F (x) = g ∈ Rd : F (x0 ) ≥ F (x) + hg, x0 − xi ∀x0 ∈ Rd .

Proposition 2.1.2. Consider
F (x) = E[f (x, ω)] ,
where ω is an element of some probability space (Ω, Σ, P) and x 7→ f (x, ω) is convex for all
ω ∈ Ω. For every x ∈ Rd define
D(x) = {ω ∈ Ω : f (·, ω) is not differentiable at x} .
Assume that D(x) is measurable, then F is differentiable at the point x ∈ Rd if and only if
P(D(x)) = 0. In this case,
Z
∇F (x) = ∇x f (x, ω)I(ω ∈ / D(x))dP(ω) .
see (Bertsekas, 1973, Proposition 3)
Therefore, we can use the first order optimality conditions for convex non-smooth minimization
problems
X
0∈∂ E max ps {2η(X, S) − 1} − sλ, 0 | S = s .
s∈S
Our
P goal nowis to calculate the above sub-differential. Let ω = (X, S), then the mapping λ 7→
can be expressed as E[f (λ, ω)] with f (λ, ω) =

s∈S E max ps {2η(X, S) − 1} − sλ, 0 | S = s
−1
pS max pS {2η(X, S) − 1} − Sλ, 0 . Since for any λ ∈ R, we have by Assumption 2.1.1 that

P(pS {2η(X, S) − 1} = Sλ) = 0, then, using Proposition 2.1.2, we deduce that

X
λ 7→ E max ps {2η(X, S) − 1} − sλ, 0 | S = s ,
s∈S
is differentiable for every λ ∈ R. Furthermore,

X
∂ E max ps {2η(X, S) − 1} − sλ, 0 | S = s
s∈S
equals to
( )
X
− sP (ps {2η(X, S) − 1} − sλ ≥ 0 | S = s) ,
s∈S
that is, it is a singleton.

The first order optimality condition for convex non-smooth problems in conjunction with the
above arguments imply that λ which minimizes
X
E max ps {2η(X, S) − 1} − sλ, 0 | S = s
s∈S
Page 21 of 63
must satisfy
X
− sP (ps {2η(X, S) − 1} − sλ ≥ 0 | S = s) .
s∈S
Or, equivalently,
P (p−1 {2η(X, S) − 1} + λ ≥ 0 | S = −1) = P (p1 {2η(X, S) − 1} − λ ≥ 0 | S = 1) = 0 .
Establishing the strong duality. Looking at the form of gλ , we have deduced that the pair
(λ∗ , gλ∗ ) which solves (D) satisfies the following three conditions

∗
P
 λ ∈ arg minλ∈R s∈S E max ps {2η(X, S) − 1} − sλ, 0 | S = s

gλ∗ (x, s) = I(ps {1 − 2η(x, s)} + sλ∗ ≤ 0) ∀(x, s) ∈ Rd × {−1, 1}

P (gλ∗ (X, S) = 1 | S = −1) = P (gλ∗ (X, S) = 1 | S = 1) .

In particular, gλ∗ is feasible for the problem (P ). If g ∗ is a solution of (P ), then the weak duality
implies
P(g ∗ (X, S) 6= Y ) = (P ) ≥ L(gλ∗ , λ∗ ) = P(gλ∗ (X, S) 6= Y ) .
Thus, we can set g ∗ ≡ gλ∗ and the proof is concluded.

Exercise 2.1.3. Extend the previous result to approximate demographic Parity. That is, for some
ε ∈ [0, 1], we want to solve
min {P(g(X, S) 6= Y ) : |P(g(X, S) = 1 | S = −1) − P(g(X, S) = 1 | S = 1)| ≤ ε} .

g:Rd ×S→{0,1}
Exercise 2.1.4. Extend the previous result to the case of unawareness. That is, we want to solve
min {P(g(X) 6= Y ) : P(g(X) = 1 | S = −1) = P(g(X) = 1 | S = 1)} .

g:Rd →{0,1}
What is the alternative to Assumption 2.1.1?

Exercise 2.1.5 (S = {1, . . . , K}). Generalize previous result to the following problem
( )
X X
min E[rs (X)g(X, S) | S = s] : E[cj,s (X)g(X, S) | S = s] ≤ 0, j ∈ [m] ,
g:Rd ×S→R
s∈S s∈S
where m ∈ N, rs , cj,s : Rd → R are some fixed functions. What is the alternative to Assump-
tion 2.1.1? Demonstrate that this formulation is general enough to accommodate previous results
and exercises as well as it allows to consider Equal Opportunity constraint.
Exercise 2.1.6 (S = {1, . . . , K}). In the context of previous exercise, consider the case of unaware-
ness.
Proof of Solenne Gaucher. In march 2022, Solenne proposed another, much more elegant
and simpler proof. It is presented here in the case of S = {1, . . . , K} , [K] and approximate
demographic parity. Let us first introduce some notation, that will be eventually used and re-
introduced again in Chapter 5. For every s ∈ [K], set µs (η)–the univariate probability measure
defined as µs (η)(A) = P(η(X, S) ∈ A | S = s) for every Borel A ⊂ [0, 1]. The cumulative distribution
of µs (η) is denoted by Fµs (η) and the quantile function of µs (η) (the inverse of Fµs (η) ) is denoted by
Fµ−1
s (η)
.
Page 22 of 63
Theorem 2.1.2: Alternative representation of fair optimal classifier

Assume that for every s ∈ [K], the cumulative distribution Fµs (η) of µs (η) is invertible. Then,
for any ε ≥ 0 a solution of

min R(g) : max |E[g(X, S) | S = s] − E[g(X, S)]| ≤ ε ,
s∈[K]
can be written as

gβ (x, s) = I η(x, s) ≥ Fµ−1
s (η) (β s ) ,
where β = (β1 , . . . , βK )> is a solution of

(K )
X Z 1
−1

min ps 1 − 2Fµs (η) (t) dt : max |βs − hp, βi| ≤ ε ,
βs s∈[K]
s=1
with p = (p1 , . . . , pK )> .
Corollary 2.1.6.1. Under assumptions of the above theorem, if ε = 0, then
β1 = . . . = βK = β ,

with β being the median of the 2-Wasserstein barycenter of µs (η) s∈[K] weighted by (ps )s∈[K] , i.e.,
K
X 1
ps Fµ−1
s (η)
(β) = .
s=1
2
The part on the 2-Wasserstein barycenter will be clear after reading Chapter 4, at this stage
it is sufficient to consider the description of β as the the root of the above equation. The above
description of β gives the following interpretation observation: threshold stability
Average threshold of fair optimal = Average threshold of Bayes optimal ,
the above was clear from the previous proof in case of K = 2, but is not at all evident for K > 2.
Interestingly, in case of regression, studied in Chapter 4 there will be a somewhat similar “average
stability" after property. Also, for the case of K = 2 (we use S = {−1, 1}) and ε = 0 it is informative
to look at the connection between the proof based on duality and Solenne’s proof. Note that the
first order optimality condition for
 
 X Z 1 X 
1 − 2Fµ−1

min ps s (η)
(t) dt : sβs = 0 ,

s∈{−1,1} βs 
s∈{−1,1}
reads as

1 sλ
ps (2Fµ−1 (βs ) − 1) + sλ∗ = 0
(
 βs = F

+
s (η) µs (η)
⇐⇒ 2 2ps ,
β1 = β−1 
β = β
1 −1
that is, Solenne’s description of the optimal fair classifier lies in the dual space of the optimization
problem that we need to solve to obtain the optimal value of λ∗ in Theorem 2.1.1.
Page 23 of 63
Proof of Theorem 2.1.2. Set q(x, s) , Fµs (η) (η(x, s)). Then, it is easy to verify that
Law(q(X, S) | S = s) = Unif(0, 1) and Law(q(X, S)) = Unif(0, 1) .
We already know from Theorem 2.1.1 that the optimal solution is given by the group-wise thresholding
classifier. That is, we can restrict our attention to the classifiers of the form
g(x, s) = I(η(x, s) ≥ αs ) .
In particular, since q(x, s) is obtained from η(x, s) by a strictly increasing transformation, every g
of the above form can be written as
g(x, s) = I(q(x, s) ≥ βs ) ,
where βs = Fµs (η) (αs ). Furthermore, since (q(X, S) | S = s) is distributed uniformly on (0, 1), then
the demographic parity constraint for the above g is equivalent to: for all s0 ∈ [K]
K

X
βs0 − ps βs ≤ ε .

s=1
Let β = (β1 , . . . , βK ), thus, by the abuse of notation, we can write

( )
K
0 X
min {R(g) : |E[g(X, S) | S = s] − E[g(X, S)]| ≤ ε} = min R(β) : βs − ps βs ≤ ε ,

β∈[0,1]K
s=1
PK
where R(β) = s=1 ps P(Y 6= I(q(x, s) ≥ βs )). This is the main amazing idea and reduction from
the constrained problem on classifiers to the constrained problem on thresholds. Furthermore, we
can write
K
X
R(β) = E[Y ] + ps E[I(q(X, S) ≥ βs )(1 − 2η(X, S)) | S = s]
s=1
K
X
= E[Y ] + ps E[I(q(X, S) ≥ βs )(1 − 2Fµ−1
s (η)
(q(X, S))) | S = s]
s=1
K
X Z 1
1 − 2Fµ−1

= E[Y ] + ps s (η)
(t) dt .
s=1 βs
The proof is concluded.
2.2 Post-processing algorithm

In the standard unconstrained classification, the Bayes classifier depends only on the regression
function η. In particular, if we knew this function, we can build the most accurate classifier. Since
it is not known, a natural attempt to mimic the Bayes classifier is to first estimate η by some η̂
and then to threshold it by 1/2. The rational behind this approach is given by the following link
between excess risk of classification and the regression risk:
E(ĝ) := P(ĝ(X, S) 6= Y | η̂)− min P(g(X, S) 6= Y )

g
(2.2)
≤ 2E [|η(X, S) − η̂(X, S)| | η̂] ,
Page 24 of 63
where ĝ(x, s) = I(η̂(x, s) ≥ 1/2). The proof of this result is short. We know (see lecture 1 for the
reminder) that that the excess risk of any classifier g and of ĝ in particular equals to
2E|η(X, S) − 1/2|I(ĝ(X, S) 6= g ∗ (X, S)) .
By definition of the plug-in rule if ĝ(X, S) 6= g ∗ (X, S), then η̂ and η are on different sides of 1/2,
implying that on this event |η(X, S) − 1/2| ≤ |η(X, S) − η̂(X, S)|. It important to mention that
attacking classification via regression is not necessarily optimal. Actually, intuitively, classification
is a simpler task—we do not actually need to estimate η, we only need to estimate the decision
boundary. Indeed, for instance imagine that X ∈ R and η(·, s) : R → [0, 1] is monotone on R. In
this case estimating η(·, s) and I(η(·, s) ≥ 1/2) are two very distinct problems. The former case
is known as isotonic regression where the minimax optimal rate is O(n−1/3 ) (see e.g., Nemirovski
et al., 1985), while the latter case corresponds to learning on a VC class (class of thresholds) and we
can achieve classification rate of O(n−1/2 ) (see e.g., Yang, 1999). Nevertheless, for richer classes of
regression functions (on classes of smoothness), the plug-in estimator can be optimal.
Unfortunately, we will not have such a short and neat proof in what follows. Additional points
will be given for any significant simplifications of what follows. The first algorithm that we will
consider in this course comes as a natural attempt to exploit the result of Theorem 2.1.1 and follow
the ideas developed in the case of unconstrained binary classification. Recall that the classifier g ∗
derived in Theorem 2.1.1 has the following form
1 sλ∗

∗
g (x, s) = I η(x, s) ≥ + ∀(x, s) ∈ Rd × {−1, 1} ,
2 2ps
with
X
λ∗ ∈ arg min

E max ps {2η(X, S) − 1} − sλ, 0 | S = s .
λ∈R s∈S
Unlike the unconstrained case, the knowledge of η is not anymore sufficient for us—we also need
some information about the conditional distribution of the features X | S = s. Yet, again, in the
spirit on unconstrained classification, let us imagine that η is known. In that case, it is sufficient to
estimate the conditional distribution of X given S = s and the marginal distribution of the sensitive
attribute. The latter is less crucial, and, in what follows, let us assume that ps = P(S = s) is known
beforehand (it can be well estimated by empirical frequencies).
Base estimator and data. Keeping the above discussion in mind, let us assume that some initial
estimator η̂ is provided. One of our eventual goals will be establishing a guarantee which is similar
to that in Eq. (2.2). For this reason, at this stage, η̂ can have any nature and we do not even need
to assume that it does something intelligent.
To estimate the distribution of X | S = s, for each s ∈ {−1, 1} we collect two independent
samples: one for s = −1 and the other for s = 1
(X s1 , . . . , X sns ) ,
which are i.i.d. following PX|S=s . Here we are allowing ourselves to sample from each sub-population.
In case it is not possible in a given application, we can replace the above sample by and i.i.d sample
from P(X,S)
(X 1 , S1 ), . . . , (X n , Sn ) ,
then setting ns = # {i ∈ [n] : Si = s} this new sample is equivalent to the previous one. The only
difference is that in this latter case ns is random, but conditionally on ns the analysis of both
sampling schemes is identical.
Page 25 of 63
Equipped with the base estimator η̂ and with the sample (X s1 , . . . , X sns )s=−1,1 , we can define
“double plug-in” estimator as
!
1 sλ̂
ĝ(x, s) = I η̂(x, s) > + ∀(x, s) ∈ Rd × {−1, 1} , (2.3)
2 2ps
with
( ns
)
X 1 X
max ps {2η̂(X si , s) − 1} − sλ, 0 (2.4)

λ̂ ∈ arg min .
λ∈R ns i=1
s∈S
Note that the minimization problem in Eq. (2.4) is univariate and convex (the convexity follows
from the same arguments as in the proof of Theorem 2.1.1), hence we can solve it up to an arbitrary
precision with a simple bisection method.
Exercise 2.2.1. Implement the above method in the language of your choice and provide benchmark
experiments.
Some deterministic properties of the algorithm Before establishing statistical guarantees

for the method, we establish some of its properties, which hold independently from the generating
process behind the data. The next result tells us that we can start this bisection method on the
interval [−1, 1].
Proposition 2.2.2. Assume that η̂ takes values in [0, 1]. Any solution of the minimization problem
in Eq. (2.4) belongs to [−1, 1].
Proof. Let us denote the objective function to be minimized by
( ns
)
X 1 X
s

G(λ) = max ps {2η̂(X i , s) − 1} − sλ, 0 .
ns i=1
s∈S
As in Theorem 2.1.1, G(λ) is convex. Note that G(λ) > G(1) for all λ > 1 and G(λ) > G(−1) for
all λ < −1.
In what follows it we will make use of the following notation: for any C ⊂ Rd and g : Rd × S → R
ns ns
1 X 1 X
P̂(X ∈ C | S = s) := I(X si ∈ C) and Ê[g(X, S) | S = s] := g(X si , s) .
ns i=1 ns i=1
The next result shows that the estimator ĝ nearly achieves demographic parity on the sample.
Proposition 2.2.3. Assume that for all s ∈ {−1, 1}, all elements of the following set are distinct
{η̂(X s1 , s), . . . , η̂(X sns , s)} .
Then, the “double plug-in” ĝ defined in Eq. (2.3) satisfies

1 1
P̂ (ĝ(X, S) = 1 | S = −1) − P̂ (ĝ(X, S) = 1 | S = 1) ≤ + .

n1 n−1
The above result states that the classifier ĝ approximately satisfies the demographic parity
constraint on the additional unlabeled sample. The statement of this result is completely deterministic
and follows from the very definition of the estimator.
Page 26 of 63
Proof. The first order optimality condition for the optimization problem in Eq. (2.4), imply that
( ns
)
X 1 X
s

0∈∂ max ps {2η̂(X i , s) − 1} − sλ̂, 0 .
ns i=1
s∈S
Unlike the population level, where our remedy was Assumption 2.1.1, which allowed us to “average-out”
all the points of non-differentiability, in that case we need to actually compute the sub-differential.
Luckily, the non-smoothness has a very simple structure. We can write the following expression for
this sub-differential at λ̂
ns
X X 1 X
sP̂ (ĝ(X, S) = 1 | S = s) + s · [0, 1] · I ps {2η̂(X si , s) − 1} = sλ̂ ,
ns i=1
s∈S s∈{−1,1}
where the summation is understood in Minkowskii’s sense. In particular, the first order optimality
condition states there exist (ρsi )s∈{−1,1},i∈[ns ] such that
ns
X X 1 X
sP̂ (ĝ(X, S) = 1 | S = s) + s · ρsi · I ps {2η̂(X si , s) − 1} = sλ̂ = 0 .
ns i=1
s∈S s∈{−1,1}
The above implies that the pair (ĝ, λ̂) which constitute our algorithm satisfies
ns
X X 1 X
sP̂ (ĝ(X, S) = 1 | S = s) ≤ I ps {2η̂(X si , s) − 1} = sλ̂ .

ns i=1

s∈S s∈{−1,1}
We want to claim that under the assumption of the proposition for each s ∈ {−1, 1} it holds that
ns
X
I ps {2η̂(X si , s) − 1} = sλ̂ ≤ 1 .
i=1
But this is obvious, since each of the numbers {η̂(X si , s)}i∈[ns ] are assumed to be different and λ̂ is
some fixed number.
Statistical analysis Once the estimator is provided and its (deterministic) properties are under-
stood, we naturally ask the following question: which statistical guarantees do we need to establish?
Given the fact that we try to mimic the behaviour of fair optimal classifier, we will aim at establishing
two results: first, (near) satisfaction of the demographic parity constraint; second, near optimality
in terms of risk. More precisely, we want to show that our estimator ĝ satisfies
P (|P(ĝ(X, S) = 1 | S = −1) − P(ĝ(X, S) = 1 | S = 1)| ≥ δ) −→ 0 ,
when n−1 , n1 −→ +∞. Furthermore, since our classifier relies on the initial estimator η̂, we want to
make sure that as long as η̂ −→ η (say in `2 sense) we have
P (R(ĝ) ≥ R(g ∗ ) + δ) −→ 0 ,
when n−1 , n1 −→ +∞. In this part our goal is to establish finite sample versions of the above
guarantees.
First of all, we establish the fairness properties of the proposed algorithm. Namely, we will show
that under mild assumptions on η̂ the post-processed estimator nearly satisfies the demographic
parity constraint. The only tool for this analysis is the Dvoretzky-Kiefer-Wolfowitz inequality
recalled below and the deterministic properties established in the previous section.
Page 27 of 63
Reminder 2.2.1: DKW inequality

Theorem 2.2.4 (Dvoretzky-Kiefer-Wolfowitz inequality). Let U1 , . . . , Un , U be i.i.d. real
valued random variables. Then, for all δ ∈ (0, 1), with probability at least 1 − δ
n r
1 X log(2/δ)
sup I(Ui ≤ t) − P(U ≤ t) ≤ .

t∈R n i=1 2n
see (Massart, 1990, Corollary 1)
Theorem 2.2.1: Fairness guarantee

Assume that η̂ is such that for all s ∈ {−1, 1}, all elements of the following set are distinct
{η̂(X s1 , s), . . . , η̂(X sns , s)} ,
almost surely. Then, for all δ ∈ (0, 1) the “double post-procesing” estimator ĝ satisfies with
probability at least 1 − δ
s 
X log(4/δ ) 1
|P(ĝ(X, S) = 1 | S = −1) − P(ĝ(X, S) = 1 | S = 1)| ≤  + 
2ns ns
s∈{−1,1}
Proof. Using the triangle inequality we deduce that

X X
sP(ĝ(X, S) = 1 | S = s) ≤ S) = 1 | S = s) − S) = 1 | S = s)

P(ĝ(X, P̂(ĝ(X,
s∈{−1,1} s∈{−1,1}

X
+ sP̂(ĝ(X, S) = 1 | S = s) .
s∈{−1,1}
Under the conditions of the theorem, Proposition 2.2.3 asserts that

X X 1
sP(ĝ(X, S)=1 | S=s)≤ S)=1 | S=s)− S)=1 | S=s)+ ,

P̂(ĝ(X,
ns
P(ĝ(X,
s∈{−1,1} s∈{−1,1}
almost surely. It remains to bound the first term. To this end, we recall that

I(ĝ(X, S) = 1) = I η̂(X, S) > θ̂ ,

where θ̂ = 1/2 + S λ̂/2pS . Thus, we can bound P(ĝ(X, S) = 1 | S = s) − P̂(ĝ(X, S) = 1 | S = s)

by the following empirical process

sup P(η̂(X, S) ≤ t | S = s) − P̂(η̂(X, S) ≤ t | S = s) .

t∈R
Hence, Dvoretzky-Kiefer-Wolfowitz inequality applied to the above empirical process for each
s ∈ {−1, 1} and the union bound asserts that for all δ ∈ (0, 1) with probability at least 1 − δ
s 

4
 log( /δ) + 1  .
X X
sP(ĝ(X, S) = 1 | S = s) ≤

s∈{−1,1} s∈{−1,1} 2ns ns
Page 28 of 63
Note that to get the above result we have used Dvoretzky-Kiefer-Wolfowitz inequality conditionally
on η̂. In particular, for this argument, it is important that η̂ is independent from the additional
unlabeled dataset.
Note that in the above result we did not use Assumption 2.1.1. This is not surprising, since 2: 2022—Lecture 2
we have only derived the satisfaction of demographic parity constraint, while Assumption 2.1.1 is
used to derive the form of the optimal classifier. In other words, Assumption 2.1.1 has no impact
on our estimator. However, we assume that an analogous condition is satisfied on the level of the
estimator—all elements of {η̂(X s1 , s), . . . , η̂(X sns , s)} are almost distinct. As is, this condition does
not allow us to use any estimator that is based on partitioning (e.g., decision tree or kNN). This
limitation can be bypassed by considering slightly randomized version of η̂. In particular, we can
always set
η̂(x, s) 7→ η̂(x, s) + ξ(x, s) ,
where ξ(x, s) is a continuous random variable with bounded support, which is independent from η̂
and all the data. Essentially, this step allows to ensure that no ties occurs. The analysis can be
extended (try it yourself) to this randomized estimator.
Fairness guarantee is (at best) a half of the story. We need to derive complement it with some
control on the accuracy. Recall that our departure point was the comparison inequality (2.2). The
next result establishes a version of this inequality.
Theorem 2.2.2: Risk guarantee
Assume that η̂ takes values in [0, 1]. Let Assumption 2.1.1 be satisfied and the condition of
Theorem 2.2.1 hold. Then, for all δ ∈ (0, 1) the “double plug-in” estimator ĝ satisfies with
probability 1 − δ
s 
4
 log( /δ) + 1  .
X
R(ĝ) − R(g ∗ ) ≤ 4E|η(X, S) − η̂(X, S)| +
2ns ns
s∈{−1,1}
Proof. Following similar lines as in the proof of Theorem 2.1.1 we can deduce that
X X
R(ĝ) = E[Y ]+ E[ĝ(X, S)(ps {1 − 2η(X, S)} + sλ̂) | S = s]−λ̂ sP(ĝ(X, S) = 1 | S = s) .
s∈S s∈{−1,1}
Proposition 2.2.2 asserts that |λ̂| ≤ 1. Thus, applying Theorem 2.2.1 we deduce that with probability
at least 1 − δ we have
s 
4
 log( /δ) + 1  .
X X
R(ĝ) ≤ E[Y ] + E[ĝ(X, S)(ps {1 − 2η(X, S)} + sλ̂) | S = s] +
2ns ns
s∈S s∈{−1,1}
Triangles inequality implies that

X X
E[ĝ(X, S)(ps {1 − 2η(X, S)} + sλ̂) | S = s] ≤ E[ĝ(X, S)(ps {1 − 2η̂(X, S)} + sλ̂) | S = s]
s∈S s∈S
+ 2E|η(X, S) − η̂(X, S)| .
Recall that by definition of ĝ it holds that

ĝ(x, s) = I ps {1 − 2η̂(X, S)} + sλ̂ < 0 .
Page 29 of 63
Thus, we can deduce the following upper bound on the risk of ĝ: with probability at least 1 − δ
X
R(ĝ) ≤ E[Y ] + E[min{ps {1 − 2η̂(X, S)} + sλ̂, 0} | S = s]
s∈S
s 
X log(4/δ) 1
+ 2E|η(X, S) − η̂(X, S)| +  + .
2ns ns
s∈{−1,1}
Finally, Equation 2.1 of Theorem 2.1.1 asserts that

X
R(g ∗ ) = E[Y ] − min

E max ps {2η(X, S) − 1} − sλ, 0 | S = s
λ∈R
s∈S
X
= E[Y ] + max E min ps {1 − 2η(X, S)} + sλ, 0 | S = s
λ∈R
s∈S
X h i
≥ E[Y ] + E min ps {1 − 2η(X, S)} + sλ̂, 0 | S = s .
s∈S
For the sake of a more compact presentation, we introduce

H(η, λ, s) := E[min ps {1 − 2η(X, S)} + sλ, 0 | S = s] .
In this notation, the upper bound derived on R(ĝ) in conjunction with the lower bound for R(g ∗ )
yield which, in conjunction with the above bound on R(ĝ) implies that with probability at least
1−δ
X
R(ĝ) − R(g ∗ ) ≤ H(η̂, λ̂, s) − H(η, λ̂, s) + 2E|η(X, S) − η̂(X, S)|
s∈S
s 
X log(4/δ) 1
+  +
2ns ns
s∈{−1,1}
s 
X log(4/δ) 1
≤ 4E|η(X, S) − η̂(X, S)| +  + ,
2ns ns
s∈{−1,1}
where for the last inequality we use the fact that H(η, λ, s) − H(η 0 , λ, s) ≤ ps E[|η(X, S) − η 0 (X, s)| |
S = s] for any λ ∈ R and s ∈ {−1, 1} (which follows from min{a, 0} − min{b, 0} ≤ |a − b| for all
a, b ∈ R).
Inspecting the proof of Theorem 2.2.2 one can notice that all the arguments were deterministic,
the randomness was involved only implicitly and controlled by Theorem 2.2.1. Thus, we can actually
make a slightly stronger statement
Corollary 2.2.4.1. Assume that η̂ takes values in [0, 1]. Let Assumption 2.1.1 be satisfied and the
condition of Theorem 2.2.1 hold. Then, for all δ ∈ (0, 1), the “double plug-in” estimator ĝ satisfies
with probability at least 1 − δ
s 
X log(4/δ ) 1
R(ĝ) ≤ R(g ∗ ) + 4E|η(X, S) − η̂(X, S)| +  +  ,
2ns ns
s∈{−1,1}
s 
X log(4/δ ) 1
|P(ĝ(X, S) = 1 | S = −1) − P(ĝ(X, S) = 1 | S = 1)| ≤  +  ,
2ns ns
s∈{−1,1}
simultaneously.
Page 30 of 63
Given the initial goal of estimating the demographic parity fair optimal classifier, the above
corollary and the derived bounds can be considered as satisfactory. Indeed, we have essentially
managed to mimic the “classical” plug-in approach to binary classification with a rather mild
additional assumption—the continuity of η(X, S). The appealing feature of the approach that we
have considered here is its “robustness” to the exact constraint. In particular, one can replace it
by equal opportunity/relaxed demographic parity or nearly any other constraint. This approach is
summarized schematically bellow
• Define explicitly the optimal classifier
• Simplify, as possible, the expression for it; obtain closed form expression
• Build plug-in classifier, estimating the unknown quantities appearing in its expression;
• Using the derived relations provide guarantees on the constraint satisfaction and risk;
Again recall that the fairness guarantee in Theorem 2.1.1 is established in a “distribution-free
manner”—it holds no matter the underlying distribution (to be more precise, we need to make sure
that no ties occurs in our estimator, see the corresponding discussion). Hence, this approach is
relevant from fairness satisfaction perspective. However, especially in the learning literature, plug-in
type approaches are not favored. Instead, Empirical Risk Minimization algorithms are often preferred.
Plug-in approaches are often studied (from minimax point of view) on huge non-parametric classes
of smoothness where the optimal rates of classification are identical to that of corresponding problem
of regression (this is the result of Yang (1999)). As you know, non-parametric rates of convergence
are very slow (though optimal) and such non-parametric assumption (as a choice of reality proxy)
is not very appealing. Later, Audibert and Tsybakov (2007) resurrected the plug-in approach
on non-parametric classes via the margin assumption. The question of usefulness of the margin
assumption in the case of fairness constrained classification remains open and any contribution in
this direction is welcomed.

Post-processing as an approach to build algorithms satisfying group-fairness constraints gained
much popularity after the observation that optimal classifiers under several fairness constraints can
be obtained via group-wise thresholding Corbett-Davies et al. (2017); Hardt et al. (2016). This
idea developed in several works Celis et al. (2019); Chzhen et al. (2019); Menon and Williamson
(2018), mostly inspired by the contribution of Hardt et al. (2016) and the introduction of the
equal opportunity fairness constraint. The analysis of this chapter is an outcome of a substantial
simplification of a series of works (Chzhen et al., 2019, 2020a; Schreuder and Chzhen, 2021) (all
being in different contexts, but sharing the core ideas), where such post-processing approach was
developed and statistically analysed.
Page 31 of 63
Page 32 of 63
Chapter 3
In-processing: fair ERM and

reduction to iterative cost-sensitive
classification
We consider the same formal setup as in the previous chapter. The main difference with the previous
chapter is our view on the target. In particular, we consider a fixed and known set of classifiers
G ⊂ Z {0,1} . Then, our goal is to mimic the performance of
gG∗ ∈ arg min {P(g(Z) 6= Y ) : g is fair w.r.t. P} . (3.1)
g∈G
In the previous chapter we have been dealing with the same problem but without the restriction
of the possible set of classifiers. In particular, we have seen that if we pick demographic parity as
our fairness constraint, then, under the continuity assumption the fair optimal prediction exists.
Meanwhile, here, the introduction of G can actually pose some issues. For one, we are not even sure
that the set
{g is fair w.r.t. P} ∩ G ,
is non-empty—it depends on the combination of G and P. Actually, if we want to perform actual
learning, we want that G is such that {g is fair w.r.t. P} ∩ G is non-empty for a large family (or
even all of them) of distributions P. Note that this question is much less important for the case of
awareness Z = X × S—since we can pick separate prediction function for each group, it is reasonable
to expect that there are choices that guarantee us the satisfaction of the eventual fairness constraint.
In the case of unawareness, however, the situation is not that clear. In the next section we will
study this case more carefully.
3.1 Linear classifiers and unawareness

We place ourselves in the case of Z = X —unawareness framework. To understand if {g is fair w.r.t. P}∩
G is non-empty we need to make our choice of fairness definition and G. As in the most of this
lecture note, we focus on demographic parity constraint
{g is fair w.r.t. P} ⇐⇒ {P(g(X) = 1 | S = s) = P(g(X) = 1) ∀s ∈ S} .
Which set of prediction functions should we study? There are too many choices to consider, so let
us focus on those that are often used in practice—linear classifiers. That is, G = Glin where
Glin = g : ∃(w, b) ∈ Rd+1 s.t. ∀x ∈ Rd

g(x) = I(hw, xi + b ≥ 0) .
33
Note that a lot of classifiers used in practice can be represented as I(hw, Φ(x)i + b ≥ 0), where
x 7→ Φ(x) is some feature mapping—they are linear in a possibly lifted space. As long as this
feature mapping is not actually learned (e.g., vanilla SVM) we can think that our observations are
(Φ(X i ), Si , Yi )ni=1 . Having this in mind, the following discussion has a broader applicability than
the case of purely linear classifiers.
With this choice of Glin , the feasible set of the problem of interest is non-empty—we can take
(w, b) = (0, 1) and (w, b) = (0, −1), which corresponds to constant predictions (those that always
give 1 and those that give 0 respectively). Of course, we are not nearly satisfied with only constant
predictions: they are not even truly linear classifiers (rather degenerate ones). The more interesting
question is: is the set Glin ∩ {g is fair w.r.t. P} contains non-constant classifiers for any distribution
P? To partially answer this question, we recall (we write recall, but it is not common knowledge in
statistical community; so, no worries.) the following classical result in geometry.
Reminder 3.1.1: Ham-sandwich theorem
Theorem 3.1.1. Given any d continuous probability measures µ1 , . . . , µd on Rd , there exists
a linear classifier g ∈ Glin , which satisfies
1
x ∈ Rd : g(x, 1) = 1 =

µi ∀i = 1, . . . , d .
2
Such a classifier is called the ham-sandwich cut.
Furthermore, the above theorem is “sharp” in the following sense.
1. There exist d + 1 continuous probability measures µ1 , . . . , µd , µd+1 on Rd such that for any
g ∈ Glin either there exists i, j ∈ [d + 1] with
µi x ∈ Rd : g(x, 1) = 1 6= µj x ∈ Rd : g(x, 1) = 1

,
or g(x, 1) ≡ 1 or g(x, 1) ≡ 0.
2. There exist two continuous probability measures µ1 , µ2 on Rd such that for any g ∈ Glin
such that
µ1 x ∈ Rd : g(x, 1) = 1 = µ2 x ∈ Rd : g(x, 1) = 1 = α ,

it holds that α = 1/2.

(see e.g., Matousek, 2007)
The interested reader can lookup the proof of this result in (Section 3 in Matousek, 2007), which
is classically based on the Borsuk-Ulam theorem. For the first counterexample one can think of
d + 1 measures diffused around the basis vectors and the origin in Rd , while for the second one can
imagine µ1 being diffused in the vicinity of the origin and µ2 being diffused on a sphere centered at
the origin with a large enough radius.
Which conclusions can we draw from the Ham-sandwich theorem? First of all, as long as we
have at most d groups (d being the ambient dimension of features), we will always find a linear
classifier which satisfies the demographic parity—the ham-sandwich cut gives it. In particular, for
any K ≤ d, assuming that the distributions of (X | S = s) are continuous for all s ∈ [K], there
always exists a linear classifier g ∈ Glin such that
1
P(g(X, S) = 1 | S = s) = P(g(X, S) = 1) = , ∀s ∈ [K] .
2
Thus, apart from the constant classifiers we always have ham-sandwich cut classifiers in Glin . Secondly,
if the number of groups K > d, then there exist group-wise distributions of the features, such that
only constant classifiers achieve the demographic parity in Glin (take all or no-one). Finally, in
Page 34 of 63
general, constant classifiers and ham-sandwich cuts are the only classifiers that satisfy demographic
parity in Glin (take all, or no one, or half from each). The attentive reader will have noticed that
this discussion is actually applicable for the equalized odds as well as for the equal opportunity
fairness constraints. One only needs to consider the law of (X | S = s, Y = y). Practically, this
discussion implies that it is important to make sure that K < d, for instance, by lifting the original
space with polynomial feature map (in that case we enter the realm of polynomial ham-sandwich
theorem, which allows to bisect more measures).
This, rather simple, analysis suggests that the approach based on the restriction of the set of
classifiers to “usual” learning concepts should be used with caution. Nevertheless, it is popular in
the literature, and as we shall see later, the issue of restricting the set of classifiers is actually often
bypassed (but without explicitly internalizing potential issues) via randomized classifiers.
3.2 Fair empirical risk minimization: a direct approach

Once we understood potential pitfalls on this approach, we will focus on the following problem
gG∗ ∈ arg min {P(g(X, S) 6= Y ) : P(g(X, S) = 1 | S = 1) = P(g(X, S) = 1 | S = 1)} , (3.2)

g∈G
where G is a known set of classifiers on Rd × {−1, 1}. Given the form of the problem in Eq. (3.2)
and our experience in statistical learning, a natural approach is empirical risk minimization. In that
case, however, it is more appropriate to call such an approach empirical risk/empirical constraint
minimization, but we stick to the simpler terminology.
In particular, let Ds = {(X s1 , S, Y1s ), . . . , (X sn , S, Yns )} be ns i.i.d. samples following the distribu-
tion of (X, S, Y ) | S = s for all s ∈ {−1, 1}. Note that in this section we again deploy the per-group
sampling scheme and assume that ps = P(S = s) are known for all s ∈ {−1, 1}. Again, this is
mainly done to slightly simplify the proofs and to avoid the annoying scenario when no samples are
observed from one of the groups. In particular, we again treat ns := |Ds | as non-random.
Given a parameter ε > 0 to be specified, we build our estimator as
 
 X p X
s 1 X 1 X 
ĝG ∈ arg min I(g(X, S) 6= Y ) : g(X, S) − g(X, S) ≤ ε .
g∈G  ns n−1 D−1 n1 
s∈{−1,1} Ds D1
(3.3)
For compactness, in the above formulation, notation D stands for (X,S,Y )∈D . Notice that
P P
compared to the problem in Eq. (3.2) we have added relaxation parameter ε > 0. The addition of
this extra parameter is due to the finite sample effects that take place once we pass from the true
quantities to their estimates.
To derive statistical guarantees, we first remind some standard learning theory results. It is
generally supposed that the reader is already familiar with these results. Actually, we will treat
these results as “black boxes” and a deep understanding is not necessary per-se.
First we start with the standard concentration inequality.
Reminder 3.2.1: Hoeffding’s inequality
Let V1 , . . . , Vn , V be i.i.d. random variables such that V ∈ [0, 1] almost surely, then for all
δ ∈ (0, 1), with probability at least 1 − δ we have
n r
1 X log(2/δ)
Vi − E[V ] ≤ .

n 2n

i=1
We will also need some results from VC and empirical process theories.
Page 35 of 63
Reminder 3.2.2: VC-dimension

Definition 3.2.1 (VC-dimension). Let G be a class of binary classifiers on Z with |Z| = ∞.
For any finite Z 0 ⊂ Z set
G|Z 0 = {(g(z))z∈Z 0 : g ∈ G} .
0
We say that G shatters Z 0 if |G|Z 0 | = 2|Z | . The VC-dimension of G, is the size of the largest
subset of Z 0 that can be shattered by G. If every finite Z 0 ⊂ Z is shattered by G, we say that
the VC-dimension of G is infinite.
Reminder 3.2.3: VC-bound

Theorem 3.2.2. Let G be a class of measurable binary classifiers on Z with finite VC-
dimension V (G) ≥ 1. Let Z 1 , . . . , Z n , Z be i.i.d. random variables distributed according to P,
then there exists C ≥ 0 such for any such distribution P and all δ ∈ (0, 1) with probability at
least 1 − δ we have
n r
1 X V (G) + log(2/δ)
sup g(Z i ) − E[g(Z)] ≤ C .

g∈G n i=1
n
(see e.g., Vershynin, 2018, Theorem 8.3.23 + McDiarmid’s inequality)
Example 3.2.3. If Z = Rd and G is the set of linear classifiers on Rd , then V (G) = d + 1. In

particular, for d = 1 the result from the above reminder is equivalent up to constant factors to the
Dvoretzky–Kiefer–Wolfowitz inequality that we have used in the previous chapter. The proof of this
result follows by application of Radon’s lemma for the upper bound and taking points on the simplex
for the lower bound.
Reminder 3.2.4: A useful lemma

Lemma 3.2.4. Let G be a class of binary classifiers on Z. Consider
F = {f : Z × {0, 1} → {0, 1} : f (z, y) = I(g(z) 6= y), (z, y) ∈ Z × {0, 1}, g ∈ G} ,
a class of binary classifiers on Z × {0, 1}, then V (F) = V (G).

see e.g., Lemma 8.1 in Hajek and Raginsky’s lecture note
http://maxim.ece.illinois.edu/teaching/SLT/SLT.pdf
Given the above reminders, we can quickly analyse the statistical properties of the estimator in
Eq. (3.3).
Theorem 3.2.5. Let G be a class of measurable classifiers on Z = Rd × {−1, 1} with finite VC-
dimension V (G). Consider gG∗ and ĝG defined in Eqs. (3.2) and (3.3) respectively, for all δ ∈ (0, 1)
q
4/δ )
set ε = s∈{−1,1} log(
P
2ns , then with probability at least 1 − δ it holds that
s
X V (G) + log(16/δ)
R(ĝG ) ≤ R(gG∗ ) + 2C ps ,
ns
s∈{−1,1}
s

P(ĝG (X, S) = 1 | S = −1) − P(ĝG (X, S) = 1 | S = 1) ≤ ε + C
X V (G) + log(4/δ)
,
ns
s∈{−1,1}
where C is defined in Theorem 3.2.2.
Page 36 of 63
Let us first discuss this result. First of all we note that the fairness guarantee of Theorem 3.2.5
yields similar sample complexity to that of Corollary 2.2.4.1 that we derived for the plug-in rule.
In particular, both guarantees suffer from unbalanced sample—the rate is mainly governed by
the size of the smallest group. The risk guarantees of Theorem 3.2.5 and Corollary 2.2.4.1 are
rather different. First of all they differ in the very first term; recall, that we did not constrain
the set of possible classifiers in the previous chapter. Furthermore, while the risk bound of
Corollary 2.2.4.1 is informative only if we are able to properly estimate the regression function
η(X, S) = P(Y = 1 | X, S), the bound of Theorem 3.2.5 does not rely on any such estimator.
Looking at the rate of convergence for the risk, we note that if ns = ps n for some n ∈ N, then this
rate does not actually suffer from group imbalance. This is not the case for Corollary 2.2.4.1, where
the bound did depend only on ns and not on n.
Proof. Since ĝG is feasible for the problem in Eq. (3.3), then

1 X 1 X

|D−1 | ĝG (X, S) − ĝG (X, S) ≤ ε .
|D 1 |
D−1 D1
Triangle inequality implies that
|P(ĝG (X, S) = 1 | S = −1)−P(ĝG (X, S) = 1 | S = 1)| ≤

X 1 X
ε+ ĝG (X, S) − E[ĝG (X, S) | S = s] .

|Ds |

s∈{−1,1} Ds
By definition, the sets Ds are composed of |Ds | i.i.d. samples from the distribution of (X, S, Y ) | S = s
therefore we can apply Theorem 3.2.2 to deduce that
X s
X V (G) + log(4/δ)
(3.4)

P sP(ĝG (X, S) = 1 | S = s) > ε + C ≤δ .
ns
s∈{−1,1} s∈{−1,1}
Thus, we have derived the fairness guarantee. For the risk guarantee, we note that by definition of
gG∗ we have
P(gG∗ (X, S) = 1 | S = 1) = P(gG∗ (X, S) = 1 | S = 1) .
Again, by triangle’s inequality we deduce that

1 X ∗ 1 X
∗
X 1 X
∗ ∗

gG (X, S) − gG (X, S) ≤ gG (X, S) − E[gG (X, S) | S = s] .

|D−1 | |D1 | |Ds |

D−1 D1 s∈{−1,1} Ds
Note that, unlike in the part where we derived the fairness constraint, we do not actually need to
control the supremum of an empirical process. Indeed, gG∗ is some fixed, data independent, classifier.
Thus, using Hoeffding’s inequality we deduce that with probability at least 1 − δ
s

1 X ∗ 1 X X log(4/δ)
∗
(3.5)

|D−1 | gG (X, S) − gG (X, S) ≤ =ε .
|D1 | 2ns
D−1 D1 s∈{−1,1}
That is, with probability 1 − δ, gG∗ is feasible for the optimization problem in Eq. (3.3). From now
on we execute the standard proof of ERM-type generalization bound. Set

X 1 X
R̂(g) := ps R̂s (g) := I(g(X, S) 6= Y ) ,
ns
s∈{−1,1} Ds
X
R(g) = ps Rs (g) := P(g(X, S) 6= Y | S = s) ,
s∈{−1,1}
Page 37 of 63
and define the following event

 
 1 X 1 X 
gG∗ (X, S) − gG∗ (X, S) ≤ ε

A := .
 |D−1 | |D1 | 
D−1 D1
Note that we have already shown that P(A) ≥ 1 − δ. We can deduce from the definition of ĝG that,
on the event A,
R(ĝG ) − R(gG∗ ) ≤ R(ĝG ) − R̂(ĝG ) + R̂(gG∗ ) − R(gG∗ ) + R̂(ĝG∗ ) − R̂(gG∗ )

| {z }
≤0 on A
X
≤ R(ĝG ) − R̂(ĝG ) + R̂(gG∗ ) − R(gG∗ ) ≤ 2

ps sup R̂s (g) − Rs (g) .

g∈G
s∈{−1,1}
To conclude let us put ourselves in the context of Lemma 3.2.4 and realize that for all s ∈ {−1, 1}

1 X
sup R̂s (g) − Rs (g) = sup f (X, Y ) − E[f (X, Y ) | S = s] .

g∈G f ∈F ns
(X,S,Y )∈Ds
Thus, first applying Lemma 3.2.4 and then Theorem 3.2.2 we deduce that with probability at least
1−δ
s
X X V (G) + log(4/δ)
2 ps sup R̂s (g) − R(g) ≤ 2C ps .

g∈G ns
s∈{−1,1} s∈{−1,1}
Hence, on the intersection of the above event with A it holds that

s
X V (G) + log(4/δ)
R(ĝG ) − R(gG∗ ) ≤ 2C ps .
ns
s∈{−1,1}
Union bound implies that with probability at least 1 − δ we have

s
X V (G) + log(8/δ)
R(ĝG ) − R(gG∗ ) ≤ 2C ps .
ns
s∈{−1,1}
We conclude by combining our fairness bound with the risk bound.

After going through the proof let us note couple of drawbacks of this approach. First of all this
proof does not really teach us anything about the original problem of fair classification; Secondly,
even if we had a black-box which performs empirical risk minimization on misclassification loss
without any constraints (which we do not, at least for general G), it is still unclear how to properly
incorporate constraints in such a black-box. While the first issue is more philosophical, and certainly
subject to criticism, the second issue is very practical. In the next section we will partially address
it.
22—Lecture 3
3.3 Reduction to cost-sensitive classification: passing to ran-

domized classifiers
In this section we well address one of the limitations of the previous ERM algorithm—this time, we
will have an idea of how to implement it. Actually, the algorithms of this section is implemented
Page 38 of 63
in the fairlearn—a python package initially developed by a group of researchers from Microsoft
which became an independent project later.
On a (very) high level, the algorithm utilizes the idea that all optimization problems are actually
linear programs. At first sight, this idea does not make any sense—optimization is a mature field, if
the above claim is true, why do we need all the available techniques. Stay with me. As a motivating
example, let us consider the following optimization problem
(P1 ) := min {f (x) : Ax ≤ b} , (3.6)

x∈Θ
where f : Θ → R, Θ ⊂ Rd , A ∈ Rm×d and b ∈ Rm . Assume that the minimum of the above problem
is attained. Clearly, the above problem is rather generic, actually, pretty much any optimization
problem can be represented in such a way. Thus, looking at it, it has nothing to do with linear
programming in general. In order to pass to a linear program, we are going to relax this problem by
introducing ∆(Θ)—the set of probability distributions supported on Θ. Thus, we can write
Z Z
min {f (x) : Ax ≤ b} ≥ min f (x) dµ(x) : Ax dµ(x) ≤ b =: (P2 ) . (3.7)
x∈Θ µ∈∆(Θ)
Indeed, a solution x∗ of the problem in Eq. (3.6) can be seen a µ = δx∗ —the Dirac measure at
x∗ . The optimization problem in (P2 ) is a linear program. Of course, in general, (P2 ) does not
have anything to do with (P1 ). However, for our purposes of building fair classifiers, this idea is
useful: we will think of x as a classifier; Θ as a restriction of the family of classifiers; f as the risk;
and the linear constraint Ax ≤ b as some fairness constraint. In this paradigm, by solving (P2 ) we
will find a measure µ∗ , which, in the context of our interpretation will be supported on classifiers.
The prediction according to the fuzzy classifier µ∗ will be made by first sampling from µ∗ one fixed
classifier and, then, making the prediction according to the sampled classifier.
In order to execute the above philosophy, we need some efficient solver for the problem (P2 ).
Introducing Lagrange multipliers, (P2 ) can be written as
Z Z
>
(P2 ) = min max
m
f (x) dµ(x) + λ Ax dµ(x) − b .
µ∈∆(Θ) λ∈R , λ≥0
For reasons that will be clarified later in the text and for our purposes, instead of solving (P2 ) in
the for described above, we shall rather need a solver for
Z Z
>
B
(P2 ) := min m
max f (x) dµ(x) + λ Ax dµ(x) − b , (3.8)
µ∈∆(Θ) λ∈R+ , kλk1 ≤B
for some B > 0. In the above min max problem we have introduced an additional constraint on λ. In
the next section, we shall develop this intuition for the already familiar problem of risk minimization
under the fairness constraints.
3.3.1 Approximate saddle point yields guarantees

Recall, that in this chapter we are interested in the following problem
 
 
X
min R(g) : sP(g(X, S) = 1 | S = s)=0 ,
g∈G 
s∈{−1,1}


where G is some known class of binary classifiers. As it is described in the beginning of the previous
section, instead of working with classifiers, we are going to consider distributions on them ∆(G).
Page 39 of 63
Thus, we reformulate our initial problem as

 
 X 

(P ) = min Eg∼µ [R(g)] : sEg∼µ P(g(X, S) = 1 | S = s) = 0 ,
µ∈∆(G)  s∈{−1,1} 
where g ∼ µ is assumed to be independent from (X, S, Y ). In words, for prediction, a randomized

classifier µ ∈ ∆(G) first samples g ∼ µ independently from the new observation (X, S), then
the prediction is performed by g(X, S). Note that the optimal value of this new problem is not
larger than that of the original one. Furthermore, it is important to understand that for this new
formulation, the discussion that relied on the ham-sandwich theorem is not applicable even in the
case of Glin . Indeed, the very same observation (x, s) can be classified differently, depending on the
actual realization of g ∼ µ.
While in the previous sections, for the sake of simplicity and transparency, we were dealing
exclusively with the demographic parity constraint, here it is actually easier to work with a more
general formulation. In particular, we observe that (P ) can be represented as
(P ) = min {R(µ) : Am(µ) ≤ c} , (3.9)

µ∈∆(G)
where, by abuse of notation, we set R(µ) = Eg∼µ [R(g)], m(µ) = E(g,X,S)∼µ⊗PX,S [g(X, S) | S =
>
s] s∈S and, in the case of demographic parity,

1 −1
A= and c = (0, 0)> .
−1 1
More generally, we can consider

>
m(µ) = E(g,X,S,Y )∼µ⊗PX,S [g(X, S) | Ej ] j∈J ,
where (Ej )j∈J defines some events, which are independent from g ∼ µ. For instance, in the case of
equal opportunity with S = {−1, 1}, we can set J = {1, 2}
E1 = {S = 1, Y = 1} and E2 = {S = −1, Y = 1} .
With an appropriate choice of A and c we can recover the exact and approximate equal opportunity
constraint. We remark that in this paradigm, the matrix A, the vector c, and the events (Ej )j∈J are
assumed to be known beforehand, while the unknown quantities lie within the vector m(µ). In what
follows, we are going to use this general, linear constraint formulation, but the statistical analysis
will be performed only for the combinations of A, c, m, corresponding to the exact demographic
parity constraint.
Exercise 3.3.1. Following the analysis for the case of exact demographic parity constraint, extend
it to the case of general A, c.
ERM over randomized classifiers In order to find an approximate solution of (P ), as in the

previous section, we are going to consider the empirical version of it. The main difference in this
case is that we are going to specify a more computationally friendly algorithm to solve it. More
precisely, for some ε > 0 we introduce
n o
(P̂ ) = min R̂(µ) : Am̂(µ) ≤ cε , (3.10)
µ∈∆(G)
Page 40 of 63
where R̂(µ) = Eg∼µ [R̂(g)] and
1 X
m̂s (µ) := Eg∼µ [g(X, S)] ∀s ∈ {−1, 1} and cε = (ε, ε)> .
ns
(X,S,Y )∈Ds
Note that the problem in Eq. (3.10) is exactly the LP relaxation of the problem in Eq. (3.3) that
we considered in the study of ERM under empirical fairness constraints (only written in a more
convenient, for this study, notation).
Similarly, to the way we have derived the expression for the optimal fair prediction, we define a
Lagrangian
L̂(µ, λ) := R̂(µ) + λ> (Am̂(µ) − cε ) . (3.11)
We remind that in the case of demographic parity, A ∈ R2×2 and λ ∈ R2+ .
Proposition 3.3.2. Let A, c correspond to the demographic parity constraint with S = {−1, 1}.
Assume that for some B > 0 there exist τ > 0, µ̂ ∈ ∆(G) and λ̂ ∈ R2+ with kλ̂k1 ≤ B such that
L̂(µ̂, λ̂) ≤ L̂(µ, λ̂) + τ ∀µ ∈ ∆(G) ,

(3.12)
L̂(µ̂, λ̂) ≥ L̂(µ̂, λ) − τ ∀λ ∈ R2+ , kλk1 ≤ B .
Let G be a class of measurable classifiers on Z = Rd × {−1, 1} with finite

q VC-dimension V (G) and
log(4/δ)
µ be any minimizer of the problem in Eq. (3.9). set ε = s∈{−1,1}
∗
P
2ns , then with probability
at least 1 − δ it holds that
s
∗
X V (G) + log(16/δ)
R(µ̂) ≤ R(µ ) + 2τ + 2C ps ,
ns
s∈{−1,1}

Eg∼µ̂ P(g(X, S) = 1 | S = −1) − P(g(X, S) = 1 | S = 1) ≤ ε+ 1+2τ

B
s
X V (G) + log(4/δ)
+C ,
ns
s∈{−1,1}
Essentially, Proposition 3.12 states that is one is able to find a pair (µ̂, λ̂) which satisfies the
conditions of Eq. (3.12), then for sufficiently small τ > 0 and sufficiently large B > 0, one recovers the
guarantee of Theorem 3.2.5 stated for pure ERM. Let us compare the result of the above proposition
with that of Theorem 3.2.5. First of all, we recall that Proposition 3.3.2 is stated for randomized
classifiers, while Theorem 3.2.5 analysed deterministic once. Secondly, unlike Theorem 3.2.5, the
above proposition is potentially easier to combine with an appropriate optimization algorithm, which
can provably guarantee the satisfaction of Eq. (3.12). Of course, it is still, at this stage, unclear how
to satisfy the conditions of Eq. (3.12).
Proof. The first step is to show that if such pair (µ̂, λ̂) exists, then µ̂ approximately satisfies the
empirical constraints and is an approximate minimizer of the empirical risk under these constraints.
By the second property in Eq. (3.12), it holds for all λ ∈ R2+ , kλk1 ≤ B that
>
Eg∼µ̂ [R̂(g)] + λ̂ (Am̂(µ̂) − cε ) ≥ Eg∼µ̂ [R̂(g)] + λ> (Am̂(µ̂) − cε ) − τ .
Page 41 of 63
Which, after rearranging and an appropriate choice of λ implies that
>
λ̂ (Am̂(µ̂) − cε ) + τ ≥ B · kAm̂(µ̂) − cε k∞,+ , (3.13)
where kxk∞,+ = maxj max{xj , 0}. Furthermore note that for any µ such that Am(µ) ≤ cε , we
have thanks to the first property in Eq. (3.12) and the fact that λ ≥ 0 that
>
Eg∼µ̂ [R̂(g)] + λ̂ (Am̂(µ̂) − cε ) ≤ Eg∼µ [R̂(g)] + τ . (3.14)
The combination of Eq. (3.14) and Eq. 3.13 yields
Eg∼µ [R̂(g)] + τ ≥ Eg∼µ̂ [R̂(g)] + B · kAm̂(µ̂) − cε k∞,+ − τ . (3.15)
In particular, from the above we conclude that for any µ such that Am(µ) ≤ cε it holds that
Eg∼µ̂ [R̂(g)] ≤ Eg∼µ [R̂(g)] + 2τ . (3.16)

Eg∼µ [R̂(g)] + 2τ 1 + 2τ
Am̂(µ̂) ≤ cε + 1 ≤ cε + 1 , (3.17)
B B
with 1 = (1, 1)> . The above means that µ̂ is an approximate minimizer of the empirical risk under
the fairness constraint and that it does satisfy this constraint approximately. From now on it remains
to execute the proof for the empirical risk minimizer presented in Theorem 3.2.5. Let us describe
this reduction. Note that Jensen’s inequality
kA(m̂(µ̂) − m(µ̂))k∞ ≤ Eg∼µ̂ kA(m̂(g) − m(g))k∞ ≤ sup kA(m̂(g) − m(g))k∞ .

g∈G
The above quantity is exactly the one that we control in Eq. (3.4) in the proof for the pure ERM.
For the risk, we note that if µ∗ is a solution of problem in Eq. (3.9), then by the same arguments as
in Eq. (3.5) weq deduce that µ∗ is feasible with high probability for problem in Eq. 3.10 as long as
4/δ )
ε = s∈{−1,1} log( 2ns . From here, we finish by exactly the same argument as in Theorem 3.2.5 to
P
derive the final risk guarantee.
Note that from the perspective of statistical analysis, the passage from deterministic classifiers
to the randomized once does not significantly modify the proof strategy. Essentially, while pure
ERM by its very definition guarantees empirical optimality and satisfaction of empirical constraints,
here we establish these properties only approximately. Once it is done, the proof goes along the
lines of that of Theorem 3.2.5.
3.4 Finding approximate saddle point

Our goal in this section is to develop an algorithm, which outputs (µ̂, λ̂) satisfying the approximate
saddle point conditions in Eq. (3.12). To develop such a result, we will make use of the theory devel-
oped for online linear optimization and assume that we have access to a cost-sensitive classification
black-box solver.
Page 42 of 63
Reminder 3.4.1: Follow The Regularized Leader (FTRL)

Let Λ ⊂ RK be a convex set. Consider the following protocol: for t = 1, 2, . . .
1. Output z t ∈ Λ;
2. Receive r t ;
3. Suffer loss hz t , r t i;
The goal is to minimize the regret defined as
T
X T
X
RT = max hz, r t i − hz t , r t i .
z∈Λ
t=1 t=1

Theorem 3.4.1. Assume that Λ = z ∈ RK
+ : kzk1 = B , consider the following algorithm
−1
( T K
)
X 1X
z T = arg min − hz, r t i + zi log(zi ) ∀T ≥ 1 , (3.18)
z∈Λ t=1
η i=1
then it holds that

B log(K)
RT ≤ + ηBT max kr t k2∞ .
η t=1,...,T
Furthermore, the optimization problem in Eq. (3.18) admits a closed form solution for all
T ≥ 1 and i ∈ [K] as
T
exp (−η(r̄ T −1 )i ) X
(z T )i = B · PK where r̄ T = r t = r̄ T −1 + r T and r 0 = 0 .
j=1 exp(−η(r̄ T −1 )j ) t=1
Gilles Stoltz proves the above result, see also (Orabona, 2019).
Let us return back to the problem of interest—finding approximate saddle point of the Lagrangian
defined in Eq. (3.11) for the demograpphic parity constraint (so that λ ∈ R2 ). We analyze
the following procedure which is called Exp. gradient reduction for fair classification:
initialize λ1 = (1, 1) ∈ R2 , r 0 = (0, 0); for t = 1, . . . , T
exp(−η(r̄ t−1 )i )
1. Output: (λt )i = B · 1+ 2j=1 exp(−η(r̄ t−1 )j )
P for i = 1, 2;
2. Cost-sensitive classification: gt ∈ arg ming∈G L̂(g, λt );

3. Receive: r t = Am̂(gt ) − cε ;
4. Update: r̄ t = r̄ t−1 + (Am̂(gt ) − cε );
PT
Approximate saddle point candidate: µ̂ = Unif{g1 , . . . , gT }; λ̂ = T1 t=1 λt .
The above procedure has three input parameters: the upper-bound B on the sum of the
components of λ; the regularization parameter η > 0; the number of iterations T ∈ N. The eventual
choice of these parameters will be given following the approximate saddle point guarantees that
we shall derive. In order to implement this procedure in practice, one needs access to a black-box
solver for
min L̂(g, λt ) .
g∈G
Page 43 of 63
Let us understand why it is reasonable to expect that such a solver is available. To this end, we
re-write this optimization problem using appropriately chosen cost-sensitive classification objective.
Exercise 3.4.2. Show that ming∈G L̂(g, λt ) can be represented as a cost-sensitive classification in
the sense given in Eq. 3.19.
Cost-sensitive classification is any optimization problem that can represented as
n
1X 0
min 1
ci I(g(X i , Si ) = 1) + ci I(g(X i , Si ) = 0) , (3.19)
g∈G n
i=1
for some (c0i , c1i )i=1,...,n representing the cost of each type of errors on the observation (X i , Si , Yi ).
For example, usual classification can be represented in above form using c0i = I(Yi = 0) and
c1i = I(Yi = 1). Note that the above problem is still computationally infeasible in general; however,
since it does not involve any constraints except the restriction of the class of classifiers, it is easier
to expect that an efficient solver is available in this case. Despite this discussion, there is still a
gap between theory and practice in this approach; in practice the indicator functions are replaced
by their convex surrogates (e.g., cost-sensitive logistic regression), to the best of our knowledge, a
complete theoretical justification of this replacement missing (the cost-sensitive solver is actually
used in the loop of another algorithm).
Proposition 3.4.3. The Exp. gradient reduction for fair classification, instantiated
with demographic parity constrains with S = {−1, 1}, satisfies the approximate saddle point condition
in Eq. (3.12) with
B log(3)
τ≤ + ηB .
ηT
p p
In particular, setting η = log(3)/T it holds that τ ≤ 2B log(3)/T .
Proof. Our goal is to apply Theorem 3.4.1 with

Λ = z ∈ R3+ : kzk1 = B and z = (λ, z3 ) and r t =

Am̂(gt ) − cε , 0 .
Note that we have added a dummy dimension, to the dual variable λ. The result of Theorem 3.4.1
then reads as: for all λ ≥ 0 with kλk1 ≤ B
t
1X B log(3)
hλ, Am̂(µ̂) − cε i − hλt , Am̂(gt ) − cε i ≤ + ηB max kAm̂(gt ) − cε k2∞ .
T t=1 ηT t=1,...,T
For the demographic parity constraint maxt=1,...,T kAm̂(gt ) − cε k2∞ ≤ 1.

Furthermore, for any λ ≥ 0 and kλk1 ≤ B, we have
t
1X B log(3)
L̂(µ̂, λ) = R̂(µ̂) + λ> (Am̂(µ̂) − cε ) ≤ R̂(µ̂) + hλt , Am̂(gt ) − cε i + + ηB
T t=1 ηT
T
1X B log(3)
= L̂(gt , λt ) + + ηB .
T t=1 ηT
Since gt ∈ arg ming∈G L̂(g, λt ), then (by linearity, check this!) for all µ ∈ ∆(G) it holds that
L̂(gt , λt ) ≤ L̂(µ, λt ). In particular, using this argument and the above chain of inequalities, we
deduce that for any λ ≥ 0 and kλk1 ≤ B
T
1X B log(3) B log(3)
L̂(µ̂, λ) ≤ L̂(µ̂, λt ) + + ηB = L̂(µ̂, λ̂) + + ηB .
T t=1 ηT ηT
Page 44 of 63
Thus, we have established the second condition for the approximate saddle point in Eq. (3.12).
To establish the first condition, following similar arguments, we deduce that for any µ ∈ ∆(G)
T T T
1X 1X 1X B log(3)
L̂(µ, λ̂) = L̂(µ, λt ) ≥ L̂(gt , λt ) ≥ L̂(gt , λ̂) − + ηB .
T t=1 T t=1 T t=1 ηT
| {z }
=L̂(µ̂,λ̂)
The combination of Proposition 3.4.3 with Proposition 3.3.2 yields the final theorem of this
chapter, which summarizes theoretical guarantees on the Exp. gradient reduction for fair
classification algorithm.
Theorem 3.4.1: Exponentiated gradient algorithm for fair learning
Let G be a class of measurable classifiers on Z = Rd × {−1, 1} with finite VC-dimension
q
4/δ )
V = V (G) and µ∗ be any minimizer of the problem in Eq. (3.9). set ε = s∈{−1,1} log( 2ns ,
P
then for all T, B > 0 the Exp. gradient reduction for fair classification algorithm
satisfies probability at least 1 − δ
r s
∗ log(3) X V + log(16/δ)
R(µ̂) ≤ R(µ ) + 4B + 2C ps ,
T ns
s∈{−1,1}
r
Eg∼µ̂ P(g(X, S) = 1 | S = −1) − P(g(X, S) = 1 | S = 1) ≤ ε+B + 4B log(3)
−1

T
s
X V + log(4/δ)
+C ,
ns
s∈{−1,1}

This last result suggests the following choice of parameters: B = O((n−1 + n+1 )1/2 ) and
T = O((n−1 + n+1 )2 ). Theoretically, the amount of iterations of the Exp. gradient reduction
for fair classification needed for the above guarantee is extremely large—quadratic in the
input sample size. Note that on each iteration we need to solve a cost-sensitive classification problem.
From this perspective it is a big drawback of this approach. However, empirical studies show that a
very competitive performance can be obtained using scikit-learn implementations of cost-sensitive
classification (e.g., based on logistic regression) and performing T = 5 to T = 10 iterations of this
algorithm. The readers are invited to to try this approach using the implementation provided by
the fairlearn package. 4: 2022—Lecture 4

A statistical study of fair empirical risk minimization was started by Donini et al. (2018); Woodworth
et al. (2017), who derived (versions of) Theorem 3.2.5 and indicated several computational and
statistical challenges when building in-processing algorithms satisfying fairness constraints. The
reduction to cost-sensitive classification was developed by Agarwal et al. (2018) and all guarantees
originate from this work (in particular, the claim about sufficiency of taking T = 5 is also taken from
this work). Later, Agarwal et al. (2019) by means of discretization extended this approach to handle
the problem of regression under various fairness constraints, building a reduction to cost-sensitive
multi-class classification.
Page 45 of 63
There are many more in-processing algorithms developed by the community. This choice in
this chapter is explained by the fact that the chosen approaches are supported by rather general
risk/fairness guarantees, which are easily presentable to graduate level student. Many works put
emphasize on the computational aspects of the problem, avoiding statistical issues that might arise.
Lohaus et al. (2020) provide a broad discussion and compare different methods arguing that it is
common for computationally feasible methods to be unjustified statistically.
Page 46 of 63
Chapter 4
Regression under demographic parity

constraint
In this chapter we go beyond binary classification and consider the problem of regression under the
independence constraint. Let (X, S, Y ) ∈ Rd × {1, 2} × R be a random triplet distributed according
to P. As before, we set ps = P(S = s) for all s ∈ {1, 2}
Our goal is to estimate any solution of the following optimization problem
E[(Y − f (X, S))2 ] : f (X, S) ⊥ (4.1)

min ⊥S .
f : Rd ×{0,1}→R
Note that, actually, fair regression can be useful even in the context of binary classification. Indeed,
let Y ∈ {0, 1} and f ∗ be a solution of the problem in Eq. (4.1), then for any θ ∈ R, any classifier
gθ : Rd × {0, 1} → {0, 1} of the form
gθ (x, s) = I(f ∗ (x, s) ≥ θ) ,
satisfies the demographic parity constraint. Indeed, since f ∗ (X, S) ⊥
⊥ S, then
P(f ∗ (X, S) ≥ θ | S = 1) = P(f ∗ (X, S) ≥ θ | S = 2) ,
which is equivalent to the demographic parity of gθ . That is, having an accurate estimate of f ∗ allows
a post-hoc specification of the threshold, without damaging fairness in the sense of demographic
parity. One can also view the above optimization problem as binary classification with Brier score
as the measure of performance.
4.1 Deriving the form of optimal prediction under the inde-

pendence constraint
We shall work with the problem in Eq. (4.1) under a similar assumption that was used for the
construction of the post-processing method for binary classification under the demographic parity
constraint. Namely, we set the regression function η(X, S) = E[Y | X, S].
Assumption 4.1.1. We assume that t 7→ P(η(X, S) ≤ t | S = s) is continuous for all s ∈ {1, 2}
and that E[η 2 (X, S) | S = s] < ∞.
Since we are interested only in a minimizer of the problem in Eq. (4.1), then, by the Pythagoras
theorem we can equivalently solve
E[(η(X, S) − f (X, S))2 ] : f (X, S) ⊥ (4.2)

min ⊥S .
f : Rd ×{0,1}→R
47
Indeed, for any f : Rd × {0, 1} → R we can write
E[(Y − f (X, S))2 ] = E[Y 2 ] − 2 E[Y f (X, S)] +E[f 2 (X, S)] ± E[η 2 (X, S)]
| {z }
E[η(X,S)f (X,S)]
= E[Y 2 − η 2 (X, S)] + E[(η(X, S) − f (X, S))2 ] .
In what follows we will need several basic facts about 2-Wasserstein distance. Possibly these facts
are covered by M. Cuturi in his course on optimal transport.
Reminder 4.1.1: Additional notation
Denote by P2 (R) the space of univariate measures with finite second moment. For any two
univariate measures µ, ν denote by Π(µ, ν) all couplings of µ and ν, i.e., all measures on R2
with the first marginal equal to µ and the second one to ν. For any T : Rd → R and any
probability measure µ on Rd we denote by T ]µ the image measure, i.e., for all A ⊂ R we have
T ]µ(A) = µ({x ∈ R : T (x) ∈ A}). For any univariate measure µ we denote by Fµ : R → [0, 1]
its cumulative distribution function. For any univariate measure µ we denote by Fµ−1 : [0, 1] →
R its quantile function defined for all q ∈ [0, 1] as Fµ−1 (q) = inf {t ∈ R : q ≤ Fµ (t)}.
Definition 4.1.2. Given two measures µ, ν ∈ P2 (R) we define 2-Wasserstein distance between
them as
Z 1/2
2
W2 (µ, ν) = inf (x − y) dγ(x, y) .
γ∈Π(µ,ν)
A coupling γ ∈ Π(µ, ν) for which the above infimum is achieved is called optimal coupling
between µ and ν
Theorem 4.1.3. Assume that µ ∈ P2 (R) is continuous. Then, for any ν ∈ P2 (R) the optimal
coupling γ between µ and ν is given by the distribution of
(X, Fν−1 ◦ Fµ (X)) with X ∼ µ ,
and if T (x) = Fν−1 ◦ Fµ (x), then

Z 1/2
W2 (µ, ν) = (x − T (x))2 dµ(x) .
Definition 4.1.4. Let µ1 , µ2 ∈ P2 (R) and w1 + w2 = 1 with w1 , w2 ≥ 0, then the weighted

Fréchet mean between µ1 and µ2 is defined as
arg min w1 W22 (µ1 , ν) + w2 W22 (µ2 , ν) .

ν∈P2 (R)
Theorem 4.1.5. Assume that µ1 , µ2 ∈ P2 (R) are continuous. Then, for any w1 + w2 = 1
with w1 , w2 ≥ 0 then the quantile function of the Fréchet mean ν between µ1 and µ2 is given by
q 7→ w1 Fµ−1
1
(q) + w2 Fµ−1
2
(q) .
For any function f : R × {1, 2} → R let us set µs (f ) := f (·, s)]PX|S=s , where PX|S=s stands for
the distribution of X given S = s. Note that just by the very definition of 2-Wasserstein distance,
it holds for any f : R × {1, 2} → R that
E[(η(X, S) − f (X, S))2 ] ≥ p1 W22 (µ1 (η), µ1 (f )) + p2 W22 (µ2 (η), µ2 (f )) .
Page 48 of 63
In particular, if f satisfies the independence criteria, then, by definition, µ1 (f ) = µ2 (f ) and we can

write that the optimal value of the problem in Eq. (4.2) is lower-bounded by
p1 W22 (µ1 (η), µ1 (f )) + p2 W22 (µ2 (η), µ2 (f )) ,

inf
ν∈P2 (R)
which is the objective of the weighted Frechet mean and by Theorem (4.1.5) we know the solution
to it. Actually, as it is shown in the next result, under Assumption 4.1.1 the above established lower
bound is an equality and we can explicitly write the expression for the optimal prediction function
under the independence constraint.
Theorem 4.1.1: Fair optimal regression function
Let Assumption 4.1.1 be satisfied, then a solution f ∗ of the optimization problem in Eq. (4.2)
can be defined for all (x, s) ∈ Rd × {1, 2} as
X
f ∗ (x, s) = ps0 Fµ−10 (η) ◦ Fµs (η) η(x, s) .

s
s0 ∈{1,2}
Proof. Let us first understand that under the assumptions of the theorem, f ∗ defined above indeed
satisfies the independence constraint. Indeed, for any s ∈ {1, 2}, under Assumption 4.1.1, it holds
that Fµs (η) η(X, S) is distributed uniformly on [0, 1] when conditioned on S = s. Hence, setting U

distributed uniformly on [0, 1], we deduce that for any s ∈ {1, 2}

X
Law(f ∗ (X, S) | S = s) = Law ps0 Fµ−10 (η) (U ) ,
s
s0 ∈{1,2}
with the right hand side being independent from s. Therefore, indeed, f ∗ satisfies the independence
constraint.
As it was described above, by the definition of 2-Wasserstein distance via minimization over
couplings, for any f such that f (X, S) ⊥
⊥ S it holds that
E[(η(X, S) − f (X, S))2 ] ≥ inf p1 W22 (µ1 (η), ν) + p2 W22 (µ2 (η), ν) .

ν∈P2 (R)
By Assumption 4.1.1 the measures µ1 (η) and µ2 (η) are continuous, thus Theorem 4.1.5 combined
with the second claim of Theorem 4.1.3 and the above inequality yield
Z
X 2
E[(η(X, S) − f (X, S))2 ] ≥ x − Fν−1

ps ∗ ◦ Fµs (η) (x) d µs (η) (x) ,
s∈{1,2}
where Fν−1∗ (·) = p1 F

−1 −1
µ1 (η) (·) + p2 Fµ2 (η) (·). Finally, recalling that µs (η) = η(·, s)]PX|S=s for each
s ∈ {1, 2}, it holds that
2
Z
2
Fν−1 −1

x− ∗ ◦ Fµs (η) (x) d µs (η) (x) = E η(X, S) − Fν ∗ ◦ Fµs (η) η(X, S) |S=s
= E[(η(X, S) − f ∗ (X, S))2 | S = s] .
Thus, we have shown that under Assumption 4.1.1, f ∗ satisfies the independence constraint and for
any f such that f (X, S) ⊥
⊥ S it holds that
E[(η(X, S) − f (X, S))2 ] ≥ E[(η(X, S) − f ∗ (X, S))2 ] ,
the proof is concluded.
Page 49 of 63
First of all, it is easy to note that the above result as well as its proof generalizes directly to the
case of arbitrary finite S. With some extra technical arguments, it is even possible to extend this
result for S = R—continuous sensitive attribute. We note that the proof of Theorem 4.1.1 relies
rather heavily on the fact that our prediction function f is placed in the context of awareness, i.e.,
f : Rd × {1, 2} → R. In particular, similar characterisation for the unawareness setup, i.e, for
E[(Y − f (X))2 ] : f (X) ⊥

min ⊥S ,
f : Rd →R
is still unknown. Actually, if one is willing to attack the unawareness case, it is somewhat natural
to start with the following simplified problem: given two continuous real-valued random variables
A1 , A2 characterise the solution of
min E (A1 − T (A1 ))2 + E (A2 − T (A2 ))2 : Law(T (A1 )) = Law(T (A2 )) .

T : R→R
One can observe that the above problem is easy to handle if the intersection of supports of A1 and
A2 have the measure (under the distributions of A1 and A2 ).
Exercise 4.1.6. Assume that the distribution of P is such that, there exists τ : Rd → {1, 2} such
that
(X, S, Y ) = (X, τ (X), Y ) ,
almost surely. Find an explicit expression of any solution of
E[(Y − f (X))2 ] : f (X) ⊥

min ⊥S .
f : Rd →R
Insights about the independence constraint. Theorem 4.1.1 is instructive and allows to draw
several insights about the combination of independence constraint and the awareness framework.
Observation 4.1.7 (Rational ordering). Let Assumption 4.1.1. For any s ∈ {1, 2} and any
x0 , x ∈ Rd the condition
η(x, s) ≤ η(x0 , s) =⇒ f ∗ (x, s) ≤ f ∗ (x0 , s) .
Proof. It is sufficient to observe that under Assumption 4.1.1, according to Theorem 4.1.1 for
any x, s, the value of f ∗ (x, s) is obtained by application of a non-decreasing transformation to
η(x, s).
The interpretation of this result is straightforward. Assuming that η is the prediction strategy in
the “unfair” world, i.e., according to the current distribution of the data, f ∗ preserves its group-wise
ordering. In particular, if η(x, s) interpreted as the salary assigned to the individual (x, s), then
higher paid individuals will remain higher paid when compared to individuals sharing the same
sensitive attribute.
Observation 4.1.8 (Mean preservation). Let Assumption 4.1.1, then
E[η(X, S)] = E[f ∗ (X, S)] .
Proof. Indeed, under Assumption 4.1.1 it holds that (recall the inverse transform sampling) for all
s, s0 ∈ {1, 2}

−1

Law Fµ 0 (η) ◦ Fµs (η) η(X, S) | S = s = µs0 (η) .
s
Page 50 of 63
Hence, for the expectation of f ∗ (X, S) we deduce that

X h i
E[f ∗ (X, S)] = ps0 E Fµ−10 (η) ◦ FµS (η(X, S))
s
s0 ∈{1,2}
 
X  X h i
−1
= ps0 
 ps E Fµs0 (η) ◦ F µS
(η(X, S)) | S = s 

0
s ∈{1,2} s∈{1,2} | {z }
=E[η(X,S)|S=s0 ]
 
X X
= ps0  ps E[η(X, S) | S = s0 ] = E[η(X, S)] .
s0 ∈{1,2} s∈{1,2}
Again interpreting η(x, s) being the salary assigned to the individual (x, s) in the current “unfair”
reality, the above result shows that in order to satisfy the independence condition we do not need to
augment the average budget. In particular, it implies that while one of the groups will get higher
salary in average, the average salary of the other group will be decreased.
Probably Theorem 4.1.1 and the insights that it provides are the most convincing reasons (at
least for the authors of the course) to study algorithmic fairness from mathematical perspective. It
is hardly possibly to deduce these intrinsic properties of the independence constraint without relying
on a rather advanced mathematical machinery.
4.2 Post-processing estimation

As before, we suppose that we are able to sample from each sensitive group. That is, we have access
to {X s1 , . . . , X s2ns } i.i.d. from PX|S=s for each s ∈ {1, 2}. Again, we suppose that ps = P(S = s) is
known.
Given a base estimator fˆ and the above described samples, our goal is to build a post-processing
operator fˆ 7→ T̂ (fˆ), such that T̂ (fˆ)(X, S) satisfies the independence constraint and T̂ (fˆ) estimates
well f ∗ .
Our main insight in the construction of this post-processing operator comes from the following
lemma, which introduces a “smoothed” version of the empirical cumulative distribution function.
Lemma 4.2.1. Let n ≥ 1. Let V1 , . . . , Vn , Vn+1 be exchangeable real-valued random variables and
U distributed uniformly on [0, 1] be independent from V1 , . . . , Vn , Vn+1 , then the statistics
n n
!!
1 X X
T (V1 , . . . , Vn , Vn+1 , U ) = I(Vi < Vn+1 ) + U · 1 + I(Vi = Vn+1 ) ,
n + 1 i=1 i=1
is distributed uniformly on [0, 1].

Note that if V1 , . . . , Vn , Vn+1 are i.i.d. (or, more generally, exchangeable) continuous real valued
random variables, then the statistics
n
1X
T 0 (V1 , . . . , Vn , Vn+1 ) = I(Vi ≤ Vn+1 ) ,
n i=1
is distributed uniformly on {(0/n), (1/n), . . . , (n/n)}. Indeed, since these random variables are gener-
ated from a continuous distribution, then for any k = 0, . . . , n it holds that
X
I T 0 (Vσ(1) , . . . , Vσ(n) , Vσ(n+1) ) = k/n = 1 ,

σ
Page 51 of 63
where the summation goes over all 1-cyclic permutations σ of [n+1] (there are n+1 of them). Taking
expectation and using the exchangeability yields this observation. In case we lift the assumption
of continuity, T 0 (V1 , . . . , Vn , Vn+1 ) is no longer distributed uniformly on {(0/n), (1/n), . . . , (n/n)} in
general, but is stochastically dominated by a random variable that is distributed uniformly on
this set. In contrast to these results, Lemma 4.2.1 states that with the help of additional mild
randomization by U , the newly constructed statistics is distributed uniformly on the whole [0, 1]
(we are essentially smoothing each distinct atom).
Pn+1
Proof. Set V ∼ n+11
i=1 δVi be independent from U and denote by FV |V the cumulative distribu-
tion function of V conditionally on V := (V1 , . . . , Vn , Vn+1 ). Define the following statistics
T (v, u) = P(V < v | V ) + uP(V = v | V ) , ∀ (u, v) ∈ R2 .
Note that T (Vn+1 , U ) = T (V1 , . . . , Vn , Vn+1 , U ). In what follows we show that T (Vn+1 , U ) is
distributed uniformly on [0, 1]. Fix some t ∈ [0, 1]. Define v(t) = FV−1 |V (t) and
(
1 if P(V = v(t) | V ) = 0
u(t) = t−P(V <v(t)|V ) .
P(V =v(t)|V ) otherwise
One can verify that by construction of (u(t), v(t)) ∈ R2 the event

{T (v, u) ≤ t} ⇐⇒ {v < v(t)} t {v = v(t)} ∩ {u ≤ u(t)} ,
where t stands for the disjoint union of two sets. Thus, since the events {v < v(t)} and {v =
v(t)} ∩ {u ≤ u(t)} are disjoint and U is distributed uniformly on [0, 1] and is independent from V, V
it holds that
P(T (V, U ) ≤ t | V ) = P(V < v(t) | V ) + P(U ≤ u(t) | V )P(V = v(t) | V ) = t .
Integrating the above equality we get P(T (V, U ) ≤ t) = t. To conclude, we first notice that thanks
to the definition of V it holds that
P(T (V, U ) ≤ t) = E[P(T (V, U ) ≤ t | V )]

"n+1 #
1 X
= E P(T (Vi , U ) ≤ t | V )
n+1 i=1
n+1
1 X
= P(T (Vi , U ) ≤ t) .
n + 1 i=1
Since, the random variables V1 , . . . , Vn , Vn+1 are exchangeable and the Bernoulli random variables
I(T (Vi , U ) ≤ t) for i ∈ [n] are also exchangeable and are distributed identically, then
n+1
1 X
t = P(T (V, U ) ≤ t) = P(T (Vi , U ) ≤ t) = P(T (Vn+1 , U ) ≤ t) .
n + 1 i=1
Having the above result in mind, let us define the post-processing operator T̂ . For any f :
Rd × {1, 2} → R we introduce the following notation
fis = f (X si ) ∀s ∈ {1, 2}, i ∈ [2ns ] .
Page 52 of 63
Using the above notation for any prediction function f and s ∈ {1, 2}, we define two estimators
ns
1 X
F̂s,f,1 (t) = I(fis ≤ t)
ns i=1
2ns 2ns
!!
1 X X
F̂s,f,2 (t) = I(fis < t) + U · 1+ I(fis = t) ,
ns + 1 i=ns +1 i=ns +1
where U distributed uniformly on [0, 1] and is independent from all the data, f and (X, S, Y ).
The post-processing estimator is then defined as
X
T̂ (f )(x, s) = ps0 F̂s−1
0 ,f,1 ◦ F̂s,f,2 (f (x, s)) .
s0 ∈{1,2}
We can prove the following fairness guarantee for this estimator.

Theorem 4.2.1: Fairness guarantee
For any distribution P of (X, S, Y ), for any prediction function f : Rd × {1, 2} → R it holds
that
Law(T̂ (f )(X, S) | S = 1) = Law(T̂ (f )(X, S) | S = 2) ,
that is, T̂ (f )(X, S) ⊥

⊥ S.
Before providing the proof of this result, let us understand the statement. It might seem magical
that we have managed to establish exact independence, while, for a simpler problem, of classification
we have only established approximate independence. The devil is in the details. Theorem 4.2.1
states that if we look at the joint distribution of (T̂ (f )(X, S), S) then they are independent. In
particular, in this joint distribution, the randomness w.r.t. the sample as well as w.r.t. the introduced
random variable U distributed uniformly on [0, 1] is “averaged out”. Meanwhile, fairness guarantees
for the post-processing method in the context of binary classification were established with high
probability w.r.t. the available sample. The two are very different and, in general, neither follows
from one another. However, for this particular estimator T̂ (or a slight modification) it is actually
possible to establish high probability guarantee on the violation of the independence measured in
the Kolmogorov-Smirnov distance (see bibliographic references).
Proof. We are going to show that the Kolmogorov-Smirnov distance between Law(T̂ (f )(X, S) | S =
1) and Law(T̂ (f )(X, S) | S = 2) equals to zero, which will immediately yield the desired result. Set

∆ = sup P(T̂ (f )(X, S) ≤ t | S = 1) − P(T̂ (f )(X, S) ≤ t | S = 2) .

t∈R
First let us derive a simple result on generalized inverse functions.
Lemma 4.2.2. Let F1 , F2 : R → [0, 1] be two non-decreasing right-continuous functions. Then, for
all p1 + p2 = 1 there exists G : R → [0, 1] such that for all q ∈ [0, 1]
p1 F1−1 (q) + p2 F2−1 (q) ≤ x ⇔ q ≤ G(x) .
In this statement it is important that G depends only on F1 , F2 , p1 , p2 and not on the particular
choice of q. Skip the proof on the first reading, it is not that insightful.
Page 53 of 63
Proof. Define G(x) as
G(x) = sup q 0 ∈ [0, 1] : p1 F1−1 (q 0 ) + p2 F2−1 (q 0 ) ≤ x .

By definition of G, if for some q ∈ [0, 1] we have p1 F1−1 (q) + p2 F2−1 (q) ≤ x, then G(x) ≥ q.
Furthermore, if for some q ∈ [0, 1] it happened that G(x) ≥ q, then since both F1−1 and F2−1 .
are non-decreasing, for every q 0 < q we have p1 F1−1 (q 0 ) + p2 F2−1 (q 0 ) ≤ x. Both F1−1 and F2−1 are
left-continuous (check it!), hence, taking the limit (with q 0 → q−) in the previous inequality we
conclude .
Note that, by construction, T̂ (f ) is defined as

T̂ (f )(x, s) = Q̂f,1 ◦ F̂s,f,2 f (x, s) ,
where Q̂f,1 is of the form suitable for application of Lemma 4.2.2. Furthermore, thanks to the
splitting of the data Q̂f,1 is independent from (F̂S,f,2 f (X, S) ) conditionally on S = 1 or S = 2.

Hence, by Lemma 4.2.2, for some Ĝ which depends only on Q̂f,1 , we have

∆ = sup |P(F̂1,f,2 f (X, S) ≤ Ĝ(t) | S = 1) − P(F̂2,f,2 f (X, S) ≤ Ĝ(t) | S = 2)| .
t∈R
Inspecting F̂s,f,2 f (X, S) , we observe that thanks to Lemma 4.2.1, Law(F̂s,f,2 f (X, S) | S =

s, Ĝ(t)) = Unif([0, 1]). Thus, we have deduced that
∆ = sup |E[Ĝ(t) | S = 1] − E[Ĝ(t) | S = 2]| = 0 ,

t∈R
where the equality follows thanks to the splitting of the data into two parts.
Unlike the case of binary classification with demographic parity constraint, we are not going to
develop the risk guarantee for this post-processing estimator. Such a guarantee can be obtained
under additional assumptions on the distribution P, which are more restrictive than Assumption 4.1.1.
The eventual bound has the flavour of the one derived for the post-processing algorithm in binary
classification—it contains two terms: quality of the base estimator and the statistical price that we
22—Lecture 5 pay for the estimation of the cumulative distribution function and the quantile function.
4.3 Risk guarantee

The goal of this section is to derive the risk guarantee on T̂ (f ). As before, we are looking for a
post-processing bound of the following type:
E|T̂ (f )(X, S) − f ∗ (X, S)| ≤ C (E|f (X, S) − η(X, S)| + remaindern1 ,n2 ) ,
where C > 0 is some constant. Note that in this section we are going to work with L1 -estimation
risk. It appears that the bound L1 -estimation risk is much simpler than any other risk under the
considered assumptions due to a very particular expression of 1-Wasserstein distance. While the
fairness guarantee was assumption free—it holds for any distribution P and for any base predictor
f —the risk guarantee requires extra assumptions, presented below.
Assumption 4.3.1. For any s ∈ {1, 2} We assume that µs (η) is
1. supported on an interval (finite or infinite);
2. admits a density qs w.r.t. the Lebesgue measure;
Page 54 of 63
3. There exists c1 (s) > 0 and c2 (s) ≥ c1 (s) such that
c1 (s) ≤ qs (y) ≤ c2 (s) for all y in the support of µs (η) ;
How restrictive this assumption is? Overall, pretty much. Intuitively, the first point does not
allow to have "gaps" in the distribution of (η(X, S) | S = s)—every value except possibly at most
two points in the interior of the has non-negligible left and right neighbourhood. The second point
is arguably the least restrictive among the three. The last one roughly says that there are no very
rare values of η(X, S) or, equivalently, µs (η) is uniform-like. To actually derive the risk bound we
need several preparatory lemmas.
The next lemma establishes some properties of the cumulative distribution and the quantile
function of (η(X, S) | S = s). The proof is pretty obvious and can be skipped on the first reading.
Lemma 4.3.2. Let Assumption 4.3.1 be satisfied, then for any s ∈ {1, 2}
• t 7→ Fµs (η) (t) is strictly increasing on the support of µs (η);
• t 7→ Fµs (η) (t) is c2 (s)-Lipschitz continuous;
• t 7→ Fµ−1
s (η)
(t) is 1/c1 (s)-Lipschitz continuous;
Proof. Fix some s ∈ {1, 2}. By Assumption 4.3.1, there exists an interval I ⊂ R such that for all
t ∈ I we have for every t ∈ I
Z
Fµs (η) (t) = qs (y)dy and qs (y) > 0 for y ∈ I .
I∩{y≤t}
Thus, for t1 > t2 ∈ I we have

Z t1
Fµs (η) (t1 ) − Fµs (η) (t2 ) = qs (y) dy > 0 .
t2
Hence the first item. The second one follows immediately from the above and the assumption
that qs (y) ≤ c2 (s). Finally, the third item from the inverse function rule (derivative of the inverse
function, which exists in the proper sense, since Fµs (η) is strictly increasing).
As have already been mentioned, our choice of the risk is dictated by a very specific expression
for the 1-Wasserstein distance. This expression is presented in the following lemma and main ideas
of the proof are given.
Lemma 4.3.3. Let µ, ν be two absolutely integrable univariate measures, then
Z Z Z
−1 −1
W1 (µ, ν) = inf |x − y| dγ(x, y) = |Fµ (t) − Fν (t)| dt = |Fµ (t) − Fν (t)| dt .
γ∈Π(µ,ν) R R
Proof. The last equality is best seen on a plot. For the second one, let (A, B) be arbitrary random
variables such that A ∼ µ and B ∼ ν. Then,
E[|A − B|] = E(A − B)+ + E(B − A)+ .
For the first term on the right hand side we can write by Fubini’s theorem
Z ∞ Z ∞
E(A − B)+ = E P(A − B > t | B) = E P(A > t | B) dt
0
ZB∞ Z ∞
=E P(A > t, B ≤ t | B) dt = P(A > t, B ≤ t) dt .
−∞ −∞
Page 55 of 63
Thus, we have shown that

Z ∞
E(A − B)+ = Fν (t) − P(A ≤ t, B ≤ t) dt .
−∞
Repeating the same argument for E(B − A)+ we deduce that

Z ∞
E[|A − B|] = Fν (t) + Fµ (t) − 2P(A ≤ t, B ≤ t) dt
−∞
Z ∞ Z
≥ Fν (t) + Fµ (t) − 2 min{Fν (t), Fµ (t)} dt = |Fµ (t) − Fν (t)| dt .
−∞ R
Above inequality establishes that

Z
W1 (µ, ν) ≥ |Fµ (t) − Fν (t)| dt .
R
The reverse inequality is left as an exercise. Hint. Construct an explicit coupling via the quantile
transform.
We will need to control 1-Wasserstein distance between the empirical and true distribution in
expectation.
Lemma 4.3.4. Let X1 , . . . , Xn be i.i.d. absolutely integrale random variables following some
distribution µ. Let µ̂n be their empirical distribution, then
Z q
1
E[W1 (µ̂n , µ)] ≤ √ Fµ (x)(1 − Fµ (x)) dx
n R
Proof. By Fubini’s theorem

Z Z q
2
E[W1 (µ̂n , µ)] = E|Fµ̂n (x) − Fµ (x)| dx ≤ E Fµ̂n (x) − Fµ (x) dx
R R
Z q
1
=√ Fµ (x)(1 − Fµ (x)) dx .
n R
Note that the previous lemma does not say that R Fµ (x)(1 − Fµ (x)) dx is finite! Actually,
R p
one needs much more than just one moment as highlighted below without the proof
Lemma 4.3.5 (Proof is omitted, see e.g., Bobkov and Ledoux (2019)). Let X ∼ µ be a univariate
random variable. Assume that for some δ > 0 it holds that E|X|2+δ < ∞, then
Z q
Fµ (x)(1 − Fµ (x)) dx < ∞ .
R
The above condition is stated as sufficient but it is also essentially necessary. Final ingredient
for our risk bound is the following lemma.
Lemma 4.3.6. Let µ, ν be two univariate probability distribution. Assume that µ admits a density
w.r.t. the Lebesgue measure which is upper bounded by Cµ , then for any A ∼ µ and B ∼ ν it holds
that
E|Fµ (A) − Fν (B)| ≤ Cµ E|A − B| .
Page 56 of 63
Proof. Let (A0 , B 0 ) be an independent copy of (A, B). Then, it holds that
E|Fµ (A) − Fν (B)| = E|P A0 ≤ A | (A, B) − P B 0 ≤ B | (A, B) |

= E|E I(A0 ≤ A) | (A, B) − E I(B 0 ≤ B) | (A, B) |

≤ E|E I(|B 0 − B| ≤ |A − B| + |A0 − B 0 |) | (A, B) |

= P(|B 0 − B| ≤ |A − B| + |A0 − B 0 |) .
Since I(a ≤ b + c) ≤ I(2a ≤ b) + I(2b ≤ c) we get
E|Fµ (A) − Fν (B)| ≤ P(2|B 0 − B| ≤ |A − B|) + P(2|B − B 0 | ≤ |A0 − B 0 |) .
Note that since (A0 , B 0 ) is independent from and is distributed identically to (A,
B), then for any
G : (R2 )2 → R we have Law G((A0 , B 0 ), (A, B)) = Law G((A, B), (A0 , B 0 )) . In particular, it
holds that
P(2|B 0 − B| ≤ |A − B|) = P(2|B 0 − B| ≤ |A0 − B 0 |) .
The above implies that
E|Fµ (A) − Fν (B)| ≤ 2P(2|B 0 − B| ≤ |A0 − B 0 |)

1 1
= 2P B 0 − |A0 − B 0 | ≤ B ≤ B 0 + |A0 − B 0 |
2 2

1 1
= 2E P B 0 − |A0 − B 0 | ≤ B ≤ B 0 + |A0 − B 0 | | (A0 , B 0 )
2 2
≤ Cµ E|A0 − B 0 | = Cµ E|A − B| ,
where the last inequality follows from the fact that Fµ is Cµ -Lipschitz continuous.
It is crucial that we do not assume any particular coupling condition between (A, B). In particular,
as a curious corollary from the above result, by taking infimum over all couplings on both sides we
deduce that under the conditions of the above lemma
W1 (Fν ]ν, Fµ ]µ) ≤ Cµ W1 (ν, µ) .
In other words, the mapping µ 7→ Fµ ]µ is C-Lipschitz in the space of measures equipped with
1-Wasserstein distance for those µ that admit density bounded by Cµ .
Finally, we are in position to derive the guarantee for the risk.
Theorem 4.3.1: Estimation guarantee
ssume that Law(f (X, S) | S = s) is continuous. Let Assumption 4.3.1 be satisfied. Then,
there exists C that depends only on c1 (1), c1 (2), c2 (1), c2 (2) such that

∗ 1 1
E|T̂ (f )(X, S) − f (X, S)| ≤ C E|f (X, S) − η(X, S)| + √ + √ .
n1 n2
Proof. Recall that for all (x, s) ∈ Rd × {1, 2}

X X
T̂ (f )(x, s) = ps0 F̂s−1
0 ,f,1 ◦ F̂s,f,2 (f (x, s)) and f ∗ (x, s) = ps0 Fµ−10 (η) ◦ Fµs (η) (η(x, s)) .
s
s0 ∈{1,2} s0 ∈{1,2}
By the triangle inequality, in order to derived the claimed result, it is sufficient to control
h i
E |F̂s−1 −1
0 ,f,1 ◦ F̂s,f,2 (f (X, S)) − Fµ (η) ◦ Fµs (η) (η(X, S))| | S = s
0
,
s
Page 57 of 63
for any s0 , s ∈ {1, 2}. Again using the triangle inequality, we decompose the above quantity into the
following three parts:
h i
M1 = E |F̂s−1 −1
0 ,f,1 ◦ F̂s,f,2 (f (X, S)) − F̂s0 ,η,1 ◦ F̂s,f,2 (f (X, S))| | S = s ,
h i
M2 = E |F̂s−1 −1
0 ,η,1 ◦ F̂s,f,2 (f (X, S)) − Fµ (η) ◦ F̂s,f,2 (f (X, S))| | S = s ,
s0
h i
M3 = E |Fµ−10 (η) ◦ F̂s,f,2 (f (X, S)) − Fµ−10 (η) ◦ Fµs (η) (η(X, S))| | S = s .
s s
We bound each of these three terms separately.

Bounding M1 . Recall that according to Lemma 4.2.1, the random variables F̂s,f,2 (f (X, S))
conditionally on S = s is distributed uniformly on (0, 1). Furthermore, thanks to the enforced
splitting of the data, F̂s−1 −1
0 ,f,1 , F̂s0 ,η,1 is independent from F̂s,f,2 (f (X, S)) conditionally on S = s.
Thus, it holds that
Z 1
M1 = E |F̂s−1 −1
0 ,f,1 (u) − F̂s0 ,η,1 | du ,
0
which corresponds to the 1-Wasserstein distance between the empirical distribution of Law(f (X, S) |
S = s0 ) and Law(η(X, S) | S = s0 ). In particular, recalling the definition of 1-Wasserstein distance
as the minimum over couplings, we deduce that
n s0
" #
1 X s0 0 s0 0
M1 ≤ E |f (X i , s ) − η(X i , s )| = E [|f (X, S) − η(X, S)| | S = s0 ] .
ns0 i=1
Bounding M2 . Similarly as for M1 , we first deduce that

Z 1
M2 = E |F̂s−1
0 ,η,1 (u) − Fµ−10 (η) | du ,
s
0
which corresponds to the 1-Wasserstein distance between the distribution of Law(η(X, S) | S = s0 )

and its empirical counterpart and, thanks to Assumption 4.3.1, we can apply Lemma 4.3.4 and
Lemma 4.3.5 to deduce that for some C = C(η, s0 ) > 0 that depends only on Law(η(X, S) | S = s0 )
we have
C
M2 ≤ √ .
ns0
Bounding M3 . This is the most involved bound and here we will use all the remaining assumptions.
First, under Assumption 4.3.1 q 7→ Fµ−10 (η) (q) is 1/c1 (s)-Lipschitz. Thus,
s
h i
M3 ≤ (1/c1 (s0 )) · E |F̂s,f,2 (f (X, S)) − Fµs (η) (η(X, S))| | S = s .
Since we assumed that Law(f (X, S) | S = s) is continuous, then almost surely we have
2ns
1 X U
F̂s,f,2 (t) = I(fis ≤ t) +
ns + 1 i=n +1 ns + 1
s
2n 2ns
1 Xs
X
s U 1
= I(fi ≤ t) + − I(f s ≤ t) .
ns i=n +1 ns + 1 ns (ns + 1) i=n +1 i
s s
Page 58 of 63
In particular, using Jensen’s inequality, we can deduce that for any t ∈ R

2ns

1 X 2
s
E|F̂s,f,2 (t) − Fµs (f ) (t)| ≤ E I(fi ≤ t) − P(f (X, S) ≤ t | S = s) +

ns ns + 1
i=ns +1
1 q 2
≤√ Fµs (f ) (t)(1 − Fµs (f ) (t)) +
ns ns + 1
1 2
≤ √ +
2 ns ns + 1
for some universal C > 0. The above arguments allow us to deduce that

C
M3 ≤ ( /c1 (s )) · √ + E |Fµs (f ) (f (X, S)) − Fµs (η) (η(X, S))| | S = s
1 0
.
ns
To bound the last term in the right hand side of the above inequality, we note that by Assumption 4.3.1,
µs (η) admits density upper bounded by c2 (s), thus, applying Lemma 4.3.6 we obtain

E |Fµs (f ) (f (X, S)) − Fµs (η) (η(X, S))| | S = s ≤ c2 (s) · E [|f (X, S) − η(X, S)| | S = s] .
Hence, we have shown that
C c2 (s)
M3 ≤ √ + · E [|f (X, S) − η(X, S)| | S = s] .
c1 (s0 ) ns c1 (s0 )
Putting everything together. We have

X h i
E|T̂ (f )(X, S) − f ∗ (X, S)| = ps E |T̂ (f )(X, S) − f ∗ (X, S)| | S = s
s∈{1,2}
X X
≤ ps ps0 (M1 + M2 + M3 ) .
s∈{1,2} s0 ∈{1,2}
6: 2022—Lecture 6

Several works establish connections between optimal transport theory and binary classification under
demographic parity constraints Chiappa et al. (2020); Gordaliza et al. (2019); Jiang et al. (2019).
Notably, Jiang et al. (2019) observes that post-processing via 1-Wasserstein transport leads to the
minimal expected changes of the base prediction. Later, Chzhen et al. (2020b); Le Gouic et al. (2020)
establish the explicit expression for the optimal prediction under the independence constraint in
the awareness framework and propose a version of the described post-processing approach. Chzhen
et al. (2020b) among other guarantees derives in expectation control of the Kolmogorov-Smirnov
distance between the group-wise distributions of the post-processed predictor and complements it
with guarantees on the estimation of f ∗ . Chzhen and Schreuder (2020), based on previous works,
introduce a novel notion of unfairness using the Fréchet means formulation and analytically analyse
the trade-off between squared loss and the (approximate) independence constraint.
Page 59 of 63
Page 60 of 63
Bibliography
Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., and Wallach, H. (2018). A reductions
approach to fair classification. arXiv preprint arXiv:1803.02453.
Agarwal, A., Dudik, M., and Wu, Z. S. (2019). Fair regression: Quantitative definitions and
reduction-based algorithms. In International Conference on Machine Learning.
Angwin, J., Larson, J., Mattu, S., and Kirchner, L. (2016). Machine bias. ProPublica.
Audibert, J. Y. and Tsybakov, A. (2007). Fast learning rates for plug-in classifiers. The Annals of
Statistics, 35(2):608–633.
Barocas, S., Hardt, M., and Narayanan, A. (2019). Fairness and Machine Learning. fairmlbook.org.
http://www.fairmlbook.org.
Bertsekas, D. P. (1973). Stochastic optimization problems with nondifferentiable cost functionals.

Journal of Optimization Theory and Applications, 12(2):218–231.
Bobkov, S. and Ledoux, M. (2019). One-Dimensional Empirical Measures, Order Statistics, and
Kantorovich Transport Distances. Memoirs of the American Mathematical Society. American
Mathematical Society.
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. (2016). Man is to computer
programmer as woman is to homemaker? debiasing word embeddings. In Lee, D., Sugiyama, M.,
Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing
Systems, volume 29. Curran Associates, Inc.
Celis, L. E., Huang, L., Keswani, V., and Vishnoi, N. K. (2019). Classification with fairness
constraints: A meta-algorithm with provable guarantees. In Proceedings of the Conference on
Fairness, Accountability, and Transparency, FAT* ’19, page 319–328, New York, NY, USA.
Association for Computing Machinery.
Chiappa, S., Jiang, R., Stepleton, T., Pacchiano, A., Jiang, H., and Aslanides, J. (2020). A general
approach to fairness with optimal transport. In AAAI.
Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism
prediction instruments. Big data, 5(2):153–163.
Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. (2019). Leveraging labeled and unlabeled
data for consistent fair binary classification. In Advances in Neural Information Processing Systems
32, pages 12739–12750. Curran Associates, Inc.
Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. (2020a). Fair regression via plug-in
estimator and recalibration with statistical guarantees. NeurIPS.
61
Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. (2020b). Fair regression with wasserstein
barycenters. NeurIPS.
Chzhen, E. and Schreuder, N. (2020). A minimax framework for quantifying risk-fairness trade-off
in regression. arXiv preprint arXiv:2007.14265.
Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. (2017). Algorithmic decision
making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference
on knowledge discovery and data mining, pages 797–806.
Diana, E., Gill, W., Kearns, M., Kenthapadi, K., and Roth, A. (2021). Minimax group fairness:
Algorithms and experiments. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics,
and Society, pages 66–76.
Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J. S., and Pontil, M. (2018). Empirical risk
minimization under fairness constraints. In Neural Information Processing Systems.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. (2012). Fairness through awareness.
In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226.
Goh, G., Cotter, A., Gupta, M., and Friedlander, M. P. (2016). Satisfying real-world goals with
dataset constraints. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors,
Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
Gordaliza, P., Del Barrio, E., Fabrice, G., and Loubes, J. M. (2019). Obtaining fairness using
optimal transport theory. In International Conference on Machine Learning.
Hardt, M., Price, E., and Srebro, N. (2016). Equality of opportunity in supervised learning. In
Neural Information Processing Systems.
Jiang, R., Pacchiano, A., Stepleton, T., Jiang, H., and Chiappa, S. (2019). Wasserstein fair
classification. arXiv preprint arXiv:1907.12059.
Joseph, M., Kearns, M., Morgenstern, J. H., and Roth, A. (2016). Fairness in learning: Classic and
contextual bandits. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors,
Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
Kamiran, F. and Calders, T. (2009). Classifying without discriminating. In International Conference

on Computer, Control and Communication.
Kleinberg, J., Mullainathan, S., and Raghavan, M. (2016). Inherent trade-offs in the fair determina-
tion of risk scores. arXiv preprint arXiv:1609.05807.
Le Gouic, T., Loubes, J., and Rigollet, P. (2020). Projection to fairness in statistical learning. arXiv
preprint arXiv:2005.11720.
Lipton, Z., McAuley, J., and Chouldechova, A. (2018). Does mitigating ML’s impact disparity require
treatment disparity? In Advances in Neural Information Processing Systems, pages 8125–8135.
Lohaus, M., Perrot, M., and Von Luxburg, U. (2020). Too relaxed to be fair. In International
Conference on Machine Learning, pages 6360–6369. PMLR.
Luong, B. T., Ruggieri, S., and Turini, F. (2011). K-nn as an implementation of situation testing for
discrimination discovery and prevention. In Proceedings of the 17th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’11, page 502–510, New York, NY,
USA. Association for Computing Machinery.
Page 62 of 63
Martinez, N., Bertran, M., and Sapiro, G. (2020). Minimax pareto fairness: A multi objective
perspective. In International Conference on Machine Learning, pages 6755–6764. PMLR.
Martinez, N. L., Bertran, M. A., Papadaki, A., Rodrigues, M., and Sapiro, G. (2021). Blind pareto
fairness and subgroup robustness. In International Conference on Machine Learning, pages
7492–7501. PMLR.
Massart, P. (1990). The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of
Probability, 18(3):1269–1283.
Matousek, J. (2007). Using the Borsuk-Ulam Theorem: Lectures on Topological Methods in Combi-
natorics and Geometry. Springer Publishing Company, Incorporated.
Menon, A. K. and Williamson, R. C. (2018). The cost of fairness in binary classification. In

Conference on Fairness, Accountability and Transparency.
Nemirovski, A. S., Polyak, B. T., and Tsybakov, A. B. (1985). Convergence rate of nonparametric
estimates of maximum-likelihood type. Problemy peredachi informatsii, 21(4):17–33.
Orabona, F. (2019). A modern introduction to online learning. ArXiv, abs/1912.13213.
Pedreschi, D., Ruggieri, S., and Turini, F. (2009). Measuring discrimination in socially-sensitive
decision records. In Proceedings of the 2009 SIAM international conference on data mining, pages
581–592. SIAM.
Pedreshi, D., Ruggieri, S., and Turini, F. (2008). Discrimination-aware data mining. In Proceedings
of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 560–568.
Schreuder, N. and Chzhen, E. (2021). Classification with abstention but without disparities. UAI.
Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data
Science. Cambridge University Press.
Woodworth, B., Gunasekar, S., Ohannessian, M. I., and Srebro, N. (2017). Learning non-
discriminatory predictors. In Conference on Learning Theory, pages 1920–1953. PMLR.
Yang, Y. (1999). Minimax nonparametric classification: Rates of convergence. IEEE Transactions
on Information Theory, 45(7):2271–2284.
Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. (2013). Learning fair representations. In
International Conference on Machine Learning.
Page 63 of 63

Fairness Lectures-21

Uploaded by

Copyright:

Available Formats

Fairness Lectures-21

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fairness Lectures-21

Uploaded by

Copyright:

Available Formats

Statistical theory of algorithmic fairness

M2 Stat. et ML // Université Paris-Saclay

2 Post-processing: deriving the form of optimal fair classifier 17

3 In-processing: fair ERM and reduction to iterative cost-sensitive classification 33

4 Regression under demographic parity constraint 47

1.1 Examples of unwanted behaviour

1.2 Supervised learning with sensitive attributes

to denote a prediction of fˆ on x, trained on (Z 1 , Y1 ), . . . , (Z n , Yn ). However, as it is often done, we

Supervised learning: adding sensitive information and fairness constraints. To formalize

One might be tempted to say that by considering f : X → Y, we eliminate all possible

1.3 Definitions of fairness: binary classification

Definition 1.3.1 (Demographic Parity). A classifier g : Z → {0, 1} satisfies demographic parity

Definition 1.3.2 (Equal Opportunity). A classifier g : Z → {0, 1} satisfies equal opportunity

P(g(Z) = y | S = s, Y = y) = P(g(Z) = y | Y = y) , ∀s ∈ S, y ∈ {0, 1} .

PPVs (g) = P(Y = 1 | S = s, g(Z) = 1) .

1.4 Definitions of fairness: general case of supervised learning

Independence, as a fairness criteria, is extremely natural. Essentially, this criteria corresponds

P(S ∈ AS , Y ∈ AY ) = E [P(S ∈ AS | f (Z)) · P(Y ∈ AY | f (Z))] .

Furthermore, since f (Z) ⊥

P(S ∈ AS , Y ∈ AY ) = P(S ∈ AS ) · P(Y ∈ AY ) .

almost surely. On the other hand, it holds that

Thus, the combination of the two expressions yields

P(f (Z) ∈ Af | Y = 1) (P(Y = 1 | S) − P(Y = 1))

Thus, P(Y = 1 | S) = P(Y = 1) almost surely or P(f (Z) ∈ Af | Y = 1) = P(f (Z) ∈ Af | Y = 0) or

Proof. Since f is assumed to satisfy (Y ⊥

TPR(ps − ps0 ) = FPR(ps − ps0 ) .

Hence, if TPR 6= 0, then f (Z) ⊥

• Conclusion 1: instinctive. Group fairness as a framework of fair statistical learning is

1.5 Risk based fairness

E[`(f (Z), Y ) | S = s] = E[`(f (Z), Y )] ∀s ∈ S .

min(r1 (a), . . . , rK (a)) .

rs (a1 ) ≤ rs (a2 ) ∀s ∈ [K] .

max rs (f ) > max rs (f ∗ ) ,

then f ∗ strictly dominates f .

Proof. By definition of f ∗ it holds that

rs0 (f ) ≥ max rs (f ∗ ) ≥ rs0 (f ∗ ) .

1.6 Main approaches and course overview

2. Define available data for building estimator;

3. Define the estimator;

4. Derive excess risk and fairness guarantees;

min R(f ) and min {R(f ) : f satisfies some fairness constraint} .

1.6.4 What this course is about?

1.7 Bibliographic references

1.8 Other courses on algorithmic fairness

• University of Washington, by Jamie Morgenstern:

• NYU, by Julia Stoyanovich and George Wood

• University Paris-Saclay, by Isabelle Guyon and Kim Gerdes

Something is missing? Drop a line: evgenii.chzhen@universite-paris-saclay.fr

Post-processing: deriving the form of

min {E[`(f (Z), Y )] : f satisfies some fairness constraint} .

• Classification: `(f (Z), Y ) = I(f (Z) 6= Y );

• Binary group: S = {−1, 1};

• Demographic Parity: P(f (Z) = 1 | S = −1) = P(f (Z) = 1 | S = 1).

2.1 Optimal classifier under Demographic Parity constraint

Y = E[Y | X, S] + (Y − E[Y | X, S]) ,

min {P(g(X, S) 6= Y ) : P(g(X, S) = 1 | S = −1) = P(g(X, S) = 1 | S = 1)} .

Furthermore, it holds that

min {P(g(X, S) 6= Y ) : P(g(X, S) = 1 | S = −1) = P(g(X, S) = 1 | S = 1)} .