Nothing Special   »   [go: up one dir, main page]

yin18a

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

Dong Yin 1 Yudong Chen 2 Kannan Ramchandran 1 Peter Bartlett 1 3

Abstract as worker machines (McMahan & Ramage, 2017; Konečnỳ


In this paper, we develop distributed optimiza- et al., 2016). Such machines are often more unpredictable,
tion algorithms that are provably robust against and in particular may be susceptible to malicious and coor-
Byzantine failures—arbitrary and potentially ad- dinated attacks.
versarial behavior, in distributed computing sys- Due to the inherent unpredictability of this abnormal (some-
tems, with a focus on achieving optimal statistical times adversarial) behavior, it is typically modeled as Byzan-
performance. A main result of this work is a tine failure (Lamport et al., 1982), meaning that some
sharp analysis of two robust distributed gradient worker machines may behave completely arbitrarily and can
descent algorithms based on median and trimmed send any message to the master machine that maintains and
mean operations, respectively. We prove statis- updates an estimate of the parameter vector to be learned.
tical error rates for all of strongly convex, non- Byzantine failures can incur major degradation in learning
strongly convex, and smooth non-convex popula- performance. It is well-known that standard learning algo-
tion loss functions. In particular, these algorithms rithms based on naive aggregation of the workers’ messages
are shown to achieve order-optimal statistical er- can be arbitrarily skewed by a single Byzantine-faulty ma-
ror rates for strongly convex losses. To achieve chine. Even when the messages from Byzantine machines
better communication efficiency, we further pro- take only moderate values—hence difficult to detect—and
pose a median-based distributed algorithm that is when the number of such machines is small, the perfor-
provably robust, and uses only one communica- mance loss can still be significant. We demonstrate such an
tion round. For strongly convex quadratic loss, example in our experiments in Section 7.
we show that this algorithm achieves the same op-
timal error rate as the robust distributed gradient In this paper, we aim to develop distributed statistical learn-
descent algorithms. ing algorithms that are provably robust against Byzantine
failures. While this objective is considered in a few re-
1. Introduction cent works (Feng et al., 2014; Blanchard et al., 2017; Chen
et al., 2017), a fundamental problem remains poorly un-
Many tasks in computer vision, natural language processing derstood, namely the optimal statistical performance of a
and recommendation systems require learning complex pre- robust learning algorithm. A learning scheme in which the
diction rules from large datasets. As the scale of the datasets master machine always outputs zero regardless of the work-
in these learning tasks continues to grow, it is crucial to uti- ers’ messages, is certainly not affected by Byzantine fail-
lize the power of distributed computing and storage. In such ures, but it will not return anything statistically useful either.
large-scale distributed systems, robustness and security is- On the other hand, many standard distributed algorithms
sues have become a major concern. In particular, individual that achieve good statistical performance in the absence of
computing units—known as worker machines—may exhibit Byzantine failures, become completely unreliable otherwise.
abnormal behavior due to crashes, faulty hardware, stalled Therefore, a main goal of this work is to understand the
computation or unreliable communication channels. Secu- following questions: what is the best achievable statisti-
rity issues are only exacerbated in the so-called Federated cal performance while being Byzantine-robust, and how to
Learning setting, a modern distributed learning paradigm design an algorithm that achieves such performance?
that is more decentralized, and that uses the data owners’
devices (such as mobile phones and personal computers) To formalize this question, we consider a standard statistical
setting of empirical risk minimization (ERM). Here nm data
1
Department of EECS, UC Berkeley 2 School of ORIE, Cornell points are sampled independently from some distribution
University 3 Department of Statistics, UC Berkeley. Correspon- and distributed evenly among m machines, αm of which
dence to: Dong Yin <dongyin@berkeley.edu>.
are Byzantine. The goal is to learn a parametric model
Proceedings of the 35 th International Conference on Machine by minimizing some loss function defined by the data. In
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 this statistical setting, one expects that the error in learning
by the author(s).
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

the parameter, measured in an appropriate metric, should Our Contributions We propose two robust distributed
decrease when the amount of data nm becomes larger and gradient descent (GD) algorithms, one based on coordinate-
the fraction of Byzantine machines α becomes smaller. In wise median, and the other on coordinate-wise trimmed
fact, we can show that, at least for strongly convex problems, mean. We establish their statistical error rates for strongly
no algorithm can achieve an error lower than convex, non-strongly convex, and non-convex population
    loss functions. Particularly for strongly convex losses, we
e √α + √ 1 e √1 α + √1

Ω =Ω , show that these algorithms achieve order-optimal statistical
n nm n m rates under mild conditions. We further propose a median-
based robust algorithm that only requires one communica-
regardless of communication costs;1 see Observation 1 in tion round, and show that it also achieves the optimal rate
Section 6. Intuitively, the above error rate is the optimal rate for strongly convex quadratic losses. The statistical error
that one should target for, as √1n is the effective standard rates of these three algorithms are summarized as follows.
deviation for each machine with n data points, α is the
• Median-based GD: O( e √α + √ 1 + 1 ), order-optimal
bias effect of Byzantine machines, and √1m is the averaging n nm n
effect of m normal machines. When there are no or few for strongly convex loss if n & m.
1 e √α + √ 1 ), order-
Byzantine machines, we see the usual scaling √mn with • Trimmed-mean-based GD: O( n nm
the total number of data points; when some machines are optimal for strongly convex loss.
Byzantine, their influence remains bounded, and moreover • Median-based one-round algorithm: O( e √α + √ 1 +
n nm
is proportional to α. If an algorithm is guaranteed to attain 1
this bound, we are assured that we do not sacrifice the n ), order-optimal for strongly convex quadratic loss if
quality of learning when trying to guard against Byzantine n & m.
failures—we pay a price that is unavoidable, but otherwise A major technical challenge in our statistical setting here
achieve the best possible statistical accuracy in the presence is as follows: the nm data points are sampled once and
of Byzantine failures. fixed, and each worker machine has access to the same set
of data throughout learning process. This creates compli-
Another important consideration for us is communication
cated probabilistic dependency across the iterations of the
efficiency. As communication between machines is costly,
algorithms. Worse yet, Byzantine machines may create fur-
one cannot simply send all data to the master machine. This
ther unspecified dependency. We overcome this difficulty
constraint precludes direct application of standard robust
by proving certain uniform bounds via careful covering ar-
learning algorithms (such as M-estimators (Huber, 2011)),
guments. Furthermore, for the analysis of median-based
which assume access to all data. Instead, a desirable al-
algorithms, we cannot simply adapt standard techniques
gorithm should involve a small number of communication
(such as those in Minsker et al. (2015)), which can only
rounds as well as a small amount of data communicated
show that the output of the master machine is as accurate
per round. We consider a setting where in each round a
as that of one normal machine, leading to a sub-optimal
worker or master machine can only communicate a vector
O( √1n ) rate even without Byzantine failures. Instead, we
of size O(d), where d is the dimension of the parameter to
be learned. In this case, the total communication cost is make use of a more delicate argument based on normal
proportional to the number of communication rounds. approximation and Berry-Esseen-type inequalities.

To summarize, we aim to develop distributed learning algo- 2. Related Work


rithms that simultaneously achieve two objectives: Outlier-robust estimation in non-distributed settings is a clas-
e √α + √1 )
sical topic in statistics (Huber, 2011). Particularly relevant
• Statistical optimality: attain an O( n nm
rate.
to us is the so-called median-of-means method, in which one
• Communication efficiency: O(d) communication per partitions the data m subsets, computes an estimate from
round, with as few rounds as possible. each sub-dataset, and finally takes the median of these m
estimates. This idea is studied in Nemirovskii et al. (1983);
To the best of our knowledge, no existing algorithm achieves Jerrum et al. (1986); Alon et al. (1999); Lerasle & Oliveira
these two goals simultaneously. In particular, previous ro- (2011); Minsker et al. (2015), and has been applied to bandit
bust algorithms either have unclear or sub-optimal statistical and least square regression problems (Bubeck et al., 2013;
guarantees, or incur a high communication cost and hence Lugosi & Mendelson, 2016; Kogler & Traxler, 2016) as
not applicable in a distributed setting—we discuss related well as problems involving heavy-tailed distributions (Hsu
work in more detail in Section 2. & Sabato, 2016; Lugosi & Mendelson, 2017). In a very
1
Throughout the paper, unless otherwise stated, Ω(·) and O(·) recent work, Minsker & Strawn (2017) provide new analy-
e and O(·)
hide universal multiplicative constants; Ω(·) e further hide sis of median-of-means using normal approximation. We
terms that are independent of α, n, m or logarithmic in n, m. borrow some techniques from this paper, but need to address
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

a significant harder problem: 1) we deal with the Byzan- 3. Problem Setup


tine setting with arbitrary/adversarial outliers, which is not In this section, we formally set up our problem and intro-
considered in their paper; 2) we study iterative algorithms duce a few concepts key to our the algorithm design and
for general multi-dimensional problems with convex and analysis. Suppose that training data points are sampled
non-convex losses, while they mainly focus on one-shot from some unknown distribution D on the sample space
algorithms for mean-estimation-type problems. Z. Let f (w; z) be a loss function of a parameter vector
The median-of-means method is used in the context of w ∈ W ⊆ Rd associated with the data point z, where W
Byzantine-robust distributed learning in two recent papers. is the parameter space, and F (w) := Ez∼D [f (w; z)] is
In particular, the work of Feng et al. (2014) considers a the corresponding population loss function. Our goal is to
simple one-shot application of median-of-means, and only learn a model defined by the parameter that minimizes the
proves a sub-optimal O( e √1 ) error rate as mentioned. The population loss:
n
work of Chen et al. (2017) considers only strongly convex w∗ = arg min F (w). (1)
w∈W
losses, and seeks to circumvent the above issue by grouping The parameter space W is assumed to be convex and com-
the √
worker machines into mini-batches; however, their rate pact with diameter D, i.e., kw − w0 k2 ≤ D, ∀w, w0 ∈ W.
e √α + √ 1 ) still falls short of being optimal, and in
O( n nm We consider a distributed computation model with one mas-
particular their algorithm fails even when there is only one ter machine and m worker machines. Each worker machine
Byzantine machine in each mini-batch. stores n data points, each of which is sampled indepen-
Other methods have been proposed for Byzantine-robust dently from D. Denote by zi,j the Pnj-th data on the i-th
distributed learning and optimization; e.g., Su & Vaidya worker machine, and Fi (w) := n1 j=1 f (w; zi,j ) the em-
(2016a;b). These works consider optimizing fixed functions pirical risk function for the i-th worker. We assume that
and do not provide guarantees on statistical error rates. Most an α fraction of the m worker machines are Byzantine,
relevant is the work by Blanchard et al. (2017), who propose and the remaining 1 − α fraction are normal. With the no-
to aggregate the gradients from worker machines using a ro- tation [N ] := {1, 2, . . . , N }, we index the set of worker
bust procedure. Their optimization setting—which is at the machines by [m], and denote the set of Byzantine machines
level of stochastic gradient descent and assumes unlimited, by B ⊂ [m] (thus |B| = αm). The master machine com-
independent access to a strong stochastic gradient oracle—is municates with the worker machines using some predefined
fundamentally different from ours; in particular, they do not protocol. The Byzantine machines need not obey this pro-
provide a characterization of the statistical errors given a tocol and can send arbitrary messages to the master; in
fixed number of data points. particular, they may have complete knowledge of the sys-
tem and learning algorithms, and can collude with each
Communication efficiency has been studied extensively in other.
non-Byzantine distributed settings (McMahan et al., 2016;
Yin et al., 2017). An important class of algorithms are based We introduce the coordinate-wise median and trimmed mean
on one-round aggregation methods (Zhang et al., 2012; operations, which serve as building blocks for our algorithm.
2015; Rosenblatt & Nadler, 2016). More sophisticated algo- Definition 1 (Coordinate-wise median). For vectors xi ∈
rithms have been proposed in order to achieve better accu- Rd , i ∈ [m], the coordinate-wise median g := med{xi :
racy than the one-round approach while maintaining lower i ∈ [m]} is a vector with its k-th coordinate being gk =
communication costs; examples include DANE (Shamir med{xik : i ∈ [m]} for each k ∈ [d], where med is the
et al., 2014), Disco (Zhang & Lin, 2015), distributed usual (one-dimensional) median.
SVRG (Lee et al., 2015) and their variants (Reddi et al., Definition 2 (Coordinate-wise trimmed mean). For β ∈
2016; Wang et al., 2017). Developing Byzantine-robust ver- [0, 21 ) and vectors xi ∈ Rd , i ∈ [m], the coordinate-wise β-
sions of these algorithms is an interesting future direction. trimmed mean g := trmeanβ {xi : i ∈ [m]} P is a vector with
1
For outlier-robust estimation in non-distributed settings, its k-th coordinate being gk = (1−2β)m x∈Uk x for each
much progress has been made recently in terms of improved k ∈ [d]. Here Uk is a subset of {x1k , . . . , xm
k } obtained by
performance in high-dimensional problems (Diakonikolas removing the largest and smallest β fraction of its elements.
et al., 2016; Lai et al., 2016; Bhatia et al., 2015) as well as de- For the analysis, we need several standard definitions con-
veloping list-decodable and semi-verified learning schemes cerning random variables/vectors.
when a majority of the data points are adversarial (Charikar Definition 3 (Variance of random vectors). For a random
et al., 2017). These results are not directly applicable to vector x, define its variance as Var(x) := E[kx − E[x]k22 ].
our distributed setting with general loss functions, but it Definition 4 (Absolute skewness). For a one-dimensional
is nevertheless an interesting future problem to investigate random variable X, define its absolute skewness2 as
their potential extension for our problem.
2 E[(X−E[X])3 ]
Note the difference with the usual skewness Var(X)3/2
.
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates
3
]
γ(X) := E[|X−E[X]|
Var(X)3/2
. For a d-dimensional random vec- Algorithm 1 Robust Distributed Gradient Descent
tor x, we define its absolute skewness as the vector of the Require: Initialize parameter vector w0 ∈ W, algorithm
absolute skewness of each coordinate of x, i.e., γ(x) := parameters β (for Option II), η and T .
[γ(x1 ) γ(x2 ) · · · γ(xd )]> . for t = 0, 1, 2, . . . , T − 1 do
Definition 5 (Sub-exponential random variables). A ran- Master machine: send wt to all the worker machines.
dom variable X with E[X] = µ is called v-sub-exponential for i ∈ [m] in parallel do
1 2 2
if E[eλ(X−µ) ] ≤ e 2 v λ , ∀ |λ| < v1 . Worker machine i: compute local gradient
(
Finally, we need several standard concepts from convex i t ∇Fi (wt ) normal machines,
analysis regarding a differentiable function h(·) : Rd → R. g (w ) ←
∗ Byzantine machines,
Definition 6 (Lipschitz). h is L-Lipschitz if
|h(w) − h(w0 )| ≤ Lkw − w0 k2 , ∀ w, w0 . send gi (wt ) to master machine.
end for
Definition 7 (Smoothness). h is L0 -smooth if Master machine: compute aggregate gradient
k∇h(w) − ∇h(w0 )k2 ≤ L0 kw − w0 k2 , ∀ w, w0 . (
t med{gi (wt ) : i ∈ [m]} Option I
Definition 8 (Strong convexity). h is λ-strongly convex if g(w ) ← i t
h(w0 ) ≥ h(w)+h∇h(w), w0 −wi+ λ2 kw0 −wk22 , ∀ w, w0 . trmeanβ {g (w ) : i ∈ [m]} Option II
update model parameter wt+1 ← ΠW (wt − ηg(wt )).
4. Robust Distributed Gradient Descent end for
We describe two robust distributed gradient descent algo-
rithms, one based on coordinate-wise median and the other
on trimmed mean. These two algorithms are formally given In addition, when F (·) is convex, we assume that w∗ , the
in Algorithm 1 as Option I and Option II, respectively, where minimizer of F (·) in W, is also the minimizer of F (·) in
the symbol ∗ represents an arbitrary vector. Rd . Formally, we have
Assumption 2 (minimizer in W). Suppose that F (w) is
In each parallel iteration of the algorithms, the master ma-
convex, and let w∗ = arg minw∈W F (w). We assume that
chine broadcasts the current model parameter to all worker
∇F (w∗ ) = 0.
machines. The normal worker machines compute the gradi-
ents of their local loss functions and then send the gradients 4.1. Median-based Gradient Descent
back to the master machine. The Byzantine machines may We first consider our median-based algorithm, namely Al-
send any messages of their choices. The master machine gorithm 1 with Option I. We impose the assumptions that
then performs a gradient descent update on the model pa- the gradient of the loss function f has bounded variance,
rameter with step-size η, using either the coordinate-wise and each coordinate of the gradient has coordinate-wise
median or trimmed mean of the received gradients. The Eu- bounded absolute skewness:
clidean projection ΠW (·) ensures that the model parameter
Assumption 3 (Bounded variance of gradient).
stays in the parameter space W.
For any w ∈ W, Var(∇f (w; z)) ≤ V 2 .
Below we provide statistical guarantees on the error rates of Assumption 4 (Bounded skewness of gradient).
these algorithms, and compare their performance. Through- For any w ∈ W, kγ(∇f (w; z))k∞ ≤ S.
out we assume that each loss functions f (w; z) and the
These assumptions are satisfied in many learning problems
population loss function F (w) are smooth:
with small values of V 2 and S. Below we provide a concrete
Assumption 1 (Smoothness of f and F ). For any z ∈ Z, example in terms of a linear regression problem.
the partial derivative of f (·; z) with respect to the k-th
Proposition 1. Suppose that each data point z = (x, y) ∈
coordinate of its first argument, denoted by ∂k f (·; z), is
Rd × R is generated by y = x> w∗ + ξ with some
Lk -Lipschitz for each k ∈ [d], and the function f (·; z) is
b := (Pd L2 )1/2 . Also assume that the w∗ ∈ W. Assume that the elements of x are inde-
L-smooth. Let L k=1 k pendent and uniformly distributed in {−1, 1}, and that
population loss function F (·) is LF -smooth.
the noise ξ ∼ N (0, σ 2 ) is independent of x. With the
quadratic loss function f (w; x, y) = 12 (y − xT w)2 , we
It is easy to see hat LF ≤ L ≤ L.
b We note that Lb appears
have Var(∇f (w; x, y)) = (d − 1)kw − w∗ k22 + dσ 2 , and
because we have coordinate-wise operations and the L
b quan-
kγ(∇f (w; x, y))k∞ ≤ 480.
tity combines the smoothness parameter of all the d partial
derivatives. When the dimension of w is high, the quantity We prove Proposition 1 in Appendix A.1. In this example,
L
b may be large. However, we will soon see that L b only the upper bound V on Var(∇f (w; x, y)) depends on di-
appears in the logarithmic factors in our bounds and thus mension d and the diameter of the parameter
√ space, and if
does not have a significant impact. the diameter is a constant, we have V = O( d). Moreover,
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

the gradient skewness is bounded by a universal constant S Assumption 5 (Size of W). The parameter space W con-
regardless of the size of the parameter space. tains the following `2 ball centered at w∗ : {w ∈ Rd :
kw − w∗ k2 ≤ 2kw0 − w∗ k2 }.
We now state our main technical results on the median-based
algorithm, namely statistical error guarantees for strongly This assumption (and Assumption 6 below) ensures that the
convex, non-strongly convex, and smooth non-convex popu- iterates wt always stay in W without projection. Doing so
lation loss functions F . streamlines our analysis, as our main focus is on robustness.
Strongly Convex Losses: We first consider the case We then have the following result on the convergence rate
where the population loss function F (·) is strongly con- in terms of the value of the population risk function.
vex. Note that we do not require strong convexity of the Theorem 2. Consider Option I in Algorithm 1. Sup-
individual loss function f (·; z). pose that Assumptions 1, 2, 3, 4 and 5 hold, and that
Theorem 1. Consider Option I in Algorithm 1. Suppose the population loss F (·) is convex, and α satisfies (2) for
that Assumptions 1, 2, 3, and 4 hold, F (·) is λF -strongly some  > 0. Define ∆ as in (3), and choose step-size
convex, and the fraction α of Byzantine machines satisfies 4d
η = 1/LF . Then, with probability at least 1 − (1+nm b d,
LD)
s
LF
d log(1 + nmLD)b S 1 after T = kw0 − w∗ k2 parallel iterations, we have

α+ + 0.4748 √ ≤ −  (2)
m(1 − α) n 2  1 
F (wT ) − F (w∗ ) ≤ 16kw0 − w∗ k2 ∆ 1 + ∆ .
2LF
for some  > 0. Choose step-size η = 1/LF . Then, with
4d
probability at least 1 − (1+nm b d , after T parallel itera-
LD)
We prove Theorem 2 in Appendix C. We observe that the
tions, we have error rate, defined as the excess risk F (wT )−F (w∗ ), again
e √α + √ 1 + 1 .

λF 2 has the form O
kwT − w∗ k2 ≤ (1 − )T kw0 − w∗ k2 + ∆, n nm n
LF + λF λF Non-convex Losses: When F (·) is non-convex but
where s
smooth, we need a somewhat different technical assumption
 α d log(nmLD)
b S 
∆ := O C V √ + + , (3) on the size of W.
n nm n
and C is defined as Assumption 6 (Size of W). Suppose that ∀ w ∈ W,
√ 1  k∇F (w)k2 ≤ M . We assume that W contains the `2 ball
C := 2π exp (Φ−1 (1 − ))2 , (4) {w ∈ Rd : kw −w0 k2 ≤ ∆22 (M +∆)(F (w0 )−F (w∗ ))},
2
where ∆ is defined as in (3).
with Φ−1 (·) being the inverse of the cumulative distribution
function of the standard Gaussian distribution Φ(·). We have the following guarantees on the rate of convergence
We prove Theorem 1 in Appendix B. In (3), we hide uni- to a critical point of the population loss F (·).
1
versal constants and a higher order term that scales as nm ,
Theorem 3. Consider Option I in Algorithm 1. Suppose
and the factor C is a function of ; as a concrete exam-
that Assumptions 1 3, 4 and 6 hold, and α satisfies (2) for
ple, C ≈ 4 when  = 16 . Theorem 1 together with the
some  > 0. Define ∆ as in (3), and choose step-size
inequality log(1 − x) ≤ −x, guarantees that after running 4d
η = 1/LF . With probability at least 1 − (1+nm b d , after
T ≥ LFλ+λ F λF
log( 2∆ kw0 − w∗ k2 ) parallel iterations, with LD)
2LF
(F (w0 ) − F (w∗ )) parallel iterations, we have
F
high probability we can obtain a solution w b = wT with T = ∆2
∗ 4 √
error kwb − w k2 ≤ λF ∆. min k∇F (wt )k2 ≤ 2∆.
t=0,1,...,T
Here we achieve an the error rate (defined as the dis-
tance between w b and the optimal solution w∗ ) of the We prove Theorem 3 in Appendix D. We again obtain an
α 1
e √α + √ 1 + 1 ) error rate in terms of the gap to a critical
O(
form O( √n + √nm
e + n1 ). In Section 6, we provide a n nm n

lower bound showing that the error rate of any algorithm is point of F (w).
e √α + √ 1 ). Therefore the first two terms in the upper
Ω( 4.2. Trimmed-mean-based Gradient Descent
n nm
bound cannot be improved. The third term n1 is due to the We next analyze the robust distributed gradient descent al-
dependence of median on the skewness of the gradients. gorithm based on coordinate-wise trimmed mean, namely
When each worker machine has a sufficient amount of data, Option II in Algorithm 1. Here we need stronger assump-
more specifically n & m, we achieve an order-optimal error tions on the tail behavior of the partial derivatives of the loss
rate up to logarithmic factors. functions—in particular, sub-exponentiality.
Non-strongly Convex Losses: We next consider the case Assumption 7 (Sub-exponential gradients). We assume that
where the population risk function F (·) is convex, but not for all k ∈ [d] and w ∈ W, the partial derivative of f (w; z)
necessarily strongly convex. In this case, we need a mild with respect to the k-th coordinate of w, ∂k f (w; z), is v-
technical assumption on the size of the parameter space W. sub-exponential.
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

The sub-exponential property implies that all the moments Non-convex Losses: In this case, imposing a version of
of the derivatives are bounded. This is a stronger assumption Assumption 6 on the size of W, we have the following.
than the bounded absolute skewness (hence bounded third Theorem 6. Consider Option II in Algorithm 1, and de-
moments) required by the median-based GD algorithm. fine ∆0 as in (5). Suppose that Assumptions 1 and 7
We use the same example as in Proposition 1 and show that hold, Assumption 6 holds with ∆ replaced by ∆0 , and
the derivatives of the loss are indeed sub-exponential. α ≤ β ≤ 12 −  for some  > 0. Choose step-size
4d
Proposition 2. Consider the regression problem in Propo- η = 1/LF . Then, with probability at least 1 − (1+nm b d,
LD)
2LF
p k ∈ [d] and w ∈ W, the partial derivative
sition 1. For all after T = ∆02 (F (w0 ) − F (w∗ )) parallel iterations, we
∂k f (w; z) is σ 2 + kw − w∗ k22 -sub-exponential. have √
min k∇F (wt )k2 ≤ 2∆0 .
Proposition 2 is proved in Appendix A.3. We now proceed t=0,1,...,T

to establish the statistical guarantees of the trimmed-mean- The proof of Theorem 6 is similar to that of Theorem 3; see
based algorithm, for different loss function classes. When Remark 1 in Appendix E. By choosing β = cα with c ≥ 1,
the population loss F (·) is convex, we again assume that e √α + √ 1 ).
we again achieve the statistical rate O( n nm
the minimizer of F (·) in W is also its minimizer in Rd . The
4.3. Comparisons
next three theorems are analogues of Theorems 1–3 for the
median-based GD algorithm. We compare the performance guarantees of the above two
robust distribute GD algorithms. The trimmed-mean-based
Strongly Convex Losses: We have the following result. e √α + √ 1 ),
algorithm achieves the statistical error rate O( n nm
Theorem 4. Consider Option II in Algorithm 1. Suppose which is order-optimal for strongly convex loss. In com-
that Assumptions 1, 2, and 7 hold, F (·) is λF -strongly con- parison, the rate of the median-based algorithm is O( e √α +
n
vex, and α ≤ β ≤ 12 −  for some  > 0. Choose step-size √1 1 1
4d nm
+ n ), which has an additional n term and is only
η = 1/LF . Then, with probability at least 1 − (1+nm b d,
LD) optimal when n & m. In particular, the trimmed-mean-
after T parallel iterations, we have based algorithm has better rates when each worker machine
 λF T 2 0 has small local sample size—the rates are meaningful even
kwT − w∗ k2 ≤ 1 − kw0 − w∗ k2 + ∆,
LF + λF λF in the extreme case n = O(1). On the other hand, the
median-based algorithm requires milder tail/moment as-
where
 vd β 1 
q  sumptions on the loss derivatives (bounded skewness) than
0
∆ := O √ +√ log(nmLD)
b . (5) its trimmed-mean counterpart (sub-exponentiality). Finally,
 n nm
the trimmed-mean operation requires an additional param-
We prove Theorem 4 in Appendix E. In (5), we hide uni- eter β, which can be any upper bound on the fraction α
versal constants and higher order terms that scale as nβ or of Byzantine machines in order to guarantee robustness.
1 LF +λF λF 0 ∗
nm . By running T ≥ 0 kw − w k2 ) paral-
λF log( 2∆ Using an overly large β may lead to a looser bound and
lel iterations, we can obtain a solution w b = wT satisfying sub-optimal performance. In contrast, median-based GD
∗ β 1 does not require knowledge of α. We summarize these ob-
kwb − w k2 ≤ O( n + nm ). Note that one needs to
e √ √

choose the parameter for trimmed mean to satisfy β ≥ α. If servations in Table 1. We see that the two algorithms are
we set β = cα for some universal constant c ≥ 1, we can complementary to each other, and our experiment results
achieve an order-optimal error rate O(e √α + √ 1 ). corroborate this point.
n nm

Non-strongly Convex Losses: Again imposing Assump- median GD trimmed mean GD


tion 5 on the size of W, we have the following guarantee. Rate α 1
O( √n + √nm
e + n1 ) e √α + √ 1 )
O( n nm
Theorem 5. Consider Option II in Algorithm 1. Suppose ∂k f (w; z) Bounded skewness Sub-exponential
that Assumptions 1, 2, 5 and 7 hold, F (·) is convex, and α known? No Yes
α ≤ β ≤ 12 −  for some  > 0. Choose step-size η =
1/LF , and define ∆0 as in (5). Then, with probability at Table 1. Comparison between the two robust distributed gradient
4d LF 0 ∗ descent algorithms.
least 1 − (1+nm b d , after T = ∆0 kw − w k2 parallel
LD)
iterations, we have 5. Robust One-round Algorithm
 1 
F (wT ) − F (w∗ ) ≤ 16kw0 − w∗ k2 ∆0 1 + ∆0 . As mentioned, in our distributed computing framework, the
2LF communication cost is proportional to the number of parallel
The proof of Theorem 5 is similar to that of Theorem 2, iterations. The above two GD algorithms both require a
and we refer readers to Remark 1 in Appendix E. Again, by number iterations depending on the desired accuracy. Can
choosing β = cα (c ≥ 1), we obtain the O( e √α + √ 1 ) we further reduce the communication cost while keeping
n nm
error rate in the function value of F (w). the algorithm Byzantine-robust and statistically optimal?
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

A natural candidate is the so-called one-round algorithm. for some  > 0, where C e is a quantity that depends on
Previous work has considered a standard one-round scheme DH , Dp , λF and is monotonically decreasing in n. Then,
4
where each local machine computes the empirical risk with probability at least 1 − nm , the output w
b of the robust
minimizer (ERM) using its local data and the master ma- one-round algorithm satisfies
chine receives all workers’ ERMs and computes their av-
s
C log(nmd) C
e 
erage (Zhang et al., 2012). Clearly, a single Byzantine kwb − w∗ k2 ≤ √ σ e α+ +√ ,
machine can arbitrary skew the output of this algorithm. We n 2m(1 − α) n
instead consider a Byzantine-robust one-round algorithm. where C is defined as in (4) and
As detailed in Algorithm 2, we employ the coordinate-wise
e2 := E kH−1 −1
  2
median operation to aggregate all the ERMs. σ F (H − HF )HF pF − (p − pF ) k2 ,

Algorithm 2 Robust One-round Algorithm with H and p drawn from DH and Dp , respectively.
for i ∈ [m] in parallel do We prove Theorem 7 and provide an explicit expression
Worker machine i: compute & send to master machine: of Ce in Appendix F. In terms of the dependence on α, n,
( and m, the robust one-round algorithm achieves the same
i arg minw∈W Fi (w) normal machines error rate as the robust gradient descent algorithm based
wb ← e √α + √ 1 + 1 ), for
∗ Byzantine machines on coordinate-wise median, i.e., O( n nm n
quadratic problems. Again, this rate is optimal when n & m.
end for Therefore, at least for quadratic loss functions, the robust
Master machine: compute w b i : i ∈ [m]}.
b ← med{w one-round algorithm has similar theoretical performance
as the robust gradient descent algorithm with significantly
Our main result is a characterization of the error rate of Al- less communication cost. Our experiments show that the
gorithm 2 in the presence of Byzantine failures. We are only one-round algorithm has good empirical performance for
able to establish such a guarantee when the loss functions other losses as well.
are quadratic and W = Rd . However, one can implement
this algorithm in problems with other loss functions. 6. Lower Bound
Definition 9 (Quadratic loss function). The loss function In this section, we provide a lower bound on the error rate for
f (w; z) is quadratic if it can be written as strongly convex losses, which implies that the √αn + √nm 1
1 term is unimprovable. This lower bound is derived using a
f (w; z) = wT Hw + pT w + c,
2 mean estimation problem, and is an extension of the lower
where z = (H, p, c), H, and p, and c are drawn from the bounds in the robust mean estimation literature such as Chen
distributions DH , Dp , and Dc , respectively. et al. (2015); Lai et al. (2016).
Denote by HF , pF , and cF the expectations of H, p, and We consider the problem of estimating the mean µ of some
c, respectively. Thus the population risk function takes the random variable z ∼ D, which is equivalent to solving the
form F (w) = 12 wT HF w + pT F w + cF . following minimization problem:
We need a technical assumption which guarantees that each µ = arg min Ez∼D [kw − zk22 ], (6)
normal worker machine has unique ERM. w∈W

Assumption 8 (Strong convexity of Fi ). With probability Note that this is a special case of the general learning prob-
1, the empirical risk minimization function Fi (·) on each lem (1). We consider the same distributed setting as in
normal machine is strongly convex. Section 4, with a minor technical difference regarding the
Byzantine machines. We assume that each of the m worker
Note that this assumption is imposed on Fi (w), rather than
machines is Byzantine with probability α, independently
on the individual loss f (w; z) associated with a single data
of each other. The parameter α is therefore the expected
point. This assumption is satisfied, for example, when all
fraction of Byzantine machines. In this setting we have the
f (·; z)’s are strongly convex, or in the linear regression
lower bound in Observation 1. In Appendix G, we also dis-
problems with the features x drawn from some continuous
cuss how we can translate this average-case bound to a lower
distribution (e.g. isotropic Gaussian) and n ≥ d. We have
bound holds under the setting of our main theorems, that
the following guarantee for the robust one-round algorithm.
is, an unknown set of αm Byzantine machines are selected
Theorem 7. Suppose that ∀ z ∈ Z, the loss function f (·; z) without any assumption.
is convex and quadratic, F (·) is λF -strongly convex, and
Observation 1. Consider the distributed mean estimation
Assumption 8 holds. Assume that α satisfies
s problem in (6) with Byzantine failure probability α, and
log(nmd) C
e 1 suppose that Z is Gaussian distribution with mean µ and
α+ + √ ≤ −
2m(1 − α) n 2 covariance matrix σ 2 I (σ = O(1)). Then, any algorithm
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

1.0
Logistic regression We train a multi-class logistic regression model and a con-
® =0, mean volutional neural network (CNN) using distributed gradient
0.8
® =0:05, mean descent, and for each model, we compare the test accura-
Test error

0.6 ® =0:05, median


® =0:05, trimmed mean
cies in the aforementioned 4 settings. For the convolutional
0.4 neural network model, we use the stochastic version of the
0.2 distributed gradient descent algorithm; more specifically, in
every iteration, each worker machine computes the gradient
0.0
0 500 1000 1500 2000 using 10% of its local data. We plot the test error as a func-
Parallel iterations
tion of the number of parallel iterations (i.e., communication
1.0
Convolutional neural network
rounds) in Figure 1. The final test accuracies are presented
® =0, mean in Tables 2 and 3.
0.8
® =0:1, mean
Test error

0.6 ® =0:1, median α 0 0.05


® =0:1, trimmed mean Algorithm mean mean median trimmed mean
0.4
Acc. (%) 88.0 76.8 87.2 86.9
0.2

0.0 Table 2. Test accuracy on the logistic regression model using gra-
0 1000 2000 3000 4000 5000
Parallel iterations dient descent. We set m = 40, and for trimmed mean, we choose
β = 0.05.
Figure 1. Test error vs the number of parallel iterations. For logistic
regression, we set m = 40, and for trimmed mean, we choose α 0 0.1
β = 0.05; for CNN, we set m = 10, and for trimmed mean, we Algorithm mean mean median trimmed mean
choose β = 0.1. Acc. (%) 94.3 77.3 87.4 90.7

Table 3. Test accuracy on CNN using gradient descent. We set


that computes an estimation µ
b of the mean from the data
q has m = 10, and for trimmed mean, we choose β = 0.1.
µ −µk2 = Ω( √αn +
a constant probability of error kb d
nm ).
As we can see, in the adversarial settings, the vanilla dis-
We prove Observation 1 in Appendix G. According to this tributed gradient descent algorithm suffers from severe per-
observation, we see that the √αn + √nm
1
dependence cannot formance loss, and using the median and trimmed mean
be avoided, which in turn implies the order-optimality of operations, we observe significant improvement in test ac-
the results in Theorem 1 (when n & m) and Theorem 4. curacy. This shows these two operations indeed have strong
ability in defense against Byzantine failures.
7. Experiments
In the second experiment, we compare the performance of
We conduct experiments to show the effectiveness of the
distributed one-round algorithms in the following 3 settings:
median and trimmed mean operations. Our experiments
1) α = 0, mean aggregation, 2) α > 0, mean aggregation,
are implemented with Tensorflow (Abadi et al., 2016) on
and 3) α > 0, median aggregation. To show that our algo-
Microsoft Azure system. We use the MNIST (LeCun et al.,
rithm is able to defend against different types of adversarial
1998) dataset and randomly partition the 60,000 training
behavior, we generate the Byzantine machines differently
data into m subsamples with equal sizes. We use these
from the first experiment—the training labels are i.i.d. uni-
subsamples to represent the data on m machines.
formly sampled from {0, . . . , 9}, and these machines train
In the first experiment, we compare the performance of models using the faulty data. We choose the multi-class lo-
distributed gradient descent algorithms in the following 4 gistic regression model, and the test accuracies are presented
settings: 1) α = 0 (no Byzantine machines), using vanilla in Table 4.
distributed gradient descent (aggregating the gradients by α 0 0.1
taking the mean), 2) α > 0, using vanilla distributed gradi-
Algorithm mean mean median
ent descent, 3) α > 0, using median-based algorithm, and 4)
Acc. (%) 91.8 83.7 89.0
α > 0, using trimmed-mean-based algorithm. We generate
the Byzantine machines in the following way: we replace Table 4. Test accuracy on the logistic regression model using one-
every training label y on these machines with 9−y, e.g., 0 is round algorithm. We set m = 10.
replaced with 9, 1 is replaced with 8, etc, and the Byzantine
machines simply compute gradients based on these data. As we can see, for the one-round algorithm, although the the-
We also note that when generating the Byzantine machines, oretical guarantee is only proved for quadratic loss, in prac-
we do not simply add extreme values in the features or gra- tice, the median-based one-round algorithm still improves
dients; instead, the Byzantine machines send messages to the test accuracy in problems with other loss functions, such
the master machine with moderate values. as the logistic loss here.
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

Acknowledgements Huber, P. J. Robust statistics. In International Encyclopedia


of Statistical Science, pp. 1248–1251. Springer, 2011.
We gratefully acknowledge the support of the NSF through
grant IIS-1619362, CIF award 1703678, CRII award Jerrum, M. R., Valiant, L. G., and Vazirani, V. V. Random
1657420 and grant 1704828. We also acknowledge the generation of combinatorial structures from a uniform
support of Berkeley DeepDrive Industry Consortium and distribution. Theoretical Computer Science, 43:169–188,
Gift award from Huawei. Cloud computing resources are 1986.
provided by a Microsoft Azure for Research award.
Kogler, A. and Traxler, P. Efficient and robust median-of-
means algorithms for location and regression. In Sym-
References bolic and Numeric Algorithms for Scientific Computing
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, (SYNASC), 2016 18th International Symposium on, pp.
J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. 206–213. IEEE, 2016.
Tensorflow: A system for large-scale machine learning.
Konečnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P.,
In OSDI, volume 16, pp. 265–283, 2016.
Suresh, A. T., and Bacon, D. Federated learning: Strate-
Alon, N., Matias, Y., and Szegedy, M. The space complexity gies for improving communication efficiency. arXiv
of approximating the frequency moments. Journal of preprint arXiv:1610.05492, 2016.
Computer and system sciences, 58(1):137–147, 1999.
Lai, K. A., Rao, A. B., and Vempala, S. Agnostic estimation
Bhatia, K., Jain, P., and Kar, P. Robust regression via hard of mean and covariance. In Foundations of Computer
thresholding. In Advances in Neural Information Process- Science (FOCS), 2016 IEEE 57th Annual Symposium on,
ing Systems, pp. 721–729, 2015. pp. 665–674. IEEE, 2016.

Blanchard, P., Mhamdi, E. M. E., Guerraoui, R., and Stainer, Lamport, L., Shostak, R., and Pease, M. The byzantine
J. Byzantine-tolerant machine learning. arXiv preprint generals problem. ACM Transactions on Programming
arXiv:1703.02757, 2017. Languages and Systems (TOPLAS), 4(3):382–401, 1982.

Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. Bandits with LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
heavy tail. IEEE Transactions on Information Theory, 59 based learning applied to document recognition. Proceed-
(11):7711–7717, 2013. ings of the IEEE, 86(11):2278–2324, 1998.

Charikar, M., Steinhardt, J., and Valiant, G. Learning from Lee, J. D., Lin, Q., Ma, T., and Yang, T. Distributed
untrusted data. In Proceedings of the 49th Annual ACM stochastic variance reduced gradient methods and a lower
SIGACT Symposium on Theory of Computing, pp. 47–60. bound for communication complexity. arXiv preprint
ACM, 2017. arXiv:1507.07595, 2015.

Chen, M., Gao, C., and Ren, Z. Robust covariance Lerasle, M. and Oliveira, R. I. Robust empirical mean
matrix estimation via matrix depth. arXiv preprint estimators. arXiv preprint arXiv:1112.3914, 2011.
arXiv:1506.00691, 2015.
Lugosi, G. and Mendelson, S. Risk minimization by median-
Chen, Y., Su, L., and Xu, J. Distributed statistical machine of-means tournaments. arXiv preprint arXiv:1608.00757,
learning in adversarial settings: Byzantine gradient de- 2016.
scent. arXiv preprint arXiv:1705.05491, 2017.
Lugosi, G. and Mendelson, S. Sub-gaussian estima-
Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, tors of the mean of a random vector. arXiv preprint
A., and Stewart, A. Robust estimators in high dimensions arXiv:1702.00482, 2017.
without the computational intractability. In Foundations
McMahan, B. and Ramage, D. Federated learning: Collabo-
of Computer Science (FOCS), 2016 IEEE 57th Annual
rative machine learning without centralized training data.
Symposium on, pp. 655–664. IEEE, 2016.
https://research.googleblog.com/2017/
Feng, J., Xu, H., and Mannor, S. Distributed robust learning. 04/federated-learning-collaborative.
arXiv preprint arXiv:1409.5937, 2014. html, 2017.

Hsu, D. and Sabato, S. Loss minimization and parameter McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al.
estimation with heavy tails. The Journal of Machine Communication-efficient learning of deep networks from
Learning Research, 17(1):543–582, 2016. decentralized data. arXiv preprint arXiv:1602.05629,
2016.
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates

Minsker, S. and Strawn, N. Distributed statistical estimation Su, L. and Vaidya, N. H. Non-bayesian learning in the
and rates of convergence in normal approximation. arXiv presence of byzantine agents. In International Symposium
preprint arXiv:1704.02658, 2017. on Distributed Computing, pp. 414–427. Springer, 2016b.
Minsker, S. et al. Geometric median and robust estimation Wang, S., Roosta-Khorasani, F., Xu, P., and Mahoney,
in banach spaces. Bernoulli, 21(4):2308–2335, 2015. M. W. Giant: Globally improved approximate new-
ton method for distributed optimization. arXiv preprint
Nemirovskii, A., Yudin, D. B., and Dawson, E. R. Problem arXiv:1709.03528, 2017.
complexity and method efficiency in optimization. 1983.
Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ram-
Reddi, S. J., Konečnỳ, J., Richtárik, P., Póczós, B., and chandran, K., and Bartlett, P. Gradient diversity: a key
Smola, A. Aide: Fast and communication efficient dis- ingredient for scalable distributed learning. arXiv preprint
tributed optimization. arXiv preprint arXiv:1608.06879, arXiv:1706.05699, 2017.
2016.
Zhang, Y. and Lin, X. Disco: Distributed optimization for
Rosenblatt, J. D. and Nadler, B. On the optimality of aver- self-concordant empirical loss. In International confer-
aging in distributed statistical learning. Information and ence on machine learning, pp. 362–370, 2015.
Inference: A Journal of the IMA, 5(4):379–404, 2016.
Zhang, Y., Wainwright, M. J., and Duchi, J. C.
Shamir, O., Srebro, N., and Zhang, T. Communication-
Communication-efficient algorithms for statistical opti-
efficient distributed optimization using an approximate
mization. In Advances in Neural Information Processing
newton-type method. In International conference on
Systems, pp. 1502–1510, 2012.
machine learning, pp. 1000–1008, 2014.
Zhang, Y., Duchi, J., and Wainwright, M. Divide and con-
Su, L. and Vaidya, N. H. Fault-tolerant multi-agent op-
quer kernel ridge regression: A distributed algorithm with
timization: optimal iterative distributed algorithms. In
minimax optimal rates. The Journal of Machine Learning
Proceedings of the 2016 ACM Symposium on Principles
Research, 16(1):3299–3340, 2015.
of Distributed Computing, pp. 425–434. ACM, 2016a.

You might also like