yin18a
yin18a
yin18a
the parameter, measured in an appropriate metric, should Our Contributions We propose two robust distributed
decrease when the amount of data nm becomes larger and gradient descent (GD) algorithms, one based on coordinate-
the fraction of Byzantine machines α becomes smaller. In wise median, and the other on coordinate-wise trimmed
fact, we can show that, at least for strongly convex problems, mean. We establish their statistical error rates for strongly
no algorithm can achieve an error lower than convex, non-strongly convex, and non-convex population
loss functions. Particularly for strongly convex losses, we
e √α + √ 1 e √1 α + √1
Ω =Ω , show that these algorithms achieve order-optimal statistical
n nm n m rates under mild conditions. We further propose a median-
based robust algorithm that only requires one communica-
regardless of communication costs;1 see Observation 1 in tion round, and show that it also achieves the optimal rate
Section 6. Intuitively, the above error rate is the optimal rate for strongly convex quadratic losses. The statistical error
that one should target for, as √1n is the effective standard rates of these three algorithms are summarized as follows.
deviation for each machine with n data points, α is the
• Median-based GD: O( e √α + √ 1 + 1 ), order-optimal
bias effect of Byzantine machines, and √1m is the averaging n nm n
effect of m normal machines. When there are no or few for strongly convex loss if n & m.
1 e √α + √ 1 ), order-
Byzantine machines, we see the usual scaling √mn with • Trimmed-mean-based GD: O( n nm
the total number of data points; when some machines are optimal for strongly convex loss.
Byzantine, their influence remains bounded, and moreover • Median-based one-round algorithm: O( e √α + √ 1 +
n nm
is proportional to α. If an algorithm is guaranteed to attain 1
this bound, we are assured that we do not sacrifice the n ), order-optimal for strongly convex quadratic loss if
quality of learning when trying to guard against Byzantine n & m.
failures—we pay a price that is unavoidable, but otherwise A major technical challenge in our statistical setting here
achieve the best possible statistical accuracy in the presence is as follows: the nm data points are sampled once and
of Byzantine failures. fixed, and each worker machine has access to the same set
of data throughout learning process. This creates compli-
Another important consideration for us is communication
cated probabilistic dependency across the iterations of the
efficiency. As communication between machines is costly,
algorithms. Worse yet, Byzantine machines may create fur-
one cannot simply send all data to the master machine. This
ther unspecified dependency. We overcome this difficulty
constraint precludes direct application of standard robust
by proving certain uniform bounds via careful covering ar-
learning algorithms (such as M-estimators (Huber, 2011)),
guments. Furthermore, for the analysis of median-based
which assume access to all data. Instead, a desirable al-
algorithms, we cannot simply adapt standard techniques
gorithm should involve a small number of communication
(such as those in Minsker et al. (2015)), which can only
rounds as well as a small amount of data communicated
show that the output of the master machine is as accurate
per round. We consider a setting where in each round a
as that of one normal machine, leading to a sub-optimal
worker or master machine can only communicate a vector
O( √1n ) rate even without Byzantine failures. Instead, we
of size O(d), where d is the dimension of the parameter to
be learned. In this case, the total communication cost is make use of a more delicate argument based on normal
proportional to the number of communication rounds. approximation and Berry-Esseen-type inequalities.
the gradient skewness is bounded by a universal constant S Assumption 5 (Size of W). The parameter space W con-
regardless of the size of the parameter space. tains the following `2 ball centered at w∗ : {w ∈ Rd :
kw − w∗ k2 ≤ 2kw0 − w∗ k2 }.
We now state our main technical results on the median-based
algorithm, namely statistical error guarantees for strongly This assumption (and Assumption 6 below) ensures that the
convex, non-strongly convex, and smooth non-convex popu- iterates wt always stay in W without projection. Doing so
lation loss functions F . streamlines our analysis, as our main focus is on robustness.
Strongly Convex Losses: We first consider the case We then have the following result on the convergence rate
where the population loss function F (·) is strongly con- in terms of the value of the population risk function.
vex. Note that we do not require strong convexity of the Theorem 2. Consider Option I in Algorithm 1. Sup-
individual loss function f (·; z). pose that Assumptions 1, 2, 3, 4 and 5 hold, and that
Theorem 1. Consider Option I in Algorithm 1. Suppose the population loss F (·) is convex, and α satisfies (2) for
that Assumptions 1, 2, 3, and 4 hold, F (·) is λF -strongly some > 0. Define ∆ as in (3), and choose step-size
convex, and the fraction α of Byzantine machines satisfies 4d
η = 1/LF . Then, with probability at least 1 − (1+nm b d,
LD)
s
LF
d log(1 + nmLD)b S 1 after T = kw0 − w∗ k2 parallel iterations, we have
∆
α+ + 0.4748 √ ≤ − (2)
m(1 − α) n 2 1
F (wT ) − F (w∗ ) ≤ 16kw0 − w∗ k2 ∆ 1 + ∆ .
2LF
for some > 0. Choose step-size η = 1/LF . Then, with
4d
probability at least 1 − (1+nm b d , after T parallel itera-
LD)
We prove Theorem 2 in Appendix C. We observe that the
tions, we have error rate, defined as the excess risk F (wT )−F (w∗ ), again
e √α + √ 1 + 1 .
λF 2 has the form O
kwT − w∗ k2 ≤ (1 − )T kw0 − w∗ k2 + ∆, n nm n
LF + λF λF Non-convex Losses: When F (·) is non-convex but
where s
smooth, we need a somewhat different technical assumption
α d log(nmLD)
b S
∆ := O C V √ + + , (3) on the size of W.
n nm n
and C is defined as Assumption 6 (Size of W). Suppose that ∀ w ∈ W,
√ 1 k∇F (w)k2 ≤ M . We assume that W contains the `2 ball
C := 2π exp (Φ−1 (1 − ))2 , (4) {w ∈ Rd : kw −w0 k2 ≤ ∆22 (M +∆)(F (w0 )−F (w∗ ))},
2
where ∆ is defined as in (3).
with Φ−1 (·) being the inverse of the cumulative distribution
function of the standard Gaussian distribution Φ(·). We have the following guarantees on the rate of convergence
We prove Theorem 1 in Appendix B. In (3), we hide uni- to a critical point of the population loss F (·).
1
versal constants and a higher order term that scales as nm ,
Theorem 3. Consider Option I in Algorithm 1. Suppose
and the factor C is a function of ; as a concrete exam-
that Assumptions 1 3, 4 and 6 hold, and α satisfies (2) for
ple, C ≈ 4 when = 16 . Theorem 1 together with the
some > 0. Define ∆ as in (3), and choose step-size
inequality log(1 − x) ≤ −x, guarantees that after running 4d
η = 1/LF . With probability at least 1 − (1+nm b d , after
T ≥ LFλ+λ F λF
log( 2∆ kw0 − w∗ k2 ) parallel iterations, with LD)
2LF
(F (w0 ) − F (w∗ )) parallel iterations, we have
F
high probability we can obtain a solution w b = wT with T = ∆2
∗ 4 √
error kwb − w k2 ≤ λF ∆. min k∇F (wt )k2 ≤ 2∆.
t=0,1,...,T
Here we achieve an the error rate (defined as the dis-
tance between w b and the optimal solution w∗ ) of the We prove Theorem 3 in Appendix D. We again obtain an
α 1
e √α + √ 1 + 1 ) error rate in terms of the gap to a critical
O(
form O( √n + √nm
e + n1 ). In Section 6, we provide a n nm n
lower bound showing that the error rate of any algorithm is point of F (w).
e √α + √ 1 ). Therefore the first two terms in the upper
Ω( 4.2. Trimmed-mean-based Gradient Descent
n nm
bound cannot be improved. The third term n1 is due to the We next analyze the robust distributed gradient descent al-
dependence of median on the skewness of the gradients. gorithm based on coordinate-wise trimmed mean, namely
When each worker machine has a sufficient amount of data, Option II in Algorithm 1. Here we need stronger assump-
more specifically n & m, we achieve an order-optimal error tions on the tail behavior of the partial derivatives of the loss
rate up to logarithmic factors. functions—in particular, sub-exponentiality.
Non-strongly Convex Losses: We next consider the case Assumption 7 (Sub-exponential gradients). We assume that
where the population risk function F (·) is convex, but not for all k ∈ [d] and w ∈ W, the partial derivative of f (w; z)
necessarily strongly convex. In this case, we need a mild with respect to the k-th coordinate of w, ∂k f (w; z), is v-
technical assumption on the size of the parameter space W. sub-exponential.
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates
The sub-exponential property implies that all the moments Non-convex Losses: In this case, imposing a version of
of the derivatives are bounded. This is a stronger assumption Assumption 6 on the size of W, we have the following.
than the bounded absolute skewness (hence bounded third Theorem 6. Consider Option II in Algorithm 1, and de-
moments) required by the median-based GD algorithm. fine ∆0 as in (5). Suppose that Assumptions 1 and 7
We use the same example as in Proposition 1 and show that hold, Assumption 6 holds with ∆ replaced by ∆0 , and
the derivatives of the loss are indeed sub-exponential. α ≤ β ≤ 12 − for some > 0. Choose step-size
4d
Proposition 2. Consider the regression problem in Propo- η = 1/LF . Then, with probability at least 1 − (1+nm b d,
LD)
2LF
p k ∈ [d] and w ∈ W, the partial derivative
sition 1. For all after T = ∆02 (F (w0 ) − F (w∗ )) parallel iterations, we
∂k f (w; z) is σ 2 + kw − w∗ k22 -sub-exponential. have √
min k∇F (wt )k2 ≤ 2∆0 .
Proposition 2 is proved in Appendix A.3. We now proceed t=0,1,...,T
to establish the statistical guarantees of the trimmed-mean- The proof of Theorem 6 is similar to that of Theorem 3; see
based algorithm, for different loss function classes. When Remark 1 in Appendix E. By choosing β = cα with c ≥ 1,
the population loss F (·) is convex, we again assume that e √α + √ 1 ).
we again achieve the statistical rate O( n nm
the minimizer of F (·) in W is also its minimizer in Rd . The
4.3. Comparisons
next three theorems are analogues of Theorems 1–3 for the
median-based GD algorithm. We compare the performance guarantees of the above two
robust distribute GD algorithms. The trimmed-mean-based
Strongly Convex Losses: We have the following result. e √α + √ 1 ),
algorithm achieves the statistical error rate O( n nm
Theorem 4. Consider Option II in Algorithm 1. Suppose which is order-optimal for strongly convex loss. In com-
that Assumptions 1, 2, and 7 hold, F (·) is λF -strongly con- parison, the rate of the median-based algorithm is O( e √α +
n
vex, and α ≤ β ≤ 12 − for some > 0. Choose step-size √1 1 1
4d nm
+ n ), which has an additional n term and is only
η = 1/LF . Then, with probability at least 1 − (1+nm b d,
LD) optimal when n & m. In particular, the trimmed-mean-
after T parallel iterations, we have based algorithm has better rates when each worker machine
λF T 2 0 has small local sample size—the rates are meaningful even
kwT − w∗ k2 ≤ 1 − kw0 − w∗ k2 + ∆,
LF + λF λF in the extreme case n = O(1). On the other hand, the
median-based algorithm requires milder tail/moment as-
where
vd β 1
q sumptions on the loss derivatives (bounded skewness) than
0
∆ := O √ +√ log(nmLD)
b . (5) its trimmed-mean counterpart (sub-exponentiality). Finally,
n nm
the trimmed-mean operation requires an additional param-
We prove Theorem 4 in Appendix E. In (5), we hide uni- eter β, which can be any upper bound on the fraction α
versal constants and higher order terms that scale as nβ or of Byzantine machines in order to guarantee robustness.
1 LF +λF λF 0 ∗
nm . By running T ≥ 0 kw − w k2 ) paral-
λF log( 2∆ Using an overly large β may lead to a looser bound and
lel iterations, we can obtain a solution w b = wT satisfying sub-optimal performance. In contrast, median-based GD
∗ β 1 does not require knowledge of α. We summarize these ob-
kwb − w k2 ≤ O( n + nm ). Note that one needs to
e √ √
choose the parameter for trimmed mean to satisfy β ≥ α. If servations in Table 1. We see that the two algorithms are
we set β = cα for some universal constant c ≥ 1, we can complementary to each other, and our experiment results
achieve an order-optimal error rate O(e √α + √ 1 ). corroborate this point.
n nm
A natural candidate is the so-called one-round algorithm. for some > 0, where C e is a quantity that depends on
Previous work has considered a standard one-round scheme DH , Dp , λF and is monotonically decreasing in n. Then,
4
where each local machine computes the empirical risk with probability at least 1 − nm , the output w
b of the robust
minimizer (ERM) using its local data and the master ma- one-round algorithm satisfies
chine receives all workers’ ERMs and computes their av-
s
C log(nmd) C
e
erage (Zhang et al., 2012). Clearly, a single Byzantine kwb − w∗ k2 ≤ √ σ e α+ +√ ,
machine can arbitrary skew the output of this algorithm. We n 2m(1 − α) n
instead consider a Byzantine-robust one-round algorithm. where C is defined as in (4) and
As detailed in Algorithm 2, we employ the coordinate-wise
e2 := E kH−1 −1
2
median operation to aggregate all the ERMs. σ F (H − HF )HF pF − (p − pF ) k2 ,
Algorithm 2 Robust One-round Algorithm with H and p drawn from DH and Dp , respectively.
for i ∈ [m] in parallel do We prove Theorem 7 and provide an explicit expression
Worker machine i: compute & send to master machine: of Ce in Appendix F. In terms of the dependence on α, n,
( and m, the robust one-round algorithm achieves the same
i arg minw∈W Fi (w) normal machines error rate as the robust gradient descent algorithm based
wb ← e √α + √ 1 + 1 ), for
∗ Byzantine machines on coordinate-wise median, i.e., O( n nm n
quadratic problems. Again, this rate is optimal when n & m.
end for Therefore, at least for quadratic loss functions, the robust
Master machine: compute w b i : i ∈ [m]}.
b ← med{w one-round algorithm has similar theoretical performance
as the robust gradient descent algorithm with significantly
Our main result is a characterization of the error rate of Al- less communication cost. Our experiments show that the
gorithm 2 in the presence of Byzantine failures. We are only one-round algorithm has good empirical performance for
able to establish such a guarantee when the loss functions other losses as well.
are quadratic and W = Rd . However, one can implement
this algorithm in problems with other loss functions. 6. Lower Bound
Definition 9 (Quadratic loss function). The loss function In this section, we provide a lower bound on the error rate for
f (w; z) is quadratic if it can be written as strongly convex losses, which implies that the √αn + √nm 1
1 term is unimprovable. This lower bound is derived using a
f (w; z) = wT Hw + pT w + c,
2 mean estimation problem, and is an extension of the lower
where z = (H, p, c), H, and p, and c are drawn from the bounds in the robust mean estimation literature such as Chen
distributions DH , Dp , and Dc , respectively. et al. (2015); Lai et al. (2016).
Denote by HF , pF , and cF the expectations of H, p, and We consider the problem of estimating the mean µ of some
c, respectively. Thus the population risk function takes the random variable z ∼ D, which is equivalent to solving the
form F (w) = 12 wT HF w + pT F w + cF . following minimization problem:
We need a technical assumption which guarantees that each µ = arg min Ez∼D [kw − zk22 ], (6)
normal worker machine has unique ERM. w∈W
Assumption 8 (Strong convexity of Fi ). With probability Note that this is a special case of the general learning prob-
1, the empirical risk minimization function Fi (·) on each lem (1). We consider the same distributed setting as in
normal machine is strongly convex. Section 4, with a minor technical difference regarding the
Byzantine machines. We assume that each of the m worker
Note that this assumption is imposed on Fi (w), rather than
machines is Byzantine with probability α, independently
on the individual loss f (w; z) associated with a single data
of each other. The parameter α is therefore the expected
point. This assumption is satisfied, for example, when all
fraction of Byzantine machines. In this setting we have the
f (·; z)’s are strongly convex, or in the linear regression
lower bound in Observation 1. In Appendix G, we also dis-
problems with the features x drawn from some continuous
cuss how we can translate this average-case bound to a lower
distribution (e.g. isotropic Gaussian) and n ≥ d. We have
bound holds under the setting of our main theorems, that
the following guarantee for the robust one-round algorithm.
is, an unknown set of αm Byzantine machines are selected
Theorem 7. Suppose that ∀ z ∈ Z, the loss function f (·; z) without any assumption.
is convex and quadratic, F (·) is λF -strongly convex, and
Observation 1. Consider the distributed mean estimation
Assumption 8 holds. Assume that α satisfies
s problem in (6) with Byzantine failure probability α, and
log(nmd) C
e 1 suppose that Z is Gaussian distribution with mean µ and
α+ + √ ≤ −
2m(1 − α) n 2 covariance matrix σ 2 I (σ = O(1)). Then, any algorithm
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates
1.0
Logistic regression We train a multi-class logistic regression model and a con-
® =0, mean volutional neural network (CNN) using distributed gradient
0.8
® =0:05, mean descent, and for each model, we compare the test accura-
Test error
0.0 Table 2. Test accuracy on the logistic regression model using gra-
0 1000 2000 3000 4000 5000
Parallel iterations dient descent. We set m = 40, and for trimmed mean, we choose
β = 0.05.
Figure 1. Test error vs the number of parallel iterations. For logistic
regression, we set m = 40, and for trimmed mean, we choose α 0 0.1
β = 0.05; for CNN, we set m = 10, and for trimmed mean, we Algorithm mean mean median trimmed mean
choose β = 0.1. Acc. (%) 94.3 77.3 87.4 90.7
Blanchard, P., Mhamdi, E. M. E., Guerraoui, R., and Stainer, Lamport, L., Shostak, R., and Pease, M. The byzantine
J. Byzantine-tolerant machine learning. arXiv preprint generals problem. ACM Transactions on Programming
arXiv:1703.02757, 2017. Languages and Systems (TOPLAS), 4(3):382–401, 1982.
Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. Bandits with LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
heavy tail. IEEE Transactions on Information Theory, 59 based learning applied to document recognition. Proceed-
(11):7711–7717, 2013. ings of the IEEE, 86(11):2278–2324, 1998.
Charikar, M., Steinhardt, J., and Valiant, G. Learning from Lee, J. D., Lin, Q., Ma, T., and Yang, T. Distributed
untrusted data. In Proceedings of the 49th Annual ACM stochastic variance reduced gradient methods and a lower
SIGACT Symposium on Theory of Computing, pp. 47–60. bound for communication complexity. arXiv preprint
ACM, 2017. arXiv:1507.07595, 2015.
Chen, M., Gao, C., and Ren, Z. Robust covariance Lerasle, M. and Oliveira, R. I. Robust empirical mean
matrix estimation via matrix depth. arXiv preprint estimators. arXiv preprint arXiv:1112.3914, 2011.
arXiv:1506.00691, 2015.
Lugosi, G. and Mendelson, S. Risk minimization by median-
Chen, Y., Su, L., and Xu, J. Distributed statistical machine of-means tournaments. arXiv preprint arXiv:1608.00757,
learning in adversarial settings: Byzantine gradient de- 2016.
scent. arXiv preprint arXiv:1705.05491, 2017.
Lugosi, G. and Mendelson, S. Sub-gaussian estima-
Diakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, tors of the mean of a random vector. arXiv preprint
A., and Stewart, A. Robust estimators in high dimensions arXiv:1702.00482, 2017.
without the computational intractability. In Foundations
McMahan, B. and Ramage, D. Federated learning: Collabo-
of Computer Science (FOCS), 2016 IEEE 57th Annual
rative machine learning without centralized training data.
Symposium on, pp. 655–664. IEEE, 2016.
https://research.googleblog.com/2017/
Feng, J., Xu, H., and Mannor, S. Distributed robust learning. 04/federated-learning-collaborative.
arXiv preprint arXiv:1409.5937, 2014. html, 2017.
Hsu, D. and Sabato, S. Loss minimization and parameter McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al.
estimation with heavy tails. The Journal of Machine Communication-efficient learning of deep networks from
Learning Research, 17(1):543–582, 2016. decentralized data. arXiv preprint arXiv:1602.05629,
2016.
Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates
Minsker, S. and Strawn, N. Distributed statistical estimation Su, L. and Vaidya, N. H. Non-bayesian learning in the
and rates of convergence in normal approximation. arXiv presence of byzantine agents. In International Symposium
preprint arXiv:1704.02658, 2017. on Distributed Computing, pp. 414–427. Springer, 2016b.
Minsker, S. et al. Geometric median and robust estimation Wang, S., Roosta-Khorasani, F., Xu, P., and Mahoney,
in banach spaces. Bernoulli, 21(4):2308–2335, 2015. M. W. Giant: Globally improved approximate new-
ton method for distributed optimization. arXiv preprint
Nemirovskii, A., Yudin, D. B., and Dawson, E. R. Problem arXiv:1709.03528, 2017.
complexity and method efficiency in optimization. 1983.
Yin, D., Pananjady, A., Lam, M., Papailiopoulos, D., Ram-
Reddi, S. J., Konečnỳ, J., Richtárik, P., Póczós, B., and chandran, K., and Bartlett, P. Gradient diversity: a key
Smola, A. Aide: Fast and communication efficient dis- ingredient for scalable distributed learning. arXiv preprint
tributed optimization. arXiv preprint arXiv:1608.06879, arXiv:1706.05699, 2017.
2016.
Zhang, Y. and Lin, X. Disco: Distributed optimization for
Rosenblatt, J. D. and Nadler, B. On the optimality of aver- self-concordant empirical loss. In International confer-
aging in distributed statistical learning. Information and ence on machine learning, pp. 362–370, 2015.
Inference: A Journal of the IMA, 5(4):379–404, 2016.
Zhang, Y., Wainwright, M. J., and Duchi, J. C.
Shamir, O., Srebro, N., and Zhang, T. Communication-
Communication-efficient algorithms for statistical opti-
efficient distributed optimization using an approximate
mization. In Advances in Neural Information Processing
newton-type method. In International conference on
Systems, pp. 1502–1510, 2012.
machine learning, pp. 1000–1008, 2014.
Zhang, Y., Duchi, J., and Wainwright, M. Divide and con-
Su, L. and Vaidya, N. H. Fault-tolerant multi-agent op-
quer kernel ridge regression: A distributed algorithm with
timization: optimal iterative distributed algorithms. In
minimax optimal rates. The Journal of Machine Learning
Proceedings of the 2016 ACM Symposium on Principles
Research, 16(1):3299–3340, 2015.
of Distributed Computing, pp. 425–434. ACM, 2016a.