Angle A New Larg-Scale Machine Learning System

National Science Review
RESEARCH ARTICLE 5: 216–236, 2018

doi: 10.1093/nsr/nwx018
Advance access publication 24 February 2017
INFORMATION SCIENCE
Angel: a new large-scale machine learning system

Jie Jiang1,2,† , Lele Yu1,† , Jiawei Jiang1 , Yuhong Liu2 and Bin Cui1,∗
ABSTRACT
Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Machine Learning (ML) techniques now are ubiquitous tools to extract structural information from data
collections. With the increasing volume of data, large-scale ML applications require an efficient
implementation to accelerate the performance. Existing systems parallelize algorithms through either data
parallelism or model parallelism. But data parallelism cannot obtain good statistical efficiency due to the
conflicting updates to parameters while the performance is damaged by global barriers in model parallel
methods. In this paper, we propose a new system, named Angel, to facilitate the development of large-scale
ML applications in production environment. By allowing concurrent updates to model across different
groups and scheduling the updates in each group, Angel can achieve a good balance between hardware
efficiency and statistical efficiency. Besides, Angel reduces the network latency by overlapping the
parameter pulling and update computing and also utilizes the sparseness of data to avoid the pulling of
unnecessary parameters. We also enhance the usability of Angel by providing a set of efficient tools to
integrate with application pipelines and provisioning efficient fault tolerance mechanisms. We conduct
extensive experiments to demonstrate the superiority of Angel.
Keywords: machine learning, distributed system, algorithms, big data analytics
INTRODUCTION (i) Integration with existing ecosystems: To inte-

1 Key Lab of High
Machine Learning (ML) has became an in- grate easily with existing ecosystems, ideally,
Confidence Software dispensable workload to support in enterprise our users wanted a system that is written in
Technologies (MOE), environment—by supporting it, the users can make Java and can be deployed easily with HDFS
School of EECS,
predictions for various applications, such as com- and Yarn. In fact, in our experience, it is diffi-
Peking University,
mercial recommendation and online advertisement. cult to deploy systems such as Petuum in their
Beijing 100871, China environment. However, this imposes a ques-
and 2 Data Platform,
In this paper, we focus on building a system to
support distributed ML for industrial users. tion about performance—can we achieve rea-
Tencent Inc., sonable speed with a Java-based distributed ML
Shenzhen 518057, (Use Case) Tencent: our motivation draws from
the real production environment of Tencent, one system instead of C/C++-based systems?
China
of the largest Internet companies in China [1,2]. (ii) One-system-fits-all: existing ML systems either
∗ Corresponding The cluster that we have access to at Tencent con- support data parallelism [3,6,7] or model paral-
tains thousands of machines, and the analytical team lelism [8,9,10]. Because of the diverse range of
author. E-mail:
needs to run a range of ML tasks, from simple applications that we need to support, our users
bin.cui@pku.edu.cn
models such as Logistic Regression (LR), to so- wanted a ‘single’ system to support all these dif-
† Equally contributed
phisticated models such as Latent Dirichlet Allo- ferent types of parallelisms. This requires us to
to this work. cation (LDA), over datasets that are often hun- carefully design the system architecture to ac-
dreds of gigabytes or even terabytes. The existing commodate both type of parallelisms in an ef-
Received 19 ecosystem other than the ML stack is mainly based fective and efficient way.
September 2016; on Java. Motivated by these challenges, we build An-
Revised 20 This environment imposes challenges of directly gel, a distributed ML systems based on Java. An-
December 2016;
applying existing distributed ML systems such as gel is used in the production environment of Ten-
Accepted 30
Spark [3,4] and Petuum [5], which we have tried at cent to run models including LR, Support Vector
December 2017
first. Machine (SVM), LDA, GBDT, KMeans, Matrix

C The Author(s) 2017. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For permissions, please e-mail:
journals.permissions@oup.com
RESEARCH ARTICLE Jiang et al. 217
Factorization (MF) for commercial recommenda- tical efficiency, Angel partitions the model in each
tion, user profiling and various applications. We have group and schedules the updates to the model. Com-
successfully scaled Angel up to hundreds of ma- pared with model parallelism, the hybrid parallel
chines and processing hundreds of gigabytes of data method in Angel allows concurrent updates to the
with billions of model parameters. Compared with same model parameters. The updates made in differ-
existing systems such as Petuum and Spark, Angel is ent groups then are merged to obtain the updated
at least 2×, and sometimes up to 10× faster. version of the model. By allowing a larger degree of
Premier: data parallelism, model parallelism and parallelism, Angel can significantly improve the per-
synchronization. We provide a brief overview of ex- formance of ML applications.
isting distributed ML systems to set the context for Because the parameters in most ML applications
the system design and technical contribution of An- such as LDA and MF can be modeled as matrices,
gel. Angel abstracts them as matrices to facilitate the de-

One technical challenge in distributed ML is velopment of ML applications. The matrix abstrac-
to achieve parallelism. In existing systems, paral- tion allows users to describe operations with linear
lelism is achieved by either data parallelism [3,6,7] algebra. In ML applications such as LR and SVM,
or model parallelism [8,9,10]. By partitioning train- the model parameters are matrices with only one
ing instances and assigning them to different work- row.
ers, data parallel methods can calculate the updates The parameter matrix is partitioned through a
to the model parameters in parallel. But as the model 2D method to generate a moderate size for each
has to be replicated and synchronized among all partition and balance the load of each server. The
workers, data parallel methods incur significant net- pulling of parameters and the pushing of updates
work and memory overheads. Besides, data paral- are fully optimized in Angel to reduce network
lelism damages the statistical efficiency due to the latency and bandwidth consumption. We perform
conflicting updates on parameters [8,11], which the pulling of parameters and the computing of
means it needs more steps until convergence to a updates one row after another. By overlapping the
give tolerance. Different from data parallel methods, pulling and the computing of different rows, we
it is the model but not the data that is partitioned can reduce the performance degradation due to
in model parallel methods. Model parallelism allows network latency. We also utilize data sparseness to
the deployment of the models with billions or tril- reduce the amount of parameters to pull.
lion parameters that cannot be placed in one worker To efficiently integrate Angel with the whole
node. Good statistical efficiency can also be obtained pipeline of online ML applications, we provide a data
by carefully scheduling the updates of parameters. pre-process module that is tailored for ML tasks to
But the overall performance of ML applications may convert the raw data to ready-to-run linear algebra
still be harmed by the computation barriers in the objects. In order to reduce the memory consump-
model parallel methods. tion caused by training data, Angel provides three
The existence of model replicas brings in the de- storage levels for users to choose according to the re-
mand to synchronize model parameters between source usage of the cluster.
different workers. Existing systems [5,6] proposed Angel now is deployed in a production cluster
to employ parameter servers and delayed synchro- with thousands of nodes that is shared by many other
nization protocols to exchange the updates to the applications. To ensure resource isolation, we utilize
parameters. In practice, we observe that the con- Yarn for resource management. We also provision
vergence rate of ML applications is heavily re- efficient mechanisms to provide fault tolerated exe-
lated to the frequency of model synchronization. cution at scale.
Typically, an ML algorithm converges faster with We implement many common ML algorithms
more frequent model synchronization. However, with Angel and conduct empirical experiments to
synchronization operations, including the pulling of study the performance of Angel. The experimental
parameters and the pushing of updates, introduce results demonstrate that Angel can significantly im-
extra network overheads. Hence, an efficient imple- prove the performance of most ML application, ex-
mentation of synchronization operations is needed hibiting much better performance than state-of-the-
to reduce the incurred overheads. art systems.
System design of Angel. To address the afore- Summary of technical contributions. In sum-
mentioned problems, we propose a new distributed mary, this paper makes the following contributions.
system, named Angel, which is deployed in Tencent
Inc. for ultra large-scale ML applications.
Similar to data parallel methods, Angel partitions (i) We design and implement a system Angel for
training data into different groups and replicates the large-scale ML applications in production envi-
model in each group. Further, to improve the statis- ronment.
218 Natl Sci Rev, 2018, Vol. 5, No. 2 RESEARCH ARTICLE
(ii) We propose a hybrid parallel method to allow among different workers. By performing update
the scheduling of updates while avoiding global aggregating and allowing delayed synchronization,
barriers. parameter servers can achieve acceptable perfor-
(iii) We carefully optimize the implementation of mance for big models with billions or trillions of
model synchronization. By deploying pipelined parameters. Petuum is a state-of-the-art distributed
execution and utilizing data sparseness, we ML system, which provides two different modules,
can significantly reduce the overheads due to Bösen and Strads, to parallelize ML algorithms.
model synchronization. Bösen models the parameters as a shared table and
(iv) We enhance the usability of Angel by providing partitions the table horizontally across different
a set of built-in modules and provisioning effi- servers. Different synchronization protocols are
cient fault-tolerance mechanisms. supported in Bösen to reduce the unnecessary
(v) We validate the effectiveness of Angel by con- network overheads. Strads employs a centralized

ducting comprehensive experiments on the im- coordinator to schedule the updates to the model.
plementation of Angel. After the end of each iteration, a global barrier is
invoked to perform the aggregation operations and
Overview. The rest of this paper is organized select the parameters to update in the next iteration.
as follows. We first review the literature in the TensorFlow [18] is a distributed ML system de-
‘Related work’ section. The overview of Angel veloped by Google Inc. It provides the data-flow
system, the hybrid parallelism, the parameter syn- model to simplify the programming of deep learning
chronization architecture and the details of system algorithms and utilize shared variables to support ef-
implementation are presented in the following ficient parallelization of ML algorithms. In the dis-
four sections. The ‘Evaluation’ section presents the tributed environment, it enables synchronous and
comparison between Angel and state-of-the-art ML asynchronous protocols to accelerate the training of
systems. Finally, we conclude our work in the last large datasets. But TensorFlow is designed to uti-
section. lize multiple GPUs to accelerate the computation-
intensive tasks, especially for deep learning ap-
plications. It cannot handle computation graphs
RELATED WORK with billions of nodes, like large-scale Bayesian
ML systems. Distributed systems were proposed graphs. Moreover, TensorFlow lacks the mechanism
in the past few years to help deal with large-scale of delay-bounded synchronization, which can sig-
ML problems. Data-flow systems [3,12], such nificantly improve the performance of distributed
as Hadoop and Spark, are widely adopted to training.
conduct big data analysis in many companies. Singa [19] is a distributed deep learning plat-
They provide easy programming interfaces for form with layer abstraction-based programming
general distributed data processing and can con- model. Singa integrates both synchronous and asyn-
duct data processing through efficient data-flow chronous protocols to enable users to train with
operators, such as map, join and shuffle. However, various distributed training frameworks including
as these systems lack globally shared states, they Sandblaster, AllReduce, Downpour and Distributed
are unsuitable for ML applications where the Hogwild!. However, its layer-based programming
model is shared in the computing of updates. model cannot be employed for general ML algo-
Besides, without explicit interfaces for commu- rithms. Although hybrid parallelism is also sup-
nication and storage, users cannot optimize the ported in Singa, it is utilized to deal with the big
performance with customized scheduling and model problem in large deep neural networks train-
synchronization. ing. Angel can still obtain speedup even when train-
Graph processing systems provide graph- ing with small models through hybrid parallelism.
parallel programming models where the execution Parallelization of learning algorithms. The
is organized as a graph and users can describe training of ML algorithms is essentially sequential
their applications by defining the behaviors of processes where the outputs of the current iteration
vertices [13,14,15]. Though the graph-parallel are fed as input to the next iteration. There exist
model can provide good abstraction for most ML trade-offs between hardware efficiency and statis-
algorithms, the fine-grained dependencies required tical efficiency in parallelizing optimization algo-
in the execution limited the scalability of graph rithms such as Stochastic Gradient Descent (SGD),
processing systems. ADMM and Coordinate Descent (CD) [20].
Systems which adopt parameter servers Various parallelized versions [21,22,23] of these
[5,6,16,17] partition parameters across multi- optimization algorithms have been proposed to
ple servers to perform the parameter exchanging accelerate the training. The parallelization strategies
DESIGN AND OVERVIEW OF ANGEL

Distribute ML has now been widely adopted to solve
problems with Big Data and Big Model in current In-
ternet companies. A general distributed ML system
is necessary to help users develop various ML ap-
plications. Usually, the production environment im-
poses several constraints on the implementation and
execution of an ML system, such as the limited mem-
ory and shared network resources. Therefore, we de-
sign and implement Angel to facilitate the develop-
ment of ML applications facing practical problems.

Figure 1 presents an overview of Angel’s system ar-
chitecture. In the following part of this section, we
introduce the modules of Angel, and how Angel ex-
ecutes ML applications.
Components of angel
Figure 1. System architecture of Angel. There are four types of roles in an application of An-
gel, including Client, Master, Worker and Server.
(i) Client is the entry point for an Angel ap-
can be categorized into two types: data parallelism
plication. It loads the configurations of the
and model parallelism. The data parallel methods
runtime environment and submits the appli-
partition the training data to parallelize the ML
cation to Yarn. Then it will wait until the
algorithms while the model is partitioned in model
application’s execution completes. Users can
parallel methods. Data parallel implementations
define the parameter matrices by specifying
can obtain better hardware efficiency since there is
the block size and the dimension numbers at
no interference between different workers during
Client.
training. Model parallel implementations however
(ii) Master is launched by Yarn to manage the life
are good at statistical efficiency because all the
cycle of each application, including request-
running tasks are coordinated to avoid concurrent
ing resources, launching containers and killing
updates to the same parameter. The study based on a
workers or servers. Besides that, it is also uti-
single NUMA machine [11] has been conducted to
lized to realize different synchronization proto-
find good trade-off point to balance the two aspects.
cols between workers and provide web pages
Synchronization protocols. Parallelizing ML al-
that present the running states of workers and
gorithms brings in the requirement of model syn-
servers to users.
chronization among different model replicas and
(iii) Worker acts as the computation node to ex-
partitions. Various synchronization protocols have
ecute the code written by users. Each worker
been proposed and are applied to coordinate the
reads the data from HDFS and accesses the
running states of different workers. Bulk Synchro-
model parameters on servers to exchange the
nization Parallel (BSP) [24] protocol forces all run-
updates with each other.
ning workers to synchronize their states through a
(iv) Server acts as a distributed storage for param-
global barrier at the end of each epoch. This proto-
eters. Each server stores one or more parti-
col guarantees the correctness of distributed ML al-
tions of model parameters and provides effi-
gorithms [21], and many existing systems [3,25,26]
cient mechanism in response to get the update
adopt it as their synchronization protocol. Delayed
requests from workers.
synchronization protocol [7,27,28] has been pro-
posed to reduce the waiting time wasted in the BSP
protocol. Recently, Asynchronous Synchronization Angel execution
Parallel (ASP) protocol is employed by some sys- An Angel application starts when users submit their
tems [16,6] and algorithms [29,30,31] to speed up compiled code which contains the logics of their task
distributed ML. Due to the error tolerance prop- through Client. Then Yarn will launch Master to take
erty of ML algorithms, these delayed synchroniza- over the running process. It will request resources
tion protocols perform well and usually can obtain from Yarn’s resource manager to allocate contain-
better performance for the problems with large train- ers for servers and workers. The model is partitioned
ing data collections. into different partitions and the meta information of
all partitions is stored at the master. After the initial- the HDFS or pushed directly to the online advertis-
ization stage, each server will fetch the meta infor- ing systems.
mation of the partitions from the master and allocate In summary, Angel is developed to solve large-
memory storage for these partitions. scale ML problems in production environment. It
The training data are stored in HDFS. Master will provides the hybrid parallel method to accelerate the
partition the input data into different splits with bal- convergence speed of distributed ML algorithms.
anced size and send the meta information of splits Besides, efficient parameter synchronization opera-
to workers. Before the running of ML algorithms, tions are implemented to reduce the network over-
Worker conducts the preprocessing phase to con- heads and improve the overall performance. By
vert the raw data into ready-to-run linear algebra ob- automatically conducting the data pre-processing
jects. The output objects will be stored inside each operations, Angel can be easily integrated with the
worker. We will introduce the detailed implementa- whole pipeline of online applications. Angel also

tion of pre-processing and data storage in the ‘system provisions efficient mechanisms to achieve fault tol-
implementation’ section. erated execution at scale.
Before training, Master first organizes the work-
ers into a set of groups. Then, the training data are
partitioned into different splits, and the model is
replicated at each group. Inside each group, a leader HYBRID PARALLELISM
is selected to schedule the updates to model and the For ML applications involving terabytes of data
others will be categorized as slaves. The updates from and billions of parameters, training on a single ma-
different groups are aggregated to obtain a global chine often takes intolerable times. Parallelizing ML
model with the updated version. Hybrid parallelism programs across multiple machines is a practicable
well generalizes existing parallelization strategies. method adopted by many studies [21,23,29,32,33].
Users can execute distributed ML through data par- Existing systems adopt either data parallelism or
allelism by only launching one worker inside each model parallelism to parallelize ML algorithms. Fig-
group. In such cases, the execution is exactly the ure 2a and b describe the architecture of the two
same as those of data parallel methods. If the num- parallel methods. In data parallel methods, the data
ber of groups is set to 1, then the execution resembles are partitioned and the partitions are assigned to
model parallelism. Users should provide scheduling workers. Each worker owns a replica of the model.
and computing functions for leaders and slaves, re- After a worker completes its computation, the up-
spectively. dates made by the worker are pushed into the global
Server acts as a distributed storage for param- shared model. When other workers pull parameters
eters. It provides get and increment interfaces for from the global shared model, these updates will be
workers to exchange parameters. During the train- reflected in the computation. Since the computa-
ing, each worker first pulls the parameters it needs tion performed at different workers are independent,
using the get method. After computing the updates asynchronous synchronization protocols are widely
to the model, the updates are pushed back to the adopted to reduce the network overheads.
servers. Angel carefully optimizes the pulling opera- Different from data parallel methods, model par-
tions, deploying various methods for different ML al- allel methods do not partition the training data. In-
gorithms. For the algorithms like LDA or MF, whose stead, the model is partitioned and each worker is
models have a large amount of rows, the pulling op- responsible for the updates of a portion of param-
eration is overlapped with the computation opera- eters. Since the parameters do not obey the i.i.d.
tion to reduce the network latency. The sparseness properties, a coordinator is needed in model paral-
of training data is also utilized to reduce the amount lel methods to select those parameters that are in-
of parameters to pull. dependent from each other. By carefully scheduling
During the execution, the running states of the updates to the model, model parallel methods
Worker and Server are maintained at Master and can converge with fewer epochs. But the scheduling
presented to users through web pages. Besides, of updates also enforces global barriers in the exe-
Server writes its snapshot into HDFS periodically. cution. Each worker cannot proceed to the next it-
Once a server comes across a failure, the crash will eration when it completes the computation of the
be detected by Master through the heartbeat infor- current iteration. It must wait for the completion of
mation or the report from workers. Then Master will other workers as well as the notification from the co-
allocate a new container to launch a new server and ordinator which informs the parameters to update
recover the state of the crashed one from its latest in the next iteration. Since the number of indepen-
snapshot. At the end of each Angel application, the dent parameters is limited, model parallel methods
parameter values stored at servers will be written into also suffer from the small degree of parallelism.

Figure 2. Various parallelization strategies for distributed ML: (a) data parallelism, (b) model parallelism and (c) hybrid parallelism.
Angel employs hybrid parallelism to avoid the sweep all the training instances in this group and
problems in data parallelism and model parallelism. only store the partition which belongs to itself.
The architecture of hybrid parallelism is presented
in Fig. 2c. Angel first divides workers into multiple
groups. The training data are partitioned and are as- Worker execution
signed to the groups. Inside each group, one worker Inside each worker, multiple user tasks will be
is selected as the leader randomly and other work- launched to execute the user-provided function in
ers are categorized as slaves. The model is replicated different threads. Each user thread reads its data par-
at different worker groups. The leader schedules the tition and conducts the training operations. The user
updates of the model replica inside this group. The thread can access the parameters at servers through
updates generated by different groups are merged a shared module inside each worker, which is pre-
through the global parameter servers. sented in Fig. 3. To eliminate duplicated parameter
pulling, a lock table is maintained at the worker. Each
Data partitioning request first tries to acquire the lock with its matrix
We recognize the training data as a 2D matrix with ID and row index. If the lock has already been ob-
each row as a data instance. Because the methods tained by other thread, then the request thread will
deployed by different ML algorithms to access data be blocked. The row locks allow us to avoid dupli-
vary a lot, Angel provides three different methods to cated requests.
partition the data matrix inside a worker group and The pulling for one row will be splitted into mul-
they are listed as follows. tiple requests and each request fetches one partition
from the server. The pulling requests are executed
(i) Replicated: the data matrix is replicated at each by multiple pulling threads. The pulling threads lo-
slave since the update of each parameter re- cate the server location from the lookup table in-
quires the whole data matrix. For example, to side worker. Once a partition is fetched, it will be in-
update each parameter in Lasso using CD, the serted into a result buffer. An input merging thread
whole data matrix is required. will merge multiple partitions into a complete row
(ii) Horizontal: in ML algorithms such as LDA and once all partitions have been fetched. Then, all user
MF, all fields of an instance are needed in the threads which are waiting for this row will be waked
computation. Hence, the training data are hor- up to perform the computation.
izontally partitioned in these applications so The updates generated by user threads will be
that each slave owns a portion of the data in- inserted into an update buffer. An output merg-
stances. ing thread is running to merge the updates before
(iii) Vertical: using this partitioning method, each pushing. Once the buffer size exceeds a pre-defined
training instance is partitioned and each slave threshold, the updates will be pushed to servers by
owns a vertical partition of all data instances. the pushing threads.
Linear models, such as LR, SVM, can utilize this
partition method to conduct model parallelism.
Users can choose the partitioning method Leader scheduling
through the configuration of the Angel application. The detailed model parallel implementations of dif-
During the read phase, each worker in a group will ferent ML algorithms can vary a lot due to their
leader’s. Each time the leader calls the CLOCK oper-

ation, the iteration number of its slaves will increase
by one.
The execution of different groups however does
not need to be synchronous. Various synchroniza-
tion protocols can be deployed among different
groups, including BSP, SSP and ASP. Before con-
ducting the pull operation, each worker will request
the minimum iteration number of all workers from
the master. Once this gap of iterations has exceeded
a pre-defined value, i.e. staleness, the slave will wait
until the slowest worker catching up.

Figure 3. Execution inside worker.
PARAMETER SYNCHRONIZATION
The replication of model parameters requires param-
eter synchronization among different workers. The
overall performance is heavily affected by the fre-
quency of model synchronization. Typically, more
frequent synchronization can help improve the sta-
tistical efficiency. However, each synchronization
operation, like pulling or pushing, incurs extra over-
heads. Angel optimizes the performance of these two
operations through a 2D partitioning method, merg-
ing the local updates and designing different pulling
Figure 4. The pseudo code for scheduling. methods for different ML algorithms.
distinct scenarios. Angel provides users a set of flex- Partition of parameters

ible interfaces. Users can develop their applications Since ML algorithms typically treat parameters as
by defining the scheduling and updating functions. linear algebra objects, Angel abstracts parameters
Here, we present an example in Fig. 4 that arranges as a matrix. The model parameters of most ML al-
parameter updating in a round robin style. A leader gorithms can be described in matrices, such as the
function is running to schedule the computation of weight vector in LR or the word-topic count ma-
slaves. It controls how many iterations the compu- trix in LDA [34]. Angel partitions the parameter ma-
tation will last. For each iteration, it schedules the trix into rectangle blocks through a 2D partitioning
parameters to each slave rotationally. This rotation- method.
style scheduling guarantees that slaves will never Each rectangle block is the communication unit
touch the same parameter concurrently while each between workers and servers. Users can control the
slave can access the whole parameters at one iter- number of model partitions through setting each
ation. For each scheduling round, leader sends the block’s row and column size. With more model par-
UPDATE command to each slave with its corre- titions, larger degree of parallelism can be achieved
sponding parameter indices. Each slave receives the and the access load on each server is more balanced.
command, pulls the parameters with the indices and However, a larger partition number will bring in
computes updates with its own data partition. The more maintaining overheads when pushing updates
updates will be pushed into the servers through the to servers. Typically, the partition number should be
increment interface. At the end of each round, each larger than the number of servers.
slave returns an ACKNOWLEDGE message to the
leader. The leader proceeds to next computation
round once all the ACKNOWLEDGE messages are Parameter pulling
received. During the training phase, workers would pull pa-
The master maintains the iteration states for all rameters from servers to calculate the gradients.
workers. Inside one worker group, the execution is For the models with billions of parameters, the
coordinated with the BSP protocol. Therefore, the pulling operations will result in massive commu-
iteration number of all slaves is the same as the nication cost. Moreover, as the communication is
synchronous, the computation will be blocked ow- the optimization algorithm needs to pull yi from the
ing to the network latency. We carefully study the server.
structures of model parameters of different ML algo- From the discussion above, we can see that the
rithms and group them into two categories. The first computation unit, such as sampling a token or pass-
type of algorithms owns a model matrix with lots of ing a rating, requires a portion of disjoint parameters
rows, such as MF and LDA, while the other one em- as input. Therefore, we overlap the parameter pulling
ploys a weight vector with billions of elements, like operation and the computation to reduce the per-
LR and SVM. We propose different pulling strategies formance degradation due to network latency. Naive
for both types of parameters, respectively. parameter fetching method, Algorithm 1, will per-
form poorly due to the latency of remote procedure
Pipelining pulling call. Instead, Angel allows each worker to request pa-
To provide an efficient parameter pulling mech- rameters for many computation units and perform

anism, we first analyze the model access method the training function once a parameter is fetched,
through two ML algorithms, LDA and MF. which is presented in Algorithm 2.
Latent Dirichlet allocation. Griffiths and Angel provides an interface to pull multiple rows
Steyvers [35] proposed using Gibbs sampling to from servers. A list of row indices to be pulled is
perform the inference for LDA model. The posterior
distribution P(Z|W) can then be estimated using Algorithm 1: Naive Parameter Pulling
a collapsed Gibbs sampling algorithm, which, in
each iteration, updates each topic assignment initialization;
zd, i ∈ Z by sampling the full conditional posterior while not done do
distribution: Get next token w(d ,i ) = v from iterator;
Acquire row v from server;
n v,k + β Sampling token w(d ,i ) ;
p(zd ,i = k|wd ,i = v) ∝
v n v ,k + Vβ
Update n (d ,k) and n (v,k) ;
× (n d ,k + α), (1)
where k ∈ [1, K] is a topic, v ∈ [1, V] is a word in the Algorithm 2: Pipeline Parameter Pulling
vocabulary, w d, i denotes the ith word in document
initialization;
d and zd, i the topic assignment to w d, i . Matrix nd, k
Send requests for rows to servers;
and nv, k are two-parameter models which are up-
while not done do
dated by the sample operations. When performing
if Pipeline has ready parameters p then
distributed Gibbs sampling, both the training data
Sampling tokens with p;
and the Matrix nd, k are partitioned to workers by
Update n (d ,k) and n (v,k) ;
document. Matrix nd, k is stored among different
workers while matrix nv, k is shared. To perform each else
Wait on the Pipeline
sample operation for token wd, i , we just need to pull
one row of nv, k where v = w d, i . In other words, the
parameters are pulled row by row in LDA.
passed and a blocked result queue is returned imme-
Matrix factorization. Similar results can also be
diately. The worker will randomly split the row in-
concluded for the MF algorithms. To solve the MF
dices into multiple disjointed sets where each one
optimization problem, we sweep over all known rat-
owns a mini batch of indices. The worker sends the
ings and update the parameters through the follow-
pulling requests with one batch at a time and in-
ing equation for each rating:
serts the fetched rows into the result queue. The
user program takes parameter rows from the result
xu ← xu + η r ui − xuT y i y i − λxu ,
queue and conduct the computation. Since multi-

y i ← y i + η r ui − xuT y i xu − λy i , (2) ple tasks will be launched in each worker and differ-
ent tasks would request the same parameter row at
where xu and yi are the feature vectors for user u and one iteration. The worker will eliminate duplicated
item i, rui is the rating value of item i given by user requests from different tasks to reduce the network
u. η is the learning rate while λ is the regularization overheads.
parameter.
Note that the user feature matrix is partitioned Pruning pulling
by user and the item feature matrix is stored at the For ML algorithms whose model consists of one vec-
servers. Therefore, for each observed rating (u, i, r), tor with billions of elements, the parameter pulling
stead of 2D arrays to reduce the memory consump-

tion. For each row in the list, different data are em-
ployed according to the ML algorithms. In the LDA
algorithm, each sampling operation only modifies
two elements of row n(v, k) , while the MF algorithm
will modify the whole yi feature vector for each rat-
ing. Hence, HashMap is used to merge the updates
for each word in LDA while a dense array is adopted
for each item in MF. When the size of this buffer
exceed a pre-defined threshold, Angel will split the
buffer into multiple partitions and compress them
with different storage formats according to their de-

grees of sparsity. Then, these updates are pushed to
Figure 5. Sparse replicas at each worker.
the servers.
message consists of a list of (key,value) pairs for

SYSTEM IMPLEMENTATION
each partition block, where the key is the index of
non-zero elements. Since data are sparse and each Angel is now deployed in Tencent Inc. and applied
training sample contains many zero elements. The to solve problems including online advertising and
corresponding parameters computed with these user analysis. The complex running environment and
zero elements will not be reflected in the results. the various demands from applications rise chal-
Hence, each worker can request a portion of param- lenges for the deployment of Angel. In the following
eters according to its data partition instead of the parts of this section, we present how angel deals with
complete model replica. these practical problems.
As what is presented in Fig. 5, each worker
only needs to pull and push the dimensions which
are non-zero in its training data matrix. In con- Data partitioning
sideration of the large size of model and scarce To avoid straggler workers, we should divide the
network bandwidth, messages compression is data into balanced partitions before training. Gener-
desirable. ally, it requires an extra partitioning phase over the
Meanwhile, training data often remains un- whole training data in order to balance the compu-
changed between iterations. A worker might re- tation load for each worker. However, this will result
quest parameters with the same set of keys mul- in massive communication cost, especially when fac-
tiple times at different iterations. Hence, it is nat- ing large datasets.
ural for the server to cache the key lists. Later, Instead of incurring a shuffle phase to generate a
the worker only sends a tag that can distinguish balanced partitioning results, Angel utilizes the lo-
it from others to server rather than the whole key cation and meta information of data to achieve the
lists. goal. We observe that the computation complex-
ity is proportional to the size of input data for a
majority of ML algorithms. For example, the sam-
Pushing updates pling times equal the number of tokens in LDA,
In Angel, we asynchronously push the local updates while the number of non-zero items in one exam-
to servers. Each worker maintains an update buffer ple in LR means the number of math calculation for
for all the tasks running on it and a separate thread the gradient computation. Therefore, we can balance
merges the generated updates. The push operation the computation work of each worker through bal-
is conducted automatically once the buffer size ex- ancing the size of data read by each worker. Usu-
ceeds a pre-defined threshold. ally, in a production cluster environment, training
Frequent pushing operations are deserved since data are stored as multiple files on HDFS, and each
we want to propagate the updates generated by each file is splitted into multiple blocks with each block
worker to others as soon as possible. Since each replicated on two or three storage nodes. So, at
pushing operation invokes network overheads, local the beginning of each Angel application, the master
updates are merged to reduce the data in transition. first collects the information of all data blocks from
A matrix is used to merge the updates from differ- the NameNode of HDFS, including the length of
ent workers. Given the sparseness of the updates, the each block and the node locations. Then, it places
matrix is implemented by row-based linked lists in- each data block to the worker that minimizes the
Figure 7. Distributed counter interface.
Figure 6. Dummy processing using Spark. showed in Fig. 7 in Angel. User can register multi-

ple counters in Angel and there are three types of
state for each counter, including WRITE, READ and
skew of computation load and maximizes the data CLEAN. WRITE is the initial state of each counter.
locality. User can write to a counter with the increase inter-
Specifically, the master first sorts the blocks by face once it is WRITE state. A counter changes to
the length in a descending order and assign each READ after user calls aggregate and allows workers
block to workers one by one. At each step, it finds out to get count value and index through getWithIndex
the worker with the most blocks that belongs to the interface. Finally, the counter will be cleaned to re-
same node with current block. If the data size of this lease the memory. We implement the counter based
worker is lower than the upper bound, the master as- on the servers. Each KEY is hashed and stored at the
signs this block to this worker. Otherwise, it assigns servers. Workers will merge the values locally before
this block to another one with the least computation pushing to the server. The aggregate operation in-
load. vokes a barrier, and the master will assign the index
for each KEY.
In the implementation of Spark, the flow of RDD
Data pre-processing feaWithsid results in massive communication cost
In practice, the data collected from online appli- which is the same order of magnitude as the train-
cations cannot be directly used for ML algorithm. ing dataset. However, the results of RDD dum-
Complex ETL operations must be performed to mySamples are still grouped by the sample id. Hence,
clean the raw data and convert it into ready-to- the flow of training data is redundant and can be
run vectors. The most common used operation is avoided with better implementation. Broadcasting
to construct dummy variables [36]. The construc- the feature-index map to each partition can avoid
tion phase of dummy variables is non-trivial and the the flow of training data, but it is also impractical
pseudo code is listed in Fig. 6 with Spark RDD inter- when there exist tremendous features. Using the dis-
face. tributed counters, Angel avoids the flow of training
The extractFeatures function is defined by users data. Moreover, the unbalanced problem introduced
to construct various features from original logs. Ex- by the join operation is also eliminated.
tracted features include simple features which are
transformed from a single field in the original logs,
as well as cross features generated by multiple fields. Data storage strategy
Then, we enumerate all distinct features according to During the training phase of distributed ML, the
the samples and assign a unique index to each fea- training data stay immutable and is iteratively ac-
ture. There are two shortcomings of the above Spark cessed by the optimization algorithm in a random
program. One is that there is a power-law distribu- order. After the initial partitioning phase, there is no
tion of the feature appearance times. This distribu- demand for communication of training data among
tion results in unbalanced load in the join operation workers. Putting the training data inside each worker
in Fig. 6. Besides, loading the training data between is suitable for the iterative access pattern of ML algo-
Spark and Angel is time consuming because we must rithms. But it also brings about huge memory con-
materialize the data into HDFS and read it again. sumption. Since Angel is run at a productive clus-
This problem becomes much more serious with the ter, allocation requests with large memory are not
increasing size of the dataset. allowed by Yarn. In consideration of the access pat-
To address this problem, we integrate a built- tern of ML algorithms, Angel stores the training data
in module to complete the pre-process phase and in workers over three strategies, including Memory,
abstract a distributed counter interface which is Disk and Memory Disk.
Within each worker, user program accesses the by other tasks. Angel provides the ability to recover
training data through a reader with the following in- when servers and workers come across failures.
terface. Servers store the values of model parameters
1 public interface Reader<KEY, VALUE> { which are frequently modified by workers. Hence,
2 boolean nextKeyValue(); we choose to write the snapshot of each parameter
3 KEY getCurrentKey(); partition periodically into persistent storage instead
4 VALUE getCurrentValue(); of writing update logs. Besides the parameter values,
5 } other systems [5,6] also maintain the clock informa-
tion on servers. In these systems, inconsistent clock
This reader provides the ability of scanning the train- view would exist among different servers. Although
ing data with a random order multiple times with- better performance can be obtained for training, it
out concerning the implementation details of data is hard to recover from a consistent state when job

storage. There are three kinds of storage level behind encounters failures. It requires to take snapshot at
the reader which can be configured by users and are each time when servers receive an update request
listed as follows. to keep consistent on the snapshot view, which in-
Memory All the training data samples will be put into troduces big pressure on server. Another problem is
an in-memory array and the reader traverses the introduced when a worker recovers from failures. It
array sequentially. With all the data samples re- needs to ask for all servers to obtain the current clock
siding at memory, the fastest data access speed value it should reside on due to the inconsistent view
would be gained. of clock values.
Disk Under this storage level, all samples are serial- To avoid these problems, Angel manages a con-
ized as binary bytes and written into multiple lo- sistent view of clock information at the master side.
cal files at worker. This storage level only keeps The servers only act as storage service without main-
current key-value pair at memory, thus most of taining any state information. At the initialization
memory space can be left for other usage, such as stage, a thread will be launched to dump the value
parameters and model updates. of parameters into the HDFS periodically. User can
Memory Disk The size of memory used to store configure the backup interval of the snapshot oper-
training data is limited by a configured threshold, ation. When server is crashed, the master node can
and it serves as a buffer for the whole training data. be informed through the broken heartbeat connec-
This hierarchical storage level can both obtain fast tion. A new server will be allocated and initialized
data access speed and low memory pressure for with the snapshot on the HDFS. A small number of
workers. updates might be lost, but we think it is acceptable
because of the error tolerance property of ML algo-
Angel enables launching multiple tasks inside one rithm. When a worker is down, the master will pull
worker. Each task will get a reader to traverse its data up another worker with the clock state of the crashed
partition. Angel will serialize the data into multiple one. The new worker will read the training data from
files. These files will be located at different local disks HDFS, fetch the latest parameters from servers and
to fully utilize the disk I/O bandwidth with concur- start training from the current clock.
rent.
Usually, the raw data read from HDFS, which is
stored as libsvm format or protobuf bytes, cannot be
EVALUATION
directly used for ML algorithms which abstract data In this section, we present the experimental evalua-
as matrices. Directly storing data samples into mem- tion of the proposed distributed ML system, Angel.
ory or disks in their raw format can introduce ex- We first demonstrate the benefits and costs of Angel
tra overheads by format converting whenever they by comparing Angel with Spark and Petuum using
are read. Users can personalize the KEY and VALUE various ML algorithms. We also compare Angel with
type in the training data reader. Angel converts each TensorFlow using the LR algorithm. Then, we eval-
sample into the ready-to-run objects only once, thus uate the performance of hybrid parallelism, the ef-
eliminating the duplicated data format converting fectiveness of hierarchical data management and the
operations. impact of synchronization strategies. We also study
the behaviors of Angel when facing failures.
Fault tolerance
Running a distributed job in a practical environment, Experimental setup
failures are common and may happen due to vari- The experiments are conducted on a cluster with
ous reasons, such as hardware errors or interfered 100 physical nodes, where each node has a 2.2 GHz
Table 1. Dataset statistics.
Model Dataset #row #col #nnz Sparse size
LR kdd2010 19 m 30 m 585 m 8.4 GB

CTR1 35 m 1.5 m 1.1 b 11 G
CTR2 256 m 4m 9b 85 G
CTR3 410 m 4.7 m 14.2 b 130 G
CTR4 528 m 101 m 22 b 167 G
MF netflix [37] 18k 480k 100 m 1 GB
wxjd 21.8 m 253k 2.5 b 15.7 G
LDA pubmed 8m 140 k 737 m 4 GB

corpus1 159 m 3.7 m 19 b 90 G
corpus2 507 m 5.6 m 60 b 270 G
KMeans kdd2010 19 m 30 m 585 m 8.4 GB
pubmed 8m 140k 737 m 4 GB
CPU with 12 cores, 64 GB memory, 12 × 2 TB they are used to evaluate the performance of scala-
SATA hard disks and connected by 1 Gbps network. bility, data pre-processing and fault tolerance on An-
The version of Yarn used in resource management gel. The user behavior dataset wxjd, as well as netflix,
is 2.2.0. Unless otherwise stated, the maximum size is adopted in the MF experiments. Document cor-
of JVM heap is 10 GB. To further demonstrate the pus datasets, pubmed, corpus1 and corpus2, are used
scalability of Angel when dealing with large datasets, to evaluate the performance of LDA, where corpus1
we conduct evaluations of Angel on a shared inner- and corpus2 are set of web pages crawled by crawlers
company cluster with around 5000 physical nodes. in Tencent. The statistics of the datasets are listed in
ML algorithms. Currently, Angel supports effi- Table 1. The sparse size of dataset is the size of text
cient implementations for a wide range of ML al- format where only non-zero elements are stored. The
gorithms to satisfy the requirement of various ap- #nnz value counts the number of non-zero elements
plications, especially aiming at large data and large in the data matrix.
model. For classification tasks, Angel provides three
algorithms, including LR, SVM and LassoLR, while
regression tasks can be solved with Linear Regres-
sion and Lasso. Besides that, users can utilize tree
ensembles methods, like GBDT, to handle classifi- End-to-end comparison
cation and regression tasks. Kmeans and LDA are We first demonstrate the efficiency of Angel by com-
supported in Angel to handle clustering problems. paring it with Spark, Petuum and TensorFlow. Spark
Moreover, MF is also integrated in Angel to deal with follows the paradigm of MapReduce, but is specially
recommendation tasks. optimized for iterative tasks. Now Spark is one of the
Here, we adopt four ML algorithms to compare most popular general-purpose big data processing
Angel with other systems in the following experi- systems in industrial companies. Due to its feasible
ments, namely LR, LDA, MF and Kmeans, because abstraction of distributed datasets and cache mech-
they have different parameters access patterns. LR is anism, Spark is widely adopted to processing ML ap-
a linear algorithm whose model has only one shared plications. Here, we use the algorithm implementa-
row, while Kmeans has one shared matrix with mul- tions from Mllib, the ML library built beyond Spark,
tiple rows. Both the models of LDA and MF are to conduct the comparison. Petuum is an academic
composed of multiple matrices. An LDA’s model prototype system which provides Bösen and Strads
contains two shared matrices and one partitioned to support data parallelism and model parallelism,
matrix, while MF uses one shared matrix and one respectively. Both Spark and Petuum provide the im-
partitioned matrix. plementations of those four ML algorithms adopted
Dataset. Besides the public datasets, including in the experiments. TensorFlow is an open-sourced
netflix, kdd 2010 and pubmed, we also use the logs distributed system designed for deep learning. Due
of Tencent applications to conduct the experiments. to its generalization of programming interface, Ten-
kdd2010 dataset is employed to evaluate the per- sorFlow can also support linear models, such as LR.
formance of different systems on LR. The datasets Therefore, we conduct the comparison with Tensor-
CTR 1-4 record the click through rates of users, and Flow using LR. We use the latest released version of

p
Figure 8. Performance comparison between Petuum, Spark, TensorFlow and Angel.
Petuum and TensorFlow in the experiments and the epochs. But the achieved speedup is damaged by the
version of Spark is 2.0.2. global barriers. Finally, the speedup of Angel is 1.46
when the worker number increases from 160 to 240,
Effects on LDA while it is only 1.25 in Strads. As a consequence,
Comparison with strads. We first compare Angel Angel can obtain a 1.6× performance improvement
with Strads using LDA workloads. The topic number against Strads. There are two reasons that enable An-
is set to 1024 in the experiments. gel achieve faster speed than Strads. The first one
Strads deploys model parallelism for sparse is that Angel employs hybrid parallelism, while the
Gibbs sampling while Angel implements LDA with other one is that in Angel the get operations are over-
hybrid parallelism. We vary the number of machines lapped with the computation operations to reduce
for both Strads and Angel to evaluate their perfor- the network latency.
mance under different environment configurations. Comparison with Bösen. We then compare An-
In Angel’s implementation, each worker group is as- gel with Bösen using the same workloads. Since the
signed with eight workers to conduct model parallel execution of Bösen always fails when using large
sampling. We monitor the loglikelihood value dur- number of machines, we use much less workers in
ing the execution and record the time when the log- these experiments. We use only 40 workers in these
likelihood value reaches −7.8 × 109 . experiments and configure the staleness in Bösen to
The results are illustrated in Fig. 8a. We can see three clocks.
that Angel exhibits better performance than Strads With the update scheduling in each group, An-
when the worker number is 160. It costs 430 s for gel needs only 22 epochs to reach the required value,
Strads with 160 workers to reach the required value, while Bösen needs 33 epochs. With the carefully op-
while the time needed for Angel is only 338 s. timized model synchronization, the performance of
When more workers are involved in the execu- Angel is further optimized. From Fig. 8d, we can see
tion, the running time of both Strads and Angel re- that Angel can obtain a 10× performance gain over
duces, but Angel benefits more from the additional Bösen.
resources. Though Angel needs four more epoches Comparison with Spark. We also conduct the
when running with 240 workers, the running time experiment of LDA algorithms using Spark on the
of each epoch is reduced as well. The model par- pubmed dataset. However, the performance of Spark
allel method enables Strads to converge with less is too slow. It will take more than 1 h to finish one
iteration. Spark abstracts the computation of LDA groups for Angel with each group containing eight
as graph and uses GraphX to perform the computa- workers. In the implementation of Bösen, it adds the
tion. In the implementation of GraphX, each vertex updates from all workers to the global parameters in-
will send the topic distribution to its neighbors to stead of adding the average value of updates. There-
perform the inference, which will generate massive fore, the objective value might increase after the first
communication cost, ≈300 GB shuffle data, in each synchronization operation. We have tried out best to
iteration. On the contrary, only the word-topic ma- adjust the learning rate for Bösen, it still takes more
trix needs to be shared and transferred through net- than 10 000 s to achieve the specified objective value.
work in the implementation of Angel, which can sig- Hence, Angel is at least 10× faster than Bösen.
nificantly reduce the communication cost
Effects on LR

Effects on MF Here, we present the comparison on LR for Angel,
SGD and Alternating least squares (ALS) are two Spark, Petuum and TensorFlow. The experimental
popular optimization algorithms for MF. SGD is a results are given in Fig. 8c. The number of work-
gradient-based method that iteratively updates the ers is all set to 10 for all systems, and each worker
two feature matrices while ALS alternatively calcu- group in Angel has only one worker. The dataset em-
lates the value for one of them. Due to the differ- ployed here is kdd2010. SGD is adopted by all sys-
ent update patterns, Strads and Spark adopt ALS as tems to solve LR. We use the loss value on the train-
its optimization method for MF while Bösen adopts ing dataset to evaluate their convergence rate. The
SGD. To eliminate the effects of different optimiza- learning rate are well tuned for each system. For ex-
tion methods, we implement both SGD and ALS on ample, Spark requires larger learning rate since it will
Angel to compare the performance of MF against average the gradient value for a batch of samples,
Petuum and Spark. Specifically, when comparing while TensorFlow requires a smaller value of learn-
Angel with Strads and Spark, we set ALS as the opti- ing rate since it directly adds the summation of gra-
mization method for Angel while SGD is employed dients generated by samples.
when comparing with Bösen. Comparison with Spark. For Spark, we use
Comparison with Strads and Spark. In the im- the latest code of MLlib, version 2.0.2, to run
plementation of ALS on Angel, we store both the this experiment. There are three different im-
user feature matrix and the item feature matrix on plementations on MLlib. We try all these three
servers. Each iteration is divided into two phases, methods and find that only one of them can
where workers fetch the item matrix and compute successfully complete the running with kdd2010
the value of user matrix at the first phase, while at dataset. The other two methods will fail on the
the second phase workers fetch the user matrix and first synchronization phase due to OOM error or
calculate the item matrix. At each phase, different Requested array size exceeds VM limit error.
worker calculates different parts of the item matrix or We can see that Angel is consistently faster than
user matrix in order to parallelize the computation. Spark. The reason is that Spark needs to broad-
The experiment’s result is showed in Fig. 8b. All cast the model and collect the gradients from work-
three systems adopt 40 workers to conduct the eval- ers for each update operation, meaning that each
uation. Each worker group contains only one worker update operation requires one to two network op-
in Angel. The rank of the feature matrices is set to erations, which much communication costs; while
200 and the dataset used here is netflix. To reach Angel adopts PSGD algorithm, which enables each
the objective value of 3 × 107 , the consumed cost worker make a copy of the model parameters and
for Strads, Spark and Angel is 1893, 633 and 494 s, synchronize them at the end of each iteration. There-
respectively. Therefore, Angel is about four times fore, Spark needs a long time to reach a low training
faster than Strads and 1.3× faster than Spark. loss value while Angel can quickly reach it.
Comparison with Bösen. Here, we give the result Comparison with Bösen. From Fig. 8c, we can
of comparing Angel and Bösen. When using SGD see that Angel is about five times faster than Bösen.
as the optimization method, we store the item fea- Angel needs about 10 s to finish one iteration while
ture matrix on the servers to share it among work- Bösen requires ≈50s. There are three reasons that re-
ers, while the user feature matrix is partitioned and sult in the performance improvement of Angel. The
stored on different workers. At each iteration, each first one is that Bösen lacks the ability to partition
worker fetches the item matrix and updates both the data. It requires users to manually copy the data to
user matrix and item matrix for each rating. each physical node. Moreover, each worker of Bösen
The evaluation result is presented in Fig. 8e. The will read all the data into memory and partition it at
number of workers is 64, and there are eight worker the worker side, which incurs more overheads since
each worker reads more data. Another reason is that

the Bösen needs to synchronize the parameters more
times inside one epoch. Otherwise, the training loss
will diverse and the algorithm will not converge at
last. Extra synchronization operations incur more
network overheads. The third reason why Angel is
faster is that Angel can utilize the sparsity of data to
avoid fetching useless parameters while Bösen can-
not.
Comparison with TensorFlow. Since the de-
fault implementation for LR in TensorFlow is
designed to dense data, it requires users to con-

vert each sparse sample to dense representation.
However, it is a bad way to process dataset with Figure 9. Scalability.
high dimensions, such as kdd2010. Therefore,
we follow the code from https://github.com/
chenghuige/tensorflow-example and realize a calculate the running time until the square error
distributed version for LR algorithm. We manually is reduced to 115 464. We can see from the re-
partition the input data into multiple parts since sults that Angel is faster than Bösen and about two
TensorFlow does not support data partitioning. times faster than Spark. When using a dataset with
From Fig. 8c, we can see that TensorFlow is high-dimensional data, for example kdd2010, An-
much slower than Angel. It takes more than 150 s gel is ≈8× faster compared with Spark because
for TensorFlow to complete one epoch which An- Spark lacks efficient mechanism to gather the data-
gel only requires ≈10 s. The reason that training data generated workers. When running with kdd2010
must be splitted into mini-batches in TensorFlow, dataset, Bösen cannot finish the first iteration since
where each batch is one computation unit that in- it will be blocked forever at the fetching operation of
curs one round of gradient computation and param- parameters. In the implementation of Bösen, it can
eter update. The problem is that larger batch size only partition parameter matrix by row. Therefore, it
will cause errors since TensorFlow does not allow lacks the ability to deal with matrix whose row size
the size of one tensor to exceed 2 GB. Thus, there is larger than 1 million. It requires users to manu-
are about 17 batches inside one epoch, indicating ally partition one row to multiple parts. However,
that there are 17 network communication opera- the open-source implementation provided by Bösen
tions. We carefully tune the batch size for Tensor- for KMeans does not handle this problem.
Flow to guarantee both the success of running and
the best performance under the limitation of tensor Scalability
size. Figure 9 presents the scalability of Angel and Strads.
We vary the number of execution tasks and calculate
Effects on KMeans their speedups in LDA and MF by the reduction in
We continue presenting the performance compari- the running time. The datasets used in LDA and MF
son using KMeans algorithm. The number of work- programs are pubmed and wxjd, respectively.
ers is all set to 10 for Bösen, Spark and Angel and Angel behaves with better scalability than Strads.
there are only one worker in each group for Angel. With the task number increasing from 80 to 240,
We use two different datasets, including pubmed and the speedups achieved by Angel in MF and LDA are
kdd2010, and different number of centers to conduct 2.65 and 2.59, respectively. Meanwhile, the numbers
this comparison. To do the comparison, we evaluate are 2.09 and 1.74 in Strads. The reason is that An-
the running time of these three system to reach the gel adopts hybrid parallelism that allows concurrent
same square errors. To guarantee that each system updates to the model while each step of Strads must
can achieve its best performance, we carefully tune be synchronized with the global barrier. With the
the running parameters for all of them. For example, growth in the degree of parallelism, Angel hence can
the performance of Bösen is sensitive to the batch obtain better speedup.
size. Larger batch size results in more computation
overhead while small batch size incurs more commu-
nication overheads. Parallelization methods
Figure 8f presents the performance of these three We continue presenting the effects of hybrid par-
systems when using these two datasets. For pubmed allelism in this subsection. We vary the number of
dataset, we set the number of centers to 10 and worker group number and the number of workers in
one group to realize data parallelism and model par- Impact of storage level
allelism on Angel. Specifically, when the number of There are three different memory storage levels for
workers in one group is set to 1, the implementation training data that reside at the workers during the
of Angel is exactly data parallelism while if all work- training phase. Figure 11a presents the effects of
ers belong to the same group, they are coordinated these three levels to the convergence time of LR
to train through model parallelism. To demonstrate algorithm on CTR2 dataset with 20 workers un-
the effectiveness of hybrid parallel method, we uti- der different memory budgets. When the memory
lize both LR and LDA to obtain the experiment budget is limited at 4 GB, the worker task using
results. only memory storage level fails because of the Out-
Figure 10a gives the convergence rate over OfMemory exception while tasks can still complete
epoches on LDA, while Fig. 10b presents the con- their execution once setting the level to disk or
vergence rate over running time. We employ pubmed memory disk.

dataset for this comparison. The total number of For each memory budget value, the tasks using
workers are set to 240 and the implementation of the disk level need more running time than their
hybrid parallelism employs eight workers at one counterparts due to the more I/O cost. The mem-
group. We can see from Fig. 10a that model paral- ory disk level can obtain comparable performance
lel method requires the least number of epoches to compared to the memory level while putting no en-
obtain the same loglikelihood value, while data par- forcement on the memory budget. By maintaining
allel method needs much more epoches than model a memory buffer on each worker node when using
parallelism and hybrid parallelism. The reason is that the memory disk level, lower accessing latency and
data parallel method incurs too much conflicting up- lower memory consumption can be obtained at the
dates, while model parallel method carefully sched- same time.
ules the parallel computing of model to obtain the
best convergence rate over epoches. However, since Impact of pre-processing
model parallel method requires to schedule the up- Here, we continue evaluating the running time of
dates among all workers to avoid conflicts and issue the dummy processing phase. Because Petuum does
global barriers to coordinate workers, it needs more not have the ability to pre-process data, we only
time to finish one epoch. Figure 10b demonstrates compare Angel and Spark in the experiments. In the
that hybrid parallelism can obtain the best perfor- implementation of Spark, we broadcast features with
mance compared with the other two counterparts. high frequency to workers instead of join opera-
It needs the least time to reach the convergence tion, which can eliminate skewed tasks. As showed
point. in Fig. 11b, Angel outperforms Spark over an order
Similar results can be derived from Fig. 10c and of magnitude on each dataset. With more training
d, which use LR algorithm to evaluate three parallel data to process, Spark exhibits inferior performance
methods on the kdd2010 dataset. The overall num- than Angel since it transfers more data through
ber of workers is 10 for all of them, and there are two shuffle operations. Thus, more disk I/O and net-
workers in one group for hybrid method. Figure 10c work overheads are encountered. The inefficiency in
shows that model parallelism can reach a lower train- the Spark’s pre-processing significantly degrades the
ing loss value with the same number of epoches com- overall performance of ML applications. Angel can
pared with hybrid parallelism and data parallelism. however complete the pre-processing phase with
However, similar to the results of LDA, model par- almost ignorable cost compared with the succeeding
allelism requires more time for each epoch, which training phase.
can be seen from Fig. 10d. Compared with the other
two parallel methods, hybrid parallelism can achieve
both good convergence rate per epoch and reduce Synchronization
the time cost for one iteration, reaching the given ob-
We continue presenting the effects of different pa-
jective value with the minimum time.
rameter pulling methods.
Impact of pipelining pulling

Data management The pulling of parameters and the computing of up-
In this subsection, we present the efficiency of the dates are performed sequentially in the naive imple-
data management module in Angel. We first examine mentation. Network requests are issued whenever a
the impact of storage level to the performance. Then, new row is needed in the computation. Since each
we compare the integrated data pre-processing mod- network operation has extra overheads and there
ule with Spark. are more than ten thousands of parameter rows in
L
H
H
M
D
D

E
(a) LDA: epoch to convergence (b) LDA: time to convergence
H H
M M
D D
O
E
(c) LR: epoch to convergence (d) LR: time to convergence
Figure 10. Performance comparison between data parallel, model parallel and hybrid parallel.
Figure 11. Effective of data management.
the model, the naive implementation incurs heavy We conduct experiments to evaluate the effects
network overheads. The pipelined execution of pipelined pulling with the MF programs running
method invokes network requests through a batch over the Netfix dataset. We measure the running
method and overlaps the parameter pulling and the time per iteration with different pulling methods.
computing phases. By reducing the unnecessary Since a larger value of rank will lead to more data
waiting time, the performance degraded by network transferred through network, we also vary the rank
operations reduces as well. values in the MF programs.
N N
N P
P
P
O
E

(a) Pipeline pulling (b) Efficiency of pruning (CTR4) (c) Pruning speed (CTR4)
Figure 12. Effectiveness of parameter synchronization.
The experimental results presented in Fig. 12a are are adopted in the experiments. The auto method
identical with our expectation. Because the pipelined which is described in Section ‘Pushing updates’ em-
method can facilitate the utilization of the band- ploys the automatic updates pushing method pro-
width, it helps to improve the performance of the vided by Angel. The 10spe method propagates lo-
MF programs. With the increasing of model size, the cal updates to servers 10 times every epoch, while
programs using pipelined pulling obtain more per- the 1spe method only pushes the updates at the
formance gains than the naive ones. end of each epoch. From Fig. 13a we can see
that the auto method can obtain comparable sta-
tistical efficiency compared with 10spe. They both
Impact of pruning pulling
outperform the convergent speed of 1spe in terms
By pulling only those parameters needed in the com-
of epoch numbers since they propagate their up-
puting, the pruning pulling method can reduce the
dates to other workers more frequently. But when
number of pulled parameters. We run the LR algo-
coming to the running time, the auto method per-
rithm against CTR4 to evaluate the effect of pruning
forms the best. This is because each update pushing
pulling. In the experiments, the number of parame-
operation invokes extra overheads, such as updates
ters exceeds 100 million, and the programs are exe-
merging, splitting and the network cost. With more
cuted over 100 workers.
frequent synchronization, more overheads are intro-
Figure 12b discriminates the time used to pull
duced and the benefits brought by frequent synchro-
parameters and compute updates during the train-
nization become marginal. Because the automatic
ing phase and test phase. Comparing to the naive
sync method utilizes a best-effort strategy to perform
method which pulls all parameters to workers, the
pushing operations, it can well balance the frequency
pruning method can avoid the pulling of those pa-
and the cost of model synchronization, avoiding the
rameters that are not needed in the calculation of
performance degradation due to unnecessary syn-
gradients. Hence, the pruning method can efficiently
chronization operations.
reduce the time needed in pulling operations. For
each iteration, the pruning method only costs 28.5 s
for the pulling operation while it takes 59.2 s for the
naive method at the training phase. The reduction in Fault tolerance
the time of pulling operations can help accelerate the
In this subsection, we evaluate the recovery perfor-
convergence rate.
mance of Angel. We conducted LR algorithm on
Figure 12c shows the objective values over time
both CTR1 and CTR2 datasets. Figure 14a shows the
with different pulling methods. By avoiding pulling
recovery latency when an Angel application comes
redundant parameters, the pruning method takes
across failures. We randomly kill a server node at it-
only 1844 s to complete 20 epochs while the naive
eration 1 and iteration 11. Compared to other itera-
one requires 3227 s.
tions without failures, it costs about more 6 s to com-
plete the two iterations where failures show up. That
Impact of synchronization frequency is, the master node takes only 6 s to detect the crash
Here, we show the impact of different synchroniza- of server node, reallocate a container for a new node
tion frequencies to the convergence speed of LDA and recover the state of failed server from its snap-
algorithm. Three different synchronization methods shot.

Figure 13. Impact of synchronization frequency.
Figure 14. Impact of failover.
Angel enables users to set the backup interval Scalability results

for servers, and we compare the effect of different
Finally, we use large datasets and more workers to
backup intervals on the convergence speed in the
demonstrate the scalability of Angel. We use LDA
execution with failures. The failure is set to happen
algorithm to conduct the evaluation since it is more
at 500 s after the application starts. If the interval is
complicated and its model size is much larger than
set to 600 s, no snapshot had been written to HDFS
others. Two datasets, including corpus1 and corpus2,
when the failure happens. Then, the convergence
are employed to perform the evaluation. The exper-
progress is affected by the server failure. However,
iments are running in a shared inner-company clus-
since only a part of parameters are lost, we do not
ter with 5000 physical machines that are equipped
need to start the entire computation from scratch.
with the same hardware configurations with the pre-
Only those servers affected by the failure will be
vious ones. Researchers and engineers in Tencent
restarted. When the backup interval is set to 30 s, we
submit around 1.2 million applications to this clus-
can see that the execution is merely affected by the
ter every data, with at most 6000 jobs running at the
failure. Note that each iteration here takes ≈80 s to
same time.
complete the training phase and test phase, the num-
The experiments’ results are presented in Fig. 15.
ber of updates not reflected in the latest snapshot is
The number of topics is set to 8000 for these two
very small. The new launched server can restore the
datasets. We vary the number of overall workers but
state with few missing updates. Therefore, the execu-
fix the number of workers in each group at four. For
tion can recover in a short time and the convergence
corpus1, Angel can achieve the convergence point
rate is not damaged.
within 7500 s using 400 workers, while it can be
L
w w
w w

E E
(a) LDA corpus1 #topics=8k (b) LDA corpus2 #topics=8k
Figure 15. Scale results with more workers and large datasets: (a) LDA corpus1 #topics = 8k and (b) LDA corpus2 #top-
ics = 8k.
decreased to 5000 s when the number of workers is prehensive experiment results demonstrated the su-
increased to 1200. For corpus2, Angel can finish the periority of Angel compared to Spark and Petuum.
training within about 3 h using 1200 workers, while
it can achieve convergence in ≈7000 s when using
FUNDING
2000 workers. There are ≈30 billions and 40 billions
of parameters stored on the servers when training This work was supported by the National Natural Science
Foundation of China (61572039), the National Basic Research
dataset corpus1 and corpus2, respectively. These two
Program of China (2014CB340405), Shenzhen Government
experiments prove that Angel can run in a real pro-
Research Project (JCYJ20151014093505032) and Tecent Re-
duction environment with thousands of workers to search Grant (PKU).
handle ML applications that require billions of pa-
Conflict of interest statement. None declared.
rameters.
We also try to run this experiment with Petuum
and Spark. However, Spark will always fail since it
lacks efficient mechanism to process large models, REFERENCES
while Petuum lacks the ability to automatically par- 1. Huang Y, Cui B and Zhang W et al. Tencentrec: Real-time stream
tition the training dataset. It requires a Network recommendation in practice. In: Sellis TK, Davidson SB and Ives
File System to share the input dataset for all work- ZG (eds). Proceedings of SIGMOD Conference 2015. Melbourne,
ers. Otherwise, user should manually copy the in- Victoria, Australia: ACM 2015, 227–38.
put dataset to every worker, which is very time 2. Huang Y, Cui B and Jiang J et al. Real-time video recommen-
consuming for large dataset and impossible for a pro- dation exploration. In: Özcan F, Koutrika G and Madden S (eds).
duction cluster managed by Yarn. Another problem Proceedings of, SIGMOD Conference 2016. San Francisco, CA,
for Strads is that the worker will fail for large data USA: ACM 2016, 35–46.
since it will read all the data into memory. In sum- 3. Zaharia M, Chowdhury M and Franklin MJ et al. Spark: cluster
mary, Petuum cannot be deployed in production en- computing with working sets. In: Nahum EM and Xu D (eds).
vironment for industry-scale applications. Proceedings of HotCloud 2010. Boston, MA, USA: USENIX As-
sociation 2010.
4. Zaharia M, Chowdhury M and Das T et al. Resilient distributed
datasets: a fault-tolerant abstraction for in-memory cluster com-
CONCLUSION puting. In: Gribble SD and Katabi D (eds). Proceedings NSDI Con-
In this paper, we proposed a new general-purpose ference 2012. San Jose, CA, USA: USENIX Association 2012,
distributed ML system, named Angel, which aimed 15–28.
at solving large-scale ML problems faced by big data 5. Xing EP, Ho Q and Dai W et al. Petuum: a new platform for dis-
analytic applications. Angel employed hybrid par- tributed machine learning on big data. IEEE Trans Big Data 2015;
allelism to accelerate the performance of ML algo- 1: 49–67.
rithms. The pulling of parameters and the pushing of 6. Li M, Andersen DG and Park JW et al. Scaling dis-
updates were fully optimized in Angel to reduce the tributed machine learning with the parameter server. In:
network overheads. Angel was deployed in a produc- Flinn J and Levy H (eds). Proceedings of OSDI Confer-
tion cluster and provisioned efficient mechanisms to ence 2014. Broomfield, CO, USA: USENIX Association 2014,
achieve fault-tolerated execution at scale. The com- 583–98.
7. Ho Q, Cipar J and Cui H et al. More effective distributed ml via a stale syn- 22. Bradley JK, Kyrola A and Bickson D et al. Parallel coordinate descent for l1-
chronous parallel parameter server. In: Burges CJC, Bottou L and Ghahramani Z regularized loss minimization. In: Getoor L and Scheffer T (eds). Proceedings of
et al. (eds). Proceedings of NIPS Conference 2013. Lake Tahoe, Nevada, United ICML Conference 2011. Bellevue, Washington, USA: Omnipress 2011, 321–8.
States, 2013, 1223–31. 23. Newman D, Smyth P and Welling M et al. Distributed inference for latent
8. Kim JK, Ho Q and Lee S et al. Strads: a distributed framework for scheduled dirichlet allocation. In: Platt JC, Koller D and Singer Y et al. (eds). Proceed-
model parallel machine learning. In: Cadar C, Pietzuch PR and Keeton K et al. ings of NIPS Conference 2007. Vancouver, British Columbia, Canada: Curran
(eds). Proceedings of EuroSys Conference 2016. London, United Kingdom: ACM Associates, Inc., 1081–8.
2016, 5:1–5:16. 24. Valiant LG A bridging model for parallel computation. Comm ACM 1990; 33:
9. Lee S, Kim JK and Zheng X et al. On model parallelization and scheduling 103–11.
strategies for distributed machine learning. In: Ghahramani Z, Welling M and 25. Dean J and Ghemawat S Mapreduce: simplified data processing on large clus-
Cortes C et al. (eds). Proceedings of NIPS Conference 2014. Montreal, Quebec, ters. Comm. ACM 2008; 51: 107–13.
Canada, 2014, 2834–42. 26. Malewicz G, Austern MH and Bik AJ et al. Pregel: a system for large-scale

10. Wang Y, Zhao X and Sun Z et al. Peacock: learning long-tail topic features for graph processing. In: Elmagarmid AK and Agrawal D (eds). Proceedings of SIG-
industrial applications. ACM Trans Intell Syst Tech 2015; 6: 47. MOD Conference 2010, Indianapolis, Indiana, USA: ACM 2010, 135–46.
11. Zhang C and Ré C. Dimmwitted: a study of main-memory statistical analytics. 27. Zinkevich M, Langford J and Smola AJ. Slow learners are fast. In: Bengio
Proc VLDB Conf 2014; 7: 1283–94. Y, Schuurmans D and Lafferty JD et al. (eds). Proceedings of NIPS Confer-
12. Yu Y, Isard M and Fetterly D et al. Dryadlinq: a system for general-purpose ence 2009. Vancouver, British Columbia, Canada: Curran Associates, Inc. 2009,
distributed data-parallel computing using a high-level language. In: Draves R 2331–9.
and van Renesse R (eds). Proceedings of OSDI Conference 2008. San Diego, 28. Agarwal A and Duchi JC. Distributed delayed stochastic optimization. In:
California, USA: USENIX Association 2008, 1–14. Shawe-Taylor J, Zemel RS and Bartlett PL et al. (eds). Proceedings of NIPS
13. Fan W and Hu C Big graph analyses: from queries to dependencies and asso- Conference 2011. Granada, Spain, 2011, 873–81.
ciation rules. Data Sci Eng 2017; 2: 1–20. 29. Recht B, Re C and Wright S et al. Hogwild: a lock-free approach to parallelizing
14. Shao Y, Cui B and Ma L. Page: a partition aware engine for parallel graph com- stochastic gradient descent. In: Shawe-Taylor J, Zemel RS and Bartlett PL et al.
putation. IEEE Trans Data Knowl Eng 2015; 27: 518–30. (eds). Proceedings of NIPS Conference 2011. Granada, Spain, 2011, 693–701.
15. Shi X, Cui B and Shao Y et al. Tornado: a system for real-time iterative analysis 30. De Sa, C Olukotun K and Ré C. Ensuring rapid mixing and low bias for asyn-
over evolving data. In: Özcan F, Koutrika G and Madden S (eds). Proceedings of chronous gibbs sampling. In: Balcan MF and Weinberger KQ (eds). Proceedings
SIGMOD Conference 2016. San Francisco, CA, USA: ACM 2016, 417–30. of ICML Conference 2016. ACM, 2016, 1567–76.
16. Dean J, Corrado G and Monga R et al. Large scale distributed deep networks. 31. Liu J, Wright SJ and Ré C et al. An asynchronous parallel stochastic coordinate
In: Bartlett PL, Pereira FCN and Burges CJC et al. (eds). Proceedings of NIPS descent algorithm. J Mach Learn Res 2015; 16: 285–322.
Conference 2012. Lake Tahoe, Nevada, United States, 2012, 1232–40. 32. Ahmed A, Aly M and Gonzalez J et al. Scalable inference in latent variable
17. Jiang J, Cui B and Zhang C et al. Heterogeneity-aware distributed parameter models. In: Adar E, Teevan J and Agichtein E et al. (eds). Proceedings of WSDM
servers. In: Proceedings of SIGMOD Conference. 2017. Conference 2012. Seattle, WA, USA: ACM 2012, 123–32.
18. Abadi M, Barham P and Chen J et al. TensorFlow: A system for large-scale 33. Gemulla R, Nijkamp E and Haas PJ et al. Large-scale matrix factorization
machine learning. In: Keeton K, Roscoe T (eds). Proceedings of OSDI Conference with distributed stochastic gradient descent. In: Apté C, Ghosh J and Smyth P
2016. Savannah, GA, USA: USENIX Association 2016, 265–83. (eds). Proceedings SIGKDD Conference 2011. San Diego, CA, USA: ACM 2011,
19. Ooi BC, Tan KL and Wang S et al. Singa: a distributed deep learning platform. 69–77.
In: Zhou X, Smeaton AF and Tian Q et al. (eds). Proceedings of Multimedia 34. Blei DM, Ng AY and Jordan MI. Latent dirichlet allocation. J Mach Learn Res
Conference. Brisbane, Australia: ACM 2015, 685–8. 2003; 3: 993–1022.
20. Wu TT and Lange K. Coordinate descent algorithms for lasso penalized regres- 35. Griffiths TL and Steyvers M. Finding scientific topics. Proc Natl Acad Sci USA
sion. Ann Appl Stat 2008; 2: 224–44. 2004; 101 (suppl 1): 5228–35.
21. Zinkevich M, Weimer M and Li L et al. Parallelized stochastic gradient de- 36. Suits DB. Use of dummy variables in regression equations. J Am Stat Assoc
scent. In: Lafferty JD, Williams CKI and Shawe-Taylor J et al. (eds). Proceedings 1957; 52: 548–51.
of NIPS Conference 2010. Vancouver, British Columbia, Canada: Curran Asso- 37. Bennett J and Lanning S. The netflix prize. In: Proceedings of KDD cup and
ciates, Inc., 2595–603. workshop 2007. San Jose, California, USA: ACM 2007, p. 35.

Angle A New Larg-Scale Machine Learning System

Uploaded by

Copyright:

Available Formats

Angle A New Larg-Scale Machine Learning System

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Angle A New Larg-Scale Machine Learning System

Uploaded by

Copyright:

Available Formats

National Science Review

RESEARCH ARTICLE 5: 216–236, 2018

Angel: a new large-scale machine learning system

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

INTRODUCTION (i) Integration with existing ecosystems: To inte-

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

DESIGN AND OVERVIEW OF ANGEL

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

leader’s. Each time the leader calls the CLOCK oper-

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

distinct scenarios. Angel provides users a set of flex- Partition of parameters

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

stead of 2D arrays to reduce the memory consump-

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

message consists of a list of (key,value) pairs for

Figure 7. Distributed counter interface.

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Table 1. Dataset statistics.

Model Dataset #row #col #nnz Sparse size

LR kdd2010 19 m 30 m 585 m 8.4 GB

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Figure 8. Performance comparison between Petuum, Spark, TensorFlow and Angel.

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

each worker reads more data. Another reason is that

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Impact of pipelining pulling

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Figure 11. Effective of data management.

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Figure 12. Effectiveness of parameter synchronization.

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

Figure 14. Impact of failover.

Angel enables users to set the backup interval Scalability results

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

(a) LDA corpus1 #topics=8k (b) LDA corpus2 #topics=8k

Downloaded from https://academic.oup.com/nsr/article/5/2/216/3052720 by guest on 07 December 2021

You might also like