Angle A New Larg-Scale Machine Learning System
Angle A New Larg-Scale Machine Learning System
Angle A New Larg-Scale Machine Learning System
INFORMATION SCIENCE
ABSTRACT
C The Author(s) 2017. Published by Oxford University Press on behalf of China Science Publishing & Media Ltd. All rights reserved. For permissions, please e-mail:
journals.permissions@oup.com
RESEARCH ARTICLE Jiang et al. 217
Factorization (MF) for commercial recommenda- tical efficiency, Angel partitions the model in each
tion, user profiling and various applications. We have group and schedules the updates to the model. Com-
successfully scaled Angel up to hundreds of ma- pared with model parallelism, the hybrid parallel
chines and processing hundreds of gigabytes of data method in Angel allows concurrent updates to the
with billions of model parameters. Compared with same model parameters. The updates made in differ-
existing systems such as Petuum and Spark, Angel is ent groups then are merged to obtain the updated
at least 2×, and sometimes up to 10× faster. version of the model. By allowing a larger degree of
Premier: data parallelism, model parallelism and parallelism, Angel can significantly improve the per-
synchronization. We provide a brief overview of ex- formance of ML applications.
isting distributed ML systems to set the context for Because the parameters in most ML applications
the system design and technical contribution of An- such as LDA and MF can be modeled as matrices,
gel. Angel abstracts them as matrices to facilitate the de-
(ii) We propose a hybrid parallel method to allow among different workers. By performing update
the scheduling of updates while avoiding global aggregating and allowing delayed synchronization,
barriers. parameter servers can achieve acceptable perfor-
(iii) We carefully optimize the implementation of mance for big models with billions or trillions of
model synchronization. By deploying pipelined parameters. Petuum is a state-of-the-art distributed
execution and utilizing data sparseness, we ML system, which provides two different modules,
can significantly reduce the overheads due to Bösen and Strads, to parallelize ML algorithms.
model synchronization. Bösen models the parameters as a shared table and
(iv) We enhance the usability of Angel by providing partitions the table horizontally across different
a set of built-in modules and provisioning effi- servers. Different synchronization protocols are
cient fault-tolerance mechanisms. supported in Bösen to reduce the unnecessary
(v) We validate the effectiveness of Angel by con- network overheads. Strads employs a centralized
Components of angel
Figure 1. System architecture of Angel. There are four types of roles in an application of An-
gel, including Client, Master, Worker and Server.
(i) Client is the entry point for an Angel ap-
can be categorized into two types: data parallelism
plication. It loads the configurations of the
and model parallelism. The data parallel methods
runtime environment and submits the appli-
partition the training data to parallelize the ML
cation to Yarn. Then it will wait until the
algorithms while the model is partitioned in model
application’s execution completes. Users can
parallel methods. Data parallel implementations
define the parameter matrices by specifying
can obtain better hardware efficiency since there is
the block size and the dimension numbers at
no interference between different workers during
Client.
training. Model parallel implementations however
(ii) Master is launched by Yarn to manage the life
are good at statistical efficiency because all the
cycle of each application, including request-
running tasks are coordinated to avoid concurrent
ing resources, launching containers and killing
updates to the same parameter. The study based on a
workers or servers. Besides that, it is also uti-
single NUMA machine [11] has been conducted to
lized to realize different synchronization proto-
find good trade-off point to balance the two aspects.
cols between workers and provide web pages
Synchronization protocols. Parallelizing ML al-
that present the running states of workers and
gorithms brings in the requirement of model syn-
servers to users.
chronization among different model replicas and
(iii) Worker acts as the computation node to ex-
partitions. Various synchronization protocols have
ecute the code written by users. Each worker
been proposed and are applied to coordinate the
reads the data from HDFS and accesses the
running states of different workers. Bulk Synchro-
model parameters on servers to exchange the
nization Parallel (BSP) [24] protocol forces all run-
updates with each other.
ning workers to synchronize their states through a
(iv) Server acts as a distributed storage for param-
global barrier at the end of each epoch. This proto-
eters. Each server stores one or more parti-
col guarantees the correctness of distributed ML al-
tions of model parameters and provides effi-
gorithms [21], and many existing systems [3,25,26]
cient mechanism in response to get the update
adopt it as their synchronization protocol. Delayed
requests from workers.
synchronization protocol [7,27,28] has been pro-
posed to reduce the waiting time wasted in the BSP
protocol. Recently, Asynchronous Synchronization Angel execution
Parallel (ASP) protocol is employed by some sys- An Angel application starts when users submit their
tems [16,6] and algorithms [29,30,31] to speed up compiled code which contains the logics of their task
distributed ML. Due to the error tolerance prop- through Client. Then Yarn will launch Master to take
erty of ML algorithms, these delayed synchroniza- over the running process. It will request resources
tion protocols perform well and usually can obtain from Yarn’s resource manager to allocate contain-
better performance for the problems with large train- ers for servers and workers. The model is partitioned
ing data collections. into different partitions and the meta information of
220 Natl Sci Rev, 2018, Vol. 5, No. 2 RESEARCH ARTICLE
all partitions is stored at the master. After the initial- the HDFS or pushed directly to the online advertis-
ization stage, each server will fetch the meta infor- ing systems.
mation of the partitions from the master and allocate In summary, Angel is developed to solve large-
memory storage for these partitions. scale ML problems in production environment. It
The training data are stored in HDFS. Master will provides the hybrid parallel method to accelerate the
partition the input data into different splits with bal- convergence speed of distributed ML algorithms.
anced size and send the meta information of splits Besides, efficient parameter synchronization opera-
to workers. Before the running of ML algorithms, tions are implemented to reduce the network over-
Worker conducts the preprocessing phase to con- heads and improve the overall performance. By
vert the raw data into ready-to-run linear algebra ob- automatically conducting the data pre-processing
jects. The output objects will be stored inside each operations, Angel can be easily integrated with the
worker. We will introduce the detailed implementa- whole pipeline of online applications. Angel also
Angel employs hybrid parallelism to avoid the sweep all the training instances in this group and
problems in data parallelism and model parallelism. only store the partition which belongs to itself.
The architecture of hybrid parallelism is presented
in Fig. 2c. Angel first divides workers into multiple
groups. The training data are partitioned and are as- Worker execution
signed to the groups. Inside each group, one worker Inside each worker, multiple user tasks will be
is selected as the leader randomly and other work- launched to execute the user-provided function in
ers are categorized as slaves. The model is replicated different threads. Each user thread reads its data par-
at different worker groups. The leader schedules the tition and conducts the training operations. The user
updates of the model replica inside this group. The thread can access the parameters at servers through
updates generated by different groups are merged a shared module inside each worker, which is pre-
through the global parameter servers. sented in Fig. 3. To eliminate duplicated parameter
pulling, a lock table is maintained at the worker. Each
Data partitioning request first tries to acquire the lock with its matrix
We recognize the training data as a 2D matrix with ID and row index. If the lock has already been ob-
each row as a data instance. Because the methods tained by other thread, then the request thread will
deployed by different ML algorithms to access data be blocked. The row locks allow us to avoid dupli-
vary a lot, Angel provides three different methods to cated requests.
partition the data matrix inside a worker group and The pulling for one row will be splitted into mul-
they are listed as follows. tiple requests and each request fetches one partition
from the server. The pulling requests are executed
(i) Replicated: the data matrix is replicated at each by multiple pulling threads. The pulling threads lo-
slave since the update of each parameter re- cate the server location from the lookup table in-
quires the whole data matrix. For example, to side worker. Once a partition is fetched, it will be in-
update each parameter in Lasso using CD, the serted into a result buffer. An input merging thread
whole data matrix is required. will merge multiple partitions into a complete row
(ii) Horizontal: in ML algorithms such as LDA and once all partitions have been fetched. Then, all user
MF, all fields of an instance are needed in the threads which are waiting for this row will be waked
computation. Hence, the training data are hor- up to perform the computation.
izontally partitioned in these applications so The updates generated by user threads will be
that each slave owns a portion of the data in- inserted into an update buffer. An output merg-
stances. ing thread is running to merge the updates before
(iii) Vertical: using this partitioning method, each pushing. Once the buffer size exceeds a pre-defined
training instance is partitioned and each slave threshold, the updates will be pushed to servers by
owns a vertical partition of all data instances. the pushing threads.
Linear models, such as LR, SVM, can utilize this
partition method to conduct model parallelism.
Users can choose the partitioning method Leader scheduling
through the configuration of the Angel application. The detailed model parallel implementations of dif-
During the read phase, each worker in a group will ferent ML algorithms can vary a lot due to their
222 Natl Sci Rev, 2018, Vol. 5, No. 2 RESEARCH ARTICLE
synchronous, the computation will be blocked ow- the optimization algorithm needs to pull yi from the
ing to the network latency. We carefully study the server.
structures of model parameters of different ML algo- From the discussion above, we can see that the
rithms and group them into two categories. The first computation unit, such as sampling a token or pass-
type of algorithms owns a model matrix with lots of ing a rating, requires a portion of disjoint parameters
rows, such as MF and LDA, while the other one em- as input. Therefore, we overlap the parameter pulling
ploys a weight vector with billions of elements, like operation and the computation to reduce the per-
LR and SVM. We propose different pulling strategies formance degradation due to network latency. Naive
for both types of parameters, respectively. parameter fetching method, Algorithm 1, will per-
form poorly due to the latency of remote procedure
Pipelining pulling call. Instead, Angel allows each worker to request pa-
To provide an efficient parameter pulling mech- rameters for many computation units and perform
× (n d ,k + α), (1)
where k ∈ [1, K] is a topic, v ∈ [1, V] is a word in the Algorithm 2: Pipeline Parameter Pulling
vocabulary, w d, i denotes the ith word in document
initialization;
d and zd, i the topic assignment to w d, i . Matrix nd, k
Send requests for rows to servers;
and nv, k are two-parameter models which are up-
while not done do
dated by the sample operations. When performing
if Pipeline has ready parameters p then
distributed Gibbs sampling, both the training data
Sampling tokens with p;
and the Matrix nd, k are partitioned to workers by
Update n (d ,k) and n (v,k) ;
document. Matrix nd, k is stored among different
workers while matrix nv, k is shared. To perform each else
Wait on the Pipeline
sample operation for token wd, i , we just need to pull
one row of nv, k where v = w d, i . In other words, the
parameters are pulled row by row in LDA.
passed and a blocked result queue is returned imme-
Matrix factorization. Similar results can also be
diately. The worker will randomly split the row in-
concluded for the MF algorithms. To solve the MF
dices into multiple disjointed sets where each one
optimization problem, we sweep over all known rat-
owns a mini batch of indices. The worker sends the
ings and update the parameters through the follow-
pulling requests with one batch at a time and in-
ing equation for each rating:
serts the fetched rows into the result queue. The
user program takes parameter rows from the result
xu ← xu + η r ui − xuT y i y i − λxu ,
queue and conduct the computation. Since multi-
y i ← y i + η r ui − xuT y i xu − λy i , (2) ple tasks will be launched in each worker and differ-
ent tasks would request the same parameter row at
where xu and yi are the feature vectors for user u and one iteration. The worker will eliminate duplicated
item i, rui is the rating value of item i given by user requests from different tasks to reduce the network
u. η is the learning rate while λ is the regularization overheads.
parameter.
Note that the user feature matrix is partitioned Pruning pulling
by user and the item feature matrix is stored at the For ML algorithms whose model consists of one vec-
servers. Therefore, for each observed rating (u, i, r), tor with billions of elements, the parameter pulling
224 Natl Sci Rev, 2018, Vol. 5, No. 2 RESEARCH ARTICLE
Figure 6. Dummy processing using Spark. showed in Fig. 7 in Angel. User can register multi-
Within each worker, user program accesses the by other tasks. Angel provides the ability to recover
training data through a reader with the following in- when servers and workers come across failures.
terface. Servers store the values of model parameters
1 public interface Reader<KEY, VALUE> { which are frequently modified by workers. Hence,
2 boolean nextKeyValue(); we choose to write the snapshot of each parameter
3 KEY getCurrentKey(); partition periodically into persistent storage instead
4 VALUE getCurrentValue(); of writing update logs. Besides the parameter values,
5 } other systems [5,6] also maintain the clock informa-
tion on servers. In these systems, inconsistent clock
This reader provides the ability of scanning the train- view would exist among different servers. Although
ing data with a random order multiple times with- better performance can be obtained for training, it
out concerning the implementation details of data is hard to recover from a consistent state when job
Fault tolerance
Running a distributed job in a practical environment, Experimental setup
failures are common and may happen due to vari- The experiments are conducted on a cluster with
ous reasons, such as hardware errors or interfered 100 physical nodes, where each node has a 2.2 GHz
RESEARCH ARTICLE Jiang et al. 227
CPU with 12 cores, 64 GB memory, 12 × 2 TB they are used to evaluate the performance of scala-
SATA hard disks and connected by 1 Gbps network. bility, data pre-processing and fault tolerance on An-
The version of Yarn used in resource management gel. The user behavior dataset wxjd, as well as netflix,
is 2.2.0. Unless otherwise stated, the maximum size is adopted in the MF experiments. Document cor-
of JVM heap is 10 GB. To further demonstrate the pus datasets, pubmed, corpus1 and corpus2, are used
scalability of Angel when dealing with large datasets, to evaluate the performance of LDA, where corpus1
we conduct evaluations of Angel on a shared inner- and corpus2 are set of web pages crawled by crawlers
company cluster with around 5000 physical nodes. in Tencent. The statistics of the datasets are listed in
ML algorithms. Currently, Angel supports effi- Table 1. The sparse size of dataset is the size of text
cient implementations for a wide range of ML al- format where only non-zero elements are stored. The
gorithms to satisfy the requirement of various ap- #nnz value counts the number of non-zero elements
plications, especially aiming at large data and large in the data matrix.
model. For classification tasks, Angel provides three
algorithms, including LR, SVM and LassoLR, while
regression tasks can be solved with Linear Regres-
sion and Lasso. Besides that, users can utilize tree
ensembles methods, like GBDT, to handle classifi- End-to-end comparison
cation and regression tasks. Kmeans and LDA are We first demonstrate the efficiency of Angel by com-
supported in Angel to handle clustering problems. paring it with Spark, Petuum and TensorFlow. Spark
Moreover, MF is also integrated in Angel to deal with follows the paradigm of MapReduce, but is specially
recommendation tasks. optimized for iterative tasks. Now Spark is one of the
Here, we adopt four ML algorithms to compare most popular general-purpose big data processing
Angel with other systems in the following experi- systems in industrial companies. Due to its feasible
ments, namely LR, LDA, MF and Kmeans, because abstraction of distributed datasets and cache mech-
they have different parameters access patterns. LR is anism, Spark is widely adopted to processing ML ap-
a linear algorithm whose model has only one shared plications. Here, we use the algorithm implementa-
row, while Kmeans has one shared matrix with mul- tions from Mllib, the ML library built beyond Spark,
tiple rows. Both the models of LDA and MF are to conduct the comparison. Petuum is an academic
composed of multiple matrices. An LDA’s model prototype system which provides Bösen and Strads
contains two shared matrices and one partitioned to support data parallelism and model parallelism,
matrix, while MF uses one shared matrix and one respectively. Both Spark and Petuum provide the im-
partitioned matrix. plementations of those four ML algorithms adopted
Dataset. Besides the public datasets, including in the experiments. TensorFlow is an open-sourced
netflix, kdd 2010 and pubmed, we also use the logs distributed system designed for deep learning. Due
of Tencent applications to conduct the experiments. to its generalization of programming interface, Ten-
kdd2010 dataset is employed to evaluate the per- sorFlow can also support linear models, such as LR.
formance of different systems on LR. The datasets Therefore, we conduct the comparison with Tensor-
CTR 1-4 record the click through rates of users, and Flow using LR. We use the latest released version of
228 Natl Sci Rev, 2018, Vol. 5, No. 2 RESEARCH ARTICLE
Petuum and TensorFlow in the experiments and the epochs. But the achieved speedup is damaged by the
version of Spark is 2.0.2. global barriers. Finally, the speedup of Angel is 1.46
when the worker number increases from 160 to 240,
Effects on LDA while it is only 1.25 in Strads. As a consequence,
Comparison with strads. We first compare Angel Angel can obtain a 1.6× performance improvement
with Strads using LDA workloads. The topic number against Strads. There are two reasons that enable An-
is set to 1024 in the experiments. gel achieve faster speed than Strads. The first one
Strads deploys model parallelism for sparse is that Angel employs hybrid parallelism, while the
Gibbs sampling while Angel implements LDA with other one is that in Angel the get operations are over-
hybrid parallelism. We vary the number of machines lapped with the computation operations to reduce
for both Strads and Angel to evaluate their perfor- the network latency.
mance under different environment configurations. Comparison with Bösen. We then compare An-
In Angel’s implementation, each worker group is as- gel with Bösen using the same workloads. Since the
signed with eight workers to conduct model parallel execution of Bösen always fails when using large
sampling. We monitor the loglikelihood value dur- number of machines, we use much less workers in
ing the execution and record the time when the log- these experiments. We use only 40 workers in these
likelihood value reaches −7.8 × 109 . experiments and configure the staleness in Bösen to
The results are illustrated in Fig. 8a. We can see three clocks.
that Angel exhibits better performance than Strads With the update scheduling in each group, An-
when the worker number is 160. It costs 430 s for gel needs only 22 epochs to reach the required value,
Strads with 160 workers to reach the required value, while Bösen needs 33 epochs. With the carefully op-
while the time needed for Angel is only 338 s. timized model synchronization, the performance of
When more workers are involved in the execu- Angel is further optimized. From Fig. 8d, we can see
tion, the running time of both Strads and Angel re- that Angel can obtain a 10× performance gain over
duces, but Angel benefits more from the additional Bösen.
resources. Though Angel needs four more epoches Comparison with Spark. We also conduct the
when running with 240 workers, the running time experiment of LDA algorithms using Spark on the
of each epoch is reduced as well. The model par- pubmed dataset. However, the performance of Spark
allel method enables Strads to converge with less is too slow. It will take more than 1 h to finish one
RESEARCH ARTICLE Jiang et al. 229
iteration. Spark abstracts the computation of LDA groups for Angel with each group containing eight
as graph and uses GraphX to perform the computa- workers. In the implementation of Bösen, it adds the
tion. In the implementation of GraphX, each vertex updates from all workers to the global parameters in-
will send the topic distribution to its neighbors to stead of adding the average value of updates. There-
perform the inference, which will generate massive fore, the objective value might increase after the first
communication cost, ≈300 GB shuffle data, in each synchronization operation. We have tried out best to
iteration. On the contrary, only the word-topic ma- adjust the learning rate for Bösen, it still takes more
trix needs to be shared and transferred through net- than 10 000 s to achieve the specified objective value.
work in the implementation of Angel, which can sig- Hence, Angel is at least 10× faster than Bösen.
nificantly reduce the communication cost
Effects on LR
one group to realize data parallelism and model par- Impact of storage level
allelism on Angel. Specifically, when the number of There are three different memory storage levels for
workers in one group is set to 1, the implementation training data that reside at the workers during the
of Angel is exactly data parallelism while if all work- training phase. Figure 11a presents the effects of
ers belong to the same group, they are coordinated these three levels to the convergence time of LR
to train through model parallelism. To demonstrate algorithm on CTR2 dataset with 20 workers un-
the effectiveness of hybrid parallel method, we uti- der different memory budgets. When the memory
lize both LR and LDA to obtain the experiment budget is limited at 4 GB, the worker task using
results. only memory storage level fails because of the Out-
Figure 10a gives the convergence rate over OfMemory exception while tasks can still complete
epoches on LDA, while Fig. 10b presents the con- their execution once setting the level to disk or
vergence rate over running time. We employ pubmed memory disk.
L
H
H
M
D
D
H H
M M
D D
O
E
(c) LR: epoch to convergence (d) LR: time to convergence
Figure 10. Performance comparison between data parallel, model parallel and hybrid parallel.
the model, the naive implementation incurs heavy We conduct experiments to evaluate the effects
network overheads. The pipelined execution of pipelined pulling with the MF programs running
method invokes network requests through a batch over the Netfix dataset. We measure the running
method and overlaps the parameter pulling and the time per iteration with different pulling methods.
computing phases. By reducing the unnecessary Since a larger value of rank will lead to more data
waiting time, the performance degraded by network transferred through network, we also vary the rank
operations reduces as well. values in the MF programs.
RESEARCH ARTICLE Jiang et al. 233
N N
N P
P
P
O
E
The experimental results presented in Fig. 12a are are adopted in the experiments. The auto method
identical with our expectation. Because the pipelined which is described in Section ‘Pushing updates’ em-
method can facilitate the utilization of the band- ploys the automatic updates pushing method pro-
width, it helps to improve the performance of the vided by Angel. The 10spe method propagates lo-
MF programs. With the increasing of model size, the cal updates to servers 10 times every epoch, while
programs using pipelined pulling obtain more per- the 1spe method only pushes the updates at the
formance gains than the naive ones. end of each epoch. From Fig. 13a we can see
that the auto method can obtain comparable sta-
tistical efficiency compared with 10spe. They both
Impact of pruning pulling
outperform the convergent speed of 1spe in terms
By pulling only those parameters needed in the com-
of epoch numbers since they propagate their up-
puting, the pruning pulling method can reduce the
dates to other workers more frequently. But when
number of pulled parameters. We run the LR algo-
coming to the running time, the auto method per-
rithm against CTR4 to evaluate the effect of pruning
forms the best. This is because each update pushing
pulling. In the experiments, the number of parame-
operation invokes extra overheads, such as updates
ters exceeds 100 million, and the programs are exe-
merging, splitting and the network cost. With more
cuted over 100 workers.
frequent synchronization, more overheads are intro-
Figure 12b discriminates the time used to pull
duced and the benefits brought by frequent synchro-
parameters and compute updates during the train-
nization become marginal. Because the automatic
ing phase and test phase. Comparing to the naive
sync method utilizes a best-effort strategy to perform
method which pulls all parameters to workers, the
pushing operations, it can well balance the frequency
pruning method can avoid the pulling of those pa-
and the cost of model synchronization, avoiding the
rameters that are not needed in the calculation of
performance degradation due to unnecessary syn-
gradients. Hence, the pruning method can efficiently
chronization operations.
reduce the time needed in pulling operations. For
each iteration, the pruning method only costs 28.5 s
for the pulling operation while it takes 59.2 s for the
naive method at the training phase. The reduction in Fault tolerance
the time of pulling operations can help accelerate the
In this subsection, we evaluate the recovery perfor-
convergence rate.
mance of Angel. We conducted LR algorithm on
Figure 12c shows the objective values over time
both CTR1 and CTR2 datasets. Figure 14a shows the
with different pulling methods. By avoiding pulling
recovery latency when an Angel application comes
redundant parameters, the pruning method takes
across failures. We randomly kill a server node at it-
only 1844 s to complete 20 epochs while the naive
eration 1 and iteration 11. Compared to other itera-
one requires 3227 s.
tions without failures, it costs about more 6 s to com-
plete the two iterations where failures show up. That
Impact of synchronization frequency is, the master node takes only 6 s to detect the crash
Here, we show the impact of different synchroniza- of server node, reallocate a container for a new node
tion frequencies to the convergence speed of LDA and recover the state of failed server from its snap-
algorithm. Three different synchronization methods shot.
234 Natl Sci Rev, 2018, Vol. 5, No. 2 RESEARCH ARTICLE
L
w w
w w
Figure 15. Scale results with more workers and large datasets: (a) LDA corpus1 #topics = 8k and (b) LDA corpus2 #top-
ics = 8k.
decreased to 5000 s when the number of workers is prehensive experiment results demonstrated the su-
increased to 1200. For corpus2, Angel can finish the periority of Angel compared to Spark and Petuum.
training within about 3 h using 1200 workers, while
it can achieve convergence in ≈7000 s when using
FUNDING
2000 workers. There are ≈30 billions and 40 billions
of parameters stored on the servers when training This work was supported by the National Natural Science
Foundation of China (61572039), the National Basic Research
dataset corpus1 and corpus2, respectively. These two
Program of China (2014CB340405), Shenzhen Government
experiments prove that Angel can run in a real pro-
Research Project (JCYJ20151014093505032) and Tecent Re-
duction environment with thousands of workers to search Grant (PKU).
handle ML applications that require billions of pa-
Conflict of interest statement. None declared.
rameters.
We also try to run this experiment with Petuum
and Spark. However, Spark will always fail since it
lacks efficient mechanism to process large models, REFERENCES
while Petuum lacks the ability to automatically par- 1. Huang Y, Cui B and Zhang W et al. Tencentrec: Real-time stream
tition the training dataset. It requires a Network recommendation in practice. In: Sellis TK, Davidson SB and Ives
File System to share the input dataset for all work- ZG (eds). Proceedings of SIGMOD Conference 2015. Melbourne,
ers. Otherwise, user should manually copy the in- Victoria, Australia: ACM 2015, 227–38.
put dataset to every worker, which is very time 2. Huang Y, Cui B and Jiang J et al. Real-time video recommen-
consuming for large dataset and impossible for a pro- dation exploration. In: Özcan F, Koutrika G and Madden S (eds).
duction cluster managed by Yarn. Another problem Proceedings of, SIGMOD Conference 2016. San Francisco, CA,
for Strads is that the worker will fail for large data USA: ACM 2016, 35–46.
since it will read all the data into memory. In sum- 3. Zaharia M, Chowdhury M and Franklin MJ et al. Spark: cluster
mary, Petuum cannot be deployed in production en- computing with working sets. In: Nahum EM and Xu D (eds).
vironment for industry-scale applications. Proceedings of HotCloud 2010. Boston, MA, USA: USENIX As-
sociation 2010.
4. Zaharia M, Chowdhury M and Das T et al. Resilient distributed
datasets: a fault-tolerant abstraction for in-memory cluster com-
CONCLUSION puting. In: Gribble SD and Katabi D (eds). Proceedings NSDI Con-
In this paper, we proposed a new general-purpose ference 2012. San Jose, CA, USA: USENIX Association 2012,
distributed ML system, named Angel, which aimed 15–28.
at solving large-scale ML problems faced by big data 5. Xing EP, Ho Q and Dai W et al. Petuum: a new platform for dis-
analytic applications. Angel employed hybrid par- tributed machine learning on big data. IEEE Trans Big Data 2015;
allelism to accelerate the performance of ML algo- 1: 49–67.
rithms. The pulling of parameters and the pushing of 6. Li M, Andersen DG and Park JW et al. Scaling dis-
updates were fully optimized in Angel to reduce the tributed machine learning with the parameter server. In:
network overheads. Angel was deployed in a produc- Flinn J and Levy H (eds). Proceedings of OSDI Confer-
tion cluster and provisioned efficient mechanisms to ence 2014. Broomfield, CO, USA: USENIX Association 2014,
achieve fault-tolerated execution at scale. The com- 583–98.
236 Natl Sci Rev, 2018, Vol. 5, No. 2 RESEARCH ARTICLE
7. Ho Q, Cipar J and Cui H et al. More effective distributed ml via a stale syn- 22. Bradley JK, Kyrola A and Bickson D et al. Parallel coordinate descent for l1-
chronous parallel parameter server. In: Burges CJC, Bottou L and Ghahramani Z regularized loss minimization. In: Getoor L and Scheffer T (eds). Proceedings of
et al. (eds). Proceedings of NIPS Conference 2013. Lake Tahoe, Nevada, United ICML Conference 2011. Bellevue, Washington, USA: Omnipress 2011, 321–8.
States, 2013, 1223–31. 23. Newman D, Smyth P and Welling M et al. Distributed inference for latent
8. Kim JK, Ho Q and Lee S et al. Strads: a distributed framework for scheduled dirichlet allocation. In: Platt JC, Koller D and Singer Y et al. (eds). Proceed-
model parallel machine learning. In: Cadar C, Pietzuch PR and Keeton K et al. ings of NIPS Conference 2007. Vancouver, British Columbia, Canada: Curran
(eds). Proceedings of EuroSys Conference 2016. London, United Kingdom: ACM Associates, Inc., 1081–8.
2016, 5:1–5:16. 24. Valiant LG A bridging model for parallel computation. Comm ACM 1990; 33:
9. Lee S, Kim JK and Zheng X et al. On model parallelization and scheduling 103–11.
strategies for distributed machine learning. In: Ghahramani Z, Welling M and 25. Dean J and Ghemawat S Mapreduce: simplified data processing on large clus-
Cortes C et al. (eds). Proceedings of NIPS Conference 2014. Montreal, Quebec, ters. Comm. ACM 2008; 51: 107–13.
Canada, 2014, 2834–42. 26. Malewicz G, Austern MH and Bik AJ et al. Pregel: a system for large-scale