1 Introduction
By abstracting away the provisioning of compute resources, serverless computing removes much of the complexity to use the cloud. This fairly recent paradigm was started by services such as Google BigQuery [
76] and AWS Glue [
15], and it has evolved into
Function-as-a-Service (
FaaS) computing platforms, such as AWS Lambda and Apache OpenWhisk. These platforms allow to deploy a user-defined
cloud function and its dependencies. Once deployed, the function is fully managed by the platform provider that executes it on demand and at scale in a datacenter. Cloud functions are billed at sub-second granularity and only the time they execute is charged to the user.
Current practices show that serverless computing works well for applications that require a small amount of storage and memory due to the operational limits set by the providers (see, e.g., AWS Lambda [
12]). However, there are more limitations. While cloud functions can initiate outgoing network connections, they cannot directly communicate between each other, and have little bandwidth compared to a regular virtual machine [
24,
101]. This is because the model was originally designed to execute event-driven functions in response to user actions or changes in the storage tier (e.g., uploading a file to Amazon S3 [
11]). Despite these constraints, serverless computing applies to many areas. Some recent works show that this paradigm allows to process big data [
55,
80,
84], encode videos [
35], and perform linear algebra [
88] and Monte Carlo simulations [
52].
Challenges. All these pioneering works prove that serverless computing can escape its initial area of usage and expand to traditional computing applications. However, programming some of these tasks still face fundamental challenges. Although the list is too long to recount here, convincing cases of these ill-suited applications are distributed stateful computations such as
machine learning (
ML) algorithms. Just an imperative implementation of
\(k\)-means [
66] raises several issues: first, the need to efficiently handle a globally-shared state at fine granularity (the cluster centroids); second, the problem to globally synchronize the cloud functions, so that the algorithm can correctly proceed to the next iteration; and finally, the prerogative that the shared state survives system failures.
No serverless system currently addresses all these issues effectively. First, due to the impossibility of function-to-function communication, the prevalent practice for sharing state across functions is to use remote storage. For instance, serverless frameworks, such as PyWren and NumPyWren [
88], use highly-scalable object storage services to transfer state between cloud functions. Since object storage is too slow to share short-lived intermediate state in serverless applications [
62], some recent works use faster storage solutions. This has been the path taken by Locus [
80], which proposes to combine fast, in-memory storage instances with slow storage to scale shuffling operations in MapReduce. However, with all the shared state transiting through storage, one of the major limitations of current serverless systems is the lack of support to handle mutable state at a fine granularity (e.g., to efficiently aggregate small granules of updates). Such a concern has been recognized in various works [
24,
56], but this type of fast, enriched storage layer for serverless computing is not available today in the cloud, leaving fine-grained state sharing as an open issue.
Similarly, FaaS orchestration services (such as AWS Step Functions [
14] or OpenWhisk Composer [
36]) offer limited capabilities to synchronize cloud functions [
39,
56]. They have no abstraction to signal a function when a condition is fulfilled, or for multiple functions to coordinate, e.g., in order to guarantee data consistency, or to ensure joint progress to the next stage of computation. Of course, such fine-grained synchronization should be also low-latency to not significantly slow down the application. Existing stand-alone notification services, such as AWS- SNS [
20] and AWS-SQS [
40], add significant latency, sometimes hundreds of milliseconds. This lack of efficient synchronization mechanisms means that each serverless framework needs to develop its own solutions. For instance, PyWren enforces the coordination of the map and reduce phases through object storage, while ExCamera has built a notification system using a long-running VM-based rendezvous server. As of today, there is no general way to let multiple functions coordinate via abstractions hand-crafted by users, so that fine-grained synchronization can be truly achieved.
Contributions. To overcome the aforementioned issues, we propose Crucial, a framework for the development of stateful distributed applications in serverless environments. The base abstraction of Crucial is the cloud thread, which maps a thread to the invocation of a cloud function. Cloud threads manipulate global shared state stored in the distributed shared objects (DSO) layer. To ease data sharing between cloud functions, DSO provides out-of-the-box strong consistency guarantees. The layer also implements fine-grained synchronization, such as collectives, to coordinate the functions. Objects stored in DSO can be either ephemeral or persistent, in which case they are passivated on durable storage. DSO is implemented with the help of state machine replication and executes atop an efficient disaggregated in-memory data store. Cloud threads can run atop any standard FaaS platform.
The programming model of Crucial is quite simple, offering conventional multi-threaded abstractions to the programmer. With the help of a few annotations and constructs, single-machine multi-threaded stateful programs can execute as cloud functions. In particular, since the global state is manipulated as remote shared objects, the interface for mutable state management becomes virtually unlimited, only constrained by the expressiveness of the programming language— Java in our case.
Our evaluation shows that Crucial can scale traditional parallel jobs, such as Monte Carlo simulations, to hundreds of workers using basic code abstractions. For applications that require fine-grained updates, like ML tasks, our system can rival, and even outperform, Apache Spark running on a dedicated cluster. We also establish that an application ported to serverless with Crucial achieves similar performance to a multi-threaded solution running on a dedicated high-end server.
We claim the following novel contributions:
–
We provide the first concrete evidence that stateful applications with needs for fine-grained data sharing and synchronization can be efficiently built using stateless cloud functions and a disaggregated shared objects layer.
–
We design
Crucial, a system for the development and execution of stateful serverless applications. Its simple interface offers fine-grained semantics for both mutable state and synchronization.
Crucial is open source and freely available online [
6].
–
We show that Crucial is a convenient tool to write serverless-native applications, or port legacy ones. In particular, we describe a methodology to port traditional single-machine applications to serverless.
–
Crucial is suited for many applications such as traditional parallel computations, ML algorithms, and complex concurrency tasks. We show that a Monte Carlo simulation and a Mandelbrot computation can easily scale over on-demand serverless resources. Using an extensive evaluation of the \(k\)-means and logistic regression algorithms over a 100 GB dataset, we show that Crucial can lead to an \(18\%\)–\(40\%\) performance improvement over Apache Spark running on dedicated instances at comparable cost. Crucial is also within \(8\%\) of the completion time of the Santa Claus problem running on a local machine.
–
Using
Crucial, we port to serverless part of Smile [
65], a state-of-the-art multi-threaded ML library. The portage impacts less than
\(4\%\) of the original code base. It brings elasticity and on-demand capabilities to a traditional single-machine program. With 200 cloud threads, the random forest classification algorithm ported to serverless is up to
\(30\%\) faster than a 4-CPU 160-threads dedicated server solution.
The remaining of the article is structured as follows: Section
2 provides a general background and motivation for this work. We explain
Crucial’s programming model in Section
3, and describe its design in Section
4. Section
5 covers implementation details. The evaluation is presented in Section
6, where we validate the effectiveness of
Crucial for fine-grained state management and synchronization in serverless environments. We review related work in Section
7 before closing in Section
8.
4 System Design
Figure
1 presents the overall architecture of
Crucial. In what follows, we detail its components and describe the lifecycle of an application in our system.
Crucial encompasses three main components (from left to right in Figure
1): the client application; the FaaS computing layer that runs the cloud threads; and the DSO layer that stores the shared objects. A client application differs from a regular JVM process in two aspects: threads are executed as cloud functions, and they access shared data using the DSO layer. Moreover,
Crucial applications may also rely on external cloud services, such as object storage to fetch input data (not modeled in Figure
1).
4.1 The Distributed Shared Objects Layer
Each object in the DSO layer is uniquely identified by a reference. Fine-grained updates to the shared state are implemented as methods of these objects. Given an object of type \(T\), the reference to this object is \((T, k)\), where \(k\) is either the name of the annotated object field or the value of the parameter \(key\) in the annotation @Shared(key=k). When a cloud thread accesses an object, it uses its reference to invoke remotely the appropriate method.
Crucial constructs the DSO layer using consistent hashing [
59], similarly to Cassandra [
63]. Each storage node knows the full storage layer membership and thus the mapping from data to node. The location of a shared object
\(o\) is determined by hashing the reference
\((T, k)\) of
\(o\). This offers the following usual benefits: (1) No broadcast is necessary to locate an object; (2) Disjoint-access parallelism [
53] can be exploited; and (3) Service interruption is minimal in the event of server addition and removal. The latter property is useful for persistent objects, as detailed next.
Persistence. One interesting aspect of Crucial is that it can ensure durability of the shared state. This property is appealing, for instance, to support the different phases of a machine learning workflow (training and inference). Objects marked as persistent are replicated \({\it rf}\) (replication factor) times in the DSO layer. They reside in memory to ensure sub-millisecond read/write latency and can be passivated to stable storage using standard mechanisms (marshalling). When a cloud thread accesses a shared object, it contacts one of the server nodes. The operation is then forwarded to the actual replicas storing the object. Each replica executes the incoming call, and one of them sends the result back to the caller. Notice that for ephemeral—non-persistent—objects, \({\it rf}\) is 1.
Consistency.
Crucial provides linearizable objects and programmers can reason about interleaving as in the shared-memory case. This greatly simplifies the writing of stateful serverless applications. For persistent objects, consistency across replicas is maintained with the help of
state machine replication (
SMR) [
87]. To handle membership changes, the DSO layer relies on a variation of virtual synchrony [
27]. Virtual synchrony provides a totally-ordered set of views to the server nodes. In a given view, for some object
\(x\), the operations accessing
\(x\) are sent using total order multicast. The replicas of
\(x\) deliver these operations in a total order and apply them on their local copy of
\(x\) according to this order. A distinct replica (primary) is in charge of sending back the result to the caller. When a membership change occurs, the nodes re-balance data according to the new view. Appendix
A provides a full pseudo-code of this construction together with a proof of correctness.
4.2 Fast Aggregates Through Remote Procedure Call
As indicated in Section
2, stateful applications aggregate and combine small granules of data (e.g., the training phase of a ML algorithm). Unfortunately, cloud functions are not network-addressable and run separate from data. As a consequence, these applications are routinely left with no other choice but to “ship data to code”. This is known as one of the biggest downsides of FaaS platforms [
44].
To illustrate this point, consider an AllReduce operation where \(N\) cloud functions need to aggregate their results by applying some commutative and associative operator \(f\) (e.g., a sum). To achieve this, each function first writes its local result in the storage layer. Then, the functions await that their peers do the same, fetch the \(N\) results, and apply \(f\) sequentially. This algorithm is expensive and entails a communication cost of \(N^2\) messages with the storage layer.
Crucial fully resolves this anti-pattern with minimal efforts from the programmer. Complex computations are implemented as object methods in DSO and called by the cloud functions where appropriate. Going back to the above example, each function simply calls
\(f(r)\) on the shared object, where
\(r\) is its local result. This is for instance the case at line 8 in Listing
4 with the method
\({\texttt {counter.addAndGet}}\). With this approach, communication complexity is reduced to
\(\mathcal {O}(N)\) messages with the storage layer.
We exploit this key feature of
Crucial in our serverless implementation of several ML algorithms (e.g.,
\(k\)-means, linear regression, and random forest). Its performance benefits are detailed in Section
6.2.
4.3 Lifecycle of an Application
The lifecycle of a Crucial application is similar to that of a standard multi-threaded Java one. Every time a CloudThread is started, a Java thread (i.e., an instance of java.lang.Thread) is spawned on the client. This thread pushes the Runnable code attached to the CloudThread to a generic function in the FaaS platform. Then, it waits for the result of the computation before it returns.
Accesses to some shared object of type
T at cloud threads (or at the client) are mediated by a proxy. This proxy is instantiated when a call to “
new T()” occurs, and either the newly created object of type
T belongs to
Crucial’s library, or it is tagged
@Shared. As an example, consider the counter used in Listing
1. When an instance of
PiEstimator is spawned, the field
counter is created. The “
new” statement is intercepted and a local proxy for the counter is instantiated to mediate calls to the remote object hosted in the DSO layer. If this object does not exist in the DSO layer, it is instantiated using the constructor defined at line 5. From thereon, any call to
addAndGet (line 15) is pushed to the DSO layer. These calls are delivered in total order to the object replicas where they are applied before sending back a response value to the caller.
The Java thread remains blocked until the cloud function terminates. Such a behavior gives cloud threads the appearance of conventional threads; minimizing code changes and allowing the use of the join() method at the client to establish synchronization points (e.g., fork/join pattern). It must be noted, however, that as cloud functions cannot be canceled or paused, the analogy is not complete. If any failure occurs in a remote cloud function, the error is propagated back to the client application for further processing.
The case of the ServerlessExecutorService builds on the same idea as CloudThread. A standard Java thread pool is used internally to manage the execution of all tasks. In the case of a callable task, the result is accessible to the caller in a Future object.
4.4 Fault Tolerance
Fault tolerance in
Crucial is based on the disaggregation of the compute and storage layers. On one hand, writes to DSO can be made durable with the help of data replication. In such a case,
Crucial tolerates the joint failure of up to
\({\it rf}- 1\) servers.
3 On the other hand,
Crucial offers the same fault-tolerance semantics in the compute layer as the underlying FaaS platform. In AWS Lambda, this means that any failed cloud thread can be re-started and re-executed with the exact same input. Thanks to the cloud thread abstraction,
Crucial allows full control over the retry system. For instance, the user may configure how many retries are allowed and/or the time between them. If retries are permitted, the programmer should ensure that the re-execution is sound (e.g., it is idempotent). Fortunately, atomic writes in the DSO layer make this task easy to achieve. Considering the
\(k\)-means example depicted in Listing
5 (or another iterative algorithm), it simply consists of sharing an iteration counter (line 6). When a thread fails and re-starts, it fetches the iteration counter and continues its execution from thereon.
5 Implementation
The implementation of
Crucial is open source and available online [
6]. It consists of around 10K SLOC, including scripts to deploy and run
Crucial applications in the cloud. The DSO layer is written atop the Infinispan in-memory data grid [
69] as a partial rewrite of the
Creson project [
96].
A
Crucial application is written in Java and uses Apache Maven to compile and manage its dependencies. It employs the abstractions listed in Table
1 and has access to scripts that automate its deployment and execution in the cloud.
To run cloud threads, our prototype implementation relies on AWS Lambda. Lambda functions are deployed with the help of a Maven plugin [
5] and invoked via the AWS Java SDK. To control the replay mechanism, calls to Lambda are synchronous. The adherence of
Crucial to Lambda is limited and the framework can execute atop a different FaaS platform with a few changes. In Section 7.1, we discuss this platform dependency.
The ServerlessExecutorService implements the base ExecutorService interface. It accepts Callable objects and task collections. The invocation of a Callable returns a (local) Future object. This future is completed once a response from AWS Lambda is received. For Runnable tasks, the response is empty unless an error occurs. In that case, the system interprets it and throws an exception at the client machine, referencing the cause.
To create a distributed parallel
for, the programmer uses an instance of
IterativeTask (as illustrated at line 10 in Listing
3). This functional interface is similar to
java.util.function. Consumer, but limited to iteration indexes (i.e., the input parameter must be an integer). Internally, the iterative task creates a collection of
Callable objects. In our current prototype, the scheduling is static and based on the number of workers and tasks given in parameter.
When an AWS Lambda function is invoked, it receives a user-defined
Runnable (or
Callable) object. The object and its content are marshalled and shipped to the remote machine, where they are re-created. Initialization parameters can be given to the constructor. As pointed out in Section
3.1, a distributed reference is sent in lieu of a shared object.
Proxies for the shared objects are waved into the code of the client application using AspectJ [
60]. In the case of user-defined objects, the aspects are applied to the annotated fields (see Section
3.1). Such objects must be serializable and they should contain an empty constructor (similarly to a JavaBean). The
jar archive containing the definition of the objects is uploaded to the DSO servers where it is dynamically loaded.
Synchronization objects (e.g., barriers, semaphores, and futures) follow the structure of their Java counterparts. They rely internally on Java monitors. When a client performs a call to a remote object, it remains blocked until the request responds. The server processes the operation with a designated thread. During the method invocation, that thread may suspend itself through a wait call on the object until another thread awakes it.
State machine replication (
SMR) is implemented using Infinispan’s interceptor API. This API enables the execution of custom code during the processing of a data store operation. It follows the visitor pattern as commonly found in storage systems. Infinispan [
69] relies on JGroups [
43] for total order multicast. The current implementation uses Skeen’s algorithm [
19].
In our prototype, the deployment of the storage layer is explicitly managed (like, e.g., AWS ElastiCache). Automatic provisioning of storage resources for serverless computing remains an open issue [
24,
56], with just a couple works appearing very recently in this area [
62,
80].
6 Evaluation
This section evaluates the performance of Crucial and its usability to program stateful serverless applications.
Outline. We first evaluate the runtime of
Crucial with a series of micro-benchmarks (Section 6.1). Then, we focus on fine-grained updates to shared mutable data (Section 6.2) and fine-grained synchronization (Section 6.3). Furthermore, we detail the (partial) portage to serverless of the Smile library [
65] (Section 6.4). Finally, we analyze the usability of our framework when writing (or porting) applications (Section 6.5).
Goal & scope.. The core objective of this evaluation is to understand the benefits of
Crucial to program applications for serverless. To this end, we distinguish two types of applications: serverless-native and ported applications. Serverless-native applications are those written from scratch for a FaaS infrastructure. Ported applications are the ones that were initially single-machine applications and were later modified to execute atop FaaS. For both types of applications, our evaluation campaign aims at providing answers to the following questions:
–
How easy is it to program with Crucial ? In addressing this question, we specifically focus on the following applications: ML (Section 6.2.1), data analytics (Section 6.3.1), and synchronization tasks (Section 6.3.2). These applications are parallel and stateful, that is they contain parallel components that need to update a shared state and synchronize to make progress.
–
Do applications programmed with Crucial benefit from the capabilities of serverless (e.g., scalability and on-demand pricing)? (Section 6.4.2).
–
How efficient is an application programmed with Crucial ? For serverless-native applications, we compare Crucial to PyWren, a state-of-the-art solution for serverless programming (Section 6.2.3). We also make a comparison with Apache Spark, the de facto standard approach to program stateful cluster-based programs (Section 6.2.2). For ported applications, we compare Crucial to a scale-up approach, using a high-end server (Section 6.4).
–
How costly is Crucial with respect to other solutions? (Section 6.5.3) Here we are interested both in the programming effort to code a serverless application and its monetary cost when running atop a FaaS platform. Again, answers are provided for both serverless-native and ported applications.
Experimental setup.. All the experiments are conducted in Amazon Web Services (AWS), within a Virtual Private Cloud (VPC) located in the us-east-1 region. Unless otherwise specified, we use r5.2xlarge EC2 instances for the DSO layer and 3 GB AWS Lambda functions. Experiments with concurrency over 300 cloud threads are run outside the VPC due to service limitations.
The code of the experiments presented in this section is available online [
6].
6.1 Micro-benchmarks
As depicted in Figure
1, the runtime of
Crucial consists of two components: an FaaS platform and the DSO layer. In this section, we evaluate the performance of this runtime across several micro-benchmarks.
6.1.1 Latency.
Table
2 compares the latency to access a 1 KB object sequentially in
Crucial (DSO), Redis, Infinispan, and S3. We chose Redis because it is a popular key-value store available on almost all cloud platforms, and it has been extensively used as storage substrate in prior serverless systems [
55,
62,
80]. Each function performs 30 K operations and we report the average access latency. In Table
2,
Crucial exhibits a performance similar to other in-memory systems. In particular, it is an order of magnitude faster than S3. This table also depicts the effect of object replication. When data is replicated, SMR adds an extra round-trip, doubling the latency perceived at a client. The number of replicas does not affect this behavior, as shown in the next experiment.
6.1.2 Throughput.
We measure the throughput of Crucial and compare it against Redis. For an accurate picture, replication is enabled in both systems to capture their performance under scenarios of high data availability and durability.
In this experiment, 200 cloud threads access 800 shared objects during 30 seconds. The objects are chosen at random. Each object stores an integer offering basic arithmetic operations. We consider simple and complex operations. The simple operation is a multiplication. The complex one is the sequential execution of 10 K multiplications. In Redis, these operations require several commands which run as Lua scripts for both consistency and performance.
To replicate data, Redis uses a master-based mechanism. By default, replication is
asynchronous, so the master does not wait for a command to be processed by the replicas. Consequently, clients can observe stale data. In our experiment, to minimize inconsistencies and offer guarantees closer to
Crucial, functions issue a
WAIT command after each write [
82]. This command flushes the pending updates to the replicas before it returns.
We compare the average throughput of the two systems when the replication factor (\({\it rf}\)) of a datum varies as follows: (\({\it rf}=1\)) Both Crucial and Redis (2 shards with no replicas) are deployed over a 2-node cluster; (\({\it rf}=2\)) In the same 2-node cluster, Redis nows uses one master and one replica; \(({\it rf}=3\)) We add a third node to the cluster and Redis employs one master and two replicas.
In Figure
2, “Redis
WAIT \(r\)” indicates that
\(r\) is the number of synchronously replicated copies of shared objects.
The experimental results reported in Figure
2 show that
Crucial is not sensitive to the complexity of operations. Redis is
\(50\%\) faster for simple operations because its implementation is optimized and written in C. However, for complex operations,
Crucial is almost five times better than Redis. Again, implementation-specific details are responsible for this behavior: While Redis is single-threaded, and thus concurrent calls to the Lua scripts run sequentially,
Crucial benefits from disjoint-access parallelism [
53]. When objects are replicated, the comparison is similar. In particular, Figure
2 shows that
Crucial and Redis have close performance when Redis operates in synchronous mode.
This experiment also verifies that the performance of
Crucial is not sensitive to the number of replicas. Indeed, the throughput in Figure
2 is roughly equivalent for all values of
\({\it rf}\ge 2\). This comes from the fact that
Crucial requires a single RTT to propagate an operation to the replicas.
6.1.3 Parallelism.
We first evaluate our framework with the Monte Carlo simulation presented in Listing
1. This algorithm is embarrassingly parallel, relying on a single shared object (a counter). The simulation runs with 1–800 cloud threads and we track the total number of points computed per second. The results, presented in Figure
3(a), show that our system scales linearly and that it exhibits a
\(512\times\) speedup with 800 threads.
We further evaluate the parallelism of
Crucial with the code in Listing
3. This second experiment computes a 30K
\(\times\)30K projection of the Mandelbrot set, with (at most) 1,000 iterations per pixel. As shown in Figure
3(b), the completion time decreases from 150 seconds with 10 threads to 14.5 seconds with 200 threads: a speedup factor of
\(10.2\times\) over the 10-thread execution. This super-linear speedup is due to the skew in the coarse-grained row partitioning of the image. It also underlines a key benefit of
Crucial. If this task is run on a cluster, the cluster is billed for the entire job duration, even if some of its resources are idle. Running atop serverless resources, this implementation ensures instead that row-dependent tasks are billed for their exact duration.
Takeaways. The DSO layer of Crucial is on par with existing in-memory data stores in terms of latency and throughput. For complex operations, it significantly outperforms Redis due to data access parallelism. Crucial scales linearly to hundreds of cloud threads. Applications written with the framework benefit from the serverless provisioning and billing model to match irregularities in parallel tasks.
6.2 Fine-grained State Management
This section shows that Crucial is efficient for parallel applications that access shared state at fine granularity. We detail the implementation of two ML algorithms in the framework. These algorithms are evaluated against a single-machine solution, as well as two state-of-the-art frameworks for cluster computing (Apache Spark) and FaaS-based computation (PyWren).
6.2.1 A Serverless \(k\)-means.
Listing
5 details the code of a
\(k\)-means clustering algorithm written with
Crucial. This program computes
\(k\) clusters from a set of points across a fixed number of iterations, or until some convergence criterion is met (line 21). The algorithm is iterative, with recurring synchronization points (line 19), and it uses a small mutable shared state. Listing
5 relies on shared objects for the convergence criterion (line 4), the centroids (line 8), and a synchronization object to coordinate the iterations (line 2). At each iteration, the algorithm needs to update both the centroids and the criterion. The corresponding method calls (lines 14, 17, and 18) are executed remotely in DSO.
Figure
3(c) compares the scalability of
Crucial against two EC2 instances:
m5.2xlarge and
m5.4xlarge, with 8 and 16 vCPUs, respectively. In this experiment, the input increases proportionally to the number of threads. We measure the
scale-up computed with respect to that fact:
\(\mathit {scale\text{-}up}= T_1 / T_n\), where
\(T_1\) is the execution time of Listing
5 with one thread, and
\(T_n\) when using
\(n\) threads.
4 Accordingly,
\(\mathit {scale\text{-}up}= 1\) means a perfect linear scale-up, i.e., the increase in the number of threads keeps up with the increase in the workload size (top line in Figure
3(c)). The scale-up is sub-linear when
\(\mathit {scale\text{-}up}\lt 1\). As expected, the single-machine solution quickly degrades when the number of threads exceeds the number of cores. The solution using
Crucial is within
\(10\%\) of the optimum. For instance, with 160 threads, the scale-up factor is approximately 0.94. This lowers to 0.9 for 320 threads due to the overhead of creating the cloud threads.
6.2.2 Comparison with Spark.
Apache Spark [
104] is a state-of-the-art solution for distributed computation in a cluster. As such, it is extensively used to scale many kinds of applications in the cloud. One of them is ML training, as enabled by Spark’s MLlib [
71] library. Most ML algorithms are iterative and share a modest amount of state that requires per-iteration updates. Consequently, they are a perfect fit to assess the efficiency of fine-grained updates in
Crucial against a state-of-the-art solution. This is the case of logistic regression and
\(k\)-means clustering, which we use in this section to compare
Crucial and Spark.
Setup. For this comparison, we provide equivalent CPU resources to all competitors. In detail,
Crucial experiments are run with 80 concurrent AWS Lambda functions and one storage node. Each AWS Lambda function has 1,792 MB and 2,048 MB of memory for logistic regression and
\(k\)-means, respectively. These values are chosen to have the optimal performance at the lowest cost (see Section
6.5.3).
5 The DSO layer runs on an
r5.2xlarge EC2 instance. Spark experiments are run in Amazon EMR with 1 master node and 10
m5.2xlarge worker nodes (
Core nodes in EMR terminology), each having 8 vCPUs. Spark executors are configured to utilize the maximum resources possible on each node of the cluster. To improve the fairness of our comparison, the time spent in loading the dataset from S3 and parsing it is not considered for both solutions. For Spark, the time to provision the cluster is not counted. Regarding
Crucial, FaaS cold starts are also excluded from measurements due to a global barrier before starting the computation.
Dataset. The input is a 100 GB dataset generated with spark-perf [
28] that contains
\(55.6\)M elements. For logistic regression, each element is labeled and contains 100 numeric features. For
\(k\)-means, each element corresponds to a 100-dimensional point. The dataset has been split into 80 equal-size partitions to ensure that all partitions are small enough to fit into the function memory. Each partition has been stored as an independent file in Amazon S3.
Logistic regression. We evaluate a
Crucial implementation of logistic regression against its counterpart available in Spark’s MLlib [
71]:
LogisticRegressionWithSGD. A key difference between the two implementations is the management of the shared state. Each iteration, Spark broadcasts the current weight coefficients, computes, and finally aggregates the sub-gradients in a MapReduce phase. In
Crucial, the weight coefficients are shared objects. Each iteration, a cloud thread retrieves the current weights, computes the sub-gradients, updates the shared objects, and synchronizes with the other threads. Once all the partial results are uploaded to the DSO layer, the weights are recomputed, and the threads proceed to the next iteration.
In Figure
4(a) and (b), we measure the running time of 100 iterations of the algorithm and the logistic loss after each iteration. Results show that the iterative phase is
\(18\%\) faster in
Crucial (62.3 seconds) than with Spark (75.9 seconds), and thus the algorithm converges faster. This gain is explained by the fact that
Crucial aggregates and combines the sub-gradients in the storage layer. On the contrary, each iteration in Spark requires a reduce phase that is costly both in terms of communication and synchronization.
\(k\)-means. We compare the
\(k\)-means implementation described in Section
6.2.1 to the one in MLlib. For both systems, the centroids are initially at random positions and the input data is evenly distributed among tasks. Figure
4(c) shows the completion time of 10 iterations of the clustering algorithm. In this figure, we consider different values of
\(k\) to assess the effectiveness of our solution when the size of the shared state varies. With
\(k=25\),
Crucial completes the 10 iterations
\(40\%\) faster (20.4 seconds) than Spark (34 seconds). The time gap is less noticeable with more clusters because the time spent synchronizing functions is less representative. In other words, the iteration time becomes increasingly dominated by computation. As in the logistic regression experiment,
Crucial benefits from computing centroids in the DSO layer, while Spark requires an expensive reduce phase at each iteration.
6.2.3 Comparison with PyWren.
We close this section by comparing
Crucial to a serverless-native state-of-the-art solution. To date, the most evaluated framework to program stateful serverless applications is PyWren [
55]. Its primitives, such as
call_async and
map are comparable to
Crucial’s cloud thread and serverless executor abstractions. Our evaluation employs Lithops [
7], a recent and improved version of PyWren. PyWren is a MapReduce framework. Thus, it does not natively provide advanced features for state sharing and synchronization. Therefore, following the recommendations by Jonas et al. [
55], we use Redis for this task.
Setup. We employ the same application, dataset, and configuration as in the previous experiment. The two frameworks use AWS Lambda for execution. A single r5.2xlarge EC2 instance runs DSO for Crucial, or Redis for PyWren.
\(k\)-means. Implementing
\(k\)-means above PyWren requires to store the shared state in Redis, that is the centroids and the convergence criterion. Following Jonas et al. [
55], we use a Lua script to achieve this. At the end of each iteration, every function updates (atomically) the shared state by calling the script. This approach is the best solution in terms of performance. In particular, it is more efficient than using distributed locking due to the large number of commands needed for the updates. To synchronize across iterations, we use the Redis barrier covered in Section
6.3.2.
The
Crucial and PyWren
\(k\)-means applications are written in different languages (Java and Python, respectively). Consequently, the time spent in computation for the two applications is dissimilar. For that reason, and contrary to the comparison against Spark, Figure
4(d) does not report the completion time. Instead, this figure depicts the average time spent in accessing the shared state during the
\(k\)-means execution for both
Crucial and PyWren. This corresponds to the time spent inside the loop in Listing
5 (excluding line 16).
In Figure
4(d), we observe that the solution combining PyWren and Redis is always slower than
Crucial. This comes from the fact that
Crucial allows efficient fine-grained updates to the shared state. Such results are in line with the ones presented in Section
6.1.2.
Takeaways. The DSO layer of Crucial offers abstractions to program stateful serverless applications. DSO is not only convenient but, as our evaluation confirms, efficient. For two common machine learning tasks, Crucial is up to 40% faster than Spark, a state-of-the-art cluster-based approach, at comparable resource usage. It is also faster than a solution using jointly PyWren, a well-known serverless framework, and the Redis data store.
6.3 Fine-grained Synchronization
This section analyzes the capabilities of Crucial to coordinate cloud functions. We evaluate the synchronization primitives available in the framework and compare them to state-of-the-art solutions. We then demonstrate the use of Crucial to solve complex coordination tasks by considering a traditional concurrent programming problem.
6.3.1 Synchronizing a Map Phase.
Many algorithms require synchronization at various stages. In MapReduce [
29], this happens between the map and reduce phases, and it is known as shuffle. Shuffling ensures that the reduce phase starts when all the appropriate data was output in the preceding map phase. Shuffling the map output is a costly operation in MapReduce, even if the reduce phase is short. For that reason, when data is small and the reduction operation simple, it is better to skip the reduce phase and instead aggregate the map output directly in the storage layer [
30].
Crucial allows to easily implement this approach.
In what follows, we compare different techniques to synchronize cloud functions at the end of a map. Namely, we compare (1) the original solution in PyWren, based on polling S3; (2) the same mechanism but using the Infinispan in-memory key-value data store; (3) the use of Amazon SQS, as proposed in some recent works (e.g., [
61]); and (4) two techniques based on the
Future object available in
Crucial. The first solution outputs a future object per function, then runs the reduce phase. The second aggregates all the results directly in the DSO layer (called auto-reduce).
We compare the above five techniques by running back-to-back the Monte Carlo simulation in Listing
1. The experiment employs 100 cloud functions, each doing 100 M iterations. During a run, we measure the time spent in synchronizing the functions. On average, this accounts for
\(23\%\) of the total time.
Figure
5(a) presents the results of our comparison. Using Amazon S3 proves to be slow, and it exhibits high variability—some experiments being far slower than others. This is explained by the combination of high access latency, eventual consistency, and the polling-based mechanism. The results improve with Infinispan, but being still based on polling, the approach induces a noticeable overhead. Using Amazon SQS is the slowest approach of all. It needs a polling mechanism that actively reads messages from the remote queue. The solution based on
Future objects allows to immediately respond when the results are available. This reduces the number of connections necessary to fetch the result and thus translates into faster synchronization. When the map output is directly aggregated in DSO,
Crucial achieves even better performance, being twice as fast as the polling approach atop S3.
6.3.2 Synchronization Primitives.
Cloud functions need to coordinate when executing parallel tasks. This section evaluates some of the synchronization primitives available in Crucial to this end.
For starters, we study the performance of a barrier when executing an iterative task. In Figure
5(b), we depict a breakdown of the time spent in the phases of each iteration (Invocation, S3 read, Compute, and Sync). The results are reported for 2 cloud functions out of 10—the other functions behave similarly.
The breakdown in Figure
5(b) considers two approaches. The first one launches a new stage of functions (a0 and a1) at each iteration that do not use the barrier primitive. The second launches a single stage of functions (b0 and b1) that run all the iterations and use the barrier primitive to synchronize. In the first case, data must be fetched from storage at each iteration, while in the second approach it is only fetched once. Overall, Figure
5(b) shows that this latter mechanism is clearly faster. In particular, the total time spent in coordinating the functions is lower when the barrier is used (Sync).
Figure
5(c) draws a comparison between two barrier objects available in
Crucial and several state-of-the-art solutions. More precisely, the figure reports the performance of the following approaches: (1) a pure cloud-based barrier, which combines Amazon SNS and SQS services to notify the functions; (2) a ZooKeeper cyclic barrier based on the official double barrier [
37] in a 3-node cluster; (3) a non-resilient barrier using the Redis
BLPOP command (“blocking left pop”) on a single server; (4) the default cyclic barrier available in
Crucial, with a single server instance; and (5) a resilient, poll-based (P) barrier implementing the algorithm in [
46] on a 3-node cluster with replication.
To draw this comparison, we measure the time needed to exit 1,000 barriers back-to-back for each approach. An experiment is run 10 times. Figure
5(c) reports the average time to cross a single barrier for a varying number of cloud functions.
The results in Figure
5(c) show that the single server solutions, namely
Crucial and Redis, are the fastest approaches. With 1,800 threads, these barriers are passed after waiting 68 ms on average. The fault-tolerant barriers (
Crucial (P) and ZooKeeper) create more contention, incurring a performance penalty when the level of parallelism increases. With the same number of threads, passing the poll-based barrier of
Crucial takes 287 milliseconds on average. ZooKeeper requires twice that time. The solution using Amazon SNS and SQS is an order of magnitude slower than the rest.
It is worth noting the difference between the programming complexity of each barrier. Both barriers implemented in Crucial take around 30 lines of basic Java code. The solution using Redis has the same length, but it requires a proper management of the connections to the data store as well as the manual creation/deletion of shared keys. ZooKeeper substantially increases code complexity, as programmers need to deal with a file-system-like interface and carefully set watches, requiring around 90 lines of code. Finally, the SNS and SQS approach is the most involved technique of all, necessitating 150 lines of code and the use of two complex cloud service APIs.
6.3.3 A Concurrency Problem.
Thanks to its coordination capabilities,
Crucial can be used to solve complex concurrency problems. To demonstrate this feature, we consider the Santa Claus problem [
98]. This problem is a concurrent programming exercise in the vein of the dining philosophers, where processes need to coordinate in order to make progress. Common solutions employ semaphores and barriers, while others, actors [
18].
Problem. The Santa Claus problem involves three sets of entities: Santa Claus, nine reindeer, and a group of elves. The elves work at the workshop until they encounter an issue that needs Santa’s attention. The reindeer are on vacation until Christmas eve, when they gather at the stable. Santa Claus sleeps, and can only be awakened by either a group of three elves to solve a workshop issue, or by the reindeer to go delivering presents. In the first case, Santa solves the issues, and the elves go back to work. In the second, Santa and the reindeer execute the delivery. The reindeer have priority if the two situations above occur concurrently.
Solution. Let us now explain the design of a common solution to this problem [
18]. Each entity (Santa, elves, and reindeer) is a thread. They communicate using two types of synchronization primitives:
groups and
gates. Elves and reindeer try to join a group when they encounter a problem or Christmas is coming, respectively. When a group is full—either including three elves or nine reindeer—the entities enter a room and notify Santa. A room has two gate objects: one for entering and one for exiting. Gates act like barriers, and all the entities in the group wait for Santa to open the gate. When Santa is notified, he looks whether a group is full (either of reindeer or elves, prioritizing reindeer). He then opens the gate and solves the workshop issues or goes delivering presents. This last operation is repeated until enough deliveries, or
epochs, have occurred.
We implemented the above solution in three flavors. The first one uses
plain old Java objects (
POJOs), where groups and gates are monitors and the entities are threads. Our second variation is a refinement of this base approach, where the synchronization objects are stored in the DSO layer. The conversion is straightforward using the API presented in Section
3. In particular, the code of the objects used in the POJO solution is unchanged. Only adding the
@Shared annotation is required. The last refinement consists in using
Crucial’s cloud threads instead of the Java ones.
Evaluation. We consider an instance of the problem with 10 elves, 9 reindeer, and 15
deliveries (epochs of the problem). Table
3 presents the completion time for each of the above solutions.
The results in Table
3 show that
Crucial is efficient in solving the Santa Claus problem, being at most
\(9\%\) slower than a single-machine solution. In detail, storing the group and gate objects in
Crucial induces an overhead of around
\(4\%\) on the completion time. When cloud threads are used instead of Java ones, a small extra time is further needed—less than a second. This penalty comes from the necessary remote calls to the FaaS platform to start computation.
Takeaways. The fine-grained synchronization capabilities of Crucial permit cloud functions to coordinate efficiently. The synchronization primitives available in the framework fit iterative tasks well and perform better than state-of-the-art solutions at large scale while being simpler to use. This allows Crucial to solve complex concurrency problems efficiently.
6.4 Smile Library
The previous section presented the portage to serverless of a solution to the Santa Claus problem. In what follows, we further push this logic by considering a complex single-machine program. In detail, we report on the portage to serverless of the random forest classification algorithm available in the Smile library. Smile [
65] is a multi-threaded library for ML, similar to Weka [
48]. It is widely employed to mine datasets with Java and contains around 165 K SLOC. In what follows, we first describe the steps that were taken to conduct the portage using
Crucial. Then, we present performance results against the vanilla version of the library.
6.4.1 Porting smile.classification.RandomForest.
The portage consists of adapting the random forest classification algorithm [
21] with the help of our framework. In the training phase, this algorithm takes as input a structured file (commonly,
.csv or
.arff), which contains the dataset description. It outputs a random forest, i.e., a set of decision trees. During the classification phase, the forest is used to predict the class of the input items. Each decision tree is calculated by a training task (
Callable). The tasks are run in parallel on a multi-core machine during the training phase. During this computation, the algorithm also extracts the
out-of-bag (
OOB) precision, that is the forest’s error rate induced by the training dataset.
To perform the portage, we take the following three steps. First, a proxy is added to stream input files from a remote object store (e.g., Amazon S3). This proxy lazily extracts the content of the file, and it is passed to each training task at the time of its creation. Second, the training tasks are instantiated in the FaaS platform. With
Crucial, this transformation simply requires calling a
ServerlessExecutorService object in lieu of the Java
ExecutorService. Third, the shared-memory matrix that holds the OOB precision is replaced with a DSO object. This step requires to change the initial programming pattern of the library. Indeed, in the original application, the
RandomForest class creates a matrix using the metadata available in the input file (e.g., the number of features). If this pattern is kept, the application must load the input file to kick off the parallel computation, which is clearly inefficient. In the portage, we instead use a barrier to synchronize the concurrent tasks. The first task to enter the barrier is in charge of creating the matrix in the DSO layer.
6For performance reasons, Smile uses Java arrays (mono or multi-dimensional) and not object-oriented constructs (such as ArrayList). As pointed out previously in Section 3.3, it is not possible to build proxies for such objects in Java without changing the bytecode generated during compilation. Thus, the portage necessitates to transform these arrays into high-level objects. These objects are then replaced with their Crucial counterparts.
Overall, the portage modifies 378 SLOC in the Smile library (version 1.5.3). This is less than \(4\%\) of the original code base to run the random forest algorithm. We also added scripts to deploy and run the serverless solution in AWS Lambda, and performance benchmarks (see below), for a total of around 1K SLOC. Notice that the portage does not preclude local (in-memory) execution, e.g., for testing purpose. This is possible by switching a flag at runtime.
6.4.2 Evaluation Results.
In Figure
6, we compare the vanilla version of Smile to our
Crucial portage. To this end, we use four datasets: (
soil) is built using features extracted from satellite observations to categorize soils [
41]; (
usps) was published by Le Cun et al. [
70], and it contains normalized handwritten digits scanned from envelopes by the US Postal Service; (
credit-card) is a set of both valid and fraudulent credit card transactions [
79]; and (
click) is a
\(1\%\) balanced subset of the KDD 2012 challenge (Task 2) [
51].
We report the performance of each solution during the learning phase. As previously,
Crucial is executed atop AWS Lambda. The DSO layer runs with
\({\it rf}=2\) in a 3-node (4 vCPU, 16 GB of RAM) Kubernetes cluster. For the vanilla version of Smile, we use two different setups: an hyperthreaded quad-core Intel i7-8550U laptop with 16 GB of memory (tagged Smile-8 in Figure
6), and a quad-Intel CLX 6230 hyperthreaded 80-core server with 740 GB of memory (tagged Smile-160 in Figure
6).
7As expected for small datasets (
soil and
usps), the cost of invocation out-weights the benefits of running over the serverless infrastructure. For the two large datasets, Figure
6(a) shows that the
Crucial portage is up to 5
x faster. Interestingly, for the last dataset the performance is 20% faster than with the high-end server.
In Figure
6(b), we scale the number of trees in the random forest, from a single tree to 200. The second
y-axis of this figure indicates the
area under the curve (
AUC) that captures the algorithm’s accuracy. This value is the average obtained after running a 10-fold cross-validation with the training dataset. In Figure
6(b), we observe that the time to compute the random forest triples from around 10–30 seconds. Scaling the number of trees helps improving classification. With 200 trees, the AUC of the computed random forest is 0.9998. This result is in line with prior reported measures [
79] and it indicates a strong accuracy of the classifier. Figure
6(b) indicates that training a 200-trees forest takes around 30 seconds with
Crucial. This computation is around 50% slower with the 160-threads server. It takes 20 minutes on the laptop test machine (not shown in Figure
6(b)).
Takeaways. Overall, the above results show that the portage is efficient, bringing elasticity and on-demand capabilities to a traditional monolithic multi-threaded library. We focused on the random forest classification algorithm in Smile, but other algorithms in this library can be ported to FaaS with the help of Crucial.
6.5 Usability of Crucial
This section evaluates how Crucial simplifies the writing of stateful serverless applications and their deployment and management in the cloud.
6.5.1 Data Availability.
Our first experiment assesses that
Crucial indeed offers high availability to data persisted in the DSO layer. To this end, the membership of DSO is changed during the execution of the serverless
\(k\)-means. Figure
7 shows a 6-minute run during which inferences are executed with the model trained in Section
6.2.1. The model is stored in a cluster of 3 nodes with
\({\it rf}=2\). The inferences are performed using 100 cloud threads. Each inference executes a read of all the objects in the model, i.e., the 200 centroids.
During the experiment, at 120 seconds and 240 seconds, we crash and add, respectively, a storage node to the DSO layer. Figure
7 shows that our system is elastic and resilient to such changes. Indeed, modifications to the membership of the DSO layer affect performance but never block the system. The (abrupt) removal of a node lowers performance by
\(30\%\). The initial throughput of the system (490 inferences per second) is restored 20 seconds after a new storage node is added.
Notice that handling catastrophic (or cascading) events is possible by running DSO across several availability zones, or even datacenters. In such cases, SMR can be tailored to accommodate with the increased latency between data replicas [
73]. Evaluating these geo-distributed scenarios is however outside of the scope of this article.
6.5.2 Programming Simplicity.
Each of the applications used in the evaluation is initially a single-machine program. Table
4 lists the modifications that were necessary to move each program to serverless with
Crucial. The differences between the single-machine, parallel code and its serverless counterpart are small. In the case of Smile, as mentioned earlier, they mainly come from the use of low-level non-OOP constructs in the library (e.g., Java arrays). For the other programs, e.g., the logistic regression algorithm detailed in Section
6.2.2, the changes account for less than
\(3\%\).
Starting from a conventional OOP program, Crucial requires only a handful of changes to port it to FaaS. We believe that this smooth transitioning can help everyday programmers to start harvesting the benefits of serverless computing.
6.5.3 Cost Comparison.
Although one may argue that the programming simplicity of serverless computing justifies its higher cost [
55], running an application serverless should not significantly exceed the cost of running it with other cloud appliances (e.g., VMs).
Table
5 offers a cost comparison between Spark and
Crucial based on the experiments in Section
6.2.2. The first two columns list the time and cost of the entire experiments, including the time spent in loading and parsing input data, but not the resource provisioning time. The last column lists the cost that can be attributed to the iterative phase of each algorithm. To compare fairly the two approaches, we only consider the pricing for on-demand instances.
With the current pricing policy of AWS [
12], the cost per second of
Crucial is always higher than Spark: 0.25 and 0.28 cents per second for 1,792 MB and 2,048 MB function memory, respectively, against 0.15 cents per second. Thus, when computation dominates the running time, as in
\(k\)-means clustering with
\(k=200\), the cost of using
Crucial is logically higher. This difference disappears when
Crucial is substantially faster than Spark (e.g.,
\(k=25\)).
To give a complete picture of this cost comparison, there are two additional remarks to make here. First, the solution provided with
Crucial using 80 concurrent AWS Lambda functions employs a larger aggregated bandwidth from S3 than the solution with Spark. This reduces the cost difference between the two approaches. Second, as pointed in Section
6.1.3,
Crucial users only need to pay for the execution time of each function, and not the time the cluster remains active. This includes bootstrapping the cluster as well as the necessary trial-and-error processes found, for instance, in machine learning training or hyper-parameter tuning [
100].
9Takeaways. The programming model of Crucial allows to easily write conventional object-oriented applications for serverless. Starting from a single-machine code, the changes are minimal. In particular, the DSO layer offers the exact same semantics for state sharing and synchronization as a conventional multi-threaded library (e.g., java.util.concurrent). Being serverless, applications written with Crucial are scalable. Moreover, they execute at a comparable cost than cluster-based solutions without high upfront investment.
8 Closing Remarks
Serverless computing is a recent paradigm to program the cloud. In this paradigm, the quantum of computation is a cloud function. Once deployed in an FaaS platform, the function is executed on-demand, at scale and in a pay-per-use manner. This article details
Crucial, a new programming framework for FaaS platforms. In line with recent works [
55,
94,
105],
Crucial pivots serverless computing on its head. Instead of event-driven stateless computations, FaaS is used to run complex stateful programs. In building and evaluating
Crucial, we faced several challenges and opportunities. This section summarizes these observations before closing.
8.1 Lessons Learned
During the development and evaluation of
Crucial, we learned the following lessons that we find useful to the community: (
Lesson 1.) Our evaluation of
Crucial was conducted entirely on AWS Lambda, a state-of-the-art platform. There are some inherent difficulties in using public cloud services. As it acts as a black box, a public cloud service makes certain experimental results complex to understand and reproduce [
85]. Anomalies can be due to cold starts, function co-location, and intermittent service disruptions. Therefore, during the evaluation of
Crucial, we took extra care when examining the results. Fortunately, we did not experience major difficulties besides cold starts, which we could easily overcome. (
Lesson 2.) Evaluating the cost of a serverless application requires to understand at fine grain the billing procedure of the various cloud services involved. A typical deployment requires the use of many services—e.g., FaaS, object storage, virtual cloud, public IPs, and so on. For our cost comparison in Section 6.5.3, we had no other choice than to manually extract these costs and aggregate them. This can be an intricate and time-consuming task. (
Lesson 3.) As indicated in Section 5,
Crucial can be used atop any FaaS platform, provided a Java runtime is available. Serverless computing platforms currently have incompatible APIs when uploading/calling cloud functions. Thus, the choice of a platform must consider the necessary tools (e.g., scripts) to simplify the deployment and execution of a serverless application. As a result, the tools created to use
Crucial atop AWS Lambda would require some adjustments for other platforms. We believe that this situation will improve over time as cloud vendors start homogenizing their services.
8.2 Limitations and Future Work
FaaS platforms ship data to the cloud functions.
Crucial partly remedies to this problem by providing a storage layer that may execute complex operations near data. Therefore, the system runs over two different layers: the FaaS platform itself and DSO (see Figure
1). An interesting direction of improvement would be to run jointly these two layers in the same infrastructure, e.g., the same container orchestrator, to improve performance.
In Crucial, cloud threads cannot directly communicate with each other. For instance, signaling between threads is not possible (see Section 3.3). This limitation of our design is intended to allow running Crucial atop any FaaS platform. However, for efficiency reasons, it could be of interest to benefit from direct function-to-function communication, without resorting to an external storage layer.
A cloud thread is a computing abstraction between a light- and a heavy-weight thread. Namely, contrary to the light-weight model, sharing is made explicit in a Crucial program. However, sharing among cloud threads is not restricted to IPC abstractions, as with heavy-weight threads (i.e., processes). We believe that this is an interesting trade-off to further explore. On one hand, implicit sharing simplifies the life of the programmer. On the other, explicitly marking shared data could be of interest for performance, e.g., by scheduling computation near data. In addition, we note that closing the gap between cloud threads and conventional threads (either light or heavy) would simplify the portage to serverless of single-machine programs.
8.3 Conclusion
This article presents Crucial, a system to program highly concurrent stateful serverless applications. Crucial can be used to construct demanding serverless programs that require fine-grained support for mutable shared state and synchronization. We show how to use it to implement applications such as traditional data parallel computations, iterative algorithms, and coordination tasks.
Crucial is built using an efficient disaggregated in-memory data store and an FaaS platform. Contrary to prior works, such as Faasm [
90] and Cloudburst [
94],
Crucial can run on any standard FaaS platform, simply requiring the existence of a Java runtime.
Our evaluation shows that, for two common ML algorithms, Crucial achieves superior or comparable performance to Apache Spark. Crucial is also a good fit for function synchronization, outperforming the ZooKeeper coordination kernel in this task. In particular, it can solve efficiently complex coordination problems despite the inherent costs of its disaggregated design. For data sharing across cloud functions, Crucial compares favorably against storage alternatives such as Redis.
Our framework allows to move traditional single-machine, multi-threaded Java programs to serverless. We use it to port Smile, a state-of-the-art multi-threaded ML library. The portage achieves performance comparable to the one of a dedicated high-end server, while providing elasticity and on-demand capabilities to the application.
Crucial offers conventional multi-threaded abstractions to the programmer. In our evaluation, less than \(6\%\) of the application code bases differ from standard solutions using plain old Java objects. We believe that this simplicity can help to broaden the horizon of serverless computing to unexplored domains.