arXiv:2311.10800v1 [cs.PL] 17 Nov 2023
The Next 700 ML-Enabled Compiler Optimizations
S. VenkataKeerthy
Siddharth Jain
Umesh Kalvakuntla
IIT Hyderabad, India
IIT Hyderabad, India
IIT Hyderabad, India
Pranav Sai Gorantla
Rajiv Shailesh Chitale
Eugene Brevdo
IIT Hyderabad, India
IIT Hyderabad, India
Google DeepMind, USA
Albert Cohen
Mircea Trofin
Ramakrishna Upadrasta
Google DeepMind, France
Google, USA
IIT Hyderabad, India
Abstract
2. Engineering objective-specific features [FKM+ 11], or extracting objective-independent program embeddings [AZLY19,
BNJH18, VAJ+ 20, CFB+ 21], or a combination of both.
3. Setting up a training interface with the compiler, with
examples ranging from communicating the output via
compiler flags [JAVU22], offline file logs [HAAW+ 20], to
generic gym APIs [BCP+ 16] and recent compiler-specific
gym APIs like CompilerGym [CWG+ 22], PolyGym [BGC21]
and Supersonic [WTZ+ 22].
4. Finally, building and deploying the compiler with the
trained model for inference.
There is a growing interest in enhancing compiler optimizations with ML models, yet interactions between compilers
and ML frameworks remain challenging. Some optimizations require tightly coupled models and compiler internals,
raising issues with modularity, performance and framework
independence. Practical deployment and transparency for
the end-user are also important concerns. We propose MLCompiler-Bridge to enable ML model development within
a traditional Python framework while making end-to-end integration with an optimizing compiler possible and efficient.
We evaluate it on both research and production use cases, for
training and inference, over several optimization problems,
multiple compilers and its versions, and gym infrastructures.
In most works, the process ends with step (3) and a simplified benchmark-oriented version of step (4) to evaluate the
trained model. Indeed, while there exist a number of solutions for steps (1 & 2), a proper methodology based solutions
for steps (3) & (4) that involve model-compiler interaction
between have not yet been adequately addressed.
The diversity of compiler optimizations and ML models
is associated with an equally broad range of requirements
for model-compiler interaction. In Tab. 1, we illustrate this
on recent proposals. There exists multiple ML frameworks
and even more types of ML models. A model’s input may
be a plain floating point vector, or tensors of different ranks
and shapes. Outputs range from a unique Boolean decision
to complex data structures. These need to be communicated
with the compiler; it may be only once for simple scenarios, or many times and involving large amounts of data for
more intricate ones. And this may involve extensive source
code modifications for the sole purpose of implementing the
compiler-model interface.
Some of these interactions have been explored in the literature and even landed in production; however, there does not
exist a single generic method to address the vast diversity of
scenarios that are imaginable and the trade-offs therein. Such
a situation limits the scope, applicability and effectiveness
of ML for compiler optimizations in the following ways:
1 Introduction
With the success of Machine Learning (ML) models in various
domains, there is a growing interest in applying ML to improve optimization heuristics in compilers [CST02, Ama20].
Several ML and Reinforcement Learning (RL) approaches
have been proposed to improve optimizations like vectorization [HAAW+ 20, MYP+ 19], loop unrolling, distribution [SA05,
JVA+ 22], function inlining [SCWK13, TQB+ 21], register allocation [DAK20, TQBL21, KPM22, VJK+ 23], prediction of
phase sequences [ABP+ 17, HHAM+ 19, JAVU22], among many
others [ABDS18, WO18]. More specifically, the widely used
LLVM compiler [LA04] has support for RL-based inlining
decisions from version 11, and RL-based eviction decisions in
its register allocator from version 14 [TQBL21]. The title of
our paper acknowledges this growing trend and anticipates
the needs of the ML-enabled optimizations that are yet to
come, in the spirit of Landis’ seminal paper [Lan66] on the
diversity of existing and future programming languages.
Setting up an ML-based compiler optimization is a challenging task. In addition to model design, it involves specialized data collection, compiler engineering, packaging:
• Scalability: Integrating a Python model with C++ code
using wrappers induces significant [JVA+ 22] compile
time overhead: e.g. 6ל100×.
• Integration: Not all optimizations are simple enough
that the outputs of the model can be communicated using
flags [VJK+ 23, JVA+ 22, KPM22, TQB+ 21]. As ML-based
1. Preparing or generating the data sets for training the
model [CPWL17, dSKdSMa+ 21].
Intial version prepared on 1st Sep 2023. Revised on 14th Nov 2023.
1
S. VenkataKeerthy et al.
Table 1. Diverse ML and RL requirements in previous work; unknown or unclear ones are left blank.
Communication
SLP Vectorization [MYP+ 19]
Model Input
Model Output
NeuroVectorizer [HAAW+ 20] Pragmas in source
Code2Vec vectors
Register Allocation [DAK20]
Register Allocation [KPM22]
No integration
Interference graph
PBQP Graph
Instructions to pack
source code with pragma
added in Python
Coloured IG
Allocated PBQP graph
POSET-RL [JAVU22]
Opt Flags
IR2Vec vectors
Pass sequence
Loop Distribution [JVA+ 22]
Inliner [TQB+ 21]
RegAlloc Eviction [TQBL21]
Python Wrappers
IR2Vec vectors
Precompiled TF model Features
Precompiled TF model Features
IG + node level
gRPC
embeddings
RL4ReAl [VJK+ 23]
LLVM IR
Commn Freq
Model Type
Once, at the end Single Agent FCNN
None
Multiple times
per episode
Distribution sequence
Once, at the end
Yes/No
Once, at the end
Index of Live Range to Evict Once, at the end
Multiple times
Color map
per episode
optimizations grow in popularity, flag-based approaches
become unwieldy.
• Programmability: ML models are typically written in
Python across different frameworks like TensorFlow,
JAX, PyTorch, etc. Expecting the model to be written
in C++ within the compiler is not ML developer-friendly.
• Portability: Several proposals involve a tight coupling
between the compiler and a specific ML framework; we
however believe that a generic compiler infrastructure
like LLVM should remain ML-framework-independent.
#Agents
ML Framework
Single Agent GGNN
Keras, RLLib
NA
LSTM
TensorFlow
Single Agent GCN, ResNet Pytorch
Single Agent FCNN
PyTorch
Two agents
Single Agent
Single Agent
Four agents;
hierarchical
GNN, FCNN
FCNN
FCNN
PyTorch
TensorFlow
TensorFlow
GNN, FCNN
PyTorch, RLLib
model runner and SerDes can be chosen based on the usage scenario and requirement, and these may differ during
training and inference. Our library provides C++ and Python
APIs to expose model runners and SerDes for integration
with compilers and ML frameworks respectively.
We show that the inter-process model runners effectively
supports training. Once the model is trained, the in-process
model runners provide interfacing of the model within the
compiler in a transparent manner, with much lesser latency
to aid in deployment. Besides, our both model runner and
SerDes modules can be easily extended to support more
forms of communication and serialization. Our library also
provides C-APIs to aid in integration with C-based compiler
infrastructures like Pluto, GCC, and SQLite.
We evaluate ML-Compiler-Bridge on four ML-enabled
optimizations in LLVM: RL-LoopDistribution, POSET-RL,
RL4ReAl, and Inliner. We show that our library can be
integrated with other compilers like Pluto [BHRS08] and
MLIR [LAB+ 21] with minimal effort. We study the impact of
communication and serialization options on compile time
under different complex scenarios that the existing infrastructures could not handle. We conduct extensive evaluations to measure the overhead caused by each model runner
and SerDes. We also study the impact of integrating MLCompiler-Bridge with LLVM in terms of additional dependencies, compile-time, and binary size overhead. Here are
our contributions:
The existing gym libraries primarily aim at facilitating
model training for research and reproducibility by providing a high-level integration. For example, the recent CompilerGym [CWG+ 22] provides a high-level interface in the
form of C++ wrapper methods outside the compiler to invoke out-of-tree compiler APIs to materialize the predicted
actions. Such integration caters well to training certain interactions like Phase Ordering [JAVU22]. However, other
optimizations like RegAlloc [VJK+ 23, KPM22, DAK20], loop
distribution [JVA+ 22] and inlining [TQB+ 21] necessitate a
deeper interfacing of the model within the compiler; with
multiple rounds of interaction for both training and inference scenarios. Further, in these gym libraries, the inference
flow is driven by Python: the compilation starts by invoking
a Python process, breaking the isolation between the end
user and the internal compiler algorithms; this limits deployment opportunities among other downsides. We discuss
these issues in detail in Sec. 4.
To address these shortcomings, we propose ML-CompilerBridge, a library that allows ML model development within
a traditional Python framework while providing tightly coupled and efficient end-to-end integration with the compiler.
Our library bridges the compiler and ML model by providing a suite of communication approaches (model runners)
and the related (de-)serialization mechanisms (SerDes) to
cater to diverse scenarios. It also provides support for both
inter- and in-process communication by exposing different
model runners: gRPC and named-pipes for the former, and
the TensorFlow interface and ONNX for the latter. Diverse
SerDes options based on Protobuf, JSON, and native bitstreams improve efficiency and versatility. The appropriate
• We propose ML-Compiler-Bridge, a library to enable
the deeper integration of ML models and the compiler
in a framework-independent manner.
• We provide a suite of two-inter- and two-in-process model
runners, and three (de-)serialization mechanisms (SerDes)
to support different interaction scenarios.
• We provide multi-language user APIs: C++ and C APIs to
interface model runners and serializers with compilers
and Python APIs to interface inter-process model runners
with ML frameworks.
• We show that our library is easy to integrate with three
different compilers spanning different representations,
and carry out extensive evaluations on four ML-enabled
optimizations on two versions of LLVM (V10, V17).
2
The Next 700 ML-Enabled Compiler Optimizations
• We characterize the impact of each communication and
serialization options on compilation and training times
and analyze additional dependencies and other overheads.
like domain-specific training or fine-tuning at deployment
time. Since ML developers usually prefer developing models
within a Python-based framework, the training process involving a C++ compiler infrastructure like LLVM requires a
communication channel, typically inter-process, while catering to the needs of (de-)serializing data between the native
types of C++ and Python. The distributed nature of training processes may also require extending communication
beyond a single operating system node.
2 Background
Input
Input
Input
Program
Program
Program
Opt 1
Opt i
Compiler
0: Compilation
starts
Materialize
Predictions
1: Input
serialization
Communication
channel
2: Optimization Query
ML Models
3: Input
deserialization
Training /
Inference
Inference. When focusing on inference/deployment, compile time and ease-of-use become crucial factors. The communication and serialization methods involved should take this
into account, along with considering converting the Python
model to a streamlined C++ implementation. These factors
are true even for the simplest forms of communication, like
one-time evaluations of the ML model and communicating
via flags. Making the flow transparent to the user also requires a deeper, end-to-end integration with the compiler.
There is no tool providing the necessary layers of abstraction between the three actors while supporting the
required training and inference scenarios, not to mention
ML-framework independence. Designing such a library and
evaluating its suitability for diverse use cases is the challenge
we tackle in this paper.
4: Querying
7: Output
deserialization
6: Response
5:Output
serialization
Opt n
Figure 1. ML-enabled compiler optimizations: (1) Inputs and other
metadata required by the model are prepared in the appropriate
format. (2) Serialized input is passed on to the model by a suitable
communication channel. (3) Input is deserialized to appropriate
format. (4) The model is queried to obtain optimization decisions
as output. (5) Output is serialized, and (6) Sent back to the compiler
optimization as a response. (7) The received response is deserialized,
and optimization decisions are taken according to the output.
ML-enabled Compiler Optimizations. The process of
supporting or fully implementing optimization decisions
with one or more ML models involves the steps shown in
Fig. 1. This process repeats until the end of the compilation
process for each ML-based optimization. The above scheme is
generic enough to capture any optimization involving single
or multiple ML models with multiple two-way interactions.
For the cases that would need multiple interactions, steps
(1)ś(7) are repeated until the final outcome.
More broadly, there are three actors involved in developing and using such an ML-enabled compiler. (i) The Compiler
expert who develops the compiler optimization, (ii) The ML
expert who designs the ML model for the optimization problem, and (iii) The end-user who uses the compiler. Ideally,
compiler experts should use the ML models with minimal
understanding of the internals/process specific to ML modeling and the framework on which the model is built to arrive
at the result. Similarly, ML experts should instead design
the models with minimal or no understanding of compiler
internals, infrastructural details, and integration points, focusing on the optimization objectives and information flow.
For the end-user, however, the presence of ML-compiler optimization should be transparent, and indistinguishable from
the conventional (non-ML based) compilation process. To
achieve this scheme of abstraction/segregation among all
three actors, it is important to distinguish between the training and inference flows.
3 ML-Compiler-Bridge
ML-Compiler-Bridge
SerDes
Compilers
ProtobufSerDes
JSONSerDes
ML Models
BitstreamSerDes
LLVM
Data-In
GCC
MLIR
Pluto
Data-Out
MLModelRunner
Inter-process
In-process
gRPCModelRunner
ONNXModelRunner
pipeModelRunner
TFModelRunner
Data-In
PyTorch
Keras
TensorFlow
JAX
CompilerGym
Data-Out
PolyGym
OpenAI Gym
RLLib
Stablebaselines3
Figure 2. The compiler instantiates a model runner and sets the
input features to be used by the model. MLModelRunner internally
invokes SerDes to serialize the data in one of the supported formats
and query the model. The returned decision is deserialized and
provided to the optimization.
We propose an abstraction mechanism made of two main
components: Serializer and Model Runner. The SerDes module (de-)serializes the data to/from the requested format, and
the MLModelRunner module is responsible for communication with the model. The model runner obtains the serialized data, writes it to a communication channel, queries
the model, and deserializes the output received from the
model. ML-Compiler-Bridge exposes methods to be invoked by the user to interact with the model decoupled
from serialization and communication. We provide three
framework-independent model runners, gRPC, named-pipes,
and ONNX, and one framework-specific TensorFlow model
Training. Typically, training the ML model becomes part
of compiler development and build-up, and inference becomes part of the compiler deployment and execution. However, occasionally this boundary may shift towards the user,
3
S. VenkataKeerthy et al.
1
2
3
4
5
6
7
8
9
10
class MLModelRunner {
public :
// Exposed to the user ; returns model 's output
template < typename T > T evaluate () {
return * reinterpret_cast <T * >(←↪
evaluateUntyped () );
}
protected :
// To be overridden by derived classes
virtual void * evaluateUntyped () = 0;
};
1
2
3
service gRPCExample {
// RPC function for training integration
rpc getObservation ( Action ) returns (←↪
Observation ) {}
4
// Mandatory RPC function for model evaluation
// Blocking call ; Released upon receiving a ←↪
response
rpc getAdvice ( Observation ) returns ( Action ) {}
5
6
7
8
}
Listing 2. Example gRPC function declaration
Listing 1. Skeleton of MLModelRunner class
(3) When input from the compiler is required, the model
sends requests to the compiler with appropriate queries
and waits for the response.
(4) The compiler gets out of the blocked state and processes
the query to generate an appropriate response.
(5) The response is sent back to the client, and the model
goes on to completing training on that input.
runner. These can be combined with three different serializations: Protobuf, JSON, and bitstream. The modular design
enables new forms of communication and serialization to be
added by overriding a minimal set of methods. Fig. 2 shows
the components and interactions of ML-Compiler-Bridge.
3.1
ML Model Runners
We provide two classes of model runners. The inter-process
class provides the easiest mechanism to decouple Python
models from a compiler running as a separate process. The inprocess class assumes that the ML Model is readily available
in a compiled form and can be accessed within the compiler
through a specific API. Clearly, in-process communication
is designed with inference and deployment in mind, while
inter-process communication enjoys more diverse use cases.
Model runners may support simple ML queries and feedforward networks as well as more involved Reinforcement
Learning (RL) algorithms or Graph Neural Networks (GNNs).
Internally, MLModelRunner is the abstract base class from
which the other model runners are derived (List. 1). It exposes two APIs: populateFeatures() populates the input
features, and evaluate() queries the model. The latter returns the output of the model and is templated according
to the expected output type. Internally, evaluate() invokes
evaluateUntyped() that is to be overridden by the concrete
model runner classes that derive from MLModelRunner. The
MLModelRunner interfaces with the methods of SerDes using the populateFeatures() so as to serialize the inputs.
The method populateFeatures() is implemented as a variadic function that uses variable-length key-value pairs as
arguments. The key is a string identifier that describes the
input, and the value is of template type.
Inference follows the same steps, yet the compiler becomes
the client and the model becomes the server so as to support
a regular compilation process.
3.1.1 Inter-process Model Runners. gRPCModelRunner
uses gRPC may run the model and compiler on different machines, and pipeModelRunner uses named pipes for singlemachine scenarios only. At training time, the compiler acts
as a server and the Python-based ML model acts as a client.
The sequence of steps is described as follows:
(1) Compilation starts and the compiler listens for queries
at the wait() call inserted at the point of interest.
(2) The Python model starts training; this can be started
concurrently with Step (1).
Pipe Model Runner. As the name suggests, pipeModelRunner
relies on named pipes for inter-process communication (the
mkfifo system call). Pipes provide a simple and effective
means of communication that is local to the machine without any network or security constraints.
As pipes are unidirectional, the pipeModelRunner creates
the read and write pipes for communication. The read pipe in
the compiler obtains the data written by the model in Python,
and the write pipe provides the data into the pipe that is
read by the model on the other end. The evaluateUntyped
gRPC Model Runner. gRPC [WZZ93] provides RPC methods specifying the type of input and output in Protobuf format [Pro]. During the build process of the library, the proto
files are automatically translated to C++ and Python code
by invoking the protoc compiler. An example is shown in
List. 2. The generated code defines the Service class that
exposes the RPC methods to be overridden by the user in
the optimization that makes use of gRPCModelRunner. Due
to design constraints of gRPC, we only support Protobuf
serialization with gRPCModelRunner.
gRPCModelRunner takes in the server address and the port
number at which the connection is to be established. In
training mode, gRPCModelRunner starts the server and starts
listening for an RPC call invoked by the model. The overridden RPC method is directly called by the Python model
to generate new observations by applying the action predicted by the model. In inference mode, gRPCModelRunner
starts the gRPC connection at the given address and port.
evaluateUntyped() is overridden to invoke the RPC method
defined by the Python model after preparing the input data,
andgetAdvice() serves as the RPC method for inference.
4
The Next 700 ML-Enabled Compiler Optimizations
method is overridden to read and write into the pipe appropriately. read() is a blocking call forcing the compiler to
wait till data is written by the model. Once the data is written, the model gets to a blocking state by invoking read()
on the second pipe waiting for the response from the compiler. The pipe model runner ensures proper opening, closing,
and clean up. pipeModelRunner provides a simpler interface
for establishing communication as the user directly invokes
evaluate() after setting the inputs.
1
2
3
4
5
6
7
8
9
10
11
3.1.2 In-process Model Runners. In-process model runners are designed to provide an effective means of compiler
deployment. It is important to optimize the inference time
as it adds up to the overall compile time. One may obtain
significantly lower compile times by removing inter-process
communication overhead, and by turning the complications
of a compiled model into an advantage, by reducing the
query time compared to models running in Python. Serialization/deserialization overhead is also lowered.
12
13
Listing 3. Snippet from ONNXModelRunner showing the
environment-agent interaction to generate an observation
ONNX Model Runner. The Open Neural Network Exchange [LF17] (ONNX) is an open format to represent machine learning models. Models built from various frameworks like TensorFlow, PyTorch, etc. can be represented
in ONNX format in an interoperable manner. Additionally,
it supports a wide variety of hardware architectures ranging from edge devices to general-purpose CPUs and GPUs.
Once the model is trained in Python, it is converted into
a common ONNX representation and is imported into the
compiler via the ONNX runtime. ONNXModelRunner exposes
the necessary wrapper APIs to read the ONNX model, query
it with inputs and obtain outputs. ONNXModelRunner also RL
models.
Opt Pass
dispatch
ONNXModelRunner
Agent
ONNXModel
evaluate()
reset()
observation
computeAction()
query()
step()
action
modelOutput
void * ONNXModelRunner :: evaluateUntyped () {
Observation obs = this -> env -> reset () ;
while ( true ) {
Action action ;
// current agent
auto current_agent = this -> agents [ this -> env ←↪
-> getNextAgent () ];
action = current_agent -> computeAction ( obs );
obs = this -> env -> step ( action );
if ( this -> env -> checkDone () )
break ;
}
return nullptr ;
}
Figure 3. Sequence diagram indicating different events and the
interaction between various classes for RL based optimization by
ONNXModelRunner. Only the methods that highlighted are to be
overridden by the user. Other methods are internal to the library.
APIs internally. The sequence of events describing this interaction is shown in Fig. 3.
ONNXModelRunner exposes the Environment class with
APIs for standard step and reset operations along with
setDone() API to indicate the end of the episode. step()
returns the next observation given an action. Internally,
the step operator applies the action predicted by the agent
(model) to move on to the next state and returns the new
observation from the environment. step() also signals if
the terminal state is reached by invoking setDone() to stop
the current prediction. reset() resets the environment to
the initial state and returns the initial observation. Hence
ONNXModelRunner involves the Reset operator first to obtain
the initial observation. This sequence of APIs is invoked
within the evaluateUntyped() of ONNXModelRunner and
is shown in the Listing 3. The optimization pass using the
ONNXModelRunner should inherit from Environment and
override step() and reset() depending on the optimization requirements.
ONNXModelRunner queries the model using the C++ APIs.
A map containing the identifier of the agent (label) and the
corresponding model path is passed while instantiating the
ONNXModelRunner. In the case of multiple agents, the identifier of the next one to use is set by the Environment while
returning the observation. ONNXModelRunner queries the
corresponding agent with the observation to obtain the requested action. This process goes on untill Environment
invokes setDone().
ONNXModelRunner for plain ML models. ONNXModelRunner
can also be used to query non-RL models by directly invoking the evaluate method upon instantiating the object with
the path to the ONNX model.
ONNXModelRunner for RL. For RL, the agent is usually the
learner trained to predict appropriate actions given the observations from the environment. Exporting a trained model
to ONNX implies exporting only the agent. To facilitate RLbased interaction for a generic multi-agent scenario between
the environment and the agents, ONNXModelRunner provides
Environment and Agent classes separately and accesses the
TensorFlow Model Runners. This is a framework-specific
model runner built on the TensorFlow ahead-of-time (AOT)
saved model. There are two implementations: (i) łRelease
Mode Model Runnerž used in production environments, (ii)
łModel Under Training Model Runnerž intended either for
finetuning or when quickly evaluating candidate models and
5
S. VenkataKeerthy et al.
parameters. TFLite is a scaled down TensorFlow interpreter
designed to be embedded in native binaries, and can be used
to further reduce overheads.
The TensorFlow model runner uses the AOT saved model
compiler which produces a header exposing the model as a
C++ class, and a native object file with its implementation.
The model runner reduces again to a simple adapter [GHJV95]
around that class. The compiler binary does not expose new
runtime dependencies as it is statically linked, and this simplifies its deployment. Note that the model compiler can be
configured to generate code loading the weights from a file
passed via the command line to the LLVM compiler.
PipeModelRunner
-OutStream : raw_fd_ostream*
-evaluateUntyped() : void*
-readNBytes(size_t) : void*
gRPCModelRunner
-stub_ : grpc::Stub*
-server_address : string
-exit_requested : promise
-request : Request*
-response : Response*
-RunService(grpc::Service *s) : int
-SetStub() : int
-evaluateUntyped() : void*
ProtobufSerDes
-request : Message*
-response : Message*
+setFeature(string, T) : void
+getSerializedData() : void*
+deserialize(void*) : void*
#cleanUp() : void
ONNXModelRunner
-evaluateUntyped() : void*
MLModelRunner
JSONSerDes
-data : json::Object
+setFeature(string, T) : void
+getSerializedData() : void*
+deserialize(void*) : void*
#cleanUp() : void
-desJSON(json::Value*) : void*
JSON SerDes. JSONSerDes overrides the setFeature methods to populate the JSON buffer appropriately, given the
key-value pairs. Similarly, the received data is deserialized
by first converting it to a JSON object, then the JSON fields
are casted to native types and returned. JSON SerDes is also
transparent to the user.
-compiledModel : TFGen
-computeAction(Obs) : void
-evaluateUntyped() : void*
+getKind() : Kind
+getSerializerKind() : BaseSerDes:Kind
+evaluate() : T
#evaluateUntyped() : void*
+populateFeatures(KV var1, KV val2) : T
Protobuf SerDes. Protobuf SerDes needs the user to provide the input and output data specifications in a proto file.
These are compiled to generate the C++ and Python sources
(Sec. 3.1.1). ProtobufSerDes serializes the input key-value
pair by overriding the setFeature methods to set the appropriate fields of the message described in the proto file. Deserializing protobuf data to the native format only involves
reading and returning the appropriate fields of the message.
Except for providing the proto file, ProtobufSerDes is transparent to the user.
TFModelRunner
-env : Environment
-agents : map<string, Agent*>
#Serializer : BaseSerDes*
+Kind : enum int
supports (de)serializing in basic (int, float, double, string,
bool) and compound (vector, list) data types.
BaseSerDes
+Kind : enum int
1
+getKind() : Kind
+setFeature(string, T) : void
+getSerializedData() : void*
+deserialize(void*) : void*
#cleanUp() : void
Bitstream SerDes. The bitstream starts with a JSON header
which specifies the key (identifier), type and shape of the
tensors, and the order in which they will be serialized. Tensor values themselves are dumped as raw bytes. The received bitstream is interpreted based on the type and shape
specified in the header and converted to native types. Processing the header induces negligible overhead if communicated data does not involve complex data types. Internally,
BitstreamSerDes overrides the setFeature methods similar to the other SerDes to expose the functionality. Fig. 4
shows the class diagrams [GHJV95] of model runners and
SerDes.
BitstreamSerDes
-tensorSpecs : vector<TensorSpec>
-rawData : vector<void *>
+setFeature(string, T) : void
+getSerializedData() : void*
+deserialize(void*) : void*
#cleanUp() : void
Figure 4. Class diagram of ML-Compiler-Bridge
3.2
SerDes: Serializer and Deserializer Module
When data is transferred, specifically across two processes,
it is important to convert data that is present in the native
types (of C++ and Python) from one format to another. This
is the purpose of (de-)serialization as implemented by the
SerDes module.
Internally, the MLModelRunner interacts with SerDes to
(de-)serialize C++ native data to model-specific types and
back. The choice of (de-)serialization depends on the optimization and ML model. We currently provide three options:
bitstream, JSON, and Protobuf. They vary in terms of usage
scenario, usage effort, and (de)serialization time. SerDes effectively abstracts away the underlying mechanism while
providing the flexibility of different serialization options.
3.3
C-APIs
We provide C wrappers around the C++ implementation to
integrate with C-based compilers. These wrappers are C++
files written in C-style. Each method internally queries the
original C++ implementation and returns results in a way
compatible with C calling conventions. This code is built as a
separate library that may be linked with a C-based compiler.
We used it with the Pluto polyhedral compiler in particular.
3.4
Base SerDes. Internally, each SerDes is derived from the
BaseSerDes class. SerDes uses key-value based serialization
as described in Sec. 3.1. The populateFeatures method of
MLModelRunner invokes the appropriate version of the overloaded setFeature() exposed by BaseSerDes to serialize
inputs. These methods are overridden by the SerDes classes
that derive from BaseSerDes according to the underlying
serializer. This class also exposes the deserialize method
to deserialize the received data and is overridden by the derived classes to obtain the data in native types. Our library
Extensions
Both MLModelRunners and SerDes can be easily extended
to support new model runners and serializers. New runners
may include TVM [CMJ+ 18], ahead-of-time compiled PyTorch models and FlatBuffers [Fla], and serialization also
supports YAML formats. New model runners can be contributed by inheriting MLModelRunner and overriding the
evaluateUntyped method according to the model runner.
Similarly, a new (de)serializer can be added by inheriting
BaseSerDes and overriding the setFeature and deserialize
methods specific to the new serializer.
6
The Next 700 ML-Enabled Compiler Optimizations
1
2
# include " MLCompilerBridge / MLModelRunner .h"
# include " MLCompilerBridge / yourMLModelRunner .h"
1
3
4
5
6
7
8
9
10
11
12
3
// Instantiate the required model runner with ←↪
SerDes type
MLModelRunner * MLRunner = std :: make_unique <←↪
yourModelRunner >( Arg , SerDes :: Kind ::←↪
yourSerDesType );
// Process Input Features
std :: pair < std :: string , InType > p = ... // Input
MLRunner -> populateFeatures (p);
// Get ML Advice / Output
OutType advice = MLRunner -> evaluate < OutType >() ;
// Use the obtained advice
...
import CompilerInterface as CI
2
4
5
6
7
8
9
10
11
# Instantiate the required CompilerInterface ←↪
with serdes type
interface = CI . YourCompilerInterface ( Arg , ←↪
yourSerdesType )
while True :
# Send buffer data to compiler and wait for ←↪
next request
request = interface . evaluate ()
# Query model to get advice
# Populates buffer with advice ( serialized ←↪
and stored in serdes )
interface . populate_buffer ( advice )
# Break on condition
Listing 4. C++ APIs of ML-Compiler-Bridge
3.5
Listing 5. Python APIs of ML-Compiler-Bridge
interactions with an ML model. All the components are configured, compiled and linked during the regular build process of LLVM. Integration challenges range from redesigning
the entire framework of the original publication, to minor
changes to the communication mechanisms.
Error Checking and Recovery
The model runners and SerDes modules are designed to handle compiler/model crashes, communication failures, and
infinite loops. The failures are handled appropriately by allowing graceful termination of the processes. In the case
of gRPC, we have implemented an exponential backoff algorithm to attempt retries to overcome the failures due to
the delays in communication resulting from any networkrelated issues and packet losses. The communication fails
gracefully upon exhausting the number of retries. In all other
cases, we use a timeout based mechanism for handling the
failure. These mechanisms proved invaluable in practical
experiments due to compiler bugs and network errors.
3.6
4.1
Phase Ordering of Optimization Passes
POSET-RL predicts the ordering sequence of passes to jointly
optimize code size along with execution time. An RL agent
is trained with the DDQN algorithm [VHGS16] to predict
a subsequence as action, given program embeddings as input observation. There are about 15 predetermined subsequences provided by the authors. The predicted optimization
subsequence is applied on the input program, and the embeddings corresponding to the transformed program are used
as the new observation. This process goes on until reaching
a threshold on the number of subsequences.
In the published version, the above process was not integrated within LLVM but driven from a Python model. An
LLVM-opt process was spawned, passing the optimization
sequence through a compiler flag for each prediction by the
agent. In addition, embeddings involve spawning yet another
process to invoke IR2Vec on the .ll IR file generated by the
compiler. A similar strategy was in place for training.
We revisited the above using ML-Compiler-Bridge to
operate directly within LLVM as a new transformation pass.
Our new PosetRL implements a pass manager that applies
the predicted optimization sequence, and also generates the
next observation by invoking IR2Vec. The MLModelRunner
communicates with the model and serializes the data to be
transferred. The model communicates the predicted optimization subsequence as an integer ID (one among 15) to
PosetRL, and the R300 module-level embedding vectors are
sent to the model for the next prediction. Integrating with
the ONNX model runner only amounts to extending the
Environment class and overriding the step, reset methods.
We also override setDone() to signal the end of the episode
upon reaching the threshold.
Compiler/ML Experts View
To use ML-Compiler-Bridge, developers need to invoke a
minimal set of APIs by instantiating the necessary model
runner with appropriate options specifying the SerDes type.
List. 4 illustrates this on an example of invoking a userdefined model runner with a user-defined SerDes from the
compiler. A similar API abstracting the communication and
SerDes in Python is provided (List. 5) to query the ML model
with inter-process model runners and respond back.
4 Use Cases: ML-LLVM optimizations
We integrated ML-Compiler-Bridge with four ML-based
compiler optimizations in LLVM: phase ordering [JAVU22],
loop distribution [JVA+ 22], register allocation [VJK+ 23] and
method inliner [TQB+ 21]. The first three optimizations are
built using RLLib [LLN+ 18a] with PyTorch [PGM+ 19] and
LLVM V10, using program embeddings called IR2Vec [VAJ+ 20].
The fourth optimizationÐinliningÐuses TensorFlow [AAB+ 15],
is built within LLVM V17, and uses feature-based representations [TQB+ 21]. There are two ML based register allocators [VJK+ 23, TQBL21] available for LLVM; we chose the
former because it emphasizes finer-grained, high-bandwidth
7
S. VenkataKeerthy et al.
4.2
Loop Distribution for Vectorization and Locality
4.4 LLVM Inliner
The inliner pass traverses call sites in a bottom-up fashion,
one connected component of functions at a time. For a given
component a working queue is initialized with the set of all
static call sites. As the algorithm marks some call sites for
inlining, it appends the former callee’s call sites to the work
queue. The decision to inline or not is made in two steps. First,
it determines legality and whether the user provided any
guidance (always/never inline). Only if the operation is legal
and non-mandatory, a heuristic determines its profitability.
The decision is driven by a simple RL based model. It takes
a number of scalar features characterizing both the caller/callee (instruction counts, basic block counts, maximum loop
depth), the call site itself (the number of compile-time constant parameters), as well as module-wide features (the current number of functions and statically known call edges).
For the published version [TQB+ 21], the cost metric was size,
with no reliance on dynamic profile data. The implementation uses AOT compiled TensorFlow model for inference
with C++ APIs. We modularized it to use any model runner.
[JVA+ 22]
Jain et al.
improve loop distribution by modeling
SIMD parallelization and locality optimization opportunities. It uses two RL agents with fully-connected networks to
identify the vertex processing order and when to distribute.
Along with these agents, a Gated Graph Neural Network
(GGNN) [LTBZ16] processes the connected components of
the dependence graph, where each node holds the embeddings for the corresponding instructions.
During training, a Python driver spawns a process to invoke the Loop Distribution pass. The RL model processes
the input graph and predicts the sequence of instructions
to be packed together as a loop. Upon applying the prediction, the rewards indicate the effectiveness of distribution.
All these steps involve model-compiler interaction via file
I/O. Inference itself is integrated with LLVM using Python
wrappers.
In this paper, we eliminate the need for Python wrappers,
file I/O and and spawning new processes. The model runners internally (de-)serialize data depending on the chosen
SerDes and the MLModelRunner. For the runners that use serialization, the input graph is represented as key-value pairs,
and a variable length matrix in R𝑛×300 encodes the sequence
of 𝑛 300-D instruction embeddings. The output takes the
form a variable-length integer array with node identifiers
that are to be distributed.
4.3
5 Evaluation
We measure compilation time on an Intel Xeon SkyLake
W2133 with 6 cores, 12 threads and 32GB RAM. Training
time is measured on an Intel Xeon W1390P with 8 cores, 16
threads, 64GB RAM and an Nvidia 3060 GPU. We evaluate
POSET-RL, RL-LoopDistribution and RL4ReAl with gRPC,
Pipe and ONNX model runners and different SerDes options,
and take the median of 3 runs. Most experiments use SPEC
CPU 2006 and SPEC CPU 2017 benchmarks.
RL-Based Register Allocation
We also evaluate RL4ReAl, an RL-based register allocator implementing the splitting, coloring, and spilling sub-tasks as
separate RL agents on LLVM’s Machine IR. These RL agents
pose a formidable engineering challenge in interfacing the
model with the compiler during both training and inference.
Unlike other optimizations that need one single communication at the end, RL4ReAl involves multiple interleaved
communications rounds to obtain a new observation and let
the relevant agent make the next prediction. Also them RL
agents are arranged hierarchically: the outcome of one agent
determines which agent would be invoked next. Unlike other
use cases, this optimization involves transferring an interference graph where each variable is associated with a R𝑛×100
matrix, and where each one of the 𝑛 instructions in the live
range of the variable is represented in 100-D, a variablelength integer array to specify interferences and use points,
and a variable-length floating point array of spill weights.
Other metadata like function name, file name, and status are
also sent as string fields. The model returns key-value pairs
mapping variables to split or color decisions. Both training
and inference use gRPC and Protobuf serialization.
We will investigate different communication and serialization improvements in this paper, with specialized scenarios
for distributed training and deployment-friendly inference.
5.1 Impact on Deployment
Tab. 2 shows the POSET-RL compile time using different
model runners. Within the in-process runners, we use ONNX
for PyTorch models and RLLib. Overall, in-process runners
achieve better compile times in all cases in comparison with
any of the inter-process ones. Among the latter, gRPC has
higher compile times (6.8ś7.6%) compared to pipes, with
JSON and bitstream SerDes. This is because of the overheads associated with establishing connections and invoking RPC methods. Pipes with Bitstream SerDes yield slightly
higher performance than JSON SerDes due to the lower (de)serialization overhead with bit streams. ONNXModelRunner
yields a 7.2× speedup with POSET-RL compared to the original method in Sec. 4.1 that involved spawning new processes
to invoke the compiler and other dependencies.
In-process model runners natively support multithreaded
compilation, while inter-process model runners necessitate
concurrently running multiple instances of the model resulting in a high memory and compute overhead. Tab. 3 shows
compile times with in-process model runners on LLVM Inliner and RL4ReAl optimizations by varying the degree of
parallelism. As LLVM Inliner and RL4ReAl respectively rely
8
The Next 700 ML-Enabled Compiler Optimizations
Table 2. Compile time (in seconds) for POSET-RL.
SPEC06
SPEC17
Original
gRPC
Pipe + JSON
Pipe + Bitstream
ONNX
5,829
10,342
1,318
1,221
1,236
1,141
1,227
1,132
1,140
1,093
times are shown in Fig. 5(b) for CPU and GPU workers. Using 10 workers with a GPU trainer takes about 2 seconds per
episode, while a CPU trainer with <10, 5, 1> workers takes
<4s, 8s, 15s> respectively. We obtained similar trends among
the workers even upon using pipes for communication.
Table 3. Multithreaded compile time with -O3 (in s) with in-process
model runners. Compile time with gRPC is shown for RL4ReAl for
comparison.
gRPC 1 Thread 2 Threads
LLVM Inliner
(TF Runner)
RL4ReAl
(ONNX Runner)
4 Threads
8 Threads
-
596
501
361
307
5,572
291
257
248
248
5.2.3 Using Different RL Policies. One may train and
deploy models with different RL policies without impacting the compiler. For this experiment, we evaluate RL4ReAl
with the different RL policies provided by RLlib. We perform hyperparameter tuning using Tune [LLN+ 18b]. We
trained the models with PPO [SWD+ 17], APPO [SWD+ 17],
and A2C [MBM+ 16] policies untill convergence. On the SPEC
CPU 2017 benchmarks, this resulted in 2% improvement on
average using the APPO policy. The PPO and A2C perform
similarly to original paper.
on TensorFlow and PyTorch (and RLlib), we use TensorFlow
and ONNX model runners accordingly. In comparison to
the original gRPC based inference flow of RL4ReAl, the
ONNX runner reduces compile time by 22.4× and 19× using
8 threads and 1 thread respectively. Using RL4ReAl results
in a higher compile time, as it involves a larger number of
model-compiler interactions. This overhead is effectively reduced by using the model runners exposed by ML-CompilerBridge.
Similar trends are observed for RL-driven loop distribution [JVA+ 22] on TSVC [Bou] and the LLVM Test Suite [LO].
The ONNX model runner yields an improvement of 16× in
comparison to the original Python wrapper.
5.2
5.3
Round-Trip Time
Let us finally isolate the Round-Trip Time (RTT) of each
model runner as a limit study of the achievable communication throughput. We consider random floating point vectors
of increasing length ranging from 500 to 50K elements in
steps of 500. The model itself is a single fully-connected
layer that consumes the vector and returns a scalar float.
Fig. 5(c) shows the RTT of the whole process. The TF and
ONNX runner achieves a very high throughput with a total
RTT of 21 and 68ms respectively; while Pipes+JSON and
Pipes+Bitstream yield 3154ms and 772ms respectively, and
gRPC yields a larger RTT of 5948ms. These differences can
be attributed to the serialization and communication overhead. The TF and ONNX runners benefit from in-process
communication, proving to be suitable candidates for deployment. The higher throughput of TF is due to the AOT
precompiled model. The Pipe runner proves to be a good candidate for training on local machines. And the gRPC runner
provides support for training in a distributed environment.
This makes all the model runners important in their own way.
Impact on Training
In this section, we evaluate the effectiveness of ML-CompilerBridge during the training of POSET-RL and RL4ReAl. We
use inter-process model runners for training.
5.2.1 Training Time. Fig. 5(a) shows the cumulative training time and number of training iterations observed in POSETRL. We obtain large improvements in the training time across
all the model runners. We see similar trends with gRPC and
Pipe, as explained in the previous experiment.
The original training process of POSET-RL involves spawning processes that takes ≈ 10Ks to complete 500 iterations.
In comparison, the gRPC model takes about 5.7Ks, while
the pipes with JSON and bitstream serialization options take
about 5.5Ks each. Throughout the iterations, we observe an
overhead of about 20s between JSON and bitstream serialization options. This minimal overhead is associated with
the additional serialization effort involved while using JSON
SerDes. However, using the inter-process model runners enables an end-to-end integration of model and the compiler
while training yields a significant improvement.
5.4
Gym Integration
We carried out additional experiments to evaluate the benefits of our library in the context of a state of the art RL
Gym. The two goals are to facilate deployment and to reduce
compilation time by using in-process model runners. For
this purpose, we trained the pass ordering for code size of
CompilerGym [CWG+ 22] and exported the resulting model
in the ONNX format. We then used our ONNX model runner within LLVM to materialize predictions and generate
code. The inference times are shown in Fig. 6, with speedups
ranging from 2× to 13×. These are primarily due to gRPC
overheads in CompilerGym, as shown in Fig. 5(c).
5.2.2 Multi-Worker Support. ML-Compiler-Bridge supports multi-worker training on both CPUs and GPUs. To
support multiple workers while using gRPC, we expose a
method taking an array of ports to establish connections with
each worker. Similarly, multi-worker support with pipes is
enabled by instantiating one pair of pipe per worker. We extended RL4ReAl to handle multi-worker scenarios; training
5.5 Domain-Specific Compilers
Given LLVM’s dominance in the general-purpose and backend compiler landscape, it forms the natural basis for most
9
S. VenkataKeerthy et al.
(a) Training times of POSET-RL
(b) Training times of
(c) Microbenchmarking of (d) MLIR performance
(e) Pluto performance
RL4ReAl with CPU/GPU individual Model Runners
multi-workers
Figure 5. Performance characterization of model runners on different compilers and optimizations
with different Model Runners
Table 4. LOC to integrate model runners. gRPC shows LOC for
API calls and RPC; Values in parenthesis indicate LOC in protobuf
specification. Other serdes do not need any additional code.
Optimizations
POSET-RL
RL-LoopDistribution
RL4ReAl
Figure 6. Compile times on using phase ordering for code size
model with CompilerGym and ONNX model runner
6.1
Original
65
75
gRPC
3+3 (4)
3+3 (5)
10+3 (28)
Pipe
3
4
4
ONNX
3
3
3
Lines of Code
In Tab. 4, we show the number of additional Lines of Code
(LOC) to integrate ML-Compiler-Bridge with different compiler optimizations. We observe a significant reduction in
LOC compared to the original published works.
We do not compare with the size of the published version of POSET-RL, as its model was not integrated with the
compiler. With Loop distribution and RL4ReAl, the effort of
writing Python wrappers and invoking protobuf and gRPC
is completely removed. Among the available model runners
and SerDes, only gRPC, ONNX and Protobuf involve (small)
additional codes to handle RPC, environment, and Protobuf
messages. It is pertinent to note that ML-Compiler-Bridge
removes the tedious work of managing dependencies like
gRPC and Python wrappers which was otherwise necessary.
ML/RL tools (Tab. 1). However, some ML-based domainspecific optimizations target higher-level frameworks like
MLIR [LAB+ 21] and the polyhedral compilers Pluto [BHRS08]
and PoCC [PBB10]. Let us illustrate the cases of MLIR and
Pluto.
Integration with MLIR. Given that end-to-end ML compilers based on MLIR are still undergoing rapid changes
[Tea23a], we designed a simple experiment to demonstrate
the integration of ML-Compiler-Bridge with MLIR. We
wrote a custom pass in MLIR to communicate data with a
dummy ML model to mimic a typical ML-compiler interaction. We use the same experimental setup as discussed
in Sec. 5.3 and measure the round-trip time. The results
are shown in Fig. 5(c). This opens up ML-based optimizations in MLIR-native compilers such as IREE and OpenXLA
[Tea23a], Triton [Tea23b], Polygeist [MCZZ21], and many
other frameworks.
6.2 Impact on binary size, compile time and memory
In Tab. 5, we show the compile time, binary size and average
resident set size (RSS) used during compilation of Clang
10 with/without ML-Compiler-Bridge. The difference in
binary size is ≈ 80KB, while the average RSS value differs
by 400KB with the release build time increasing only by a
few seconds. ML-Compiler-Bridge incurs only a negligible
overhead in terms of binary size, compile time and memory
upon statically linking with the production version of Clang.
Integration with Pluto. We also experimented with the
polyhedral source-to-source compiler Pluto. As Pluto is written in C, we use the C-APIs of ML-Compiler-Bridge for
interfacing the models, to illustrating the Pipe model runners
and SerDes. We measured round-trip time using different
SerDes and show the same in Fig. 5(e). This integration opens
new opportunities for ML-based polyhedral optimizations,
including autoscheduling and tile size selection.
Table 5. Comparisons of time taken to build clang and final binary
size with/without ML-Compiler-Bridge
Characteristics
Compilation Time
Binary Size
Average RSS
6 Discussion
Native Clang Clang with ML-Compiler-Bridge
5m 7s
102.79 MB
1.5538 GB
5m 15s
102.87 MB
1.5542 GB
6.3 Additional dependencies
The current version of ML-Compiler-Bridge is implemented
in C++17 and Python 3.10. While Clang17.X uses C++17,
Let us now study the ease of integrating ML-CompilerBridge with compiler optimizations.
10
The Next 700 ML-Enabled Compiler Optimizations
7 Related Work
Clang10.X uses C++14. We updated the build system of Clang
10 to use C++17 and fixed the issues arising from the migration of earlier experiments on POSET-RL, RL-LoopDistribution,
and RL4ReAl. We were able to use Clang 17 for Inliner experiments. Though ML-Compiler-Bridge itself does not introduce any dependency, model runners do: gRPCModelRunner,
ONNXModelRunner, and ProtobufSerDes require gRPC, the
ONNX C++ Runtime, and Protobuf setups respectively.
6.4
RL environments for compilers come closest to our work,
such as CompilerGym [CWG+ 22], PolyGym [BGC21], Supersonic [WTZ+ 22]. These primarily aim at facilitating research and reproducibility, which are only two of the broader
ambitions of our research (e.g., deployment, programmable
compiler interface, finer-grained interaction). CompilerGym
internally calls the compiler APIs from a C++ wrapper, and
the communication between the Python model and the wrapper is established by predefined gRPC methods. This limits
the functionality to only the APIs supported by the library
and a particular compiler version with which the library is
compatible. Supersonic [WTZ+ 22] also uses the CompilerGym way of interfacing via gRPC. And, to our understanding,
PolyGym [BGC21] does not provide a programmable compiler interface.
The gym libraries and ML-Compiler-Bridge solve different problems; the former facilitates research and training,
while our library aims to facilitate different interfaces for
communication. We envision ML-Compiler-Bridge to supplement these gym environments by providing a variety
of options for more diverse, finer-grained, and frameworkindependent interfacing of ML models with compilers facilitating the transition from research to production.
Characterization
As discussed earlier, different model runners exhibit different characteristics. During deployment, neither of the
inter-process model runners offer multi-threaded compilation upon running a single model instance. It could be done
by instantiating multiple model instances but this would
consume unreasonable amounts of memory. The in-process
model runners however do not face this problem. Though
there is a separate serialization overhead involved with gRPC
and pipe model runners, they are handled automatically
without the involvement of the developer. Due to the nature of inter-process communication, there is a possibility of
encountering communication errors arising from network
and compiler crashes. We handle such cases as explained in
Sec. 3.5. We summarize these characteristics in Tab. 6.
Table 6. Characteristics of different model runners
Characteristics
Multithreaded Compilation
Distributed Training
Need for separate model process
Autoserialization
Communication Fidelity
ML Framework agnostic
Additional code by compiler writer
Serialization Requirement
Time overhead
6.5
gRPC Pipes
✗
✓
✓
✓
✗
✓
Y
Y
Y
✗
✗
✓
✓
✗
✓
N
Y
Y
ONNX
TF
8 Conclusions
✓
✗
✓
✓
Y
N
✓
✗
✓
✗
N
N
We present ML-Compiler-Bridge, a modular and extensible
library to integrate ML models within compiler optimizations. It provides inter/in-process model runners with different serialization options to support both training/deployment scenarios. We show that a model and compiler pass can
be integrated with only 3 lines of code, while also enabling
very deep interleaving of RL-based algorithms like RL4ReAl,
as well as leaner and production-friendly optimizations like
function inlining.
Our library exposes C++/C and Python APIs for integration with compilers and ML frameworks respectively.
We considered multiple ML frameworks (TensorFlow, PyTorch, RLlib), both feature-based and embedding-based representations, multiple compilers (and versions) written in
different languages to show versatility and suitability of MLCompiler-Bridge on research and production environments.
We will open-source the library and artifacts with extensive
documentation.
Limitations
As mentioned earlier, not all model runners are compatible with all ML models due to the nature of the underlying
libraries. For instance, Tensorflow AOT compilation supports any Tensorflow or JAX model, but not PyTorch. Also,
upon exporting the inliner model from TensorFlow to ONNX,
we encountered an operator (TFL-Bucketize1 ) that is not
supported by ONNX. To handle such cases, the ONNX runtime allows registering custom operators. Once exported,
the models can be used seamlessly without restriction.
Similarly, protobuf does not natively support C runtime.
Hence, our C APIs do not support using the gRPC model
runner with protobuf serialization. The current TF AOT compilation generates C++ code thereby making it not usable
directly with C. This issue can be mitigated by using TF
C-APIs instead of using AOT models.
References
[AAB+ 15] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian
Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath
Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga,
Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,
1 https://www.tensorflow.org/mlir/tfl_ops#tflbucketize_tflbucketizeop
11
S. VenkataKeerthy et al.
[ABDS18]
[ABP+ 17]
[Ama20]
[AZLY19]
[BCP+ 16]
[BGC21]
[BHRS08]
[BNJH18]
[Bou]
[CFB+ 21]
[CMJ+ 18]
[CPWL17]
[CST02]
[CWG+ 22] Chris Cummins, Bram Wasti, Jiadong Guo, Brandon Cui,
Jason Ansel, Sahir Gomez, Somya Jain, Jia Liu, Olivier Teytaud, Benoit Steiner, Yuandong Tian, and Hugh Leather.
Compilergym: Robust, performant compiler optimization
environments for ai research. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization
(CGO), pages 92ś105, 2022.
[DAK20] D Das, S A Ahmad, and V Kumar. Deep learning-based approximate graph-coloring algorithm for register allocation.
In Workshop on the LLVM Compiler Infrastructure in HPC,
pages 23ś32, 2020.
[dSKdSMa+ 21] Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno
Campos Ferreira Guimarães, and Fernando Magno Quintão
Pereira. Anghabench: A suite with one million compilable
c benchmarks for code-size reduction. In Proceedings of the
2021 IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’21, page 378ś390. IEEE Press,
2021.
[FKM+ 11] Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon,
Zbigniew Chamski, Olivier Temam, Mircea Namolaru,
Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, François Bodin, Phil Barnard, Elton Ashton, Edwin V.
Bonilla, John Thomson, Christopher K. I. Williams, and
Michael F. P. O’Boyle. Milepost GCC: machine learning enabled self-tuning compiler. Int. J. Parallel Program.,
39(3):296ś327, 2011.
[Fla] FlatBuffers. https://flatbuffers.dev/index.html. [Online;
accessed 29-Aug-2022].
[GHJV95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design patterns: elements of reusable object-oriented
software. Pearson Deutschland GmbH, 1995.
[HAAW+ 20] A Haj-Ali, N K. Ahmed, T Willke, Y S Shao, K Asanovic, and
I Stoica. Neurovectorizer: End-to-end vectorization with
deep reinforcement learning. In CGO’20, page 242ś255,
2020.
[HHAM+ 19] Q Huang, A Haj-Ali, W S Moses, J Xiang, I Stoica,
K Asanović, and J Wawrzynek. Autophase: Compiler phaseordering for hls with deep reinforcement learning. In International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 308ś308, 2019.
[JAVU22] Shalini Jain, Yashas Andaluri, S. VenkataKeerthy, and Ramakrishna Upadrasta. POSET-RL: phase ordering for optimizing size and execution time using reinforcement learning. In International IEEE Symposium on Performance Analysis of Systems and Software, ISPASS 2022, Singapore, May
22-24, 2022, pages 121ś131. IEEE, 2022.
[JVA+ 22] Shalini Jain, S. VenkataKeerthy, Rohit Aggarwal,
Tharun Kumar Dangeti, Dibyendu Das, and Ramakrishna
Upadrasta. Reinforcement learning assisted loop distribution for locality and vectorization. In 2022 IEEE/ACM
Eighth Workshop on the LLVM Compiler Infrastructure in
HPC (LLVM-HPC), pages 1ś12, 2022.
[KPM22] Minsu Kim, Jeong-Keun Park, and Soo-Mook Moon. Solving pbqp-based register allocation using deep reinforcement learning. In Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization,
CGO ’22, page 230ś241. IEEE Press, 2022.
[LA04] C Lattner and V Adve. Llvm: A compilation framework
for lifelong program analysis & transformation. CGO’04,
page 75, 2004.
[LAB+ 21] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. Mlir:
Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan,
Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous
systems, 2015. Software available from tensorflow.org.
Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and
Charles Sutton. A survey of machine learning for big code
and naturalness. ACM Computing Surveys (CSUR), 51(4):81,
2018.
A. H. Ashouri, A Bignoli, G Palermo, C Silvano, S Kulkarni,
and J Cavazos. Micomp: Mitigating the compiler phaseordering problem using optimization sub-sequences and
machine learning. TACO, 14(3), 2017.
Saman P. Amarasinghe. Compiler 2.0: Using machine learning to modernize compiler technology. In Jingling Xue
and Changhee Jung, editors, Proceedings of the 21st ACM
SIGPLAN/SIGBED International Conference on Languages,
Compilers, and Tools for Embedded Systems, LCTES 2020,
London, UK, June 16, 2020, pages 1ś2. ACM, 2020.
U Alon, M Zilberstein, O Levy, and E Yahav. Code2vec:
Learning distributed representations of code. In POPL,
pages 40:1ś40:29, 2019.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas
Schneider, John Schulman, Jie Tang, and Wojciech Zaremba.
Openai gym, 2016.
Alexander Brauckmann, Andrés Goens, and Jeronimo Castrillon. Polygym: Polyhedral optimizations as an environment for reinforcement learning. In 2021 30th International
Conference on Parallel Architectures and Compilation Techniques (PACT), pages 17ś29, 2021.
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer
and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and
Implementation, PLDI ’08, page 101ś113, New York, NY,
USA, 2008. Association for Computing Machinery.
T Ben-Nun, A S Jakobovits, and T Hoefler. Neural code comprehension: A learnable representation of code semantics.
In NIPS’18, pages 3589ś3601, 2018.
Michael Boulton. TSVC_2. https://github.com/UoB-HPC/
TSVC_2.git. Accessed 2015-09-16.
Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten
Hoefler, Michael F. P. O’Boyle, and Hugh Leather. Programl:
A graph-based program representation for data flow analysis and compiler optimizations. In Marina Meila and Tong
Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021,
Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 2244ś2253. PMLR, 2021.
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin
Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan
Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind
Krishnamurthy. Tvm: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th
USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 579ś594, USA, 2018. USENIX
Association.
Chris Cummins, Pavlos Petoumenos, Zheng Wang, and
Hugh Leather. Synthesizing benchmarks for predictive
modeling. In CGO. IEEE, 2017.
Keith D. Cooper, Devika Subramanian, and Linda Torczon.
Adaptive optimizing compilers for the 21st century. J.
Supercomput., 23(1):7ś22, 2002.
12
The Next 700 ML-Enabled Compiler Optimizations
[Lan66]
[LF17]
[LLN+ 18a]
[LLN+ 18b]
[LO]
[LTBZ16]
[MBM+ 16]
[MCZZ21]
[MYP+ 19]
[PBB10]
[PGM+ 19]
[Pro]
[SA05]
[SCWK13]
[SWD+ 17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.
[Tea23a] The IREE Team. https://github.com/openxla/iree, 2023.
Accessed 2023-11-13.
[Tea23b] The Triton Team. https://github.com/openai/triton, 2023.
Accessed 2023-11-13.
[TQB+ 21] Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin,
Krzysztof Choromanski, and David Li. MLGO: a machine
learning guided compiler optimizations framework. CoRR,
abs/2101.04808, 2021.
[TQBL21] Mircea Trofin, Yundi Qian, Eugene Brevdo, and David
Li. RFC: MLGO Regalloc: learned eviction policy for
regalloc. https://lists.llvm.org/pipermail/llvm-dev/2021November/153639.html, https://lists.llvm.org/pipermail/
llvm-dev/2021-November/153639.html, 2021. [Online; accessed 08-May-2022].
[VAJ+ 20] S. VenkataKeerthy, R Aggarwal, S Jain, M S Desarkar,
R Upadrasta, and Y. N. Srikant. IR2Vec: LLVM IR Based
Scalable Program Embeddings. ACM Trans. Archit. Code
Optim., 17(4), December 2020.
[VHGS16] Hado Van Hasselt, Arthur Guez, and David Silver. Deep
reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence,
volume 30, 2016.
[VJK+ 23] S. VenkataKeerthy, Siddharth Jain, Anilava Kundu, Rohit Aggarwal, Albert Cohen, and Ramakrishna Upadrasta.
RL4ReAl: Reinforcement Learning for Register Allocation.
In CC 2023, page 133ś144, New York, NY, USA, 2023. Association for Computing Machinery.
[WO18] Zheng Wang and Michael O’Boyle. Machine learning in compiler optimization. Proceedings of the IEEE,
106(11):1879ś1901, 2018.
[WTZ+ 22] Huanting Wang, Zhanyong Tang, Cheng Zhang, Jiaqi Zhao,
Chris Cummins, Hugh Leather, and Zheng Wang. Automating reinforcement learning architecture design for code
optimization. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction, CC 2022,
page 129ś143, New York, NY, USA, 2022. Association for
Computing Machinery.
[WZZ93] X Wang, H Zhao, and J Zhu. Grpc: A communication
cooperation mechanism in distributed systems. SIGOPS
Oper. Syst. Rev., 27(3):75ś86, jul 1993.
Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code
Generation and Optimization (CGO), pages 2ś14, 2021.
P. J. Landin. The next 700 programming languages. Commun. ACM, 9(3):157ś166, mar 1966.
ONNX (Linux Foundation). ONNX: Open Neural Network
Exchange. https://github.com/onnx/onnx, 2017. [Online;
accessed 11-Mar-2023].
Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz,
Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan,
and Ion Stoica. RLlib: Abstractions for distributed reinforcement learning. In Jennifer Dy and Andreas Krause, editors,
Proceedings of the 35th International Conference on Machine
Learning, volume 80 of Proceedings of Machine Learning
Research, pages 3053ś3062. PMLR, 10ś15 Jul 2018.
Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz,
Joseph E. Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training, 2018.
LLVM-Org. LLVM Test Suite. https://github.com/llvm/llvmtest-suite. Accessed 2021-08-25.
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S.
Zemel. Gated graph sequence neural networks. In Yoshua
Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan,
Puerto Rico, May 2-4, 2016, Conference Track Proceedings,
2016.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza,
Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
and Koray Kavukcuoglu. Asynchronous methods for deep
reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928ś1937,
New York, New York, USA, 20ś22 Jun 2016. PMLR.
William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko. Polygeist: Raising c to polyhedral mlir. In
2021 30th International Conference on Parallel Architectures
and Compilation Techniques (PACT), pages 45ś59, 2021.
C Mendis, C Yang, Y Pu, S Amarasinghe, and M Carbin.
Compiler auto-vectorization with imitation learning. In
NeurIPS’19, volume 32, 2019.
Louis-Noel Pouchet, C´edric Bastoul, and Uday Bondhugula. PoCC: the polyhedral compiler collection. https:
//web.cs.ucla.edu/~pouchet/software/pocc, 2010. Accessed
2023-10-25.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,
James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison,
Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An
imperative style, high-performance deep learning library.
In Advances in Neural Information Processing Systems 32,
pages 8024ś8035. Curran Associates, Inc., 2019.
Protocol Buffers. https://developers.google.com/protocolbuffers. [Online; accessed 29-Aug-2022].
M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. In CGO’05, pages 123ś
134, March 2005.
D Simon, J Cavazos, C Wimmer, and S Kulkarni. Automatic
construction of inlining heuristics using machine learning.
In CGO’13, page 1ś12, 2013.
13