Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

The Next 700 ML-Enabled Compiler Optimizations

2023, arXiv (Cornell University)

arXiv:2311.10800v1 [cs.PL] 17 Nov 2023 The Next 700 ML-Enabled Compiler Optimizations S. VenkataKeerthy Siddharth Jain Umesh Kalvakuntla IIT Hyderabad, India IIT Hyderabad, India IIT Hyderabad, India Pranav Sai Gorantla Rajiv Shailesh Chitale Eugene Brevdo IIT Hyderabad, India IIT Hyderabad, India Google DeepMind, USA Albert Cohen Mircea Trofin Ramakrishna Upadrasta Google DeepMind, France Google, USA IIT Hyderabad, India Abstract 2. Engineering objective-specific features [FKM+ 11], or extracting objective-independent program embeddings [AZLY19, BNJH18, VAJ+ 20, CFB+ 21], or a combination of both. 3. Setting up a training interface with the compiler, with examples ranging from communicating the output via compiler flags [JAVU22], offline file logs [HAAW+ 20], to generic gym APIs [BCP+ 16] and recent compiler-specific gym APIs like CompilerGym [CWG+ 22], PolyGym [BGC21] and Supersonic [WTZ+ 22]. 4. Finally, building and deploying the compiler with the trained model for inference. There is a growing interest in enhancing compiler optimizations with ML models, yet interactions between compilers and ML frameworks remain challenging. Some optimizations require tightly coupled models and compiler internals, raising issues with modularity, performance and framework independence. Practical deployment and transparency for the end-user are also important concerns. We propose MLCompiler-Bridge to enable ML model development within a traditional Python framework while making end-to-end integration with an optimizing compiler possible and efficient. We evaluate it on both research and production use cases, for training and inference, over several optimization problems, multiple compilers and its versions, and gym infrastructures. In most works, the process ends with step (3) and a simplified benchmark-oriented version of step (4) to evaluate the trained model. Indeed, while there exist a number of solutions for steps (1 & 2), a proper methodology based solutions for steps (3) & (4) that involve model-compiler interaction between have not yet been adequately addressed. The diversity of compiler optimizations and ML models is associated with an equally broad range of requirements for model-compiler interaction. In Tab. 1, we illustrate this on recent proposals. There exists multiple ML frameworks and even more types of ML models. A model’s input may be a plain floating point vector, or tensors of different ranks and shapes. Outputs range from a unique Boolean decision to complex data structures. These need to be communicated with the compiler; it may be only once for simple scenarios, or many times and involving large amounts of data for more intricate ones. And this may involve extensive source code modifications for the sole purpose of implementing the compiler-model interface. Some of these interactions have been explored in the literature and even landed in production; however, there does not exist a single generic method to address the vast diversity of scenarios that are imaginable and the trade-offs therein. Such a situation limits the scope, applicability and effectiveness of ML for compiler optimizations in the following ways: 1 Introduction With the success of Machine Learning (ML) models in various domains, there is a growing interest in applying ML to improve optimization heuristics in compilers [CST02, Ama20]. Several ML and Reinforcement Learning (RL) approaches have been proposed to improve optimizations like vectorization [HAAW+ 20, MYP+ 19], loop unrolling, distribution [SA05, JVA+ 22], function inlining [SCWK13, TQB+ 21], register allocation [DAK20, TQBL21, KPM22, VJK+ 23], prediction of phase sequences [ABP+ 17, HHAM+ 19, JAVU22], among many others [ABDS18, WO18]. More specifically, the widely used LLVM compiler [LA04] has support for RL-based inlining decisions from version 11, and RL-based eviction decisions in its register allocator from version 14 [TQBL21]. The title of our paper acknowledges this growing trend and anticipates the needs of the ML-enabled optimizations that are yet to come, in the spirit of Landis’ seminal paper [Lan66] on the diversity of existing and future programming languages. Setting up an ML-based compiler optimization is a challenging task. In addition to model design, it involves specialized data collection, compiler engineering, packaging: • Scalability: Integrating a Python model with C++ code using wrappers induces significant [JVA+ 22] compile time overhead: e.g. 6ל100×. • Integration: Not all optimizations are simple enough that the outputs of the model can be communicated using flags [VJK+ 23, JVA+ 22, KPM22, TQB+ 21]. As ML-based 1. Preparing or generating the data sets for training the model [CPWL17, dSKdSMa+ 21]. Intial version prepared on 1st Sep 2023. Revised on 14th Nov 2023. 1 S. VenkataKeerthy et al. Table 1. Diverse ML and RL requirements in previous work; unknown or unclear ones are left blank. Communication SLP Vectorization [MYP+ 19] Model Input Model Output NeuroVectorizer [HAAW+ 20] Pragmas in source Code2Vec vectors Register Allocation [DAK20] Register Allocation [KPM22] No integration Interference graph PBQP Graph Instructions to pack source code with pragma added in Python Coloured IG Allocated PBQP graph POSET-RL [JAVU22] Opt Flags IR2Vec vectors Pass sequence Loop Distribution [JVA+ 22] Inliner [TQB+ 21] RegAlloc Eviction [TQBL21] Python Wrappers IR2Vec vectors Precompiled TF model Features Precompiled TF model Features IG + node level gRPC embeddings RL4ReAl [VJK+ 23] LLVM IR Commn Freq Model Type Once, at the end Single Agent FCNN None Multiple times per episode Distribution sequence Once, at the end Yes/No Once, at the end Index of Live Range to Evict Once, at the end Multiple times Color map per episode optimizations grow in popularity, flag-based approaches become unwieldy. • Programmability: ML models are typically written in Python across different frameworks like TensorFlow, JAX, PyTorch, etc. Expecting the model to be written in C++ within the compiler is not ML developer-friendly. • Portability: Several proposals involve a tight coupling between the compiler and a specific ML framework; we however believe that a generic compiler infrastructure like LLVM should remain ML-framework-independent. #Agents ML Framework Single Agent GGNN Keras, RLLib NA LSTM TensorFlow Single Agent GCN, ResNet Pytorch Single Agent FCNN PyTorch Two agents Single Agent Single Agent Four agents; hierarchical GNN, FCNN FCNN FCNN PyTorch TensorFlow TensorFlow GNN, FCNN PyTorch, RLLib model runner and SerDes can be chosen based on the usage scenario and requirement, and these may differ during training and inference. Our library provides C++ and Python APIs to expose model runners and SerDes for integration with compilers and ML frameworks respectively. We show that the inter-process model runners effectively supports training. Once the model is trained, the in-process model runners provide interfacing of the model within the compiler in a transparent manner, with much lesser latency to aid in deployment. Besides, our both model runner and SerDes modules can be easily extended to support more forms of communication and serialization. Our library also provides C-APIs to aid in integration with C-based compiler infrastructures like Pluto, GCC, and SQLite. We evaluate ML-Compiler-Bridge on four ML-enabled optimizations in LLVM: RL-LoopDistribution, POSET-RL, RL4ReAl, and Inliner. We show that our library can be integrated with other compilers like Pluto [BHRS08] and MLIR [LAB+ 21] with minimal effort. We study the impact of communication and serialization options on compile time under different complex scenarios that the existing infrastructures could not handle. We conduct extensive evaluations to measure the overhead caused by each model runner and SerDes. We also study the impact of integrating MLCompiler-Bridge with LLVM in terms of additional dependencies, compile-time, and binary size overhead. Here are our contributions: The existing gym libraries primarily aim at facilitating model training for research and reproducibility by providing a high-level integration. For example, the recent CompilerGym [CWG+ 22] provides a high-level interface in the form of C++ wrapper methods outside the compiler to invoke out-of-tree compiler APIs to materialize the predicted actions. Such integration caters well to training certain interactions like Phase Ordering [JAVU22]. However, other optimizations like RegAlloc [VJK+ 23, KPM22, DAK20], loop distribution [JVA+ 22] and inlining [TQB+ 21] necessitate a deeper interfacing of the model within the compiler; with multiple rounds of interaction for both training and inference scenarios. Further, in these gym libraries, the inference flow is driven by Python: the compilation starts by invoking a Python process, breaking the isolation between the end user and the internal compiler algorithms; this limits deployment opportunities among other downsides. We discuss these issues in detail in Sec. 4. To address these shortcomings, we propose ML-CompilerBridge, a library that allows ML model development within a traditional Python framework while providing tightly coupled and efficient end-to-end integration with the compiler. Our library bridges the compiler and ML model by providing a suite of communication approaches (model runners) and the related (de-)serialization mechanisms (SerDes) to cater to diverse scenarios. It also provides support for both inter- and in-process communication by exposing different model runners: gRPC and named-pipes for the former, and the TensorFlow interface and ONNX for the latter. Diverse SerDes options based on Protobuf, JSON, and native bitstreams improve efficiency and versatility. The appropriate • We propose ML-Compiler-Bridge, a library to enable the deeper integration of ML models and the compiler in a framework-independent manner. • We provide a suite of two-inter- and two-in-process model runners, and three (de-)serialization mechanisms (SerDes) to support different interaction scenarios. • We provide multi-language user APIs: C++ and C APIs to interface model runners and serializers with compilers and Python APIs to interface inter-process model runners with ML frameworks. • We show that our library is easy to integrate with three different compilers spanning different representations, and carry out extensive evaluations on four ML-enabled optimizations on two versions of LLVM (V10, V17). 2 The Next 700 ML-Enabled Compiler Optimizations • We characterize the impact of each communication and serialization options on compilation and training times and analyze additional dependencies and other overheads. like domain-specific training or fine-tuning at deployment time. Since ML developers usually prefer developing models within a Python-based framework, the training process involving a C++ compiler infrastructure like LLVM requires a communication channel, typically inter-process, while catering to the needs of (de-)serializing data between the native types of C++ and Python. The distributed nature of training processes may also require extending communication beyond a single operating system node. 2 Background Input Input Input Program Program Program Opt 1 Opt i Compiler 0: Compilation starts Materialize Predictions 1: Input serialization Communication channel 2: Optimization Query ML Models 3: Input deserialization Training / Inference Inference. When focusing on inference/deployment, compile time and ease-of-use become crucial factors. The communication and serialization methods involved should take this into account, along with considering converting the Python model to a streamlined C++ implementation. These factors are true even for the simplest forms of communication, like one-time evaluations of the ML model and communicating via flags. Making the flow transparent to the user also requires a deeper, end-to-end integration with the compiler. There is no tool providing the necessary layers of abstraction between the three actors while supporting the required training and inference scenarios, not to mention ML-framework independence. Designing such a library and evaluating its suitability for diverse use cases is the challenge we tackle in this paper. 4: Querying 7: Output deserialization 6: Response 5:Output serialization Opt n Figure 1. ML-enabled compiler optimizations: (1) Inputs and other metadata required by the model are prepared in the appropriate format. (2) Serialized input is passed on to the model by a suitable communication channel. (3) Input is deserialized to appropriate format. (4) The model is queried to obtain optimization decisions as output. (5) Output is serialized, and (6) Sent back to the compiler optimization as a response. (7) The received response is deserialized, and optimization decisions are taken according to the output. ML-enabled Compiler Optimizations. The process of supporting or fully implementing optimization decisions with one or more ML models involves the steps shown in Fig. 1. This process repeats until the end of the compilation process for each ML-based optimization. The above scheme is generic enough to capture any optimization involving single or multiple ML models with multiple two-way interactions. For the cases that would need multiple interactions, steps (1)ś(7) are repeated until the final outcome. More broadly, there are three actors involved in developing and using such an ML-enabled compiler. (i) The Compiler expert who develops the compiler optimization, (ii) The ML expert who designs the ML model for the optimization problem, and (iii) The end-user who uses the compiler. Ideally, compiler experts should use the ML models with minimal understanding of the internals/process specific to ML modeling and the framework on which the model is built to arrive at the result. Similarly, ML experts should instead design the models with minimal or no understanding of compiler internals, infrastructural details, and integration points, focusing on the optimization objectives and information flow. For the end-user, however, the presence of ML-compiler optimization should be transparent, and indistinguishable from the conventional (non-ML based) compilation process. To achieve this scheme of abstraction/segregation among all three actors, it is important to distinguish between the training and inference flows. 3 ML-Compiler-Bridge ML-Compiler-Bridge SerDes Compilers ProtobufSerDes JSONSerDes ML Models BitstreamSerDes LLVM Data-In GCC MLIR Pluto Data-Out MLModelRunner Inter-process In-process gRPCModelRunner ONNXModelRunner pipeModelRunner TFModelRunner Data-In PyTorch Keras TensorFlow JAX CompilerGym Data-Out PolyGym OpenAI Gym RLLib Stablebaselines3 Figure 2. The compiler instantiates a model runner and sets the input features to be used by the model. MLModelRunner internally invokes SerDes to serialize the data in one of the supported formats and query the model. The returned decision is deserialized and provided to the optimization. We propose an abstraction mechanism made of two main components: Serializer and Model Runner. The SerDes module (de-)serializes the data to/from the requested format, and the MLModelRunner module is responsible for communication with the model. The model runner obtains the serialized data, writes it to a communication channel, queries the model, and deserializes the output received from the model. ML-Compiler-Bridge exposes methods to be invoked by the user to interact with the model decoupled from serialization and communication. We provide three framework-independent model runners, gRPC, named-pipes, and ONNX, and one framework-specific TensorFlow model Training. Typically, training the ML model becomes part of compiler development and build-up, and inference becomes part of the compiler deployment and execution. However, occasionally this boundary may shift towards the user, 3 S. VenkataKeerthy et al. 1 2 3 4 5 6 7 8 9 10 class MLModelRunner { public : // Exposed to the user ; returns model 's output template < typename T > T evaluate () { return * reinterpret_cast <T * >(←↪ evaluateUntyped () ); } protected : // To be overridden by derived classes virtual void * evaluateUntyped () = 0; }; 1 2 3 service gRPCExample { // RPC function for training integration rpc getObservation ( Action ) returns (←↪ Observation ) {} 4 // Mandatory RPC function for model evaluation // Blocking call ; Released upon receiving a ←↪ response rpc getAdvice ( Observation ) returns ( Action ) {} 5 6 7 8 } Listing 2. Example gRPC function declaration Listing 1. Skeleton of MLModelRunner class (3) When input from the compiler is required, the model sends requests to the compiler with appropriate queries and waits for the response. (4) The compiler gets out of the blocked state and processes the query to generate an appropriate response. (5) The response is sent back to the client, and the model goes on to completing training on that input. runner. These can be combined with three different serializations: Protobuf, JSON, and bitstream. The modular design enables new forms of communication and serialization to be added by overriding a minimal set of methods. Fig. 2 shows the components and interactions of ML-Compiler-Bridge. 3.1 ML Model Runners We provide two classes of model runners. The inter-process class provides the easiest mechanism to decouple Python models from a compiler running as a separate process. The inprocess class assumes that the ML Model is readily available in a compiled form and can be accessed within the compiler through a specific API. Clearly, in-process communication is designed with inference and deployment in mind, while inter-process communication enjoys more diverse use cases. Model runners may support simple ML queries and feedforward networks as well as more involved Reinforcement Learning (RL) algorithms or Graph Neural Networks (GNNs). Internally, MLModelRunner is the abstract base class from which the other model runners are derived (List. 1). It exposes two APIs: populateFeatures() populates the input features, and evaluate() queries the model. The latter returns the output of the model and is templated according to the expected output type. Internally, evaluate() invokes evaluateUntyped() that is to be overridden by the concrete model runner classes that derive from MLModelRunner. The MLModelRunner interfaces with the methods of SerDes using the populateFeatures() so as to serialize the inputs. The method populateFeatures() is implemented as a variadic function that uses variable-length key-value pairs as arguments. The key is a string identifier that describes the input, and the value is of template type. Inference follows the same steps, yet the compiler becomes the client and the model becomes the server so as to support a regular compilation process. 3.1.1 Inter-process Model Runners. gRPCModelRunner uses gRPC may run the model and compiler on different machines, and pipeModelRunner uses named pipes for singlemachine scenarios only. At training time, the compiler acts as a server and the Python-based ML model acts as a client. The sequence of steps is described as follows: (1) Compilation starts and the compiler listens for queries at the wait() call inserted at the point of interest. (2) The Python model starts training; this can be started concurrently with Step (1). Pipe Model Runner. As the name suggests, pipeModelRunner relies on named pipes for inter-process communication (the mkfifo system call). Pipes provide a simple and effective means of communication that is local to the machine without any network or security constraints. As pipes are unidirectional, the pipeModelRunner creates the read and write pipes for communication. The read pipe in the compiler obtains the data written by the model in Python, and the write pipe provides the data into the pipe that is read by the model on the other end. The evaluateUntyped gRPC Model Runner. gRPC [WZZ93] provides RPC methods specifying the type of input and output in Protobuf format [Pro]. During the build process of the library, the proto files are automatically translated to C++ and Python code by invoking the protoc compiler. An example is shown in List. 2. The generated code defines the Service class that exposes the RPC methods to be overridden by the user in the optimization that makes use of gRPCModelRunner. Due to design constraints of gRPC, we only support Protobuf serialization with gRPCModelRunner. gRPCModelRunner takes in the server address and the port number at which the connection is to be established. In training mode, gRPCModelRunner starts the server and starts listening for an RPC call invoked by the model. The overridden RPC method is directly called by the Python model to generate new observations by applying the action predicted by the model. In inference mode, gRPCModelRunner starts the gRPC connection at the given address and port. evaluateUntyped() is overridden to invoke the RPC method defined by the Python model after preparing the input data, andgetAdvice() serves as the RPC method for inference. 4 The Next 700 ML-Enabled Compiler Optimizations method is overridden to read and write into the pipe appropriately. read() is a blocking call forcing the compiler to wait till data is written by the model. Once the data is written, the model gets to a blocking state by invoking read() on the second pipe waiting for the response from the compiler. The pipe model runner ensures proper opening, closing, and clean up. pipeModelRunner provides a simpler interface for establishing communication as the user directly invokes evaluate() after setting the inputs. 1 2 3 4 5 6 7 8 9 10 11 3.1.2 In-process Model Runners. In-process model runners are designed to provide an effective means of compiler deployment. It is important to optimize the inference time as it adds up to the overall compile time. One may obtain significantly lower compile times by removing inter-process communication overhead, and by turning the complications of a compiled model into an advantage, by reducing the query time compared to models running in Python. Serialization/deserialization overhead is also lowered. 12 13 Listing 3. Snippet from ONNXModelRunner showing the environment-agent interaction to generate an observation ONNX Model Runner. The Open Neural Network Exchange [LF17] (ONNX) is an open format to represent machine learning models. Models built from various frameworks like TensorFlow, PyTorch, etc. can be represented in ONNX format in an interoperable manner. Additionally, it supports a wide variety of hardware architectures ranging from edge devices to general-purpose CPUs and GPUs. Once the model is trained in Python, it is converted into a common ONNX representation and is imported into the compiler via the ONNX runtime. ONNXModelRunner exposes the necessary wrapper APIs to read the ONNX model, query it with inputs and obtain outputs. ONNXModelRunner also RL models. Opt Pass dispatch ONNXModelRunner Agent ONNXModel evaluate() reset() observation computeAction() query() step() action modelOutput void * ONNXModelRunner :: evaluateUntyped () { Observation obs = this -> env -> reset () ; while ( true ) { Action action ; // current agent auto current_agent = this -> agents [ this -> env ←↪ -> getNextAgent () ]; action = current_agent -> computeAction ( obs ); obs = this -> env -> step ( action ); if ( this -> env -> checkDone () ) break ; } return nullptr ; } Figure 3. Sequence diagram indicating different events and the interaction between various classes for RL based optimization by ONNXModelRunner. Only the methods that highlighted are to be overridden by the user. Other methods are internal to the library. APIs internally. The sequence of events describing this interaction is shown in Fig. 3. ONNXModelRunner exposes the Environment class with APIs for standard step and reset operations along with setDone() API to indicate the end of the episode. step() returns the next observation given an action. Internally, the step operator applies the action predicted by the agent (model) to move on to the next state and returns the new observation from the environment. step() also signals if the terminal state is reached by invoking setDone() to stop the current prediction. reset() resets the environment to the initial state and returns the initial observation. Hence ONNXModelRunner involves the Reset operator first to obtain the initial observation. This sequence of APIs is invoked within the evaluateUntyped() of ONNXModelRunner and is shown in the Listing 3. The optimization pass using the ONNXModelRunner should inherit from Environment and override step() and reset() depending on the optimization requirements. ONNXModelRunner queries the model using the C++ APIs. A map containing the identifier of the agent (label) and the corresponding model path is passed while instantiating the ONNXModelRunner. In the case of multiple agents, the identifier of the next one to use is set by the Environment while returning the observation. ONNXModelRunner queries the corresponding agent with the observation to obtain the requested action. This process goes on untill Environment invokes setDone(). ONNXModelRunner for plain ML models. ONNXModelRunner can also be used to query non-RL models by directly invoking the evaluate method upon instantiating the object with the path to the ONNX model. ONNXModelRunner for RL. For RL, the agent is usually the learner trained to predict appropriate actions given the observations from the environment. Exporting a trained model to ONNX implies exporting only the agent. To facilitate RLbased interaction for a generic multi-agent scenario between the environment and the agents, ONNXModelRunner provides Environment and Agent classes separately and accesses the TensorFlow Model Runners. This is a framework-specific model runner built on the TensorFlow ahead-of-time (AOT) saved model. There are two implementations: (i) łRelease Mode Model Runnerž used in production environments, (ii) łModel Under Training Model Runnerž intended either for finetuning or when quickly evaluating candidate models and 5 S. VenkataKeerthy et al. parameters. TFLite is a scaled down TensorFlow interpreter designed to be embedded in native binaries, and can be used to further reduce overheads. The TensorFlow model runner uses the AOT saved model compiler which produces a header exposing the model as a C++ class, and a native object file with its implementation. The model runner reduces again to a simple adapter [GHJV95] around that class. The compiler binary does not expose new runtime dependencies as it is statically linked, and this simplifies its deployment. Note that the model compiler can be configured to generate code loading the weights from a file passed via the command line to the LLVM compiler. PipeModelRunner -OutStream : raw_fd_ostream* -evaluateUntyped() : void* -readNBytes(size_t) : void* gRPCModelRunner -stub_ : grpc::Stub* -server_address : string -exit_requested : promise -request : Request* -response : Response* -RunService(grpc::Service *s) : int -SetStub() : int -evaluateUntyped() : void* ProtobufSerDes -request : Message* -response : Message* +setFeature(string, T) : void +getSerializedData() : void* +deserialize(void*) : void* #cleanUp() : void ONNXModelRunner -evaluateUntyped() : void* MLModelRunner JSONSerDes -data : json::Object +setFeature(string, T) : void +getSerializedData() : void* +deserialize(void*) : void* #cleanUp() : void -desJSON(json::Value*) : void* JSON SerDes. JSONSerDes overrides the setFeature methods to populate the JSON buffer appropriately, given the key-value pairs. Similarly, the received data is deserialized by first converting it to a JSON object, then the JSON fields are casted to native types and returned. JSON SerDes is also transparent to the user. -compiledModel : TFGen -computeAction(Obs) : void -evaluateUntyped() : void* +getKind() : Kind +getSerializerKind() : BaseSerDes:Kind +evaluate() : T #evaluateUntyped() : void* +populateFeatures(KV var1, KV val2) : T Protobuf SerDes. Protobuf SerDes needs the user to provide the input and output data specifications in a proto file. These are compiled to generate the C++ and Python sources (Sec. 3.1.1). ProtobufSerDes serializes the input key-value pair by overriding the setFeature methods to set the appropriate fields of the message described in the proto file. Deserializing protobuf data to the native format only involves reading and returning the appropriate fields of the message. Except for providing the proto file, ProtobufSerDes is transparent to the user. TFModelRunner -env : Environment -agents : map<string, Agent*> #Serializer : BaseSerDes* +Kind : enum int supports (de)serializing in basic (int, float, double, string, bool) and compound (vector, list) data types. BaseSerDes +Kind : enum int 1 +getKind() : Kind +setFeature(string, T) : void +getSerializedData() : void* +deserialize(void*) : void* #cleanUp() : void Bitstream SerDes. The bitstream starts with a JSON header which specifies the key (identifier), type and shape of the tensors, and the order in which they will be serialized. Tensor values themselves are dumped as raw bytes. The received bitstream is interpreted based on the type and shape specified in the header and converted to native types. Processing the header induces negligible overhead if communicated data does not involve complex data types. Internally, BitstreamSerDes overrides the setFeature methods similar to the other SerDes to expose the functionality. Fig. 4 shows the class diagrams [GHJV95] of model runners and SerDes. BitstreamSerDes -tensorSpecs : vector<TensorSpec> -rawData : vector<void *> +setFeature(string, T) : void +getSerializedData() : void* +deserialize(void*) : void* #cleanUp() : void Figure 4. Class diagram of ML-Compiler-Bridge 3.2 SerDes: Serializer and Deserializer Module When data is transferred, specifically across two processes, it is important to convert data that is present in the native types (of C++ and Python) from one format to another. This is the purpose of (de-)serialization as implemented by the SerDes module. Internally, the MLModelRunner interacts with SerDes to (de-)serialize C++ native data to model-specific types and back. The choice of (de-)serialization depends on the optimization and ML model. We currently provide three options: bitstream, JSON, and Protobuf. They vary in terms of usage scenario, usage effort, and (de)serialization time. SerDes effectively abstracts away the underlying mechanism while providing the flexibility of different serialization options. 3.3 C-APIs We provide C wrappers around the C++ implementation to integrate with C-based compilers. These wrappers are C++ files written in C-style. Each method internally queries the original C++ implementation and returns results in a way compatible with C calling conventions. This code is built as a separate library that may be linked with a C-based compiler. We used it with the Pluto polyhedral compiler in particular. 3.4 Base SerDes. Internally, each SerDes is derived from the BaseSerDes class. SerDes uses key-value based serialization as described in Sec. 3.1. The populateFeatures method of MLModelRunner invokes the appropriate version of the overloaded setFeature() exposed by BaseSerDes to serialize inputs. These methods are overridden by the SerDes classes that derive from BaseSerDes according to the underlying serializer. This class also exposes the deserialize method to deserialize the received data and is overridden by the derived classes to obtain the data in native types. Our library Extensions Both MLModelRunners and SerDes can be easily extended to support new model runners and serializers. New runners may include TVM [CMJ+ 18], ahead-of-time compiled PyTorch models and FlatBuffers [Fla], and serialization also supports YAML formats. New model runners can be contributed by inheriting MLModelRunner and overriding the evaluateUntyped method according to the model runner. Similarly, a new (de)serializer can be added by inheriting BaseSerDes and overriding the setFeature and deserialize methods specific to the new serializer. 6 The Next 700 ML-Enabled Compiler Optimizations 1 2 # include " MLCompilerBridge / MLModelRunner .h" # include " MLCompilerBridge / yourMLModelRunner .h" 1 3 4 5 6 7 8 9 10 11 12 3 // Instantiate the required model runner with ←↪ SerDes type MLModelRunner * MLRunner = std :: make_unique <←↪ yourModelRunner >( Arg , SerDes :: Kind ::←↪ yourSerDesType ); // Process Input Features std :: pair < std :: string , InType > p = ... // Input MLRunner -> populateFeatures (p); // Get ML Advice / Output OutType advice = MLRunner -> evaluate < OutType >() ; // Use the obtained advice ... import CompilerInterface as CI 2 4 5 6 7 8 9 10 11 # Instantiate the required CompilerInterface ←↪ with serdes type interface = CI . YourCompilerInterface ( Arg , ←↪ yourSerdesType ) while True : # Send buffer data to compiler and wait for ←↪ next request request = interface . evaluate () # Query model to get advice # Populates buffer with advice ( serialized ←↪ and stored in serdes ) interface . populate_buffer ( advice ) # Break on condition Listing 4. C++ APIs of ML-Compiler-Bridge 3.5 Listing 5. Python APIs of ML-Compiler-Bridge interactions with an ML model. All the components are configured, compiled and linked during the regular build process of LLVM. Integration challenges range from redesigning the entire framework of the original publication, to minor changes to the communication mechanisms. Error Checking and Recovery The model runners and SerDes modules are designed to handle compiler/model crashes, communication failures, and infinite loops. The failures are handled appropriately by allowing graceful termination of the processes. In the case of gRPC, we have implemented an exponential backoff algorithm to attempt retries to overcome the failures due to the delays in communication resulting from any networkrelated issues and packet losses. The communication fails gracefully upon exhausting the number of retries. In all other cases, we use a timeout based mechanism for handling the failure. These mechanisms proved invaluable in practical experiments due to compiler bugs and network errors. 3.6 4.1 Phase Ordering of Optimization Passes POSET-RL predicts the ordering sequence of passes to jointly optimize code size along with execution time. An RL agent is trained with the DDQN algorithm [VHGS16] to predict a subsequence as action, given program embeddings as input observation. There are about 15 predetermined subsequences provided by the authors. The predicted optimization subsequence is applied on the input program, and the embeddings corresponding to the transformed program are used as the new observation. This process goes on until reaching a threshold on the number of subsequences. In the published version, the above process was not integrated within LLVM but driven from a Python model. An LLVM-opt process was spawned, passing the optimization sequence through a compiler flag for each prediction by the agent. In addition, embeddings involve spawning yet another process to invoke IR2Vec on the .ll IR file generated by the compiler. A similar strategy was in place for training. We revisited the above using ML-Compiler-Bridge to operate directly within LLVM as a new transformation pass. Our new PosetRL implements a pass manager that applies the predicted optimization sequence, and also generates the next observation by invoking IR2Vec. The MLModelRunner communicates with the model and serializes the data to be transferred. The model communicates the predicted optimization subsequence as an integer ID (one among 15) to PosetRL, and the R300 module-level embedding vectors are sent to the model for the next prediction. Integrating with the ONNX model runner only amounts to extending the Environment class and overriding the step, reset methods. We also override setDone() to signal the end of the episode upon reaching the threshold. Compiler/ML Experts View To use ML-Compiler-Bridge, developers need to invoke a minimal set of APIs by instantiating the necessary model runner with appropriate options specifying the SerDes type. List. 4 illustrates this on an example of invoking a userdefined model runner with a user-defined SerDes from the compiler. A similar API abstracting the communication and SerDes in Python is provided (List. 5) to query the ML model with inter-process model runners and respond back. 4 Use Cases: ML-LLVM optimizations We integrated ML-Compiler-Bridge with four ML-based compiler optimizations in LLVM: phase ordering [JAVU22], loop distribution [JVA+ 22], register allocation [VJK+ 23] and method inliner [TQB+ 21]. The first three optimizations are built using RLLib [LLN+ 18a] with PyTorch [PGM+ 19] and LLVM V10, using program embeddings called IR2Vec [VAJ+ 20]. The fourth optimizationÐinliningÐuses TensorFlow [AAB+ 15], is built within LLVM V17, and uses feature-based representations [TQB+ 21]. There are two ML based register allocators [VJK+ 23, TQBL21] available for LLVM; we chose the former because it emphasizes finer-grained, high-bandwidth 7 S. VenkataKeerthy et al. 4.2 Loop Distribution for Vectorization and Locality 4.4 LLVM Inliner The inliner pass traverses call sites in a bottom-up fashion, one connected component of functions at a time. For a given component a working queue is initialized with the set of all static call sites. As the algorithm marks some call sites for inlining, it appends the former callee’s call sites to the work queue. The decision to inline or not is made in two steps. First, it determines legality and whether the user provided any guidance (always/never inline). Only if the operation is legal and non-mandatory, a heuristic determines its profitability. The decision is driven by a simple RL based model. It takes a number of scalar features characterizing both the caller/callee (instruction counts, basic block counts, maximum loop depth), the call site itself (the number of compile-time constant parameters), as well as module-wide features (the current number of functions and statically known call edges). For the published version [TQB+ 21], the cost metric was size, with no reliance on dynamic profile data. The implementation uses AOT compiled TensorFlow model for inference with C++ APIs. We modularized it to use any model runner. [JVA+ 22] Jain et al. improve loop distribution by modeling SIMD parallelization and locality optimization opportunities. It uses two RL agents with fully-connected networks to identify the vertex processing order and when to distribute. Along with these agents, a Gated Graph Neural Network (GGNN) [LTBZ16] processes the connected components of the dependence graph, where each node holds the embeddings for the corresponding instructions. During training, a Python driver spawns a process to invoke the Loop Distribution pass. The RL model processes the input graph and predicts the sequence of instructions to be packed together as a loop. Upon applying the prediction, the rewards indicate the effectiveness of distribution. All these steps involve model-compiler interaction via file I/O. Inference itself is integrated with LLVM using Python wrappers. In this paper, we eliminate the need for Python wrappers, file I/O and and spawning new processes. The model runners internally (de-)serialize data depending on the chosen SerDes and the MLModelRunner. For the runners that use serialization, the input graph is represented as key-value pairs, and a variable length matrix in R𝑛×300 encodes the sequence of 𝑛 300-D instruction embeddings. The output takes the form a variable-length integer array with node identifiers that are to be distributed. 4.3 5 Evaluation We measure compilation time on an Intel Xeon SkyLake W2133 with 6 cores, 12 threads and 32GB RAM. Training time is measured on an Intel Xeon W1390P with 8 cores, 16 threads, 64GB RAM and an Nvidia 3060 GPU. We evaluate POSET-RL, RL-LoopDistribution and RL4ReAl with gRPC, Pipe and ONNX model runners and different SerDes options, and take the median of 3 runs. Most experiments use SPEC CPU 2006 and SPEC CPU 2017 benchmarks. RL-Based Register Allocation We also evaluate RL4ReAl, an RL-based register allocator implementing the splitting, coloring, and spilling sub-tasks as separate RL agents on LLVM’s Machine IR. These RL agents pose a formidable engineering challenge in interfacing the model with the compiler during both training and inference. Unlike other optimizations that need one single communication at the end, RL4ReAl involves multiple interleaved communications rounds to obtain a new observation and let the relevant agent make the next prediction. Also them RL agents are arranged hierarchically: the outcome of one agent determines which agent would be invoked next. Unlike other use cases, this optimization involves transferring an interference graph where each variable is associated with a R𝑛×100 matrix, and where each one of the 𝑛 instructions in the live range of the variable is represented in 100-D, a variablelength integer array to specify interferences and use points, and a variable-length floating point array of spill weights. Other metadata like function name, file name, and status are also sent as string fields. The model returns key-value pairs mapping variables to split or color decisions. Both training and inference use gRPC and Protobuf serialization. We will investigate different communication and serialization improvements in this paper, with specialized scenarios for distributed training and deployment-friendly inference. 5.1 Impact on Deployment Tab. 2 shows the POSET-RL compile time using different model runners. Within the in-process runners, we use ONNX for PyTorch models and RLLib. Overall, in-process runners achieve better compile times in all cases in comparison with any of the inter-process ones. Among the latter, gRPC has higher compile times (6.8ś7.6%) compared to pipes, with JSON and bitstream SerDes. This is because of the overheads associated with establishing connections and invoking RPC methods. Pipes with Bitstream SerDes yield slightly higher performance than JSON SerDes due to the lower (de)serialization overhead with bit streams. ONNXModelRunner yields a 7.2× speedup with POSET-RL compared to the original method in Sec. 4.1 that involved spawning new processes to invoke the compiler and other dependencies. In-process model runners natively support multithreaded compilation, while inter-process model runners necessitate concurrently running multiple instances of the model resulting in a high memory and compute overhead. Tab. 3 shows compile times with in-process model runners on LLVM Inliner and RL4ReAl optimizations by varying the degree of parallelism. As LLVM Inliner and RL4ReAl respectively rely 8 The Next 700 ML-Enabled Compiler Optimizations Table 2. Compile time (in seconds) for POSET-RL. SPEC06 SPEC17 Original gRPC Pipe + JSON Pipe + Bitstream ONNX 5,829 10,342 1,318 1,221 1,236 1,141 1,227 1,132 1,140 1,093 times are shown in Fig. 5(b) for CPU and GPU workers. Using 10 workers with a GPU trainer takes about 2 seconds per episode, while a CPU trainer with <10, 5, 1> workers takes <4s, 8s, 15s> respectively. We obtained similar trends among the workers even upon using pipes for communication. Table 3. Multithreaded compile time with -O3 (in s) with in-process model runners. Compile time with gRPC is shown for RL4ReAl for comparison. gRPC 1 Thread 2 Threads LLVM Inliner (TF Runner) RL4ReAl (ONNX Runner) 4 Threads 8 Threads - 596 501 361 307 5,572 291 257 248 248 5.2.3 Using Different RL Policies. One may train and deploy models with different RL policies without impacting the compiler. For this experiment, we evaluate RL4ReAl with the different RL policies provided by RLlib. We perform hyperparameter tuning using Tune [LLN+ 18b]. We trained the models with PPO [SWD+ 17], APPO [SWD+ 17], and A2C [MBM+ 16] policies untill convergence. On the SPEC CPU 2017 benchmarks, this resulted in 2% improvement on average using the APPO policy. The PPO and A2C perform similarly to original paper. on TensorFlow and PyTorch (and RLlib), we use TensorFlow and ONNX model runners accordingly. In comparison to the original gRPC based inference flow of RL4ReAl, the ONNX runner reduces compile time by 22.4× and 19× using 8 threads and 1 thread respectively. Using RL4ReAl results in a higher compile time, as it involves a larger number of model-compiler interactions. This overhead is effectively reduced by using the model runners exposed by ML-CompilerBridge. Similar trends are observed for RL-driven loop distribution [JVA+ 22] on TSVC [Bou] and the LLVM Test Suite [LO]. The ONNX model runner yields an improvement of 16× in comparison to the original Python wrapper. 5.2 5.3 Round-Trip Time Let us finally isolate the Round-Trip Time (RTT) of each model runner as a limit study of the achievable communication throughput. We consider random floating point vectors of increasing length ranging from 500 to 50K elements in steps of 500. The model itself is a single fully-connected layer that consumes the vector and returns a scalar float. Fig. 5(c) shows the RTT of the whole process. The TF and ONNX runner achieves a very high throughput with a total RTT of 21 and 68ms respectively; while Pipes+JSON and Pipes+Bitstream yield 3154ms and 772ms respectively, and gRPC yields a larger RTT of 5948ms. These differences can be attributed to the serialization and communication overhead. The TF and ONNX runners benefit from in-process communication, proving to be suitable candidates for deployment. The higher throughput of TF is due to the AOT precompiled model. The Pipe runner proves to be a good candidate for training on local machines. And the gRPC runner provides support for training in a distributed environment. This makes all the model runners important in their own way. Impact on Training In this section, we evaluate the effectiveness of ML-CompilerBridge during the training of POSET-RL and RL4ReAl. We use inter-process model runners for training. 5.2.1 Training Time. Fig. 5(a) shows the cumulative training time and number of training iterations observed in POSETRL. We obtain large improvements in the training time across all the model runners. We see similar trends with gRPC and Pipe, as explained in the previous experiment. The original training process of POSET-RL involves spawning processes that takes ≈ 10Ks to complete 500 iterations. In comparison, the gRPC model takes about 5.7Ks, while the pipes with JSON and bitstream serialization options take about 5.5Ks each. Throughout the iterations, we observe an overhead of about 20s between JSON and bitstream serialization options. This minimal overhead is associated with the additional serialization effort involved while using JSON SerDes. However, using the inter-process model runners enables an end-to-end integration of model and the compiler while training yields a significant improvement. 5.4 Gym Integration We carried out additional experiments to evaluate the benefits of our library in the context of a state of the art RL Gym. The two goals are to facilate deployment and to reduce compilation time by using in-process model runners. For this purpose, we trained the pass ordering for code size of CompilerGym [CWG+ 22] and exported the resulting model in the ONNX format. We then used our ONNX model runner within LLVM to materialize predictions and generate code. The inference times are shown in Fig. 6, with speedups ranging from 2× to 13×. These are primarily due to gRPC overheads in CompilerGym, as shown in Fig. 5(c). 5.2.2 Multi-Worker Support. ML-Compiler-Bridge supports multi-worker training on both CPUs and GPUs. To support multiple workers while using gRPC, we expose a method taking an array of ports to establish connections with each worker. Similarly, multi-worker support with pipes is enabled by instantiating one pair of pipe per worker. We extended RL4ReAl to handle multi-worker scenarios; training 5.5 Domain-Specific Compilers Given LLVM’s dominance in the general-purpose and backend compiler landscape, it forms the natural basis for most 9 S. VenkataKeerthy et al. (a) Training times of POSET-RL (b) Training times of (c) Microbenchmarking of (d) MLIR performance (e) Pluto performance RL4ReAl with CPU/GPU individual Model Runners multi-workers Figure 5. Performance characterization of model runners on different compilers and optimizations with different Model Runners Table 4. LOC to integrate model runners. gRPC shows LOC for API calls and RPC; Values in parenthesis indicate LOC in protobuf specification. Other serdes do not need any additional code. Optimizations POSET-RL RL-LoopDistribution RL4ReAl Figure 6. Compile times on using phase ordering for code size model with CompilerGym and ONNX model runner 6.1 Original 65 75 gRPC 3+3 (4) 3+3 (5) 10+3 (28) Pipe 3 4 4 ONNX 3 3 3 Lines of Code In Tab. 4, we show the number of additional Lines of Code (LOC) to integrate ML-Compiler-Bridge with different compiler optimizations. We observe a significant reduction in LOC compared to the original published works. We do not compare with the size of the published version of POSET-RL, as its model was not integrated with the compiler. With Loop distribution and RL4ReAl, the effort of writing Python wrappers and invoking protobuf and gRPC is completely removed. Among the available model runners and SerDes, only gRPC, ONNX and Protobuf involve (small) additional codes to handle RPC, environment, and Protobuf messages. It is pertinent to note that ML-Compiler-Bridge removes the tedious work of managing dependencies like gRPC and Python wrappers which was otherwise necessary. ML/RL tools (Tab. 1). However, some ML-based domainspecific optimizations target higher-level frameworks like MLIR [LAB+ 21] and the polyhedral compilers Pluto [BHRS08] and PoCC [PBB10]. Let us illustrate the cases of MLIR and Pluto. Integration with MLIR. Given that end-to-end ML compilers based on MLIR are still undergoing rapid changes [Tea23a], we designed a simple experiment to demonstrate the integration of ML-Compiler-Bridge with MLIR. We wrote a custom pass in MLIR to communicate data with a dummy ML model to mimic a typical ML-compiler interaction. We use the same experimental setup as discussed in Sec. 5.3 and measure the round-trip time. The results are shown in Fig. 5(c). This opens up ML-based optimizations in MLIR-native compilers such as IREE and OpenXLA [Tea23a], Triton [Tea23b], Polygeist [MCZZ21], and many other frameworks. 6.2 Impact on binary size, compile time and memory In Tab. 5, we show the compile time, binary size and average resident set size (RSS) used during compilation of Clang 10 with/without ML-Compiler-Bridge. The difference in binary size is ≈ 80KB, while the average RSS value differs by 400KB with the release build time increasing only by a few seconds. ML-Compiler-Bridge incurs only a negligible overhead in terms of binary size, compile time and memory upon statically linking with the production version of Clang. Integration with Pluto. We also experimented with the polyhedral source-to-source compiler Pluto. As Pluto is written in C, we use the C-APIs of ML-Compiler-Bridge for interfacing the models, to illustrating the Pipe model runners and SerDes. We measured round-trip time using different SerDes and show the same in Fig. 5(e). This integration opens new opportunities for ML-based polyhedral optimizations, including autoscheduling and tile size selection. Table 5. Comparisons of time taken to build clang and final binary size with/without ML-Compiler-Bridge Characteristics Compilation Time Binary Size Average RSS 6 Discussion Native Clang Clang with ML-Compiler-Bridge 5m 7s 102.79 MB 1.5538 GB 5m 15s 102.87 MB 1.5542 GB 6.3 Additional dependencies The current version of ML-Compiler-Bridge is implemented in C++17 and Python 3.10. While Clang17.X uses C++17, Let us now study the ease of integrating ML-CompilerBridge with compiler optimizations. 10 The Next 700 ML-Enabled Compiler Optimizations 7 Related Work Clang10.X uses C++14. We updated the build system of Clang 10 to use C++17 and fixed the issues arising from the migration of earlier experiments on POSET-RL, RL-LoopDistribution, and RL4ReAl. We were able to use Clang 17 for Inliner experiments. Though ML-Compiler-Bridge itself does not introduce any dependency, model runners do: gRPCModelRunner, ONNXModelRunner, and ProtobufSerDes require gRPC, the ONNX C++ Runtime, and Protobuf setups respectively. 6.4 RL environments for compilers come closest to our work, such as CompilerGym [CWG+ 22], PolyGym [BGC21], Supersonic [WTZ+ 22]. These primarily aim at facilitating research and reproducibility, which are only two of the broader ambitions of our research (e.g., deployment, programmable compiler interface, finer-grained interaction). CompilerGym internally calls the compiler APIs from a C++ wrapper, and the communication between the Python model and the wrapper is established by predefined gRPC methods. This limits the functionality to only the APIs supported by the library and a particular compiler version with which the library is compatible. Supersonic [WTZ+ 22] also uses the CompilerGym way of interfacing via gRPC. And, to our understanding, PolyGym [BGC21] does not provide a programmable compiler interface. The gym libraries and ML-Compiler-Bridge solve different problems; the former facilitates research and training, while our library aims to facilitate different interfaces for communication. We envision ML-Compiler-Bridge to supplement these gym environments by providing a variety of options for more diverse, finer-grained, and frameworkindependent interfacing of ML models with compilers facilitating the transition from research to production. Characterization As discussed earlier, different model runners exhibit different characteristics. During deployment, neither of the inter-process model runners offer multi-threaded compilation upon running a single model instance. It could be done by instantiating multiple model instances but this would consume unreasonable amounts of memory. The in-process model runners however do not face this problem. Though there is a separate serialization overhead involved with gRPC and pipe model runners, they are handled automatically without the involvement of the developer. Due to the nature of inter-process communication, there is a possibility of encountering communication errors arising from network and compiler crashes. We handle such cases as explained in Sec. 3.5. We summarize these characteristics in Tab. 6. Table 6. Characteristics of different model runners Characteristics Multithreaded Compilation Distributed Training Need for separate model process Autoserialization Communication Fidelity ML Framework agnostic Additional code by compiler writer Serialization Requirement Time overhead 6.5 gRPC Pipes ✗ ✓ ✓ ✓ ✗ ✓ Y Y Y ✗ ✗ ✓ ✓ ✗ ✓ N Y Y ONNX TF 8 Conclusions ✓ ✗ ✓ ✓ Y N ✓ ✗ ✓ ✗ N N We present ML-Compiler-Bridge, a modular and extensible library to integrate ML models within compiler optimizations. It provides inter/in-process model runners with different serialization options to support both training/deployment scenarios. We show that a model and compiler pass can be integrated with only 3 lines of code, while also enabling very deep interleaving of RL-based algorithms like RL4ReAl, as well as leaner and production-friendly optimizations like function inlining. Our library exposes C++/C and Python APIs for integration with compilers and ML frameworks respectively. We considered multiple ML frameworks (TensorFlow, PyTorch, RLlib), both feature-based and embedding-based representations, multiple compilers (and versions) written in different languages to show versatility and suitability of MLCompiler-Bridge on research and production environments. We will open-source the library and artifacts with extensive documentation. Limitations As mentioned earlier, not all model runners are compatible with all ML models due to the nature of the underlying libraries. For instance, Tensorflow AOT compilation supports any Tensorflow or JAX model, but not PyTorch. Also, upon exporting the inliner model from TensorFlow to ONNX, we encountered an operator (TFL-Bucketize1 ) that is not supported by ONNX. To handle such cases, the ONNX runtime allows registering custom operators. Once exported, the models can be used seamlessly without restriction. Similarly, protobuf does not natively support C runtime. Hence, our C APIs do not support using the gRPC model runner with protobuf serialization. The current TF AOT compilation generates C++ code thereby making it not usable directly with C. This issue can be mitigated by using TF C-APIs instead of using AOT models. References [AAB+ 15] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, 1 https://www.tensorflow.org/mlir/tfl_ops#tflbucketize_tflbucketizeop 11 S. VenkataKeerthy et al. [ABDS18] [ABP+ 17] [Ama20] [AZLY19] [BCP+ 16] [BGC21] [BHRS08] [BNJH18] [Bou] [CFB+ 21] [CMJ+ 18] [CPWL17] [CST02] [CWG+ 22] Chris Cummins, Bram Wasti, Jiadong Guo, Brandon Cui, Jason Ansel, Sahir Gomez, Somya Jain, Jia Liu, Olivier Teytaud, Benoit Steiner, Yuandong Tian, and Hugh Leather. Compilergym: Robust, performant compiler optimization environments for ai research. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 92ś105, 2022. [DAK20] D Das, S A Ahmad, and V Kumar. Deep learning-based approximate graph-coloring algorithm for register allocation. In Workshop on the LLVM Compiler Infrastructure in HPC, pages 23ś32, 2020. [dSKdSMa+ 21] Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimarães, and Fernando Magno Quintão Pereira. Anghabench: A suite with one million compilable c benchmarks for code-size reduction. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’21, page 378ś390. IEEE Press, 2021. [FKM+ 11] Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, François Bodin, Phil Barnard, Elton Ashton, Edwin V. Bonilla, John Thomson, Christopher K. I. Williams, and Michael F. P. O’Boyle. Milepost GCC: machine learning enabled self-tuning compiler. Int. J. Parallel Program., 39(3):296ś327, 2011. [Fla] FlatBuffers. https://flatbuffers.dev/index.html. [Online; accessed 29-Aug-2022]. [GHJV95] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design patterns: elements of reusable object-oriented software. Pearson Deutschland GmbH, 1995. [HAAW+ 20] A Haj-Ali, N K. Ahmed, T Willke, Y S Shao, K Asanovic, and I Stoica. Neurovectorizer: End-to-end vectorization with deep reinforcement learning. In CGO’20, page 242ś255, 2020. [HHAM+ 19] Q Huang, A Haj-Ali, W S Moses, J Xiang, I Stoica, K Asanović, and J Wawrzynek. Autophase: Compiler phaseordering for hls with deep reinforcement learning. In International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 308ś308, 2019. [JAVU22] Shalini Jain, Yashas Andaluri, S. VenkataKeerthy, and Ramakrishna Upadrasta. POSET-RL: phase ordering for optimizing size and execution time using reinforcement learning. In International IEEE Symposium on Performance Analysis of Systems and Software, ISPASS 2022, Singapore, May 22-24, 2022, pages 121ś131. IEEE, 2022. [JVA+ 22] Shalini Jain, S. VenkataKeerthy, Rohit Aggarwal, Tharun Kumar Dangeti, Dibyendu Das, and Ramakrishna Upadrasta. Reinforcement learning assisted loop distribution for locality and vectorization. In 2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), pages 1ś12, 2022. [KPM22] Minsu Kim, Jeong-Keun Park, and Soo-Mook Moon. Solving pbqp-based register allocation using deep reinforcement learning. In Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’22, page 230ś241. IEEE Press, 2022. [LA04] C Lattner and V Adve. Llvm: A compilation framework for lifelong program analysis & transformation. CGO’04, page 75, 2004. [LAB+ 21] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. Mlir: Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 51(4):81, 2018. A. H. Ashouri, A Bignoli, G Palermo, C Silvano, S Kulkarni, and J Cavazos. Micomp: Mitigating the compiler phaseordering problem using optimization sub-sequences and machine learning. TACO, 14(3), 2017. Saman P. Amarasinghe. Compiler 2.0: Using machine learning to modernize compiler technology. In Jingling Xue and Changhee Jung, editors, Proceedings of the 21st ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, LCTES 2020, London, UK, June 16, 2020, pages 1ś2. ACM, 2020. U Alon, M Zilberstein, O Levy, and E Yahav. Code2vec: Learning distributed representations of code. In POPL, pages 40:1ś40:29, 2019. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016. Alexander Brauckmann, Andrés Goens, and Jeronimo Castrillon. Polygym: Polyhedral optimizations as an environment for reinforcement learning. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 17ś29, 2021. Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, page 101ś113, New York, NY, USA, 2008. Association for Computing Machinery. T Ben-Nun, A S Jakobovits, and T Hoefler. Neural code comprehension: A learnable representation of code semantics. In NIPS’18, pages 3589ś3601, 2018. Michael Boulton. TSVC_2. https://github.com/UoB-HPC/ TSVC_2.git. Accessed 2015-09-16. Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Michael F. P. O’Boyle, and Hugh Leather. Programl: A graph-based program representation for data flow analysis and compiler optimizations. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 2244ś2253. PMLR, 2021. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, page 579ś594, USA, 2018. USENIX Association. Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. Synthesizing benchmarks for predictive modeling. In CGO. IEEE, 2017. Keith D. Cooper, Devika Subramanian, and Linda Torczon. Adaptive optimizing compilers for the 21st century. J. Supercomput., 23(1):7ś22, 2002. 12 The Next 700 ML-Enabled Compiler Optimizations [Lan66] [LF17] [LLN+ 18a] [LLN+ 18b] [LO] [LTBZ16] [MBM+ 16] [MCZZ21] [MYP+ 19] [PBB10] [PGM+ 19] [Pro] [SA05] [SCWK13] [SWD+ 17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. [Tea23a] The IREE Team. https://github.com/openxla/iree, 2023. Accessed 2023-11-13. [Tea23b] The Triton Team. https://github.com/openai/triton, 2023. Accessed 2023-11-13. [TQB+ 21] Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li. MLGO: a machine learning guided compiler optimizations framework. CoRR, abs/2101.04808, 2021. [TQBL21] Mircea Trofin, Yundi Qian, Eugene Brevdo, and David Li. RFC: MLGO Regalloc: learned eviction policy for regalloc. https://lists.llvm.org/pipermail/llvm-dev/2021November/153639.html, https://lists.llvm.org/pipermail/ llvm-dev/2021-November/153639.html, 2021. [Online; accessed 08-May-2022]. [VAJ+ 20] S. VenkataKeerthy, R Aggarwal, S Jain, M S Desarkar, R Upadrasta, and Y. N. Srikant. IR2Vec: LLVM IR Based Scalable Program Embeddings. ACM Trans. Archit. Code Optim., 17(4), December 2020. [VHGS16] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016. [VJK+ 23] S. VenkataKeerthy, Siddharth Jain, Anilava Kundu, Rohit Aggarwal, Albert Cohen, and Ramakrishna Upadrasta. RL4ReAl: Reinforcement Learning for Register Allocation. In CC 2023, page 133ś144, New York, NY, USA, 2023. Association for Computing Machinery. [WO18] Zheng Wang and Michael O’Boyle. Machine learning in compiler optimization. Proceedings of the IEEE, 106(11):1879ś1901, 2018. [WTZ+ 22] Huanting Wang, Zhanyong Tang, Cheng Zhang, Jiaqi Zhao, Chris Cummins, Hugh Leather, and Zheng Wang. Automating reinforcement learning architecture design for code optimization. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction, CC 2022, page 129ś143, New York, NY, USA, 2022. Association for Computing Machinery. [WZZ93] X Wang, H Zhao, and J Zhu. Grpc: A communication cooperation mechanism in distributed systems. SIGOPS Oper. Syst. Rev., 27(3):75ś86, jul 1993. Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 2ś14, 2021. P. J. Landin. The next 700 programming languages. Commun. ACM, 9(3):157ś166, mar 1966. ONNX (Linux Foundation). ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx, 2017. [Online; accessed 11-Mar-2023]. Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for distributed reinforcement learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3053ś3062. PMLR, 10ś15 Jul 2018. Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E. Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training, 2018. LLVM-Org. LLVM Test Suite. https://github.com/llvm/llvmtest-suite. Accessed 2021-08-25. Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural networks. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928ś1937, New York, New York, USA, 20ś22 Jun 2016. PMLR. William S. Moses, Lorenzo Chelini, Ruizhe Zhao, and Oleksandr Zinenko. Polygeist: Raising c to polyhedral mlir. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 45ś59, 2021. C Mendis, C Yang, Y Pu, S Amarasinghe, and M Carbin. Compiler auto-vectorization with imitation learning. In NeurIPS’19, volume 32, 2019. Louis-Noel Pouchet, C´edric Bastoul, and Uday Bondhugula. PoCC: the polyhedral compiler collection. https: //web.cs.ucla.edu/~pouchet/software/pocc, 2010. Accessed 2023-10-25. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024ś8035. Curran Associates, Inc., 2019. Protocol Buffers. https://developers.google.com/protocolbuffers. [Online; accessed 29-Aug-2022]. M. Stephenson and S. Amarasinghe. Predicting unroll factors using supervised classification. In CGO’05, pages 123ś 134, March 2005. D Simon, J Cavazos, C Wimmer, and S Kulkarni. Automatic construction of inlining heuristics using machine learning. In CGO’13, page 1ś12, 2013. 13