research-article

Open access

AutoRL X: Automated Reinforcement Learning on the Web

Authors:

Loraine Franke,

Daniel Karl I. Weidele,

Nima Dehmamy,

Lipeng Ning,

Daniel HaehnAuthors Info & Claims

ACM Transactions on Interactive Intelligent Systems, Volume 14, Issue 4

Article No.: 26, Pages 1 - 30

https://doi.org/10.1145/3670692

Published: 16 December 2024 Publication History

PDF eReader

Abstract

Reinforcement Learning (RL) is crucial in decision optimization, but its inherent complexity often presents challenges in interpretation and communication. Building upon AutoDOViz—an interface that pushed the boundaries of Automated RL for Decision Optimization—this article unveils an open-source expansion with a web-based platform for RL. Our work introduces a taxonomy of RL visualizations and launches a dynamic web platform, leveraging backend flexibility for AutoRL frameworks like ARLO and Svelte.js for a smooth interactive user experience in the front end. Since AutoDOViz is not open-source, we present AutoRL X, a new interface designed to visualize RL processes. AutoRL X is shaped by the extensive user feedback and expert interviews from AutoDOViz studies, and it brings forth an intelligent interface with real-time, intuitive visualization capabilities that enhance understanding, collaborative efforts, and personalization of RL agents. Addressing the gap in accurately representing complex real-world challenges within standard RL environments, we demonstrate our tool’s application in healthcare, explicitly optimizing brain stimulation trajectories. A user study contrasts the performance of human users optimizing electric fields via a 2D interface with RL agents’ behavior that we visually analyze in AutoRL X, assessing the practicality of automated RL. All our data and code is openly available at: https://github.com/lorifranke/autorlx.

1 Introduction

Reinforcement learning (RL) stands as a cornerstone in the advancement of artificial intelligence (AI), holding potential application in a multitude of domains through its ability to learn and optimize from interaction with arbitrary environments. The emerging field of RL is witnessing a rising demand for intuitive visual tools, specifically for understanding, validating and also increasing trust in machine learning (ML). These tools are crucial in bridging the gap between RL algorithms and their practical application, enabling practitioners and those new to the field to understand, interpret, and interact with RL systems more effectively to solve complex problems.

In light of this need, we build on prior work in AutoDOViz [69], which exemplifies advancements being made toward accessible and insightful RL visualization, empowering users to explore and dive into complexities of RL processes by following the human-within-the-loop approach. With AutoDOViz we introduced a novel approach to decision optimization (DO) with RL to automate complex decision-making processes. By integrating a user-friendly interface with powerful visualizations, the system facilitates understanding of RL algorithms, making them more approachable for practitioners across various fields. AutoDOViz also accelerates the adoption of DO techniques in real-world scenarios by providing actionable insights and collaborative exploration. Through its innovative features, AutoDOViz set a new standard for interactive, human-centered decision automation for practical applications. Despite such advancements, the reality remains that many existing RL environments and simulations fall short of addressing the nuanced challenges posed by real-world applications. Furthermore, while more domain experts from various fields are incorporating methods of ML, furthering access and applicability of RL seems to bear particular potential. The symbiotic relationship between human insight and machine efficiency—where machines can excel in optimization and computational tasks, human oversight and intervention ensure relevance and applicability to real-world scenarios.

In this work the following contributions are presented:

Firstly, we strive to democratize RL technology by leveraging open-source frameworks, thereby offering fully accessible code to the community. The spirit of open-source is embodied in the AutoRL X platform, ensuring a flexible architecture that can integrate with a variety of back-end engines. We aim to offer an RL experience on the web, making it straightforward and easy to use for a broader audience. Second, we address shortcomings in user interface design identified in the AutoDOViz paper and its user studies. Thirdly, through a detailed use case in healthcare, we further conduct a comprehensive user study with tasks in an interactive simulation component. Next, we show AutoRL X’s practical impact, for which we derive a RL environment to investigate with AutoRL X. With detailed analysis, we wish to demonstrate real-world applicability of the proposed platform outside the data science domain.

In essence, our work represents a significant leap forward from previous studies, harnessing the collective advancements in open-source technology and user interface design to bring automated RL closer to solving tangible, complex real-world problems in various fields. The structure of the paper and workflow of this article is presented in Figure 2, mapping out the journey and development of our system from theoretical framework to practical application in the dynamic landscape of RL.

Fig. 1.

Fig. 2.

2 Related Work

RL and Automated ML (AutoML). RL is an influential method in ML for tackling sequential decision-making problems by training autonomous agents [32, 59]. These trained models, often termed agents, operate within specific environments and perform predefined actions. The concept of RL offers a normative explanation deeply grounded in the psychological and neuroscientific viewpoint [56], elucidating how agents can enhance their mastery of an environment. Employing a trial-and-error methodology, RL aims to derive the best policy or behavioral approach [3]. The algorithm selects optimal actions based on feedback, typically in environmental rewards, to determine which actions benefit a particular state. In recent years, a variety of different algorithms and models have been proposed such as deep Q-networks (DQNs) [43, 45], deep deterministic policy gradient (DDPG) [39], soft actor-critic (SAC) [23], and proximal policy optimization (PPO) [55] to name but a few. RL’s effectiveness stems from its capability to navigate decision-making in unpredictable settings. This approach finds utility in diverse domains such as autonomous driving [67], robotics [40, 48], health, finance [51, 72], smart grids [75], gaming [44], space exploration [35], and pedagogy [9] to name but a few [38]. Further, RL successfully handles challenges like multi-stage inventory management under demand fluctuations or autonomous manufacturing tasks that come with resource constraints. It minimizes not only redundant human engagements, such as system calibrations or monitoring but also ensures rapid system adaptability [54]. The primary objective of RL agents is to increase the overall expected reward by discerning the most suitable policy.

Recent advancements in data science and ML have streamlined critical workflows, such as data cleaning, feature selection, model training, and hyperparameter optimization (HPO) [33, 34, 36, 66, 76]. While many innovations address specific steps in this process, a fully automated and user-friendly approach remains rare. In this landscape, AutoML has emerged as a promising solution, extensively studied for its capability to enable ML with minimal human intervention [25]. Growing interest in fully automating the AI lifecycle has led to the concept of automated AI (AutoAI), a term often used interchangeably with AutoML. AutoML’s objective is to automate the entire ML pipeline, from initial data pre-processing to the deployment of a fully trained and evaluated model. One of the most intricate tasks in AutoML, HPO, traditionally demands considerable expertise from data scientists and is highly dependent on the dataset. Consequently, automated HPO has become a focus within AutoML research, striving to minimize the necessity for deep AI knowledge or statistical skills, thus democratizing the creation of ML solutions [20]. This not only empowers specialists in various fields to engage with ML but also liberates data scientists from repetitive tasks, allowing them to dedicate more effort to enhancing visualization strategies—a critical aspect of the interpretability of ML models[68, 70]. The market response to these advancements includes a variety of commercial platforms like Amazon SageMaker [12], Azure AutoML [46], H2O AutoML [37], Google AutoML [5], and IBM AutoAI [57, 62, 70], each offering unique features tailored to different aspects of the ML process. Meanwhile, open-source tools such as AutoKeras [31], TPOT [49], Auto-Sklearn [17], and LALE [26] are expanding the accessibility and flexibility of AutoML for a broader audience. These automation principles extend into RL, where automated reinforcement learning (AutoRL) seeks to facilitate the RL pipeline. RL’s performance is notably sensitive to the correct tuning of hyperparameters; thus, AutoRL focuses on adjusting these parameters automatically, which is essential for the progress of RL research and application. Despite its potential, AutoRL is not without its own set of hurdles and challenges, signaling an ongoing and dynamic field of study [10, 16, 18, 50, 58]. At the forefront of this initiative is ARLO [47], a pioneering framework for AutoRL presented in 2022. ARLO is a Python library for automating all the stages of an Automated RL pipeline. Unlike many other AutoML libraries, ARLO does not tie to specific RL algorithms, making it versatile and extensible. It further strengthens its foundation by relying on recognized open-source platforms and libraries like the OpenAIGym [6] and DeepMinds’ MuJoCO [60]. Further libraries for automated RL include, for example, Alibaba’s EasyReinforcementLearning¹ or IBM’s AutoDO [41].

Explainable AI (XAI) and Visualization. XAI is essential in integrating advanced AI technologies such as computer vision and natural language processing into the corporate sphere. A significant contribution to this field includes researching how visual tools can help demystify AI models and laying down fundamental terminologies and concepts. References to critical surveys in the literature [1, 27] underline the importance of visualization for comprehending and explaining AI algorithms. As we dive into RL, XAI reveals a tendency toward requiring visual aids and straightforward policy summaries to interpret complex algorithms and agents. Despite the progress, the challenges persist, especially the absence of user-centric studies that would anchor XAI techniques more firmly in the user experience. Additionally, the predominance of oversimplified examples in research, which fail to encapsulate the complexities of real-world applications, raises concerns about the applicability and robustness of these XAI methodologies. Wells and Bednarz have identified immersive visuals and symbolic representation as promising future research directions that could address some of these challenges in RL [71]. From the current literature, we can derive that one of the crucial trends for creating explainable and interpretable AI systems within the domain of RL is visualization. This focus on employing visual tools not only aligns with the broader objectives of XAI but also represents a strategy for addressing the challenges specific to RL. In general, visualization is crucial for understanding complex ML algorithms, such as web-based visualizations, as emphasized by multiple studies [14, 19, 42, 64, 77]. Specifically, visual analytics for RL can enhance trustworthiness during RL training and evaluation processes. Current literature provides a variety of tools, such as ReLVis [54] to help in tracking RL experimentation, while DRLViz [30] and DRLIVE [64] offer insights into an agent’s internal memory and interactive tracking, respectively. PolicyExplainer [42] allows direct queries to autonomous agents. Other tools, like DQNViz [63] and DynamicsExplorer [24], target specific RL algorithms or policies. Although direct visualization methods exist [21, 29, 45, 74], challenges remain, especially in multi-objective optimization scenarios [11]. By integrating the strengths of past works and addressing the highlighted gaps, we aim to pioneer a robust and intuitive open-source platform for automated RL visualization and enriching the domain’s landscape.

AutoDOviz. AutoDOViz [69] is an interactive platform designed to enhance the user experience in automated DO with RL. Developed with insights from semi-structured expert interviews involving business consultants and DO practitioners from a variety of industries, the system aimed to meet the design requirements essential for human-centered automation in the realm of DO. One of its main achievements is its ability to improve trust and confidence in ML, especially RL agent models. This was achieved by adding transparent presentation of reward metrics, making the complexities of automated training and evaluation processes both accessible and comprehensible to users. Furthermore, AutoDOViz incorporates the power of automated DO algorithms for RL pipeline searches, generating insightful policy data and visualizations. These advanced visualization capabilities facilitate more effective communication between DO experts and those from various domains. Another feature of AutoDOViz is its gym composer and the accompanying gym template catalog. This unique addition aimed to lower the entry barrier for data scientists when specifying problems for RL. With an array of pre-defined templates, users can easily find and reuse examples, speeding up the problem-definition process. However, the user study did reveal a hesitancy in contributing to this catalog, largely due to concerns about client confidentiality. The system also features a streaming architecture, introducing a novel Human-within-the-Loop approach for enhanced interactions. However, the study also identified areas for potential improvement, providing guidance for future iterations which we will pick up upon in Section 3.2.

As pointed out by [65] as well as other reviewed literature, a significant gap in real-time, interactive visualization tools for RL and AutoML exists. Despite advancements in these areas, existing systems enhance automation and interpretability in ML models but often lack in providing dynamic, user-friendly features for both novices and experts. The proliferation of RL and AutoML tools hasn’t bridged the gap between complex algorithms and actionable, easily understandable insights for real-time decision-making.

3 System Design

Figure 2 outlines the development process of AutoRL X. Section 3.1 revisits the user interface of AutoDOViz, detailing insights that have been incorporated in AutoRL X. Subsequently, Section 3.2 discusses the requirements and key takeaways from the AutoDOViz experience, setting the stage for the subsequent section describing the architecture of our open-source platform.

3.1 Insights from AutoDOViz

From exploratory interviews, user studies, and reflections from previous work in AutoDOViz [69], we were able to derive further features and suggestions that could drive our system design for AutoRL X. Furthermore, since the AutoDO [41] engine is not easily accessible as an open-source software outside the IBM network for industry and clients, we propose AutoRL X as an alternative option to visualize and interact with AutoRL on alternative engines. In AutoDOViz, we further introduced an ensemble of different visualizations for RL analytics, such as state space visualizations, action space visualizations, policy and value function visualization, training progress and convergence, or agent-environment interaction dynamics. We also proposed a set of interactive tools tailored for RL visualization. For example, drag-and-drop features should allow users to input and then customize their own RL environment, or zooming features in the performance charts to focus on particular areas of the state space or time intervals, also visualizing real-time feedback for hyperparameter tuning and algorithm adjustments, and line charts to compare different RL algorithms (agents) and its configurations side by side. Principles and requirements that led to the design decisions of AutoDOViz were derived from exploratory semi-structured interviews with two groups. First, we interviewed DO practitioners with roles, titles, and personas ranging from business representatives, business analysts, data engineers, data scientists, quantitative analysts, developers, IT specialists, as well as salespeople to one or more optimization experts. A second round of remote video interviews with another user group, namely eight business consultants in different domains ranging from specialists in agriculture, oil, automotive, government, manufacturing, or retail industries. The potential target end-users for AutoDOViz were identified to be data scientists. However, in AutoRL X, we were hoping to enable domain experts from non-computer science fields that need to solve optimization problems. Insights from these semi-structured interviews led to 9 design requirements for AutoDOViz for developing human-centered automation for DO using RL. These requirements include (1) creation of generic templates that match common categories of DO problems, (2) visual tools for categorizing DO challenges, (3) fostering stakeholder collaboration, (4) defining user skills and goals, (5) supporting the complete workflow within a unified framework, (6) enhancing trust in automated solutions, (7) demystifying RL training for non-experts, (8) organizing templates by industry for easy access, and (9) offering templates for widespread business issues across various sectors. In our AutoDOViz user interface, a dashboard provides users with straightforward access to three core entities: environments (gyms), engine configurations, and executed jobs (Figure 3). This structured layout enables users, including business domain stakeholders, to trigger executions and access high-level visualizations of their RL experiments without delving into the technicalities of gym implementations. Moreover, AutoDOViz incorporates a configuration wizard that simplifies the complex process of setting up RL agents and their hyperparameters and implements two further types of visualizations, transition matrices and trajectory networks, to present behavioral information about the agent to provide detailed insights and increased confidence to the user. The interface displays a list of selectable RL agents and, for each one, a detailed configuration panel that allows users to adjust hyperparameters, providing options for types, possible values for discrete parameters, ranges for continuous ones, and default values. Further, tutoring interface strategies of AutoDOViz are applied in modelling the gym, where the composer led users through a series of decisions.

Fig. 3.

3.2 Requirements

Our main priority in developing an open-source version was to maintain the functionalities that we offered users in the proprietary software. We still aim to incorporate additional findings and qualitative feedback from the user studies, which were used as a baseline to informally derive our requirements. AutoDOViz [69] Section 5.2.4 describes participants’ post-survey reflections, likes and dislikes, and findings of the post-study questionnaire. The post-study questionnaire consisted of 14 questions, including 11 5-point Likert agreement scale questions. The user study was conducted with 13 participants who were encouraged to “think aloud” as they followed through with user study tasks while working in AutoDOViz’s user interface (UI).

In the following section, we present the requirements that guide the development of AutoRL X, as a more refined and user-centric follow-up system: First, areas which participants of the AutoDOViz study felt could be improved, included user experience for small screen-size users to reduce scrolling in the composer screen. There is a need to ensure the system is optimized for various devices, including tablets and mobile phones (R1). The agent listing screen was also identified by the users as an area for improvement. Enhancing its layout, functionality, and filter options could provide a smoother experience (R2). One participant expressed a need for more understandable visualizations. Addressing this could involve using tooltips, legends, and contextual guides to help users decipher visual data (R3). Suggestions were made to incorporate time sliders to replay real-time feedback on agent progress visualizations. This feature would allow users to rewind, pause, and analyze agents’ actions over time (R4). For those less familiar with the tool, there is a need for additional on-screen explanations, tutorials, or a help section to guide them (R5). Given that one user expressed interest in collaborating using AutoDOViz, introducing features that enable collaboration, such as shared views, commenting, and real-time edits, could be beneficial (R6). Several participants showed interest in integrating AutoDOViz with their existing toolkits. Developing plugins or APIs to facilitate integration with popular data science and ML tools could enhance its adoption (R7). Based on feedback, while users are keen on using pre-existing templates, there is a hesitance in contributing due to confidentiality concerns. A potential solution is to provide more generic templates or allow for anonymized sharing (R8). Recognizing that preferences for working in shared vs. custom environments are highly use-case dependent, the system could offer more granular control over environment settings, with attention to security, privacy, and cost (R9). While the UI was appreciated for its familiarity, maintaining consistency with popular data science software can ensure users find the platform intuitive (R10). Since the UI successfully allowed data scientists to learn about DO tasks quickly, adding more educational tools, walkthroughs, or interactive demos might enhance user understanding (R11). Emphasizing transparency, especially on metrics, was claimed essential by user study participants (R12). Table 1 lists all the requirements we could identify.

Table 1.

Requirement	Description
R1 Enhanced UX for Small Screens	Optimize platform for various devices to improve user experience, e.g., reducing scrolling on small screens like tablets and mobile phones.
R2 Refined Agent Listing	Improve layout, functionality, and filtering options of the agent listing screen for a smoother user experience.
R3 Improved Visual Interpretability	Use tooltips, legends, and contextual guides to make visual data easier to understand for users.
R4 Inclusion of Time Sliders	Integrate time sliders in visualizations to allow users to rewind, pause, and analyze agent actions over time.
R5 Enhanced On-screen Explanations	Provide additional on-screen explanations, tutorials, or help sections to assist users less familiar with the tool.
R6 Collaborative Features	Introduce collaborative features like shared views, commenting, and real-time edits to facilitate teamwork.
R7 Integration with Existing Toolkits	Develop plugins or APIs for integration with popular data science and ML tools to encourage adoption.
R8 Expand the Gym Template Catalog	Offer more generic templates and anonymized sharing options to address confidentiality concerns.
R9 Customizable Environment Preferences	Provide granular control over environment settings with a focus on security, privacy, and cost.
R10 Enhanced UI Consistency	Maintain consistency with popular data science software to ensure intuitive use.
R11 Educational Features	Add more educational tools, walkthroughs, or interactive demos to enhance user understanding of DO tasks.
R12 Transparency Enhancements	Implement features that provide insights into metric calculations and algorithmic choices to increase user trust.

Table 1. Requirements for AutoRL X Derived from AutoDOViz—A Human-Centered Approach

3.3 AutoRL X Architecture

The architectural schema of the AutoRL X system is depicted in Figure 4. Informed by the identified requirements, the design closely mirrors that of our proprietary AutoDOViz platform. Similarly, the system should manage three entities by the user of AutoRL X: gyms (or environments), engine configurations and agents, and resulting runs (or jobs). The system architecture is structured into the following three principal components:

Fig. 4.

Backend. The backend of our application uses an open-source AutoRL engine, ARLO [47], to facilitate automatic computation of RL pipelines. ARLO handles OpenAI Gyms [6], MuJoCo [60] three-dimensional (3D) environments, and leverages agent implementations from Mushroom RL [13]. It is suitable for diverse research and development scenarios. Figure 5 shows eight ARLO [47] models offered through our UI. While we are focusing on Online RL scenarios throughout this work, next to DQN, PPO, SAC, DDPG, Gradient of a Partially ObservableMarkov Decision Process, the ARLO framework also features FQI, DoubleFQI, and LSPI for Offline scenarios. ARLO further provides different tuner strategies that our users can choose from in our interface. For example, a Genetic Tuner, which evolves a population of model configurations by mutation and selection to optimize hyperparameter for performance on a given evaluation metric. We also provide access to the Optuna Tuner,² performing HPO by searching through a predefined space and evaluating model performance. It uses advanced algorithms to determine the best set of parameters with features like trial pruning and parallel execution to speed up the search. From a dropdown menu in the UI, the user can also choose different evaluation metrics for the RL pipeline, such as a discounted reward, which evaluates the average of cumulative rewards received over episodes, adjusted by a discount factor (gamma) to account for the time value of rewards. In contrast, another metric, temporal difference error, calculates the average squared deviation between predicted and actual rewards in subsequent states, reflecting the accuracy of the value function. Lastly, a time series rolling average discounted reward evaluates the average of cumulative rewards received over episodes adjusted by a discount factor to account for the time value of rewards.

Fig. 5.

Further enhancing the backend, our system is designed with highest extensibility in mind (R7). It is not limited to the proposed ARLO framework [47]; the architecture allows for integration of alternative AutoRL engines. This is achieved by AutoRL X’s robust logging mechanism that records run metadata and streams model logs into a database, ensuring a structured and retrievable data management process. Additional AutoRL frameworks can simply be integrated by providing a job execution script (Python), and making them communicate with the REST API provided in AutoRL X.

API. The REST API (see Figure 6) built with FastAPI³ offers execution of RL gyms in ARLO and integrates with the SQL database via an internal service layer. One part of the API is responsible for managing the database operations related to gyms, configurations, runs, and models. These endpoints contain different HTTP Methods to create or send data to the server. The services handle database connections securely and perform queries and updates efficiently to ensure lazy loading on the front-end. For example, to get information about model run episode trajectories, we offer a get\(\_\)trajectory method in the runs endpoint, which retrieves only the currently selected step sequence requested by the user in the UI. The request body must be in JSON format. Figure 6 exemplifies the gyms endpoint is essential for operations that involve adding new gym entities or updating existing ones within the system. Each gym instance has multiple attributes that need to be defined, and it supports customization through its various parameters and modules. Our API further comprises of other parts like a logger with zlib compression support for fast write- and read operations, to enable an overall comprehensive and scalable backend solution.

Fig. 6.

Frontend. Our frontend framework of choice is Svelte.js [53], a modern and friendly framework [4] with actual, compiler-ensured state-reactivity from the bottom. Svelte stands out for its innovative approach to building user interfaces, leading to more efficient updates and cleaner code. This advantage lies in Svelte’s departure from the virtual document object model (DOM) paradigm, offering direct manipulation of the DOM and, thus, faster performance and a more straightforward development experience. To obtain responsive and user-friendly interface components, we add Carbon Design Framework. The Carbon Components Svelte library implements the Carbon Design System,⁴ an open-source design system developed by IBM that emphasizes reuse, consistency, and extensibility. This design is tailored for complex, enterprise-level interfaces to accelerate the development process while maintaining a high standard of design quality and user experience. By utilizing this library, we were able to maintain a consistent look and feel across our application. Our decision was informed by insights gathered from interviews with data scientists using AutoDOViz [69], and many were already acquainted with the Carbon UI library from other software tools they use in their daily work. The familiarity of the UI framework contributed positively to the user experience. Participants noted that this consistency aided them in efficiently performing their tasks, fulfilling the user interface consistency requirement R10 highlighted as a priority for our system’s design.

In result, as shown in Figure 7, users of AutoRL X can now select from automatically refined agents in a pipeline run according to R2 with novel filter options to search for specific phases like learn or test phase, a certain epoch, filter through iterations and actions to closer examine agents behaviors. In line with R3 and R4, we have added tooltips and time sliders to better retrieve information from the line charts allowing users to easier understand visual data. The user can also see which agents are still running or already finished. Next to the filtering options, we also added the possibility to more refinedly view agent logs, visited states, and hyperparameters that were demanded in R12, improving transparency for the user. However, as mentioned in by [8, 15, 52], capturing user trust can be challenging, which we also point out as a limitation for further discussion.

Fig. 7.

In response to user feedback received in AutoDOViz, where users expressed confusion regarding navigation of categories in the gym template catalog, we make a slightly improved design prototype in AutoRL X as seen in Figure 12. Specifically, in AutoDOViz, we categorized gyms based on the North American Industry Classification System (NAICS) into various business problem categories. We introduce a more user-friendly navigation system in AutoRL X compared to AutoDOViz, which guides users through the different gym categories via breadcrumbs to streamline the user experience and facilitate easier exploration. Similar to AutoDOViz, users click through the hierarchy via tiles on each level. On a leaf level, the gym catalog then shows the list of available templates; an example is shown in Figure 13. For requirement R1, we tested our platform on multiple devices: a tablet, smartphone and even the mixed reality browser offered in Meta Quest 3 by connecting via the local network. Overall, we addressed 8 of the 12 requirements from our full list in this work. The remaining requirements R5, R6, R9, and R11 are mapped to Github issues in our open-source repository (https://github.com/lorifranke/autorlx).

Extensibility. We provide three points of extensibility: (1) We build up on the extensible OpenAI API [6]. However, OpenAI provides a preliminary infrastructure where rendering can be challenging for RL environments, essentially using a pythonic render function. This setup based on Python is a disconnect from web applications and not dynamic or flexible enough to serve as real-time visualization. (2) Therefore, an extensible feature we have implemented is a 3D visualization showing agent dynamics within a simulated environment using WebGL [22] and Three.js [7]. Both are powerful tools for rendering interactive 3D graphics directly in web browsers without plugins. Unlike OpenAI’s Gym pythonic render function, this novel feature naturally supports interactivity within the web app, providing a native experience for users engaging with our platform. As illustrated in Figure 14, users have the flexibility to create this visualization optionally by inserting TypeScript code via Three.js in the editor. This addition enables visualization of agent’s movements and actions within the environment, also the ability to see step-by-step agent interactions in different epochs, offering a more intuitive understanding of its behavior in 3D spaces. Users can use the step back button to see agents’ previous behavior and click through the sequence. Furthermore, the 3D environment is included as an interactive thumbnail-version in the catalog leaf node (see Figure 13). (3) Another point regarding extensibility is the ARLO [47] framework, which lays the foundation with its different models. Moreover, ARLO offers extensibility by allowing users to customize different RL pipelines and adding customized stages, by incorporating automatic RL training. (4) Lastly, we offer extensibility via gym code parametrization. In AutoRL X, we enhance the UI by enabling users to define individual parameters during the implementation of a gym. This approach marks a significant lesson learned, as it allows users to set parameters externally before execution of a run and test multiple gyms with slightly differing parameters (see Figure 9)

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

4 Applications of RL

RL environments range from simple board games to complex simulated robotics or virtual ecosystems, providing controlled spaces where agents can safely explore and learn from interactions. However, transitioning from these well-defined environments to real-world applications is often challenging: real-world scenarios are often unstructured and unpredictable, with noisy data and non-stationary dynamics that can significantly hinder performance of RL algorithms trained in more controlled settings. Moreover, ethical and safety considerations involved in trial-and-error learning methods raise substantial barriers to deploying RL agents when mistakes can have serious consequences, such as in healthcare or autonomous driving.

In AutoDOViz, we have identified a variety of domains that are using optimization strategies (or DO) by conducting semi-structured interviews with specialists in agriculture, automotive, government, manufacturing, oil, or retail industries. However, all these examples mainly focused on business applications, and the environments in AutoDOViz’s gym catalog were inspired by these use cases. In this work, we aim to explore a novel real-world problem from the medical/healthcare field that could be solved with RL. We believe that specialists from this domain can further benefit from RL’s capabilities. RL has shown promise in a range of medical applications, although it is still in its early stages of adoption in the field [73]. Some examples of RL medical and healthcare application successes include its use for critical care or chronic diseases like cancer, diabetes, and HIVs or in automated medical diagnosis, especially for medical imaging [28]. Furthermore, RL can be used for surgical robots [61]. Other application domains for RL in healthcare involve clinical resource allocation, controlling, or healthcare management. We opted for a problem that is visually appealing in order to be sufficiently comprehensible to the general reader, despite being very domain specific. This allows us to exemplarily journey together from problem inception to RL agent run results and inspection within AutoRL X. This health-related RL challenge was further identified in concert with collaborators from the domain.

4.1 Use Case Scenario: Transcranial Magnetic Stimulation (TMS)

We identify a real-world optimization problem that seems worthwhile to explore with RL pipelines: Brain stimulation refers to the application of electrical, magnetic, or other forms of energy directly or indirectly to the brain to alter its activity, with therapeutic purposes such as treating psychiatric conditions or neurological disorders, or even to enhance cognitive function. There are invasive forms of brain stimulation where electrodes are placed onto the brain with surgery or non-invasive methods like TMS [2]. TMS uses a device that is placed on a patient’s head and then creates electromagnetic fields to stimulate nerve cells (neurons) in the brain. In recent years, it has been successfully used to improve symptoms of mental health diseases such as depression and many other neurological and psychiatric disorders. Researchers, clinicians, and doctors are now focused on determining the optimal placement of TMS devices to target specific brain regions effectively.

4.2 Assessing Human Accuracy

With AutoRL X being derived from AutoDOViz with a similar interface as such, we opt for a detailed, domain-specific application study instead of another software usability study as already performed on AutoDOViz. For our study, we developed an interactive tool TMS Simulator two-dimensional (2D) (Figure 15) for users to handle a brain stimulation device in an abstracted 2D environment. The simulation interface allows users to control a 2D representation of an electromagnetic field to influence a section of brain’s neuronal activity. In reality, the electromagnetic field is being generated by a stimulation device that a doctor hovers around a patient’s head to treat brain diseases. The brain region is exemplified with a grid-like canvas of size \(N\times N\), with each square reflecting a brain cell with an activity level of its enclosed neuron. The starting grid has some initial ‘damages’ with inactive neurons which exemplify the brain regions that need to be treated with the tool. Activity values can be manipulated using the circular tool representing the electromagnetic field. Users can move this tool with their mouse, adjust its size with a scroll, and initiate or pause treatment with a click. The application of the electric field on the brain cells happens in form of a Gaussian distribution similarly as a real-world electromagnetic field. The treatment clicks then change the activity value of the neurons under the circle, adjusting the colors of the squares, with colors ranging from inactive cells (dark grey) to optimal activity (white) to overly stimulated cells (red). Users are provided with a control legend and a color-coded legend for clarity. The simulation also evaluates the user’s performance, calculating a score based on the brain’s overall activity, and charts this score over time. The simulator has been accompanied by an interactive tutorial for users to learn and participate in the study without guided session moderator supervision.

Fig. 15.

Participants. The participant demographic for our user study, as detailed in Table 2, comprises of a sample with a diversity of educational backgrounds and a binary gender distribution. We recruited the participants via words of mouth and our collaboration network. The group included 13 individuals (\(N=13\)), with a gender representation of \(38.46\%\) female (5 participants) and \(61.54\%\) male (8 participants). In terms of educational attainment, the participants are distributed across the spectrum of formal education levels: \(7.69\%\) (1 participant) have completed high school, \(30.77\%\) (4 participants) hold a Bachelor’s degree, another \(30.77\%\) (4 participants) have attained a Master’s degree, and the remaining \(30.77\%\) (4 participants) possess a PhD. This distribution indicates a relatively balanced representation of educational backgrounds within the cohort, providing a broad perspective in the context of the user study. From the 13 participants, 4 were classified as Experts (\(E1\)–\(E4\)), having a background in neuroscience or psychology, currently working in research on TMS or administering brain stimulation treatments to patients. The remaining 9 participants of our study, which we will call General Users, included mainly technical backgrounds in engineering or software development.

Table 2.

Gender		Female	Male
		5 (38.46%)	8 (61.54 %)
Degree	High School	Bachelor	Master	PhD
	1 (7.69 %)	4 (30.77 %)	4 (30.77 %)	4 (30.77 %)

Table 2. Participant Demographic Data from Questionnaire

Study Protocol. We created a fully self-paced remote study hosted on a web-server. Participants received a link and could click through the study page by page, then followed by a link to a questionnaire. Sessions started with a tutorial to introduce the optimization problem for brain stimulation and to familiarize participants with the controls of the tool. Next, users engaged with the simulator to beat the score. For extra motivation we displayed the time passed from when the grid was first displayed. At the end of the simulator, we displayed a visualizations to let participants explore and reflect on their own performance, before going into a post-study questionnaire.

Post-Study Questionnaire. After the study, participants were asked to fill out a survey with nine questions, to gathers qualitative feedback on participants’ experience with the simulator, trust, its perceived strengths, potential areas for improvement, as well as overall satisfaction and user experience. The nine questions were rated on a five-point scale, where participants shared how much they agreed or disagreed with statements about clarity of the task, how easy it was to understand color codes for brain activity, and whether they were able to interpret the meaning of the displayed score. They also indicated how comfortable it was to use the simulated device, and how easy it was to complete the task. We were also interested in their assessment on whether they believed a computer could do the task more quickly or higher accuracy than a human, and also if they trusted the scores shown. The intent of these two questions was to capture general sentiment about the participants’ beliefs in the capabilities of machine automation vs. human effort in such contexts. Lastly, we asked if the simulation was a good representation of the real-world 3D task.

4.3 Results

Table 3 shows the results from the user study. We logged times and scores into a database for post-processing. Each participant was assigned an anonymous session ID. Figure 17 shows participants’ performance. Scores ranged from 84.92% to 97.50%, with the majority of participants achieving above 95%. The general user group achieved an average score of 96.60%. Expert users, while still proficient, had a slightly lower average score of 92.44%, suggesting that the tool’s challenges were robust across all levels of expertise. When combined, the overall average score for both groups was 95.32% with a standard deviation of 3.35%. The average time invested by users also differed notably between groups. Time spent in the simulation varied from 9 seconds up to 840 seconds (14 minutes), showing no strong correlation with neither scores or total steps taken, suggesting variable efficiency in tool use. General users took an average of 388.44 seconds, while experts completed tasks more rapidly, averaging 216 seconds, potentially reflecting familiarity and efficiency with such tasks. The combined average time for both groups was 335.38 seconds, but with a substantial standard deviation of 234.91 seconds, pointing to significant differences in individual completion times. Level of interaction, measured in total steps taken, varied extensively, with general users averaging 5,851.67 steps and experts 2,125.25 steps, indicating that experts navigated the tool with higher stability requiring fewer interactions. Steps taken as a measure of interaction also varied widely, from as few as 110 to as many as 10,841. This variability might indicate differences in user strategy, understanding, or the complexity of tasks they encountered. The overall average for both groups was 4,705.08 steps, with a large deviation once again highlighting the diversity in user approach.

Table 3.

Metric	General Users	Experts	Both Groups
Average score [%]	96.60 \(\pm\) std	92.44 \(\pm\) std	95.32 \(\pm\) 3.35
Average time [s]	388.44 \(\pm\) std	216 \(\pm\) std	335.38 \(\pm\) 234.91
Average steps	5851.67 \(\pm\) std	2125.25 \(\pm\) std	4705.08 \(\pm\) 3287.80

Table 3. User Study Scores, Times and Steps

Fig. 16.

Fig. 17.

The post-study questionnaire data as in Table 4 revealed insights into participant experiences with an unspecified tool, delineated between general and expert users. Participants rated their agreement with statements regarding various aspects of the tool on a Likert scale, with scores averaging between 3 and 5. Clarity of the task goal (Q1) and understanding of color encoding (Q2) received high ratings, indicating a straightforward design and intuitive interface, with averages across both groups being 4.46 (±0.66) and 4.69 (±0.48), respectively. Clarity on score meaning (Q3), comfort with device manipulation (Q4), and task completion ease (Q5) presented more moderate agreement, suggesting potential areas for improvement in user understanding and interface ergonomics. Notably, participants rated the machine’s speed (Q6) and score (Q7) compared to human performance for task completion highly, averaging 4.31 (±1.03) and 4.46 (±0.78), respectively. This may indicate that participants believe a program or machine will perform better than a human in this task. Trust in the accuracy of displayed scores (Q8) scored slightly lower but still within a positive range, averaging at 3.92 (±1.04), hinting at slight reservations about the tool’s reliability. Finally, the simulator’s representation as a 2D brain stimulation (Q9) received the lowest average score of 3.15 (±1.14). This could signal a need for a more effective visual representation or user education regarding the simulation. The standard deviations suggest variability in participant responses, reflecting individual perceptions and experiences with the tool.

Table 4.

Question	Statements	General Users	Experts	Both Groups
Q1	Task goal clarity	4.56	4.25	4.46 \(\pm\) 0.66
Q2	Ease of understanding color encoding	4.56	5	4.69 \(\pm\) 0.48
Q3	Score interpretability	3.67	3.75	3.69 \(\pm\) 1.25
Q4	Comfort in device manipulation/navigation	3.67	3.75	3.69 \(\pm\) 1.11
Q5	Task completion ease	3.67	3	3.46 \(\pm\) 1.20
Q6	Machine vs. human speed for task	4.56	3.75	4.31 \(\pm\) 1.03
Q7	Machine vs. human score for task	4.67	4	4.46 \(\pm\) 0.78
Q8	Trust in displayed score accuracy	4	3.75	3.92 \(\pm\) 1.04
Q9	Simulator as 2D brain stimulation representation	3	3.5	3.15 \(\pm\) 1.14

Table 4. Post-Study Questionnaire Results

5 RL Environment for Brain Stimulation in AutoRL X

Our research explores the application of RL to the previously presented brain stimulation problem by creating a specialized RL environment (reference gym). This simulated environment mimics essential aspects of brain stimulation procedures for educational and research applications. It is designed for ease of use, allowing individuals, irrespective of their RL expertise, to investigate and comprehend the nuances of brain stimulation techniques in a risk-free virtual setting. Similar to how MuJoCo environments⁵ provide simulation tasks for physical movements, our RL environment broadens the scope to healthcare, particularly the optimization of brain stimulation parameters and thereby optimal placement of an electromagnetic field on a brain region. This simulation can be essential for further devising personalized treatment strategies. With this algorithmic challenge, we had various options for implementing the 2D TMS Simulator as a gym environment, specifically since definitions of reward functions, action and observation spaces highly vary with the developer’s decision. Our goal was to test this environment in AutoRL X and run it with a diverse number of agents.

5.1 Technical Implementation Details

Throughout this section we provide technical details on an exemplary gym implementation for our given TMS problem. We need to make decisions on how to model the state of the environment, which is observed by an RL agent. A state is further assessed in a reward function, which the agent is tasked to optimize. In order to do so, the agent can propose an action, which is then applied to receive a new state, and so on. Typically, the procedure terminates after a certain number of steps or when a certain condition is met.

State/Observation. The state \(s\) of the gym is the representation of neural activity across the 2D grid, captured as a 2D vector. When returning the state in the OpenAI interface, the vector needs to be flattened to 1D: \(s=\text{flatten}(\text{grid})\) where \(\text{grid}\in\mathbb{R}^{N\times N}\) and \(s\in\mathbb{R}^{N^{2}}\) with \(N\) being the side length of the grid.

Action. We can model the action space \(a\) as a 2D vector that represents the position of the TMS device in the grid (\(device\_x\), \(device\_y\)) the location where the electromagnetic field is applied:

\begin{align*}a=[device\_x,device\_y]^{T},\end{align*}

where \(device\_x,device\_y\in[0,N-1]\) are continuous values. If we wanted to further model the radius or the intensity of the field, we could do so by introducing additional actions analogously.

Transition/Step Dynamics. The transition dynamics are defined by the influence of the device on the grid. When an action is taken, it increases (stimulates) the values in the grid based on a Gaussian distribution centered around the action’s location with a fixed or actionable radius \(R\) and intensity \(I\):

\begin{align}\text{s}^{t+1}_{x,y}=\text{s}^{t}_{x,y}+\text{OPT}\cdot\text{I}\cdot \text{Gaussian}(\text{device_x},\text{device_y},x,y,\text{R}/2).\end{align}

(1)

Reward Function. The reward function at any time step \(t\) could be modeled as the sum of the errors between the current grid values and the optimal activity level \(OPT\), scaled by the maximum possible reward. Since reward demands the better the solution, the higher the value, we can simply take the negative. We further reward stimulation of understimulated neurons more than overstimulating already stimulated neuron by skewing the value distribution using an exponential function:

\begin{align}\text{reward}(s)=-\frac{\sum\limits_{x=1}^{N}\sum\limits_{y=1}^{N}\left(\frac{\text{grid}(x, y)}{\text{OPT}}-1\right)^{2}\cdot\left(e^{\left(\frac{\text{grid}(x,y)}{\text{ OPT}}-1\right)}-1\right)}{N^{2}\cdot\text{OPT}}.\end{align}

(2)

For visual clarity, Figure 18 shows the modeled function in comparison to a regular parabola.

Fig. 18.

Termination. While typically the RL agent stops after a certain number of steps (horizon), we could additionally define a termination condition if all values in the grid reach above the optimal stimulation:

\begin{align*}\text{done}=\min(\text{grid})\geq\text{OPT}.\end{align*}

Figure 18 further shows the source code implementation of the gym in Python.

5.2 AutoRL on the TMS Environment

We were interested in the actions that humans would take in optimizing their scores by comparing them to the RL agents’ behaviors and actions. With the help of AutoRL X, we aim to analyze how the differently configured RL agents will perform in our defined reference environment to solve the same task as the study participants. We also hoped to receive similar graphs as in Figure 17, showing the steadily increasing curves of human performance to optimize the grid value. Similar graphs would have been shown in the agent leaderboard with each line in the chart being one agent in Figure 1 and Figure 7. However, while training the agents with the TMS environment 1, our agents learned more back-and-forth depending on the hyperparameter tuners, selected reward function, and episodes. This can easily be traced in the 3D visualization in Figure 20.

The reference gym within the platform provides an observable environment where the behavior of the RL agent can be monitored. This allows for real-time observation and analysis of the agent’s interactions with the environment, offering tangible evidence of how different settings or configurations impact the agent’s decisions. For instance, if an agent displays a propensity for selecting red or dark boxes within the environment, it could indicate that the scoring system may need adjustment to ensure the agent is not biased by certain visual features.

Additionally, AutoRL X serves as a valuable tool for conducting sanity checks to verify the fundamental correctness of the environment implementation—essentially, a form of debugging the created gyms. Observations might reveal that actions chosen by the agent fall outside the expected range, leading to ’out of bounds’ behaviors. Other observations we found are that agents might not necessarily move around the grid but sometimes also learn to get stuck in the corner and increase the score by executing the action on a single cell. Recognizing this enables developers to craft and compare different strategies for clipping or constraining actions, thus ensuring that agent operates within the desired parameters. Figure 19 shows the alternative implementations we have added instead of AutoDOViz’s matrix- and graph-based visualizations. Here, agents’ behaviors in the 2D grid and trajectories are visualized over a single episode, enabled by the highly granular logging. At the beginning of step 1 in the first episode, the grid shows pre-defined damages, which are Gaussian distributed, and the agent has not yet interacted. In subsequent iterations, we can see how the agent moves to different positions in the grid, applying the electric field to the cells turning red when the value is over the optimal value (100 in our case). From the episode progress depicted in Figure 19, we can see that the agent has not yet developed a strategy to find the fastest and optimal trajectory to resolve the damaged cells in the grid. The agents’ trajectories for the TMS environment were additionally tracked in 3D in Figure 20.

Fig. 19.

Fig. 20.

Finally, it is worth mentioning that our developed gym environment is also available as part of an open-source package to invite users to experiment and engage with the gym, to create a collaborative and exploratory approach to refine and enhance the gym, and to study RL models’ behavior. The open-source nature grants community-driven development and the opportunity for diverse contributions that can lead to innovative uses and improvements of gym setups through AutoRL X and beyond.

class MyEnv(BaseEnvironment):

def __init__(self, obj_name=”TMSSimulator2D”, …):

super().__init__(…)

self.gamma = 0.99

self.horizon = 150

self.N = 10

self.OPT = 100

self.n_damages = 10

self.grid = self._init_grid()

self.observation_space = Box(

low=np.array([0.0] * (self.N ** 2)),

high=np.array([self.OPT * 2.0] * (self.N ** 2)),

shape=((self.N ** 2),))

self.action_space = Box(

low=np.array([0.0, 0.0]),

high=np.array([self.N - 1.0, self.N - 1.0]),

shape=(2,)

)

def _init_grid(self):

grid = np.full((self.N, self.N), float(self.OPT))

# damage grid

np.random.seed(42)

for _in range(self.n_damages):

kernel_x = np.random.randint(0, self.N)

kernel_y = np.random.randint(0, self.N)

for x in range(self.N):

for y in range(self.N):

value = self.OPT * self._gauss(kernel_x, kernel_y, x, y, 0.1 * self.N / 2)

grid[x][y] -= value

grid[x][y] = max(0, grid[x][y])

return grid

def _gauss(self, x0, y0, x, y, s):

return math.exp(-((x - x0) ** 2) / (2 * s ** 2) - ((y - y0) ** 2) / (2 * s ** 2))

def step(self, action):

device_x = int(abs(action[0]) % self.N

device_y = int(abs(action[1]) % self.N

r = 0.05 * self.N

intensity = 0.1

error = 0

for x in range(self.N):

for y in range(self.N):

value = self.OPT * self._gauss(device_x, device_y, x, y, r / 2) * intensity

self.grid[x][y] += value

d = -(self.grid[x][y] / self.OPT - 1) * 2

error = error + d * (np.exp(d) - 1)

score = -error / (self.N ** 2 * self.OPT)

done = np.min(self.grid) ¿= self.OPT

return self.grid.flatten(), score, done, {}

def reset(self, initial_state=None):

self.grid = self._init_grid()

return self.grid.flatten()

6 Discussion

Our development and exploration of a web-based user interface with visualizations demonstrates its effectiveness as an educational and problem-solving and debugging tool, substantially demystifying RL for a wider audience. Utilizing the AutoRL X platform, our study observed human and RL agents’ behavior within a specially configured gym environment. While humans showed tendency to improve their performance steadily, RL agents vary their behavior based on hyperparameters, state and action space definitions and reward functions throughout and learning epochs. This was particularly evident in the RL agents’ learning patterns, which sometimes resulted in repetitive or suboptimal actions, like getting stuck in a grid corner—a phenomenon that underscores the importance of precise environment design and agent investigation.

In the last section, we could demonstrate that the AutoRL X platform, besides its visual analytics potential derived from AutoDOViz, acts as a practical testing and debugging ground for custom-built RL environments, catering to users from novice to expert levels in RL. Through its detailed visual analytics, such as those presented in Figure 19, we could track the agents’ interactions over time, offering insights into the agent’s strategy development or lack thereof, as they interacted with the grid and adjusted to the simulated damages.

The responses from our post-study questionnaire show a consensus regarding the machine’s superior speed and accuracy, underscoring the benefits of using automation in complex tasks like brain stimulation procedures. However, skepticism surrounding the accuracy of the displayed scores and lack of trust in the 2D simulator indicate a gap in the interface that necessitates more transparent feedback mechanisms.

Our goal to develop a specialized RL environment for medical applications, particularly simulating the complexities involved in brain stimulation, is an encouraging advancement. On top, can serve AutoRL X as an educational platform and a research instrument, offering a controlled environment to refine brain stimulation strategies. However, our findings indicate that an RL agent, although programmed to optimize scores, does not always match human strategies, highlighting areas for future visual analytics to advance agent and gym development. Furthermore, agent learning from human demonstrations is a promising approach that has also been addressed more recently in the literature.

In conclusion, the user study and subsequent questionnaire provide valuable insights into human accuracy and perception of these tasks. The performance scores and varied time and interaction metrics demonstrate its adaptability and potential as a training and research platform.

6.1 Limitations

While AutoRL X makes a significant impact as an open-source platform for RL, it is not without limitations. One of the challenges we encountered is integration and deployment of the platform, which presented severe compatibility issues across various backend frameworks due to different chipset architectures, and outdated dependencies compromising the heterogeneity of user environments. Despite flexible architecture decisions, seamless integration across user-specific configurations therefore remains an ongoing task. Furthermore, our interface strives to present information intuitively, yet the complexity of RL can make it challenging to distill information without sacrificing detail. As we continue to refine the user experience, we aim to tailor the information density to user preferences and expertise levels better. Next, despite having selected a diverse population, user study results would need to be confirmed via a larger sample before generalizing. Similar to AutoDOViz, a further limitation is the question of how to measure trust in AutoRL X reflected in requirement R12. In AutoDOViz we established that simple Yes-No questions were not expressive enough to gather participants’ feelings about trust in the system. Lastly, the usability of AutoRL X was not directly tested in a usability study, however, being the open-source continuation of AutoDOViz, design principles were informed by insights from previous extensive user studies. In light of this, we decided to forgo potentially redundant examination in favor of deploying a different type of user study to provide alternative insights into the process when working in highly domain specific environments.

6.2 Future Work

For the future trajectory of AutoRL X, the remaining user requirements need to be addressed, such as collaborative features R6 that enable real-time edits and comments for a cooperative learning environment. Additionally, embedding educational features R11 like guided walk-throughs and interactive demonstrations as conducted in the simulator tutorial could significantly augment the learning curve for users new to RL. The TMS reference gym can be further explored to overcome the human accuracy provided in our user study. Furthermore, the potential for integrating AutoRL X into other tools or platforms should be investigated. This could lead to a more comprehensive ecosystem for RL and ML practitioners, promoting a seamless workflow across various tools. Despite the feedback and iterative improvements we have drawn from AutoDoViz, the need for ongoing refinement based on user engagement still needs to be addressed. Continuous user feedback is necessary for platform evolution, ensuring that AutoRL X meets the latest user demands and anticipates and adapts to the quickly evolving ML landscape. This proactive approach to user-centric design and development will be crucial in maintaining the platform’s relevance and effectiveness.

7 Conclusion

In this article, we have presented AutoRL X, an open-source expansion of our previous work, AutoDOViz, which aims to contribute to better understanding and utilization of RL in diverse domains. Our contributions encompass various facets that collectively advance the field of RL and highlight the critical role of visual analytics in promoting its understanding, trust and usage. Our foremost contribution lies in democratizing Automated RL technology with an open-source contribution. This ensures that our code is readily accessible to the community, fostering collaboration and innovation. The flexible architecture of AutoRL X allows seamless integration with various backend engines, making RL more approachable and adaptable for a broader audience. Building upon the insights and feedback garnered from interviews and user studies conducted during the development of AutoDOViz, we have tailored AutoRL X to address these identified user interface elements and incorporate additional features. This user-centric approach enhances the usability and personalization of RL agents, catered to the evolving needs of practitioners and researchers. Moreover, we have extended our platform’s applicability to a relevant problem in the healthcare domain. By creating a novel RL environment and a 2D simulator visualization component, we demonstrate the real-world potential of RL in optimizing complex healthcare challenges, such as optimizing brain stimulation device trajectories. Our user study, including experts from the healthcare field, provides valuable insights into the performance of RL compared to human decision-making, further solidifying the practicality of Automated RL. In summary, our work aims to leap forward to more intelligent user interfaces for RL, applying open-source technology and modern user interface design to bridge the gap between complex RL algorithms and tangible real-world problem-solving. By presenting AutoRL X, we hope to have contributed to the broader understanding of RL processes and emphasize the importance of visualization in enhancing RL trust and usage.

Acknowledgments

The authors wish to thank the participants of the user studies and interviews in AutoDOViz which helped for requirements analysis. We also wish to thank the experts of the TMS clinic and for their time and valuable input.

Footnotes

https://github.com/alibaba/EasyReinforcementLearning

https://optuna.readthedocs.io/en/stable/index.html

https://fastapi.tiangolo.com/

⁴

https://carbon-components-svelte.onrender.com/

⁵

https://www.gymlibrary.dev/environments/mujoco/index.html

References

[1]

Sajid Ali, Tamer Abuhmed, Shaker El-Sappagh, Khan Muhammad, Jose M. Alonso-Moral, Roberto Confalonieri, Riccardo Guidotti, Javier Del Ser, Natalia Díaz-Rodríguez, and Francisco Herrera. 2023. Explainable artificial intelligence (XAI): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion 99 (2023), 101805.

Abstract

1 Introduction

2 Related Work

3 System Design

3.1 Insights from AutoDOViz

3.2 Requirements

3.3 AutoRL X Architecture

4 Applications of RL

4.1 Use Case Scenario: Transcranial Magnetic Stimulation (TMS)

4.2 Assessing Human Accuracy

4.3 Results

5 RL Environment for Brain Stimulation in AutoRL X

5.1 Technical Implementation Details

5.2 AutoRL on the TMS Environment

6 Discussion

6.1 Limitations

6.2 Future Work

7 Conclusion

Acknowledgments

Footnotes

References

Index Terms

Recommendations

AutoDOViz: Human-Centered Automation for Decision Optimization

Automated Reinforcement Learning (AutoRL): A Survey and Open Problems

Reward Shaping in Episodic Reinforcement Learning

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations