Article
Open access
Published: 25 July 2024

A model-free method to learn multiple skills in parallel on modular robots

Nature Communications volume 15, Article number: 6267 (2024) Cite this article

2244 Accesses
16 Altmetric
Metrics details

Subjects

Abstract

Legged robots are well-suited for deployment in unstructured environments but require a unique control scheme specific for their design. As controllers optimised in simulation do not transfer well to the real world (the infamous sim-to-real gap), methods enabling quick learning in the real world, without any assumptions on the specific robot model and its dynamics, are necessary. In this paper, we present a generic method based on Central Pattern Generators, that enables the acquisition of basic locomotion skills in parallel, through very few trials. The novelty of our approach, underpinned by a mathematical analysis of the controller model, is to search for good initial states, instead of optimising connection weights. Empirical validation in six different robot morphologies demonstrates that our method enables robots to learn primary locomotion skills in less than 15 minutes in the real world. In the end, we showcase our skills in a targeted locomotion experiment.

Versatile modular neural locomotion control with fast learning

Article 14 February 2022

Learning plastic matching of robot dynamics in closed-loop central pattern generators

Article Open access 18 July 2022

Hybrid hierarchical learning for solving complex sequential tasks using the robotic manipulation network ROMAN

Article Open access 07 September 2023

Introduction

Basic locomotion primitives, such as walking and turning, are a prerequisite for more complex behaviours in legged robots (e.g., foraging or environmental monitoring). However, obtaining controllers for such skills is a very challenging task. A popular approach is the use of bio-inspired Central Pattern Generators (CPGs) that are modelled after neural circuits in the spinal cord of various animals¹. These (artificial) neural networks encode continuous-time signals that are smooth, (quasi-)periodic, and parameterised by only a few variables, which makes them appealing for learning locomotion, as the latter requires smooth and (quasi-)periodic behaviour². CPGs can be interconnected to form a complex CPG-network that produces a more complex signal.

In biology, neuroplasticity is a key element in the optimisation of CPG-networks³. This form of adaptation is a slow indirect process of operant conditioning to improve its overall function. In robotics, a similar method of weight optimisation (WO) is the standard approach to training CPG-networks, where the connection strength is changed to improve locomotion^{4,5,6,7,8,9,10,11,12,13,14,15,16}. In this paper, we present a novel way of optimising a CPG-network, where rather than focusing on conventional WO, we instead focus on improving the initial state of the network (Initial State Optimisation, ISO). Figure 1 provides a general overview of our method.

**Fig. 1: Initial State Optimisation on Central Pattern Generators (CPGs) quickly finds different locomotion skills (Gait:→; Clockwise rotation: ↻; Anti-clockwise rotation: ↺).**

ISO is a more direct process than WO in which we learn how to correctly initiate the network state that produces the right behaviour. The CPG-network is not adapted through WO, and only functions as a behavioural encoder with several motor primitives. Our approach is based on the observation that the CPGs in spinal centres can produce multiple locomotion patterns that are encoded within the same network³. In animals, it is well documented that the motor cortex and/or peripheral stimuli can start locomotion patterns with signals to the spinal cord that initialise the CPGs¹⁷ (although these signals are frequency encoded by a drive signal and not by a single command). Initial states can be seen as the movement command for a specific (cyclical) motor pattern memorised by the higher-order brain centres (like the motor cortex). Here, our start command is an input signal that sets the CPG-network in the correct initial state.

Our mathematical analysis of the difference between ISO and WO (provided in Appendix I) shows that, due to the dynamics of CPGs, WO suffers from many local optima caused by redundancies and nonlinearities in weights search-space. This is not the case for ISO. This mathematical analysis also shows that our random CPG-networks contain infinitely many quasi-periodic cycles due to their elliptic fixed point. As a result, we suggest that instead of optimising limit-cycle behaviours using WO, ISO should be considered to acquire semi-cyclical patterns for robot locomotion. Furthermore, we identify two additional practical tricks that can be utilised during ISO to speed up optimisation (see Fig. 1): 1) Bootstrapping, which amounts to evaluating intermediate network states as additional potential initial state samples (an in-depth parameter analysis can be found in Appendix C.A); 2) Parallelisation, which involves assessing different locomotion skills in parallel (we show the efficacy of this method up to 100 skills in parallel Appendix C.B). Using ISO, our robots only need a few trials to learn multiple locomotion skills in parallel. Supplementary video 1, provides a video explanation of the method (can also be found here https://fudavd.github.io/multi-skill-learning/).

We demonstrate ISO on six different robots that learn the following three locomotion skills: gait, clockwise rotation, and anti-clockwise rotation (for a total of 18 learning tasks). Our experimental validation is two-fold. First, in simulation, we compare ISO learning performance with WO and a SOTA baseline reinforcement learning method, Proximal Policy Optimisation (PPO,^18,19,20). The results show that our method outperforms PPO and achieves 50% of the WO performance in less than 5% of the time, which corresponds to our findings in preliminary experiments (Appendix B). Second, we perform real-world ISO experiments on our set of robots and compare its real-world performance with WO transferred from simulation to real. To this end, we transfer the best controllers out of 30 repetitions to present the strongest and most robust solution found by WO. The results show that WO suffers significantly from the sim-to-real transfer and is outperformed by ISO on 11/18 tasks (p < 0.001) in significantly shorter optimisation time. A more detailed layout of our CPG-network, the description of the system, the ISO and WO method, and experimental setup can be found in section IV. The Appendix provides both the mathematical derivation of our propositions and the results of our preliminary experiments and extended results.

Our conclusions are supported by exploratory experiments on a range of random time-series (Appendix A.A) showing that ISO can obtain a performance similar to that of WO in a much shorter time frame for any target behaviour (Fig. B5). Finally, we show the model-agnosticity of our method by learning 6 locomotion skills with ISO on a test suite of 20 different robots (Appendix C.C).

Results

We compared WO with ISO by learning three different movement skills on six different modular-legged robots (see Fig. 2; more technical details in Table 1), thus resulting in 18 different learning tasks. The three locomotion skills are defined as: (1) Gait (→), absolute planar displacement speed in centimetres per second [cm s⁻¹]; (2 and 3) clockwise and anti-clockwise rotation (↻/↺), planar rotation in radians per second [rad s⁻¹]. First, the performance of ISO and WO is compared with PPO in simulation (parameters are detailed in Table 2); then ISO and WO are compared in the real world (without PPO since it performed worst in the simulation experiments).

Table 1 Technical description of our six robots

Full size table

Table 2 Hyperparameters of the optimisation learning algorithms (RevDE, Random Search)

Full size table

Simulation experiment

We optimise our robots in simulation for 300 min for each skill using ISO, WO, and compare their performance with a PPO baseline¹⁸. We use cleanRL’s continuous action PPO algorithm for Isaac Gym with default parameters²¹. For WO, we use a state-of-the-art differential evolution algorithm (RevDE,²²) in which every trial lasts for 1 min (t_trial = 60 s, N_trials = 300). Trial-time is chosen to differentiate between transient behaviour (starting from standing still to moving) and steady state behaviour of the locomotion skill. Shorter evaluation times are chosen to highlight transient behaviour which will be exploited by the optimiser (e.g. the robot may only fall sideways for high rotation speed). Random search is used for ISO, in which we halve the number of trials and double the trial time to increase the effectiveness of bootstrapping based on our hyper parameter sweep (see Appendix C.A) (t_trial = 120 s, N_trials = 150). This results in a 600-fold increase in initial state samples for each trial (using a rolling window evaluation length t_eval = 60 s)). It is important to note here that per robot ISO requires one-third of the time than ISO to learn all three skills, thanks to the parallelisation trick (300 min in ISO vs. 900 min in WO). A summary of the experimental parameters is presented in Table 3.

Table 3 Experimental setup

Full size table

The learning curves (Fig. 3) show the Mean Best Fitness (MBF, i.e. the highest fitness so far in time, see Algorithm 4) averaged over 30 repetitions for each experiment ± Standard Error (N_rep = 30, ± SE), which is a metric that allows for the comparison of different learners²³. For each morphology, we also report the Time-to-Equal Quality (TTEQ) for each skill, which indicates the first time that the MBF ISO is equal to the MBF WO (see the subcaption of Algorithm 5, higher TTEQ is better). The results show an overall better performance for ISO at the start of the learning process, which shows a more efficient skill acquisition than WO or PPO. This outperformance is consistent throughout the entire optimisation process for Blokky →, ↺ and ↻ and Salamander ↻, which shows in the absence of TTEQ values. In congruence with the preliminary experiments, ISO sharply outperforms WO at the beginning before plateauing for the remainder of the run (Fig. B5). PPO overall tends to be the most inefficient. This is unsurprising, as the number of allowed time steps is several orders of magnitude smaller with respect to most PPO applications (180,000 vs 10,000,000 in the original PPO paper¹⁸).

**Fig. 3: Simulation experiment results.**

These findings are unsurprising since Appendix A reveals the linearity of ISO vs. the complex nonlinear coupling of WO. The parallelisation trick in ISO makes learning efficiency independent to the number of skills learned, shown up to a 100 skills in parallel Fig. C7. This means that ISO has constant complexity ${{{\mathcal{O}}}}(k)$, even though ↺, and ↻ are counter opposite behaviours. Furthermore, even with random CPG-networks, the chances of finding a good initial state are relatively high (we find good performance quickly with random sampling and bootstrapping). Already, in the first trials, ISO can immediately identify better sub-samples than WO, thereby outperforming WO in efficiency. This outperformance is maintained for 64 min on average (mean TTEQ). Here, 64 min underestimate the true mean TTEQ, as we omitted instances where WO never crossed ISO (Blokky ↺ and ↻; Salamander ↻).

Comparing the MBF learning curves for each skill provides a distorted overview of the efficiency of ISO with respect to WO, because we pit one parallel run of the former against three sequential runs of the latter. In Fig. 4, we show the simulated time for different normalised performance thresholds at [25, 50, 75, 100] percent performance with respect to WO. Normalisation is based on the final WO MBF values (i.e., performance at 300 min), for each skill for all six robots. For WO, we calculate the relative performance for each skill and show the total simulated time as a stacked bar (WO_→ + WO_↺ + WO_↻, with a maximum time of 900 min). For ISO, we calculate the combined relative performance (that is, the total sum of relative performance over all three skills) as a dark grey bar (ISO_Σ, see Algorithm 6). ISO_Σ relative performance provides a fair comparison, as the learning in ISO is parallelised. In the subcaptions of Fig. 4 we also report the specific time needed to obtain 75% relative performance for each robot.

**Fig. 4: Normalised performance results.**

With respect to normalised performance, it becomes clear that ISO is clearly faster with respect to WO. On average, it takes WO 270min to reach 75% performance vs. 101 min for ISO, with all morphologies taking less than 200min in the latter. At the end of the optimisation, four of the six morphologies (Ant, Gecko, Spider, and Stingray) never reached the same level of performance as WO, although the final performances of the Blokky and Salamander morphologies exceeded WO up to 140%. Furthermore, we found that ISO can already achieve 50% relative performance (with respect to WO MBF) within the first 10 min (i.e., 5 trials) for all but the Spider morphology: Ant 9.33 min, Gecko 10 min, Spider 21 min, Blokky 2 min, Salamander 2.66 min, Stingray 6 min. Resultantly, we achieve 50% of WO performance in less than 5% of the time, using ISO.

Real-world experiment

In the real-world experiment, we test our ISO method on the physical twin of our simulated robots. Here, we restrict ourselves to only 15 min of total experiment time for each robot, including re-evaluation of the acquired skill. By including the overhead, we can therefore assess five trials of 2 min, and re-test each acquired skill for 1 min. Based on the results of the simulation experiments, this very tight budget is sufficient to obtain good performance. Moreover, the 15 min limit is an efficient time for real-world experiments and compares well (as a benchmark time) with respect to other state-of-the-art (SOTA) learning algorithms (12–20 min^12,13,24). We compare the performance of the real ISO experiments with the overall best WO controller (henceforth denoted as WO*, the best controller out of 30 runs) for each skill. This method is also known as offline learning with sim-to-real transfer²⁵. The results of the real experiments are presented in Fig. 5. In the real world, WO* performs similar to ISO, which equates to a 57% reduction as a result of the sim-to-real gap. In the end, ISO outperforms WO* in 11 out of 18 skills, based on the overall best run (ranked over all skills within a morphology, shown as striped bars). We want to emphasise that we present the strongest and most robust version of WO, where each skill was optimised for 30 different runs separately for a total of (vs. 1 single run of all skills in parallel). Therefore, we argue that real-world ISO is preferable on the grounds that it circumvents the sim-to-real problem, requires less computational resources (local cluster vs. Raspberry Pi), and is more scalable with respect to additional skills (parallelisation means constant time complexity ${{{\mathcal{O}}}}(k)$). Supplementary video 2 shows the final behaviours obtained using ISO.

ISO can learn all three skills on all six robots in under 15 min. We found a small average performance drop with ISO in real robots compared to the ISO simulation experiment. This means that the real-world ISO performance corresponds to the predicted ISO performance, which, as such, indicates that ISO is robust in performance regardless of simulation or real world. The overall best real-world ISO runs (based on total rank for each skill within a morphology) show a 54% performance relative to final WO MBF values (compared to 50% in simulation for the same amount of time). If we compare these runs with WO* we see that we outperform eleven out of eighteen learning tasks. The behaviour of these runs is shown in Appendix Fig. D9. If we zoom in closer, we see that ISO has a random bias for rotation in a certain direction for a single run (an effect that smoothened out when we averaged the 30 runs in simulation). The random CPG-weights might induce a bias for a given rotational skill, as there are only a limited set of frequencies available in the CPG-network defined by the purely imaginary eigenvalues in its canonical form (shown in Appendix I). Slow weight adaptation might be able to resolve this problem by changing the frequency content of the CPG-network.

Optimising in simulation is inevitably vulnerable to the sim-to-real gap^26,27,28, due to model approximations and simulation errors being exploited and overfitted by the optimiser. When transferring WO* on the real robots, we observed on average a 57% drop in performance compared to simulation, while ISO did not drop significantly in performance (−2%). The specific sim-to-real gap for each morphology and skill is presented in Appendix Table D1. These results show that the WO* controller is badly impacted by the sim-to-real gap, while ISO performance is robust in both conditions. In the end, Fig. 5 shows that we get comparative performance with ISO in the same wall-clock time as sim-to-real WO* (~11 min ISO vs. ~4 min for each skill with sim-to-real WO*). It should be noted that this comparison favouring WO* by default, as we took the best controller over 30 repetitions (which took about 6 h in total per robot). Furthermore, in our experiments, we only provided three skills. The wall-clock time of WO increases linearly with respect to the number of skills, while ISO remains constant (also demonstrated in Appendix C.B). Real-world ISO is thus competitive with sim-to-real even in the most favourable circumstances for WO.

Preliminary feedback control experiments

Our CPG-controller allows the perturbation of hidden (y-neuron) states that are not directly linked to the signals sent to the servomotors. This will provide smooth behavioural changes as our CPG defines behaviour in velocities. Using this technique, we extend our controller with heading feedback for a targeted locomotion task on our spider robot in simulation (specific details on the design are presented in Appendix D.C). The proposed method robustly reached the target in a random location 4 m from the origin for all trials (N = 60) under 5 min. Supplementary video 3 shows the spider robot during the target following task.

Discussion

An advantage of CPGs in robotics is that their behaviour is relatively stable and is encoded cheaply using few parameters. Instead of learning a whole time-series of motor input or a policy of actions at every point in state-space, we now only need to describe the CPG-network to realise useful behaviour. Our mathematical analysis (in Appendix A) sheds light on the effects of optimising the different design variables of a CPG-network: network structure, connection strength, and initial state. Based on our findings, we developed a simple and efficient method for learning multiple skills using the initial state.

We provide strong evidence for the potential advantage of learning skills through ISO. The results of the simulation experiments showed that ISO can achieve 50% of WO performance in less than 5% of the simulated time, highlighting its outperformance in terms of efficiency. The real robot experiments show the advantage of ISO for online multi-skill learning vs. offline WO and transferring the best controllers. We consider task complexity not only on the final behaviour (moving from A to B, or rotating), but also on the difficulty to learn the behaviour (DofF, amount of prior knowledge, a-symmetry in morphology and control etc.). Compared to other SOTA learning algorithms (12–20 min^12,13,24), we perform similarly in complex locomotion tasks. In addition, we achieve this learning performance without the use of a simulator, limited a priori knowledge, and a predefined model. Our method outperforms these methods, both in terms of speed (the number of skills that it learns in a short amount of time, also seen in Appendix C.B) and generality (model-free control tested on several different morphologies).

As a consequence of their marginally stable property (see Appendix I), our SO2-oscillators do not exhibit genuine limit-cycle-like behaviours. The behaviour we obtain is the result of only local synchronisation of the oscillator frequencies. This is unlike most work on CPGs where a central frequency strictly defines a limit-cycle phase to coordinate periodic movements in time^{4,5,6,7,8,9,10,11,12,13,14,15,16}. Our method does not require such an internal clock and is capable of learning quasi-periodic movements in much shorter time-scales, while it can also produce non-periodic control. In addition, the formal basis for the strength of our method is detailed in Appendix I. We show that, irrespective of random weights, the SO2-oscillator network exhibits marginally stable behaviour due to their elliptic fixed point, which proves the existence of (infinitely) many quasi-periodic patterns. Furthermore, we investigated the expressiveness of ISO by estimating 100 randomly generated time-series in parallel (Appendix C.B) and extent the robot learning task with 6 skills in parallel on 20 different robots (Appendix C.C).

The importance of initial states is not limited to the specific application of SO2-oscillators analysed in this work. For any optimisation task (irrespective of the controller/optimisation strategy), the initial state of the robot must be in a good region of the state space. If this is not the case, then the optimiser must solve two tasks: 1) find the optimal (cyclical) behaviour; 2) move from the bad initial state towards the region of optimal behaviour (a transient pattern). The authors of ²⁹ reported, that in their RL algorithm, the initial state distribution was designed and tuned separately for each task, which took a specialist around two days to complete. The sensitivity of certain dynamical systems to the initial state has already been described as the butterfly effect in chaos theory. Thus, in any systems with a recurrent structure (ODEs, LSTMs, reinforment learning, recurrent neural networks, other forms of CPGs, reservoir computing, etc.) the behaviour of a controller is sensitive to its initial condition. Nevertheless, we found that most work does not consider tuning/optimising its initial state (if reported at all, initial states are often set to 0 or 1).

Learning robot control is a topic of wide interest, with current methods relying on specific robot models and/or massive computing to learn models and behaviours in simulation¹³. Such learning methods have trouble adapting or learning new behaviours as integrating/adapting learnt skills often require sequential learning. This involves problems due to catastrophic forgetting (a problem that arises in lifelong/(deep) reinforcement learning^30,31), or the curse of dimensionality (e.g., MAP-elites,^32,33), and overall time constraints^11,29. Parallelisation circumvents catastrophic forgetting and results in constant time complexity ${{{\mathcal{O}}}}(k)$ to learn multiple skills. In addition, massive parallel computing continues to be plagued by the sim-to-real gap, which we can circumvent by not relying on simulation at all²⁶. Other model-free controllers are several magnitudes slower than our ISO method³⁴. Similar other CPG-based controllers rely on human expertise to design networks with precise hyper-parameters^13,35, and focus on a specific robot morphology^14,24. In contrast, our method is truly model-free and straightforward in that it defines a network structure based on morphology with random weights. Finally, the process of storing and selecting patterns shares some resemblance with reservoir computing³⁶ or the use of motor primitives³⁷. However, ISO has the added advantage that it learns multiple skills in parallel and does not suffer from catastrophic forgetting.

The generalisability of ISO shows in the number of robots tested and its efficacy in learning multiple skills/behaviours in parallel (Appendix C.C). Real-world testing was possible due to the time-efficiency of ISO. The design space of modular robots (the set of all possible morphologies) is large and incredibly diverse, insofar as it contains robots comprising various shapes, a-symmetrical bodies, several or many limbs with potentially varying lengths. Successfully learning multiple skills in parallel on our wide variety of morphologies can be attributed to the model-free nature of the CPGs. Alongside this, we created our CPG-network structure based on simple morphological rules that are not specific to our design and, as such, can be altered without any consequences for the algorithm. This shows the robustness of our method for different types of designs and tasks.

The open-loop skills obtained during ISO can be used for more complex tasks. We demonstrated this through higher-level control in an additional targeted locomotion task (Appendix D.C). In the future, we can extend our work by modulating gait speed and heading direction during forward locomotion. For this, we propose only modulation of our ODE (Equation A5) by scaling the weight matrix for frequency adaptations (on A), and output state for amplitude control. Different forms of complex feedback control have been implemented in CPG-networks: low-level Internal Model Control³⁸, smooth speed and heading³⁹; which could be implemented for ISO which would require feasible adaptations.

Furthermore, our learning framework has been originally motivated in the context of a robot evolution system, where morphologies and controllers evolve in tandem and “newborn" robots undergo individual learning to optimise their inherited brain to their body⁴⁰. This requires learning methods that are robust and fast, i.e., work for “all" possible morphologies and need relatively few search steps/evaluations before being employed in the real world. By leveraging the additional bootstrapped samples in our optimiser, we can employ other data-driven techniques for this end, for example, surrogate models and MAP-elites/novelty-search^41,42,43 help to increase learning speed and encourage diversity in solutions. Parallelised fast learning of modular behaviours can help in the design of these hierarchical control structures such as finite-state machines or behavioural trees^41,44,45.

Finally, we want to emphasise the perspective of the CPG-network as a behavioural encoder. CPGs as a behavioural reservoir provide a simple explanation for the multi-functionality of the spinal cord⁴⁶. The behavioural encoder effectively stores various movement patterns that can be initiated with the appropriate start signal, which corresponds to the initial state. From such a perspective, we could speculate that ISO quickly learns network input signals that activate movement skills, whereas WO slowly fine-tunes network dynamics during life, and evolution optimises network structure every generation. This idea is not too far-fetched given the fact that newborn babies exhibit predefined locomotion movement patterns when specific states trigger the walking reflex⁴⁷. Our work might be the first to shed light on these bio-inspired modes of optimisation in CPGs over different time-scales.

To conclude, learning multiple movement skills is an important challenge for legged robotics insofar as it is a prerequisite for viable complex behaviour. Considering the specificity of each robot, complex/dynamic environments, recovery after damage, etc., we argue that there is an important need for model-free controllers that quickly learn. In this paper, we show that this is achievable using CPG-networks and ISO. ISO is preferable over WO in terms of time-efficiency and with respect to the sim-to-real problem. Future work should aim at improving the search method, further utilising bootstrapping and parallelisation, and implementing a more complex and higher-level of control overall.

Methods

This section describes the CPG-network controller, optimisation methods, and simulated/physical experiment and setup. This does not include any formal analysis of the optimisation of the network itself. For this, we refer the reader to Appendix A.

CPG-network controller

CPGs are pairs of artificial neurons that reciprocally inhibit and excite each other to produce oscillatory behaviour², and can be mathematically described as a set of Ordinary Differential Equations (ODEs)⁴⁸. In the extant literature, several different models for CPGs have been proposed^{4,5,6,7,9,10,11,12,13,14,15,49}. Although some details may vary, in essence, the oscillatory behaviour within the CPG is invariably defined in terms of the connectivity between neurons and their initial state. We focus on an inter-connected network of SO2-oscillators⁵⁰ whose dynamics are described by an anti-symmetric weight-matrix that results in (quasi-)periodic behaviour.

Figure 6a shows a single coupled SO2-oscillator (i.e., CPG) formulated as two paired neurons (an x_i- and y_i-neuron). Here, i denotes the specific joint that we associate with this CPG. The change of a neuron’s state is calculated by multiplying the current state of the opposite neuron with a weight (w). The weights ${w}_{{x}_{i}{y}_{i}}$ and ${w}_{{y}_{i}{x}_{i}}$ represent the connection strength between each pair of neurons. An output neuron o_i is coupled to the x-neuron with connection strength ${w}_{{x}_{i}{o}_{i}}$. The output of o_i serves as the input command for the servomotors in a specific joint (i.e. servo). In our case, we selected a tangent hyperbolic activation function as our output neuron to normalise the motor input between [−1,1], which corresponds to the min/max angular positions of the joints. It is important to note here that the output neuron only reads out the current state of the x-neuron, and, hence, neither influences the CPG dynamics nor does it have an initial state.

To enable more complex output patterns, symmetrical/bidirectional connections between neighbouring joints (denoted within neighbour set ${{{\mathcal{N}}}}$) are added between oscillators’ x-neurons (see Fig. 6b). We define the set of neighbouring joints in joint i (${{{{\mathcal{N}}}}}_{i}\subset {{{\mathcal{N}}}}$) as the set of all joints j ≠ i located within a 2-module distance from i. That is to say, if we represent our robot morphology as a connected a-cyclic graph (a tree with the top node being the Core module), then all joints j that are connected within two or less edges away from joint i are considered neighbouring joints in ${{{{\mathcal{N}}}}}_{i}$. The set ${{{{\mathcal{N}}}}}_{i}$ contains all the neighbouring inter-connections with x_i in the CPG-controller, with its resulting ODE shown below.

$${\dot{x}}_{i}={w}_{{y}_{i}{x}_{i}}{y}_{i}+\mathop {\sum}\limits_{j\in {{{{\mathcal{N}}}}}_{i}}{x}_{j}{w}_{{x}_{j}{x}_{i}}$$

(1)

$${\dot{y}}_{i}={w}_{{x}_{i}{y}_{i}}{x}_{i}$$

(2)

The connections with the neighbouring joints j at i are denoted as ${w}_{{x}_{j}{x}_{i}}$. To simplify the search space, the following relations are imposed between the weights: ${w}_{{x}_{i}{y}_{i}}=-{w}_{{y}_{i}{x}_{i}}$; ${w}_{{x}_{j}{x}_{i}}=-{w}_{{x}_{i}{x}_{j}}$; ${w}_{{x}_{i}{o}_{i}}=1$. The size of the search space for WO and ISO can now be formulated as follows: For a given robot, the number of CPGs is equal to the total joints (k), while the number of neighbouring pair connections ($| {{{\mathcal{N}}}}|$) is based on the modular tree-based distance. This means that for WO we will have ${N}_{weights}=k+| {{{\mathcal{N}}}}|$ weights to optimise; and for ISO N_state = 2 ⋅ k initial states (two neurons for each CPG oscillator).

System description

Robots

Our robots consist of modules which are based on the RoboGen framework⁵¹. A modular design space has the distinct advantage of rapid real-world prototyping through pre-fabrication of 3D printed modules and off-the-shelf components, as also shown in other frameworks^52,53,54. We altered the original Robogen module designs to prevent breakage, improve servomotors, and allow feedback control. This new design consists of three components: a core component 9 cm × 9 cm × 6 cm, 250 g), brick components (6 cm × 6 cm × 6 cm, 30 g), and active hinges (12 cm × 6 cm × 6 cm, 69 g). The core component is a container with a micro-controller (Raspberry Pi 3B+ https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/), a custom designed bridge to our servomotors (as detailed in ref. ⁵⁵) and a battery (LiPO, 3.7 V, 7800 mA h). It has four attachment slots on its lateral faces, to which other modules can be attached via screws. The brick component is a smaller cube with similar attachment points also on its lateral sides. The active hinge component is a joint actuated by a servomotor (Tower Pro MG996R, more detailed specifications can be found at https://www.towerpro.com.tw/product/mg996r/) with an attachment point on both ends.

For our learning experiment, we use six different-legged robot designs based on our modular framework. This set can be divided into two groups: (1) Bio-inspired hand-designed robots that resemble animals with distinct locomotion patterns (top row in Fig. 2). (2) Evolutionary automated-designed robots via a novelty-search algorithm (bottom row in Fig. 2). The latter group evolved to be fast in gait and morphologically different^23,56. A description of each robot, along with its corresponding CPG-network parameters, are presented in Table 1.

Simulator system description

For our simulation experiments, we use a high-performance GPU-based 3D physics simulator for robot learning named Isaac gym⁵⁷ (with an NVIDIA GeForce RTX 3070). Isaac gym enables parallelisation during optimisation. During WO this allows us to simulate a whole population of robots every generation of our evolutionary algorithm, and in ISO to simulate all our trials at once. We set the simulator update time to 60 Hz with simulator sub-step 2 and use the PhysX physics engine with temporal Gauss-Seidel solver. The robots are spawned in a neutral position (all joint angles 0 rad) centred on a flat horizontal plane, as illustrated in Fig. 2. Our world-frame coordinate system has its origin on a flat surface with x/y-direction spanning across the ground and a positive z-direction perpendicular to the surface pointing up, i.e., opposite to the direction of gravity (9.81 ms⁻¹). During simulation, we log both the position {x, y, z} and orientation {ϕ_x, ϕ_y, ϕ_z} of the robot core component, each time we update the controller servomotor input (update frequency is 10 Hz). At the end of a trial, the logged robot state data (${{{\mathcal{D}}}}$) are used for evaluation.

Real world system description

For our real world experiments we used our own tracking-system, which merges the footage from six cameras (Dahua, IPC-HFW1320SP) into a single video-stream on a central laptop. Merging the cameras creates an image of 3964 × 1776 pixels with a resolution of 0.26 mm px⁻¹. The video-stream provides a top-down view of our lab at 30 frames per second. We made a custom tracking script, using ArUco markers⁵⁸, to log both the position and orientation of the Core module every 0.1s (10 Hz) in the same world-frame coordinates as the simulation experiments (positive z-direction up). From a central laptop, a start command automatically uploads a CPG controller via WiFi. The states of the robot and controller are buffered until the end of a trial, after which fitness values are calculated and all the data are saved. The CPG-controller runs locally using Python on a Raspberry Pi that sends the corresponding PWM servomotor commands with PIGPIO. For stable servomotor behaviour, the CPG-controller commands are provided at 64 Hz.

A trial is initiated by first starting/recording the merged video-stream, proceeded by sending a start signal from the laptop to the Raspberry Pi. During a trial, all intermediate controller states are logged locally on the Raspberry Pi, while all frames from the video-stream are buffered on the laptop. If the Raspberry Pi is finished, then a stop signal is sent to the laptop and the video-stream stops. In this way, the video-stream recording will always cover both the beginning and end of the trial. We match the intermediate controller states to the buffered video frames in order to obtain the CPG initial state samples corresponding to the captured robot states (position and orientation).

Controller optimisation

We assess three basic locomotion skills in this paper → , ↺ and ↻ for 60 s (t_eval = 60s). Gait fitness is defined as the absolute planar distance travelled divided by time (f_→ = [cm s⁻¹], see algorithm 3). Rotational fitness is defined as the amount of horizontal rotation (ϕ_z) in the preferred direction divided by time (f_↻ = [rad s⁻¹], see Equation 4). For f_↺ we use a stable unwrapping algorithm (see Algorithm 3) that accounts for precession projection artefacts to prevent exploitation by our learning algorithm⁵⁹. We run N_rep repetitions of the same experiment, with a single experiment consisting of several trials that provide samples for the optimiser. In the case of ISO, a single trial contains multiple initial state samples due to the bootstrapping trick. WO and ISO each have their own procedure which will be explained in turn shortly. A summary of the respective optimiser parameters is shown in Table 2.

Weight optimisation

For Weight optimisation (WO) we represent our CPG-network as a vector of real numbers (${{\mathbb{R}}}^{{N}_{weights}}$) that define the connection strength within and between oscillators. Here, the initial state of the CPG is fixed during optimisation to $\frac{1}{2}\sqrt{2}$ for all neurons. This value induces a $\frac{1}{2}\pi$ phase-offset within a single oscillator pair between the x- and y neuron. The WO search-space size for each robot is shown in Table 1 (N_weights). Every time we sample a new vector of weights from the search space, we test for 60 s for each trial (t_trial = 60 s) after which a fitness value is assigned for the corresponding skill.

To optimise our vector of weights we use an evolutionary algorithm called Reversible Differential Evolution (RevDE, ref. ²²). This algorithm iteratively proposes a solution by applying a special set of linearly reversible transformations to produce a wide coverage over the search space when selecting new samples from search space²². RevDE works as follows (see Algorithm 1): (1) initialise a population of λ uniformly sampled weight vectors, $pop\in {{\mathbb{R}}}^{\lambda \times {N}_{weights}}$ in range U[ − 1, 1]; (2) Obtain fitness values (${{{\mathcal{F}}}}$) over all λ-samples; (3) select the top μ-samples to obtain genotype vectors ${\mu }_{1}\in {{\mathbb{R}}}^{\mu \times {N}_{weights}}$; (4) randomly row shuffle μ₁ twice to obtain two additional vectors (μ₂, μ₃), and create new samples with the following linear relations: λ₁ = μ₁ + F(μ₂ − μ₃), λ₂ = μ₂ + F(μ₃ − λ₁), λ₃ = μ₃ + F(λ₁ − λ₂). Apply uniform crossover with probability CR and repeat from (2). The number of evaluated samples (λ-samples) is three times that of the top-size μ₁, which is set to 10 (Table 2). The parameters are set to scaling factor F = 0.5 and cross-over rate CR = 0.9, which have been shown to provide robust performance over different morphologies²³.

Algorithm 1

RevDE weight optimisation of CPG-networks

Initial state optimisation

For ISO, we represent our CPG-network as a vector of real numbers (${{\mathbb{R}}}^{{N}_{state}}$) that define the initial states of all neurons within our network. Here, all weights (both within and between oscillators) are sampled randomly (U[ − 1, 1]) and remain the same during optimisation. The ISO search-space size for each robot is shown in Table 1 (N_state). In every trial we test a random initial state for 120 s (t_trial = 120 s); in order to bootstrap the number of samples for each trial, we evaluate intermediate CPG-states with a rolling window t_eval = 60 s, which results in N_{samples/trial} = 600.

At the beginning of an ISO experiment, we fixate the weights of the CPG-network with random values between U[ − 1, 1]. For every trial, we ‘start’ this random CPG-network with random uniformly sampled initial states (${U}[-1,\, 1]\in {{\mathbb{R}}}^{{N}_{state}}$) and log the intermediate states every 0.1s. This bootstraps the number of samples by ten more initial states per second. The fitness values are then calculated over the logged sub-samples using a rolling window. We only consider samples with a full 60s window length, thereby ignoring only considering the first 600 initial states, N_{samples/trial} = (t_trial − t_eval)/0.1. Fitness is calculated on each of these 600 sub-sample for all skills in parallel). Per trial for each skill (s_↺/→/↻), we only store the (sub)-sample with the highest fitness per skill {s_↺ ∗, s_→ ∗, s_↻∗} (see Fig. 1 B. data-evaluation). Therefore, a single trial provides N_skills fitness values (one for each skill) corresponding to the (intermediate) CPG-state during the trial with the best behaviour for that skill. Subsequently, we repeat this process by loading the same CPG-network (in terms of weights) with a random initial state, until our budget (in time) is spent. The pseudo-code for ISO is shown below (see Algorithm 2). To clarify, the pseudo-code presents a parallelised application (i.e., we TEST all initial states at the same time). In the real-world we TEST each trial sequentially.

Algorithm 2

Random search initial state optimisation of CPG-networks

Proximal policy gradient

We use the continuous action-space PPO algorithm of CleanRl²¹), which is based on the original paper¹⁸. Here, we adapt our fitness functions (Equation 3, Equation 4) with t_eval = 0.1 for our intermediate rewards. This corresponds to the update frequency of our controller, i.e. our rewards are the (rotational) distance travelled per controller update. Per experiment, we run 300 trials of 60 s (similar to WO and ISO) with a total of 180,000 time steps per skill (540,000 total). We update our learners in batches of 10 trials that are split into 100 mini-batches over 10 epochs. The remaining parameter values are taken from the original paper.

Experimental setup

We perform two different experiments: the first is a performance comparison between WO and ISO in simulation, and in the second we compare ISO with sim-to-real WO in real-world. The experimental conditions are shown below in Table 3.

Simulation experimental setup

For the simulation experiments we carry out a per skill comparison of the fitness per simulated time in minutes between ISO and WO. It should be noted that the simulated time for WO is reported relative to each skill (as opposed to the cumulative time needed to learn the three skills sequentially). For the purpose of statistical analysis, we repeat each optimisation 30 times (N_rep = 30). We compare the MBF learning curves (Algorithm 4), ISO out-performance time (Algorithm 6) and TTEQ (Algorithm 5).

For WO, a separate experiment is run for each skill. An optimisation run for a single skill takes place over 300 trials, each of which are 60 s in length. At the end of each trial, a single fitness value is calculated as a sample for the RevDe optimiser. We log the best-fitness-so-far for every trial. In the end, the total simulated time to learn three skills is ${t}_{total}=300\times 3=900\min$.

For ISO, all three skills are learnt simultaneously in a single run. A single run takes place over 150 trials, each of which are 120 s in length. At the end of each trial, rolling window fitness values are calculated for all skills with the same evaluation time as WO, i.e., 60 s. Within a trial, the best sub-sample fitness represents the trial fitness for each skill. We log the ’best-fitness-so-far’ for every trial. In the end, the total time in simulation needed to learn three skills is t_total = 300 min.

Real-world experimental setup

Our real-world experiments are the physical tests of algorithm 2 on the physical twins. We start by placing a real robot in the centre of a 7 × 4 m arena. We run a CPG-network with random weights and an initial state from a central laptop for 120 s. A custom built tracking system using ArUco markers tracks the robot’s position and orientation every 0.1 s. At the end of the trial, the tracking data is then matched to the controller state at that time, and a rolling fitness is calculated for all the skills. This process is repeated until all the trials are complete.

A perfectly tracked run produces 600 samples, but due to noise and image merging artefacts, this number can sometimes be less. Although it rarely occurs in practice, missing robot data is filled in with the last previously captured state from the buffer. This operation only affects the fitness values when data is missing at the end of a trial because → fitness only considers the state at the end and ↺/↻ fitness performs a sum of the angular differences between subsequent samples, which is zero when samples are copied. Furthermore, filling in missing data with the last-seen values underestimates fitness values, insofar as it effectively shortens the evaluation period while simultaneously dividing by the same fixed evaluation time t_eval = 60.

For ISO we test five random initial states per run (N_trials = 5) with three repetitions (N_rep = 3). After all the trials are finished, we calculate the best overall initial state sample s^0* for each skill and re-test in order to obtain the final fitness for each skill in a single run. Including overhead and re-testing the acquired skill, the total time for the real robot experiments is t_total ≤ 15 min.

For the sim-to-real experiments, we re-run the overall best WO* controller from the 30 runs completed in simulation for each skill on the physical robots. We re-test each WO* controller one time. We measure the performance in a similar fashion as we do for the ISO real setup. Ultimately, we compare the real-world performance of all three ISO runs and WO*.

Metrics

For the simulated environment, we apply 30 repetitions of each learning experiment (30×, 2 optimisers, six robots, 3 skills). We report on MBF over time ± Standard Error (N_rep = 30, ± SE). The best performance over time indicates the highest fitness so far. For each skill, the fitness is calculated as follows:

Gait fitness

We define gait fitness as the absolute planar displacement speed in centimetres per seconds [cm s⁻¹]. We define our world frame’s origin at the ground, where the x- and y-directions span the horizontal ground and the z-direction points up. Gait performance is then calculated as follows:

$${f}_{\to }(t)=\frac{\sqrt{{({x}_{t+{t}_{eval}}-{x}_{t})}^{2}+{({y}_{t+{t}_{eval}}-{y}_{t})}^{2}}}{{t}_{eval}}$$

(3)

Where t_eval represents the evaluation time in seconds (60 s), and (x_t, y_t) the planar coordinates at time t.

Rotational fitness

We define rotation fitness as the rotational speed around the z axis of our world frame in radians per second [rad s⁻¹]. Our skills ↺ and ↻ only define which rotational direction is preferred which, in turn, changes the sign of f_↺. We calculate the rotational fitness as follows:

$${f}_{\circlearrowleft }(t)=\frac{1}{{t}_{eval}}{\sum}_{t+dt}^{t+{t}_{eval}}{{{{\rm{UNWRAP}}}}}_{{\phi }_{z}}\left({{{\phi }}}_{t},{{{\phi }}}_{t-dt}\right)$$

(4)

Here, ϕ_t denotes a vector of angles [θ_x, θ_y, θ_z] at time t. We add the differences in planar angle ϕ_z between the subsequent samples to obtain the total amount of planar rotation achieved. The differences are unwrapped from [0, 2π) with a custom unwrapping algorithm (from ref. ⁵⁹), which accounts for precession angle artefacts that can be exploited by optimisers. The pseudo-code is shown below in Algorithm 3.

Algorithm 3

UNWRAP: Calculate relativez-angle between subsequent time-steps

Mean best fitness

For each skill, we log the fitness values of a single run for each trial ${{{{\mathcal{F}}}}}_{trials}=[{f}_{skil{l}_{1}},{f}_{skil{l}_{2}},\ldots,{f}_{skil{l}_{{N}_{trials}}}]\,\in {{\mathbb{R}}}^{{N}_{trials}}$. The repeated experiments (N_rep = 30) are concatenated ${{{{\mathcal{F}}}}}_{runs}\in {{\mathbb{R}}}^{{N}_{rep}\times {N}_{trials}}$. This allows us to calculate the MBF as follows:

Algorithm 4

Mean Best Fitness

Time till equal equality

TTEQ measures the first cross-over point for the MBF curves. It indicates how long a certain optimiser method outperforms another one.

Algorithm 5

Time Till Equal Quality

Normalised performance

We normalise the performance based on the MBF end-values of WO (as it is the best performing method overall in simulation). While we apply normalisation for each skill for WO, for ISO we apply normalisation based on the summed performance of all the skills (denoted with Σ). Summed normalisation indicates an average relative performance over time to learn multiple skills in parallel. This allows us to generalise across multiple skills and compare both the parallel- and sequential-learning time-efficiency of ISO and WO, respectively.

Algorithm 6

Normalised Performance

Statistical analysis

All optimisation experiments in simulation were conducted 30 times (N_rep = 30). Within robot/tasks comparison of ISO vs. WO is subject to a standard student t-test after passing checks for normality (Shapiro-Wilk test) and equal variances (F-test). Significant differences are considered at p < 0.05, with Bonferronni correction for aggregated results when comparing WO vs ISO on all skills and morphologies (p < 0.05/m with m = 18 learning tasks).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The raw and processed data generated in this study have been deposited in the Dataverse database and are available at https://doi.org/10.34894/XZQZAZ.

Code availability

Learning algorithm, CPG analysis, and experimental code (including code from Appendix) are peer-reviewed on CodeOcean⁶⁰: https://codeocean.com/capsule/3493679/tree/v1. Real world experiments tools (including: analysis, fitness functions, merged camera stream, tracking software): https://github.com/fudavd/multi-stitch-stream.

References

Marder, E. & Bucher, D. Central pattern generators and the control of rhythmic movements. Curr. Biol. 11, R986–R996 (2001).
Article CAS PubMed Google Scholar
Bucher, D., Haspel, G., Golowasch, J. & Nadim, F. Central pattern generators. In eLS. 1–12 (John Wiley & Sons, Ltd, 2015).
Pearson, K. Neural adaptation in the generation of rhythmic behavior. Annu. Rev. Physiol. 62, 723–753 (2000).
Article CAS PubMed Google Scholar
Ijspeert, A. J., Crespi, A., Ryczko, D. & Cabelguen, J.-M. From swimming to walking with a salamander robot driven by a spinal cord model. Science 315, 1416–1420 (2007).
Article ADS CAS PubMed Google Scholar
Sproewitz, A., Moeckel, R., Maye, J. & Ijspeert, A. J. Learning to move in modular robots using central pattern generators and online optimization. Int. J. Robot. Res. 27, 423–443 (2008).
Article Google Scholar
Ijspeert, A. J., Nakanishi, J., Hoffmann, H., Pastor, P. & Schaal, S. Edinburgh research explorer dynamical movement primitives: learning attractor models for motor behaviors. Neural Comput. 25, 328–373 (2013).
Article MathSciNet PubMed Google Scholar
Minati, L., Frasca, M., Yoshimura, N. & Koike, Y. Versatile locomotion control of a hexapod robot using a hierarchical network of nonlinear oscillator circuits. IEEE Access 6, 8042–8065 (2018).
Article Google Scholar
Korkmaz, D. et al. Locomotion control of a biomimetic robotic fish based on closed-loop sensory feedback cpg model. J. Mar. Eng. Technol. 20, 1–13 (2019).
Campanaro, L., Gangapurwala, S., De Martini, D., Merkt, W. & Havoutis, I. CPG-Actor: reinforcement learning for central pattern generators. In Proc. 22nd Annual Conference onTowards Autonomous Robotic Systems, TAROS 25–35 (2021).
Manoonpong, P. et al. Insect-inspired robots: bridging biological and artificial systems. Sensors 21, 7609 (2021).
Article ADS PubMed PubMed Central Google Scholar
Ruppert, F. & Badri-Spröwitz, A. Learning plastic matching of robot dynamics in closed-loop central pattern generators. Nat. Mach. Intell. 4, 652–660 (2022).
Article Google Scholar
Thor, M. & Manoonpong, P. Versatile modular neural locomotion control with fast learning. Nat. Mach. Intell. 4, 169–179 (2022).
Article Google Scholar
Rudin, N., Hoeller, D., Reist, P. & Hutter, M. Learning to walk in minutes using massively parallel deep reinforcement learning. In Proc. Conference on Robot Learning 91–100 (2022).
Bellegarda, G. & Ijspeert, A. CPG-RL: Learning central pattern generators for quadruped locomotion. IEEE Robot. Auton. Lett. 7, 12547–12554 (2022).
Article Google Scholar
Li, G., Zhang, H. & Zhang, J. Development of adaptive locomotion based on a feedback coupled CPG model, 145–170. https://doi.org/10.1007/978-981-19-8384-9_7 (Springer Nature Singapore, 2023).
Szorkovszky, A., Veenstra, F. & Glette, K. Toward cultures of rhythm in legged robots. In Proc. Artificial Life Conference, ALIFE 2023: Ghost in the Machine: 66. https://doi.org/10.1162/isal_a_00673 (2023).
Grillner, S. & Wallen, P. Central pattern generators for locomotion, with special reference to vertebrates. Annu. Rev. Neurosci. 8, 233–261 (1985).
Article CAS PubMed Google Scholar
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
Shahid, A. A., Piga, D., Braghin, F. & Roveda, L. Continuous control actions learning and adaptation for robotic manipulation through reinforcement learning. Auton. Robots 46, 483–498 (2022).
Article Google Scholar
Handa, A. et al. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. In Proc. IEEE International Conference on Robotics and Automation (ICRA) 5977–5984 (2023).
Huang, S. et al. Cleanrl: high-quality single-file implementations of deep reinforcement learning algorithms. J. Mach. Learn. Res. 23, 1–18 (2022).
MathSciNet Google Scholar
Tomczak, J. M., Wȩglarz-Tomczak, E. & Eiben, A. E. Differential evolution with reversible linear transformations. In Proc. Genetic and Evolutionary Computation Conference Companion 205–206 (2020).
van Diggelen, F., Ferrante, E. & Eiben, A. E. Comparing robot controller optimization methods on evolvable morphologies. Evol. Comput. 1–20 https://doi.org/10.1162/evco_a_00334 (2023).
Smith, L., Kostrikov, I. & Levine, S. A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning. Preprint at https://arxiv.org/abs/2208.07860 (2022).
Höfer, S. et al. Sim2real in robotics and automation: applications and challenges. IEEE Trans. Autom. Sci. Eng. 18, 398–400 (2021).
Article Google Scholar
Jakobi, N., Husbands, P. & Harvey, I. Noise and the reality gap: the use of simulation in evolutionary robotics. In Proc. European Conference on Artificial Life 704–720 (1995).
Mouret, J.-B. & Chatzilygeroudis, K. 20 years of reality gap: a few thoughts about simulators in evolutionary robotics. In Proc. Genetic and Evolutionary Computation Conference Companion 1121–1124 (2017).
Ligot, A. & Birattari, M. Simulation-only experiments to mimic the effects of the reality gap in the automatic design of robot swarms. Swarm Intell. 14, 1–24 (2020).
Article Google Scholar
Hwangbo, J. et al. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 4, eaau5872 (2019).
Article PubMed Google Scholar
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C. & Wermter, S. Continual lifelong learning with neural networks: a review. Neural Netw. 113, 54–71 (2019).
Article PubMed Google Scholar
Kirkpatrick, J. et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl Acad. Sci. 114, 3521–3526 (2017).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Cully, A., Clune, J., Tarapore, D. & Mouret, J.-B. Robots that can adapt like animals. Nature 521, 503–507 (2015).
Article ADS CAS PubMed Google Scholar
Mouret, J.-B. & Clune, J. Illuminating search spaces by mapping elites. Preprint at https://arxiv.org/abs/1504.04909 (2015).
Mahmood, A. R., Korenkevych, D., Vasan, G., Ma, W. & Bergstra, J. Benchmarking reinforcement learning algorithms on real-world robots. In Proc. Conference on Robot Learning 561–591 (2018).
Goldschmidt, D., Manoonpong, P. & Dasgupta, S. A neurocomputational model of goal-directed navigation in insect-inspired artificial agents. Front. Neurorobot. 11, 20 (2017).
Article PubMed PubMed Central Google Scholar
Lukoševičius, M. & Jaeger, H. Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3, 127–149 (2009).
Article Google Scholar
Frank, F., Paraschos, A., van der Smagt, P. & Cseke, B. Constrained probabilistic movement primitives for robot trajectory adaptation. IEEE Trans. Robot. 38, 2276–2294 (2022).
Article Google Scholar
van Diggelen, F., Babuska, R. & Eiben, A. E. The effects of adaptive control on learning-directed locomotion. In Proc. IEEE Symposium Series on Computational Intelligence (SSCI) 2117–2124 (2020).
Lan, G. et al. Learning directed locomotion in modular robots with evolvable morphologies. Appl. Soft Comput. 107688 (2021).
Eiben, A. E. et al. The triangle of life: evolving robots in real-time and real-space. Artificial Life Conference Proceedings 13, 1056–1063 (2013).
Allard, M., Smith, S. C., Chatzilygeroudis, K. & Cully, A. Hierarchical quality-diversity for online damage recovery. In Proc. Genetic and Evolutionary Computation Conference 58–67 (2022).
Nordmoen, J., Veenstra, F., Ellefsen, K. O. & Glette, K. Quality and diversity in evolutionary modular robotics. In Proc. IEEE Symposium Series on Computational Intelligence (SSCI) 2109–2116 (2020).
Le Goff, L. K. et al. Sample and time efficient policy learning with cma-es and Bayesian optimisation. Artificial Life Conference Proceedings 432–440 (2020).
Francesca, G., Brambilla, M., Brutschy, A., Trianni, V. & Birattari, M. Automode: a novel approach to the automatic design of control software for robot swarms. Swarm Intell. 8, 89–112 (2014).
Article Google Scholar
Chand, S. & Howard, D. Multi-level evolution for robotic design. Front. Robot. AI 8, 192 (2021).
Article ADS Google Scholar
Lindén, H., Petersen, P. C., Vestergaard, M. & Berg, R. W. Movement is governed by rotational neural dynamics in spinal motor networks. Nature 610, 526–531 (2022).
Article ADS PubMed Google Scholar
Dominici, N. et al. Locomotor primitives in newborn babies and their development. Science 334, 997–999 (2011).
Article ADS CAS PubMed Google Scholar
Ijspeert, A. J. Central pattern generators for locomotion control in animals and robots: a review. Neural Netw. 21, 642–653 (2008).
Article PubMed Google Scholar
Pasandi, V., Dinale, A., Keshmiri, M. & Pucci, D. A programmable central pattern generator with bounded output. Robot. Auton. Syst. 125, 103423 (2020).
Article Google Scholar
Pasemann, F., Hild, M. & Zahedi, K. SO(2)-networks as neural oscillators. In Proc. International Work-Conference on Artificial Neural Networks 144–151 (2003).
Auerbach, J. E. et al. Robogen: Robot generation through artificial evolution. In Proc. Artificial Life Conference 14, 136–137 (2014).
Pollack, J. B. & Lipson, H. The GOLEM project: evolving hardware bodies and brains. In Proc. Second NASA/DoD Workshop on Evolvable Hardware 37–42 (2000).
Veenstra, F., Andres, F., Risi, S. & Stoy, K. Evolution and morphogenesis of simulated modular robots: a comparison between a direct and generative encoding. In Proc. European Conference on the Applications of Evolutionary Computation 870–885 (2017).
Faíña, A., Bellas, F., López-Peña, F. & Duro, R. J. EDHMoR: evolutionary designer of heterogeneous modular robots. Eng. Appl. Artif. Intell. 26, 2408–2423 (2013).
Article Google Scholar
van Diggelen, F. et al. The influence of robot traits and evolutionary dynamics on the reality gap. IEEE Trans. Cogn. Dev. Syst. 15, 499–506 (2023).
Article Google Scholar
Miras, K. & Eiben, A. Effects of environmental conditions on evolved robot morphologies and behavior. In Proc. Genetic and Evolutionary Computation Conference 125–132 (2019).
Makoviychuk, V. et al. Isaac gym: High performance GPU-based physics simulation for robot learning. Preprint at https://arxiv.org/abs/2108.10470 (2021).
Garrido-Jurado, S., Muñoz-Salinas, R., Madrid-Cuevas, F. & Marín-Jiménez, M. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognit. 47, 2280–2292 (2014).
Article ADS Google Scholar
De Carlo, M. et al. Robotic task affects the resulting morphology and behaviour in evolutionary robotics. In Proc. IEEE Symposium Series on Computational Intelligence (SSCI) 2125–2131 (2020).
van Diggelen, F., Cambier, N., Ferrante, E. & Eiben, A. Code: a model-free method to learn skills in modular robots. https://codeocean.com/capsule/3493679/tree (2023).

Download references

Acknowledgements

This work is supported by the Technology Innovation Institute (TII).

Author information

Authors and Affiliations

Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, Noord-Holland, the Netherlands
Fuda van Diggelen, Nicolas Cambier, Eliseo Ferrante & A. E. Eiben

Authors

Fuda van Diggelen
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Cambier
View author publications
You can also search for this author in PubMed Google Scholar
Eliseo Ferrante
View author publications
You can also search for this author in PubMed Google Scholar
A. E. Eiben
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Fuda van Diggelen: main contributing author, designed the learning algorithms and tools, conducted all experiments, build tracking system, mathematical analysis. Nicolas Cambier: contributed to writing, and experimental design. Eliseo Ferrante: contributed to writing, experimental design. Agoston Eiben: contributed to writing, mathematical analysis.

Corresponding author

Correspondence to Fuda van Diggelen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description Of Additional Supplementary File

Supplementary Movie 1

Supplementary Movie 2

Supplementary Movie 3

Reporting summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

van Diggelen, F., Cambier, N., Ferrante, E. et al. A model-free method to learn multiple skills in parallel on modular robots. Nat Commun 15, 6267 (2024). https://doi.org/10.1038/s41467-024-50131-4

Download citation

Received: 27 February 2024
Accepted: 02 July 2024
Published: 25 July 2024
DOI: https://doi.org/10.1038/s41467-024-50131-4

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Simulation experiment

Real-world experiment

Preliminary feedback control experiments

Discussion

Methods

CPG-network controller

System description

Robots

Simulator system description

Real world system description

Controller optimisation

Weight optimisation

Algorithm 1

Initial state optimisation

Algorithm 2

Proximal policy gradient

Experimental setup

Simulation experimental setup

Real-world experimental setup

Metrics

Gait fitness

Rotational fitness

Algorithm 3

Mean best fitness

Algorithm 4

Time till equal equality

Algorithm 5

Normalised performance

Algorithm 6

Statistical analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links