Introduction

Basic locomotion primitives, such as walking and turning, are a prerequisite for more complex behaviours in legged robots (e.g., foraging or environmental monitoring). However, obtaining controllers for such skills is a very challenging task. A popular approach is the use of bio-inspired Central Pattern Generators (CPGs) that are modelled after neural circuits in the spinal cord of various animals1. These (artificial) neural networks encode continuous-time signals that are smooth, (quasi-)periodic, and parameterised by only a few variables, which makes them appealing for learning locomotion, as the latter requires smooth and (quasi-)periodic behaviour2. CPGs can be interconnected to form a complex CPG-network that produces a more complex signal.

In biology, neuroplasticity is a key element in the optimisation of CPG-networks3. This form of adaptation is a slow indirect process of operant conditioning to improve its overall function. In robotics, a similar method of weight optimisation (WO) is the standard approach to training CPG-networks, where the connection strength is changed to improve locomotion4,5,6,7,8,9,10,11,12,13,14,15,16. In this paper, we present a novel way of optimising a CPG-network, where rather than focusing on conventional WO, we instead focus on improving the initial state of the network (Initial State Optimisation, ISO). Figure 1 provides a general overview of our method.

Fig. 1: Initial State Optimisation on Central Pattern Generators (CPGs) quickly finds different locomotion skills (Gait:→; Clockwise rotation: ; Anti-clockwise rotation: ).
figure 1

A Our robot is initialised with a network of interconnected CPGs in which we freeze all weights. B. Top, we sample a random initial state of the CPG-network and produce a time-series (based on its dynamics) that provides motor commands to our robot. B. Bottom-left, we collect data of the intermediate CPG states (\({{{\mathcal{S}}}}\)) and the corresponding robot behaviour (\({{{\mathcal{D}}}}\)). B Bottom-right), we consider the intermediate CPG states as additional “initial state” samples (bootstrapping) and assess the fitness values for multiple skills at the same time (parallelisation). C In the end, we re-test the best initial state sample for each skill {\({{{{\boldsymbol{s}}}}}_{\circlearrowleft }^{*},{{{{\boldsymbol{s}}}}}_{\to }^{*},{{{{\boldsymbol{s}}}}}_{\circlearrowright }^{*}\)} encoded in the same CPG-network. The real-world behaviour is shown in three pictures on the right with the robot position trajectory as a white line, and heading in red fading on a (semi-)circle.

ISO is a more direct process than WO in which we learn how to correctly initiate the network state that produces the right behaviour. The CPG-network is not adapted through WO, and only functions as a behavioural encoder with several motor primitives. Our approach is based on the observation that the CPGs in spinal centres can produce multiple locomotion patterns that are encoded within the same network3. In animals, it is well documented that the motor cortex and/or peripheral stimuli can start locomotion patterns with signals to the spinal cord that initialise the CPGs17 (although these signals are frequency encoded by a drive signal and not by a single command). Initial states can be seen as the movement command for a specific (cyclical) motor pattern memorised by the higher-order brain centres (like the motor cortex). Here, our start command is an input signal that sets the CPG-network in the correct initial state.

Our mathematical analysis of the difference between ISO and WO (provided in Appendix I) shows that, due to the dynamics of CPGs, WO suffers from many local optima caused by redundancies and nonlinearities in weights search-space. This is not the case for ISO. This mathematical analysis also shows that our random CPG-networks contain infinitely many quasi-periodic cycles due to their elliptic fixed point. As a result, we suggest that instead of optimising limit-cycle behaviours using WO, ISO should be considered to acquire semi-cyclical patterns for robot locomotion. Furthermore, we identify two additional practical tricks that can be utilised during ISO to speed up optimisation (see Fig. 1): 1) Bootstrapping, which amounts to evaluating intermediate network states as additional potential initial state samples (an in-depth parameter analysis can be found in Appendix C.A); 2) Parallelisation, which involves assessing different locomotion skills in parallel (we show the efficacy of this method up to 100 skills in parallel Appendix C.B). Using ISO, our robots only need a few trials to learn multiple locomotion skills in parallel. Supplementary video 1, provides a video explanation of the method (can also be found here https://fudavd.github.io/multi-skill-learning/).

We demonstrate ISO on six different robots that learn the following three locomotion skills: gait, clockwise rotation, and anti-clockwise rotation (for a total of 18 learning tasks). Our experimental validation is two-fold. First, in simulation, we compare ISO learning performance with WO and a SOTA baseline reinforcement learning method, Proximal Policy Optimisation (PPO,18,19,20). The results show that our method outperforms PPO and achieves 50% of the WO performance in less than 5% of the time, which corresponds to our findings in preliminary experiments (Appendix B). Second, we perform real-world ISO experiments on our set of robots and compare its real-world performance with WO transferred from simulation to real. To this end, we transfer the best controllers out of 30 repetitions to present the strongest and most robust solution found by WO. The results show that WO suffers significantly from the sim-to-real transfer and is outperformed by ISO on 11/18 tasks (p < 0.001) in significantly shorter optimisation time. A more detailed layout of our CPG-network, the description of the system, the ISO and WO method, and experimental setup can be found in section IV. The Appendix provides both the mathematical derivation of our propositions and the results of our preliminary experiments and extended results.

Our conclusions are supported by exploratory experiments on a range of random time-series (Appendix A.A) showing that ISO can obtain a performance similar to that of WO in a much shorter time frame for any target behaviour (Fig. B5). Finally, we show the model-agnosticity of our method by learning 6 locomotion skills with ISO on a test suite of 20 different robots (Appendix C.C).

Results

We compared WO with ISO by learning three different movement skills on six different modular-legged robots (see Fig. 2; more technical details in Table 1), thus resulting in 18 different learning tasks. The three locomotion skills are defined as: (1) Gait (→), absolute planar displacement speed in centimetres per second [cm s−1]; (2 and 3) clockwise and anti-clockwise rotation (/), planar rotation in radians per second [rad s−1]. First, the performance of ISO and WO is compared with PPO in simulation (parameters are detailed in Table 2); then ISO and WO are compared in the real world (without PPO since it performed worst in the simulation experiments).

Fig. 2: Test suite of six robots.
figure 2

Left, rendered in simulation; Right, physical instance.

Table 1 Technical description of our six robots
Table 2 Hyperparameters of the optimisation learning algorithms (RevDE, Random Search)

Simulation experiment

We optimise our robots in simulation for 300 min for each skill using ISO, WO, and compare their performance with a PPO baseline18. We use cleanRL’s continuous action PPO algorithm for Isaac Gym with default parameters21. For WO, we use a state-of-the-art differential evolution algorithm (RevDE,22) in which every trial lasts for 1 min (ttrial = 60 s, Ntrials = 300). Trial-time is chosen to differentiate between transient behaviour (starting from standing still to moving) and steady state behaviour of the locomotion skill. Shorter evaluation times are chosen to highlight transient behaviour which will be exploited by the optimiser (e.g. the robot may only fall sideways for high rotation speed). Random search is used for ISO, in which we halve the number of trials and double the trial time to increase the effectiveness of bootstrapping based on our hyper parameter sweep (see Appendix C.A) (ttrial = 120 s, Ntrials = 150). This results in a 600-fold increase in initial state samples for each trial (using a rolling window evaluation length teval = 60 s)). It is important to note here that per robot ISO requires one-third of the time than ISO to learn all three skills, thanks to the parallelisation trick (300 min in ISO vs. 900 min in WO). A summary of the experimental parameters is presented in Table 3.

Table 3 Experimental setup

The learning curves (Fig. 3) show the Mean Best Fitness (MBF, i.e. the highest fitness so far in time, see Algorithm 4) averaged over 30 repetitions for each experiment ± Standard Error (Nrep = 30, ± SE), which is a metric that allows for the comparison of different learners23. For each morphology, we also report the Time-to-Equal Quality (TTEQ) for each skill, which indicates the first time that the MBF ISO is equal to the MBF WO (see the subcaption of Algorithm 5, higher TTEQ is better). The results show an overall better performance for ISO at the start of the learning process, which shows a more efficient skill acquisition than WO or PPO. This outperformance is consistent throughout the entire optimisation process for Blokky →,  and and Salamander , which shows in the absence of TTEQ values. In congruence with the preliminary experiments, ISO sharply outperforms WO at the beginning before plateauing for the remainder of the run (Fig. B5). PPO overall tends to be the most inefficient. This is unsurprising, as the number of allowed time steps is several orders of magnitude smaller with respect to most PPO applications (180,000 vs 10,000,000 in the original PPO paper18).

Fig. 3: Simulation experiment results.
figure 3

Initial State optimisation (ISO, blue curves) outperforms both PPO (grey) and Weight optimisation (WO, orange). WO is outperformed in the first 64 min on average. Learning curves show the Mean Best Fitness (Nrep = 30 ± SE) for gait (→) and anti-/clockwise rotation (, ) skills for the six different robots in simulation. The vertical dotted lines indicate the first cross-over point after which WO begins to outperform ISO, i.e., Time Till Equal Quality (TTEQ, higher is better). The specific TTEQ for each skill is presented in each sub-caption (ordered in :→, , ), while no cross-over point is denoted with a dash (−).

These findings are unsurprising since Appendix A reveals the linearity of ISO vs. the complex nonlinear coupling of WO. The parallelisation trick in ISO makes learning efficiency independent to the number of skills learned, shown up to a 100 skills in parallel Fig. C7. This means that ISO has constant complexity \({{{\mathcal{O}}}}(k)\), even though , and are counter opposite behaviours. Furthermore, even with random CPG-networks, the chances of finding a good initial state are relatively high (we find good performance quickly with random sampling and bootstrapping). Already, in the first trials, ISO can immediately identify better sub-samples than WO, thereby outperforming WO in efficiency. This outperformance is maintained for 64 min on average (mean TTEQ). Here, 64 min underestimate the true mean TTEQ, as we omitted instances where WO never crossed ISO (Blokky and ; Salamander ).

Comparing the MBF learning curves for each skill provides a distorted overview of the efficiency of ISO with respect to WO, because we pit one parallel run of the former against three sequential runs of the latter. In Fig. 4, we show the simulated time for different normalised performance thresholds at [25, 50, 75, 100] percent performance with respect to WO. Normalisation is based on the final WO MBF values (i.e., performance at 300 min), for each skill for all six robots. For WO, we calculate the relative performance for each skill and show the total simulated time as a stacked bar (WO + WO + WO, with a maximum time of 900 min). For ISO, we calculate the combined relative performance (that is, the total sum of relative performance over all three skills) as a dark grey bar (ISOΣ, see Algorithm 6). ISOΣ relative performance provides a fair comparison, as the learning in ISO is parallelised. In the subcaptions of Fig. 4 we also report the specific time needed to obtain 75% relative performance for each robot.

Fig. 4: Normalised performance results.
figure 4

Initial State optimisation (ISOΣ) obtains 50% performance of Weight optimisation (WO, WO, WO) in less than 5% of the time. The bars indicate the time needed to obtain normalised performance quality with respect to WO (Nrep = 30). In the sub-caption we specify the total time needed to attain 75% relative performance in minutes (WO time vs. ISO time).

With respect to normalised performance, it becomes clear that ISO is clearly faster with respect to WO. On average, it takes WO 270min to reach 75% performance vs. 101 min for ISO, with all morphologies taking less than 200min in the latter. At the end of the optimisation, four of the six morphologies (Ant, Gecko, Spider, and Stingray) never reached the same level of performance as WO, although the final performances of the Blokky and Salamander morphologies exceeded WO up to 140%. Furthermore, we found that ISO can already achieve 50% relative performance (with respect to WO MBF) within the first 10 min (i.e., 5 trials) for all but the Spider morphology: Ant 9.33 min, Gecko 10 min, Spider 21 min, Blokky 2 min, Salamander 2.66 min, Stingray 6 min. Resultantly, we achieve 50% of WO performance in less than 5% of the time, using ISO.

Real-world experiment

In the real-world experiment, we test our ISO method on the physical twin of our simulated robots. Here, we restrict ourselves to only 15 min of total experiment time for each robot, including re-evaluation of the acquired skill. By including the overhead, we can therefore assess five trials of 2 min, and re-test each acquired skill for 1 min. Based on the results of the simulation experiments, this very tight budget is sufficient to obtain good performance. Moreover, the 15 min limit is an efficient time for real-world experiments and compares well (as a benchmark time) with respect to other state-of-the-art (SOTA) learning algorithms (12–20 min12,13,24). We compare the performance of the real ISO experiments with the overall best WO controller (henceforth denoted as WO*, the best controller out of 30 runs) for each skill. This method is also known as offline learning with sim-to-real transfer25. The results of the real experiments are presented in Fig. 5. In the real world, WO* performs similar to ISO, which equates to a 57% reduction as a result of the sim-to-real gap. In the end, ISO outperforms WO* in 11 out of 18 skills, based on the overall best run (ranked over all skills within a morphology, shown as striped bars). We want to emphasise that we present the strongest and most robust version of WO, where each skill was optimised for 30 different runs separately for a total of (vs. 1 single run of all skills in parallel). Therefore, we argue that real-world ISO is preferable on the grounds that it circumvents the sim-to-real problem, requires less computational resources (local cluster vs. Raspberry Pi), and is more scalable with respect to additional skills (parallelisation means constant time complexity \({{{\mathcal{O}}}}(k)\)). Supplementary video 2 shows the final behaviours obtained using ISO.

Fig. 5: Real-world experiment results.
figure 5

Initial State optimisation (ISO, blue) on real robots produces similar performance to the sim-to-real transfer of the best out of the 30 runs from Weight Optimisation (WO*, orange). The bars show the performance of each physical robot concerning three different skills: gait (→) and anti-/clockwise (, ). The striped bars indicate the overall best ISO run (based on the total rank over all the skills).

ISO can learn all three skills on all six robots in under 15 min. We found a small average performance drop with ISO in real robots compared to the ISO simulation experiment. This means that the real-world ISO performance corresponds to the predicted ISO performance, which, as such, indicates that ISO is robust in performance regardless of simulation or real world. The overall best real-world ISO runs (based on total rank for each skill within a morphology) show a 54% performance relative to final WO MBF values (compared to 50% in simulation for the same amount of time). If we compare these runs with WO* we see that we outperform eleven out of eighteen learning tasks. The behaviour of these runs is shown in Appendix Fig. D9. If we zoom in closer, we see that ISO has a random bias for rotation in a certain direction for a single run (an effect that smoothened out when we averaged the 30 runs in simulation). The random CPG-weights might induce a bias for a given rotational skill, as there are only a limited set of frequencies available in the CPG-network defined by the purely imaginary eigenvalues in its canonical form (shown in Appendix I). Slow weight adaptation might be able to resolve this problem by changing the frequency content of the CPG-network.

Optimising in simulation is inevitably vulnerable to the sim-to-real gap26,27,28, due to model approximations and simulation errors being exploited and overfitted by the optimiser. When transferring WO* on the real robots, we observed on average a 57% drop in performance compared to simulation, while ISO did not drop significantly in performance (−2%). The specific sim-to-real gap for each morphology and skill is presented in Appendix Table D1. These results show that the WO* controller is badly impacted by the sim-to-real gap, while ISO performance is robust in both conditions. In the end, Fig. 5 shows that we get comparative performance with ISO in the same wall-clock time as sim-to-real WO* (~11 min ISO vs. ~4 min for each skill with sim-to-real WO*). It should be noted that this comparison favouring WO* by default, as we took the best controller over 30 repetitions (which took about 6 h in total per robot). Furthermore, in our experiments, we only provided three skills. The wall-clock time of WO increases linearly with respect to the number of skills, while ISO remains constant (also demonstrated in Appendix C.B). Real-world ISO is thus competitive with sim-to-real even in the most favourable circumstances for WO.

Preliminary feedback control experiments

Our CPG-controller allows the perturbation of hidden (y-neuron) states that are not directly linked to the signals sent to the servomotors. This will provide smooth behavioural changes as our CPG defines behaviour in velocities. Using this technique, we extend our controller with heading feedback for a targeted locomotion task on our spider robot in simulation (specific details on the design are presented in Appendix D.C). The proposed method robustly reached the target in a random location 4 m from the origin for all trials (N = 60) under 5 min. Supplementary video 3 shows the spider robot during the target following task.

Discussion

An advantage of CPGs in robotics is that their behaviour is relatively stable and is encoded cheaply using few parameters. Instead of learning a whole time-series of motor input or a policy of actions at every point in state-space, we now only need to describe the CPG-network to realise useful behaviour. Our mathematical analysis (in Appendix A) sheds light on the effects of optimising the different design variables of a CPG-network: network structure, connection strength, and initial state. Based on our findings, we developed a simple and efficient method for learning multiple skills using the initial state.

We provide strong evidence for the potential advantage of learning skills through ISO. The results of the simulation experiments showed that ISO can achieve 50% of WO performance in less than 5% of the simulated time, highlighting its outperformance in terms of efficiency. The real robot experiments show the advantage of ISO for online multi-skill learning vs. offline WO and transferring the best controllers. We consider task complexity not only on the final behaviour (moving from A to B, or rotating), but also on the difficulty to learn the behaviour (DofF, amount of prior knowledge, a-symmetry in morphology and control etc.). Compared to other SOTA learning algorithms (12–20 min12,13,24), we perform similarly in complex locomotion tasks. In addition, we achieve this learning performance without the use of a simulator, limited a priori knowledge, and a predefined model. Our method outperforms these methods, both in terms of speed (the number of skills that it learns in a short amount of time, also seen in Appendix C.B) and generality (model-free control tested on several different morphologies).

As a consequence of their marginally stable property (see Appendix I), our SO2-oscillators do not exhibit genuine limit-cycle-like behaviours. The behaviour we obtain is the result of only local synchronisation of the oscillator frequencies. This is unlike most work on CPGs where a central frequency strictly defines a limit-cycle phase to coordinate periodic movements in time4,5,6,7,8,9,10,11,12,13,14,15,16. Our method does not require such an internal clock and is capable of learning quasi-periodic movements in much shorter time-scales, while it can also produce non-periodic control. In addition, the formal basis for the strength of our method is detailed in Appendix I. We show that, irrespective of random weights, the SO2-oscillator network exhibits marginally stable behaviour due to their elliptic fixed point, which proves the existence of (infinitely) many quasi-periodic patterns. Furthermore, we investigated the expressiveness of ISO by estimating 100 randomly generated time-series in parallel (Appendix C.B) and extent the robot learning task with 6 skills in parallel on 20 different robots (Appendix C.C).

The importance of initial states is not limited to the specific application of SO2-oscillators analysed in this work. For any optimisation task (irrespective of the controller/optimisation strategy), the initial state of the robot must be in a good region of the state space. If this is not the case, then the optimiser must solve two tasks: 1) find the optimal (cyclical) behaviour; 2) move from the bad initial state towards the region of optimal behaviour (a transient pattern). The authors of 29 reported, that in their RL algorithm, the initial state distribution was designed and tuned separately for each task, which took a specialist around two days to complete. The sensitivity of certain dynamical systems to the initial state has already been described as the butterfly effect in chaos theory. Thus, in any systems with a recurrent structure (ODEs, LSTMs, reinforment learning, recurrent neural networks, other forms of CPGs, reservoir computing, etc.) the behaviour of a controller is sensitive to its initial condition. Nevertheless, we found that most work does not consider tuning/optimising its initial state (if reported at all, initial states are often set to 0 or 1).

Learning robot control is a topic of wide interest, with current methods relying on specific robot models and/or massive computing to learn models and behaviours in simulation13. Such learning methods have trouble adapting or learning new behaviours as integrating/adapting learnt skills often require sequential learning. This involves problems due to catastrophic forgetting (a problem that arises in lifelong/(deep) reinforcement learning30,31), or the curse of dimensionality (e.g., MAP-elites,32,33), and overall time constraints11,29. Parallelisation circumvents catastrophic forgetting and results in constant time complexity \({{{\mathcal{O}}}}(k)\) to learn multiple skills. In addition, massive parallel computing continues to be plagued by the sim-to-real gap, which we can circumvent by not relying on simulation at all26. Other model-free controllers are several magnitudes slower than our ISO method34. Similar other CPG-based controllers rely on human expertise to design networks with precise hyper-parameters13,35, and focus on a specific robot morphology14,24. In contrast, our method is truly model-free and straightforward in that it defines a network structure based on morphology with random weights. Finally, the process of storing and selecting patterns shares some resemblance with reservoir computing36 or the use of motor primitives37. However, ISO has the added advantage that it learns multiple skills in parallel and does not suffer from catastrophic forgetting.

The generalisability of ISO shows in the number of robots tested and its efficacy in learning multiple skills/behaviours in parallel (Appendix C.C). Real-world testing was possible due to the time-efficiency of ISO. The design space of modular robots (the set of all possible morphologies) is large and incredibly diverse, insofar as it contains robots comprising various shapes, a-symmetrical bodies, several or many limbs with potentially varying lengths. Successfully learning multiple skills in parallel on our wide variety of morphologies can be attributed to the model-free nature of the CPGs. Alongside this, we created our CPG-network structure based on simple morphological rules that are not specific to our design and, as such, can be altered without any consequences for the algorithm. This shows the robustness of our method for different types of designs and tasks.

The open-loop skills obtained during ISO can be used for more complex tasks. We demonstrated this through higher-level control in an additional targeted locomotion task (Appendix D.C). In the future, we can extend our work by modulating gait speed and heading direction during forward locomotion. For this, we propose only modulation of our ODE (Equation A5) by scaling the weight matrix for frequency adaptations (on A), and output state for amplitude control. Different forms of complex feedback control have been implemented in CPG-networks: low-level Internal Model Control38, smooth speed and heading39; which could be implemented for ISO which would require feasible adaptations.

Furthermore, our learning framework has been originally motivated in the context of a robot evolution system, where morphologies and controllers evolve in tandem and “newborn" robots undergo individual learning to optimise their inherited brain to their body40. This requires learning methods that are robust and fast, i.e., work for “all" possible morphologies and need relatively few search steps/evaluations before being employed in the real world. By leveraging the additional bootstrapped samples in our optimiser, we can employ other data-driven techniques for this end, for example, surrogate models and MAP-elites/novelty-search41,42,43 help to increase learning speed and encourage diversity in solutions. Parallelised fast learning of modular behaviours can help in the design of these hierarchical control structures such as finite-state machines or behavioural trees41,44,45.

Finally, we want to emphasise the perspective of the CPG-network as a behavioural encoder. CPGs as a behavioural reservoir provide a simple explanation for the multi-functionality of the spinal cord46. The behavioural encoder effectively stores various movement patterns that can be initiated with the appropriate start signal, which corresponds to the initial state. From such a perspective, we could speculate that ISO quickly learns network input signals that activate movement skills, whereas WO slowly fine-tunes network dynamics during life, and evolution optimises network structure every generation. This idea is not too far-fetched given the fact that newborn babies exhibit predefined locomotion movement patterns when specific states trigger the walking reflex47. Our work might be the first to shed light on these bio-inspired modes of optimisation in CPGs over different time-scales.

To conclude, learning multiple movement skills is an important challenge for legged robotics insofar as it is a prerequisite for viable complex behaviour. Considering the specificity of each robot, complex/dynamic environments, recovery after damage, etc., we argue that there is an important need for model-free controllers that quickly learn. In this paper, we show that this is achievable using CPG-networks and ISO. ISO is preferable over WO in terms of time-efficiency and with respect to the sim-to-real problem. Future work should aim at improving the search method, further utilising bootstrapping and parallelisation, and implementing a more complex and higher-level of control overall.

Methods

This section describes the CPG-network controller, optimisation methods, and simulated/physical experiment and setup. This does not include any formal analysis of the optimisation of the network itself. For this, we refer the reader to Appendix A.

CPG-network controller

CPGs are pairs of artificial neurons that reciprocally inhibit and excite each other to produce oscillatory behaviour2, and can be mathematically described as a set of Ordinary Differential Equations (ODEs)48. In the extant literature, several different models for CPGs have been proposed4,5,6,7,9,10,11,12,13,14,15,49. Although some details may vary, in essence, the oscillatory behaviour within the CPG is invariably defined in terms of the connectivity between neurons and their initial state. We focus on an inter-connected network of SO2-oscillators50 whose dynamics are described by an anti-symmetric weight-matrix that results in (quasi-)periodic behaviour.

Figure 6a shows a single coupled SO2-oscillator (i.e., CPG) formulated as two paired neurons (an xi- and yi-neuron). Here, i denotes the specific joint that we associate with this CPG. The change of a neuron’s state is calculated by multiplying the current state of the opposite neuron with a weight (w). The weights \({w}_{{x}_{i}{y}_{i}}\) and \({w}_{{y}_{i}{x}_{i}}\) represent the connection strength between each pair of neurons. An output neuron oi is coupled to the x-neuron with connection strength \({w}_{{x}_{i}{o}_{i}}\). The output of oi serves as the input command for the servomotors in a specific joint (i.e. servo). In our case, we selected a tangent hyperbolic activation function as our output neuron to normalise the motor input between [−1,1], which corresponds to the min/max angular positions of the joints. It is important to note here that the output neuron only reads out the current state of the x-neuron, and, hence, neither influences the CPG dynamics nor does it have an initial state.

Fig. 6: CPG controller design.
figure 6

a A single CPG. b The CPG-network schematic of our Spider robot, containing eight CPGs (numbers) with ten pairs of neighbouring connections (blue arrows).

To enable more complex output patterns, symmetrical/bidirectional connections between neighbouring joints (denoted within neighbour set \({{{\mathcal{N}}}}\)) are added between oscillators’ x-neurons (see Fig. 6b). We define the set of neighbouring joints in joint i (\({{{{\mathcal{N}}}}}_{i}\subset {{{\mathcal{N}}}}\)) as the set of all joints j ≠ i located within a 2-module distance from i. That is to say, if we represent our robot morphology as a connected a-cyclic graph (a tree with the top node being the Core module), then all joints j that are connected within two or less edges away from joint i are considered neighbouring joints in \({{{{\mathcal{N}}}}}_{i}\). The set \({{{{\mathcal{N}}}}}_{i}\) contains all the neighbouring inter-connections with xi in the CPG-controller, with its resulting ODE shown below.

$${\dot{x}}_{i}={w}_{{y}_{i}{x}_{i}}{y}_{i}+\mathop {\sum}\limits_{j\in {{{{\mathcal{N}}}}}_{i}}{x}_{j}{w}_{{x}_{j}{x}_{i}}$$
(1)
$${\dot{y}}_{i}={w}_{{x}_{i}{y}_{i}}{x}_{i}$$
(2)

The connections with the neighbouring joints j at i are denoted as \({w}_{{x}_{j}{x}_{i}}\). To simplify the search space, the following relations are imposed between the weights: \({w}_{{x}_{i}{y}_{i}}=-{w}_{{y}_{i}{x}_{i}}\); \({w}_{{x}_{j}{x}_{i}}=-{w}_{{x}_{i}{x}_{j}}\); \({w}_{{x}_{i}{o}_{i}}=1\). The size of the search space for WO and ISO can now be formulated as follows: For a given robot, the number of CPGs is equal to the total joints (k), while the number of neighbouring pair connections (\(| {{{\mathcal{N}}}}|\)) is based on the modular tree-based distance. This means that for WO we will have \({N}_{weights}=k+| {{{\mathcal{N}}}}|\) weights to optimise; and for ISO Nstate = 2 k initial states (two neurons for each CPG oscillator).

System description

Robots

Our robots consist of modules which are based on the RoboGen framework51. A modular design space has the distinct advantage of rapid real-world prototyping through pre-fabrication of 3D printed modules and off-the-shelf components, as also shown in other frameworks52,53,54. We altered the original Robogen module designs to prevent breakage, improve servomotors, and allow feedback control. This new design consists of three components: a core component 9 cm × 9 cm × 6 cm, 250 g), brick components (6 cm × 6 cm × 6 cm, 30 g), and active hinges (12 cm × 6 cm × 6 cm, 69 g). The core component is a container with a micro-controller (Raspberry Pi 3B+ https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/), a custom designed bridge to our servomotors (as detailed in ref. 55) and a battery (LiPO, 3.7 V, 7800 mA h). It has four attachment slots on its lateral faces, to which other modules can be attached via screws. The brick component is a smaller cube with similar attachment points also on its lateral sides. The active hinge component is a joint actuated by a servomotor (Tower Pro MG996R, more detailed specifications can be found at https://www.towerpro.com.tw/product/mg996r/) with an attachment point on both ends.

For our learning experiment, we use six different-legged robot designs based on our modular framework. This set can be divided into two groups: (1) Bio-inspired hand-designed robots that resemble animals with distinct locomotion patterns (top row in Fig. 2). (2) Evolutionary automated-designed robots via a novelty-search algorithm (bottom row in Fig. 2). The latter group evolved to be fast in gait and morphologically different23,56. A description of each robot, along with its corresponding CPG-network parameters, are presented in Table 1.

Simulator system description

For our simulation experiments, we use a high-performance GPU-based 3D physics simulator for robot learning named Isaac gym57 (with an NVIDIA GeForce RTX 3070). Isaac gym enables parallelisation during optimisation. During WO this allows us to simulate a whole population of robots every generation of our evolutionary algorithm, and in ISO to simulate all our trials at once. We set the simulator update time to 60 Hz with simulator sub-step 2 and use the PhysX physics engine with temporal Gauss-Seidel solver. The robots are spawned in a neutral position (all joint angles 0 rad) centred on a flat horizontal plane, as illustrated in Fig. 2. Our world-frame coordinate system has its origin on a flat surface with x/y-direction spanning across the ground and a positive z-direction perpendicular to the surface pointing up, i.e., opposite to the direction of gravity (9.81 ms−1). During simulation, we log both the position {x, y, z} and orientation {ϕx, ϕy, ϕz} of the robot core component, each time we update the controller servomotor input (update frequency is 10 Hz). At the end of a trial, the logged robot state data (\({{{\mathcal{D}}}}\)) are used for evaluation.

Real world system description

For our real world experiments we used our own tracking-system, which merges the footage from six cameras (Dahua, IPC-HFW1320SP) into a single video-stream on a central laptop. Merging the cameras creates an image of 3964 × 1776 pixels with a resolution of 0.26 mm px−1. The video-stream provides a top-down view of our lab at 30 frames per second. We made a custom tracking script, using ArUco markers58, to log both the position and orientation of the Core module every 0.1s (10 Hz) in the same world-frame coordinates as the simulation experiments (positive z-direction up). From a central laptop, a start command automatically uploads a CPG controller via WiFi. The states of the robot and controller are buffered until the end of a trial, after which fitness values are calculated and all the data are saved. The CPG-controller runs locally using Python on a Raspberry Pi that sends the corresponding PWM servomotor commands with PIGPIO. For stable servomotor behaviour, the CPG-controller commands are provided at 64 Hz.

A trial is initiated by first starting/recording the merged video-stream, proceeded by sending a start signal from the laptop to the Raspberry Pi. During a trial, all intermediate controller states are logged locally on the Raspberry Pi, while all frames from the video-stream are buffered on the laptop. If the Raspberry Pi is finished, then a stop signal is sent to the laptop and the video-stream stops. In this way, the video-stream recording will always cover both the beginning and end of the trial. We match the intermediate controller states to the buffered video frames in order to obtain the CPG initial state samples corresponding to the captured robot states (position and orientation).

Controller optimisation

We assess three basic locomotion skills in this paper → ,  and for 60 s (teval = 60s). Gait fitness is defined as the absolute planar distance travelled divided by time (f = [cm s−1], see algorithm 3). Rotational fitness is defined as the amount of horizontal rotation (ϕz) in the preferred direction divided by time (f = [rad s−1], see Equation 4). For f we use a stable unwrapping algorithm (see Algorithm 3) that accounts for precession projection artefacts to prevent exploitation by our learning algorithm59. We run Nrep repetitions of the same experiment, with a single experiment consisting of several trials that provide samples for the optimiser. In the case of ISO, a single trial contains multiple initial state samples due to the bootstrapping trick. WO and ISO each have their own procedure which will be explained in turn shortly. A summary of the respective optimiser parameters is shown in Table 2.

Weight optimisation

For Weight optimisation (WO) we represent our CPG-network as a vector of real numbers (\({{\mathbb{R}}}^{{N}_{weights}}\)) that define the connection strength within and between oscillators. Here, the initial state of the CPG is fixed during optimisation to \(\frac{1}{2}\sqrt{2}\) for all neurons. This value induces a \(\frac{1}{2}\pi\) phase-offset within a single oscillator pair between the x- and y neuron. The WO search-space size for each robot is shown in Table 1 (Nweights). Every time we sample a new vector of weights from the search space, we test for 60 s for each trial (ttrial = 60 s) after which a fitness value is assigned for the corresponding skill.

To optimise our vector of weights we use an evolutionary algorithm called Reversible Differential Evolution (RevDE, ref. 22). This algorithm iteratively proposes a solution by applying a special set of linearly reversible transformations to produce a wide coverage over the search space when selecting new samples from search space22. RevDE works as follows (see Algorithm 1): (1) initialise a population of λ uniformly sampled weight vectors, \(pop\in {{\mathbb{R}}}^{\lambda \times {N}_{weights}}\) in range U[ − 1, 1]; (2) Obtain fitness values (\({{{\mathcal{F}}}}\)) over all λ-samples; (3) select the top μ-samples to obtain genotype vectors \({\mu }_{1}\in {{\mathbb{R}}}^{\mu \times {N}_{weights}}\); (4) randomly row shuffle μ1 twice to obtain two additional vectors (μ2, μ3), and create new samples with the following linear relations: λ1 = μ1 + F(μ2 − μ3), λ2 = μ2 + F(μ3 − λ1), λ3 = μ3 + F(λ1 − λ2). Apply uniform crossover with probability CR and repeat from (2). The number of evaluated samples (λ-samples) is three times that of the top-size μ1, which is set to 10 (Table 2). The parameters are set to scaling factor F = 0.5 and cross-over rate CR = 0.9, which have been shown to provide robust performance over different morphologies23.

Algorithm 1

RevDE weight optimisation of CPG-networks

Initial state optimisation

For ISO, we represent our CPG-network as a vector of real numbers (\({{\mathbb{R}}}^{{N}_{state}}\)) that define the initial states of all neurons within our network. Here, all weights (both within and between oscillators) are sampled randomly (U[ − 1, 1]) and remain the same during optimisation. The ISO search-space size for each robot is shown in Table 1 (Nstate). In every trial we test a random initial state for 120 s (ttrial = 120 s); in order to bootstrap the number of samples for each trial, we evaluate intermediate CPG-states with a rolling window teval = 60 s, which results in Nsamples/trial = 600.

At the beginning of an ISO experiment, we fixate the weights of the CPG-network with random values between U[ − 1, 1]. For every trial, we ‘start’ this random CPG-network with random uniformly sampled initial states (\({U}[-1,\, 1]\in {{\mathbb{R}}}^{{N}_{state}}\)) and log the intermediate states every 0.1s. This bootstraps the number of samples by ten more initial states per second. The fitness values are then calculated over the logged sub-samples using a rolling window. We only consider samples with a full 60s window length, thereby ignoring only considering the first 600 initial states, Nsamples/trial = (ttrial − teval)/0.1. Fitness is calculated on each of these 600 sub-sample for all skills in parallel). Per trial for each skill (s/→/), we only store the (sub)-sample with the highest fitness per skill {s , s , s} (see Fig. 1 B. data-evaluation). Therefore, a single trial provides Nskills fitness values (one for each skill) corresponding to the (intermediate) CPG-state during the trial with the best behaviour for that skill. Subsequently, we repeat this process by loading the same CPG-network (in terms of weights) with a random initial state, until our budget (in time) is spent. The pseudo-code for ISO is shown below (see Algorithm 2). To clarify, the pseudo-code presents a parallelised application (i.e., we TEST all initial states at the same time). In the real-world we TEST each trial sequentially.

Algorithm 2

Random search initial state optimisation of CPG-networks

Proximal policy gradient

We use the continuous action-space PPO algorithm of CleanRl21), which is based on the original paper18. Here, we adapt our fitness functions (Equation 3, Equation 4) with teval = 0.1 for our intermediate rewards. This corresponds to the update frequency of our controller, i.e. our rewards are the (rotational) distance travelled per controller update. Per experiment, we run 300 trials of 60 s (similar to WO and ISO) with a total of 180,000 time steps per skill (540,000 total). We update our learners in batches of 10 trials that are split into 100 mini-batches over 10 epochs. The remaining parameter values are taken from the original paper.

Experimental setup

We perform two different experiments: the first is a performance comparison between WO and ISO in simulation, and in the second we compare ISO with sim-to-real WO in real-world. The experimental conditions are shown below in Table 3.

Simulation experimental setup

For the simulation experiments we carry out a per skill comparison of the fitness per simulated time in minutes between ISO and WO. It should be noted that the simulated time for WO is reported relative to each skill (as opposed to the cumulative time needed to learn the three skills sequentially). For the purpose of statistical analysis, we repeat each optimisation 30 times (Nrep = 30). We compare the MBF learning curves (Algorithm 4), ISO out-performance time (Algorithm 6) and TTEQ (Algorithm 5).

For WO, a separate experiment is run for each skill. An optimisation run for a single skill takes place over 300 trials, each of which are 60 s in length. At the end of each trial, a single fitness value is calculated as a sample for the RevDe optimiser. We log the best-fitness-so-far for every trial. In the end, the total simulated time to learn three skills is \({t}_{total}=300\times 3=900\min\).

For ISO, all three skills are learnt simultaneously in a single run. A single run takes place over 150 trials, each of which are 120 s in length. At the end of each trial, rolling window fitness values are calculated for all skills with the same evaluation time as WO, i.e., 60 s. Within a trial, the best sub-sample fitness represents the trial fitness for each skill. We log the ’best-fitness-so-far’ for every trial. In the end, the total time in simulation needed to learn three skills is ttotal = 300 min.

Real-world experimental setup

Our real-world experiments are the physical tests of algorithm 2 on the physical twins. We start by placing a real robot in the centre of a 7 × 4 m arena. We run a CPG-network with random weights and an initial state from a central laptop for 120 s. A custom built tracking system using ArUco markers tracks the robot’s position and orientation every 0.1 s. At the end of the trial, the tracking data is then matched to the controller state at that time, and a rolling fitness is calculated for all the skills. This process is repeated until all the trials are complete.

A perfectly tracked run produces 600 samples, but due to noise and image merging artefacts, this number can sometimes be less. Although it rarely occurs in practice, missing robot data is filled in with the last previously captured state from the buffer. This operation only affects the fitness values when data is missing at the end of a trial because → fitness only considers the state at the end and / fitness performs a sum of the angular differences between subsequent samples, which is zero when samples are copied. Furthermore, filling in missing data with the last-seen values underestimates fitness values, insofar as it effectively shortens the evaluation period while simultaneously dividing by the same fixed evaluation time teval = 60.

For ISO we test five random initial states per run (Ntrials = 5) with three repetitions (Nrep = 3). After all the trials are finished, we calculate the best overall initial state sample s0* for each skill and re-test in order to obtain the final fitness for each skill in a single run. Including overhead and re-testing the acquired skill, the total time for the real robot experiments is ttotal ≤ 15 min.

For the sim-to-real experiments, we re-run the overall best WO* controller from the 30 runs completed in simulation for each skill on the physical robots. We re-test each WO* controller one time. We measure the performance in a similar fashion as we do for the ISO real setup. Ultimately, we compare the real-world performance of all three ISO runs and WO*.

Metrics

For the simulated environment, we apply 30 repetitions of each learning experiment (30×, 2 optimisers, six robots, 3 skills). We report on MBF over time ± Standard Error (Nrep = 30, ± SE). The best performance over time indicates the highest fitness so far. For each skill, the fitness is calculated as follows:

Gait fitness

We define gait fitness as the absolute planar displacement speed in centimetres per seconds [cm s−1]. We define our world frame’s origin at the ground, where the x- and y-directions span the horizontal ground and the z-direction points up. Gait performance is then calculated as follows:

$${f}_{\to }(t)=\frac{\sqrt{{({x}_{t+{t}_{eval}}-{x}_{t})}^{2}+{({y}_{t+{t}_{eval}}-{y}_{t})}^{2}}}{{t}_{eval}}$$
(3)

Where teval represents the evaluation time in seconds (60 s), and (xt, yt) the planar coordinates at time t.

Rotational fitness

We define rotation fitness as the rotational speed around the z axis of our world frame in radians per second [rad s−1]. Our skills and only define which rotational direction is preferred which, in turn, changes the sign of f. We calculate the rotational fitness as follows:

$${f}_{\circlearrowleft }(t)=\frac{1}{{t}_{eval}}{\sum}_{t+dt}^{t+{t}_{eval}}{{{{\rm{UNWRAP}}}}}_{{\phi }_{z}}\left({{{\phi }}}_{t},{{{\phi }}}_{t-dt}\right)$$
(4)

Here, ϕt denotes a vector of angles [θx, θy, θz] at time t. We add the differences in planar angle ϕz between the subsequent samples to obtain the total amount of planar rotation achieved. The differences are unwrapped from [0, 2π) with a custom unwrapping algorithm (from ref. 59), which accounts for precession angle artefacts that can be exploited by optimisers. The pseudo-code is shown below in Algorithm 3.

Algorithm 3

UNWRAP: Calculate relativez-angle between subsequent time-steps

Mean best fitness

For each skill, we log the fitness values of a single run for each trial \({{{{\mathcal{F}}}}}_{trials}=[{f}_{skil{l}_{1}},{f}_{skil{l}_{2}},\ldots,{f}_{skil{l}_{{N}_{trials}}}]\,\in {{\mathbb{R}}}^{{N}_{trials}}\). The repeated experiments (Nrep = 30) are concatenated \({{{{\mathcal{F}}}}}_{runs}\in {{\mathbb{R}}}^{{N}_{rep}\times {N}_{trials}}\). This allows us to calculate the MBF as follows:

Algorithm 4

Mean Best Fitness

Time till equal equality

TTEQ measures the first cross-over point for the MBF curves. It indicates how long a certain optimiser method outperforms another one.

Algorithm 5

Time Till Equal Quality

Normalised performance

We normalise the performance based on the MBF end-values of WO (as it is the best performing method overall in simulation). While we apply normalisation for each skill for WO, for ISO we apply normalisation based on the summed performance of all the skills (denoted with Σ). Summed normalisation indicates an average relative performance over time to learn multiple skills in parallel. This allows us to generalise across multiple skills and compare both the parallel- and sequential-learning time-efficiency of ISO and WO, respectively.

Algorithm 6

Normalised Performance

Statistical analysis

All optimisation experiments in simulation were conducted 30 times (Nrep = 30). Within robot/tasks comparison of ISO vs. WO is subject to a standard student t-test after passing checks for normality (Shapiro-Wilk test) and equal variances (F-test). Significant differences are considered at p < 0.05, with Bonferronni correction for aggregated results when comparing WO vs ISO on all skills and morphologies (p < 0.05/m with m = 18 learning tasks).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.