1. Introduction
Regular inspections and the detection of chemical pipeline leaks in chemical plants are paramount to safeguarding the environment from harmful chemical contamination, maintaining public health and safety, and reducing resource wastage and operational costs [
1]. Conventionally, human personnel always conduct regular inspections. However, relying on human personnel for this task can be risky, as it increases their workload and exposes them to potential physical harm in the event of sudden pipeline leaks [
2]. Hence, substituting human personnel with robots for this task is safer and more efficient, as it effectively prevents such occurrences. At present, some scholars have used robots to detect cracks. For example, the study [
3] proposes the use of lidar and depth cameras to detect 3D cracks and build a robotic 3D crack detection system with an accuracy of <0.1 mm. However, choosing a suitable robot and an appropriate algorithm for path planning to inspect points within chemical plants is a fundamental prerequisite to address the challenge of chemical pipeline leak detection.
Hexapod robots have a distinct advantage over wheeled or tracked robots, as they offer a higher degree of freedom and greater reliance on discrete footholds for locomotion. Moreover, these robots demonstrate greater adaptability in complex environments than wheeled or tracked robots [
4], making them ideal for patrolling chemical plants to detect pipeline leaks. However, the intricate layout of pipelines and the multi-dimensional nature of chemical plant spaces can make this inspection challenging, as hexapod robots might need to locate inspection points that are not confined to a single plane. It is imperative to accurately enable the hexapod robot to find the crossing point in the shortest crossing time and plan the shortest path across the plane to reach the inspection points within a multi-dimensional environment. However, this process is a pressing problem that requires resolution.
Previous studies have proposed various path-planning algorithms, falling into three categories [
5,
6,
7]: population-based evolutionary algorithms, graph search planning algorithms, and artificial intelligence algorithms. Population-based evolutionary algorithms like the genetic algorithm (GA) and Ant Colony Optimization (ACO) are commonly used. Luo et al. [
8] proposed the method of constructing unevenly distributed pheromones to solve the problem of slow convergence and low search efficiency. The effectiveness of the proposed method was verified by constructing a simulation grid map. This method pays more attention to the starting point and end point in path optimization problems. Another study [
9], to improve the exploration capability of the genetic algorithm, incorporated Bezier curves into the algorithm to enhance the diversity of solutions generated by it. The study verified the feasibility of the algorithm through a small scene with 20 obstacles and 5 special maps.
Population-based evolutionary algorithms are commonly employed for optimizing paths between start and end points. However, in the multi-dimensional environment of chemical plants, there may exist multiple transition points on a given plane. It becomes essential to select these points based on their respective reward weights. Consequently, such algorithms may lack the environmental feedback characteristics required for formulating effective actions.
The A* algorithm is a well-known graph search planning algorithm. Zuo et al. [
10] proposed a method aimed at enhancing a robot’s adaptability to unfamiliar maps by integrating least squares strategy iteration and the A* algorithm. This approach was validated on a simulation map characterized by multiple structures and extensive obstacle intervals, demonstrating the efficacy of the proposed technique. Li et al. [
11] proposed a novel approach that incorporates a bidirectional alternating search strategy to enhance the A* algorithm. This method aims to solve the problems of extended computation time and excessive turning angles inherent in the A* algorithm by minimizing redundant nodes. Its effectiveness was validated on simulated maps with large structural obstacles as well as experimental maps.
However, graph-based planning algorithms rely on known maps for planning, and the structure of the graph significantly impacts the performance of these algorithms. Given the dense network of pipelines within chemical plants, paths generated by such algorithms may not be suitable for hexapod robot locomotion, potentially resulting in collisions with the pipelines. Therefore, there is a need for path-planning algorithms capable of mapping to the actions of hexapod robots to facilitate their control.
The Q-learning algorithm and Deep Q-Network (DQN) algorithm are two well-known techniques in the field of artificial intelligence. Zou et al. [
12] integrated the Q-learning algorithm with the Dynamic Window approach and the Mayfly Optimization algorithm to improve the algorithm’s global search capability and adaptability to complex environments. H. Quan et al. [
13] enhanced the experience replay and sampling mechanisms within the DQN algorithm to augment its adaptability in dynamic environments. These algorithms can learn and adapt to intricate environments without prior knowledge and exhibit self-learning characteristics. As a result, they are well suited for six-legged robots to undertake path planning in multi-dimensional and complex environments.
Recently, some researchers have employed reinforcement learning algorithms for path planning in hexapod robots. X. Wang et al. [
14] utilized the Soft-max function to enhance the DQN algorithm, thereby improving its adaptability for generating actions for hexapod robots. Additionally, L. Wang et al. [
15] applied the Soft Actor-Critic algorithm for path planning in hexapod robots in outdoor environments, enhancing their robustness in unstructured environments.
One study [
16] revealed that a reinforcement learning algorithm, DDQN, has demonstrated superior performance in decision-making problems with vast state spaces and challenging multi-dimensional environments that are notoriously difficult to model. DDQN uses two neural networks to overcome the overestimation problem, which greatly enhances its effectiveness. As a result, DDQN has been applied in numerous domains, including robot control and path planning, encompassing tasks such as legged robot control [
17], unmanned aerial vehicle logistics and delivery [
18], and path planning in unknown environments [
19].
This study proposes an improved DDQN algorithm integrated with a hexapod robot for path planning for leak detection points in chemical plant pipelines. We used the global search capability and parallel computing ability of the PSO algorithm to overcome the limitations of traditional path-planning algorithms and replace the conventional random search strategy in the DDQN algorithm, thereby increasing DDQN’s ability to acquire environmental data quickly. In addition, we replaced the traditional BP neural network architecture with a four-input layer convolutional neural network structure to construct DDQN, which reduces the number of learning parameters. We abstracted the multi-dimensional environment into multiple grid-based planes to perform the preprocessing of the map data. These multiple grid-based planes consist of current and next layers as groups, dividing the overall path planning into multiple local planning to achieve map data preprocessing. We evaluated the effectiveness of the proposed method by employing subjective and objective evaluation metrics and making comparisons with traditional approaches. Then, we verified the feasibility of the proposed method through simulations and experiments.
2. PG-DDQN
This section delves into the utilization of the Particle Swarm Optimization (PSO)-guided Double Deep Q-Network (PG-DDQN) algorithm for multi-dimensional path planning with hexapod robots.
Section 2.1 provides a comprehensive overview of a hexapod robot’s motion planning.
Section 2.2 presents an in-depth explanation of the PG-DDQN algorithm.
2.1. Hexapod Robot’s Motion Planning
This study focuses on the fundamental tripod gait. In
Section 3, the simulation and validation will primarily revolve around this gait pattern. The tripod gait is where three robot legs are consistently grounded while the remaining three legs are in an elevated position. The tripod gait pattern involves three distinct phases crucial to the robot’s motion.
Initial phase: At motion onset, the robot establishes stability by supporting itself on three legs, forming a triangular base in contact with the ground. The other three legs are in a swing phase, preparing to make ground contact for the subsequent step.
Transition phase: During this phase, the robot maneuvers the swing legs toward their target positions to establish contact with the ground. Three legs from the previous support base remain grounded, forming a new triangular support base. This phase involves motion control and action planning to guarantee seamless transitions and appropriate step lengths.
Final phase: In the last phase, the previously supporting legs are raised and transitioned into the swing phase, preparing for the next movement. Again, the other three legs simultaneously create a stable triangular support base, maintaining the robot’s stability. Hexapod robots achieve continuous and uninterrupted motion by executing these three phases recurrently.
The hexapod robot movement consists of forward and backward motions, lateral translations, and turning maneuvers. The illustration in
Figure 1 shows movement details of this type of robot, with the supporting legs moving in the direction of the robot’s center of mass (COM) and the swing legs being lifted and moving in the opposite direction. The robot’s continuous motion is maintained by alternating the roles of the supporting and swinging legs.
2.2. Improved DDQN Algorithm
The PG-DDQN algorithm is presented as follows. The environment is processed, and the state space, action space, reward structure, and convolutional neural network are constructed. The PSO algorithm replaces the traditional random search strategy to speed up the accumulation of compelling data in the experience pool.
2.2.1. DDQN Algorithm
The DDQN algorithm is a variant of the DQN algorithm [
20] that leverages the practice of experience replay to enhance overall stability and performance, alleviating any potential correlations within the training data [
21]. The parameters of the actual action value Q-network are trained in real-world scenarios. The target action value Q-network weights are periodically updated to reduce the overestimation’s influence on the training outcomes. The target value formulation for DDQN is as follows:
The presented equation incorporates several essential components:
, representing the discount factor;
, denoting the network structure weights in the Q-local neural network;
, corresponding to the network structure weights in the Q-target neural network;
, symbolizing the environmental state perceived at the next time step;
, signifying the selected action; and
, representing the reward value.
Figure 2 depicts the schematic illustration of the DDQN model. PG-DDQN is improved on the basis of the DDQN algorithm.
2.2.2. State Space
The robot must comprehensively understand its surroundings to navigate its environment effectively. A hexapod robot gathers extensive global information, such as scene layout, comprising dynamic obstacles, static obstacles, navigable waypoints, self-position, target location, and operational status. Four matrices are formulated, based on the extracted current coordinate information from the environment, to represent the prevailing environmental state: the Path Matrix (PM), Obstacle Matrix (OM), Location Matrix (LM), and Target Matrix (TM). The environment is discretized into a grid, with each matrix corresponding to one or more grid cells, where each position in the matrices adopts either of two values: 0, signifying false, and 1, signifying true. Pertinent information is captured in a grid map when extracting a planar section from the multi-dimensional environment, producing four distinct matrices.
Figure 3 showcases how the hexapod robot’s motion model uses these matrices to observe the current state. Then, the state vector of the hexapod,
, can be expressed by concatenating the aforementioned four vectors as follows:
2.2.3. Action Space
In the subsequent experimental section, we demonstrate a hexapod robot’s specific movements in a grid-based environment, where its motion is broken down into four categories, i.e., up, down, left, and right, facilitating effective path planning. Each action is associated with a numerical code, devoid of inherent meaning but serving as an identifier due to the numerical nature of neural network outputs (
Figure 4).
2.2.4. Reward Structure
Rewards are critical to providing feedback regarding the hexapod robot’s efficacy and performance according to the algorithm, as they help evaluate the actions’ worth and effectiveness.
The instantaneous reward value comprises five parts: indicates that the hexapod moves outside the given map, indicates the penalty for collision with obstacles, indicates the reward value for selecting the target point, and indicates a one-time reward for reaching the end point. means that PSO accumulates rewards/penalties.
We define
as follows:
where
represents the number of times the hexapod robot touches the boundary during its movement. For robots that leave the map multiple times, a larger penalty value must be given to avoid errors in state evaluation, so the penalty value is here set to 120.
gives the penalty for collisions with obstacles:
where
is the indicator function showing whether the designated hexapod position collides with a dynamic obstacle and
means that the hexapod collides with a static obstacle.
This study introduces two types of crossing points, i.e., slope-type and trapezoid-type points, with varying reward values.
represents the crossing point reward value: The trapezoid-type transition points have reward values of 25, 30, 35, and 40, while the slope-type transition points have reward values of 30, 35, and 40.
is associated with a reward value of 40.
is defined as follows:
where
denotes the distance between the current state and the target or end point, while
represents the distance between the state from two-time steps ago and the transition or end point.
where
represents the number of particles in this state, and
represents the distance between the current position and the target point.
2.2.5. Neural Networks
Our input data comprise two-dimensional plane data with a layout size of 40 × 40, representing the current state coordinating and transforming into four observation matrices. The neural network architecture is divided into three main components based on these input data.
Figure 5 illustrates the architecture of the neural network.
The first step involves feeding the matrix into the neural network’s convolutional layer, which comprises 16 convolutional kernels, each with a size of 5 × 5 and a stride of 1 (stride = 1). Then, the input data are padded with an additional border of two units with zero values (padding = 2). Through the convolution operation, we obtain 16 feature matrices of size 40 × 40, passing through a subsequent max-pooling layer with a pooling window size of 2 × 2 and a stride of 2 × 2, resulting in a 16 × 20 × 20 feature matrix.
In the second step, the 16 × 20 × 20 feature matrix serves as input data, fed into another convolutional layer. This layer comprises 32 convolutional kernels, each with a size of 5 × 5 with the same stride and padding settings as in the first step. The convolution operation produces 32 feature matrices of size 20 × 20, passing through another max-pooling layer with a pooling window size of 2 × 2 and a stride of 2 × 2, resulting in a 32 × 10 × 10 feature matrix.
In the third step, the data obtained from the first two components are processed through a linear fitting, with 3200 input nodes passing through three fully connected layers, ultimately yielding 1 × 4 one-dimensional output data. Neural network parameters will be adjusted in simulation and real-life environments, but the neural network structure will not change.
2.2.6. PG-DDQN Algorithm
In the DDQN algorithm, random exploration is vital for enriching the decisions made by the algorithm, as the neural network consistently makes the same decisions in response to repetitive observations. Random exploration also facilitates diversity and allows robots to make different decisions, thereby increasing the exploration capacity. However, this approach can be inefficient for path planning in hexapod robots, especially in multi-dimensional environments mapped to grid-based maps.
As depicted in
Figure 6, Panel A displays the hexapod robot approaching a transition point with static obstacles above and below. In this scenario, only the action “Right” is the correct choice, as the actions “Up” and “Down” are incorrect. The action “Left” leads the hexapod robot away from the transition point. In this case, random exploration has a 50% chance of making an incorrect decision, a 25% chance of making a suboptimal decision, and only a 25% chance of making the correct decision.
As shown in Panel B in
Figure 6, the hexapod robot must enter the transition point. Random exploration in this scenario has a 25% chance of making an illegal decision, a 50% chance of making a suboptimal decision, and only a 25% chance of making the correct decision. This scenario also has a 25% chance of terminating the motion model in Panel B.
We have introduced the PSO algorithm to replace random exploration to enhance the efficiency of exploration and make the latter more purposeful.
The PSO algorithm proceeds as follows:
Step 1: Data preparation and initialization. First, the obstacle data are constructed using accumulated experience in the knowledge pool. Next, the positions and velocities of the particles in the swarm are initialized with each particle
, and its position vector
and velocity vector
are randomly initialized. Additionally, each particle’s individual best position
is set initially to its current position
. The global optimal position (
) is initialized; the selection of
requires a combination of angle and distance screening with the final target point. First, it is necessary to find the angle and distance between each particle and the final target point. The formula is as follows:
where
represents the initialized particle coordinates in the frame-selected map data,
represents the global map target coordinates,
represents the distance, and
represents the angle between the particle and the target.
where
represents the collection of particle coordinates contained in the random map boundary particle coordinates as the center of the circle.
represents the exploration rate, which changes with iteration time.
Step 2: Updating of particle velocity and position. For each particle
, the velocity vector is updated using the following equation:
where
represents the velocity of particle
at time
,
is the inertia weight,
and
are the learning factors, and
is a function that generates random numbers. Subsequently, the position vector is updated as follows:
The fitness of the new position, , is evaluated, and the individual’s best position, , is updated if the fitness of is superior to that of .
Step 4: Iteration. The process of updating particle velocities and positions (Step 2) and updating the global best position (Step 3) is repeated until the stopping condition is met, which may involve reaching the maximum number of iterations or finding a satisfactory solution.
The action selection strategy is as follows:
where
is a randomly generated number between 0 and 1 with 8 decimal points in each round and
represents the choice of the actions by PSO.
where the value of
is between 0 and 1,
represents the next action selected by using the Q-local neural network, and
represents the actual value of the optimization target network.
Finally, when a certain number is reached, the Q-target neural network is updated.
Figure 7 depicts the PG-DDQN training structure on the abstract plane grid motion model, indicating that the training process consists of several steps.
3. Experiment and Discussion
This section presents an overview of the experiments conducted and the conclusions drawn from them. We performed all experiments on a computer system equipped with an Intel(R) Core(TM) i9-13900HX CPU operating at 2.20 GHz alongside an NVIDIA GeForce RTX 4060 Laptop GPU. An abstraction approach was employed to address multi-dimensional environments, decomposed into a composition of multiple current and next layers. Each transition from the current layer plane to the next was regarded as an instance of localized path planning, and this technique was used in conjunction with other instances to create comprehensive path planning for multi-dimensional environments. We adopted a localized path-planning approach involving transitions from the current to the next layer plane to evaluate the algorithm’s performance. This strategy effectively applied localized planning across various multi-dimensional environments, since it could be replicated based on the dimension of the environment. By evaluating the algorithm’s performance within this localized context, we could assess its applicability effectively.
To facilitate the experimentation, we skillfully constructed two-dimensional planes using Python’s TKINTER library for the current and next layer planes. Implementing the PG-DDQN algorithm, which is the core of our research, was meticulously undertaken in Python, leveraging the PYTORCH framework. We validated simulation results based on the simulations conducted on Ubuntu 20.04, utilizing Gazebo 11 as the designated platform.
3.1. Algorithm Comparison
We employed a random obstacle generation method to generate the maps, yielding three sets of current and subsequent layer planes to compose them, and comprehensive information is presented in
Figure 8 and
Table 1. The left image shows the map of the current plane, while the right image depicts that of the next plane. Each layer plane has dimensions of 40 × 40, incorporating four dynamic obstacles. The exploration rates for each map are specified in
Table 1. The total number of blocks, encompassing dynamic obstacles, static obstacles, transition points, and target points, denotes the abstract size of the maps.
We employed two distinct performance metrics to measure the algorithm’s overall performance.
- (1)
Overall reward value
The reward value for each training set is calculated based on the reward acquisition rules, with the maximum achievable score for the current layer and the next layer’s final destination point being valued at 40 points. Each plane is equipped with transition points. The current layer plane contains seven transition points; the next layer represents the destination points. The hexapod robot can obtain the total reward value by completing the task without penalty. This process is calculated as follows:
- (2)
Time to complete the path
The timing of the robot’s journey is initiated at its starting position, and the total time taken to traverse the entire planned path is measured. We apply a threshold range to the total reward value to mitigate the prolonged evaluation of erroneous paths. The path completion time is zero when the reward value surpasses this threshold. The precise calculation formula is as follows:
In PG-DDQN training, “training” refers to updating the local neural network to achieve convergence.
Table 2 displays the actual statistics of training steps. During training, the algorithm compares the total reward value after each completed epoch to a constraint. If the total reward falls below −5000, the current epoch test is stopped, and a reward value of −5000 is recorded based on this constraint.
These hyperparameters are based according to the references provided by [
22]. The learning rate is initially set to 0.0025, and when every 10% of the training is completed, the learning rate is reduced to 95% of the original. The discount factor determines the extent to which future rewards influence current decisions. If the discount factor approaches 1, the agent tends to consider long-term rewards, whereas if the discount factor approaches 0, the agent focuses more on immediate rewards. The PG-DDQN parameter settings are shown in
Table 3.
The performance of the PG-DDQN algorithm on three random maps is shown in the
Figure 9.
In
Figure 9, the left image represents the map of total reward, while the right image depicts the map of total movement time. The x-axis represents the number of training epochs, while the y-axis represents the cumulative reward obtained from testing after each epoch. The figure shows that the entire neural network requires approximately 1850 epochs to achieve convergence for each map.
The results from
Figure 10 indicate that the enhanced PG-DDQN algorithm outperforms the standard DQN and DDQN algorithms regarding overall total reward across all three random maps. Additionally, the standard DQN and DDQN algorithms exhibit weaker convergence when abstracting three-dimensional environments into two-dimensional maps than the improved PG-DDQN algorithm. By comparing map 1 and map 2 in
Figure 9, it can be seen that higher exploration rates and increased PSO guidance lead to more stable training amplitudes during the training phase and enhance more excellent stability during the oscillation period.
Table 4 presents the differences in computational time among the algorithms. Additionally, comparing completion times reveals that the improved algorithm achieves shorter path-planning times than the other two methods.
Figure 11 shows the algorithm’s success rate in each random map without touching obstacles and completing path planning. It can be seen that the growth trend of the success rate of the improved algorithm is greater than the standard DDQN and DQN algorithms, and the success rate of the improved algorithm is greater than the standard DDQN and DQN algorithms.
3.2. Simulation Verification
The simulation validation was conducted using the Gazebo simulation platform, setting up the simulation environment as depicted in
Figure 12. The dual-layer simulation environment represents the current and subsequent layers for simulation validation.
Figure 13 shows the simulation model, with the left side representing the mobile obstacle model, implemented using a wheeled robot as a reference for the dynamic obstacles. The right side of
Figure 13 shows the hexapod robot model.
In the Gazebo environment, the PG-DDQN algorithm was integrated with the ROS system to validate performance after training, where path planning is depicted in
Figure 14. In this validation process, the algorithm can handle step-type points for plane traversal. The black box represents the hexapod robot, while the red lines illustrate the paths planned by the algorithm. The orange squares in the current layer indicate the entrances to incline-type point 1, and those in the subsequent layer, the exits from incline-type point 1. Similarly, the blue squares in the current layer mark the entrances to incline-type point 2, while in the following layer, they represent the exits from incline-type point 2. Furthermore, the green squares in the current layer indicate the entrances to step-type points, while in the next layer, they indicate the exits from the step-type points.
Figure 15 and
Figure 16 illustrate the planned trajectory of the hexapod robot during traversal in the current and following layers, respectively.
During traversal in the current layer, the hexapod robot can avoid obstacles encountered during movement, as demonstrated in
Figure 17. In Step 1, the hexapod robot detects a moving obstacle and promptly halts its movement. In Step 2, it assesses the trajectory of the moving obstacle to determine the best course of action. Once the path is evaluated, the hexapod robot resumes its movement, adhering to the planned path in Step 3. The robot makes necessary adjustments to its body path from Steps 4 to 7 to effectively navigate the obstacle. The path-planning process within the moving obstacle’s range is successfully executed from Steps 9 to 12.
In this section, we have developed the following conclusions by comparing the algorithm and validating simulations: The improved PG-DDQN algorithm performs better than the other two reinforcement learning algorithms regarding convergence stability and achieving optimal rewards while path planning in abstract local environments in multi-dimensional scenarios. Moreover, the application of the PG-DDQN algorithm, in conjunction with sensor integration, for path planning in hexapod robots demonstrates the effectiveness of this improved approach in efficiently navigating through multi-dimensional terrains within the simulated environment.
3.3. Experimental Verification
The experimental verification used the laboratory indoor environment to simulate the chemical plant scene. The roads in the scene are complex; the site environment is shown in
Figure 18.
Due to site restrictions, the site area was divided in order to build a multi-dimensional environment, and different goals that needed to be reached were set in different stages to achieve the setting of a multi-dimensional environment, as shown in
Figure 19.
In the current layer, the hexapod robot successfully avoided interference target points during movement, as shown in
Figure 20.
Then, the hexapod then successfully avoided dynamic obstacles. The movement process is shown in
Figure 21.
Finally, the target point position of the current layer was reached, as shown in
Figure 22.
At this time, the hexapod robot entered the next layer and moved to the target point position of the next layer, as shown in
Figure 23.
In this section, we present the following conclusions drawn through actual testing: As it can be seen from the verification diagram above, the robot did not hit stationary obstacles when moving. When encountering interference points, the robot’s route can be changed to avoid them, and dynamic obstacles can also be avoided. The feasibility of this improved approach in effectively navigating multi-dimensional terrain is demonstrated.
4. Conclusions
This study addresses the challenges encountered in path planning for detecting chemical plant pipeline leaks, presenting a hexapod robot path-planning method based on PG-DDQN. The proposed method exhibits the following characteristics:
(1) To address problems in multi-dimensional environments, we employ the preprocessing of the multi-dimensional environment to divide the data of path planning. We abstract the multi-dimensional environment into multiple grid-based planes, aiming to optimize data processing and enhance planning efficiency.
(2) The PSO algorithm is implemented as a guiding strategy to supplant the random selection strategy in action selection, augmenting the accumulation of meaningful data within the experience pool and enhancing the algorithm’s convergence rate.
(3) Introducing the four-input layer convolutional neural network also accelerates the processing speed of high-dimensional information.
The proposed method effectively addresses the challenges hexapod robots face in path planning for detecting chemical plant pipeline leaks, such as dynamic environmental scenes, complex road structures, and multi-dimensional survey points. Compared with the standard DQN and DDQN algorithms, this approach demonstrates faster convergence and reduces path-planning time by at least 33.94%. The feasibility of the algorithm was verified through Gazebo simulations and experiments. This path-planning method has important practical value for pipeline leak detection in chemical plants. There is room for improvement in the algorithm’s completion rate in environments with more obstacles. Follow-up work will continue to improve the algorithm’s adaptability to larger maps and combine it with the hexapod robot’s gait to achieve path planning in rugged terrain; this study provides new ideas for the future multi-robot coordinated processing of pipeline repair and chemical material processing.