CN117742387A

CN117742387A - Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm

Info

Publication number: CN117742387A
Application number: CN202311744849.5A
Authority: CN
Inventors: 张韵悦; 赵志诚; 范宇坤; 杨凯; 武紫东
Original assignee: Taiyuan Institute of Technology
Current assignee: Taiyuan Institute of Technology
Priority date: 2023-12-18
Filing date: 2023-12-18
Publication date: 2024-03-22

Abstract

The application relates to the technical field of intelligent hydraulic excavators and discloses a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm, which comprises the steps that under the condition of not considering rotary operation, an excavator working device realizes a motion track of the tail end of a tooth tip of a bucket through coupling motion among three joints of a movable arm, a bucket rod and the bucket in the operation process, each joint of the movable arm, the bucket rod and the bucket is used as an independent decision-making intelligent body, and the finally planned operation track is a decision-making sequence of the three joints; the central training-distributed training mode is adopted, and the combined actions of the environment state and the three intelligent agents are used as the input of the evaluator decision network in the training process. The autonomous online operation track planning of the excavator can be realized by utilizing the reinforcement learning algorithm-TD 3 algorithm without depending on a specific interpolation strategy model, and the corresponding interpolation strategy model is not required to be selected according to the target point of the planning path, namely, the accurate modeling of a complex planning task is avoided.

Description

Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm

Technical Field

The invention relates to the technical field of intelligent hydraulic excavators, in particular to a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm.

Background

The track planning method of the intelligent hydraulic excavator is characterized in that the excavator can automatically plan and execute the motion track of the excavator through algorithms and technologies so as to realize specific tasks. Such methods typically involve sensors, computer vision and control systems to assist the excavator in moving, excavating or performing other operations within the work area. Can perform tasks more intelligently and more efficiently, reduce the need for human intervention, and provide more reliable motion control in complex environments.

At present, the conventional optimal track planning method for the intelligent hydraulic excavator depends on interpolation strategies, when the method is used for excavating in a complex environment, complex tasks are required to be planned and modeled accurately, so that the system response speed is not efficient enough in the actual use process, and the task execution precision is different.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm, which solves the problems that in the prior art, complex tasks are planned and accurately modeled, the response rate of a system is not efficient enough in the actual use process, and the execution precision of the tasks is different.

In order to achieve the above purpose, the invention is realized by the following technical scheme: a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm comprises the following steps:

under the condition of not considering rotary operation, the excavator working device realizes the movement track of the tail end of the tooth tip of the bucket by coupling movement among three joints of the movable arm, the bucket rod and the bucket in the operation process, and takes each joint of the movable arm, the bucket rod and the bucket as an independent decision-making intelligent body, wherein the finally planned operation track is a decision-making sequence of the three joints;

step two: the method comprises the steps of adopting a centralized training-distributed training mode, taking the combined actions of an environmental state and three intelligent agents as the input of an evaluator decision network in the training process, so that an output evaluation value function comprises the guidance information of the cooperation of the three joint intelligent agents;

step three: based on the training result of the second step, distributed execution is carried out, the execution actions of all the intelligent agents are not required to be communicated with each other, long-time training can be carried out, three joints of the movable arm, the bucket rod and the bucket can be used for carrying out cooperative operation, the establishment of a multi-intelligent-agent system model is completed, and then basic elements of the established multi-intelligent-agent system model are defined;

step four: and (3) optimizing the point-to-point operation task of the excavator by using a TD3 algorithm, training the multi-intelligent system model established in the step (III), and establishing an Actor-Critic framework for each joint of the movable arm, the bucket rod and the bucket.

Preferably, the element definition in the third step includes a state space design, taking angles of joints of the boom, the arm and the bucket as state parameters, taking an initial joint angle as an input parameter of a strategy network, and calculating according to a variation range of a joint angle value corresponding to an output of the action strategy network to obtain an angle value of a next state, wherein a specific calculation formula is as follows:

θ _i ＝θ _i0 +Δθ _i (i＝2,3,4)

in θ _i0 Joint angle values, Δθ, representing the starting points of boom, stick, and bucket joints _i I=2, 3, and 4 represent the variation range of the angle value of each joint, and the boom, the arm, and the bucket joints are sequentially shown.

Preferably, the element definition in the third step includes action space design, the output of the strategy network is defined as the variation amplitude of the joint angle, and the action taken satisfies a _i N (0, 1) is normally distributed, and in order to reduce difficulty in decision making operation, discretization processing is required for output information.

Preferably, the element definition in the third step includes designing a reward function, and in order to implement efficient and stable autonomous operation of the working device within the allowed working range, designing the reward function of the intelligent agent is as follows:

r＝r ₁₁ +r ₁₂ +r ₁₃ +r ₂₁ +r ₂₂ +r ₂₃ +r ₃₁ +r ₃₂ +r ₃₃ +r _t

in θ ₂ ,θ ₃ ,θ ₄ The angle values of the movable arm, the bucket rod and the bucket joint are sequentially shown; r is (r) ₁₁ ,r ₁₂ A reward indicating whether the arm joint movement exceeds the allowable movement range, r ₁₃ Indicating whether the velocity of the movable arm joint exceeds the constraint range or not, theta _2min ,θ _2max Indicating the allowable movement range of the movable arm joint, v ₂ Representing the velocity constraint value of the boom joint, r ₂₁ ,r ₂₂ And r ₂₃ ，r ₃₁ ,r ₃₂ And r ₃₃ D, sequentially obtaining rewards for whether the joint movement of the bucket rod and the bucket exceeds the allowable movement range _t T is the total time of the job, which is the distance between the current position of the bucket tip end and the target end.

Preferably, θ in the reward function ₂ <θ _2min ，θ ₂ >θ _2max Andthe boolean expression is a boolean expression, i.e. the boolean expression results in 0 when the boom joint angle and angular velocity values are within the allowed range of motion, whereas the boolean expression results in 1 when the boom joint angle and angular velocity values exceed the allowed range.

Preferably, the element definition in the third step includes designing a neural network, based on an Actor and a Critic network in the TD3 algorithm, which have substantially the same structure, and using a fully connected network with a double hidden layer structure, where the hidden layer includes 512 neurons, and the ReLu function is an activation function, and includes:

the Actor network receives the normalized state observation information, and after passing through the full-connection layer, the Actor network sets a Softmax function as the last layer of the neural network, converts an output result into a probability distribution vector, and forms discretized output information;

the Critic network outputs a 1-dimensional state value function.

8. Preferably, the element definition in the third step includes super parameter setting, and for the neural network training process, an Adam network optimizer is adopted, and the time optimal track planning process based on the TD3 algorithm includes the following steps:

s1, initializing evaluation networks Critic1, critic2 and a strategy network Actor, and randomly giving network parameters theta ₁ ,θ ₂ ,φ；

S2, initializing target networks Critic_T1, critic_T2 and actor_T, and enabling θ' ₁ ←θ ₁ ,θ' ₂ ←θ ₂ ,φ'←φ；

S3, initializing an experience pool beta;

S4、for 1 to T；

s5, generating actions a-pi with noise _φ (s) +epsilon, epsilon-N (0, sigma), calculating a reward value obtained by executing the action and a new state s 'according to a reward function, and putting the quadruple (s, a, r, s') into an experience pool beta;

s6, taking the quaternary groups (S, a, r, S') with the number of N from the experience pool for training the target network.

Preferably, the step S6 includes:

Actor_T：

Critic_T1 and Critic_T2:

updating Critic1 and Critic2 network parameters:

if t accumulates a certain step size do;

updating parameters phi of the deterministic strategy:

updating by using a gradient descent method:

θ' _i ←τθ _i +(1-τ)θ' _i

updating target network parameters: phi '≡τ+ (1- τ) phi'.

Preferably, in the fourth step, the Actor network is used for policy iteration update, the actor_t network is used for experience pool sampling update, and the network parameters of the actor_t network are updated from the Actor network periodically.

Preferably, in the fourth step, the Critic1 and Critic2 networks update Q values for evaluating the behavior of the current Actor; the critic_t1 and critic_t2 networks are responsible for calculating global prize values, the network parameters of which are updated periodically from Critic1 and Critic2, and finally, by targeting at high efficiency, working paths satisfying large prize values are obtained.

The invention provides a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm.

The beneficial effects are as follows:

1. according to the method, independent of a specific interpolation strategy model, autonomous online operation track planning of the excavator can be realized by utilizing a reinforcement learning algorithm-TD 3 algorithm, and a corresponding interpolation strategy model does not need to be selected according to a target point of a planning path, namely accurate modeling of a complex planning task is avoided.

2. According to the method, the training of the working device of the excavator by utilizing the TD3 reinforcement learning algorithm can realize rapid continuous decision, the control law is not required to be solved, and the three joints of the movable arm, the bucket rod and the bucket of the excavator are trained by utilizing the TD3 reinforcement learning algorithm, so that the timely decision of the working device of the excavator can be finally realized, the planning result is executed, and the control law is prevented from being solved in the traditional control method.

3. Compared with the traditional method for optimizing and solving the optimal track by adopting the intelligent optimization algorithm to the interpolation strategy, the method for optimizing and solving the time optimal track of the excavator by utilizing the TD3 reinforcement learning algorithm can effectively reduce the calculated amount, thereby being beneficial to improving the planning efficiency.

Drawings

FIG. 1 is a flow chart of the centralized training-distributed execution of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples:

referring to fig. 1, an embodiment of the present invention provides a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm, where under the condition of not considering a rotation operation, an excavator working device implements a motion track of a tip end of a bucket by coupling motions among three joints of a boom, an arm and a bucket during the operation. Therefore, when planning the time optimal track of the excavator by using the TD3 reinforcement learning algorithm, each joint of the movable arm, the bucket arm and the bucket is used as an independent decision-making agent, and the finally planned operation track is a decision sequence of three joints. Therefore, in the multi-agent system consisting of the movable arm, the bucket rod and the bucket joint, a centralized training-distributed training mode is adopted, as shown in the figure 1, s is the state, a is the action, r ₁ ,r ₂ ,...,r _n A prize value for each agent. The method comprises the steps that the combined actions of an environment state and three intelligent agents are used as input of an evaluator decision network in the training process, so that an output evaluation value function contains guidance information of cooperation of the three joint intelligent agents; for distributed execution, the execution actions of all the agents are not required to be communicated with each other, and the three joints of the movable arm, the bucket rod and the bucket can be cooperatively operated after long-time training.

Establishing an Actor-Critic framework for each joint of a movable arm, a bucket rod and a bucket by using a TD3 algorithm for point-to-point operation tasks of the excavator, wherein an Actor network is used for strategy iteration update, an actor_T network is used for experience pool sampling update, and network parameters of the actor_T network are periodically updated from the Actor network; critic1 and Critic2 networks update the Q value for evaluating the behavior of the current Actor; the critic_t1 and critic_t2 networks are responsible for calculating global prize values, the network parameters of which are periodically updated from Critic1 and Critic 2. Finally, a job path satisfying a large prize value is obtained by targeting high efficiency.

Before training a multi-intelligent system by using the TD3 algorithm, basic elements of a model need to be defined.

1. Designing a state space;

the angles of the movable arm, the bucket rod and the bucket joint are taken as state parameters, the initial joint angle is taken as an input parameter of a strategy network, the angle value of the next state is obtained by calculating according to the variation amplitude of the corresponding joint angle value output by the action strategy network, and the specific calculation formula is as follows:

one, θ _i ＝θ _i0 +Δθ _i (i＝2,3,4)

2. Designing an action space;

defining the output of the strategy network as the variation amplitude of the joint angle, and taking action to meet a _i N (0, 1) normal distribution. In order to reduce the difficulty of decision making, discretization processing is required for the output information.

3. Designing a reward function;

in order to realize high-efficiency and stable autonomous operation of the working device in the allowed working range, the rewarding function of the intelligent agent is designed as follows:

second step,

Three, r=r ₁₁ +r ₁₂ +r ₁₃ +r ₂₁ +r ₂₂ +r ₂₃ +r ₃₁ +r ₃₂ +r ₃₃ +r _t

In θ ₂ ,θ ₃ ,θ ₄ The angle values of the movable arm, the bucket rod and the bucket joint are sequentially shown; r is (r) ₁₁ ,r ₁₂ A reward indicating whether the arm joint movement exceeds the allowable movement range, r ₁₃ Indicating whether the velocity of the boom joint is outside the constraint range, where θ _2min ,θ _2max Indicating the allowable movement range of the movable arm joint, v ₂ Represents a speed constraint value of the boom joint, and θ ₂ <θ _2min ，θ ₂ >θ _2max Andis a boolean expression, i.e. when the boom joint angle and angular velocity values are within the allowed range of motion, the boolean expression results in 0; conversely, when the boom joint angle and the angular velocity value exceed the allowable ranges, the result of the boolean expression is 1. Similarly, r ₂₁ ,r ₂₂ And r ₂₃ ，r ₃₁ ,r ₃₂ And r ₃₃ Which in turn awards whether the stick and bucket articulation exceeds the allowable range of motion. D (D) _t T is the total time of the job, which is the distance between the current position of the bucket tip end and the target end. As can be seen from the formulas two-three, when each joint exceeds the allowable movement range, the rewards are smaller; the longer the total time of movement and the greater the distance of the tip end of the bucket from the target point, the less rewards will be. Considering that each joint is an independent agent, defining that the prize value obtained by each agent interacting with the environment is the same, the shared Critic evaluation network is affected by all agent actions.

Thus, the first part of the reward function is the reward that is obtained for each joint if it exceeds the allowed range of motion; the second part is awarded by the total time to complete the task and the distance value of the current bucket tooth tip point from the given target point. By combining the second knowledge, the acceleration value output by the strategy network can be effectively limited by limiting the joint speed, namely, the action that the output of the strategy network exceeds the allowable range is reduced, so that the stable operation of the working device is realized.

4. Designing a neural network;

for the Actor and Critic networks in the TD3 algorithm, the structures of the Actor and Critic networks are basically the same, a fully-connected network with a double hidden layer structure is adopted, the hidden layer comprises 512 neurons, and the ReLu function is an activation function. The method comprises the steps that an Actor network receives normalized state observation information, and after the Actor network passes through a full-connection layer, a Softmax function is set as the last layer of a neural network, and an output result is converted into a probability distribution vector to form discretized output information; the Critic network outputs a 1-dimensional state value function.

5. Setting super parameters;

for the neural network training process, an Adam network optimizer is adopted, the learning rate is 0.00025, the discount rate is 0.99, the cutting rate is 0.2, the batch size is 128, the experience library capacity is set to 4000, and the initial training sample number is 2000.

In summary, the time optimal trajectory planning process based on the TD3 algorithm is as follows:

S3, initializing an experience pool beta;

S4、for 1 to T；

s5, generating actions a-pi with noise _φ (s) +epsilon, epsilon-N (0, sigma), calculating a reward value obtained by executing the action and a new state s 'according to a formula two-formula three, and putting the quadruple (s, a, r, s') into an experience pool beta;

s6, taking a number N of quadruples (S, a, r, S') from the experience pool for training the target network, wherein:

Actor_T：

Critic_T1 and Critic_T2:

updating Critic1 and Critic2 network parameters:

if t accumulates a certain step size do;

updating parameters phi of the deterministic strategy:

updating by using a gradient descent method:

θ' _i ←τθ _i +(1-τ)θ' _i

updating target network parameters: phi '≡τ+ (1- τ) phi';

End。

although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The track planning method for the hydraulic excavator based on the TD3 reinforcement learning algorithm is characterized by comprising the following steps of:

2. The track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm according to claim 1, wherein the element definition in the third step includes a state space design, the angles of the boom, the arm and the bucket are taken as state parameters, the initial joint angle is taken as an input parameter of a strategy network, and the angle value of the next state is obtained by calculating according to the variation amplitude of the joint angle value corresponding to the output of the action strategy network, and the specific calculation formula is as follows:

θ _i ＝θ _i0 +Δθ _i (i＝2,3,4)

3. The track planning method for hydraulic excavator based on the TD3 reinforcement learning algorithm according to claim 1, wherein the element definition in the third step includes action space design, the output of the strategy network is defined as the variation amplitude of the joint angle, and the action taken satisfies a _i N (0, 1) is normally distributed, and in order to reduce difficulty in decision making operation, discretization processing is required for output information.

4. The track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm according to claim 1, wherein the element definition in the third step includes a reward function design, and in order to realize efficient and stable autonomous operation of the working device within the allowed working range, the reward function of the intelligent agent is designed as follows:

5. The track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm according to claim 4, wherein θ in said reward function is ₂ <θ _2min ，θ ₂ >θ _2max Andis a Boolean expression, namely, when the angle and the angular velocity of the movable arm joint are within the allowable movement range, the Boolean expression result is 0, otherwise, when the angle and the angular velocity of the movable arm joint areWhen the degree value exceeds the allowable range, the result of the boolean expression is 1.

6. The track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm according to claim 1, wherein the element definition in the third step includes neural network design, based on the Actor and Critic networks in the TD3 algorithm, their structures are basically the same, a fully connected network with double hidden layer structure is adopted, the hidden layer contains 512 neurons, and the ReLu function is an activation function, which includes:

the Critic network outputs a 1-dimensional state value function.

7. The track planning method for the hydraulic excavator based on the TD3 reinforcement learning algorithm according to claim 4, wherein the element definition in the third step comprises super parameter setting, and for the neural network training process, an Adam network optimizer is adopted, and the time optimal track planning process based on the TD3 algorithm comprises the following steps:

S2, initializing target networks Critic_T1, critic_T2 and actor_T, and enabling θ' ₁ ←θ ₁ ,θ′ ₂ ←θ ₂ ,φ′←φ；

S3, initializing an experience pool beta;

S4、for 1to T；

8. The track planning method for a hydraulic excavator based on the TD3 reinforcement learning algorithm of claim 7, wherein S6 includes:

Actor_T：

Critic_T1 and Critic_T2:

updating Critic1 and Critic2 network parameters:

if t accumulates a certain step size do;

updating parameters phi of the deterministic strategy:

updating by using a gradient descent method:

updating target network parameters:

9. the track planning method for hydraulic excavator based on the TD3 reinforcement learning algorithm according to claim 1, wherein in the fourth step, the Actor network is used for iterative updating of strategies, the actor_t network is used for sampling and updating of experience pools, and the network parameters are updated from the Actor network periodically.

10. The track planning method for the hydraulic excavator based on the TD3 reinforcement learning algorithm according to claim 1, wherein the Critic1 and Critic2 networks in the fourth step update the Q value for evaluating the behavior of the current Actor; the critic_t1 and critic_t2 networks are responsible for calculating global prize values, the network parameters of which are updated periodically from Critic1 and Critic2, and finally, by targeting at high efficiency, working paths satisfying large prize values are obtained.