Nothing Special   »   [go: up one dir, main page]

CN117742387A - Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm - Google Patents

Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm Download PDF

Info

Publication number
CN117742387A
CN117742387A CN202311744849.5A CN202311744849A CN117742387A CN 117742387 A CN117742387 A CN 117742387A CN 202311744849 A CN202311744849 A CN 202311744849A CN 117742387 A CN117742387 A CN 117742387A
Authority
CN
China
Prior art keywords
bucket
network
joint
actor
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311744849.5A
Other languages
Chinese (zh)
Inventor
张韵悦
赵志诚
范宇坤
杨凯
武紫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan Institute of Technology
Original Assignee
Taiyuan Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan Institute of Technology filed Critical Taiyuan Institute of Technology
Priority to CN202311744849.5A priority Critical patent/CN117742387A/en
Publication of CN117742387A publication Critical patent/CN117742387A/en
Pending legal-status Critical Current

Links

Landscapes

  • Feedback Control In General (AREA)

Abstract

The application relates to the technical field of intelligent hydraulic excavators and discloses a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm, which comprises the steps that under the condition of not considering rotary operation, an excavator working device realizes a motion track of the tail end of a tooth tip of a bucket through coupling motion among three joints of a movable arm, a bucket rod and the bucket in the operation process, each joint of the movable arm, the bucket rod and the bucket is used as an independent decision-making intelligent body, and the finally planned operation track is a decision-making sequence of the three joints; the central training-distributed training mode is adopted, and the combined actions of the environment state and the three intelligent agents are used as the input of the evaluator decision network in the training process. The autonomous online operation track planning of the excavator can be realized by utilizing the reinforcement learning algorithm-TD 3 algorithm without depending on a specific interpolation strategy model, and the corresponding interpolation strategy model is not required to be selected according to the target point of the planning path, namely, the accurate modeling of a complex planning task is avoided.

Description

Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm
Technical Field
The invention relates to the technical field of intelligent hydraulic excavators, in particular to a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm.
Background
The track planning method of the intelligent hydraulic excavator is characterized in that the excavator can automatically plan and execute the motion track of the excavator through algorithms and technologies so as to realize specific tasks. Such methods typically involve sensors, computer vision and control systems to assist the excavator in moving, excavating or performing other operations within the work area. Can perform tasks more intelligently and more efficiently, reduce the need for human intervention, and provide more reliable motion control in complex environments.
At present, the conventional optimal track planning method for the intelligent hydraulic excavator depends on interpolation strategies, when the method is used for excavating in a complex environment, complex tasks are required to be planned and modeled accurately, so that the system response speed is not efficient enough in the actual use process, and the task execution precision is different.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm, which solves the problems that in the prior art, complex tasks are planned and accurately modeled, the response rate of a system is not efficient enough in the actual use process, and the execution precision of the tasks is different.
In order to achieve the above purpose, the invention is realized by the following technical scheme: a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm comprises the following steps:
under the condition of not considering rotary operation, the excavator working device realizes the movement track of the tail end of the tooth tip of the bucket by coupling movement among three joints of the movable arm, the bucket rod and the bucket in the operation process, and takes each joint of the movable arm, the bucket rod and the bucket as an independent decision-making intelligent body, wherein the finally planned operation track is a decision-making sequence of the three joints;
step two: the method comprises the steps of adopting a centralized training-distributed training mode, taking the combined actions of an environmental state and three intelligent agents as the input of an evaluator decision network in the training process, so that an output evaluation value function comprises the guidance information of the cooperation of the three joint intelligent agents;
step three: based on the training result of the second step, distributed execution is carried out, the execution actions of all the intelligent agents are not required to be communicated with each other, long-time training can be carried out, three joints of the movable arm, the bucket rod and the bucket can be used for carrying out cooperative operation, the establishment of a multi-intelligent-agent system model is completed, and then basic elements of the established multi-intelligent-agent system model are defined;
step four: and (3) optimizing the point-to-point operation task of the excavator by using a TD3 algorithm, training the multi-intelligent system model established in the step (III), and establishing an Actor-Critic framework for each joint of the movable arm, the bucket rod and the bucket.
Preferably, the element definition in the third step includes a state space design, taking angles of joints of the boom, the arm and the bucket as state parameters, taking an initial joint angle as an input parameter of a strategy network, and calculating according to a variation range of a joint angle value corresponding to an output of the action strategy network to obtain an angle value of a next state, wherein a specific calculation formula is as follows:
θ i =θ i0 +Δθ i (i=2,3,4)
in θ i0 Joint angle values, Δθ, representing the starting points of boom, stick, and bucket joints i I=2, 3, and 4 represent the variation range of the angle value of each joint, and the boom, the arm, and the bucket joints are sequentially shown.
Preferably, the element definition in the third step includes action space design, the output of the strategy network is defined as the variation amplitude of the joint angle, and the action taken satisfies a i N (0, 1) is normally distributed, and in order to reduce difficulty in decision making operation, discretization processing is required for output information.
Preferably, the element definition in the third step includes designing a reward function, and in order to implement efficient and stable autonomous operation of the working device within the allowed working range, designing the reward function of the intelligent agent is as follows:
r=r 11 +r 12 +r 13 +r 21 +r 22 +r 23 +r 31 +r 32 +r 33 +r t
in θ 234 The angle values of the movable arm, the bucket rod and the bucket joint are sequentially shown; r is (r) 11 ,r 12 A reward indicating whether the arm joint movement exceeds the allowable movement range, r 13 Indicating whether the velocity of the movable arm joint exceeds the constraint range or not, theta 2min2max Indicating the allowable movement range of the movable arm joint, v 2 Representing the velocity constraint value of the boom joint, r 21 ,r 22 And r 23 ,r 31 ,r 32 And r 33 D, sequentially obtaining rewards for whether the joint movement of the bucket rod and the bucket exceeds the allowable movement range t T is the total time of the job, which is the distance between the current position of the bucket tip end and the target end.
Preferably, θ in the reward function 22min ,θ 22max Andthe boolean expression is a boolean expression, i.e. the boolean expression results in 0 when the boom joint angle and angular velocity values are within the allowed range of motion, whereas the boolean expression results in 1 when the boom joint angle and angular velocity values exceed the allowed range.
Preferably, the element definition in the third step includes designing a neural network, based on an Actor and a Critic network in the TD3 algorithm, which have substantially the same structure, and using a fully connected network with a double hidden layer structure, where the hidden layer includes 512 neurons, and the ReLu function is an activation function, and includes:
the Actor network receives the normalized state observation information, and after passing through the full-connection layer, the Actor network sets a Softmax function as the last layer of the neural network, converts an output result into a probability distribution vector, and forms discretized output information;
the Critic network outputs a 1-dimensional state value function.
8. Preferably, the element definition in the third step includes super parameter setting, and for the neural network training process, an Adam network optimizer is adopted, and the time optimal track planning process based on the TD3 algorithm includes the following steps:
s1, initializing evaluation networks Critic1, critic2 and a strategy network Actor, and randomly giving network parameters theta 12 ,φ;
S2, initializing target networks Critic_T1, critic_T2 and actor_T, and enabling θ' 1 ←θ 1 ,θ' 2 ←θ 2 ,φ'←φ;
S3, initializing an experience pool beta;
S4、for 1 to T;
s5, generating actions a-pi with noise φ (s) +epsilon, epsilon-N (0, sigma), calculating a reward value obtained by executing the action and a new state s 'according to a reward function, and putting the quadruple (s, a, r, s') into an experience pool beta;
s6, taking the quaternary groups (S, a, r, S') with the number of N from the experience pool for training the target network.
Preferably, the step S6 includes:
Actor_T:
Critic_T1 and Critic_T2:
updating Critic1 and Critic2 network parameters:
if t accumulates a certain step size do;
updating parameters phi of the deterministic strategy:
updating by using a gradient descent method:
θ' i ←τθ i +(1-τ)θ' i
updating target network parameters: phi '≡τ+ (1- τ) phi'.
Preferably, in the fourth step, the Actor network is used for policy iteration update, the actor_t network is used for experience pool sampling update, and the network parameters of the actor_t network are updated from the Actor network periodically.
Preferably, in the fourth step, the Critic1 and Critic2 networks update Q values for evaluating the behavior of the current Actor; the critic_t1 and critic_t2 networks are responsible for calculating global prize values, the network parameters of which are updated periodically from Critic1 and Critic2, and finally, by targeting at high efficiency, working paths satisfying large prize values are obtained.
The invention provides a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm.
The beneficial effects are as follows:
1. according to the method, independent of a specific interpolation strategy model, autonomous online operation track planning of the excavator can be realized by utilizing a reinforcement learning algorithm-TD 3 algorithm, and a corresponding interpolation strategy model does not need to be selected according to a target point of a planning path, namely accurate modeling of a complex planning task is avoided.
2. According to the method, the training of the working device of the excavator by utilizing the TD3 reinforcement learning algorithm can realize rapid continuous decision, the control law is not required to be solved, and the three joints of the movable arm, the bucket rod and the bucket of the excavator are trained by utilizing the TD3 reinforcement learning algorithm, so that the timely decision of the working device of the excavator can be finally realized, the planning result is executed, and the control law is prevented from being solved in the traditional control method.
3. Compared with the traditional method for optimizing and solving the optimal track by adopting the intelligent optimization algorithm to the interpolation strategy, the method for optimizing and solving the time optimal track of the excavator by utilizing the TD3 reinforcement learning algorithm can effectively reduce the calculated amount, thereby being beneficial to improving the planning efficiency.
Drawings
FIG. 1 is a flow chart of the centralized training-distributed execution of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples:
referring to fig. 1, an embodiment of the present invention provides a track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm, where under the condition of not considering a rotation operation, an excavator working device implements a motion track of a tip end of a bucket by coupling motions among three joints of a boom, an arm and a bucket during the operation. Therefore, when planning the time optimal track of the excavator by using the TD3 reinforcement learning algorithm, each joint of the movable arm, the bucket arm and the bucket is used as an independent decision-making agent, and the finally planned operation track is a decision sequence of three joints. Therefore, in the multi-agent system consisting of the movable arm, the bucket rod and the bucket joint, a centralized training-distributed training mode is adopted, as shown in the figure 1, s is the state, a is the action, r 1 ,r 2 ,...,r n A prize value for each agent. The method comprises the steps that the combined actions of an environment state and three intelligent agents are used as input of an evaluator decision network in the training process, so that an output evaluation value function contains guidance information of cooperation of the three joint intelligent agents; for distributed execution, the execution actions of all the agents are not required to be communicated with each other, and the three joints of the movable arm, the bucket rod and the bucket can be cooperatively operated after long-time training.
Establishing an Actor-Critic framework for each joint of a movable arm, a bucket rod and a bucket by using a TD3 algorithm for point-to-point operation tasks of the excavator, wherein an Actor network is used for strategy iteration update, an actor_T network is used for experience pool sampling update, and network parameters of the actor_T network are periodically updated from the Actor network; critic1 and Critic2 networks update the Q value for evaluating the behavior of the current Actor; the critic_t1 and critic_t2 networks are responsible for calculating global prize values, the network parameters of which are periodically updated from Critic1 and Critic 2. Finally, a job path satisfying a large prize value is obtained by targeting high efficiency.
Before training a multi-intelligent system by using the TD3 algorithm, basic elements of a model need to be defined.
1. Designing a state space;
the angles of the movable arm, the bucket rod and the bucket joint are taken as state parameters, the initial joint angle is taken as an input parameter of a strategy network, the angle value of the next state is obtained by calculating according to the variation amplitude of the corresponding joint angle value output by the action strategy network, and the specific calculation formula is as follows:
one, θ i =θ i0 +Δθ i (i=2,3,4)
In θ i0 Joint angle values, Δθ, representing the starting points of boom, stick, and bucket joints i I=2, 3, and 4 represent the variation range of the angle value of each joint, and the boom, the arm, and the bucket joints are sequentially shown.
2. Designing an action space;
defining the output of the strategy network as the variation amplitude of the joint angle, and taking action to meet a i N (0, 1) normal distribution. In order to reduce the difficulty of decision making, discretization processing is required for the output information.
3. Designing a reward function;
in order to realize high-efficiency and stable autonomous operation of the working device in the allowed working range, the rewarding function of the intelligent agent is designed as follows:
second step,
Three, r=r 11 +r 12 +r 13 +r 21 +r 22 +r 23 +r 31 +r 32 +r 33 +r t
In θ 234 The angle values of the movable arm, the bucket rod and the bucket joint are sequentially shown; r is (r) 11 ,r 12 A reward indicating whether the arm joint movement exceeds the allowable movement range, r 13 Indicating whether the velocity of the boom joint is outside the constraint range, where θ 2min2max Indicating the allowable movement range of the movable arm joint, v 2 Represents a speed constraint value of the boom joint, and θ 22min ,θ 22max Andis a boolean expression, i.e. when the boom joint angle and angular velocity values are within the allowed range of motion, the boolean expression results in 0; conversely, when the boom joint angle and the angular velocity value exceed the allowable ranges, the result of the boolean expression is 1. Similarly, r 21 ,r 22 And r 23 ,r 31 ,r 32 And r 33 Which in turn awards whether the stick and bucket articulation exceeds the allowable range of motion. D (D) t T is the total time of the job, which is the distance between the current position of the bucket tip end and the target end. As can be seen from the formulas two-three, when each joint exceeds the allowable movement range, the rewards are smaller; the longer the total time of movement and the greater the distance of the tip end of the bucket from the target point, the less rewards will be. Considering that each joint is an independent agent, defining that the prize value obtained by each agent interacting with the environment is the same, the shared Critic evaluation network is affected by all agent actions.
Thus, the first part of the reward function is the reward that is obtained for each joint if it exceeds the allowed range of motion; the second part is awarded by the total time to complete the task and the distance value of the current bucket tooth tip point from the given target point. By combining the second knowledge, the acceleration value output by the strategy network can be effectively limited by limiting the joint speed, namely, the action that the output of the strategy network exceeds the allowable range is reduced, so that the stable operation of the working device is realized.
4. Designing a neural network;
for the Actor and Critic networks in the TD3 algorithm, the structures of the Actor and Critic networks are basically the same, a fully-connected network with a double hidden layer structure is adopted, the hidden layer comprises 512 neurons, and the ReLu function is an activation function. The method comprises the steps that an Actor network receives normalized state observation information, and after the Actor network passes through a full-connection layer, a Softmax function is set as the last layer of a neural network, and an output result is converted into a probability distribution vector to form discretized output information; the Critic network outputs a 1-dimensional state value function.
5. Setting super parameters;
for the neural network training process, an Adam network optimizer is adopted, the learning rate is 0.00025, the discount rate is 0.99, the cutting rate is 0.2, the batch size is 128, the experience library capacity is set to 4000, and the initial training sample number is 2000.
In summary, the time optimal trajectory planning process based on the TD3 algorithm is as follows:
s1, initializing evaluation networks Critic1, critic2 and a strategy network Actor, and randomly giving network parameters theta 12 ,φ;
S2, initializing target networks Critic_T1, critic_T2 and actor_T, and enabling θ' 1 ←θ 1 ,θ' 2 ←θ 2 ,φ'←φ;
S3, initializing an experience pool beta;
S4、for 1 to T;
s5, generating actions a-pi with noise φ (s) +epsilon, epsilon-N (0, sigma), calculating a reward value obtained by executing the action and a new state s 'according to a formula two-formula three, and putting the quadruple (s, a, r, s') into an experience pool beta;
s6, taking a number N of quadruples (S, a, r, S') from the experience pool for training the target network, wherein:
Actor_T:
Critic_T1 and Critic_T2:
updating Critic1 and Critic2 network parameters:
if t accumulates a certain step size do;
updating parameters phi of the deterministic strategy:
updating by using a gradient descent method:
θ' i ←τθ i +(1-τ)θ' i
updating target network parameters: phi '≡τ+ (1- τ) phi';
End。
although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. The track planning method for the hydraulic excavator based on the TD3 reinforcement learning algorithm is characterized by comprising the following steps of:
under the condition of not considering rotary operation, the excavator working device realizes the movement track of the tail end of the tooth tip of the bucket by coupling movement among three joints of the movable arm, the bucket rod and the bucket in the operation process, and takes each joint of the movable arm, the bucket rod and the bucket as an independent decision-making intelligent body, wherein the finally planned operation track is a decision-making sequence of the three joints;
step two: the method comprises the steps of adopting a centralized training-distributed training mode, taking the combined actions of an environmental state and three intelligent agents as the input of an evaluator decision network in the training process, so that an output evaluation value function comprises the guidance information of the cooperation of the three joint intelligent agents;
step three: based on the training result of the second step, distributed execution is carried out, the execution actions of all the intelligent agents are not required to be communicated with each other, long-time training can be carried out, three joints of the movable arm, the bucket rod and the bucket can be used for carrying out cooperative operation, the establishment of a multi-intelligent-agent system model is completed, and then basic elements of the established multi-intelligent-agent system model are defined;
step four: and (3) optimizing the point-to-point operation task of the excavator by using a TD3 algorithm, training the multi-intelligent system model established in the step (III), and establishing an Actor-Critic framework for each joint of the movable arm, the bucket rod and the bucket.
2. The track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm according to claim 1, wherein the element definition in the third step includes a state space design, the angles of the boom, the arm and the bucket are taken as state parameters, the initial joint angle is taken as an input parameter of a strategy network, and the angle value of the next state is obtained by calculating according to the variation amplitude of the joint angle value corresponding to the output of the action strategy network, and the specific calculation formula is as follows:
θ i =θ i0 +Δθ i (i=2,3,4)
in θ i0 Joint angle values, Δθ, representing the starting points of boom, stick, and bucket joints i I=2, 3, and 4 represent the variation range of the angle value of each joint, and the boom, the arm, and the bucket joints are sequentially shown.
3. The track planning method for hydraulic excavator based on the TD3 reinforcement learning algorithm according to claim 1, wherein the element definition in the third step includes action space design, the output of the strategy network is defined as the variation amplitude of the joint angle, and the action taken satisfies a i N (0, 1) is normally distributed, and in order to reduce difficulty in decision making operation, discretization processing is required for output information.
4. The track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm according to claim 1, wherein the element definition in the third step includes a reward function design, and in order to realize efficient and stable autonomous operation of the working device within the allowed working range, the reward function of the intelligent agent is designed as follows:
r=r 11 +r 12 +r 13 +r 21 +r 22 +r 23 +r 31 +r 32 +r 33 +r t
in θ 234 The angle values of the movable arm, the bucket rod and the bucket joint are sequentially shown; r is (r) 11 ,r 12 A reward indicating whether the arm joint movement exceeds the allowable movement range, r 13 Indicating whether the velocity of the movable arm joint exceeds the constraint range or not, theta 2min2max Indicating the allowable movement range of the movable arm joint, v 2 Representing the velocity constraint value of the boom joint, r 21 ,r 22 And r 23 ,r 31 ,r 32 And r 33 D, sequentially obtaining rewards for whether the joint movement of the bucket rod and the bucket exceeds the allowable movement range t T is the total time of the job, which is the distance between the current position of the bucket tip end and the target end.
5. The track planning method for a hydraulic excavator based on a TD3 reinforcement learning algorithm according to claim 4, wherein θ in said reward function is 22min ,θ 22max Andis a Boolean expression, namely, when the angle and the angular velocity of the movable arm joint are within the allowable movement range, the Boolean expression result is 0, otherwise, when the angle and the angular velocity of the movable arm joint areWhen the degree value exceeds the allowable range, the result of the boolean expression is 1.
6. The track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm according to claim 1, wherein the element definition in the third step includes neural network design, based on the Actor and Critic networks in the TD3 algorithm, their structures are basically the same, a fully connected network with double hidden layer structure is adopted, the hidden layer contains 512 neurons, and the ReLu function is an activation function, which includes:
the Actor network receives the normalized state observation information, and after passing through the full-connection layer, the Actor network sets a Softmax function as the last layer of the neural network, converts an output result into a probability distribution vector, and forms discretized output information;
the Critic network outputs a 1-dimensional state value function.
7. The track planning method for the hydraulic excavator based on the TD3 reinforcement learning algorithm according to claim 4, wherein the element definition in the third step comprises super parameter setting, and for the neural network training process, an Adam network optimizer is adopted, and the time optimal track planning process based on the TD3 algorithm comprises the following steps:
s1, initializing evaluation networks Critic1, critic2 and a strategy network Actor, and randomly giving network parameters theta 12 ,φ;
S2, initializing target networks Critic_T1, critic_T2 and actor_T, and enabling θ' 1 ←θ 1 ,θ′ 2 ←θ 2 ,φ′←φ;
S3, initializing an experience pool beta;
S4、for 1to T;
s5, generating actions a-pi with noise φ (s) +epsilon, epsilon-N (0, sigma), calculating a reward value obtained by executing the action and a new state s 'according to a reward function, and putting the quadruple (s, a, r, s') into an experience pool beta;
s6, taking the quaternary groups (S, a, r, S') with the number of N from the experience pool for training the target network.
8. The track planning method for a hydraulic excavator based on the TD3 reinforcement learning algorithm of claim 7, wherein S6 includes:
Actor_T:
Critic_T1 and Critic_T2:
updating Critic1 and Critic2 network parameters:
if t accumulates a certain step size do;
updating parameters phi of the deterministic strategy:
updating by using a gradient descent method:
updating target network parameters:
9. the track planning method for hydraulic excavator based on the TD3 reinforcement learning algorithm according to claim 1, wherein in the fourth step, the Actor network is used for iterative updating of strategies, the actor_t network is used for sampling and updating of experience pools, and the network parameters are updated from the Actor network periodically.
10. The track planning method for the hydraulic excavator based on the TD3 reinforcement learning algorithm according to claim 1, wherein the Critic1 and Critic2 networks in the fourth step update the Q value for evaluating the behavior of the current Actor; the critic_t1 and critic_t2 networks are responsible for calculating global prize values, the network parameters of which are updated periodically from Critic1 and Critic2, and finally, by targeting at high efficiency, working paths satisfying large prize values are obtained.
CN202311744849.5A 2023-12-18 2023-12-18 Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm Pending CN117742387A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311744849.5A CN117742387A (en) 2023-12-18 2023-12-18 Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311744849.5A CN117742387A (en) 2023-12-18 2023-12-18 Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm

Publications (1)

Publication Number Publication Date
CN117742387A true CN117742387A (en) 2024-03-22

Family

ID=90280775

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311744849.5A Pending CN117742387A (en) 2023-12-18 2023-12-18 Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm

Country Status (1)

Country Link
CN (1) CN117742387A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118238153A (en) * 2024-05-28 2024-06-25 华中科技大学 Autonomous construction method and system for intelligent self-contained bulldozer

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118238153A (en) * 2024-05-28 2024-06-25 华中科技大学 Autonomous construction method and system for intelligent self-contained bulldozer

Similar Documents

Publication Publication Date Title
Zhang et al. Deep interactive reinforcement learning for path following of autonomous underwater vehicle
WO2022088593A1 (en) Robotic arm control method and device, and human-machine cooperation model training method
CN110682286B (en) Real-time obstacle avoidance method for cooperative robot
CN114047697B (en) Four-foot robot balance inverted pendulum control method based on deep reinforcement learning
CN113510704A (en) Industrial mechanical arm motion planning method based on reinforcement learning algorithm
CN116460860B (en) Model-based robot offline reinforcement learning control method
CN114083539B (en) Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning
CN112140101A (en) Trajectory planning method, device and system
CN117742387A (en) Track planning method for hydraulic excavator based on TD3 reinforcement learning algorithm
CN113684885B (en) Working machine control method and device and working machine
CN107844460A (en) A kind of underwater multi-robot based on P MAXQ surrounds and seize method
CN115890670A (en) Method for training motion trail of seven-degree-of-freedom redundant mechanical arm based on intensive deep learning
CN117103282A (en) Double-arm robot cooperative motion control method based on MATD3 algorithm
Rastogi et al. Sample-efficient reinforcement learning via difference models
JP2024539851A (en) Coordination of Multiple Robots Using Graph Neural Networks
CN113967909B (en) Direction rewarding-based intelligent control method for mechanical arm
CN114943182A (en) Robot cable shape control method and device based on graph neural network
Borngrund et al. Autonomous navigation of wheel loaders using task decomposition and reinforcement learning
Smart et al. Reinforcement learning for robot control
Prabhu et al. Fuzzy-logic-based reinforcement learning of admittance control for automated robotic manufacturing
CN113759929B (en) Multi-agent path planning method based on reinforcement learning and model predictive control
Revell et al. Sim2real: Issues in transferring autonomous driving model from simulation to real world
CN115922711A (en) Brain-like synchronous tracking control method for double mechanical arms
CN114771783B (en) Control method and system for submarine stratum space robot
Li et al. Manipulator Motion Planning based on Actor-Critic Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination