CN117140527B - Mechanical arm control method and system based on deep reinforcement learning algorithm - Google Patents
Mechanical arm control method and system based on deep reinforcement learning algorithm Download PDFInfo
- Publication number
- CN117140527B CN117140527B CN202311258556.6A CN202311258556A CN117140527B CN 117140527 B CN117140527 B CN 117140527B CN 202311258556 A CN202311258556 A CN 202311258556A CN 117140527 B CN117140527 B CN 117140527B
- Authority
- CN
- China
- Prior art keywords
- mechanical arm
- network
- reinforcement learning
- target
- action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 60
- 238000000034 method Methods 0.000 title claims abstract description 43
- 230000006870 function Effects 0.000 claims abstract description 59
- 238000005457 optimization Methods 0.000 claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 17
- 230000002452 interceptive effect Effects 0.000 claims abstract description 9
- 230000008447 perception Effects 0.000 claims abstract description 9
- 230000009471 action Effects 0.000 claims description 57
- 230000003993 interaction Effects 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 10
- 230000000875 corresponding effect Effects 0.000 claims description 8
- 230000003044 adaptive effect Effects 0.000 claims description 7
- 230000001276 controlling effect Effects 0.000 claims description 6
- 230000002596 correlated effect Effects 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 description 31
- 239000012636 effector Substances 0.000 description 13
- 238000010586 diagram Methods 0.000 description 8
- 238000004088 simulation Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000011217 control strategy Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 210000000078 claw Anatomy 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1628—Programme controls characterised by the control loop
- B25J9/163—Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
Landscapes
- Engineering & Computer Science (AREA)
- Robotics (AREA)
- Mechanical Engineering (AREA)
- Feedback Control In General (AREA)
- Manipulator (AREA)
Abstract
The invention discloses a mechanical arm control method and a mechanical arm control system based on a deep reinforcement learning algorithm, wherein the method comprises the following steps: obtaining sensing data of a six-axis mechanical arm; based on the reinforcement learning DDPG algorithm, setting an experience playback pool, an agent noise exploration function and a reward function, and performing interactive optimization training on the perception data to obtain an optimized reinforcement learning DDPG algorithm; and deploying the optimized reinforcement learning DDPG algorithm to the six-axis mechanical arm for motion guidance. According to the invention, by setting an experience playback pool, an agent noise exploration function and a reward function, a reinforcement learning DDPG algorithm is constructed, so that more accurate target tracking is realized. The mechanical arm control method and system based on the deep reinforcement learning algorithm can be widely applied to the technical field of mechanical arm control.
Description
Technical Field
The invention relates to the technical field of mechanical arm control, in particular to a mechanical arm control method and system based on a deep reinforcement learning algorithm.
Background
With the annual increase of human aerospace activities, space garbage, dead spacecraft and other space non-cooperative targets needing to be captured, repaired or cleaned are increasingly increased; the method for capturing the space non-cooperative targets comprises a rigid scheme, a rope net, a flying claw and other flexible schemes, wherein the rigid capturing scheme mainly comprises the steps that a capturing device is arranged at the tail end of a mechanical arm, the targets can be captured accurately by controlling the mechanical arm, so that the service star and the target star are very firmly abutted, and the space mechanical arm has wide application prospects in capturing tasks of the space non-cooperative targets due to the advantages of light weight, high flexibility, strong operability and the like;
The non-cooperative target has the characteristics of uncertain motion trail, easy escape and the like, and is a difficulty in the capturing process. In the robot arm control, the conventional control method such as PID control, calculated torque control, etc., has disadvantages in that, firstly, the conventional control method generally depends on an accurate system model and parameters, and in practical applications, a dynamic model of the robot arm system may be difficult to accurately model. This may lead to unstable or inaccurate performance of the conventional method in actual operation, and secondly, insufficient adaptability, the conventional control method generally has difficulty in coping with complex mechanical arm system characteristics such as nonlinearity, time variation, coupling, etc., the methods may require frequent parameter adjustment when processing complex tasks, and may not share experience among different tasks, thirdly, parameter adjustment is difficult, and the conventional control method requires manual parameter adjustment, which may be a time-consuming and tedious process. Under conditions of variable parameters or continuously variable tasks of the mechanical arm system, maintenance and optimization of the traditional controller may become difficult, fourth, the traditional control method is poor in generalization capability, is generally difficult to learn from a small amount of data and has strong generalization capability, and when the mechanical arm system faces an unknown environment or task, the traditional method may not be suitable for and learn a new strategy.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a mechanical arm control method and a mechanical arm control system based on a deep reinforcement learning algorithm, which are used for constructing the reinforcement learning DDPG algorithm by setting an experience playback pool, an agent noise exploration function and a reward function so as to realize more accurate target tracking.
The first technical scheme adopted by the invention is as follows: a mechanical arm control method based on a deep reinforcement learning algorithm comprises the following steps:
Obtaining sensing data of a six-axis mechanical arm;
Based on the reinforcement learning DDPG algorithm, setting an experience playback pool, an agent noise exploration function and a reward function, and performing interactive optimization training on the perception data to obtain an optimized reinforcement learning DDPG algorithm;
and deploying the optimized reinforcement learning DDPG algorithm to the six-axis mechanical arm for motion guidance.
Further, the step of obtaining the sensing data of the six-axis mechanical arm specifically includes:
Moment input ports and corner and angular velocity sensors for controlling all joints of the six-axis mechanical arm are arranged, and states of an intelligent body are defined as six joint angles and joint angular velocities at the tail end of the six-axis mechanical arm, distance errors and velocity errors between the intelligent body and a target in x, y and z directions respectively, and moment control of all joints at present.
Further, the reinforcement learning DDPG algorithm includes a primary network and a target network, wherein:
The main network comprises an Actor network and a Critic network, and the Target network comprises a Target Actor network and a TARGET CRITIC network;
the Actor network is an action network, takes the state s of an intelligent agent as input, and outputs six-degree-of-freedom control moment of the deterministic action mechanical arm;
The Critic network is an evaluation network and is used for calculating the Q value, and the value of the action given by the Actor network is evaluated through the Q value;
The Target Actor network and the TARGET CRITIC network are used for initializing the network parameters of the Actor network and the network parameters of the Critic network, and initializing the network parameters of the Target Actor network and the network parameters of the TARGET CRITIC network.
Further, the step of setting an experience playback pool specifically includes:
The agent stores the obtained experience data (s t,at,rt,st+1, done) in an experience playback pool, samples according to batches when updating the main network parameter and the target network parameter, wherein s t represents the state of the agent at the time t, a t represents the action taken by the agent at the time t, r t represents the reward obtained after the action is taken at the time t, s t+1 represents the state reached at the time t+1 after the action is taken by the agent, and done represents whether the round task is completed or not.
Further, the step of agent noise exploration specifically includes:
And adding noise to the action output by the Actor network, calculating the mean square error d between the action a of the sample in the current state and the action a of the original strategy network output before updating in the process of extracting the sample in a memory bank during network updating, comparing the mean square error with a set error threshold d th, and carrying out adjustment updating on the standard deviation s of Gaussian noise according to the comparison result.
Further, the expression for making adjustment update to the standard deviation s of the gaussian noise is:
in the above expression, s represents the standard deviation of gaussian noise, k represents an adaptive adjustment coefficient, d represents a mean square error value, and d th represents a set error threshold.
Further, the reward functions include a distance reward function, a speed reward function, an action-plateau reward function, and an energy loss reward function, wherein:
the distance rewarding function indicates that when the distance error between the tail end of the intelligent agent and the target is smaller than a threshold value, the intelligent agent obtains rewards positively correlated with the error close to the target, otherwise, obtains rewards negatively correlated with the error, and the expression is as follows:
in the above formula, r 1 represents a distance rewarding function, k 1 represents an adjustable weight coefficient, e ri represents position distance errors of the tail end of the mechanical arm in x, y and z directions respectively;
the speed rewarding function is used for guiding the mechanical arm to move according to the expected track speed, and the expression is as follows:
in the above formula, r 2 represents a speed rewarding function, k 2 represents an adjustable weight coefficient, e vi represents speed errors of the tail end of the mechanical arm in x, y and z directions respectively;
The action stable rewarding function represents control moment punishment, and the expression is as follows:
In the above formula, r 3 represents a motion stability rewarding function, k 3 represents an adjustable weight coefficient, and torque i represents control moment of each axis of the mechanical arm;
the energy loss reward function represents moment change penalty, and the expression is as follows:
Where r 4 denotes an energy loss reward function, k 4 denotes an adjustable weight coefficient, And/>Respectively representing the control moment of each shaft of the mechanical arm in the current time step and the last time step.
Further, the step of performing interactive optimization training on the perception data to obtain an optimized reinforcement learning DDPG algorithm specifically includes:
The sensing data is input into an Actor network according to the current state of the six-axis mechanical arm, and control moment corresponding to the sensing data is obtained;
Adding noise to the control moment and taking the control moment as the actual action of the six-axis mechanical arm;
The six-axis mechanical arm executes corresponding actual actions, and the state of the six-axis mechanical arm is updated to obtain the observation state at the next moment;
inputting the current state of the six-axis mechanical arm, rewards obtained by executing corresponding actual actions and the actual actions of the six-axis mechanical arm into a Critic network to obtain a Q estimated value;
inputting the observation state at the next moment into a Target Actor network to obtain the action output of the six-axis mechanical arm at the next moment;
The observation state at the next moment and the action output of the six-axis mechanical arm at the next moment are input into a TARGET CRITIC network to obtain a Q estimated value and rewards at the next moment;
Determining a first interaction optimization completion mark of the six-axis mechanical arm according to the Q estimation value and the Q estimation value at the next moment;
Storing the current state of the six-axis mechanical arm, the actual action of the six-axis mechanical arm, rewards, the estimated value of the Q at the next moment and the first interaction optimization completion mark into an experience playback pool;
and (3) circulating the interactive optimization step of the six-axis mechanical arm until the experience playback pool is fully stored, and obtaining an optimized reinforcement learning DDPG algorithm.
The second technical scheme adopted by the invention is as follows: a robotic arm control system based on a deep reinforcement learning algorithm, comprising:
the acquisition module is used for acquiring the sensing data of the six-axis mechanical arm;
The interaction module is used for setting an experience playback pool, an agent noise exploration function and a reward function based on the reinforcement learning DDPG algorithm, carrying out interaction optimization training on the perception data, and obtaining an optimized reinforcement learning DDPG algorithm;
the deployment module is used for deploying the optimized reinforcement learning DDPG algorithm to the six-axis mechanical arm to conduct motion guidance.
The method and the system have the beneficial effects that: according to the invention, by adopting a deep reinforcement learning algorithm, learning and optimizing a control strategy from perceived data, the mechanical arm can be adjusted in real time according to the dynamic information of the target, in the training process, the mechanical arm interacts with the target, the strategy is continuously optimized through a trial-and-error and rewarding mechanism, for a dynamic target tracking task, on a DDPG basic frame, the self-adaptive strategy parameters and the improvement of experience playback are designed, self-adaptive noise is introduced, and the exploratory property of the DDPG algorithm is improved, so that more accurate target tracking is realized.
Drawings
FIG. 1 is a flow chart of steps of a method for controlling a mechanical arm based on a deep reinforcement learning algorithm according to an embodiment of the present invention;
FIG. 2 is a block diagram of a control system for a robotic arm based on a deep reinforcement learning algorithm according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a six-axis mechanical arm model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a target space straight line track according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a robot end effector tracking a linear trajectory in accordance with an embodiment of the present invention;
FIG. 6 is a schematic diagram of the coordinate error of a robot end effector tracking a linear trajectory in accordance with an embodiment of the present invention;
FIG. 7 is a schematic diagram of a circular track of a target space in accordance with an embodiment of the present invention;
FIG. 8 is a schematic diagram of a robotic end effector tracking a circular trajectory in accordance with an embodiment of the present invention;
FIG. 9 is a schematic diagram of the coordinate error of a robotic end effector tracking a circular trajectory in accordance with an embodiment of the present invention.
Detailed Description
The invention will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
Compared with the traditional other control methods, the control method based on the deep reinforcement learning does not need an accurate system model, and can directly learn a control strategy from actual interaction, so that the control method is suitable for complex nonlinear and time-varying systems; the capture strategy has strong adaptability and predictability, the deep reinforcement learning method can adapt to different tasks and environments, experience can be shared among different tasks through learning in interaction, and the capture strategy can be independently learned in actual tasks without manually adjusting parameters, so that the burden of manual parameter adjustment is reduced. The method has good applicability to the capture problem of the multi-degree-of-freedom mechanical arm of the space non-cooperative target.
Based on the principle, a deep reinforcement learning algorithm applied to the problem of dynamic track tracking of a non-cooperative target in a task of capturing the non-cooperative target by a space manipulator is provided, and the effectiveness of the algorithm is verified in simulation. For a dynamic target tracking task, on a DDPG basic framework, an adaptive strategy parameter and experience playback improvement are designed, the track tracking of a target is realized within an allowable error range, the effectiveness of a reinforcement learning algorithm is verified in a six-degree-of-freedom space mechanical arm simulation environment, and the algorithm is verified to have higher tracking precision in the track tracking task;
According to the invention, a simulation environment of a mechanical arm tracking dynamic track target is built in the Simscape of Matlab, and simulation is carried out by combining the Matlab with Simulink. For the convenience of the subsequent experiments, the i5 arm of the real six-axis arm traveling to the blog is selected, and as shown in fig. 3, the parameters are taken as simulation arm parameters. And measuring the angle and the angular velocity of each joint in the motion process of the mechanical arm in real time by using a joint angle sensor, and taking the angle and the angular velocity as state feedback information of deep reinforcement learning. The simulation experiment only considers that the end effector tracks to reach the target position, takes the position error and the speed error between the end effector and the target point as main indexes for evaluating the control strategy, and does not relate to the control of collision force generated by the contact of the end effector and the target.
Referring to fig. 1, the invention provides a mechanical arm control method based on a deep reinforcement learning algorithm, which comprises the following steps:
S1, acquiring sensing data of a six-axis mechanical arm;
Specifically, a urdf model of the mechanical arm is led into the Simulink, dynamics, kinematics and other attributes are added to the model, an initial state of the mechanical arm is adjusted, and moment input ports, rotation angles and angular velocity sensors for controlling all joints of the mechanical arm are arranged. And setting reinforcement learning agent super-parameters and a neural network on the Matlab. The State of the intelligent agent is defined as the distance error and the speed error of six joint angles and joint angular velocities of the tail end of the mechanical arm and the target in the x, y and z directions respectively, the current control moment of each joint, and the DH parameters and the rotation angle ranges of each joint are shown in table 1.
TABLE 1DH parameters and ranges of joint angles
S2, based on a reinforcement learning DDPG algorithm, setting an experience playback pool, an agent noise exploration function and a reward function, and performing interactive optimization training on the perception data to obtain an optimized reinforcement learning DDPG algorithm;
S21, setting DDPG a neural network required to be used;
Specifically, the neural network required by DDPG is set, including a main network Actor network, a Critic network, a Target network Target Actor network and a TARGET CRITIC network with the same network structure. The Actor network is an action network, the state s of an intelligent agent is taken as input, the six-degree-of-freedom control moment of a deterministic action mechanical arm is output, the Critic network is an evaluation network and is used for calculating a Q value, the value of actions given by the Actor network is evaluated through the Q value and is expressed as Q (s, a, w), wherein s is the state of the mechanical arm, a is the actions given by the Actor network, w is Critic network parameters, and the target network acts to slow down the network updating speed in a soft updating mode, so that the output is more stable, and the learning process of the Critic network is ensured to be more stable. Network parameters w 1 and w 2 of the Actor network and the Critic network are randomly initialized, and network parameters w 3 and w 4 of the Target Actor network and the TARGET CRITIC network are initialized, wherein the intelligent agent represents a decision function main body in a reinforcement learning algorithm, namely a mechanical arm.
S22, setting an experience playback pool;
Specifically, an experience playback pool is set, the obtained experience data (s t,at,rt,st+1, done) are stored in the experience playback pool by the agent, the network parameters are updated and sampled in batches, s t represents the state of the agent at the time t, a t represents the action taken by the agent at the time t, r t represents the reward obtained after the action is taken at the time t, s t+1 represents the state reached at the time t+1 after the action is taken by the agent, and done represents whether the round task is completed or not.
S23, setting agent noise exploration;
Specifically, the intelligent agent noise exploration is set, the action output by the deterministic strategy is a deterministic action, and the exploration of the environment is lacked. In the training stage, noise is added to the action output by the Actor network, so that the intelligent agent has certain exploration capability. Adaptive noise can enable an agent to explore more early in training, and gradually reduce exploration later in training, making use of learned experiences. One improvement is to add gaussian noise to the action policy network parameters, i.e. to use policy parameter noise skills. In the process of extracting samples in a memory bank during network updating, calculating the mean square error d between the action a of the samples in the current state and the network output action a of the original strategy before updating, wherein the samples represent data in an experience playback pool, comparing the data with a set error threshold d th, and carrying out adjustment updating on the standard deviation s of Gaussian noise according to a comparison result, wherein the expression is as follows:
in the above expression, s represents the standard deviation of gaussian noise, k represents an adaptive adjustment coefficient, d represents a mean square error value, and d th represents a set error threshold.
S24, constructing a designed reward function;
Specifically, a design reward function is constructed, which is mainly considered from four aspects in the target tracking operation: distance, speed, smooth motion and energy loss. It is desirable that the final error of the end effector of the robotic arm to the target be minimized and that the amplitude of the motion of the robotic arm during motion be as small as possible and that the energy consumption be as low as possible.
The sparse rewards are adopted in the aspect of distance, and the method is specifically designed as follows: setting a distance threshold value of 0.01, and when the distance error between the tail end and the target is smaller than 1/2/5/10 times of the threshold value, the intelligent agent respectively obtains a fixed prize r 1 =1000 for successfully approaching the target Multiple rewards, which are used for exciting the intelligent agent to gradually approach the target; otherwise, a reward inversely related to the error is obtained:
in the above formula, r 1 represents a distance rewarding function, k 1 represents an adjustable weight coefficient, e ri represents position distance errors of the tail end of the mechanical arm in x, y and z directions respectively;
the speed aspect adopts rewards:
In the above formula, r 2 represents a speed rewarding function, k 2 represents an adjustable weight coefficient, e vi represents speed errors of the tail end of the mechanical arm in x, y and z directions respectively, and the mechanical arm is guided to move as much as possible according to a desired track speed;
Two penalties are introduced in terms of motion smoothness and energy loss:
Control moment penalty:
In the above formula, r 3 represents a smooth action rewarding function, k 3 represents an adjustable weight coefficient, and torque i represents control moment of each shaft of the mechanical arm, so that the mechanical arm is prevented from being large in movement amplitude, large in energy loss and low in training stability due to overlarge control moment input;
moment variation penalty:
Where r 4 denotes an energy loss reward function, k 4 denotes an adjustable weight coefficient, And/>Respectively representing the control moment of each shaft of the mechanical arm at the current time step and the control moment of each shaft of the mechanical arm at the last time step, limiting the change rate of the moment given by the intelligent body, and enabling the movement of the mechanical arm to be smoother;
in summary, the reward function is always designed as:
r=r1+r2+r3+r4
In the above equation, r represents the total bonus function.
S25, interaction optimization.
Specifically, the method comprises the following steps:
(1) The sensing data is input into an Actor network according to the current state of the six-axis mechanical arm, and control moment corresponding to the sensing data is obtained;
(2) Adding noise to the control moment and taking the control moment as the actual action of the six-axis mechanical arm;
(3) The six-axis mechanical arm executes corresponding actual actions, and the state of the six-axis mechanical arm is updated to obtain the observation state at the next moment;
(4) Inputting the current state of the six-axis mechanical arm and the actual action of the six-axis mechanical arm into a Critic network to obtain a Q estimated value;
(5) Inputting the observation state at the next moment into a Target Actor network to obtain the action output of the six-axis mechanical arm at the next moment;
(6) The observation state at the next moment and the action output of the six-axis mechanical arm at the next moment are input into a TARGET CRITIC network to obtain a Q estimated value and rewards at the next moment;
(7) Determining a first interaction optimization completion mark of the six-axis mechanical arm according to the Q estimation value and the Q estimation value at the next moment;
(8) Storing the current state of the six-axis mechanical arm, the actual action of the six-axis mechanical arm, rewards, the estimated value of the Q at the next moment and the first interaction optimization completion mark into an experience playback pool;
(9) The interactive optimization steps (1) to (8) of the six-axis mechanical arm are circulated until the experience playback pool is fully stored, and an optimized reinforcement learning DDPG algorithm is obtained;
Through the continuous interaction between the intelligent agent and the environment, the processes of selecting actions, interacting with the environment, storing experience and training the network are repeatedly executed, so that the intelligent agent gradually learns the optimal strategy.
And S3, deploying the optimized reinforcement learning DDPG algorithm to a six-axis mechanical arm for motion guidance.
The invention takes the reinforcement learning DDPG algorithm as a basic framework, and correspondingly improves the basic DDPG algorithm aiming at the experimental task of tracking the dynamic target so as to improve the algorithm performance. DDPG is a deep reinforcement learning algorithm for solving the problem of continuous motion space. The method combines an Actor-Critic framework and a deep neural network, uses a Q function as Critic, adopts two algorithm skills, namely experience playback (Experience) and a target network (TARGET NET), samples experience data from an experience memory (Replay Buffer), updates the deep neural network, utilizes a TD error update value function network, outputs actions through a deterministic strategy, and can effectively process a high-dimensional continuous action space. Aiming at the problem of poor exploratory property of DDPG algorithm, adaptive noise is introduced in order to improve exploratory property. Adaptive noise can enable an agent to explore more early in training, and gradually reduce exploration later in training, making use of learned experiences. The reward function plays a vital role in reinforcement learning algorithm, a detailed reward function model is designed aiming at experimental task construction for tracking dynamic targets, and an agent is guided to learn and adapt to complete the experimental task through the setting of the reward function.
Further, the effectiveness of the deep reinforcement learning algorithm is verified in a six-degree-of-freedom space manipulator tracking operation simulation environment. For the tracking effect of the reinforcement learning algorithm, a simulation experiment is performed on Matlab software, wherein the tracking effect of the reinforcement learning algorithm is firstly verified on the space linear moving target and the space curve moving target, the tracking result of the reinforcement learning algorithm is firstly obtained, the target space linear track is shown in fig. 4, the tracking track of the mechanical arm end effector is shown in fig. 5, the mechanical arm rapidly responds to the tracking coordinate error tracked by the mechanical arm end effector at the initial position, the tracking result of the reinforcement learning algorithm is secondly obtained on the space circular track of the mechanical arm, the target space linear track is shown in fig. 7, the tracking track of the mechanical arm end effector is shown in fig. 8, the mechanical arm rapidly responds to the tracking coordinate error tracked by the mechanical arm end effector at the initial position, and the tracking error of the mechanical arm end effector is shown in fig. 9.
Referring to fig. 2, a mechanical arm control system based on a deep reinforcement learning algorithm includes:
the acquisition module is used for acquiring the sensing data of the six-axis mechanical arm;
The interaction module is used for setting an experience playback pool, an agent noise exploration function and a reward function based on the reinforcement learning DDPG algorithm, carrying out interaction optimization training on the perception data, and obtaining an optimized reinforcement learning DDPG algorithm;
the deployment module is used for deploying the optimized reinforcement learning DDPG algorithm to the six-axis mechanical arm to conduct motion guidance.
The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.
While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.
Claims (2)
1. The mechanical arm control method based on the deep reinforcement learning algorithm is characterized by comprising the following steps of:
the method comprises the steps of obtaining sensing data of a six-axis mechanical arm, wherein moment input ports and corner and angular velocity sensors for controlling all joints of the six-axis mechanical arm are arranged, and states of an intelligent body are defined as six joint angles and joint angular velocities at the tail end of the six-axis mechanical arm, distance errors and velocity errors of the intelligent body and a target in x, y and z directions respectively, and moment control of all joints at present;
Based on the reinforcement learning DDPG algorithm, setting an experience playback pool, an agent noise exploration function and a reward function, and performing interactive optimization training on the perception data to obtain an optimized reinforcement learning DDPG algorithm;
The reinforcement learning DDPG algorithm comprises a main network and a Target network, wherein the main network comprises an Actor network and a Critic network, and the Target network comprises a Target Actor network and a TARGET CRITIC network; the Actor network is an action network, takes the state s of an intelligent agent as input, and outputs six-degree-of-freedom control moment of the deterministic action mechanical arm; the Critic network is an evaluation network and is used for calculating the Q value, and the value of the action given by the Actor network is evaluated through the Q value; the Target Actor network and the TARGET CRITIC network are used for initializing the network parameters of the Actor network and the network parameters of the Critic network, and initializing the network parameters of the Target Actor network and the network parameters of the TARGET CRITIC network;
The experience playback pool includes: the intelligent agent stores the obtained experience data (s t,at,rt,st+1, done) in an experience playback pool, samples are taken according to batches when updating the main network parameters and the target network parameters, wherein s t represents the intelligent agent state at the time t, a t represents the action taken by the intelligent agent at the time t, r t represents the reward obtained after the action is taken at the time t, s t+1 represents the state reached at the time t+1 after the action is taken by the intelligent agent, and done represents whether the round task is completed or not;
The intelligent agent noise exploration comprises the steps of adding noise to actions output by an Actor network, calculating a mean square error d between the action a' of the sample in the current state and the action a of the Actor network output before updating in the process of extracting the sample in a memory bank when the Actor network is updated, comparing the mean square error d with a set error threshold d th, and adjusting and updating the standard deviation s of Gaussian noise according to a comparison result, wherein the sample represents data in an experience playback pool;
The expression for making adjustment and update to the standard deviation s of Gaussian noise is as follows:
In the above formula, s represents the standard deviation of gaussian noise, k represents an adaptive adjustment coefficient, d represents a mean square error value, and d th represents a set error threshold;
the reward functions include a distance reward function, a speed reward function, an action-plateau reward function, and an energy loss reward function, wherein:
the distance rewarding function indicates that when the distance error between the tail end of the intelligent agent and the target is smaller than a threshold value, the intelligent agent obtains rewards positively correlated with the error close to the target, otherwise, obtains rewards negatively correlated with the error, and the expression is as follows:
in the above formula, r 1 represents a distance rewarding function, k 1 represents an adjustable weight coefficient, e ri represents position distance errors of the tail end of the mechanical arm in x, y and z directions respectively;
the speed rewarding function is used for guiding the mechanical arm to move according to the expected track speed, and the expression is as follows:
in the above formula, r 2 represents a speed rewarding function, k 2 represents an adjustable weight coefficient, e vi represents speed errors of the tail end of the mechanical arm in x, y and z directions respectively;
The action stable rewarding function represents control moment punishment, and the expression is as follows:
In the above formula, r 3 represents a motion stability rewarding function, k 3 represents an adjustable weight coefficient, and torque i represents control moment of each axis of the mechanical arm;
the energy loss reward function represents moment change penalty, and the expression is as follows:
Where r 4 denotes an energy loss reward function, k 4 denotes an adjustable weight coefficient, And/>Respectively representing the control moment of each shaft of the mechanical arm in the current time step and the last time step;
the interactive optimization training is performed on the perception data, and the reinforcement learning DDPG algorithm after optimization further comprises:
The sensing data is input into an Actor network according to the current state of the six-axis mechanical arm, and control moment corresponding to the sensing data is obtained;
Adding noise to the control moment and taking the control moment as the actual action of the six-axis mechanical arm;
The six-axis mechanical arm executes corresponding actual actions, and the state of the six-axis mechanical arm is updated to obtain the observation state at the next moment;
inputting the current state of the six-axis mechanical arm, rewards obtained by executing corresponding actual actions and the actual actions of the six-axis mechanical arm into a Critic network to obtain a Q estimated value;
Inputting the observation state at the next moment into TargetActor networks to obtain the action output of the six-axis mechanical arm at the next moment;
The observation state at the next moment and the action output of the six-axis mechanical arm at the next moment are input into a TARGET CRITIC network to obtain a Q estimated value and rewards at the next moment;
Determining a first interaction optimization completion mark of the six-axis mechanical arm according to the Q estimation value and the Q estimation value at the next moment;
Storing the current state of the six-axis mechanical arm, the actual action of the six-axis mechanical arm, rewards, the estimated value of the Q at the next moment and the first interaction optimization completion mark into an experience playback pool;
The interactive optimization step of the six-axis mechanical arm is circulated until the experience playback pool is fully stored, and an optimized reinforcement learning DDPG algorithm is obtained;
and deploying the optimized reinforcement learning DDPG algorithm to the six-axis mechanical arm for motion guidance.
2. A system of a method for controlling a robotic arm based on a deep reinforcement learning algorithm as claimed in claim 1, comprising the following modules:
the acquisition module is used for acquiring the sensing data of the six-axis mechanical arm;
The interaction module is used for setting an experience playback pool, an agent noise exploration function and a reward function based on the reinforcement learning DDPG algorithm, carrying out interaction optimization training on the perception data, and obtaining an optimized reinforcement learning DDPG algorithm;
the deployment module is used for deploying the optimized reinforcement learning DDPG algorithm to the six-axis mechanical arm to conduct motion guidance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311258556.6A CN117140527B (en) | 2023-09-27 | 2023-09-27 | Mechanical arm control method and system based on deep reinforcement learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311258556.6A CN117140527B (en) | 2023-09-27 | 2023-09-27 | Mechanical arm control method and system based on deep reinforcement learning algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117140527A CN117140527A (en) | 2023-12-01 |
CN117140527B true CN117140527B (en) | 2024-04-26 |
Family
ID=88902671
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311258556.6A Active CN117140527B (en) | 2023-09-27 | 2023-09-27 | Mechanical arm control method and system based on deep reinforcement learning algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117140527B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118034061B (en) * | 2024-03-29 | 2024-09-20 | 合肥工业大学 | SCARA robot approximate constraint robust control method based on deep reinforcement learning |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110340888A (en) * | 2018-10-30 | 2019-10-18 | 大连理工大学 | A kind of robot for space arrests control system, intensified learning method and dynamic modeling method |
JP2021034050A (en) * | 2019-08-21 | 2021-03-01 | 哈爾浜工程大学 | Auv action plan and operation control method based on reinforcement learning |
WO2021164276A1 (en) * | 2020-07-31 | 2021-08-26 | 平安科技(深圳)有限公司 | Target tracking method and apparatus, computer device, and storage medium |
CN113524186A (en) * | 2021-07-19 | 2021-10-22 | 山东大学 | Deep reinforcement learning double-arm robot control method and system based on demonstration example |
CN114083539A (en) * | 2021-11-30 | 2022-02-25 | 哈尔滨工业大学 | Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning |
CN115097736A (en) * | 2022-08-10 | 2022-09-23 | 东南大学 | Active disturbance rejection controller parameter optimization method based on deep reinforcement learning |
CN115179280A (en) * | 2022-06-21 | 2022-10-14 | 南京大学 | Reward shaping method based on magnetic field in mechanical arm control for reinforcement learning |
CN116476042A (en) * | 2022-12-31 | 2023-07-25 | 中国科学院长春光学精密机械与物理研究所 | Mechanical arm kinematics inverse solution optimization method and device based on deep reinforcement learning |
CN116533249A (en) * | 2023-06-05 | 2023-08-04 | 贵州大学 | Mechanical arm control method based on deep reinforcement learning |
CN116587275A (en) * | 2023-05-29 | 2023-08-15 | 华侨大学 | Mechanical arm intelligent impedance control method and system based on deep reinforcement learning |
WO2023168821A1 (en) * | 2022-03-07 | 2023-09-14 | 大连理工大学 | Reinforcement learning-based optimization control method for aeroengine transition state |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101988504B1 (en) * | 2019-02-28 | 2019-10-01 | 아이덴티파이 주식회사 | Method for reinforcement learning using virtual environment generated by deep learning |
KR102457914B1 (en) * | 2021-04-21 | 2022-10-24 | 숭실대학교산학협력단 | Method for combating stop-and-go wave problem using deep reinforcement learning based autonomous vehicles, recording medium and device for performing the method |
-
2023
- 2023-09-27 CN CN202311258556.6A patent/CN117140527B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110340888A (en) * | 2018-10-30 | 2019-10-18 | 大连理工大学 | A kind of robot for space arrests control system, intensified learning method and dynamic modeling method |
JP2021034050A (en) * | 2019-08-21 | 2021-03-01 | 哈爾浜工程大学 | Auv action plan and operation control method based on reinforcement learning |
WO2021164276A1 (en) * | 2020-07-31 | 2021-08-26 | 平安科技(深圳)有限公司 | Target tracking method and apparatus, computer device, and storage medium |
CN113524186A (en) * | 2021-07-19 | 2021-10-22 | 山东大学 | Deep reinforcement learning double-arm robot control method and system based on demonstration example |
CN114083539A (en) * | 2021-11-30 | 2022-02-25 | 哈尔滨工业大学 | Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning |
WO2023168821A1 (en) * | 2022-03-07 | 2023-09-14 | 大连理工大学 | Reinforcement learning-based optimization control method for aeroengine transition state |
CN115179280A (en) * | 2022-06-21 | 2022-10-14 | 南京大学 | Reward shaping method based on magnetic field in mechanical arm control for reinforcement learning |
CN115097736A (en) * | 2022-08-10 | 2022-09-23 | 东南大学 | Active disturbance rejection controller parameter optimization method based on deep reinforcement learning |
CN116476042A (en) * | 2022-12-31 | 2023-07-25 | 中国科学院长春光学精密机械与物理研究所 | Mechanical arm kinematics inverse solution optimization method and device based on deep reinforcement learning |
CN116587275A (en) * | 2023-05-29 | 2023-08-15 | 华侨大学 | Mechanical arm intelligent impedance control method and system based on deep reinforcement learning |
CN116533249A (en) * | 2023-06-05 | 2023-08-04 | 贵州大学 | Mechanical arm control method based on deep reinforcement learning |
Non-Patent Citations (3)
Title |
---|
Control of Ski Robot Based on Deep Reinforcement Learning;Zegui Wu;《2011 International Conference on Security, Pattern Analysis, and Cybernetics》;20210620;全文 * |
基于强化学习的空间机器人抓捕操作的路径规划与控制;曹钰雪;《第40届中国控制会议论文集(15)》;20210726;全文 * |
基于强化学习空间机械臂智能捕获控制策略研究;杜德嵩;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200215;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117140527A (en) | 2023-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11529733B2 (en) | Method and system for robot action imitation learning in three-dimensional space | |
Peters et al. | Reinforcement learning by reward-weighted regression for operational space control | |
Mitrovic et al. | Adaptive optimal feedback control with learned internal dynamics models | |
JP7291185B2 (en) | Technologies for force and torque guided robot assembly | |
CN117140527B (en) | Mechanical arm control method and system based on deep reinforcement learning algorithm | |
CN116533249A (en) | Mechanical arm control method based on deep reinforcement learning | |
CN115446867B (en) | Industrial mechanical arm control method and system based on digital twin technology | |
CN111702766A (en) | Mechanical arm self-adaptive door opening screwing method based on force sense guidance | |
CN112847235A (en) | Robot step force guiding assembly method and system based on deep reinforcement learning | |
CN115256401A (en) | Space manipulator shaft hole assembly variable impedance control method based on reinforcement learning | |
CN118201742A (en) | Multi-robot coordination using a graph neural network | |
CN115416024A (en) | Moment-controlled mechanical arm autonomous trajectory planning method and system | |
CN113967909B (en) | Direction rewarding-based intelligent control method for mechanical arm | |
Yan et al. | Hierarchical policy learning with demonstration learning for robotic multiple peg-in-hole assembly tasks | |
CN115918377B (en) | Control method and control device of automatic tree fruit picking machine and automatic tree fruit picking machine | |
CN117086882A (en) | Strengthening learning method based on mechanical arm attitude movement degree of freedom | |
Carneiro et al. | The role of early anticipations for human-robot ball catching | |
Malone et al. | Efficient motion-based task learning for a serial link manipulator | |
US11921492B2 (en) | Transfer between tasks in different domains | |
Yadavari et al. | Addressing challenges in dynamic modeling of stewart platform using reinforcement learning-based control approach | |
Malone et al. | Efficient motion-based task learning | |
Zhou et al. | Intelligent Control of Manipulator Based on Deep Reinforcement Learning | |
Mitrovic | Stochastic optimal control with learned dynamics models | |
CN114378811B (en) | Force and torque directed robotic assembly technique | |
Dag et al. | Monolithic vs. hybrid controller for multi-objective Sim-to-Real learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |