CN117140527B

CN117140527B - Mechanical arm control method and system based on deep reinforcement learning algorithm

Info

Publication number: CN117140527B
Application number: CN202311258556.6A
Authority: CN
Inventors: 邬树楠; 植嘉皓; 初未萌
Original assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Current assignee: Sun Yat Sen University; Sun Yat Sen University Shenzhen Campus
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2024-04-26
Anticipated expiration: 2043-09-27
Also published as: CN117140527A

Abstract

The present invention discloses a method and system for controlling a robotic arm based on a deep reinforcement learning algorithm, the method comprising: obtaining the perception data of a six-axis robotic arm; based on the reinforcement learning DDPG algorithm, setting an experience replay pool, agent noise exploration and reward function, interactively optimizing and training the perception data to obtain an optimized reinforcement learning DDPG algorithm; deploying the optimized reinforcement learning DDPG algorithm to the six-axis robotic arm for motion guidance. The present invention constructs a reinforcement learning DDPG algorithm by setting an experience replay pool, agent noise exploration and reward function to achieve more accurate target tracking. As a method and system for controlling a robotic arm based on a deep reinforcement learning algorithm, the present invention can be widely used in the field of robotic arm control technology.

Description

A robotic arm control method and system based on deep reinforcement learning algorithm

技术领域Technical Field

本发明涉及机械臂控制技术领域，尤其涉及一种基于深度强化学习算法的机械臂控制方法及系统。The present invention relates to the field of robotic arm control technology, and in particular to a robotic arm control method and system based on a deep reinforcement learning algorithm.

背景技术Background technique

随着人类航天活动的逐年增加，太空垃圾、失效航天器等需要捕获、修复或清理的空间非合作目标日益增多；捕获空间非合作目标的方法包括刚性方案和绳网、飞爪等柔性方案，刚性捕获方案主要是机械臂末端安装抓捕装置，可以通过控制机械臂精准抓捕目标，使得服务星与目标星对接非常稳固，且空间机械臂因其具有质量轻、灵活性高、操控性强等优势，在空间非合作目标的捕获任务中具有广泛的应用前景；With the increase of human space activities year by year, the number of non-cooperative space targets such as space debris and failed spacecraft that need to be captured, repaired or cleaned up is increasing day by day. The methods of capturing non-cooperative space targets include rigid solutions and flexible solutions such as rope nets and flying claws. The rigid capture solution mainly involves installing a capture device at the end of the robotic arm, which can accurately capture the target by controlling the robotic arm, making the docking of the service satellite and the target satellite very stable. In addition, the space robotic arm has the advantages of light weight, high flexibility and strong maneuverability, and has broad application prospects in the capture mission of non-cooperative space targets.

非合作目标具有运动轨迹不确定、容易逃逸等特点，是抓捕过程中的难点。在机械臂控制中，传统控制方法例如PID控制、计算力矩控制等，存在以下缺点，一是模型依赖性，传统控制方法通常依赖于精确的系统模型和参数，而在实际应用中，机械臂系统的动力学模型可能很难精确建模。这可能导致传统方法在实际操作中表现不稳定或不准确，二是适应性不足，传统控制方法通常难以应对非线性、时变、耦合等复杂的机械臂系统特性，这些方法在处理复杂任务时可能需要经常调整参数，且可能无法在不同任务之间共享经验，三是调参困难，传统控制方法需要手动调整参数，这可能是一个耗时且繁琐的过程。在机械臂系统参数变化或任务不断变化的情况下，维护和优化传统控制器可能变得困难，四是泛化能力差，传统控制方法通常难以从少量数据中学习并具有较强的泛化能力，当机械臂系统面临未知环境或任务时，传统方法可能无法适应和学习新的策略。Non-cooperative targets have characteristics such as uncertain motion trajectories and easy escape, which are difficult points in the capture process. In the control of manipulators, traditional control methods such as PID control and calculated torque control have the following disadvantages: first, model dependence. Traditional control methods usually rely on accurate system models and parameters. In practical applications, the dynamic model of the manipulator system may be difficult to accurately model. This may cause traditional methods to be unstable or inaccurate in actual operation. Second, insufficient adaptability. Traditional control methods usually have difficulty in dealing with complex manipulator system characteristics such as nonlinearity, time-varying, and coupling. These methods may need to adjust parameters frequently when dealing with complex tasks, and may not be able to share experience between different tasks. Third, parameter adjustment is difficult. Traditional control methods require manual parameter adjustment, which may be a time-consuming and tedious process. When the parameters of the manipulator system change or the tasks change continuously, it may become difficult to maintain and optimize traditional controllers. Fourth, poor generalization ability. Traditional control methods usually have difficulty learning from a small amount of data and have strong generalization ability. When the manipulator system faces unknown environments or tasks, traditional methods may not be able to adapt and learn new strategies.

发明内容Summary of the invention

为了解决上述技术问题，本发明的目的是提供一种基于深度强化学习算法的机械臂控制方法及系统，通过设置经验回放池、智能体噪声探索和奖励函数，构建强化学习DDPG算法，实现更准确的目标跟踪。In order to solve the above technical problems, the purpose of the present invention is to provide a robotic arm control method and system based on a deep reinforcement learning algorithm. By setting an experience replay pool, agent noise exploration and reward function, a reinforcement learning DDPG algorithm is constructed to achieve more accurate target tracking.

本发明所采用的第一技术方案是：一种基于深度强化学习算法的机械臂控制方法，包括以下步骤：The first technical solution adopted by the present invention is: a robot arm control method based on a deep reinforcement learning algorithm, comprising the following steps:

获取六轴机械臂的感知数据；Obtain the perception data of the six-axis robotic arm;

基于强化学习DDPG算法，设置经验回放池、智能体噪声探索和奖励函数，对感知数据进行交互优化训练，得到优化后的强化学习DDPG算法；Based on the reinforcement learning DDPG algorithm, the experience replay pool, agent noise exploration and reward function are set up to perform interactive optimization training on the perception data to obtain the optimized reinforcement learning DDPG algorithm;

将优化后的强化学习DDPG算法部署至六轴机械臂进行运动指导。The optimized reinforcement learning DDPG algorithm is deployed to the six-axis robotic arm for motion guidance.

进一步，所述获取六轴机械臂的感知数据这一步骤，其具体包括：Furthermore, the step of obtaining the perception data of the six-axis robotic arm specifically includes:

设置控制六轴机械臂的各个关节的力矩输入端口和转角及角速度敏感器，定义智能体的状态为六轴机械臂末端六个关节角、关节角速度、与目标分别在x,y,z方向上的距离误差和速度误差、当前各关节控制力矩。Set up the torque input ports and angle and angular velocity sensors for controlling each joint of the six-axis robot arm, and define the state of the intelligent agent as the six joint angles at the end of the six-axis robot arm, joint angular velocity, distance error and velocity error with the target in the x, y, and z directions, and the current control torque of each joint.

进一步，所述强化学习DDPG算法包括主网络与目标网路，其中：Furthermore, the reinforcement learning DDPG algorithm includes a main network and a target network, wherein:

所述主网络包括Actor网络和Critic网络，所述目标网路包括Target Actor网络和Target Critic网络；The main network includes an Actor network and a Critic network, and the target network includes a Target Actor network and a Target Critic network;

所述Actor网络为动作网络，以智能体的状态s为输入，输出确定性动作机械臂的六自由度控制力矩；The Actor network is an action network that takes the state s of the agent as input and outputs the six-degree-of-freedom control torque of the deterministic action robot;

所述Critic网络为评价网络，用于计算Q值，通过Q值评价Actor网络给出的动作的价值；The Critic network is an evaluation network used to calculate the Q value and evaluate the value of the action given by the Actor network through the Q value;

所述Target Actor网络和所述Target Critic网络用于初始化Actor网络的网络参数和Critic网络的网络参数，并初始化Target Actor网络的网络参数和Target Critic网络的网络参数。The Target Actor network and the Target Critic network are used to initialize the network parameters of the Actor network and the network parameters of the Critic network, and initialize the network parameters of the Target Actor network and the network parameters of the Target Critic network.

进一步，所述设置经验回放池这一步骤，其具体包括：Furthermore, the step of setting up the experience replay pool specifically includes:

智能体将得到的经验数据(s_t,a_t,r_t,s_t+1,done)存放在经验回放池中，更新主网络参数与目标网路参数时按照批量采样，其中s_t表示t时刻智能体状态，a_t表示t时刻智能体采取的动作，r_t表示t时刻采取动作后获得的奖励，s_t+1表示智能体采取动作后t+1时刻所到达的状态，done表示回合任务是否已经完成。The agent stores the obtained experience data (s _t , a _t , r _t , s _t+1 , done) in the experience replay pool, and updates the main network parameters and target network parameters according to batch sampling, where s _t represents the state of the agent at time t, a _t represents the action taken by the agent at time t, r _t represents the reward obtained after taking the action at time t, s _t+1 represents the state reached at time t+1 after the agent takes the action, and done indicates whether the round task has been completed.

进一步，所述智能体噪声探索这一步骤，其具体包括：Furthermore, the agent noise exploration step specifically includes:

对所述Actor网络输出的动作进行加入噪声处理，在网络更新时在记忆库中抽取样本的过程中，计算样本当前状态下动作a,和更新前原策略网络输出动作a之间的均方误差d，与设定的误差阈值d_th进行比较，并根据比较结果对高斯噪声的标准差s做出调整更新。Noise processing is performed on the action output by the Actor network. When extracting samples from the memory bank during network update, the mean square error d between the action a in the current state of the sample and the action a output by the original policy network before the update is calculated and compared with the set error threshold _dth . The standard deviation s of the Gaussian noise is adjusted and updated according to the comparison result.

进一步，所述对高斯噪声的标准差s做出调整更新的表达式为：Furthermore, the expression for adjusting and updating the standard deviation s of the Gaussian noise is:

上式中，s表示高斯噪声的标准差，k表示自适应调整系数，d表示均方误差值，d_th表示设定的误差阈值。In the above formula, s represents the standard deviation of Gaussian noise, k represents the adaptive adjustment coefficient, d represents the mean square error value, and _dth represents the set error threshold.

进一步，所述奖励函数包括距离奖励函数、速度奖励函数、动作平稳奖励函数和能量损耗奖励函数，其中：Furthermore, the reward function includes a distance reward function, a speed reward function, a smooth action reward function and an energy loss reward function, wherein:

所述距离奖励函数表示当智能体末端与目标的距离误差小于阈值时，智能体获得靠近目标的误差正相关的奖励，否则，获得误差负相关的奖励，其表达式为：The distance reward function means that when the distance error between the agent terminal and the target is less than the threshold, the agent obtains a reward positively correlated with the error of being close to the target, otherwise, it obtains a reward negatively correlated with the error. Its expression is:

上式中，r₁表示距离奖励函数，k₁表示可调整权重系数，e_ri表示机械臂末端分别在x,y,z方向的位置距离误差；In the above formula, r ₁ represents the distance reward function, k ₁ represents the adjustable weight coefficient, and e _ri represents the position distance error of the end of the robot arm in the x, y, and z directions respectively;

所述速度奖励函数表示用于引导机械臂根据期望的轨迹速度进行移动，其表达式为：The speed reward function is used to guide the robot arm to move according to the desired trajectory speed, and its expression is:

上式中，r₂表示速度奖励函数，k₂表示可调整权重系数，e_vi表示机械臂末端分别在x,y,z方向的速度误差；In the above formula, r ₂ represents the speed reward function, k ₂ represents the adjustable weight coefficient, and e _vi represents the speed error of the end of the robot in the x, y, and z directions respectively;

所述动作平稳奖励函数表示控制力矩惩罚，其表达式为：The action smoothness reward function represents the control torque penalty, and its expression is:

上式中，r₃表示动作平稳奖励函数，k₃表示可调整权重系数，torque_i表示机械臂各个轴的控制力矩；In the above formula, r ₃ represents the action smoothness reward function, k ₃ represents the adjustable weight coefficient, and torque _i represents the control torque of each axis of the robot arm;

所述能量损耗奖励函数表示力矩变化惩罚，其表达式为：The energy loss reward function represents the torque change penalty, and its expression is:

上式中，r₄表示能量损耗奖励函数，k₄表示可调整权重系数，和/>分别表示机械臂各个轴当前与上一次时间步的控制力矩。In the above formula, r ₄ represents the energy loss reward function, k ₄ represents the adjustable weight coefficient, and/> Respectively represent the control torque of each axis of the robot in the current and previous time steps.

进一步，所述对感知数据进行交互优化训练，得到优化后的强化学习DDPG算法这一步骤，其具体包括：Furthermore, the step of interactively optimizing the perception data to obtain an optimized reinforcement learning DDPG algorithm specifically includes:

根据六轴机械臂的当前状态将感知数据输入至Actor网络，得到感知数据对应的控制力矩；According to the current state of the six-axis robot arm, the perception data is input into the Actor network to obtain the control torque corresponding to the perception data;

对控制力矩进行加入噪声处理并作为六轴机械臂的实际动作；Add noise to the control torque and use it as the actual action of the six-axis robot;

六轴机械臂执行对应的实际动作，并更新六轴机械臂的状态，得到下一时刻观测状态；The six-axis robot arm performs the corresponding actual action and updates the state of the six-axis robot arm to obtain the observation state at the next moment;

将六轴机械臂的当前状态、执行对应的实际动作获取的奖励和六轴机械臂的实际动作输入至Critic网络，得到Q估计值；The current state of the six-axis robot, the reward obtained by executing the corresponding actual action, and the actual action of the six-axis robot are input into the Critic network to obtain the Q estimate;

将下一时刻观测状态输入至Target Actor网络，得到下一时刻的六轴机械臂动作输出；Input the next moment's observation state into the Target Actor network to obtain the next moment's six-axis robot action output;

将下一时刻观测状态与下一时刻的六轴机械臂动作输出输入至Target Critic网络，得到下一时刻Q估计值与奖励；The next moment observation state and the next moment six-axis robot action output are input into the Target Critic network to obtain the next moment Q estimation value and reward;

根据Q估计值与下一时刻Q估计值确定六轴机械臂的第一次交互优化完成标志；Determine the completion mark of the first interactive optimization of the six-axis robot arm according to the Q estimation value and the Q estimation value at the next moment;

将六轴机械臂的当前状态、六轴机械臂的实际动作、奖励、下一时刻Q估计值和第一次交互优化完成标志存放至经验回放池；The current state of the six-axis robot, the actual action of the six-axis robot, the reward, the Q estimate at the next moment, and the completion mark of the first interactive optimization are stored in the experience replay pool;

循环上述六轴机械臂的交互优化步骤，直至经验回放池存放满，得到优化后的强化学习DDPG算法。The interactive optimization steps of the above six-axis robot are repeated until the experience replay pool is full, and the optimized reinforcement learning DDPG algorithm is obtained.

本发明所采用的第二技术方案是：一种基于深度强化学习算法的机械臂控制系统，包括：The second technical solution adopted by the present invention is: a robotic arm control system based on a deep reinforcement learning algorithm, comprising:

获取模块，用于获取六轴机械臂的感知数据；An acquisition module is used to acquire the perception data of the six-axis robot arm;

交互模块，用于基于强化学习DDPG算法，设置经验回放池、智能体噪声探索和奖励函数，对感知数据进行交互优化训练，得到优化后的强化学习DDPG算法；The interactive module is used to set the experience replay pool, agent noise exploration and reward function based on the reinforcement learning DDPG algorithm, perform interactive optimization training on the perception data, and obtain the optimized reinforcement learning DDPG algorithm;

部署模块，用于将优化后的强化学习DDPG算法部署至六轴机械臂进行运动指导。The deployment module is used to deploy the optimized reinforcement learning DDPG algorithm to the six-axis robot arm for motion guidance.

本发明方法及系统的有益效果是：本发明通过采用深度强化学习算法，通过从感知数据中学习和优化控制策略，使机械臂能够根据目标的动态信息进行实时调整，在训练过程中，机械臂与目标进行交互，通过试错和奖励机制不断优化策略，对于动态目标跟踪任务，在DDPG基础框架上，设计了自适应的策略参数和经验回放的改进，引入自适应的噪声，提高DDPG算法的探索性，以实现更准确的目标跟踪。The beneficial effects of the method and system of the present invention are as follows: the present invention adopts a deep reinforcement learning algorithm, learns and optimizes the control strategy from the perception data, so that the robot arm can make real-time adjustments according to the dynamic information of the target. During the training process, the robot arm interacts with the target and continuously optimizes the strategy through trial and error and reward mechanisms. For dynamic target tracking tasks, on the basis of the DDPG basic framework, adaptive strategy parameters and experience replay improvements are designed, adaptive noise is introduced, and the exploratory nature of the DDPG algorithm is improved to achieve more accurate target tracking.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例一种基于深度强化学习算法的机械臂控制方法的步骤流程图；FIG1 is a flowchart of a method for controlling a robotic arm based on a deep reinforcement learning algorithm according to an embodiment of the present invention;

图2是本发明实施例一种基于深度强化学习算法的机械臂控制系统的结构框图；FIG2 is a block diagram of a mechanical arm control system based on a deep reinforcement learning algorithm according to an embodiment of the present invention;

图3是本发明具体实施例六轴机械臂模型的装置示意图；FIG3 is a schematic diagram of a six-axis robotic arm model according to a specific embodiment of the present invention;

图4是本发明具体实施例目标空间直线轨迹的示意图；FIG4 is a schematic diagram of a linear trajectory of a target space according to a specific embodiment of the present invention;

图5是本发明具体实施例机械臂末端执行器跟踪直线轨迹的示意图；5 is a schematic diagram of a robot end effector tracking a straight line trajectory according to a specific embodiment of the present invention;

图6是本发明具体实施例机械臂末端执行器跟踪直线轨迹的坐标误差示意图；6 is a schematic diagram of coordinate errors of a robot end effector tracking a straight line trajectory according to a specific embodiment of the present invention;

图7是本发明具体实施例目标空间圆轨迹的示意图；7 is a schematic diagram of a target space circular trajectory according to a specific embodiment of the present invention;

图8是本发明具体实施例机械臂末端执行器跟踪圆轨迹的示意图；8 is a schematic diagram of a robot end effector tracking a circular trajectory according to a specific embodiment of the present invention;

图9是本发明具体实施例机械臂末端执行器跟踪圆轨迹的坐标误差示意图。FIG. 9 is a schematic diagram of coordinate errors of a robot end effector tracking a circular trajectory according to a specific embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和具体实施例对本发明做进一步的详细说明。对于以下实施例中的步骤编号，其仅为了便于阐述说明而设置，对步骤之间的顺序不做任何限定，实施例中的各步骤的执行顺序均可根据本领域技术人员的理解来进行适应性调整。The present invention is further described in detail below in conjunction with the accompanying drawings and specific embodiments. The step numbers in the following embodiments are only provided for the convenience of explanation and description, and the order between the steps is not limited in any way. The execution order of each step in the embodiment can be adaptively adjusted according to the understanding of those skilled in the art.

与传统其他控制方法相比，基于深度强化学习的控制方法不需要精确的系统模型，能够直接从实际交互中学习控制策略，从而适用于复杂的非线性和时变系统；抓捕策略自适应性强、具有预测性，深度强化学习方法能够适应不同的任务和环境，通过在交互中学习，可以在不同任务之间共享经验，且可以在实际任务中自主学习，无需手动调整参数，从而减轻了人工调参的负担。对于空间非合作目标的多自由度机械臂抓捕问题有较好的适用性。Compared with other traditional control methods, the control method based on deep reinforcement learning does not require an accurate system model and can directly learn control strategies from actual interactions, making it suitable for complex nonlinear and time-varying systems; the capture strategy is highly adaptive and predictive, and the deep reinforcement learning method can adapt to different tasks and environments. By learning in interaction, it can share experience between different tasks and can learn autonomously in actual tasks without manually adjusting parameters, thereby reducing the burden of manual parameter adjustment. It has good applicability to the problem of multi-degree-of-freedom robotic arm capture of non-cooperative targets in space.

基于上述原理，提出了应用在空间机械臂对非合作目标抓捕任务中对非合作目标的动态轨迹跟踪问题的深度强化学习算法，并在仿真中验证了算法的有效性。对于动态目标跟踪任务，在DDPG基础框架上，设计了自适应的策略参数和经验回放的改进，在容许误差范围内实现了对目标的轨迹跟踪，在六自由度空间机械臂仿真环境中验证了强化学习算法的有效性，并验证了该算法在轨迹跟踪任务中具有较高的跟踪精度；Based on the above principles, a deep reinforcement learning algorithm for the dynamic trajectory tracking problem of non-cooperative targets in the space manipulator's non-cooperative target capture task is proposed, and the effectiveness of the algorithm is verified in simulation. For the dynamic target tracking task, based on the DDPG basic framework, adaptive strategy parameters and experience replay improvements are designed to achieve trajectory tracking of the target within the allowable error range. The effectiveness of the reinforcement learning algorithm is verified in the six-degree-of-freedom space manipulator simulation environment, and it is verified that the algorithm has high tracking accuracy in the trajectory tracking task;

本发明在Matlab的Simscape中搭建机械臂跟踪动态轨迹目标的仿真环境，联合Matlab与Simulink进行仿真。为方便后续实验，选用真实六轴机械臂遨博的i5机械臂，如图3所示，参数作为仿真机械臂参数。利用关节角传感器实时测量机械臂运动过程中各关节的角度和角速度，作为深度强化学习的状态反馈信息。仿真实验只考虑末端执行器跟踪到达目标位置，以末端执行器与目标点间的位置误差和速度误差作为控制策略的评价的主要指标，不涉及末端执行器与目标接触产生的碰撞力的控制。The present invention builds a simulation environment for a robotic arm to track a dynamic trajectory target in Matlab's Simscape, and combines Matlab and Simulink for simulation. To facilitate subsequent experiments, the i5 robotic arm of the real six-axis robotic arm AUO is selected, as shown in Figure 3, and the parameters are used as the simulation robotic arm parameters. The joint angle sensor is used to measure the angle and angular velocity of each joint in the process of the robotic arm movement in real time as the state feedback information of deep reinforcement learning. The simulation experiment only considers the end effector tracking to the target position, and the position error and speed error between the end effector and the target point are used as the main indicators for evaluating the control strategy, and does not involve the control of the collision force generated by the contact between the end effector and the target.

参照图1，本发明提供了一种基于深度强化学习算法的机械臂控制方法，该方法包括以下步骤：1 , the present invention provides a method for controlling a robotic arm based on a deep reinforcement learning algorithm, the method comprising the following steps:

S1、获取六轴机械臂的感知数据；S1, obtain the perception data of the six-axis robot arm;

具体地，在Simulink上导入机械臂的urdf模型，并为其添加动力学、运动学等属性，调整机械臂的初试状态，设置控制机械臂各个关节的力矩输入端口和转角及角速度敏感器。在Matlab上设置强化学习智能体超参数以及神经网络。定义智能体的状态State为机械臂末端六个关节角、关节角速度、与目标分别在在x,y,z方向上的距离误差和速度误差、当前各关节控制力矩，表1为DH参数及各关节转角范围。Specifically, the urdf model of the robot arm is imported into Simulink, and dynamics, kinematics and other properties are added to it. The initial test state of the robot arm is adjusted, and the torque input port and the angle and angular velocity sensors for controlling each joint of the robot arm are set. The hyperparameters of the reinforcement learning agent and the neural network are set in Matlab. The state of the agent is defined as the six joint angles at the end of the robot arm, the joint angular velocity, the distance error and velocity error with the target in the x, y, and z directions, and the current control torque of each joint. Table 1 shows the DH parameters and the range of each joint angle.

表1DH参数及各关节转角范围Table 1 DH parameters and rotation range of each joint

S2、基于强化学习DDPG算法，设置经验回放池、智能体噪声探索和奖励函数，对感知数据进行交互优化训练，得到优化后的强化学习DDPG算法；S2. Based on the reinforcement learning DDPG algorithm, set the experience replay pool, agent noise exploration and reward function, perform interactive optimization training on the perception data, and obtain the optimized reinforcement learning DDPG algorithm;

S21、设置DDPG需要用的神经网络；S21. Set the neural network required by DDPG;

具体地，设置DDPG需要用的神经网络，包括网络结构相同的主网络Actor网络、Critic网络和目标网络Target Actor网络、Target Critic网络。Actor网络为动作网络，以智能体的状态s为输入，输出确定性动作机械臂的六自由度控制力矩，Critic网络为评价网络，用于计算Q值，通过Q值评价Actor网络给出的动作的价值，表示为Q(s,a,w)，其中s为机械臂状态，a为Actor网络给出的动作，w为Critic网络参数，目标网络的作用为通过软更新的方式来减缓网络更新的速度，使得输出会更加稳定，保证Critic网络的学习过程更加平稳。随机初始化Actor网络和Critic网络的网络参数w₁和w₂，并初始化Target Actor网络和Target Critic网络的网络参数w₃和w₄，其中，智能体表示强化学习算法中的决策函数主体，即机械臂。Specifically, the neural network required by DDPG is set, including the main network Actor network, Critic network and target network Target Actor network, Target Critic network with the same network structure. The Actor network is an action network, which takes the state s of the agent as input and outputs the six-degree-of-freedom control torque of the deterministic action robot. The Critic network is an evaluation network, which is used to calculate the Q value. The value of the action given by the Actor network is evaluated by the Q value, which is expressed as Q(s,a,w), where s is the state of the robot, a is the action given by the Actor network, and w is the Critic network parameter. The role of the target network is to slow down the network update speed through soft updates, so that the output will be more stable and ensure that the learning process of the Critic network is smoother. Randomly initialize the network parameters _w1 and _w2 of the Actor network and the Critic network, and initialize the network parameters _w3 and _w4 of the Target Actor network and the Target Critic network. The agent represents the decision function body in the reinforcement learning algorithm, that is, the robot.

S22、设置经验回放池；S22. Set up experience replay pool;

具体地，设置经验回放池，智能体将得到的经验数据(s_t,a_t,r_t,s_t+1,done)存放在经验回放池中，更新网络参数时按照批量采样，其中s_t表示t时刻智能体状态，a_t表示t时刻智能体采取的动作，r_t表示t时刻采取动作后获得的奖励，s_t+1表示智能体采取动作后t+1时刻所到达的状态，done表示回合任务是否已经完成。Specifically, an experience replay pool is set up, and the agent stores the obtained experience data (s _t , a _t , r _t , s _t+1 , done) in the experience replay pool. When updating the network parameters, batch sampling is performed, where s _t represents the state of the agent at time t, a _t represents the action taken by the agent at time t, r _t represents the reward obtained after taking the action at time t, s _t+1 represents the state reached at time t+1 after the agent takes the action, and done represents whether the round task has been completed.

S23、设置智能体噪声探索；S23, setting agent noise exploration;

具体地，设置智能体噪声探索，确定性策略输出的动作为确定性动作，缺乏对环境的探索。在训练阶段，给Actor网络输出的动作加入噪声，从而让智能体具备一定的探索能力。自适应噪声可以使得智能体在训练早期更多地进行探索，而在训练后期逐渐减少探索，更加利用已经学到的经验。一种改进方法是在动作策略网络参数上添加高斯噪声，即使用策略参数噪声技巧。在网络更新时在记忆库中抽取样本的过程中，计算样本当前状态下动作a,和更新前原策略网络输出动作a之间的均方误差d，其中，样本表示经验回放池中的数据，与设定的误差阈值d_th进行比较，并根据比较结果对高斯噪声的标准差s做出调整更新，其表达式为：Specifically, the agent noise exploration is set, and the action output by the deterministic strategy is a deterministic action, lacking exploration of the environment. In the training stage, noise is added to the action output by the Actor network, so that the agent has a certain exploration ability. Adaptive noise can make the agent explore more in the early stage of training, and gradually reduce exploration in the later stage of training, making better use of the experience already learned. One improvement method is to add Gaussian noise to the action strategy network parameters, that is, to use the strategy parameter noise technique. In the process of extracting samples from the memory bank when the network is updated, the mean square error d between the action a in the current state of the sample and the action a output by the original strategy network before the update is calculated, where the sample represents the data in the experience replay pool, which is compared with the set error threshold _dth , and the standard deviation s of the Gaussian noise is adjusted and updated according to the comparison result. The expression is:

S24、构造设计奖励函数；S24, constructing a design reward function;

具体地，构造设计奖励函数，在目标跟踪操作中，主要从四个方面考虑设计奖励函数：距离、速度、动作平稳和能量损耗。期望机械臂末端执行器到目标的最终误差最小，且期望机械臂在运动过程中运动幅度尽可能小，能量损耗尽可能低。Specifically, the reward function is constructed and designed. In the target tracking operation, the reward function is designed mainly from four aspects: distance, speed, smooth movement and energy loss. It is expected that the final error from the end effector of the manipulator to the target is the smallest, and it is expected that the manipulator's movement amplitude is as small as possible during the movement, and the energy loss is as low as possible.

其中距离方面采用稀疏奖励，具体设计为：设定一个距离阈值0.01，当末端与目标的距离误差小于阈值1/2/5/10倍时，智能体分别获得成功靠近目标的一个固定奖励r₁＝1000的倍奖励，激励智能体逐渐靠近目标；否则，会获得与误差负相关的奖励：The distance aspect uses sparse rewards. The specific design is as follows: a distance threshold of 0.01 is set. When the distance error between the end and the target is less than 1/2/5/10 times the threshold, the agent will receive a fixed reward r ₁ = 1000 for successfully approaching the target. times the reward to encourage the agent to gradually approach the goal; otherwise, it will receive a reward negatively correlated with the error:

速度方面采用奖励：Speed adoption bonus:

上式中，r₂表示速度奖励函数，k₂表示可调整权重系数，e_vi表示机械臂末端分别在x,y,z方向的速度误差，引导机械臂尽量按照期望的轨迹速度进行移动；In the above formula, r ₂ represents the speed reward function, k ₂ represents the adjustable weight coefficient, and e _vi represents the speed error of the end of the robot arm in the x, y, and z directions respectively, guiding the robot arm to move as fast as possible according to the desired trajectory speed;

在动作平稳和能量损耗方面引入两个惩罚：Two penalties are introduced in terms of motion smoothness and energy loss:

控制力矩惩罚：Control torque penalty:

上式中，r₃表示动作平稳奖励函数，k₃表示可调整权重系数，torque_i表示机械臂各个轴的控制力矩，防止控制力矩输入过大，造成机械臂运动幅度大，能量损耗大，训练稳定性低；In the above formula, r ₃ represents the action smoothness reward function, k ₃ represents the adjustable weight coefficient, and torque _i represents the control torque of each axis of the robot arm, which prevents the control torque input from being too large, causing the robot arm to move a large range, large energy loss, and low training stability;

力矩变化惩罚：Torque change penalty:

上式中，r₄表示能量损耗奖励函数，k₄表示可调整权重系数，和/>分别表示机械臂各个轴当前与上一次时间步的控制力矩，限制智能体给出力矩的变化率，使机械臂的运动更加平滑；In the above formula, r ₄ represents the energy loss reward function, k ₄ represents the adjustable weight coefficient, and/> Respectively represent the control torque of each axis of the robot arm at the current and previous time steps, limit the rate of change of the torque given by the agent, and make the movement of the robot arm smoother;

综上，奖励函数总设计为：In summary, the reward function is designed as follows:

r＝r₁+r₂+r₃+r₄ r＝ _r1 + _r2 + _r3 + _r4

上式中，r表示总的奖励函数。In the above formula, r represents the total reward function.

S25、交互优化。S25. Interaction optimization.

具体地，包括以下步骤：Specifically, the steps include:

(1)根据六轴机械臂的当前状态将感知数据输入至Actor网络，得到感知数据对应的控制力矩；(1) According to the current state of the six-axis robot arm, the perception data is input into the Actor network to obtain the control torque corresponding to the perception data;

(2)对控制力矩进行加入噪声处理并作为六轴机械臂的实际动作；(2) Adding noise to the control torque and using it as the actual motion of the six-axis robot;

(3)六轴机械臂执行对应的实际动作，并更新六轴机械臂的状态，得到下一时刻观测状态；(3) The six-axis robot arm performs the corresponding actual action and updates the state of the six-axis robot arm to obtain the observation state at the next moment;

(4)将六轴机械臂的当前状态和六轴机械臂的实际动作输入至Critic网络，得到Q估计值；(4) Input the current state of the six-axis robot and the actual action of the six-axis robot into the Critic network to obtain the Q estimation value;

(5)将下一时刻观测状态输入至Target Actor网络，得到下一时刻的六轴机械臂动作输出；(5) Input the observed state at the next moment into the Target Actor network to obtain the action output of the six-axis robot arm at the next moment;

(6)将下一时刻观测状态与下一时刻的六轴机械臂动作输出输入至TargetCritic网络，得到下一时刻Q估计值与奖励；(6) Input the next moment observation state and the next moment six-axis robot arm action output into the TargetCritic network to obtain the next moment Q estimation value and reward;

(7)根据Q估计值与下一时刻Q估计值确定六轴机械臂的第一次交互优化完成标志；(7) Determine the completion mark of the first interactive optimization of the six-axis robot arm according to the Q estimation value and the Q estimation value at the next moment;

(8)将六轴机械臂的当前状态、六轴机械臂的实际动作、奖励、下一时刻Q估计值和第一次交互优化完成标志存放至经验回放池；(8) The current state of the six-axis robot arm, the actual action of the six-axis robot arm, the reward, the Q estimation value at the next moment, and the completion mark of the first interactive optimization are stored in the experience replay pool;

(9)循环上述六轴机械臂的交互优化步骤(1)至(8)，直至经验回放池存放满，得到优化后的强化学习DDPG算法；(9) looping the interactive optimization steps (1) to (8) of the six-axis robot arm until the experience replay pool is full, and obtaining the optimized reinforcement learning DDPG algorithm;

即通过以上设置智能体与环境不断交互，重复执行选择动作、与环境交互、存储经验和训练网络的过程，使得智能体逐渐学习到最优策略。That is, through the above settings, the intelligent agent continuously interacts with the environment, repeatedly executing the process of selecting actions, interacting with the environment, storing experience and training the network, so that the intelligent agent gradually learns the optimal strategy.

S3、将优化后的强化学习DDPG算法部署至六轴机械臂进行运动指导。S3. Deploy the optimized reinforcement learning DDPG algorithm to the six-axis robotic arm for motion guidance.

本发明以强化学习DDPG算法为基本框架，针对跟踪动态目标的实验任务对基础DDPG算法做出了相应的改进以提高算法性能。DDPG是一种深度强化学习算法，用于解决连续动作空间的问题。它结合了Actor-Critic框架和深度神经网络，使用Q函数作为Critic，采用两个算法技巧——经验回放(Experience)和目标网络(Target Net)，从经验记忆库(Replay Buffer)中采样经验数据更新深度神经网络，利用TD误差更新值函数网络，通过确定性策略输出动作，能有效处理高维连续动作空间。针对DDPG算法探索性差的问题，为了提高探索性，设计引入自适应的噪声。自适应噪声可以使得智能体在训练早期更多地进行探索，而在训练后期逐渐减少探索，更加利用已经学到的经验。奖励函数在强化学习算法中起着至关重要的作用，本发明针对跟踪动态目标的实验任务构造设计了详细的奖励函数模型，通过奖励函数的设置引导智能体学习适应完成实验任务。The present invention takes the reinforcement learning DDPG algorithm as the basic framework, and makes corresponding improvements to the basic DDPG algorithm for the experimental task of tracking dynamic targets to improve the algorithm performance. DDPG is a deep reinforcement learning algorithm used to solve the problem of continuous action space. It combines the Actor-Critic framework and deep neural network, uses Q function as Critic, adopts two algorithm techniques-experience replay (Experience) and target network (Target Net), samples experience data from the experience memory library (Replay Buffer) to update the deep neural network, uses TD error to update the value function network, and outputs actions through deterministic strategies, which can effectively process high-dimensional continuous action space. In order to improve the exploration of the DDPG algorithm, the design introduces adaptive noise. Adaptive noise can make the intelligent agent explore more in the early stage of training, and gradually reduce exploration in the later stage of training, making better use of the experience learned. The reward function plays a vital role in the reinforcement learning algorithm. The present invention constructs and designs a detailed reward function model for the experimental task of tracking dynamic targets, and guides the intelligent agent to learn to adapt and complete the experimental task through the setting of the reward function.

进一步，在六自由度空间机械臂跟踪操作仿真环境中验证深度强化学习算法的有效性。针对空间直线运动目标和空间曲线运动目标验证强化学习算法的跟踪效果，在Matlab软件上进行仿真实验，首先为强化学习算法直线轨迹跟踪结果，图4为目标空间直线轨迹，图5为机械臂末端执行器跟踪轨迹，机械臂在初始位置迅速响应跟上了直线运动目标，图6为机械臂末端执行器跟踪的坐标误差，其次为强化学习算法空间圆轨迹跟踪结果，图7为目标空间直线轨迹，图8为机械臂末端执行器跟踪轨迹，机械臂在初始位置迅速响应跟上了直线运动目标，图9为机械臂末端执行器跟踪的坐标误差。Furthermore, the effectiveness of the deep reinforcement learning algorithm was verified in a six-degree-of-freedom space robot tracking operation simulation environment. The tracking effect of the reinforcement learning algorithm was verified for spatial linear motion targets and spatial curved motion targets. Simulation experiments were conducted on Matlab software. First, the linear trajectory tracking results of the reinforcement learning algorithm were shown. Figure 4 is the linear trajectory of the target space. Figure 5 is the tracking trajectory of the end effector of the robot arm. The robot arm responded quickly to the linear motion target at the initial position. Figure 6 is the coordinate error of the end effector tracking of the robot arm. Next, the circular trajectory tracking results of the reinforcement learning algorithm were shown. Figure 7 is the linear trajectory of the target space. Figure 8 is the tracking trajectory of the end effector of the robot arm. The robot arm responded quickly to the linear motion target at the initial position. Figure 9 is the coordinate error of the end effector tracking of the robot arm.

参照图2，一种基于深度强化学习算法的机械臂控制系统，包括：Referring to FIG2 , a robotic arm control system based on a deep reinforcement learning algorithm includes:

上述方法实施例中的内容均适用于本系统实施例中，本系统实施例所具体实现的功能与上述方法实施例相同，并且达到的有益效果与上述方法实施例所达到的有益效果也相同。The contents of the above method embodiments are all applicable to the present system embodiments. The functions specifically implemented by the present system embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

以上是对本发明的较佳实施进行了具体说明，但本发明创造并不限于所述实施例，熟悉本领域的技术人员在不违背本发明精神的前提下还可做作出种种的等同变形或替换，这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present invention, but the invention is not limited to the embodiments. Those skilled in the art may make various equivalent modifications or substitutions without violating the spirit of the present invention. These equivalent modifications or substitutions are all included in the scope defined by the claims of this application.

Claims

1. A robot arm control method based on a deep reinforcement learning algorithm, characterized in that it comprises the following steps:

Acquire the perception data of the six-axis robot arm, where the torque input port and the angle and angular velocity sensors for controlling each joint of the six-axis robot arm are set, and the state of the intelligent agent is defined as the six joint angles at the end of the six-axis robot arm, the joint angular velocity, the distance error and velocity error with the target in the x, y, and z directions, and the current control torque of each joint;

Based on the reinforcement learning DDPG algorithm, the experience replay pool, agent noise exploration and reward function are set up to perform interactive optimization training on the perception data to obtain the optimized reinforcement learning DDPG algorithm;

The reinforcement learning DDPG algorithm includes a main network and a target network, wherein the main network includes an Actor network and a Critic network, and the target network includes a Target Actor network and a Target Critic network; the Actor network is an action network, which takes the state s of the intelligent body as input and outputs the six-degree-of-freedom control torque of the deterministic action manipulator; the Critic network is an evaluation network, which is used to calculate the Q value and evaluate the value of the action given by the Actor network through the Q value; the Target Actor network and the Target Critic network are used to initialize the network parameters of the Actor network and the network parameters of the Critic network, and initialize the network parameters of the Target Actor network and the network parameters of the Target Critic network;

The experience replay pool includes: the agent stores the obtained experience data (s _t , a _t , r _t , s _t+1 , done) in the experience replay pool, and updates the main network parameters and the target network parameters according to batch sampling, wherein s _t represents the state of the agent at time t, a _t represents the action taken by the agent at time t, r _t represents the reward obtained after taking the action at time t, s _t+1 represents the state reached at time t+1 after the agent takes the action, and done represents whether the round task has been completed;

The agent noise exploration includes adding noise processing to the action output by the Actor network, and in the process of extracting samples from the memory bank when the Actor network is updated, calculating the mean square error d between the action a' in the current state of the sample and the action a output by the Actor network before the update, comparing it with the set error threshold _dth , and adjusting and updating the standard deviation s of the Gaussian noise according to the comparison result, wherein the sample represents the data in the experience replay pool;

The expression for adjusting and updating the standard deviation s of Gaussian noise is:

In the above formula, s represents the standard deviation of Gaussian noise, k represents the adaptive adjustment coefficient, d represents the mean square error value, and _dth represents the set error threshold;

The reward function includes a distance reward function, a speed reward function, a smooth action reward function and an energy loss reward function, wherein:

The distance reward function means that when the distance error between the agent terminal and the target is less than the threshold, the agent obtains a reward positively correlated with the error of being close to the target, otherwise, it obtains a reward negatively correlated with the error. Its expression is:

In the above formula, r ₁ represents the distance reward function, k ₁ represents the adjustable weight coefficient, and e _ri represents the position distance error of the end of the robot arm in the x, y, and z directions respectively;

The speed reward function is used to guide the robot arm to move according to the desired trajectory speed, and its expression is:

In the above formula, r ₂ represents the speed reward function, k ₂ represents the adjustable weight coefficient, and e _vi represents the speed error of the end of the robot in the x, y, and z directions respectively;

The action smoothness reward function represents the control torque penalty, and its expression is:

In the above formula, r ₃ represents the action smoothness reward function, k ₃ represents the adjustable weight coefficient, and torque _i represents the control torque of each axis of the robot arm;

The energy loss reward function represents the torque change penalty, and its expression is:

In the above formula, r ₄ represents the energy loss reward function, k ₄ represents the adjustable weight coefficient, and/> Respectively represent the control torque of each axis of the robot arm at the current and previous time steps;

Among them, interactive optimization training of perception data is performed to obtain the optimized reinforcement learning DDPG algorithm, which also includes:

According to the current state of the six-axis robot arm, the perception data is input into the Actor network to obtain the control torque corresponding to the perception data;

Add noise to the control torque and use it as the actual action of the six-axis robot;

The six-axis robot arm performs the corresponding actual action and updates the state of the six-axis robot arm to obtain the observation state at the next moment;

The current state of the six-axis robot, the reward obtained by executing the corresponding actual action, and the actual action of the six-axis robot are input into the Critic network to obtain the Q estimate;

Input the observed state at the next moment into the TargetActor network to obtain the six-axis robot arm action output at the next moment;

The next moment observation state and the next moment six-axis robot action output are input into the Target Critic network to obtain the next moment Q estimation value and reward;

Determine the completion mark of the first interactive optimization of the six-axis robot arm according to the Q estimation value and the Q estimation value at the next moment;

The current state of the six-axis robot, the actual action of the six-axis robot, the reward, the Q estimate at the next moment, and the completion mark of the first interactive optimization are stored in the experience replay pool;

The interactive optimization steps of the six-axis robot arm are repeated until the experience replay pool is full, and the optimized reinforcement learning DDPG algorithm is obtained;

The optimized reinforcement learning DDPG algorithm is deployed to the six-axis robotic arm for motion guidance.

2. A system for a robot arm control method based on a deep reinforcement learning algorithm as claimed in claim 1, characterized in that it comprises the following modules:

An acquisition module is used to acquire the perception data of the six-axis robot arm;

The interactive module is used to set the experience replay pool, agent noise exploration and reward function based on the reinforcement learning DDPG algorithm, perform interactive optimization training on the perception data, and obtain the optimized reinforcement learning DDPG algorithm;

The deployment module is used to deploy the optimized reinforcement learning DDPG algorithm to the six-axis robot arm for motion guidance.