CN111198568A

CN111198568A - Underwater robot obstacle avoidance control method based on Q learning

Info

Publication number: CN111198568A
Application number: CN201911338069.4A
Authority: CN
Inventors: 闫敬; 李文飚; 杨晛; 罗小元
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-26

Abstract

The invention discloses an underwater robot obstacle avoidance control method based on Q learning, which belongs to the field of underwater robot control and mainly comprises the following steps: establishing a current environment according to sonar devices arranged around the underwater robot; setting a safety alert distance and a target threshold range of the underwater robot, and determining the position of the underwater robot in real time by adopting a positioning technology; creating an action space, a neural network, initializing action rewarding and penalizing, a state space and an iteration value; setting a reward and penalty mechanism, selecting each action according to a reward and penalty function, and enabling the actions to reach a convergence requirement through an iteration Q function to approach a target; the neural network approximation is adopted to improve the efficiency, and the gradient descent method is used for iteration. The invention has the advantages of improving the response capability and the learning capability of the underwater robot, having high data utilization rate, reducing errors and the like.

Description

Underwater robot obstacle avoidance control method based on Q learning

Technical Field

The invention belongs to the technical field of underwater robot control, in particular to optimal control for timely avoiding underwater obstacles, and particularly relates to an underwater robot obstacle avoidance control method based on Q learning.

Background

The ocean accounts for about 71% of the earth's surface and will become a new exploration space for human beings. The underwater robot senses the obstacles through a specific sensor to avoid the obstacles. However, the characteristics of marine environments are very complex, such as reef, coral, sea ditch and even marine emergencies (fast gathering fish shoal), so that it is very important that the underwater robot can avoid obstacles smoothly during exploration.

The patent application with the publication number of CN107121985A discloses a radar obstacle avoidance system of an underwater intelligent robot, and the scheme takes a radar transceiver as a main carrier and combines a timer of a single chip microcomputer to effectively avoid obstacles. Although the method can complete obstacle avoidance work of the underwater robot, the radar transmission mode is mainly electromagnetic waves, the electromagnetic waves are quickly attenuated when being transmitted underwater, and received signals are weakened, so that the obstacle avoidance is not timely, and further the robot is collided.

Furthermore, patent application with publication number CN108829134A discloses a real-time autonomous obstacle avoidance method for a deep-sea robot, which uses a geometric sphere to model irregular obstacles, projects the obstacles on a horizontal plane and a vertical plane, and analyzes a course infeasible area affected by the obstacles by adopting a tangent method to obtain a course infeasible set navigated by an unmanned underwater vehicle; analyzing the motion characteristics of the unmanned underwater vehicle to obtain a heading window and a linear velocity window of the unmanned underwater vehicle; searching out an optimal navigation angle by constructing an optimal navigation angle optimization function, and constructing a leading line speed model according to the distribution of the obstacles and the size of a yaw angle; and finally, outputting the navigation angle and the navigation linear speed to an unmanned underwater vehicle motion control module to guide the unmanned underwater vehicle to realize real-time obstacle avoidance in a three-dimensional environment. However, the above method is complicated and time-consuming in analysis and calculation, and cannot cope with seabed emergencies such as fish swarms moving everywhere. Therefore, it is necessary to design an obstacle avoidance control method for an underwater robot, which considers timeliness and has strong adaptability, so that on one hand, seabed emergencies can be avoided in time, and on the other hand, the obstacle avoidance control method can adapt to various complex seabed conditions.

Disclosure of Invention

The invention aims to provide an underwater robot obstacle avoidance control method which is timely in avoidance, strong in adaptability and wide in application.

In order to achieve the purpose, the invention adopts the technical scheme that:

an underwater robot obstacle avoidance control method based on Q learning comprises the following steps:

step 1, establishing the current environment of a robot through signals of a sonar receiving device arranged on an underwater robot; the underwater robot adopts a dynamic model of

Wherein M represents an inertia matrix, C represents a Coriolis force matrix, D represents a damping matrix, G represents a gravity matrix, τ is a control input, and v is a control output;

the underwater robot has 6 degrees of freedom, and the distance between the robot and the obstacle is x on the assumption of the nth degree of freedom_nThe underwater robot sets a safety warning distance d, if the underwater robot has x in the nth degree of freedom_n<d, indicating that the underwater robot is likely to collide and simultaneously taking corresponding evasive action on the degree of freedom;

step 2, determining the position D of the underwater robot at each moment by using a positioning technology_iWherein i represents the ith moment, and the distance D between the underwater robot and the target point at the moment is compared_iAnd the distance D between the underwater robot and the target point at the last moment_i-1If D is_i>D_i-1Indicating that the robot is moving away from the target point, if D_i<D_i-1Indicating that the robot is approaching the target point, calculating the distance D between the underwater robot and the target point at the current moment, setting a target point threshold value D0 by considering the underwater fluctuation, and if D is detected<d0, indicating that the underwater robot has reached the target point; establishing an action space A according to the degree of freedom of the underwater robot;

step 3, selecting action to punish minimum according to Q learning of the underwater robot, setting a per-step reward and penalty mechanism, setting an initial penalty as K, and in step 1, selecting action to punish minimum by the underwater robot, wherein the underwater robot sets the initial penalty as KDistance rewarding function R from target point₁The following formula is given,

i.e. the occurrence of D_i>D_i-1Then a penalty K is given and D appears_i<D_i-1Then a negative penalty-K is given, and in step 2, the underwater robot approaches the reward function R of the obstacle within the security alert threshold₂Is given by

Wherein the above formula indicates that when the obstacle enters the safety warning distance, the reward and penalty function value increases along with the decrease of the distance that the underwater robot approaches the obstacle; when the barrier is out of the safety warning distance, the reward and penalty function value is K, and the total reward and penalty of each step of the underwater robot is R ═ R₁+R₂(ii) a Meanwhile, the underwater robot avoids the obstacle according to a reward penalty function, when the penalty of the step is larger than that of the previous step, the underwater robot is close to the obstacle, and the underwater robot moves away from the obstacle; when the penalty of the step is smaller than that of the previous step, the underwater robot is far away from the barrier, and the underwater robot moves to a target point;

step 4, carrying out weight distribution on the multidimensional input by utilizing the neural network, copying the actual network weight into the target network weight after each training, wherein the weight updating formula is as follows

Wherein x_mFor input signals, ω_mRepresenting the weight, M being the total number of neurons, net_lIs the relation of input and output, f is the activation function, y_lIs output for the neuron;

step 5, training the underwater robot to search an optimal obstacle avoidance path scheme, and initializing an action rewarding penalty R; initializing a state matrix S; initializing the total training times M of the robot; setting an iteration value j to represent the number of times of robot training; setting a discount factor gamma; according to the Q function

Q(s,a)＝R(s,a)+γmax_aQ(s',a') (5)

The above equation represents the reward penalty R (s, a) for taking the action a in one state s plus the highest Q value at the discount rate for the next state s'; for the Q value that seeks the maximum, perform gradient descent, in order to minimize the punishment of each step; inputting the updated state of each step into a Q learning network, and then returning the Q values of all possible actions in the state; selecting an action, selecting a random action a when Q values of all selected actions are the same, and selecting the action with the highest Q value when the Q values of all selected actions are different; after the action a is selected, the underwater robot executes the selected action in the state S, and moves to a new state S', and receives the prize R; these steps M rounds are repeated until the Q value meets the convergence requirement.

The technical scheme of the invention is further improved as follows: in step 2, the target point threshold range is a circular area with d0 as a radius and the target point as a center.

The technical scheme of the invention is further improved as follows: in step 1, the safety alert range is a circular moving area with d as a radius and the center of mass of the underwater robot as the center of a circle.

The technical scheme of the invention is further improved as follows: the convergence requirement of the Q value in the step 5 is that the difference between the Q value in the step and the Q value in the previous step is not more than 0.01, namely the Q value reaches the convergence.

Due to the adoption of the technical scheme, the invention has the following technical effects:

1. aiming at the seabed emergency, the method of the invention is respectively provided with distance measuring sonar and forward looking sonar at the bow, the stern, the port and the starboard of the robot, thus being capable of timely measuring the surrounding obstacle situation to effectively avoid.

2. Aiming at the complex topography condition of the seabed, the weight distribution is carried out on the input by utilizing the neural network, the weight distribution is combined with Q learning, the weight is updated by utilizing experience every time, and the data utilization efficiency is higher. Learning directly from consecutive samples is insufficient due to the high correlation between consecutive samples. The method of combining the neural network to randomly draw the samples breaks the correlation, so that the variance of the updated weight can be reduced.

3. The method is also provided with a target network independently to process TD deviation in a time difference algorithm. Therefore, the method has strong learning ability and fast environment adaptation, and can be better competent for complex tasks when being applied to obstacle avoidance control of the underwater robot.

4. In the method, the underwater robot selects actions to punish minimum by using Q learning, sets a per-step reward-penalty mechanism, and enables the underwater robot to avoid obstacles more accurately and reasonably by setting a reasonable per-step total reward-penalty function.

Drawings

FIG. 1 is a flow chart of an underwater robot learning process;

FIG. 2 is a schematic diagram of an obstacle avoidance of the underwater robot on a simulated seabed;

in fig. 2: u is an underwater robot; g is a target point; x is an obstacle; 1, 2, 3, 4, and robot learning training.

Detailed Description

The invention is described in further detail below with reference to the following figures and specific embodiments:

as shown in fig. 1 and 2, the invention discloses an underwater robot obstacle avoidance control method based on Q learning, which is mainly an autonomous and cableless underwater robot U, senses the surrounding environment through a sonar receiving device around the autonomous and cableless underwater robot U, and can perform underwater autonomous obstacle avoidance by using a self-control system. The method takes timeliness into consideration and has strong adaptability.

The obstacle avoidance method comprises the following steps:

step 1, establishing the current environment of the robot through signals of a sonar receiving device arranged on the underwater robot. The underwater robot adopts a dynamic model of

Wherein M represents an inertia matrix, C represents a Coriolis force matrix, D represents a damping matrix, G represents a gravity matrix, τ is a control input, and v is a control output.

The underwater robot has 6 degrees of freedom. Suppose that the distance between the robot and the obstacle is x in the nth degree of freedom_n. The safety warning distance set by the underwater robot is d, wherein the safety warning range refers to a circular moving area which takes d as a radius and takes the center of mass of the underwater robot as the center of a circle, and if the underwater robot has x in the nth degree of freedom_n<d, the underwater robot is possible to collide, and corresponding evasive action is taken on the degree of freedom.

Step 2, determining the position D of the underwater robot at each moment by using a positioning technology_iWherein i represents the ith moment, and the distance D between the underwater robot and the target point at the moment is compared_iAnd the distance D between the underwater robot and the target point at the last moment_i-1If D is_i>D_i-1Indicating that the robot is moving away from the target point, if D_i<D_i-1Indicating that the robot is approaching the target point. Calculating the distance D between the underwater robot and a target point at the current moment, and setting a target point threshold value D0 by considering the underwater fluctuation, wherein the target point threshold value range is a circular area taking D0 as a radius and taking the target point as a circle center; if D is<d0, it indicates that the underwater robot has reached the target point. And establishing an action space A according to the degree of freedom of the underwater robot.

And 3, selecting action to punish minimum according to Q learning of the underwater robot, setting a per-step reward and penalty mechanism, and setting an initial penalty as K. In step 1, a distance reward function R between the underwater robot and a target point₁Is given by

I.e. the occurrence of D_i>D_i-1Then a penalty K is given and D appears_i<D_i-1Then a negative penalty-K is given. In step 2, the underwater robot is safeReward function R close to obstacle within warning threshold₂Is given by

Wherein the above formula indicates that when the obstacle enters the safety warning distance, the reward and penalty function value increases along with the decrease of the distance that the underwater robot approaches the obstacle; when the barrier is out of the safety warning distance, the reward and penalty function value is K. The total reward per step of the underwater robot is R ═ R₁+R₂. And meanwhile, action selection is carried out by combining a reward and penalty mechanism of the underwater robot. By setting the reward and penalty mechanism and giving a cost function, the target is approached by iteratively seeking the minimum penalty.

And 4, carrying out weight distribution on the multidimensional input by utilizing the neural network, and copying the actual network weight to the target network weight after each training. The weight update is as follows

Wherein x_mFor input signals, ω_mRepresenting the weight, M being the total number of neurons, net_lIs the relation of input and output, f is the activation function, y_lIs output by the neuron.

Step 5, training the underwater robot to search an optimal obstacle avoidance path scheme, and initializing an action rewarding penalty R; initializing a state matrix S; initializing the total training times M of the robot; setting an iteration value j; setting a discount factor gamma; according to the Q function

Q(s,a)＝R(s,a)+γmax_aQ(s',a') (5)

The above equation represents the reward penalty R (s, a) for taking the action a in one state s plus the highest Q value at the discount rate for the next state s'. Gradient descent is performed to minimize per-step penalty for seeking the maximum Q value. The updated state for each step is input into the Q learning network and then the Q values for all possible actions in that state are returned. At this point, an action is selected, and when the Q value of each selected action is the same, we select a random action a, and when the Q value of each selected action is different, the action with the highest Q value is selected. After selecting action a, the underwater robot performs the selected action in state S and proceeds to a new state S', receiving the prize R. These steps M rounds are repeated until the Q value meets the convergence requirement. In a specific operation, the requirement for convergence of the Q value is that the Q value of the step is not more than 0.01 different from the Q value of the previous step, i.e. the Q value has reached convergence. The neural network approximation is adopted to improve the efficiency, and the gradient descent method is utilized to iterate so as to seek the optimal control strategy.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and not restrictive, and various changes and modifications may be made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention, which is defined by the claims.

Claims

1. An underwater robot obstacle avoidance control method based on Q learning is characterized in that: the method comprises the following steps:

step 2, determining the underwater robot at each moment by using a positioning technologyPosition D of_iWherein i represents the ith moment, and the distance D between the underwater robot and the target point at the moment is compared_iAnd the distance D between the underwater robot and the target point at the last moment_i-1If D is_i>D_i-1Indicating that the robot is moving away from the target point, if D_i<D_i-1Indicating that the robot is approaching the target point, calculating the distance D between the underwater robot and the target point at the current moment, setting a target point threshold value D0 by considering the underwater fluctuation, and if D is detected<d0, indicating that the underwater robot has reached the target point; establishing an action space A according to the degree of freedom of the underwater robot;

step 3, selecting action to punish minimum according to Q learning of the underwater robot, setting a per-step reward-penalty mechanism, setting an initial penalty as K, and in step 1, selecting a distance reward-penalty function R between the underwater robot and a target point₁The following formula is given,

Wherein the above formula indicates that when the obstacle enters the safety warning distance, the reward and penalty function value increases along with the decrease of the distance that the underwater robot approaches the obstacle; when the barrier is out of the safety warning distance, the reward and penalty function value is K, and the total reward and penalty of each step of the underwater robot is R ═ R₁+R₂(ii) a Meanwhile, the underwater robot avoids the obstacle according to a reward penalty function, when the penalty of the step is larger than that of the previous step, the underwater robot is close to the obstacle, and the underwater robot moves away from the obstacle; when the step penalty becomes smaller than the previous step penalty, tableThe underwater robot is shown to be away from the obstacle, and the underwater robot needs to move to a target point;

Q(s,a)＝R(s,a)+γmax_aQ(s',a') (5)

2. The Q learning-based underwater robot obstacle avoidance control method according to claim 1, characterized in that: in step 2, the target point threshold range is a circular area with d0 as a radius and the target point as a center.

3. The Q learning-based underwater robot obstacle avoidance control method according to claim 1, characterized in that: in step 1, the safety alert range is a circular moving area with d as a radius and the center of mass of the underwater robot as the center of a circle.

4. The Q learning-based underwater robot obstacle avoidance control method according to claim 1, characterized in that: the convergence requirement of the Q value in the step 5 is that the difference between the Q value in the step and the Q value in the previous step is not more than 0.01, namely the Q value reaches the convergence.