CN117826603A

CN117826603A - Automatic driving control method based on countermeasure reinforcement learning

Info

Publication number: CN117826603A
Application number: CN202410010711.0A
Authority: CN
Inventors: 郝俊锋; 任文龙; 贺小平; 宋周林; 黄先昊; 夏伟峰; 赖薇; 赵霄; 李凯; 冉光炯
Original assignee: Sichuan Wisdom High Speed Technology Co ltd
Current assignee: Sichuan Wisdom High Speed Technology Co ltd
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2024-04-05

Abstract

The invention discloses an automatic driving control method based on countermeasure reinforcement learning, wherein the countermeasure reinforcement learning takes a PPO algorithm as a basic algorithm, an automatic driving control model based on the countermeasure PPO algorithm is built based on a single-agent environment, two roles of a main angle (Protagonit) and a perturber (advertisement) are introduced into a vehicle simulation environment, and the two roles are interacted to obtain the control right of a target vehicle. Under the excitation of the target rewarding function, the action space range of the perturber is limited, the target rewarding function is modified, the perturber tends to take dangerous driving actions to minimize rewards when controlling the vehicle to run, the principal angle maximizes rewards, and finally, the control strategy of the principal angle obtains the capability of resisting more interference after learning for a certain round.

Description

Automatic driving control method based on countermeasure reinforcement learning

Technical Field

The invention relates to the technical field of artificial intelligence/automatic driving, in particular to an automatic driving control method based on countermeasure reinforcement learning.

Background

The method of automatic driving test is mainly divided into two types: one is to use a real road test on behalf of the Waymo company. And verifying the reliability of the rule-based automatic driving algorithm by running the vehicle in the actual road environment. Real vehicle test can obtain real data, the test result is reliable, but the covered scene working condition is limited, and especially for solving the problem of heavy tail distribution of the test scene, the problem is difficult to solve by a real vehicle test mode, the test cost is high, and the danger coefficient is large. Another approach is to use virtual simulation testing. By constructing a test scene frame, various traffic real data or expert experience is collected to build a complete scene library, and then driving simulation is carried out by using computer simulation software. The virtual simulation can enlarge scene coverage, and a large-scale and omnibearing test is performed, so that the risk of testing on an actual road is avoided, the efficiency and the test cost are greatly improved, but the reliability of the result of the simulation test is still to be further verified due to the difference between the virtual environment and the actual road. In recent years, as the application of the DRL algorithm in the fields of games, languages, sports, etc. has succeeded, a large number of scholars start to apply the DRL algorithm in a more challenging field and have made an effort to apply theoretical research in real life. The strong self-learning and evolution capability of reinforcement learning and the applicability of processing the high-dimensional complex problem are very compatible with the requirements of the automatic driving technology, so that in the automatic driving field, a plurality of students combine the DRL algorithm with the rapid simulation, so that the reinforcement learning algorithm is widely applied to the control research of the automatic driving vehicle and is one of the preferred methods for realizing the automatic driving of the vehicle. Reinforcement learning is a type of learning problem in the machine learning field. Different from methods such as supervised learning and unsupervised learning, reinforcement learning obtains learning experience through interaction between an agent and an environment. Based on the current state, the intelligent agent makes corresponding actions through observing the environment at different moments, the environment generates corresponding rewards and punishments based on the current state after receiving action feedback, and then the intelligent agent optimizes the action of the next step according to the rewards and punishments.

In the early development stage of the automatic driving technology, a control model based on theory and rules can enable the vehicle to complete some simple auxiliary driving operations. However, with the complexity of application scenes and the improvement of automatic driving grades, the original control model cannot achieve more refined vehicle control and safe driving, and the comprehensive capability of the automatic driving vehicle is difficult to improve. The strong self-learning and evolution capability of deep reinforcement learning and the applicability of processing the high-dimensional complex problem are very suitable for the development requirement of the current automatic driving technology. At present, the performance of the deep reinforcement learning algorithm in the aspect of vehicle control is superior to that of other artificial intelligence algorithms, and the capability of deep reinforcement learning of automatic learning state characteristics is more suitable for the research of automatic driving control compared with the traditional control method for road traffic with complex environment and difficult type. The existing deep reinforcement learning algorithm has a great deal of research on automatic driving control in a simulation environment, and simulation test has a plurality of advantages, but the simulation scene lacks uncertain disturbance in an actual road, so that the robustness of the algorithm is still to be improved, and the application of the simulation scene is limited. Meanwhile, the learning mode of the learning algorithm is strengthened, so that the observation result is easily interfered by the observation, and finally, the deviation of the result is caused.

Disclosure of Invention

In view of the above, the present invention provides an automatic driving control method based on countermeasure reinforcement learning.

The invention adopts the following technical scheme:

an automatic driving control method based on countermeasure reinforcement learning is characterized by comprising the following steps:

step 1: building training environment, designing principal angle andperturbers, initializing the neural network, and expressing principal angle strategy parameters as theta ^μ The perturbator policy parameter is denoted as θ ^υ Setting the iteration times of the principal angle as N _μ The number of the perturber iterations is N ^υ ；

Step 2: front N _μ Iterative times, keeping the perturbator strategy θ ^υ Parameters are unchanged, and the main angle policy parameters theta ^μ Optimizing; at time step t, the main angle observes state S _t And take actionThen the state transitions to S _t+1 Obtain rewarding->Optimizing principal angle parameter θ with maximized game to achieve Nash equilibrium ^μ ；

The expected rewards are calculated using the following formula:

wherein μ represents the principal angle policy, θ ^μ Is the principal angle policy parameter, p is the state transfer function, gamma ^t Is a discount factor, T is the total time step, R (S _t ,A _t ) For rewarding at time t, S ₀ Is in an initial state S _t In the state of t time, A _t The action obtained by the principal angle decision at the moment t;

step 3: at the back N ^υ The secondary iteration keeps the principal angle strategy theta ^μ Parameters are unchanged, and strategy theta is carried out on a perturbator ^υ Optimizing; at time step t, the perturber observes state S _t And take actionThen the state transitions to S _t+1 Obtain rewarding->Optimizing perturber parameter θ with maximum game to achieve Nash equilibrium ^υ ；

The expected rewards are calculated using the following formula:

wherein v is the strategy of the perturbator, θ ^υ Is the strategy parameter of the perturber, p is the state transfer function, gamma ^t Is a discount factor, T is the total time step, R (S _t ,A _t ) For rewarding at time t, S ₀ Is in an initial state S _t In the state of t time, A _t The action obtained by the decision of the perturber at the time t.

Step 4: alternately executing the steps 2 and 3 until the main angle and the perturber respectively finish training;

step 5: the master angle and the perturber interact to obtain the control right of the target vehicle, the perturber can minimize rewards under the excitation of the target rewarding function, the master angle can maximize rewards, and finally the master angle obtains the vehicle control strategy for resisting the perturber.

Further, the step 2 specifically includes the following steps:

step 2.1: calculating a loss function L of a main angle Actor network _CLIP (θ ^μ ) Loss function L of Critic network ^VF (θ ^μ ) And gradient of

Step 2.2: calculating update target V by using generalized dominance estimation method _t ^tar With a prize value R ^1* Maximum is the target;

step 2.3: policy parameter θ for principal angle ^μ And (5) optimizing.

Further, the step 2.1:

loss function L of Actor network _CLIP (θ ^μ ) The method comprises the following steps:

wherein θ ^μ Is a policy parameter of principal angle, clip (r _t (θ ^μ ) As a shearing function, r _t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;

loss function L of Critic network ^VF (θ ^μ ) The method comprises the following steps:

wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;

gradient ofThe method comprises the following steps:

wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta ^μ State transfer function below.

Further, what is said isStep 2.2, calculating an update target V by using the generalized dominance estimation method and adopting the following formula _t ^tar ：

Wherein v is _θμ (S _t ) State value estimated for principal angle Critic network, r _t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.

Further, in the step 2.3, the policy parameters of the principal angle are optimized by adopting the following formula:

wherein μ is the principal angle policy, v is the perturbator policy, R ^1* For the main angle rewards when the main angle and the perturber reach equilibrium, R ¹ The bonus function, which is the principal angle, is expressed as follows,

wherein S is _t The state at the time of t is the state,and->Representing the actions selected by the principal angle and the perturber according to the respective strategies, r _t ¹ Is a reward at time t.

Further, the step 3 specifically includes the following steps:

step 3.1: calculating loss function L of a perturbator Actor network _CLIP (θ ^υ ) Loss function L of Critic network ^VF (θ ^υ ) And gradient of

Step 3.2: calculating update target V by using generalized dominance estimation method _t ^tar With a prize value R ^2* Maximum is the target;

step 3.3: policy parameter θ for perturbers ^υ And (5) optimizing.

Further, in the step 3.1,

loss function L of Actor network _CLIP (θ ^υ ) The method comprises the following steps:

wherein θ ^υ Is the policy parameter of the perturber, clip (r _t (θ ^υ ) As a shearing function, r _t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;

loss function L of Critic network _CLIP (θ ^υ ) The method comprises the following steps:

gradient ofThe method comprises the following steps:

wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta ^υ State transfer function below.

Further, in the step 3.2, the update target V is calculated by the following formula _t ^tar ：

In the method, in the process of the invention,state value estimated for perturber Critic network, r _t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.

Further, in the step 3.3, the policy parameters of the principal angle are optimized by adopting the following formula:

wherein μ is the principal angle policy, v is the perturber policy, R ^2* For the main angle rewards when the main angle and the perturber reach equilibrium, R ² A reward function for the perturber, which takes the value of,

R ² ＝-R ¹ 。

the beneficial effects of the invention are as follows:

(1) Continuous control of the automatic driving vehicle is realized based on the PPO algorithm, a reward function is constructed jointly from the aspects of safety, comfort and efficiency, and finally the driving operations such as lane following, lane changing, steering and the like are realized for the target vehicle. Compared with the DDPG algorithm, the control model constructed based on the PPO algorithm is stronger in safety and comfort, but lacks certain stability in the face of larger disturbance in the simulation environment.

(2) Under the same simulation environment condition, the convergence speed of an automatic driving control model constructed based on the antagonistic PPO is faster than that of the PPO control model in training, and the cumulative prize value is higher; when the running speed, the number of the background vehicles and the driving style of the background vehicles are changed, the anti-PPO control model has better performance in the aspects of comprehensive performance, safety, comfort and efficiency of controlling the running of the target vehicle, the following and lane changing track of the running of the vehicle is more stable, and the overall robustness is higher.

(3) The asymmetric bonus function is constructed, the bonus function is firstly constructed based on the driving target of the principal angle, then the bonus function is asymmetric based on the countermeasure strategy, the setting difficulty of the bonus function is reduced through reasonable design, and the training speed of the countermeasure model is accelerated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following brief description of the drawings of the embodiments will make it apparent that the drawings in the following description relate only to some embodiments of the present invention and are not limiting of the present invention.

FIG. 1 is a diagram of a construction of an anti-reinforcement learning algorithm according to the present invention;

FIG. 2 is a schematic diagram of the training results of the anti-PPO model according to the present invention;

FIG. 3 is a graph showing a comparison of jackpots against PPO and PPO models according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.

The invention will be further described with reference to the drawings and examples.

An automatic driving control method based on countermeasure reinforcement learning comprises the following steps:

step 1: building a training environment, designing principal angles and perturbers, initializing a neural network, and expressing principal angle strategy parameters as theta ^μ The perturbator policy parameter is denoted as θ ^υ Setting the iteration times of the principal angle as N _μ The number of the perturber iterations is N ^υ 。

The expected rewards are calculated using the following formula:

wherein μ represents the principal angle policy, θ ^μ Is the principal angle policy parameter, p is the state transfer function, gamma ^t Is a discount factor, T is the total time step, R (S _t ,A _t ) For rewarding at time t, S ₀ Is in an initial state S _t In the state of t time, A _t The action obtained by the principal angle decision at the time t.

Wherein, the step 2 specifically comprises the following steps:

step 2.1: and calculating the loss function of the main angle Actor network, the loss function and the gradient of the Critic network.

Wherein, the step 2.1:

gradient ofThe method comprises the following steps:

wherein,is a dominance function for estimating the return value, p _θμ (A _t |S _t ) Is the policy parameter theta ^μ State transfer function below.

Step 2.2: and calculating an updating target by using a generalization dominance estimation method, and taking the maximum rewarding value as the target.

Wherein, in the step 2.2, the update target is calculated by using the following formula by using the generalized dominance estimation method

Wherein,state value estimated for principal angle Critic network, r _t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.

Step 2.3: and optimizing the strategy parameters of the principal angles.

Wherein, in the step 2.3, the policy parameters of the principal angle are optimized by adopting the following formula:

wherein S is _t The state at the time of t is the state,and->Representing actions selected by principal angles and perturbers according to respective strategies,>is a reward at time t.

The expected rewards are calculated using the following formula:

Wherein, the step 3 comprises the following specific steps:

Wherein in the step 3.1,

gradient ofThe method comprises the following steps:

Step 3.2: calculating update target V by using generalized dominance estimation method _t ^tar With a prize value R ^2* The maximum is the target.

In the step 3.2, the update target V is calculated by the following formula _t ^tar ：

Step 3.3: policy parameter θ for perturbers ^υ And (5) optimizing.

In the step 3.3, the policy parameters of the principal angle are optimized by adopting the following formula:

R ² ＝-R ¹ 。

Examples

1. Simulation environment design

In the simulation environment, the road consists of three road sections, namely a straight line section, a curve section and a straight line section, the width of a lane is set to be 3.75m, two lanes are unidirectional, the curve section meets the minimum turning radius, the lowest speed limit of the road in the simulation training is 60km/h, and the highest speed limit is 120km/h.

The types of vehicles in the simulation environment are divided into a target vehicle and a background vehicle, and the control right of the target vehicle is respectively obtained by a principal angle and a perturber, and the safe or dangerous driving operation is executed by the principal angle and the perturber. The background vehicle is controlled by a traditional control model, and the polite parameter rho is set to 0 in the simulation training, so that the disturbance is mainly added to the safe driving of the main angle by the disturbance in the simulation training. After experience is obtained by the main angle through multiple interactions with the environment, the target vehicle obtains the capability of resisting the environmental disturbance and learns how to safely run.

In the countermeasure simulation training, the assumption of the simulation scene is still as follows:

(1) The road in the simulation scene keeps continuous traffic flow, and the condition that traffic jam is slow to run due to the increase of traffic density or traffic jam is caused by traffic accidents is not considered.

(2) The static elements of the simulation environment only keep the geometric dimension of the lane, and neglect the influence of the pavement materials and the traffic signs on driving; the dynamic traffic factors only consider the influence of the road background vehicle on the driving of the target vehicle, and do not consider the change of natural environments such as weather, illumination and the like and the influence of emergencies except experimental design.

(3) The other vehicles except the target vehicle are background vehicles, and the background vehicles are controlled by a designed driving theoretical model and do not have the driving randomness and the driving risk of the target vehicle.

2. Simulation training set-up

In the framework of reinforcement learning, in order for a perturber to sample the worst working space as much as possible in simulation training,the environment disturbance is added to limit the safe driving of the main angle, and meanwhile, the problem that the excessive capacity of a perturber causes slow growth of rewards in the early stage of the main angle and cannot learn experience is avoided. Setting N through setting the rewarding function and training round _iter ＝500，N _pro ＝10，N _adv The training method comprises the following steps of (1) training a main angle for 10 times, training a perturber for 1 time, and training two alternating training loops for 500 times, wherein the influence of the perturber on the main angle safety control can be ensured, and meanwhile, the main angle can be ensured to have enough training and learning on how to resist the perturbation.

The training hyper-parameters set for the challenge PPO control model of this experiment are shown in table 1, based on the design of the relevant asymmetric bonus function.

In the reinforcement learning framework, training is completed with a principal angle and a disturbance as one round. N (N) _iter Every time a round is added, the principal angle completes 10 training, and the perturber completes 1 training. Through N _iter The main corner learns how to maximize the jackpot and the perturber minimizes the jackpot as much as possible during each round of training. In the countermeasure simulation training, the main angle and the perturber respectively obtain the control right of the target vehicle, and the target vehicle can collide with a background vehicle and the road edge, or drive behaviors such as too slow driving speed, reverse driving of the vehicle and the like. According to the main angle and the training target of the perturber, the experimental termination condition design is divided into three aspects of vehicle state abnormality, vehicle position abnormality and simulation termination:

(1) Abnormal state of vehicle

1) The target vehicle collides. When the target vehicle collides with the background vehicle or the road edge, the current training round of the main angle or the perturber is ended, and the next new training round is entered.

2) The target vehicle speed is abnormal. In the training round of the main angle, when the speed of the target vehicle does not reach the set minimum speed or exceeds the maximum limit within the set time, the training round is ended, and the next new training round is entered.

(2) Abnormal vehicle position

1) The vehicle is driven out of the prescribed road. When the target is driven out of the specified test road range, the current training round of the main angle or the perturber is ended, and the next new training round is entered.

2) The vehicle travels in the reverse direction. In the training round of the main angle, when the running direction of the target vehicle is opposite to the forward direction of the road, or the target vehicle runs in a decelerating and reversing mode, the training round is ended, and the next new training round is entered.

(3) Simulation termination

1) The maximum simulation step size or round is reached. When the simulation time step of the single round reaches the maximum, ending the round; when the training round of the main angle or the perturber reaches the maximum, adding one to the number of Niter times, and entering the next new alternate training round; and when the number of the alternate training rounds reaches a set value, the simulation training is ended.

TABLE 1 ultra-parameter settings for challenge simulation training

3. Model training result analysis

Setting the initial speed of the target vehicle to 65km/h in the simulation scene, selecting 5 fixed background vehicles to induce the target vehicle to run along with the car and change the road, randomly generating 5 background vehicles in the road running range to increase the randomness of the test environment, and running at the initial speed of 60-70 km/h. The countermeasure model is subjected to 5000 rounds of simulation training, the model is iterated continuously in the training process, and finally rewards are stabilized at about 170 and convergence is achieved. As shown in fig. 1, in the initial training phase (about the first 1000 rounds), the cumulative return of principal angle is negative and rises slowly due to the influence of the perturber and the inexperience of the model. After a certain round of "antagonism", the antagonism model explores experience of avoiding danger, and the accumulated rewards rise rapidly in the middle of training. The jackpot of the 2500 round model returned to a zone stable value after rising. The cumulative return per turn of the model then gradually stabilizes, indicating that the countermeasures have converged to enable safe control of the automated driving vehicle.

The jackpot plots for the antagonism PPO algorithm and the PPO algorithm are compared. As shown in FIG. 2, the overall final jackpot for the challenge model stabilized around 170 and the PPO model stabilized around 140, indicating that the challenge model performed better with the same training round. The PPO model rises faster in the early training phase (1300 rounds before) as viewed from the trend of the fitted line, while the anti-PPO model has a lower jackpot because of the influence of the perturber, and rises slower than the former. And then the accumulated rewards of the PPO model are gradually caught up and overtake after the PPO model is experienced, and finally the accumulated rewards of the two models are pulled apart and respectively converged.

The present invention is not limited to the above-mentioned embodiments, but is intended to be limited to the following embodiments, and any modifications, equivalents and modifications can be made to the above-mentioned embodiments without departing from the scope of the invention.

Claims

1. An automatic driving control method based on countermeasure reinforcement learning is characterized by comprising the following steps:

step 1: building a training environment, designing principal angles and perturbers, initializing a neural network, and expressing principal angle strategy parameters as theta ^μ The perturbator policy parameter is denoted as θ ^υ Setting the iteration times of the principal angle as N _μ The number of the perturber iterations is N ^υ ；

The expected rewards are calculated using the following formula:

wherein v is the strategy of the perturbator, θ ^υ Is the strategy parameter of the perturber, p is the state transfer function, gamma ^t Is a discount factor, T is the total time step, R (S _t ,A _t ) For rewarding at time t, S ₀ Is in an initial state S _t In the state of t time, A _t Action obtained by decision-making of a perturber at time t;

2. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 1, wherein the step 2 is specifically as follows:

Step 2.2: computing update targets using generalized dominance estimation methodsAt a prize value of R ^1* Maximum is the target;

step 2.3: policy parameter θ for principal angle ^μ And (5) optimizing.

3. The automatic driving control method based on the countermeasure reinforcement learning according to claim 2, wherein the step 2.1:

wherein θ ^μ Is a policy parameter of principal angle, clip (r _t (θ ^μ ) R is a shearing function _t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;

wherein v is _θμ (S _t ) Representing the state value of the Critic network estimate,an update target calculated for the generalized dominance estimation method;

gradient ofThe method comprises the following steps:

4. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 2, wherein the step 2.2 calculates the update target V by using a generalized dominance estimation method by using the following formula _t ^tar ：

5. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 2, wherein the step 2.3 optimizes the policy parameters of the principal angle by adopting the following formula:

6. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 1, wherein the step 3 is specifically as follows:

step 3.3: policy parameter θ for perturbers ^υ And (5) optimizing.

7. The method for controlling autopilot based on reinforcement learning of claim 6, wherein in step 3.1,

wherein θ ^υ Is the policy parameter of the perturber, clip (r _t (θ ^υ ) R is a shearing function _t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;

gradient ofThe method comprises the following steps:

8. The method according to claim 6, wherein in step 3.2, the update target V is calculated by using the following formula _t ^tar ：

9. The automatic driving control method based on reinforcement learning of claim 6, wherein in the step 3.3, the policy parameters of the principal angle are optimized by adopting the following formula:

R ² ＝-R ¹ 。