CN117826603A - Automatic driving control method based on countermeasure reinforcement learning - Google Patents
Automatic driving control method based on countermeasure reinforcement learning Download PDFInfo
- Publication number
- CN117826603A CN117826603A CN202410010711.0A CN202410010711A CN117826603A CN 117826603 A CN117826603 A CN 117826603A CN 202410010711 A CN202410010711 A CN 202410010711A CN 117826603 A CN117826603 A CN 117826603A
- Authority
- CN
- China
- Prior art keywords
- perturber
- angle
- policy
- state
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 230000002787 reinforcement Effects 0.000 title claims abstract description 26
- 230000006870 function Effects 0.000 claims abstract description 72
- 230000009471 action Effects 0.000 claims abstract description 13
- 238000011217 control strategy Methods 0.000 claims abstract description 4
- 230000005284 excitation Effects 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 51
- 238000012546 transfer Methods 0.000 claims description 12
- 238000010008 shearing Methods 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000004088 simulation Methods 0.000 abstract description 31
- 238000012360 testing method Methods 0.000 description 16
- 238000013461 design Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000002159 abnormal effect Effects 0.000 description 3
- 230000008485 antagonism Effects 0.000 description 3
- 230000001186 cumulative effect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000005856 abnormality Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000000875 corresponding effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 241001122315 Polites Species 0.000 description 1
- 206010039203 Road traffic accident Diseases 0.000 description 1
- 230000003042 antagnostic effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000013401 experimental design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- VWDWKYIASSYTQR-UHFFFAOYSA-N sodium nitrate Chemical compound [Na+].[O-][N+]([O-])=O VWDWKYIASSYTQR-UHFFFAOYSA-N 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses an automatic driving control method based on countermeasure reinforcement learning, wherein the countermeasure reinforcement learning takes a PPO algorithm as a basic algorithm, an automatic driving control model based on the countermeasure PPO algorithm is built based on a single-agent environment, two roles of a main angle (Protagonit) and a perturber (advertisement) are introduced into a vehicle simulation environment, and the two roles are interacted to obtain the control right of a target vehicle. Under the excitation of the target rewarding function, the action space range of the perturber is limited, the target rewarding function is modified, the perturber tends to take dangerous driving actions to minimize rewards when controlling the vehicle to run, the principal angle maximizes rewards, and finally, the control strategy of the principal angle obtains the capability of resisting more interference after learning for a certain round.
Description
Technical Field
The invention relates to the technical field of artificial intelligence/automatic driving, in particular to an automatic driving control method based on countermeasure reinforcement learning.
Background
The method of automatic driving test is mainly divided into two types: one is to use a real road test on behalf of the Waymo company. And verifying the reliability of the rule-based automatic driving algorithm by running the vehicle in the actual road environment. Real vehicle test can obtain real data, the test result is reliable, but the covered scene working condition is limited, and especially for solving the problem of heavy tail distribution of the test scene, the problem is difficult to solve by a real vehicle test mode, the test cost is high, and the danger coefficient is large. Another approach is to use virtual simulation testing. By constructing a test scene frame, various traffic real data or expert experience is collected to build a complete scene library, and then driving simulation is carried out by using computer simulation software. The virtual simulation can enlarge scene coverage, and a large-scale and omnibearing test is performed, so that the risk of testing on an actual road is avoided, the efficiency and the test cost are greatly improved, but the reliability of the result of the simulation test is still to be further verified due to the difference between the virtual environment and the actual road. In recent years, as the application of the DRL algorithm in the fields of games, languages, sports, etc. has succeeded, a large number of scholars start to apply the DRL algorithm in a more challenging field and have made an effort to apply theoretical research in real life. The strong self-learning and evolution capability of reinforcement learning and the applicability of processing the high-dimensional complex problem are very compatible with the requirements of the automatic driving technology, so that in the automatic driving field, a plurality of students combine the DRL algorithm with the rapid simulation, so that the reinforcement learning algorithm is widely applied to the control research of the automatic driving vehicle and is one of the preferred methods for realizing the automatic driving of the vehicle. Reinforcement learning is a type of learning problem in the machine learning field. Different from methods such as supervised learning and unsupervised learning, reinforcement learning obtains learning experience through interaction between an agent and an environment. Based on the current state, the intelligent agent makes corresponding actions through observing the environment at different moments, the environment generates corresponding rewards and punishments based on the current state after receiving action feedback, and then the intelligent agent optimizes the action of the next step according to the rewards and punishments.
In the early development stage of the automatic driving technology, a control model based on theory and rules can enable the vehicle to complete some simple auxiliary driving operations. However, with the complexity of application scenes and the improvement of automatic driving grades, the original control model cannot achieve more refined vehicle control and safe driving, and the comprehensive capability of the automatic driving vehicle is difficult to improve. The strong self-learning and evolution capability of deep reinforcement learning and the applicability of processing the high-dimensional complex problem are very suitable for the development requirement of the current automatic driving technology. At present, the performance of the deep reinforcement learning algorithm in the aspect of vehicle control is superior to that of other artificial intelligence algorithms, and the capability of deep reinforcement learning of automatic learning state characteristics is more suitable for the research of automatic driving control compared with the traditional control method for road traffic with complex environment and difficult type. The existing deep reinforcement learning algorithm has a great deal of research on automatic driving control in a simulation environment, and simulation test has a plurality of advantages, but the simulation scene lacks uncertain disturbance in an actual road, so that the robustness of the algorithm is still to be improved, and the application of the simulation scene is limited. Meanwhile, the learning mode of the learning algorithm is strengthened, so that the observation result is easily interfered by the observation, and finally, the deviation of the result is caused.
Disclosure of Invention
In view of the above, the present invention provides an automatic driving control method based on countermeasure reinforcement learning.
The invention adopts the following technical scheme:
an automatic driving control method based on countermeasure reinforcement learning is characterized by comprising the following steps:
step 1: building training environment, designing principal angle andperturbers, initializing the neural network, and expressing principal angle strategy parameters as theta μ The perturbator policy parameter is denoted as θ υ Setting the iteration times of the principal angle as N μ The number of the perturber iterations is N υ ;
Step 2: front N μ Iterative times, keeping the perturbator strategy θ υ Parameters are unchanged, and the main angle policy parameters theta μ Optimizing; at time step t, the main angle observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing principal angle parameter θ with maximized game to achieve Nash equilibrium μ ;
The expected rewards are calculated using the following formula:
wherein μ represents the principal angle policy, θ μ Is the principal angle policy parameter, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the principal angle decision at the moment t;
step 3: at the back N υ The secondary iteration keeps the principal angle strategy theta μ Parameters are unchanged, and strategy theta is carried out on a perturbator υ Optimizing; at time step t, the perturber observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing perturber parameter θ with maximum game to achieve Nash equilibrium υ ;
The expected rewards are calculated using the following formula:
wherein v is the strategy of the perturbator, θ υ Is the strategy parameter of the perturber, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the decision of the perturber at the time t.
Step 4: alternately executing the steps 2 and 3 until the main angle and the perturber respectively finish training;
step 5: the master angle and the perturber interact to obtain the control right of the target vehicle, the perturber can minimize rewards under the excitation of the target rewarding function, the master angle can maximize rewards, and finally the master angle obtains the vehicle control strategy for resisting the perturber.
Further, the step 2 specifically includes the following steps:
step 2.1: calculating a loss function L of a main angle Actor network CLIP (θ μ ) Loss function L of Critic network VF (θ μ ) And gradient of
Step 2.2: calculating update target V by using generalized dominance estimation method t tar With a prize value R 1* Maximum is the target;
step 2.3: policy parameter θ for principal angle μ And (5) optimizing.
Further, the step 2.1:
loss function L of Actor network CLIP (θ μ ) The method comprises the following steps:
wherein θ μ Is a policy parameter of principal angle, clip (r t (θ μ ) As a shearing function, r t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network VF (θ μ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta μ State transfer function below.
Further, what is said isStep 2.2, calculating an update target V by using the generalized dominance estimation method and adopting the following formula t tar :
Wherein v is θμ (S t ) State value estimated for principal angle Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
Further, in the step 2.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturbator policy, R 1* For the main angle rewards when the main angle and the perturber reach equilibrium, R 1 The bonus function, which is the principal angle, is expressed as follows,
wherein S is t The state at the time of t is the state,and->Representing the actions selected by the principal angle and the perturber according to the respective strategies, r t 1 Is a reward at time t.
Further, the step 3 specifically includes the following steps:
step 3.1: calculating loss function L of a perturbator Actor network CLIP (θ υ ) Loss function L of Critic network VF (θ υ ) And gradient of
Step 3.2: calculating update target V by using generalized dominance estimation method t tar With a prize value R 2* Maximum is the target;
step 3.3: policy parameter θ for perturbers υ And (5) optimizing.
Further, in the step 3.1,
loss function L of Actor network CLIP (θ υ ) The method comprises the following steps:
wherein θ υ Is the policy parameter of the perturber, clip (r t (θ υ ) As a shearing function, r t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network CLIP (θ υ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta υ State transfer function below.
Further, in the step 3.2, the update target V is calculated by the following formula t tar :
In the method, in the process of the invention,state value estimated for perturber Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
Further, in the step 3.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturber policy, R 2* For the main angle rewards when the main angle and the perturber reach equilibrium, R 2 A reward function for the perturber, which takes the value of,
R 2 =-R 1 。
the beneficial effects of the invention are as follows:
(1) Continuous control of the automatic driving vehicle is realized based on the PPO algorithm, a reward function is constructed jointly from the aspects of safety, comfort and efficiency, and finally the driving operations such as lane following, lane changing, steering and the like are realized for the target vehicle. Compared with the DDPG algorithm, the control model constructed based on the PPO algorithm is stronger in safety and comfort, but lacks certain stability in the face of larger disturbance in the simulation environment.
(2) Under the same simulation environment condition, the convergence speed of an automatic driving control model constructed based on the antagonistic PPO is faster than that of the PPO control model in training, and the cumulative prize value is higher; when the running speed, the number of the background vehicles and the driving style of the background vehicles are changed, the anti-PPO control model has better performance in the aspects of comprehensive performance, safety, comfort and efficiency of controlling the running of the target vehicle, the following and lane changing track of the running of the vehicle is more stable, and the overall robustness is higher.
(3) The asymmetric bonus function is constructed, the bonus function is firstly constructed based on the driving target of the principal angle, then the bonus function is asymmetric based on the countermeasure strategy, the setting difficulty of the bonus function is reduced through reasonable design, and the training speed of the countermeasure model is accelerated.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following brief description of the drawings of the embodiments will make it apparent that the drawings in the following description relate only to some embodiments of the present invention and are not limiting of the present invention.
FIG. 1 is a diagram of a construction of an anti-reinforcement learning algorithm according to the present invention;
FIG. 2 is a schematic diagram of the training results of the anti-PPO model according to the present invention;
FIG. 3 is a graph showing a comparison of jackpots against PPO and PPO models according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.
The invention will be further described with reference to the drawings and examples.
An automatic driving control method based on countermeasure reinforcement learning comprises the following steps:
step 1: building a training environment, designing principal angles and perturbers, initializing a neural network, and expressing principal angle strategy parameters as theta μ The perturbator policy parameter is denoted as θ υ Setting the iteration times of the principal angle as N μ The number of the perturber iterations is N υ 。
Step 2: front N μ Iterative times, keeping the perturbator strategy θ υ Parameters are unchanged, and the main angle policy parameters theta μ Optimizing; at time step t, the main angle observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing principal angle parameter θ with maximized game to achieve Nash equilibrium μ ;
The expected rewards are calculated using the following formula:
wherein μ represents the principal angle policy, θ μ Is the principal angle policy parameter, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the principal angle decision at the time t.
Wherein, the step 2 specifically comprises the following steps:
step 2.1: and calculating the loss function of the main angle Actor network, the loss function and the gradient of the Critic network.
Wherein, the step 2.1:
loss function L of Actor network CLIP (θ μ ) The method comprises the following steps:
wherein θ μ Is a policy parameter of principal angle, clip (r t (θ μ ) As a shearing function, r t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network VF (θ μ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, p θμ (A t |S t ) Is the policy parameter theta μ State transfer function below.
Step 2.2: and calculating an updating target by using a generalization dominance estimation method, and taking the maximum rewarding value as the target.
Wherein, in the step 2.2, the update target is calculated by using the following formula by using the generalized dominance estimation method
Wherein,state value estimated for principal angle Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
Step 2.3: and optimizing the strategy parameters of the principal angles.
Wherein, in the step 2.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturbator policy, R 1* For the main angle rewards when the main angle and the perturber reach equilibrium, R 1 The bonus function, which is the principal angle, is expressed as follows,
wherein S is t The state at the time of t is the state,and->Representing actions selected by principal angles and perturbers according to respective strategies,>is a reward at time t.
Step 3: at the back N υ The secondary iteration keeps the principal angle strategy theta μ Parameters are unchanged, and strategy theta is carried out on a perturbator υ Optimizing; at time step t, the perturber observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing perturber parameter θ with maximum game to achieve Nash equilibrium υ ;
The expected rewards are calculated using the following formula:
wherein v is the strategy of the perturbator, θ υ Is the strategy parameter of the perturber, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the decision of the perturber at the time t.
Wherein, the step 3 comprises the following specific steps:
step 3.1: calculating loss function L of a perturbator Actor network CLIP (θ υ ) Loss function L of Critic network VF (θ υ ) And gradient of
Wherein in the step 3.1,
loss function L of Actor network CLIP (θ υ ) The method comprises the following steps:
wherein θ υ Is the policy parameter of the perturber, clip (r t (θ υ ) As a shearing function, r t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network CLIP (θ υ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta υ State transfer function below.
Step 3.2: calculating update target V by using generalized dominance estimation method t tar With a prize value R 2* The maximum is the target.
In the step 3.2, the update target V is calculated by the following formula t tar :
In the method, in the process of the invention,state value estimated for perturber Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
Step 3.3: policy parameter θ for perturbers υ And (5) optimizing.
In the step 3.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturber policy, R 2* For the main angle rewards when the main angle and the perturber reach equilibrium, R 2 A reward function for the perturber, which takes the value of,
R 2 =-R 1 。
step 4: alternately executing the steps 2 and 3 until the main angle and the perturber respectively finish training;
step 5: the master angle and the perturber interact to obtain the control right of the target vehicle, the perturber can minimize rewards under the excitation of the target rewarding function, the master angle can maximize rewards, and finally the master angle obtains the vehicle control strategy for resisting the perturber.
Examples
1. Simulation environment design
In the simulation environment, the road consists of three road sections, namely a straight line section, a curve section and a straight line section, the width of a lane is set to be 3.75m, two lanes are unidirectional, the curve section meets the minimum turning radius, the lowest speed limit of the road in the simulation training is 60km/h, and the highest speed limit is 120km/h.
The types of vehicles in the simulation environment are divided into a target vehicle and a background vehicle, and the control right of the target vehicle is respectively obtained by a principal angle and a perturber, and the safe or dangerous driving operation is executed by the principal angle and the perturber. The background vehicle is controlled by a traditional control model, and the polite parameter rho is set to 0 in the simulation training, so that the disturbance is mainly added to the safe driving of the main angle by the disturbance in the simulation training. After experience is obtained by the main angle through multiple interactions with the environment, the target vehicle obtains the capability of resisting the environmental disturbance and learns how to safely run.
In the countermeasure simulation training, the assumption of the simulation scene is still as follows:
(1) The road in the simulation scene keeps continuous traffic flow, and the condition that traffic jam is slow to run due to the increase of traffic density or traffic jam is caused by traffic accidents is not considered.
(2) The static elements of the simulation environment only keep the geometric dimension of the lane, and neglect the influence of the pavement materials and the traffic signs on driving; the dynamic traffic factors only consider the influence of the road background vehicle on the driving of the target vehicle, and do not consider the change of natural environments such as weather, illumination and the like and the influence of emergencies except experimental design.
(3) The other vehicles except the target vehicle are background vehicles, and the background vehicles are controlled by a designed driving theoretical model and do not have the driving randomness and the driving risk of the target vehicle.
2. Simulation training set-up
In the framework of reinforcement learning, in order for a perturber to sample the worst working space as much as possible in simulation training,the environment disturbance is added to limit the safe driving of the main angle, and meanwhile, the problem that the excessive capacity of a perturber causes slow growth of rewards in the early stage of the main angle and cannot learn experience is avoided. Setting N through setting the rewarding function and training round iter =500,N pro =10,N adv The training method comprises the following steps of (1) training a main angle for 10 times, training a perturber for 1 time, and training two alternating training loops for 500 times, wherein the influence of the perturber on the main angle safety control can be ensured, and meanwhile, the main angle can be ensured to have enough training and learning on how to resist the perturbation.
The training hyper-parameters set for the challenge PPO control model of this experiment are shown in table 1, based on the design of the relevant asymmetric bonus function.
In the reinforcement learning framework, training is completed with a principal angle and a disturbance as one round. N (N) iter Every time a round is added, the principal angle completes 10 training, and the perturber completes 1 training. Through N iter The main corner learns how to maximize the jackpot and the perturber minimizes the jackpot as much as possible during each round of training. In the countermeasure simulation training, the main angle and the perturber respectively obtain the control right of the target vehicle, and the target vehicle can collide with a background vehicle and the road edge, or drive behaviors such as too slow driving speed, reverse driving of the vehicle and the like. According to the main angle and the training target of the perturber, the experimental termination condition design is divided into three aspects of vehicle state abnormality, vehicle position abnormality and simulation termination:
(1) Abnormal state of vehicle
1) The target vehicle collides. When the target vehicle collides with the background vehicle or the road edge, the current training round of the main angle or the perturber is ended, and the next new training round is entered.
2) The target vehicle speed is abnormal. In the training round of the main angle, when the speed of the target vehicle does not reach the set minimum speed or exceeds the maximum limit within the set time, the training round is ended, and the next new training round is entered.
(2) Abnormal vehicle position
1) The vehicle is driven out of the prescribed road. When the target is driven out of the specified test road range, the current training round of the main angle or the perturber is ended, and the next new training round is entered.
2) The vehicle travels in the reverse direction. In the training round of the main angle, when the running direction of the target vehicle is opposite to the forward direction of the road, or the target vehicle runs in a decelerating and reversing mode, the training round is ended, and the next new training round is entered.
(3) Simulation termination
1) The maximum simulation step size or round is reached. When the simulation time step of the single round reaches the maximum, ending the round; when the training round of the main angle or the perturber reaches the maximum, adding one to the number of Niter times, and entering the next new alternate training round; and when the number of the alternate training rounds reaches a set value, the simulation training is ended.
TABLE 1 ultra-parameter settings for challenge simulation training
3. Model training result analysis
Setting the initial speed of the target vehicle to 65km/h in the simulation scene, selecting 5 fixed background vehicles to induce the target vehicle to run along with the car and change the road, randomly generating 5 background vehicles in the road running range to increase the randomness of the test environment, and running at the initial speed of 60-70 km/h. The countermeasure model is subjected to 5000 rounds of simulation training, the model is iterated continuously in the training process, and finally rewards are stabilized at about 170 and convergence is achieved. As shown in fig. 1, in the initial training phase (about the first 1000 rounds), the cumulative return of principal angle is negative and rises slowly due to the influence of the perturber and the inexperience of the model. After a certain round of "antagonism", the antagonism model explores experience of avoiding danger, and the accumulated rewards rise rapidly in the middle of training. The jackpot of the 2500 round model returned to a zone stable value after rising. The cumulative return per turn of the model then gradually stabilizes, indicating that the countermeasures have converged to enable safe control of the automated driving vehicle.
The jackpot plots for the antagonism PPO algorithm and the PPO algorithm are compared. As shown in FIG. 2, the overall final jackpot for the challenge model stabilized around 170 and the PPO model stabilized around 140, indicating that the challenge model performed better with the same training round. The PPO model rises faster in the early training phase (1300 rounds before) as viewed from the trend of the fitted line, while the anti-PPO model has a lower jackpot because of the influence of the perturber, and rises slower than the former. And then the accumulated rewards of the PPO model are gradually caught up and overtake after the PPO model is experienced, and finally the accumulated rewards of the two models are pulled apart and respectively converged.
The present invention is not limited to the above-mentioned embodiments, but is intended to be limited to the following embodiments, and any modifications, equivalents and modifications can be made to the above-mentioned embodiments without departing from the scope of the invention.
Claims (9)
1. An automatic driving control method based on countermeasure reinforcement learning is characterized by comprising the following steps:
step 1: building a training environment, designing principal angles and perturbers, initializing a neural network, and expressing principal angle strategy parameters as theta μ The perturbator policy parameter is denoted as θ υ Setting the iteration times of the principal angle as N μ The number of the perturber iterations is N υ ;
Step 2: front N μ Iterative times, keeping the perturbator strategy θ υ Parameters are unchanged, and the main angle policy parameters theta μ Optimizing; at time step t, the main angle observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing principal angle parameter θ with maximized game to achieve Nash equilibrium μ ;
The expected rewards are calculated using the following formula:
wherein μ represents the principal angle policy, θ μ Is the principal angle policy parameter, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the principal angle decision at the moment t;
step 3: at the back N υ The secondary iteration keeps the principal angle strategy theta μ Parameters are unchanged, and strategy theta is carried out on a perturbator υ Optimizing; at time step t, the perturber observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing perturber parameter θ with maximum game to achieve Nash equilibrium υ ;
The expected rewards are calculated using the following formula:
wherein v is the strategy of the perturbator, θ υ Is the strategy parameter of the perturber, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t Action obtained by decision-making of a perturber at time t;
step 4: alternately executing the steps 2 and 3 until the main angle and the perturber respectively finish training;
step 5: the master angle and the perturber interact to obtain the control right of the target vehicle, the perturber can minimize rewards under the excitation of the target rewarding function, the master angle can maximize rewards, and finally the master angle obtains the vehicle control strategy for resisting the perturber.
2. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 1, wherein the step 2 is specifically as follows:
step 2.1: calculating a loss function L of a main angle Actor network CLIP (θ μ ) Loss function L of Critic network VF (θ μ ) And gradient of
Step 2.2: computing update targets using generalized dominance estimation methodsAt a prize value of R 1* Maximum is the target;
step 2.3: policy parameter θ for principal angle μ And (5) optimizing.
3. The automatic driving control method based on the countermeasure reinforcement learning according to claim 2, wherein the step 2.1:
loss function L of Actor network CLIP (θ μ ) The method comprises the following steps:
wherein θ μ Is a policy parameter of principal angle, clip (r t (θ μ ) R is a shearing function t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network VF (θ μ ) The method comprises the following steps:
wherein v is θμ (S t ) Representing the state value of the Critic network estimate,an update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta μ State transfer function below.
4. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 2, wherein the step 2.2 calculates the update target V by using a generalized dominance estimation method by using the following formula t tar :
Wherein v is θμ (S t ) State value estimated for principal angle Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
5. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 2, wherein the step 2.3 optimizes the policy parameters of the principal angle by adopting the following formula:
wherein μ is the principal angle policy, v is the perturbator policy, R 1* For the main angle rewards when the main angle and the perturber reach equilibrium, R 1 The bonus function, which is the principal angle, is expressed as follows,
wherein S is t The state at the time of t is the state,and->Representing actions selected by principal angles and perturbers according to respective strategies,>is a reward at time t.
6. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 1, wherein the step 3 is specifically as follows:
step 3.1: calculating loss function L of a perturbator Actor network CLIP (θ υ ) Loss function L of Critic network VF (θ υ ) And gradient of
Step 3.2: calculating update target V by using generalized dominance estimation method t tar With a prize value R 2* Maximum is the target;
step 3.3: policy parameter θ for perturbers υ And (5) optimizing.
7. The method for controlling autopilot based on reinforcement learning of claim 6, wherein in step 3.1,
loss function L of Actor network CLIP (θ υ ) The method comprises the following steps:
wherein θ υ Is the policy parameter of the perturber, clip (r t (θ υ ) R is a shearing function t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network CLIP (θ υ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta υ State transfer function below.
8. The method according to claim 6, wherein in step 3.2, the update target V is calculated by using the following formula t tar :
In the method, in the process of the invention,state value estimated for perturber Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
9. The automatic driving control method based on reinforcement learning of claim 6, wherein in the step 3.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturber policy, R 2* For the main angle rewards when the main angle and the perturber reach equilibrium, R 2 A reward function for the perturber, which takes the value of,
R 2 =-R 1 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410010711.0A CN117826603A (en) | 2024-01-04 | 2024-01-04 | Automatic driving control method based on countermeasure reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410010711.0A CN117826603A (en) | 2024-01-04 | 2024-01-04 | Automatic driving control method based on countermeasure reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117826603A true CN117826603A (en) | 2024-04-05 |
Family
ID=90512930
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410010711.0A Pending CN117826603A (en) | 2024-01-04 | 2024-01-04 | Automatic driving control method based on countermeasure reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117826603A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118651439A (en) * | 2024-08-16 | 2024-09-17 | 西北工业大学 | Star group avoidance autonomous decision-making method based on self-adaption MADDPG |
-
2024
- 2024-01-04 CN CN202410010711.0A patent/CN117826603A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118651439A (en) * | 2024-08-16 | 2024-09-17 | 西北工业大学 | Star group avoidance autonomous decision-making method based on self-adaption MADDPG |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110297494B (en) | Decision-making method and system for lane change of automatic driving vehicle based on rolling game | |
CN110969848A (en) | Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes | |
CN113511222B (en) | Scene self-adaptive vehicle interaction behavior decision and prediction method and device | |
Hou et al. | Autonomous driving at the handling limit using residual reinforcement learning | |
Sanchez et al. | Gene regulated car driving: using a gene regulatory network to drive a virtual car | |
CN114973650A (en) | Vehicle ramp entrance confluence control method, vehicle, electronic device, and storage medium | |
CN111824182A (en) | Three-axis heavy vehicle self-adaptive cruise control algorithm based on deep reinforcement learning | |
CN116476825B (en) | Automatic driving lane keeping control method based on safe and reliable reinforcement learning | |
CN116894395A (en) | Automatic driving test scene generation method, system and storage medium | |
CN114355897B (en) | Vehicle path tracking control method based on model and reinforcement learning hybrid switching | |
EP4160478A1 (en) | Driving decision-making method, device, and chip | |
Yan et al. | A game-theoretical approach to driving decision making in highway scenarios | |
CN117826603A (en) | Automatic driving control method based on countermeasure reinforcement learning | |
CN117872800A (en) | Decision planning method based on reinforcement learning in discrete state space | |
CN116224996A (en) | Automatic driving optimization control method based on countermeasure reinforcement learning | |
CN113353102B (en) | Unprotected left-turn driving control method based on deep reinforcement learning | |
CN116872971A (en) | Automatic driving control decision-making method and system based on man-machine cooperation enhancement | |
Kaushik et al. | Learning driving behaviors for automated cars in unstructured environments | |
CN116534011A (en) | Course reinforcement learning-based control method for bicycle lane change and import fleet | |
CN116300944A (en) | Automatic driving decision method and system based on improved Double DQN | |
CN116853243A (en) | Vehicle self-adaptive cruise control method based on projection constraint strategy optimization | |
Zhao et al. | Imitation of real lane-change decisions using reinforcement learning | |
CN116027788A (en) | Intelligent driving behavior decision method and equipment integrating complex network theory and part of observable Markov decision process | |
Yang et al. | Decision-making in autonomous driving by reinforcement learning combined with planning & control | |
CN116052411A (en) | Diversion area mixed traffic flow control method based on graph neural network reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |