Nothing Special   »   [go: up one dir, main page]

CN117826603A - Automatic driving control method based on countermeasure reinforcement learning - Google Patents

Automatic driving control method based on countermeasure reinforcement learning Download PDF

Info

Publication number
CN117826603A
CN117826603A CN202410010711.0A CN202410010711A CN117826603A CN 117826603 A CN117826603 A CN 117826603A CN 202410010711 A CN202410010711 A CN 202410010711A CN 117826603 A CN117826603 A CN 117826603A
Authority
CN
China
Prior art keywords
perturber
angle
policy
state
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410010711.0A
Other languages
Chinese (zh)
Inventor
郝俊锋
任文龙
贺小平
宋周林
黄先昊
夏伟峰
赖薇
赵霄
李凯
冉光炯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Wisdom High Speed Technology Co ltd
Original Assignee
Sichuan Wisdom High Speed Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Wisdom High Speed Technology Co ltd filed Critical Sichuan Wisdom High Speed Technology Co ltd
Priority to CN202410010711.0A priority Critical patent/CN117826603A/en
Publication of CN117826603A publication Critical patent/CN117826603A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses an automatic driving control method based on countermeasure reinforcement learning, wherein the countermeasure reinforcement learning takes a PPO algorithm as a basic algorithm, an automatic driving control model based on the countermeasure PPO algorithm is built based on a single-agent environment, two roles of a main angle (Protagonit) and a perturber (advertisement) are introduced into a vehicle simulation environment, and the two roles are interacted to obtain the control right of a target vehicle. Under the excitation of the target rewarding function, the action space range of the perturber is limited, the target rewarding function is modified, the perturber tends to take dangerous driving actions to minimize rewards when controlling the vehicle to run, the principal angle maximizes rewards, and finally, the control strategy of the principal angle obtains the capability of resisting more interference after learning for a certain round.

Description

Automatic driving control method based on countermeasure reinforcement learning
Technical Field
The invention relates to the technical field of artificial intelligence/automatic driving, in particular to an automatic driving control method based on countermeasure reinforcement learning.
Background
The method of automatic driving test is mainly divided into two types: one is to use a real road test on behalf of the Waymo company. And verifying the reliability of the rule-based automatic driving algorithm by running the vehicle in the actual road environment. Real vehicle test can obtain real data, the test result is reliable, but the covered scene working condition is limited, and especially for solving the problem of heavy tail distribution of the test scene, the problem is difficult to solve by a real vehicle test mode, the test cost is high, and the danger coefficient is large. Another approach is to use virtual simulation testing. By constructing a test scene frame, various traffic real data or expert experience is collected to build a complete scene library, and then driving simulation is carried out by using computer simulation software. The virtual simulation can enlarge scene coverage, and a large-scale and omnibearing test is performed, so that the risk of testing on an actual road is avoided, the efficiency and the test cost are greatly improved, but the reliability of the result of the simulation test is still to be further verified due to the difference between the virtual environment and the actual road. In recent years, as the application of the DRL algorithm in the fields of games, languages, sports, etc. has succeeded, a large number of scholars start to apply the DRL algorithm in a more challenging field and have made an effort to apply theoretical research in real life. The strong self-learning and evolution capability of reinforcement learning and the applicability of processing the high-dimensional complex problem are very compatible with the requirements of the automatic driving technology, so that in the automatic driving field, a plurality of students combine the DRL algorithm with the rapid simulation, so that the reinforcement learning algorithm is widely applied to the control research of the automatic driving vehicle and is one of the preferred methods for realizing the automatic driving of the vehicle. Reinforcement learning is a type of learning problem in the machine learning field. Different from methods such as supervised learning and unsupervised learning, reinforcement learning obtains learning experience through interaction between an agent and an environment. Based on the current state, the intelligent agent makes corresponding actions through observing the environment at different moments, the environment generates corresponding rewards and punishments based on the current state after receiving action feedback, and then the intelligent agent optimizes the action of the next step according to the rewards and punishments.
In the early development stage of the automatic driving technology, a control model based on theory and rules can enable the vehicle to complete some simple auxiliary driving operations. However, with the complexity of application scenes and the improvement of automatic driving grades, the original control model cannot achieve more refined vehicle control and safe driving, and the comprehensive capability of the automatic driving vehicle is difficult to improve. The strong self-learning and evolution capability of deep reinforcement learning and the applicability of processing the high-dimensional complex problem are very suitable for the development requirement of the current automatic driving technology. At present, the performance of the deep reinforcement learning algorithm in the aspect of vehicle control is superior to that of other artificial intelligence algorithms, and the capability of deep reinforcement learning of automatic learning state characteristics is more suitable for the research of automatic driving control compared with the traditional control method for road traffic with complex environment and difficult type. The existing deep reinforcement learning algorithm has a great deal of research on automatic driving control in a simulation environment, and simulation test has a plurality of advantages, but the simulation scene lacks uncertain disturbance in an actual road, so that the robustness of the algorithm is still to be improved, and the application of the simulation scene is limited. Meanwhile, the learning mode of the learning algorithm is strengthened, so that the observation result is easily interfered by the observation, and finally, the deviation of the result is caused.
Disclosure of Invention
In view of the above, the present invention provides an automatic driving control method based on countermeasure reinforcement learning.
The invention adopts the following technical scheme:
an automatic driving control method based on countermeasure reinforcement learning is characterized by comprising the following steps:
step 1: building training environment, designing principal angle andperturbers, initializing the neural network, and expressing principal angle strategy parameters as theta μ The perturbator policy parameter is denoted as θ υ Setting the iteration times of the principal angle as N μ The number of the perturber iterations is N υ
Step 2: front N μ Iterative times, keeping the perturbator strategy θ υ Parameters are unchanged, and the main angle policy parameters theta μ Optimizing; at time step t, the main angle observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing principal angle parameter θ with maximized game to achieve Nash equilibrium μ
The expected rewards are calculated using the following formula:
wherein μ represents the principal angle policy, θ μ Is the principal angle policy parameter, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the principal angle decision at the moment t;
step 3: at the back N υ The secondary iteration keeps the principal angle strategy theta μ Parameters are unchanged, and strategy theta is carried out on a perturbator υ Optimizing; at time step t, the perturber observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing perturber parameter θ with maximum game to achieve Nash equilibrium υ
The expected rewards are calculated using the following formula:
wherein v is the strategy of the perturbator, θ υ Is the strategy parameter of the perturber, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the decision of the perturber at the time t.
Step 4: alternately executing the steps 2 and 3 until the main angle and the perturber respectively finish training;
step 5: the master angle and the perturber interact to obtain the control right of the target vehicle, the perturber can minimize rewards under the excitation of the target rewarding function, the master angle can maximize rewards, and finally the master angle obtains the vehicle control strategy for resisting the perturber.
Further, the step 2 specifically includes the following steps:
step 2.1: calculating a loss function L of a main angle Actor network CLIPμ ) Loss function L of Critic network VFμ ) And gradient of
Step 2.2: calculating update target V by using generalized dominance estimation method t tar With a prize value R 1* Maximum is the target;
step 2.3: policy parameter θ for principal angle μ And (5) optimizing.
Further, the step 2.1:
loss function L of Actor network CLIPμ ) The method comprises the following steps:
wherein θ μ Is a policy parameter of principal angle, clip (r tμ ) As a shearing function, r t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network VFμ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta μ State transfer function below.
Further, what is said isStep 2.2, calculating an update target V by using the generalized dominance estimation method and adopting the following formula t tar
Wherein v is θμ (S t ) State value estimated for principal angle Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
Further, in the step 2.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturbator policy, R 1* For the main angle rewards when the main angle and the perturber reach equilibrium, R 1 The bonus function, which is the principal angle, is expressed as follows,
wherein S is t The state at the time of t is the state,and->Representing the actions selected by the principal angle and the perturber according to the respective strategies, r t 1 Is a reward at time t.
Further, the step 3 specifically includes the following steps:
step 3.1: calculating loss function L of a perturbator Actor network CLIPυ ) Loss function L of Critic network VFυ ) And gradient of
Step 3.2: calculating update target V by using generalized dominance estimation method t tar With a prize value R 2* Maximum is the target;
step 3.3: policy parameter θ for perturbers υ And (5) optimizing.
Further, in the step 3.1,
loss function L of Actor network CLIPυ ) The method comprises the following steps:
wherein θ υ Is the policy parameter of the perturber, clip (r tυ ) As a shearing function, r t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network CLIPυ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta υ State transfer function below.
Further, in the step 3.2, the update target V is calculated by the following formula t tar
In the method, in the process of the invention,state value estimated for perturber Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
Further, in the step 3.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturber policy, R 2* For the main angle rewards when the main angle and the perturber reach equilibrium, R 2 A reward function for the perturber, which takes the value of,
R 2 =-R 1
the beneficial effects of the invention are as follows:
(1) Continuous control of the automatic driving vehicle is realized based on the PPO algorithm, a reward function is constructed jointly from the aspects of safety, comfort and efficiency, and finally the driving operations such as lane following, lane changing, steering and the like are realized for the target vehicle. Compared with the DDPG algorithm, the control model constructed based on the PPO algorithm is stronger in safety and comfort, but lacks certain stability in the face of larger disturbance in the simulation environment.
(2) Under the same simulation environment condition, the convergence speed of an automatic driving control model constructed based on the antagonistic PPO is faster than that of the PPO control model in training, and the cumulative prize value is higher; when the running speed, the number of the background vehicles and the driving style of the background vehicles are changed, the anti-PPO control model has better performance in the aspects of comprehensive performance, safety, comfort and efficiency of controlling the running of the target vehicle, the following and lane changing track of the running of the vehicle is more stable, and the overall robustness is higher.
(3) The asymmetric bonus function is constructed, the bonus function is firstly constructed based on the driving target of the principal angle, then the bonus function is asymmetric based on the countermeasure strategy, the setting difficulty of the bonus function is reduced through reasonable design, and the training speed of the countermeasure model is accelerated.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following brief description of the drawings of the embodiments will make it apparent that the drawings in the following description relate only to some embodiments of the present invention and are not limiting of the present invention.
FIG. 1 is a diagram of a construction of an anti-reinforcement learning algorithm according to the present invention;
FIG. 2 is a schematic diagram of the training results of the anti-PPO model according to the present invention;
FIG. 3 is a graph showing a comparison of jackpots against PPO and PPO models according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention.
The invention will be further described with reference to the drawings and examples.
An automatic driving control method based on countermeasure reinforcement learning comprises the following steps:
step 1: building a training environment, designing principal angles and perturbers, initializing a neural network, and expressing principal angle strategy parameters as theta μ The perturbator policy parameter is denoted as θ υ Setting the iteration times of the principal angle as N μ The number of the perturber iterations is N υ
Step 2: front N μ Iterative times, keeping the perturbator strategy θ υ Parameters are unchanged, and the main angle policy parameters theta μ Optimizing; at time step t, the main angle observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing principal angle parameter θ with maximized game to achieve Nash equilibrium μ
The expected rewards are calculated using the following formula:
wherein μ represents the principal angle policy, θ μ Is the principal angle policy parameter, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the principal angle decision at the time t.
Wherein, the step 2 specifically comprises the following steps:
step 2.1: and calculating the loss function of the main angle Actor network, the loss function and the gradient of the Critic network.
Wherein, the step 2.1:
loss function L of Actor network CLIPμ ) The method comprises the following steps:
wherein θ μ Is a policy parameter of principal angle, clip (r tμ ) As a shearing function, r t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network VFμ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, p θμ (A t |S t ) Is the policy parameter theta μ State transfer function below.
Step 2.2: and calculating an updating target by using a generalization dominance estimation method, and taking the maximum rewarding value as the target.
Wherein, in the step 2.2, the update target is calculated by using the following formula by using the generalized dominance estimation method
Wherein,state value estimated for principal angle Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
Step 2.3: and optimizing the strategy parameters of the principal angles.
Wherein, in the step 2.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturbator policy, R 1* For the main angle rewards when the main angle and the perturber reach equilibrium, R 1 The bonus function, which is the principal angle, is expressed as follows,
wherein S is t The state at the time of t is the state,and->Representing actions selected by principal angles and perturbers according to respective strategies,>is a reward at time t.
Step 3: at the back N υ The secondary iteration keeps the principal angle strategy theta μ Parameters are unchanged, and strategy theta is carried out on a perturbator υ Optimizing; at time step t, the perturber observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing perturber parameter θ with maximum game to achieve Nash equilibrium υ
The expected rewards are calculated using the following formula:
wherein v is the strategy of the perturbator, θ υ Is the strategy parameter of the perturber, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the decision of the perturber at the time t.
Wherein, the step 3 comprises the following specific steps:
step 3.1: calculating loss function L of a perturbator Actor network CLIPυ ) Loss function L of Critic network VFυ ) And gradient of
Wherein in the step 3.1,
loss function L of Actor network CLIPυ ) The method comprises the following steps:
wherein θ υ Is the policy parameter of the perturber, clip (r tυ ) As a shearing function, r t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network CLIPυ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta υ State transfer function below.
Step 3.2: calculating update target V by using generalized dominance estimation method t tar With a prize value R 2* The maximum is the target.
In the step 3.2, the update target V is calculated by the following formula t tar
In the method, in the process of the invention,state value estimated for perturber Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
Step 3.3: policy parameter θ for perturbers υ And (5) optimizing.
In the step 3.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturber policy, R 2* For the main angle rewards when the main angle and the perturber reach equilibrium, R 2 A reward function for the perturber, which takes the value of,
R 2 =-R 1
step 4: alternately executing the steps 2 and 3 until the main angle and the perturber respectively finish training;
step 5: the master angle and the perturber interact to obtain the control right of the target vehicle, the perturber can minimize rewards under the excitation of the target rewarding function, the master angle can maximize rewards, and finally the master angle obtains the vehicle control strategy for resisting the perturber.
Examples
1. Simulation environment design
In the simulation environment, the road consists of three road sections, namely a straight line section, a curve section and a straight line section, the width of a lane is set to be 3.75m, two lanes are unidirectional, the curve section meets the minimum turning radius, the lowest speed limit of the road in the simulation training is 60km/h, and the highest speed limit is 120km/h.
The types of vehicles in the simulation environment are divided into a target vehicle and a background vehicle, and the control right of the target vehicle is respectively obtained by a principal angle and a perturber, and the safe or dangerous driving operation is executed by the principal angle and the perturber. The background vehicle is controlled by a traditional control model, and the polite parameter rho is set to 0 in the simulation training, so that the disturbance is mainly added to the safe driving of the main angle by the disturbance in the simulation training. After experience is obtained by the main angle through multiple interactions with the environment, the target vehicle obtains the capability of resisting the environmental disturbance and learns how to safely run.
In the countermeasure simulation training, the assumption of the simulation scene is still as follows:
(1) The road in the simulation scene keeps continuous traffic flow, and the condition that traffic jam is slow to run due to the increase of traffic density or traffic jam is caused by traffic accidents is not considered.
(2) The static elements of the simulation environment only keep the geometric dimension of the lane, and neglect the influence of the pavement materials and the traffic signs on driving; the dynamic traffic factors only consider the influence of the road background vehicle on the driving of the target vehicle, and do not consider the change of natural environments such as weather, illumination and the like and the influence of emergencies except experimental design.
(3) The other vehicles except the target vehicle are background vehicles, and the background vehicles are controlled by a designed driving theoretical model and do not have the driving randomness and the driving risk of the target vehicle.
2. Simulation training set-up
In the framework of reinforcement learning, in order for a perturber to sample the worst working space as much as possible in simulation training,the environment disturbance is added to limit the safe driving of the main angle, and meanwhile, the problem that the excessive capacity of a perturber causes slow growth of rewards in the early stage of the main angle and cannot learn experience is avoided. Setting N through setting the rewarding function and training round iter =500,N pro =10,N adv The training method comprises the following steps of (1) training a main angle for 10 times, training a perturber for 1 time, and training two alternating training loops for 500 times, wherein the influence of the perturber on the main angle safety control can be ensured, and meanwhile, the main angle can be ensured to have enough training and learning on how to resist the perturbation.
The training hyper-parameters set for the challenge PPO control model of this experiment are shown in table 1, based on the design of the relevant asymmetric bonus function.
In the reinforcement learning framework, training is completed with a principal angle and a disturbance as one round. N (N) iter Every time a round is added, the principal angle completes 10 training, and the perturber completes 1 training. Through N iter The main corner learns how to maximize the jackpot and the perturber minimizes the jackpot as much as possible during each round of training. In the countermeasure simulation training, the main angle and the perturber respectively obtain the control right of the target vehicle, and the target vehicle can collide with a background vehicle and the road edge, or drive behaviors such as too slow driving speed, reverse driving of the vehicle and the like. According to the main angle and the training target of the perturber, the experimental termination condition design is divided into three aspects of vehicle state abnormality, vehicle position abnormality and simulation termination:
(1) Abnormal state of vehicle
1) The target vehicle collides. When the target vehicle collides with the background vehicle or the road edge, the current training round of the main angle or the perturber is ended, and the next new training round is entered.
2) The target vehicle speed is abnormal. In the training round of the main angle, when the speed of the target vehicle does not reach the set minimum speed or exceeds the maximum limit within the set time, the training round is ended, and the next new training round is entered.
(2) Abnormal vehicle position
1) The vehicle is driven out of the prescribed road. When the target is driven out of the specified test road range, the current training round of the main angle or the perturber is ended, and the next new training round is entered.
2) The vehicle travels in the reverse direction. In the training round of the main angle, when the running direction of the target vehicle is opposite to the forward direction of the road, or the target vehicle runs in a decelerating and reversing mode, the training round is ended, and the next new training round is entered.
(3) Simulation termination
1) The maximum simulation step size or round is reached. When the simulation time step of the single round reaches the maximum, ending the round; when the training round of the main angle or the perturber reaches the maximum, adding one to the number of Niter times, and entering the next new alternate training round; and when the number of the alternate training rounds reaches a set value, the simulation training is ended.
TABLE 1 ultra-parameter settings for challenge simulation training
3. Model training result analysis
Setting the initial speed of the target vehicle to 65km/h in the simulation scene, selecting 5 fixed background vehicles to induce the target vehicle to run along with the car and change the road, randomly generating 5 background vehicles in the road running range to increase the randomness of the test environment, and running at the initial speed of 60-70 km/h. The countermeasure model is subjected to 5000 rounds of simulation training, the model is iterated continuously in the training process, and finally rewards are stabilized at about 170 and convergence is achieved. As shown in fig. 1, in the initial training phase (about the first 1000 rounds), the cumulative return of principal angle is negative and rises slowly due to the influence of the perturber and the inexperience of the model. After a certain round of "antagonism", the antagonism model explores experience of avoiding danger, and the accumulated rewards rise rapidly in the middle of training. The jackpot of the 2500 round model returned to a zone stable value after rising. The cumulative return per turn of the model then gradually stabilizes, indicating that the countermeasures have converged to enable safe control of the automated driving vehicle.
The jackpot plots for the antagonism PPO algorithm and the PPO algorithm are compared. As shown in FIG. 2, the overall final jackpot for the challenge model stabilized around 170 and the PPO model stabilized around 140, indicating that the challenge model performed better with the same training round. The PPO model rises faster in the early training phase (1300 rounds before) as viewed from the trend of the fitted line, while the anti-PPO model has a lower jackpot because of the influence of the perturber, and rises slower than the former. And then the accumulated rewards of the PPO model are gradually caught up and overtake after the PPO model is experienced, and finally the accumulated rewards of the two models are pulled apart and respectively converged.
The present invention is not limited to the above-mentioned embodiments, but is intended to be limited to the following embodiments, and any modifications, equivalents and modifications can be made to the above-mentioned embodiments without departing from the scope of the invention.

Claims (9)

1. An automatic driving control method based on countermeasure reinforcement learning is characterized by comprising the following steps:
step 1: building a training environment, designing principal angles and perturbers, initializing a neural network, and expressing principal angle strategy parameters as theta μ The perturbator policy parameter is denoted as θ υ Setting the iteration times of the principal angle as N μ The number of the perturber iterations is N υ
Step 2: front N μ Iterative times, keeping the perturbator strategy θ υ Parameters are unchanged, and the main angle policy parameters theta μ Optimizing; at time step t, the main angle observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing principal angle parameter θ with maximized game to achieve Nash equilibrium μ
The expected rewards are calculated using the following formula:
wherein μ represents the principal angle policy, θ μ Is the principal angle policy parameter, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t The action obtained by the principal angle decision at the moment t;
step 3: at the back N υ The secondary iteration keeps the principal angle strategy theta μ Parameters are unchanged, and strategy theta is carried out on a perturbator υ Optimizing; at time step t, the perturber observes state S t And take actionThen the state transitions to S t+1 Obtain rewarding->Optimizing perturber parameter θ with maximum game to achieve Nash equilibrium υ
The expected rewards are calculated using the following formula:
wherein v is the strategy of the perturbator, θ υ Is the strategy parameter of the perturber, p is the state transfer function, gamma t Is a discount factor, T is the total time step, R (S t ,A t ) For rewarding at time t, S 0 Is in an initial state S t In the state of t time, A t Action obtained by decision-making of a perturber at time t;
step 4: alternately executing the steps 2 and 3 until the main angle and the perturber respectively finish training;
step 5: the master angle and the perturber interact to obtain the control right of the target vehicle, the perturber can minimize rewards under the excitation of the target rewarding function, the master angle can maximize rewards, and finally the master angle obtains the vehicle control strategy for resisting the perturber.
2. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 1, wherein the step 2 is specifically as follows:
step 2.1: calculating a loss function L of a main angle Actor network CLIPμ ) Loss function L of Critic network VFμ ) And gradient of
Step 2.2: computing update targets using generalized dominance estimation methodsAt a prize value of R 1* Maximum is the target;
step 2.3: policy parameter θ for principal angle μ And (5) optimizing.
3. The automatic driving control method based on the countermeasure reinforcement learning according to claim 2, wherein the step 2.1:
loss function L of Actor network CLIPμ ) The method comprises the following steps:
wherein θ μ Is a policy parameter of principal angle, clip (r tμ ) R is a shearing function t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network VFμ ) The method comprises the following steps:
wherein v is θμ (S t ) Representing the state value of the Critic network estimate,an update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta μ State transfer function below.
4. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 2, wherein the step 2.2 calculates the update target V by using a generalized dominance estimation method by using the following formula t tar
Wherein v is θμ (S t ) State value estimated for principal angle Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
5. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 2, wherein the step 2.3 optimizes the policy parameters of the principal angle by adopting the following formula:
wherein μ is the principal angle policy, v is the perturbator policy, R 1* For the main angle rewards when the main angle and the perturber reach equilibrium, R 1 The bonus function, which is the principal angle, is expressed as follows,
wherein S is t The state at the time of t is the state,and->Representing actions selected by principal angles and perturbers according to respective strategies,>is a reward at time t.
6. The automatic driving control method based on the reinforcement learning of the countermeasure according to claim 1, wherein the step 3 is specifically as follows:
step 3.1: calculating loss function L of a perturbator Actor network CLIPυ ) Loss function L of Critic network VFυ ) And gradient of
Step 3.2: calculating update target V by using generalized dominance estimation method t tar With a prize value R 2* Maximum is the target;
step 3.3: policy parameter θ for perturbers υ And (5) optimizing.
7. The method for controlling autopilot based on reinforcement learning of claim 6, wherein in step 3.1,
loss function L of Actor network CLIPυ ) The method comprises the following steps:
wherein θ υ Is the policy parameter of the perturber, clip (r tυ ) R is a shearing function t For the difference between the old and new Actor networks,representing the estimation of the dominance function, epsilon representing the truncated hyper-parameter, and generally taking a value of 0.2;
loss function L of Critic network CLIPυ ) The method comprises the following steps:
wherein,representing the state value of the Critic network estimate, < +.>An update target calculated for the generalized dominance estimation method;
gradient ofThe method comprises the following steps:
wherein,is a dominance function for estimating the return value, +.>Is the policy parameter theta υ State transfer function below.
8. The method according to claim 6, wherein in step 3.2, the update target V is calculated by using the following formula t tar
In the method, in the process of the invention,state value estimated for perturber Critic network, r t-1 For the reward at time T-1, gamma is the discount factor, lambda is the GAE parameter, usually between 0.95 and 1, and T is the total time step.
9. The automatic driving control method based on reinforcement learning of claim 6, wherein in the step 3.3, the policy parameters of the principal angle are optimized by adopting the following formula:
wherein μ is the principal angle policy, v is the perturber policy, R 2* For the main angle rewards when the main angle and the perturber reach equilibrium, R 2 A reward function for the perturber, which takes the value of,
R 2 =-R 1
CN202410010711.0A 2024-01-04 2024-01-04 Automatic driving control method based on countermeasure reinforcement learning Pending CN117826603A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410010711.0A CN117826603A (en) 2024-01-04 2024-01-04 Automatic driving control method based on countermeasure reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410010711.0A CN117826603A (en) 2024-01-04 2024-01-04 Automatic driving control method based on countermeasure reinforcement learning

Publications (1)

Publication Number Publication Date
CN117826603A true CN117826603A (en) 2024-04-05

Family

ID=90512930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410010711.0A Pending CN117826603A (en) 2024-01-04 2024-01-04 Automatic driving control method based on countermeasure reinforcement learning

Country Status (1)

Country Link
CN (1) CN117826603A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118651439A (en) * 2024-08-16 2024-09-17 西北工业大学 Star group avoidance autonomous decision-making method based on self-adaption MADDPG

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118651439A (en) * 2024-08-16 2024-09-17 西北工业大学 Star group avoidance autonomous decision-making method based on self-adaption MADDPG

Similar Documents

Publication Publication Date Title
CN110297494B (en) Decision-making method and system for lane change of automatic driving vehicle based on rolling game
CN110969848A (en) Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes
CN113511222B (en) Scene self-adaptive vehicle interaction behavior decision and prediction method and device
Hou et al. Autonomous driving at the handling limit using residual reinforcement learning
Sanchez et al. Gene regulated car driving: using a gene regulatory network to drive a virtual car
CN114973650A (en) Vehicle ramp entrance confluence control method, vehicle, electronic device, and storage medium
CN111824182A (en) Three-axis heavy vehicle self-adaptive cruise control algorithm based on deep reinforcement learning
CN116476825B (en) Automatic driving lane keeping control method based on safe and reliable reinforcement learning
CN116894395A (en) Automatic driving test scene generation method, system and storage medium
CN114355897B (en) Vehicle path tracking control method based on model and reinforcement learning hybrid switching
EP4160478A1 (en) Driving decision-making method, device, and chip
Yan et al. A game-theoretical approach to driving decision making in highway scenarios
CN117826603A (en) Automatic driving control method based on countermeasure reinforcement learning
CN117872800A (en) Decision planning method based on reinforcement learning in discrete state space
CN116224996A (en) Automatic driving optimization control method based on countermeasure reinforcement learning
CN113353102B (en) Unprotected left-turn driving control method based on deep reinforcement learning
CN116872971A (en) Automatic driving control decision-making method and system based on man-machine cooperation enhancement
Kaushik et al. Learning driving behaviors for automated cars in unstructured environments
CN116534011A (en) Course reinforcement learning-based control method for bicycle lane change and import fleet
CN116300944A (en) Automatic driving decision method and system based on improved Double DQN
CN116853243A (en) Vehicle self-adaptive cruise control method based on projection constraint strategy optimization
Zhao et al. Imitation of real lane-change decisions using reinforcement learning
CN116027788A (en) Intelligent driving behavior decision method and equipment integrating complex network theory and part of observable Markov decision process
Yang et al. Decision-making in autonomous driving by reinforcement learning combined with planning & control
CN116052411A (en) Diversion area mixed traffic flow control method based on graph neural network reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination