WO2024212657A1 - 用于训练决策模型的方法、装置、设备、介质和程序产品 - Google Patents
用于训练决策模型的方法、装置、设备、介质和程序产品 Download PDFInfo
- Publication number
- WO2024212657A1 WO2024212657A1 PCT/CN2024/073076 CN2024073076W WO2024212657A1 WO 2024212657 A1 WO2024212657 A1 WO 2024212657A1 CN 2024073076 W CN2024073076 W CN 2024073076W WO 2024212657 A1 WO2024212657 A1 WO 2024212657A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- strategy
- decision
- model
- training
- loss
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 194
- 238000000034 method Methods 0.000 title claims abstract description 76
- 230000002787 reinforcement Effects 0.000 claims abstract description 122
- 230000003044 adaptive effect Effects 0.000 claims description 28
- 230000008859 change Effects 0.000 claims description 26
- 238000005457 optimization Methods 0.000 claims description 25
- 230000006399 behavior Effects 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 description 28
- 238000010586 diagram Methods 0.000 description 11
- 238000004821 distillation Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 230000007613 environmental effect Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 4
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 4
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 4
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 4
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 4
- 230000003542 behavioural effect Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- Embodiments of the present disclosure generally relate to the field of computers. More specifically, embodiments of the present disclosure relate to methods, devices, equipment, computer-readable storage media, and computer program products for training decision models.
- decision models using artificial intelligence are widely used in fields such as autonomous driving, recommendation decision management, and robot control decision management.
- decision models can be used to determine driving behaviors such as lane changing and braking according to road conditions, thereby achieving autonomous driving.
- a large amount of expert data needs to be collected to train decision models based on supervised learning.
- decision models based on reinforcement learning need to build complex reward functions to learn decision-making experience. Therefore, a solution for training decision models is needed to train human-like decision models with excellent performance.
- An embodiment of the present disclosure provides a solution for training a decision model.
- a method for training a decision model includes determining a first strategy based on training data using a supervised learning model in the decision model and determining a second strategy based on a reinforcement learning model in the decision model; determining an imitation learning loss based on the difference between the first strategy and the second strategy; and training the decision model based on the imitation learning loss and the reinforcement learning loss corresponding to the second strategy.
- a decision model applied to the field of autonomous driving can be trained to provide strategies such as lane changing.
- training the decision model includes: determining an adaptive weight for the imitation learning loss; determining an overall learning loss based on the adaptive weight and the imitation learning loss and the reinforcement learning loss; and training the decision model by minimizing the overall learning loss.
- determining the adaptive weight for the imitation learning loss includes: determining an initial weight for the imitation learning loss; before reaching a predetermined training round, updating the initial weight based on a change in the imitation learning loss to determine an updated weight; and after reaching the predetermined round, gradually reducing the updated weight.
- updating the initial weight based on the change of the imitation learning loss includes: if the imitation learning loss of the initial training round is less than the imitation learning loss of the subsequent training round, increasing the initial weight; and if the imitation learning loss of the initial training round is greater than the imitation learning loss of the subsequent training round, maintaining the initial weight.
- the reinforcement learning model can focus more on "imitating" human strategies in the early stages of training, and more on free exploration in the later stages of training, thereby obtaining a decision network that combines the advantages of both supervised learning and reinforcement learning.
- determining the imitation learning loss based on the difference between the first strategy and the second strategy includes: normalizing the first strategy and the second strategy; and determining the imitation learning loss based on the normalized distance between the first strategy and the second strategy.
- the method also includes: training the supervised learning model based on labeled expert data; determining the reasoning performance of the supervised learning model trained based on the expert data, wherein the reasoning performance indicates the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; and determining the data distribution in the training data corresponding to the plurality of decision scenarios based on the reasoning performance of the supervised learning model.
- the method further comprises: determining a reasoning performance of the reinforcement learning model, the reasoning performance indicating a quality of a prediction strategy for each decision scenario in a plurality of decision scenarios; updating the reasoning performance of the reinforcement learning model based on the reasoning performance of the reinforcement learning model;
- the method further comprises: determining updated training data by analyzing the data distribution corresponding to the plurality of decision scenarios in the training data; and training the decision model based on the updated training data.
- the data distribution for decision-making scenarios in the training data can be dynamically adjusted, thereby directionally improving the reasoning performance of the decision model for specific decision-making scenarios.
- the method further comprises: generating at least a portion of the training data using a simulator.
- generating data using the simulator comprises: based on at least one of the strategies or random strategies determined by the reinforcement learning model, generating a behavior corresponding to the at least one of the strategies or random strategies using the simulator as at least a portion of the training data. In this way, the simulator can be used to increase the amount of training data.
- training the decision model includes: determining a supervised learning loss corresponding to the first strategy; and training the decision model based on the imitation learning loss, the reinforcement learning loss, and the supervised learning loss.
- the method also includes: using the trained decision model or the trained reinforcement learning model to determine a driving strategy based on input data related to driving, and the driving strategy includes at least one of the following: changing left lanes, changing right lanes, going straight, overtaking, turning left, turning right, stopping, accelerating, decelerating, and braking.
- a device for training a decision model includes a strategy determination unit, configured to determine a first strategy based on training data using a supervised learning model in the decision model and determine a second strategy using a reinforcement learning model in the decision model; a loss determination unit, configured to determine an imitation learning loss based on a difference between the first strategy and the second strategy; and an optimization unit, configured to train the decision model based on the imitation learning loss and the reinforcement learning loss corresponding to the second strategy.
- the optimization unit is further configured to: determine an adaptive weight for the imitation learning loss; determine an overall learning loss based on the adaptive weight and the imitation learning loss and the reinforcement learning loss; and train the decision model by minimizing the overall learning loss.
- the optimization unit is further configured to: determine an initial weight for the imitation learning loss; before reaching a predetermined training round, update the initial weight based on the change of the imitation learning loss to determine an updated weight; and after reaching the predetermined round, gradually reduce the updated weight.
- the optimization unit is further configured to: increase the initial weight if the imitation learning loss of the initial training round is less than the imitation learning loss of the subsequent training round; and maintain the initial weight if the imitation learning loss of the initial training round is greater than the imitation learning loss of the subsequent training round.
- the device also includes a training data determination unit, which is configured to: train the supervised learning model based on labeled expert data; determine the reasoning performance of the supervised learning model trained based on the expert data, wherein the reasoning performance indicates the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; and determine the data distribution in the training data corresponding to the plurality of decision scenarios based on the reasoning performance of the supervised learning model.
- a training data determination unit which is configured to: train the supervised learning model based on labeled expert data; determine the reasoning performance of the supervised learning model trained based on the expert data, wherein the reasoning performance indicates the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; and determine the data distribution in the training data corresponding to the plurality of decision scenarios based on the reasoning performance of the supervised learning model.
- the apparatus further comprises a simulator utilizing unit, the simulator utilizing unit being configured to: utilize a simulator to generate at least a portion of the training data.
- the simulator utilizing unit is further configured to: based on at least one of the strategies or random strategies determined by the reinforcement learning model, utilize the simulator to generate a behavior corresponding to the at least one of the strategies or random strategies as at least a portion of the training data.
- the device also includes a directional optimization unit, which is configured to: determine the reasoning performance of the reinforcement learning model, the reasoning performance indicating the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; based on the reasoning performance of the reinforcement learning model, update the data distribution in the training data corresponding to the plurality of decision scenarios to determine updated training data; and train the decision model based on the updated training data.
- a directional optimization unit configured to: determine the reasoning performance of the reinforcement learning model, the reasoning performance indicating the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; based on the reasoning performance of the reinforcement learning model, update the data distribution in the training data corresponding to the plurality of decision scenarios to determine updated training data; and train the decision model based on the updated training data.
- the loss determination unit is further configured to: normalize the first strategy and the second strategy; and determine the imitation learning loss based on the normalized distance between the first strategy and the second strategy.
- the optimization unit is further configured to: determine a supervised learning loss corresponding to the first strategy; and train the decision model based on the imitation learning loss, the reinforcement learning loss, and the supervised learning loss.
- the device also includes a decision model utilization unit, which is configured to: utilize the trained decision model or the trained reinforcement learning model to determine a driving strategy based on driving-related input data, and the driving strategy includes at least one of the following: changing left lanes, changing right lanes, going straight, overtaking, turning left, turning right, stopping, accelerating, decelerating, and braking.
- an electronic device comprising: at least one computing unit; and at least one memory, wherein the at least one memory is coupled to the at least one computing unit and stores instructions for execution by the at least one computing unit, and when the instructions are executed by the at least one computing unit, the device implements the method provided in the first aspect.
- a computer-readable storage medium on which a computer program is stored, wherein the computer program is executed by a processor to implement the method provided in the first aspect.
- a computer program product comprising computer executable instructions, which implement part or all of the steps of the method of the first aspect when the instructions are executed by a processor.
- the electronic device of the third aspect, the computer storage medium of the fourth aspect, or the computer program product of the fifth aspect provided above is used to execute at least a portion of the method provided in the first aspect. Therefore, the explanation or description of the first aspect is also applicable to the second aspect, the third aspect, the fourth aspect, and the fifth aspect.
- the beneficial effects that can be achieved in the second aspect, the third aspect, the fourth aspect, and the fifth aspect can refer to the beneficial effects in the corresponding method, which will not be repeated here.
- FIG1 is a schematic diagram showing an example environment in which various embodiments of the present disclosure can be implemented
- 2A-2B are schematic diagrams showing an example process of training a decision model according to some embodiments of the present disclosure
- FIG3 is a schematic diagram showing an example process of training a decision model in stages according to some embodiments of the present disclosure
- FIG4 shows a flowchart of an example process of training a decision model based on a decision scenario according to some embodiments of the present disclosure
- FIG5 is a flowchart showing a process of an example method for training a decision model according to some embodiments of the present disclosure
- FIG6 shows a schematic block diagram of an apparatus for training a decision model according to some embodiments of the present disclosure.
- FIG. 7 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.
- decision models suitable for complex scenarios are difficult to train.
- using supervised learning to train decision models usually requires collecting a large amount of expert data so that the decision model can imitate human behavior, thereby obtaining a human-like decision model.
- the distribution of training data is usually uneven.
- expert data generally does not contain negative samples, and the data scenarios are limited, resulting in low robustness of the decision model and potential safety hazards.
- this method requires a carefully designed reward function to train the decision model.
- supervised learning can be used to train the feature extractor in the decision model, and the feature vector obtained by the feature extractor is used as the input of the reinforcement learning model.
- the feature extractor trained by supervised learning can be used to obtain more accurate low-dimensional features, thereby reducing the amount of data and time required for reinforcement learning.
- the reinforcement learning model in this scheme cannot utilize the human strategies contained in the expert data, and the use of expert data is not efficient.
- a method for training a decision model includes: based on the training data, determining a first strategy using a supervised learning model in the decision model and determining a second strategy using a reinforcement learning model in the decision model.
- the method also includes determining an imitation learning loss based on the difference between the first strategy and the second strategy.
- the method also includes training the decision model based on the imitation learning loss and the reinforcement learning loss corresponding to the second strategy.
- a decision model suitable for the field of autonomous driving can be trained to provide strategies such as lane changing.
- FIG1 shows a schematic diagram of an example environment 100 in which multiple embodiments of the present disclosure can be implemented.
- an example environment in which the solution of training a decision model according to the present disclosure can be applied is shown by taking the field of autonomous driving as an example.
- Autonomous driving technology generally includes three aspects: road information perception and reasoning, behavior decision-making, and path planning.
- the perception module 110 can process the original radar and camera information of the road and surrounding vehicles into road and vehicle information with physical meaning.
- the decision module 120 can determine the upper-level decision-making behavior, such as lane change, overtaking, left turn, etc., based on the perceived road and vehicle information.
- the decision module 120 can use the decision model 125 to determine the decision-making behavior, that is, the strategy. Examples of strategies may include left lane change, right lane change, straight driving, overtaking, left turn, right turn, parking, acceleration, deceleration, braking, etc.
- the planning module 130 can plan a path for controlling the steering wheel and brake throttle of the vehicle to achieve the upper-level decision-making behavior.
- the apparatus for training decision models according to the present disclosure may be deployed on a vehicle with computing capabilities, such as a vehicle equipped with a computer system.
- the apparatus for training decision models according to the present disclosure may train decision models 125 based on data collected from the vehicle and/or a simulation environment based on a real vehicle.
- the executable code of the perception module 110, the decision module 120 (including the decision model 125), and the planning module 130 may be stored on a storage component of the vehicle, and may be executed by a computing device of the vehicle, such as a processor, to implement the function of training and/or applying the decision model.
- the apparatus for training decision models according to the present disclosure may be deployed in a distributed manner, such as at least partially deployed on a remote server. It should be understood that the environment 100 shown in FIG. 1 is merely exemplary and does not constitute a limitation on the scope of the present disclosure.
- the scheme for training decision models according to the present disclosure may be applied to other fields such as recommendation decision management.
- FIG. 2A shows a schematic diagram of an example process 200 of training a decision model according to some embodiments of the present disclosure.
- training data 201 is used to train a decision model 210.
- the training data 201 may include annotated expert data, such as behavioral data collected from human drivers and corresponding environmental data. Additionally or alternatively, the training data 201 may include data generated by a simulator. The simulator may simulate and determine the behavioral data based on environmental data. Environmental data may include offline data extracted from a map, for example. Additionally or alternatively, environmental data may include online data dynamically simulated based on the real environment of the vehicle. In some examples, the simulator may determine the corresponding behavioral data using a random strategy or a strategy generated by a reinforcement learning model. It should be understood that unreasonable behaviors may be included in the training data generated by the simulator. These behaviors can be used as negative sample data to improve the robustness of the decision model 210.
- the decision model 210 includes a supervised learning model 212 and a reinforcement learning model 214.
- the supervised learning model can be any suitable model based on supervised learning, such as a Transformer model, a decision tree model, etc.
- the reinforcement learning model 214 can be any suitable model based on reinforcement learning, such as a Q-learning model, a Monte Carlo model, etc. The scope of the present disclosure is not limited in terms of specific model implementations.
- the first strategy 222 is determined using the supervised learning model 212 in the decision model 210
- the second strategy 224 is determined using the reinforcement learning model 214.
- the first strategy 222 and the second strategy 224 are obtained based on the same input data in the training data 201, so the difference between the first strategy 222 and the second strategy 224 can reflect the difference when the supervised learning model 212 and the reinforcement learning model 214 make decisions for the same input data.
- the supervised learning model 212 can usually apply more human experience than the reinforcement learning model 214, and the reinforcement learning model 214 is more exploratory than the supervised learning model 212.
- the supervised learning model 212 that determines the first strategy 222 may be trained. In other words, the parameters of the supervised learning model 212 have been determined based on the labeled expert data and are no longer updated in the training process shown in FIG2. Alternatively, the supervised learning model 212 that determines the first strategy 222 may be trained together with the reinforcement learning model 214, and the parameters of the supervised learning model 212 are updated in the training process shown in FIG2.
- the policy distillation module 230 determines the imitation learning loss 242.
- the imitation learning loss 242 can reflect the degree to which the reinforcement learning model 214 "imitates” the supervised learning model 212 makes decisions. For example, if the imitation learning loss 242 is small, the reinforcement learning model 214 "imitates” the supervised learning model 212 to a higher degree when making decisions. This can also be understood as the reinforcement learning model 214 "imitating" the human strategy contained in the expert data. On the contrary, if the imitation learning loss 242 is large, the reinforcement learning model 214 "imitates” the supervised learning model 212 to a lower degree when making decisions. Based on the imitation learning loss 242 determined by the policy distillation module 230, the reinforcement learning model 214 can "distill" the strategy determined by the supervised learning model 212, thereby learning the human experience in the expert data.
- the strategy distillation module 230 may normalize the first strategy 222 and the second strategy 224 , and may determine the imitation learning loss 242 based on the distance between the normalized first strategy 222 and the second strategy 224 .
- the first strategy 222 output by the supervised learning model 212 may be a probability distribution of behaviors, such as (0.6, 0.4, 0), where each value represents the probability of a behavior.
- the second strategy 224 output by the reinforcement learning model 214 may be a similar probability distribution or (state If the second strategy 224 is a similar probability distribution, the strategy distillation module 230 can determine the imitation learning loss 242 based on the vector distance between the first strategy 222 and the second strategy 224. If the second strategy 224 is a value of (state, behavior), the strategy distillation module 230 can normalize the value using a softmax function and calculate the distance between the first strategy 222 and the second strategy 224 by relative entropy (KL divergence), thereby obtaining the imitation learning loss 242.
- KL divergence relative entropy
- the optimization module 250 trains (also referred to as optimizes) the decision model 210.
- the reinforcement learning loss 244 can be determined based on any suitable loss function, and the scope of the present disclosure is not limited thereto.
- the optimization module 250 trains the decision model 210 or only the reinforcement learning model 214 in the decision model 210 by minimizing the combination of the imitation learning loss 242 and the reinforcement learning loss 244.
- the optimization module 250 may determine an adaptive weight for the imitation learning loss 242, and determine the overall learning loss based on the adaptive weight and the imitation learning loss 242 and the reinforcement learning loss 244.
- the optimization module 250 may train the decision model 210 by minimizing the overall learning loss.
- loss kl represents the imitation learning loss 242
- loss rl represents the reinforcement learning loss 244
- ⁇ represents the adaptive weight for the imitation learning loss 242.
- the adaptive weight may include two weights for both the imitation learning loss 242 and the reinforcement learning loss 244, and may not be embodied in the form of a coefficient.
- the optimization module 250 may determine the initial weights for the imitation learning loss 242 and determine the adaptive weights by gradual updating. In some embodiments, the optimization module 250 may update the initial weights based on the change in the imitation learning loss 242 before reaching a predetermined training round to determine the updated weights. The optimization module 250 may gradually reduce the updated weights after reaching a predetermined round.
- the initial weights can be kept unchanged. For example, if the imitation learning loss 242 of the initial training round is greater than the imitation learning loss 242 of the subsequent training rounds, the initial weights are kept.
- the adaptive weight ⁇ n will be reduced to zero in the 2Nth round.
- FIG2B shows a schematic diagram of an example process 260 of training a decision model according to some embodiments of the present disclosure.
- the supervised learning model 212 that determines the first strategy 222 can be trained together with the reinforcement learning model 214, and the parameters of the supervised learning model 212 are updated in the training process shown in FIG2 .
- a supervised learning loss 262 corresponding to the first strategy 222 can be determined, and the decision model 210 can be trained together based on the imitation learning loss 242, the reinforcement learning loss 244, and the supervised learning loss 262.
- the ability of supervised learning to utilize expert data and the strong generalization characteristics of reinforcement learning can be combined to train a decision model 210 with excellent performance and human-like.
- the reinforcement learning model 214 can pay more attention to "imitation" of the strategy determined by the supervised learning model 212 in the early stage of training and pay more attention to autonomous exploration in the later stage of training, thereby improving the efficiency of training the decision model 210, especially the reinforcement learning model 214.
- FIG3 shows a schematic diagram of an example process 300 of training a decision model in stages according to some embodiments of the present disclosure. It should be understood that FIG3 is only an example of the field of autonomous driving and does not constitute a limitation on the scope of the present disclosure.
- expert data and non-expert data may be collected.
- Expert data may include data collected directly from humans, such as data obtained from the interaction between human experts and the environment.
- expert data may be collected by collecting the driver's manipulation of the vehicle.
- Non-expert data may include data directly generated by non-humans. For example, non-expert data may be collected through a simulator.
- the simulator may simulate the environment of the vehicle and apply a strategy in the environment to generate the behavior of the vehicle as non-expert data.
- the simulator may apply a random strategy or a strategy generated by a decision model to generate the behavior of the vehicle corresponding to the strategy.
- the simulator may apply the strategy output by the reinforcement learning model online to generate non-expert data.
- feature extraction may be performed on the collected expert data and non-expert data to obtain preprocessed training data.
- the environmental data and the behavioral data may be converted into corresponding vector representations.
- specific data may be selected from the collected data for use in training the decision model in the training stage 340.
- data of a specific decision scenario may be selected from the collected data as training data for training the decision model, thereby improving the performance of the decision model for the specific decision scenario.
- the supervised learning model in the decision model may be first trained using the annotated expert data, and the supervised learning model may be tested in a test set to determine the reasoning performance of the supervised learning model trained based on the expert data.
- the reasoning performance may indicate the quality of the prediction strategy for each of the multiple decision scenarios.
- the reasoning performance of the supervised learning model may indicate the quality of the prediction strategy for the lane change scenario, the braking scenario, and the turning scenario, respectively.
- the data selection module can be used to select training data for training the decision model or reinforcement learning model from the collected data.
- the data selection module can determine the data distribution corresponding to multiple decision scenarios in the training data based on the reasoning performance of the supervised learning model. For example, for a specific decision scenario with poor prediction strategy quality, the data of the decision scenario can be added to the training data to improve the reasoning performance of the decision model in a targeted manner.
- the training data can be updated based on the reasoning performance of the reinforcement learning model to be used to improve the reasoning performance of the decision model in subsequent training rounds.
- the reasoning performance of the reinforcement learning model obtained through previous training can be determined, and the reasoning performance indicates the quality of the prediction strategy for each decision scenario in multiple decision scenarios.
- the data distribution corresponding to the multiple decision scenarios in the training data can be updated to determine the updated training data.
- the decision model can be further trained so that the reasoning performance of the decision model for a specific decision scenario is improved.
- the decision model can be trained by combining supervised learning and reinforcement learning.
- policy distillation can be used to make the reinforcement learning model "imitate” the decision-making method of the supervised learning model, thereby inheriting the policy obtained by supervised learning to the reinforcement learning model.
- the degree of "imitation” can be adjusted based on adaptive weights, so that the reinforcement learning model imitates the policy obtained by the supervised learning model more at the beginning of training, and gradually reduces the degree of imitation, thereby increasing the generalization of the decision model.
- the training phase 340 may include an offline training phase and an online training phase.
- the supervised learning model may be trained based on expert data.
- the strategy output by the reinforcement learning model may be applied to the simulator to generate non-expert data.
- the non-expert data may be used as part of the training data and may be used to further train the reinforcement learning model in subsequent training rounds.
- the simulator may be used to increase the amount of training data, and the amount of training data may be increased for a specific decision scenario, thereby training the decision model more efficiently.
- FIG4 shows a flowchart of an example process 400 for training a decision model based on a decision scenario according to some embodiments of the present disclosure.
- the device for training the decision model can be deployed on the vehicle, and the training data can be determined by collecting the operating behavior of the human driver and the simulation of the simulator.
- the trained decision model can determine strategies in a variety of decision scenarios. Examples of decision scenarios may include: decision scenarios that require left lane change, right lane change, and straight driving, respectively.
- a suitable target lane can be selected according to the current overall road conditions, such as issuing a left lane change, a right lane change, or a straight driving instruction to maximize traffic efficiency.
- the process of training the decision model will be described in detail below.
- the decision model may be initialized in box 402.
- Parameters of the supervised learning model and the reinforcement learning model may be initialized, such as the dimension of the neural network and the activation function.
- parameters related to the decision task may be input, such as the dimension of the behavior.
- the dimension of the behavior may be set to 3 to represent left lane change, right lane change, and straight driving, respectively.
- adaptive weights and parameters for updating weights such as a predetermined training round N, may be initialized.
- process 400 may proceed to block 406 to train the supervised learning model using expert data.
- any feasible supervised learning loss function may be used to train the supervised learning model. Examples of supervised learning loss functions include, but are not limited to, mean square error and cross entropy.
- the process 400 can proceed to box 408.
- the data selection module can test the reasoning performance of the supervised learning model and determine the data distribution of a specific decision scenario (also referred to as determining the data scenario) based on the reasoning performance.
- decision scenarios corresponding to the need to change left lanes, change right lanes, and go straight can be set.
- the data selection module can adjust the distribution of training data in the next training round based on the quality of the prediction strategies of the supervised learning model for these decision scenarios.
- the proportion of the decision scenario can be increased in the training data.
- the adjustment can be made in the manner of increasing the data by 10% if the pass rate decreases by 10%, with a minimum adjustment of 10%.
- the pass rates for decision scenarios requiring left lane change, right lane change, and straight driving are 80%, 80%, and 50%, respectively
- the data proportion of the straight driving scenario can be increased, and the data proportion for these three decision scenarios can be determined as 100%:100%:130%.
- the decision model (or only the reinforcement learning model) may be trained based on the determined training data.
- the policy distillation module may calculate the imitation learning loss.
- the policy distillation module may be based on the first policy output by the supervised learning model and The difference between the second strategy output by the reinforcement learning model and the imitation learning loss is determined.
- the reinforcement learning loss can be calculated.
- the reinforcement learning loss can be calculated using a Q-value learning method.
- the adaptive weight module can calculate the overall learning loss based on the imitation learning loss and the reinforcement learning loss.
- the overall learning loss can be calculated with reference to the above formula (1).
- the reinforcement learning model can be trained based on the overall learning loss.
- the process 400 can proceed to box 420.
- box 420 it can be determined whether the reasoning performance of the reinforcement learning decision model meets the standard. If the reasoning performance meets the standard, the training can be ended in box 422. On the contrary, if the reasoning performance does not meet the standard, the process 400 can proceed to box 425.
- the data selection module can select data scenarios based on the reasoning performance of the reinforcement learning model, thereby adjusting the data distribution in the training data.
- the pass rate for decision scenarios requiring left lane change, right lane change and straight driving is increased from 80%, 80%, 50% to 80%, 80%, 60%
- the data ratio for decision scenarios requiring left lane change, right lane change and straight driving can be reduced from 100%: 100%: 130% to 100%: 100%: 120%.
- supervised learning and reinforcement learning can be combined to train a decision model with excellent performance and human-like.
- the data distribution for a specific decision scenario in the training data can be adjusted based on the reasoning performance of the supervised learning model and/or the reinforcement learning model, so that the decision model can be trained based on the decision scenario, thereby improving the reasoning performance of the decision model in a targeted manner.
- FIG5 shows a flowchart of a process 500 of an example method for training a decision model according to some embodiments of the present disclosure.
- a first strategy is determined using a supervised learning model in the decision model and a second strategy is determined using a reinforcement learning model in the decision model.
- an imitation learning loss is determined based on the difference between the first strategy and the second strategy.
- the decision model is trained based on the imitation learning loss and the reinforcement learning loss corresponding to the second strategy.
- training the decision model includes: determining an adaptive weight for the imitation learning loss; determining an overall learning loss based on the adaptive weight and the imitation learning loss and the reinforcement learning loss; and training the decision model by minimizing the overall learning loss.
- determining the adaptive weight for the imitation learning loss includes: determining an initial weight for the imitation learning loss; before reaching a predetermined training round, updating the initial weight based on changes in the imitation learning loss to determine an updated weight; and after reaching the predetermined round, gradually reducing the updated weight.
- updating the initial weights based on changes in the imitation learning loss includes: increasing the initial weights if the imitation learning loss of an initial training round is less than the imitation learning loss of a subsequent training round; and maintaining the initial weights if the imitation learning loss of an initial training round is greater than the imitation learning loss of a subsequent training round.
- the method also includes: training the supervised learning model based on labeled expert data; determining the reasoning performance of the supervised learning model trained based on the expert data, wherein the reasoning performance indicates the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; and determining the data distribution in the training data corresponding to the plurality of decision scenarios based on the reasoning performance of the supervised learning model.
- determining the imitation learning loss based on the difference between the first strategy and the second strategy includes: normalizing the first strategy and the second strategy; and determining the imitation learning loss based on the normalized distance between the first strategy and the second strategy.
- the method further comprises: generating at least a portion of the training data using a simulator.
- generating data using the simulator comprises: based on at least one of the strategies or random strategies determined by the reinforcement learning model, generating a behavior corresponding to the at least one of the strategies or random strategies using the simulator as at least a portion of the training data.
- the method also includes: determining the reasoning performance of the reinforcement learning model, the reasoning performance indicating the quality of the prediction strategy for each of a plurality of decision scenarios; updating the data distribution in the training data corresponding to the plurality of decision scenarios based on the reasoning performance of the reinforcement learning model to determine updated training data; and training the decision model based on the updated training data.
- training the decision model includes: determining a supervised learning loss corresponding to the first strategy; and training the decision model based on the imitation learning loss, the reinforcement learning loss, and the supervised learning loss.
- the method also includes: using the trained decision model or the trained reinforcement learning model to determine a driving strategy based on input data related to driving, and the driving strategy includes at least one of the following: changing left lanes, changing right lanes, going straight, overtaking, turning left, turning right, stopping, accelerating, decelerating, and braking.
- the advantages of supervised learning and reinforcement learning can be combined to train the decision model.
- the supervised learning model can be first trained using offline expert data to obtain a human-like expert model. Then, the policy inheritance of the expert model can be transferred through the policy distillation module. As the initial solution of the reinforcement learning model.
- the data selection module it is possible to achieve targeted improvement of decision-making scenarios based on the strategy of the expert model, thereby obtaining a human-like decision-making model with excellent performance.
- FIG6 shows a block diagram of an apparatus 600 for training a decision network according to an embodiment of the present disclosure, and the apparatus 600 may include a plurality of modules for performing corresponding steps in the process 500 as discussed in FIG5 .
- the apparatus 600 may be deployed on an on-board device (e.g., a vehicle computer) to improve the decision-making performance of the autonomous driving software.
- an on-board device e.g., a vehicle computer
- the apparatus 600 includes a strategy determination unit 610, which is configured to determine a first strategy based on training data using a supervised learning model in a decision model and to determine a second strategy using a reinforcement learning model in the decision model; a loss determination unit 620, which is configured to determine an imitation learning loss based on the difference between the first strategy and the second strategy; and an optimization unit 630, which is configured to train the decision model based on the imitation learning loss and the reinforcement learning loss corresponding to the second strategy.
- a strategy determination unit 610 which is configured to determine a first strategy based on training data using a supervised learning model in a decision model and to determine a second strategy using a reinforcement learning model in the decision model
- a loss determination unit 620 which is configured to determine an imitation learning loss based on the difference between the first strategy and the second strategy
- an optimization unit 630 which is configured to train the decision model based on the imitation learning loss and the reinforcement learning loss corresponding to the second strategy.
- the optimization unit 630 is further configured to: determine an adaptive weight for the imitation learning loss; determine an overall learning loss based on the adaptive weight and the imitation learning loss and the reinforcement learning loss; and train the decision model by minimizing the overall learning loss.
- the optimization unit 630 is further configured to: determine an initial weight for the imitation learning loss; before reaching a predetermined training round, update the initial weight based on the change of the imitation learning loss to determine an updated weight; and after reaching the predetermined round, gradually reduce the updated weight.
- the optimization unit 630 is further configured to: increase the initial weight if the imitation learning loss of the initial training round is less than the imitation learning loss of the subsequent training round; and maintain the initial weight if the imitation learning loss of the initial training round is greater than the imitation learning loss of the subsequent training round.
- the device 600 also includes a training data determination unit, which is configured to: train the supervised learning model based on labeled expert data; determine the reasoning performance of the supervised learning model trained based on the expert data, wherein the reasoning performance indicates the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; and determine the data distribution in the training data corresponding to the plurality of decision scenarios based on the reasoning performance of the supervised learning model.
- a training data determination unit which is configured to: train the supervised learning model based on labeled expert data; determine the reasoning performance of the supervised learning model trained based on the expert data, wherein the reasoning performance indicates the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; and determine the data distribution in the training data corresponding to the plurality of decision scenarios based on the reasoning performance of the supervised learning model.
- the apparatus 600 further comprises a simulator utilizing unit, the simulator utilizing unit being configured to: utilize a simulator to generate at least a portion of the training data.
- the simulator utilizing unit is further configured to: based on at least one of the strategies or random strategies determined by the reinforcement learning model, utilize the simulator to generate a behavior corresponding to the at least one of the strategies or random strategies as at least a portion of the training data.
- the device 600 also includes a directional optimization unit, which is configured to: determine the reasoning performance of the reinforcement learning model, the reasoning performance indicating the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; based on the reasoning performance of the reinforcement learning model, update the data distribution in the training data corresponding to the plurality of decision scenarios to determine updated training data; and train the decision model based on the updated training data.
- a directional optimization unit configured to: determine the reasoning performance of the reinforcement learning model, the reasoning performance indicating the quality of the prediction strategy for each decision scenario in a plurality of decision scenarios; based on the reasoning performance of the reinforcement learning model, update the data distribution in the training data corresponding to the plurality of decision scenarios to determine updated training data; and train the decision model based on the updated training data.
- the loss determination unit 620 is further configured to: normalize the first strategy and the second strategy; and determine the imitation learning loss based on the normalized distance between the first strategy and the second strategy.
- the optimization unit 630 is further configured to: determine a supervised learning loss corresponding to the first strategy; and train the decision model based on the imitation learning loss, the reinforcement learning loss, and the supervised learning loss.
- the device 600 also includes a decision model utilization unit, which is configured to: utilize the trained decision model or the trained reinforcement learning model to determine a driving strategy based on driving-related input data, and the driving strategy includes at least one of the following: changing left lanes, changing right lanes, going straight, overtaking, turning left, turning right, stopping, accelerating, decelerating, and braking.
- FIG7 shows a schematic block diagram of an example device 700 that can be used to implement an embodiment of the present disclosure.
- the device 700 includes a computing unit 701, which can perform various appropriate actions and processes according to computer program instructions stored in a random access memory (RAM) 703 and/or a read-only memory (ROM) 702 or computer program instructions loaded from a storage unit 708 into the RAM 703 and/or ROM 702.
- RAM random access memory
- ROM read-only memory
- Various programs and data required for the operation of the device 700 can also be stored in the RAM 703 and/or ROM 702.
- the computing unit 701 and the RAM 703 and/or ROM 702 are connected to each other via a bus 704.
- An input/output (I/O) interface 705 is also connected to the bus 704.
- a number of components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a disk, an optical disk, etc.; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, etc.
- the communication unit 709 allows the device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
- the computing unit 701 may be a variety of general and/or special processing components having processing and computing capabilities. Some examples of the computing unit 701 include Including but not limited to a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DSPs), and any appropriate processors, controllers, microcontrollers, etc.
- the computing unit 701 performs the various methods and processes described above, such as process 500.
- process 500 may be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as a storage unit 708.
- part or all of the computer program may be loaded and/or installed on the device 700 via RAM and/or ROM and/or communication unit 709.
- the computer program When the computer program is loaded into RAM and/or ROM and executed by the computing unit 701, one or more steps of the process 500 described above may be performed.
- the computing unit 701 may be configured to perform process 500 in any other appropriate manner (e.g., by means of firmware).
- the above embodiments it can be implemented in whole or in part by software, hardware, firmware or any combination thereof.
- software it can be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a server or terminal, the process or function described in the embodiment of the present application is generated in whole or in part.
- the computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions can be transmitted from one website site, computer, server or data center to another website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line) or wireless (e.g., infrared, wireless, microwave, etc.) means.
- the computer-readable storage medium can be any available medium that can be accessed by a server or terminal or a data storage device such as a server or data center that includes one or more available media integrated.
- the available medium can be a magnetic medium (such as a floppy disk, a hard disk, and a tape, etc.), an optical medium (such as a digital video disk (DVD), etc.), or a semiconductor medium (such as a solid-state hard disk, etc.).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Feedback Control In General (AREA)
Abstract
本公开的实施例提供了用于训练决策模型的方法、设备、装置、介质和程序产品,涉及计算机领域。方法包括:基于训练数据,利用决策模型中的监督学习模型确定第一策略并且利用决策模型中的强化学习模型确定第二策略(510)。方法还包括基于第一策略与第二策略之间的差异,确定模仿学习损失(520)。方法还包括基于模仿学习损失和与第二策略对应的强化学习损失,训练该决策模型(530)。以此方式,基于模仿学习损失和强化学习损失两者,可以结合监督学习利用专家数据的能力和强化学习泛化性强的特点,从而训练得到性能优异且类人的决策模型。在一些实施例中,根据本公开的方案,可以训练得到应用于自动驾驶领域的决策模型,以提供诸如换道等策略。
Description
本申请要求于2023年4月10日提交中国专利局、申请号为202310413264.9、发明名称为“用于训练决策模型的方法、装置、设备、介质和程序产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本公开的实施例主要涉及计算机领域。更具体地,本公开的实施例涉及用于训练决策模型的方法、装置、设备、计算机可读存储介质以及计算机程序产品。
目前,利用人工智能的决策模型被广泛应用于诸如自动驾驶、推荐决策管理、机器人控制决策管理等领域。例如,在自动驾驶领域,可以利用决策模型来根据路况确定诸如换道和刹车之类的驾车行为,从而实现自动驾驶。然而,适用于复杂场景的决策模型的训练难度较大。在一些示例中,需要收集大量的专家数据来训练基于监督学习的决策模型。在另一些示例中,基于强化学习的决策模型需要构建复杂的奖励函数来学习决策经验。因此,需要一种训练决策模型的方案,以用于训练类人并且性能优异的决策模型。
发明内容
本公开的实施例提供了一种训练决策模型的方案。
在本公开的第一方面,提供了训练决策模型的方法。该方法包括基于训练数据,利用决策模型中的监督学习模型确定第一策略并且利用所述决策模型中的强化学习模型确定第二策略;基于所述第一策略与所述第二策略之间的差异,确定模仿学习损失;以及基于所述模仿学习损失和与所述第二策略对应的强化学习损失,训练所述决策模型。
以此方式,基于模仿学习损失和强化学习损失两者,可以结合监督学习利用专家数据的能力和强化学习泛化性强的特点,从而训练得到性能优异且类人的决策模型。在一些实施例中,根据本公开的方案,可以训练得到应用于自动驾驶领域的决策模型,以提供诸如换道等策略。
在第一方面的一些实施例中,基于所述模仿学习损失和与所述第二策略对应的强化学习损失,训练所述决策模型包括:确定针对所述模仿学习损失的自适应权重;基于所述自适应权重以及所述模仿学习损失和所述强化学习损失,确定整体学习损失;以及通过使所述整体学习损失最小化,训练所述决策模型。
在第一方面的一些实施例中,确定针对所述模仿学习损失的自适应权重包括:确定针对所述模仿学习损失的初始权重;在达到预定训练轮次之前,基于所述模仿学习损失的变化来更新所述初始权重,以确定更新权重;以及在达到所述预定轮次之后,逐渐减小所述更新权重。
在第一方面的一些实施例中,基于所述模仿学习损失的变化来更新所述初始权重包括:若初始训练轮次的所述模仿学习损失小于后续训练轮次的所述模仿学习损失,则增加所述初始权重;以及若初始训练轮次的所述模仿学习损失大于后续训练轮次的所述模仿学习损失,则保持所述初始权重。
以此方式,基于自适应权重,可以使得强化学习模型在训练前期更注重“模仿”人类策略,并且在训练后期更注重自由探索,从而得到结合监督学习和强化学习两者的优点的决策网络。
在第一方面的一些实施例中,基于所述第一策略与所述第二策略之间的差异,确定模仿学习损失包括:对所述第一策略和所述第二策略进行归一化;以及基于归一化的所述第一策略和所述第二策略之间的距离,确定所述模仿学习损失。
在第一方面的一些实施例中,方法还包括:基于经标注的专家数据,训练所述监督学习模型;确定基于所述专家数据被训练的所述监督学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;以及基于所述监督学习模型的所述推理性能,确定所述训练数据中对应于所述多个决策场景的数据分布。
在第一方面的一些实施例中,方法还包括:确定所述强化学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;基于所述强化学习模型的所述推理性能,更新所述
训练数据中对应于所述多个决策场景的数据分布,以确定经更新的训练数据;以及基于所述经更新的训练数据,训练所述决策模型。
以此方式,可以动态调节训练数据中针对决策场景的数据分布,从而定向提升决策模型针对特定决策场景的推理性能。
在第一方面的一些实施例中,方法还包括:利用仿真器生成所述训练数据的至少一部分。在一些实施例中,利用所述仿真器生成数据包括:基于由所述强化学习模型确定的策略或随机策略中的至少一项,利用所述仿真器生成与所述策略或随机策略中的所述至少一项对应的行为,以作为所述训练数据的至少一部分。以此方式,可以利用仿真器来增加训练数据量。
在第一方面的一些实施例中,训练所述决策模型包括:确定与所述第一策略对应的监督学习损失;以及基于所述模仿学习损失、所述强化学习损失和所述监督学习损失,训练所述决策模型。
在第一方面的一些实施例中,方法还包括:利用经训练的所述决策模型或经训练的所述强化学习模型,基于与驾驶有关的输入数据,确定驾驶策略,所述驾驶策略包括以下至少一项:左换道、右换道、直行、超车、左转弯、右转弯、停车、加速、减速、刹车。
在本公开的第二方面,提供了用于训练决策模型的装置。该装置包括策略确定单元,被配置为基于训练数据,利用决策模型中的监督学习模型确定第一策略并且利用所述决策模型中的强化学习模型确定第二策略;损失确定单元,被配置为基于所述第一策略与所述第二策略之间的差异,确定模仿学习损失;以及优化单元,被配置为基于所述模仿学习损失和与所述第二策略对应的强化学习损失,训练所述决策模型。
在第二方面的一些实施例中,优化单元进一步被配置为:确定针对所述模仿学习损失的自适应权重;基于所述自适应权重以及所述模仿学习损失和所述强化学习损失,确定整体学习损失;以及通过使所述整体学习损失最小化,训练所述决策模型。
在第二方面的一些实施例中,所述优化单元进一步被配置为:确定针对所述模仿学习损失的初始权重;在达到预定训练轮次之前,基于所述模仿学习损失的变化来更新所述初始权重,以确定更新权重;以及在达到所述预定轮次之后,逐渐减小所述更新权重。
在第二方面的一些实施例中,所述优化单元进一步被配置为:若初始训练轮次的所述模仿学习损失小于后续训练轮次的所述模仿学习损失,则增加所述初始权重;以及若初始训练轮次的所述模仿学习损失大于后续训练轮次的所述模仿学习损失,则保持所述初始权重。
在第二方面的一些实施例中,装置还包括训练数据确定单元,所述训练数据确定单元被配置为:基于经标注的专家数据,训练所述监督学习模型;确定基于所述专家数据被训练的所述监督学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;以及基于所述监督学习模型的所述推理性能,确定所述训练数据中对应于所述多个决策场景的数据分布。
在第二方面的一些实施例中,装置还包括仿真器利用单元,所述仿真器利用单元被配置为:利用仿真器生成所述训练数据的至少一部分。在第二方面的一些实施例中,所述仿真器利用单元进一步被配置为:基于由所述强化学习模型确定的策略或随机策略中的至少一项,利用所述仿真器生成与所述策略或随机策略中的所述至少一项对应的行为,以作为所述训练数据的至少一部分。
在第二方面的一些实施例中,装置还包括定向优化单元,所述定向优化单元被配置为:确定所述强化学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;基于所述强化学习模型的所述推理性能,更新所述训练数据中对应于所述多个决策场景的数据分布,以确定经更新的训练数据;以及基于所述经更新的训练数据,训练所述决策模型。
在第二方面的一些实施例中,所述损失确定单元进一步被配置为:对所述第一策略和所述第二策略进行归一化;以及基于归一化的所述第一策略和所述第二策略之间的距离,确定所述模仿学习损失。
在第二方面的一些实施例中,所述优化单元进一步被配置为:确定与所述第一策略对应的监督学习损失;以及基于所述模仿学习损失、所述强化学习损失和所述监督学习损失,训练所述决策模型。
在第二方面的一些实施例中,装置还包括决策模型利用单元,所述决策模型利用单元被配置为:利用经训练的所述决策模型或经训练的所述强化学习模型,基于与驾驶有关的输入数据,确定驾驶策略,所述驾驶策略包括以下至少一项:左换道、右换道、直行、超车、左转弯、右转弯、停车、加速、减速、刹车。
在本公开的第三方面,提供了一种电子设备,包括:至少一个计算单元;至少一个存储器,至少一个存储器被耦合到至少一个计算单元并且存储用于由至少一个计算单元执行的指令,指令当由至少一个计算单元执行时,使得设备实现第一方面所提供的方法。
在本公开的第四方面,提供了一种计算机可读存储介质,其上存储有计算机程序,其中计算机程序被处理器执行实现第一方面所提供的方法。
在本公开的第五方面,提供一种计算机程序产品,包括计算机可执行指令,当指令在被处理器执行时实现第一方面的方法的部分或全部步骤。
可以理解地,上述提供的第三方面的电子设备、第四方面的计算机存储介质或者第五方面的计算机程序产品用于执行第一方面所提供的方法中的至少一部分。因此,关于第一方面的解释或者说明同样适用于第二方面、第三方面、第四方面和第五方面。此外,第二方面、第三方面、第四方面和第五方面所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。
结合附图并参考以下详细说明,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标注表示相同或相似的元素,其中:
图1示出了本公开的多个实施例能够在其中实现的示例环境的示意图;
图2A至图2B示出了根据本公开的一些实施例的训练决策模型的示例过程的示意图;
图3示出了根据本公开的一些实施例的分阶段训练决策模型的示例过程的示意图;
图4示出了根据本公开的一些实施例的基于决策场景来训练决策模型的示例过程的流程图;
图5示出了根据本公开的一些实施例的训练决策模型的示例方法的过程的流程图;
图6示出了根据本公开的一些实施例的用于训练决策模型的装置的示意性框图;以及
图7示出了能够实施本公开的多个实施例的计算设备的框图。
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
在本公开的实施例的描述中,术语“包括”及其类似用语应当理解为开放性包含,即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。
如上文所简要提及的,适用于复杂场景的决策模型的训练难度较大。在一些示例中,利用监督学习训练决策模型通常需要收集大量的专家数据以使得决策模型可以模仿人类的行为,从而得到类人的决策模型。此外,由于不同专家的行为有差异,训练数据的分布通常不均匀。再者,专家数据中一般不包含负样本,数据场景受限,导致决策模型的鲁棒性较低,可能会出现安全隐患。在另一些示例中,尽管利用强化学习训练决策模型可以不依赖专家数据并且具有较强的泛化性,但是此方法需要精细地设计奖励函数来训练决策模型。
目前,已经提出了一些结合监督学习和强化学习来训练决策模型的方案。例如,可以利用监督学习来训练决策模型中的特征提取器,并且将由特征提取器获得的特征向量作为强化学习模型的输入。在这种方案中,利用由监督学习训练得到的特征提取器,可以获得较为准确的低维特征,从而减少强化学习所需的数据量和时间。然而,这种方案中的强化学习模型不能利用专家数据所包含的人类的策略,对专家数据的使用效率不高。
为了至少部分地解决上述问题以及其他潜在问题,本公开的各种实施例提供了一种用于训练决策模型的方案。总体而言,根据在此描述的各种实施例,提供了一种训练决策模型的方法。该方法包括:基于训练数据,利用决策模型中的监督学习模型确定第一策略并且利用决策模型中的强化学习模型确定第二策略。方法还包括基于第一策略与第二策略之间的差异,确定模仿学习损失。方法还包括基于模仿学习损失和与第二策略对应的强化学习损失,训练该决策模型。
以此方式,基于模仿学习损失和强化学习损失两者,可以结合监督学习利用专家数据的能力和强化学习泛化性强的特点,从而训练得到性能优异且类人的决策模型。在一些实施例中,根据本公开的方案,可以训练得到适用于自动驾驶领域的决策模型,以提供诸如换道等策略。
以下参考附图来描述本公开的各种示例实施例。图1示出了本公开的多个实施例能够在其中实现的示例环境100的示意图。在图1中,以自动驾驶领域为例示出了可以应用根据本公开的训练决策模型的方案的示例环境。
自动驾驶技术通常包括道路信息感知与推理、行为决策和路径规划三方面。如图1所示,感知模块110可以将道路和周围车辆的原始雷达和摄像头等信息处理为有物理含义的道路和车辆信息。决策模块120可以根据感知到的道路和车辆信息确定上层决策行为,例如换道、超车、左转等。具体地,决策模块120可以利用决策模型125来确定决策行为,也即策略。策略的示例可以包括左换道、右换道、直行、超车、左转弯、右转弯、停车、加速、减速、刹车等。基于所确定的策略,规划模块130可以规划路径,以用于控制本车的方向盘和刹车油门来实现上层决策行为。
在一些实施例中,根据本公开的训练决策模型的装置可以部署在具有计算能力的车辆上,例如安装有计算机系统的车辆上。根据本公开的训练决策模型的装置可以基于从车辆和/或基于真实车辆的仿真环境收集的数据来训练决策模型125。感知模块110、决策模块120(包括决策模型125)和规划模块130的可执行代码可以存储于车辆的存储组件上,并且可以由车辆的计算装置,例如处理器执行以实现训练和/或应用决策模型的功能。附加地或备选地,根据本公开的训练决策模型的装置可以分布式地部署,例如至少部分地部署在远程服务器。应理解,图1中所示的环境100仅是示例性的而不构成对本公开的范围的限制。根据本公开的训练决策模型的方案可以应用于诸如推荐决策管理等其他领域。
图2A示出了根据本公开的一些实施例的训练决策模型的示例过程200的示意图。如图2A所示,训练数据201被用于训练决策模型210。在一些实施例中,训练数据201可以包括经标注的专家数据,例如从人类驾驶员收集的行为数据以及对应的环境数据。附加地或备选地,训练数据201可以包括由仿真器生成的数据。仿真器可以基于环境数据来模拟地确定行为数据。环境数据可以包括例如从地图中抽取的离线数据。附加地或备选地,环境数据可以包括基于车辆的现实环境而动态模拟得到的在线数据。在一些示例中,仿真器可以利用随机策略或由强化学习模型生成的策略来确定相应的行为数据。应理解,在仿真器生成的训练数据中可能包括不合理的行为。这些行为可以作为负样本数据来提升决策模型210的鲁棒性。
如图2A所示,决策模型210包括监督学习模型212和强化学习模型214。监督学习模型可以是基于监督学习的任何合适的模型,例如Transformer模型、决策树模型等。强化学习模型214可以是基于强化学习的任何合适的模型,例如Q值学习(Q-learning)模型、蒙特卡洛模型等。本公开的范围在具体的模型实现方面不受限制。
基于训练数据201,利用决策模型210中的监督学习模型212确定第一策略222,并且利用强化学习模型214确定第二策略224。应理解,第一策略222和第二策略224是基于训练数据201中相同的输入数据得到的,因此第一策略222和第二策略224之间的差异可以体现监督学习模型212和强化学习模型214针对相同的输入数据进行决策时的差异。可以理解,在进行决策时,监督学习模型212相比强化学习模型214通常可以应用更多的人类经验,而强化学习模型214相比监督学习模型212具有更强的探索性。
在一些实施例中,确定第一策略222的监督学习模型212可以是经训练的。换言之,监督学习模型212的参数已经基于经标注的专家数据被确定,并且在图2所示的训练过程中不再被更新。备选地,确定第一策略222的监督学习模型212可以与强化学习模型214一起被训练,并且监督学习模型212的参数在图2所示的训练过程中被更新。
基于第一策略222与第二策略224之间的差异,策略蒸馏模块230确定模仿学习损失242。模仿学习损失242可以体现强化学习模型214“模仿”监督学习模型212进行决策的程度。举例而言,如果模仿学习损失242较小,则强化学习模型214在进行决策时“模仿”监督学习模型212的程度较高。这也可以理解为强化学习模型214“模仿”到了专家数据中包含的人类策略。相反,如果模仿学习损失242较大,则强化学习模型214在进行决策时“模仿”监督学习模型212的程度较低。基于利用策略蒸馏模块230确定的模仿学习损失242,强化学习模型214可以“蒸馏”监督学习模型212所确定的策略,从而学习专家数据中的人类经验。
在一些实施例中,取决于监督学习模型212和强化学习模型214的具体实现,策略蒸馏模块230可以对第一策略222和第二策略224进行归一化,并且可以基于归一化的第一策略222和第二策略224之间的距离,确定模仿学习损失242。
在一些示例中,监督学习模型212输出的第一策略222可以是行为的概率分布,例如(0.6,0.4,0),其中每个值表示一种行为的概率。强化学习模型214输出的第二策略224可以是类似的概率分布或者(状
态,行为)的值。如果第二策略224是类似的概率分布,策略蒸馏模块230可以基于第一策略222和第二策略224之间的向量距离来确定模仿学习损失242。如果第二策略224是(状态,行为)的值,则策略蒸馏模块230可以对该值使用softmax函数进行归一化,并且通过相对熵(KL散度)来计算第一策略222与第二策略224之间的距离,从而得到模仿学习损失242。
基于所确定的模仿学习损失242以及与第二策略224对应的强化学习损失244,优化模块250训练(也称为优化)决策模型210。取决于强化学习模型214的具体实现,强化学习损失244可以基于任何合适的损失函数而被确定,本公开的范围在此不受限制。
优化模块250通过使模仿学习损失242和强化学习损失244的组合最小化来训练决策模型210或仅训练决策模型210中的强化学习模型214。在一些实施例中,优化模块250可以确定针对模仿学习损失242的自适应权重,并且基于该自适应权重以及模仿学习损失242和强化学习损失244来确定整体学习损失。优化模块250可以通过使该整体学习损失最小化来训练决策模型210。例如,可以参考下式(1)来确定整体学习损失L。
L=α*losskl+lossrl (1)
L=α*losskl+lossrl (1)
其中losskl表示模仿学习损失242,lossrl表示强化学习损失244,并且α表示针对模仿学习损失242的自适应权重。应理解,上式(1)仅是示例性的而不构成对本公开的限制。例如,自适应权重可以包括针对模仿学习损失242和强化学习损失244两者的两个权重,并且可以不以系数的形式体现。
在一些实施例中,优化模块250可以确定针对模仿学习损失242的初始权重并且通过逐步的更新来确定自适应权重。在一些实施例中,优化模块250可以在达到预定训练轮次之前,基于模仿学习损失242的变化来更新初始权重,以确定经更新的权重。优化模块250可以在达到预定轮次之后,逐渐减小经更新的权重。
在一些示例中,在预定训练轮次之前,若模仿学习损失242增加,则可以增加初始权重。例如,若初始训练轮次的模仿学习损失小于后续训练轮次的模仿学习损失,则可以增加初始权重。在一些示例中,可以参考公式αn=1.1*αn-1来增加权重,其中αn表示第n轮次的自适应权重并且αn-1表示第n-1轮次的自适应权重。
相反,若模仿学习损失242减小,则可以保持初始权重不变。例如,若初始训练轮次的模仿学习损失242大于后续训练轮次的模仿学习损失242,则保持初始权重。在一些示例中,当达到预定训练轮次N之后,可以参考公式来逐渐减小权重。根据该公式,自适应权重αn将在第2N轮次时减少为零。
图2B示出了根据本公开的一些实施例的训练决策模型的示例过程260的示意图。在图2B中,确定第一策略222的监督学习模型212可以与强化学习模型214一起被训练,并且监督学习模型212的参数在图2所示的训练过程中被更新。在一些实施例中,可以确定与第一策略222对应的监督学习损失262,并且可以基于模仿学习损失242、强化学习损失244和监督学习损失262来一起训练决策模型210。
利用过程200和260,基于模仿学习损失242和强化学习损失244两者,可以结合监督学习利用专家数据的能力和强化学习泛化性强的特点,从而训练得到性能优异且类人的决策模型210。此外,利用自适应权重,可以使得强化学习模型214在训练前期更注重“模仿”监督学习模型212确定的策略并且在训练后期更注重自主探索,从而提高训练决策模型210,特别是强化学习模型214的效率。
图3示出了根据本公开的一些实施例的分阶段训练决策模型的示例过程300的示意图。应理解,图3仅以自动驾驶领域为示例而不构成对本公开的范围的限制。如图3所示,在数据收集阶段310中,可以收集专家数据和非专家数据。专家数据可以包括从人类直接采集的数据,例如从人类专家与环境的交互获取的数据。在自动驾驶领域,可以通过收集驾驶员对车辆的操控行为来收集专家数据。非专家数据可以包括非人类直接生成的数据。例如,可以通过仿真器来收集非专家数据。仿真器可以模拟车辆的环境并且在该环境中应用策略来生成车辆的行为,以作为非专家数据。仿真器可以应用随机策略或由决策模型生成的策略来生成与该策略对应的车辆的行为。特别地,仿真器可以在线地应用强化学习模型输出的策略来生成非专家数据。
在特征提取阶段320,可以对所收集的专家数据和非专家数据进行特征提取,以得到预处理的训练数据。例如,可以将环境数据和行为数据转换为对应的向量表示。在数据选取阶段330,可以从所收集的数据中选取特定数据以用于在训练阶段340中训练决策模型。例如,可以从所收集的数据中选取特定决策场景的数据作为训练决策模型的训练数据,从而提升决策模型针对该特定决策场景的性能。
在一些实施例中,可以首先利用经标注的专家数据来训练决策模型中的监督学习模型,并在测试集中测试该监督学习模型,从而确定基于专家数据被训练的监督学习模型的推理性能。该推理性能可以指示针对多个决策场景中的每个决策场景的预测策略质量。例如,监督学习模型的推理性能可以指示分别针对换道场景、刹车场景和转弯场景的预测策略质量。
根据得到的推理性能,可以使用数据选取模块从所收集的数据中选择用于训练决策模型或强化学习模型的训练数据。数据选取模块可以基于监督学习模型的推理性能,确定训练数据中对应于多个决策场景的数据分布。例如,针对预测策略质量较差的特定决策场景,可以在训练数据中增加该决策场景的数据,以用于定向提升决策模型的推理性能。
附加地或备选地,可以基于强化学习模型的推理性能来更新训练数据,以用于在后续的训练轮次中定向提升决策模型的推理性能。具体地,可以确定通过先前训练得到的强化学习模型的推理性能,该推理性能指示针对多个决策场景中的每个决策场景的预测策略质量。基于强化学习模型的推理性能,可以更新训练数据中对应于多个决策场景的数据分布,以确定经更新的训练数据。基于经更新的训练数据,可以进一步训练决策模型,以使得决策模型针对特定决策场景的推理性能得到提升。
在训练阶段340中,如上文参考图2所描述的,可以结合监督学习和强化学习两者来训练决策模型。具体地,可以利用策略蒸馏来使得强化学习模型“模仿”监督学习模型的决策方式,从而将通过监督学习得到的策略继承给强化学习模型。此外,可以基于自适应权重来调整“模仿”的程度,使得强化学习模型在训练的开始阶段更多地模仿监督学习模型得到的策略,并逐渐降低模仿的程度,从而增加决策模型的泛化性。
在一些实施例中,训练阶段340可以包括离线训练阶段和在线训练阶段。在离线训练阶段中,可以基于专家数据来训练监督学习模型。在在线训练阶段中,可以将强化学习模型输出的策略应用到仿真器,以生成非专家数据。该非专家数据可以作为训练数据的一部分,在后续的训练轮次中被用于进一步训练强化学习模型。以此方式,可以利用仿真器来增加训练数据量,并且可以针对特定决策场景来增加训练数据量,从而更高效地训练决策模型。
图4示出了根据本公开的一些实施例的基于决策场景来训练决策模型的示例过程400的流程图。参考图4,下文将描述根据本公开的训练决策模型的方案在自动驾驶领域中的应用。该训练决策模型的装置可以被部署在自车上,并且可以通过收集人类驾驶员的操作行为和仿真器的模拟来确定训练数据。经训练的决策模型可以在多种决策场景中确定策略。决策场景的示例可以包括:分别需要左换道、右换道和直行的决策场景。利用经训练的决策模型,可以在前方有障碍车辆或慢速车辆时,根据当前整体路况选择合适的目标车道,例如发出左换道、右换道或直行的指令,以最大化通行效率。下文将详细描述训练该决策模型的过程。
如图4所示,在框402可以初始化决策模型。可以初始化监督学习模型和强化学习模型的参数,例如神经网络的维度和激活函数等。附加地或备选地,可以输入与决策任务相关的参数,例如行为的维度。在一些示例中,行为的维度可以设置为3,以分别代表左换道、右换道和直行。附加地或备选地,可以初始化自适应权重以及用于更新权重的参数,例如预定训练轮次N。
在框404,可以判断监督学习模型是否需要训练。在一些实施例中,如果需要训练监督学习模型,则过程400可以进行到框406,以使用专家数据来训练监督学习模型。在框406,可以使用任何可行的监督学习损失函数来训练监督学习模型。监督学习损失函数的示例包括但不限于均方误差(mean square error)和交叉熵(cross entropy)等。
相反,如果可以直接利用经训练的监督学习模型,则过程400可以进行到框408。在框408,数据选取模块可以测试监督学习模型的推理性能,并且根据推理性能来确定特定决策场景的数据分布(也称为确定数据场景)。在一些实施例中,可以设置与需要左换道、右换道和直行分别对应的决策场景。数据选取模块可以基于监督学习模型针对这些决策场景的预测策略质量来调整下一训练轮次中训练数据的分布。
例如,当监督学习模型在某一种决策场景中表现较差时,可以在训练数据中增加针对该决策场景的比例。作为非限制性示例,可以按照通过率降低10%则增加10%的数据的方式进行调节,最低调整10%。例如,如果针对需要左换道、右换道和直行的决策场景的通过率分别为80%、80%和50%,则可以增加直行场景的数据比例,并且可以将针对这三个决策场景的数据比例确定为100%:100%:130%。
在框410至框420,可以基于所确定的训练数据来训练决策模型(或仅训练强化学习模型)。具体地,在框410,策略蒸馏模块可以计算模仿学习损失。策略蒸馏模块可以基于监督学习模型输出的第一策略和
强化学习模型输出的第二策略之间的差异来确定模仿学习损失。在框412,可以计算强化学习损失。例如,可以使用Q值学习方法来计算强化学习损失。
在框414,自适应权重模块可以基于模仿学习损失和强化学习损失来计算整体学习损失。例如,可以参考上式(1)来计算整体学习损失。在框416,可以基于整体学习损失来训练强化学习模型。在框418,可以判断训练是否收敛。如果没有收敛,则过程400可以返回到框410进行下一轮次的训练。
相反,如果训练已经收敛,则过程400可以进行到框420。在框420,可以判断强化学习决策模型的推理性能是否达标。如果推理性能达标,则可以在框422结束训练。相反,如果推理性能不达标,则过程400可以进行到框425。在框425,数据选取模块可以基于强化学习模型的推理性能来选择数据场景,从而调整训练数据中的数据分布。例如,如果针对需要左换道、右换道和直行的决策场景的通过率由80%、80%、50%提升至80%、80%、60%,则针对需要左换道、右换道和直行的决策场景的数据比例可以由100%:100%:130%减少至100%:100%:120%。
利用过程400,可以结合监督学习和强化学习来训练得到性能优异且类人的决策模型。此外,可以基于监督学习模型和/或强化学习模型的推理性能来调整训练数据中针对特定决策场景的数据分布,从而可以基于决策场景来训练决策模型,从而定向提升决策模型的推理性能。
图5示出了根据本公开的一些实施例的训练决策模型的示例方法的过程500的流程图。在框510,基于训练数据,利用决策模型中的监督学习模型确定第一策略并且利用决策模型中的强化学习模型确定第二策略。在框520,基于第一策略与第二策略之间的差异,确定模仿学习损失。在框530,基于模仿学习损失和与第二策略对应的强化学习损失,训练决策模型。
在一些实施例中,基于所述模仿学习损失和与所述第二策略对应的强化学习损失,训练所述决策模型包括:确定针对所述模仿学习损失的自适应权重;基于所述自适应权重以及所述模仿学习损失和所述强化学习损失,确定整体学习损失;以及通过使所述整体学习损失最小化,训练所述决策模型。
在一些实施例中,确定针对所述模仿学习损失的自适应权重包括:确定针对所述模仿学习损失的初始权重;在达到预定训练轮次之前,基于所述模仿学习损失的变化来更新所述初始权重,以确定更新权重;以及在达到所述预定轮次之后,逐渐减小所述更新权重。
在一些实施例中,基于所述模仿学习损失的变化来更新所述初始权重包括:若初始训练轮次的所述模仿学习损失小于后续训练轮次的所述模仿学习损失,则增加所述初始权重;以及若初始训练轮次的所述模仿学习损失大于后续训练轮次的所述模仿学习损失,则保持所述初始权重。
在一些实施例中,方法还包括:基于经标注的专家数据,训练所述监督学习模型;确定基于所述专家数据被训练的所述监督学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;以及基于所述监督学习模型的所述推理性能,确定所述训练数据中对应于所述多个决策场景的数据分布。
在一些实施例中,基于所述第一策略与所述第二策略之间的差异,确定模仿学习损失包括:对所述第一策略和所述第二策略进行归一化;以及基于归一化的所述第一策略和所述第二策略之间的距离,确定所述模仿学习损失。
在一些实施例中,方法还包括:利用仿真器生成所述训练数据的至少一部分。在一些实施例中,利用所述仿真器生成数据包括:基于由所述强化学习模型确定的策略或随机策略中的至少一项,利用所述仿真器生成与所述策略或随机策略中的所述至少一项对应的行为,以作为所述训练数据的至少一部分。
在一些实施例中,方法还包括:确定所述强化学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;基于所述强化学习模型的所述推理性能,更新所述训练数据中对应于所述多个决策场景的数据分布,以确定经更新的训练数据;以及基于所述经更新的训练数据,训练所述决策模型。
在一些实施例中,训练所述决策模型包括:确定与所述第一策略对应的监督学习损失;以及基于所述模仿学习损失、所述强化学习损失和所述监督学习损失,训练所述决策模型。
在一些实施例中,方法还包括:利用经训练的所述决策模型或经训练的所述强化学习模型,基于与驾驶有关的输入数据,确定驾驶策略,所述驾驶策略包括以下至少一项:左换道、右换道、直行、超车、左转弯、右转弯、停车、加速、减速、刹车。
根据本公开的方案,可以结合监督学习和强化学习的优点来训练决策模型。例如,可以首先使用离线专家数据训练监督学习模型,获得类人的专家模型。然后,可以通过策略蒸馏模块将专家模型的策略继承
作为强化学习模型的初始解。此外,利用数据选取模块,可以在专家模型的策略的基础上实现针对决策场景的定向提升,从而得到类人且性能优异的决策模型。
示例装置和设备
图6示出了根据本公开实施例的用于训练决策网络的装置600的框图,装置600可以包括多个模块,以用于执行如图5中所讨论的过程500中的对应步骤。该装置600可以部署在车载设备(例如,车机)上,以用于改进自动驾驶软件的决策性能。装置600包括策略确定单元610,其被配置为基于训练数据,利用决策模型中的监督学习模型确定第一策略并且利用所述决策模型中的强化学习模型确定第二策略;损失确定单元620,其被配置为基于所述第一策略与所述第二策略之间的差异,确定模仿学习损失;以及优化单元630,其被配置为基于所述模仿学习损失和与所述第二策略对应的强化学习损失,训练所述决策模型。
在一些实施例中,优化单元630进一步被配置为:确定针对所述模仿学习损失的自适应权重;基于所述自适应权重以及所述模仿学习损失和所述强化学习损失,确定整体学习损失;以及通过使所述整体学习损失最小化,训练所述决策模型。
在一些实施例中,优化单元630进一步被配置为:确定针对所述模仿学习损失的初始权重;在达到预定训练轮次之前,基于所述模仿学习损失的变化来更新所述初始权重,以确定更新权重;以及在达到所述预定轮次之后,逐渐减小所述更新权重。
在一些实施例中,优化单元630进一步被配置为:若初始训练轮次的所述模仿学习损失小于后续训练轮次的所述模仿学习损失,则增加所述初始权重;以及若初始训练轮次的所述模仿学习损失大于后续训练轮次的所述模仿学习损失,则保持所述初始权重。
在一些实施例中,装置600还包括训练数据确定单元,所述训练数据确定单元被配置为:基于经标注的专家数据,训练所述监督学习模型;确定基于所述专家数据被训练的所述监督学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;以及基于所述监督学习模型的所述推理性能,确定所述训练数据中对应于所述多个决策场景的数据分布。
在一些实施例中,装置600还包括仿真器利用单元,所述仿真器利用单元被配置为:利用仿真器生成所述训练数据的至少一部分。在一些实施例中,所述仿真器利用单元进一步被配置为:基于由所述强化学习模型确定的策略或随机策略中的至少一项,利用所述仿真器生成与所述策略或随机策略中的所述至少一项对应的行为,以作为所述训练数据的至少一部分。
在一些实施例中,装置600还包括定向优化单元,所述定向优化单元被配置为:确定所述强化学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;基于所述强化学习模型的所述推理性能,更新所述训练数据中对应于所述多个决策场景的数据分布,以确定经更新的训练数据;以及基于所述经更新的训练数据,训练所述决策模型。
在一些实施例中,损失确定单元620进一步被配置为:对所述第一策略和所述第二策略进行归一化;以及基于归一化的所述第一策略和所述第二策略之间的距离,确定所述模仿学习损失。
在一些实施例中,优化单元630进一步被配置为:确定与所述第一策略对应的监督学习损失;以及基于所述模仿学习损失、所述强化学习损失和所述监督学习损失,训练所述决策模型。
在一些实施例中,装置600还包括决策模型利用单元,所述决策模型利用单元被配置为:利用经训练的所述决策模型或经训练的所述强化学习模型,基于与驾驶有关的输入数据,确定驾驶策略,所述驾驶策略包括以下至少一项:左换道、右换道、直行、超车、左转弯、右转弯、停车、加速、减速、刹车。
图7示出了可以用来实施本公开的实施例的示例设备700的示意性框图。如图所示,设备700包括计算单元701,其可以根据存储在随机存取存储器(RAM)703和/或只读存储器(ROM)702的计算机程序指令或者从存储单元708加载到RAM 703和/或ROM 702中的计算机程序指令,来执行各种适当的动作和处理。在RAM 703和/或ROM 702中,还可存储设备700操作所需的各种程序和数据。计算单元701和RAM 703和/或ROM 702通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。
设备700中的多个部件连接至I/O接口705,包括:输入单元706,例如键盘、鼠标等;输出单元707,例如各种类型的显示器、扬声器等;存储单元708,例如磁盘、光盘等;以及通信单元709,例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包
括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理,例如过程500。例如,在一些实施例中,过程500可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元708。在一些实施例中,计算机程序的部分或者全部可以经由RAM和/或ROM和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM和/或ROM并由计算单元701执行时,可以执行上文描述的过程500的一个或多个步骤。备选地,在其他实施例中,计算单元701可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行过程500。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令,在服务器或终端上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴光缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是服务器或终端能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(如软盘、硬盘和磁带等),也可以是光介质(如数字视盘(digital video disk,DVD)等),或者半导体介质(如固态硬盘等)。
此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。
Claims (22)
- 一种训练决策模型的方法,所述方法包括:基于与驾驶有关的训练数据,利用决策模型中的监督学习模型确定第一策略并且利用所述决策模型中的强化学习模型确定第二策略;基于所述第一策略与所述第二策略之间的差异,确定模仿学习损失;以及基于所述模仿学习损失和与所述第二策略对应的强化学习损失,训练所述决策模型。
- 根据权利要求1所述的方法,其中基于所述模仿学习损失和与所述第二策略对应的强化学习损失,训练所述决策模型包括:确定针对所述模仿学习损失的自适应权重;基于所述自适应权重以及所述模仿学习损失和所述强化学习损失,确定整体学习损失;以及通过使所述整体学习损失最小化,训练所述决策模型。
- 根据权利要求2所述的方法,其中确定针对所述模仿学习损失的自适应权重包括:确定针对所述模仿学习损失的初始权重;在达到预定训练轮次之前,基于所述模仿学习损失的变化来更新所述初始权重,以确定更新权重;以及在达到所述预定训练轮次之后,逐渐减小所述更新权重。
- 根据权利要求3所述的方法,其中基于所述模仿学习损失的变化来更新所述初始权重包括:若初始训练轮次的所述模仿学习损失小于后续训练轮次的所述模仿学习损失,则增加所述初始权重;以及若初始训练轮次的所述模仿学习损失大于后续训练轮次的所述模仿学习损失,则保持所述初始权重。
- 根据权利要求1至4中任一项所述的方法,还包括:基于经标注的专家数据,训练所述监督学习模型;确定基于所述专家数据被训练的所述监督学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;以及基于所述监督学习模型的所述推理性能,确定所述训练数据中对应于所述多个决策场景的数据分布。
- 根据权利要求1至5中任一项所述的方法,其中基于所述第一策略与所述第二策略之间的差异,确定模仿学习损失包括:对所述第一策略和所述第二策略进行归一化;以及基于归一化的所述第一策略和所述第二策略之间的距离,确定所述模仿学习损失。
- 根据权利要求1至6中任一项所述的方法,还包括:利用仿真器生成所述训练数据的至少一部分。
- 根据权利要求7所述的方法,其中利用所述仿真器生成数据包括:基于由所述强化学习模型确定的策略或随机策略中的至少一项,利用所述仿真器生成与所述策略或随机策略中的所述至少一项对应的行为,以作为所述训练数据的至少一部分。
- 根据权利要求1至8中任一项所述的方法,还包括:确定所述强化学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;基于所述强化学习模型的所述推理性能,更新所述训练数据中对应于所述多个决策场景的数据分布,以确定经更新的训练数据;以及基于所述经更新的训练数据,训练所述决策模型。
- 根据权利要求1至9中任一项所述的方法,其中训练所述决策模型包括:确定与所述第一策略对应的监督学习损失;以及基于所述模仿学习损失、所述强化学习损失和所述监督学习损失,训练所述决策模型。
- 根据权利要求1至10中任一项所述的方法,还包括:利用经训练的所述决策模型或经训练的所述强化学习模型,基于与驾驶有关的输入数据,确定驾驶策略,所述驾驶策略包括以下至少一项:左换道、右换道、直行、超车、左转弯、右转弯、停车、加速、减速、刹车。
- 一种用于训练决策模型的装置,包括:策略确定单元,被配置为基于与驾驶有关的训练数据,利用决策模型中的监督学习模型确定第一策略并且利用所述决策模型中的强化学习模型确定第二策略;损失确定单元,被配置为基于所述第一策略与所述第二策略之间的差异,确定模仿学习损失;以及优化单元,被配置为基于所述模仿学习损失和与所述第二策略对应的强化学习损失,训练所述决策模型。
- 根据权利要求12所述的装置,其中所述优化单元进一步被配置为:确定针对所述模仿学习损失的自适应权重;基于所述自适应权重以及所述模仿学习损失和所述强化学习损失,确定整体学习损失;以及通过使所述整体学习损失最小化,训练所述决策模型。
- 根据权利要求13所述的装置,其中所述优化单元进一步被配置为:确定针对所述模仿学习损失的初始权重;在达到预定训练轮次之前,基于所述模仿学习损失的变化来更新所述初始权重,以确定更新权重;以及在达到所述预定训练轮次之后,逐渐减小所述更新权重。
- 根据权利要求14所述的装置,其中所述优化单元进一步被配置为:若初始训练轮次的所述模仿学习损失小于后续训练轮次的所述模仿学习损失,则增加所述初始权重;以及若初始训练轮次的所述模仿学习损失大于后续训练轮次的所述模仿学习损失,则保持所述初始权重。
- 根据权利要求12至15中任一项所述的装置,还包括训练数据确定单元,所述训练数据确定单元被配置为:基于经标注的专家数据,训练所述监督学习模型;确定基于所述专家数据被训练的所述监督学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;以及基于所述监督学习模型的所述推理性能,确定所述训练数据中对应于所述多个决策场景的数据分布。
- 根据权利要求12至16中任一项所述的装置,还包括仿真器利用单元,所述仿真器利用单元被配置为:利用仿真器生成所述训练数据的至少一部分。
- 根据权利要求17所述的装置,其中所述仿真器利用单元进一步被配置为:基于由所述强化学习模型确定的策略或随机策略中的至少一项,利用所述仿真器生成与所述策略或随机策略中的所述至少一项对应的行为,以作为所述训练数据的至少一部分。
- 根据权利要求12至18中任一项所述的装置,还包括定向优化单元,所述定向优化单元被配置为:确定所述强化学习模型的推理性能,所述推理性能指示针对多个决策场景中的每个决策场景的预测策略质量;基于所述强化学习模型的所述推理性能,更新所述训练数据中对应于所述多个决策场景的数据分布,以确定经更新的训练数据;以及基于所述经更新的训练数据,训练所述决策模型。
- 一种电子设备,包括:至少一个计算单元;至少一个存储器,所述至少一个存储器被耦合到所述至少一个计算单元并且存储用于由所述至少一个计算单元执行的指令,所述指令当由所述至少一个计算单元执行时,使所述电子设备执行根据权利要求1-11中任一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求1-11中任一项所述的方法。
- 一种计算机程序产品,包括计算机可执行指令,其中所述计算机可执行指令在被处理器执行时实现根据权利要求1-11中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310413264.9 | 2023-04-10 | ||
CN202310413264.9A CN118780387A (zh) | 2023-04-10 | 2023-04-10 | 用于训练决策模型的方法、装置、设备、介质和程序产品 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024212657A1 true WO2024212657A1 (zh) | 2024-10-17 |
Family
ID=92988905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2024/073076 WO2024212657A1 (zh) | 2023-04-10 | 2024-01-18 | 用于训练决策模型的方法、装置、设备、介质和程序产品 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118780387A (zh) |
WO (1) | WO2024212657A1 (zh) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200090048A1 (en) * | 2017-05-19 | 2020-03-19 | Deepmind Technologies Limited | Multi-task neural network systems with task-specific policies and a shared policy |
CN111461226A (zh) * | 2020-04-01 | 2020-07-28 | 深圳前海微众银行股份有限公司 | 对抗样本生成方法、装置、终端及可读存储介质 |
CN112508164A (zh) * | 2020-07-24 | 2021-03-16 | 北京航空航天大学 | 一种基于异步监督学习的端到端自动驾驶模型预训练方法 |
CN113110550A (zh) * | 2021-04-23 | 2021-07-13 | 南京大学 | 一种基于强化学习与网络模型蒸馏的无人机飞行控制方法 |
CN113343979A (zh) * | 2021-05-31 | 2021-09-03 | 北京百度网讯科技有限公司 | 用于训练模型的方法、装置、设备、介质和程序产品 |
CN113835421A (zh) * | 2020-06-06 | 2021-12-24 | 华为技术有限公司 | 训练驾驶行为决策模型的方法及装置 |
CN113962362A (zh) * | 2021-10-18 | 2022-01-21 | 北京百度网讯科技有限公司 | 强化学习模型训练方法、决策方法、装置、设备及介质 |
CN114404977A (zh) * | 2022-01-25 | 2022-04-29 | 腾讯科技(深圳)有限公司 | 行为模型的训练方法、结构扩容模型的训练方法 |
US20230107539A1 (en) * | 2021-10-06 | 2023-04-06 | Samsung Electronics Co., Ltd. | Multi-batch reinforcement learning via multi-imitation learning |
-
2023
- 2023-04-10 CN CN202310413264.9A patent/CN118780387A/zh active Pending
-
2024
- 2024-01-18 WO PCT/CN2024/073076 patent/WO2024212657A1/zh unknown
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200090048A1 (en) * | 2017-05-19 | 2020-03-19 | Deepmind Technologies Limited | Multi-task neural network systems with task-specific policies and a shared policy |
CN111461226A (zh) * | 2020-04-01 | 2020-07-28 | 深圳前海微众银行股份有限公司 | 对抗样本生成方法、装置、终端及可读存储介质 |
CN113835421A (zh) * | 2020-06-06 | 2021-12-24 | 华为技术有限公司 | 训练驾驶行为决策模型的方法及装置 |
CN112508164A (zh) * | 2020-07-24 | 2021-03-16 | 北京航空航天大学 | 一种基于异步监督学习的端到端自动驾驶模型预训练方法 |
CN113110550A (zh) * | 2021-04-23 | 2021-07-13 | 南京大学 | 一种基于强化学习与网络模型蒸馏的无人机飞行控制方法 |
CN113343979A (zh) * | 2021-05-31 | 2021-09-03 | 北京百度网讯科技有限公司 | 用于训练模型的方法、装置、设备、介质和程序产品 |
US20230107539A1 (en) * | 2021-10-06 | 2023-04-06 | Samsung Electronics Co., Ltd. | Multi-batch reinforcement learning via multi-imitation learning |
CN113962362A (zh) * | 2021-10-18 | 2022-01-21 | 北京百度网讯科技有限公司 | 强化学习模型训练方法、决策方法、装置、设备及介质 |
CN114404977A (zh) * | 2022-01-25 | 2022-04-29 | 腾讯科技(深圳)有限公司 | 行为模型的训练方法、结构扩容模型的训练方法 |
Also Published As
Publication number | Publication date |
---|---|
CN118780387A (zh) | 2024-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ye et al. | Automated lane change strategy using proximal policy optimization-based deep reinforcement learning | |
CN107229973B (zh) | 一种用于车辆自动驾驶的策略网络模型的生成方法及装置 | |
Chen et al. | Joint optimization of sensing, decision-making and motion-controlling for autonomous vehicles: A deep reinforcement learning approach | |
WO2019047646A1 (zh) | 车辆避障方法和装置 | |
CN114148349B (zh) | 一种基于生成对抗模仿学习的车辆个性化跟驰控制方法 | |
WO2020098226A1 (en) | System and methods of efficient, continuous, and safe learning using first principles and constraints | |
CN114926823B (zh) | 基于wgcn的车辆驾驶行为预测方法 | |
Zou et al. | Inverse reinforcement learning via neural network in driver behavior modeling | |
CN114997048A (zh) | 基于探索策略改进的td3算法的自动驾驶车辆车道保持方法 | |
Wang et al. | An end-to-end deep reinforcement learning model based on proximal policy optimization algorithm for autonomous driving of off-road vehicle | |
Hu et al. | Multi-objective optimization for autonomous driving strategy based on Deep Q Network | |
WO2024212657A1 (zh) | 用于训练决策模型的方法、装置、设备、介质和程序产品 | |
CN110390398A (zh) | 在线学习方法 | |
CN113033902A (zh) | 一种基于改进深度学习的自动驾驶换道轨迹规划方法 | |
Yuan et al. | Human feedback enhanced autonomous intelligent systems: a perspective from intelligent driving | |
Cui et al. | An Integrated Lateral and Longitudinal Decision‐Making Model for Autonomous Driving Based on Deep Reinforcement Learning | |
WO2023060586A1 (zh) | 自动驾驶指令生成模型优化方法、装置、设备及存储介质 | |
US20220188621A1 (en) | Generative domain adaptation in a neural network | |
CN115426149A (zh) | 基于雅各比显著图的单交叉口信号灯控制的交通状态对抗扰动生成方法 | |
Wu et al. | Aggregated multi-deep deterministic policy gradient for self-driving policy | |
CN114781064A (zh) | 一种基于社会力的车辆行为建模方法 | |
Si et al. | A Deep Coordination Graph Convolution Reinforcement Learning for Multi‐Intelligent Vehicle Driving Policy | |
CN118560530B (zh) | 一种基于生成对抗模仿学习的多智能体驾驶行为建模方法 | |
Wu et al. | Federated learning-based driving strategies optimization for intelligent connected vehicles | |
Liu et al. | Driver behavior modeling via inverse reinforcement learning based on particle swarm optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24787763 Country of ref document: EP Kind code of ref document: A1 |