CN115293334A - Model-based unmanned equipment control method for high sample rate deep reinforcement learning - Google Patents
Model-based unmanned equipment control method for high sample rate deep reinforcement learning Download PDFInfo
- Publication number
- CN115293334A CN115293334A CN202210963402.6A CN202210963402A CN115293334A CN 115293334 A CN115293334 A CN 115293334A CN 202210963402 A CN202210963402 A CN 202210963402A CN 115293334 A CN115293334 A CN 115293334A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- state
- action
- actor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000002787 reinforcement Effects 0.000 title claims abstract description 27
- 230000007704 transition Effects 0.000 claims abstract description 90
- 230000007613 environmental effect Effects 0.000 claims abstract description 68
- 230000003993 interaction Effects 0.000 claims abstract description 16
- 230000009471 action Effects 0.000 claims description 126
- 230000006870 function Effects 0.000 claims description 87
- 238000012549 training Methods 0.000 claims description 16
- 238000004088 simulation Methods 0.000 claims description 15
- 238000011156 evaluation Methods 0.000 claims description 11
- 238000005070 sampling Methods 0.000 claims description 11
- 230000002452 interceptive effect Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000007476 Maximum Likelihood Methods 0.000 claims description 6
- 238000003062 neural network model Methods 0.000 claims description 6
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 238000011217 control strategy Methods 0.000 abstract description 6
- 230000001276 controlling effect Effects 0.000 description 12
- 230000000875 corresponding effect Effects 0.000 description 11
- 238000005457 optimization Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000012546 transfer Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000010485 coping Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses a model-based unmanned equipment control method for high-sample-rate deep reinforcement learning, which comprises the following steps of: acquiring track data and storing the track data into an environment buffer pool; updating the environmental state transition model; predicting a multi-step interaction track to generate prediction data, and storing the prediction data into a model buffer pool; updating the Actor-Critic strategy model; and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement. The invention adopts a deep reinforcement learning method based on the model to construct an environmental state transition model to simulate the interaction between the unmanned equipment and the external environment, thereby sharply reducing the interaction times between the unmanned equipment and the real environment, and effectively optimizing the advancing control strategy of the unmanned equipment by the data generated by the environmental state transition model, so that the control of the unmanned equipment becomes efficient.
Description
Technical Field
The invention relates to an unmanned equipment control technology, in particular to an unmanned equipment control method based on model high sample rate deep reinforcement learning.
Background
At present, the control of the advance of the unmanned equipment is mainly developed based on the traditional control technology, but the traditional control technology has the problems of single planning of the advance line of the unmanned equipment, inflexible line planning, lack of coping strategies in complex scenes and the like. With the rapid development of deep learning technology and reinforcement learning, the strong feature learning capability of a deep neural network is utilized, the relevant advancing features can be learned from a large amount of unmanned equipment interaction data, and the obstacle avoidance in the advancing process of the unmanned equipment can be realized by combining the modeling of reinforcement learning on the advancing problem of the unmanned equipment, but the problem of low data sample efficiency still exists.
Disclosure of Invention
In order to overcome at least the above-mentioned deficiencies in the prior art, it is an object of the present application to provide a model-based high sample rate deep reinforcement learning unmanned device control method.
The embodiment of the application provides a model-based unmanned equipment control method for high-sample-rate deep reinforcement learning, which comprises the following steps:
controlling the unmanned equipment to use a strategy in an Actor-Critic strategy model to interact with a real environment to obtain track data, and storing the track data into an environment buffer pool;
updating the environmental state transition model through the data in the environmental buffer pool;
using the data in the environment buffer pool to carry out multi-step interactive trajectory prediction on the environment state transition model and the Actor-Critic strategy model to generate prediction data, and storing the prediction data in a model buffer pool;
updating the Actor-Critic strategy model through data in the model buffer pool;
and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement.
In the prior art, the time cost required by the operation test of the unmanned equipment in the real environment is very high, and meanwhile, the problems of equipment collision and the like easily occur when the unmanned equipment operates in the real environment for a long time, so that a large number of sample sources required by a model for training the operation control of the unmanned equipment are difficult to guarantee. When the method and the device are implemented, an Actor-Critic strategy model and an environment state transition model need to be constructed first, wherein the Actor-Critic strategy model is used for providing a strategy for the unmanned equipment to operate in the environment, the environment state transition model is used for providing a simulation environment for the unmanned equipment to operate, and a large number of samples for training the unmanned equipment operation control model can be generated through interaction between the Actor-Critic strategy model and the environment state transition model, so that the accuracy of the trained model is improved.
In the embodiment of the application, each iteration updating of the model needs to control the unmanned equipment to operate for a short time in a real environment by a strategy in an Actor-critical strategy model, and the generated track data can update the environment state transition model; and then, multi-step interactive trajectory prediction is carried out on the environment state transition model and the Actor-Critic strategy model to generate a large amount of prediction data to serve as training samples to update the Actor-Critic strategy model, so that the Actor-Critic strategy model can become a mature model for providing the unmanned equipment control strategy after a certain number of iterations, and subsequent tests can be carried out. The embodiment of the application can provide a high-precision simulation environment for unmanned equipment interaction, the simulation environment is learned through a real historical interaction track, and the accuracy of the environment is directly influenced by a loss function of an environment state transition model in the learning process. The environmental state transition model of the embodiment of the application corresponds to a simulation environment, and provides interactive safety and high efficiency for the unmanned equipment.
In one possible implementation manner, the controlling the unmanned aerial device to acquire the trajectory data by using the strategy in the Actor-Critic strategy model to interact with the real environment comprises the following steps:
inputting current state data of the unmanned equipment in the real environment as first state data into an Actor network of the Actor-criticic policy model, and receiving an action value output by the Actor network as first action data; the first action data is obtained by sampling the mean value and the variance of multidimensional Gaussian distribution output by the last full-connection layer of the Actor network by the Actor network;
controlling the unmanned equipment to operate in the real environment by the first action data, and acquiring state data of the unmanned equipment at the next moment as second state data and a return value at the next moment as a first return value;
and taking the first state data, the first action data, the second state data and the first return value as the track data.
In a possible implementation manner, the updating the Actor-critical policy model by the data in the model buffer pool includes:
sampling a fixed amount of third state data, second action data, second return values and fourth state data from the model buffer pool; the fourth state data is state data at the next moment of the third state data;
inputting the fourth state data into an Actor network of the Actor-Critic policy model to acquire third action data;
inputting the third state data and the second action data into a criticic network of the Actor-criticic policy model, and acquiring a first state action function output by the criticic network; the first state action function is the minimum value of state action functions respectively output by at least two current Q networks in the Critic network;
inputting the third action data and the fourth state data into a target Q network in the Critic network to obtain a state action function at the next moment as a second state action function;
and calculating a first loss function through the second return value, the first state action function and the second state action function, and updating the criticic network.
In a possible implementation manner, updating the Actor-critical policy model through the data in the model buffer pool further includes:
calculating a second loss function through a third state action function and the action entropy value and updating the Actor network; the third state action function is an evaluation value of the criticic network when an action value is selected through the Actor network policy under corresponding state data; the action entropy value is an entropy value of an action value selected by the Actor network policy.
In a possible implementation manner, the generating prediction data by performing multi-step interaction trajectory prediction on the environmental state transition model and the Actor-Critic policy model by using the data in the environmental buffer pool includes:
randomly sampling a preset number of fifth state data from the environment buffer pool;
inputting the fifth state data into an Actor network of the Actor-Critic policy model, and acquiring fourth action data output by the Actor network;
inputting the fifth state data and the fourth action data into the environmental state transition model to obtain state data at the next moment as sixth state data, and taking a return value at the next moment as a third return value;
taking the sixth state data as fifth state data and circularly acquiring the state data and the return value to preset conditions;
and randomly selecting a preset amount of data generated in the cyclic process as the prediction data.
In one possible implementation, the environmental state transition model includes a plurality of mutually independent submodels; each sub-model is trained through the same neural network model and sample, and the initial value of each sub-model in training is different.
In one possible implementation, the environmental state transition model includes a third loss function and a fourth loss function; the third loss function is generated based on how the current state data and the action data are transferred to the state data at the next moment and the generated return value; the fourth loss function is generated based on a difference between the real environment and a simulated environment produced by the environmental state transition model.
In one possible implementation, the fourth loss function is generated based on the following equation:
in the formula,for the performance of the policy of the drone on the real environment,performance of the same policy of the unmanned device on the simulated environment, s is sampled state data, a is sampled action data, gamma represents a discount factor in reinforcement learning, | r | y max The representation environment gives the maximum value of the absolute value of the return value,is the difference between the real environment and the simulated environment;
wherein p is a p-norm of state data and motion data in the real environment,is p-norm, D, of state data and motion data in the simulation environment TV For the TV distance, s' is the state at the next moment.
In one possible implementation, when the difference between the real environment and the simulated environment is measured by a KL divergence, the network loss function used by the environmental state transition model is implemented by:
in the formula, s n For state data sampled from the environmental buffer pool, a n For action data sampled from the context buffer pool,for the next moment state predicted by the environmental state transition model given the state data and the action data,to correspond toα is a hyperparameter, μ θold The average value of the environmental state transition model at the last update,is corresponding toVariance of (d), μ θ For the latest mean, σ, of the model of the environmental state transitions θ To correspond to mu θ Variance of (D), D kl Is KL divergence, N is normal distribution,is a maximum likelihood estimation function.
In one possible implementation, when the difference between the real environment and the simulated environment is measured by a p-norm, the network loss function used by the environmental state transition model is implemented by:
in the formula s n Is a slave ringState data sampled in the border buffer pool, a n For action data sampled from the context buffer pool,for the next moment state predicted by the environmental state transition model given the state data and the action data,to correspond toThe value of (a) is a hyperparameter,the mean value of the environmental state transition model at the last update,is corresponding toVariance of (d), μ θ Is the latest mean value of the environmental state transition model,is a norm of 2 times,is a maximum likelihood estimation function.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides a high-accuracy environmental state transition model for providing a high-precision simulation track sample by modeling the interaction between the unmanned equipment and the external environment during the advancing control as MDP.
2. The invention solves the optimization problem of the unmanned equipment advancing control strategy by means of the Actor and Critic functions, and obtains a high-quality environmental state transfer model by a brand-new loss function optimization.
3. The invention adopts a model-based deep reinforcement learning method to construct an environmental state transfer model to simulate the interaction between the unmanned equipment and the external environment, thereby sharply reducing the interaction times between the unmanned equipment and the real environment, and the data generated by the environmental state transfer model can effectively optimize the advancing control strategy of the unmanned equipment, so that the control of the unmanned equipment becomes efficient.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a schematic diagram of the steps of an embodiment of the method of the present application;
fig. 2 is a schematic diagram of a network structure of an Actor in the embodiment of the present application;
FIG. 3 is a schematic diagram of a Critic network according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an environmental state transition model according to an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Further, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some of the embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. In addition, one skilled in the art, under the guidance of the present disclosure, may add one or more other operations to, or remove one or more operations from, the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
Please refer to fig. 1, which is a flowchart illustrating a method for controlling an unmanned aerial vehicle based on model-based high sample rate deep reinforcement learning according to an embodiment of the present disclosure, and further, the method for controlling an unmanned aerial vehicle based on model-based high sample rate deep reinforcement learning may specifically include the contents described in the following steps S1 to S6.
S1: controlling the unmanned equipment to use a strategy in the Actor-Critic strategy model to interact with a real environment to obtain track data, and storing the track data into an environment buffer pool;
s2: updating an environment state transition model through the data in the environment buffer pool;
s3: using the data in the environment buffer pool to carry out multi-step interactive trajectory prediction on the environment state transition model and the Actor-Critic strategy model to generate prediction data, and storing the prediction data in a model buffer pool;
s4: updating the Actor-Critic strategy model through data in the model buffer pool;
s5: and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement.
In the prior art, the time cost required by running and testing the unmanned equipment in the real environment is very high, and meanwhile, the problems of equipment collision and the like easily occur when the unmanned equipment runs in the real environment for a long time, so that the sources of a large number of samples required by training a model for running and controlling the unmanned equipment are difficult to guarantee. When the method and the device are implemented, an Actor-Critic strategy model and an environment state transition model need to be constructed first, wherein the Actor-Critic strategy model is used for providing a strategy for the unmanned equipment to operate in the environment, the environment state transition model is used for providing a simulation environment for the unmanned equipment to operate, and a large number of samples for training the unmanned equipment operation control model can be generated through interaction between the Actor-Critic strategy model and the environment state transition model, so that the accuracy of the trained model is improved.
In the embodiment of the application, each iteration of the model is updated by controlling the unmanned equipment to operate for a short time in a real environment by using a strategy in the Actor-Critic strategy model, and the generated track data can update the environment state transition model; and then, performing multi-step interactive trajectory prediction on the environmental state transition model and the Actor-Critic strategy model to generate a large amount of prediction data as a training sample to update the Actor-Critic strategy model, so that the Actor-Critic strategy model can become a mature model for providing the unmanned equipment control strategy after a certain number of iterations to perform subsequent tests. According to the method and the device, the high-precision simulation environment for unmanned equipment interaction can be provided, learning is carried out through a real historical interaction track, and the accuracy of the environment is directly influenced by a loss function of an environment state transition model in the learning process. The environmental state transition model of the embodiment of the application corresponds to a simulation environment, and provides interactive safety and high efficiency for the unmanned equipment.
In one possible implementation manner, the controlling the unmanned device to use the strategy in the Actor-critical strategy model to interact with the real environment to acquire the track data includes:
inputting current state data of the unmanned equipment in the real environment as first state data into an Actor network of the Actor-criticic policy model, and receiving an action value output by the Actor network as first action data; the first action data is obtained by sampling the mean value and the variance of multidimensional Gaussian distribution output by the last full-connection layer of the Actor network by the Actor network;
controlling the unmanned equipment to operate in the real environment by the first action data, and acquiring state data of the unmanned equipment at the next moment as second state data and a return value at the next moment as a first return value;
and taking the first state data, the first action data, the second state data and the first return value as the track data.
When the method is implemented, the Actor-Critic strategy model comprises an Actor network and a Critic network, the Actor network and the Critic network respectively comprise a full connection layer and an activation layer which are sequentially arranged, and when the Actor network outputs mean value to sum variance according to the last full connection layer, the Actor network uses a tanh function to carry out one-time nonlinear mapping on samples in Gaussian distribution, so that the final action values are ensured to be in an effective range. When the first state data is input into the Actor network, the Actor network generates first action data corresponding to the first state data, the first action data is a strategy for controlling the unmanned equipment to operate in a real environment, and the strategy is used for controlling the unmanned equipment to operate in the real environment, so that the state data of the unmanned equipment after the strategy is operated, namely the second state data, can be obtained, and meanwhile, a return value generated after the strategy is executed, namely the first return value, can also be obtained; this allows the first state data, the first motion data, the second state data, and the first reward value to form a set of pairs of samples of the operation of the drone in the real environment, the samples being stored in the environment buffer pool and used to update the simulated environment.
In a possible implementation manner, the updating the Actor-critical policy model by the data in the model buffer pool includes:
sampling a fixed amount of third state data, second action data, second return values and fourth state data from the model buffer pool; the fourth state data is state data at the next moment of the third state data;
inputting the fourth state data into an Actor network of the Actor-Critic policy model to acquire third action data;
inputting the third state data and the second action data into a criticic network of the Actor-criticic policy model, and acquiring a first state action function output by the criticic network; the first state action function is the minimum value of state action functions output by at least two current Q networks in the Critic network respectively;
inputting the third action data and the fourth state data into a target Q network in the Critic network to obtain a state action function at the next moment as a second state action function;
and calculating a first loss function through the second return value, the first state action function and the second state action function, and updating the Critic network.
In the embodiment of the application, the Critic network is divided into a target Q network and at least two current Q networks, when the action selected in the current state is evaluated, the current Q networks are used for calculation, and the networks with small Q values are selected for updating the Actor network according to the results of the evaluation. The Actor network is used for deciding what action should be taken when the unmanned equipment meets different environments, and corresponds to the unmanned equipment; the Critic network is used for evaluating the influence caused by the action selected by the Actor network, and the Critic network provided by the embodiment of the application corresponds to evaluation. Therefore, a complete model-based high-sampling depth reinforcement learning algorithm can be formed by the environment state transition model, the Actor policy network and the Critic network. The unmanned aerial vehicle walks to drive motors in each driving mechanism of the unmanned aerial vehicle, and relevant parameters of the motors of each driving mechanism are adjusted for coping when the unmanned aerial vehicle faces different states according to a model-based high-sampling depth reinforcement learning algorithm.
In the embodiment of the present application, updating the Actor-Critic policy model corresponds to performing secondary training on the policy model after supplementing data, where the third state data, the second action data, the second return value, and the fourth state data should be data in one sample pair, that is, the third state data, the second action data, the second return value, and the fourth state data should be corresponding relationships. The first state action function is the evaluation of the third state data and the second action data output by the Critic network as input data, the third state data and the second action data can be understood as current state data and current action data, and the first state action function is the evaluation of the current action data under the condition of the current state data; the second state action function is the evaluation of the Critic network output on the condition of third action data and fourth state data, the third action data is the action relative to the current next moment, the fourth state data is the state of the next moment, and the first state action function is the evaluation of the action data of the next moment on the condition of the state data of the next moment; the second return value is a return value for executing the second action data in the environment, and the first loss function can be calculated through the second return value, the first state action function and the second state action function so as to update the criticic network.
For example, please refer to fig. 2, which illustrates a structure of an Actor network in this embodiment, where the Actor network in this embodiment includes 3 fully-connected layers and a rule layer, which are sequentially arranged, each layer of the fully-connected network includes 256 neurons, an input of the fully-connected network is in a current or given state, and an output of the fully-connected network is an action to be taken when the fully-connected network faces the state.
For example, referring to fig. 3, the Critic network in this embodiment includes two current Q networks, whose network structures are completely the same, and includes 3 fully-connected layers and rule layers that are sequentially arranged, where each layer of the fully-connected network includes 256 neurons, the input is a given state and its corresponding action, and the output is an evaluation value for this scenario.
In a possible implementation manner, updating the Actor-critical policy model through the data in the model buffer pool further includes:
calculating a second loss function through a third state action function and the action entropy value and updating the Actor network; the third state action function is an evaluation value of the criticic network when an action value is selected through the Actor network policy under corresponding state data; the action entropy value is an entropy value of an action value selected by the Actor network policy.
In the embodiment of the application, the loss function of the Actor network is composed of two parts, the first part is an evaluation value of the criticc network when a certain action is selected by a strategy under a certain state, namely a state action value function, and the second part is an entropy of the action selected by the strategy, so that the Actor network can be updated more accurately.
In a possible implementation manner, the generating prediction data by performing multi-step interaction trajectory prediction on the environmental state transition model and the Actor-Critic policy model by using the data in the environmental buffer pool includes:
randomly sampling a preset number of fifth state data from the environment buffer pool;
inputting the fifth state data into an Actor network of the Actor-Critic policy model, and acquiring fourth action data output by the Actor network;
inputting the fifth state data and the fourth action data into the environmental state transition model to obtain state data at the next moment as sixth state data, and taking a return value at the next moment as a third return value;
taking the sixth state data as fifth state data and circularly acquiring the state data and the return value to preset conditions;
and randomly selecting a preset amount of data generated in the cyclic process as the prediction data.
In the embodiment of the application, after the fifth state data is input into the Actor network as the current state data, the current decision action generated aiming at the current state data, that is, the fourth action data, can be acquired; since the environmental state transition model is environment-simulated, after the fifth state data and the fourth action data are input into the environmental state transition model, the fifth state data and the fourth action data can be run in the simulated environment to generate state data at the next moment, namely, sixth state data, and a reward value generated by the fourth action data, namely, a third reward value. In the embodiment of the present application, the multiple-interaction trajectory prediction means that newly obtained sixth state data is input into an Actor network as current state data, so as to start a loop process, where each loop generates a sample pair, and the sample pair includes a current state, a current action, a return value, and a next-moment state. The preset condition for the loop may be to reach a predetermined number of loops, or may set a threshold for some variable, which is not limited in this embodiment of the application.
For example, please refer to fig. 4, the environment state transition model of this embodiment is an integrated model composed of a plurality of neural network models, 4 layers of full connections and 3 activation functions are sequentially set for each neural network model, each layer of full connections is set with 300 hidden layers, and the activation functions use swish; the input of the model is a randomly given state and the action in the state, and the output is the state at the next moment and the return value at the moment.
In one possible implementation, the environmental state transition model includes a plurality of mutually independent submodels; each submodel is trained through the same neural network model and sample, and the initial value of each submodel in training is different.
When the method and the device are implemented, the environment state transition model is combined into an integrated environment state transition model through a plurality of neural network models, the integrated environment state transition model has the capability of capturing uncertainty in real environment dynamic transition, the mean value and the variance output by each independent sub-model have differences, more track data of different scenes can be generated when the model is interacted with the unmanned aerial vehicle, and the strategy network learning is facilitated. The loss function of the environmental state transition model is a unique design, and is an aggregate model, and each aggregate model comprises a full connection layer and an activation layer which are sequentially arranged, and the state and the return value of the next moment are predicted. And the mean value and the variance of the state and the return value of the next moment output by the last full-connection layer of the environment state transition model adopt Gaussian distribution to increase the adaptability of the model to the complex environment.
In one possible implementation, the environmental state transition model includes a third loss function and a fourth loss function; the third loss function is generated based on how the current state data and the action data are transferred to the state data at the next moment and the generated return value; the fourth loss function is generated based on a difference between the real environment and a simulated environment produced by the environmental state transition model.
In the embodiment of the application, the environment state transition model uses a uniquely designed network loss function, the design of the loss function is derived according to theoretical analysis, the design of the loss function not only considers how the current state and action are transferred to the state and the return value at the next moment, but also considers the difference between the real environment and the simulated environment, and the loss function designed by using the two functions has uncertainty of capturing the environment.
In one possible implementation, the fourth loss function is generated based on the following equation:
in the formula,for the performance of the policy of the drone on the real environment,is the performance of the same strategy of the unmanned equipment on the simulated environment, s is sampled state data, a is sampled action data, gamma represents a discount factor in reinforcement learning, | r | n max The representation environment gives the maximum value of the absolute value of the return value,is the difference between the real environment and the simulated environment;
wherein p is state data and motion data in the real environmentThe p-norm of (a) is,is p-norm, D, of state data and motion data in the simulation environment TV For the TV distance, s' is the state at the next moment.
In the embodiment of the application, the environmental state transition model is a performance lower bound of the unmanned equipment control strategy in a real environment, the performance lower bound is established in a simulation environment, the performance lower bound is still established in strategy iteration based on the model, and the performance lower bound also ensures monotonous convergence in strategy iteration based on the model, and is realized by the following formula:
it is worth noting that the lower performance bound is characterized by the optimization of performance in real environments independent of the policies of the agent itself.
In one possible implementation, when the difference between the real environment and the simulated environment is measured by KL divergence, the network loss function used by the environmental state transition model is implemented by the following equation:
in the formula s n For state data sampled from the environmental buffer pool, a n For action data sampled from the context buffer pool,for the next moment state predicted by the environmental state transition model given the state data and the action data,to correspond toThe value of (a) is a hyperparameter,the average value of the environmental state transition model at the last update,is corresponding toVariance of (d), μ θ For the latest mean, σ, of the model of the environmental state transitions θ To correspond to mu θ Variance of D kl Is KL divergence, N is normal distribution,is a maximum likelihood estimation function.
In the embodiment of the application, aiming at the control of the unmanned robot, the inventor finds that the difference between the real environment and the simulated environment is measured by KL divergence, the interactive data of the unmanned robot and the real environment is used for training the network, and a better effect can be obtained under the condition that the training can be stopped when the training error is less than a threshold value; thereinIs corresponding toThe reported value isTo correspond to the state s n Lower execution a n Post transition to stateThe return value of this process; whereinTo predict the state and return value at the next moment after a given state and action, i.e. to predict the state and return value at the next momentTo be in a given state s n And a n Then, the state of the next time is predictedAnd a return valueWhen the environment state transition model is updated, the state and the return value of the real environment at the next moment are sampled from the environment cache pool and are used as the updating target of the environment state transition model. The network loss function is to ensure that the environmental state transition model can fit the trajectory data in the real environment.
Specifically, when the difference between the real environment and the simulated environment is measured using the KL divergence, it can be characterized as:
and the performance of the strategy of the unmanned aerial vehicle robot on the real environment is changed into an unconstrained optimization problem, namely:
wherein,and further deducing a network loss function used by the environment state transition model for an adjustable hyper-parameter.
In one possible implementation, when the difference between the real environment and the simulated environment is measured by a p-norm, the network loss function used by the environmental state transition model is implemented by:
in the formula s n For state data sampled from the environmental buffer pool, a n For the action data sampled from the context buffer pool,for the next moment state predicted by the environmental state transition model given the state data and the action data,is corresponding toThe value of (a) is a hyperparameter,the mean value of the environmental state transition model at the last update,to correspond toVariance of (d), μ θ Is the latest mean value of the environmental state transition model,is a norm of 2 times,is a maximum likelihood estimation function.
When the embodiment of the application is implemented, aiming at unmanned aerial vehicle control, the inventor finds that the difference between a real environment and a simulation environment is measured by using a p-norm, the network is trained by using interactive data of the unmanned aerial vehicle and the real environment, and when the training is mistaken, the training is mistakenWhen the difference is less than a threshold value, the training can be stopped, and a better effect can be obtained; thereinTo correspond toThe reported value isTo correspond to the state s n Lower execution a n Post transition to stateThe return value of this process; whereinTo predict the state and return value at the next moment after a given state and action, i.e. to predict the state and return value at the next momentTo be in a given state s n And a n Then, the state of the next time is predictedAnd the return valueWhen the environment state transition model is updated, the state and the return value of the real environment at the next moment are sampled from the environment cache pool and are used as the updating target of the environment state transition model. The network loss function is to ensure that the environmental state transition model can fit the trajectory data in the real environment.
Specifically, the difference between the real environment and the simulated environment can be measured using a Lipschitz continuous transformation into a p-norm, specifically, a link between the dynamic model of the environment and the p-norm is constructed, which is characterized as:
in the formula,a dynamic transfer model representing the real environment,representing a dynamic transfer model on a simulation environment.
When the difference between the real environment and the simulation environment is measured by using the p-norm, the corresponding formula is as follows:
the performance of the strategy of the unmanned aerial vehicle on a real environment is changed into an unconstrained optimization problem, and an optimization formula when the p-norm is used for measurement is as follows:
wherein,and deriving a network loss function used by the environment state transition model for an adjustable hyper-parameter.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The elements described as separate components may or may not be physically separate, as the elements are clearly recognizable to those skilled in the art that the elements and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or combinations of both, and the components and steps of each example have been described in general terms of function in the foregoing description for clearly illustrating the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a grid device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. The model-based unmanned equipment control method for high-sample-rate deep reinforcement learning is characterized by comprising the following steps of:
controlling the unmanned equipment to use a strategy in an Actor-Critic strategy model to interact with a real environment to obtain track data, and storing the track data into an environment buffer pool;
updating an environment state transition model through the data in the environment buffer pool;
performing multi-step interactive trajectory prediction on the environmental state transition model and the Actor-Critic strategy model by using the data in the environmental buffer pool to generate prediction data, and storing the prediction data into a model buffer pool;
updating the Actor-Critic strategy model through data in the model buffer pool;
and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement.
2. The model-based unmanned aerial vehicle control method for high-sample-rate deep reinforcement learning according to claim 1, wherein controlling the unmanned aerial vehicle to interact with the real environment by using the strategy in the Actor-critical strategy model to acquire trajectory data comprises:
inputting current state data of the unmanned equipment in the real environment as first state data into an Actor network of the Actor-criticic policy model, and receiving an action value output by the Actor network as first action data; the first action data is obtained by sampling the mean value and the variance of multidimensional Gaussian distribution output by the last full-connection layer of the Actor network by the Actor network;
controlling the unmanned equipment to operate in the real environment by the first action data, and acquiring state data of the unmanned equipment at the next moment as second state data and a return value at the next moment as a first return value;
and taking the first state data, the first action data, the second state data and the first return value as the track data.
3. The model-based unmanned device control method for high-sample-rate deep reinforcement learning according to claim 1, wherein updating the Actor-critical policy model with data in the model buffer pool comprises:
sampling a fixed amount of third state data, second action data, second return values and fourth state data from the model buffer pool; the fourth state data is state data at the next moment of the third state data;
inputting the fourth state data into an Actor network of the Actor-Critic policy model to acquire third action data;
inputting the third state data and the second action data into a criticic network of the Actor-criticic policy model, and acquiring a first state action function output by the criticic network; the first state action function is the minimum value of state action functions respectively output by at least two current Q networks in the Critic network;
inputting the third action data and the fourth state data into a target Q network in the Critic network to obtain a state action function at the next moment as a second state action function;
and calculating a first loss function through the second return value, the first state action function and the second state action function, and updating the criticic network.
4. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning according to claim 3, wherein updating the Actor-Critic policy model with data in the model buffer pool further comprises:
calculating a second loss function through a third state action function and the action entropy value and updating the Actor network; the third state action function is an evaluation value of the criticic network when an action value is selected through the Actor network policy under corresponding state data; the action entropy value is an entropy value of an action value selected by the Actor network policy.
5. The model-based unmanned aerial vehicle control method for high-sample-rate deep reinforcement learning according to claim 1, wherein performing multi-step interaction trajectory prediction on the environmental state transition model and the Actor-Critic policy model using the data in the environmental buffer pool to generate prediction data comprises:
randomly sampling a preset number of fifth state data from the environment buffer pool;
inputting the fifth state data into an Actor network of the Actor-Critic policy model, and acquiring fourth action data output by the Actor network;
inputting the fifth state data and the fourth action data into the environment state transition model to obtain state data at the next moment as sixth state data, and taking a return value at the next moment as a third return value;
taking the sixth state data as fifth state data and circularly acquiring the state data and the return value to preset conditions;
and randomly selecting a preset amount of data generated in the cyclic process as the prediction data.
6. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 1, wherein the environmental state transition model comprises a plurality of mutually independent sub-models; each sub-model is trained through the same neural network model and sample, and the initial value of each sub-model in training is different.
7. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 1, wherein the environmental state transition model comprises a third loss function and a fourth loss function; the third loss function is generated based on how the current state data and the action data are transferred to the state data at the next moment and the generated return value; the fourth loss function is generated based on a difference between the real environment and a simulated environment produced by the environmental state transition model.
8. The model-based high sample rate deep reinforcement learning unmanned device control method of claim 7, wherein the fourth loss function is generated based on the following equation:
in the formula,for the performance of the policy of the drone on the real environment,gamma represents a discount factor in reinforcement learning, | r |, for performance of the same policy of the unmanned device on the simulated environment max The representation environment gives the maximum value of the absolute value of the return value,is the difference between the real environment and the simulated environment;
9. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 8, wherein when the difference between the real environment and the simulated environment is measured by KL divergence, the network loss function used by the environmental state transition model is implemented by the following equation:
in the formula, s n For state data sampled from the environmental buffer pool, a n For actions sampled from the environmental buffer poolThe data of the data is transmitted to the data receiver,for the next-time state predicted by the environmental state transition model given the state data and the action data,to correspond toAlpha is a hyper-parameter,the average value of the environmental state transition model at the last update,to correspond toVariance of (d), μ θ For the latest mean, σ, of the model of the environmental state transitions θ To correspond to mu θ Variance of (D), D kl Is KL divergence, N is normal distribution,is a maximum likelihood estimation function.
10. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 8, wherein when the difference between the real environment and the simulated environment is measured by p-norm, the network loss function used by the environmental state transition model is implemented by the following formula:
in the formula, s n For state data sampled from the environmental buffer pool, a n For the action data sampled from the context buffer pool,for the next moment state predicted by the environmental state transition model given the state data and the action data,to correspond toThe value of (a) is a hyperparameter,the mean value of the environmental state transition model at the last update,is corresponding toVariance of (d), μ θ Is the latest mean value of the environmental state transition model,is a norm of 2 times,is a maximum likelihood estimation function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210963402.6A CN115293334B (en) | 2022-08-11 | 2022-08-11 | Model-based unmanned equipment control method for high-sample-rate deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210963402.6A CN115293334B (en) | 2022-08-11 | 2022-08-11 | Model-based unmanned equipment control method for high-sample-rate deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115293334A true CN115293334A (en) | 2022-11-04 |
CN115293334B CN115293334B (en) | 2024-09-27 |
Family
ID=83827894
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210963402.6A Active CN115293334B (en) | 2022-08-11 | 2022-08-11 | Model-based unmanned equipment control method for high-sample-rate deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115293334B (en) |
Citations (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106094813A (en) * | 2016-05-26 | 2016-11-09 | 华南理工大学 | It is correlated with based on model humanoid robot gait's control method of intensified learning |
WO2018206504A1 (en) * | 2017-05-10 | 2018-11-15 | Telefonaktiebolaget Lm Ericsson (Publ) | Pre-training system for self-learning agent in virtualized environment |
CN110717600A (en) * | 2019-09-30 | 2020-01-21 | 京东城市(北京)数字科技有限公司 | Sample pool construction method and device, and algorithm training method and device |
DE102018216561A1 (en) * | 2018-09-27 | 2020-04-02 | Robert Bosch Gmbh | Method, device and computer program for determining an agent's strategy |
KR20200062887A (en) * | 2018-11-27 | 2020-06-04 | 한국전자통신연구원 | Apparatus and method for assuring quality of control operations of a system based on reinforcement learning. |
CN111461347A (en) * | 2020-04-02 | 2020-07-28 | 中国科学技术大学 | Reinforced learning method for optimizing experience playback sampling strategy |
CN111460650A (en) * | 2020-03-31 | 2020-07-28 | 北京航空航天大学 | Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning |
CN111582441A (en) * | 2020-04-16 | 2020-08-25 | 清华大学 | High-efficiency value function iteration reinforcement learning method of shared cyclic neural network |
KR20200105365A (en) * | 2019-06-05 | 2020-09-07 | 아이덴티파이 주식회사 | Method for reinforcement learning using virtual environment generated by deep learning |
CN112183288A (en) * | 2020-09-22 | 2021-01-05 | 上海交通大学 | Multi-agent reinforcement learning method based on model |
CN112363402A (en) * | 2020-12-21 | 2021-02-12 | 杭州未名信科科技有限公司 | Gait training method and device of foot type robot based on model-related reinforcement learning, electronic equipment and medium |
CN112460741A (en) * | 2020-11-23 | 2021-03-09 | 香港中文大学(深圳) | Control method of building heating, ventilation and air conditioning system |
WO2021058626A1 (en) * | 2019-09-25 | 2021-04-01 | Deepmind Technologies Limited | Controlling agents using causally correct environment models |
CN112766497A (en) * | 2021-01-29 | 2021-05-07 | 北京字节跳动网络技术有限公司 | Deep reinforcement learning model training method, device, medium and equipment |
WO2021123235A1 (en) * | 2019-12-19 | 2021-06-24 | Secondmind Limited | Reinforcement learning system |
CN113281999A (en) * | 2021-04-23 | 2021-08-20 | 南京大学 | Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning |
CN113359704A (en) * | 2021-05-13 | 2021-09-07 | 浙江工业大学 | Self-adaptive SAC-PID method suitable for complex unknown environment |
CN113419424A (en) * | 2021-07-05 | 2021-09-21 | 清华大学深圳国际研究生院 | Modeling reinforcement learning robot control method and system capable of reducing over-estimation |
CN113467515A (en) * | 2021-07-22 | 2021-10-01 | 南京大学 | Unmanned aerial vehicle flight control method based on virtual environment simulation reconstruction and reinforcement learning |
CN113485107A (en) * | 2021-07-05 | 2021-10-08 | 清华大学深圳国际研究生院 | Reinforcement learning robot control method and system based on consistency constraint modeling |
CN113510704A (en) * | 2021-06-25 | 2021-10-19 | 青岛博晟优控智能科技有限公司 | Industrial mechanical arm motion planning method based on reinforcement learning algorithm |
WO2021208771A1 (en) * | 2020-04-18 | 2021-10-21 | 华为技术有限公司 | Reinforced learning method and device |
CN113947022A (en) * | 2021-10-20 | 2022-01-18 | 哈尔滨工业大学(深圳) | Near-end strategy optimization method based on model |
CN114186496A (en) * | 2021-12-15 | 2022-03-15 | 中国科学技术大学 | Method for improving continuous control stability of intelligent agent |
CN114800515A (en) * | 2022-05-12 | 2022-07-29 | 四川大学 | Robot assembly motion planning method based on demonstration track |
CN114859921A (en) * | 2022-05-12 | 2022-08-05 | 鹏城实验室 | Automatic driving optimization method based on reinforcement learning and related equipment |
CN114879486A (en) * | 2022-02-28 | 2022-08-09 | 复旦大学 | Robot optimization control method based on reinforcement learning and evolution algorithm |
-
2022
- 2022-08-11 CN CN202210963402.6A patent/CN115293334B/en active Active
Patent Citations (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106094813A (en) * | 2016-05-26 | 2016-11-09 | 华南理工大学 | It is correlated with based on model humanoid robot gait's control method of intensified learning |
WO2018206504A1 (en) * | 2017-05-10 | 2018-11-15 | Telefonaktiebolaget Lm Ericsson (Publ) | Pre-training system for self-learning agent in virtualized environment |
DE102018216561A1 (en) * | 2018-09-27 | 2020-04-02 | Robert Bosch Gmbh | Method, device and computer program for determining an agent's strategy |
KR20200062887A (en) * | 2018-11-27 | 2020-06-04 | 한국전자통신연구원 | Apparatus and method for assuring quality of control operations of a system based on reinforcement learning. |
KR20200105365A (en) * | 2019-06-05 | 2020-09-07 | 아이덴티파이 주식회사 | Method for reinforcement learning using virtual environment generated by deep learning |
WO2021058626A1 (en) * | 2019-09-25 | 2021-04-01 | Deepmind Technologies Limited | Controlling agents using causally correct environment models |
CN110717600A (en) * | 2019-09-30 | 2020-01-21 | 京东城市(北京)数字科技有限公司 | Sample pool construction method and device, and algorithm training method and device |
WO2021123235A1 (en) * | 2019-12-19 | 2021-06-24 | Secondmind Limited | Reinforcement learning system |
CN111460650A (en) * | 2020-03-31 | 2020-07-28 | 北京航空航天大学 | Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning |
CN111461347A (en) * | 2020-04-02 | 2020-07-28 | 中国科学技术大学 | Reinforced learning method for optimizing experience playback sampling strategy |
CN111582441A (en) * | 2020-04-16 | 2020-08-25 | 清华大学 | High-efficiency value function iteration reinforcement learning method of shared cyclic neural network |
WO2021208771A1 (en) * | 2020-04-18 | 2021-10-21 | 华为技术有限公司 | Reinforced learning method and device |
CN112183288A (en) * | 2020-09-22 | 2021-01-05 | 上海交通大学 | Multi-agent reinforcement learning method based on model |
CN112460741A (en) * | 2020-11-23 | 2021-03-09 | 香港中文大学(深圳) | Control method of building heating, ventilation and air conditioning system |
CN112363402A (en) * | 2020-12-21 | 2021-02-12 | 杭州未名信科科技有限公司 | Gait training method and device of foot type robot based on model-related reinforcement learning, electronic equipment and medium |
CN112766497A (en) * | 2021-01-29 | 2021-05-07 | 北京字节跳动网络技术有限公司 | Deep reinforcement learning model training method, device, medium and equipment |
CN113281999A (en) * | 2021-04-23 | 2021-08-20 | 南京大学 | Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning |
CN113359704A (en) * | 2021-05-13 | 2021-09-07 | 浙江工业大学 | Self-adaptive SAC-PID method suitable for complex unknown environment |
CN113510704A (en) * | 2021-06-25 | 2021-10-19 | 青岛博晟优控智能科技有限公司 | Industrial mechanical arm motion planning method based on reinforcement learning algorithm |
CN113419424A (en) * | 2021-07-05 | 2021-09-21 | 清华大学深圳国际研究生院 | Modeling reinforcement learning robot control method and system capable of reducing over-estimation |
CN113485107A (en) * | 2021-07-05 | 2021-10-08 | 清华大学深圳国际研究生院 | Reinforcement learning robot control method and system based on consistency constraint modeling |
CN113467515A (en) * | 2021-07-22 | 2021-10-01 | 南京大学 | Unmanned aerial vehicle flight control method based on virtual environment simulation reconstruction and reinforcement learning |
CN113947022A (en) * | 2021-10-20 | 2022-01-18 | 哈尔滨工业大学(深圳) | Near-end strategy optimization method based on model |
CN114186496A (en) * | 2021-12-15 | 2022-03-15 | 中国科学技术大学 | Method for improving continuous control stability of intelligent agent |
CN114879486A (en) * | 2022-02-28 | 2022-08-09 | 复旦大学 | Robot optimization control method based on reinforcement learning and evolution algorithm |
CN114800515A (en) * | 2022-05-12 | 2022-07-29 | 四川大学 | Robot assembly motion planning method based on demonstration track |
CN114859921A (en) * | 2022-05-12 | 2022-08-05 | 鹏城实验室 | Automatic driving optimization method based on reinforcement learning and related equipment |
Non-Patent Citations (4)
Title |
---|
AZULAY, OSHER ET AL.,: "Wheel Loader Scooping Controller Using Deep Reinforcement Learning", 《 IEEE ACCESS》, vol. 9, 2 February 2020 (2020-02-02), pages 24145 - 24154, XP011836475, DOI: 10.1109/ACCESS.2021.3056625 * |
JIE LENG ET AL.,: "M-A3C: A Mean-Asynchronous Advantage Actor-Critic Reinforcement Learning Method for Real-Time Gait Planning of Biped Robot", 《 IEEE ACCESS》, vol. 10, 20 May 2021 (2021-05-20), pages 76523 - 76536 * |
刘全等: "深度强化学习综述", 《计算机学报》, vol. 40, 31 December 2017 (2017-12-31), pages 1 - 28 * |
杨智友: "强化学习中的优化策略研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 1, 15 January 2022 (2022-01-15), pages 140 - 539 * |
Also Published As
Publication number | Publication date |
---|---|
CN115293334B (en) | 2024-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Groshev et al. | Learning generalized reactive policies using deep neural networks | |
Bianchi et al. | Accelerating autonomous learning by using heuristic selection of actions | |
CN114139637B (en) | Multi-agent information fusion method and device, electronic equipment and readable storage medium | |
CN110991027A (en) | Robot simulation learning method based on virtual scene training | |
US20210158162A1 (en) | Training reinforcement learning agents to learn farsighted behaviors by predicting in latent space | |
CN112119409A (en) | Neural network with relational memory | |
CN112884131A (en) | Deep reinforcement learning strategy optimization defense method and device based on simulation learning | |
CN114261400B (en) | Automatic driving decision method, device, equipment and storage medium | |
US20220366246A1 (en) | Controlling agents using causally correct environment models | |
Andersen et al. | Active exploration for learning symbolic representations | |
CN117077727B (en) | Track prediction method based on space-time attention mechanism and neural ordinary differential equation | |
Kojima et al. | To learn or not to learn: Analyzing the role of learning for navigation in virtual environments | |
CN114239974B (en) | Multi-agent position prediction method and device, electronic equipment and storage medium | |
CN116166642A (en) | Spatio-temporal data filling method, system, equipment and medium based on guide information | |
CN113110101B (en) | Production line mobile robot gathering type recovery and warehousing simulation method and system | |
Ge et al. | Deep reinforcement learning navigation via decision transformer in autonomous driving | |
CN114219066A (en) | Unsupervised reinforcement learning method and unsupervised reinforcement learning device based on Watherstein distance | |
CN115293334A (en) | Model-based unmanned equipment control method for high sample rate deep reinforcement learning | |
CN117933055A (en) | Equipment residual service life prediction method based on reinforcement learning integrated framework | |
Feng et al. | Mobile robot obstacle avoidance based on deep reinforcement learning | |
CN116360435A (en) | Training method and system for multi-agent collaborative strategy based on plot memory | |
Gross et al. | Probabilistic model checking of stochastic reinforcement learning policies | |
Han et al. | Three‐dimensional obstacle avoidance for UAV based on reinforcement learning and RealSense | |
Lauttia | Adaptive Monte Carlo Localization in ROS | |
CN114970714B (en) | Track prediction method and system considering uncertain behavior mode of moving target |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |