Nothing Special   »   [go: up one dir, main page]

CN115293334A - Model-based unmanned equipment control method for high sample rate deep reinforcement learning - Google Patents

Model-based unmanned equipment control method for high sample rate deep reinforcement learning Download PDF

Info

Publication number
CN115293334A
CN115293334A CN202210963402.6A CN202210963402A CN115293334A CN 115293334 A CN115293334 A CN 115293334A CN 202210963402 A CN202210963402 A CN 202210963402A CN 115293334 A CN115293334 A CN 115293334A
Authority
CN
China
Prior art keywords
data
model
state
action
actor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210963402.6A
Other languages
Chinese (zh)
Other versions
CN115293334B (en
Inventor
杨智友
屈鸿
符明晟
李凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202210963402.6A priority Critical patent/CN115293334B/en
Publication of CN115293334A publication Critical patent/CN115293334A/en
Application granted granted Critical
Publication of CN115293334B publication Critical patent/CN115293334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a model-based unmanned equipment control method for high-sample-rate deep reinforcement learning, which comprises the following steps of: acquiring track data and storing the track data into an environment buffer pool; updating the environmental state transition model; predicting a multi-step interaction track to generate prediction data, and storing the prediction data into a model buffer pool; updating the Actor-Critic strategy model; and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement. The invention adopts a deep reinforcement learning method based on the model to construct an environmental state transition model to simulate the interaction between the unmanned equipment and the external environment, thereby sharply reducing the interaction times between the unmanned equipment and the real environment, and effectively optimizing the advancing control strategy of the unmanned equipment by the data generated by the environmental state transition model, so that the control of the unmanned equipment becomes efficient.

Description

Model-based unmanned equipment control method for high sample rate deep reinforcement learning
Technical Field
The invention relates to an unmanned equipment control technology, in particular to an unmanned equipment control method based on model high sample rate deep reinforcement learning.
Background
At present, the control of the advance of the unmanned equipment is mainly developed based on the traditional control technology, but the traditional control technology has the problems of single planning of the advance line of the unmanned equipment, inflexible line planning, lack of coping strategies in complex scenes and the like. With the rapid development of deep learning technology and reinforcement learning, the strong feature learning capability of a deep neural network is utilized, the relevant advancing features can be learned from a large amount of unmanned equipment interaction data, and the obstacle avoidance in the advancing process of the unmanned equipment can be realized by combining the modeling of reinforcement learning on the advancing problem of the unmanned equipment, but the problem of low data sample efficiency still exists.
Disclosure of Invention
In order to overcome at least the above-mentioned deficiencies in the prior art, it is an object of the present application to provide a model-based high sample rate deep reinforcement learning unmanned device control method.
The embodiment of the application provides a model-based unmanned equipment control method for high-sample-rate deep reinforcement learning, which comprises the following steps:
controlling the unmanned equipment to use a strategy in an Actor-Critic strategy model to interact with a real environment to obtain track data, and storing the track data into an environment buffer pool;
updating the environmental state transition model through the data in the environmental buffer pool;
using the data in the environment buffer pool to carry out multi-step interactive trajectory prediction on the environment state transition model and the Actor-Critic strategy model to generate prediction data, and storing the prediction data in a model buffer pool;
updating the Actor-Critic strategy model through data in the model buffer pool;
and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement.
In the prior art, the time cost required by the operation test of the unmanned equipment in the real environment is very high, and meanwhile, the problems of equipment collision and the like easily occur when the unmanned equipment operates in the real environment for a long time, so that a large number of sample sources required by a model for training the operation control of the unmanned equipment are difficult to guarantee. When the method and the device are implemented, an Actor-Critic strategy model and an environment state transition model need to be constructed first, wherein the Actor-Critic strategy model is used for providing a strategy for the unmanned equipment to operate in the environment, the environment state transition model is used for providing a simulation environment for the unmanned equipment to operate, and a large number of samples for training the unmanned equipment operation control model can be generated through interaction between the Actor-Critic strategy model and the environment state transition model, so that the accuracy of the trained model is improved.
In the embodiment of the application, each iteration updating of the model needs to control the unmanned equipment to operate for a short time in a real environment by a strategy in an Actor-critical strategy model, and the generated track data can update the environment state transition model; and then, multi-step interactive trajectory prediction is carried out on the environment state transition model and the Actor-Critic strategy model to generate a large amount of prediction data to serve as training samples to update the Actor-Critic strategy model, so that the Actor-Critic strategy model can become a mature model for providing the unmanned equipment control strategy after a certain number of iterations, and subsequent tests can be carried out. The embodiment of the application can provide a high-precision simulation environment for unmanned equipment interaction, the simulation environment is learned through a real historical interaction track, and the accuracy of the environment is directly influenced by a loss function of an environment state transition model in the learning process. The environmental state transition model of the embodiment of the application corresponds to a simulation environment, and provides interactive safety and high efficiency for the unmanned equipment.
In one possible implementation manner, the controlling the unmanned aerial device to acquire the trajectory data by using the strategy in the Actor-Critic strategy model to interact with the real environment comprises the following steps:
inputting current state data of the unmanned equipment in the real environment as first state data into an Actor network of the Actor-criticic policy model, and receiving an action value output by the Actor network as first action data; the first action data is obtained by sampling the mean value and the variance of multidimensional Gaussian distribution output by the last full-connection layer of the Actor network by the Actor network;
controlling the unmanned equipment to operate in the real environment by the first action data, and acquiring state data of the unmanned equipment at the next moment as second state data and a return value at the next moment as a first return value;
and taking the first state data, the first action data, the second state data and the first return value as the track data.
In a possible implementation manner, the updating the Actor-critical policy model by the data in the model buffer pool includes:
sampling a fixed amount of third state data, second action data, second return values and fourth state data from the model buffer pool; the fourth state data is state data at the next moment of the third state data;
inputting the fourth state data into an Actor network of the Actor-Critic policy model to acquire third action data;
inputting the third state data and the second action data into a criticic network of the Actor-criticic policy model, and acquiring a first state action function output by the criticic network; the first state action function is the minimum value of state action functions respectively output by at least two current Q networks in the Critic network;
inputting the third action data and the fourth state data into a target Q network in the Critic network to obtain a state action function at the next moment as a second state action function;
and calculating a first loss function through the second return value, the first state action function and the second state action function, and updating the criticic network.
In a possible implementation manner, updating the Actor-critical policy model through the data in the model buffer pool further includes:
calculating a second loss function through a third state action function and the action entropy value and updating the Actor network; the third state action function is an evaluation value of the criticic network when an action value is selected through the Actor network policy under corresponding state data; the action entropy value is an entropy value of an action value selected by the Actor network policy.
In a possible implementation manner, the generating prediction data by performing multi-step interaction trajectory prediction on the environmental state transition model and the Actor-Critic policy model by using the data in the environmental buffer pool includes:
randomly sampling a preset number of fifth state data from the environment buffer pool;
inputting the fifth state data into an Actor network of the Actor-Critic policy model, and acquiring fourth action data output by the Actor network;
inputting the fifth state data and the fourth action data into the environmental state transition model to obtain state data at the next moment as sixth state data, and taking a return value at the next moment as a third return value;
taking the sixth state data as fifth state data and circularly acquiring the state data and the return value to preset conditions;
and randomly selecting a preset amount of data generated in the cyclic process as the prediction data.
In one possible implementation, the environmental state transition model includes a plurality of mutually independent submodels; each sub-model is trained through the same neural network model and sample, and the initial value of each sub-model in training is different.
In one possible implementation, the environmental state transition model includes a third loss function and a fourth loss function; the third loss function is generated based on how the current state data and the action data are transferred to the state data at the next moment and the generated return value; the fourth loss function is generated based on a difference between the real environment and a simulated environment produced by the environmental state transition model.
In one possible implementation, the fourth loss function is generated based on the following equation:
Figure BDA0003793739290000051
in the formula,
Figure BDA0003793739290000052
for the performance of the policy of the drone on the real environment,
Figure BDA0003793739290000053
performance of the same policy of the unmanned device on the simulated environment, s is sampled state data, a is sampled action data, gamma represents a discount factor in reinforcement learning, | r | y max The representation environment gives the maximum value of the absolute value of the return value,
Figure BDA0003793739290000054
is the difference between the real environment and the simulated environment;
wherein,
Figure BDA0003793739290000055
calculated according to the following formula:
Figure BDA0003793739290000056
wherein p is a p-norm of state data and motion data in the real environment,
Figure BDA0003793739290000057
is p-norm, D, of state data and motion data in the simulation environment TV For the TV distance, s' is the state at the next moment.
In one possible implementation, when the difference between the real environment and the simulated environment is measured by a KL divergence, the network loss function used by the environmental state transition model is implemented by:
Figure BDA0003793739290000058
in the formula, s n For state data sampled from the environmental buffer pool, a n For action data sampled from the context buffer pool,
Figure BDA0003793739290000059
for the next moment state predicted by the environmental state transition model given the state data and the action data,
Figure BDA00037937392900000510
to correspond to
Figure BDA00037937392900000511
α is a hyperparameter, μ θold The average value of the environmental state transition model at the last update,
Figure BDA00037937392900000512
is corresponding to
Figure BDA00037937392900000513
Variance of (d), μ θ For the latest mean, σ, of the model of the environmental state transitions θ To correspond to mu θ Variance of (D), D kl Is KL divergence, N is normal distribution,
Figure BDA00037937392900000514
is a maximum likelihood estimation function.
In one possible implementation, when the difference between the real environment and the simulated environment is measured by a p-norm, the network loss function used by the environmental state transition model is implemented by:
Figure BDA0003793739290000061
in the formula s n Is a slave ringState data sampled in the border buffer pool, a n For action data sampled from the context buffer pool,
Figure BDA0003793739290000062
for the next moment state predicted by the environmental state transition model given the state data and the action data,
Figure BDA0003793739290000063
to correspond to
Figure BDA0003793739290000064
The value of (a) is a hyperparameter,
Figure BDA0003793739290000065
the mean value of the environmental state transition model at the last update,
Figure BDA0003793739290000066
is corresponding to
Figure BDA0003793739290000067
Variance of (d), μ θ Is the latest mean value of the environmental state transition model,
Figure BDA0003793739290000068
is a norm of 2 times,
Figure BDA0003793739290000069
is a maximum likelihood estimation function.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention provides a high-accuracy environmental state transition model for providing a high-precision simulation track sample by modeling the interaction between the unmanned equipment and the external environment during the advancing control as MDP.
2. The invention solves the optimization problem of the unmanned equipment advancing control strategy by means of the Actor and Critic functions, and obtains a high-quality environmental state transfer model by a brand-new loss function optimization.
3. The invention adopts a model-based deep reinforcement learning method to construct an environmental state transfer model to simulate the interaction between the unmanned equipment and the external environment, thereby sharply reducing the interaction times between the unmanned equipment and the real environment, and the data generated by the environmental state transfer model can effectively optimize the advancing control strategy of the unmanned equipment, so that the control of the unmanned equipment becomes efficient.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a schematic diagram of the steps of an embodiment of the method of the present application;
fig. 2 is a schematic diagram of a network structure of an Actor in the embodiment of the present application;
FIG. 3 is a schematic diagram of a Critic network according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an environmental state transition model according to an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Further, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some of the embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. In addition, one skilled in the art, under the guidance of the present disclosure, may add one or more other operations to, or remove one or more operations from, the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
Please refer to fig. 1, which is a flowchart illustrating a method for controlling an unmanned aerial vehicle based on model-based high sample rate deep reinforcement learning according to an embodiment of the present disclosure, and further, the method for controlling an unmanned aerial vehicle based on model-based high sample rate deep reinforcement learning may specifically include the contents described in the following steps S1 to S6.
S1: controlling the unmanned equipment to use a strategy in the Actor-Critic strategy model to interact with a real environment to obtain track data, and storing the track data into an environment buffer pool;
s2: updating an environment state transition model through the data in the environment buffer pool;
s3: using the data in the environment buffer pool to carry out multi-step interactive trajectory prediction on the environment state transition model and the Actor-Critic strategy model to generate prediction data, and storing the prediction data in a model buffer pool;
s4: updating the Actor-Critic strategy model through data in the model buffer pool;
s5: and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement.
In the prior art, the time cost required by running and testing the unmanned equipment in the real environment is very high, and meanwhile, the problems of equipment collision and the like easily occur when the unmanned equipment runs in the real environment for a long time, so that the sources of a large number of samples required by training a model for running and controlling the unmanned equipment are difficult to guarantee. When the method and the device are implemented, an Actor-Critic strategy model and an environment state transition model need to be constructed first, wherein the Actor-Critic strategy model is used for providing a strategy for the unmanned equipment to operate in the environment, the environment state transition model is used for providing a simulation environment for the unmanned equipment to operate, and a large number of samples for training the unmanned equipment operation control model can be generated through interaction between the Actor-Critic strategy model and the environment state transition model, so that the accuracy of the trained model is improved.
In the embodiment of the application, each iteration of the model is updated by controlling the unmanned equipment to operate for a short time in a real environment by using a strategy in the Actor-Critic strategy model, and the generated track data can update the environment state transition model; and then, performing multi-step interactive trajectory prediction on the environmental state transition model and the Actor-Critic strategy model to generate a large amount of prediction data as a training sample to update the Actor-Critic strategy model, so that the Actor-Critic strategy model can become a mature model for providing the unmanned equipment control strategy after a certain number of iterations to perform subsequent tests. According to the method and the device, the high-precision simulation environment for unmanned equipment interaction can be provided, learning is carried out through a real historical interaction track, and the accuracy of the environment is directly influenced by a loss function of an environment state transition model in the learning process. The environmental state transition model of the embodiment of the application corresponds to a simulation environment, and provides interactive safety and high efficiency for the unmanned equipment.
In one possible implementation manner, the controlling the unmanned device to use the strategy in the Actor-critical strategy model to interact with the real environment to acquire the track data includes:
inputting current state data of the unmanned equipment in the real environment as first state data into an Actor network of the Actor-criticic policy model, and receiving an action value output by the Actor network as first action data; the first action data is obtained by sampling the mean value and the variance of multidimensional Gaussian distribution output by the last full-connection layer of the Actor network by the Actor network;
controlling the unmanned equipment to operate in the real environment by the first action data, and acquiring state data of the unmanned equipment at the next moment as second state data and a return value at the next moment as a first return value;
and taking the first state data, the first action data, the second state data and the first return value as the track data.
When the method is implemented, the Actor-Critic strategy model comprises an Actor network and a Critic network, the Actor network and the Critic network respectively comprise a full connection layer and an activation layer which are sequentially arranged, and when the Actor network outputs mean value to sum variance according to the last full connection layer, the Actor network uses a tanh function to carry out one-time nonlinear mapping on samples in Gaussian distribution, so that the final action values are ensured to be in an effective range. When the first state data is input into the Actor network, the Actor network generates first action data corresponding to the first state data, the first action data is a strategy for controlling the unmanned equipment to operate in a real environment, and the strategy is used for controlling the unmanned equipment to operate in the real environment, so that the state data of the unmanned equipment after the strategy is operated, namely the second state data, can be obtained, and meanwhile, a return value generated after the strategy is executed, namely the first return value, can also be obtained; this allows the first state data, the first motion data, the second state data, and the first reward value to form a set of pairs of samples of the operation of the drone in the real environment, the samples being stored in the environment buffer pool and used to update the simulated environment.
In a possible implementation manner, the updating the Actor-critical policy model by the data in the model buffer pool includes:
sampling a fixed amount of third state data, second action data, second return values and fourth state data from the model buffer pool; the fourth state data is state data at the next moment of the third state data;
inputting the fourth state data into an Actor network of the Actor-Critic policy model to acquire third action data;
inputting the third state data and the second action data into a criticic network of the Actor-criticic policy model, and acquiring a first state action function output by the criticic network; the first state action function is the minimum value of state action functions output by at least two current Q networks in the Critic network respectively;
inputting the third action data and the fourth state data into a target Q network in the Critic network to obtain a state action function at the next moment as a second state action function;
and calculating a first loss function through the second return value, the first state action function and the second state action function, and updating the Critic network.
In the embodiment of the application, the Critic network is divided into a target Q network and at least two current Q networks, when the action selected in the current state is evaluated, the current Q networks are used for calculation, and the networks with small Q values are selected for updating the Actor network according to the results of the evaluation. The Actor network is used for deciding what action should be taken when the unmanned equipment meets different environments, and corresponds to the unmanned equipment; the Critic network is used for evaluating the influence caused by the action selected by the Actor network, and the Critic network provided by the embodiment of the application corresponds to evaluation. Therefore, a complete model-based high-sampling depth reinforcement learning algorithm can be formed by the environment state transition model, the Actor policy network and the Critic network. The unmanned aerial vehicle walks to drive motors in each driving mechanism of the unmanned aerial vehicle, and relevant parameters of the motors of each driving mechanism are adjusted for coping when the unmanned aerial vehicle faces different states according to a model-based high-sampling depth reinforcement learning algorithm.
In the embodiment of the present application, updating the Actor-Critic policy model corresponds to performing secondary training on the policy model after supplementing data, where the third state data, the second action data, the second return value, and the fourth state data should be data in one sample pair, that is, the third state data, the second action data, the second return value, and the fourth state data should be corresponding relationships. The first state action function is the evaluation of the third state data and the second action data output by the Critic network as input data, the third state data and the second action data can be understood as current state data and current action data, and the first state action function is the evaluation of the current action data under the condition of the current state data; the second state action function is the evaluation of the Critic network output on the condition of third action data and fourth state data, the third action data is the action relative to the current next moment, the fourth state data is the state of the next moment, and the first state action function is the evaluation of the action data of the next moment on the condition of the state data of the next moment; the second return value is a return value for executing the second action data in the environment, and the first loss function can be calculated through the second return value, the first state action function and the second state action function so as to update the criticic network.
For example, please refer to fig. 2, which illustrates a structure of an Actor network in this embodiment, where the Actor network in this embodiment includes 3 fully-connected layers and a rule layer, which are sequentially arranged, each layer of the fully-connected network includes 256 neurons, an input of the fully-connected network is in a current or given state, and an output of the fully-connected network is an action to be taken when the fully-connected network faces the state.
For example, referring to fig. 3, the Critic network in this embodiment includes two current Q networks, whose network structures are completely the same, and includes 3 fully-connected layers and rule layers that are sequentially arranged, where each layer of the fully-connected network includes 256 neurons, the input is a given state and its corresponding action, and the output is an evaluation value for this scenario.
In a possible implementation manner, updating the Actor-critical policy model through the data in the model buffer pool further includes:
calculating a second loss function through a third state action function and the action entropy value and updating the Actor network; the third state action function is an evaluation value of the criticic network when an action value is selected through the Actor network policy under corresponding state data; the action entropy value is an entropy value of an action value selected by the Actor network policy.
In the embodiment of the application, the loss function of the Actor network is composed of two parts, the first part is an evaluation value of the criticc network when a certain action is selected by a strategy under a certain state, namely a state action value function, and the second part is an entropy of the action selected by the strategy, so that the Actor network can be updated more accurately.
In a possible implementation manner, the generating prediction data by performing multi-step interaction trajectory prediction on the environmental state transition model and the Actor-Critic policy model by using the data in the environmental buffer pool includes:
randomly sampling a preset number of fifth state data from the environment buffer pool;
inputting the fifth state data into an Actor network of the Actor-Critic policy model, and acquiring fourth action data output by the Actor network;
inputting the fifth state data and the fourth action data into the environmental state transition model to obtain state data at the next moment as sixth state data, and taking a return value at the next moment as a third return value;
taking the sixth state data as fifth state data and circularly acquiring the state data and the return value to preset conditions;
and randomly selecting a preset amount of data generated in the cyclic process as the prediction data.
In the embodiment of the application, after the fifth state data is input into the Actor network as the current state data, the current decision action generated aiming at the current state data, that is, the fourth action data, can be acquired; since the environmental state transition model is environment-simulated, after the fifth state data and the fourth action data are input into the environmental state transition model, the fifth state data and the fourth action data can be run in the simulated environment to generate state data at the next moment, namely, sixth state data, and a reward value generated by the fourth action data, namely, a third reward value. In the embodiment of the present application, the multiple-interaction trajectory prediction means that newly obtained sixth state data is input into an Actor network as current state data, so as to start a loop process, where each loop generates a sample pair, and the sample pair includes a current state, a current action, a return value, and a next-moment state. The preset condition for the loop may be to reach a predetermined number of loops, or may set a threshold for some variable, which is not limited in this embodiment of the application.
For example, please refer to fig. 4, the environment state transition model of this embodiment is an integrated model composed of a plurality of neural network models, 4 layers of full connections and 3 activation functions are sequentially set for each neural network model, each layer of full connections is set with 300 hidden layers, and the activation functions use swish; the input of the model is a randomly given state and the action in the state, and the output is the state at the next moment and the return value at the moment.
In one possible implementation, the environmental state transition model includes a plurality of mutually independent submodels; each submodel is trained through the same neural network model and sample, and the initial value of each submodel in training is different.
When the method and the device are implemented, the environment state transition model is combined into an integrated environment state transition model through a plurality of neural network models, the integrated environment state transition model has the capability of capturing uncertainty in real environment dynamic transition, the mean value and the variance output by each independent sub-model have differences, more track data of different scenes can be generated when the model is interacted with the unmanned aerial vehicle, and the strategy network learning is facilitated. The loss function of the environmental state transition model is a unique design, and is an aggregate model, and each aggregate model comprises a full connection layer and an activation layer which are sequentially arranged, and the state and the return value of the next moment are predicted. And the mean value and the variance of the state and the return value of the next moment output by the last full-connection layer of the environment state transition model adopt Gaussian distribution to increase the adaptability of the model to the complex environment.
In one possible implementation, the environmental state transition model includes a third loss function and a fourth loss function; the third loss function is generated based on how the current state data and the action data are transferred to the state data at the next moment and the generated return value; the fourth loss function is generated based on a difference between the real environment and a simulated environment produced by the environmental state transition model.
In the embodiment of the application, the environment state transition model uses a uniquely designed network loss function, the design of the loss function is derived according to theoretical analysis, the design of the loss function not only considers how the current state and action are transferred to the state and the return value at the next moment, but also considers the difference between the real environment and the simulated environment, and the loss function designed by using the two functions has uncertainty of capturing the environment.
In one possible implementation, the fourth loss function is generated based on the following equation:
Figure BDA0003793739290000141
in the formula,
Figure BDA0003793739290000142
for the performance of the policy of the drone on the real environment,
Figure BDA0003793739290000143
is the performance of the same strategy of the unmanned equipment on the simulated environment, s is sampled state data, a is sampled action data, gamma represents a discount factor in reinforcement learning, | r | n max The representation environment gives the maximum value of the absolute value of the return value,
Figure BDA0003793739290000144
is the difference between the real environment and the simulated environment;
wherein,
Figure BDA0003793739290000145
calculated according to the following formula:
Figure BDA0003793739290000146
wherein p is state data and motion data in the real environmentThe p-norm of (a) is,
Figure BDA0003793739290000147
is p-norm, D, of state data and motion data in the simulation environment TV For the TV distance, s' is the state at the next moment.
In the embodiment of the application, the environmental state transition model is a performance lower bound of the unmanned equipment control strategy in a real environment, the performance lower bound is established in a simulation environment, the performance lower bound is still established in strategy iteration based on the model, and the performance lower bound also ensures monotonous convergence in strategy iteration based on the model, and is realized by the following formula:
Figure BDA0003793739290000148
it is worth noting that the lower performance bound is characterized by the optimization of performance in real environments independent of the policies of the agent itself.
In one possible implementation, when the difference between the real environment and the simulated environment is measured by KL divergence, the network loss function used by the environmental state transition model is implemented by the following equation:
Figure BDA0003793739290000149
in the formula s n For state data sampled from the environmental buffer pool, a n For action data sampled from the context buffer pool,
Figure BDA0003793739290000151
for the next moment state predicted by the environmental state transition model given the state data and the action data,
Figure BDA0003793739290000152
to correspond to
Figure BDA0003793739290000153
The value of (a) is a hyperparameter,
Figure BDA0003793739290000154
the average value of the environmental state transition model at the last update,
Figure BDA0003793739290000155
is corresponding to
Figure BDA0003793739290000156
Variance of (d), μ θ For the latest mean, σ, of the model of the environmental state transitions θ To correspond to mu θ Variance of D kl Is KL divergence, N is normal distribution,
Figure BDA0003793739290000157
is a maximum likelihood estimation function.
In the embodiment of the application, aiming at the control of the unmanned robot, the inventor finds that the difference between the real environment and the simulated environment is measured by KL divergence, the interactive data of the unmanned robot and the real environment is used for training the network, and a better effect can be obtained under the condition that the training can be stopped when the training error is less than a threshold value; therein
Figure BDA0003793739290000158
Is corresponding to
Figure BDA0003793739290000159
The reported value is
Figure BDA00037937392900001510
To correspond to the state s n Lower execution a n Post transition to state
Figure BDA00037937392900001511
The return value of this process; wherein
Figure BDA00037937392900001512
To predict the state and return value at the next moment after a given state and action, i.e. to predict the state and return value at the next moment
Figure BDA00037937392900001513
To be in a given state s n And a n Then, the state of the next time is predicted
Figure BDA00037937392900001514
And a return value
Figure BDA00037937392900001515
When the environment state transition model is updated, the state and the return value of the real environment at the next moment are sampled from the environment cache pool and are used as the updating target of the environment state transition model. The network loss function is to ensure that the environmental state transition model can fit the trajectory data in the real environment.
Specifically, when the difference between the real environment and the simulated environment is measured using the KL divergence, it can be characterized as:
Figure BDA00037937392900001516
and the performance of the strategy of the unmanned aerial vehicle robot on the real environment is changed into an unconstrained optimization problem, namely:
Figure BDA00037937392900001517
wherein,
Figure BDA0003793739290000161
and further deducing a network loss function used by the environment state transition model for an adjustable hyper-parameter.
In one possible implementation, when the difference between the real environment and the simulated environment is measured by a p-norm, the network loss function used by the environmental state transition model is implemented by:
Figure BDA0003793739290000162
in the formula s n For state data sampled from the environmental buffer pool, a n For the action data sampled from the context buffer pool,
Figure BDA0003793739290000163
for the next moment state predicted by the environmental state transition model given the state data and the action data,
Figure BDA0003793739290000164
is corresponding to
Figure BDA0003793739290000165
The value of (a) is a hyperparameter,
Figure BDA0003793739290000166
the mean value of the environmental state transition model at the last update,
Figure BDA0003793739290000167
to correspond to
Figure BDA0003793739290000168
Variance of (d), μ θ Is the latest mean value of the environmental state transition model,
Figure BDA0003793739290000169
is a norm of 2 times,
Figure BDA00037937392900001610
is a maximum likelihood estimation function.
When the embodiment of the application is implemented, aiming at unmanned aerial vehicle control, the inventor finds that the difference between a real environment and a simulation environment is measured by using a p-norm, the network is trained by using interactive data of the unmanned aerial vehicle and the real environment, and when the training is mistaken, the training is mistakenWhen the difference is less than a threshold value, the training can be stopped, and a better effect can be obtained; therein
Figure BDA00037937392900001611
To correspond to
Figure BDA00037937392900001612
The reported value is
Figure BDA00037937392900001613
To correspond to the state s n Lower execution a n Post transition to state
Figure BDA00037937392900001614
The return value of this process; wherein
Figure BDA00037937392900001615
To predict the state and return value at the next moment after a given state and action, i.e. to predict the state and return value at the next moment
Figure BDA00037937392900001616
To be in a given state s n And a n Then, the state of the next time is predicted
Figure BDA00037937392900001617
And the return value
Figure BDA00037937392900001618
When the environment state transition model is updated, the state and the return value of the real environment at the next moment are sampled from the environment cache pool and are used as the updating target of the environment state transition model. The network loss function is to ensure that the environmental state transition model can fit the trajectory data in the real environment.
Specifically, the difference between the real environment and the simulated environment can be measured using a Lipschitz continuous transformation into a p-norm, specifically, a link between the dynamic model of the environment and the p-norm is constructed, which is characterized as:
Figure BDA0003793739290000171
in the formula,
Figure BDA0003793739290000172
a dynamic transfer model representing the real environment,
Figure BDA0003793739290000173
representing a dynamic transfer model on a simulation environment.
When the difference between the real environment and the simulation environment is measured by using the p-norm, the corresponding formula is as follows:
Figure BDA0003793739290000174
the performance of the strategy of the unmanned aerial vehicle on a real environment is changed into an unconstrained optimization problem, and an optimization formula when the p-norm is used for measurement is as follows:
Figure BDA0003793739290000175
wherein,
Figure BDA0003793739290000176
and deriving a network loss function used by the environment state transition model for an adjustable hyper-parameter.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The elements described as separate components may or may not be physically separate, as the elements are clearly recognizable to those skilled in the art that the elements and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or combinations of both, and the components and steps of each example have been described in general terms of function in the foregoing description for clearly illustrating the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a grid device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. The model-based unmanned equipment control method for high-sample-rate deep reinforcement learning is characterized by comprising the following steps of:
controlling the unmanned equipment to use a strategy in an Actor-Critic strategy model to interact with a real environment to obtain track data, and storing the track data into an environment buffer pool;
updating an environment state transition model through the data in the environment buffer pool;
performing multi-step interactive trajectory prediction on the environmental state transition model and the Actor-Critic strategy model by using the data in the environmental buffer pool to generate prediction data, and storing the prediction data into a model buffer pool;
updating the Actor-Critic strategy model through data in the model buffer pool;
and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement.
2. The model-based unmanned aerial vehicle control method for high-sample-rate deep reinforcement learning according to claim 1, wherein controlling the unmanned aerial vehicle to interact with the real environment by using the strategy in the Actor-critical strategy model to acquire trajectory data comprises:
inputting current state data of the unmanned equipment in the real environment as first state data into an Actor network of the Actor-criticic policy model, and receiving an action value output by the Actor network as first action data; the first action data is obtained by sampling the mean value and the variance of multidimensional Gaussian distribution output by the last full-connection layer of the Actor network by the Actor network;
controlling the unmanned equipment to operate in the real environment by the first action data, and acquiring state data of the unmanned equipment at the next moment as second state data and a return value at the next moment as a first return value;
and taking the first state data, the first action data, the second state data and the first return value as the track data.
3. The model-based unmanned device control method for high-sample-rate deep reinforcement learning according to claim 1, wherein updating the Actor-critical policy model with data in the model buffer pool comprises:
sampling a fixed amount of third state data, second action data, second return values and fourth state data from the model buffer pool; the fourth state data is state data at the next moment of the third state data;
inputting the fourth state data into an Actor network of the Actor-Critic policy model to acquire third action data;
inputting the third state data and the second action data into a criticic network of the Actor-criticic policy model, and acquiring a first state action function output by the criticic network; the first state action function is the minimum value of state action functions respectively output by at least two current Q networks in the Critic network;
inputting the third action data and the fourth state data into a target Q network in the Critic network to obtain a state action function at the next moment as a second state action function;
and calculating a first loss function through the second return value, the first state action function and the second state action function, and updating the criticic network.
4. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning according to claim 3, wherein updating the Actor-Critic policy model with data in the model buffer pool further comprises:
calculating a second loss function through a third state action function and the action entropy value and updating the Actor network; the third state action function is an evaluation value of the criticic network when an action value is selected through the Actor network policy under corresponding state data; the action entropy value is an entropy value of an action value selected by the Actor network policy.
5. The model-based unmanned aerial vehicle control method for high-sample-rate deep reinforcement learning according to claim 1, wherein performing multi-step interaction trajectory prediction on the environmental state transition model and the Actor-Critic policy model using the data in the environmental buffer pool to generate prediction data comprises:
randomly sampling a preset number of fifth state data from the environment buffer pool;
inputting the fifth state data into an Actor network of the Actor-Critic policy model, and acquiring fourth action data output by the Actor network;
inputting the fifth state data and the fourth action data into the environment state transition model to obtain state data at the next moment as sixth state data, and taking a return value at the next moment as a third return value;
taking the sixth state data as fifth state data and circularly acquiring the state data and the return value to preset conditions;
and randomly selecting a preset amount of data generated in the cyclic process as the prediction data.
6. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 1, wherein the environmental state transition model comprises a plurality of mutually independent sub-models; each sub-model is trained through the same neural network model and sample, and the initial value of each sub-model in training is different.
7. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 1, wherein the environmental state transition model comprises a third loss function and a fourth loss function; the third loss function is generated based on how the current state data and the action data are transferred to the state data at the next moment and the generated return value; the fourth loss function is generated based on a difference between the real environment and a simulated environment produced by the environmental state transition model.
8. The model-based high sample rate deep reinforcement learning unmanned device control method of claim 7, wherein the fourth loss function is generated based on the following equation:
Figure FDA0003793739280000031
in the formula,
Figure FDA0003793739280000032
for the performance of the policy of the drone on the real environment,
Figure FDA0003793739280000033
gamma represents a discount factor in reinforcement learning, | r |, for performance of the same policy of the unmanned device on the simulated environment max The representation environment gives the maximum value of the absolute value of the return value,
Figure FDA0003793739280000041
is the difference between the real environment and the simulated environment;
wherein,
Figure FDA0003793739280000042
calculated according to the following formula:
Figure FDA0003793739280000043
wherein p is a p-norm of state data and motion data in the real environment,
Figure FDA0003793739280000044
is p-norm, D, of state data and motion data in the simulation environment TV For the TV distance, s' is the state at the next moment, s is the sampled state data, and a is the sampled motion data.
9. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 8, wherein when the difference between the real environment and the simulated environment is measured by KL divergence, the network loss function used by the environmental state transition model is implemented by the following equation:
Figure FDA0003793739280000045
in the formula, s n For state data sampled from the environmental buffer pool, a n For actions sampled from the environmental buffer poolThe data of the data is transmitted to the data receiver,
Figure FDA0003793739280000046
for the next-time state predicted by the environmental state transition model given the state data and the action data,
Figure FDA0003793739280000047
to correspond to
Figure FDA0003793739280000048
Alpha is a hyper-parameter,
Figure FDA00037937392800000412
the average value of the environmental state transition model at the last update,
Figure FDA0003793739280000049
to correspond to
Figure FDA00037937392800000410
Variance of (d), μ θ For the latest mean, σ, of the model of the environmental state transitions θ To correspond to mu θ Variance of (D), D kl Is KL divergence, N is normal distribution,
Figure FDA00037937392800000411
is a maximum likelihood estimation function.
10. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 8, wherein when the difference between the real environment and the simulated environment is measured by p-norm, the network loss function used by the environmental state transition model is implemented by the following formula:
Figure FDA0003793739280000051
in the formula, s n For state data sampled from the environmental buffer pool, a n For the action data sampled from the context buffer pool,
Figure FDA0003793739280000052
for the next moment state predicted by the environmental state transition model given the state data and the action data,
Figure FDA0003793739280000053
to correspond to
Figure FDA0003793739280000054
The value of (a) is a hyperparameter,
Figure FDA0003793739280000055
the mean value of the environmental state transition model at the last update,
Figure FDA0003793739280000056
is corresponding to
Figure FDA0003793739280000057
Variance of (d), μ θ Is the latest mean value of the environmental state transition model,
Figure FDA0003793739280000058
is a norm of 2 times,
Figure FDA0003793739280000059
is a maximum likelihood estimation function.
CN202210963402.6A 2022-08-11 2022-08-11 Model-based unmanned equipment control method for high-sample-rate deep reinforcement learning Active CN115293334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210963402.6A CN115293334B (en) 2022-08-11 2022-08-11 Model-based unmanned equipment control method for high-sample-rate deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210963402.6A CN115293334B (en) 2022-08-11 2022-08-11 Model-based unmanned equipment control method for high-sample-rate deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN115293334A true CN115293334A (en) 2022-11-04
CN115293334B CN115293334B (en) 2024-09-27

Family

ID=83827894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210963402.6A Active CN115293334B (en) 2022-08-11 2022-08-11 Model-based unmanned equipment control method for high-sample-rate deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115293334B (en)

Citations (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106094813A (en) * 2016-05-26 2016-11-09 华南理工大学 It is correlated with based on model humanoid robot gait's control method of intensified learning
WO2018206504A1 (en) * 2017-05-10 2018-11-15 Telefonaktiebolaget Lm Ericsson (Publ) Pre-training system for self-learning agent in virtualized environment
CN110717600A (en) * 2019-09-30 2020-01-21 京东城市(北京)数字科技有限公司 Sample pool construction method and device, and algorithm training method and device
DE102018216561A1 (en) * 2018-09-27 2020-04-02 Robert Bosch Gmbh Method, device and computer program for determining an agent's strategy
KR20200062887A (en) * 2018-11-27 2020-06-04 한국전자통신연구원 Apparatus and method for assuring quality of control operations of a system based on reinforcement learning.
CN111461347A (en) * 2020-04-02 2020-07-28 中国科学技术大学 Reinforced learning method for optimizing experience playback sampling strategy
CN111460650A (en) * 2020-03-31 2020-07-28 北京航空航天大学 Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning
CN111582441A (en) * 2020-04-16 2020-08-25 清华大学 High-efficiency value function iteration reinforcement learning method of shared cyclic neural network
KR20200105365A (en) * 2019-06-05 2020-09-07 아이덴티파이 주식회사 Method for reinforcement learning using virtual environment generated by deep learning
CN112183288A (en) * 2020-09-22 2021-01-05 上海交通大学 Multi-agent reinforcement learning method based on model
CN112363402A (en) * 2020-12-21 2021-02-12 杭州未名信科科技有限公司 Gait training method and device of foot type robot based on model-related reinforcement learning, electronic equipment and medium
CN112460741A (en) * 2020-11-23 2021-03-09 香港中文大学(深圳) Control method of building heating, ventilation and air conditioning system
WO2021058626A1 (en) * 2019-09-25 2021-04-01 Deepmind Technologies Limited Controlling agents using causally correct environment models
CN112766497A (en) * 2021-01-29 2021-05-07 北京字节跳动网络技术有限公司 Deep reinforcement learning model training method, device, medium and equipment
WO2021123235A1 (en) * 2019-12-19 2021-06-24 Secondmind Limited Reinforcement learning system
CN113281999A (en) * 2021-04-23 2021-08-20 南京大学 Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning
CN113359704A (en) * 2021-05-13 2021-09-07 浙江工业大学 Self-adaptive SAC-PID method suitable for complex unknown environment
CN113419424A (en) * 2021-07-05 2021-09-21 清华大学深圳国际研究生院 Modeling reinforcement learning robot control method and system capable of reducing over-estimation
CN113467515A (en) * 2021-07-22 2021-10-01 南京大学 Unmanned aerial vehicle flight control method based on virtual environment simulation reconstruction and reinforcement learning
CN113485107A (en) * 2021-07-05 2021-10-08 清华大学深圳国际研究生院 Reinforcement learning robot control method and system based on consistency constraint modeling
CN113510704A (en) * 2021-06-25 2021-10-19 青岛博晟优控智能科技有限公司 Industrial mechanical arm motion planning method based on reinforcement learning algorithm
WO2021208771A1 (en) * 2020-04-18 2021-10-21 华为技术有限公司 Reinforced learning method and device
CN113947022A (en) * 2021-10-20 2022-01-18 哈尔滨工业大学(深圳) Near-end strategy optimization method based on model
CN114186496A (en) * 2021-12-15 2022-03-15 中国科学技术大学 Method for improving continuous control stability of intelligent agent
CN114800515A (en) * 2022-05-12 2022-07-29 四川大学 Robot assembly motion planning method based on demonstration track
CN114859921A (en) * 2022-05-12 2022-08-05 鹏城实验室 Automatic driving optimization method based on reinforcement learning and related equipment
CN114879486A (en) * 2022-02-28 2022-08-09 复旦大学 Robot optimization control method based on reinforcement learning and evolution algorithm

Patent Citations (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106094813A (en) * 2016-05-26 2016-11-09 华南理工大学 It is correlated with based on model humanoid robot gait's control method of intensified learning
WO2018206504A1 (en) * 2017-05-10 2018-11-15 Telefonaktiebolaget Lm Ericsson (Publ) Pre-training system for self-learning agent in virtualized environment
DE102018216561A1 (en) * 2018-09-27 2020-04-02 Robert Bosch Gmbh Method, device and computer program for determining an agent's strategy
KR20200062887A (en) * 2018-11-27 2020-06-04 한국전자통신연구원 Apparatus and method for assuring quality of control operations of a system based on reinforcement learning.
KR20200105365A (en) * 2019-06-05 2020-09-07 아이덴티파이 주식회사 Method for reinforcement learning using virtual environment generated by deep learning
WO2021058626A1 (en) * 2019-09-25 2021-04-01 Deepmind Technologies Limited Controlling agents using causally correct environment models
CN110717600A (en) * 2019-09-30 2020-01-21 京东城市(北京)数字科技有限公司 Sample pool construction method and device, and algorithm training method and device
WO2021123235A1 (en) * 2019-12-19 2021-06-24 Secondmind Limited Reinforcement learning system
CN111460650A (en) * 2020-03-31 2020-07-28 北京航空航天大学 Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning
CN111461347A (en) * 2020-04-02 2020-07-28 中国科学技术大学 Reinforced learning method for optimizing experience playback sampling strategy
CN111582441A (en) * 2020-04-16 2020-08-25 清华大学 High-efficiency value function iteration reinforcement learning method of shared cyclic neural network
WO2021208771A1 (en) * 2020-04-18 2021-10-21 华为技术有限公司 Reinforced learning method and device
CN112183288A (en) * 2020-09-22 2021-01-05 上海交通大学 Multi-agent reinforcement learning method based on model
CN112460741A (en) * 2020-11-23 2021-03-09 香港中文大学(深圳) Control method of building heating, ventilation and air conditioning system
CN112363402A (en) * 2020-12-21 2021-02-12 杭州未名信科科技有限公司 Gait training method and device of foot type robot based on model-related reinforcement learning, electronic equipment and medium
CN112766497A (en) * 2021-01-29 2021-05-07 北京字节跳动网络技术有限公司 Deep reinforcement learning model training method, device, medium and equipment
CN113281999A (en) * 2021-04-23 2021-08-20 南京大学 Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning
CN113359704A (en) * 2021-05-13 2021-09-07 浙江工业大学 Self-adaptive SAC-PID method suitable for complex unknown environment
CN113510704A (en) * 2021-06-25 2021-10-19 青岛博晟优控智能科技有限公司 Industrial mechanical arm motion planning method based on reinforcement learning algorithm
CN113419424A (en) * 2021-07-05 2021-09-21 清华大学深圳国际研究生院 Modeling reinforcement learning robot control method and system capable of reducing over-estimation
CN113485107A (en) * 2021-07-05 2021-10-08 清华大学深圳国际研究生院 Reinforcement learning robot control method and system based on consistency constraint modeling
CN113467515A (en) * 2021-07-22 2021-10-01 南京大学 Unmanned aerial vehicle flight control method based on virtual environment simulation reconstruction and reinforcement learning
CN113947022A (en) * 2021-10-20 2022-01-18 哈尔滨工业大学(深圳) Near-end strategy optimization method based on model
CN114186496A (en) * 2021-12-15 2022-03-15 中国科学技术大学 Method for improving continuous control stability of intelligent agent
CN114879486A (en) * 2022-02-28 2022-08-09 复旦大学 Robot optimization control method based on reinforcement learning and evolution algorithm
CN114800515A (en) * 2022-05-12 2022-07-29 四川大学 Robot assembly motion planning method based on demonstration track
CN114859921A (en) * 2022-05-12 2022-08-05 鹏城实验室 Automatic driving optimization method based on reinforcement learning and related equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AZULAY, OSHER ET AL.,: "Wheel Loader Scooping Controller Using Deep Reinforcement Learning", 《 IEEE ACCESS》, vol. 9, 2 February 2020 (2020-02-02), pages 24145 - 24154, XP011836475, DOI: 10.1109/ACCESS.2021.3056625 *
JIE LENG ET AL.,: "M-A3C: A Mean-Asynchronous Advantage Actor-Critic Reinforcement Learning Method for Real-Time Gait Planning of Biped Robot", 《 IEEE ACCESS》, vol. 10, 20 May 2021 (2021-05-20), pages 76523 - 76536 *
刘全等: "深度强化学习综述", 《计算机学报》, vol. 40, 31 December 2017 (2017-12-31), pages 1 - 28 *
杨智友: "强化学习中的优化策略研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 1, 15 January 2022 (2022-01-15), pages 140 - 539 *

Also Published As

Publication number Publication date
CN115293334B (en) 2024-09-27

Similar Documents

Publication Publication Date Title
Groshev et al. Learning generalized reactive policies using deep neural networks
Bianchi et al. Accelerating autonomous learning by using heuristic selection of actions
CN114139637B (en) Multi-agent information fusion method and device, electronic equipment and readable storage medium
CN110991027A (en) Robot simulation learning method based on virtual scene training
US20210158162A1 (en) Training reinforcement learning agents to learn farsighted behaviors by predicting in latent space
CN112119409A (en) Neural network with relational memory
CN112884131A (en) Deep reinforcement learning strategy optimization defense method and device based on simulation learning
CN114261400B (en) Automatic driving decision method, device, equipment and storage medium
US20220366246A1 (en) Controlling agents using causally correct environment models
Andersen et al. Active exploration for learning symbolic representations
CN117077727B (en) Track prediction method based on space-time attention mechanism and neural ordinary differential equation
Kojima et al. To learn or not to learn: Analyzing the role of learning for navigation in virtual environments
CN114239974B (en) Multi-agent position prediction method and device, electronic equipment and storage medium
CN116166642A (en) Spatio-temporal data filling method, system, equipment and medium based on guide information
CN113110101B (en) Production line mobile robot gathering type recovery and warehousing simulation method and system
Ge et al. Deep reinforcement learning navigation via decision transformer in autonomous driving
CN114219066A (en) Unsupervised reinforcement learning method and unsupervised reinforcement learning device based on Watherstein distance
CN115293334A (en) Model-based unmanned equipment control method for high sample rate deep reinforcement learning
CN117933055A (en) Equipment residual service life prediction method based on reinforcement learning integrated framework
Feng et al. Mobile robot obstacle avoidance based on deep reinforcement learning
CN116360435A (en) Training method and system for multi-agent collaborative strategy based on plot memory
Gross et al. Probabilistic model checking of stochastic reinforcement learning policies
Han et al. Three‐dimensional obstacle avoidance for UAV based on reinforcement learning and RealSense
Lauttia Adaptive Monte Carlo Localization in ROS
CN114970714B (en) Track prediction method and system considering uncertain behavior mode of moving target

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant