CN115293334A

CN115293334A - Model-based unmanned equipment control method for high sample rate deep reinforcement learning

Info

Publication number: CN115293334A
Application number: CN202210963402.6A
Authority: CN
Inventors: 杨智友; 屈鸿; 符明晟; 李凡
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-08-11
Filing date: 2022-08-11
Publication date: 2022-11-04
Anticipated expiration: 2042-08-11
Also published as: CN115293334B

Abstract

The invention discloses a model-based unmanned equipment control method for high-sample-rate deep reinforcement learning, which comprises the following steps of: acquiring track data and storing the track data into an environment buffer pool; updating the environmental state transition model; predicting a multi-step interaction track to generate prediction data, and storing the prediction data into a model buffer pool; updating the Actor-Critic strategy model; and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement. The invention adopts a deep reinforcement learning method based on the model to construct an environmental state transition model to simulate the interaction between the unmanned equipment and the external environment, thereby sharply reducing the interaction times between the unmanned equipment and the real environment, and effectively optimizing the advancing control strategy of the unmanned equipment by the data generated by the environmental state transition model, so that the control of the unmanned equipment becomes efficient.

Description

Model-based unmanned equipment control method for high sample rate deep reinforcement learning

Technical Field

The invention relates to an unmanned equipment control technology, in particular to an unmanned equipment control method based on model high sample rate deep reinforcement learning.

Background

At present, the control of the advance of the unmanned equipment is mainly developed based on the traditional control technology, but the traditional control technology has the problems of single planning of the advance line of the unmanned equipment, inflexible line planning, lack of coping strategies in complex scenes and the like. With the rapid development of deep learning technology and reinforcement learning, the strong feature learning capability of a deep neural network is utilized, the relevant advancing features can be learned from a large amount of unmanned equipment interaction data, and the obstacle avoidance in the advancing process of the unmanned equipment can be realized by combining the modeling of reinforcement learning on the advancing problem of the unmanned equipment, but the problem of low data sample efficiency still exists.

Disclosure of Invention

In order to overcome at least the above-mentioned deficiencies in the prior art, it is an object of the present application to provide a model-based high sample rate deep reinforcement learning unmanned device control method.

The embodiment of the application provides a model-based unmanned equipment control method for high-sample-rate deep reinforcement learning, which comprises the following steps:

controlling the unmanned equipment to use a strategy in an Actor-Critic strategy model to interact with a real environment to obtain track data, and storing the track data into an environment buffer pool;

updating the environmental state transition model through the data in the environmental buffer pool;

using the data in the environment buffer pool to carry out multi-step interactive trajectory prediction on the environment state transition model and the Actor-Critic strategy model to generate prediction data, and storing the prediction data in a model buffer pool;

updating the Actor-Critic strategy model through data in the model buffer pool;

and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement.

In the prior art, the time cost required by the operation test of the unmanned equipment in the real environment is very high, and meanwhile, the problems of equipment collision and the like easily occur when the unmanned equipment operates in the real environment for a long time, so that a large number of sample sources required by a model for training the operation control of the unmanned equipment are difficult to guarantee. When the method and the device are implemented, an Actor-Critic strategy model and an environment state transition model need to be constructed first, wherein the Actor-Critic strategy model is used for providing a strategy for the unmanned equipment to operate in the environment, the environment state transition model is used for providing a simulation environment for the unmanned equipment to operate, and a large number of samples for training the unmanned equipment operation control model can be generated through interaction between the Actor-Critic strategy model and the environment state transition model, so that the accuracy of the trained model is improved.

In the embodiment of the application, each iteration updating of the model needs to control the unmanned equipment to operate for a short time in a real environment by a strategy in an Actor-critical strategy model, and the generated track data can update the environment state transition model; and then, multi-step interactive trajectory prediction is carried out on the environment state transition model and the Actor-Critic strategy model to generate a large amount of prediction data to serve as training samples to update the Actor-Critic strategy model, so that the Actor-Critic strategy model can become a mature model for providing the unmanned equipment control strategy after a certain number of iterations, and subsequent tests can be carried out. The embodiment of the application can provide a high-precision simulation environment for unmanned equipment interaction, the simulation environment is learned through a real historical interaction track, and the accuracy of the environment is directly influenced by a loss function of an environment state transition model in the learning process. The environmental state transition model of the embodiment of the application corresponds to a simulation environment, and provides interactive safety and high efficiency for the unmanned equipment.

In one possible implementation manner, the controlling the unmanned aerial device to acquire the trajectory data by using the strategy in the Actor-Critic strategy model to interact with the real environment comprises the following steps:

inputting current state data of the unmanned equipment in the real environment as first state data into an Actor network of the Actor-criticic policy model, and receiving an action value output by the Actor network as first action data; the first action data is obtained by sampling the mean value and the variance of multidimensional Gaussian distribution output by the last full-connection layer of the Actor network by the Actor network;

controlling the unmanned equipment to operate in the real environment by the first action data, and acquiring state data of the unmanned equipment at the next moment as second state data and a return value at the next moment as a first return value;

and taking the first state data, the first action data, the second state data and the first return value as the track data.

In a possible implementation manner, the updating the Actor-critical policy model by the data in the model buffer pool includes:

sampling a fixed amount of third state data, second action data, second return values and fourth state data from the model buffer pool; the fourth state data is state data at the next moment of the third state data;

inputting the fourth state data into an Actor network of the Actor-Critic policy model to acquire third action data;

inputting the third state data and the second action data into a criticic network of the Actor-criticic policy model, and acquiring a first state action function output by the criticic network; the first state action function is the minimum value of state action functions respectively output by at least two current Q networks in the Critic network;

inputting the third action data and the fourth state data into a target Q network in the Critic network to obtain a state action function at the next moment as a second state action function;

and calculating a first loss function through the second return value, the first state action function and the second state action function, and updating the criticic network.

In a possible implementation manner, updating the Actor-critical policy model through the data in the model buffer pool further includes:

calculating a second loss function through a third state action function and the action entropy value and updating the Actor network; the third state action function is an evaluation value of the criticic network when an action value is selected through the Actor network policy under corresponding state data; the action entropy value is an entropy value of an action value selected by the Actor network policy.

In a possible implementation manner, the generating prediction data by performing multi-step interaction trajectory prediction on the environmental state transition model and the Actor-Critic policy model by using the data in the environmental buffer pool includes:

randomly sampling a preset number of fifth state data from the environment buffer pool;

inputting the fifth state data into an Actor network of the Actor-Critic policy model, and acquiring fourth action data output by the Actor network;

inputting the fifth state data and the fourth action data into the environmental state transition model to obtain state data at the next moment as sixth state data, and taking a return value at the next moment as a third return value;

taking the sixth state data as fifth state data and circularly acquiring the state data and the return value to preset conditions;

and randomly selecting a preset amount of data generated in the cyclic process as the prediction data.

In one possible implementation, the environmental state transition model includes a plurality of mutually independent submodels; each sub-model is trained through the same neural network model and sample, and the initial value of each sub-model in training is different.

In one possible implementation, the environmental state transition model includes a third loss function and a fourth loss function; the third loss function is generated based on how the current state data and the action data are transferred to the state data at the next moment and the generated return value; the fourth loss function is generated based on a difference between the real environment and a simulated environment produced by the environmental state transition model.

In one possible implementation, the fourth loss function is generated based on the following equation:

in the formula,

for the performance of the policy of the drone on the real environment,

performance of the same policy of the unmanned device on the simulated environment, s is sampled state data, a is sampled action data, gamma represents a discount factor in reinforcement learning, | r | y _max The representation environment gives the maximum value of the absolute value of the return value,

is the difference between the real environment and the simulated environment;

wherein,

calculated according to the following formula:

wherein p is a p-norm of state data and motion data in the real environment,

is p-norm, D, of state data and motion data in the simulation environment _TV For the TV distance, s' is the state at the next moment.

In one possible implementation, when the difference between the real environment and the simulated environment is measured by a KL divergence, the network loss function used by the environmental state transition model is implemented by:

in the formula, s _n For state data sampled from the environmental buffer pool, a _n For action data sampled from the context buffer pool,

for the next moment state predicted by the environmental state transition model given the state data and the action data,

to correspond to

α is a hyperparameter, μ _θold The average value of the environmental state transition model at the last update,

is corresponding to

Variance of (d), μ _θ For the latest mean, σ, of the model of the environmental state transitions _θ To correspond to mu _θ Variance of (D), D _kl Is KL divergence, N is normal distribution,

is a maximum likelihood estimation function.

In one possible implementation, when the difference between the real environment and the simulated environment is measured by a p-norm, the network loss function used by the environmental state transition model is implemented by:

in the formula s _n Is a slave ringState data sampled in the border buffer pool, a _n For action data sampled from the context buffer pool,

to correspond to

The value of (a) is a hyperparameter,

the mean value of the environmental state transition model at the last update,

is corresponding to

Variance of (d), μ _θ Is the latest mean value of the environmental state transition model,

is a norm of 2 times,

is a maximum likelihood estimation function.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention provides a high-accuracy environmental state transition model for providing a high-precision simulation track sample by modeling the interaction between the unmanned equipment and the external environment during the advancing control as MDP.

2. The invention solves the optimization problem of the unmanned equipment advancing control strategy by means of the Actor and Critic functions, and obtains a high-quality environmental state transfer model by a brand-new loss function optimization.

3. The invention adopts a model-based deep reinforcement learning method to construct an environmental state transfer model to simulate the interaction between the unmanned equipment and the external environment, thereby sharply reducing the interaction times between the unmanned equipment and the real environment, and the data generated by the environmental state transfer model can effectively optimize the advancing control strategy of the unmanned equipment, so that the control of the unmanned equipment becomes efficient.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a schematic diagram of the steps of an embodiment of the method of the present application;

fig. 2 is a schematic diagram of a network structure of an Actor in the embodiment of the present application;

FIG. 3 is a schematic diagram of a Critic network according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an environmental state transition model according to an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Further, it should be understood that the schematic drawings are not drawn to scale. The flowcharts used in this application illustrate operations implemented according to some of the embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. In addition, one skilled in the art, under the guidance of the present disclosure, may add one or more other operations to, or remove one or more operations from, the flowchart.

In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

Please refer to fig. 1, which is a flowchart illustrating a method for controlling an unmanned aerial vehicle based on model-based high sample rate deep reinforcement learning according to an embodiment of the present disclosure, and further, the method for controlling an unmanned aerial vehicle based on model-based high sample rate deep reinforcement learning may specifically include the contents described in the following steps S1 to S6.

S1: controlling the unmanned equipment to use a strategy in the Actor-Critic strategy model to interact with a real environment to obtain track data, and storing the track data into an environment buffer pool;

s2: updating an environment state transition model through the data in the environment buffer pool;

s3: using the data in the environment buffer pool to carry out multi-step interactive trajectory prediction on the environment state transition model and the Actor-Critic strategy model to generate prediction data, and storing the prediction data in a model buffer pool;

s4: updating the Actor-Critic strategy model through data in the model buffer pool;

s5: and continuously and iteratively updating the environment state transition model and the Actor-Critic strategy model until the current strategy performance meets the expected requirement.

In the prior art, the time cost required by running and testing the unmanned equipment in the real environment is very high, and meanwhile, the problems of equipment collision and the like easily occur when the unmanned equipment runs in the real environment for a long time, so that the sources of a large number of samples required by training a model for running and controlling the unmanned equipment are difficult to guarantee. When the method and the device are implemented, an Actor-Critic strategy model and an environment state transition model need to be constructed first, wherein the Actor-Critic strategy model is used for providing a strategy for the unmanned equipment to operate in the environment, the environment state transition model is used for providing a simulation environment for the unmanned equipment to operate, and a large number of samples for training the unmanned equipment operation control model can be generated through interaction between the Actor-Critic strategy model and the environment state transition model, so that the accuracy of the trained model is improved.

In the embodiment of the application, each iteration of the model is updated by controlling the unmanned equipment to operate for a short time in a real environment by using a strategy in the Actor-Critic strategy model, and the generated track data can update the environment state transition model; and then, performing multi-step interactive trajectory prediction on the environmental state transition model and the Actor-Critic strategy model to generate a large amount of prediction data as a training sample to update the Actor-Critic strategy model, so that the Actor-Critic strategy model can become a mature model for providing the unmanned equipment control strategy after a certain number of iterations to perform subsequent tests. According to the method and the device, the high-precision simulation environment for unmanned equipment interaction can be provided, learning is carried out through a real historical interaction track, and the accuracy of the environment is directly influenced by a loss function of an environment state transition model in the learning process. The environmental state transition model of the embodiment of the application corresponds to a simulation environment, and provides interactive safety and high efficiency for the unmanned equipment.

In one possible implementation manner, the controlling the unmanned device to use the strategy in the Actor-critical strategy model to interact with the real environment to acquire the track data includes:

When the method is implemented, the Actor-Critic strategy model comprises an Actor network and a Critic network, the Actor network and the Critic network respectively comprise a full connection layer and an activation layer which are sequentially arranged, and when the Actor network outputs mean value to sum variance according to the last full connection layer, the Actor network uses a tanh function to carry out one-time nonlinear mapping on samples in Gaussian distribution, so that the final action values are ensured to be in an effective range. When the first state data is input into the Actor network, the Actor network generates first action data corresponding to the first state data, the first action data is a strategy for controlling the unmanned equipment to operate in a real environment, and the strategy is used for controlling the unmanned equipment to operate in the real environment, so that the state data of the unmanned equipment after the strategy is operated, namely the second state data, can be obtained, and meanwhile, a return value generated after the strategy is executed, namely the first return value, can also be obtained; this allows the first state data, the first motion data, the second state data, and the first reward value to form a set of pairs of samples of the operation of the drone in the real environment, the samples being stored in the environment buffer pool and used to update the simulated environment.

inputting the third state data and the second action data into a criticic network of the Actor-criticic policy model, and acquiring a first state action function output by the criticic network; the first state action function is the minimum value of state action functions output by at least two current Q networks in the Critic network respectively;

and calculating a first loss function through the second return value, the first state action function and the second state action function, and updating the Critic network.

In the embodiment of the application, the Critic network is divided into a target Q network and at least two current Q networks, when the action selected in the current state is evaluated, the current Q networks are used for calculation, and the networks with small Q values are selected for updating the Actor network according to the results of the evaluation. The Actor network is used for deciding what action should be taken when the unmanned equipment meets different environments, and corresponds to the unmanned equipment; the Critic network is used for evaluating the influence caused by the action selected by the Actor network, and the Critic network provided by the embodiment of the application corresponds to evaluation. Therefore, a complete model-based high-sampling depth reinforcement learning algorithm can be formed by the environment state transition model, the Actor policy network and the Critic network. The unmanned aerial vehicle walks to drive motors in each driving mechanism of the unmanned aerial vehicle, and relevant parameters of the motors of each driving mechanism are adjusted for coping when the unmanned aerial vehicle faces different states according to a model-based high-sampling depth reinforcement learning algorithm.

In the embodiment of the present application, updating the Actor-Critic policy model corresponds to performing secondary training on the policy model after supplementing data, where the third state data, the second action data, the second return value, and the fourth state data should be data in one sample pair, that is, the third state data, the second action data, the second return value, and the fourth state data should be corresponding relationships. The first state action function is the evaluation of the third state data and the second action data output by the Critic network as input data, the third state data and the second action data can be understood as current state data and current action data, and the first state action function is the evaluation of the current action data under the condition of the current state data; the second state action function is the evaluation of the Critic network output on the condition of third action data and fourth state data, the third action data is the action relative to the current next moment, the fourth state data is the state of the next moment, and the first state action function is the evaluation of the action data of the next moment on the condition of the state data of the next moment; the second return value is a return value for executing the second action data in the environment, and the first loss function can be calculated through the second return value, the first state action function and the second state action function so as to update the criticic network.

For example, please refer to fig. 2, which illustrates a structure of an Actor network in this embodiment, where the Actor network in this embodiment includes 3 fully-connected layers and a rule layer, which are sequentially arranged, each layer of the fully-connected network includes 256 neurons, an input of the fully-connected network is in a current or given state, and an output of the fully-connected network is an action to be taken when the fully-connected network faces the state.

For example, referring to fig. 3, the Critic network in this embodiment includes two current Q networks, whose network structures are completely the same, and includes 3 fully-connected layers and rule layers that are sequentially arranged, where each layer of the fully-connected network includes 256 neurons, the input is a given state and its corresponding action, and the output is an evaluation value for this scenario.

In the embodiment of the application, the loss function of the Actor network is composed of two parts, the first part is an evaluation value of the criticc network when a certain action is selected by a strategy under a certain state, namely a state action value function, and the second part is an entropy of the action selected by the strategy, so that the Actor network can be updated more accurately.

In the embodiment of the application, after the fifth state data is input into the Actor network as the current state data, the current decision action generated aiming at the current state data, that is, the fourth action data, can be acquired; since the environmental state transition model is environment-simulated, after the fifth state data and the fourth action data are input into the environmental state transition model, the fifth state data and the fourth action data can be run in the simulated environment to generate state data at the next moment, namely, sixth state data, and a reward value generated by the fourth action data, namely, a third reward value. In the embodiment of the present application, the multiple-interaction trajectory prediction means that newly obtained sixth state data is input into an Actor network as current state data, so as to start a loop process, where each loop generates a sample pair, and the sample pair includes a current state, a current action, a return value, and a next-moment state. The preset condition for the loop may be to reach a predetermined number of loops, or may set a threshold for some variable, which is not limited in this embodiment of the application.

For example, please refer to fig. 4, the environment state transition model of this embodiment is an integrated model composed of a plurality of neural network models, 4 layers of full connections and 3 activation functions are sequentially set for each neural network model, each layer of full connections is set with 300 hidden layers, and the activation functions use swish; the input of the model is a randomly given state and the action in the state, and the output is the state at the next moment and the return value at the moment.

In one possible implementation, the environmental state transition model includes a plurality of mutually independent submodels; each submodel is trained through the same neural network model and sample, and the initial value of each submodel in training is different.

When the method and the device are implemented, the environment state transition model is combined into an integrated environment state transition model through a plurality of neural network models, the integrated environment state transition model has the capability of capturing uncertainty in real environment dynamic transition, the mean value and the variance output by each independent sub-model have differences, more track data of different scenes can be generated when the model is interacted with the unmanned aerial vehicle, and the strategy network learning is facilitated. The loss function of the environmental state transition model is a unique design, and is an aggregate model, and each aggregate model comprises a full connection layer and an activation layer which are sequentially arranged, and the state and the return value of the next moment are predicted. And the mean value and the variance of the state and the return value of the next moment output by the last full-connection layer of the environment state transition model adopt Gaussian distribution to increase the adaptability of the model to the complex environment.

In the embodiment of the application, the environment state transition model uses a uniquely designed network loss function, the design of the loss function is derived according to theoretical analysis, the design of the loss function not only considers how the current state and action are transferred to the state and the return value at the next moment, but also considers the difference between the real environment and the simulated environment, and the loss function designed by using the two functions has uncertainty of capturing the environment.

in the formula,

for the performance of the policy of the drone on the real environment,

is the performance of the same strategy of the unmanned equipment on the simulated environment, s is sampled state data, a is sampled action data, gamma represents a discount factor in reinforcement learning, | r | n _max The representation environment gives the maximum value of the absolute value of the return value,

is the difference between the real environment and the simulated environment;

wherein,

calculated according to the following formula:

wherein p is state data and motion data in the real environmentThe p-norm of (a) is,

In the embodiment of the application, the environmental state transition model is a performance lower bound of the unmanned equipment control strategy in a real environment, the performance lower bound is established in a simulation environment, the performance lower bound is still established in strategy iteration based on the model, and the performance lower bound also ensures monotonous convergence in strategy iteration based on the model, and is realized by the following formula:

it is worth noting that the lower performance bound is characterized by the optimization of performance in real environments independent of the policies of the agent itself.

In one possible implementation, when the difference between the real environment and the simulated environment is measured by KL divergence, the network loss function used by the environmental state transition model is implemented by the following equation:

in the formula s _n For state data sampled from the environmental buffer pool, a _n For action data sampled from the context buffer pool,

to correspond to

The value of (a) is a hyperparameter,

the average value of the environmental state transition model at the last update,

is corresponding to

Variance of (d), μ _θ For the latest mean, σ, of the model of the environmental state transitions _θ To correspond to mu _θ Variance of D _kl Is KL divergence, N is normal distribution,

is a maximum likelihood estimation function.

In the embodiment of the application, aiming at the control of the unmanned robot, the inventor finds that the difference between the real environment and the simulated environment is measured by KL divergence, the interactive data of the unmanned robot and the real environment is used for training the network, and a better effect can be obtained under the condition that the training can be stopped when the training error is less than a threshold value; therein

Is corresponding to

The reported value is

To correspond to the state s _n Lower execution a _n Post transition to state

The return value of this process; wherein

To predict the state and return value at the next moment after a given state and action, i.e. to predict the state and return value at the next moment

To be in a given state s _n And a _n Then, the state of the next time is predicted

And a return value

When the environment state transition model is updated, the state and the return value of the real environment at the next moment are sampled from the environment cache pool and are used as the updating target of the environment state transition model. The network loss function is to ensure that the environmental state transition model can fit the trajectory data in the real environment.

Specifically, when the difference between the real environment and the simulated environment is measured using the KL divergence, it can be characterized as:

and the performance of the strategy of the unmanned aerial vehicle robot on the real environment is changed into an unconstrained optimization problem, namely:

wherein,

and further deducing a network loss function used by the environment state transition model for an adjustable hyper-parameter.

in the formula s _n For state data sampled from the environmental buffer pool, a _n For the action data sampled from the context buffer pool,

is corresponding to

The value of (a) is a hyperparameter,

the mean value of the environmental state transition model at the last update,

to correspond to

is a norm of 2 times,

is a maximum likelihood estimation function.

When the embodiment of the application is implemented, aiming at unmanned aerial vehicle control, the inventor finds that the difference between a real environment and a simulation environment is measured by using a p-norm, the network is trained by using interactive data of the unmanned aerial vehicle and the real environment, and when the training is mistaken, the training is mistakenWhen the difference is less than a threshold value, the training can be stopped, and a better effect can be obtained; therein

To correspond to

The reported value is

To correspond to the state s _n Lower execution a _n Post transition to state

The return value of this process; wherein

And the return value

Specifically, the difference between the real environment and the simulated environment can be measured using a Lipschitz continuous transformation into a p-norm, specifically, a link between the dynamic model of the environment and the p-norm is constructed, which is characterized as:

in the formula,

a dynamic transfer model representing the real environment,

representing a dynamic transfer model on a simulation environment.

When the difference between the real environment and the simulation environment is measured by using the p-norm, the corresponding formula is as follows:

the performance of the strategy of the unmanned aerial vehicle on a real environment is changed into an unconstrained optimization problem, and an optimization formula when the p-norm is used for measurement is as follows:

wherein,

and deriving a network loss function used by the environment state transition model for an adjustable hyper-parameter.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The elements described as separate components may or may not be physically separate, as the elements are clearly recognizable to those skilled in the art that the elements and algorithm steps of each example described in connection with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or combinations of both, and the components and steps of each example have been described in general terms of function in the foregoing description for clearly illustrating the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a grid device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The model-based unmanned equipment control method for high-sample-rate deep reinforcement learning is characterized by comprising the following steps of:

updating an environment state transition model through the data in the environment buffer pool;

performing multi-step interactive trajectory prediction on the environmental state transition model and the Actor-Critic strategy model by using the data in the environmental buffer pool to generate prediction data, and storing the prediction data into a model buffer pool;

updating the Actor-Critic strategy model through data in the model buffer pool;

2. The model-based unmanned aerial vehicle control method for high-sample-rate deep reinforcement learning according to claim 1, wherein controlling the unmanned aerial vehicle to interact with the real environment by using the strategy in the Actor-critical strategy model to acquire trajectory data comprises:

3. The model-based unmanned device control method for high-sample-rate deep reinforcement learning according to claim 1, wherein updating the Actor-critical policy model with data in the model buffer pool comprises:

4. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning according to claim 3, wherein updating the Actor-Critic policy model with data in the model buffer pool further comprises:

5. The model-based unmanned aerial vehicle control method for high-sample-rate deep reinforcement learning according to claim 1, wherein performing multi-step interaction trajectory prediction on the environmental state transition model and the Actor-Critic policy model using the data in the environmental buffer pool to generate prediction data comprises:

inputting the fifth state data and the fourth action data into the environment state transition model to obtain state data at the next moment as sixth state data, and taking a return value at the next moment as a third return value;

6. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 1, wherein the environmental state transition model comprises a plurality of mutually independent sub-models; each sub-model is trained through the same neural network model and sample, and the initial value of each sub-model in training is different.

7. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 1, wherein the environmental state transition model comprises a third loss function and a fourth loss function; the third loss function is generated based on how the current state data and the action data are transferred to the state data at the next moment and the generated return value; the fourth loss function is generated based on a difference between the real environment and a simulated environment produced by the environmental state transition model.

8. The model-based high sample rate deep reinforcement learning unmanned device control method of claim 7, wherein the fourth loss function is generated based on the following equation:

in the formula,

for the performance of the policy of the drone on the real environment,

gamma represents a discount factor in reinforcement learning, | r |, for performance of the same policy of the unmanned device on the simulated environment _max The representation environment gives the maximum value of the absolute value of the return value,

is the difference between the real environment and the simulated environment;

wherein,

calculated according to the following formula:

wherein p is a p-norm of state data and motion data in the real environment,

is p-norm, D, of state data and motion data in the simulation environment _TV For the TV distance, s' is the state at the next moment, s is the sampled state data, and a is the sampled motion data.

9. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 8, wherein when the difference between the real environment and the simulated environment is measured by KL divergence, the network loss function used by the environmental state transition model is implemented by the following equation:

in the formula, s _n For state data sampled from the environmental buffer pool, a _n For actions sampled from the environmental buffer poolThe data of the data is transmitted to the data receiver,

for the next-time state predicted by the environmental state transition model given the state data and the action data,

to correspond to

Alpha is a hyper-parameter,

to correspond to

is a maximum likelihood estimation function.

10. The model-based unmanned aerial vehicle control method for high sample rate deep reinforcement learning of claim 8, wherein when the difference between the real environment and the simulated environment is measured by p-norm, the network loss function used by the environmental state transition model is implemented by the following formula:

in the formula, s _n For state data sampled from the environmental buffer pool, a _n For the action data sampled from the context buffer pool,

to correspond to

The value of (a) is a hyperparameter,

the mean value of the environmental state transition model at the last update,

is corresponding to

is a norm of 2 times,

is a maximum likelihood estimation function.