CN112183288A

CN112183288A - Multi-agent reinforcement learning method based on model

Info

Publication number: CN112183288A
Application number: CN202011002376.8A
Authority: CN
Inventors: 张伟楠; 王锡淮; 沈键; 周铭
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2021-01-05
Anticipated expiration: 2040-09-22
Also published as: CN112183288B

Abstract

The invention discloses a multi-agent reinforcement learning method based on a model, which belongs to the field of multi-agent reinforcement learning and comprises the steps of modeling multi-agent environment and strategies, generating virtual tracks of multi-agents and updating the strategies of the multi-agents by utilizing the virtual tracks. In the invention, each intelligent agent is in a distributed type to make a decision, the multi-intelligent-agent environment and the opponent intelligent agent are respectively subjected to strategy modeling, and the obtained model is used for generating the virtual track, so that the sampling efficiency of the multi-intelligent-agent reinforcement learning can be effectively improved, the interaction times of the intelligent agents are reduced, the equipment damage risk is reduced, and the feasibility of deploying the distributed multi-intelligent-agent reinforcement learning method in the multi-intelligent-agent task is improved.

Description

Multi-agent reinforcement learning method based on model

Technical Field

The invention relates to the field of multi-agent reinforcement learning methods, in particular to a model-based multi-agent reinforcement learning method.

Background

Reinforcement learning is a sub-field of machine learning, whose goal is to perform decision-making actions based on received environmental information, in order to achieve maximum expected yields. The deep reinforcement learning utilizes the neural network to approximate the value function and the strategy function, and the performance exceeding the average level of human beings is obtained on a plurality of tasks. In a multi-agent scenario, each agent is learning and improving, resulting in an unstable environment, and the relationship between agents may be competitive, cooperative, or intermediate. How and what information is shared among agents also becomes a difficulty. Based on the above problems introduced by the multi-agent scenario, the single-agent approach cannot be directly applied to the multi-agent scenario. Similar to the algorithm of a single agent, the algorithm of the reinforcement learning of multiple agents is divided into two categories, i.e. no model and a model. Among them, the multi-agent reinforcement learning algorithm without model faces more serious sample efficiency problem.

A model-based multi-agent reinforcement learning method aims to improve the sample efficiency of a multi-agent reinforcement learning algorithm. I.e. to reduce the number of interactions of the agents with the environment and the number of interactions between the agents. In general, there are currently situations where reinforcement learning is inefficient when landing on a particular application. In the application of multi-agent reinforcement learning, the joint action space and the joint state space of each agent further reduce the sample efficiency. When a multi-agent reinforcement learning is used in a scene of training multi-vehicle automatic driving, a plurality of vehicles usually need to do reasonable actions in different scenes through massive training, and in the massive training, the vehicles continuously interact with the environment and the vehicles, so that the possibility of vehicle damage is high. Using a model-based approach helps to reduce training costs.

Analyzing recent patent technologies related to multi-agent reinforcement learning and model-based reinforcement learning:

1. the Chinese invention patent application with application number of 201811032979.5, a path planning method based on multi-agent reinforcement learning, provides a multi-agent path planning method based on the aircraft field, improves the survival rate and the task completion rate of the aircraft by establishing a global state division model of the air flight environment, mainly uses an environment model for planning, and considers the interaction among agents;

2. the chinese patent application No. 201911121271.1, entitled "learning method of cooperative agents based on multi-agent reinforcement learning" proposes a method for sharing target parameters among agents, and models the global environment, and the agents share the global model to improve the efficiency of multi-agent algorithm, and similarly, the method lacks consideration for interaction among agents.

(II) analyzing the recent research of the model-based multi-agent reinforcement learning method:

in 'Multi-agent recovery with improvement model for comprehensive functions' published in PLoS One journal in 2019, modeling of the global environment is used as an auxiliary task to deeply learn about the cognition of the Multi-agent environment. But this work does not improve the sample efficiency of the algorithm.

A paper 'Multi agent discovery with Multi-step generating models' published in 2019 at The Conference on Robot Learning (CoRL) Conference uses a differential automatic encoder to model a Multi-agent environment and a strategy of an opponent agent, directly predicts a section of track, and then selects an optimal track by using a model prediction control method. The method effectively improves the sample efficiency, but the lack of the strategy function increases the decision cost, and meanwhile, the centralized training and decision make the algorithm difficult to deploy in practical application.

Therefore, those skilled in the art are working on developing a model-based Multi-Agent Branched-roll out Policy Optimization (Multi-Agent Branched-roll out Policy Optimization), a Multi-Agent reinforcement learning method that can achieve higher sample efficiency in any environment.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is how to reduce the number of interactions between a smart agent and an environment and between a smart agent and a smart agent, and at the same time, to enable distributed execution.

In order to achieve the above object, the present invention provides a model-based multi-agent reinforcement learning method, which is characterized in that in a multi-agent environment, a multi-agent environment and a strategy are modeled to generate a virtual track of a multi-agent, and the strategy of the multi-agent is updated by using the virtual track.

Further, the multi-agent makes distributed decisions.

Further, for current agent i, keeping the set of adversary agents as { -i }, the action of current agent i depends on the joint policy pi of adversary agents^-iAnd the current state s_tLet the combined action of the adversary agent at time t be

The current agent's action is represented as

Wherein piⁱIs the policy of the current agent.

Further, multiple agents all hold independent multiple agent environment models

And set of adversary policy models

Further, a method of dynamically selecting an opponent model is used when generating the virtual trajectory.

Further, for current agent i, the model for each adversary strategy is represented as

Wherein j belongs to { -i }, the method for dynamically selecting the adversary model comprises two steps:

step a, strategy model for each opponent

Selecting a part of real interaction data which occur recently, calculating the generalization error of the strategy model, and marking as the epsilon_j；

Step b, giving the length K of the virtual track, and then giving the length K of the virtual track to the opponent agent j, before

Using the model of the opponent agent's policy to generate an action of the opponent agent at that time; and following K-n_jThe adversary agent is requested in steps for the actions taken under its real policy.

Further, the generation of the virtual trajectory comprises the following steps:

step 1, initializing t to be 0, wherein the length of a virtual track is K;

step 2, selecting a state s from the real track_t；

Step 3, obtaining the combined action of the other opponents under the state s

Step 4, obtaining the action a of the current agent by using the strategy function of the current agentⁱ＝πⁱ(s,a^-i)；

Step 5, using the model of the multi-agent environment to predict the state at the next moment

And the current time prize r_t；

Step 6, mixing(s)_t,aⁱ,a^-i,s_t+1,r_t) Put into an experience playback pool

Performing the following steps;

and 7, repeating the step 1 until t is larger than K by making t equal to t + 1.

Further, after the multi-agent environment and the opponent agent strategy are modeled to a certain precision, a virtual track is generated.

Further, a Gaussian distribution is used to represent the output of the model when modeling multi-agent environment and opponent agent strategies, and a multi-agent ring is formedSetting up a plurality of models, and using a multi-agent environment model by using an ensemble learning method; let the number of environment models be B, then the set of environment models be

Wherein B ∈ {1, …, B }; the adversary strategy model is

Wherein j ∈ { -i }.

Further, gradient descent methods are used for updating when modeling multi-agent environments and opponent agent strategies.

Further, when updating the policy of the current agent, a flexible Actor-Critic (Soft Actor Critic) algorithm is used.

The updated formula of the critic part Q function is:

the updating formula of the actor part strategy function is as follows:

wherein e_tFor noise sampled from a gaussian distribution, f is the reparameterization function.

The multi-agent environment and the opponent agent strategy model applied in the invention are closer to the real model and strategy in a regular way, so that the generated virtual track is more and more real. The intelligent agent utilizes the generated virtual track to be more approximate to the state which can be reached under the real condition and the real intelligent agent interaction, and simultaneously can explore the state and the interaction situation which are difficult to reach by the real track. Therefore, the intelligent agent can effectively train in the virtual track, the possibility of experiencing dangerous states and interaction in a real situation is reduced, the damage risk is reduced, and the training cost is reduced. In a rule, the agent can be trained more comprehensively and abundantly by using a multi-agent environment model and an opponent agent strategy model.

The invention has the following technical effects:

1. in the invention, the decision of each agent can be independently carried out, and optionally, the effect can be improved by carrying out communication by each agent.

2. The agent of the present invention may not be limited to a particular action space, including discrete and continuous action spaces, and may thus be used in conjunction with any reinforcement learning algorithm, such as DQN, A3C, PPO, etc.

3. The agent of the present invention may not be limited to a specific state space, and may thus be combined with a modeling method.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is a diagram of a training framework for the method of the present invention;

FIG. 2 is a flow chart of the method of the present invention.

Detailed Description

A preferred embodiment of the present invention will be described below with reference to the accompanying drawings for clarity and understanding of the technical contents thereof. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

The embodiment of the invention provides a model-based multi-agent reinforcement learning method. The embodiments of the present invention apply the method in an environment where vehicles are automatically driven, where there are several vehicles, each with a different destination. The method comprises the following specific steps:

1. an observation space (namely an input space of the method) in the automatic driving scene of the vehicle is defined, and the observation space comprises the position of the vehicle in the high-definition space semantic map, the positions of other vehicles, pedestrians and other individuals in the high-definition space semantic map, a driving track in the plan, the distance and the direction from a peripheral obstacle to the vehicle sensed by a sensor, and the like. The motion space of the robot is defined as acceleration, direction, braking, etc. Defining the external reward that can be obtained by the vehicle to be determined by factors such as speed, route, impact, comfort and the like;

2. for each vehicle, a strategy function pi, a Q function network and a multi-agent environment model are initialized randomly

Other vehicle policy model sets

Real track database D_envDatabase of virtual trajectories D_model；

3. For each Epoch (Epoch):

(1) updating multiple agent environment model for each vehicle

Where state s consists of vehicle observations, each vehicle sends its own observations to other vehicles during training.

4. For each time t:

(1) updating models for other vehicle strategies for each vehicle

(2) Each vehicle independently makes a decision, and when making a decision, a model related to other vehicle strategies is used to generate real interactive data, and the real interactive data is added into a real track database D_envPerforming the following steps;

(3) each vehicle calculates its model error { ∈ with respect to other vehicle strategies_jCalculating the length n that each model should use when generating the virtual track_j}；

(4) Each vehicle uses a method for dynamically selecting an opponent model, generates a virtual track by using a respective multi-agent environment model, and adds the virtual track into a virtual track database D_modelIn (1). Wherein, in the process of dynamically selecting the opponent model, when the vehicle i needs to use the vehicle j in shapeState s_tWhen the real strategy is applied, if the state is generated in the real environment, the state s is calculated first_tAnd (4) obtaining the observation result of the vehicle j, otherwise, directly outputting the observation result of the vehicle j by the multi-agent environment model of the vehicle i. Vehicle i obtains observation o of vehicle j_jThen, o is mixed_jThe decision a is transmitted to the vehicle j, and the vehicle j then makes the decision a under the real strategy^jAnd transmitted to vehicle i.

5. And each vehicle updates the strategy function and the Q value function by using the data of the real track database and the virtual track database. Wherein, the loss function of the Q value function is:

the penalty function for the policy function is:

wherein e_tFor noise sampled from a gaussian distribution, a re-parameterized function.

In the scene of automatic driving of multiple vehicles, the method can improve the sample efficiency of the multi-agent reinforcement learning algorithm and reduce the times of real actions of the vehicles in the training process. Under the condition of only using a model-free multi-agent reinforcement learning algorithm, a large amount of training is carried out by each vehicle under a real environment, so that the damage risk is high, the vehicles using the method can carry out virtual interaction during training, the actions under the real environment are reduced, the risk is reduced, meanwhile, the states and the action space can be explored more comprehensively, and a better strategy can be learned under a safer condition.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A model-based multi-agent reinforcement learning method is characterized in that in a multi-agent environment, modeling is performed on the multi-agent environment and strategies, virtual tracks of the multi-agent are generated, and the strategies of the multi-agent are updated by using the virtual tracks.

2. The model-based multi-agent reinforcement learning method of claim 1, wherein the multi-agent makes distributed decisions.

3. The model-based multi-agent reinforcement learning method of claim 2, characterized in that for a current agent i, the set of opponent agents is remembered { -i }, the action of the current agent i depends on the joint policy pi of the opponent agents^-iAnd the current state s_tLet the combined action of the opponent agent at time t be

The action of the current agent is represented as

Wherein piⁱA policy for the current agent.

4. The model-based multi-agent reinforcement learning method of claim 3, wherein each of said multi-agents holds an independent multi-agent environment model

And set of adversary policy models

5. The model-based multi-agent reinforcement learning method of claim 4, wherein a method of dynamically selecting an opponent model is used in generating the virtual trajectory.

6. The model-based multi-agent reinforcement learning method of claim 5, characterized in that for the current agent i, the model for each adversary strategy is represented as

Wherein j ∈ { -i }, the method of dynamically selecting an adversary model comprises two steps:

step a, strategy model for each opponent

B, giving the length K of the virtual track, and giving the length K of the virtual track to the opponent agent j, the first n_j＝

Using a model of the opponent agent's policy to generate an action of the opponent agent at that time; and following K-n_jThe adversary agent is requested in steps for the actions taken under its real policy.

7. The model-based multi-agent reinforcement learning method of claim 6, wherein the generation of the virtual trajectory comprises the steps of:

step 1, initializing t to be 0, wherein the length of a virtual track is K;

step 2, selecting a state s from the real track_t；

Step 3, obtaining the combined action of the other opponents under the state s

And the current time prize r_t；

Step 6, mixing(s)_t,aⁱ,a^-i,s_t+1,r_t) Put into an experience playback pool

Performing the following steps;

8. The model-based multi-agent reinforcement learning method of claim 7, wherein said virtual trajectory is regenerated after modeling said multi-agent environment and opponent agent policies to a certain accuracy.

9. The model-based multi-agent reinforcement learning method of claim 8, wherein gaussian distributions are used to represent model outputs in modeling the multi-agent environment and the adversary agent policies, and a plurality of models are built for the multi-agent environment, the multi-agent environment model being used using an ensemble learning approach; let the number of environment models be B, then the set of environment models be

Wherein B ∈ {1, …, B }; the adversary strategy model is

Wherein j ∈ { -i }.

10. The model-based multi-agent reinforcement learning method of claim 9, wherein gradient descent is used to update in modeling the multi-agent environment and the opponent agent policy.