CN112003269B

CN112003269B - Intelligent on-line control method of grid-connected shared energy storage system

Info

Publication number: CN112003269B
Application number: CN202010754472.1A
Authority: CN
Inventors: 刘友波; 宋航; 黄媛; 刘俊勇
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2022-06-28
Anticipated expiration: 2040-07-30
Also published as: CN112003269A

Abstract

The invention discloses an intelligent online control method for a grid-connected shared energy storage system, which includes building two multi-hidden layer competitive Q network models; establishing a Markov decision process of CBESS, and mapping its charging and discharging behavior into action value-based iteration Updated reinforcement learning process; determine environmental state characteristics and immediate reward function; enter E rounds of loop iterative learning; MG executes the first plan scheduling in the round, and obtains the first result obtained by the agent-perceived environment with the pre-transaction volume CBESS of the external system. a state vector s _t ; use s _t as the input in the main competitive Q network to obtain the Q value output corresponding to all actions. The remaining power SOC _t of CBESS is updated to SOC _t+1 ; MG performs secondary planning for this period according to the tradable power actually fed back by CBESS, calculates the priority values of s _t , at , r _t , and s _t ₊₁ , and All hyperparameters of the main competition Q network are updated through gradient backpropagation; the priority p _i of the stored data in the sumtree is updated, and the parameters of the main competition Q network are copied to the target competition Q network.

Description

Intelligent online control method of grid-connected shared energy storage system

技术领域technical field

本发明涉及电力系统自动化技术领域，具体是并网型共享储能系统的智能化在线控制方法。The invention relates to the technical field of power system automation, in particular to an intelligent online control method of a grid-connected shared energy storage system.

背景技术Background technique

与集中控制的大型储能系统(energy storage system，ESS)不同，共享储能(Community energy storage system，CESS)的规模较小，一般只有几兆瓦时的容量，配置于配电变电站的电变压器二次侧，以减轻可再生资源和负荷持续变化的负面影响。一旦集成到并网微电网(MG)中，CESS就可以通过快速充放电提高MG的灵活性和可靠性。随着分销市场的放松管制，CESS可由独立实体企业所运营，并通过价格反应行为参与市场并实现套利。然而针对CESS优化决策的传统方法中，不管是采用集中式优化控制还是分散式协调优化方法，复杂的系统建模、数据的不可观性以及各种不确定性因素都给基于模型的物理建模带来诸多挑战。Different from the centrally controlled large-scale energy storage system (ESS), the shared energy storage system (CESS) is small in scale, generally only has a capacity of several MWh, and is configured in the electric transformer of the distribution substation. Secondary side to mitigate the negative effects of renewable resources and continuous load changes. Once integrated into a grid-connected microgrid (MG), CESS can improve the flexibility and reliability of the MG through fast charging and discharging. With the deregulation of the distribution market, CESS can be operated by independent entities and participate in the market and realize arbitrage through price-responsive behavior. However, in the traditional method for CESS optimization decision-making, whether it adopts centralized optimization control or decentralized coordinated optimization method, complex system modeling, data unobservability and various uncertain factors are all given to the model-based physical modeling. brings many challenges.

近年来机器学习快速发展，其强大的感知学习能力和数据分析能力契合了智能电网中大数据应用的需求。其中强化学习(Reinforcement Learning，RL)通过决策主体和环境之间的不断交互来获取环境知识，并采取影响环境的行动以达到预设目标。而深度学习(Deep Learning，DL)不依赖于任何解析方程，而利用大量的现有数据来描述数学问题和近似解，将其应用于RL中可以有效缓解价值函数求解困难等问题。为解决物理建模方法建模困难、扩展性和实用性差等问题，同时克服传统智能算法在状态空间过大时出现的求解困难，以及算法本身收敛性、鲁棒性差以及收敛速度缓慢等缺陷。In recent years, machine learning has developed rapidly, and its powerful perceptual learning capabilities and data analysis capabilities meet the needs of big data applications in smart grids. Among them, Reinforcement Learning (RL) acquires environmental knowledge through the continuous interaction between the decision-making subject and the environment, and takes actions that affect the environment to achieve preset goals. Deep Learning (DL) does not rely on any analytical equations, but uses a large amount of existing data to describe mathematical problems and approximate solutions. Applying it to RL can effectively alleviate problems such as the difficulty of solving value functions. In order to solve the problems of difficult modeling, poor scalability and practicability of physical modeling methods, and at the same time overcome the difficulties of solving traditional intelligent algorithms when the state space is too large, as well as the defects of the algorithm itself, such as poor convergence, poor robustness and slow convergence speed.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足，提供并网型共享储能系统的智能化在线控制方法，包括如下步骤：The purpose of the present invention is to overcome the deficiencies of the prior art, and to provide an intelligent online control method for a grid-connected shared energy storage system, comprising the following steps:

步骤一，搭建两个多隐层竞争Q网络模型，主竞争Q网络和目标竞争Q网络，其输入为观测状态的特征向量s_t，输出则对应于每一个动作集合A中a_t的动作价值Q(s_t,a_t)；Step 1, build two multi-hidden layer competitive Q network models, the main competitive Q network and the target competitive Q network, whose input is the feature vector s _t of the observation state, and the output corresponds to the action value of at _t in each action set A Q(s _t , at _t );

步骤二，建立CBESS的马尔科夫决策过程，将其充放电行为映射为基于动作价值迭代更新的强化学习过程；确定环境状态特征以及即时奖励函数；Step 2, establish the Markov decision process of CBESS, map its charging and discharging behavior into a reinforcement learning process based on iterative update of action value; determine the characteristics of the environment state and the immediate reward function;

步骤三，进入E个回合的循环迭代学习，每个回合开始重新初始化MG的负荷曲线和RDG的输出、市场价格以及共享储能的SOC；Step 3: Enter E rounds of iterative learning, and each round starts to re-initialize the load curve of MG and the output of RDG, market price and SOC of shared energy storage;

步骤四，MG执行回合内的首次计划调度，得到与外部系统的预交易量CBESS的代理感知环境得到的第一个状态向量s_t；Step 4, the MG performs the first plan scheduling in the round, and obtains the first state vector s _t obtained by the agent perception environment with the pre-transaction volume CBESS of the external system;

步骤五，在主竞争Q网络中使用s_t作为输入，得到所有动作对应的Q值输出。采用ε-贪婪法在当前Q值输出中选择一个最优估计Q值，以其确定其对应的动作a_t并执行；Step 5: Use s _t as the input in the main competitive Q network to obtain the Q value output corresponding to all actions. Use the ε-greedy method to select an optimal estimated Q value in the output of the current Q value to determine its corresponding action a _t and execute it;

步骤六，CBESS的剩余电量SOC_t更新至SOC_t+1，判断SOC_t+1是否超出[0,1]范围来判定其是否越限，并以此计算本轮迭代的终止判定指标done_t，同时计算本次动作后的即时奖励r_t；Step 6: Update the remaining power SOC _t of CBESS to SOC _t+1 , judge whether SOC _t+1 exceeds the range of [0,1] to determine whether it exceeds the limit, and calculate the termination judgment index done _t of this round of iterations, Simultaneously calculate the immediate reward _rt after this action;

步骤七，MG根据CBESS实际反馈的可交易电量进行本时段的二次规划，确定与外部系统的交易电量，同时给出下一时段的预交易电量Pmg.CHE t+1,Pmg.grid t+1，以作为代理下一时段的感知状态信息；并将系统的状态更新至s_t+1；Step 7: MG performs secondary planning for the current period according to the tradable power actually fed back by CBESS, determines the transaction power with the external system, and gives the pre-trading power Pmg.CHE t+1, Pmg.grid t+ for the next period. 1, as the perception state information of the agent in the next period; and update the state of the system to s _t+1 ;

步骤八，计算s_t、a_t、r_t、s_t+1的优先级值，并将其与done_t指标全部依次存放入sumtree的叶节点中；若存储数据的数量达到预设的小批量采样容量m时，从中随机采样m个样本，计算当前目标Q值及其误差，并通过梯度反向传播来更新主竞争Q网络的所有超参数；Step 8: Calculate the priority values of s _t , at , r _t , and s _t ₊₁ , and store them and done _t indicators in the leaf nodes of the sumtree in turn; if the number of stored data reaches the preset small batch When sampling capacity m, randomly sample m samples from it, calculate the current target Q value and its error, and update all hyperparameters of the main competitive Q network through gradient backpropagation;

步骤九，Q网络更新后重新计算并更新sumtree中存储数据的优先级p_i，将主竞争Q网络的参数复制给目标Q网络，同时令当前状态s＝s_t+1；若s为终止状态或达到迭代轮数T则本轮迭代完毕，回到步骤三进行循环；否则转到步骤五继续迭代。Step 9, after the Q network is updated, recalculate and update the priority p _i of the stored data in the sumtree, copy the parameters of the main competition Q network to the target Q network, and make the current state s=s _t+1 simultaneously; if s is the termination state Or when the number of iteration rounds T is reached, this round of iteration is completed, and go back to step 3 to loop; otherwise, go to step 5 to continue the iteration.

进一步的，所述的主竞争Q网络为具有单神经元的状态值子层和K个神经元的动作优势子层的多隐层主竞争Q网络架构，激活函数选取ReLu函数来加速收敛过程；正态初始化层间权重ω，初始化偏置b都为趋于0的常数；以时序号、CBESS的荷电状态、市场电价、MG与CBESS/上级配网的预交易电量组成状态特征向量s_t作为网络输入，输出最优的离散化充放电动作价值Q_t，并最终通过优先回放数据进行网络训练来迭代收敛。Further, the main competition Q network is a multi-hidden layer main competition Q network architecture with a state value sublayer of a single neuron and an action advantage sublayer of K neurons, and the activation function selects the ReLu function to accelerate the convergence process; The normal initialization inter-layer weight ω and the initialization bias b are constants tending to 0; the state feature vector s _t is composed of the sequence number, the state of charge of CBESS, the market electricity price, and the pre-transaction electricity of MG and CBESS/superior distribution network. As the network input, it outputs the optimal value of discrete charge and discharge actions Q _t , and finally iteratively converges by preferentially replaying data for network training.

进一步的，所述的动作集合A为：Further, the action set A is:

将CBESS的动作空间划分为K个离散的充放电选择P(k)be，均匀离散化动作空间ADivide the action space of CBESS into K discrete charge and discharge options P(k)be, and uniformly discretize the action space A

式中，A为所有可能动作组成的集合；P_be ^(k)表示CBESS均匀离散动作空间中的第k个充电/放电动作。where A is the set of all possible actions; P _be ^(k) represents the k-th charge/discharge action in the uniform discrete action space of CBESS.

进一步的，所述的建立CBESS的马尔科夫决策过程，将CBESS充放电行为映射为基于动作价值迭代更新的强化学习过程，具体为：Further, the described Markov decision process for establishing CBESS maps the CBESS charging and discharging behavior into a reinforcement learning process based on iterative update of action value, specifically:

BESS的剩余电量在充放电过程中不断变化，其变化量与该时段内的充、放电电量和自放电有关；储能充电递推关系为The remaining power of the BESS changes continuously during the charging and discharging process, and its variation is related to the charging and discharging power and self-discharge in this period; the recursive relationship between the energy storage charging and charging is:

SoC(t)＝(1-σ_sdr)·SoC(t-1)+P_be·(1-L_c)Δt/E_cap SoC(t)=(1-σ _sdr )·SoC(t-1)+P _be ·(1-L _c )Δt/E _cap

储能放电过程表示如下The energy storage discharge process is expressed as follows

SoC(t)＝(1-σ_sdr)·SoC(t-1)-P_beΔt/[E_cap·(1-L_dc)]SoC(t)=(1-σ _sdr )·SoC(t-1)-P _be Δt/[E _cap ·(1-L _dc )]

式中：SoC(t)为CBESS在t时段的荷电状态；P_be(t)为CBESS在t时段的充放电功率；σ_sdr为储能介质的自放电率；L_c和L_dc分别为CBESS的充电和放电损耗；△t为每个计算窗口时长；In the formula: SoC(t) is the state of charge of CBESS in period t; _Pbe (t) is the charge and discharge power of _CBESS in period _t ; _σsdr is the self-discharge rate of the energy storage medium; Lc and Ldc are respectively Charge and discharge losses of CBESS; Δt is the duration of each calculation window;

CBESS在t时刻的最大允许充放电功率由其自身的充放电特性和t时刻的剩余荷电状态所决定，同时运行过程中满足约束：The maximum allowable charge and discharge power of CBESS at time t is determined by its own charge and discharge characteristics and the remaining state of charge at time t, and the constraints are satisfied during operation:

SoC_min≤SoC(t)≤SoC_max SoC _min ≤SoC(t)≤SoC _max

式中：SoC_max和SoC_min分别为CBESS荷电状态约束的上、下限；where: SoC _max and SoC _min are the upper and lower limits of CBESS state-of-charge constraints, respectively;

所述的环境状态特征为：The environmental state characteristics are:

定义CBESS在时刻t所感知到的环境状态特征向量为s_t为Define the environmental state feature vector perceived by CBESS at time t as s _t as

式中，t为时序号；pric_t ^b.pre/pric_t ^s.pre分别表示时段t时上级电网的预测售、购电价，P_t ^mg.CHE/P_t ^mg.grid分别表示微电网与CBESS和上级电网之间的预交易电量；In the formula, t is the time sequence number; pric _t ^b.pre /pric _t ^s.pre respectively represent the predicted sale and purchase price of the upper power grid at time t, P _t ^mg.CHE /P _t ^mg.grid respectively represent the microgrid and CBESS Pre-traded electricity with the upper power grid;

2)所述的即时奖励函数为：CBESS通过在非高峰时段充电，然后在高峰时段放电获得能源套利利润；在分别确定与微网和上级电网的实际交易功率后，根据实时价格计算奖励收益r_EAP；2) The instant reward function is: CBESS obtains energy arbitrage profits by charging in off-peak hours and then discharging in peak hours; after determining the actual transaction power with the microgrid and the upper-level power grid respectively, calculate the reward income r according to the real-time price _EAP ;

CBESS的运营和维护总成本C_o,m见下式The total cost of operation and maintenance of CBESS, C _o,m, is shown in the following formula

C₁＝|P_be|·c_be C ₁ =|P _be |·c _be

增加一个系数为σ的负报酬线作为惩罚，以抑制并网点的功率(P_{exc_grid})波动A negative return line with a coefficient σ is added as a penalty to suppress the power (P _{exc_grid} ) fluctuation of the grid connection point

r_line＝-σ·|P_{exc_grid}|r _line = -σ · |P _{exc_grid} |

若执行的动作导致SOC超出[0,1]，给予较大惩罚r_exc，以防止代理在随后的学习中做出不合理的决策；即时奖励r_t为：If the performed action causes the SOC to exceed [0,1], a large penalty r _exc is given to prevent the agent from making unreasonable decisions in subsequent learning; the immediate reward r _t is:

进一步的，所述的MG执行回合内的首次计划调度，得到与外部系统的预交易量CBESS的代理感知环境得到的第一个状态向量s_t包括如下过程：对于MG模型，其目标是在预测价格信号下最小化运行成本，其经济调度模型的目标函数如下：Further, the MG performs the first planning and scheduling in the round, and obtains the first state vector s _t obtained by the agent perception environment of the pre-transaction volume CBESS with the external system, including the following process: For the MG model, the goal is to predict To minimize the operating cost under the price signal, the objective function of the economic dispatch model is as follows:

式中，T为规划周期；cCDG z是第z个CDG的发电成本，c_i ^es是第i个微网储能的运行成本；PCDG z,t是第z个CDG的功率输出，Pes i,t是第i个微网储能的充放电功率；Pb.gridt/Ps.grid t分别表示每时段上级配电网的售、购电价，P_t ^b.CHE/P_t ^s.CHE则分别表示CBESS运营商发布的售、购电价；In the formula, T is the planning period; cCDG z is the power generation cost of the zth CDG, ci ^es is the operating cost of the _ith microgrid energy storage; PCDG z,t is the power output of the zth CDG, Pes i, t is the charging and discharging power of the i-th microgrid energy storage; Pb.gridt/Ps.grid t respectively represent the sales and purchase price of the upper distribution network in each period, and P _t ^b.CHE /P _t ^s.CHE respectively represent Electricity sales and purchase prices published by CBESS operators;

微电网根据预测数据采用混合整数线性规划(MILP)方法，得到该时段与CBESS和上级配网之间的交易电量大小P_t ^mg.CHE/P_t ^mg.grid，并向外界发布该交易信息；CBESS的代理通过感知外部环境，得到状态特征向量s_t＝[t,SOC_t,pric_t ^b.pre,pric_t ^s.pre,P_t ^mg.CHE,P_t ^mg.grid]。According to the forecast data, the microgrid adopts the mixed integer linear programming (MILP) method to obtain the transaction volume P _t ^mg.CHE /P _t ^mg.grid between the CBESS and the upper-level distribution network during this period, and publish the transaction information to the outside world; The agent of CBESS obtains the state feature vector s _t = [t, SOC _t , ^pric _t b.pre , pric _t ^s.pre , P _t ^mg.CHE , P _t ^mg.grid ] by sensing the external environment.

进一步的，所述的在主竞争Q网络中使用s_t作为输入，得到所有动作对应的Q值输出。采用ε-贪婪法在当前Q值输出中选择一个最优估计Q值，以其确定其对应的动作a_t并执行，包括如下过程：Further, in the main competitive Q network, s _t is used as the input, and the Q value output corresponding to all actions is obtained. The ε-greedy method is used to select an optimal estimated Q value in the output of the current Q value to determine its corresponding action a _t and execute it, including the following process:

在主竞争Q网络中使用s_t作为输入，得到所有动作对应的Q值输出；采用ε贪婪法在当前Q值输出中选择一个对应的动作a_t，在状态s_t执行当前动作a_t；对于ε-greedy策略，首先通过设置ε∈(0,1)的值，则在对应的动作时，以概率(1-ε)贪婪地选择当前被视为最大Q价值的最优动作a^*，而以ε的概率从所有K个离散的可选行为中随机探索潜在的行为：In the main competitive Q network, use s _t as the input to obtain the Q value output corresponding to all actions; use the ε greedy method to select a corresponding action a _t in the current Q value output, and execute the current action a _t in the state _st ; for In the ε-greedy strategy, firstly by setting the value of ε∈(0,1), in the corresponding action, the optimal action a ^* currently regarded as the maximum Q value is greedily selected with probability (1-ε), while Randomly explore potential actions from among all K discrete optional actions with probability ε:

其中，ε将随着迭代过程从ε_ini逐渐减小ε_fin。where ε will gradually decrease from ε _ini ε _fin over the iterative process.

进一步的，所述的CBESS的剩余电量SOC_t更新至SOC_t+1，判断SOC_t+1是否超出[0,1]范围来判定其是否越限，并以此计算本轮迭代的终止判定指标done_t，同时计算本次动作后的即时奖励r_t，具体包括如下过程：CBESS的电量SOC_t更新至SOC_t+1，以此判断本次迭代是否为终止状态，并计算本次动作后的即时奖励r_t；以二值变量done为迭代终止判定指标，用作每次迭代过程的中断指标Further, the remaining power SOC _t of the CBESS is updated to SOC _t+1 , and it is determined whether SOC _t+1 exceeds the range of [0,1] to determine whether it exceeds the limit, and the termination judgment index of this round of iteration is calculated based on this. done _t , and at the same time calculate the immediate reward _rt after this action, which specifically includes the following process: the power SOC _t of the CBESS is updated to SOC _t+1 , so as to judge whether this iteration is a termination state, and calculate the value after this action. Immediate reward r _t ; the binary variable done is used as the indicator for judging the termination of iteration, which is used as the interrupt indicator for each iteration process

式中，如果储能运行过程中荷电状态越限，则本次迭代的done等于1，否则为0；done＝1表示终止而跳出本次迭代，done＝0表示迭代未终止。In the formula, if the state of charge exceeds the limit during the energy storage operation, the done of this iteration is equal to 1, otherwise it is 0; done=1 means termination and jumps out of this iteration, done=0 means the iteration is not terminated.

进一步的，所述的步骤八中所述的计算s_t、a_t、r_t、s_t+1的优先级值，并将其与done_t指标全部依次存放入sumtree的叶节点中；若存储数据的数量达到预设的小批量采样容量m时，从中随机采样m个样本，计算当前目标Q值及其误差，并通过梯度反向传播来更新主竞争Q网络的所有超参数，其中所述的当前目标Q值y_j为：Further, the priority values of s _t , at , r _t , and s _t ₊₁ are calculated in the step 8, and all of them and done _t indicators are stored in the leaf nodes of the sumtree in sequence; When the amount of data reaches the preset mini-batch sampling capacity m, randomly sample m samples from it, calculate the current target Q value and its error, and update all hyperparameters of the main competitive Q network through gradient backpropagation, where the The current target Q value y _j of is:

采用比例优先化策略，即第i个样本被的提取概率的P(i)为：The proportional prioritization strategy is adopted, that is, the P(i) of the extraction probability of the i-th sample is:

其中α∈[0,1]，是将TD误差的重要性转换为优先级的幂指数；若α＝0，则转换为均匀随机抽样；p_i是转换i的优先级，计算如下式所示：where α∈[0,1] is the power exponent that converts the importance of TD error into priority; if α=0, it is converted to uniform random sampling; p _i is the priority of conversion i, calculated as follows :

p(i)＝|δ_i|+ζp(i)=|δ _i |+ζ

其中

为正偏差；in

is a positive deviation;

采用重要抽样权重来校正偏差，从而得到考虑样本优先级的均方差损失函数L_i(θi)。最后通过神经网络的梯度反向传播来更新主竞争Q网络的所有参数θ：Important sampling weights are used to correct the bias, resulting in a mean square error loss function Li ( _θi ) that considers the sample priority. Finally, all parameters θ of the main competitive Q network are updated through the gradient back-propagation of the neural network:

ω_j＝(N·P(j))^-β/max_iω_i ω _j =(N·P(j)) ^-β /max _i ω _i

θ_i＝θ_i-1+α▽_θiL_i(θ_i)θ _i =θ _i _-1 +α▽ _{θi Li} (θ _i )

其中ω_j是样本j的IS权重；β是逐渐增加到1的超参数。where ω _j is the IS weight of sample j; β is a hyperparameter that gradually increases to 1.

本发明的有益效果是：1.本发明赋予CBESS在高不确定环境下强大的在线学习和决策能力，通过对最优动作价值函数的逼近而不依赖于任何解析方程，解决了环境状态连续且空间巨大导致的无法迭代求解的问题；The beneficial effects of the present invention are as follows: 1. The present invention endows CBESS with powerful online learning and decision-making capabilities in a high-uncertainty environment, and solves the problem of continuous and continuous environmental state by approximating the optimal action value function without relying on any analytical equation. Problems that cannot be solved iteratively due to huge space;

2.双竞争Q网络结构和优先级回放策略的协同优化，可以有效缓解模型过优估计问题，显著提高代理决策的准确性和收敛的鲁棒性，同时加快了算法的收敛速度，提升在线计算效率。2. The collaborative optimization of the dual-competitive Q network structure and the priority playback strategy can effectively alleviate the problem of model overestimation, significantly improve the accuracy of surrogate decision-making and the robustness of convergence, and at the same time speed up the convergence speed of the algorithm and improve online computing. efficiency.

附图说明Description of drawings

图1为并网型共享储能系统的智能化在线控制方法流程图；Figure 1 is a flow chart of an intelligent online control method for a grid-connected shared energy storage system;

图2为竞争Q网络结构示意图；Fig. 2 is a schematic diagram of a competitive Q network structure;

图3为sumtree数据结构示意图；Figure 3 is a schematic diagram of the sumtree data structure;

图4为优先经验回放策略的算法结构示意图。FIG. 4 is a schematic diagram of the algorithm structure of the priority experience playback strategy.

具体实施方式Detailed ways

下面结合附图进一步详细描述本发明的技术方案，但本发明的保护范围不局限于以下所述。The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the protection scope of the present invention is not limited to the following.

如图1所示，所发明的一种并网型共享储能系统在线控制决策的数据驱动技术，包括以下步骤：As shown in Figure 1, the invented data-driven technology for online control decision-making of a grid-connected shared energy storage system includes the following steps:

S1：搭建两个多隐层竞争Q网络模型，即主竞争Q网络和目标竞争Q网络，其输入为观测状态的特征向量s_t，输出则对应于每一个动作集合A中a_t的动作价值Q(s_t,a_t)。首先初始化Q网络的所有参数、数据存储结构sumtree的容量D以及其叶节点的优先级值。S1: Build two multi-hidden layer competitive Q network models, namely the main competitive Q network and the target competitive Q network, whose input is the feature vector s _t of the observation state, and the output corresponds to the action value of at _t in each action set A Q(s _t , at _t ). First, initialize all the parameters of the Q network, the capacity D of the data storage structure sumtree, and the priority value of its leaf nodes.

S2：建立CBESS的马尔科夫决策过程，将其充放电行为映射为基于动作价值迭代更新的强化学习过程，并确定1)算法的控制目标为：在最大化储能市场套利的情况下尽可能平抑微电网并网点的功率波动；2)环境状态特征组合：包括当前时段的时序号、CBESS的剩余电量、预测的上级电网售/购电价以及MG一次经济调度得到的与配网/CBESS的预交易电量值；3)奖励函数：包括CBESS通过灵活充放电实现的能源套利利润r_EAP、运营和维护总成本C_o,m、并网点功率波动惩罚r_line和储能SOC越限惩罚r_exc。S2: Establish the Markov decision process of CBESS, map its charging and discharging behavior into a reinforcement learning process based on iterative update of action value, and determine 1) the control objective of the algorithm is: to maximize the arbitrage of the energy storage market as much as possible Suppress the power fluctuation of the grid connection point of the microgrid; 2) Combination of environmental state characteristics: including the sequence number of the current period, the remaining power of the CBESS, the predicted electricity sale/purchase price of the upper-level power grid, and the forecast of the distribution network/CBESS obtained by MG's one economic dispatch. 3) Reward function: including the energy arbitrage profit r _EAP realized by CBESS through flexible charging and discharging, the total cost of operation and maintenance C _o,m , the power fluctuation penalty r _line at the grid connection point, and the energy storage SOC violation penalty r _exc .

S3：每轮回合迭代开始前，需要重新初始化不确定性数据，包括微电网的负荷曲线、可再生分布式发电出力以及市场价格信号等；S3: Before the start of each round of iteration, the uncertainty data needs to be re-initialized, including the load curve of the microgrid, the output of renewable distributed generation, and the market price signal;

S4：微电网基于预测数据进行每个时段的预规划，得到时段t与CBESS/上级配网之间的预交易电量额，即P_t ^mg.CHE/P_t ^mg.grid，并向外界发布该信息；与此同时，CBESS的代理通过感知外部环境，得到状态特征向量s_t＝[t,SOC_t,pric_t ^b.pre,pric_t ^s.pre,P_t ^mg.CHE,P_t ^mg.grid]。S4: The microgrid performs pre-planning for each period based on the forecast data, and obtains the pre-transaction amount of electricity between period t and CBESS/superior distribution network, namely P _t ^mg.CHE /P _t ^mg.grid , and publishes this information to the outside world. At the same time, the agent of CBESS obtains the state feature vector s _t = [t, SOC _t , ^pric _t b.pre , pric _t ^s.pre , P _t ^mg.CHE , P _t ^mg.grid by sensing the external environment ].

S5：在主竞争Q网络中使用s_t作为输入，得到所有动作对应的Q值输出。采用ε-贪婪法在当前Q值输出中选择一个最优估计Q值，以其确定其对应的动作a_t并执行。S5: Use s _t as the input in the main competitive Q network to obtain the Q value output corresponding to all actions. The ε-greedy method is used to select an optimal estimated Q value in the output of the current Q value to determine its corresponding action a _t and execute it.

S6：CBESS的剩余电量SOC_t更新至SOC_t+1，判断SOC_t+1是否超出[0,1]范围来判定其是否越限，并以此计算本轮迭代的终止判定指标done_t，同时计算本次动作后的即时奖励r_t。S6: The remaining power SOC _t of CBESS is updated to SOC _t+1 , and it is judged whether SOC _t+1 exceeds the range of [0,1] to determine whether it exceeds the limit, and the termination judgment index done _t of this round of iteration is calculated based on this, and at the same time Calculate the immediate reward _rt after this action.

S7：MG根据CBESS实际反馈的可交易电量进行本时段的二次规划，确定与外部系统的交易电量，同时给出下一时段的预交易电量Pmg.CHE t+1,Pmg.grid t+1，以作为代理下一时段的感知状态信息；此时，系统的状态更新至s_t+1。S7: MG performs secondary planning for the current period according to the tradable power actually fed back by CBESS, determines the transaction power with the external system, and gives the pre-trading power Pmg.CHE t+1, Pmg.grid t+1 for the next period , as the perception state information of the agent in the next period; at this time, the state of the system is updated to s _t+1 .

S8：计算s_t、a_t、r_t、s_t+1的优先级值，并将其与done_t指标全部依次存放入sumtree的叶节点中。一旦存储数据的数量达到预设的小批量采样容量m时，便开始从中优先地随机采样m个样本，计算当前目标Q值及其误差，并通过梯度反向传播来更新主竞争Q网络的所有超参数。S8: Calculate the priority values of s _t , at , r _t , and s _t ₊₁ , and store them and done _t indicators in the leaf nodes of the sumtree in sequence. Once the number of stored data reaches the preset mini-batch sampling capacity m, it starts to randomly sample m samples from it, calculates the current target Q value and its error, and updates all the main competitive Q network through gradient backpropagation. hyperparameters.

S9：Q网络更新后需要重新计算并更新sumtree中存储数据的优先级p_i，并定期将主竞争Q网络的参数复制给目标Q网络，同时令当前状态s＝s_t+1。若s为终止状态或达到迭代轮数T则本轮迭代完毕，回到S3进行循环；否则转到步骤S5继续迭代。S9: After the Q network is updated, the priority pi of the data stored in the _sumtree needs to be recalculated and updated, and the parameters of the main competing Q network are copied to the target Q network periodically, and the current state s=s _t+1 is set at the same time. If s is in the terminal state or reaches the number of iteration rounds T, the iteration of the current round is completed, and the loop is returned to S3; otherwise, it goes to step S5 to continue the iteration.

5.1S1步骤的具体过程为：The specific process of 5.1S1 steps is:

CBESS通过不断感知微电网电量需求与市场环境，在控制目标下与环境进行交互得到反馈奖励。构建具有单神经元的状态值子层和K个神经元的动作优势子层的多隐层主竞争Q网络架构，如图2所示。对应的目标竞争Q网络架构与其一致。激活函数选取ReLu函数来加速收敛过程。正态初始化层间权重ω，初始化偏置b都为趋于0的常数。以时序号、CBESS的荷电状态、市场电价、MG与CBESS/上级配网的预交易电量组成状态特征向量s_t作为网络输入，输出最优的离散化充放电动作价值Q_t，并最终通过优先回放数据进行网络训练来迭代收敛。这种基于无模型强化学习和数据驱动的储能智能化决策方法中，采用基于sumtree数据结构的优先级比例样本回放方法，同时与DDQN兼容以后能可观地地提高策略精度和收敛速度，增加算法鲁棒性；同时竞争性网络架构的应用可以促使代理在策略评估期间快速识别正确的动作，具备更高的计算效率和可观的拟合精度，且自适应能力较强。CBESS obtains feedback and rewards by continuously sensing the power demand and market environment of the microgrid, and interacting with the environment under the control objective. A multi-hidden-layer main competition Q-network architecture is constructed with a state-value sub-layer of a single neuron and an action-dominant sub-layer of K neurons, as shown in Figure 2. The corresponding target competition Q network architecture is consistent with it. The activation function selects the ReLu function to speed up the convergence process. The normal initialization inter-layer weight ω, and the initialization bias b are both constants tending to 0. The state eigenvector s _t composed of the sequence number, the state of charge of CBESS, the market electricity price, the pre-transaction electricity of MG and CBESS/superior distribution network is used as the network input, and the optimal discrete charge and discharge action value Q _t is output, and finally passes Iteratively converges by first replaying data for network training. In this model-free reinforcement learning and data-driven intelligent decision-making method for energy storage, the priority proportional sample playback method based on the sumtree data structure is adopted. At the same time, it is compatible with DDQN, which can significantly improve the strategy accuracy and convergence speed, and increase the algorithm Robustness; at the same time, the application of competitive network architecture can prompt the agent to quickly identify the correct action during policy evaluation, with higher computational efficiency and considerable fitting accuracy, and strong adaptive ability.

Sumtree是图3所示的二叉树结构。根节点位于最顶层，分支节点位于中间层，只有底部的叶节点负责存储样本。每个父节点包含其两个子节点的和。因此，根节点是所有优先级的总和，表示为p_total。由于这种数据结构提供了计算优先级累积和的有效方法，所以sumtree有助于有效地存储、更新和采样比例变量。在存储过程中，从左到右将获得的数据存储在叶节点中，一旦叶节点被填满，旧数据将从左逐个溢出。这种方法的一个显著优点是不需要按优先级对转换进行排序，大大减轻了计算负担，便于实时训练。在迭代之前，需要先确定sumtree叶节点的容量大小，并初始化叶节点的优先级值。Sumtree is the binary tree structure shown in Figure 3. The root node is at the topmost layer, the branch nodes are at the middle layer, and only the bottom leaf nodes are responsible for storing samples. Each parent node contains the sum of its two child nodes. Therefore, the root node is the sum of all priorities, denoted as p _total . Since this data structure provides an efficient way to compute cumulative sums of priorities, sumtrees help efficiently store, update, and sample scale variables. In the storage process, the obtained data is stored in the leaf nodes from left to right, and once the leaf nodes are filled, the old data will overflow one by one from the left. A significant advantage of this approach is that there is no need to prioritize transformations, which greatly reduces the computational burden and facilitates real-time training. Before iteration, it is necessary to determine the capacity of sumtree leaf nodes and initialize the priority value of leaf nodes.

当感知到环境状态的变化后，agent将控制CBESS反馈相应的动作a_t。将CBESS的动作空间划分为K个离散的充放电选择P(k)be，均匀离散化动作空间AWhen sensing the change of the environmental state, the agent will control the CBESS to feed back the corresponding action at _t . Divide the action space of CBESS into K discrete charge and discharge options P(k)be, and uniformly discretize the action space A

5.2S2步骤的具体过程为：5.2 The specific process of the S2 step is:

建立CBESS的马尔科夫决策过程，将CBESS充放电行为映射为基于动作价值迭代更新的强化学习过程，具体为：The Markov decision process of CBESS is established, and the charging and discharging behavior of CBESS is mapped to the reinforcement learning process based on iterative update of action value, specifically:

BESS的剩余电量在充放电过程中不断变化，其变化量与该时段内的充、放电电量和自放电有关。储能充电递推关系为The remaining power of the BESS changes continuously during the charging and discharging process, and its variation is related to the charging and discharging power and self-discharge in this period. The recursive relationship between energy storage and charging is:

式中：SoC(t)为CBESS在t时段的荷电状态(state of charge，SoC)；P_be(t)为CBESS在t时段的充放电功率；σ_sdr为储能介质的自放电率；L_c和L_dc分别为CBESS的充电和放电损耗；△t为每个计算窗口时长。Where: SoC(t) is the state of charge (SoC) of CBESS in period t; _Pbe (t) is the charge and discharge power of CBESS in period t; _σsdr is the self-discharge rate of energy storage medium; L _c and L _dc are the charging and discharging losses of CBESS, respectively; Δt is the duration of each calculation window.

SoC_min≤SoC(t)≤SoC_max SoC _min ≤SoC(t)≤SoC _max

式中：SoC_max和SoC_min分别为CBESS荷电状态约束的上、下限。Where: SoC _max and SoC _min are the upper and lower limits of the CBESS state-of-charge constraint, respectively.

强化学习是一种从环境状态映射到动作的学习，目标是使代理(agent)在与环境的交互过程中获得最大的累积奖赏。RL利用马尔科夫决策过程(Markov DecisionProcess，MDP)来简化其建模，通常将MDP定义为一个四元组(S，A，r，f)，其中：S为所有环境状态的集合，s_t∈S表示agent在t时刻所处的状态；A为agent可执行动作的集合，a_t∈A表示agent在t时刻所采取的动作；r为奖赏函数，r_t～r(s_t，a_t)表示agent在状态s_t执行动作a_t获得的立即奖赏值；f为状态转移概率分布函数，s_t+1～f(s_t，a_t)表示agent在在状态s_t执行动作a_t转移到下一状态s_t+1的概率。马尔科夫模型的目标是在初始化状态s后，找到一种最大化预期奖励总和的最优计划策略V^π* Reinforcement learning is a learning that maps from the state of the environment to actions, and the goal is to maximize the cumulative reward of the agent during its interaction with the environment. RL uses Markov Decision Process (MDP) to simplify its modeling, and MDP is usually defined as a quadruple (S, A, r, f), where: S is the set of all environmental states, s _t ∈S represents the state of the agent at time t; A is the set of actions the agent can perform, at _t ∈A represents the action taken by the agent at time t; r is the reward function, r _t ~r(s _t , at _t ) represents the immediate reward value obtained by the agent performing the action a _t in the state s _t ; f is the state transition probability distribution function, s _t+1 ~ f(s _t , at _t ) indicates that the agent performs the action a _t transition in the state s _t The probability of going to the next state s _t+1 . The goal of the Markov model is to find an optimal planning policy V ^π* that maximizes the sum of expected rewards after initializing state s

式中，E_π表示在策略π下对价值的期望；0<γ<1是强化学习中表征未来奖励重要程度的一个衰减系数。In the formula, E _π represents the expectation of value under policy π; 0 < γ < 1 is a decay coefficient representing the importance of future rewards in reinforcement learning.

在问题的规模比较小时，算法相对容易求解。然而对于实际问题来说，状态空间通常会很大，传统迭代求解的计算成本过高，且存在收敛困难、收敛速度慢、易出现过优估计等缺点，所以需要利用本发明所提方法进行改进求解。对应于本发明所提的并网型共享储能系统在线控制的数据驱动技术，映射关系如下：When the size of the problem is relatively small, the algorithm is relatively easy to solve. However, for practical problems, the state space is usually very large, the calculation cost of traditional iterative solution is too high, and there are disadvantages such as difficulty in convergence, slow convergence, and prone to over-optimal estimation. Therefore, it is necessary to use the method proposed in the present invention to improve. Solve. Corresponding to the data-driven technology of the on-line control of the grid-connected shared energy storage system proposed in the present invention, the mapping relationship is as follows:

(1)环境状态特征(1) Environmental state characteristics

s_t＝[t,SOC_t ^be,pric_t ^b.pre,pric_t ^s.pre,P_t ^mg.CHE,P_t ^mg.grid]^T,s_t∈Ss _t =[t,SOC _t ^be , ^pric _t b.pre ,pric _t ^s.pre ,P _t ^mg.CHE ,P _t ^mg.grid ] ^T ,s _t ∈S

式中，t为时序号；pric_t ^b.pre/pric_t ^s.pre分别表示时段t时上级电网的预测售、购电价，P_t ^mg.CHE/P_t ^mg.grid分别表示微电网与CBESS和上级电网之间的预交易电量。In the formula, t is the time sequence number; pric _t ^b.pre /pric _t ^s.pre respectively represent the predicted sale and purchase price of the upper power grid at time t, P _t ^mg.CHE /P _t ^mg.grid respectively represent the microgrid and CBESS Pre-traded electricity with the upper-level grid.

(2)反馈奖励(2) Feedback reward

CBESS在不断感知和学习过程中，在给定环境状态s_t和选择动作a_t之后，所获得的单步即时奖励r_t包括In the continuous perception and learning process of CBESS, after a given environment state s _t and a choice action a _t , the single-step immediate reward r _t obtained includes:

3)CBESS通过在非高峰时段充电，然后在高峰时段放电获得能源套利利润(Energyarbitrage profit，EAP)。在分别确定与微网和上级电网的实际交易功率后，根据实时价格计算奖励收益r_EAP。3) CBESS obtains energy arbitrage profit (EAP) by charging during off-peak hours and then discharging during peak hours. After determining the actual transaction power with the microgrid and the upper-level grid, respectively, the reward income r _EAP is calculated according to the real-time price.

2)除了CBESS的基本单位电力成本c_be外，当其电量接近极限时，它可能仍继续运行导致成本增加。最终，CBESS的运营和维护总成本C_o,m见下式2) In addition to the basic unit electricity cost of CBESS, _cbe , when its electricity is close to its limit, it may continue to operate resulting in increased cost. Finally, the total operation and maintenance cost of CBESS, C _o,m, is shown in the following formula

C₁＝|P_be|·c_be C ₁ =|P _be |·c _be

4)CBESS有能力减轻MG对配电网的负面影响。因此，增加一个系数为σ的负报酬线作为惩罚，以抑制并网点的功率(P_{exc_grid})波动4) CBESS has the ability to mitigate the negative impact of MG on the distribution network. Therefore, a negative return line with a coefficient σ is added as a penalty to suppress the power (P _{exc_grid} ) fluctuation of the grid connection point

r_line＝-σ·|P_{exc_grid}|r _line = -σ · |P _{exc_grid} |

5)一旦执行的动作导致SOC超出[0,1]，就必须给予较大惩罚r_exc，以防止代理在随后的学习中做出不合理的决策。最后，即时奖励r_t定义为5) Once the performed action causes the SOC to exceed [0,1], a large penalty r _exc must be given to prevent the agent from making unreasonable decisions in subsequent learning. Finally, the immediate reward _rt is defined as

5.3S3步骤的具体过程为：5.3 The specific process of the S3 step is:

每轮回合迭代开始前，初始化不确定性数据，包括微电网的负荷曲线、可再生分布式发电出力以及市场价格信号等。具体先可给定负荷曲线、RDG出力和市场电价的实际值，并假设其预测误差均服从一定正态分布，以此表征不确定性波动。Before each round of iteration starts, initialize uncertainty data, including the load curve of the microgrid, renewable distributed generation output, and market price signals. Specifically, the actual value of the load curve, RDG output and market electricity price can be given first, and it is assumed that the prediction errors obey a certain normal distribution, so as to characterize the uncertainty fluctuation.

5.4S4步骤的具体过程为：5.4 The specific process of the S4 step is:

对于MG模型，其目标是在预测价格信号下最小化运行成本，其经济调度(ED)模型的目标函数如下：For the MG model, whose goal is to minimize the running cost under the forecast price signal, the objective function of its economic dispatch (ED) model is as follows:

式中，T为规划周期；cCDG z是第z个CDG的发电成本，c_i ^es是第i个微网储能的运行成本；PCDG z,t是第z个CDG的功率输出，而Pes i,t是第i个微网储能的充放电功率。Pb.grid t/Ps.grid t分别表示每时段上级配电网的售、购电价，P_t ^b.CHE/P_t ^s.CHE则分别表示CBESS运营商发布的售、购电价。where T is the planning period; cCDG z is the power generation cost of the zth CDG, ci ^es is the operating cost of the _ith microgrid energy storage; PCDG z,t is the power output of the zth CDG, and Pes i , t is the charging and discharging power of the i-th microgrid energy storage. Pb.grid t/Ps.grid t respectively represent the sales and purchase prices of the upper-level distribution network in each period, and P _t ^b.CHE /P _t ^s.CHE respectively represent the sales and purchase prices released by CBESS operators.

微电网根据预测数据采用混合整数线性规划(MILP)方法，得到该时段与CBESS和上级配网之间的交易电量大小P_t ^mg.CHE/P_t ^mg.grid，并向外界发布该交易信息；与此同时，CBESS的代理通过感知外部环境，得到状态特征向量s_t＝[t,SOC_t,pric_t ^b.pre,pric_t ^s.pre,P_t ^mg.CHE,P_t ^mg.grid]According to the forecast data, the microgrid adopts the mixed integer linear programming (MILP) method to obtain the transaction volume P _t ^mg.CHE /P _t ^mg.grid between the CBESS and the upper-level distribution network during this period, and publish the transaction information to the outside world; At the same time, the agent of CBESS obtains the state feature vector s _t = [t, SOC _t , ^pric _t b.pre , pric _t ^s.pre , P _t ^mg.CHE , P _t ^mg.grid ] by sensing the external environment

5.5S5步骤的具体过程为：The specific process of 5.5S5 steps is:

在主竞争Q网络中使用s_t作为输入，得到所有动作对应的Q值输出。采用ε贪婪法在当前Q值输出中选择一个对应的动作a_t，在状态s_t执行当前动作a_t；对于ε-greedy策略，首先通过设置ε∈(0,1)的值，则在对应的动作时，以概率(1-ε)贪婪地选择当前被视为最大Q价值的最优动作a^*，而以ε的概率从所有K个离散的可选行为中随机探索潜在的行为Use s _t as input in the main competitive Q network to get the Q value output corresponding to all actions. Use the ε-greedy method to select a corresponding action a _t in the current Q value output, and execute the current action a _t in the state s _t ; for the ε-greedy strategy, first set the value of ε∈(0,1), then in the corresponding , greedily selects the optimal action a ^* currently regarded as the largest Q-value with probability (1-ε), while randomly exploring potential actions from all K discrete optional actions with probability ε

其中，ε将随着迭代过程从ε_ini逐渐减小ε_fin，以便在迭代的早期鼓励多进行探索，而在后期主要关注贪婪收敛，以便算法可以稳定收敛。Among them, ε will gradually decrease ε _fin from ε _ini with the iterative process, so as to encourage more exploration in the early stage of the iteration, and focus on greedy convergence in the later stage, so that the algorithm can converge stably.

5.6S6步骤的具体过程为：The specific process of 5.6S6 steps is:

S6：CBESS的电量SOC_t更新至SOC_t+1，以此判断本次迭代是否为终止状态，并计算本次动作后的即时奖励r_t。以二值变量done为迭代终止判定指标，用作每次迭代过程的中断指标S6: The power SOC _t of the CBESS is updated to SOC _t+1 , so as to judge whether the current iteration is in a terminated state, and calculate the immediate reward _rt after this action. The binary variable done is used as the indicator for determining the termination of the iteration, which is used as the interrupt indicator for each iteration process.

式中，如果储能运行过程中荷电状态越限，则本次迭代的done等于1，否则为0。done＝1表示终止而跳出本次迭代，done＝0表示迭代未终止。In the formula, if the state of charge exceeds the limit during the energy storage operation, the done of this iteration is equal to 1, otherwise it is 0. done=1 means to terminate and jump out of this iteration, done=0 means that the iteration is not terminated.

S7：MG根据CBESS实际反馈的可交易电量进行二次MILP规划，确定本时段与外部系统的交易电量，同时给出下一时段的预交易电量Pmg.CHE t+1,Pmg.grid t+1作为代理下一时段的感知状态信息；此时，系统的状态更新至s_t+1；S7: MG conducts secondary MILP planning according to the tradable power actually fed back by CBESS, determines the transaction power between the current period and the external system, and gives the pre-trading power Pmg.CHE t+1, Pmg.grid t+1 for the next period As the perception state information of the agent in the next period; at this time, the state of the system is updated to s _t+1 ;

S8：在不断迭代更新的过程中，每个时段t获得的s_t、a_t、r_t、s_t+1和终止判定指标done组成的五元组{s_t,a_t,r_t,s_t+1,done}依次存放入sumtree的叶节点中。若存入数量达到叶节点最大容量时，按个滚动溢出旧的数据而存入新数据，以保证样本的有效性。一旦样本数量达到小批量训练样本数量m时，就开始从叶节点中，按照优先回放机制随机采样m个样本

(j＝1,2··,m)，计算每个样本对应的当前目标Q值y_j S8: In the process of continuous iterative update, the five-tuple {s _t , at , r _t , s composed of s _t , at _t , r _t , s _t+1 and the termination judgment index done obtained in each period _t _t+1 , done} are stored in the leaf nodes of sumtree in turn. If the stored quantity reaches the maximum capacity of the leaf node, the old data will be rolled over and new data will be stored to ensure the validity of the sample. Once the number of samples reaches m, the number of training samples in the mini-batch, start to randomly sample m samples from the leaf nodes according to the priority playback mechanism

(j=1,2··,m), calculate the current target Q value y _j corresponding to each sample

针对优先回放机制，即使用更高的频率重放更加重要的样本数据。因此，需要对TD误差δ进行计算和保存，且δ绝对值越大的样品越容易被采样。采用比例优先化策略，这是介于纯贪婪策略和均匀抽样策略之间的一种随机采用策略，即第i个样本被的提取概率的P(i)为For the priority playback mechanism, the more important sample data is played back at a higher frequency. Therefore, the TD error δ needs to be calculated and stored, and the sample with a larger absolute value of δ is easier to be sampled. The proportional priority strategy is adopted, which is a random adoption strategy between the pure greedy strategy and the uniform sampling strategy, that is, the extraction probability of the i-th sample is P(i) is

其中α∈[0,1]，是将TD误差的重要性转换为优先级的幂指数。如果α＝0，则转换为均匀随机抽样。p_i是转换i的优先级，计算如下式所示where α∈[0,1], is a power exponent that converts the importance of TD errors into priorities. If α=0, convert to uniform random sampling. pi is the priority of transition _i , calculated as follows

其中

是一个小的正偏差，以确保仍然可以提取TD误差为0的一些边缘样本。上述过程会导致随机更新的期望分布发生变化，因此收敛解也随之变化。鉴于这种情况，采用重要抽样(IS)权重来校正偏差，从而得到考虑样本优先级的均方差损失函数L_i(θi)。最后通过神经网络的梯度反向传播来更新主竞争Q网络的所有参数θin

is a small positive bias to ensure that some edge samples with a TD error of 0 can still be extracted. The above process causes the expected distribution of random updates to change, and therefore the converged solution. In view of this situation, importance sampling (IS) weights are used to correct the bias, resulting in a mean squared error loss function Li ( _θi ) considering the priority of the samples. Finally, all parameters θ of the main competitive Q network are updated through the gradient back-propagation of the neural network

ω_j＝(N·P(j))^-β/max_iω_i ω _j =(N·P(j)) ^-β /max _i ω _i

θ_i＝θ_i-1+α▽_θiL_i(θ_i)θ _i =θ _i _-1 +α▽ _{θi Li} (θ _i )

其中ω_j是样本j的IS权重；β是逐渐增加到1的超参数。图3总结了优先经验回放算法结构。where ω _j is the IS weight of sample j; β is a hyperparameter that gradually increases to 1. Figure 3 summarizes the structure of the priority experience playback algorithm.

S9：Q网络更新后重新计算并更新sumtree中存储数据的优先级p_i，并定期将主竞争Q网络的参数复制给目标Q网络，同时令当前状态s＝s_t+1，若s为终止状态则当前轮迭代完毕，或达到迭代轮数T则结束全部迭代回到S3进行循环，否则转到步骤S5继续进行迭代。S9: After the Q network is updated, recalculate and update the priority p _i of the data stored in the sumtree, and periodically copy the parameters of the main competing Q network to the target Q network, and set the current state s=s _t+1 at the same time, if s is terminated In the state, the iteration of the current round is completed, or when the iteration round number T is reached, all the iterations are ended and returned to S3 for looping, otherwise, go to step S5 to continue the iteration.

以上所述仅是本发明的优选实施方式，应当理解本发明并非局限于本文所披露的形式，不应看作是对其他实施例的排除，而可用于各种其他组合、修改和环境，并能够在本文所述构想范围内，通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围，则都应在本发明所附权利要求的保护范围内。The above are only preferred embodiments of the present invention, and it should be understood that the present invention is not limited to the form disclosed herein, should not be construed as an exclusion of other embodiments, but may be used in various other combinations, modifications and environments, and Modifications can be made within the scope of the concepts described herein, from the above teachings or from skill or knowledge in the relevant field. However, modifications and changes made by those skilled in the art do not depart from the spirit and scope of the present invention, and should all fall within the protection scope of the appended claims of the present invention.

Claims

1. the intelligent online control method of grid-connected shared energy storage system, is characterized in that, comprises the following steps:

Step 1, build two multi-hidden layer competitive Q network models, the main competitive Q network and the target competitive Q network, whose input is the feature vector s _t of the observation state, and the output corresponds to the action value of at _t in each action set A Q(s _t , at _t );

Step 2, establish the Markov decision process of CBESS, map its charging and discharging behavior into a reinforcement learning process based on iterative update of action value; determine the characteristics of the environment state and the immediate reward function;

Step 3: Enter E rounds of iterative learning, and each round starts to re-initialize the load curve of MG and the output of RDG, market price and SOC of shared energy storage;

Step 4, the MG performs the first plan scheduling in the round, and obtains the first state vector s _t obtained by the agent perception environment with the pre-transaction volume CBESS of the external system;

Step 5: Use s _t as the input in the main competitive Q network to obtain the Q value output corresponding to all actions, and use the ε-greedy method to select an optimal estimated Q value in the current Q value output to determine its corresponding action a. _t and execute;

Step 6: Update the remaining power SOC _t of CBESS to SOC _t+1 , judge whether SOC _t+1 exceeds the range of [0,1] to determine whether it exceeds the limit, and calculate the termination judgment index done _t of this round of iterations, Simultaneously calculate the immediate reward _rt after this action;

Step 7: MG performs secondary planning for the current period according to the tradable power actually fed back by CBESS, determines the transaction power with the external system, and gives the pre-trading power Pmg.CHE t+1, Pmg.grid t+ for the next period. 1, as the perception state information of the agent in the next period; and update the state of the system to s _t+1 ;

Step 8: Calculate the priority values of s _t , at , r _t , and s _t ₊₁ , and store them and done _t indicators in the leaf nodes of the sumtree in turn; if the number of stored data reaches the preset small batch When sampling capacity m, randomly sample m samples from it, calculate the current target Q value and its error, and update all hyperparameters of the main competitive Q network through gradient backpropagation;

Step 9, after the Q network is updated, recalculate and update the priority p _i of the stored data in the sumtree, copy the parameters of the main competition Q network to the target Q network, and make the current state s=s _t+1 simultaneously; if s is the termination state Or when the number of iteration rounds T is reached, this round of iteration is completed, and go back to step 3 to loop; otherwise, go to step 5 to continue the iteration.

2. The intelligent online control method for a grid-connected shared energy storage system according to claim 1, wherein the main competition Q network is a state value sublayer with a single neuron and a network with K neurons. The multi-hidden layer main competition Q network architecture of the action advantage sublayer, the activation function selects the ReLu function to speed up the convergence process; the normal initialization inter-layer weight ω, the initialization bias b are constants tending to 0; the sequence number, CBESS State of charge, market price, MG and CBESS/superior distribution network's pre-transaction electricity constitute state feature vector s _t as network input, output optimal discrete charge and discharge action value Q _t , and finally perform network training by preferentially replaying data to iteratively converge.

3. The intelligent online control method of the grid-connected shared energy storage system according to claim 1, wherein the action set A is:

Divide the action space of CBESS into K discrete charge and discharge options P _be ^(k) , and uniformly discretize the action space A

where A is the set of all possible actions; P _be ^(k) represents the k-th charge/discharge action in the uniform discrete action space of CBESS.

4. The intelligent online control method of grid-connected shared energy storage system according to claim 1, characterized in that, in the described Markov decision-making process of establishing CBESS, the charging and discharging behavior of CBESS is mapped to iteration based on action value An updated reinforcement learning process, specifically:

The remaining power of CBESS changes continuously during the charging and discharging process, and its variation is related to the charging and discharging power and self-discharge in this period; the recursive relationship of energy storage charging is SoC(t)=(1-σ _sdr )·SoC( t-1)+P _be ·(1-L _c )Δt/E _cap

The energy storage discharge process is expressed as follows

SoC(t)=(1-σ _sdr )·SoC(t-1)-P _be Δt/[E _cap ·(1-L _dc )]

In the formula: SoC(t) is the remaining power of CBESS in period t; _Pbe (t) is the charge and discharge power of CBESS in period t; σ _sdr is the self-discharge rate of energy storage medium; L _c and L _dc are CBESS respectively Δt is the duration of each calculation window;

The maximum allowable charge and discharge power of CBESS at time t is determined by its own charge and discharge characteristics and the remaining state of charge at time t, and the constraints are satisfied during operation:

SoC _min ≤SoC(t)≤SoC _max

where: SoC _max and SoC _min are the upper and lower limits of CBESS state-of-charge constraints, respectively;

The environmental state characteristics are:

Define the environmental state feature vector perceived by CBESS at time t as s _t as

In the formula, t is the sequence number;

Respectively represent the predicted sales and purchase prices of the upper-level power grid at time period t,

Represent the pre-transaction electricity between the microgrid and the CBESS and the upper-level grid, respectively;

1) The instant reward function is as follows: CBESS obtains energy arbitrage profits by charging during off-peak hours and then discharging during peak hours; after determining the actual transaction power with the microgrid and the upper-level power grid respectively, the reward income r is calculated according to the real-time price. _EAP ;

The total cost of operation and maintenance of CBESS, C _o,m, is shown in the following formula

C ₁ =|P _be |·c _be

A negative return line with a coefficient σ is added as a penalty to suppress the fluctuation of the power P _{exc_grid} at the grid connection point

r _line = -σ · |P _{exc_grid} |

If the performed action causes the SOC to exceed [0,1], a large penalty r _exc is given to prevent the agent from making unreasonable decisions in subsequent learning; the immediate reward r _t is:

5 . The intelligent online control method for a grid-connected shared energy storage system according to claim 1 , wherein the MG executes the first planned scheduling in a round, and obtains the agent of the pre-transaction volume CBESS with the external system. 6 . The first state vector s _t obtained by perceiving the environment includes the following process: For the MG model, the goal is to minimize the operating cost under the predicted price signal, and the objective function of the economic dispatch model is as follows:

In the formula, T is the planning period;

is the power generation cost of the zth CDG, and ^{ci es} is the operating cost of the _ith microgrid energy storage;

is the power output of the zth CDG,

is the charging and discharging power of the i-th microgrid energy storage;

represent the electricity sales and purchase prices of the upper-level distribution network in each period, respectively,

represent the electricity sales and purchase prices issued by the CBESS operator respectively;

The microgrid adopts the mixed integer linear programming method according to the forecast data to obtain the transaction amount of electricity between the time period and the CBESS and the upper-level distribution network.

Publish transaction information to the outside world; the agent of CBESS obtains the state feature vector by sensing the external environment

6 . The intelligent online control method for a grid-connected shared energy storage system according to claim 1 , wherein in the main competitive Q network, s _t is used as the input to obtain the Q value output corresponding to all actions. 7 . , using the ε-greedy method to select an optimal estimated Q value in the output of the current Q value to determine its corresponding action a _t and execute it, including the following processes:

In the main competitive Q network, use s _t as the input to obtain the Q value output corresponding to all actions; use the ε greedy method to select a corresponding action a _t in the current Q value output, and execute the current action a _t in the state _st ; for The ε-greedy strategy, first by setting the value of ε∈(0,1), then in the corresponding action, the optimal action a ^* currently regarded as the maximum Q value is greedily selected with probability (1-ε), while Randomly explore potential actions from among all K discrete optional actions with probability ε:

where ε will gradually decrease from ε _ini ε _fin over the iterative process.

7. The intelligent online control method of a grid-connected shared energy storage system according to claim 1, wherein the remaining power SoC _t of the CBESS is updated to SoC _t+1 , and it is judged that SoC _t+1 exceeds [ 0,1] range to determine whether it exceeds the limit, and calculate the termination judgment index done _t of this round of iterations, and calculate the immediate reward r _t after this action, which includes the following process: The power SoC _t of CBESS is updated to SoC _t+1 to judge whether this iteration is in the termination state, and calculate the immediate reward r _t after this action; the binary variable done is used as the indicator for judging the termination of the iteration, which is used as the interruption indicator of each iteration process

In the formula, if the state of charge exceeds the limit during the energy storage operation, the done of this iteration is equal to 1, otherwise it is 0; done=1 means termination and jumps out of this iteration, done=0 means the iteration is not terminated.

8 . The intelligent online control method for a grid-connected shared energy storage system according to claim 1 , wherein the calculation of s _t , at , r _t , and s _t ₊₁ in the step 8. 9 . , and store it and done _t indicators in the leaf nodes of sumtree in turn; if the number of stored data reaches the preset small batch sampling capacity m, randomly sample m samples from it, and calculate the current target Q value and its error, and update all hyperparameters of the main competitive Q network through gradient backpropagation, where the current target Q value y _j is:

The proportional prioritization strategy is adopted, that is, the P(i) of the extraction probability of the i-th sample is:

where α∈[0,1] is the power exponent that converts the importance of TD error into priority; if α=0, it is converted to uniform random sampling; p _i is the priority of conversion i, calculated as follows :

in

is a positive deviation;

The important sampling weight is used to correct the deviation, so as to obtain the mean square error loss function Li ( _θi ) considering the sample priority, and finally all parameters θ of the main competitive Q network are updated through the gradient back-propagation of the neural network:

ω _j =(N·P(j)) ^-β /max ω _i

where ω _j is the IS weight of sample j; β is a hyperparameter gradually increasing to 1.