CN112003269B - Intelligent on-line control method of grid-connected shared energy storage system - Google Patents
Intelligent on-line control method of grid-connected shared energy storage system Download PDFInfo
- Publication number
- CN112003269B CN112003269B CN202010754472.1A CN202010754472A CN112003269B CN 112003269 B CN112003269 B CN 112003269B CN 202010754472 A CN202010754472 A CN 202010754472A CN 112003269 B CN112003269 B CN 112003269B
- Authority
- CN
- China
- Prior art keywords
- cbess
- network
- soc
- action
- grid
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 238000004146 energy storage Methods 0.000 title claims abstract description 40
- 230000009471 action Effects 0.000 claims abstract description 73
- 230000008569 process Effects 0.000 claims abstract description 43
- 230000002860 competitive effect Effects 0.000 claims abstract description 29
- 238000007599 discharging Methods 0.000 claims abstract description 22
- 230000006870 function Effects 0.000 claims abstract description 21
- 230000007613 environmental effect Effects 0.000 claims abstract description 12
- 238000013439 planning Methods 0.000 claims abstract description 11
- 230000002787 reinforcement Effects 0.000 claims abstract description 11
- 230000006399 behavior Effects 0.000 claims abstract description 8
- 239000003795 chemical substances by application Substances 0.000 claims description 23
- 230000000875 corresponding effect Effects 0.000 claims description 23
- 230000005611 electricity Effects 0.000 claims description 17
- 238000005070 sampling Methods 0.000 claims description 12
- 239000010410 layer Substances 0.000 claims description 10
- 230000008447 perception Effects 0.000 claims description 8
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 238000012423 maintenance Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000008901 benefit Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 239000011229 interlayer Substances 0.000 claims description 3
- 238000012804 iterative process Methods 0.000 claims description 3
- 230000003334 potential effect Effects 0.000 claims description 3
- 238000010248 power generation Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000012913 prioritisation Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 abstract description 2
- 238000004422 calculation algorithm Methods 0.000 description 9
- 238000005457 optimization Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 101000802640 Homo sapiens Lactosylceramide 4-alpha-galactosyltransferase Proteins 0.000 description 2
- 102100035838 Lactosylceramide 4-alpha-galactosyltransferase Human genes 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000001186 cumulative effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 206010048669 Terminal state Diseases 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000003831 deregulation Effects 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J3/00—Circuit arrangements for AC mains or AC distribution networks
- H02J3/008—Circuit arrangements for AC mains or AC distribution networks involving trading of energy or energy transmission rights
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J3/00—Circuit arrangements for AC mains or AC distribution networks
- H02J3/28—Arrangements for balancing of the load in a network by storage of energy
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J3/00—Circuit arrangements for AC mains or AC distribution networks
- H02J3/38—Arrangements for parallely feeding a single network by two or more generators, converters or transformers
- H02J3/381—Dispersed generators
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J3/00—Circuit arrangements for AC mains or AC distribution networks
- H02J3/38—Arrangements for parallely feeding a single network by two or more generators, converters or transformers
- H02J3/46—Controlling of the sharing of output between the generators, converters, or transformers
- H02J3/466—Scheduling the operation of the generators, e.g. connecting or disconnecting generators to meet a given demand
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2113/00—Details relating to the application field
- G06F2113/04—Power grid distribution networks
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J2203/00—Indexing scheme relating to details of circuit arrangements for AC mains or AC distribution networks
- H02J2203/20—Simulating, e g planning, reliability check, modelling or computer assisted design [CAD]
-
- H—ELECTRICITY
- H02—GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
- H02J—CIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
- H02J2300/00—Systems for supplying or distributing electric power characterised by decentralized, dispersed, or local generation
- H02J2300/40—Systems for supplying or distributing electric power characterised by decentralized, dispersed, or local generation wherein a plurality of decentralised, dispersed or local energy generation technologies are operated simultaneously
Landscapes
- Engineering & Computer Science (AREA)
- Power Engineering (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Supply And Distribution Of Alternating Current (AREA)
Abstract
本发明公开了并网型共享储能系统的智能化在线控制方法,包括搭建两个多隐层竞争Q网络模型;建立CBESS的马尔科夫决策过程,将其充放电行为映射为基于动作价值迭代更新的强化学习过程;确定环境状态特征以及即时奖励函数;进入E个回合的循环迭代学习;MG执行回合内的首次计划调度,得到与外部系统的预交易量CBESS的代理感知环境得到的第一个状态向量st;在主竞争Q网络中使用st作为输入,得到所有动作对应的Q值输出。CBESS的剩余电量SOCt更新至SOCt+1;MG根据CBESS实际反馈的可交易电量进行本时段的二次规划,计算st、at、rt、st+1的优先级值,并通过梯度反向传播来更新主竞争Q网络的所有超参数;更新sumtree中存储数据的优先级pi,将主竞争Q网络的参数复制给目标竞争Q网络。
The invention discloses an intelligent online control method for a grid-connected shared energy storage system, which includes building two multi-hidden layer competitive Q network models; establishing a Markov decision process of CBESS, and mapping its charging and discharging behavior into action value-based iteration Updated reinforcement learning process; determine environmental state characteristics and immediate reward function; enter E rounds of loop iterative learning; MG executes the first plan scheduling in the round, and obtains the first result obtained by the agent-perceived environment with the pre-transaction volume CBESS of the external system. a state vector s t ; use s t as the input in the main competitive Q network to obtain the Q value output corresponding to all actions. The remaining power SOC t of CBESS is updated to SOC t+1 ; MG performs secondary planning for this period according to the tradable power actually fed back by CBESS, calculates the priority values of s t , at , r t , and s t +1 , and All hyperparameters of the main competition Q network are updated through gradient backpropagation; the priority p i of the stored data in the sumtree is updated, and the parameters of the main competition Q network are copied to the target competition Q network.
Description
技术领域technical field
本发明涉及电力系统自动化技术领域,具体是并网型共享储能系统的智能化在线控制方法。The invention relates to the technical field of power system automation, in particular to an intelligent online control method of a grid-connected shared energy storage system.
背景技术Background technique
与集中控制的大型储能系统(energy storage system,ESS)不同,共享储能(Community energy storage system,CESS)的规模较小,一般只有几兆瓦时的容量,配置于配电变电站的电变压器二次侧,以减轻可再生资源和负荷持续变化的负面影响。一旦集成到并网微电网(MG)中,CESS就可以通过快速充放电提高MG的灵活性和可靠性。随着分销市场的放松管制,CESS可由独立实体企业所运营,并通过价格反应行为参与市场并实现套利。然而针对CESS优化决策的传统方法中,不管是采用集中式优化控制还是分散式协调优化方法,复杂的系统建模、数据的不可观性以及各种不确定性因素都给基于模型的物理建模带来诸多挑战。Different from the centrally controlled large-scale energy storage system (ESS), the shared energy storage system (CESS) is small in scale, generally only has a capacity of several MWh, and is configured in the electric transformer of the distribution substation. Secondary side to mitigate the negative effects of renewable resources and continuous load changes. Once integrated into a grid-connected microgrid (MG), CESS can improve the flexibility and reliability of the MG through fast charging and discharging. With the deregulation of the distribution market, CESS can be operated by independent entities and participate in the market and realize arbitrage through price-responsive behavior. However, in the traditional method for CESS optimization decision-making, whether it adopts centralized optimization control or decentralized coordinated optimization method, complex system modeling, data unobservability and various uncertain factors are all given to the model-based physical modeling. brings many challenges.
近年来机器学习快速发展,其强大的感知学习能力和数据分析能力契合了智能电网中大数据应用的需求。其中强化学习(Reinforcement Learning,RL)通过决策主体和环境之间的不断交互来获取环境知识,并采取影响环境的行动以达到预设目标。而深度学习(Deep Learning,DL)不依赖于任何解析方程,而利用大量的现有数据来描述数学问题和近似解,将其应用于RL中可以有效缓解价值函数求解困难等问题。为解决物理建模方法建模困难、扩展性和实用性差等问题,同时克服传统智能算法在状态空间过大时出现的求解困难,以及算法本身收敛性、鲁棒性差以及收敛速度缓慢等缺陷。In recent years, machine learning has developed rapidly, and its powerful perceptual learning capabilities and data analysis capabilities meet the needs of big data applications in smart grids. Among them, Reinforcement Learning (RL) acquires environmental knowledge through the continuous interaction between the decision-making subject and the environment, and takes actions that affect the environment to achieve preset goals. Deep Learning (DL) does not rely on any analytical equations, but uses a large amount of existing data to describe mathematical problems and approximate solutions. Applying it to RL can effectively alleviate problems such as the difficulty of solving value functions. In order to solve the problems of difficult modeling, poor scalability and practicability of physical modeling methods, and at the same time overcome the difficulties of solving traditional intelligent algorithms when the state space is too large, as well as the defects of the algorithm itself, such as poor convergence, poor robustness and slow convergence speed.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于克服现有技术的不足,提供并网型共享储能系统的智能化在线控制方法,包括如下步骤:The purpose of the present invention is to overcome the deficiencies of the prior art, and to provide an intelligent online control method for a grid-connected shared energy storage system, comprising the following steps:
步骤一,搭建两个多隐层竞争Q网络模型,主竞争Q网络和目标竞争Q网络,其输入为观测状态的特征向量st,输出则对应于每一个动作集合A中at的动作价值Q(st,at);
步骤二,建立CBESS的马尔科夫决策过程,将其充放电行为映射为基于动作价值迭代更新的强化学习过程;确定环境状态特征以及即时奖励函数;
步骤三,进入E个回合的循环迭代学习,每个回合开始重新初始化MG的负荷曲线和RDG的输出、市场价格以及共享储能的SOC;Step 3: Enter E rounds of iterative learning, and each round starts to re-initialize the load curve of MG and the output of RDG, market price and SOC of shared energy storage;
步骤四,MG执行回合内的首次计划调度,得到与外部系统的预交易量CBESS的代理感知环境得到的第一个状态向量st;
步骤五,在主竞争Q网络中使用st作为输入,得到所有动作对应的Q值输出。采用ε-贪婪法在当前Q值输出中选择一个最优估计Q值,以其确定其对应的动作at并执行;Step 5: Use s t as the input in the main competitive Q network to obtain the Q value output corresponding to all actions. Use the ε-greedy method to select an optimal estimated Q value in the output of the current Q value to determine its corresponding action a t and execute it;
步骤六,CBESS的剩余电量SOCt更新至SOCt+1,判断SOCt+1是否超出[0,1]范围来判定其是否越限,并以此计算本轮迭代的终止判定指标donet,同时计算本次动作后的即时奖励rt;Step 6: Update the remaining power SOC t of CBESS to SOC t+1 , judge whether SOC t+1 exceeds the range of [0,1] to determine whether it exceeds the limit, and calculate the termination judgment index done t of this round of iterations, Simultaneously calculate the immediate reward rt after this action;
步骤七,MG根据CBESS实际反馈的可交易电量进行本时段的二次规划,确定与外部系统的交易电量,同时给出下一时段的预交易电量Pmg.CHE t+1,Pmg.grid t+1,以作为代理下一时段的感知状态信息;并将系统的状态更新至st+1;Step 7: MG performs secondary planning for the current period according to the tradable power actually fed back by CBESS, determines the transaction power with the external system, and gives the pre-trading power Pmg.CHE t+1, Pmg.grid t+ for the next period. 1, as the perception state information of the agent in the next period; and update the state of the system to s t+1 ;
步骤八,计算st、at、rt、st+1的优先级值,并将其与donet指标全部依次存放入sumtree的叶节点中;若存储数据的数量达到预设的小批量采样容量m时,从中随机采样m个样本,计算当前目标Q值及其误差,并通过梯度反向传播来更新主竞争Q网络的所有超参数;Step 8: Calculate the priority values of s t , at , r t , and s t +1 , and store them and done t indicators in the leaf nodes of the sumtree in turn; if the number of stored data reaches the preset small batch When sampling capacity m, randomly sample m samples from it, calculate the current target Q value and its error, and update all hyperparameters of the main competitive Q network through gradient backpropagation;
步骤九,Q网络更新后重新计算并更新sumtree中存储数据的优先级pi,将主竞争Q网络的参数复制给目标Q网络,同时令当前状态s=st+1;若s为终止状态或达到迭代轮数T则本轮迭代完毕,回到步骤三进行循环;否则转到步骤五继续迭代。
进一步的,所述的主竞争Q网络为具有单神经元的状态值子层和K个神经元的动作优势子层的多隐层主竞争Q网络架构,激活函数选取ReLu函数来加速收敛过程;正态初始化层间权重ω,初始化偏置b都为趋于0的常数;以时序号、CBESS的荷电状态、市场电价、MG与CBESS/上级配网的预交易电量组成状态特征向量st作为网络输入,输出最优的离散化充放电动作价值Qt,并最终通过优先回放数据进行网络训练来迭代收敛。Further, the main competition Q network is a multi-hidden layer main competition Q network architecture with a state value sublayer of a single neuron and an action advantage sublayer of K neurons, and the activation function selects the ReLu function to accelerate the convergence process; The normal initialization inter-layer weight ω and the initialization bias b are constants tending to 0; the state feature vector s t is composed of the sequence number, the state of charge of CBESS, the market electricity price, and the pre-transaction electricity of MG and CBESS/superior distribution network. As the network input, it outputs the optimal value of discrete charge and discharge actions Q t , and finally iteratively converges by preferentially replaying data for network training.
进一步的,所述的动作集合A为:Further, the action set A is:
将CBESS的动作空间划分为K个离散的充放电选择P(k)be,均匀离散化动作空间ADivide the action space of CBESS into K discrete charge and discharge options P(k)be, and uniformly discretize the action space A
式中,A为所有可能动作组成的集合;Pbe (k)表示CBESS均匀离散动作空间中的第k个充电/放电动作。where A is the set of all possible actions; P be (k) represents the k-th charge/discharge action in the uniform discrete action space of CBESS.
进一步的,所述的建立CBESS的马尔科夫决策过程,将CBESS充放电行为映射为基于动作价值迭代更新的强化学习过程,具体为:Further, the described Markov decision process for establishing CBESS maps the CBESS charging and discharging behavior into a reinforcement learning process based on iterative update of action value, specifically:
BESS的剩余电量在充放电过程中不断变化,其变化量与该时段内的充、放电电量和自放电有关;储能充电递推关系为The remaining power of the BESS changes continuously during the charging and discharging process, and its variation is related to the charging and discharging power and self-discharge in this period; the recursive relationship between the energy storage charging and charging is:
SoC(t)=(1-σsdr)·SoC(t-1)+Pbe·(1-Lc)Δt/Ecap SoC(t)=(1-σ sdr )·SoC(t-1)+P be ·(1-L c )Δt/E cap
储能放电过程表示如下The energy storage discharge process is expressed as follows
SoC(t)=(1-σsdr)·SoC(t-1)-PbeΔt/[Ecap·(1-Ldc)]SoC(t)=(1-σ sdr )·SoC(t-1)-P be Δt/[E cap ·(1-L dc )]
式中:SoC(t)为CBESS在t时段的荷电状态;Pbe(t)为CBESS在t时段的充放电功率;σsdr为储能介质的自放电率;Lc和Ldc分别为CBESS的充电和放电损耗;△t为每个计算窗口时长;In the formula: SoC(t) is the state of charge of CBESS in period t; Pbe (t) is the charge and discharge power of CBESS in period t ; σsdr is the self-discharge rate of the energy storage medium; Lc and Ldc are respectively Charge and discharge losses of CBESS; Δt is the duration of each calculation window;
CBESS在t时刻的最大允许充放电功率由其自身的充放电特性和t时刻的剩余荷电状态所决定,同时运行过程中满足约束:The maximum allowable charge and discharge power of CBESS at time t is determined by its own charge and discharge characteristics and the remaining state of charge at time t, and the constraints are satisfied during operation:
SoCmin≤SoC(t)≤SoCmax SoC min ≤SoC(t)≤SoC max
式中:SoCmax和SoCmin分别为CBESS荷电状态约束的上、下限;where: SoC max and SoC min are the upper and lower limits of CBESS state-of-charge constraints, respectively;
所述的环境状态特征为:The environmental state characteristics are:
定义CBESS在时刻t所感知到的环境状态特征向量为st为Define the environmental state feature vector perceived by CBESS at time t as s t as
式中,t为时序号;prict b.pre/prict s.pre分别表示时段t时上级电网的预测售、购电价,Pt mg.CHE/Pt mg.grid分别表示微电网与CBESS和上级电网之间的预交易电量;In the formula, t is the time sequence number; pric t b.pre /pric t s.pre respectively represent the predicted sale and purchase price of the upper power grid at time t, P t mg.CHE /P t mg.grid respectively represent the microgrid and CBESS Pre-traded electricity with the upper power grid;
2)所述的即时奖励函数为:CBESS通过在非高峰时段充电,然后在高峰时段放电获得能源套利利润;在分别确定与微网和上级电网的实际交易功率后,根据实时价格计算奖励收益rEAP;2) The instant reward function is: CBESS obtains energy arbitrage profits by charging in off-peak hours and then discharging in peak hours; after determining the actual transaction power with the microgrid and the upper-level power grid respectively, calculate the reward income r according to the real-time price EAP ;
CBESS的运营和维护总成本Co,m见下式The total cost of operation and maintenance of CBESS, C o,m, is shown in the following formula
C1=|Pbe|·cbe C 1 =|P be |·c be
增加一个系数为σ的负报酬线作为惩罚,以抑制并网点的功率(Pexc_grid)波动A negative return line with a coefficient σ is added as a penalty to suppress the power (P exc_grid ) fluctuation of the grid connection point
rline=-σ·|Pexc_grid|r line = -σ · |P exc_grid |
若执行的动作导致SOC超出[0,1],给予较大惩罚rexc,以防止代理在随后的学习中做出不合理的决策;即时奖励rt为:If the performed action causes the SOC to exceed [0,1], a large penalty r exc is given to prevent the agent from making unreasonable decisions in subsequent learning; the immediate reward r t is:
进一步的,所述的MG执行回合内的首次计划调度,得到与外部系统的预交易量CBESS的代理感知环境得到的第一个状态向量st包括如下过程:对于MG模型,其目标是在预测价格信号下最小化运行成本,其经济调度模型的目标函数如下:Further, the MG performs the first planning and scheduling in the round, and obtains the first state vector s t obtained by the agent perception environment of the pre-transaction volume CBESS with the external system, including the following process: For the MG model, the goal is to predict To minimize the operating cost under the price signal, the objective function of the economic dispatch model is as follows:
式中,T为规划周期;cCDG z是第z个CDG的发电成本,ci es是第i个微网储能的运行成本;PCDG z,t是第z个CDG的功率输出,Pes i,t是第i个微网储能的充放电功率;Pb.gridt/Ps.grid t分别表示每时段上级配电网的售、购电价,Pt b.CHE/Pt s.CHE则分别表示CBESS运营商发布的售、购电价;In the formula, T is the planning period; cCDG z is the power generation cost of the zth CDG, ci es is the operating cost of the ith microgrid energy storage; PCDG z,t is the power output of the zth CDG, Pes i, t is the charging and discharging power of the i-th microgrid energy storage; Pb.gridt/Ps.grid t respectively represent the sales and purchase price of the upper distribution network in each period, and P t b.CHE /P t s.CHE respectively represent Electricity sales and purchase prices published by CBESS operators;
微电网根据预测数据采用混合整数线性规划(MILP)方法,得到该时段与CBESS和上级配网之间的交易电量大小Pt mg.CHE/Pt mg.grid,并向外界发布该交易信息;CBESS的代理通过感知外部环境,得到状态特征向量st=[t,SOCt,prict b.pre,prict s.pre,Pt mg.CHE,Pt mg.grid]。According to the forecast data, the microgrid adopts the mixed integer linear programming (MILP) method to obtain the transaction volume P t mg.CHE /P t mg.grid between the CBESS and the upper-level distribution network during this period, and publish the transaction information to the outside world; The agent of CBESS obtains the state feature vector s t = [t, SOC t , pric t b.pre , pric t s.pre , P t mg.CHE , P t mg.grid ] by sensing the external environment.
进一步的,所述的在主竞争Q网络中使用st作为输入,得到所有动作对应的Q值输出。采用ε-贪婪法在当前Q值输出中选择一个最优估计Q值,以其确定其对应的动作at并执行,包括如下过程:Further, in the main competitive Q network, s t is used as the input, and the Q value output corresponding to all actions is obtained. The ε-greedy method is used to select an optimal estimated Q value in the output of the current Q value to determine its corresponding action a t and execute it, including the following process:
在主竞争Q网络中使用st作为输入,得到所有动作对应的Q值输出;采用ε贪婪法在当前Q值输出中选择一个对应的动作at,在状态st执行当前动作at;对于ε-greedy策略,首先通过设置ε∈(0,1)的值,则在对应的动作时,以概率(1-ε)贪婪地选择当前被视为最大Q价值的最优动作a*,而以ε的概率从所有K个离散的可选行为中随机探索潜在的行为:In the main competitive Q network, use s t as the input to obtain the Q value output corresponding to all actions; use the ε greedy method to select a corresponding action a t in the current Q value output, and execute the current action a t in the state st ; for In the ε-greedy strategy, firstly by setting the value of ε∈(0,1), in the corresponding action, the optimal action a * currently regarded as the maximum Q value is greedily selected with probability (1-ε), while Randomly explore potential actions from among all K discrete optional actions with probability ε:
其中,ε将随着迭代过程从εini逐渐减小εfin。where ε will gradually decrease from ε ini ε fin over the iterative process.
进一步的,所述的CBESS的剩余电量SOCt更新至SOCt+1,判断SOCt+1是否超出[0,1]范围来判定其是否越限,并以此计算本轮迭代的终止判定指标donet,同时计算本次动作后的即时奖励rt,具体包括如下过程:CBESS的电量SOCt更新至SOCt+1,以此判断本次迭代是否为终止状态,并计算本次动作后的即时奖励rt;以二值变量done为迭代终止判定指标,用作每次迭代过程的中断指标Further, the remaining power SOC t of the CBESS is updated to SOC t+1 , and it is determined whether SOC t+1 exceeds the range of [0,1] to determine whether it exceeds the limit, and the termination judgment index of this round of iteration is calculated based on this. done t , and at the same time calculate the immediate reward rt after this action, which specifically includes the following process: the power SOC t of the CBESS is updated to SOC t+1 , so as to judge whether this iteration is a termination state, and calculate the value after this action. Immediate reward r t ; the binary variable done is used as the indicator for judging the termination of iteration, which is used as the interrupt indicator for each iteration process
式中,如果储能运行过程中荷电状态越限,则本次迭代的done等于1,否则为0;done=1表示终止而跳出本次迭代,done=0表示迭代未终止。In the formula, if the state of charge exceeds the limit during the energy storage operation, the done of this iteration is equal to 1, otherwise it is 0; done=1 means termination and jumps out of this iteration, done=0 means the iteration is not terminated.
进一步的,所述的步骤八中所述的计算st、at、rt、st+1的优先级值,并将其与donet指标全部依次存放入sumtree的叶节点中;若存储数据的数量达到预设的小批量采样容量m时,从中随机采样m个样本,计算当前目标Q值及其误差,并通过梯度反向传播来更新主竞争Q网络的所有超参数,其中所述的当前目标Q值yj为:Further, the priority values of s t , at , r t , and s t +1 are calculated in the
采用比例优先化策略,即第i个样本被的提取概率的P(i)为:The proportional prioritization strategy is adopted, that is, the P(i) of the extraction probability of the i-th sample is:
其中α∈[0,1],是将TD误差的重要性转换为优先级的幂指数;若α=0,则转换为均匀随机抽样;pi是转换i的优先级,计算如下式所示:where α∈[0,1] is the power exponent that converts the importance of TD error into priority; if α=0, it is converted to uniform random sampling; p i is the priority of conversion i, calculated as follows :
p(i)=|δi|+ζp(i)=|δ i |+ζ
其中为正偏差;in is a positive deviation;
采用重要抽样权重来校正偏差,从而得到考虑样本优先级的均方差损失函数Li(θi)。最后通过神经网络的梯度反向传播来更新主竞争Q网络的所有参数θ:Important sampling weights are used to correct the bias, resulting in a mean square error loss function Li ( θi ) that considers the sample priority. Finally, all parameters θ of the main competitive Q network are updated through the gradient back-propagation of the neural network:
ωj=(N·P(j))-β/maxiωi ω j =(N·P(j)) -β /max i ω i
θi=θi-1+α▽θiLi(θi)θ i =θ i -1 +α▽ θi Li (θ i )
其中ωj是样本j的IS权重;β是逐渐增加到1的超参数。where ω j is the IS weight of sample j; β is a hyperparameter that gradually increases to 1.
本发明的有益效果是:1.本发明赋予CBESS在高不确定环境下强大的在线学习和决策能力,通过对最优动作价值函数的逼近而不依赖于任何解析方程,解决了环境状态连续且空间巨大导致的无法迭代求解的问题;The beneficial effects of the present invention are as follows: 1. The present invention endows CBESS with powerful online learning and decision-making capabilities in a high-uncertainty environment, and solves the problem of continuous and continuous environmental state by approximating the optimal action value function without relying on any analytical equation. Problems that cannot be solved iteratively due to huge space;
2.双竞争Q网络结构和优先级回放策略的协同优化,可以有效缓解模型过优估计问题,显著提高代理决策的准确性和收敛的鲁棒性,同时加快了算法的收敛速度,提升在线计算效率。2. The collaborative optimization of the dual-competitive Q network structure and the priority playback strategy can effectively alleviate the problem of model overestimation, significantly improve the accuracy of surrogate decision-making and the robustness of convergence, and at the same time speed up the convergence speed of the algorithm and improve online computing. efficiency.
附图说明Description of drawings
图1为并网型共享储能系统的智能化在线控制方法流程图;Figure 1 is a flow chart of an intelligent online control method for a grid-connected shared energy storage system;
图2为竞争Q网络结构示意图;Fig. 2 is a schematic diagram of a competitive Q network structure;
图3为sumtree数据结构示意图;Figure 3 is a schematic diagram of the sumtree data structure;
图4为优先经验回放策略的算法结构示意图。FIG. 4 is a schematic diagram of the algorithm structure of the priority experience playback strategy.
具体实施方式Detailed ways
下面结合附图进一步详细描述本发明的技术方案,但本发明的保护范围不局限于以下所述。The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the protection scope of the present invention is not limited to the following.
如图1所示,所发明的一种并网型共享储能系统在线控制决策的数据驱动技术,包括以下步骤:As shown in Figure 1, the invented data-driven technology for online control decision-making of a grid-connected shared energy storage system includes the following steps:
S1:搭建两个多隐层竞争Q网络模型,即主竞争Q网络和目标竞争Q网络,其输入为观测状态的特征向量st,输出则对应于每一个动作集合A中at的动作价值Q(st,at)。首先初始化Q网络的所有参数、数据存储结构sumtree的容量D以及其叶节点的优先级值。S1: Build two multi-hidden layer competitive Q network models, namely the main competitive Q network and the target competitive Q network, whose input is the feature vector s t of the observation state, and the output corresponds to the action value of at t in each action set A Q(s t , at t ). First, initialize all the parameters of the Q network, the capacity D of the data storage structure sumtree, and the priority value of its leaf nodes.
S2:建立CBESS的马尔科夫决策过程,将其充放电行为映射为基于动作价值迭代更新的强化学习过程,并确定1)算法的控制目标为:在最大化储能市场套利的情况下尽可能平抑微电网并网点的功率波动;2)环境状态特征组合:包括当前时段的时序号、CBESS的剩余电量、预测的上级电网售/购电价以及MG一次经济调度得到的与配网/CBESS的预交易电量值;3)奖励函数:包括CBESS通过灵活充放电实现的能源套利利润rEAP、运营和维护总成本Co,m、并网点功率波动惩罚rline和储能SOC越限惩罚rexc。S2: Establish the Markov decision process of CBESS, map its charging and discharging behavior into a reinforcement learning process based on iterative update of action value, and determine 1) the control objective of the algorithm is: to maximize the arbitrage of the energy storage market as much as possible Suppress the power fluctuation of the grid connection point of the microgrid; 2) Combination of environmental state characteristics: including the sequence number of the current period, the remaining power of the CBESS, the predicted electricity sale/purchase price of the upper-level power grid, and the forecast of the distribution network/CBESS obtained by MG's one economic dispatch. 3) Reward function: including the energy arbitrage profit r EAP realized by CBESS through flexible charging and discharging, the total cost of operation and maintenance C o,m , the power fluctuation penalty r line at the grid connection point, and the energy storage SOC violation penalty r exc .
S3:每轮回合迭代开始前,需要重新初始化不确定性数据,包括微电网的负荷曲线、可再生分布式发电出力以及市场价格信号等;S3: Before the start of each round of iteration, the uncertainty data needs to be re-initialized, including the load curve of the microgrid, the output of renewable distributed generation, and the market price signal;
S4:微电网基于预测数据进行每个时段的预规划,得到时段t与CBESS/上级配网之间的预交易电量额,即Pt mg.CHE/Pt mg.grid,并向外界发布该信息;与此同时,CBESS的代理通过感知外部环境,得到状态特征向量st=[t,SOCt,prict b.pre,prict s.pre,Pt mg.CHE,Pt mg.grid]。S4: The microgrid performs pre-planning for each period based on the forecast data, and obtains the pre-transaction amount of electricity between period t and CBESS/superior distribution network, namely P t mg.CHE /P t mg.grid , and publishes this information to the outside world. At the same time, the agent of CBESS obtains the state feature vector s t = [t, SOC t , pric t b.pre , pric t s.pre , P t mg.CHE , P t mg.grid by sensing the external environment ].
S5:在主竞争Q网络中使用st作为输入,得到所有动作对应的Q值输出。采用ε-贪婪法在当前Q值输出中选择一个最优估计Q值,以其确定其对应的动作at并执行。S5: Use s t as the input in the main competitive Q network to obtain the Q value output corresponding to all actions. The ε-greedy method is used to select an optimal estimated Q value in the output of the current Q value to determine its corresponding action a t and execute it.
S6:CBESS的剩余电量SOCt更新至SOCt+1,判断SOCt+1是否超出[0,1]范围来判定其是否越限,并以此计算本轮迭代的终止判定指标donet,同时计算本次动作后的即时奖励rt。S6: The remaining power SOC t of CBESS is updated to SOC t+1 , and it is judged whether SOC t+1 exceeds the range of [0,1] to determine whether it exceeds the limit, and the termination judgment index done t of this round of iteration is calculated based on this, and at the same time Calculate the immediate reward rt after this action.
S7:MG根据CBESS实际反馈的可交易电量进行本时段的二次规划,确定与外部系统的交易电量,同时给出下一时段的预交易电量Pmg.CHE t+1,Pmg.grid t+1,以作为代理下一时段的感知状态信息;此时,系统的状态更新至st+1。S7: MG performs secondary planning for the current period according to the tradable power actually fed back by CBESS, determines the transaction power with the external system, and gives the pre-trading power Pmg.CHE t+1, Pmg.grid t+1 for the next period , as the perception state information of the agent in the next period; at this time, the state of the system is updated to s t+1 .
S8:计算st、at、rt、st+1的优先级值,并将其与donet指标全部依次存放入sumtree的叶节点中。一旦存储数据的数量达到预设的小批量采样容量m时,便开始从中优先地随机采样m个样本,计算当前目标Q值及其误差,并通过梯度反向传播来更新主竞争Q网络的所有超参数。S8: Calculate the priority values of s t , at , r t , and s t +1 , and store them and done t indicators in the leaf nodes of the sumtree in sequence. Once the number of stored data reaches the preset mini-batch sampling capacity m, it starts to randomly sample m samples from it, calculates the current target Q value and its error, and updates all the main competitive Q network through gradient backpropagation. hyperparameters.
S9:Q网络更新后需要重新计算并更新sumtree中存储数据的优先级pi,并定期将主竞争Q网络的参数复制给目标Q网络,同时令当前状态s=st+1。若s为终止状态或达到迭代轮数T则本轮迭代完毕,回到S3进行循环;否则转到步骤S5继续迭代。S9: After the Q network is updated, the priority pi of the data stored in the sumtree needs to be recalculated and updated, and the parameters of the main competing Q network are copied to the target Q network periodically, and the current state s=s t+1 is set at the same time. If s is in the terminal state or reaches the number of iteration rounds T, the iteration of the current round is completed, and the loop is returned to S3; otherwise, it goes to step S5 to continue the iteration.
5.1S1步骤的具体过程为:The specific process of 5.1S1 steps is:
CBESS通过不断感知微电网电量需求与市场环境,在控制目标下与环境进行交互得到反馈奖励。构建具有单神经元的状态值子层和K个神经元的动作优势子层的多隐层主竞争Q网络架构,如图2所示。对应的目标竞争Q网络架构与其一致。激活函数选取ReLu函数来加速收敛过程。正态初始化层间权重ω,初始化偏置b都为趋于0的常数。以时序号、CBESS的荷电状态、市场电价、MG与CBESS/上级配网的预交易电量组成状态特征向量st作为网络输入,输出最优的离散化充放电动作价值Qt,并最终通过优先回放数据进行网络训练来迭代收敛。这种基于无模型强化学习和数据驱动的储能智能化决策方法中,采用基于sumtree数据结构的优先级比例样本回放方法,同时与DDQN兼容以后能可观地地提高策略精度和收敛速度,增加算法鲁棒性;同时竞争性网络架构的应用可以促使代理在策略评估期间快速识别正确的动作,具备更高的计算效率和可观的拟合精度,且自适应能力较强。CBESS obtains feedback and rewards by continuously sensing the power demand and market environment of the microgrid, and interacting with the environment under the control objective. A multi-hidden-layer main competition Q-network architecture is constructed with a state-value sub-layer of a single neuron and an action-dominant sub-layer of K neurons, as shown in Figure 2. The corresponding target competition Q network architecture is consistent with it. The activation function selects the ReLu function to speed up the convergence process. The normal initialization inter-layer weight ω, and the initialization bias b are both constants tending to 0. The state eigenvector s t composed of the sequence number, the state of charge of CBESS, the market electricity price, the pre-transaction electricity of MG and CBESS/superior distribution network is used as the network input, and the optimal discrete charge and discharge action value Q t is output, and finally passes Iteratively converges by first replaying data for network training. In this model-free reinforcement learning and data-driven intelligent decision-making method for energy storage, the priority proportional sample playback method based on the sumtree data structure is adopted. At the same time, it is compatible with DDQN, which can significantly improve the strategy accuracy and convergence speed, and increase the algorithm Robustness; at the same time, the application of competitive network architecture can prompt the agent to quickly identify the correct action during policy evaluation, with higher computational efficiency and considerable fitting accuracy, and strong adaptive ability.
Sumtree是图3所示的二叉树结构。根节点位于最顶层,分支节点位于中间层,只有底部的叶节点负责存储样本。每个父节点包含其两个子节点的和。因此,根节点是所有优先级的总和,表示为ptotal。由于这种数据结构提供了计算优先级累积和的有效方法,所以sumtree有助于有效地存储、更新和采样比例变量。在存储过程中,从左到右将获得的数据存储在叶节点中,一旦叶节点被填满,旧数据将从左逐个溢出。这种方法的一个显著优点是不需要按优先级对转换进行排序,大大减轻了计算负担,便于实时训练。在迭代之前,需要先确定sumtree叶节点的容量大小,并初始化叶节点的优先级值。Sumtree is the binary tree structure shown in Figure 3. The root node is at the topmost layer, the branch nodes are at the middle layer, and only the bottom leaf nodes are responsible for storing samples. Each parent node contains the sum of its two child nodes. Therefore, the root node is the sum of all priorities, denoted as p total . Since this data structure provides an efficient way to compute cumulative sums of priorities, sumtrees help efficiently store, update, and sample scale variables. In the storage process, the obtained data is stored in the leaf nodes from left to right, and once the leaf nodes are filled, the old data will overflow one by one from the left. A significant advantage of this approach is that there is no need to prioritize transformations, which greatly reduces the computational burden and facilitates real-time training. Before iteration, it is necessary to determine the capacity of sumtree leaf nodes and initialize the priority value of leaf nodes.
当感知到环境状态的变化后,agent将控制CBESS反馈相应的动作at。将CBESS的动作空间划分为K个离散的充放电选择P(k)be,均匀离散化动作空间AWhen sensing the change of the environmental state, the agent will control the CBESS to feed back the corresponding action at t . Divide the action space of CBESS into K discrete charge and discharge options P(k)be, and uniformly discretize the action space A
式中,A为所有可能动作组成的集合;Pbe (k)表示CBESS均匀离散动作空间中的第k个充电/放电动作。where A is the set of all possible actions; P be (k) represents the k-th charge/discharge action in the uniform discrete action space of CBESS.
5.2S2步骤的具体过程为:5.2 The specific process of the S2 step is:
建立CBESS的马尔科夫决策过程,将CBESS充放电行为映射为基于动作价值迭代更新的强化学习过程,具体为:The Markov decision process of CBESS is established, and the charging and discharging behavior of CBESS is mapped to the reinforcement learning process based on iterative update of action value, specifically:
BESS的剩余电量在充放电过程中不断变化,其变化量与该时段内的充、放电电量和自放电有关。储能充电递推关系为The remaining power of the BESS changes continuously during the charging and discharging process, and its variation is related to the charging and discharging power and self-discharge in this period. The recursive relationship between energy storage and charging is:
SoC(t)=(1-σsdr)·SoC(t-1)+Pbe·(1-Lc)Δt/Ecap SoC(t)=(1-σ sdr )·SoC(t-1)+P be ·(1-L c )Δt/E cap
储能放电过程表示如下The energy storage discharge process is expressed as follows
SoC(t)=(1-σsdr)·SoC(t-1)-PbeΔt/[Ecap·(1-Ldc)]SoC(t)=(1-σ sdr )·SoC(t-1)-P be Δt/[E cap ·(1-L dc )]
式中:SoC(t)为CBESS在t时段的荷电状态(state of charge,SoC);Pbe(t)为CBESS在t时段的充放电功率;σsdr为储能介质的自放电率;Lc和Ldc分别为CBESS的充电和放电损耗;△t为每个计算窗口时长。Where: SoC(t) is the state of charge (SoC) of CBESS in period t; Pbe (t) is the charge and discharge power of CBESS in period t; σsdr is the self-discharge rate of energy storage medium; L c and L dc are the charging and discharging losses of CBESS, respectively; Δt is the duration of each calculation window.
CBESS在t时刻的最大允许充放电功率由其自身的充放电特性和t时刻的剩余荷电状态所决定,同时运行过程中满足约束:The maximum allowable charge and discharge power of CBESS at time t is determined by its own charge and discharge characteristics and the remaining state of charge at time t, and the constraints are satisfied during operation:
SoCmin≤SoC(t)≤SoCmax SoC min ≤SoC(t)≤SoC max
式中:SoCmax和SoCmin分别为CBESS荷电状态约束的上、下限。Where: SoC max and SoC min are the upper and lower limits of the CBESS state-of-charge constraint, respectively.
强化学习是一种从环境状态映射到动作的学习,目标是使代理(agent)在与环境的交互过程中获得最大的累积奖赏。RL利用马尔科夫决策过程(Markov DecisionProcess,MDP)来简化其建模,通常将MDP定义为一个四元组(S,A,r,f),其中:S为所有环境状态的集合,st∈S表示agent在t时刻所处的状态;A为agent可执行动作的集合,at∈A表示agent在t时刻所采取的动作;r为奖赏函数,rt~r(st,at)表示agent在状态st执行动作at获得的立即奖赏值;f为状态转移概率分布函数,st+1~f(st,at)表示agent在在状态st执行动作at转移到下一状态st+1的概率。马尔科夫模型的目标是在初始化状态s后,找到一种最大化预期奖励总和的最优计划策略Vπ* Reinforcement learning is a learning that maps from the state of the environment to actions, and the goal is to maximize the cumulative reward of the agent during its interaction with the environment. RL uses Markov Decision Process (MDP) to simplify its modeling, and MDP is usually defined as a quadruple (S, A, r, f), where: S is the set of all environmental states, s t ∈S represents the state of the agent at time t; A is the set of actions the agent can perform, at t ∈A represents the action taken by the agent at time t; r is the reward function, r t ~r(s t , at t ) represents the immediate reward value obtained by the agent performing the action a t in the state s t ; f is the state transition probability distribution function, s t+1 ~ f(s t , at t ) indicates that the agent performs the action a t transition in the state s t The probability of going to the next state s t+1 . The goal of the Markov model is to find an optimal planning policy V π* that maximizes the sum of expected rewards after initializing state s
式中,Eπ表示在策略π下对价值的期望;0<γ<1是强化学习中表征未来奖励重要程度的一个衰减系数。In the formula, E π represents the expectation of value under policy π; 0 < γ < 1 is a decay coefficient representing the importance of future rewards in reinforcement learning.
在问题的规模比较小时,算法相对容易求解。然而对于实际问题来说,状态空间通常会很大,传统迭代求解的计算成本过高,且存在收敛困难、收敛速度慢、易出现过优估计等缺点,所以需要利用本发明所提方法进行改进求解。对应于本发明所提的并网型共享储能系统在线控制的数据驱动技术,映射关系如下:When the size of the problem is relatively small, the algorithm is relatively easy to solve. However, for practical problems, the state space is usually very large, the calculation cost of traditional iterative solution is too high, and there are disadvantages such as difficulty in convergence, slow convergence, and prone to over-optimal estimation. Therefore, it is necessary to use the method proposed in the present invention to improve. Solve. Corresponding to the data-driven technology of the on-line control of the grid-connected shared energy storage system proposed in the present invention, the mapping relationship is as follows:
(1)环境状态特征(1) Environmental state characteristics
定义CBESS在时刻t所感知到的环境状态特征向量为st为Define the environmental state feature vector perceived by CBESS at time t as s t as
st=[t,SOCt be,prict b.pre,prict s.pre,Pt mg.CHE,Pt mg.grid]T,st∈Ss t =[t,SOC t be , pric t b.pre ,pric t s.pre ,P t mg.CHE ,P t mg.grid ] T ,s t ∈S
式中,t为时序号;prict b.pre/prict s.pre分别表示时段t时上级电网的预测售、购电价,Pt mg.CHE/Pt mg.grid分别表示微电网与CBESS和上级电网之间的预交易电量。In the formula, t is the time sequence number; pric t b.pre /pric t s.pre respectively represent the predicted sale and purchase price of the upper power grid at time t, P t mg.CHE /P t mg.grid respectively represent the microgrid and CBESS Pre-traded electricity with the upper-level grid.
(2)反馈奖励(2) Feedback reward
CBESS在不断感知和学习过程中,在给定环境状态st和选择动作at之后,所获得的单步即时奖励rt包括In the continuous perception and learning process of CBESS, after a given environment state s t and a choice action a t , the single-step immediate reward r t obtained includes:
3)CBESS通过在非高峰时段充电,然后在高峰时段放电获得能源套利利润(Energyarbitrage profit,EAP)。在分别确定与微网和上级电网的实际交易功率后,根据实时价格计算奖励收益rEAP。3) CBESS obtains energy arbitrage profit (EAP) by charging during off-peak hours and then discharging during peak hours. After determining the actual transaction power with the microgrid and the upper-level grid, respectively, the reward income r EAP is calculated according to the real-time price.
2)除了CBESS的基本单位电力成本cbe外,当其电量接近极限时,它可能仍继续运行导致成本增加。最终,CBESS的运营和维护总成本Co,m见下式2) In addition to the basic unit electricity cost of CBESS, cbe , when its electricity is close to its limit, it may continue to operate resulting in increased cost. Finally, the total operation and maintenance cost of CBESS, C o,m, is shown in the following formula
C1=|Pbe|·cbe C 1 =|P be |·c be
4)CBESS有能力减轻MG对配电网的负面影响。因此,增加一个系数为σ的负报酬线作为惩罚,以抑制并网点的功率(Pexc_grid)波动4) CBESS has the ability to mitigate the negative impact of MG on the distribution network. Therefore, a negative return line with a coefficient σ is added as a penalty to suppress the power (P exc_grid ) fluctuation of the grid connection point
rline=-σ·|Pexc_grid|r line = -σ · |P exc_grid |
5)一旦执行的动作导致SOC超出[0,1],就必须给予较大惩罚rexc,以防止代理在随后的学习中做出不合理的决策。最后,即时奖励rt定义为5) Once the performed action causes the SOC to exceed [0,1], a large penalty r exc must be given to prevent the agent from making unreasonable decisions in subsequent learning. Finally, the immediate reward rt is defined as
5.3S3步骤的具体过程为:5.3 The specific process of the S3 step is:
每轮回合迭代开始前,初始化不确定性数据,包括微电网的负荷曲线、可再生分布式发电出力以及市场价格信号等。具体先可给定负荷曲线、RDG出力和市场电价的实际值,并假设其预测误差均服从一定正态分布,以此表征不确定性波动。Before each round of iteration starts, initialize uncertainty data, including the load curve of the microgrid, renewable distributed generation output, and market price signals. Specifically, the actual value of the load curve, RDG output and market electricity price can be given first, and it is assumed that the prediction errors obey a certain normal distribution, so as to characterize the uncertainty fluctuation.
5.4S4步骤的具体过程为:5.4 The specific process of the S4 step is:
对于MG模型,其目标是在预测价格信号下最小化运行成本,其经济调度(ED)模型的目标函数如下:For the MG model, whose goal is to minimize the running cost under the forecast price signal, the objective function of its economic dispatch (ED) model is as follows:
式中,T为规划周期;cCDG z是第z个CDG的发电成本,ci es是第i个微网储能的运行成本;PCDG z,t是第z个CDG的功率输出,而Pes i,t是第i个微网储能的充放电功率。Pb.grid t/Ps.grid t分别表示每时段上级配电网的售、购电价,Pt b.CHE/Pt s.CHE则分别表示CBESS运营商发布的售、购电价。where T is the planning period; cCDG z is the power generation cost of the zth CDG, ci es is the operating cost of the ith microgrid energy storage; PCDG z,t is the power output of the zth CDG, and Pes i , t is the charging and discharging power of the i-th microgrid energy storage. Pb.grid t/Ps.grid t respectively represent the sales and purchase prices of the upper-level distribution network in each period, and P t b.CHE /P t s.CHE respectively represent the sales and purchase prices released by CBESS operators.
微电网根据预测数据采用混合整数线性规划(MILP)方法,得到该时段与CBESS和上级配网之间的交易电量大小Pt mg.CHE/Pt mg.grid,并向外界发布该交易信息;与此同时,CBESS的代理通过感知外部环境,得到状态特征向量st=[t,SOCt,prict b.pre,prict s.pre,Pt mg.CHE,Pt mg.grid]According to the forecast data, the microgrid adopts the mixed integer linear programming (MILP) method to obtain the transaction volume P t mg.CHE /P t mg.grid between the CBESS and the upper-level distribution network during this period, and publish the transaction information to the outside world; At the same time, the agent of CBESS obtains the state feature vector s t = [t, SOC t , pric t b.pre , pric t s.pre , P t mg.CHE , P t mg.grid ] by sensing the external environment
5.5S5步骤的具体过程为:The specific process of 5.5S5 steps is:
在主竞争Q网络中使用st作为输入,得到所有动作对应的Q值输出。采用ε贪婪法在当前Q值输出中选择一个对应的动作at,在状态st执行当前动作at;对于ε-greedy策略,首先通过设置ε∈(0,1)的值,则在对应的动作时,以概率(1-ε)贪婪地选择当前被视为最大Q价值的最优动作a*,而以ε的概率从所有K个离散的可选行为中随机探索潜在的行为Use s t as input in the main competitive Q network to get the Q value output corresponding to all actions. Use the ε-greedy method to select a corresponding action a t in the current Q value output, and execute the current action a t in the state s t ; for the ε-greedy strategy, first set the value of ε∈(0,1), then in the corresponding , greedily selects the optimal action a * currently regarded as the largest Q-value with probability (1-ε), while randomly exploring potential actions from all K discrete optional actions with probability ε
其中,ε将随着迭代过程从εini逐渐减小εfin,以便在迭代的早期鼓励多进行探索,而在后期主要关注贪婪收敛,以便算法可以稳定收敛。Among them, ε will gradually decrease ε fin from ε ini with the iterative process, so as to encourage more exploration in the early stage of the iteration, and focus on greedy convergence in the later stage, so that the algorithm can converge stably.
5.6S6步骤的具体过程为:The specific process of 5.6S6 steps is:
S6:CBESS的电量SOCt更新至SOCt+1,以此判断本次迭代是否为终止状态,并计算本次动作后的即时奖励rt。以二值变量done为迭代终止判定指标,用作每次迭代过程的中断指标S6: The power SOC t of the CBESS is updated to SOC t+1 , so as to judge whether the current iteration is in a terminated state, and calculate the immediate reward rt after this action. The binary variable done is used as the indicator for determining the termination of the iteration, which is used as the interrupt indicator for each iteration process.
式中,如果储能运行过程中荷电状态越限,则本次迭代的done等于1,否则为0。done=1表示终止而跳出本次迭代,done=0表示迭代未终止。In the formula, if the state of charge exceeds the limit during the energy storage operation, the done of this iteration is equal to 1, otherwise it is 0. done=1 means to terminate and jump out of this iteration, done=0 means that the iteration is not terminated.
S7:MG根据CBESS实际反馈的可交易电量进行二次MILP规划,确定本时段与外部系统的交易电量,同时给出下一时段的预交易电量Pmg.CHE t+1,Pmg.grid t+1作为代理下一时段的感知状态信息;此时,系统的状态更新至st+1;S7: MG conducts secondary MILP planning according to the tradable power actually fed back by CBESS, determines the transaction power between the current period and the external system, and gives the pre-trading power Pmg.CHE t+1, Pmg.grid t+1 for the next period As the perception state information of the agent in the next period; at this time, the state of the system is updated to s t+1 ;
S8:在不断迭代更新的过程中,每个时段t获得的st、at、rt、st+1和终止判定指标done组成的五元组{st,at,rt,st+1,done}依次存放入sumtree的叶节点中。若存入数量达到叶节点最大容量时,按个滚动溢出旧的数据而存入新数据,以保证样本的有效性。一旦样本数量达到小批量训练样本数量m时,就开始从叶节点中,按照优先回放机制随机采样m个样本(j=1,2··,m),计算每个样本对应的当前目标Q值yj S8: In the process of continuous iterative update, the five-tuple {s t , at , r t , s composed of s t , at t , r t , s t+1 and the termination judgment index done obtained in each period t t+1 , done} are stored in the leaf nodes of sumtree in turn. If the stored quantity reaches the maximum capacity of the leaf node, the old data will be rolled over and new data will be stored to ensure the validity of the sample. Once the number of samples reaches m, the number of training samples in the mini-batch, start to randomly sample m samples from the leaf nodes according to the priority playback mechanism (j=1,2··,m), calculate the current target Q value y j corresponding to each sample
针对优先回放机制,即使用更高的频率重放更加重要的样本数据。因此,需要对TD误差δ进行计算和保存,且δ绝对值越大的样品越容易被采样。采用比例优先化策略,这是介于纯贪婪策略和均匀抽样策略之间的一种随机采用策略,即第i个样本被的提取概率的P(i)为For the priority playback mechanism, the more important sample data is played back at a higher frequency. Therefore, the TD error δ needs to be calculated and stored, and the sample with a larger absolute value of δ is easier to be sampled. The proportional priority strategy is adopted, which is a random adoption strategy between the pure greedy strategy and the uniform sampling strategy, that is, the extraction probability of the i-th sample is P(i) is
其中α∈[0,1],是将TD误差的重要性转换为优先级的幂指数。如果α=0,则转换为均匀随机抽样。pi是转换i的优先级,计算如下式所示where α∈[0,1], is a power exponent that converts the importance of TD errors into priorities. If α=0, convert to uniform random sampling. pi is the priority of transition i , calculated as follows
其中是一个小的正偏差,以确保仍然可以提取TD误差为0的一些边缘样本。上述过程会导致随机更新的期望分布发生变化,因此收敛解也随之变化。鉴于这种情况,采用重要抽样(IS)权重来校正偏差,从而得到考虑样本优先级的均方差损失函数Li(θi)。最后通过神经网络的梯度反向传播来更新主竞争Q网络的所有参数θin is a small positive bias to ensure that some edge samples with a TD error of 0 can still be extracted. The above process causes the expected distribution of random updates to change, and therefore the converged solution. In view of this situation, importance sampling (IS) weights are used to correct the bias, resulting in a mean squared error loss function Li ( θi ) considering the priority of the samples. Finally, all parameters θ of the main competitive Q network are updated through the gradient back-propagation of the neural network
ωj=(N·P(j))-β/maxiωi ω j =(N·P(j)) -β /max i ω i
θi=θi-1+α▽θiLi(θi)θ i =θ i -1 +α▽ θi Li (θ i )
其中ωj是样本j的IS权重;β是逐渐增加到1的超参数。图3总结了优先经验回放算法结构。where ω j is the IS weight of sample j; β is a hyperparameter that gradually increases to 1. Figure 3 summarizes the structure of the priority experience playback algorithm.
S9:Q网络更新后重新计算并更新sumtree中存储数据的优先级pi,并定期将主竞争Q网络的参数复制给目标Q网络,同时令当前状态s=st+1,若s为终止状态则当前轮迭代完毕,或达到迭代轮数T则结束全部迭代回到S3进行循环,否则转到步骤S5继续进行迭代。S9: After the Q network is updated, recalculate and update the priority p i of the data stored in the sumtree, and periodically copy the parameters of the main competing Q network to the target Q network, and set the current state s=s t+1 at the same time, if s is terminated In the state, the iteration of the current round is completed, or when the iteration round number T is reached, all the iterations are ended and returned to S3 for looping, otherwise, go to step S5 to continue the iteration.
以上所述仅是本发明的优选实施方式,应当理解本发明并非局限于本文所披露的形式,不应看作是对其他实施例的排除,而可用于各种其他组合、修改和环境,并能够在本文所述构想范围内,通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围,则都应在本发明所附权利要求的保护范围内。The above are only preferred embodiments of the present invention, and it should be understood that the present invention is not limited to the form disclosed herein, should not be construed as an exclusion of other embodiments, but may be used in various other combinations, modifications and environments, and Modifications can be made within the scope of the concepts described herein, from the above teachings or from skill or knowledge in the relevant field. However, modifications and changes made by those skilled in the art do not depart from the spirit and scope of the present invention, and should all fall within the protection scope of the appended claims of the present invention.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010754472.1A CN112003269B (en) | 2020-07-30 | 2020-07-30 | Intelligent on-line control method of grid-connected shared energy storage system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010754472.1A CN112003269B (en) | 2020-07-30 | 2020-07-30 | Intelligent on-line control method of grid-connected shared energy storage system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112003269A CN112003269A (en) | 2020-11-27 |
CN112003269B true CN112003269B (en) | 2022-06-28 |
Family
ID=73462676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010754472.1A Active CN112003269B (en) | 2020-07-30 | 2020-07-30 | Intelligent on-line control method of grid-connected shared energy storage system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112003269B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112670982B (en) * | 2020-12-14 | 2022-11-08 | 广西电网有限责任公司电力科学研究院 | Active power scheduling control method and system for micro-grid based on reward mechanism |
CN112671033B (en) * | 2020-12-14 | 2022-12-23 | 广西电网有限责任公司电力科学研究院 | Priority-level-considered microgrid active scheduling control method and system |
CN113126498A (en) * | 2021-04-17 | 2021-07-16 | 西北工业大学 | Optimization control system and control method based on distributed reinforcement learning |
CN114243650B (en) * | 2021-11-17 | 2024-12-03 | 国网浙江省电力有限公司绍兴供电公司 | A distribution network area automation protection method based on low voltage intelligent switch |
CN114048576B (en) * | 2021-11-24 | 2024-05-10 | 国网四川省电力公司成都供电公司 | Intelligent control method for energy storage system for stabilizing power transmission section tide of power grid |
CN114285854B (en) * | 2022-03-03 | 2022-07-05 | 成都工业学院 | Edge computing system and method with storage optimization and security transmission capability |
CN116316755B (en) * | 2023-03-07 | 2023-11-14 | 西南交通大学 | An energy management method for electrified railway energy storage systems based on reinforcement learning |
CN117541036B (en) * | 2024-01-10 | 2024-04-05 | 中网华信科技股份有限公司 | Energy management method and system based on intelligent park |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109347149A (en) * | 2018-09-20 | 2019-02-15 | 国网河南省电力公司电力科学研究院 | Microgrid energy storage scheduling method and device based on deep Q-value network reinforcement learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150184549A1 (en) * | 2013-12-31 | 2015-07-02 | General Electric Company | Methods and systems for enhancing control of power plant generating units |
-
2020
- 2020-07-30 CN CN202010754472.1A patent/CN112003269B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109347149A (en) * | 2018-09-20 | 2019-02-15 | 国网河南省电力公司电力科学研究院 | Microgrid energy storage scheduling method and device based on deep Q-value network reinforcement learning |
Non-Patent Citations (1)
Title |
---|
含储能系统的配电网电压调节深度强化学习算法;史景坚等;《电力建设》;20200301(第03期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112003269A (en) | 2020-11-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112003269B (en) | Intelligent on-line control method of grid-connected shared energy storage system | |
Guo et al. | Real-time optimal energy management of microgrid with uncertainties based on deep reinforcement learning | |
CN112614009B (en) | Power grid energy management method and system based on deep expectation Q-learning | |
CN111884213B (en) | Power distribution network voltage adjusting method based on deep reinforcement learning algorithm | |
Huang et al. | A control strategy based on deep reinforcement learning under the combined wind-solar storage system | |
CN107706932B (en) | An Energy Scheduling Optimization Method Based on Dynamic Adaptive Fuzzy Logic Controller | |
CN116247648A (en) | Deep reinforcement learning method for micro-grid energy scheduling under consideration of source load uncertainty | |
CN113935463A (en) | Microgrid controller based on artificial intelligence control method | |
CN110751318A (en) | IPSO-LSTM-based ultra-short-term power load prediction method | |
CN112952831B (en) | Daily optimization operation strategy for providing stacking service by load side energy storage | |
CN114069650B (en) | Power distribution network closed loop current regulation and control method and device, computer equipment and storage medium | |
CN117621898B (en) | Smart parking lot charging pile charging control method and system considering grid electricity price | |
CN116451880B (en) | Distributed energy optimization scheduling method and device based on hybrid learning | |
CN115313403A (en) | Real-time voltage regulation and control method based on deep reinforcement learning algorithm | |
CN115940294A (en) | Multi-level power grid real-time dispatching strategy adjustment method, system, equipment and storage medium | |
CN117117989A (en) | Deep reinforcement learning solving method for unit combination | |
Liu et al. | Deep reinforcement learning for real-time economic energy management of microgrid system considering uncertainties | |
CN117220318A (en) | Power grid digital driving control method and system | |
CN115345380A (en) | A new energy consumption power dispatching method based on artificial intelligence | |
CN114298429A (en) | Power distribution network scheme aided decision-making method, system, device and storage medium | |
CN117893043A (en) | Hydropower station load distribution method based on DDPG algorithm and deep learning model | |
CN115001002B (en) | Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling | |
Cheng et al. | Real-time dispatch via expert knowledge driven deep reinforcement learning | |
CN114971250B (en) | Comprehensive energy economic dispatch system based on deep Q-learning | |
CN116826762A (en) | Smart distribution network voltage safety control methods, devices, equipment and media |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |