Nothing Special   »   [go: up one dir, main page]

CN116436033A - Temperature control load frequency response control method based on user satisfaction and reinforcement learning - Google Patents

Temperature control load frequency response control method based on user satisfaction and reinforcement learning Download PDF

Info

Publication number
CN116436033A
CN116436033A CN202310367857.6A CN202310367857A CN116436033A CN 116436033 A CN116436033 A CN 116436033A CN 202310367857 A CN202310367857 A CN 202310367857A CN 116436033 A CN116436033 A CN 116436033A
Authority
CN
China
Prior art keywords
temperature
user satisfaction
temperature control
control
load
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310367857.6A
Other languages
Chinese (zh)
Inventor
陈汝斯
刘海光
蔡德福
李大虎
杨旋
周悦
周鲲鹏
孙冠群
王尔玺
王文娜
许典
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Hubei Electric Power Co Ltd
Electric Power Research Institute of State Grid Hubei Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Hubei Electric Power Co Ltd
Electric Power Research Institute of State Grid Hubei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Hubei Electric Power Co Ltd, Electric Power Research Institute of State Grid Hubei Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202310367857.6A priority Critical patent/CN116436033A/en
Publication of CN116436033A publication Critical patent/CN116436033A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/24Arrangements for preventing or reducing oscillations of power in networks
    • H02J3/241The oscillation concerning frequency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/11Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
    • G06F17/13Differential equations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/02Computing arrangements based on specific mathematical models using fuzzy logic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/38Arrangements for parallely feeding a single network by two or more generators, converters or transformers
    • H02J3/46Controlling of the sharing of output between the generators, converters, or transformers
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J2203/00Indexing scheme relating to details of circuit arrangements for AC mains or AC distribution networks
    • H02J2203/20Simulating, e g planning, reliability check, modelling or computer assisted design [CAD]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S20/00Management or operation of end-user stationary applications or the last stages of power distribution; Controlling, monitoring or operating thereof
    • Y04S20/20End-user application control systems
    • Y04S20/222Demand response systems, e.g. load shedding, peak shaving

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Power Engineering (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Algebra (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Water Supply & Treatment (AREA)
  • Public Health (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Primary Health Care (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)

Abstract

The invention relates to a temperature control load frequency response control method based on user satisfaction and reinforcement learning, which comprises a temperature control load user satisfaction quantization method and a deep reinforcement learning intelligent body model construction. Considering two control modes of direct switch control and temperature setting control of temperature control load, respectively defining an energy storage index and an uncomfortable degree index as load adjustment indexes, and evaluating user satisfaction by adopting a fuzzy comprehensive evaluation method; and (3) establishing a multi-agent model of the temperature control load based on a flexible actor-arbiter algorithm, weighting user satisfaction and frequency adjustment errors to establish a comprehensive evaluation index, and performing parameter updating by the agent according to local temperature control load information and frequency deviation in an objective function of agent optimization, so that model self-adaptive learning is realized to solve the problem of cooperative control of participation frequency response of the temperature control load. Compared with the prior art, the invention has the advantages of reducing the system frequency deviation and improving the user satisfaction.

Description

Temperature control load frequency response control method based on user satisfaction and reinforcement learning
Technical Field
The invention relates to the technical field of temperature control load frequency response control methods, in particular to a temperature control load frequency response control method based on user satisfaction and deep reinforcement learning.
Background
With the continuous improvement of the renewable energy source duty ratio in the power grid, the characteristics of intermittence and fluctuation of the renewable energy source duty ratio bring great challenges to the active power balance and the frequency stability of the power grid. The traditional power system maintains the balance of the system by adjusting the output force of the generating side unit, the adjusting mode is single, and additional economic cost and environmental cost can be generated; in addition, with an increase in electric load and a wide access of renewable energy sources, the regulation capability of the power generation side gradually decreases. The novel power system mainly using new energy can integrate and schedule the resources on the demand side by utilizing advanced information technology so as to provide various auxiliary services. Therefore, the traditional system frequency adjustment can be supplemented by reasonably controlling the resource at the demand side, so that the stability of the power system is enhanced.
In the demand side resource, the temperature control load (thermostatically controlled load, TCL) is a type of electric equipment which is controlled by a thermostat to switch, can realize electric heat conversion and has adjustable temperature, and comprises a heat pump, a water heater, a refrigerator, a heating ventilation air conditioner and the like. The temperature controlled load can be used to provide frequency adjustment services, based primarily on the following three points: firstly, the energy-saving building material is widely distributed in residential, commercial and industrial buildings, and has large adjustable potential; secondly, the energy storage device has good heat storage capacity and can be regarded as distributed energy storage equipment; and thirdly, the control mode is flexible, and the power requirement of the system can be responded in time. Therefore, in order to fully exploit the frequency modulation potential of the flexible resource at the demand side and maintain the grid frequency within a certain offset range, intensive research on a control strategy of the large-scale temperature control load at the demand side is required.
The prior art mainly adopts methods of centralized control, decentralized control and mixed control. The learner establishes a layered centralized load tracking control framework, coordinates the demand side heterogeneous temperature control load aggregator and adopts a state space model for modeling. The distributed control reduces the judging mechanism of the load control to the local control end, programs or threshold values are set in the local control end in advance, when the load side device detects important parameter changes, the load acts according to a strategy set in advance, the judgment of the distributed control is carried out on the local port, therefore, the demand on communication is low, the response speed is high, and the control effect is greatly influenced by user behaviors and errors of the detection device. There are studies to optimize each load setting using a multi-objective optimization method to reduce the required load response amount and trigger decentralized control of the load based on the frequency response index. Hybrid control combines the characteristics of centralized control and decentralized control, establishes a control framework of 'centralized parameter setting-decentralized decision', coordinates a large-scale user and a power grid control center through a Load Aggregator (LA), and a learner establishes a two-stage control model based on hybrid control to participate in energy market transaction, and relieves the change of micro-grid community photovoltaics and loads by utilizing temperature control load based on hybrid control, so that a communication network needs to be established between the control center and all polymers. In the research of the temperature control load participating in auxiliary service, a dynamic model is built in literature, and the direct load control is adopted to verify that the variable frequency heat pump has good performance in providing frequency modulation service, but the dynamic response performance of a single air conditioner is mainly researched, but the coordination control discussion of the load of a large-scale air conditioner is less. The scholars establish a virtual energy storage model of the variable-frequency air conditioner, shield part of model information through a layered control framework, and simplify downlink control by adopting unified broadcast signals, but sacrifice the adjustable capacity of the air conditioner clusters for simplifying the downlink control.
The control modes of the temperature control load are mainly 2, namely direct switch and temperature setting. The learner realizes the adjustment of the frequency based on the direct load switch, and the method has the advantages that the tracking precision of the system is higher and the influence on the comfort level of the user is lower in the adjustment capacity range of the load; the disadvantage is that when the indoor temperature of the load is concentrated near the temperature boundary, the equipment is frequently turned on and off, so that not only can the adjustment task not be completed, but also the service life of the equipment can be reduced. The temperature setting can avoid the above-described drawbacks, but its limitation is that the tracking effect of the power depends on the designed controller (commonly used controllers have a minimum variance controller, a sliding mode controller, an in-mold controller, and the like). In addition, the limitation is also shown in the aspects of large temperature change range, influence on the comfort of users and the like. Researchers have built residential building energy management systems (energy management system, EMS) based on optimization techniques combined with machine learning models that utilize real residential data for training and testing of demand response controllers, while maintaining thermal comfort and reducing energy consumption. Therefore, the influence of the user satisfaction degree is considered in the load response control process, and the method has important significance for the enthusiasm of the modulating user for participating in frequency modulation. There is a literature that proposes a hybrid control strategy based on a parallel structure, which can improve the tracking accuracy of the system, reduce the switching times of the device, but the range of variation of the temperature is very large, which can reduce the comfort of the user.
The advanced reinforcement learning algorithm in recent years provides a new solution to the problem of frequency control of the power system, and has potential of on-line optimization decision when facing the problem of complex nonlinear frequency control by utilizing the strong searching learning capability. Researchers realize cooperative control of the distributed generation units by utilizing a Q learning algorithm of deep reinforcement learning, thereby eliminating frequency deviation of the system. However, Q learning algorithms can only discretize the selection of control actions from a low-dimensional action domain, and thus cannot deal with problems that contain continuous variables. The learner proposes a deep reinforcement learning algorithm acting on the continuous action domain, thereby realizing the self-adaptive control of the load frequency. But only optimally controls a single generator set or a single residential building, and is not suitable for controlling large-scale temperature control loads.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a temperature control load frequency response control method based on user satisfaction and reinforcement learning, which is based on a deep reinforcement learning control strategy of a flexible mobile unit-judging unit framework, can reduce system frequency fluctuation and can improve user satisfaction.
The aim of the invention can be achieved by the following technical scheme:
a temperature control load frequency response control method based on user satisfaction and reinforcement learning comprises the following steps:
1) Establishing a temperature control load model and a power system frequency response model with the temperature control load participating in frequency modulation by adopting a first-order ordinary differential equation;
2) Aiming at temperature control loads adopting two control modes of direct switch control and temperature setting control, respectively establishing user satisfaction degree adjustment indexes of the temperature control loads under the two control modes;
3) According to the user satisfaction degree adjustment index established in the step 2), comprehensive evaluation of the user satisfaction degree is carried out by utilizing a fuzzy comprehensive evaluation method to obtain the user satisfaction degree;
4) Defining a frequency adjustment error index according to the frequency error signal and the tracking power signal of the power system in the control period, and carrying out weighted combination on the user satisfaction degree obtained in the step 3) and the frequency adjustment error index to obtain a comprehensive evaluation index;
5) Establishing a deep reinforcement learning agent model based on a flexible actor-evaluator algorithm, constructing an agent action space and an agent state space according to the frequency change of the power system and the temperature control load operation state environment information of a demand side, and constructing a reward function of the agent model according to the comprehensive evaluation index obtained in the step 4);
6) Training an intelligent body model by utilizing a flexible actor-discriminant algorithm to solve an optimal strategy of the intelligent body model, wherein the training process comprises intelligent body objective function construction, intelligent body strategy iteration and strategy updating, intelligent body parameter updating, and constructing an objective function of the intelligent body according to an action space, a state space and a reward function in the step 5) and combining strategy entropy, wherein the intelligent body realizes objective function maximization by continuously optimizing the strategy, in the process, the intelligent body utilizes a Bellman operator to carry out strategy iteration, then strategy updating is realized by minimizing the divergence of a new strategy and an old strategy, a neural network of a Q value network and a strategy network in the intelligent body model is constructed, and the Q value network and the strategy network and temperature parameters are subjected to iterative updating of the neural network parameters according to different updating strategies, so that the objective function of the intelligent body model is continuously converged, and the optimal strategy of the intelligent body model is obtained;
7) And (3) online applying the trained intelligent agent model in the step (6) in an actual temperature control load cluster, inputting real-time running state information, user satisfaction information and power grid frequency information of the temperature control load cluster into the intelligent agent control model, and enabling the trained intelligent agent to rapidly calculate a control instruction of the temperature control load cluster at the current moment, wherein the temperature control load cluster carries out load adjustment according to the control instruction.
Further, the step 1) adopts a first-order ordinary differential equation to establish a temperature control load model, and the specific steps include:
11 Establishing a first-order ordinary differential equation model introducing a state variable and a virtual variable to represent the dynamic characteristics of any temperature control load;
12 Calculating the sum of rated powers of the temperature control load clusters according to the dynamic characteristic equation of the single temperature control load.
Further, step 2) establishes user satisfaction adjustment indexes of temperature control loads in two control modes, and the specific steps include:
21 For a direct switch control temperature control load cluster, neglecting the influence of a temperature set value on user comfort, directly acting on a device switch in a control mode, and defining an energy storage index C for the temperature control load cluster adopting the direct switch s The intelligent agent inputs control command to make C as possible s Close to 0, reduce the start-stop frequency of the apparatus;
22 For temperature setting control type temperature control load cluster, defining uncomfortable degree index C u The intelligent agent inputs control command to make C as possible u Close to 0, reducing user discomfort.
Further, in the step 3), according to the user satisfaction adjustment index established in the step 2), comprehensive evaluation of the user satisfaction is performed by using a fuzzy comprehensive evaluation method to obtain the user satisfaction, and the specific steps include:
31 Constructing a user satisfaction factor set comprising energy storage index C s And discomfort index C u I.e. u= { C s ,C u };
32 A user satisfaction comment set is constructed, and five comment grades are set according to the user satisfaction degree, namely V= { satisfaction, more satisfaction, generally, less satisfaction and dissatisfaction };
33 Determining the weight of each influencing factor, wherein the factor set is formed by an energy storage index C s And discomfort index constitutes C u The importance of the two users is set to be the same, and the weight is set to be 0.5 and 0.5];
34 Establishing a fuzzy judgment matrix, judging the degree of each factor belonging to each comment, and selecting a membership function as a Gaussian function;
35 Performing fuzzy comprehensive judgment, evaluating user satisfaction, and defining that the smaller the user satisfaction m is, the higher the user satisfaction is.
Further, in step 4), the user satisfaction degree and the frequency adjustment error index obtained in step 3) are weighted and combined to form a comprehensive evaluation index, and the specific steps include:
41 Evaluation of the tracking performance of the system, definition of the frequency adjustment error index E RMS ,E RMS The smaller the system is, the higher the tracking accuracy of the system is;
42 Frequency adjustment error index E) RMS The weighted combination with the user satisfaction m is defined as a comprehensive evaluation index J.
Further, step 5) establishes a deep reinforcement learning agent model based on a flexible actor-evaluator algorithm, and the specific steps include:
51 Establishing input information of an intelligent body model, namely a state space of the intelligent body, forming the state space of the intelligent body by the switching state, rated power, indoor and outdoor temperatures of a temperature control load cluster controlled by the intelligent body, a temperature set value of the temperature control load, frequency deviation of an electric power system and user satisfaction m calculated in the step 3), and inputting the state space into the intelligent body model to realize environment perception of the intelligent body;
52 The control method comprises the steps of) establishing an output control instruction of an intelligent body model, namely an action space of the intelligent body, setting the control instruction of the temperature control load as a load switch instruction and a temperature set value according to two control modes of temperature control load direct switch control and temperature set control, and setting constraint conditions of the control instruction as frequent switch limit and set temperature range limit of the temperature control load;
53 And (3) establishing an optimization target of the intelligent body model according to the comprehensive evaluation index established in the step (4), namely, a reward function required by the intelligent body model, and setting the reward function as a negative value of the comprehensive evaluation index J formed by weighted combination of the user satisfaction degree and the frequency adjustment error index.
Further, the objective function of the flexible actor-evaluator algorithm in step 6) maximizes the policy entropy while maximizing the cumulative rewards, and the specific steps of constructing the objective function of the intelligent agent are:
61 Constructing an objective function including entropy-containing regularization terms, i.e
Figure SMS_1
Wherein: e (·) is the desired function; pi is the policy; s is(s) q A state space for the q-th agent; a, a q The action space of the q-th temperature control load; r(s) q ,a q ) A reward function for the q-th agent; (s) q ,a q )~p π State-action tracks formed for policy pi; alpha is a temperature term, and determines the influence degree of entropy on rewards; h (pi (|s) q ) Is a state s) q Entropy term of policy at time;
62 The entropy item of the strategy is set, and the calculation method is as follows:
Figure SMS_2
further, in step 6), the agent performs policy iteration by using a bellman operator, and the specific construction method is as follows:
71 A cost function is composed of a bonus function and a state space s t+1 The expected value composition of the updated policy bellman operator contains the expected value of the reward function and the new value function, and the calculation method is as follows:
Figure SMS_3
Figure SMS_4
wherein:
Figure SMS_5
is a state space s t+1 Is a function of the desired function of (2); t (T) π Is a Belman backup operator under the strategy pi; gamma is the discount factor of the prize, V (s q+1 ) Is state s q+1 Is a new value function of:
Figure SMS_6
further, in step 6), the Q-value network outputs a single value through the neural network, and the Q-value network parameters have the following update policies:
Figure SMS_7
wherein: θ is a Q-value network parameter; phi is a policy network parameter; v (V) θ And Q θ Respectively substituting the new value function and the cost function after the Q value network parameters are substituted;
the policy network output is a gaussian distribution, and the policy network updates the policies as follows:
Figure SMS_8
wherein: z(s) q ) Is state s q A time distribution function;
updating the temperature parameters to realize the iterative test of all feasible actions, wherein the updating strategy is as follows:
Figure SMS_9
wherein: pi q A control strategy for the q-th agent; h 0 Is an entropy term;
the deep neural network learns to continuously update the Q value network parameter, the strategy network parameter and the temperature parameter, so that the model is continuously converged, and the optimal strategy of the intelligent agent model is solved.
Compared with the prior art, the temperature control load frequency response control method based on user satisfaction and reinforcement learning has the following advantages:
1. according to the invention, the influence of user satisfaction on temperature control load frequency response is considered, and aiming at switch control type and temperature setting type temperature control loads, load regulation indexes of an energy storage index and an uncomfortable degree index are respectively established to represent the satisfaction condition of temperature control load users, and comprehensive evaluation is carried out on the user satisfaction through a fuzzy comprehensive evaluation method, so that a user satisfaction evaluation index is obtained, and the user satisfaction evaluation index is used as one of optimization targets of temperature control load participation frequency response. Meanwhile, considering the frequency adjustment effect of the power system, the user satisfaction evaluation index and the frequency adjustment error index are weighted and combined to form an objective function, and the objective function is set to be an agent model rewarding function. The method has stronger improvement on the satisfaction degree of the user;
2. the invention provides a method for establishing a deep reinforcement learning agent model based on a flexible actor-critic (SAC) algorithm, wherein an agent and an environment continuously interact according to a Markov decision process (Markov decision process, MDP), an environmental state is obtained, actions are adopted to change the environmental state, corresponding rewards or penalties are obtained as update guidance of model parameters, the maximum accumulated rewards are obtained through continuous learning, and an accurate and effective control decision is made. The method has strong promotion on reducing frequency fluctuation.
Drawings
FIG. 1 is a frequency modulation model of an electric power system with the participation of a temperature control load in an embodiment of the invention;
FIG. 2 shows the temperature-controlled load operating characteristics under two control modes, namely a switch control mode and a temperature setting mode, according to an embodiment of the present invention;
FIG. 3 is a decision making process taken in a flexible actor-evaluator deep reinforcement learning model in accordance with an embodiment of the present invention.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The invention relates to a temperature control load frequency response control method based on user satisfaction and reinforcement learning, and provides a temperature control load frequency cooperative control method considering user satisfaction based on deep reinforcement learning of a flexible actor-discriminator so as to solve the problem of large-scale power system frequency control of a temperature control load on a demand side to participate in frequency modulation, and further optimize system frequency control and improve user satisfaction.
The main principle of the temperature control load frequency response control method based on user satisfaction and deep reinforcement learning established by the invention is as follows:
in terms of user satisfaction evaluation, the influence of two control modes of switch control type and temperature setting type on the satisfaction of temperature control load users is considered. Fig. 2 shows the operation characteristics of the temperature-controlled load in two control modes, wherein the temperature set value of the temperature-controlled load controlled by direct switch is kept unchanged, the adjustment command is to set the switch state of the temperature-controlled load, and the temperature-controlled load scheduling command of temperature-set control is to adjust the temperature set value upwards or downwards. For quantifying the user satisfaction level under different control modes, respectively establishing an energy storage index and an uncomfortable degree index for two temperature control loads, and evaluating the user satisfaction by a fuzzy comprehensive discrimination method; then, in order to realize the frequency cooperative control of the large-scale temperature control load, a multi-agent control model is established based on a flexible actor-discriminant algorithm, user satisfaction and frequency adjustment deviation are used as optimization targets, the on-off state and the temperature set value of the large-scale temperature control load are used as optimization variables, the agents are trained interactively with the environment, and the frequency response control of the on-line cooperative temperature control load cluster can be realized through the trained multi-agent reinforcement learning model considering the user satisfaction.
In the aspect of a control algorithm of large-scale temperature control load frequency response, a deep reinforcement learning method based on the SAC algorithm is to acquire an environmental state through continuous interaction of an intelligent agent and the environment, take actions to change the environmental state, acquire corresponding rewards or penalties as update guidance of model parameters, and acquire the maximum accumulated rewards in continuous learning. In the iterative calculation of each moment, the Actor (Actor) firstly calculates the frequency deviation of the power system and the running state s of the temperature control load cluster according to the observation at the moment t Generating action a through a policy network t (i.e., control variables). Then, the temperature control load cluster performs state transition according to the control strategy at the moment to reach the state s at the next moment t+1 . At the same time, the system environment calculates a reward r(s) t ,a t ) (objective function), feedback to agent, agent record (s t ,a t ,r(s t ,a t ),s t+1 ) To an experience pool. Then, the action policy sample of the mobile device is input to the evaluator (Critic) together with the system state, and the action-cost function Q (s t ,a t ) To evaluate the policy. The process is repeatedly carried out circularly, and the actor and the judge update the neural network parameters of the actor and the judge through a gradient descent method, so that the model self-adaptive learning is realized. During the training process, the accumulated returns of the intelligent agents in the response period gradually increase and finally tend to be stable. The SAC reinforcement learning algorithm improves the robustness of the algorithm and accelerates the training speed by introducing a maximum entropy encouraging strategy, and can make accurate and effective control decisions for large-scale temperature control loads in a complex power supply and demand environment.
Based on the principle, the temperature control load frequency response control method based on user satisfaction and deep reinforcement learning specifically comprises the following steps:
the first step, the first order ordinary differential equation model taking the indoor environment, the outdoor environment and the building characteristics into consideration has high accuracy and simple calculation, is widely applied in practice, and adopts the model to build a temperature control load dynamic model and a temperature control load frequency response model of a power system participating in frequency modulation (shown in figure 1), and specifically comprises the following operations:
11 Introduction of state variables T in a model i And virtual variable s i The operating characteristics of the i-th temperature controlled load in the cooling mode can be expressed as:
Figure SMS_10
wherein s is i (k) The change rule of (2) is as follows:
Figure SMS_11
Figure SMS_12
wherein: t (T) (k) And T i (k) The outdoor temperature and the indoor temperature are respectively; c (C) i 、R i 、P i The equivalent heat capacity, the equivalent thermal resistance and the energy transfer rate of the ith temperature control load are respectively; s is(s) i (k) Indicating the load switch state, the on state s i (k) =1, off state s i (k)=0;T i max And T i min The upper limit and the lower limit of the temperature during load operation are respectively set; t (T) i set Is a temperature set point; delta is a temperature dead zone section and is a constant; k and ak are the run time and control period, respectively. Solving the differential equation can be achieved:
Figure SMS_13
wherein: t (T) i (0) Indicating the initial indoor temperature.
For a load cluster consisting of N temperature-controlled loads, the aggregate power consumption P thereof total (k) For the sum of the rated powers of all loads, i.e.
Figure SMS_14
Figure SMS_15
Wherein: p (P) i n Rated power for the ith temperature controlled load; η (eta) i Is the energy conversion efficiency coefficient of the ith temperature control load.
As shown in FIG. 1, the power system frequency response model for the temperature-controlled load to participate in frequency modulation is shown, wherein T is Ga And T Gb Respectively the time constants of the speed regulator and the steam turbine, wherein an instantaneous characteristic compensation link is arranged between the speed regulator and the steam turbine, and the time constant T is used for 1 And T 2 The lead-lag transfer function between T R Is a temperature control load response delay time constant, T c R is a communication delay time constant eq For the unit difference adjustment rate, deltaP G 、ΔP L H, D represent the total output power of the generator, the disturbance power of the system, the system inertia time constant and the load damping coefficient, respectively, Δf being the frequency offset.
Step two, respectively establishing user satisfaction quantitative indexes of two loads aiming at temperature control loads of different control modes, wherein the specific operation is as follows:
21 Defining energy storage index C for temperature control load cluster adopting direct switch in refrigeration mode s
Figure SMS_16
According to C s As can be seen from the definition of C s The closer to 0, the closer to the temperature setting the indoor temperatureThe temperature distribution of the temperature control load is uniform, the adjustable potential is large, and the switch is not frequently switched, so that the intelligent body can make C as much as possible when inputting a control instruction s Close to 0.
22 Defining an discomfort index C for a temperature-controlled load cluster using temperature setting control u The method comprises the following steps:
C u =|T i set -T i set (0)| (8)
wherein T is i set (0) Indicating the initial temperature set point. From C u The definition of (a) shows that the more the temperature set point deviates from the initial temperature set point, the higher the user's discomfort level. Therefore, the agent should make C as possible when outputting the control command u Approaching 0, thereby reducing user discomfort.
Step three, comprehensively evaluating the user satisfaction degree by using a fuzzy comprehensive evaluation method, wherein the method comprises the following specific steps of:
31 Constructing a user satisfaction factor set comprising energy storage index C s And discomfort index C u I.e. u= { C s ,C u }。
32 A user satisfaction comment set v= { satisfactory, more satisfactory, generally less satisfactory, dissatisfaction }, is constructed.
33 Determining the weight of each factor. Due to the factor set herein being defined by C s And C u These 2 factors constitute that the importance to the user is the same, so weight a= [ a ] 1 ,a 2 ]=[0.5,0.5]。
34 A fuzzy judgment matrix is established. First, the degree to which each factor is affiliated with each comment is evaluated. Since most things follow a normal distribution, the membership function is chosen as a Gaussian function, i.e
Figure SMS_17
Wherein: y is s The inputs of the s-th factor are C s And C u ;u sp Sum sigma sp Respectively the s-th factor and the p-th commentMean and standard deviation of (a). The fuzzy evaluation matrix R is:
Figure SMS_18
35 A fuzzy comprehensive judgment is carried out. The fuzzy evaluation set is as follows:
Figure SMS_19
wherein:
Figure SMS_20
representing the operation of the fuzzy matrix.
Because the weighted average type fuzzy synthesis operator has obvious weight effect and strong comprehensive degree, the information of R can be fully utilized, so the element b p The method comprises the following steps:
Figure SMS_21
36 Evaluating user satisfaction. In order to realize continuous and quantitative grading, setting the grading ranks corresponding to the B elements of the matrix to be 1, 2, 3, 4 and 5 respectively, and defining the user satisfaction m as follows:
Figure SMS_22
from the definition of m, the smaller m is, the higher the user satisfaction is.
Step four, the user satisfaction evaluation index and the frequency adjustment error index are weighted and combined to form a comprehensive evaluation index, and the comprehensive evaluation index is set as an agent model rewarding function, and the specific steps are as follows:
41 Defining a root mean square error index E of frequency adjustment for quantifying the power system frequency adjustment level RMS The method comprises the following steps:
Figure SMS_23
wherein: n (N) s For controlling the number of periods ak; e (Δk) is an error signal within the control period Δk;
Figure SMS_24
and->
Figure SMS_25
The minimum and maximum values of the tracking power signal, respectively. From E RMS As can be seen from the definition of E RMS The smaller the tracking accuracy of the system is, the higher.
42 For comprehensive evaluation of the control effect, providing basis for optimizing the power distribution signal, defining a comprehensive evaluation index J as:
J=(1-λ)E RMS +λm (15)
wherein: lambda is the satisfaction specific gravity.
In order to ensure the stability of the power grid frequency preferentially, the user satisfaction degree can be considered when the tracking precision is within a certain range, otherwise, the user satisfaction degree is not considered when the frequency deviation exceeds the specified range, and the temperature control load is scheduled to the greatest extent to participate in the frequency adjustment. Lambda and E RMS The relationship of (2) is as follows:
Figure SMS_26
wherein: f (F) 1 、F 2 、F 3 、G 1 、G 2 、G 3 All are constants, and are set to {2%,3%,5%,0.8,0.5,0.3}, respectively.
Step five, establishing a deep reinforcement learning intelligent body model (shown in fig. 3) based on a flexible actor-evaluator algorithm, constructing an intelligent body action space and a state space according to the frequency change of the power system and the temperature control load running state environment information of the demand side, and constructing a reward function of the intelligent body model according to the comprehensive evaluation index obtained in the step 4). The temperature control load control framework based on deep reinforcement learning is shown in fig. 3, and the specific operation is as follows:
51 The state space of the deep reinforcement learning intelligent agent is established, and the state space can reflect the comprehensive and real physical state of the whole system, wherein the state space comprises the switching state, rated power, indoor and outdoor temperatures, temperature setting values of the temperature control loads, control modes, user satisfaction and frequency deviation of an electric power system of each temperature control load in a temperature control load cluster controlled by the intelligent agent.
52 Establishing a deep reinforcement learning agent action space, wherein the action space variable corresponds to a control variable of the whole system, and comprises a switching instruction and a temperature set value of each temperature control load in a temperature control load cluster controlled by the agent, and constraint conditions of the action space comprise frequent switching limitation and set temperature constraint of the temperature control loads.
53 Establishing a reward mechanism of the deep reinforcement learning intelligent agent, wherein the reward mechanism consists of system frequency deviation and user satisfaction, and the system frequency deviation uses square root error index E RMS The user satisfaction is represented by a user satisfaction index m, and since the reinforcement learning agent takes the form of maximizing the cumulative return, the reward function is set to a negative value of the weighted combination of the frequency deviation and the user satisfaction.
Training an intelligent body model by utilizing a flexible actor-arbiter algorithm to solve an optimal strategy of the intelligent body model, wherein the training process comprises intelligent body objective function construction, intelligent body strategy iteration and strategy updating, intelligent body parameter updating, constructing an objective function of the intelligent body according to an action space, a state space and a reward function and combining strategy entropy, realizing objective function maximization by continuously optimizing the strategy by the intelligent body, carrying out strategy iteration by the intelligent body in the process, then realizing strategy updating by minimizing the divergence of a new strategy and an old strategy, constructing a neural network of a Q value network and a strategy network in the intelligent body model, carrying out iterative updating of the neural network parameters by the Q value network and the strategy network and the temperature parameters according to different updating strategies, and continuously converging the objective function of the intelligent body model, thereby obtaining the optimal strategy of the intelligent body model.
The method comprises the following specific steps of:
61 Constructing an objective function containing entropy regularization terms, i.e
Figure SMS_27
Wherein: e (·) is the desired function; pi is the policy; s is(s) q A state space for the q-th agent; a, a q The action space of the q-th temperature control load; r(s) q ,a q ) A reward function for the q-th agent; (s) q ,a q )~p π State-action tracks formed for policy pi; alpha is a temperature term, and determines the influence degree of entropy on rewards; h (pi (|s) q ) Is a state s) q Entropy term of policy at that time.
62 To avoid the greedy sampling from falling into local optimum in the process of agent relearning, the calculation method for setting the strategy entropy item is as follows:
Figure SMS_28
the temperature control load intelligent agent performs iterative calculation in the training process, and the specific construction method of the iterative strategy comprises the following steps:
71 A cost function is composed of a bonus function and a state space s t+1 The expected value composition of the new value function and the expected value of the rewarding function, the cost function is used for strategy value evaluation, the Belman operator is used for strategy updating, and the calculation method is as follows:
Figure SMS_29
Figure SMS_30
wherein:
Figure SMS_31
is a state space s t+1 Is a function of the desired function of (2); t (T) π Is a Belman backup operator under the strategy pi; gamma rayA discount factor for the reward; v(s) q+1 ) Is state s q+1 Is a new value function of V (s q+1 ) The calculation method of (2) is as follows:
Figure SMS_32
the cost function is continuously updated by strategies to realize:
Q k+1 =T π Q k (22)
wherein: q (Q) k Is the cost function at the kth calculation.
The bellman backup operator and the above steps are iterated continuously, and the method can be realized:
Figure SMS_33
wherein:
Figure SMS_34
is a soft Q value.
72 In the agent policy promotion process, to make the policy trend to the exponential form of the Q value function, the policy update adopts the form of minimizing the KL divergence, namely, the SAC policy update method is as follows:
Figure SMS_35
wherein: d (D) KL KL divergence (KL divergence); pi is a policy set;
Figure SMS_36
is a cost function under the old policy pi old;
Figure SMS_37
Is the old policy pi old The lower distribution function is used for normalizing the distribution.
The method for establishing the flexible actor-evaluator deep reinforcement learning intelligent agent model needs to establish a SAC algorithm, and specifically comprises the following steps:
to improve self-adaptive learning and generalization capability of an intelligent agent model, constructing a neural network comprising a Q value network and a strategy network;
the Q value network outputs a single value through the neural network, the strategy network outputs a Gaussian distribution, and the parameter learning of the Q value network is realized through the minimum residual error J Q (θ) implementation, the Q-value network and the policy network will perform policy updates as follows:
Figure SMS_38
Figure SMS_39
wherein: θ is a Q-value network parameter; phi is a policy network parameter;
Figure SMS_40
and Q θ Respectively substituting the new value function and the cost function after the Q value network parameters are substituted; z(s) q ) Is state s q And the component function and alpha are temperature parameters.
The temperature parameters are adaptively updated in the training process, and the updating strategy of the temperature parameters is as follows:
Figure SMS_41
wherein: pi q A control strategy for the q-th agent; h 0 Is an entropy term.
The deep neural network learns to continuously update the Q value network parameter, the strategy network parameter and the temperature parameter, so that the model is continuously converged, and the optimal strategy is solved.
And seventhly, carrying out online application on the trained intelligent body model in an actual temperature control load cluster, inputting real-time running state information, user satisfaction information and power grid frequency information of the temperature control load cluster into the intelligent body control model, and calculating a control instruction of the temperature control load cluster at the current moment by the trained intelligent body, wherein the temperature control load cluster carries out load adjustment according to the control instruction.
The invention provides a temperature control load frequency response control method based on user satisfaction and deep reinforcement learning by considering the influence of the user satisfaction on the temperature control load participation frequency adjustment and the advantages of offline training and online execution of a deep reinforcement learning algorithm. Firstly, according to the operation characteristics of two temperature control loads, namely direct switch control and temperature setting control, respectively defining an energy storage index and an uncomfortable degree index to represent user satisfaction influence factors; secondly, establishing a user satisfaction evaluation system by using a fuzzy comprehensive evaluation method according to defined index factors, and quantifying the user satisfaction of the temperature control load; then, a deep reinforcement learning agent model based on a SAC algorithm is established, the agent model has better self-learning capability aiming at random uncertainty of a large-scale temperature control load, and the agent adaptively completes training of the model according to a set strategy and a parameter updating mode by interacting with environment states such as a large-scale temperature control load running state, system deviation and the like; the trained intelligent body model is applied to the temperature control load cluster in actual operation, so that the user satisfaction degree of the temperature control load can be considered while the cooperative control of the large-scale temperature control load frequency response is realized, and the method has good engineering practical value.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions may be made without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (9)

1. A temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning, comprising the steps of:
1) Establishing a temperature control load model and a power system frequency response model with the temperature control load participating in frequency modulation by adopting a first-order ordinary differential equation;
2) Aiming at temperature control loads adopting two control modes of direct switch control and temperature setting control, respectively establishing user satisfaction degree adjustment indexes of the temperature control loads under the two control modes;
3) According to the user satisfaction degree adjustment index established in the step 2), comprehensive evaluation of the user satisfaction degree is carried out by utilizing a fuzzy comprehensive evaluation method to obtain the user satisfaction degree;
4) Defining a frequency adjustment error index according to the frequency error signal and the tracking power signal of the power system in the control period, and carrying out weighted combination on the user satisfaction degree obtained in the step 3) and the frequency adjustment error index to obtain a comprehensive evaluation index;
5) Establishing a deep reinforcement learning agent model based on a flexible actor-evaluator algorithm, constructing an agent action space and an agent state space according to the frequency change of the power system and the temperature control load operation state environment information of a demand side, and constructing a reward function of the agent model according to the comprehensive evaluation index obtained in the step 4);
6) Training an intelligent body model by utilizing a flexible actor-discriminant algorithm to solve an optimal strategy of the intelligent body model, wherein the training process comprises intelligent body objective function construction, intelligent body strategy iteration and strategy updating, intelligent body parameter updating, and constructing an objective function of the intelligent body according to an action space, a state space and a reward function in the step 5) and combining strategy entropy, wherein the intelligent body realizes objective function maximization by continuously optimizing the strategy, in the process, the intelligent body utilizes a Bellman operator to carry out strategy iteration, then strategy updating is realized by minimizing the divergence of a new strategy and an old strategy, a neural network of a Q value network and a strategy network in the intelligent body model is constructed, and the Q value network and the strategy network and temperature parameters are subjected to iterative updating of the neural network parameters according to different updating strategies, so that the objective function of the intelligent body model is continuously converged, and the optimal strategy of the intelligent body model is obtained;
7) And (3) online applying the trained intelligent agent model in the step (6) in an actual temperature control load cluster, inputting real-time running state information, user satisfaction information and power grid frequency information of the temperature control load cluster into the intelligent agent control model, and calculating a control instruction of the temperature control load cluster at the current moment by the trained intelligent agent, wherein the temperature control load cluster carries out load adjustment according to the control instruction.
2. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the step 1) of establishing the temperature-controlled load model by using a first-order ordinary differential equation comprises the following specific steps:
11 Establishing a first-order ordinary differential equation model introducing a state variable and a virtual variable to represent the dynamic characteristics of any temperature control load;
12 Calculating the sum of rated powers of the temperature control load clusters according to the dynamic characteristic equation of the single temperature control load.
3. The method for controlling frequency stability under participation of temperature-controlled load considering user satisfaction according to claim 1, wherein step 2) establishes user satisfaction adjustment index of temperature-controlled load under two control modes, and the specific steps include:
21 For a direct switch control temperature control load cluster, neglecting the influence of a temperature set value on user comfort, directly acting on a device switch in a control mode, and defining an energy storage index C for the temperature control load cluster adopting the direct switch s The intelligent agent inputs control command to make C as possible s Close to 0, reduce the start-stop frequency of the apparatus;
22 For temperature setting control type temperature control load cluster, defining uncomfortable degree index C u The intelligent agent inputs control command to make C as possible u Close to 0, reducing user discomfort.
4. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the step 3) is based on the user satisfaction adjustment index established in the step 2), and the comprehensive evaluation of the user satisfaction is performed by using a fuzzy comprehensive evaluation method to obtain the user satisfaction, and the specific steps include:
31 Constructing a user satisfaction factor set comprising energy storage index C s And discomfort index C u I.e. u= { C s ,C u };
32 A user satisfaction comment set is constructed, and five comment grades are set according to the user satisfaction degree, namely V= { satisfaction, more satisfaction, generally, less satisfaction and dissatisfaction };
33 Determining the weight of each influencing factor, wherein the factor set is formed by an energy storage index C s And discomfort index constitutes C u The importance of the two users is set to be the same, and the weight is set to be 0.5 and 0.5];
34 Establishing a fuzzy judgment matrix, judging the degree of each factor belonging to each comment, and selecting a membership function as a Gaussian function;
35 Performing fuzzy comprehensive judgment, evaluating user satisfaction, and defining that the smaller the user satisfaction m is, the higher the user satisfaction is.
5. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein in step 4), the user satisfaction and the frequency adjustment error index obtained in step 3) are weighted and combined into a comprehensive evaluation index, and the specific steps include:
41 Evaluation of the tracking performance of the system, definition of the frequency adjustment error index E RMS ,E RMS The smaller the system is, the higher the tracking accuracy of the system is;
42 Frequency adjustment error index E) RMS The weighted combination with the user satisfaction m is defined as a comprehensive evaluation index J.
6. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the step 5) of establishing a deep reinforcement learning agent model based on a flexible actor-evaluator algorithm comprises the following specific steps:
51 Establishing input information of an intelligent body model, namely a state space of the intelligent body, forming the state space of the intelligent body by the switching state, rated power, indoor and outdoor temperatures of a temperature control load cluster controlled by the intelligent body, a temperature set value of the temperature control load, frequency deviation of an electric power system and user satisfaction m calculated in the step 3), and inputting the state space into the intelligent body model to realize environment perception of the intelligent body;
52 The control method comprises the steps of) establishing an output control instruction of an intelligent body model, namely an action space of the intelligent body, setting the control instruction of the temperature control load as a load switch instruction and a temperature set value according to two control modes of temperature control load direct switch control and temperature set control, and setting constraint conditions of the control instruction as frequent switch limit and set temperature range limit of the temperature control load;
53 And (3) establishing an optimization target of the intelligent body model according to the comprehensive evaluation index established in the step (4), namely, a reward function required by the intelligent body model, and setting the reward function as a negative value of the comprehensive evaluation index J formed by weighted combination of the user satisfaction degree and the frequency adjustment error index.
7. The temperature-controlled load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the objective function of the flexible actor-evaluator algorithm in step 6) maximizes the policy entropy while maximizing the cumulative rewards, and the specific steps of constructing the objective function of the intelligent agent are:
61 Constructing an objective function including entropy-containing regularization terms, i.e
Figure FDA0004167523790000031
Wherein: e (·) is the desired function; pi is the policy; s is(s) q A state space for the q-th agent; a, a q The action space of the q-th temperature control load; r(s) q ,a q ) A reward function for the q-th agent; (s) q ,a q )~p π State-action tracks formed for policy pi; alpha is a temperature term, and determines the influence degree of entropy on rewards; h (pi (|s) q ) Is a state s) q Entropy term of policy at time;
62 The entropy item of the strategy is set, and the calculation method is as follows:
Figure FDA0004167523790000032
8. the temperature control load frequency response control method based on user satisfaction and reinforcement learning according to claim 1, wherein the agent in step 6) performs strategy iteration by using a bellman operator, and the specific construction method is as follows:
71 A cost function is composed of a bonus function and a state space s t+1 The expected value composition of the updated policy bellman operator contains the expected value of the reward function and the new value function, and the calculation method is as follows:
Figure FDA0004167523790000041
Figure FDA0004167523790000042
wherein:
Figure FDA0004167523790000043
is a state space s t+1 Is a function of the desired function of (2); t (T) π Is a Belman backup operator under the strategy pi; gamma is the discount factor of the prize, V (s q+1 ) Is state s q+1 Is a new value function of:
Figure FDA0004167523790000044
the cost function is continuously realized through strategy updating:
Q k+1 =T π Q k
wherein: q (Q) k A cost function for the kth calculation;
the bellman backup operator and the above steps are iterated continuously, and the method can be realized:
Figure FDA0004167523790000045
wherein:
Figure FDA0004167523790000046
is soft Q;
72 The policy update takes the form of minimizing the KL divergence, namely the SAC policy update method is as follows:
Figure FDA0004167523790000047
wherein: d (D) KL KL divergence; pi is a policy set;
Figure FDA0004167523790000048
is a cost function under the old policy pi old;
Figure FDA0004167523790000049
Is the old policy pi old The lower distribution function is used for normalizing the distribution.
9. The method for controlling the frequency response of a temperature-controlled load based on user satisfaction and reinforcement learning according to claim 1, wherein in the step 6), the Q-value network outputs a single value through a neural network, and the Q-value network parameters have the following update policies:
Figure FDA00041675237900000410
wherein: θ is a Q-value network parameter; phi is a policy network parameter;
Figure FDA00041675237900000411
and Q θ Respectively substituting the new value function and the cost function after the Q value network parameters are substituted;
the policy network output is a gaussian distribution, and the policy network updates the policies as follows:
Figure FDA00041675237900000412
wherein: z(s) q ) Is state s q A time distribution function;
updating the temperature parameters to realize the iterative test of all feasible actions, wherein the updating strategy is as follows:
Figure FDA0004167523790000051
wherein: pi q A control strategy for the q-th agent; h 0 Is an entropy term;
the deep neural network learns to continuously update the Q value network parameter, the strategy network parameter and the temperature parameter, so that the model is continuously converged, and the optimal strategy of the intelligent agent model is solved.
CN202310367857.6A 2023-04-07 2023-04-07 Temperature control load frequency response control method based on user satisfaction and reinforcement learning Pending CN116436033A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310367857.6A CN116436033A (en) 2023-04-07 2023-04-07 Temperature control load frequency response control method based on user satisfaction and reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310367857.6A CN116436033A (en) 2023-04-07 2023-04-07 Temperature control load frequency response control method based on user satisfaction and reinforcement learning

Publications (1)

Publication Number Publication Date
CN116436033A true CN116436033A (en) 2023-07-14

Family

ID=87092078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310367857.6A Pending CN116436033A (en) 2023-04-07 2023-04-07 Temperature control load frequency response control method based on user satisfaction and reinforcement learning

Country Status (1)

Country Link
CN (1) CN116436033A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116739323A (en) * 2023-08-16 2023-09-12 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling
CN117291363A (en) * 2023-09-08 2023-12-26 国网山东省电力公司营销服务中心(计量中心) Load regulation and control method and system based on heterogeneous temperature control load model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116739323A (en) * 2023-08-16 2023-09-12 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling
CN116739323B (en) * 2023-08-16 2023-11-10 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling
CN117291363A (en) * 2023-09-08 2023-12-26 国网山东省电力公司营销服务中心(计量中心) Load regulation and control method and system based on heterogeneous temperature control load model

Similar Documents

Publication Publication Date Title
CN109270842B (en) Bayesian network-based regional heat supply model prediction control system and method
CN112232980A (en) Regulation and control method for heat pump unit of regional energy heat supply system
CN110458443A (en) A kind of wisdom home energy management method and system based on deeply study
CN116436033A (en) Temperature control load frequency response control method based on user satisfaction and reinforcement learning
CN103591637A (en) Centralized heating secondary network operation adjustment method
CN117057491B (en) Rural area power supply optimization management method based on combination of MPC and energy storage system
CN118031385B (en) Air conditioner energy-saving control method and system based on reinforcement learning and digital twin model
CN111555291A (en) Load cluster control method based on adaptive particle swarm
CN112012875B (en) Optimization method of PID control parameters of water turbine regulating system
CN113078629B (en) Aggregate power distribution model for cluster temperature control load aggregate power regulation and control and distributed consistency control method
CN111473408A (en) Control method of heat supply control system based on climate compensation
CN113869742B (en) Comprehensive supply and demand side power dispatching system based on mobile home and commentator networks
CN115795992A (en) Park energy Internet online scheduling method based on virtual deduction of operation situation
CN115705608A (en) Virtual power plant load sensing method and device
CN113222231A (en) Wisdom heating system based on internet of things
CN115115145B (en) Demand response scheduling method and system for distributed photovoltaic intelligent residence
CN115660325B (en) Power grid peak regulation capacity quantization method and system
CN117833316A (en) Method for dynamically optimizing operation of energy storage at user side
CN117236746A (en) Air conditioner load adjustable potential evaluation method
CN117791645A (en) Energy storage auxiliary power grid frequency modulation method and system
CN111242412A (en) Thermal control load cluster cooperative management and control method based on demand response
Xi et al. Q-learning algorithm based multi-agent coordinated control method for microgrids
CN116963461A (en) Energy saving method and device for machine room air conditioner
CN116488223A (en) Household light-storage-flexible double-layer multi-time scale control method, device and medium
CN115169839A (en) Heating load scheduling method based on data-physics-knowledge combined drive

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination