CN109711636A

CN109711636A - River water level prediction method based on chaotic firefly and gradient lifting tree model

Info

Publication number: CN109711636A
Application number: CN201910018633.8A
Authority: CN
Inventors: 梁雪春; 苏佳佩
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2019-01-09
Filing date: 2019-01-09
Publication date: 2019-05-03

Abstract

The invention provides a river water level prediction method based on a chaos firefly and gradient lifting tree model, and relates to the technical field of information and the technical field of hydrologic condition prediction. First, data is collected and the required data is divided into five categories in total. And then carrying out data preprocessing, including elimination of abnormal values, processing of missing values and data normalization. The improved chaotic firefly algorithm is used for optimizing the training parameters of the gradient lifting tree model, and the improved gradient lifting tree model is applied to the river level prediction research of the structured data. And finally, constructing a training sample set, randomly adopting a part of the 5 types of data obtained after processing for model training, optimizing by using a GSO algorithm, carrying out parameter tuning to obtain a GBDT model under optimal parameters, having better generalization capability, improving the accuracy of the model for water level prediction, finally carrying out model inspection by combining a test set, and carrying out comparative analysis on the obtained actual value and the error of a calculated value to verify the excellence of the model.

Description

River water level prediction method based on chaotic firefly and gradient lifting tree model

Technical Field

The invention discloses a river water level prediction method based on a chaos firefly and gradient lifting tree model, and relates to the technical field of information and the technical field of hydrologic condition prediction.

Background

In the early 70 s, the swedish hydrographic and weather bureau developed a hydrographic forecasting model for flood forecasting of hydropower plants, and performed flood forecasting by inputting reasonable forecasting parameters and verifying forecasting results. In recent years, China has rapidly developed water conservancy informatization, and has achieved some successful applications in water resource allocation and management, but China starts late in the field of hydrological forecasting, the mining degree of a large amount of water conservancy data is low, and means for finding valuable information from a large amount of hydrological data are few.

The significance of the explanation is mainly to reasonably and accurately predict the river water level through an information technology means.

At present, the hydrological condition prediction technology system is still in a development stage in China, and compared with developed countries, the hydrological condition prediction technology system still faces many problems which need to be solved urgently. There are mainly the following problems:

1. in the case of large amount of data, how to combine the method of machine learning effectively and accurately solves the practical problem of water level prediction.

2. Data analysis and pretreatment: the accuracy of the water level prediction model performance depends on the integrity, validity and timeliness of the data. Data samples which are not suitable for being input as a model are inevitably in original data, so that processing of missing values and abnormal values is necessary, and how to process the missing values and the abnormal values of different data types becomes a difficulty.

3. Index selection under multidimensional data: the evaluation data index attributes obtained by the conventional water level monitoring are not uniform, and the method has the characteristics of multiple data indexes, high data dimension, and non-linearity and redundancy of data, so that the method is particularly important for screening important characteristic indexes.

4. And (3) establishing a water level prediction model, namely establishing a water level prediction model suitable for the national conditions of China by referring to the research results of the river water level prediction model of the predecessors.

The invention aims to quantitatively analyze the river water level so as to effectively predict the future river water level through historical data. The water level is also the primary sign for the water conservancy department to know the water flow change, dynamic information of the water level is monitored, important basis is provided for the decision of the water conservancy department, and the water level change is closely related to the daily life and production construction of people.

Disclosure of Invention

The invention provides a machine learning method for river water level prediction based on data mining and data analysis aiming at the characteristics of complicated river water level change and the like. The change of the water level is the result of the comprehensive action under various influence factors, so the selected data comprises historical data such as a time stamp, instantaneous flow, accumulated water quantity, flow rate, water level and the like, and the accurate prediction of the river water level is completed through the selected structured data.

In order to solve the technical problem, the invention provides a river water level prediction method based on a chaotic firefly and gradient lifting tree model, which comprises the following steps:

s101: in order to extract effective information from a large amount of data, a data acquisition scheme is provided. The required data are divided into five types in total, the time stamp can represent that a piece of data exists completely and can be verified at a specific time point, the accumulated water quantity represents the sum of river flow in the current time period, the instantaneous flow represents the amount of fluid flowing through the effective section of a closed pipeline or an open channel in unit time, the flow rate river represents the displacement data in unit time, and the water level represents the water regime of the water body which is most intuitively reflected in the current time period. The method aims to extract information from the structured data to provide a decision for accurate prediction of river water level.

The concrete contents of the five types of data are S1011, S1012, S1013, S1014 and S1015. The timestamp data can represent that a complete verifiable exists at a particular point in time; the accumulated water quantity data reflects the total river running water in the current time period; the instantaneous flow data reflects the amount of fluid flowing through the effective section of a closed pipeline or an open channel in unit time, a flow meter is mainly adopted to measure the flow of a river at present, and the error between the measured value and the actual value of the flow is large due to instability of the flow; the flow velocity data reflects the displacement of the river in unit time, the flow velocities of various points of water flow in the channel and the river channel are different, the flow velocities near the bottom and the edge of the river (channel) are smaller, and the flow velocity near the water surface in the center of the river is the largest; the water level data can reflect the most intuitive water regime of the water body in the current time period, the observation content of the water level generally comprises the influence of changes such as the flow, the waves and the ice regime, and the observation time and times are changed along with the change process of the water level in one day.

S102: and acquiring data, processing the acquired data, wherein the data preprocessing in the structured data comprises elimination of abnormal values, processing of missing values and data normalization.

S103: the invention provides an improved chaotic firefly algorithm (GSO) for optimizing training parameters of a Gradient Boosted Decision Tree (GBDT), and the improved Gradient Boosted Tree model is applied to river water level prediction research of structured data, so that classification and regression tasks can be better realized.

S1031: gradient boosting decision tree: the method is an iterative decision Tree algorithm, the algorithm is composed of a plurality of decision trees, the conclusion of all the trees is accumulated to be used as a final classifier, the algorithm is an algorithm with strong generalization capability (generalization), the result of the model is a set of regression classification Tree combinations (CART Tree Ensemble), and the model can be expressed as the following model:

in the formula (1) f_k(x_i) The k-th decision tree is represented,representing a strong classifier that is linearly summed up by n weak classifiers. Namely, adding a new decision tree function f to the last round of predicted value_k(x_i) So that the residual error from the true value is reduced to the maximum extent. The objective function of GBDT is as follows:

formula (2) wherein l is a differentiable loss function representing a predicted valueAnd true value y_iThe difference value of (a) to (b),regularization for addition [9]Ω represents the complexity of the decision tree, and may constrain the number of nodes of the decision tree, the depth of the tree, or the L of the score corresponding to the leaf node₂And (4) norm. Over-fitting of the model is prevented.

The target function of the t iteration is shown in the formula (3), wherein C is a constant, the formula is expanded according to the Taylor formula, and a second-order form is used as an approximate value of the target function, and the formula is as follows:

in the formula (4), the reaction mixture is,respectively, a loss function pairFirst and second derivatives of (a). The objective function of t iterations with the constant term removed can be simplified as equation (5), and the tree complexity function used herein is as follows:

in the formula (6), γ represents a leaf node coefficient, and T represents the number of leaf nodes. λ is L₂The square mode coefficients also act to prevent overfitting, ω representing the leaf weight. Redefining the decision tree function f_t(x)＝ω_q(x)[10]I.e. splitting the tree into a structure function q and a leaf weight part ω, where q maps the input to the index of the leaf, i.e. q: r^d→ {1, 2, 3, L, T }, define the sample set of each leaf as I_j＝{i|q(x_i) And j, solving by using a minimum value of a quadratic equation to obtain an optimal solution of the optimal solution objective function.

From the above, when the structure function q of the decision tree is obtained, the objective function can be obtained by calculation according to the above formula, and the final problem is converted into finding the optimal tree structure q^*So that the objective function has a minimum value.

S1032: the firefly search algorithm is a heuristic search algorithm proposed based on bionics, the brightness of the firefly is related to a target value at the position of the firefly, and the brighter firefly indicates that the position of the firefly is better, namely the firefly has a better target function value. The brighter the firefly, the better the location it is, i.e., the better the objective function value, and most fireflies will gather at multiple locations, i.e., reach the extreme points.

1. Setting initial state, setting firefly number n and maximum attraction β₀The light intensity absorption coefficient γ, the step factor α, the maximum number of iterations MaxGeneration or the search precision ε.

2. Randomly initializing the position of the firefly, and calculating the target function value of the firefly as the respective maximum fluorescence brightness I₀。

3. The relative brightness I and attraction β of the fireflies in the population are calculated, and the direction of movement of the fireflies is determined based on the relative brightness.

4. And updating the position of the firefly, randomly moving the firefly at the optimal position, and recalculating the brightness of the firefly.

5. And (4) updating the optimal solution and the optimal solution position of the objective function, judging whether the optimal solution meets the set condition and whether the maximum iteration number is reached, and if not, turning to the step (3) to iterate.

6. And outputting the global extreme point.

S1033: the improved firefly algorithm optimizes the GBDT model training parameters and is improved through the following two aspects:

1. introducing inertial weight: in the process of solving the problem, it is generally expected that the optimization algorithm exhibits good global search capability at the early stage and has fine local development capability at the later stage, the position update of the firefly algorithm has randomness, and in order to improve the performance of the algorithm, the update weight formula introduces inertia weight:

x_i(t+1)＝ωx_i(t)+β(x_j(t)-x_i(t))+α(rand-1/2) (9)

2. adding a chaotic variation system: in order to improve the characteristic of poor optimization precision of the algorithm, when most points are in an iteration stop state, the characteristic of traversal of the chaotic system is utilized, so that the particles jump out of a local optimal solution. Is a chaotic system mapped by Logistic:

X_n+1＝uX_n(1-X_n) n＝0，1，2L (10)

the invention mainly utilizes an improved firefly algorithm to carry out parameter optimization on three parameter step lengths (learning _ rate), the maximum depth (max _ depth) of a decision tree and the maximum leaf node number (max _ leaf _ nodes) of the GBDT model. And (3) taking the error between the training set and the actual value as a fitness function f (x), and searching a GBDT model under the optimal parameters to improve the accuracy of the model.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a schematic diagram of the GBDT algorithm;

Detailed Description

With reference to fig. 1, the river water level prediction based on the chaotic firefly and gradient-boosted tree model of the present invention includes the following steps:

A. the data acquisition and the required data are divided into five types in total, and the five types of data respectively comprise complete verifiable timestamp data which can represent that one piece of data exists at a specific time point, cumulative water quantity data of the sum of river flow in the current time period, instantaneous flow data of fluid quantity flowing through the effective section of a closed pipeline or an open channel in unit time, flow speed data of displacement of the river in unit time, and water level data which most visually reflects the water regime of the water body in the current time period.

B. The Decision Tree model trained with the strategy of Gradient Boosting, the result of which is a set of regression classification Tree combinations (CART Tree Ensemble), can be expressed as the following model:

in the formula (1), f_k(x_i) The k-th decision tree is represented,representing a strong classifier that is linearly summed up by n weak classifiers. Namely, adding a new decision tree function f to the last round of predicted value_k(x_i) So that the residual error from the true value is reduced to the maximum extent. The objective function of GBDT is as follows:

in the formula (2), l is a differentiable loss function representing a predicted valueAnd true value y_iThe difference value of (a) to (b),for added regularization, Ω represents the decisionThe complexity of the tree can be restricted by the number of nodes of the decision tree, the depth of the tree or the L of the scores corresponding to the leaf nodes₂And (4) norm. Over-fitting of the model is prevented.

The formula is an objective function of the t iteration, wherein C is a constant, the formula is expanded according to a Taylor formula, and a second-order form is taken as an approximate value of the objective function, and the formula is as follows:

solving by using the minimum value of a quadratic equation to obtain an optimal solutionOptimal solution L of objective function^*。

When the structure function q of the decision tree is obtained, the objective function can be obtained by calculation according to the formula. The final problem is converted into finding the optimal tree structure q^*So that the objective function has a minimum value.

The improved firefly algorithm optimizes the GBDT model training parameters, and is characterized in that:

1. by introducing inertial weight, in the process of solving problems, the optimization algorithm is generally expected to show good global search capability at the early stage and have fine local development capability at the later stage. The position updating of the firefly algorithm has randomness, and in order to improve the performance of the algorithm, an updating weight formula introduces inertia weight:

x_i(t+1)＝ωx_i(t)+β(x_j(t)-x_i(t))+α(rand-1/2) (6)

omega is reduced along with the increase of the iteration times t, so that a firefly algorithm is ensured to have a good search space, and the value of omega in the early period is larger, so that the local optimal solution can be skipped, and the global search capability of the algorithm is ensured. And the later omega value is small, so that the later searching capability of the algorithm is accelerated while the local searching capability of the algorithm is ensured.

2. And a chaotic variation system is added, so that in order to improve the characteristic of poor optimization precision of the algorithm, when a plurality of points are in an iteration stop state, the local optimal solution is skipped by utilizing the traversal characteristic of the chaotic system. Is a chaotic system mapped by Logistic:

X_n+1＝uX_n(1-X_n) n＝0，1，2L (7)

wherein u is a control parameter, generally, u is 4, and the system is completely in a chaotic state. Giving any one of the initial values X₀∈[0，1]And the Logistic is completely in a chaotic state, so that the global property and the uniformity of the dispersion are ensured. When the optimal solution isAnd if the iteration process is unchanged for h times, triggering a mapping condition, performing Logistic mapping on the position, and scattering the position in the space again for optimizing, so that the algorithm is not easy to fall into local optimization, and the later accuracy of the algorithm is ensured.

C. And performing model inspection on the test set, randomly selecting a part of processed data as the test set, using the rest data as a training set, performing optimization by using a GSO algorithm, performing parameter tuning to obtain a GBDT model under optimal parameters, performing model inspection by combining the test set, and analyzing and comparing the error of the obtained actual value and the error of the calculated value.

Claims

1. A river water level prediction method based on a chaos firefly and gradient lifting tree model is characterized by comprising the following steps:

s101: the data acquisition and the required data are divided into five types in total, and the five types of data respectively comprise complete verifiable timestamp data which can represent that one piece of data exists at a specific time point, cumulative water quantity data of the sum of river flow in the current time period, instantaneous flow data of the amount of fluid flowing through the effective section of a closed pipeline or an open channel in unit time, flow speed data of displacement of the river in unit time, and water level data which most visually reflects the water regime of the water body in the current time period.

S102: the method comprises the steps of collecting data preprocessing, wherein the collected data are all structured data, and the data preprocessing in the structured data comprises the elimination of abnormal values, the processing of missing values and data normalization.

S103: training parameters of a Gradient Boosted Decision Tree (GBDT) model are optimized based on an improved chaotic firefly algorithm (GSO), and the improved Gradient Boosted Tree model is applied to river water level prediction research.

S104: constructing a training sample set, randomly adopting a part of the 5 types of data obtained after processing for model training, optimizing by using a GSO algorithm, carrying out parameter tuning to obtain a GBDT model under optimal parameters, carrying out model inspection by combining a test set, calculating the error with an actual value, and verifying the excellence of the model.

2. The five types of data according to claim 1, wherein:

the data acquisition of the predicted water level comprises the following steps:

s1011: the timestamp can represent the complete verifiable data that already exists at a particular point in time;

s1012: the accumulated water amount reflects the total river water flow in the current time period;

s1013: the instantaneous flow reflects the data of the fluid volume flowing through the closed pipeline or the effective section of the open channel in unit time, the flow meter is mainly adopted to measure the flow of the river at present, and the error between the measured value and the actual value of the flow is large due to the instability of the flow;

s1014: the flow velocity reflects the displacement of the river in unit time, the flow velocities of various points of water flow in the channel and the river channel are different, the flow velocities near the bottom and the edge of the river (channel) are smaller, and the flow velocity near the water surface in the center of the river is the largest;

s1015: the water level can reflect the most intuitive water regime of the water body in the current time period, the observation content of the water level generally comprises the influence of changes such as the flow, the waves and the ice regime, and the observation time and the observation times are changed along with the change process of the water level in one day.

3. The structured data preprocessing of claim 1, wherein:

s1021: when the data is preprocessed, three methods for processing missing values are as follows: directly using features containing missing values; deleting features that contain missing values (the method is valid when an attribute that contains a missing value contains a large number of missing values but only a very small number of valid values); completing missing values; common types of feature selection fall into three categories: filter, wrapper, and embedding.

4. The GBDT model of S103, according to claim 1, wherein:

s1031: the invention provides a GBDT model which can better realize classification and regression tasks to predict the water level.

GBDT is a new model trained in the gradient direction of residual error reduction for each training in order to reduce the residual error of the previous time, and finally, the accumulation of all trees is used as a final classifier, so that classification and regression tasks can be well realized, and the overfitting phenomenon is not easy to occur. GBDT principle: a Decision Tree model trained with the strategy of Gradient Boosting. The result of the model is a set of regression classification Tree combinations (CART Tree ensembles) that can be represented as the following model:

in the formula (2), l is a differentiable loss function representing a predicted valueAnd true value y_iThe difference value of (a) to (b),for added regularization, Ω represents the complexity of the decision tree, and may constrain the number of nodes of the decision tree, the depth of the tree, or the L of the scores corresponding to the leaf nodes₂Norm, overfitting of the stop model:

in the formula (4), the reaction mixture is,respectively, a loss function pairThe objective function of t iterations, minus the constant term, can be simplified as equation (5), and the tree complexity function used herein is as follows:

in the formula (6), γ represents a leaf node coefficient, and T represents the number of leaf nodes. λ is L₂The square mode coefficients also act to prevent overfitting, ω representing the leaf weight. Redefining the decision tree function f_t(x)＝ω_q(x)I.e. splitting the tree into a structure function q and a leaf weight part ω, where q maps the input to the index of the leaf, i.e. q: r^d→ {1, 2, 3, L, T }, define the sample set of each leaf as I_j＝{i|q(x_i) J, so that the objective function is rewritten:

wherein,solving by using the minimum value of a quadratic equation to obtain an optimal solutionOptimal solution L of objective function^*：

From the above, when the structure function q of the decision tree is obtained, the objective function can be obtained by calculation according to the above formula. The final problem is converted into finding the optimal tree structure q^*So that the objective function has a minimum value.

5. The firefly algorithm of S103, as set forth in claim 1, wherein:

s1032: the firefly search algorithm is a heuristic search algorithm proposed based on bionics, the brightness of the firefly is related to a target value at the position of the firefly, and the brighter firefly indicates that the position of the firefly is better, namely the firefly has a better target function value. The brighter the firefly, the better the position where the firefly is located, i.e. the better objective function value, most of the firefly will gather at multiple positions, i.e. the extreme point is reached, the relative fluorescence brightness of the firefly:

in the formula (10), I₀Denotes the brightness of the brightest firefly, gamma denotes the light absorption coefficient, r_ijDistance between firefly i and firefly j, degree of mutual attraction β:

in formula (11), β₀Represents the maximum attraction, i.e. the attraction at the light source, optimal target iteration:

x_i(t+1)＝x_i(t)+β(x_j(t)-x_i(t))+α(rand-1/2) (12)

x in formula (12)_iAnd x_jRepresenting the spatial positions of two fireflies, i and j, α is a step factor with rand of [0, 1]Obeying a uniformly distributed random factor.

The firefly algorithm is specifically realized as follows:

(1) setting initial state, setting firefly number n and maximum attraction β₀The light intensity absorption coefficient γ, the step factor α, the maximum number of iterations MaxGeneration or the search precision ε.

(2) Randomly initializing the position of the firefly, and calculating the target function value of the firefly as the respective maximum fluorescence brightness I₀。

(3) The relative brightness I and attraction β of the fireflies in the population are calculated, and the direction of movement of the fireflies is determined based on the relative brightness.

(4) Updating the position of the firefly, randomly moving the firefly at the optimal position, and recalculating the brightness of the firefly

(5) And (4) updating the optimal solution and the optimal solution position of the objective function, judging whether the optimal solution meets the set conditions and reaches the maximum iteration times, and if not, turning to the step (3) to iterate.

(6) And outputting the global extreme point.

6. The improved firefly algorithm for optimizing GBDT model training parameters of claim 1, wherein:

s1033: the firefly algorithm has the characteristics of simplicity, easy understanding, few parameters and the like, and when the problem is solved, excessive parameters do not need to be configured, so that the firefly algorithm is easy to realize. Studies have shown that this algorithm may be more efficient than genetic algorithms, PSOs, and other algorithms. However, the firefly algorithm has the disadvantages of low finding rate in local search, low search speed, low precision and the like. The present description is improved by the following two aspects:

(1) introducing inertial weights

In the process of solving the problem, the optimization algorithm is generally expected to show good global search capability at the early stage and have fine local development capability at the later stage. The position updating of the firefly algorithm has randomness, and in order to improve the performance of the algorithm, an inertia weight is introduced into an updating weight formula:

x_i(t+1)＝ωx_i(t)+β(x_j(t)-x_i(t))+α(rand-1/2) (13)

omega is reduced along with the increase of the iteration times t, so that a firefly algorithm is ensured to have a good search space, the value of omega in the early period is larger, the local optimal solution can be skipped, and the global search capability of the algorithm is ensured. And the later omega value is small, so that the later searching capability of the algorithm is accelerated while the local searching capability of the algorithm is ensured.

(2) Adding chaotic variant system

In order to improve the characteristic of poor optimization precision of the algorithm, when most points are in an iteration stop state, a local optimal solution is jumped out by utilizing the traversal characteristic of the chaotic system. Is a chaotic system mapped by Logistic:

X_n+1＝uX_n(1-X_n)n＝0，1，2L (14)

wherein u is a control parameter, generally, u is 4, and the system is completely in a chaotic state. Giving any one of the initial values X₀∈[0，1]And the Logistic is completely in a chaotic state, so that the global property and the uniformity of the dispersion are ensured. The algorithm is not easy to fall into local optimization, and the later accuracy of the algorithm is ensured.