Local and Global Learning Methods For Predicting Power of A Combined Gas & Steam Turbine
Local and Global Learning Methods For Predicting Power of A Combined Gas & Steam Turbine
Local and Global Learning Methods For Predicting Power of A Combined Gas & Steam Turbine
net/publication/269108474
Local and Global Learning Methods for Predicting Power of a Combined Gas &
Steam Turbine
CITATIONS READS
88 986
2 authors:
63 PUBLICATIONS 1,153 CITATIONS
Namık Kemal Üniversitesi
11 PUBLICATIONS 375 CITATIONS
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Heysem Kaya on 05 December 2014.
13
International Conference on Emerging Trends in Computer and Electronics Engineering (ICETCEE'2012) March 24-25, 2012 Dubai
instance is labeled with the majority classout of k neighbors.In networks such as Support Vector Machines and Multi-Layer
a regression task, the outputs of k neighbors are aggregated. In Perceptrons[13].
both tasks it is possible to weight the contributions manually Another approach to a multivariate regression problem is
by means of a kernel. A kernel can be seen as a smooth weight fitting higher order univariate models and aggregating the
function which takes the distance between test instance and a individual estimates [13]. This approach is called additive
neighbor as input. Later the aggregation can be normalized to model the details of which can be found in [17].
the sum of kernel coefficients so that the weights sum up to 1.
The most widely used kernel is Gaussian kernel:
C. Feedforward Error Backpropagating ANN
1 𝑑𝑑 ‖𝑢𝑢‖2
An Artificial Neural Network (ANN) is simply a predictive
𝐾𝐾(𝑢𝑢) = � � 𝑒𝑒𝑒𝑒𝑒𝑒 �− � (1) model inspired from human brain. Similar to bionic NN, it has
√2 𝜋𝜋 2
a layered architecture with neurons and connections among
where exponent d is the dimensionality of the multivariate the layers. A basic ANN has three layers: input, hidden and
dataset. Then the smoothas weighted sum of k neighbors is output layers. The number of hidden layers and the number of
attainedusing: hidden units (neurons) can be adjusted depending on problem
complexity. It has been proven that using sufficient number of
∑𝑘𝑘𝑡𝑡=1 𝐾𝐾�𝑥𝑥−𝑥𝑥 𝑡𝑡 �𝑟𝑟 𝑡𝑡 hidden units in one hidden layer has the same effect with using
𝑔𝑔�(𝑥𝑥) = ∑𝑘𝑘𝑡𝑡=1 𝐾𝐾(𝑥𝑥−𝑥𝑥 𝑡𝑡 )
(2) multiple hidden layers and such a one-hidden-layer ANN can
learn any nonlinear function[13]. A neuron sums the input
directed to it with the corresponding connection weights and
B. MultivariateLinear Regression transforms the sum using a non-linear activation function. This
Usually, polynomial regression with order higher than activation function is generally the sigmoid function.
linearis used for univariate case. For multivariate datasets,
however, it is not common to use polynomials of order higher sigmoid(x) = 1 /[1 + exp
(−x)] (7)
than linear [13]. Using a linear model is more preferable due
to two reasons: 1) It is simpler in terms of complexity. 2) It is An important prerequisite for any activation function is
easier to interpret the resulting model since the coefficients differentiability. This property is used in learning process, i.e.
give direct information about the relative importance of updating weights based on the error. In a feedforward NN,
variables. In fact using a polynomial univariate regression of units take the input only from lower layers, process and pass it
an order dis a special case ofmultivariate linear regression to higher layers. The error, which is the difference between
where x 1= x, x 2 =x2,… ,x d= xd. Mathematically, if the linear expected value and ANN output, is backpropagated to
model is [13] network using partial derivatives with respect to weights.
Learning stops either when learning stabilizes, or the error is
𝑟𝑟 𝑡𝑡 = 𝑔𝑔(𝑥𝑥 𝑡𝑡 |𝑤𝑤0 , 𝑤𝑤1 , … , 𝑤𝑤𝑑𝑑 ) + 𝜖𝜖 below a threshold or validation set accuracy does not increase
= 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1𝑡𝑡 + 𝑤𝑤2 𝑥𝑥2𝑡𝑡 + ⋯ + 𝑤𝑤𝑑𝑑 𝑥𝑥𝑑𝑑𝑡𝑡 + 𝜖𝜖 (3) for a specified number of epochs. An epoch is feeding the
ANN with all training set instances in random order.
The error function can be stated as
1 D.K-Means Clustering
𝐸𝐸 = ∑𝑡𝑡 (𝑟𝑟 𝑡𝑡 − 𝑤𝑤0 − 𝑤𝑤1 𝑥𝑥1𝑡𝑡 − 𝑤𝑤2 𝑥𝑥2𝑡𝑡 − ⋯ − 𝑤𝑤𝑑𝑑 𝑥𝑥𝑑𝑑𝑡𝑡 )2 (4)
2 K-Means clustering is a centroid based algorithm.In
machine learning literature [13], K-Means clustering is used
Taking partial derivatives with respect to coefficients, w j , for several purposes: to label the unlabeled data, to map data
j=0, …,d, normalequations are obtained. Defining a bias into a lower dimensional space or to fit local models. This
variable x 0 =1, let the bias-padded dataset be X, the weight study utilizesK-Means for fitting local models to clusters.
vector be w and vector of outputs be r. Then d+1 normal Since clustering is stochastic due to random initialization of
equations can be written as means, it is necessary to aggregate the suitable models
attained from various clusterings for better approximation. In
𝑋𝑋 𝑇𝑇 𝑋𝑋𝑋𝑋 = 𝑋𝑋 𝑇𝑇 𝑟𝑟 (5) this approach the training set is clustered and for each cluster a
model (e.g. an ANN) is trained. Later when a test input is to
Where w can be solved using be predicted, the model(s) associated with nearest mean to this
input is/are used.
𝑤𝑤 = (𝑋𝑋 𝑇𝑇 𝑋𝑋)−1 𝑋𝑋 𝑇𝑇 𝑟𝑟 (6)
14
International Conference on Emerging Trends in Computer and Electronics Engineering (ICETCEE'2012) March 24-25, 2012 Dubai
15
International Conference on Emerging Trends in Computer and Electronics Engineering (ICETCEE'2012) March 24-25, 2012 Dubai
16
International Conference on Emerging Trends in Computer and Electronics Engineering (ICETCEE'2012) March 24-25, 2012 Dubai
kernel weighting was applied with leave-one-day-out setting. and using kernel smoothing significantly increase
The data points corresponding to a specific day was used for performance. MATLAB’s ANN Toolbox uses mapminmax
validation and the rest used as training; this was applied for all function to normalize data into [-1,+1] range. This
days. Since the real life application is predicting next day’s preprocessing does not change performance of k-NN, however
hourly P E , this setting reflects the most realistic case. scaling with mean and standard deviation gives the desired
outcome. Figure 6 depicts 1-Way ANOVA results (compare
TABLE IV: PRELIMINARY K-NN TESTS WITH LEAVE-ONE-DAY-OUT means) of k-NN Mean Square Error (MSE) for datasets using
1) no preprocessing, no kernel smoothing 2) only kernel
k 5 10 15 20 25 smoothing 3) only preprocessing 4) both.
MRE(%) 0.82 0.79 0.78 0.77 0.77
Subsets T V AP RH MRE(%) MAE(MW) Fig. 6: ANOVA for k-NN MSE Performance of 4 Settings
1 1 0 0 0 0.92 4.19
It was observed that using preprocessing and kernel
2 0 1 0 0 1.61 7.38 smoothing requires no more than 5 neighbors to aggregate.
3 1 1 0 0 0.81 3.67 Later tests with MATLAB’s NN Toolbox were carried out
4 0 0 1 0 2.58 11.79 with ‘trainlm’ training algorithm, one hidden layer with 10
5 1 0 1 0 0.91 4.11 hidden units. Learning stops when either of the following is
6 0 1 1 0 1.30 5.97 reached (the default values are given in brackets):
7 1 1 1 0 0.78 3.55 • Max epochs (1000)
• Min error (1 x 10-5)
8 0 0 0 1 2.82 12.89
• Validation set accuracy does not increase (6 epochs)
9 1 0 0 1 0.84 3.80
• Max η
10 0 1 0 1 1.33 6.09
• Min gradient
11 1 1 0 1 0.78 3.53
12 0 0 1 1 2.30 10.54 Contrary to our expectation,in 5x2 CV setting the MSE
13 1 0 1 1 0.85 3.87 performanceof 5-NN was found significantly better than fine-
14 0 1 1 1 1.24 5.67 tuned ANN showing mean statistics 16.07 and 16.64,
15 1 1 1 1 0.77 3.51 respectively.
Fitting linear regression functions for T in 5x2 CV yields an
average MSE performance of 29.45 which is dramatically
Tests with subsets imply that the best accuracy is obtained
weaker than both k-NN and ANN. In the same CV setting,
when all features are used. As expected, T gives the highest
while additive regression was found to worsen predictive
individual predictive accuracy. Also collective performance
performance,a multivariate linear regression modelover all
of T and V is found significantly better than individual
features was found to reduce MSE to 21.84.
performances. If dimensionality reduction by feature selection
Lastly, local models were applied as proposed in [22]. In the
is intended, any of the remaining two ambient variables can be
same 5x2 CV setting, for each validation set instance a
used besides T and V.
specific local ANN was trained using the 100 nearest
neighbors in training set. This means 5x2xN/2=95.680 ANNs
For subsequent tests, in order to compare performance of
were constructed using 100 training instances for each. After
different learning methods, 5x2 cross-validation [21] was
this computationally complex process, the MSE performance
applied. In this scheme, the dataset is randomly shuffled 5
was found to be 19.99 which fell significantly behind global
times and each of them used in 2-fold CV. The resulting
model performance. Later, a more efficient local approach
validation set performances of size 10 are used for statistical
with K-Means clustering is used. In this model first the
significance test. The results indicated that both preprocessing
training set is clustered and K models are trained for
17
International Conference on Emerging Trends in Computer and Electronics Engineering (ICETCEE'2012) March 24-25, 2012 Dubai
18