Kanner 2017
Kanner 2017
Kanner 2017
OMAE2017
June 25-30, 2017, Trondheim, Norway
OMAE2017-62077
Normalization
𝑋" → 𝑋7" , ∋ 0 ≤ 𝑋7" ≤ 1
Initial clustering
Assign Bin
Assign all 𝑋" to window 𝑤N , where 𝑘 = 1, … , 𝑁T
Denormalization
𝑋7" → 𝑋" Update Distance AND/OR
Population Tolerance(s) Update Centroids
𝑑XYZ = 𝛽a 𝑑XYZ ; 𝑁XYZ = 𝛽b 𝑁XYZ 𝑋UN = mean 𝑋?" ∈ 𝑤N
Done
that the largest array created is of size n · max(q j ) − 1. To en- 2.2 Extreme Clustering
j
sure that no dimension is weighted disproportionately, the data is In the extreme clustering loop, clusters are created only if
normalized using the min and max of each dimension so that the they pass a population threshold test. On a given iteration, the
normalized data and gridlines X̄i , Ȳ j,q j ∈ [0, 1]. population tolerance is given as a percentage of the total number
A two-dimensional example is shown in Fig. 2. Here, the of observations. If a cell does not contain enough observations,
observational data is represented by points and the grid lines by then a cluster is not formed. This concept can be seen in Fig. 3(a),
dashed lines. The gridlines are evenly spaced with q1 = q2 = 5. where only 3 cells have passed the population threshold test to
All of the data and gridlines have been normalized such that they form clusters or bins. These bins are shown by the large circles
span [0, 1]. It is apparent that many cells, such as those in the with a number corresponding to the number of observations they
lower-right corner, are empty. Other cells, such as those in the represent. They are located at the average position, or centroid,
upper-right corner, only contain relatively few observations. The of all the representing observations. The black lines show ob-
purpose of the extreme clustering algorithm is to strike a balance servations outside of the clusters’ respective cells that pass the
between the number of clusters created and the distance between distance test. In p−dimensions, a distance is defined as,
an observation and its associated cluster. p
!(1/m)
dk = α ∑ (x̂ j − x̄ j,k )m (2)
j=1
23
32
30
0 0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
normalized x−distance [−] normalized x−distance [−] normalized x−distance [−]
4
25 26 26
32 32 32
5
6
33 33
normalized y−distance [−]
30 30 30
0 0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
normalized x−distance [−] normalized x−distance [−] normalized x−distance [−]
4 4
26 26
32 32
6 6
0.75 0.75
33 33
normalized y−distance [−]
1
144 144
180 180
0.5 0.5
333 333
7 7
0.25 0.25
30 30
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
normalized x−distance [−] normalized x−distance [−]
data. The bias in the initial gridding of the data can be seen in only after a few trial and errors, the user can generally obtain
Fig. 3(h), where the final centroids are all located near the center m ± 10 clusters fairly quickly.
of each gridded cell (that has observational data). Changing the One of the advantages of the algorithm is the ability to en-
initial grid would have a large effect on the final position of the sure clusters that include a wide range values over single dimen-
clusters. As [6] notes, however, the random initialization of the sions, much like the MDA. In Sec. 3, it is shown that a bin can be
K−means or the SOM classifications also has a great influence formed due to single outlier. In this example, this outlier is ac-
on the final centroids. counted for in the final centroids, which may (or may not) have
Another drawback to this algorithm is the fact that the final a significant impact on the fatigue life. A discussion on how the
number of m clusters cannot be determined a-priori. That is, the clusters can be modified in order to shape the user’s specifica-
tolerances, Ntol and dtol , or the initial gridlines must be modified tions can be found in Sec. 4.
until the desired final number of clusters is achieved. However,
Table 1: Number of observations in each color-coded bin shown in Fig. 3, per iteration. The order of the columns corresponds to the bins starting in the
upper-left corner and moving to the right, and then down row-by-row.
3 CASE STUDY
52 oN
II
target, the Dutch government established four wind farm sites
on the Dutch continental shelf. The Dutch government con- 30’
Netherlands
tracted Deltares to provide a metocean study for each wind site Middelburg
that would be made public. The metocean data for Site III, on
which this example is based, can be found in [12]. The site 15’
is shown in Fig. 4. The wind conditions are based on high- Bruges Belgium
30’
4 oE
3 oE
cast simulations with the Delft3D-FLOW model for the same Lat it ude
time period as the waves. Only data from this time is provided Figure 4: Map of case-study site off the coast of Netherlands. The data
by Deltares in [12]. The data provided only consists of unimodal is taken from Site III.
wave spectra (i.e., the wave data is not split into wind-wave seas
and swell seas). However, it is common to split the wave data 3.1 RESULTS
into a bimodal spectrum using information from the wind hind-
Two-dimensional results for using this algorithm to find 50,
cast with such an algorithm proposed as proposed in [13]. The
5-dimensional clusters are shown in Figs. 5 and 6. In Fig. 5(b),
metocean conditions for Site III are determined at (2.961111N,
a cluster has formed over a single observation at 4.2 m Hs and
51.694424E), as shown by the + symbol in Fig. 4, which has
3.1 m/s wind speed. In general, it is helpful to keep graphically
a water depth of 35.1m [12]. The current data is not taken into
outlying data since they may have undue influence on the fatigue
account in the binning analysis, but is assumed to be a linear
life of the structure. While the algorithm successfully creates a
function of the wind speed.
cluster at this low windspeed, it fails to create any clusters with
Hs > 6m. The lack of clusters in this range has to do with the
initial grid for this data set. The bias of the initial grid can also
be seen in Fig. 5(b), where there are a lack of clusters from 8-10
1 RVO.nl m/s. The absence of clusters in certain wind speed ranges can be
(b) Uw vs θw
Figure 6: Selected directional data for algorithm with 50 Bins.
(b) Uw vs Hs
creased to nearly 200 and by Fig. 8(a), the distribution of cluster
Figure 5: Selected data for algorithm with 50 bins. looks quite continuous.
troublesome when calculating the fatigue life from such a lim-
3.2 Quality Measure
ited set of data. For instance, some wind speed ranges may cause
more damage than others due to resonance interactions. Not in- For each set of bins a ‘quality’ measure can be used to de-
cluding these data can result in unconservative estimates for a scribe how closely the clusters are located to the observations.
fatigue life. If these large absences are noticed, it is prudent to The quality of a set of bins q is defined as,
use a data set with a larger number of clusters, which generally "
p
!#
1 n 1
results in a larger subspace of the original domain. The abil- q = 100 1 − ∑ ∑ |X̂k − X j,k | , (3)
ity to use more clusters in a fatigue analysis allows for a much n j=1 p k=1
more accurate and representative set of the data. This feature can where the taxicab (Manhattan) distance is used instead of the
be seen in Figs. 7 and 8, where wind speed as a function of wave Euclidian distance. This definition of the distance function pe-
height is shown for clusters of approximate size of 100, 200, 500, nalizes clusters that are far away from a given observation in a
1000 and 2000. In Figs. 5(b) and 7(a) there are extremely few single dimension. Increasing q corresponds to an increase in the
clusters between 0-3 m/s, 8-10 m/s, and 16-18 m/s. This issue is ‘quality’ of a set of bins, in the sense that more observations are
nearly resolved in Fig. 7(b) when the number of bins has been in- closer to their associated clusters. The q values for the bins pre-
sented in Sec. 3.1 are shown in Table 2.
Nbins q
50 93.617
101 93.720
193 94.615
565 95.625
981 96.329
1981 96.971
(c) 1981 bins.
Table 2: Measure of quality of bins as function of number of bins.
Figure 8: Wind speed as a function of wave height for clusters of various
size.
4 WEIGHTING FUNCTION EXTENSION
This section examines the purpose and consequences of
varying the α term in Eq. 2. This terms provides the user with
0.5
1
4
1
1 1
5
0.
0.5
0.5
normalized y−distance [−]
0.5 0.25
1
0.5 1
1
1
0.5 1
0.5
0
0 0.25 0.5 0.75 1
normalized x−distance [−]
0.25
1
Figure 10: Visualization of final step of weighted algorithm.