Statistical Data Mining: Edward J. Wegman

Statistical Data Mining
Lecture 6
Edward J. Wegman
George Mason University
Density Estimation
Outline of Topics
• Kernel Density Estimators

• Adaptive Density Estimators
• Geometrically-based Density Estimators
Kernel Smoothers
i
Curse of Dimensionality
A Solution – Mixture Models
• Reduce the number of summands
• Easy interpretation for clustering
• Model complexity reduced
• Data requirements perhaps reduced
• Recursive form reduces storage
requirements
Dimension Reduction for Visualization
Linear Form for Kernels
Surface Normal Calculation for Kernels
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Finite Mixtures
Some Illustrations
Some Illustrations
Adaptive Mixtures
• Finite Mixture Estimators are L1 consistent
only if the mixture density is right and if the
number of terms is right.
• Adaptive Mixture Estimators have a create
rule to add terms as needed.
Update/Create Rule
Update/Create Rule
Create Rule
Some AMDE Illustrations
Some AMDE Illustrations
Bivariate AMDE Illustration
Bivariate AMDE Illustration
Trivariate AMDE Illustration
AMDE Properties
• Nonparametric because of adaptive
character
• Generally L1 consistent under relatively
mild conditions
• Almost always, too many terms created
• Need pruning algorithm
AIC Pruning
AIC Pruning
Penalty Pruning
• Remove components having small mixing
coefficient, 
• How?
– assume a penalty distribution on 
– add a penalty term to the likelihood
– maximize the new penalized likelihood
function
Penalty Pruning
The new log likelihood

Penalty Pruning Algorithm
START with
AMDE result
Compute ij
Compute
j, j, and j
Some j YES Remove the

too small? components
YES
NO
Compute Likelihood NO
STOP
likelihood improved?
Penalty Pruning Simulation
Conclusions about Mixture Models
• Mixture methods simplify model
complexity, but may not simplify
computational complexity.
• They naturally suggest clustering
mechanisms.
• Once established, a mixture model is
computationally simple for estimating
probabilities.
Motivations
• Computational complexity issues
– Same as adaptive mixtures case
• Model complexity issues
• Simple histogram-like estimator
• Desire for data-adaptive density
estimator
Basic Strategy
• In d dimensions, tessellate using Delauney

tiles. Dual of Voronoi tessellation. Yields
simplices.
• Choose the convex hull and a comparatively
small number of data points (say 10%) as
tiling vertices, called the tessellation
generating set.
Random Tessellations
• The ideas are motivated by histograms

– Traditionally a histogram is constructed with
equal size bins in one dimension.
– In multiple dimensions, squares, cubes, or
hypercubes are traditionally used.
– This leads to the possibility of over crowded
bins where the density if high and empty bins
where the density is low.
• So, we construct a tessellation that is based
on density of observations.
Delauney Tessellation in 2-D
Tessellation Algorithm
• Begin with a
random
sample in d-
dimensions.
• Form the
convex
hull.
• Randomly
subsample the
observations in
the random
sample.
• Use these to
construct a data
adaptive
Delauney
tessellation.
Computing Content
• The Delauney tessellation yields

– Triangles in two dimensions
– Tetrahedrons in three dimensions
– d-simplices in d dimensions
• In d dimensions, a d-simplex always has
d+1 vertices
Content of a Simplex
If B4 œ ÐB"4 ß á ß B.4 Ñß
4 œ "ß á ß .  " are .  " arbitrary
points, then the general content
of the simplex is
 
 B"" "
 
 B"# "
B#" â B."
" 
H œ .x  ã ã Þ
B## â B.#
 
 B". "
ã ã
B " 
B#. â B..
 "ß." B#ß." â B.ß."
Counting Points in a d-simplex
• Now that we have a tessellation with
d-simplices, we wish to know if a
point is interior to a simplex, on the
boundary, or exterior to the simplex.
There will be .  " vertices on the

simplex. Call these vertices
C4 œ ÐC"4 ß C#4 ß á ß C.4 Ñß 4 œ "ß á ß .  "Þ
If B is interior, it must be a convex
linear combination of the vertices C4 .
B5 œ +4 C54 ß 5 œ "ß á ß .ß
."
Thus
4œ"
where  +4 œ " and +4  !Þ If +4 œ ! for one or

."
4œ"
more 4, then the point B will be on the boundary of
the simplex. If +4  ! for one or more 4, then the
point Bwill be external to the simplex. In matrix
form this can be rewritten
" C"" C#" â C." 
" C.." 
Ð"ß B" ß B# ß á ß B. Ñ œ Ð+" ß +# ß á ß +." Ñ ã ã ã ã ã .
C"." C#." â
In the obvious short-hand form, we can write
B† œ + ] . Here B† is an augmented version of B.
The vertices of the simplex must be linearly
independent for the simplex to be non-
degenerate. Then ] " must exist. Thus we
have + œ B† ] " Þ Given the point B to be tested
and the vertices of the simplex, it is a simple
matter to compute +. We have an algorithm for
testing B.
Theorem: Compute + œ B† ] " . A point B is
interior to the simplex if and only if all +4  !. The
point B is exterior to the simplex if and only if one
or more of the +4  !Þ A point B is on the
boundary if and only if all +4 ! with at least one
equality.
Thus we may recursively test each observation to
discover to which simplex it belongs and tally the
total count for each simplex in the tessellation.
Data Adaptive Density
Estimation
Let 8 be the number of observations, let 83 be the
number of observations in the simplex X3 , and
finally let lX3 l be the content of simplex X3 , then the
adaptive density 0 is given by
0 ÐBÑ œ M3 ÐBÑ 8lX

83
3l
3
where M3 is the indicator function and where 3 runs
over all simplices in the tessellation.
Properties
• We can re-sample the observations in a bootstrap-
like manner to obtain additional tessellation
generating sets.
• Because each set of tessellations is a finite
collection of convex tiles, intersections will also
yield convex sets.
• We can construct data-adaptive ASH-like
multivariate density estimators á la David Scott.
Properties
• Bias, variance, consistency and asymptotic

distribution results are available.
• Computational complexity is dependent on
convex hull algorithms. Algorithms slightly
more complex that O(n) are available.
Bounding the Number of Tiles
• Key to both the density estimation and data
quantization is the relationship between the
number of tessellating points and the
number of simplices generated.
• Seidel (1982a, 1982b) gives upper bounds
on the number of d-simplices as a function
of the number of tessellating points.
• We have considered some sampling
experiments based on sampling from both
uniform and Gaussian distributions.
Bounding the Number of Tiles
Consider a hypercube
with 8 equally spaced
point along each one
dimensional edge. Then
there are 8  " line
segments on each edge,
and a total of Ð8  "Ñ.
sub-hypercubes. Each
sub hypercube may be
partioned into .x d-
simplices yielding a total
of .xÐ8  "Ñ. simplices.
The Two Dimensional Case
The Three Dimensional Case
The Four Dimensional Case
The Five Dimensional Case
The Six Dimensional Case
Seidel Bound Versus Normal
Seidel Normal Seidel Normal Seidel Normal Seidel Normal
n\d 2 2 3 3 4 4 6 6
100 195 190 4849 537 9311 1769 285,759 21,128
250 495 488 30,874 1500 60,761 5601 4,901,959 94,383
500 995 987 124,249 3143 246,511 12,479 40,428,959 254,076
800 1595 1587 318,799 5125 634,411 21,037 167,486,359 473,462
1000 1995 1987 498,499 6451 993,011 26,870 328,357,959 630,274

Simulation Results
16
14
LOG OF NUM B E R OF TILE S
12
10
2
7
6 8
5 7
6
4 5
3 4
3
2 2
DIMENSION
LOG OF NO OF POINTS
Simulation Results
Are these results reasonable?
Note that as dimension increases, most points

in the tessellation generating set are on the
convex hull. Thus one would expect more
stability. This is consistent with the results
on a thin shell of a hypersphere.
Thin Shell of a Hypersphere
Hypervolume of a Simplex
Proof is available in lecture notes Geometric Methods in

Statistics, http://www.galaxy.gmu.edu/stats/syllabi/geowithpics.pdf.

Statistical Data Mining: Edward J. Wegman

Uploaded by

Copyright:

Available Formats

Statistical Data Mining: Edward J. Wegman

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Data Mining: Edward J. Wegman

Uploaded by

Copyright:

Available Formats

Statistical Data Mining

• Kernel Density Estimators

The new log likelihood

Some j YES Remove the

• In d dimensions, tessellate using Delauney

• The ideas are motivated by histograms

• The Delauney tessellation yields

There will be .  " vertices on the

where  +4 œ " and +4  !Þ If +4 œ ! for one or

" C"" C#" â C." 

0 ÐBÑ œ M3 ÐBÑ 8lX

• Bias, variance, consistency and asymptotic

100 195 190 4849 537 9311 1769 285,759 21,128

250 495 488 30,874 1500 60,761 5601 4,901,959 94,383

500 995 987 124,249 3143 246,511 12,479 40,428,959 254,076

800 1595 1587 318,799 5125 634,411 21,037 167,486,359 473,462

1000 1995 1987 498,499 6451 993,011 26,870 328,357,959 630,274

Note that as dimension increases, most points

Proof is available in lecture notes Geometric Methods in

You might also like