Statistical Data Mining: Edward J. Wegman
Statistical Data Mining: Edward J. Wegman
Statistical Data Mining: Edward J. Wegman
Lecture 6
Edward J. Wegman
George Mason University
Density Estimation
Outline of Topics
i
Curse of Dimensionality
A Solution – Mixture Models
• Reduce the number of summands
• Easy interpretation for clustering
• Model complexity reduced
• Data requirements perhaps reduced
• Recursive form reduces storage
requirements
Dimension Reduction for Visualization
Linear Form for Kernels
Surface Normal Calculation for Kernels
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Some Kernel Images
Finite Mixtures
Some Illustrations
Some Illustrations
Adaptive Mixtures
• Finite Mixture Estimators are L1 consistent
only if the mixture density is right and if the
number of terms is right.
• Adaptive Mixture Estimators have a create
rule to add terms as needed.
Update/Create Rule
Update/Create Rule
Create Rule
Some AMDE Illustrations
Some AMDE Illustrations
Bivariate AMDE Illustration
Bivariate AMDE Illustration
Trivariate AMDE Illustration
AMDE Properties
• Nonparametric because of adaptive
character
• Generally L1 consistent under relatively
mild conditions
• Almost always, too many terms created
• Need pruning algorithm
AIC Pruning
AIC Pruning
Penalty Pruning
• Remove components having small mixing
coefficient,
• How?
– assume a penalty distribution on
– add a penalty term to the likelihood
– maximize the new penalized likelihood
function
Penalty Pruning
Compute ij
Compute
j, j, and j
• Begin with a
random
sample in d-
dimensions.
Tessellation Algorithm
• Form the
convex
hull.
Tessellation Algorithm
• Randomly
subsample the
observations in
the random
sample.
Tessellation Algorithm
• Use these to
construct a data
adaptive
Delauney
tessellation.
Computing Content
If B4 œ ÐB"4 ß á ß B.4 Ñß
4 œ "ß á ß . " are . " arbitrary
points, then the general content
of the simplex is
B"" "
B"# "
B#" â B."
"
H œ .x ã ã Þ
B## â B.#
B". "
ã ã
B "
B#. â B..
"ß." B#ß." â B.ß."
Counting Points in a d-simplex
• Now that we have a tessellation with
d-simplices, we wish to know if a
point is interior to a simplex, on the
boundary, or exterior to the simplex.
4œ"
more 4, then the point B will be on the boundary of
the simplex. If +4 ! for one or more 4, then the
point Bwill be external to the simplex. In matrix
form this can be rewritten
" C.."
Ð"ß B" ß B# ß á ß B. Ñ œ Ð+" ß +# ß á ß +." Ñ ã ã ã ã ã .
C"." C#." â
Counting Points in a d-simplex
In the obvious short-hand form, we can write
B† œ + ] . Here B† is an augmented version of B.
The vertices of the simplex must be linearly
independent for the simplex to be non-
degenerate. Then ] " must exist. Thus we
have + œ B† ] " Þ Given the point B to be tested
and the vertices of the simplex, it is a simple
matter to compute +. We have an algorithm for
testing B.
Counting Points in a d-simplex
Theorem: Compute + œ B† ] " . A point B is
interior to the simplex if and only if all +4 !. The
point B is exterior to the simplex if and only if one
or more of the +4 !Þ A point B is on the
boundary if and only if all +4 ! with at least one
equality.
Thus we may recursively test each observation to
discover to which simplex it belongs and tally the
total count for each simplex in the tessellation.
Data Adaptive Density
Estimation
Let 8 be the number of observations, let 83 be the
number of observations in the simplex X3 , and
finally let lX3 l be the content of simplex X3 , then the
adaptive density 0 is given by
n\d 2 2 3 3 4 4 6 6
16
14
LOG OF NUM B E R OF TILE S
12
10
2
7
6 8
5 7
6
4 5
3 4
3
2 2
DIMENSION
LOG OF NO OF POINTS
Simulation Results
Are these results reasonable?