Sublinear time algorithms for earth mover's distance
KD Ba, HL Nguyen, HN Nguyen… - Theory of Computing …, 2011 - Springer
KD Ba, HL Nguyen, HN Nguyen, R Rubinfeld
Theory of Computing Systems, 2011•SpringerWe study the problem of estimating the Earth Mover's Distance (EMD) between probability
distributions when given access only to samples of the distribution. We give closeness
testers and additive-error estimators over domains in [0, 1] d, with sample complexities
independent of domain size—permitting the testability even of continuous distributions over
infinite domains. Instead, our algorithms depend on the dimension of the domain space and
the quality of the result required. We also prove lower bounds for closeness testing, showing …
distributions when given access only to samples of the distribution. We give closeness
testers and additive-error estimators over domains in [0, 1] d, with sample complexities
independent of domain size—permitting the testability even of continuous distributions over
infinite domains. Instead, our algorithms depend on the dimension of the domain space and
the quality of the result required. We also prove lower bounds for closeness testing, showing …
Abstract
We study the problem of estimating the Earth Mover’s Distance (EMD) between probability distributions when given access only to samples of the distribution. We give closeness testers and additive-error estimators over domains in [0,1] d , with sample complexities independent of domain size—permitting the testability even of continuous distributions over infinite domains. Instead, our algorithms depend on the dimension of the domain space and the quality of the result required. We also prove lower bounds for closeness testing, showing the dependencies on these parameters to be essentially optimal. Additionally, we consider whether natural classes of distributions exist for which there are algorithms with better dependence on the dimension, and show that for highly clusterable data, this is indeed the case. Lastly, we consider a variant of the EMD, defined over tree metrics instead of the usual ℓ 1 metric, and give tight upper and lower bounds.
Springer