Article

Testing closeness with unequal sized samples

Authors:

Bhaswar B. Bhattacharya,

Gregory ValiantAuthors Info & Claims

NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2

Pages 2611 - 2619

Published: 07 December 2015 Publication History

Abstract

We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter ε > 0, m₁ independent draws from an unknown distribution p with discrete support, and m₂ draws from an unknown distribution q of discrete support, we describe a test for distinguishing the case that p = q from the case that ‖p - q‖₁ ≥ ε. If p and q are supported on at most n elements, then our test is successful with high probability provided m₁ ≥ n^2/3/ε^4/3 and m₂ = Ω (max{n/√m₁ ε², √n/ε²}). We show that this tradeoff is information theoretically optimal throughout this range in the dependencies on all parameters, n, m₁, and ε, to constant factors for worst-case distributions. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on n states up to a log n factor that uses Õ(n^3/2τ_mix) queries to a "next node" oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.

References

[1]

J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, and S. Pan, Competitive closeness testing, COLT, 2011.

[2]

J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, and S. Pan, Competitive classification and closeness testing. COLT, 2012.

[3]

J. Acharya, A. Jafarpour, A. Orlitsky, and A. T. Suresh, Sublinear algorithms for outlier detection and generalized closeness testing, ISIT, 3200-3204, 2014.

[4]

J. Acharya, C. Daskalakis, and G. Kamath, Optimal testing for properties of distributions, NIPS, 2015.

[5]

Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling algorithms: lower bounds and applications, STOC, 2001.

[6]

T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White, Testing that distributions are close, FOCS, 2000.

[7]

T. Batu, S. Dasgupta, R. Kumar, and R. Rubinfeld, The complexity of approximating the entropy, SIAM Journal on Computing, 2005.

[8]

T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White, Testing random variables for independence and identity, FOCS, 2001.

[9]

S.-on Chan, I. Diakonikolas, P. Valiant, G. Valiant, Optimal Algorithms for Testing Closeness of Discrete Distributions, Symposium on Discrete Algorithms (SODA), 1193-1203, 2014,

[10]

M. Charikar, S. Chaudhuri, R. Motwani, and V.R. Narasayya, Towards estimation error guarantees for distinct values, Symposium on Principles of Database Systems (PODS), 2000.

[11]

A. Czumaj and C. Sohler, Testing expansion in bounded-degree graphs, FOCS, 2007.

[12]

O. Goldreich and D. Ron, On testing expansion in bounded-degree graphs, ECCC, TR00-020, 2000.

[13]

S. Guha, A. McGregor, and S. Venkatasubramanian, Streaming and sublinear approximation of entropy and information distances, Symposium on Discrete Algorithms (SODA), 2006.

[14]

D. Hsu, A. Kontorovich, and C. Szepesvári, Mixing time estimation in reversible Markov chains from a single sample path, NIPS, 2015.

[15]

S. Kale and C. Seshadhri, An expansion tester for bounded degree graphs, ICALP, LNCS, Vol. 5125, 527-538, 2008.

[16]

A. Keinan and A. G. Clark. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science, 336(6082):740743, 2012.

[17]

D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times, Amer. Math. Soc., 2009.

[18]

A. Nachmias and A. Shapira, Testing the expansion of a graph, Electronic Colloquium on Computational Complexity (ECCC), Vol. 14 (118), 2007.

[19]

M. R. Nelson and D. Wegmann et al., An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science, 337(6090):100104, 2012.

[20]

L. Paninski, Estimation of entropy and mutual information, Neural Comp., Vol. 15 (6), 1191-1253, 2003.

[21]

L. Paninski, Estimating entropy on m bins given fewer than m samples, IEEE Transactions on Information Theory, Vol. 50 (9), 2200-2203, 2004.

[22]

L. Paninski, A coincidence-based test for uniformity given very sparsely-sampled discrete data, IEEE Transactions on Information Theory, Vol. 54, 4750-4755, 2008.

[23]

S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith, Strong lower bounds for approximating distribution support size and the distinct elements problem, SIAM Journal on Computing, Vol. 39(3), 813-842, 2009.

[24]

R. Rubinfeld, Taming big probability distributions, XRDS, Vol. 19(1), 24-28, 2012.

[25]

A. Sinclair and M. Jerrum, Approximate counting, uniform generation and rapidly mixing Markov chains, Information and Computation, Vol. 82(1), 93-133, 1989.

[26]

J. A. Tennessen, A.W. Bigham, and T.D. O'Connor et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science, 337(6090):6469, 2012

[27]

G. Valiant and P. Valiant, Estimating the unseen: an n/ log n-sample estimator for entropy and support size, shown optimal via new CLTs, STOC, 2011.

[28]

G. Valiant and P. Valiant, Estimating the unseen: improved estimators for entropy and other properties, NIPS, 2013.

[29]

G. Valiant and P. Valiant, An Automatic Inequality Prover and Instance Optimal Identity Testing, FOCS, 51-60, 2014.

[30]

P. Valiant, Testing symmetric properties of distributions, STOC, 2008.

[31]

P. Valiant, Testing Symmetric Properties of Distributions, PhD thesis, M.I.T., 2008.

Cited By

Canonne CChen XKamath GLevi AWaingarten EMarx D(2021)Random restrictions of high dimensional distributions and uniformity testing with subcube conditioningProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458085(321-336)Online publication date: 10-Jan-2021
https://dl.acm.org/doi/10.5555/3458064.3458085
Kipnis ADonoho D(2021)Two-sample Testing of Discrete Distributions under Rare/Weak Perturbations2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9517991(3314-3319)Online publication date: 12-Jul-2021
https://dl.acm.org/doi/10.1109/ISIT45174.2021.9517991
Kamath GTzamos CChan T(2019)AnacondaProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310478(679-693)Online publication date: 6-Jan-2019
https://dl.acm.org/doi/10.5555/3310435.3310478
Show More Cited By

Recommendations

Optimal algorithms for testing closeness of discrete distributions
SODA '14: Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms

We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions p and q over an n-element set, we wish to distinguish whether p = q versus p is at least ε-far from q, in either ℓ₁ or ℓ₂ ...
Testing Closeness of Discrete Distributions

Given samples from two distributions over an n-element set, we wish to test whether these distributions are statistically close. We present an algorithm which uses sublinear in n, specifically, O(n^2/3ε^−8/3 log n), independent samples from each ...
Testing Probability Distributions using Conditional Samples

We study a new framework for property testing of probability distributions, by considering distribution testing algorithms that have access to a conditional sampling oracle. This is an oracle that takes as input a subset $S \subseteq [N]$ of the domain $[N]$ ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2

December 2015

3626 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 07 December 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Canonne CChen XKamath GLevi AWaingarten EMarx D(2021)Random restrictions of high dimensional distributions and uniformity testing with subcube conditioningProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458085(321-336)Online publication date: 10-Jan-2021
https://dl.acm.org/doi/10.5555/3458064.3458085
Kipnis ADonoho D(2021)Two-sample Testing of Discrete Distributions under Rare/Weak Perturbations2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9517991(3314-3319)Online publication date: 12-Jul-2021
https://dl.acm.org/doi/10.1109/ISIT45174.2021.9517991
Kamath GTzamos CChan T(2019)AnacondaProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310478(679-693)Online publication date: 6-Jan-2019
https://dl.acm.org/doi/10.5555/3310435.3310478
Blais ECanonne CGur T(2019)Distribution Testing Lower Bounds via Reductions from Communication ComplexityACM Transactions on Computation Theory10.1145/330527011:2(1-37)Online publication date: 11-Feb-2019
https://dl.acm.org/doi/10.1145/3305270
Acharya JSun ZZhang H(2018)Differentially private testing of identity and closeness of discrete distributionsProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327757.3327792(6879-6891)Online publication date: 3-Dec-2018
https://dl.acm.org/doi/10.5555/3327757.3327792
Daskalakis CKamath GWright JCzumaj A(2018)Which distribution distances are sublinearly testable?Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3174304.3175479(2747-2764)Online publication date: 7-Jan-2018
https://dl.acm.org/doi/10.5555/3174304.3175479
Blais ECanonne CGur T(2017)Distribution testing lower bounds via reductions from communication complexityProceedings of the 32nd Computational Complexity Conference10.5555/3135595.3135623(1-40)Online publication date: 9-Jul-2017
https://dl.acm.org/doi/10.5555/3135595.3135623

View Options

View options

Media

Figures

Other

Tables

View Table of Contents