Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2969442.2969531guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
Article

Testing closeness with unequal sized samples

Published: 07 December 2015 Publication History

Abstract

We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter ε > 0, m1 independent draws from an unknown distribution p with discrete support, and m2 draws from an unknown distribution q of discrete support, we describe a test for distinguishing the case that p = q from the case that ‖p - q1 ≥ ε. If p and q are supported on at most n elements, then our test is successful with high probability provided m1n2/34/3 and m2 = Ω (max{n/√m1 ε2, √n2}). We show that this tradeoff is information theoretically optimal throughout this range in the dependencies on all parameters, n, m1, and ε, to constant factors for worst-case distributions. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on n states up to a log n factor that uses Õ(n3/2τmix) queries to a "next node" oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.

References

[1]
J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, and S. Pan, Competitive closeness testing, COLT, 2011.
[2]
J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, and S. Pan, Competitive classification and closeness testing. COLT, 2012.
[3]
J. Acharya, A. Jafarpour, A. Orlitsky, and A. T. Suresh, Sublinear algorithms for outlier detection and generalized closeness testing, ISIT, 3200-3204, 2014.
[4]
J. Acharya, C. Daskalakis, and G. Kamath, Optimal testing for properties of distributions, NIPS, 2015.
[5]
Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling algorithms: lower bounds and applications, STOC, 2001.
[6]
T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White, Testing that distributions are close, FOCS, 2000.
[7]
T. Batu, S. Dasgupta, R. Kumar, and R. Rubinfeld, The complexity of approximating the entropy, SIAM Journal on Computing, 2005.
[8]
T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White, Testing random variables for independence and identity, FOCS, 2001.
[9]
S.-on Chan, I. Diakonikolas, P. Valiant, G. Valiant, Optimal Algorithms for Testing Closeness of Discrete Distributions, Symposium on Discrete Algorithms (SODA), 1193-1203, 2014,
[10]
M. Charikar, S. Chaudhuri, R. Motwani, and V.R. Narasayya, Towards estimation error guarantees for distinct values, Symposium on Principles of Database Systems (PODS), 2000.
[11]
A. Czumaj and C. Sohler, Testing expansion in bounded-degree graphs, FOCS, 2007.
[12]
O. Goldreich and D. Ron, On testing expansion in bounded-degree graphs, ECCC, TR00-020, 2000.
[13]
S. Guha, A. McGregor, and S. Venkatasubramanian, Streaming and sublinear approximation of entropy and information distances, Symposium on Discrete Algorithms (SODA), 2006.
[14]
D. Hsu, A. Kontorovich, and C. Szepesvári, Mixing time estimation in reversible Markov chains from a single sample path, NIPS, 2015.
[15]
S. Kale and C. Seshadhri, An expansion tester for bounded degree graphs, ICALP, LNCS, Vol. 5125, 527-538, 2008.
[16]
A. Keinan and A. G. Clark. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science, 336(6082):740743, 2012.
[17]
D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times, Amer. Math. Soc., 2009.
[18]
A. Nachmias and A. Shapira, Testing the expansion of a graph, Electronic Colloquium on Computational Complexity (ECCC), Vol. 14 (118), 2007.
[19]
M. R. Nelson and D. Wegmann et al., An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science, 337(6090):100104, 2012.
[20]
L. Paninski, Estimation of entropy and mutual information, Neural Comp., Vol. 15 (6), 1191-1253, 2003.
[21]
L. Paninski, Estimating entropy on m bins given fewer than m samples, IEEE Transactions on Information Theory, Vol. 50 (9), 2200-2203, 2004.
[22]
L. Paninski, A coincidence-based test for uniformity given very sparsely-sampled discrete data, IEEE Transactions on Information Theory, Vol. 54, 4750-4755, 2008.
[23]
S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith, Strong lower bounds for approximating distribution support size and the distinct elements problem, SIAM Journal on Computing, Vol. 39(3), 813-842, 2009.
[24]
R. Rubinfeld, Taming big probability distributions, XRDS, Vol. 19(1), 24-28, 2012.
[25]
A. Sinclair and M. Jerrum, Approximate counting, uniform generation and rapidly mixing Markov chains, Information and Computation, Vol. 82(1), 93-133, 1989.
[26]
J. A. Tennessen, A.W. Bigham, and T.D. O'Connor et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science, 337(6090):6469, 2012
[27]
G. Valiant and P. Valiant, Estimating the unseen: an n/ log n-sample estimator for entropy and support size, shown optimal via new CLTs, STOC, 2011.
[28]
G. Valiant and P. Valiant, Estimating the unseen: improved estimators for entropy and other properties, NIPS, 2013.
[29]
G. Valiant and P. Valiant, An Automatic Inequality Prover and Instance Optimal Identity Testing, FOCS, 51-60, 2014.
[30]
P. Valiant, Testing symmetric properties of distributions, STOC, 2008.
[31]
P. Valiant, Testing Symmetric Properties of Distributions, PhD thesis, M.I.T., 2008.

Cited By

View all
  • (2021)Random restrictions of high dimensional distributions and uniformity testing with subcube conditioningProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458085(321-336)Online publication date: 10-Jan-2021
  • (2021)Two-sample Testing of Discrete Distributions under Rare/Weak Perturbations2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9517991(3314-3319)Online publication date: 12-Jul-2021
  • (2019)AnacondaProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310478(679-693)Online publication date: 6-Jan-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2
December 2015
3626 pages

Publisher

MIT Press

Cambridge, MA, United States

Publication History

Published: 07 December 2015

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Random restrictions of high dimensional distributions and uniformity testing with subcube conditioningProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458085(321-336)Online publication date: 10-Jan-2021
  • (2021)Two-sample Testing of Discrete Distributions under Rare/Weak Perturbations2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9517991(3314-3319)Online publication date: 12-Jul-2021
  • (2019)AnacondaProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310478(679-693)Online publication date: 6-Jan-2019
  • (2019)Distribution Testing Lower Bounds via Reductions from Communication ComplexityACM Transactions on Computation Theory10.1145/330527011:2(1-37)Online publication date: 11-Feb-2019
  • (2018)Differentially private testing of identity and closeness of discrete distributionsProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327757.3327792(6879-6891)Online publication date: 3-Dec-2018
  • (2018)Which distribution distances are sublinearly testable?Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3174304.3175479(2747-2764)Online publication date: 7-Jan-2018
  • (2017)Distribution testing lower bounds via reductions from communication complexityProceedings of the 32nd Computational Complexity Conference10.5555/3135595.3135623(1-40)Online publication date: 9-Jul-2017

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media