Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/3174304.3175479acmconferencesArticle/Chapter ViewAbstractPublication PagessodaConference Proceedingsconference-collections
research-article

Which distribution distances are sublinearly testable?

Published: 07 January 2018 Publication History

Abstract

Given samples from an unknown distribution p and a description of a distribution q, are p and q close or far? This question of "identity testing" has received significant attention in the case of testing whether p and q are equal or far in total variation distance. However, in recent work [VV11a, ADK15, DP17], the following questions have been been critical to solving problems at the frontiers of distribution testing:
• Alternative Distances: Can we test whether p and q are far in other distances, say Hellinger?
• Tolerance: Can we test when p and q are close, rather than equal? And if so, close in which distances?
Motivated by these questions, we characterize the complexity of distribution testing under a variety of distances, including total variation, ℓ2, Hellinger, Kullback-Leibler, and χ2. For each pair of distances d1 and d2, we study the complexity of testing if p and q are close in d1 versus far in d2, with a focus on identifying which problems allow strongly sublinear testers (i.e., those with complexity O(n1−γ) for some γ > 0 where n is the size of the support of the distributions p and q). We provide matching upper and lower bounds for each case. We also study these questions in the case where we only have samples from q (equivalence testing), showing qualitative differences from identity testing in terms of when tolerance can be achieved. Our algorithms fall into the classical paradigm of χ2-statistics, but require crucial changes to handle the challenges introduced by each distance we consider. Finally, we survey other recent results in an attempt to serve as a reference for the complexity of various distribution testing problems.

References

[1]
{ADK15} Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems 28, NIPS '15, pages 3577--3598. Curran Associates, Inc., 2015.
[2]
{ADOS17} Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In Proceedings of the 34th International Conference on Machine Learning, ICML '17, pages 11--21. JMLR, Inc., 2017.
[3]
{AOST17} Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. Estimating rényi entropy of discrete distributions. IEEE Transactions on Information Theory, 63(1):38--56, 2017.
[4]
{BCG17} Eric Blais, Clément L. Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. In Proceedings of the 32nd Computational Complexity Conference, CCC '17, pages 28:1--28:40, Dagstuhl, Germany, 2017. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[5]
{BFF+ 01} Tuğkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Proceedings of the 42nd Annual IEEE Symposium on Foundations of Computer Science, FOCS '01, pages 442--451, Washington, DC, USA, 2001. IEEE Computer Society.
[6]
{BFR+ 00} Tuğkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing that distributions are close. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science, FOCS '00, pages 259--269, Washington, DC, USA, 2000. IEEE Computer Society.
[7]
{BFR+ 13} Tuğkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing closeness of discrete distributions. Journal of the ACM, 60(1):4:1--4:25, 2013.
[8]
{BKR04} Tuğkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In Proceedings of the 36th Annual ACM Symposium on the Theory of Computing, STOC '04, New York, NY, USA, 2004. ACM.
[9]
{BV15} Bhaswar Bhattacharya and Gregory Valiant. Testing closeness with unequal sized samples. In Advances in Neural Information Processing Systems 28, NIPS '15, pages 2611--2619. Curran Associates, Inc., 2015.
[10]
{Can15} Clément L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22(63), 2015.
[11]
{CDGR16} Clément L. Canonne, Ilias Diakonikolas, Themis Gouleakis, and Ronitt Rubinfeld. Testing shape restrictions of discrete distributions. In Proceedings of the 33rd Symposium on Theoretical Aspects of Computer Science, STACS '16, pages 25:1--25:14, Dagstuhl, Germany, 2016. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[12]
{CDKS17} Clément L. Canonne, Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Testing Bayesian networks. In Proceedings of the 30th Annual Conference on Learning Theory, COLT '17, pages 370--448, 2017.
[13]
{CDVV14} Siu-On Chan, Ilias Diakonikolas, Gregory Valiant, and Paul Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '14, pages 1193--1203, Philadelphia, PA, USA, 2014. SIAM.
[14]
{DBNNR11} Khanh Do Ba, Huy L. Nguyen, Huy N. Nguyen, and Ronitt Rubinfeld. Sublinear time algorithms for earth movers distance. Theory of Computing Systems, 48(2):428--442, 2011.
[15]
{DDK18} Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising models. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '18, Philadelphia, PA, USA, 2018. SIAM.
[16]
{DGPP16} Ilias Diakonikolas, Themis Gouleakis, John Peebles, and Eric Price. Collision-based testers are optimal for uniformity and closeness. arXiv preprint arXiv:1611.03579, 2016.
[17]
{DK16} Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of discrete distributions. In Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS '16, pages 685--694, Washington, DC, USA, 2016. IEEE Computer Society.
[18]
{DKN15} Ilias Diakonikolas, Daniel M. Kane, and Vladimir Nikishkin. Testing identity of structured distributions. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '15, pages 1841--1854, Philadelphia, PA, USA, 2015. SIAM.
[19]
{DP17} Constantinos Daskalakis and Qinxuan Pan. Square Hellinger subadditivity for Bayesian networks and its applications to identity testing. In Proceedings of the 30th Annual Conference on Learning Theory, COLT '17, pages 697--703, 2017.
[20]
{Gol16} Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. Electronic Colloquium on Computational Complexity (ECCC), 23(15), 2016.
[21]
{GR00} Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Electronic Colloquium on Computational Complexity (ECCC), 7(20), 2000.
[22]
{GS02} Alison L. Gibbs and Francis E. Su. On choosing and bounding probability metrics. International Statistical Review, 70(3):419--435, dec 2002.
[23]
{HJW16} Yanjun Han, Jiao Jiantao, and Tsachy Weissman. Minimax rate-optimal estimation of divergences between discrete distributions. arXiv preprint arXiv:1605.09124, 2016.
[24]
{JHW16} Jiantao Jiao, Yanjun Han, and Tsachy Weiss-man. Minimax estimation of the ℓ1 distance. In Proceedings of the 2016 IEEE International Symposium on Information Theory, ISIT '16, pages 750--754, Washington, DC, USA, 2016. IEEE Computer Society.
[25]
{JVHW17} Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835--2885, 2017.
[26]
{LC06} Erich Leo Lehmann and George Casella. Theory of Point Estimation. Springer, 2006.
[27]
{Pan08} Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750--4755, 2008.
[28]
{Val11} Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing, 40(6):1927--1968, 2011.
[29]
{VV10a} Gregory Valiant and Paul Valiant. A CLT and tight lower bounds for estimating entropy. Electronic Colloquium on Computational Complexity (ECCC), 17(179), 2010.
[30]
{VV10b} Gregory Valiant and Paul Valiant. Estimating the unseen: A sublinear-sample canonical estimator of distributions. Electronic Colloquium on Computational Complexity (ECCC), 17(180), 2010.
[31]
{VV11a} Gregory Valiant and Paul Valiant. Estimating the unseen: An n/ log n-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd Annual ACM Symposium on the Theory of Computing, STOC '11, pages 685--694, New York, NY, USA, 2011. ACM.
[32]
{VV11b} Gregory Valiant and Paul Valiant. The power of linear estimators. In Proceedings of the 52nd Annual IEEE Symposium on Foundations of Computer Science, FOCS '11, pages 403--412, Washington, DC, USA, 2011. IEEE Computer Society.
[33]
{VV17} Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429--455, 2017.
[34]
{Wag15} Bo Waggoner. lp testing and learning of discrete distributions. In Proceedings of the 6th Conference on Innovations in Theoretical Computer Science, ITCS '15, pages 347--356, New York, NY, USA, 2015. ACM.
[35]
{WY16} Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Transactions on Information Theory, 62(6):3702--3720, 2016.

Cited By

View all
  • (2022)Independence testing for bounded degree Bayesian networksProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601363(15027-15038)Online publication date: 28-Nov-2022
  • (2021)Random restrictions of high dimensional distributions and uniformity testing with subcube conditioningProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458085(321-336)Online publication date: 10-Jan-2021
  • (2019)Minimax optimal estimation of approximate differential privacy on neighboring databasesProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454504(2417-2428)Online publication date: 8-Dec-2019
  • Show More Cited By
  1. Which distribution distances are sublinearly testable?

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SODA '18: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms
    January 2018
    2859 pages
    ISBN:9781611975031
    • Program Chair:
    • Artur Czumaj

    Sponsors

    Publisher

    Society for Industrial and Applied Mathematics

    United States

    Publication History

    Published: 07 January 2018

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SODA '18
    Sponsor:
    SODA '18: Symposium on Discrete Algorithms
    January 7 - 10, 2018
    Louisiana, New Orleans

    Acceptance Rates

    Overall Acceptance Rate 411 of 1,322 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Independence testing for bounded degree Bayesian networksProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601363(15027-15038)Online publication date: 28-Nov-2022
    • (2021)Random restrictions of high dimensional distributions and uniformity testing with subcube conditioningProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458085(321-336)Online publication date: 10-Jan-2021
    • (2019)Minimax optimal estimation of approximate differential privacy on neighboring databasesProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454504(2417-2428)Online publication date: 8-Dec-2019
    • (2019)AnacondaProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310478(679-693)Online publication date: 6-Jan-2019
    • (2018)Learning and testing causal models with interventionsProceedings of the 32nd International Conference on Neural Information Processing Systems10.5555/3327546.3327616(9469-9481)Online publication date: 3-Dec-2018

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media