Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3375395.3387643acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

The Adversarial Robustness of Sampling

Published: 14 June 2020 Publication History

Abstract

Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet "representative" subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe U to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample "very unrepresentative" of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner. Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size Ω(d/ε2) is an ε-approximation of the full data with good probability, where d is the VC-dimension of the underlying set system (U, R). Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is negative : We demonstrate a set system where a constant sample size (corresponding to a VC-dimension of 1) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easy-to-implement attack. However, this attack is "theoretical only", requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that in order to make the sampling algorithm robust against adaptive adversaries, the modification required is solely to replace the VC-dimension term d in the sample size with the cardinality term log |R|. That is, the Bernoulli and reservoir sampling algorithms with sample size Ω(log |R|/ε2) output a representative sample of the stream with good probability, even in the presence of an adaptive adversary. This nearly matches the bound imposed by the attack.

References

[1]
Noga Alon and Joel H. Spencer. 2016. The Probabilistic Method 4th ed.). Wiley Publishing.
[2]
Amitabha Bagchi, Amitabh Chaudhary, David Eppstein, and Michael T. Goodrich. 2007. Deterministic sampling and range counting in geometric data streams. ACM Transactions on Algorithms, Vol. 3, 2 (2007), 16.
[3]
Omri Ben-Eliezer, Rajesh Jayaram, David P. Woodruff, and Eylon Yogev. 2020. A Framework for Adversarially Robust Streaming Algorithms. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS). To appear.
[4]
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2013. Evasion Attacks against Machine Learning at Test Time. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD. 387--402.
[5]
Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, Vol. 84 (2018), 317--331.
[6]
Vladimir Braverman, Rafail Ostrovsky, and Gregory Vorsanger. 2015. Weighted sampling without replacement from data streams. Inform. Process. Lett., Vol. 115, 12 (2015), 923--926.
[7]
Bernard Chazelle. 2001. The discrepancy method - randomness and complexity .Cambridge University Press.
[8]
Herman Chernoff. 1952. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations. The Annals of Mathematical Statistics, Vol. 23, 4 (1952), 493--507.
[9]
Timothy Chu, Yu Gao, Richard Peng, Sushant Sachdeva, Saurabh Sawlani, and Junxing Wang. 2018. Graph Sparsification, Spectral Sketches, and Faster Resistance Computation, via Short Cycle Decompositions. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS. 361--372.
[10]
Fan Chung and Linyuan Lu. 2006. Concentration inequalities and martingale inequalities: a survey. Internet Mathematics, Vol. 3, 1 (2006), 79--127.
[11]
Yung-Yu Chung, Srikanta Tirthapura, and David P. Woodruff. 2016. A Simple Message-Optimal Algorithm for Random Sampling from a Distributed Stream. IEEE Transactions on Knowledge and Data Engineering, Vol. 28 (2016), 1356--1368.
[12]
Kenneth L. Clarkson, David Eppstein, Gary L. Miller, Carl Sturtivant, and Shang-Hua Teng. 1996. Approximating center points with iterative Radon points. International Journal of Computational Geometry and Applications, Vol. 6, 3 (1996), 357--377.
[13]
Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. 2011. Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums. SIAM J. Comput., Vol. 40, 5 (2011), 1402--1431.
[14]
Graham Cormode and Minos N. Garofalakis. 2005. Sketching Streams Through the Net: Distributed Approximate Query Tracking. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB. 13--24.
[15]
Graham Cormode, S. Muthukrishnan, and Ke Yi. 2011. Algorithms for distributed functional monitoring. ACM Transactions on Algorithms, Vol. 7, 2 (2011), 21:1--21:20.
[16]
Graham Cormode, S. Muthukrishnan, Ke Yi, and Qin Zhang. 2012. Continuous sampling from distributed streams. J. ACM, Vol. 59 (2012), 10:1--10:25.
[17]
Charles D. Cranor, Theodore Johnson, Oliver Spatscheck, and Vladislav Shkapenyuk. 2003. The Gigascope Stream Database. IEEE Data Engineering Bulletin, Vol. 26, 1 (2003), 27--32.
[18]
Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. 2018. PAC-learning in the Presence of Evasion Adversaries. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems (NIPS). 228--239.
[19]
Amit Daniely, Alon Gonen, and Shai Shalev-Shwartz. 2015. Strongly Adaptive Online Learning. In Proceedings of the 32nd International Conference on Machine Learning, ICML. 1405--1411.
[20]
Nick G. Duffield, Carsten Lund, and Mikkel Thorup. 2005. Estimating flow distributions from sampled flow statistics. IEEE/ACM Transactions on Networking, Vol. 13, 5 (2005), 933--946.
[21]
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. 2015. The reusable holdout: Preserving validity in adaptive data analysis. Science, Vol. 349, 6248 (2015), 636--638.
[22]
Pavlos S. Efraimidis and Paul G. Spirakis. 2006. Weighted random sampling with a reservoir. Inform. Process. Lett., Vol. 97, 5 (2006), 181--185.
[23]
David A. Freedman. 1975. On Tail Probabilities for Martingales. The Annals of Probability, Vol. 3, 1 (1975), 100--118.
[24]
Mohammed Ghesmoune, Mustapha Lebbah, and Hanene Azzag. 2016. State-of-the-art on clustering data streams. Big Data Analytics, Vol. 1, 1 (2016), 13.
[25]
Anna C. Gilbert, Brett Hemenway, Atri Rudra, Martin J. Strauss, and Mary Wootters. 2012a. Recovering simple signals. In Information Theory and Applications Workshop, ITA. 382--391.
[26]
Anna C. Gilbert, Brett Hemenway, Martin J. Strauss, David P. Woodruff, and Mary Wootters. 2012b. Reusable low-error compressive sampling schemes through privacy. In IEEE Statistical Signal Processing Workshop, SSP. 536--539.
[27]
Ian J. Goodfellow, Patrick D. McDaniel, and Nicolas Papernot. 2018. Making machine learning robust against adversarial inputs. Commun. ACM, Vol. 61, 7 (2018), 56--66.
[28]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).
[29]
Michael Greenwald and Sanjeev Khanna. 2001. Space-efficient Online Computation of Quantile Summaries. SIGMOD Record, Vol. 30, 2 (2001), 58--66.
[30]
Michael B. Greenwald and Sanjeev Khanna. 2016. Quantiles and Equi-depth Histograms over Streams. In Data Stream Management: Processing High-Speed Data Streams, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi (Eds.). Springer Berlin Heidelberg, 45--86.
[31]
Moritz Hardt and David P. Woodruff. 2013. How robust are linear sketches to adaptive inputs?. In Symposium on Theory of Computing Conference. 121--130.
[32]
Elad Hazan. 2016. Introduction to Online Convex Optimization. Foundations and Trends in Optimization, Vol. 2, 3--4 (2016), 157--325.
[33]
Elad Hazan, Amit Agarwal, and Satyen Kale. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning, Vol. 69, 2--3 (2007), 169--192.
[34]
Rajesh Jayaram, Gokarna Sharma, Srikanta Tirthapura, and David P. Woodruff. 2019. Weighted Reservoir Sampling from Distributed Streams. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS). 218--235.
[35]
Chris Jermaine, Abhijit Pol, and Subramanian Arumugam. 2004. Online Maintenance of Very Large Random Samples. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 299--310.
[36]
Theodore Johnson, S. Muthukrishnan, and Irina Rozenbaum. 2005. Sampling Algorithms in a Stream Operator. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1--12.
[37]
Zohar Karnin, Kevin Lang, and Edo Liberty. 2016. Optimal Quantile Approximation in Streams. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). 71--78.
[38]
Donald E. Knuth. 1997. The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms .Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[39]
Yi Li, Philip M. Long, and Aravind Srinivasan. 2001. Improved Bounds on the Sample Complexity of Learning. J. Comput. System Sci., Vol. 62, 3 (2001), 516--527.
[40]
Richard J. Lipton and Jeffrey F. Naughton. 1993. Clocked Adversaries for Hashing. Algorithmica, Vol. 9, 3 (1993), 239--252.
[41]
Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. 2018. Stochastic Bandits Robust to Adversarial Corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2018). 114--122.
[42]
Saeed Mahloujifar and Mohammad Mahmoody. 2019. Can Adversarially Robust Learning Leverage Computational Hardness?. In Algorithmic Learning Theory, ALT. 581--609.
[43]
Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. 1999. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets. SIGMOD Record, Vol. 28, 2 (1999), 251--262.
[44]
Michael McCoyd and David A. Wagner. 2018. Background Class Defense Against Adversarial Examples. In 2018 IEEE Security and Privacy Workshops, SP Workshops. 96--102.
[45]
Colin McDiarmid. 1998. Concentration. In Probabilistic Methods for Algorithmic Discrete Mathematics, Michel Habib, Colin McDiarmid, Jorge Ramirez-Alfonsin, and Bruce Reed (Eds.). Springer Berlin Heidelberg, 195--248.
[46]
Ilya Mironov, Moni Naor, and Gil Segev. 2011. Sketching in Adversarial Environments. SIAM J. Comput., Vol. 40, 6 (2011), 1845--1870.
[47]
Omar Montasser, Steve Hanneke, and Nathan Srebro. 2019. VC Classes are Adversarially Robustly Learnable, but Only Improperly. In Proceedings of the Thirty-Second Conference on Learning Theory, COLT (Proceedings of Machine Learning Research), Alina Beygelzimer and Daniel Hsu (Eds.), Vol. 99. 2512--2530.
[48]
Nabil H. Mustafa and Kasturi R. Varadarajan. 2017. Epsilon-approximations and epsilon-nets. In Handbook of Discrete and Computational Geometry, 3rd Edition, Csaba D. Toth, Joseph O'Rourke, and Jacob E. Goodman (Eds.). Chapman and Hall/CRC, New York, Chapter 47, 27.
[49]
Moni Naor and Eylon Yogev. 2015. Bloom Filters in Adversarial Environments. In Advances in Cryptology - CRYPTO 2015 - 35th Annual Cryptology Conference. 565--584.
[50]
M. Tamer Özsu and Patrick Valduriez. 2011. Principles of Distributed Database Systems 3rd ed.). Springer Publishing Company, Incorporated.
[51]
Nicolas Papernot, Patrick D. McDaniel, Ian J. Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. 2017. Practical Black-Box Attacks against Machine Learning. In ACM Asia Conference on Computer and Communications Security, AsiaCCS. 506--519.
[52]
Norbert Sauer. 1972. On the density of families of sets. Journal of Combinatorial Theory, Series A, Vol. 13, 1 (1972), 145 -- 147.
[53]
Shai Shalev-Shwartz. 2012. Online Learning and Online Convex Optimization. Foundations and Trends in Machine Learning, Vol. 4, 2 (2012), 107--194.
[54]
Ohad Shamir and Liran Szlak. 2017. Online Learning with Local Permutations and Delayed Feedback. In Proceedings of the 34th International Conference on Machine Learning, ICML. 3086--3094.
[55]
Saharon Shelah. 1972. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific J. Math., Vol. 41, 1 (1972), 247--261.
[56]
Chawin Sitawarin, Arjun Nitin Bhagoji, Arsalan Mosenia, Mung Chiang, and Prateek Mittal. 2018. DARTS: Deceiving Autonomous Cars with Toxic Signs. CoRR, Vol. abs/1802.06430 (2018).
[57]
Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. 2017. Don't Decay the Learning Rate, Increase the Batch Size. http://arxiv.org/abs/1711.00489 Published as a conference paper at ICLR 2018.
[58]
Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. 2010. Smoothness, Low Noise and Fast Rates. In Advances in Neural Information Processing Systems 23: Annual Conference on Neural Information Processing Systems, NIPS. 2199--2207.
[59]
Subhash Suri, Csaba D. Tóth, and Yunhong Zhou. 2004. Range Counting over Multidimensional Data Streams. In Proceedings of the Twentieth Annual Symposium on Computational Geometry (SCG '04). 160--169.
[60]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR.
[61]
Michel Talagrand. 1994. Sharper bounds for Gaussian and empirical processes. The Annals of Probability, Vol. 22, 1 (1994), 28--76.
[62]
L. G. Valiant. 1984. A Theory of the Learnable. Commun. ACM, Vol. 27, 11 (1984), 1134--1142.
[63]
V. Vapnik and A. Chervonenkis. 1971. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications, Vol. 16, 2 (1971), 264--280.
[64]
Jeffrey Scott Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Software, Vol. 11, 1 (1985), 37--57.
[65]
Lu Wang, Ge Luo, Ke Yi, and Graham Cormode. 2013. Quantiles over Data Streams: An Experimental Study. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 737--748.
[66]
Blake E. Woodworth, Vitaly Feldman, Saharon Rosset, and Nati Srebro. 2018. The Everlasting Database: Statistical Validity at a Fair Price. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. 6532--6541.
[67]
Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. 2018. Adaptive Online Learning in Dynamic Environments. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. 1330--1340.

Cited By

View all
  • (2024)Universal Deoxidation of Semiconductor Substrates Assisted by Machine Learning and Real-Time Feedback ControlACS Applied Materials & Interfaces10.1021/acsami.4c0176516:14(18213-18221)Online publication date: 30-Mar-2024
  • (2024)A Framework for Adversarial Streaming Via Differential Privacy and Difference EstimatorsAlgorithmica10.1007/s00453-024-01259-8Online publication date: 31-Aug-2024
  • (2023)Adversarially robust distributed count tracking via partial differential privacyProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669460(76370-76381)Online publication date: 10-Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS'20: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
June 2020
480 pages
ISBN:9781450371087
DOI:10.1145/3375395
  • General Chair:
  • Dan Suciu,
  • Program Chair:
  • Yufei Tao,
  • Publications Chair:
  • Zhewei Wei
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ε-approximation
  2. adversarial model
  3. random sampling
  4. streaming

Qualifiers

  • Research-article

Funding Sources

  • European Union's Horizon 2020 research and innovation program

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)3
Reflects downloads up to 30 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Universal Deoxidation of Semiconductor Substrates Assisted by Machine Learning and Real-Time Feedback ControlACS Applied Materials & Interfaces10.1021/acsami.4c0176516:14(18213-18221)Online publication date: 30-Mar-2024
  • (2024)A Framework for Adversarial Streaming Via Differential Privacy and Difference EstimatorsAlgorithmica10.1007/s00453-024-01259-8Online publication date: 31-Aug-2024
  • (2023)Adversarially robust distributed count tracking via partial differential privacyProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669460(76370-76381)Online publication date: 10-Dec-2023
  • (2023)Improved algorithms for white-box adversarial streamsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618807(9962-9975)Online publication date: 23-Jul-2023
  • (2023)Practical Dynamic Extension for Sampling IndexesProceedings of the ACM on Management of Data10.1145/36267441:4(1-26)Online publication date: 12-Dec-2023
  • (2023)Coloring in Graph Streams via Deterministic and Adversarially Robust AlgorithmsProceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3584372.3588681(141-153)Online publication date: 18-Jun-2023
  • (2023)Mirror games against an open book playerTheoretical Computer Science10.1016/j.tcs.2023.114159976:COnline publication date: 17-Oct-2023
  • (2022)Adversarially Robust Streaming Algorithms via Differential PrivacyJournal of the ACM10.1145/355697269:6(1-14)Online publication date: 24-Nov-2022
  • (2022)Uniform approximations for Randomized Hadamard Transforms with applicationsProceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing10.1145/3519935.3519961(659-671)Online publication date: 9-Jun-2022
  • (2022)The White-Box Adversarial Data Stream ModelProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526228(15-27)Online publication date: 12-Jun-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media