Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3357713.3384290acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article
Open access

Does learning require memorization? a short tale about a long tail

Published: 22 June 2020 Publication History

Abstract

State-of-the-art results on image recognition tasks are achieved using over-parameterized learning algorithms that (nearly) perfectly fit the training set and are known to fit well even random labels. This tendency to memorize seemingly useless training data labels is not explained by existing theoretical analyses. Memorization of the training data also presents significant privacy risks when the training data contains sensitive personal information and thus it is important to understand whether such memorization is necessary for accurate learning.
We provide a simple conceptual explanation and a theoretical model demonstrating that for natural data distributions memorization of labels is necessary for achieving close-to-optimal generalization error. The model is motivated and supported by the results of several recent empirical works. In our model, data is sampled from a mixture of subpopulations and the frequencies of these subpopulations are chosen from some prior. The model allows to quantify the effect of not fitting the training data on the generalization performance of the learned classifier and demonstrates that memorization is necessary whenever frequencies are long-tailed. Image and text data are known to follow such distributions and therefore our results establish a formal link between these empirical phenomena. Our results also have concrete implications for the cost of ensuring differential privacy in learning.

References

[1]
Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with diferential privacy. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 233-242.
[2]
Rohit Babbar and Bernhard Schölkopf. 2017. Dismec: Distributed sparse machines for extreme multi-label classification. In Proceedings of the tenth ACM international conference on web search and data mining. ACM, 721-729.
[3]
Rohit Babbar and Bernhard Schölkopf. 2019. Data scarcity, robustness and extreme multi-label classification. Machine Learning (18 Mar 2019 ).
[4]
Eugene Bagdasaryan and Vitaly Shmatikov. 2019. Diferential Privacy Has Disparate Impact on Model Accuracy. CoRR abs/ 1905.12101 ( 2019 ). arXiv: 1905.12101 http://arxiv.org/abs/ 1905.12101
[5]
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems. 6240-6249.
[6]
Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. 2019. Benign Overfitting in Linear Regression. arXiv preprint arXiv: 1906. 11300 ( 2019 ).
[7]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2018. Reconciling modern machine learning and the bias-variance trade-of. arXiv preprint arXiv: 1812. 11118 ( 2018 ).
[8]
Mikhail Belkin, Daniel Hsu, and Ji Xu. 2019. Two models of double descent for weak features. arXiv preprint arXiv: 1903. 07571 ( 2019 ).
[9]
Mikhail Belkin, Daniel J Hsu, and Partha Mitra. 2018. Overfitting or perfect iftting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems. 2300-2311.
[10]
Mikhail Belkin, Siyuan Ma, and Soumik Mandal. 2018. To Understand Deep Learning We Need to Understand Kernel Learning. In ICML (Proceedings of Machine Learning Research), Vol. 80. PMLR, 541-549. http://proceedings.mlr. press/v80/belkin18a.html
[11]
Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. 2018. Does data interpolation contradict statistical optimality? arXiv preprint arXiv: 1806. 09471 ( 2018 ).
[12]
Olivier Bousquet and André Elisseef. 2002. Stability and generalization. JMLR 2 ( 2002 ), 499-526.
[13]
Leo Breiman. 2001. Random forests. Machine learning 45, 1 ( 2001 ), 5-32.
[14]
Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. 2019. Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss. arXiv preprint arXiv: 1906. 07413 ( 2019 ).
[15]
Nicholas Carlini, Ulfar Erlingsson, and Nicolas Papernot. 2018. Prototypical Examples in Deep Learning: Metrics, Characteristics, and Utility. ( 2018 ). https: //openreview.net/forum?id=r1xyx3R9tQ
[16]
Kamalika Chaudhuri and Sanjoy Dasgupta. 2014. Rates of Convergence for Nearest Neighbor Classification. In NIPS. 3437-3445. http://papers.nips.cc/paper/ 5439-rates-of-convergence-for-nearest-neighbor-classification
[17]
Gilad Cohen, Guillermo Sapiro, and Raja Giryes. 2018. DNN or k-NN: That is the Generalize vs. Memorize Question. arXiv preprint arXiv: 1805. 06822 ( 2018 ).
[18]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 ( 1995 ), 273-297.
[19]
Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory 13, 1 ( 1967 ), 21-27.
[20]
Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. 2018. Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[21]
Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. 2014. Preserving Statistical Validity in Adaptive Data Analysis. CoRR abs/1411.2664 ( 2014 ). Extended abstract in STOC 2015.
[22]
Vitaly Feldman. 2016. Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back. CoRR abs/1608.04414 ( 2016 ). http://arxiv.org/abs/ 1608.04414 Extended abstract in NIPS 2016.
[23]
Vitaly Feldman. 2019. Does Learning Require Memorization? A Short Tale about a Long Tail. CoRR abs/ 1906.05271 ( 2019 ). arXiv: 1906.05271 http://arxiv.org/abs/ 1906.05271
[24]
Vitaly Feldman and Jan Vondrák. 2019. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. CoRR abs/ 1902.10710 ( 2019 ). arXiv: 1902.10710 http://arxiv.org/abs/ 1902.10710
[25]
Vitaly Feldman and Chiyuan Zhang. 2019. Finding the memorized examples via fast influence estimation (working title). Unpublished manuscript. Short summary in https://www.youtube.com/watch?v= YWy2Iwn-1S8.
[26]
Y. Freund and R. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55, 1 ( 1997 ), 119-139.
[27]
Moritz Hardt, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of stochastic gradient descent. In ICML. 1225-1234. http://jmlr.org/ proceedings/papers/v48/hardt16.html
[28]
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. 2019. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. arXiv preprint arXiv: 1903. 08560 ( 2019 ).
[29]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 ( 2017 ), 32-73.
[30]
Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. 2018. Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations. In Conference On Learning Theory. 2-47.
[31]
Tengyuan Liang and Alexander Rakhlin. 2018. Just interpolate: Kernel" ridgeless" regression can generalize. arXiv preprint arXiv: 1808. 00387 ( 2018 ).
[32]
Siyuan Ma, Raef Bassily, and Mikhail Belkin. 2018. The Power of Interpolation: Understanding the Efectiveness of SGD in Modern Over-parametrized Learning. In ICML. 3331-3340. http://proceedings.mlr.press/v80/ma18a.html
[33]
Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2018. Learning Diferentially Private Recurrent Language Models. In International Conference on Learning Representations (ICLR). https://openreview.net/pdf?id=BJ0hF1Z0b
[34]
Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. 2019. Harmless interpolation of noisy data in regression. arXiv preprint arXiv: 1903. 09139 ( 2019 ).
[35]
Deanna Needell, Rachel Ward, and Nati Srebro. 2014. Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. In NIPS. 1017-1025. http://papers.nips.cc/paper/5355-stochastic-gradient-descent-weightedsampling-and-the-randomized-kaczmarz-algorithm.pdf
[36]
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. 2017. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems. 5947-5956.
[37]
Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. 2017. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071 ( 2017 ).
[38]
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. 2015. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning. In ICLR. http://arxiv.org/abs/1412.6614
[39]
Alon Orlitsky and Ananda Theertha Suresh. 2015. Competitive distribution estimation: Why is good-turing good. In NIPS. 2143-2151.
[40]
Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. 2016. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. CoRR abs/1610.05755 ( 2016 ). arXiv: 1610.05755 http://arxiv.org/abs/1610.05755
[41]
Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. 2017. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. In Proceedings of the 5th International Conference on Learning Representations (ICLR).
[42]
Alexander Rakhlin and Xiyu Zhai. 2019. Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon. In COLT, Vol. 99. PMLR, 2595-2623. http://proceedings.mlr.press/v99/rakhlin19a.html
[43]
R. Schapire, Y. Freund, P. Bartlett, and W. Lee. 1998. Boosting the margin: a new explanation for the efectiveness of voting methods. Annals of Statistics 26, 5 ( 1998 ), 1651-1686.
[44]
Robert E Schapire. 2013. Explaining adaboost. In Empirical inference. Springer, 37-52.
[45]
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. 2009. Stochastic Convex Optimization. In COLT.
[46]
Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. 2010. Smoothness, Low Noise and Fast Rates. In NIPS. 2199-2207. http://papers.nips.cc/paper/3894-smoothness-low-noise-and-fast-rates.pdf
[47]
Gregory Valiant and Paul Valiant. 2016. Instance optimal learning of discrete distributions. In STOC. ACM, 142-155.
[48]
Grant Van Horn and Pietro Perona. 2017. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450 ( 2017 ).
[49]
V. N. Vapnik. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, New York.
[50]
Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to model the tail. In Advances in Neural Information Processing Systems. 7029-7039.
[51]
Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. 2017. Explaining the success of adaboost and random forests as interpolating classifiers. The Journal of Machine Learning Research 18, 1 ( 2017 ), 1558-1590.
[52]
Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 3485-3492.
[53]
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. In ICLR. https://openreview.net/forum?id=Sy8gdB9xx
[54]
Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. 2014. Capturing long-tail distributions of object subcategories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915-922.

Cited By

View all
  • (2024)Is Load Shedding another Pandemic, Post COVID-19 at Institution of Higher Learning in South Africa?RUDN Journal of Public Administration10.22363/2312-8313-2024-11-1-87-9711:1(87-97)Online publication date: 15-Mar-2024
  • (2024)Towards adaptive graph neural networks via solving prior-data conflicts通过解决先验数据冲突实现自适应图神经网络Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.230019425:3(369-383)Online publication date: 23-Mar-2024
  • (2024)DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and ReleaseProceedings of the VLDB Endowment10.14778/3648160.364816417:6(1200-1213)Online publication date: 1-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC 2020: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing
June 2020
1429 pages
ISBN:9781450369794
DOI:10.1145/3357713
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2020

Check for updates

Author Tags

  1. Generalization
  2. Interpolation
  3. Long-tailed Distribution
  4. Overfitting
  5. Privacy-preserving Learning

Qualifiers

  • Research-article

Conference

STOC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,434
  • Downloads (Last 6 weeks)182
Reflects downloads up to 18 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Is Load Shedding another Pandemic, Post COVID-19 at Institution of Higher Learning in South Africa?RUDN Journal of Public Administration10.22363/2312-8313-2024-11-1-87-9711:1(87-97)Online publication date: 15-Mar-2024
  • (2024)Towards adaptive graph neural networks via solving prior-data conflicts通过解决先验数据冲突实现自适应图神经网络Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.230019425:3(369-383)Online publication date: 23-Mar-2024
  • (2024)DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and ReleaseProceedings of the VLDB Endowment10.14778/3648160.364816417:6(1200-1213)Online publication date: 1-Feb-2024
  • (2024)Incentivising the federation: gradient-based metrics for data selection and valuation in private decentralised trainingProceedings of the 2024 European Interdisciplinary Cybersecurity Conference10.1145/3655693.3660253(179-185)Online publication date: 5-Jun-2024
  • (2024)TeDA: A Testing Framework for Data Usage Auditing in Deep Learning Model DevelopmentProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680375(1479-1490)Online publication date: 11-Sep-2024
  • (2024)Approximating Memorization Using Loss Surface Geometry for Dataset Pruning and SummarizationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671985(17-28)Online publication date: 25-Aug-2024
  • (2024)Traces of Memorisation in Large Language Models for CodeProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639133(1-12)Online publication date: 20-May-2024
  • (2024)Unveiling Memorization in Code ModelsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639074(1-13)Online publication date: 20-May-2024
  • (2024)Breaking the Trilemma of Privacy, Utility, and Efficiency via Controllable Machine UnlearningProceedings of the ACM Web Conference 202410.1145/3589334.3645669(1260-1271)Online publication date: 13-May-2024
  • (2024)Anonymization: The imperfect science of using data while preserving privacyScience Advances10.1126/sciadv.adn705310:29Online publication date: 19-Jul-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media