research-article

Open access

Does learning require memorization? a short tale about a long tail

Author:

Vitaly FeldmanAuthors Info & Claims

STOC 2020: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing

Pages 954 - 959

https://doi.org/10.1145/3357713.3384290

Published: 22 June 2020 Publication History

Abstract

State-of-the-art results on image recognition tasks are achieved using over-parameterized learning algorithms that (nearly) perfectly fit the training set and are known to fit well even random labels. This tendency to memorize seemingly useless training data labels is not explained by existing theoretical analyses. Memorization of the training data also presents significant privacy risks when the training data contains sensitive personal information and thus it is important to understand whether such memorization is necessary for accurate learning.

We provide a simple conceptual explanation and a theoretical model demonstrating that for natural data distributions memorization of labels is necessary for achieving close-to-optimal generalization error. The model is motivated and supported by the results of several recent empirical works. In our model, data is sampled from a mixture of subpopulations and the frequencies of these subpopulations are chosen from some prior. The model allows to quantify the effect of not fitting the training data on the generalization performance of the learned classifier and demonstrates that memorization is necessary whenever frequencies are long-tailed. Image and text data are known to follow such distributions and therefore our results establish a formal link between these empirical phenomena. Our results also have concrete implications for the cost of ensuring differential privacy in learning.

References

[1]

Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with diferential privacy. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 233-242.

[2]

Rohit Babbar and Bernhard Schölkopf. 2017. Dismec: Distributed sparse machines for extreme multi-label classification. In Proceedings of the tenth ACM international conference on web search and data mining. ACM, 721-729.

Digital Library

[3]

Rohit Babbar and Bernhard Schölkopf. 2019. Data scarcity, robustness and extreme multi-label classification. Machine Learning (18 Mar 2019 ).

[4]

Eugene Bagdasaryan and Vitaly Shmatikov. 2019. Diferential Privacy Has Disparate Impact on Model Accuracy. CoRR abs/ 1905.12101 ( 2019 ). arXiv: 1905.12101 http://arxiv.org/abs/ 1905.12101

[5]

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems. 6240-6249.

[6]

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. 2019. Benign Overfitting in Linear Regression. arXiv preprint arXiv: 1906. 11300 ( 2019 ).

[7]

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2018. Reconciling modern machine learning and the bias-variance trade-of. arXiv preprint arXiv: 1812. 11118 ( 2018 ).

[8]

Mikhail Belkin, Daniel Hsu, and Ji Xu. 2019. Two models of double descent for weak features. arXiv preprint arXiv: 1903. 07571 ( 2019 ).

[9]

Mikhail Belkin, Daniel J Hsu, and Partha Mitra. 2018. Overfitting or perfect iftting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems. 2300-2311.

[10]

Mikhail Belkin, Siyuan Ma, and Soumik Mandal. 2018. To Understand Deep Learning We Need to Understand Kernel Learning. In ICML (Proceedings of Machine Learning Research), Vol. 80. PMLR, 541-549. http://proceedings.mlr. press/v80/belkin18a.html

[11]

Mikhail Belkin, Alexander Rakhlin, and Alexandre B Tsybakov. 2018. Does data interpolation contradict statistical optimality? arXiv preprint arXiv: 1806. 09471 ( 2018 ).

[12]

Olivier Bousquet and André Elisseef. 2002. Stability and generalization. JMLR 2 ( 2002 ), 499-526.

[13]

Leo Breiman. 2001. Random forests. Machine learning 45, 1 ( 2001 ), 5-32.

[14]

Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. 2019. Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss. arXiv preprint arXiv: 1906. 07413 ( 2019 ).

[15]

Nicholas Carlini, Ulfar Erlingsson, and Nicolas Papernot. 2018. Prototypical Examples in Deep Learning: Metrics, Characteristics, and Utility. ( 2018 ). https: //openreview.net/forum?id=r1xyx3R9tQ

[16]

Kamalika Chaudhuri and Sanjoy Dasgupta. 2014. Rates of Convergence for Nearest Neighbor Classification. In NIPS. 3437-3445. http://papers.nips.cc/paper/ 5439-rates-of-convergence-for-nearest-neighbor-classification

[17]

Gilad Cohen, Guillermo Sapiro, and Raja Giryes. 2018. DNN or k-NN: That is the Generalize vs. Memorize Question. arXiv preprint arXiv: 1805. 06822 ( 2018 ).

[18]

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 ( 1995 ), 273-297.

[19]

Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory 13, 1 ( 1967 ), 21-27.

Digital Library

[20]

Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. 2018. Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth. 2014. Preserving Statistical Validity in Adaptive Data Analysis. CoRR abs/1411.2664 ( 2014 ). Extended abstract in STOC 2015.

[22]

Vitaly Feldman. 2016. Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back. CoRR abs/1608.04414 ( 2016 ). http://arxiv.org/abs/ 1608.04414 Extended abstract in NIPS 2016.

[23]

Vitaly Feldman. 2019. Does Learning Require Memorization? A Short Tale about a Long Tail. CoRR abs/ 1906.05271 ( 2019 ). arXiv: 1906.05271 http://arxiv.org/abs/ 1906.05271

[24]

Vitaly Feldman and Jan Vondrák. 2019. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. CoRR abs/ 1902.10710 ( 2019 ). arXiv: 1902.10710 http://arxiv.org/abs/ 1902.10710

[25]

Vitaly Feldman and Chiyuan Zhang. 2019. Finding the memorized examples via fast influence estimation (working title). Unpublished manuscript. Short summary in https://www.youtube.com/watch?v= YWy2Iwn-1S8.

[26]

Y. Freund and R. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55, 1 ( 1997 ), 119-139.

Digital Library

[27]

Moritz Hardt, Ben Recht, and Yoram Singer. 2016. Train faster, generalize better: Stability of stochastic gradient descent. In ICML. 1225-1234. http://jmlr.org/ proceedings/papers/v48/hardt16.html

[28]

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. 2019. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. arXiv preprint arXiv: 1903. 08560 ( 2019 ).

[29]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 ( 2017 ), 32-73.

Digital Library

[30]

Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. 2018. Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations. In Conference On Learning Theory. 2-47.

[31]

Tengyuan Liang and Alexander Rakhlin. 2018. Just interpolate: Kernel" ridgeless" regression can generalize. arXiv preprint arXiv: 1808. 00387 ( 2018 ).

[32]

Siyuan Ma, Raef Bassily, and Mikhail Belkin. 2018. The Power of Interpolation: Understanding the Efectiveness of SGD in Modern Over-parametrized Learning. In ICML. 3331-3340. http://proceedings.mlr.press/v80/ma18a.html

[33]

Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. 2018. Learning Diferentially Private Recurrent Language Models. In International Conference on Learning Representations (ICLR). https://openreview.net/pdf?id=BJ0hF1Z0b

[34]

Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. 2019. Harmless interpolation of noisy data in regression. arXiv preprint arXiv: 1903. 09139 ( 2019 ).

[35]

Deanna Needell, Rachel Ward, and Nati Srebro. 2014. Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. In NIPS. 1017-1025. http://papers.nips.cc/paper/5355-stochastic-gradient-descent-weightedsampling-and-the-randomized-kaczmarz-algorithm.pdf

[36]

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. 2017. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems. 5947-5956.

[37]

Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. 2017. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071 ( 2017 ).

[38]

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. 2015. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning. In ICLR. http://arxiv.org/abs/1412.6614

[39]

Alon Orlitsky and Ananda Theertha Suresh. 2015. Competitive distribution estimation: Why is good-turing good. In NIPS. 2143-2151.

[40]

Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. 2016. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. CoRR abs/1610.05755 ( 2016 ). arXiv: 1610.05755 http://arxiv.org/abs/1610.05755

[41]

Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. 2017. Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data. In Proceedings of the 5th International Conference on Learning Representations (ICLR).

[42]

Alexander Rakhlin and Xiyu Zhai. 2019. Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon. In COLT, Vol. 99. PMLR, 2595-2623. http://proceedings.mlr.press/v99/rakhlin19a.html

[43]

R. Schapire, Y. Freund, P. Bartlett, and W. Lee. 1998. Boosting the margin: a new explanation for the efectiveness of voting methods. Annals of Statistics 26, 5 ( 1998 ), 1651-1686.

[44]

Robert E Schapire. 2013. Explaining adaboost. In Empirical inference. Springer, 37-52.

[45]

S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. 2009. Stochastic Convex Optimization. In COLT.

[46]

Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. 2010. Smoothness, Low Noise and Fast Rates. In NIPS. 2199-2207. http://papers.nips.cc/paper/3894-smoothness-low-noise-and-fast-rates.pdf

[47]

Gregory Valiant and Paul Valiant. 2016. Instance optimal learning of discrete distributions. In STOC. ACM, 142-155.

[48]

Grant Van Horn and Pietro Perona. 2017. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450 ( 2017 ).

[49]

V. N. Vapnik. 1982. Estimation of Dependences Based on Empirical Data. SpringerVerlag, New York.

[50]

Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. 2017. Learning to model the tail. In Advances in Neural Information Processing Systems. 7029-7039.

[51]

Abraham J Wyner, Matthew Olson, Justin Bleich, and David Mease. 2017. Explaining the success of adaboost and random forests as interpolating classifiers. The Journal of Machine Learning Research 18, 1 ( 2017 ), 1558-1590.

[52]

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 3485-3492.

[53]

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. In ICLR. https://openreview.net/forum?id=Sy8gdB9xx

[54]

Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. 2014. Capturing long-tail distributions of object subcategories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 915-922.

Digital Library

Cited By

Kgarose MMakhubele DSetaise L(2024)Is Load Shedding another Pandemic, Post COVID-19 at Institution of Higher Learning in South Africa?RUDN Journal of Public Administration10.22363/2312-8313-2024-11-1-87-9711:1(87-97)Online publication date: 15-Mar-2024
https://doi.org/10.22363/2312-8313-2024-11-1-87-97
Wu XWu HWang RZhou XLu K(2024)Towards adaptive graph neural networks via solving prior-data conflicts通过解决先验数据冲突实现自适应图神经网络Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.230019425:3(369-383)Online publication date: 23-Mar-2024
https://doi.org/10.1631/FITEE.2300194
Fu JYe QHu HChen ZWang LWang KRan X(2024)DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and ReleaseProceedings of the VLDB Endowment10.14778/3648160.364816417:6(1200-1213)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648164
Show More Cited By

Index Terms

Does learning require memorization? a short tale about a long tail
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Models of learning
      2. Sample complexity and generalization bounds

Recommendations

Can we improve meta-learning model in few-shot learning by aligning data distributions?
Abstract
Meta-learning becomes a promising way to solve the few-shot learning problem in recent research. This paradigm mainly relies on hierarchical architecture and episodic training to achieve good generalization on the new learning task. ...
Decoupled Contrastive Learning for Long-Tailed Distribution
Pattern Recognition and Computer Vision
Abstract
Self-supervised contrastive learning is popularly used to obtain powerful representation models. However, unlabeled data in the real world naturally exhibits a long-tailed distribution, making the traditional instance-wise contrastive learning ...
A Systematic Review on Data Scarcity Problem in Deep Learning: Solution and Applications
Recent advancements in deep learning architecture have increased its utility in real-life applications. Deep learning models require a large amount of data to train the model. In many application domains, there is a limited set of data available for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

STOC 2020: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing

June 2020

1429 pages

ISBN:9781450369794

DOI:10.1145/3357713

General Chairs:
Konstantin Makarychev
Northwestern University, USA
,
Yury Makarychev
Toyota Technological Institute at Chicago, USA
,
Madhur Tulsiani
Toyota Technological Institute at Chicago, USA
,
Gautam Kamath
University of Waterloo, Canada
,
Program Chair:
Julia Chuzhoy
Toyota Technological Institute at Chicago, USA

Copyright © 2020 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGACT: ACM Special Interest Group on Algorithms and Computation Theory

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

STOC '20

Sponsor:

SIGACT

STOC '20: 52nd Annual ACM SIGACT Symposium on Theory of Computing

June 22 - 26, 2020

IL, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

110
Total Citations
View Citations
5,455
Total Downloads

Downloads (Last 12 months)1,434
Downloads (Last 6 weeks)182

Reflects downloads up to 18 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kgarose MMakhubele DSetaise L(2024)Is Load Shedding another Pandemic, Post COVID-19 at Institution of Higher Learning in South Africa?RUDN Journal of Public Administration10.22363/2312-8313-2024-11-1-87-9711:1(87-97)Online publication date: 15-Mar-2024
https://doi.org/10.22363/2312-8313-2024-11-1-87-97
Wu XWu HWang RZhou XLu K(2024)Towards adaptive graph neural networks via solving prior-data conflicts通过解决先验数据冲突实现自适应图神经网络Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.230019425:3(369-383)Online publication date: 23-Mar-2024
https://doi.org/10.1631/FITEE.2300194
Fu JYe QHu HChen ZWang LWang KRan X(2024)DPSUR: Accelerating Differentially Private Stochastic Gradient Descent Using Selective Update and ReleaseProceedings of the VLDB Endowment10.14778/3648160.364816417:6(1200-1213)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.14778/3648160.3648164
Usynin DRueckert DKaissis G(2024)Incentivising the federation: gradient-based metrics for data selection and valuation in private decentralised trainingProceedings of the 2024 European Interdisciplinary Cybersecurity Conference10.1145/3655693.3660253(179-185)Online publication date: 5-Jun-2024
https://dl.acm.org/doi/10.1145/3655693.3660253
Gao XChen JWang JShi JCheng PChen JChristakis MPradel M(2024)TeDA: A Testing Framework for Data Usage Auditing in Deep Learning Model DevelopmentProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680375(1479-1490)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680375
Agiollo AKim YKhanna RBaeza-Yates RBonchi F(2024)Approximating Memorization Using Loss Surface Geometry for Dataset Pruning and SummarizationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671985(17-28)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671985
Al-Kaswan AIzadi Mvan Deursen ARoychoudhury APaiva AAbreu RStorey M(2024)Traces of Memorisation in Large Language Models for CodeProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639133(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639133
Yang ZZhao ZWang CShi JKim DHan DLo DRoychoudhury APaiva AAbreu RStorey M(2024)Unveiling Memorization in Code ModelsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639074(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639074
Liu ZDou GChien EZhang CTian YZhu ZChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Breaking the Trilemma of Privacy, Utility, and Efficiency via Controllable Machine UnlearningProceedings of the ACM Web Conference 202410.1145/3589334.3645669(1260-1271)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645669
Gadotti ARocher LHoussiau FCreţu Ade Montjoye Y(2024)Anonymization: The imperfect science of using data while preserving privacyScience Advances10.1126/sciadv.adn705310:29Online publication date: 19-Jul-2024
https://doi.org/10.1126/sciadv.adn7053
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents