research-article

Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark

Authors:

Deepak Narayanan,

Kunle Olukotun,

Matei ZahariaAuthors Info & Claims

ACM SIGOPS Operating Systems Review, Volume 53, Issue 1

Pages 14 - 25

https://doi.org/10.1145/3352020.3352024

Published: 25 July 2019 Publication History

Abstract

Researchers have proposed hardware, software, and algorithmic optimizations to improve the computational performance of deep learning. While some of these optimizations perform the same operations faster (e.g., increasing GPU clock speed), many others modify the semantics of the training procedure (e.g., reduced precision), and can impact the final model's accuracy on unseen data. Due to a lack of standard evaluation criteria that considers these trade-offs, it is difficult to directly compare these optimizations. To address this problem, we recently introduced DAWNBENCH, a benchmark competition focused on end-to-end training time to achieve near-state-of-the-art accuracy on an unseen dataset-a combined metric called time-to-accuracy (TTA). In this work, we analyze the entries from DAWNBENCH, which received optimized submissions from multiple industrial groups, to investigate the behavior of TTA as a metric as well as trends in the best-performing entries. We show that TTA has a low coefficient of variation and that models optimized for TTA generalize nearly as well as those trained using standard methods. Additionally, even though DAWNBENCH entries were able to train ImageNet models in under 3 minutes, we find they still underutilize hardware capabilities such as Tensor Cores. Furthermore, we find that distributed entries can spend more than half of their time on communication. We show similar findings with entries to the MLPERF v0.5 benchmark.

References

[1]

Second conference on machine translation, 2017.

[2]

Tensorflow xla overview. https://www.tensorflow.org/ performance/xla, 2017.

[3]

MLPerf. https://mlperf.org/, 2018.

[4]

TVM: An automated end-to-end optimizing compiler for deep learning. In OSDI, Carlsbad, CA, 2018. USENIX Association.

Digital Library

[5]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, volume 16, pages 265--283, 2016.

Digital Library

[6]

Robert Adolf, Saketh Rama, Brandon Reagen, Gu-YeonWei, and David Brooks. Fathom: Reference Workloads for Modern Deep Learning Methods. In IISWC, pages 1--10. IEEE, 2016.

[7]

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.

[8]

Dario Amodei and Danny Hernandez. Ai and compute, 2018.

[9]

Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. Workload analysis of a large-scale key-value store. In SIGMETRICS, volume 40, pages 53--64. ACM, 2012.

Digital Library

[10]

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.

[11]

Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, and Mohak Shah. Comparative Study of Deep Learning Software Frameworks. arXiv preprint arXiv:1511.06435, 2015.

[12]

Baidu. DeepBench: Benchmarking Deep Learning Operations on Different Hardware. https://github.com/baidu-research/ DeepBench, 2017.

[13]

Anup Bhande. What is underfitting and overfitting in machine learning and how to deal with it, 2018.

[14]

Victor Bittorf. Making ncf reflect production usage, 2019.

[15]

Doug Burger. Microsoft unveils Project Brainwave for Real-time AI. Microsoft Research, Microsoft, 22, 2017.

[16]

Kevin K Chang, A Giray Yaglkç, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O'Connor, Hasan Hassan, and Onur Mutlu. Understanding reduced-voltage operation in modern dram devices: Experimental characterization, analysis, and mechanisms. SIGMETRICS, 1(1):10, 2017.

Digital Library

[17]

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.

[18]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.

[19]

Sharan Chetlur, CliffWoolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.

[20]

Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI, volume 14, pages 571--582, 2014.

Digital Library

[21]

Soumith Chintala. Convnet-Benchmarks: Easy Benchmarking of All Publicly Accessible Implementations of Convnets. https://github. com/soumith/convnet-benchmarks, September 2017.

[22]

Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. DAWNBench: An End-to-End Deep Learning Benchmark and Competition. NIPS ML Systems Workshop, 2017.

[23]

Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. Understanding and Optimizing Asynchronous Lowprecision Stochastic Gradient Descent. In ISCA. ACM, 2017.

Digital Library

[24]

Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R Aberger, Kunle Olukotun, and Christopher Ré. Highaccuracy low-precision training. arXiv preprint arXiv:1803.03383, 2018.

[25]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large Scale Distributed Deep Networks. In NIPS, 2012.

Digital Library

[26]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

[27]

Saugata Ghose, Abdullah Giray Yaglikçi, Raghav Gupta, Donghyuk Lee, Kais Kudrolli, William X Liu, Hasan Hassan, Kevin K Chang, Niladrish Chatterjee, Aditya Agrawal, et al. What your dram power models are not telling you: Lessons from a detailed experimental study. SIGMETRICS, 2(3):38, 2018.

Digital Library

[28]

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. In AISTATS, pages 315--323, 2011.

[29]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

Digital Library

[30]

Google. TensorFlow Benchmarks. https://www.tensorflow.org/ performance/benchmarks, 2017.

[31]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

[32]

Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory Ganger, Phillip Gibbons, Garth Gibson, and Eric Xing. Addressing the Straggler Problem for Iterative Convergent Parallel ML. In SoCC. ACM, 2016.

Digital Library

[33]

Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and < 0.5 MB Model Size. arXiv preprint arXiv:1602.07360, 2016.

[34]

Intel. Bigdl: Distributed deep learning library for apache spark, 2019.

[35]

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[36]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In ACM International Conference on Multimedia, pages 675--678. ACM, 2014.

Digital Library

[37]

Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks. In SysML, 2019.

[38]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter Performance Analysis of a Tensor Processing Unit. In ISCA, pages 1--12. ACM, 2017.

Digital Library

[39]

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and YonghuiWu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.

[40]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv preprint arXiv:1710.10196, 2017.

[41]

Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. ICLR, 2015.

[42]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583--598, 2014.

Digital Library

[43]

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced Deep Residual Networks for Single Image Super- Resolution. In CVPR Workshops, volume 1, page 3, 2017.

[44]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR, 2018.

[45]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. Nvidia tensor core programmability, performance & precision. arXiv preprint arXiv:1803.04014, 2018.

[46]

Dominic Masters and Carlo Luschi. Revisiting Small Batch Training for Deep Neural Networks. arXiv preprint arXiv:1804.07612, 2018.

[47]

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.

[48]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed Precision Training. arXiv preprint arXiv:1710.03740, 2017.

[49]

Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, and Christopher Ré. Asynchrony begets momentum, with an application to deep learning. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 997--1004. IEEE, 2016.

Digital Library

[50]

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807--814, 2010.

Digital Library

[51]

Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright. Hogwild: A Lock-free Approach to Parallelizing Stochastic Gradient Descent. In NIPS, pages 693--701, 2011.

Digital Library

[52]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

[53]

Dexmont Pena, Andrew Forembski, Xiaofan Xu, and David Moloney. Benchmarking of CNNs for Low-Cost, Low-Power Robotics Applications. 2017.

[54]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized Evolution for Image Classifier Architecture Search. arXiv preprint arXiv:1802.01548, 2018.

[55]

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do CIFAR-10 classifiers generalize to cifar-10? CoRR, abs/1806.00451, 2018.

[56]

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400--407, 1951.

[57]

Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.

[58]

Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. Benchmarking State-of-the-Art Deep Learning Software Tools. In Cloud Computing and Big Data (CCBD). IEEE, 2016.

[59]

Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. Don't decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.

[60]

Jascha Sohl-Dickstein, Ben Poole, and Surya Ganguli. Fast Largescale Optimization by Unifying Stochastic Gradient and Quasi-Newton Methods. In ICML, pages 604--612, 2014.

Digital Library

[61]

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. CoRR, abs/1707.02968, 2017.

[62]

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the Importance of Initialization and Momentum in Deep Learning. In ICML, pages 1139--1147, 2013.

Digital Library

[63]

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4):65--76, 2009.

Digital Library

[64]

Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In NIPS, pages 4148--4158, 2017.

Digital Library

[65]

Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. In ICPP, page 1. ACM, 2018.

Digital Library

[66]

Ce Zhang and Christopher Ré. Dimmwitted: A Study of Main-memory Statistical Analytics. PVLDB, 7(12):1283--1294, 2014.

Digital Library

[67]

Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. Tbd: Benchmarking and analyzing deep neural network training. arXiv preprint arXiv:1803.06905, 2018.

Cited By

Dai PHan BLi KXu XXing HLiu K(2025)Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge ComputingIEEE Transactions on Mobile Computing10.1109/TMC.2024.345779324:1(210-226)Online publication date: Jan-2025
https://doi.org/10.1109/TMC.2024.3457793
Pochelu PCastro Lopez OBalsamo SKnottenbelt WAbad CShang W(2024)Mastering Computer Vision Inference FrameworksCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651430(28-33)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629527.3651430
Vegesana KThomas P(2024)Cracking the code of adaptive immunity: The role of computational toolsCell Systems10.1016/j.cels.2024.11.00915:12(1156-1167)Online publication date: Dec-2024
https://doi.org/10.1016/j.cels.2024.11.009
Show More Cited By

Recommendations

Characterization & analysis of a server consolidation benchmark
VEE '08: Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

Virtualization is already becoming ubiquitous in data centers for the consolidation of multiple workloads on a single platform. However, there are very few performance studies of server consolidation workloads in the literature. In this paper, our goal ...
Instance-level accuracy versus bag-level accuracy in multi-instance learning

In multi-instance learning, instances are organized into bags, and a bag is labeled positive if it contains at least one positive instance, and negative otherwise; the labels of the individual instances are not given. The task is to learn a classifier ...
Semi-supervised learning combining transductive support vector machine with active learning

In typical data mining applications, labeling the large amounts of data is difficult, expensive, and time consuming, if annotated manually. To avoid manual labeling, semi-supervised learning uses unlabeled data along with the labeled data in the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review

ACM SIGOPS Operating Systems Review Volume 53, Issue 1

July 2019

90 pages

ISSN:0163-5980

DOI:10.1145/3352020

Editors:
Robbert van Renesse
Cornell University, Ithaca, New York
,
Christopher J. Rossbach
Stop D9500, Austin, TX
,
Kishore Pusukuri
Santa Clara University
,
John Chandy
University of Connecticut
,
Antônio Fröhlich
Federal Univ. of Santa Catarina
,
Ashvin Goel
University of Toronto

Issue’s Table of Contents

Copyright © 2019 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Published in SIGOPS Volume 53, Issue 1

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
1,234
Total Downloads

Downloads (Last 12 months)199
Downloads (Last 6 weeks)39

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dai PHan BLi KXu XXing HLiu K(2025)Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge ComputingIEEE Transactions on Mobile Computing10.1109/TMC.2024.345779324:1(210-226)Online publication date: Jan-2025
https://doi.org/10.1109/TMC.2024.3457793
Pochelu PCastro Lopez OBalsamo SKnottenbelt WAbad CShang W(2024)Mastering Computer Vision Inference FrameworksCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3651430(28-33)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629527.3651430
Vegesana KThomas P(2024)Cracking the code of adaptive immunity: The role of computational toolsCell Systems10.1016/j.cels.2024.11.00915:12(1156-1167)Online publication date: Dec-2024
https://doi.org/10.1016/j.cels.2024.11.009
Kari VAmalanathan G(2024)Targeted prevention of risky deals for improper granular data with deep learningInternational Journal of System Assurance Engineering and Management10.1007/s13198-024-02646-8Online publication date: 6-Dec-2024
https://doi.org/10.1007/s13198-024-02646-8
Gupta MMishra A(2024)A systematic review of deep learning based image segmentation to detect polypArtificial Intelligence Review10.1007/s10462-023-10621-157:1Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1007/s10462-023-10621-1
Stacchio LAngeli ALisanti GMarfia G(2024)Analyzing cultural relationships visual cues through deep learning models in a cross-dataset settingNeural Computing and Applications10.1007/s00521-023-08966-336:20(11727-11742)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s00521-023-08966-3
Khan SChen JLiao WChen C(2023)Towards Adversarial Robustness for Multi-Mode Data through Metric LearningSensors10.3390/s2313617323:13(6173)Online publication date: 5-Jul-2023
https://doi.org/10.3390/s23136173
Potić ISrdić ZVakanjac BBakrač SĐorđević DBanković RJovanović J(2023)Improving Forest Detection Using Machine Learning and Remote Sensing: A Case Study in Southeastern SerbiaApplied Sciences10.3390/app1314828913:14(8289)Online publication date: 18-Jul-2023
https://doi.org/10.3390/app13148289
Liang XZhao JChen YBandara EShetty S(2023)Architectural Design of a Blockchain-Enabled, Federated Learning Platform for Algorithmic Fairness in Predictive Health Care: Design Science StudyJournal of Medical Internet Research10.2196/4654725(e46547)Online publication date: 30-Oct-2023
https://doi.org/10.2196/46547
Sulun G(2023)On the Principles and Decisions of New Word Translation in Sino-Japan Cross-Border e-CommerceInternational Journal of Digital Multimedia Broadcasting10.1155/2023/28137022023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/2813702
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents