research-article

CowClip: reducing CTR prediction model training time from 12 hours to 10 minutes on 1 GPU

AUTHORs:

Xiangzhuo Ding,

Yang YouAuthors Info & Claims

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

Article No.: 1278, Pages 11390 - 11398

https://doi.org/10.1609/aaai.v37i9.26347

Published: 07 February 2023 Publication History

Abstract

The click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item. As mind-boggling amounts of data are produced online daily, accelerating CTR prediction model training is critical to ensuring an up-to-date model and reducing the training cost. One approach to increase the training speed is to apply large batch training. However, as shown in computer vision and natural language processing tasks, training with a large batch easily suffers from the loss of accuracy. Our experiments show that previous scaling rules fail in the training of CTR prediction neural networks. To tackle this problem, we first theoretically show that different frequencies of ids make it challenging to scale hyperparameters when scaling the batch size. To stabilize the training process in a large batch size setting, we develop the adaptive Column-wise Clipping (CowClip). It enables an easy and effective scaling rule for the embeddings, which keeps the learning rate unchanged and scales the L2 loss. We conduct extensive experiments with four CTR prediction networks on two real-world datasets and successfully scaled 128 times the original batch size without accuracy loss. In particular, for CTR prediction model DeepFM training on the Criteo dataset, our optimization framework enlarges the batch size from 1K to 128K with over 0.1% AUC improvement and reduces training time from 12 hours to 10 minutes on a single V100 GPU.

References

[1]

Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Mané, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viegas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/. Accessed: 2022-03-28.

[2]

Adnan, M. 2021. Accelerating input dispatching for deep learning recommendation models training. Ph.D. thesis, University of British Columbia.

[3]

Adnan, M.; Maboud, Y. E.; Mahajan, D.; and Nair, P. J. 2021. Accelerating Recommendation System Training by Leveraging Popular Choices. In Proceedings of the VLDB Endowment, volume 15, 127-140.

Digital Library

[4]

Avazu. 2015. Avazu Click-Through Rate Prediction. https://www.kaggle.com/c/avazu-ctr-prediction. Accessed: 202203-28.

[5]

Baji, T. 2018. Evolution of the GPU Device widely used in AI and Massive Parallel Processing. In 2018 IEEE 2nd Electron Devices Technology and Manufacturing Conference (EDTM), 7-9. IEEE.

[6]

Chen, Q.; and Li, D. 2021. Improved CTR Prediction Algorithm based on LSTM and Attention. In Proceedings of the 5th International Conference on Control Engineering and Artificial Intelligence, 122-125.

[7]

Chen, Q.; Zhao, H.; Li, W.; Huang, P.; and Ou, W. 2019. Behavior sequence transformer for e-commerce recommendation in alibaba. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, 1-4.

[8]

Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H. B.; Anderson, G.; Corrado, G. S.; Chai, W.; Ispir, M.; Anil, R.; Haque, Z.; Hong, L.; Jain, V.; Liu, X.; and Shah, H. 2016. Wide & Deep Learning for Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems.

[9]

Covington, P.; Adams, J. K.; and Sargin, E. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems.

[10]

Deng, W.; Pan, J.; Zhou, T.; Flores, A.; and Lin, G. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining.

[11]

Ginart, A. A.; Naumov, M.; Mudigere, D.; Yang, J.; and Zou, J. Y. 2021. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems. In 2021 IEEE International Symposium on Information Theory (ISIT), 2786-2791.

[12]

Gomez-Uribe, C.; and Hunt, N. 2016. The Netflix Recommender System: Algorithms, Business Value, and Innovation. ACM Transactions on Management Information Systems (TMIS), 6: 13:1-13:19.

Digital Library

[13]

Gotmare, A. D.; Keskar, N. S.; Xiong, C.; and Socher, R. 2019. A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation. arXiv, abs/1810.13243.

[14]

Goyal, P.; Dollár, P.; Girshick, R. B.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He, K. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv, abs/1706.02677.

[15]

Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X.; and Dong, Z. 2018. DeepFM: An End-to-End Wide & Deep Learning Framework for CTR Prediction. arXiv, abs/1804.04950.

[16]

He, X.; Xue, F.; Ren, X.; and You, Y. 2021. Large-Scale Deep Learning Optimizations: A Comprehensive Survey. arXiv, abs/2111.00856.

[17]

Hoffer, E.; Hubara, I.; and Soudry, D. 2017. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv, abs/1705.08741.

[18]

Kaasinen, E.; Roto, V.; Roloff, K.; Väänänen-Vainio-Mattila, K.; Vainio, T.; Maehr, W.; Joshi, D.; and Shrestha, S. 2009. User experience of mobile internet: analysis and recommendations. International Journal of Mobile Human Computer Interaction (IJMHCI), 1(4): 4-23.

[19]

Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. arXiv, abs/1412.6980.

[20]

Krizhevsky, A. 2014. One weird trick for parallelizing convolutional neural networks. arXiv, abs/1404.5997.

[21]

Labs, C. 2014. Display Advertising Challenge. https://www.kaggle.com/c/criteo-display-ad-challenge. Accessed: 202203-28.

[22]

Li, Z.; Cui, Z.; Wu, S.; Zhang, X.; and Wang, L. 2019. Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management.

[23]

Ma, Y.; Narayanaswamy, B.; Lin, H.; and Ding, H. 2020. Temporal-Contextual Recommendation in Real-Time. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[24]

Mattson, P.; Cheng, C.; Coleman, C. A.; Diamos, G. F.; Micikevicius, P.; Patterson, D.; Tang, H.; Wei, G.-Y.; Bailis, P.; Bittorf, V.; Brooks, D. M.; Chen, D.; Dutta, D.; Gupta, U.; Hazelwood, K. M.; Hock, A.; Huang, X.; Jia, B.; Kang, D.; Kanter, D.; Kumar, N.; Liao, J.; Ma, G.; Narayanan, D.; Oguntebi, T.; Pekhimenko, G.; Pentecost, L.; Reddi, V. J.; Robie, T.; John, T. S.; Wu, C.-J.; Xu, L.; Young, C.; and Zaharia, M. A. 2020. MLPerf Training Benchmark. arXiv, abs/1910.01500.

[25]

Miao, X.; Zhang, H.; Shi, Y.; Nie, X.; Yang, Z.; Tao, Y.; and Cui, B. 2021. HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework. In Proceedings of the VLDB Endowment, volume 15, 312-320.

Digital Library

[26]

Mudigere, D.; Hao, Y.; Huang, J.; Jia, Z.; Tulloch, A.; Sridharan, S.; Liu, X.; Ozdal, M.; Nie, J.; Park, J.; Luo, L.; Yang, J. A.; Gao, L.; Ivchenko, D.; Basant, A.; Hu, Y.; Yang, J.; Ardestani, E. K.; Wang, X.; Komuravelli, R.; Chu, C.-H.; Yilmaz, S.; Li, H.; Qian, J.; Feng, Z.; Ma, Y.-A.; Yang, J.; Wen, E.; Li, H.; Yang, L.; Sun, C.; Zhao, W.; Melts, D.; Dhulipala, K.; Kishore, K. G.; Graf, T.; Eisenman, A.; Matam, K. K.; Gangidi, A.; Chen, G. J.; Krishnan, M.; Nayak, A.; Nair, K.; Muthiah, B.; khorashadi, M.; Bhattacharya, P.; Lapukhov, P.; Naumov, M.; Mathews, A. S.; Qiao, L.; Smelyanskiy, M.; Jia, B.; and Rao, V. 2021. Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models. arXiv, abs/2104.05158.

[27]

Naumov, M.; Mudigere, D.; Shi, H.-J. M.; Huang, J.; Sundaraman, N.; Park, J.; Wang, X.; Gupta, U.; Wu, C.-J.; Azzolini, A. G.; Dzhulgakov, D.; Mallevich, A.; Cherniavskii, I.; Lu, Y.; Krishnamoorthi, R.; Yu, A.; Kondratenko, V.; Pereira, S.; Chen, X.; Chen, W.; Rao, V.; Jia, B.; Xiong, L.; and Smelyanskiy, M. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. arXiv, abs/1906.00091.

[28]

Shen, W. 2017. DeepCTR: Easy-to-use, Modular and Extendible package of deep-learning based CTR models. https://github.com/shenweichen/deepctr. Accessed: 2022-03-28.

[29]

Wang, R.; Fu, B.; Fu, G.; and Wang, M. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD'17.

[30]

Wang, R.; Shivanna, R.; Cheng, D. Z.; Jain, S.; Lin, D.; Hong, L.; and Chi, E. H. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. In Proceedings of the Web Conference 2021.

[31]

Wang, X. 2020. A Survey of Online Advertising Click-Through Rate Prediction Models. In 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), volume 1, 516-521.

[32]

Xie, M.; Ren, K.; Lu, Y.; Yang, G.; Xu, Q.; Wu, B.; Lin, J.; Ao, H.; Xu, W.; and Shu, J. 2020. Kraken: memory-efficient continual learning for large-scale real-time recommendations. In International Conference for High Performance Computing, Networking, Storage, and Analysis (SC).

[33]

You, Y.; Gitman, I.; and Ginsburg, B. 2017. Large Batch Training of Convolutional Networks. arXiv, abs/1708.03888.

[34]

You, Y.; Li, J.; Reddi, S. J.; Hseu, J.; Kumar, S.; Bhojanapalli, S.; Song, X.; Demmel, J.; Keutzer, K.; and Hsieh, C.-J. 2020. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. arXiv, abs/1904.00962.

[35]

Zhang, J.; He, T.; Sra, S.; and Jadbabaie, A. 2020. Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity. arXiv, abs/1905.11881.

[36]

Zhang, J.; Huang, T.; and Zhang, Z. 2019. FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine. In ICDM.

[37]

Zhao, W.; Zhang, J.; Xie, D.; Qian, Y.; Jia, R.; and Li, P. 2019. AIBox: CTR Prediction Model Training on a Single Node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management.

[38]

Zhou, G.; Mou, N.; Fan, Y.; Pi, Q.; Bian, W.; Zhou, C.; Zhu, X.; and Gai, K. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. arXiv, abs/1809.03672.

[39]

Zhu, J.; Liu, J.; Yang, S.; Zhang, Q.; and He, X. 2021. Open Benchmarking for Click-Through Rate Prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management.

Recommendations

Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Predicting unseen labels using label hierarchies in large-scale multi-label learning
ECMLPKDD'15: Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

An important problem in multi-label classification is to capture label patterns or underlying structures that have an impact on such patterns. One way of learning underlying structures over labels is to project both instances and labels into the same ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

February 2023

16496 pages

ISBN:978-1-57735-880-0

Copyright © 2023 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents