Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1609/aaai.v37i9.26347guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

CowClip: reducing CTR prediction model training time from 12 hours to 10 minutes on 1 GPU

Published: 07 February 2023 Publication History

Abstract

The click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item. As mind-boggling amounts of data are produced online daily, accelerating CTR prediction model training is critical to ensuring an up-to-date model and reducing the training cost. One approach to increase the training speed is to apply large batch training. However, as shown in computer vision and natural language processing tasks, training with a large batch easily suffers from the loss of accuracy. Our experiments show that previous scaling rules fail in the training of CTR prediction neural networks. To tackle this problem, we first theoretically show that different frequencies of ids make it challenging to scale hyperparameters when scaling the batch size. To stabilize the training process in a large batch size setting, we develop the adaptive Column-wise Clipping (CowClip). It enables an easy and effective scaling rule for the embeddings, which keeps the learning rate unchanged and scales the L2 loss. We conduct extensive experiments with four CTR prediction networks on two real-world datasets and successfully scaled 128 times the original batch size without accuracy loss. In particular, for CTR prediction model DeepFM training on the Criteo dataset, our optimization framework enlarges the batch size from 1K to 128K with over 0.1% AUC improvement and reduces training time from 12 hours to 10 minutes on a single V100 GPU.

References

[1]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Mané, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viegas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/. Accessed: 2022-03-28.
[2]
Adnan, M. 2021. Accelerating input dispatching for deep learning recommendation models training. Ph.D. thesis, University of British Columbia.
[3]
Adnan, M.; Maboud, Y. E.; Mahajan, D.; and Nair, P. J. 2021. Accelerating Recommendation System Training by Leveraging Popular Choices. In Proceedings of the VLDB Endowment, volume 15, 127-140.
[4]
Avazu. 2015. Avazu Click-Through Rate Prediction. https://www.kaggle.com/c/avazu-ctr-prediction. Accessed: 202203-28.
[5]
Baji, T. 2018. Evolution of the GPU Device widely used in AI and Massive Parallel Processing. In 2018 IEEE 2nd Electron Devices Technology and Manufacturing Conference (EDTM), 7-9. IEEE.
[6]
Chen, Q.; and Li, D. 2021. Improved CTR Prediction Algorithm based on LSTM and Attention. In Proceedings of the 5th International Conference on Control Engineering and Artificial Intelligence, 122-125.
[7]
Chen, Q.; Zhao, H.; Li, W.; Huang, P.; and Ou, W. 2019. Behavior sequence transformer for e-commerce recommendation in alibaba. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, 1-4.
[8]
Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H. B.; Anderson, G.; Corrado, G. S.; Chai, W.; Ispir, M.; Anil, R.; Haque, Z.; Hong, L.; Jain, V.; Liu, X.; and Shah, H. 2016. Wide & Deep Learning for Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems.
[9]
Covington, P.; Adams, J. K.; and Sargin, E. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems.
[10]
Deng, W.; Pan, J.; Zhou, T.; Flores, A.; and Lin, G. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining.
[11]
Ginart, A. A.; Naumov, M.; Mudigere, D.; Yang, J.; and Zou, J. Y. 2021. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems. In 2021 IEEE International Symposium on Information Theory (ISIT), 2786-2791.
[12]
Gomez-Uribe, C.; and Hunt, N. 2016. The Netflix Recommender System: Algorithms, Business Value, and Innovation. ACM Transactions on Management Information Systems (TMIS), 6: 13:1-13:19.
[13]
Gotmare, A. D.; Keskar, N. S.; Xiong, C.; and Socher, R. 2019. A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation. arXiv, abs/1810.13243.
[14]
Goyal, P.; Dollár, P.; Girshick, R. B.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; and He, K. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv, abs/1706.02677.
[15]
Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X.; and Dong, Z. 2018. DeepFM: An End-to-End Wide & Deep Learning Framework for CTR Prediction. arXiv, abs/1804.04950.
[16]
He, X.; Xue, F.; Ren, X.; and You, Y. 2021. Large-Scale Deep Learning Optimizations: A Comprehensive Survey. arXiv, abs/2111.00856.
[17]
Hoffer, E.; Hubara, I.; and Soudry, D. 2017. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv, abs/1705.08741.
[18]
Kaasinen, E.; Roto, V.; Roloff, K.; Väänänen-Vainio-Mattila, K.; Vainio, T.; Maehr, W.; Joshi, D.; and Shrestha, S. 2009. User experience of mobile internet: analysis and recommendations. International Journal of Mobile Human Computer Interaction (IJMHCI), 1(4): 4-23.
[19]
Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. arXiv, abs/1412.6980.
[20]
Krizhevsky, A. 2014. One weird trick for parallelizing convolutional neural networks. arXiv, abs/1404.5997.
[21]
Labs, C. 2014. Display Advertising Challenge. https://www.kaggle.com/c/criteo-display-ad-challenge. Accessed: 202203-28.
[22]
Li, Z.; Cui, Z.; Wu, S.; Zhang, X.; and Wang, L. 2019. Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management.
[23]
Ma, Y.; Narayanaswamy, B.; Lin, H.; and Ding, H. 2020. Temporal-Contextual Recommendation in Real-Time. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
[24]
Mattson, P.; Cheng, C.; Coleman, C. A.; Diamos, G. F.; Micikevicius, P.; Patterson, D.; Tang, H.; Wei, G.-Y.; Bailis, P.; Bittorf, V.; Brooks, D. M.; Chen, D.; Dutta, D.; Gupta, U.; Hazelwood, K. M.; Hock, A.; Huang, X.; Jia, B.; Kang, D.; Kanter, D.; Kumar, N.; Liao, J.; Ma, G.; Narayanan, D.; Oguntebi, T.; Pekhimenko, G.; Pentecost, L.; Reddi, V. J.; Robie, T.; John, T. S.; Wu, C.-J.; Xu, L.; Young, C.; and Zaharia, M. A. 2020. MLPerf Training Benchmark. arXiv, abs/1910.01500.
[25]
Miao, X.; Zhang, H.; Shi, Y.; Nie, X.; Yang, Z.; Tao, Y.; and Cui, B. 2021. HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework. In Proceedings of the VLDB Endowment, volume 15, 312-320.
[26]
Mudigere, D.; Hao, Y.; Huang, J.; Jia, Z.; Tulloch, A.; Sridharan, S.; Liu, X.; Ozdal, M.; Nie, J.; Park, J.; Luo, L.; Yang, J. A.; Gao, L.; Ivchenko, D.; Basant, A.; Hu, Y.; Yang, J.; Ardestani, E. K.; Wang, X.; Komuravelli, R.; Chu, C.-H.; Yilmaz, S.; Li, H.; Qian, J.; Feng, Z.; Ma, Y.-A.; Yang, J.; Wen, E.; Li, H.; Yang, L.; Sun, C.; Zhao, W.; Melts, D.; Dhulipala, K.; Kishore, K. G.; Graf, T.; Eisenman, A.; Matam, K. K.; Gangidi, A.; Chen, G. J.; Krishnan, M.; Nayak, A.; Nair, K.; Muthiah, B.; khorashadi, M.; Bhattacharya, P.; Lapukhov, P.; Naumov, M.; Mathews, A. S.; Qiao, L.; Smelyanskiy, M.; Jia, B.; and Rao, V. 2021. Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models. arXiv, abs/2104.05158.
[27]
Naumov, M.; Mudigere, D.; Shi, H.-J. M.; Huang, J.; Sundaraman, N.; Park, J.; Wang, X.; Gupta, U.; Wu, C.-J.; Azzolini, A. G.; Dzhulgakov, D.; Mallevich, A.; Cherniavskii, I.; Lu, Y.; Krishnamoorthi, R.; Yu, A.; Kondratenko, V.; Pereira, S.; Chen, X.; Chen, W.; Rao, V.; Jia, B.; Xiong, L.; and Smelyanskiy, M. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. arXiv, abs/1906.00091.
[28]
Shen, W. 2017. DeepCTR: Easy-to-use, Modular and Extendible package of deep-learning based CTR models. https://github.com/shenweichen/deepctr. Accessed: 2022-03-28.
[29]
Wang, R.; Fu, B.; Fu, G.; and Wang, M. 2017. Deep & Cross Network for Ad Click Predictions. In Proceedings of the ADKDD'17.
[30]
Wang, R.; Shivanna, R.; Cheng, D. Z.; Jain, S.; Lin, D.; Hong, L.; and Chi, E. H. 2021. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. In Proceedings of the Web Conference 2021.
[31]
Wang, X. 2020. A Survey of Online Advertising Click-Through Rate Prediction Models. In 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), volume 1, 516-521.
[32]
Xie, M.; Ren, K.; Lu, Y.; Yang, G.; Xu, Q.; Wu, B.; Lin, J.; Ao, H.; Xu, W.; and Shu, J. 2020. Kraken: memory-efficient continual learning for large-scale real-time recommendations. In International Conference for High Performance Computing, Networking, Storage, and Analysis (SC).
[33]
You, Y.; Gitman, I.; and Ginsburg, B. 2017. Large Batch Training of Convolutional Networks. arXiv, abs/1708.03888.
[34]
You, Y.; Li, J.; Reddi, S. J.; Hseu, J.; Kumar, S.; Bhojanapalli, S.; Song, X.; Demmel, J.; Keutzer, K.; and Hsieh, C.-J. 2020. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. arXiv, abs/1904.00962.
[35]
Zhang, J.; He, T.; Sra, S.; and Jadbabaie, A. 2020. Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity. arXiv, abs/1905.11881.
[36]
Zhang, J.; Huang, T.; and Zhang, Z. 2019. FAT-DeepFFM: Field Attentive Deep Field-aware Factorization Machine. In ICDM.
[37]
Zhao, W.; Zhang, J.; Xie, D.; Qian, Y.; Jia, R.; and Li, P. 2019. AIBox: CTR Prediction Model Training on a Single Node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management.
[38]
Zhou, G.; Mou, N.; Fan, Y.; Pi, Q.; Bian, W.; Zhou, C.; Zhu, X.; and Gai, K. 2019. Deep Interest Evolution Network for Click-Through Rate Prediction. arXiv, abs/1809.03672.
[39]
Zhu, J.; Liu, J.; Yang, S.; Zhang, Q.; and He, X. 2021. Open Benchmarking for Click-Through Rate Prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence
February 2023
16496 pages
ISBN:978-1-57735-880-0

Sponsors

  • Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media