research-article

Accelerating recommendation system training by leveraging popular choices

Authors:

Muhammad Adnan,

Yassaman Ebrahimzadeh Maboud,

Prashant J. NairAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 1

Pages 127 - 140

https://doi.org/10.14778/3485450.3485462

Published: 01 September 2021 Publication History

Abstract

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000X more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3X and 1.52X in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.

References

[1]

Carlos A. Gomez-Uribe and Neil Hunt. The netflix recommender system: Algorithms, business value, and innovation. ACM Trans. Manage. Inf. Syst., 6(4), December 2016. ISSN 2158-656X. URL https://doi.org/10.1145/2843948.

Digital Library

[2]

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. Deep learning recommendation model for personalization and recommendation systems. CoRR, abs/1906.00091, 2019. URL https://arxiv.org/abs/1906.00091.

[3]

B. Smith and G. Linden. Two decades of recommender systems at amazon.com. IEEE Internet Computing, 21(3):12--18, 2017.

Digital Library

[4]

T. Ishkhanov, M. Naumov, X. Chen, Y. Zhu, Y. Zhong, A. G. Azzolini, C. Sun, F. Jiang, A. Malevich, and L. Xiong. Time-based sequence model for personalization and recommendation systems. CoRR, abs/2008.11922, 2020. URL https://arxiv.org/abs/2008.11922.

[5]

Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang, and Mikhail Smelyanskiy. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems. arXiv e-prints, art. arXiv:2003.09518, March 2020.

[6]

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, page 1--12, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450348928. URL https://doi.org/10.1145/3079856.3080246.

Digital Library

[7]

Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Adrian Caulfield, Todd Massengill, Ming Liu, Mahdi Ghandi, Daniel Lo, Steve Reinhardt, Shlomi Alkalay, Hari Angepat, Derek Chiou, Alessandro Forin, Doug Burger, Lisa Woods, Gabriel Weisz, Michael Haselman, and Dan Zhang. Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro, 38:8--20, March 2018. URL https://www.microsoft.com/en-us/research/publication/serving-dnns-realtime-datacenter-scale-project-brainwave/.

[8]

Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A machine-learning supercomputer. In MICRO, 2014.

Digital Library

[9]

Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdan-bakhsh, Joon Kim, and Hadi Esmaeilzadeh. Tabla: A unified template-based framework for accelerating statistical machine learning. March 2016.

[10]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In ISCA, 2016.

[11]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP '19, page 1--15, New York, NY, USA, 2019. Association for Computing Machinery. ISBN9781450368735. URL https://doi.org/10.1145/3341301.3359646.

Digital Library

[12]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 103--112, 2019. URL http://papers.nips.cc/paper/8305-gpipe-efficient-training-of-giant-neural-networks-using-pipeline-parallelism.

[13]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc' aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc Le, and Andrew Ng. Large scale distributed deep networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25, pages 1223--1231. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf.

[14]

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, unjie Qian, Wencong Xiao, and Fan Yang. Analysis of large-scale multi-tenant gpu clusters for dnn training workloads. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '19, page 947--960, USA, 2019. USENIX Association. ISBN 9781939133038.

Digital Library

[15]

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[16]

Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. Hippogriffdb: Balancing i/o and gpu bandwidth in big data analytics. Proc. VLDB Endow., 9(14):1647--1658, October 2016. ISSN 2150-8097. URL https://doi.org/10.14778/3007328.3007331.

Digital Library

[17]

Yang Sun, Fajie Yuan, Min Yang, Guoao Wei, Zhou Zhao, and Duo Liu. A generic network compression framework for sequential recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '20, page 1299--1308, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450380164. URL https://doi.org/10.1145/3397271.3401125.

Digital Library

[18]

Xiaorui Wu, Hong Xu, Honglin Zhang, Huaming Chen, and Jian Wang. Saec: similarity-aware embedding compression in recommendation systems. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems, pages 82--89, 2020.

Digital Library

[19]

A. Jain, A. Phanishayee, J. Mars, L. Tang, and G. Pekhimenko. Gist: Efficient data encoding for deep neural network training. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 776--789, 2018.

Digital Library

[20]

Seokin Hong, Bulent Abali, Alper Buyuktosunoglu, Michael B. Healy, and Prashant J. Nair. Touché: Towards ideal and efficient cache compression by mitigating tag area overheads. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '52, page 453--465, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450369381. URL https://doi.org/10.1145/3352460.3358281.

Digital Library

[21]

Seokin Hong, Prashant J. Nair, Bulent Abali, Alper Buyuktosunoglu, Kyu-Hyoun Kim, and Michael B. Healy. Attaché: Towards ideal memory compression by mitigating metadata bandwidth overheads. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-51, page 326--338. IEEE Press, 2018. ISBN 9781538662403. URL https://doi.org/10.1109/MICRO.2018.00034.

Digital Library

[22]

Amin Ghasemazar, Prashant Nair, and Mieszko Lis. Thesaurus: Efficient cache compression via dynamic clustering. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, page 527--540, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371025. URL https://doi.org/10.1145/3373376.3378518.

Digital Library

[23]

Vinson Young, Prashant J. Nair, and Moinuddin K. Qureshi. Dice: Compressing dram caches for bandwidth and capacity. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, page 627--638, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450348928. URL https://doi.org/10.1145/3079856.3080243.

Digital Library

[24]

Jianyu Huang, Jongsoo Park, Ping Tak Peter Tang, Andrew Tulloch, et al. Mixed-precision embedding using a cache. arXiv preprint arXiv:2010.11305, 2020.

[25]

Avilash Mukherjee, Kumar Saurav, Prashant Nair, Sudip Shekhar, and Mieszko Lis. A case for emerging memories in dnn accelerators. In 2021 Design, Automation Test in Europe Conference Exhibition (DATE), pages 938--941, 2021.

[26]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW '17, page 173--182, Republic and Canton of Geneva, CHE, 2017. International World Wide Web Conferences Steering Committee. ISBN 9781450349130. URL https://doi.org/10.1145/3038912.3052569.

Digital Library

[27]

CriteoLabs. Criteo display ad challenge,. https://www.kaggle.com/c/criteo-display-ad-challenge.

[28]

Alibaba. User behavior data from taobao for recommendation. https://tianchi.aliyun.com/dataset/dataDetail?dataId=649&userId=1.

[29]

CriteoLabs. Terabyte click logs,. https://labs.criteo.com/2013/12/download-terabyte-click-logs.

[30]

Kaggle. Avazu mobile ads ctr. https://www.kaggle.com/c/avazu-ctr-prediction.

[31]

C Rosset. Turing-nlg: A 17-billion-parameter language model by microsoft. Microsoft Blog, 2019.

[32]

Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), April 1965.

[33]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers. IEEE Micro, 31(4):6--15, July--Aug. 2011.

Digital Library

[34]

Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. Aibox: Ctr prediction model training on a single node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 319--328, 2019.

Digital Library

[35]

M. Horowitz. 1.1 computing's energy problem (and what we can do about it). 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10--14, 2014.

[36]

K. Cho, M. Lee, K. Park, T. T. Kwon, Y. Choi, and Sangheon Pack. Wave: Popularity-based and collaborative in-network caching for content-oriented networks. In 2012 Proceedings IEEE INFOCOM Workshops, pages 316--321, 2012.

[37]

Fragkiskos Papadopoulos, Maksim Kitsak, M. A. Serrano, Marian Boguna, and Dmitri Krioukov. Popularity versus similarity in growing networks. Nature, 489 (7417):537--40, Sep 27 2012. URL https://ezproxy.library.ubc.ca/login?url=https://www-proquest-com.ezproxy.library.ubc.ca/docview/1095114119?accountid=14656. Copyright - Copyright Nature Publishing Group Sep 27, 2012; Document feature - Illustrations; Graphs; Last updated - 2019-09-06; CODEN - NATUAS.

[38]

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. Tensor casting: Co-designing algorithm-architecture for personalized recommendation training. arXiv preprint arXiv:2010.13100, 2020.

[39]

C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331--344, Feb 2019.

[40]

U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia, H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang. The architectural implications of facebook's dnn-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 488--501, 2020.

[41]

Nvidia. Accelerating wide deep recommender inference on gpus, 2017. https://developer.nvidia.com/blog/accelerating-wide-deep-recommender-inference-on-gpus/.

[42]

Biye Jiang, Chao Deng, Huimin Yi, Zelin Hu, Guorui Zhou, Yang Zheng, Sui Huang, Xinyang Guo, Dongyue Wang, Yue Song, Liqin Zhao, Zhi Wang, Peng Sun, Yu Zhang, Di Zhang, Jinhui Li, Jian Xu, Xiaoqiang Zhu, and Kun Gai. Xdl: An industrial deep learning framework for high-dimensional sparse data. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, DLP-KDD '19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367837. URL https://doi.org/10.1145/3326937.3341255.

Digital Library

[43]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84--90, May 2017. ISSN 0001-0782. URL https://doi.org/10.1145/3065386.

Digital Library

[44]

Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, and Nick Koudas. Mrshare: Sharing across multiple queries in mapreduce. Proc. VLDB Endow., 3(1-2):494--505, September 2010. ISSN 2150-8097. URL https://doi.org/10.14778/1920841.1920906.

Digital Library

[45]

Abhishek Vijaya Kumar and Muthian Sivathanu. Quiver: An informed storage cache for deep learning. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 283--296, Santa Clara, CA, February 2020. USENIX Association. ISBN 978-1-939133-12-0. URL https://www.usenix.org/conference/fast20/presentation/kumar.

[46]

Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, and Hadi Esmaeilzadeh. In-rdbms hardware acceleration of advanced analytics. Proc. VLDB Endow., 11(11):1317--1331, July 2018. ISSN 2150-8097. URL https://doi.org/10.14778/3236187.3236188.

Digital Library

[47]

Nvlink. URL https://developer.nvidia.com/nccl.

[48]

Runger Montgomery. Applied statistics and probability for engineers.

[49]

B. Recht and C. Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5:201--226, 2013.

[50]

Léon Bottou. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pages 421--436. Springer, 2012.

[51]

Lutz Prechelt. Early stopping-but when? In Neural Networks: Tricks of the trade, pages 55--69. Springer, 1998.

[52]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

[53]

Nvidia. NVIDIA Collective Communications Library (NCCL). https://docs.nvidia.com/deeplearning/nccl/index.html.

[54]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems, 2015. URL http://download.tensorflow.org/paper/whitepaper2015.pdf.

[55]

UBC Advanced Research Computing, "UBC ARC Sockeye." UBC Advanced Research Computing, 2019

[56]

Mlperf becnhmarks. https://mlcommons.org/en/training-normal-10/.

[57]

Jongse Park, Hardik Sharma, Divya Mahajan, Joon Kyung Kim, Preston Olds, and Hadi Esmaeilzadeh. Scale-out acceleration for machine learnng. October 2017.

[58]

Andrew Tomkins, R. Hugo Patterson, and Garth Gibson. Informed multi-process prefetching and caching. SIGMETRICS Perform. Eval. Rev., 25(1):100--114, June 1997. ISSN 0163-5999. URL https://doi.org/10.1145/258623.258680.

Digital Library

[59]

Michael Stonebraker. Operating system support for database management. Commun. ACM, 24(7):412--418, July 1981. ISSN 0001-0782. URL https://doi.org/10.1145/358699.358703.

Digital Library

[60]

Y. Zhu, F. Chowdhury, H. Fu, A. Moody, K. Mohror, K. Sato, and W. Yu. Entropy-aware i/o pipelining for large-scale deep learning on hpc systems. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pages 145--156, 2018.

[61]

Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, and Vijay Chidambaram. Analyzing and mitigating data stalls in dnn training. In VLDB 2021, January 2021. URL https://www.microsoft.com/en-us/research/publication/analyzing-and-mitigating-data-stalls-in-dnn-training/.

Digital Library

[62]

Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. Bandana: Using non-volatile memory for storing deep learning models. Proceedings of Machine Learning and Systems, 1:40--52, 2019.

[63]

A. Ginart, M. Naumov, D. Mudigere, Jiyan Yang, and J. Zou. Mixed dimension embeddings with application to memory-efficient recommendation systems. ArXiv, abs/1909.11810, 2019.

[64]

L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H. S. Lee, M. Li, B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy, X. Wang, B. Reagen, C. Wu, M. Hempstead, and X. Zhang. Recnmp: Accelerating personalized recommendation with near-memory processing. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 790--803, 2020.

Digital Library

[65]

Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems, page 165--175. Association for Computing Machinery, New York, NY, USA, 2020. ISBN 9781450379984.

Digital Library

[66]

Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. Distributed hierarchical gpu parameter server for massive scale deep learning ads systems, 2020.

[67]

Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu, and Kim Hazelwood. Understanding training efficiency of deep learning recommendation models at scale, 2020.

[68]

Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric Chung, and Greg Stitt. A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication. In International Symposium on Field-Programmable Custom Computing Machines. IEEE, May 2014. URL http://research.microsoft.com/apps/pubs/default.aspx?id=217166.

[69]

Mengdi Huang Nvidia Inc. Vinh Nguyen, Tomasz Grel. Optimizing the deep learning recommendation model on nvidia gpus. https://developer.nvidia.com/blog/optimizing-dlrm-on-nvidia-gpus.

[70]

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053, 2019.

[71]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 571--582, Broomfield, CO, October 2014. USENIX Association. ISBN 978-1-931971-16-4. URL https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chilimbi.

Digital Library

[72]

Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks. SysML 2019, 2019.

[73]

Jakub M Tarnawski, Amar Phanishayee, Nikhil Devanur, Divya Mahajan, and Fanny Nina Paravecino. Efficient algorithms for device placement of dnn graph operators. Advances in Neural Information Processing Systems, 33, 2020.

Cited By

Liu SZheng NKang HSimmons XZhang JLanger MZhu WLee MWang Z(2024)Embedding Optimization for Training Large-scale Deep Learning Recommendation Systems with EMBarkProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688111(622-632)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688111
Sirin UIdreos S(2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639307
Liu YLiu MWalder CXie LSerra ESpezzano F(2024)A Universal Sets-level Optimization Framework for Next Set RecommendationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679610(1544-1554)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679610
Show More Cited By

Accelerating recommendation system training by leveraging popular choices
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Accelerating PQMRCGSTAB algorithm on GPU
UCHPC-MAW '09: Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop

The general computations on GPU are becoming more and more popular because of GPU's powerful computing ability. In this paper, how to use GPU to accelerate sparse linear system solver, preconditioned QMRCGSTAB (PQMRCGSTAB for short), is our concern. We ...
AtRec: Accelerating Recommendation Model Training on CPUs
The popularity of recommendation models and the enhanced AI processing capability of CPUs have provided massive performance opportunities to deliver satisfactory experiences to a large number of users. Unfortunately, existing recommendation model training ...
Accelerating Training Process in Logistic Regression Model using OpenCL Framework

In this paper, the authors propose a new parallel implemented approach on Graphics Processing Units GPU for training logistic regression model. Logistic regression has been applied in many machine learning applications to build building predictive ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 1

September 2021

140 pages

ISSN:2150-8097

Editors:
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2021

Published in PVLDB Volume 15, Issue 1

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
225
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)6

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu SZheng NKang HSimmons XZhang JLanger MZhu WLee MWang Z(2024)Embedding Optimization for Training Large-scale Deep Learning Recommendation Systems with EMBarkProceedings of the 18th ACM Conference on Recommender Systems10.1145/3640457.3688111(622-632)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3640457.3688111
Sirin UIdreos S(2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639307
Liu YLiu MWalder CXie LSerra ESpezzano F(2024)A Universal Sets-level Optimization Framework for Next Set RecommendationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679610(1544-1554)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679610
Heo GLee SCho JChoi HLee SHam HKim GMahajan DPark JTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM InferencingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651380(722-737)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651380
Wang ZWang YDeng JZheng DLi ADing YTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)RAP: Resource-aware Automated GPU Sharing for Multi-GPU Recommendation Model Training and Input PreprocessingProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640406(964-979)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640406
Pan ZZheng ZZhang FXie BWu RSmith SLiu CRuwase ODu XDing Y(2024)RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible SchedulesProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00047(1-15)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00047
Zheng ZXu PZou XTang DLi ZXi CWu PZou LZhu YChen MDing XXue FQin ZCheng YYou YWilliams BChen YNeville J(2023)CowClipProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i9.26347(11390-11398)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i9.26347
Pan ZZheng ZZhang FWu RLiang HWang DQiu XBai JLin WDu XAamodt TSwift MJerger N(2023)RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding ColumnsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624761(268-286)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624761
Zheng ZPan ZWang DZhu KZhao WGuo TQiu XSun MBai JZhang FDu XZhai JLin W(2023)BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler ApproachProceedings of the ACM on Management of Data10.1145/36173271:3(1-29)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3617327
Agarwal SYan CZhang ZVenkataraman SDruschel PKaufmann AMace JFlinn JSeltzer M(2023)Bagpipe: Accelerating Deep Recommendation Model TrainingProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613142(348-363)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613142
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents