research-article

HET: scaling out huge embedding model training via cache-enabled distributed framework

Authors:

Bin CuiAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 15, Issue 2

Pages 312 - 320

https://doi.org/10.14778/3489496.3489511

Published: 01 October 2021 Publication History

Abstract

Embedding models have been an effective learning paradigm for high-dimensional data. However, one open issue of embedding models is that their representations (latent factors) often result in large parameter space. We observe that existing distributed training frameworks face a scalability issue of embedding models since updating and retrieving the shared embedding parameters from servers usually dominates the training cycle. In this paper, we propose HET, a new system framework that significantly improves the scalability of huge embedding model training. We embrace skewed popularity distributions of embeddings as a performance opportunity and leverage it to address the communication bottleneck with an embedding cache. To ensure consistency across the caches, we incorporate a new consistency model into HET design, which provides fine-grained consistency guarantees on a per-embedding basis. Compared to previous work that only allows staleness for read operations, HET also utilizes staleness for write operations. Evaluations on six representative tasks show that HET achieves up to 88% embedding communication reductions and up to 20.68×performance speedup over the state-of-the-art baselines.

References

[1]

2014. Criteo Kaggle Ad. https://www.kaggle.com/c/criteo-display-ad-challenge.

[2]

2020. MLPerf Benchmark. https://mlperf.org.

[3]

2021. HET Appendix. https://github.com/Hsword/HET/blob/main/vldb2021_het_appendix.pdf.

[4]

2021. NVIDIA collective communications library (NCCL). https://developer.nvidia.com/nccl.

[5]

2021. NVIDIA HugeCTR. https://github.com/NVIDIA/HugeCTR.

[6]

2021. PS-Lite. https://github.com/dmlc/ps-lite.

[7]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI. 265--283.

Digital Library

[8]

Gediminas Adomavicius and Jingjing Zhang. 2012. Impact of data characteristics on recommender systems performance. ACM Trans. Manag. Inf. Syst. 3, 1 (2012), 3:1--3:17.

Digital Library

[9]

Avishek Anand, Megha Khosla, Jaspreet Singh, Jan-Hendrik Zab, and Zijian Zhang. 2019. Asynchronous Training of Word Embeddings for Large Text Corpora. In WSDM. 168--176.

Digital Library

[10]

Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. In DLRS@RecSys. 7--10.

[11]

Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. 2019. Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks. In SIGKDD. 257--266.

Digital Library

[12]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In RecSys. 191--198.

Digital Library

[13]

Andrew M. Dai, Christopher Olah, and Quoc V. Le. 2015. Document Embedding with Paragraph Vectors. CoRR abs/1507.07998 (2015).

Digital Library

[14]

Message P Forum. 1994. MPI: A Message-Passing Interface Standard. Technical Report. USA.

Digital Library

[15]

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. 2016. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155, 1--2 (2016), 267--305.

Digital Library

[16]

Zhabiz Gharibshah, Xingquan Zhu, Arthur Hainline, and Michael Conway. 2020. Deep Learning for User Interest and Response Prediction in Online Display Advertising. Data Sci. Eng. 5, 1 (2020), 12--26.

[17]

Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In IJCAI. 1725--1731.

Digital Library

[18]

Qipeng Guo, Xipeng Qiu, Xiangyang Xue, and Zheng Zhang. 2021. Syntax-guided text generation via graph neural network. Sci. China Inf. Sci. 64, 5 (2021).

[19]

William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NeurIPS. 1024--1034.

Digital Library

[20]

Jun He, Hongyan Liu, Yiqing Zheng, Shu Tang, Wei He, and Xiaoyong Du. 2020. Bi-Labeled LDA: Inferring Interest Tags for Non-famous Users in Social Network. Data Sci. Eng. 5, 1 (2020), 27--47.

[21]

Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In NeurIPS. 1223--1231.

Digital Library

[22]

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open Graph Benchmark: Datasets for Machine Learning on Graphs. In NeurIPS.

[23]

Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware Distributed Parameter Servers. In SIGMOD. 463--478.

Digital Library

[24]

Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, and Ed H. Chi. 2020. Learning Multi-granular Quantized Embeddings for Large-Vocab Categorical Features in Recommender Systems. In WWW. ACM / IW3C2, 562--566.

Digital Library

[25]

Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, and Eric P. Xing. 2016. STRADS: a distributed framework for scheduled model parallel machine learning. In EuroSys. ACM, 5:1--5:16.

Digital Library

[26]

Soojeong Kim, Gyeong-In Yu, Hojin Park, Sungwoo Cho, Eunji Jeong, Hyeonmin Ha, Sanha Lee, Joo Seong Jeong, and Byung-Gon Chun. 2019. Parallax: Sparsityaware Data Parallel Training of Deep Neural Networks. In EuroSys. 43:1--43:15.

Digital Library

[27]

Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM 21, 7 (1978), 558--565.

Digital Library

[28]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In OSDI. 583--598.

Digital Library

[29]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. PVLDB 13, 12 (2020), 3005--3018.

Digital Library

[30]

Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In SIGKDD. 1754--1763.

Digital Library

[31]

Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. 2015. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization. In NeurIPS. 2737--2745.

Digital Library

[32]

Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In ICML, Vol. 80. 3049--3058.

[33]

Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A Framework for Machine Learning in the Cloud. Proc. VLDB Endow. 5, 8 (2012), 716--727.

Digital Library

[34]

Xupeng Miao, Nezihe Merve Gürel, Wentao Zhang, Zhichao Han, Bo Li, Wei Min, Susie Xi Rao, Hansheng Ren, Yinan Shan, Yingxia Shao, Yujie Wang, Fan Wu, Hui Xue, Yaming Yang, Zitao Zhang, Yang Zhao, Shuai Zhang, Yujing Wang, Bin Cui, and Ce Zhang. 2021. DeGNN: Improving Graph Neural Networks with Graph Decomposition. In KDD. ACM, 1223--1233.

Digital Library

[35]

X. Miao, L. Ma, Z. Yang, Y. Shao, B. Cui, L. Yu, and J. Jiang. 2020. CuWide: Towards Efficient Flow-based Training for Sparse Wide Models on GPUs. TKDE (2020), 1--1.

[36]

Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma, and Bin Cui. 2021. Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce. In SIGMOD. ACM, 2262--2270.

Digital Library

[37]

Xupeng Miao, Wentao Zhang, Yingxia Shao, Bin Cui, Lei Chen, Ce Zhang, and Jiawei Jiang. 2021. Lasagne: A Multi-Layer Graph Convolutional Network Frame-work via Node-aware Deep Architecture. IEEE Transactions on Knowledge and Data Engineering (2021).

[38]

Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR Workshop.

[39]

Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems. CoRR abs/1906.00091 (2019).

[40]

Hao Peng, Jianxin Li, Hao Yan, Qiran Gong, Senzhang Wang, Lin Liu, Lihong Wang, and Xiang Ren. 2020. Dynamic network embedding via incremental skip-gram with negative sampling. Sci. China Inf. Sci. 63, 10 (2020), 1--19.

[41]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. In SOSP. 16--29.

Digital Library

[42]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: online learning of social representations. In SIGKDD. 701--710.

Digital Library

[43]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018).

[44]

Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. In ADKDD. 12:1--12:7.

[45]

Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. SIGIR. 165--174.

[46]

Minhui Xie, Kai Ren, Youyou Lu, Guangxu Yang, Qingxing Xu, Bihai Wu, Jiazhen Lin, Hongbo Ao, Wanhong Xu, and Jiwu Shu. 2020. Kraken: memory-efficient continual learning for large-scale real-time recommendations. In SC. 1--17.

Digital Library

[47]

Eric P Xing, Qirong Ho, Pengtao Xie, and Dai Wei. 2016. Strategies and principles of distributed machine learning on big data. Engineering (2016), 179--195.

[48]

Hao Yu, Sen Yang, and Shenghuo Zhu. 2019. Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning. In AAAI. 5693--5700.

Digital Library

[49]

Lele Yu, Bin Cui, Ce Zhang, and Yingxia Shao. 2017. LDA^*: A Robust and Large-scale Topic Modeling System. Proc. VLDB Endow. 10, 11 (2017), 1406--1417.

Digital Library

[50]

Junqi Zhang, Bing Bai, Ye Lin, Jian Liang, Kun Bai, and Fei Wang. 2020. General-Purpose User Embeddings based on Mobile App Usage. In SIGKDD. 2831--2840.

[51]

Jia-Dong Zhang and Chi-Yin Chow. 2015. GeoSoCa: Exploiting Geographical, Social and Categorical Correlations for Point-of-Interest Recommendations. In SIGIR. 443--452.

Digital Library

[52]

Zhipeng Zhang, Bin Cui, Yingxia Shao, Lele Yu, Jiawei Jiang, and Xupeng Miao. 2019. PS2: Parameter Server on Spark. In SIGMOD. 376--388.

Digital Library

[53]

Qian Zhao, Jilin Chen, Minmin Chen, Sagar Jain, Alex Beutel, Francois Belletti, and Ed H. Chi. 2018. Categorical-attributes-based item classification for recommender systems. In RecSys. ACM, 320--328.

Digital Library

[54]

Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems. In MLSys.

[55]

Guorui Zhou, Xiaoqiang Zhu, Chengru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep Interest Network for Click-Through Rate Prediction. In SIGKDD. 1059--1068.

Digital Library

Cited By

Zheng CJiang GYan XYin PZhou QCheng J(2024)GE2: A General and Efficient Knowledge Graph Embedding Learning SystemProceedings of the ACM on Management of Data10.1145/36549862:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654986
Zhang HLiu ZChen BZhao YZhao TYang TCui B(2024)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation ModelsProceedings of the ACM on Management of Data10.1145/36393062:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639306
Tang SYu YWang HWang GChen WXu ZGuo SGao W(2024)A Survey on Scheduling Techniques in Computing and Network ConvergenceIEEE Communications Surveys & Tutorials10.1109/COMST.2023.332902726:1(160-195)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/COMST.2023.3329027
Show More Cited By

Index Terms

HET: scaling out huge embedding model training via cache-enabled distributed framework

Index terms have been assigned to the content through auto-classification.

Recommendations

HET-GMP: A Graph-based System Approach to Scaling Large Embedding Model Training
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Embedding models have been recognized as an effective learning paradigm for high-dimensional data. However, a major embedding model training obstacle is that updating and retrieving the shared large-scale embedding parameters usually dominates the ...
A region-adaptive semi-fragile dual watermarking scheme

Since existing watermarking schemes usually have only a single function, a region-adaptive semi-fragile dual watermarking scheme is proposed, taking into account both watermark embedding capacity and security. The dual watermarks refer to the robust ...
A steganographic method based upon JPEG and quantization table modification
Special issue: Intelligent multimedia computing and networking

In this paper, a novel steganographic method based on joint photographic expert-group (JPEG) is proposed. The proposed method modifies the quantization table first. Next, the secret message is hidden in the cover-image with its middle-frequency of the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 15, Issue 2

October 2021

247 pages

ISSN:2150-8097

Editors:
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 October 2021

Published in PVLDB Volume 15, Issue 2

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
130
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)5

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zheng CJiang GYan XYin PZhou QCheng J(2024)GE2: A General and Efficient Knowledge Graph Embedding Learning SystemProceedings of the ACM on Management of Data10.1145/36549862:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654986
Zhang HLiu ZChen BZhao YZhao TYang TCui B(2024)CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation ModelsProceedings of the ACM on Management of Data10.1145/36393062:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639306
Tang SYu YWang HWang GChen WXu ZGuo SGao W(2024)A Survey on Scheduling Techniques in Computing and Network ConvergenceIEEE Communications Surveys & Tutorials10.1109/COMST.2023.332902726:1(160-195)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/COMST.2023.3329027
Zhang HWang YChen QChang RZhang TMiao ZHou YDing YMiao XWang HPang BZhan YSun HDeng WZhang QYang FXie XYang MCui BOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Model-enhanced vector indexProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668518(54903-54917)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668518
Jiang YFu FMiao XNie XCui BElkind E(2023)OSDPProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/238(2142-2150)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/238
Zheng ZXu PZou XTang DLi ZXi CWu PZou LZhu YChen MDing XXue FQin ZCheng YYou YWilliams BChen YNeville J(2023)CowClipProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v37i9.26347(11390-11398)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1609/aaai.v37i9.26347
Zhang HZhao PMiao XShao YLiu ZYang TCui B(2023)Experimental Analysis of Large-Scale Learnable Vector Storage CompressionProceedings of the VLDB Endowment10.14778/3636218.363623417:4(808-822)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.14778/3636218.3636234
Nie XLiu YFu FXue JJiao DMiao XTao YCui B(2023)Angel-PTM: A Scalable and Economical Large-Scale Pre-Training System in TencentProceedings of the VLDB Endowment10.14778/3611540.361156416:12(3781-3794)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611564
Zou YDing ZShi JGuo SSu CZhang Y(2023)EmbedX: A Versatile, Efficient and Scalable Platform to Embed Both Graphs and High-Dimensional Sparse DataProceedings of the VLDB Endowment10.14778/3611540.361154616:12(3543-3556)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.14778/3611540.3611546
Miao XShi YYang ZCui BJia Z(2023)SDPipe: A Semi-Decentralized Framework for Heterogeneity-Aware Pipeline-parallel TrainingProceedings of the VLDB Endowment10.14778/3598581.359860416:9(2354-2363)Online publication date: 10-Jul-2023
https://dl.acm.org/doi/10.14778/3598581.3598604
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents