research-article

Centaur: a chiplet-based, hybrid sparse-dense accelerator for personalized recommendations

Authors:

Minsoo RhuAuthors Info & Claims

ISCA '20: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture

Pages 968 - 981

https://doi.org/10.1109/ISCA45697.2020.00083

Published: 23 September 2020 Publication History

Abstract

Personalized recommendations are the backbone machine learning (ML) algorithm that powers several important application domains (e.g., ads, e-commerce, etc) serviced from cloud datacenters. Sparse embedding layers are a crucial building block in designing recommendations yet little attention has been paid in properly accelerating this important ML algorithm. This paper first provides a detailed workload characterization on personalized recommendations and identifies two significant performance limiters: memory-intensive embedding layers and compute-intensive multi-layer perceptron (MLP) layers. We then present Centaur, a chiplet-based hybrid sparse-dense accelerator that addresses both the memory throughput challenges of embedding layers and the compute limitations of MLP layers. We implement and demonstrate our proposal on an Intel HARPv2, a package-integrated CPU+FPGA device, which shows a 1.7---17.2x performance speedup and 1.7---19.5x energy-efficiency improvement than conventional approaches.

References

[1]

J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional Neural Network Computing," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.

Digital Library

[2]

M. Alian, S. W. Min, H. Asgharimoghaddam, A. Dhar, D. K. Wang, T. Roewer, A. McPadden, O. O'Halloran, D. Chen, J. Xiong, D. Kim, W. Hwu, and N. S. Kim, "Application-Transparent Near-Memory Processing Architecture with Memory Channel Network," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2018.

[3]

Altera, "Floating-Point IP Cores User Guide," 2016.

[4]

M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused Layer CNN Accelerators," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.

[5]

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan, and Z. Zhu, "Deep Speech 2: End-To-End Speech Recognition in English and Mandarin," 2015.

[6]

A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa, A. Jaleel, C.-J. Wu, and D. Nellans, "MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.

Digital Library

[7]

H. Asghari-Moghaddam, Y. H. Son, J. H. Ahn, and N. S. Kim, "Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.

[8]

Caffe2, "Sparse Operations," 2017.

[9]

M. Campo, C.-K. Hsieh, M. Nickens, J. Espinoza, A. Taliyan, J. Rieger, J. Ho, and B. Sherick, "Competitive Analysis System for Theatrical Movie Releases Based on Movie Trailer Deep Video Representation," in arxiv.org, 2018.

[10]

T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), 2014.

[11]

Y. Chen, J. Emer, and V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.

Digital Library

[12]

Y. Chen, J. He, X. Zhang, C. Hao, and D. Chen, "Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs," in Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 2019.

Digital Library

[13]

Y. Chen, T. Krishna, J. Emer, and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," in Proceedings of the International Solid State Circuits Conference (ISSCC), 2016.

Digital Library

[14]

Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "DaDianNao: A Machine-Learning Supercomputer," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2014.

[15]

Y. Choi and M. Rhu, "PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2020.

[16]

J. Choquette, "Volta: Programmability and Performance," in Hot Chips: A Symposium on High Performance Chips, 2017.

[17]

P. Covington, J. Adams, and E. Sargin, "Deep Neural Networks for Youtube Recommendations," in Proceedings of the ACM Conference on Recommender Systems (RECSYS), 2016.

[18]

J. Dean, D. Patterson, and C. Young, "A New Golden Age in Computer Architecture: Empowering the Machine-Learning Revolution," in IEEE Micro, 2018.

[19]

J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in arxiv.org, 2018.

[20]

C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, "DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator," in Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 2018.

Digital Library

[21]

Google, "Cloud TPUs: ML Accelerators for TensorFlow," 2017.

[22]

U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, G.-Y. Wei, H.-H. S. Lee, D. Brooks, and C.-J. Wu, "DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2020.

[23]

U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, M. Hempstead, B. Jia, H.-H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang, "The Architectural Implications of Facebook's DNN-based Personalized Recommendation," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2020.

[24]

S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. J. Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.

Digital Library

[25]

K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang, "Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2018.

[26]

X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua, "Neural Collaborative Filtering," in Proceedings of the International Conference on World Wide Web (WWW), 2017.

[27]

J. Hestness, N. Ardalani, and G. Diamos, "Beyond Human-Level Accuracy: Computational Challenges in Deep Learning," in Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPOPP), 2019.

Digital Library

[28]

B. Hyun, Y. Kwon, Y. Choi, J. Kim, and M. Rhu, "NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS), 2020.

[29]

Intel, "Hardware Accelerator Research Program (HARP)," 2017.

[30]

Intel, "Intel Agilex FPGAs and SoCs," 2019.

[31]

Intel, "Intel Foveros 3D Packaging Technology," 2019.

[32]

Intel, "Intel VTune Profiler," 2020.

[33]

H. Jang, J. Kim, J.-E. Jo, J. Lee, and J. Kim, "MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2019.

Digital Library

[34]

N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. luc Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, "In-Datacenter Performance Analysis of a Tensor Processing Unit," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.

[35]

W. Jung, D. Jung, B. Kim, S. Lee, W. Rhee, and J. Ahn, "Restructuring Batch Normalization to Accelerate CNN Training," in The Conference on Systems and Machine Learning (SysML), 2019.

[36]

L. Ke, U. Gupta, C.-J. Wu, B. Y. Cho, M. Hempstead, B. Reagen, X. Zhang, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee, M. Li, B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy, and X. Wang, "RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2020.

[37]

Y. Kwon and M. Rhu, "A Disaggregated Memory System for Deep Learning," in IEEE Micro, 2019.

[38]

Y. Kwon, Y. Lee, and M. Rhu, "TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2019.

Digital Library

[39]

Y. Kwon and M. Rhu, "A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks," in IEEE Computer Architecture Letters, 2018.

[40]

Y. Kwon and M. Rhu, "Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2018.

[41]

S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, "Cambricon: An Instruction Set Architecture for Neural Networks," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2016.

Digital Library

[42]

D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh, "TABLA: A Unified Template-based Framework for Accelerating Statistical Machine Learning," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2016.

[43]

R. Mahajan, R. Sankman, N. Patel, D. Kim, K. Aygun, Z. Qian, Y. Mekonnen, I. Salama, S. Sharan, D. Iyengar, and D. Mallik, "Embedded Multi-die Interconnect Bridge (EMIB) - A High Density, High Bandwidth Packaging Interconnect," in IEEE Electronic Components and Technology Conference (ECTC), 2016.

[44]

D. J. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. Leong, "A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study," in Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 2018.

[45]

D. J. Moss, E. Nurvitadhi, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. Leong, "High Performance Binary Neural Networks on the Xeon+FPGA Platform," in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), 2017.

[46]

M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, "Deep Learning Recommendation Model for Personalization and Recommendation Systems," in arxiv.org, 2019.

[47]

N. Nethercote and J. Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation," in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2007.

Digital Library

[48]

E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. O. G. Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh, "Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?" in Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 2017.

[49]

NVIDIA, "The NVIDIA DGX-1V Deep Learning System," 2017.

[50]

NVIDIA, "NVIDIA Tesla V100," 2018.

[51]

A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.

Digital Library

[52]

E. Park, D. Kim, and S. Yoo, "Energy-efficient Neural Network Accelerator Based on Outlier-aware Low-precision Computation," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2018.

[53]

J. Park, H. Sharma, D. Mahajan, J. K. Kim, P. Olds, and H. Esmaeilzadeh, "Scale-Out Acceleration for Machine Learning," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2017.

[54]

J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law, P. Malani, A. Malevich, S. Nadathur, J. Pino, M. Schatz, A. Sidorov, V. Sivakumar, A. Tulloch, X. Wang, Y. Wu, H. Yuen, U. Diril, D. Dzhulgakov, K. H. an Bill Jia, Y. Jia, L. Qiao, V. Rao, N. Rotem, S. Yoo, and M. Smelyanskiy, "Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications," in arxiv.org, 2018.

[55]

M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.

[56]

M. Rhu, M. O'Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler, "Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2018.

[57]

Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y. Zhang, W. J. Dally, J. Emer, C. T. Gray, B. Khailany, and S. W. Keckler, "Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2019.

[58]

H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Misra, and H. Esmaeilzadeh, "From High-Level Deep Neural Models to FPGAs," in Proceedings of the International Symposium on Microarchitecture (MICRO), 2016.

[59]

Y. Shen, M. Ferdman, and P. Milder, "Maximizing CNN Accelerator Efficiency Through Resource Partitioning," in Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.

[60]

J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao, and D. L. Lee, "Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba," in Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), 2018.

[61]

P. N. Whatmough, S. K. Lee, N. Mulholland, P. Hansen, S. Kodali, D. C. Brooks, and G.-Y. Wei, "DNN ENGINE: A 16nm Sub-uJ Deep Neural Network Inference Accelerator for the Embedded Masses," in Hot Chips: A Symposium on High Performance Chips, 2017.

[62]

C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang, "Machine Learning at Facebook: Understanding Inference at the Edge," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2019.

[63]

C.-J. Wu, D. Brooks, U. Gupta, H.-H. Lee, and K. Hazelwood, "Deep Learning: Its Not All About Recognizing Cats and Dogs," 2019.

[64]

Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y. Tai, "Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs," in Design Automation Conference (DAC), 2017.

[65]

C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," in Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 2015.

[66]

J. Zhang and J. Li, "Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network," in Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 2017.

[67]

X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W. Hwu, and D. Chen, "DNNBuilder: An Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs," in Proceedings of the International Conference on Computer-Aided Design (ICCAD), 2018.

Digital Library

Cited By

Krishnan ANambiar MSinghal R(2023)Hetero-Rec++: Modelling-based Robust and Optimal Deployment of Embeddings RecommendationsProceedings of the Third International Conference on AI-ML Systems10.1145/3639856.3639878(1-9)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3639856.3639878
Pan ZZheng ZZhang FWu RLiang HWang DQiu XBai JLin WDu XAamodt TSwift MJerger N(2023)RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding ColumnsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624761(268-286)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624761
Xu RMa SGuo YLi D(2023)A Survey of Design and Optimization for Systolic Array-based DNN AcceleratorsACM Computing Surveys10.1145/360480256:1(1-37)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3604802
Show More Cited By

Recommendations

Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Current-generation Deep Neural Networks (DNNs), such as AlexNet and VGG, rely heavily on dense floating-point matrix multiplication (GEMM), which maps well to GPUs (regular parallelism, high TFLOP/s). Because of this, GPUs are widely used for ...
Comparing Hardware Accelerators in Scientific Applications: A Case Study

Multicore processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing ...
An FPGA-Based Accelerator for Neighborhood-Based Collaborative Filtering Recommendation Algorithms
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing

Neighborhood-based Collaborative Filtering (CF) is a kind of techniques in the field of recommendation algorithms and has been widely used in lots of personalized recommender systems. In the big data era, the increasing data amounts make these CF ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '20: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture

May 2020

1152 pages

ISBN:9781728146614

General Chairs:
José Martínez
Cornell University
,
José Duato
Universitat Politècnica de València
,
Program Chair:
Lieven Eeckhout
Ghent University

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE

Publisher

IEEE Press

Publication History

Published: 23 September 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISCA '20

Sponsor:

SIGARCH

ISCA '20: The 47th Annual International Symposium on Computer Architecture

May 30 - June 3, 2020

Virtual Event

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
391
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)9

Reflects downloads up to 16 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Krishnan ANambiar MSinghal R(2023)Hetero-Rec++: Modelling-based Robust and Optimal Deployment of Embeddings RecommendationsProceedings of the Third International Conference on AI-ML Systems10.1145/3639856.3639878(1-9)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3639856.3639878
Pan ZZheng ZZhang FWu RLiang HWang DQiu XBai JLin WDu XAamodt TSwift MJerger N(2023)RECom: A Compiler Approach to Accelerating Recommendation Model Inference with Massive Embedding ColumnsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624761(268-286)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3623278.3624761
Xu RMa SGuo YLi D(2023)A Survey of Design and Optimization for Systolic Array-based DNN AcceleratorsACM Computing Surveys10.1145/360480256:1(1-37)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3604802
Hsia SGupta UAcun BArdalani NZhong PWei GBrooks DWu CAamodt TJerger NSwift M(2023)MP-Rec: Hardware-Software Co-design to Enable Multi-path RecommendationProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582068(449-465)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582068
Ye HVedula SChen YYang YBronstein ADreslinski RMudge TTalati NAamodt TJerger NSwift M(2023)GRACE: A Scalable Graph-Based Approach to Accelerating Recommendation Model InferenceProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582029(282-301)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582029
Jain RCheng SKalagi VSanghavi VKaul SArunachalam MMaeng KJog ASivasubramaniam AKandemir MDas CSolihin YHeinrich M(2023)Optimizing CPU Performance for Recommendation Systems At-ScaleProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589112(1-15)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589112
Gong YYin MHuang LXiao JSui YDeng CYuan BSolihin YHeinrich M(2023)ETTE: Efficient Tensor-Train-based Computing Engine for Deep Neural NetworksProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589103(1-13)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589103
Liu HZheng LHuang YLiu CYe XYuan JLiao XJin HXue JSolihin YHeinrich M(2023)Accelerating Personalized Recommendation with Cross-level Near-Memory ProcessingProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589101(1-13)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589101
Mahajan CKrishnan ANambiar MSinghal R(2022)Hetero-Rec: Optimal Deployment of Embeddings for High-Speed RecommendationsProceedings of the Second International Conference on AI-ML Systems10.1145/3564121.3564134(1-9)Online publication date: 12-Oct-2022
https://dl.acm.org/doi/10.1145/3564121.3564134
Krishnan ANambiar MSumeet NIqbal SFeng DBecker SHerbst NLeitner PPapadopoulos A(2022)Performance Model and Profile Guided Design of a High-Performance Session Based Recommendation EngineProceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering10.1145/3489525.3511692(133-144)Online publication date: 9-Apr-2022
https://dl.acm.org/doi/10.1145/3489525.3511692
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents