default search action
7th MLSys 2024: Santa Clara, CA, USA
- Phillip B. Gibbons, Gennady Pekhimenko, Christopher De Sa:
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024. mlsys.org 2024 - Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy:
Punica: Multi-Tenant LoRA Serving. - Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry:
ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time. - Gyudong Kim, Mehdi Ghasemi, Soroush Heidari, Seungryong Kim, Young Geun Kim, Sarma B. K. Vrudhula, Carole-Jean Wu:
HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning. - Mohamed Assem Ibrahim, Shaizeen Aga, Ada Li, Suchita Pati, Mahzabeen Islam:
JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training. - Milos Nikolic, Enrique Torres-Sánchez, Jiahui Wang, Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Kareem Ibrahim, Andreas Moshovos:
Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers. - Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang:
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping. - Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han:
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. - Ye Tian, Zhen Jia, Ziyue Luo, Yida Wang, Chuan Wu:
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines. - Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, Purushotham Kamath:
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference. - Kiwan Maeng, G. Edward Suh:
Accelerating ReLU for MPC-Based Private Inference with a Communication-Efficient Sign Estimation. - Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, Yu Wang:
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics. - Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You:
HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices. - Yifei Xu, Yuning Chen, Xumiao Zhang, Xianshang Lin, Pan Hu, Yunfei Ma, Songwu Lu, Wan Du, Zhuoqing Mao, Ennan Zhai, Dennis Cai:
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation. - Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci:
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving. - Jingtian Dang, Jianming Tong, Anupam Golder, Cong Hao, Arijit Raychowdhury, Tushar Krishna:
Accurate Low-Degree Polynomial Approximation of Non-Polynomial Operators for Fast Private Inference in Homomorphic Encryption. - Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai Li, Yiran Chen:
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models. - Song Bian, Dacheng Li, Hongyi Wang, Eric P. Xing, Shivaram Venkataraman:
Does Compressing Activations Help Model Parallel Training? - Alok Tripathy, Katherine A. Yelick, Aydin Buluç:
Distributed Matrix-Based Sampling for Graph Neural Network Training. - Liang Luo, Buyun Zhang, Michael Tsang, Yinbin Ma, Ching-Hsiang Chu, Yuxin Chen, Shen Li, Yuchen Hao, Yanli Zhao, Guna Lakshminarayanan, Ellie Wen, Jongsoo Park, Dheevatsa Mudigere, Maxim Naumov:
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation. - Shan Yu, Zhenting Zhu, Yu Chen, Hanchen Xu, Pengzhan Zhao, Yang Wang, Arthi Padmanabhan, Hugo Latapie, Harry Xu:
VQPy: An Object-Oriented Approach to Modern Video Analytics. - Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph Gonzalez, Ion Stoica:
SLoRA: Scalable Serving of Thousands of LoRA Adapters. - Ilia Markov, Kaveh Alim, Elias Frantar, Dan Alistarh:
L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning. - In Gim, Guojun Chen, Seung-Seob Lee, Nikhil Sarda, Anurag Khandelwal, Lin Zhong:
Prompt Cache: Modular Attention Reuse for Low-Latency Inference. - Yunhao Yang, Neel P. Bhatt, Tyler Ingebrand, William Ward, Steven Carr, Atlas Wang, Ufuk Topcu:
Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems. - Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, Alexey Tumanov:
VIDUR: A Large-Scale Simulation Framework for LLM Inference. - Jian Meng, Yuan Liao, Anupreetham Anupreetham, Ahmed Hassan, Shixing Yu, Han-Sok Suh, Xiaofeng Hu, Jae-sun Seo:
Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design. - Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang:
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache. - Yuxuan Zhu, Jiachen Liu, Mosharaf Chowdhury, Fan Lai:
FedTrans: Efficient Federated Learning via Multi-Model Transformation. - Shixiong Qi, K. K. Ramakrishnan, Myungjin Lee:
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning. - Yubo Gao, Maryam Haghifam, Christina Giannoula, Renbo Tu, Gennady Pekhimenko, Nandita Vijaykumar:
Proteus: Preserving Model Confidentiality during Graph Optimizations. - Elias Frantar, Dan Alistarh:
QMoE: Sub-1-Bit Compression of Trillion Parameter Models. - Size Zheng, Renze Chen, Meng Li, Zihao Ye, Luis Ceze, Yun Liang:
vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs. - Yichen Qian, Yongyi He, Rong Zhu, Jintao Huang, Zhijian Ma, Haibin Wang, Yaohua Wang, Xiuyu Sun, Defu Lian, Bolin Ding, Jingren Zhou:
UniDM: A Unified Framework for Data Manipulation with Large Language Models. - Haihao Shen, Naveen Mellempudi, Xin He, Qun Gao, Chang Wang, Mengni Wang:
Efficient Post-training Quantization with FP8 Formats. - Isha Chaudhary, Alex Renda, Charith Mendis, Gagandeep Singh:
COMET: Neural Cost Model Explanation Framework. - Yash Akhauri, Mohamed S. Abdelfattah:
On Latency Predictors for Neural Architecture Search. - Haoran Qiu, Weichao Mao, Archit Patke, Shengkun Cui, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Basar, Ravi K. Iyer:
FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms.
manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.