Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2830772.2830794acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

DCS: a fast and scalable device-centric server architecture

Published: 05 December 2015 Publication History

Abstract

Conventional servers have achieved high performance by employing fast CPUs to run compute-intensive workloads, while making operating systems manage relatively slow I/O devices through memory accesses and interrupts. However, as the emerging workloads are becoming heavily data-intensive and the emerging devices (e.g., NVM storage, high-bandwidth NICs, and GPUs) come to enable low-latency and high-bandwidth device operations, the traditional host-centric server architectures fail to deliver high performance due to their inefficient device handling mechanisms. Furthermore, without resolving the architecture inefficiency, the performance loss will continue to increase as the emerging devices become faster.
In this paper, we propose DCS, a novel device-centric server architecture to fully exploit the potential of the emerging devices so that the server performance nicely scales with the performance of the devices. The key idea of DCS is to orchestrate the devices to directly communicate with each other while selectively bypassing the host. The host becomes responsible for only few device-related operations (e.g., filesystem lookup). In this way, DCS achieves high I/O performance by direct inter-device communications and high computation performance by fully utilizing the host-side resources. To implement DCS, we introduce DCS Engine, a custom hardware device to orchestrate devices via standard I/O protocols (i.e., PCIe and NVMe), along with its device driver and user-level library. We show that our FPGA-based DCS prototype significantly improves the performance of emerging server workloads and the architecture will nicely scale with the performance of the devices.

References

[1]
A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion, "IX: A Protected Dataplane Operating System for High Throughput and Low Latency," in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
[2]
S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe, "Arrakis: The Operating System is the Control Plane," in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
[3]
E. Jeong, S. Woo, M. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park, "mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems," in Proc. 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014.
[4]
A. M. Caulfield, T. I. Mollov, L. A. Eisner, A. De, J. Coburn, and S. Swanson, "Providing Safe, User Space Access to Fast, Solid State Disks," in Proc. 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.
[5]
A. M. Caulfield and S. Swanson, "QuickSAN: A Storage Area Network for Fast, Distributed, Solid State Disks," in Proc. 40th International Symposium on Computer Architecture (ISCA), 2013.
[6]
NVIDIA Corporation, "NVIDIA GPUDirect." https://developer.nvidia.com/gpudirect.
[7]
K. Jang, S. Han, S. Han, S. Moon, and K. Park, "SSLShader: Cheap SSL Acceleration with Commodity Processors," in Proc. 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2011.
[8]
S. Neuwirth, D. Frey, M. Nuessle, and U. Bruening, "Scalable Communication Architecture for Network-Attached Accelerators," in Proc. 21st IEEE Symposium on High Performance Computer Architecture (HPCA), 2015.
[9]
"NVIDIA Launches World's First High-Speed GPU Interconnect, Helping Pave the Way to Exascale Computing." http://nvidianews.nvidia.com/news/nvidia-launches-world-s-first-high-speed-gpu-interconnect-helping-pave-the-way-to-exascale-computing, 2014.
[10]
Advanced Micro Devices, Inc., "DirectGMA on AMD's FirePro GPUs." http://developer.amd.com/community/blog/2014/09/08/amd-firepro-gpus-directgma/.
[11]
R. Wipfel, D. Atkisson, and V. Brisebois, "RDMA GPU Direct for ioMemory." http://on-demand.gputechconf.com/gtc/2014/presentations/S4265-rdma-gpu-direct-for-fusion-io-iodrive.pdf.
[12]
S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein, "GPUnet: Networking Abstractions for GPU Programs," in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
[13]
Mellanox Technologies, "Mellanox OFED GPUDirect RDMA." http://www.mellanox.com/page/products_dyn?product_family=116&mtag=gpudirect.
[14]
G. Kyriazis, "Heterogeneous System Architecture: A Technical Review." http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf, 2012.
[15]
A. Branover, D. Foley, and M. Steinman, "Amd fusion apu: Llano," Ieee Micro, no. 2, pp. 28--37, 2012.
[16]
S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, et al., "Bluedbm: an appliance for big data analytics," in Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 1--13, ACM, 2015.
[17]
A. M. Caulfield, A. De, J. Coburn, T. I. Mollov, R. K. Gupta, and S. Swanson, "Moneta: A High-performance Storage Array Architecture for Next-generation, Non-volatile Memories," in Proc. 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2010.
[18]
R. Huggahalli, R. Iyer, and S. Tetrick, "Direct Cache Access for High Bandwidth Network I/O," in Proc. 32nd International Symposium on Computer Architecture (ISCA), 2005.
[19]
D. Tang, Y. Bao, W. Hu, and M. Chen, "DMA Cache: Using On-Chip Storage to Architecturally Separate I/O Data from CPU Data for Improving I/O Performance," in Proc. 16th International Symposium on High Performance Computer Architecture (HPCA), 2010.
[20]
NVM Express, Inc., "NVM Express." http://www.nvmexpress.org/.
[21]
J. W. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, R. Raghuraman, and J. Luo, "NetFPGA - An Open Platform for Gigabit-rate Network Switching and Routing," in Proc. 2007 IEEE International Conference on Microelectronic Systems Education (MSE), 2007.
[22]
"Hp moonshot system." http://www8.hp.com/us/en/products/servers/moonshot/index.html?jumpid=reg_r1002_usen_c-001_title_r0001.
[23]
M. Factor, K. Meth, D. Naor, O. Rodeh, and J. Satran, "Object storage: The future building block for storage systems," Local to Global Data Interoperability-Challenges and Technologies, IEEE, 2005.
[24]
M. Mesnier, G. Ganger, and E. Riedel, "Object-based storage," Communications Magazine, IEEE, vol. 41, no. 8, 2003.
[25]
"Openstack swift." http://swift.openstack.org.
[26]
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Proc. 6th Symposium on Operating Systems Design & Implementations (OSDI), 2004.

Cited By

View all
  • (2024)Performance Characterization of SmartNIC NVMe-over-Fabrics Target OffloadingProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689154(14-24)Online publication date: 16-Sep-2024
  • (2024)Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00083(1043-1062)Online publication date: 2-Mar-2024
  • (2023)GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System ArchitectureProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575748(325-339)Online publication date: 27-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture
December 2015
787 pages
ISBN:9781450340342
DOI:10.1145/2830772
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. I/O optimizations
  2. device-to-device communications
  3. server architecture
  4. storage systems

Qualifiers

  • Research-article

Conference

MICRO-48
Sponsor:

Acceptance Rates

MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)55
  • Downloads (Last 6 weeks)4
Reflects downloads up to 01 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Performance Characterization of SmartNIC NVMe-over-Fabrics Target OffloadingProceedings of the 17th ACM International Systems and Storage Conference10.1145/3688351.3689154(14-24)Online publication date: 16-Sep-2024
  • (2024)Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00083(1043-1062)Online publication date: 2-Mar-2024
  • (2023)GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System ArchitectureProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575748(325-339)Online publication date: 27-Jan-2023
  • (2023)BM-Store: A Transparent and High-performance Local Storage Architecture for Bare-metal Clouds Enabling Large-scale Deployment2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071029(1031-1044)Online publication date: Feb-2023
  • (2022)Survey on storage-accelerator data movementCCF Transactions on High Performance Computing10.1007/s42514-022-00112-0Online publication date: 21-Jul-2022
  • (2021)LibpublProceedings of the 13th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3465332.3470874(64-70)Online publication date: 27-Jul-2021
  • (2020)TrainBox: An Extreme-Scale Neural Network Training Server Architecture by Systematically Balancing Operations2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00072(825-838)Online publication date: Oct-2020
  • (2020)DRAM-Less: Hardware Acceleration of Data Processing with New Memory2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00032(287-302)Online publication date: Feb-2020
  • (2019)FIDRProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358303(239-252)Online publication date: 12-Oct-2019
  • (2019)FlashGPUProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317827(1-6)Online publication date: 2-Jun-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media