announcement

GPUrpc: Exploring Transparent Access to Remote GPUs

Authors:

Nobuhiko Nishio,

Shinpei KatoAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 16, Issue 1

Article No.: 17, Pages 1 - 25

https://doi.org/10.1145/2950056

Published: 13 October 2016 Publication History

Abstract

Graphics processing units (GPUs) are increasingly used for high-performance computing. Programming frameworks for general-purpose computing on GPUs (GPGPU), such as CUDA and OpenCL, are also maturing. Driving this trend is the recent proliferation of mobile devices such as smartphones and wearable computers. These devices are increasingly incorporating computationally intensive applications that involve some form of environmental recognition such as augmented reality (AR) or voice recognition. However, devices with low computational power cannot satisfy such demanding computing requirements. The CPU load of these devices could be reduced by offloading computation onto GPUs on the cloud. This paper presents GPUrpc, a remote procedure call (RPC) extension to Gdev, which is a rich set of runtime libraries and device drivers for achieving first-class GPU resource management. GPUrpc allows developers to use CUDA for GPGPU development work. Existing research uses RPCs based on the CUDA application programming interfaces (APIs); hence, all CUDA APIs require communication. To reduce communication overhead, we use an RPC based on a low-level API than CUDA API and reduced API that does not require communication. Our evaluation conducted on Linux and NVIDIA GPUs shows that the basic performance of our prototype implementation is reliable in comparison with the existing method. Evaluation using the Rodinia benchmark suite designed for research in heterogeneous parallel computing showed that GPUrpc is effective for applications such as image processing and data mining. GPUrpc also can improve power consumption to approximately 1/6 that of CPU processing for performing 512 × 512 matrix multiplication.

References

[1]

2014. TOP500 supercomputing sites. Retrieved from http://www.top500.org/lists/2014/11/.

[2]

Erik Alerstam, Tomas Svensson, and Stefan Andersson-Engels. 2008. Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration. J. Biomed. Optics 13, 6 (2008), 060504--060504.

[3]

Nguyen Viet Anh, Yusuke Fujii, Yuki Iida, Takuya Azumi, Nobuhiko Nishio, and Shinpei Kato. 2014. Reducing data copies between GPUs and NICs. In Proceedings of the IEEE International Conference on Cyber-Physical Systems, Networks, and Applications. 37--42.

Digital Library

[4]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC) (IISWC’09). IEEE Computer Society, 44--54.

Digital Library

[5]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. 2008. A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput. 68, 10 (Oct. 2008), 1370--1380.

Digital Library

[6]

Eduardo Cuervo, Aruna Balasubramanian, Dae ki Cho, Alec Wolman, Stefan Saroiu, Ranveer Chandra, and Paramvir Bahl. 2010. MAUI: Making smartphones last longer with code offload. In Proceedings of the ACM International Conference on Mobile Systems, Applications, and Services. 49--62.

Digital Library

[7]

José Duato, Francisco D. Igual, Rafael Mayo, Antonio J. Peña, Enrique S. Quintana-Ortí, and Federico Silla. 2010a. An efficient implementation of GPU virtualization in high performance clusters. In Proceedings of the 2009 International Conference on Parallel Processing (Euro-Par’09). Springer-Verlag, 385--394. http://dl.acm.org/citation.cfm?id=1884795.1884840

Digital Library

[8]

José Duato, Antonio J. Peña, Federico Silla, Rafael Mayo, and Enrique S. Quintana-Ortí. 2010b. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In Proceedings of the 2010 International Conference on High Performance Computing 8 Simulation (HPCS’10). 224--231.

[9]

Glenn A. Elliott and James H. Anderson. 2014. Exploring the multitude of real-time multi-GPU configurations. In Proceedings of the IEEE Real-Time Systems Symposium. 260--271.

[10]

James H. Anderson Glenn A. Elliott, Bryan C. Ward. 2013. GPUSync: A framework for real-time GPU management. In Proceedings of the IEEE Real-Time Systems Symposium. 33--44.

Digital Library

[11]

Google. 2013. Google Glass. Retrieved from http://www.google.com/glass.

[12]

Khronos Group. 2013. OpenCL. Retrieved from http://jp.khronos.org/opencl.

[13]

Yuki Iida, Manato Hirabayashi, Takuya Azumi, Nobuhiko Nishio, and Shinpei Kato. 2014. Connected smartphones and high-performance servers for remote object detection. In Proceedings of the IEEE International Conference on Cyber-Physical Systems, Networks, and Applications. 71--76.

Digital Library

[14]

Shinpei Kato, Jason Aumiller, and Scott Brandt. 2013. Zero-copy I/O processing for low-latency GPU computing. In Proceedings of the ACM/IEEE 4th International Conference on Cyber-Physical Systems (ICCPS’13). ACM, 170--178.

Digital Library

[15]

Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-class gpu resource management in the operating system. In Presented as Part of the 2012 USENIX Annual Technical Conference (USENIX ATC’12). USENIX, 401--412. https://www.usenix.org/conference/atc12/technical-sessions/ presentation/kato.

Digital Library

[16]

A. Kawai, K. Yasuoka, K. Yoshikawa, and T. Narumi. 2012. Distributed-shared CUDA: Virtualization of large-scale GPU systems for programmability and reliability. In Proceedings of the 4th International Conference on Future Computational Technologies and Applications (FUTURE COMPUTING’12). 712.

[17]

Volodymyr Kindratenko and Pedro Trancoso. 2011. Trends in high-performance computing. Computing in Science 8 Engineering 13, 3 (2011), 92--95.

Digital Library

[18]

Andreas Kolb and Nicolas Cuntz. 2005. Dynamic particle coupling for GPU-based fluid simulation. In Proceedings of the Symposium on Simulation Technique. 722--727.

[19]

Wenjing Ma and Gagan Agrawal. 2009. A translation system for enabling data mining applications on GPUs. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, 400--409.

Digital Library

[20]

NVIDIA. 2015a. CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programm ing-guide.

[21]

NVIDIA. 2015b. CUDA Documents. Retrieved from http://docs.nvidia.com/cuda/.

[22]

Minoru Oikawa, Atsushi Kawai, Kentaro Nomura, Kenji Yasuoka, Kazuyuki Yoshikawa, and Tetsu Narumi. 2012. DS-CUDA: A middleware to use many GPUs in the cloud environment. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC’12). IEEE Computer Society, Washington, DC, USA, 1207--1214.

Digital Library

[23]

Antonio José Peña, Carlos Reaño, Federico Silla, Rafael Mayo, Enrique S. Quintana-Ortí, and Jose Duato. 2014. A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Comput. 40, 10 (2014), 574--588.

Digital Library

[24]

Padmanabhan S. Pillai, Lily B. Mummert, Steven W. Schlosser, and Rahul Sukthankar. 2009. SLIPstream: Scalable low-latency interactive perception on streaming data. In NOSSDAV. ACM, 43--48.

Digital Library

[25]

Moo-Ryong Ra, Anmol Sheth, Lily Mummert, Padmanabhan Pillai, David Wetherall, and Ramesh Govindan. 2011. Odessa: Enabling interactive perception applications on mobile devices. In Proceedings of the 9th International Conference on Mobile Systems, Applications, and Services (MobiSys’11). ACM, 43--56.

Digital Library

[26]

Haicheng Wu, Gregory Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel weaver: Automatically fusing database primitives for efficient GPU computation. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, 107--118.

Digital Library

[27]

Jeffrey Young, Haicheng Wu, and Sudhakar Yalamanchili. 2012. Satisfying data-intensive queries using GPU clusters. In SC Companion. IEEE Computer Society, 1314.

Digital Library

Cited By

Constantinou GSankar Ramachandran GAlfarrarjeh AKim SKrishnamachari BShahabi C(2019)A Crowd-Based Image Learning Framework using Edge Computing for Smart City Applications2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM)10.1109/BigMM.2019.00-47(11-20)Online publication date: Sep-2019
https://doi.org/10.1109/BigMM.2019.00-47
Zhao QYou TMa XMao YLeng SYang NZhao Z(2017)Mobile Edge Decoding for Saving Energy and Improving Experience2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData)10.1109/iThings-GreenCom-CPSCom-SmartData.2017.76(475-482)Online publication date: Jun-2017
https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData.2017.76

Index Terms

GPUrpc: Exploring Transparent Access to Remote GPUs
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
      2. Embedded software

Recommendations

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. ...
Kokkos

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly ...
SIMD Monte-Carlo Numerical Simulations Accelerated on GPU and Xeon Phi

The efficiency of a pleasingly parallel application is studied for several computing platforms. A real world problem, i.e., Monte-Carlo numerical simulations of stratospheric balloon envelope drift descent is considered. We detail the optimization of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 16, Issue 1

Special Issue on VIPES, Special Issue on ICESS2015 and Regular Papers

February 2017

602 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/3008024

Editor:
Sandeep K. Shukla
Indian Institute of Technology, India

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 13 October 2016

Accepted: 01 May 2016

Revised: 01 May 2016

Received: 01 October 2015

Published in TECS Volume 16, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Announcement
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
305
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)3

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Constantinou GSankar Ramachandran GAlfarrarjeh AKim SKrishnamachari BShahabi C(2019)A Crowd-Based Image Learning Framework using Edge Computing for Smart City Applications2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM)10.1109/BigMM.2019.00-47(11-20)Online publication date: Sep-2019
https://doi.org/10.1109/BigMM.2019.00-47
Zhao QYou TMa XMao YLeng SYang NZhao Z(2017)Mobile Edge Decoding for Saving Energy and Improving Experience2017 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData)10.1109/iThings-GreenCom-CPSCom-SmartData.2017.76(475-482)Online publication date: Jun-2017
https://doi.org/10.1109/iThings-GreenCom-CPSCom-SmartData.2017.76

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents