Abstract
The last-level cache (LLC) shared by heterogeneous processors such as CPU and general-purpose graphics processing unit (GPGPU) brings new opportunities to optimize data sharing among them. Previous work introduces the LLC buffer, which uses part of the LLC storage as a FIFO buffer to enable data sharing between CPU and GPGPU with negligible management overhead. However, the baseline LLC buffer’s capacity is limited and can lead to deadlock when the buffer is full. It also relies on inefficient CPU kernel relaunch and high overhead atomic operations on GPGPU for global synchronization. These limitations motivate us to enable back memory and global synchronization on the baseline LLC buffer and make it more practical. The back memory divides the buffer storage into two levels. While they are managed as a single queue, the data storage in each level is managed as individual circular buffer. The data are redirected to the memory level when the LLC level is full, and are loaded back to the LLC level when it has free space. The case study of n-queen shows that the back memory has a comparative performance with a LLC buffer of infinite LLC level. On the contrary, LLC buffer without back memory exhibits 10% performance degradation incurred by buffer space contention. The global synchronization is enabled by peeking the data about to be read from the buffer. Any request to read the data in LLC buffer after the global barrier is allowed only when all the threads reach the barrier. We adopt breadth-first search (BFS) as a case study and compare the LLC buffer with an optimized implementation of BFS on GPGPU. The results show the LLC buffer has speedup of 1.70 on average. The global synchronization time on GPGPU and CPU is decreased to 38 and 60–5%, respectively.
Similar content being viewed by others
Notes
For simplicity, we use CUDA terminologies.
References
Agarwal N, Nellans D, Ebrahimi E, Wenisch TF, Danskin J, Keckler SW (2016) Selective gpu caches to eliminate cpu-gpu hw cache coherence. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 494–506, doi:10.1109/HPCA.2016.7446089
Al-Saber N, Kulkarni M (2015) Semcache++: Semantics-aware caching for efficient multi-gpu offloading. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ACM, New York, ICS ’15, pp 79–88, doi:10.1145/2751205.2751210
Amini M, Coelho F, Irigoin F, Keryell R (2013) Static Compilation Analysis for Host-Accelerator Communication Optimization, Springer Berlin Heidelberg, Heidelberg, pp 237–251. doi:10.1007/978-3-642-36036-7_16
Asmussen N, Völp M, Nöthen B, Härtig H, Fettweis G (2016) M3: A hardware/operating-system co-design to tame heterogeneous manycores. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, New York, ASPLOS ’16, pp 189–203, doi:10.1145/2872362.2872371
Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed gpu simulator. In: Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp 163–174, doi:10.1109/ISPASS.2009.4919648
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. doi:10.1145/2024716.2024718
Dubach C, Cheng P, Rabbah R, Bacon DF, Fink SJ (2012) Compiling a high-level language for gpus: (via language support for architectures and compilers). In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, New York, PLDI ’12, pp 1–12, doi:10.1145/2254064.2254066
Group KOW et al. (2008) The opencl specification. 1(29):8
Ham TJ, Aragón JL, Martonosi M (2015) Desc: Decoupled supply-compute communication management for heterogeneous architectures. In: Proceedings of the 48th International Symposium on Microarchitecture, ACM, New York, MICRO-48, pp 191–203, doi:10.1145/2830772.2830800
Harish P, Narayanan PJ (2007) High Performance Computing – HiPC 2007: 14th International Conference, Goa, India, December 18-21, 2007. Proceedings, Springer Berlin Heidelberg, Heidelberg, chap Accelerating Large Graph Algorithms on the GPU Using CUDA, pp 197–208
Hayashi A, Ishizaki K, Koblents G, Sarkar V (2015) Machine-learning-based performance heuristics for runtime cpu/gpu selection. In: Proceedings of the Principles and Practices of Programming on The Java Platform, ACM, New York, PPPJ ’15, pp 27–36, doi:10.1145/2807426.2807429
Ishizaki K, Hayashi A, Koblents G, Sarkar V (2015) Compiling and optimizing java 8 programs for gpu execution. In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp 419–431, doi:10.1109/PACT.2015.46
Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic cpu-gpu communication management and optimization. In: Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, New York, PLDI ’11, pp 142–151, doi:10.1145/1993498.1993516
Jablin TB, Jablin JA, Prabhu P, Liu F, August DI (2012) Dynamically managed data for cpu-gpu architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, ACM, New York, CGO ’12, pp 165–174, doi:10.1145/2259016.2259038
Kato S, McThrow M, Maltzahn C, Brandt S (2012) Gdev: First-class gpu resource management in the operating system. Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12). USENIX, Boston, pp 401–412
Kato S, Aumiller J, Brandt S (2013) Zero-copy i/o processing for low-latency gpu computing. In: Proceedings of the ACM/IEEE 4th International Conference on Cyber-Physical Systems, ACM, New York, ICCPS ’13, pp 170–178, doi:10.1145/2502524.2502548.
Lee H, Brown KJ, Sujeeth AK, Rompf T, Olukotun K (2014) Locality-aware mapping of nested parallel patterns on gpus. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, Washington, MICRO-47, pp 63–74, doi:10.1109/MICRO.2014.23.
Licheng Y, Yulong P, Tianzhou C, Xueqing L, Minghui W, Tiefei Z (2016) LLC buffer for arbitrary data sharing in heterogeneous systems. In: High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016 IEEE 18th International Conference on, IEEE, pp 260–267
Luo L, Wong M, Hwu Wm (2010) An effective gpu implementation of breadth-first search. In: Proceedings of the 47th Design Automation Conference, ACM, New York, DAC ’10, pp 52–55, doi:10.1145/1837274.1837289
Margiolas C, O’Boyle MFP (2014) Portable and transparent host-device communication optimization for gpgpu environments. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, ACM, New York, CGO ’14, pp 55:55–55:65, doi:10.1145/2544137.2544156
Nvidia C (2008) Cuda programming guide
Pai S, Govindarajan R, Thazhuthaveetil MJ (2012) Fast and efficient automatic memory management for gpus using compiler-assisted runtime coherence scheme. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ACM, New York, PACT ’12, pp 33–42, doi:10.1145/2370816.2370824
Phothilimthana PM, Ansel J, Ragan-Kelley J, Amarasinghe S (2013) Portable performance on heterogeneous architectures. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, New York, ASPLOS ’13, pp 431–444, doi:10.1145/2451116.2451162.
Ren B, Ravi N, Yang Y, Feng M, Agrawal G, Chakradhar S (2016) Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors, Springer International Publishing, Cham, pp 173–190. doi:10.1007/978-3-319-29778-1_11
Richards M (1997) Backtracking algorithms in MCPL using bit patterns and recursion. Citeseer
Stratton JA, Rodrigues C, Sung I, Obeid N, Chang L, Anssari N, Liu G, Hwu W (2012) The parboil technical report. Tech. rep., IMPACT Technical Report (IMPACT-12-01), University of Illinois Urbana-Champaign
Thoziyoor S, Muralimanohar N, Ahn JH, Jouppi NP (2008) Cacti 5.1. Tech. rep., Technical Report HPL-2008-20, HP Labs
Wang Z, Grewe D, O’boyle MFP (2014) Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM Trans Archit Code Optim 11(4):42:1–42:26, doi:10.1145/2677036
Wolf C, Glaser J, Kepler J (2013) Yosys-a free verilog synthesis suite. In: Proceedings of the 21st Austrian Workshop on Microelectronics (Austrochip)
Xiao S, c Feng W (2010) Inter-block gpu communication via fast barrier synchronization. In: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp 1–12, doi:10.1109/IPDPS.2010.5470477
Acknowledgements
This project is supported by the National Natural Science Foundation of China (Grant No. 61379035), the National Natural Science Foundation of Zhejiang Province, China (Grant No. LY14F020005) and the National Natural Science Foundation of Zhejiang Province, China (Grant No. LQ14F02001).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yu, L., Pei, Y., Chen, T. et al. Enable back memory and global synchronization on LLC buffer. J Supercomput 73, 5414–5439 (2017). https://doi.org/10.1007/s11227-017-2093-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2093-8