Enable back memory and global synchronization on LLC buffer

Licheng Yu¹,
Yulong Pei¹,
Tianzhou Chen¹,
Xueqing Lou¹,
Minghui Wu² &
…
Tiefei Zhang³

193 Accesses
Explore all metrics

Abstract

The last-level cache (LLC) shared by heterogeneous processors such as CPU and general-purpose graphics processing unit (GPGPU) brings new opportunities to optimize data sharing among them. Previous work introduces the LLC buffer, which uses part of the LLC storage as a FIFO buffer to enable data sharing between CPU and GPGPU with negligible management overhead. However, the baseline LLC buffer’s capacity is limited and can lead to deadlock when the buffer is full. It also relies on inefficient CPU kernel relaunch and high overhead atomic operations on GPGPU for global synchronization. These limitations motivate us to enable back memory and global synchronization on the baseline LLC buffer and make it more practical. The back memory divides the buffer storage into two levels. While they are managed as a single queue, the data storage in each level is managed as individual circular buffer. The data are redirected to the memory level when the LLC level is full, and are loaded back to the LLC level when it has free space. The case study of n-queen shows that the back memory has a comparative performance with a LLC buffer of infinite LLC level. On the contrary, LLC buffer without back memory exhibits 10% performance degradation incurred by buffer space contention. The global synchronization is enabled by peeking the data about to be read from the buffer. Any request to read the data in LLC buffer after the global barrier is allowed only when all the threads reach the barrier. We adopt breadth-first search (BFS) as a case study and compare the LLC buffer with an optimized implementation of BFS on GPGPU. The results show the LLC buffer has speedup of 1.70 on average. The global synchronization time on GPGPU and CPU is decreased to 38 and 60–5%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compressed page walk cache

Article 11 November 2021

I/O Acceleration via Multi-Tiered Data Buffering and Prefetching

Article 17 January 2020

Miss-aware LLC buffer management strategy based on heterogeneous multi-core

Article Open access 20 February 2019

Notes

For simplicity, we use CUDA terminologies.

References

Agarwal N, Nellans D, Ebrahimi E, Wenisch TF, Danskin J, Keckler SW (2016) Selective gpu caches to eliminate cpu-gpu hw cache coherence. In: 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 494–506, doi:10.1109/HPCA.2016.7446089
Al-Saber N, Kulkarni M (2015) Semcache++: Semantics-aware caching for efficient multi-gpu offloading. In: Proceedings of the 29th ACM on International Conference on Supercomputing, ACM, New York, ICS ’15, pp 79–88, doi:10.1145/2751205.2751210
Amini M, Coelho F, Irigoin F, Keryell R (2013) Static Compilation Analysis for Host-Accelerator Communication Optimization, Springer Berlin Heidelberg, Heidelberg, pp 237–251. doi:10.1007/978-3-642-36036-7_16
Asmussen N, Völp M, Nöthen B, Härtig H, Fettweis G (2016) M3: A hardware/operating-system co-design to tame heterogeneous manycores. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, New York, ASPLOS ’16, pp 189–203, doi:10.1145/2872362.2872371
Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed gpu simulator. In: Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pp 163–174, doi:10.1109/ISPASS.2009.4919648
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator. SIGARCH Comput Archit News 39(2):1–7. doi:10.1145/2024716.2024718
Article Google Scholar
Dubach C, Cheng P, Rabbah R, Bacon DF, Fink SJ (2012) Compiling a high-level language for gpus: (via language support for architectures and compilers). In: Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, New York, PLDI ’12, pp 1–12, doi:10.1145/2254064.2254066
Group KOW et al. (2008) The opencl specification. 1(29):8
Ham TJ, Aragón JL, Martonosi M (2015) Desc: Decoupled supply-compute communication management for heterogeneous architectures. In: Proceedings of the 48th International Symposium on Microarchitecture, ACM, New York, MICRO-48, pp 191–203, doi:10.1145/2830772.2830800
Harish P, Narayanan PJ (2007) High Performance Computing – HiPC 2007: 14th International Conference, Goa, India, December 18-21, 2007. Proceedings, Springer Berlin Heidelberg, Heidelberg, chap Accelerating Large Graph Algorithms on the GPU Using CUDA, pp 197–208
Hayashi A, Ishizaki K, Koblents G, Sarkar V (2015) Machine-learning-based performance heuristics for runtime cpu/gpu selection. In: Proceedings of the Principles and Practices of Programming on The Java Platform, ACM, New York, PPPJ ’15, pp 27–36, doi:10.1145/2807426.2807429
Ishizaki K, Hayashi A, Koblents G, Sarkar V (2015) Compiling and optimizing java 8 programs for gpu execution. In: 2015 International Conference on Parallel Architecture and Compilation (PACT), pp 419–431, doi:10.1109/PACT.2015.46
Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic cpu-gpu communication management and optimization. In: Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, New York, PLDI ’11, pp 142–151, doi:10.1145/1993498.1993516
Jablin TB, Jablin JA, Prabhu P, Liu F, August DI (2012) Dynamically managed data for cpu-gpu architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, ACM, New York, CGO ’12, pp 165–174, doi:10.1145/2259016.2259038
Kato S, McThrow M, Maltzahn C, Brandt S (2012) Gdev: First-class gpu resource management in the operating system. Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12). USENIX, Boston, pp 401–412
Kato S, Aumiller J, Brandt S (2013) Zero-copy i/o processing for low-latency gpu computing. In: Proceedings of the ACM/IEEE 4th International Conference on Cyber-Physical Systems, ACM, New York, ICCPS ’13, pp 170–178, doi:10.1145/2502524.2502548.
Lee H, Brown KJ, Sujeeth AK, Rompf T, Olukotun K (2014) Locality-aware mapping of nested parallel patterns on gpus. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, IEEE Computer Society, Washington, MICRO-47, pp 63–74, doi:10.1109/MICRO.2014.23.
Licheng Y, Yulong P, Tianzhou C, Xueqing L, Minghui W, Tiefei Z (2016) LLC buffer for arbitrary data sharing in heterogeneous systems. In: High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016 IEEE 18th International Conference on, IEEE, pp 260–267
Luo L, Wong M, Hwu Wm (2010) An effective gpu implementation of breadth-first search. In: Proceedings of the 47th Design Automation Conference, ACM, New York, DAC ’10, pp 52–55, doi:10.1145/1837274.1837289
Margiolas C, O’Boyle MFP (2014) Portable and transparent host-device communication optimization for gpgpu environments. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, ACM, New York, CGO ’14, pp 55:55–55:65, doi:10.1145/2544137.2544156
Nvidia C (2008) Cuda programming guide
Pai S, Govindarajan R, Thazhuthaveetil MJ (2012) Fast and efficient automatic memory management for gpus using compiler-assisted runtime coherence scheme. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ACM, New York, PACT ’12, pp 33–42, doi:10.1145/2370816.2370824
Phothilimthana PM, Ansel J, Ragan-Kelley J, Amarasinghe S (2013) Portable performance on heterogeneous architectures. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, New York, ASPLOS ’13, pp 431–444, doi:10.1145/2451116.2451162.
Ren B, Ravi N, Yang Y, Feng M, Agrawal G, Chakradhar S (2016) Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors, Springer International Publishing, Cham, pp 173–190. doi:10.1007/978-3-319-29778-1_11
Richards M (1997) Backtracking algorithms in MCPL using bit patterns and recursion. Citeseer
Stratton JA, Rodrigues C, Sung I, Obeid N, Chang L, Anssari N, Liu G, Hwu W (2012) The parboil technical report. Tech. rep., IMPACT Technical Report (IMPACT-12-01), University of Illinois Urbana-Champaign
Thoziyoor S, Muralimanohar N, Ahn JH, Jouppi NP (2008) Cacti 5.1. Tech. rep., Technical Report HPL-2008-20, HP Labs
Wang Z, Grewe D, O’boyle MFP (2014) Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM Trans Archit Code Optim 11(4):42:1–42:26, doi:10.1145/2677036
Wolf C, Glaser J, Kepler J (2013) Yosys-a free verilog synthesis suite. In: Proceedings of the 21st Austrian Workshop on Microelectronics (Austrochip)
Xiao S, c Feng W (2010) Inter-block gpu communication via fast barrier synchronization. In: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp 1–12, doi:10.1109/IPDPS.2010.5470477

Download references

Acknowledgements

This project is supported by the National Natural Science Foundation of China (Grant No. 61379035), the National Natural Science Foundation of Zhejiang Province, China (Grant No. LY14F020005) and the National Natural Science Foundation of Zhejiang Province, China (Grant No. LQ14F02001).

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China
Licheng Yu, Yulong Pei, Tianzhou Chen & Xueqing Lou
School of Computer and Computing Science, Zhejiang University City College, Hangzhou, Zhejiang, China
Minghui Wu
School of Computer Science and Information Engineering, Zhejiang Gongshang University, Hangzhou, Zhejiang, China
Tiefei Zhang

Authors

Licheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yulong Pei
View author publications
You can also search for this author in PubMed Google Scholar
Tianzhou Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xueqing Lou
View author publications
You can also search for this author in PubMed Google Scholar
Minghui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tiefei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Licheng Yu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, L., Pei, Y., Chen, T. et al. Enable back memory and global synchronization on LLC buffer. J Supercomput 73, 5414–5439 (2017). https://doi.org/10.1007/s11227-017-2093-8

Download citation

Published: 15 June 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s11227-017-2093-8

Enable back memory and global synchronization on LLC buffer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Compressed page walk cache

I/O Acceleration via Multi-Tiered Data Buffering and Prefetching

Miss-aware LLC buffer management strategy based on heterogeneous multi-core

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Enable back memory and global synchronization on LLC buffer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Compressed page walk cache

I/O Acceleration via Multi-Tiered Data Buffering and Prefetching

Miss-aware LLC buffer management strategy based on heterogeneous multi-core

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation