Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

CABARRE: Request Response Arbitration for Shared Cache Management

Published: 09 September 2023 Publication History

Abstract

Modern multi-processor systems-on-chip (MPSoCs) are characterized by caches shared by multiple cores. These shared caches receive requests issued by the processor cores. Requests that are subject to cache misses may result in the generation of responses. These responses are received from the lower level of the memory hierarchy and written to the cache. The outstanding requests and responses contend for the shared cache bandwidth. To mitigate the impact of the cache bandwidth contention on the overall system performance, an efficient request and response arbitration policy is needed.
Research on shared cache management has neglected the additional cache contention caused by responses, which are written to the cache. We propose CABARRE, a novel request and response arbitration policy at shared caches, so as to improve the overall system performance. CABARRE shows a performance improvement of 23% on average across a set of SPEC workloads compared to straightforward adaptations of state-of-the-art solutions.

References

[1]
Tosiron Adegbija and Ravi Tandon. 2017. Coding for efficient caching in multicore embedded systems. In 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 296–301.
[2]
Mainak Chaudhuri, Jayesh Gaur, and Sreenivas Subramoney. 2019. Bandwidth-aware last-level caching: Efficiently coordinating off-chip read and write bandwidth. In 2019 IEEE 37th International Conference on Computer Design (ICCD). IEEE, 109–118.
[3]
Hyejeong Choi and Sejin Park. 2022. Learning future reference patterns for efficient cache replacement decisions. IEEE Access 10 (2022), 25922–25934.
[4]
Jongwook Chung, Yuhwan Ro, Joonsung Kim, Jaehyung Ahn, Jangwoo Kim, John Kim, Jae W. Lee, and Jung Ho Ahn. 2019. Enforcing last-level cache partitioning through memory virtual channels. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 97–109.
[5]
Kousik Kumar Dutta, Prathamesh Nitin Tanksale, and Shirshendu Das. 2021. A fairness conscious cache replacement policy for last level cache. In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 695–700.
[6]
Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, Harshad Kasture, Xiaosong Ma, and Daniel Sanchez. 2018. KPart: A hybrid cache partitioning-sharing technique for commodity multicores. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 104–117.
[7]
Juan Fang, Zixuan Nie, and Li’ang Zhao. 2022. PACP: A prefetch-aware multi-core shared cache partitioning strategy. In Proceedings of the 8th International Conference on Computing and Artificial Intelligence. 246–251.
[8]
Josue Feliu, Julio Sahuquillo, Salvador Petit, and Jose Duato. 2015. Bandwidth-aware on-line scheduling in SMT multicores. IEEE Trans. Comput. 65, 2 (2015), 422–434.
[9]
Josue Feliu, Julio Sahuquillo, Salvador Petit, and Jose Duato. 2016. Perf&Fair: A progress-aware scheduler to enhance performance and fairness in SMT multicores. IEEE Trans. Comput. 66, 5 (2016), 905–911.
[10]
Saugata Ghose, Hyodong Lee, and José F. Martínez. 2013. Improving memory scheduling via processor-side load criticality information. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 84–95.
[11]
Fazal Hameed, Lars Bauer, and Jörg Henkel. 2013. Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies. In 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). IEEE, 1–10.
[12]
John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1–17.
[13]
Nadja Ramhöj Holtryd, Madhavan Manivannan, Per Stenström, and Miquel Pericàs. 2021. CBP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 213–225.
[14]
Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana. 2008. Self-optimizing memory controllers: A reinforcement learning approach. ACM SIGARCH Computer Architecture News 36, 3 (2008), 39–50.
[15]
Rahul Jain, Preeti Ranjan Panda, and Sreenivas Subramoney. 2017. A coordinated multi-agent reinforcement learning approach to multi-level cache co-partitioning. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017. IEEE, 800–805.
[16]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr, and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). ACM SIGARCH Computer Architecture News 38, 3 (2010), 60–71.
[17]
Daniel A. Jiménez, Elvira Teran, and Paul V. Gratz. 2023. Last-level cache insertion and promotion policy in the presence of aggressive prefetching. IEEE Computer Architecture Letters 22, 1 (2023), 17–20.
[18]
Kamil Kędzierski, Miquel Moreto, Francisco J. Cazorla, and Mateo Valero. 2010. Adapting cache partitioning algorithms to pseudo-lru replacement policies. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 1–12.
[19]
Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 65–76.
[20]
Zhaoying Li, Lei Ju, Hongjun Dai, Xin Li, Mengying Zhao, and Zhiping Jia. 2018. Set variation-aware shared LLC management for CPU-GPU heterogeneous architecture. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 79–84.
[21]
Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann, Srikant Bharadwaj, et al. 2020. The gem5 simulator: Version 20.0+. arXiv preprint arXiv:2007.03152 (2020).
[22]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Laboratories 27 (2009), 28.
[23]
Jiwoong Park, Heonyoung Yeom, and Yongseok Son. 2020. Page reusability-based cache partitioning for multi-core systems. IEEE Trans. Comput. 69, 6 (2020), 812–818.
[24]
Moinuddin K. Qureshi and Yale N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE, 423–432.
[25]
Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. 2000. Memory access scheduling. ACM SIGARCH Computer Architecture News 28, 2 (2000), 128–138.
[26]
Eduardo Olmedo Sanchez and Xian-He Sun. 2019. Cads: Core-aware dynamic scheduler for multicore memory controllers. arXiv preprint arXiv:1907.07776 (2019).
[27]
Yang Song, Olivier Alavoine, and Bill Lin. 2018. Row-buffer hit harvesting in orchestrated last-level cache and DRAM scheduling for heterogeneous multicore systems. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 779–784.
[28]
Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. 2010. The virtual write queue: Coordinating DRAM and last-level cache policies. ACM SIGARCH Computer Architecture News 38, 3 (2010), 72–82.
[29]
Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu. 2016. BLISS: Balancing performance, fairness and complexity in memory access scheduling. IEEE Transactions on Parallel and Distributed Systems 27, 10 (2016), 3071–3087.
[30]
Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu. 2015. The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory. In Proceedings of the 48th International Symposium on Microarchitecture. 62–75.
[31]
Sakshi Tiwari, Shreshth Tuli, Isaar Ahmad, Ayushi Agarwal, Preeti Ranjan Panda, and Sreenivas Subramoney. 2019. REAL: REquest arbitration in last level caches. ACM Transactions on Embedded Computing Systems (TECS) 18, 6 (2019), 1–24.
[32]
Andreas Tretter, Georgia Giannopoulou, Matthias Baer, and Lothar Thiele. 2017. Minimising access conflicts on shared multi-bank memory. ACM Transactions on Embedded Computing Systems (TECS) 16, 5s (2017), 1–20.
[33]
Hans Vandierendonck and Andre Seznec. 2011. Fairness metrics for multi-threaded processors. IEEE Computer Architecture Letters 10, 1 (2011), 4–7.
[34]
Kelefouras Vasilios, Keramidas Georgios, and Voros Nikolaos. 2018. Combining software cache partitioning and loop tiling for effective shared cache management. ACM Transactions on Embedded Computing Systems (TECS) 17, 3 (2018), 1–25.
[35]
Po-Han Wang, Cheng-Hsuan Li, and Chia-Lin Yang. 2016. Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture. In Proceedings of the 53rd Annual Design Automation Conference. 1–6.
[36]
Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, and Zhenlin Wang. 2019. EMBA: Efficient memory bandwidth allocation to improve performance on intel commodity processor. In Proceedings of the 48th International Conference on Parallel Processing. 1–12.
[37]
Di Xu, Chenggang Wu, and Pen-Chung Yew. 2010. On mitigating memory bandwidth contention through bandwidth-aware scheduling. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 237–248.

Cited By

View all
  • (2024)POEM: Performance Optimization and Endurance Management for Non-volatile CachesACM Transactions on Design Automation of Electronic Systems10.1145/365345229:5(1-36)Online publication date: 27-Mar-2024
  • (2024)FASTA: Revisiting Fully Associative Memories in Computer MicroarchitectureIEEE Access10.1109/ACCESS.2024.335596112(13923-13943)Online publication date: 2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 22, Issue 5s
Special Issue ESWEEK 2023
October 2023
1394 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3614235
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 09 September 2023
Accepted: 13 July 2023
Revised: 02 June 2023
Received: 23 March 2023
Published in TECS Volume 22, Issue 5s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Shared caches
  2. requests
  3. responses
  4. arbitration
  5. cache bandwidth
  6. multi-core systems

Qualifiers

  • Research-article

Funding Sources

  • Semiconductor Research Corporation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)199
  • Downloads (Last 6 weeks)11
Reflects downloads up to 02 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)POEM: Performance Optimization and Endurance Management for Non-volatile CachesACM Transactions on Design Automation of Electronic Systems10.1145/365345229:5(1-36)Online publication date: 27-Mar-2024
  • (2024)FASTA: Revisiting Fully Associative Memories in Computer MicroarchitectureIEEE Access10.1109/ACCESS.2024.335596112(13923-13943)Online publication date: 2024

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media