article

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Authors:

Zeshan Chishti,

Michael D. Powell,

T. N. VijaykumarAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 33, Issue 2

Pages 357 - 368

https://doi.org/10.1145/1080695.1070001

Published: 01 May 2005 Publication History

Abstract

Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.

References

[1]

{1} A. R. Alameldeen and D. A. Wood. Variability in architectural simulations of multi-threaded workloads. In HPCA 9, pp 7-18, Feb. 2003.

Digital Library

[2]

{2} E. Artiaga, X. Martorell, Y. Becerra, and N. Navarro. Experiences on implementing Parmacs macros to run the Splash-2 suite on multiprocessors. Technical Report UPC-DAC-1998-1, Department of Computer Architecture Universittat Politecnica de Catalunya, Jan. 1998.

[3]

{3} P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In Measurement and Modeling of Computer Systems, pages 151-160, June 1998.

Digital Library

[4]

{4} L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In the 27th ISCA, pages 282-293, June 2000.

Digital Library

[5]

{5} B. M. Beckmann and D. A. Wood. TLC: Transmission line caches. In MICRO 36, pages 43-54, Dec. 2003.

Digital Library

[6]

{6} B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In MICRO 37, pages 319-330, Dec. 2004.

Digital Library

[7]

{7} D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting the interthread cache contention on a chip multiprocessor architecture. In HPCA 11, pages 340-351, Feb. 2005.

Digital Library

[8]

{8} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In MICRO 36, pages 55-66, Dec. 2004.

Digital Library

[9]

{9} J. P. Singh, D. Culler, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1998.

Digital Library

[10]

{10} J. H. Edmondson and et al. Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Technical Journal, 7(1), 1995.

Digital Library

[11]

{11} B. Falsafi and D. A. Wood. Reactive NUMA: A design for unifying SCOMA and CC-NUMA. In the 24th ISCA, pages 229-240, June 1997.

Digital Library

[12]

{12} S. Finnes. iseries.myseries. http://www-1.ibm.com/servers/uk/media/ iseries_skillbuilder/POWER5DeliverWith%outDisruption1.pdf, 2004.

[13]

{13} E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In the 27th ISCA, pages 107-116, June 2000.

Digital Library

[14]

{14} C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS X, pages 211-222, Oct. 2002.

Digital Library

[15]

{15} C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA 10, pages 176-185, Feb. 2004.

Digital Library

[16]

{16} P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, and G. Hallberg. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, Feb. 2002.

Digital Library

[17]

{17} P. Michaud. Exploiting the cache capacity of a single-chip multicore processor with execution migration. In HPCA 10, pages 186-197, Feb. 2004.

Digital Library

[18]

{18} S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. Sullivan, and T. Grutkowski. The implementation of the Itanium 2 microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1448-1460, Nov. 2002.

[19]

{19} K. Olukotun, B. A Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In ASPLOS VII, pages 2-11, 1996.

Digital Library

[20]

{20} Open Source Development Labs. Open source development labs data-base test 2. http://www.osdl.org/lab_activities/kernel_testing/ osdl_database_test_suite/o%sdl_dbt-2/.

[21]

{21} M. Papamarcos and J. Patel. A low overhead coherence solution for multiprocessors with private cache memories. In the 11th ISCA 84, pages 348-354, 1984.

Digital Library

[22]

{22} P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical report, Compaq Computer Corporation, Aug. 2001.

[23]

{23} P. Stenstrom, E. Hagersten, D. Lilja, M. Martonosi, and M. Venugopal. Trends in shared memory multiprocessing. IEEE Computer, 30(12):44-50, Dec. 1997.

Digital Library

[24]

{24} P. Stenstrom, T. Joe, and A. Gupta. Comparative performance evaluation of cache-coherent NUMA and COMA architectures. In the 19th ISCA , pages 80-91, 1992.

Digital Library

[25]

{25} Sun Microsystems. Sun's 64-bit gemini chip. Sunflash, 66(4), Aug. 2003.

[26]

{26} J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. IBM eserver Power4 System Microarchitecture. IBM White Paper, Oct. 2001.

[27]

{27} The Standard Performance Evaluation Corporation. Spec CPU2000 suite. http://www.specbench.org/osg/cpu2000/.

[28]

{28} D. Weiss, J. J. Wuu, and V. Chin. The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1523-1529, Nov. 2002.

[29]

{29} M. Wong. Stressing linux with real-world workloads. In Linux Symposium , pages 495-504, July 2003.

[30]

{30} M. Wong, J. Zhang, C. Thomas, B. Olmstead, and C. White. Open source development labs database test 2 differences from the tpc-c, version 0.15. http://www.osdl.org/docs/dbt_2_differences_from_tpc_c.pdf, June 2002.

[31]

{31} S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In the 22nd ISCA, pages 24-36, July 1995.

Digital Library

Cited By

Știrb IGillich G(2023)A Low-Level Virtual Machine Just-In-Time Prototype for Running an Energy-Saving Hardware-Aware Mapping Algorithm on C/C++ Applications That Use PthreadsEnergies10.3390/en1619678116:19(6781)Online publication date: 23-Sep-2023
https://doi.org/10.3390/en16196781
Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Puche JPetit SGómez MSahuquillo J(2020)An efficient cache flat storage organization for multithreaded workloads for low power processorsFuture Generation Computer Systems10.1016/j.future.2019.11.024110(1037-1054)Online publication date: Sep-2020
https://doi.org/10.1016/j.future.2019.11.024
Show More Cited By

Recommendations

Optimizing Replication, Communication, and Capacity Allocation in CMPs
ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture

Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to ...
Throttling capacity sharing in private L2 caches of CMPs
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied Computation

In Chip Multi-Processors (CMPs) with private L2 caches, to combine the strengths of private and shared caches, private caches can share capacity through spilling replaced blocks to other private caches. However, indiscriminate spilling can make the ...
Filtering directory lookups in CMPs

Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 33, Issue 2

ISCA 2005

May 2005

531 pages

ISSN:0163-5964

DOI:10.1145/1080695

Issue’s Table of Contents

ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture
June 2005
541 pages
ISBN:076952270X

Copyright © 2005 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2005

Published in SIGARCH Volume 33, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

140
Total Citations
View Citations
22
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Știrb IGillich G(2023)A Low-Level Virtual Machine Just-In-Time Prototype for Running an Energy-Saving Hardware-Aware Mapping Algorithm on C/C++ Applications That Use PthreadsEnergies10.3390/en1619678116:19(6781)Online publication date: 23-Sep-2023
https://doi.org/10.3390/en16196781
Cruz EDiener MPilla LNavaux P(2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
https://dl.acm.org/doi/10.1145/3433687
Puche JPetit SGómez MSahuquillo J(2020)An efficient cache flat storage organization for multithreaded workloads for low power processorsFuture Generation Computer Systems10.1016/j.future.2019.11.024110(1037-1054)Online publication date: Sep-2020
https://doi.org/10.1016/j.future.2019.11.024
Zhao XAdileh AYu ZWang ZJaleel AEeckhout LManne SHunter HAltman E(2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322235
Puche JPetit SSahuquillo JGómez M(2019)FOS: a low-power cache organization for multicoresThe Journal of Supercomputing10.1007/s11227-019-02858-x75:10(6542-6573)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s11227-019-02858-x
Das P(2019)Cache Memory Architectures for Handling Big Data Applications: A SurveySmart Computing Paradigms: New Progresses and Challenges10.1007/978-981-13-9680-9_18(211-220)Online publication date: 1-Dec-2019
https://doi.org/10.1007/978-981-13-9680-9_18
H. M. Cruz EDiener MO. A. Navaux PH. M. Cruz EDiener MO. A. Navaux P(2018)Sharing-Aware Mapping and Parallel ArchitecturesThread and Data Mapping for Multicore Systems10.1007/978-3-319-91074-1_2(9-17)Online publication date: 5-Jul-2018
https://doi.org/10.1007/978-3-319-91074-1_2
H. M. Cruz EDiener MO. A. Navaux PH. M. Cruz EDiener MO. A. Navaux P(2018)IntroductionThread and Data Mapping for Multicore Systems10.1007/978-3-319-91074-1_1(1-8)Online publication date: 5-Jul-2018
https://doi.org/10.1007/978-3-319-91074-1_1
Diener MCruz EAlves MNavaux PKoren I(2016)Affinity-Based Thread and Data Mapping in Shared Memory SystemsACM Computing Surveys10.1145/300638549:4(1-38)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.1145/3006385
Cruz EDiener MPilla LNavaux P(2016)Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/297558713:3(1-28)Online publication date: 17-Sep-2016
https://dl.acm.org/doi/10.1145/2975587
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents