Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Published: 01 May 2005 Publication History

Abstract

Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.

References

[1]
{1} A. R. Alameldeen and D. A. Wood. Variability in architectural simulations of multi-threaded workloads. In HPCA 9, pp 7-18, Feb. 2003.
[2]
{2} E. Artiaga, X. Martorell, Y. Becerra, and N. Navarro. Experiences on implementing Parmacs macros to run the Splash-2 suite on multiprocessors. Technical Report UPC-DAC-1998-1, Department of Computer Architecture Universittat Politecnica de Catalunya, Jan. 1998.
[3]
{3} P. Barford and M. Crovella. Generating representative web workloads for network and server performance evaluation. In Measurement and Modeling of Computer Systems, pages 151-160, June 1998.
[4]
{4} L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In the 27th ISCA, pages 282-293, June 2000.
[5]
{5} B. M. Beckmann and D. A. Wood. TLC: Transmission line caches. In MICRO 36, pages 43-54, Dec. 2003.
[6]
{6} B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In MICRO 37, pages 319-330, Dec. 2004.
[7]
{7} D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting the interthread cache contention on a chip multiprocessor architecture. In HPCA 11, pages 340-351, Feb. 2005.
[8]
{8} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In MICRO 36, pages 55-66, Dec. 2004.
[9]
{9} J. P. Singh, D. Culler, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1998.
[10]
{10} J. H. Edmondson and et al. Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Technical Journal, 7(1), 1995.
[11]
{11} B. Falsafi and D. A. Wood. Reactive NUMA: A design for unifying SCOMA and CC-NUMA. In the 24th ISCA, pages 229-240, June 1997.
[12]
{12} S. Finnes. iseries.myseries. http://www-1.ibm.com/servers/uk/media/ iseries_skillbuilder/POWER5DeliverWith%outDisruption1.pdf, 2004.
[13]
{13} E. G. Hallnor and S. K. Reinhardt. A fully associative software-managed cache design. In the 27th ISCA, pages 107-116, June 2000.
[14]
{14} C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS X, pages 211-222, Oct. 2002.
[15]
{15} C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA 10, pages 176-185, Feb. 2004.
[16]
{16} P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, and G. Hallberg. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, Feb. 2002.
[17]
{17} P. Michaud. Exploiting the cache capacity of a single-chip multicore processor with execution migration. In HPCA 10, pages 186-197, Feb. 2004.
[18]
{18} S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. Sullivan, and T. Grutkowski. The implementation of the Itanium 2 microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1448-1460, Nov. 2002.
[19]
{19} K. Olukotun, B. A Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In ASPLOS VII, pages 2-11, 1996.
[20]
{20} Open Source Development Labs. Open source development labs data-base test 2. http://www.osdl.org/lab_activities/kernel_testing/ osdl_database_test_suite/o%sdl_dbt-2/.
[21]
{21} M. Papamarcos and J. Patel. A low overhead coherence solution for multiprocessors with private cache memories. In the 11th ISCA 84, pages 348-354, 1984.
[22]
{22} P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical report, Compaq Computer Corporation, Aug. 2001.
[23]
{23} P. Stenstrom, E. Hagersten, D. Lilja, M. Martonosi, and M. Venugopal. Trends in shared memory multiprocessing. IEEE Computer, 30(12):44-50, Dec. 1997.
[24]
{24} P. Stenstrom, T. Joe, and A. Gupta. Comparative performance evaluation of cache-coherent NUMA and COMA architectures. In the 19th ISCA , pages 80-91, 1992.
[25]
{25} Sun Microsystems. Sun's 64-bit gemini chip. Sunflash, 66(4), Aug. 2003.
[26]
{26} J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. IBM eserver Power4 System Microarchitecture. IBM White Paper, Oct. 2001.
[27]
{27} The Standard Performance Evaluation Corporation. Spec CPU2000 suite. http://www.specbench.org/osg/cpu2000/.
[28]
{28} D. Weiss, J. J. Wuu, and V. Chin. The on-chip 3-MB subarray-based third-level cache on an Itanium microprocessor. IEEE Journal of Solid-State Circuits, 37(11):1523-1529, Nov. 2002.
[29]
{29} M. Wong. Stressing linux with real-world workloads. In Linux Symposium , pages 495-504, July 2003.
[30]
{30} M. Wong, J. Zhang, C. Thomas, B. Olmstead, and C. White. Open source development labs database test 2 differences from the tpc-c, version 0.15. http://www.osdl.org/docs/dbt_2_differences_from_tpc_c.pdf, June 2002.
[31]
{31} S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In the 22nd ISCA, pages 24-36, July 1995.

Cited By

View all
  • (2023)A Low-Level Virtual Machine Just-In-Time Prototype for Running an Energy-Saving Hardware-Aware Mapping Algorithm on C/C++ Applications That Use PthreadsEnergies10.3390/en1619678116:19(6781)Online publication date: 23-Sep-2023
  • (2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
  • (2020)An efficient cache flat storage organization for multithreaded workloads for low power processorsFuture Generation Computer Systems10.1016/j.future.2019.11.024110(1037-1054)Online publication date: Sep-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 33, Issue 2
ISCA 2005
May 2005
531 pages
ISSN:0163-5964
DOI:10.1145/1080695
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture
    June 2005
    541 pages
    ISBN:076952270X

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2005
Published in SIGARCH Volume 33, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Low-Level Virtual Machine Just-In-Time Prototype for Running an Energy-Saving Hardware-Aware Mapping Algorithm on C/C++ Applications That Use PthreadsEnergies10.3390/en1619678116:19(6781)Online publication date: 23-Sep-2023
  • (2021)Online Thread and Data Mapping Using a Sharing-Aware Memory Management UnitACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/34336875:4(1-28)Online publication date: 21-Jan-2021
  • (2020)An efficient cache flat storage organization for multithreaded workloads for low power processorsFuture Generation Computer Systems10.1016/j.future.2019.11.024110(1037-1054)Online publication date: Sep-2020
  • (2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
  • (2019)FOS: a low-power cache organization for multicoresThe Journal of Supercomputing10.1007/s11227-019-02858-x75:10(6542-6573)Online publication date: 1-Oct-2019
  • (2019)Cache Memory Architectures for Handling Big Data Applications: A SurveySmart Computing Paradigms: New Progresses and Challenges10.1007/978-981-13-9680-9_18(211-220)Online publication date: 1-Dec-2019
  • (2018)Sharing-Aware Mapping and Parallel ArchitecturesThread and Data Mapping for Multicore Systems10.1007/978-3-319-91074-1_2(9-17)Online publication date: 5-Jul-2018
  • (2018)IntroductionThread and Data Mapping for Multicore Systems10.1007/978-3-319-91074-1_1(1-8)Online publication date: 5-Jul-2018
  • (2016)Affinity-Based Thread and Data Mapping in Shared Memory SystemsACM Computing Surveys10.1145/300638549:4(1-38)Online publication date: 5-Dec-2016
  • (2016)Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore ArchitecturesACM Transactions on Architecture and Code Optimization10.1145/297558713:3(1-28)Online publication date: 17-Sep-2016
  • Show More Cited By

View Options

Get Access

Login options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media