Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Forwardflow: a scalable core for power-constrained CMPs

Published: 19 June 2010 Publication History

Abstract

Chip Multiprocessors (CMPs) are now commodity hardware, but commoditization of parallel software remains elusive. In the near term, the current trend of increased core-per-socket count will continue, despite a lack of parallel software to exercise the hardware. Future CMPs must deliver thread-level parallelism when software provides threads to run, but must also continue to deliver performance gains for single threads by exploiting instruction-level parallelism and memory-level parallelism. However, power limitations will prevent conventional cores from exploiting both simultaneously.
This work presents the Forwardflow Architecture, which can scale its execution logic up to run single threads, or down to run multiple threads in a CMP. Forwardflow dynamically builds an explicit internal dataflow representation from a conventional instruction set architecture, using forward dependence pointers to guide instruction wakeup, selection, and issue. Forwardflow's backend is organized into discrete units that can be individually (de-)activated, allowing each core's performance to be scaled by system software at the architectural level.
On single threads, Forwardflow core scaling yields a mean runtime reduction of 21% for a 37% increase in power consumption. For multithreaded workloads, a Forwardflow-based CMP allows system software to select the performance point that best matches available power.

References

[1]
A. R. Alameldeen, C. J. Mauer, M. Xu, P. J. Harper, M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Evaluating Non-deterministic Multi-threaded Commercial Workloads. In Proc. of the 5th Workshop on Computer Architecture Evaluation Using Commercial Workloads, pages 30--38, Feb. 2002.
[2]
D. Albonesi, R., Balasubramonian, S. Dropsbo, S. Dwarkadas, F. Friedman, M. Huang, V. Kursun, G. Magklis, M. Scott, G. Semeraro, P. Bose, A. Buyuktosunoglu, P. Cook, and S. Schuster. Dynamically tuning processor resources with adaptive processing. IEEE Computer, 36(2):49--58, Dec. 2003.
[3]
K. Arvind and R. S. Nikhil. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Transactions on Computers, pages 300--318, Mar. 1990.
[4]
V. Aslot, M. Domeika, R. Eigenmann, G. Gaertner, W. Jones, and B. Parady. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In Workshop on OpenMP Applications and Tools, pages 1--10, July 2001.
[5]
D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In Proc. of the 27th Annual Intnl. Symp. on Computer Architecture, pages 83--94, June 2000.
[6]
K. Chakraborty, P. M. Wells, and G. S. Sohi. Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly. In Proc. of the 12th Intnl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 2006.
[7]
J. Dundas and T. Mudge. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proc. of the 1997 Intnl. Conf. on Supercomputing, pages 68--75, July 1997.
[8]
I. T. R. for Semiconductors. ITRS 2006 Update. Semiconductor Industry Association, 2006. http://www.itrs.net/Links/2006Update/2006UpdateFinal.htm.
[9]
J. L. Henning. SPEC CPU2006 Benchmark Descriptions. Computer Architecture News, 34(4):1--17, 2006.
[10]
A. Henstrom. US Patent #6,557,095: Scheduling operations using a dependency matrix, Dec. 1999.
[11]
M. D. Hill and M. R. Marty. Amdahl's Law in the Multicore Era. IEEE Computer, pages 33--38, July 2008.
[12]
M. Huang, J. Renau, and J. Torrellas. Energy-efficient hybrid wakeup logic. In ISLPED '02: Proceedings of the 2002 international symposium on Low power electronics and design, pages 196--201, New York, NY, USA, 2002. ACM.
[13]
Intel. First the Tick, Now the Tock: Next Generation Intel® Microarchitecture (Nehalem). http://www.intel.com/technology/architecture-silicon/next-gen/whitepaper.pd% f, 2008.
[14]
Intel. Intel and Core i7 (Nehalem) Dynamic Power Management, 2008.
[15]
E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core Fusion: Accomodating Software Diversity in Chip Multiprocessors. In Proc. of the 34th Annual Intnl. Symp. on Computer Architecture, June 2007.
[16]
C. Kim, S. Sethumadhavan, M. S. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler. Composable Lightweight Processors. In Proc. of the 40th Annual IEEE/ACM International Symp. on Microarchitecture, Dec. 2007.
[17]
I. Kim and M. H. Lipasti. Half-price architecture. In Proc. of the 30th Annual Intnl. Symp. on Computer Architecture, pages 28--38, June 2003.
[18]
A. R. Lebeck, T. Li, E. Rotenberg, J. Koppanalil, and J. P. Patwardhan. A Large, Fast Instruction Window for Tolerating Cache Misses. In Proc. of the 29th Annual Intnl. Symp. on Computer Architecture, May 2002.
[19]
P. S. Magnusson et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50--58, Feb. 2002.
[20]
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. Computer Architecture News, pages 92--99, Sept. 2005.
[21]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead Execution: An Effective Alternative to Large Instruction Windows. IEEE Micro, 23(6):20--25, Nov/Dec 2003.
[22]
S. Palacharla and J. E. Smith. Complexity-Effective Superscalar Processors. In Proc. of the 24th Annual Intnl. Symp. on Computer Architecture, pages 206--218, June 1997.
[23]
S. E. Raasch, N. L. Binkert, and S. K. Reinhardt. A scalable instruction queue design using dependence chains. In Proc. of the 29th Annual Intnl. Symp. on Computer Architecture, pages 318--329, May 2002.
[24]
M. A. Ramirez, A. Cristal, A. V. Veidenbaum, L. Villa, and M. Valero. Direct Instruction Wakeup for Out-of-Order Processors. In IWIA '04: Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04), pages 2--9, Washington, DC, USA, 2004. IEEE Computer Society.
[25]
K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. Moore. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture. In Proceedings of the 30th Annual International Symposium on Computer Architecture, pages 422--433, June 2003.
[26]
S. R. Sarangi, W. Liu, J. Torrellas, and Y. Zhou. ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing. In Proc. of the 38th Annual IEEE/ACM International Symp. on Microarchitecture, Nov. 2005.
[27]
P. Sassone, J. R. II, E. Brekelbaum, G. Loh, and B. Black. Matrix Scheduler Reloaded. In Proc. of the 34th Annual Intnl. Symp. on Computer Architecture, pages 335--346, June 2007.
[28]
T. Sha, M. M. K. Martin, and A. Roth. NoSQ: Store-Load Communication without a Store Queue. In Proc. of the 39th Annual IEEE/ACM International Symp. on Microarchitecture, pages 285--296, Dec. 2006.
[29]
T. Shyamkumar, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical Report HPL--2008--20, Hewlett Packard Labs, 2008.
[30]
G. S. Sohi and S. Vajapeyam. Instruction Issue Logic for High-Performance Interruptable Pipelined Processors. In Proc. of the 14th Annual Intnl. Symp. on Computer Architecture, pages 27--34, June 1987.
[31]
S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton. Continual Flow Pipelines. In Proc. of the 11th Intnl. Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 2004.
[32]
M. Tremblay and S. Chaudhry. A Third-Generation 65nm 16-Core 32-Thread Plus 32--Scout--Thread CMT SPARC Processor. In ISSCC Conference Proceedings, 2008.
[33]
D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proc. of the 23th Annual Intnl. Symp. on Computer Architecture, pages 191--202, May 1996.
[34]
G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo--Martinez, S. Swanson, and M. B. Taylor. Conservation Cores: Reducing the Energy of Mature Computations. In Proc. of the 9th Intnl. Conf. on Architectural Support for Programming Languages and Operating Systems, Nov. 2000.
[35]
R. Vivekanandham, B. Amrutur, and R. Govindarajan. A scalable low power issue queue for large instruction window processors. In Proc. of the 20th Intnl. Conf. on Supercomputing, pages 167--176, June 2006.
[36]
K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28--40, Apr. 1996.

Cited By

View all
  • (2016)A Survey of Techniques for Architecting and Managing Asymmetric Multicore ProcessorsACM Computing Surveys10.1145/285612548:3(1-38)Online publication date: 8-Feb-2016
  • (2021)OmegaflowProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460367(152-163)Online publication date: 3-Jun-2021
  • (2017)Dynamic power management techniques in multi-core architectures: A survey studyAin Shams Engineering Journal10.1016/j.asej.2015.08.0108:3(445-456)Online publication date: Sep-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
ISCA '10
June 2010
508 pages
ISSN:0163-5964
DOI:10.1145/1816038
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
    June 2010
    520 pages
    ISBN:9781450300537
    DOI:10.1145/1815961
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2010
Published in SIGARCH Volume 38, Issue 3

Check for updates

Author Tags

  1. chip multiprocessor (cmp)
  2. power
  3. scalable core

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)5
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2016)A Survey of Techniques for Architecting and Managing Asymmetric Multicore ProcessorsACM Computing Surveys10.1145/285612548:3(1-38)Online publication date: 8-Feb-2016
  • (2021)OmegaflowProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460367(152-163)Online publication date: 3-Jun-2021
  • (2017)Dynamic power management techniques in multi-core architectures: A survey studyAin Shams Engineering Journal10.1016/j.asej.2015.08.0108:3(445-456)Online publication date: Sep-2017
  • (2016)FXA: Executing Instructions in Front-End for Energy EfficiencyIEICE Transactions on Information and Systems10.1587/transinf.2015EDP7316E99.D:4(1092-1107)Online publication date: 2016
  • (2016)A Heterogeneous Von Neumann/Explicit Dataflow ProcessorIEEE Micro10.1109/MM.2016.3436:3(20-30)Online publication date: May-2016
  • (2015)Exploring the potential of heterogeneous von neumann/dataflow execution modelsACM SIGARCH Computer Architecture News10.1145/2872887.275038043:3S(298-310)Online publication date: 13-Jun-2015
  • (2015)Exploring the potential of heterogeneous von neumann/dataflow execution modelsProceedings of the 42nd Annual International Symposium on Computer Architecture10.1145/2749469.2750380(298-310)Online publication date: 13-Jun-2015
  • (2015)ChryssoProceedings of the 12th ACM International Conference on Computing Frontiers10.1145/2742854.2742885(1-8)Online publication date: 6-May-2015
  • (2014)A Front-end Execution Architecture for High Energy EfficiencyProceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2014.35(419-431)Online publication date: 13-Dec-2014
  • (2014)Performance analysis and structured parallelisation of the space–time adaptive processing computational kernel on multi-core architecturesInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2014.88596729:5(460-498)Online publication date: 11-Feb-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media