research-article

Power and thermal management in massive multicore chips: theoretical foundation meets architectural innovation and resource allocation

Authors:

Jörg HenkelAuthors Info & Claims

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Article No.: 4, Pages 1 - 2

https://doi.org/10.1145/2968455.2974013

Published: 01 October 2016 Publication History

Get Access

Abstract

Continuing progress and integration levels in silicon technologies make possible complete end-user systems consisting of extremely high number of cores on a single chip targeting either embedded or high-performance computing. However, without new paradigms of energy- and thermally-efficient designs, producing information and communication systems capable of meeting the computing, storage and communication demands of the emerging applications will be unlikely. The broad topic of power and thermal management of massive multicore chips is actively being pursued by a number of researchers worldwide, from a variety of different perspectives, ranging from workload modeling to efficient on-chip network infrastructure design to resource allocation. Successful solutions will likely adopt and encompass elements from all or at least several levels of abstraction. Starting from these ideas, we consider a holistic approach in establishing the Power-Thermal-Performance (PTP) trade-offs of massive multicore processors by considering three inter-related but varying angles, viz., on-chip traffic modeling, novel Networks-on-Chip (NoC) architecture and resource allocation/mapping

On-line workload (mathematical modeling, analysis and prediction) learning is fundamental for endowing the many-core platforms with self-optimizing capabilities [2][3]. This built-in intelligence capability of many-cores calls for monitoring the interactions between the set of running applications and the architectural (core and uncore) components, the online construction of mathematical models for the observed workloads, and determining the best resource allocation decisions given the limited amount of information about user-to-application-to-system dynamics. However, workload modeling is not a trivial task.

Centralized approaches for analyzing and mining workloads can easily run into scalability issues with increasing number of monitored processing elements and uncore (routers and interface queues) components since it can either lead to significant traffic and energy overhead or require dedicated system infrastructure. In contrast, learning the most compact mathematical representation of the workload can be done in a distributed manner (within the proximity of the observation /sensing) as long as the mathematical techniques are flexible and exploit the mathematical characteristics of the workloads (degree of periodicity, degree of fractal and temporal scaling) [3]. As one can notice, this strategy does not postulate a-priori the mathematical expressions (e.g., a specific order of the autoregressive moving average (ARMA) model). Instead, the periodicity and fractality of the observed computation (e.g., instructions per cycles, last level cache misses, branch prediction successes and failures, TLB access/misses) and communication (request-reply latency, queues utilization, memory queuing delay) metrics dictate the number of coefficients, the linearity or nonlinearity of the dynamical state equations and the noise terms (e.g., Gaussian distributed) [3]. In other words, dedicated minimal logic can be allocated to interact with the local sensor to analyze the incoming workload at run-time, determine the required number of parameters and their values as a function of their characteristics and communicate only the workload model parameters to a hierarchical optimization module (autonomous control architecture). For instance, capturing the fractal characteristics of the core and uncore workloads led to the development of more efficient power management strategy [1] than those based on PID or model predictive control.

In order to develop a compact and accurate mathematical framework for analyzing and modeling the incoming workload, we describe a general probabilistic approach that models the statistics of the increments in the magnitude of a stochastic process (associated with a specific workload metric) and the intervals of time (inter-event times) between successive changes in the stochastic process [3]. We show that the statistics of these two components of the stochastic process allows us to derive state equations and capture either short-range or long-range memory properties. To test the benefits of this new workload modeling approach, we describe its integration into a multi-fractal optimal control framework for solving the power management for a 64-core NoC-based manycore platform and contrast it with a mono-fractal and non-fractal schemes [3].

A scalable, low power, and high-bandwidth on-chip communication infrastructure is essential to sustain the predicted growth in the number of embedded cores in a single die. New interconnection fabrics are key for continued performance improvements and energy reduction of manycore chips, and an efficient and robust NoC architecture is one of the key steps towards achieving that goal. An NoC architecture that incorporates emerging interconnect paradigms will be an enabler for low-power, high-bandwidth manycore chips. Innovative interconnect paradigms based on optical technologies, RF/wireless methods, carbon nanotubes, or 3D integration are promising alternatives that may indeed overcome obstacles that impede continued advances of the manycore paradigm. These innovations will open new opportunities for research in NoC designs with emerging interconnect infrastructures. In this regard, wireless NoC (WiNoC) is a promising direction to design energy efficient multicore architectures. WiNoC not only helps in improving the energy efficiency and performance, it also opens up opportunities for implementing power management strategies. WiNoCs enable implementation of the two most popular power management mechanisms, viz., dynamic voltage and frequency scaling (DVFS) and voltage frequency island (VFI).

The wireless links in the WiNoC establish one-hop shortcuts between the distant nodes and facilitate energy savings in data exchange [3]. The wireless shortcuts attract a significant amount of the overall traffic within the network. The amount of traffic detoured is substantial and the low power wireless links enable energy savings. However, the overall energy dissipation within the network is still dominated by the data traversing the wireline links. Hence, by incorporating DVFS on these wireline links we can save more energy. Moreover, by incorporating suitable congestion aware routing with DVFS, we can avoid thermal hotspots in the system [4].

It should be noted that for large system size the hardware overhead in terms of on-chip voltage regulators and synchronizers is much more in DVFS than in VFI. WiNoC-enabled VFI designs mitigate some of the full-system performance degradation inherent in VFI-partitioned multicore designs, and it also help in eliminating it entirely for certain applications [5]. The VFI-partitioned designs used in conjunction with a novel NoC architecture like WiNoC can achieve significant energy savings while minimizing the impact on the achievable performance.

On-chip power density and temperature trends are continuously increasing due to high integration density of nano-scale transistors and failure of Dennard Scaling as a result of diminishing voltage scaling. Hence, all computing is temperature-constrained computing and therefore, employing thermal management techniques that keep chip temperatures within safe limits along with meeting the constraints of spatial/temporal thermal gradients and avoid wear-out effects [8] is key.

We introduced the novel concept of Dark Silicon Patterning, i.e. spatio-temporal control of power states of different cores [9] Sophisticated patterning and thread-to-core mapping decisions are made considering the knowledge of process variations and lateral heat dissipation of power-gated cores in order to enhance the performance of multi-threaded workloads through dynamic core count scaling (DCCS). This is enabled by a lightweight online prediction of chip's thermal profile for a given patterning candidate. We also present an enhanced temperature-aware resource management technique that, besides active and dark states of cores, also exploit various grey states (i.e., using different voltage-frequency levels) in order to achieve a high performance for mixed ILP-TLP workloads under peak temperature constraints. High ILP applications benefit from high V-f and boosting levels, while high TLP applications benefit from

As the scaling trends move from multi-core to many-core processors, the centralized solutions become infeasible, and thereby require distributed techniques. In [6], we proposed an agent-based distributed temperature-aware resource management technique called TAPE. It assigns a so-called agent to each core, a software or hardware entity that acts on behalf of the core. Following the principles of economic theory, these agents negotiate with each other to trade their power budgets in order to fulfil the performance requirements of their tasks, while keep the T_Peak≤T_critical. In case of thermal violations, task migration or V-f throttling is triggered, and a penalty is applied to the trading process to improve the decision making.

References

[1]

P. Bogdan, R. Marculescu, and S. Jain, "Dynamic Power Management for Multidomain System-on-Chip Platforms: An Optimal Control Approach," ACM Transactions on Design Automation of Electronic Systems, 18, 4, Article 46, October 2013.

Digital Library

Google Scholar

[2]

P. Bogdan, "A cyber-physical systems approach to personalized medicine: challenges and opportunities for noc-based multicore platforms," Proc. of the 2015 Design, Automation & Test in Europe Conference & Exhibition, March 09-13, 2015, Grenoble, France.

Digital Library

Google Scholar

[3]

P. Bogdan, "Mathematical Modeling and Control of Multifractal Workloads for Data-Center-on-a-Chip Optimization," Proc. of the 9th International Symposium on Networks-on-Chip (NOCS), 2015.

Digital Library

Google Scholar

[4]

S. Deb, et al., "Design of an Energy Efficient CMOS Compatible NoC Architecture with Millimeter-Wave Wireless Interconnects", IEEE Transactions on Computers, Vol.62, pp.2382-2396, Dec. 2013.

Digital Library

Google Scholar

[5]

J. Murray, et. al., "Performance Evaluation of Congestion-Aware Routing with DVFS on a Millimeter-Wave Small World Wireless NoC", ACM Journal of Emerging Technologies in Computing Systems (JETC), Volume 11 Issue 2, November 2014.

Digital Library

Google Scholar

[6]

R. G. Kim, et. al., "Wireless NoC for VFI-Enabled Multicore Chip Design: Performance Evaluation and Design Trade-offs", IEEE Transactions on Computers, Vol. 65, Issue 4, pp. 1323--1336.

Digital Library

Google Scholar

[7]

T. Ebi, et al. "TAPE: Thermal-aware agent-based power economy multi/many-core architectures", ICCAD, 2009.

Digital Library

Google Scholar

[8]

H. Amrouch et al., "Towards interdependencies of aging mechanisms", IEEE/ACM 3 3rd Intl. Conf. on Computer-Aided Design (ICCAD), San Jose, USA, pp. 478--485, 2014.

Digital Library

Google Scholar

[9]

M. Shafique et al., "Variability-aware dark silicon management in on-chip many-core systems", DATE, 2015.

Digital Library

Google Scholar

[10]

M. Shafique et al., "The EDA challenges in the dark silicon era: temperature, reliability, and variability perspectives", DAC, 2014.

Digital Library

Google Scholar

Cited By

View all

Pandey SSiddhu LPanda P(2023)NeuroCool: Dynamic Thermal Management of 3D DRAM for Deep Neural Networks through Customized PrefetchingACM Transactions on Design Automation of Electronic Systems10.1145/363001229:1(1-35)Online publication date: 18-Dec-2023
https://dl.acm.org/doi/10.1145/3630012
Siddhu LBagchi AKedia RAhmad IPandey SPanda P(2023)Dynamic Thermal Management of 3D Memory through Rotating Low Power States and Partial Channel ClosureACM Transactions on Embedded Computing Systems10.1145/362458122:6(1-27)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3624581
Siddhu LKedia RPandey SRapp MPathania AHenkel JPanda P(2022)CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory SystemsACM Transactions on Architecture and Code Optimization10.1145/353218519:3(1-25)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3532185
Show More Cited By

Recommendations

Hardware-based load balancing for massive multicore architectures implementing power gating

Many-core architectures provide a computation platform with high execution throughput, enabling them to efficiently execute workloads with a significant degree of thread-level parallelism. The burstlike nature of these workloads allows large power ...
Scalable and Dynamic Global Power Management for Multicore Chips
PARMA-DITAM '15: Proceedings of the 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures

The design for continuous computer performance is increasingly becoming limited by the exponential increase in the power consumption. In order to improve the energy efficiency of multicore chips, we propose a novel global power management technique. The ...
Optimizing throughput of power- and thermal-constrained multicore processors using DVFS and per-core power-gating
DAC '09: Proceedings of the 46th Annual Design Automation Conference

Process variability from a range of sources is growing as technology scales below 65nm, resulting in increasingly nonuniform transistor delay and leakage power both within a die and across dies. As a result, the negative impact of process variations on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

CASES '16: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

October 2016

187 pages

ISBN:9781450344821

DOI:10.1145/2968455

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ESWEEK'16

ESWEEK'16: TWELFTH EMBEDDED SYSTEM WEEK

October 1 - 7, 2016

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
145
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Pandey SSiddhu LPanda P(2023)NeuroCool: Dynamic Thermal Management of 3D DRAM for Deep Neural Networks through Customized PrefetchingACM Transactions on Design Automation of Electronic Systems10.1145/363001229:1(1-35)Online publication date: 18-Dec-2023
https://dl.acm.org/doi/10.1145/3630012
Siddhu LBagchi AKedia RAhmad IPandey SPanda P(2023)Dynamic Thermal Management of 3D Memory through Rotating Low Power States and Partial Channel ClosureACM Transactions on Embedded Computing Systems10.1145/362458122:6(1-27)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3624581
Siddhu LKedia RPandey SRapp MPathania AHenkel JPanda P(2022)CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5D, and 3D Processor-Memory SystemsACM Transactions on Architecture and Code Optimization10.1145/353218519:3(1-25)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1145/3532185
Siddhu LPanda P(2019)PredictNcoolACM Transactions on Embedded Computing Systems10.1145/335820818:5s(1-22)Online publication date: 8-Oct-2019
https://dl.acm.org/doi/10.1145/3358208

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

Hardware-based load balancing for massive multicore architectures implementing power gating

Scalable and Dynamic Global Power Management for Multicore Chips

Optimizing throughput of power- and thermal-constrained multicore processors using DVFS and per-core power-gating