PDF Parallel Computing Hits The Power Wall Principles Challenges and A Survey of Solutions Arthur Francisco Lorenzon Ebook Full Chapter
PDF Parallel Computing Hits The Power Wall Principles Challenges and A Survey of Solutions Arthur Francisco Lorenzon Ebook Full Chapter
PDF Parallel Computing Hits The Power Wall Principles Challenges and A Survey of Solutions Arthur Francisco Lorenzon Ebook Full Chapter
https://textbookfull.com/product/task-scheduling-for-multi-core-
and-parallel-architectures-challenges-solutions-and-
perspectives-1st-edition-quan-chen/
https://textbookfull.com/product/economics-principles-
applications-and-tools-arthur-osullivan/
https://textbookfull.com/product/privately-and-publicly-
verifiable-computing-techniques-a-survey-1st-edition-denise-
demirel/
https://textbookfull.com/product/designing-internet-of-things-
solutions-with-microsoft-azure-a-survey-of-secure-and-smart-
industrial-applications-nirnay-bansal/
https://textbookfull.com/product/matlab-parallel-computing-
toolbox-user-s-guide-the-mathworks/
https://textbookfull.com/product/elements-of-cloud-computing-
security-a-survey-of-key-practicalities-1st-edition-mohammed-m-
alani-auth/
https://textbookfull.com/product/parallel-and-high-performance-
computing-meap-v09-robert-robey/
SPRINGER BRIEFS IN COMPUTER SCIENCE
Parallel
Computing Hits
the Power Wall
Principles,
Challenges, and a
Survey of Solutions
123
SpringerBriefs in Computer Science
Series editors
Stan Zdonik, Brown University, Providence, RI, USA
Shashi Shekhar, University of Minnesota, Minneapolis, MN, USA
Xindong Wu, University of Vermont, Burlington, VT, USA
Lakhmi C. Jain, University of South Australia, Adelaide, SA, Australia
David Padua, University of Illinois Urbana-Champaign, Urbana, IL, USA
Xuemin Sherman Shen, University of Waterloo, Waterloo, ON, Canada
Borko Furht, Florida Atlantic University, Boca Raton, FL, USA
V. S. Subrahmanian, Department of Computer Science, University of Maryland,
College Park, MD, USA
Martial Hebert, Carnegie Mellon University, Pittsburgh, PA, USA
Katsushi Ikeuchi, Meguro-ku, University of Tokyo, Tokyo, Japan
Bruno Siciliano, Dipartimento di Ingegneria Elettrica e delle Tecnologie
dell’Informazione, Università di Napoli Federico II, Napoli, Italy
Sushil Jajodia, George Mason University, Fairfax, VA, USA
Newton Lee, Institute for Education Research and Scholarships, Los Angeles,
CA, USA
SpringerBriefs present concise summaries of cutting-edge research and practical
applications across a wide spectrum of fields. Featuring compact volumes of 50 to
125 pages, the series covers a range of content from professional to academic.
Typical topics might include:
• A timely report of state-of-the art analytical techniques
• A bridge between new research results, as published in journal articles, and a
contextual literature review
• A snapshot of a hot or emerging topic
• An in-depth case study or clinical example
• A presentation of core concepts that students must understand in order to make
independent contributions
Briefs allow authors to present their ideas and readers to absorb them with minimal
time investment. Briefs will be published as part of Springer’s eBook collection,
with millions of users worldwide. In addition, Briefs will be available for individual
print and electronic purchase. Briefs are characterized by fast, global electronic
dissemination, standard publishing contracts, easy-to-use manuscript preparation
and formatting guidelines, and expedited production schedules. We aim for pub-
lication 8–12 weeks after acceptance. Both solicited and unsolicited manuscripts
are considered for publication in this series.
123
Arthur Francisco Lorenzon Antonio Carlos Schneider Beck Filho
Department of Computer Science Institute of Informatics, Campus do Vale
Federal University of Pampa (UNIPAMPA) Federal University of Rio Grande
Alegrete, Rio Grande do Sul, Brazil do Sul (UFRGS)
Porto Alegre, Rio Grande do Sul, Brazil
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book is dedicated to the memory of
Márcia Cristina and Aurora Cera.
Preface
vii
Acknowledgments
The authors would like to thank the friends and colleagues at Informatics Institute
of the Federal University of Rio Grande do Sul and give a special thanks to all the
people in the Embedded Systems Laboratory, who have contributed to this research
since 2013.
The authors would also like to thank the Brazilian research support agencies,
FAPERGS, CAPES, and CNPq.
ix
Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Acronyms
xiii
Chapter 1
Runtime Adaptability: The Key for
Improving Parallel Applications
1.1 Introduction
With the increasing complexity of parallel applications, which require more com-
puting power, energy consumption has become an important issue. The power
consumption of high-performance computing (HPC) systems is expected to signifi-
cantly grow (up to 100 MW) in the next years [34]. Moreover, while general-purpose
processors are being pulled back by the limits of the thermal design power (TDP),
most of the embedded devices are mobile and heavily dependent on battery (e.g.,
smartphones and tablets). Therefore, the primary objective when designing and
executing parallel applications is not to merely improve performance but to do so
with minimal impact on energy consumption.
Performance improvements can be achieved by exploiting instruction-level
parallelism (ILP) or thread-level parallelism (TLP). In the former, independent
instructions of a single program are simultaneously executed, usually on a super-
scalar processor, as long as there are functional units available. However, typical
instruction streams have only a limited amount of parallelism [122], resulting in
considerable efforts to design a microarchitecture that will bring only marginal
performance gains with very significant area/power overhead. Even if one considers
a perfect processor, ILP exploitation will reach an upper bound [85].
Hence, to continue increasing performance and to provide better use of the extra
available transistors, modern designs have started to exploit TLP more aggressively
[7]. In this case, multiple processors simultaneously execute parts of the same
program, exchanging data at runtime through shared variables or message passing.
In the former, all threads share the same memory region, while in the latter each
process has its private memory, and the communication occurs by send/receive
primitives (even though they are also implemented using a shared memory context
when the data exchange is done intra-chip [21]). Regardless of the processor or
communication model, data exchange is usually done through memory regions that
are more distant from the processor (e.g., L3 cache and main memory) and have
higher delay and power consumption when compared to memories that are closer to
it (e.g., register, L1, and L2 caches).
Therefore, even though execution time shall decrease because of TLP exploita-
tion, energy will not necessarily follow the same trend, since many other variables
are involved:
• Memories that are more distant from the processor will be more accessed for
synchronization and data exchange, increasing energy related to dynamic power
(which increases as there is more activity in the circuitry).
• A parallel application will usually execute more instructions than its sequential
counterpart. Moreover, even considering an ideal scenario (where processors are
put on standby with no power consumption), the sum of the execution times of
all threads executing on all cores tends to be greater than if the application was
sequentially executed on only one core. In consequence, the resulting energy
from static power (directly proportional to how long each hardware component
is turned on) consumed by the cores will also be more significant. There are
few exceptions to this rule, such as non-deterministic algorithms, in which
the execution of a parallel application may execute fewer instructions than its
sequential counterpart.
• The memory system (which involves caches and main memory) will be turned
on for a shorter time (the total execution time of the applications), which will
decrease the energy resulting from the static power.
Given the aforementioned discussion, cores tend to consume more energy from
both dynamic and static power, while memories will usually spend more dynamic
power (and hence energy), but also tend to save static power, which is very
significant [121]. On top of that, neither performance nor energy improvements
resultant from TLP exploitation are linear, and sometimes they do not scale as the
number of threads increases, which means that in many cases the maximum number
of threads will not offer the best results.
On top of that, in order to speed up the development process of TLP exploitation
and make it as transparent as possible to the software developer, different parallel
programming interfaces are used (e.g., OpenMP—Open Multi-Processing [22],
PThreads—POSIX Threads [17], or MPI—Message Passing Interface [38]). How-
ever, each one of these has different characteristics with respect to the management
(i.e., creation and finalization of threads/processes), workload distribution, and
synchronization.
In addition to the complex scenario of thread scalability, several optimization
techniques for power and energy management can be used, such as dynamic
voltage and frequency scaling (DVFS) [62] and power gating [47]. The former is
a feature of the processor that allows the application to adapt the clock frequency
and operating voltage of the processor on the fly. It enables software to change
the processing performance to attain low-power consumption while meeting the
performance requirements [62]. On the other hand, power gating consists of
selectively powering down certain blocks in the chip while keeping other blocks
powered up. In multicore processors, it switches off unused cores to reduce power
1.2 Scalability Analysis 3
Many works have associated the fact that executing an application with the maxi-
mum possible number of available threads (the common choice for most software
developers [63]) will not necessarily lead to the best possible performance. There
are several reasons for this lack of scalability: instruction issue-width saturation; off-
chip bus saturation; data-synchronization; and concurrent shared memory accesses
[51, 64, 95, 114, 115]. In order to measure (through correlation) their real influence,
we have executed four benchmarks from our set (and used them as examples
in the next subsections) on a 12-core machine with SMT support. Each one of
them has one limiting characteristic that stands out, as shown in Table 1.1. The
benchmark hotspot (HS) saturates the issue-width; fast Fourier transform (FFT),
the off-chip bus; MG, the shared memory accesses; and N-body (NB) saturates
data-synchronization. To analyze each of the scalability issues, we considered the
Pearson correlation [9]. It takes a range of values from +1 to −1: the stronger the
“r” linear association between two variables, the closer the value will be to either
+1 or −1. r ≥ 0.9 or r ≤ −0.9 means a very strong correlation (association is
directly or inversely proportional). We discuss these bottlenecks next.
Issue-Width Saturation SMT allows many threads to run simultaneously on a
core. It increases the probability of having more independent instructions to fill the
function units (FUs). Although it may work well for applications with low ILP, it
can lead to the opposite behavior if an individual thread presents enough ILP to
issue instructions to all or most of the core’s FUs. Then, SMT may lead to resource
competition and functional unit contention, resulting in extra idle cycles. Figure 1.1a
shows the performance speedup relative to the sequential version and the number of
idle cycles (average, represented by the bars, and total) as we increase the number
of threads for the HS application. As we start executing with 13 threads, two will
be mapped to the same physical core, activating SMT. From this point on, as the
number of threads grows, the average number of idle cycles increases by a small
amount or stays constant. However, the total number of idle cycles significantly
Normalized w rt sequential
Normalized wrt sequential
5.0 Time
0.8 80% 0.8
4.0
0.6 60% 0.6
3.0
0.4 40% 0.4
2.0
Off-chip
1.0 0.2 20% 0.2
Time
0.0 0 0% 0
1 4 7 10 13 16 19 22 1 4 7 10 13 16 19 22
#Threads # Threads
(a) (b)
Fig. 1.1 Scalability behavior of parallel applications. (a) Issue-width saturation. (b) Off-chip bus
saturation
increases. Because this application has high ILP, there are not enough resources to
execute both threads concurrently as if each one was executed on a single core.
They become the new critical path of that parallel region, as both threads will delay
the execution of the entire parallel region (threads can only synchronize when all
have reached the barrier). Therefore, performance drops and is almost recovered
only with the maximum number of threads executing. In the end, extra resources
are being used without improving performance and potentially increasing energy
consumption, decreasing resource efficiency.
Considering the prior scenario, choosing the right number of threads to a given
application will offer opportunities to improve performance and increase the energy
efficiency. However, such task is extremely difficult: besides the huge number of
6 1 Runtime Adaptability: The Key for Improving Parallel Applications
10 3.5
2.5
6
2.0
4 1.5
1.0
2
0.5
0 0.0
1 3 5 7 9 11 13 15 17 19 21 23 1 4 7 10 13 16 19 22
# Threads # Threads
(a) (b)
Fig. 1.3 Data-synchronization. (a) Critical section behavior. (b) Perf./Energy degradation
5 8
Medium Improvements over seq. Perf.
Perf. Improvements
Small Energy
4 6 EDP
3 4
2 2
1 0
1 4 7 10 13 16 19 22 1 4 7 10 13 16 19 22
# Threads # Threads
(a) (b)
8 20
8-core 1st 2nd
EDP Improvements
Perf. Improvements
0 0
1 4 7 10 13 16 19 22 25 28 31 1 4 7 10 13 16 19 22
# Threads # Threads
(c) (d)
Fig. 1.4 Appropriate number of threads (x-axis) considering the improvements over sequential
version (y-axis). (a) Different input sets. (b) Different metrics evaluated. (c) Different multicore
processors. (d) Different parallel regions
variables involved, many of them will change according to different aspects of the
system at hand and are only possible to be defined at runtime, such as:
• Input set: As shown in Fig. 1.4a, different levels of performance improvements
for the LULESH benchmark [57] (also used as examples in the next two items)
over its single-threaded version are reached with a different number of threads (x-
1.3 This Book 7
axis). However, these levels vary according to the input set (small or medium).
While the best number of threads is 12 for the medium input set, the ideal number
for the small set is 11.
• Metric evaluated: As Fig. 1.4b shows, the best performance is reached with 12
threads, while 6 threads bring the lowest energy consumption, and 9 presents
the best trade-off between both metrics (represented by the energy-delay product
(EDP)).
• Processor architecture: Fig. 1.4c shows that the best EDP improvements of the
parallel application on a 32-core system are when it executes with 11 threads.
However, the best choice for a 24-core system is 9 threads.
• Parallel regions: Many applications are divided into several parallel regions, in
which each of these regions may have a distinct ideal number of threads, since
their behavior may vary as the application executes. As an example, Fig. 1.4d
shows the behavior of four parallel regions from the Poisson equation benchmark
[94] when running on a 24-core system. One can note that each parallel region is
better executed with a different number of threads.
• Application behavior: A DVFS enabled system adapts the operating frequency
and voltage at runtime according to the application at hand, taking advantage
of the processor idleness (usually provoked by I/O operations or by memory
requests). Therefore, a memory- or CPU-bound application will influence the
DVFS at different levels.
Efficiently exploiting thread-level parallelism from new multicore systems has been
challenging for software developers. While blindly increasing the number of threads
may lead to performance gains, it can also result in a disproportionate increase
in energy consumption. In the same way, optimization techniques for reducing
energy consumption, such as DVFS and power gating, can lead to huge performance
loss if used incorrectly. For this reason, rightly choosing the number of threads,
the operating processor frequency, and the number of active cores is essential to
reach the best compromise between performance and energy. However, such task
is extremely difficult: besides the large number of variables involved, many of
them will change according to different aspects of the system at hand and are
defined at runtime, such as the input set of the application, the metric evaluated, the
processor microarchitecture, and the behavior of the parallel regions that comprise
the application.
In this book, we present and discuss several techniques that address this
challenge.
In Chap. 2, we provide a brief background for the reader. First, we give an
overview of parallel computing in software, presenting the parallel programming
interfaces widely used in multicore architectures. Second, we present the techniques
used in software and hardware to optimize the power and energy consumption
8 1 Runtime Adaptability: The Key for Improving Parallel Applications
Parallel computing exploits the use of multiple processing units to execute parts of
the same program simultaneously. Thus, there is cooperation between the processors
that execute concurrently. However, for this cooperation to occur, processors should
exchange information at runtime. In multicore processors, this can be done through
shared variables or message passing [97]:
0
0 1M
1M
2M
2M 3M
3M
4M
(a) (b)
Fig. 2.1 Example of parallel computing. (a) Sequential execution. (b) Parallel execution in four
cores
Shared variable is based on the existence of an address space in the memory that
can be accessed by all processors. It is widely used when parallelism is exploited
at the level of the thread since they share the same memory address space. In this
model, the threads can have private variables (the thread has exclusive access) and
shared variables (all the threads have access). When the threads need to exchange
information between them, they use shared variables located in memory regions that
are accessed by all threads (shared memory). Each parallel programming interface
provides synchronization operations to control the access to shared variables,
avoiding race conditions.
Message passing is used in environments where memory space is distributed
or where processes do not share the same memory address space. Therefore,
communication occurs using send/receive operations which can be point-to-point
or collective ones. In the first, data exchange is done between pairs of processes. In
the latter, more than two processes are communicating.
The development of applications that can exploit the full potential parallelism of
multiprocessor architectures depends on many specific aspects of their organization,
including the size, structure, and hierarchy of the memory. Operating Systems pro-
vide transparency concerning the allocation and scheduling of different processes
across the various cores. However, when it comes to TLP exploitation, which
involves the division of the application into threads or processes, the responsibility
is of the programmer. Therefore, PPIs make the extraction of the parallelism
easier, fast, and less error-prone. Several parallel programming interfaces are used
nowadays, in which the most common are Open Multi-Processing (OpenMP),
POSIX Threads (PThreads), Message Passing Interface (MPI), Threading Building
Blocks (TBB), Cilk Plus, Charm, among others.
2.1 Parallel Computing in Software 11
OpenMP is a PPI for shared memory in C/C++ and FORTRAN that consists
of a set of compiler directives, library functions, and environment variables [22].
Parallelism is exploited through the insertion of directives in the sequential code that
inform the compiler how and which parts of the code should be executed in parallel.
The synchronization can be implicit (implied barrier at the end of a parallel region)
or explicit (synchronization constructs) to the programmer. By default, whenever
there is a synchronization point, OpenMP threads enter in a hybrid state (Spin-
lock and Sleep), i.e., they access the shared memory repeatedly until the number
of spins of the busy-wait loop is achieved (Spin-lock), and then, they enter into
a sleep state until the end of synchronization [22]. The amount of time that each
thread waits actively before waiting passively without consuming CPU power may
vary according to the waiting policy that gives the number of spins of the busy-wait
loop (e.g., the standard value when omp wait policy is set to being active is 30 billion
iterations) [86].
PThreads is a standard PPI for C/C++, where functions allow fine adjustment
in the grain size of the workload. Thus, the creation/termination of the threads, the
workload distribution, and the control of execution are defined by the programmer
[17]. PThreads synchronization is done by blocking threads with mutexes, which
are inserted in the code by the programmer. In this case, threads lose the processor
and wait on standby until the end of the synchronization, when they are rescheduled
for execution [117].
Cilk Plus is integrated with a C/C++ compiler and extends the language with the
addition of keywords by the programmer indicating where parallelism is allowed.
Cilk Plus enables programmers to concentrate on structuring programs to expose
parallelism and exploit locality. Thus, the runtime system has the responsibility of
scheduling the computation to run efficiently on a given platform. Besides, it takes
care of details like load balancing, synchronization, and communication protocols.
Unlike PThreads and OpenMP, Cilk Plus works at a finer grain, with a runtime
system that is responsible for efficient execution and predictable performance [79].
TBB is a library that supports parallelism based on a tasking model and can be
used with any C++ compiler. TBB requires the use of function objects to specify
blocks of code to run in parallel, which relies on templates and generic program-
ming. The synchronization between threads is done by mutual exclusion, in which
the threads in this state perform busy-waiting until the end of synchronization [79].
MPI is a standard message passing library for C/C++ and FORTRAN. It imple-
ments an optimization mechanism to provide communication in shared memory
environments [38]. MPI is like PThreads regarding the explicit exploitation of
parallelism. Currently, it is divided into three norms. In MPI-1, all processes are
created at the beginning of the execution and the number of processes does not
change throughout program execution. In MPI-2, the creation of the processes
occurs at runtime, and the number of processes can change during the execution.
In MPI-3, the updates include the extension of collective operations to include
nonblocking versions and extensions to the one-sided operations. Communication
between MPI processes occurs through send/receive operations (point-to-point or
collective ones), which are likewise explicitly handled by the programmers. When
12 2 Fundamental Concepts
Multicore architectures have multiple processing units (cores) and a memory system
that enables communication between the cores. Each core is an independent logical
processor with its resources, such as functional units, pipeline execution, registers,
among others. The memory system consists of private memories, which are closer to
the core and only accessible by a single core, and shared memories, which are more
distant from the core and can be accessed by multiple cores [43]. Figure 2.2 shows
an example of a multicore architecture with four cores (C0, C1, C2, and C3) and its
private (L1 and L2 caches) and shared memories (L3 cache and main memory).
Multicore processors can exploit TLP. In this case, multiple processors simul-
taneously execute parts of the same program, exchanging data at runtime through
shared variables or message passing. Regardless of the processor or communication
model, data exchange is done through load/store instructions in shared memory
regions. As Fig. 2.2 shows, these regions are more distant from the processor (e.g.,
L3 cache and main memory), and have a higher delay and power consumption when
compared to memories that are closer to it (e.g., register, L1, and L2 caches) [61].
Among the challenges faced in the design of multicore architectures, one of
the most important is related to the data access on parallel applications. When a
private data is accessed, its location is migrated to the private cache of a core, since
no other core will use the same variable. On the other hand, shared data may be
replicated in multiple caches, since other processors can access it to communicate.
Therefore, while sharing data improves concurrency between multiple processors, it
also introduces the cache coherence problem: when a processor writes on any shared
L2 L2 L2 L2
L3
Main Memory
2.2 Power and Energy Consumption 13
data, the information stored in other caches may become invalid. In order to solve
this problem, cache coherence protocols are used.
Cache coherence protocols are classified into two classes: directory based and
snooping [88]. In the former, a centralized directory maintains the state of each
block in different caches. When an entry is modified, the directory is responsible
for either updating or invalidating the other caches with that entry. In the snooping
protocol, rather than keeping the state of sharing block in a single directory, each
cache that has a copy of the data can track the sharing status of the block. Thus, all
the cores observe memory operations and take proper action to update or invalidate
the local cache content if needed.
Cache blocks are classified into states, in which the number of states depends
on the protocol. For instance, directory based and snooping protocols are simple
three-state protocols in which each block is classified into modified, shared, and
invalid (they are often called as MSI—modified, shared, and invalid—protocol).
When a cache block is in the modified state, it has been updated in the private
cache, and cannot be in any other cache. The shared state indicates that the block in
the private cache is potentially shared, and the cache block is invalid when a block
contains no valid data. Based on the MSI protocol, extensions have been created
by adding additional states. There are two common extensions: MESI, which adds
the state “exclusive” to the MSI to indicate when a cache block is resident only in
a single cache but is clean, and MOESI, which adds the “state-owned” to the MESI
protocol to indicate that a particular cache owns the associated block and out-of-date
in memory [43].
When developing parallel applications, the software developer does not need
to know about all details of cache coherence. However, knowing how the data
exchange occurs at the hardware level can help the programmer to make better
decisions during the development of parallel applications.
Two main components constitute the power used by a CMOS integrated circuit:
dynamic and static [58]. The former is the power consumed while the inputs are
active, with capacitances charging and discharging, which is directly proportional
to the circuit switching activity, given by Eq. (2.1).
Pdynamic = CV 2 Af (2.1)
Capacitance (C) depends on the wire lengths of on-chip structures. The designers
in several ways can influence this metric. For example, building two smaller cores
on-chip, rather than one large, is likely to reduce average wire lengths, since most
wires will interconnect units within a single core.
14 2 Fundamental Concepts
Supply voltage (V or Vdd) is the main voltage to power the integrated circuit.
Because of its direct quadratic influence on dynamic power, supply voltage has a
high importance on power-aware design.
Activity factor (A) refers to how often clock ticks lead to switching activity on
average.
Clock frequency (f ) has a fundamental impact on power dissipation because it
indirectly influences supply voltage: the higher clock frequencies can require a
higher supply voltage. Thus, the combined portion of supply voltage and clock
frequency in the dynamic power equation has a cubic impact on total power
dissipation.
While dynamic power dissipation represents the predominant factor in CMOS
power consumption, static power has been increasingly prominent in recent tech-
nologies. The static power essentially consists of the power used when the transistor
is not in the process of switching and is determined by Eq. (2.2), where the supply
voltage is V, and the total current flowing through the device is Istatic .
Energy, in joules, is the integral of total power consumed (P) over the time (T),
given by Eq. (2.3).
T
Energy = Pi (2.3)
0
Currently, energy is considered one of the most fundamental metrics due to the
energy restrictions: while most of the embedded devices are mobile and heavily
dependent on battery, general-purpose processors are being pulled back by the limits
of thermal design power. Also, the reduction of energy consumption on HPC is one
of the challenges to achieving the Exascale era, since the actual energy required to
maintain these systems corresponds to the power from a nuclear plant of medium
size [34]. Therefore, several techniques to reduce energy consumption have been
proposed, such as DVFS and power gating.
Dynamic voltage and frequency scaling is a feature of the processor that allows
software to adapt the clock frequency and operating voltage of a processor on the
fly without requiring a reset [62]. DVFS enables software to change system-on-
chip (SoC) processing performance to attain low-power consumption while meeting
the performance requirements. The main idea of the DVFS is dynamically scaling
the supply voltage of the CPU for a given frequency so that it operates at a
minimum speed required by the specific task executed [62]. Therefore, this can yield
a significant reduction in power consumption because of the V 2 relationship shown
in Eq. (2.2).
2.2 Power and Energy Consumption 15
Reducing the operating frequency reduces the processor performance and the
power consumption per second. Also, when reducing the voltage, the leakage
current from the CPU’s transistors decreases, making the processor most energy-
efficient resulting in further gains [99]. However, determining the ideal frequency
and voltage for a given point of execution is not a trivial task. To make the DVFS
management as transparent as possible to the software developer, Operating Systems
provide frameworks that allow each CPU core to have a min/max frequency,
and a governor to control it. Governors are kernels models that can drive CPU
core frequency/voltage operating points. Currently, the most common available
governors are:
• Performance: The frequency of the processor is always fixed at the maximum,
even if the processor is underutilized.
• Powersave: The frequency of the processor is always fixed at the minimum
allowable frequency.
• Userspace: allows the user or any userspace program running to set the CPU for
a specific frequency.
• Ondemand: The frequency of the processor is adjusted according to the workload
behavior, within the range of allowed frequencies.
• Conservative: In the same way as the previous mode, the frequency of the
processor is gradually adjusted based on the workload, but in a more conservative
way.
Besides the pre-defined governors, it is possible to set the processor frequency
level manually, by editing the configurations of the CPU frequency driver.
Power gating consists of selectively powering down certain blocks in the chip while
keeping other blocks powered up. The goal of power gating is to minimize leakage
current by temporarily switching power off to blocks that are not required in the
current operating mode [59]. Power gating can be applied either at the unit-level,
reducing the power consumption of unused core functional units or at the core-
level, in which entire cores may be power gated [56, 76]. Currently, power gating
is mainly used in multicore processors to switch off unused cores to reduce power
consumption [84].
Power gating requires the presence of a header “sleep” transistor that can set
the supply voltage of the circuit to ground level during idle times. Power gating
also requires a control logic that decides when to power gate the circuit. Every
time that the power gating is applied, an energy overhead cost occurs due to
distributing the sleep signal to the header transistor before the circuit is turned off,
and turning off the sleep signal and driving the voltage when the circuit is powered
on again. Therefore, there is a break-even point, which represents the exact point in
time where the cumulative leakage energy savings is equal to the energy overhead
16 2 Fundamental Concepts
incurred by power gating. If, after the decision to power gate a unit, the unit stays
idle for a time interval that is longer than the break-even point, then power gating
saves energy. On the other hand, if the unit needs to be active again before the break-
even point is reached, then power gating incurs an energy penalty [75].
Chapter 3
The Impact of Parallel Programming
Interfaces on Energy
3.1 Methodology
3.1.1 Benchmarks
-C omputation Phases
-Communication and
Synchronization
Threads/Processes
Threads/Processes
3 3
2 2
1 1
0 0
Fig. 3.1 Behavior of benchmarks. (a) High communication. (b) Low communication
the best results for the same benchmark set used in this study and, therefore, this
scheduling mechanism is used here.
As indicated by [17, 36] and [38], we have used parallel tasks for the PThreads
and MPI implementations. In such cases, the iterations of the loop were distributed
based on the best workload balancing between threads/processes. Moreover, the
communication between MPI processes was implemented by using nonblocking
operations, to provide better performance, as showed in [44].
Another random document with
no related content on Scribd:
faisait souffrir. Un frisson passait quelquefois sur sa joue maigre ;
ses mâchoires se serraient jusqu’à la crispation.
Je le devinais, ou je l’imaginais, triste aujourd’hui jusqu’à
l’angoisse, et je dis très doucement :
— François.
Il se tourna, me regarda en silence, puis brusquement :
— Alvère, me demanda-t-il, est-ce que vous n’en avez pas assez
de nos sottes rencontres dans la campagne et de nos promenades
d’écoliers ?
— Assez ?… répétai-je.
Et je ne pouvais pas le comprendre, car je voyais bien à l’ardeur
de ses yeux que cet « assez » ne voulait point exprimer la lassitude.
— Oui, poursuivit-il avec cette impatience, cette espèce d’avidité
qui suivaient ses minutes indifférentes, n’aimeriez-vous point,
comme moi, que nous puissions nous voir avec plus de
tranquillité ?… Voici l’automne, les nuits promptes, et les grandes
pluies vont venir… Écoutez, — et sa fiévreuse parole ne me laissait
pas le pouvoir de réfléchir, — vous connaissez, sur la place où est
l’ormeau, notre vieille maison. Ma grand’mère n’a pas voulu que
j’attende sa mort pour en pouvoir disposer : cette maison
m’appartient.
— Je sais…
Avec le jour déclinant, les humides odeurs de l’automne
commençaient à monter des sous bois, et, dans le ciel, d’un bleu
verdâtre et très pur, s’étendaient de paisibles grèves de sable
lumineux vers lesquelles nageaient d’autres nuages, d’apparence
tourmentée, qui portaient de longues plaies rouges dans leurs
masses violettes.
— La bicoque, continuait François, est assez curieuse. Les
fenêtres ont encore leurs petits carreaux épais à travers lesquels se
déforme le paysage. Vous verrez…
Sa phrase prudente, une seconde, demeura en suspens.
— Vous verrez, au premier, dans la grande salle, la cheminée
avec les deux faunes et de petites salamandres ciselées sur chaque
pierre. J’ai fait là ma bibliothèque. Ma chambre est à côté. Je suis
capricieux. Quelquefois il me semble mieux respirer dans cette
maison que dans l’autre, où nous habitons. Alors je viens m’y
installer pour huit jours ou davantage. Tout est prêt pour me
recevoir…
Il hésitait encore. Puis brusque, tout à coup, et suppliant :
— Vous viendrez, n’est-ce pas ? Dites que vous viendrez, Alvère,
dites-le… Ah ! je suis malade, ce soir, malade et triste. Depuis deux
mois nos rencontres sont toute ma joie et le mauvais temps bientôt
va les empêcher… Vous viendrez pour que je ne sois pas trop
malheureux. Ce serait si simple… le soir, parce que dans le jour on
pourrait vous voir entrer ; mais le soir, la ville est si sombre… On doit
se coucher de bonne heure, chez vous ?
Les grands nuages, au-dessus de nos têtes, continuaient d’étirer
leurs formes sanglantes. Je les regardai longuement, et, me levant
pour partir :
— … Comment voulez-vous ?…
— Oh ! que vous êtes empruntée ! Y a-t-il donc à vos portes des
serrures qui grincent très fort ? Que redoutez-vous ? Vous sortirez et
vous pourrez rentrer un peu plus tard sans que personne entende
rien. Si vous avez peur, je vous accompagnerai… Vous viendrez… Il
faut avoir pitié. Il me semble quelquefois que vous me comprenez
bien et cela m’est si doux !… Vous ne savez pas comme je vais les
attendre tout le long des journées, ces petits instants du soir que
vous voudrez bien me donner ! Vous viendrez… vous viendrez…
Sa véhémence savante, toute mêlée d’ailleurs de sincérités
douloureuses, m’étourdissait un peu et il le voyait bien.
— Quel jour ? dites-moi quel jour ?
— Ah ! ne fixons pas de jour, m’écriai-je.
Déjà je courais dans le chemin. François marchait derrière moi,
mais paisiblement et sans me poursuivre. Quand je fus dans le bois,
j’eus peur de me perdre. Je m’arrêtai pour l’attendre. Je me
retournai. Et l’air de contentement que je vis sur son visage me
blessa d’une façon que je devais me rappeler bien souvent.
*
* *
*
* *
La pluie tombe depuis huit jours. Nous serons bientôt en
décembre. Une odeur froide, qui semble venir des pierres trempées
d’eau et prêtes à se dissoudre, — pierres des vieux murs
ruisselants, pierres des pavés entre lesquelles bondissent de petits
flots ininterrompus, monte de toute la ville. Une danse enragée et
lourde, qui menace de tout enfoncer, ne cesse de bondir et de
piétiner là-haut les tuiles du vieux toit. Il y a dans ce bruit, pressé et
continu, je ne sais quelle monotonie affolée qui étourdit, engourdit, et
mêle à la somnolence un insupportable malaise. — Automne tout
pareil à celui d’alors, à l’automne qui suivit ce soir où je n’allai pas
chez François Landargues, pour supporter le mal qui vient de vous,
il faudrait être au fond de soi très riche ou d’une entière pauvreté.
Mais je suis également loin de la force et de la stupeur. Les
médiocres comme moi ne savent que sentir.
… Nul jour ne se détache entre les mornes jours qui suivirent. Je
ne vois pas ce temps derrière moi comme une suite d’heures
formant des semaines avec leurs dimanches. C’est une seule masse
grise et pesante comme ces vapeurs qui roulent en novembre sur
les prairies crépusculaires. Le temps était mauvais ; la nuit tombait
vite. Émue encore des confidences que j’avais dû lui faire, maman
me considérait trop souvent avec une frayeur désolée. Et puis elle
fermait les yeux, et la méditation qu’il lui fallait subir creusait en
quelques minutes son visage si pâle et si fin. C’est elle, dans ces
moments, qui portait mes remords et elle ne se consolait point de
tout ce que sa chair et son âme avaient mis en moi de faible et de
passionné. — Elle ne me parlait de rien d’ailleurs. J’avais supplié
qu’il en fût ainsi, elle admettait ma prière, et le nom des Landargues
qui, dans nos heures provinciales, revenait jusqu’alors assez
souvent entre nous, n’était plus jamais prononcé.
Je ne me plaignais d’aucune peine, je n’en voulais point
éprouver, et, m’appliquant à rire souvent, je mettais toute ma bonne
volonté à m’occuper sans cesse et utilement, aidant au ménage
comme à la couture, pliant le linge et préparant les pommes et les
figues pour les conserves de l’hiver. Mais ma souffrance, que
semblaient écarter tant de petits gestes, dès qu’ils s’interrompaient
revenait aussitôt se serrer contre mes épaules, et tout mon mal, se
remuant avec force, étirait ses griffes au dedans de moi. D’une
imagination ou d’une mémoire tout à la fois inlassable et épuisée, je
cherchais François, ses phrases durant nos rencontres, ses regards
et ses gestes. Et souvent je chérissais tout de lui, ses tristesses et
ses sourires, et jusqu’à son cœur sec, jusqu’à ses méchancetés
douloureuses ; mais souvent aussi, le comprenant plus clairement, je
n’avais plus pour lui que de la répulsion.
Deux fois déjà, dans la rue Puits-aux-Bœufs et sur le quai du
Rhône, je l’avais revu. On ne me permettait plus de sortir seule, je
ne le demandais pas ; Guicharde chaque fois marchait auprès de
moi. Et, sans presser ni ralentir le pas, il avait salué, d’un geste
indifférent, laissant toutefois s’attacher sur moi un regard d’où ne
venaient ni regrets ni prière, mais seulement, blessante de cette
façon aiguë qu’il savait trop bien faire sentir, la plus méprisante
ironie. M’aimait-il, m’avait-il aimée ? Était-ce de l’amour, ce que moi-
même j’avais éprouvé pour lui ?… Mais les jours passèrent et je
commençais de ne plus bien connaître les causes de ce grand
tourment qui m’occupait encore… Lui-même peu à peu s’en allait de
moi. Et je me rappelle, comme le printemps allait venir, les belles
heures que je passais à la fenêtre de ma chambre, qui était la plus
petite au bout du couloir blanchi à la chaux. Quel bonheur me venait
alors de mon cœur vide, paisible et léger ! Le soleil disparaissait
derrière les monts de l’Ardèche, et devant moi, du ciel où s’étaient
dissous les derniers rayons au fleuve qui le recevait avec eux, la
couleur du miel occupait tout l’espace.
Pauvres âmes que les petites et les ignorantes comme la
mienne, tour à tour paisibles et brûlées, savourant leur folie,
appréciant leur sagesse, et ne sachant jamais bien où il leur
convient de s’établir !
*
* *
Le printemps fut aigre et changeant comme il est souvent dans
nos pays, avec des coups de vent glacé qui secouent sur leur tige et
font tomber les fleurs naissantes, et des soleils si chauds que le blé
vert semble s’allonger dans la minute que l’on met à le regarder. Je
recommençais de sortir seule dans les petites rues qui tournent
autour de la maison et j’apercevais quelquefois le docteur Gourdon.
Il venait là pour soigner l’enfant d’un charpentier, atteint de
tuberculose osseuse et auquel s’intéressait Mme Livron qui est fort
riche, et grande amie de la vieille Mme Landargues. Il me saluait
avec un grand respect et me regardait longuement.
Un jour, il me parla. C’était devant la « Maison des Têtes », où
trois seigneurs et quatre dames, du temps du roi François Ier,
sculptés merveilleusement dans la pierre brunie, penchent au-
dessus des fenêtres à croisillons leurs têtes coiffées de plumes ou
de perles. La rue est malpropre et fort étroite. Au moment que je
passai auprès du docteur, je glissai sur une pelure de pomme et
manquai de tomber. Il étendit le bras pour me retenir et, comme je le
remerciais, en riant de ma maladresse, il rit avec moi. Ensuite, il me
demanda si ma santé était bonne, et s’informa avec un grand intérêt
de ma mère qu’il apercevait quelquefois le dimanche et qu’il trouvait,
me dit-il, un peu pâle et fatiguée. Je répondis qu’elle était, en effet,
d’une santé fragile, et nous demeurions l’un devant l’autre, ne
sachant plus bien ce qu’il fallait ajouter.
Alors, ayant, me parut-il, hésité légèrement, il me demanda :
— Y a-t-il longtemps, mademoiselle, que vous n’avez vu M.
François Landargues ?
La question n’était que banale. Elle me troubla cependant, car je
ne l’attendais point et je répondis : « Très longtemps », avec une
indifférence excessive et maladroite. Fabien Gourdon ne fut point
assez délicat pour ignorer mon trop visible malaise :
— Oh ! dit-il, baissant un peu la voix, je vous demande pardon
d’avoir réveillé des souvenirs…
— Il n’y a pas de souvenirs, ripostai-je.
— A la bonne heure ! approuva Gourdon.
Et il soupira, parce qu’il supposait sans doute que j’avais le cœur
gros et qu’il tenait à me rendre évidente toute sa sympathie :
— Que voulez-vous !… Il était bien à prévoir que Mme
Landargues, si intransigeante, ne permettrait pas à son petit-fils de
se marier selon sa tendresse !
Avait-il donc pu croire que François désirait m’épouser ? Je fus
touchée, et cela me flatta de découvrir chez quelqu’un cette pensée
qui ne m’était jamais venue. Je regardai mieux Gourdon. Il était
admiratif, pitoyable et sincère. Alors je pensai qu’il était honnête de
cœur et de cerveau, et je le fus sans doute moins que lui, car, ayant
fait un geste vague qui pouvait marquer un grand détachement pour
ces choses déjà lointaines, je ne le détrompai pas.
*
* *
*
* *
*
* *