Nothing Special   »   [go: up one dir, main page]

Whats New Performance Power9

Download as pdf or txt
Download as pdf or txt
You are on page 1of 55

What’s New POWER Performance

Steve Nasypany
IBM Washington System Center
nasypany@us.ibm.com
Please note

IBM’s statements regarding its plans, directions, and intent are subject
to change or withdrawal without notice and at IBM’s sole discretion.
Information regarding potential future products is intended to outline our
general product direction and it should not be relied on in making a
purchasing decision.

The information mentioned regarding potential future products is not a


commitment, promise, or legal obligation to deliver any material, code
or functionality. Information about potential future products may not be
incorporated into any contract.

The development, release, and timing of any future features or functionality


described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM


benchmarks in a controlled environment. The actual throughput or
performance that any user will experience will vary depending upon many
factors, including considerations such as the amount of multiprogramming in
the user’s job stream, the I/O configuration, the storage configuration, and
the workload processed. Therefore, no assurance can be given that an
individual user will achieve results similar to those stated here.

2
Credits

This session includes materials developed by Bret Olszewski, Ron Arroyo,


Todd Rosedahl and Tracy Smith of IBM

Special thanks to Shakti Kapoor, Ralf Schmidt-Dannert and Bhargavaram


Akula for their tireless efforts on POWER9 testing

3
The POWER9 Processor
Power Systems Performance Collateral

https://developer.ibm.com/linuxonpower/perfcol/
https://www14.software.ibm.com/webapp/set2/sas/f/best/home.html
6
IBM POWER Architecture & Terminology Basics

thread a hardware/software abstraction on a physical core


processor collection of cores on the same physical die
chip may be one or more processors packaged on a single socket.
SMT Simultaneous Multi-threading
SCM Single-Chip Module – one chip per socket
DCM Dual-Chip Module – two chips packaged per socket
NUMA Non-Uniform Memory Architecture
NUCA Non-Uniform Cache Architecture

POWER8 POWER9 PowerVM POWER9 OpenPOWER


core core core
Up to 12 cores/chip Up to 12 cores/chip Up to 24 cores/chip
SCM/DCM SCM SCM
64 KB Instruction Cache 64 KB Instruction Cache 32 KB Instruction Cache (not
32 KB Data Cache 64 KB Data Cache shared)
512 KB Level 2 cache 512 KB L2 Cache 32 KB Data Cache (not shared)
8 MB L3 Cache 10 MB eDRAM L3 Cache 512 KB L2 Cache (shared)
Support for SMT8 Support for SMT8 10 MB eDRAM L3 (shared)
Support for SMT4
POWER9 - #1 & #2 Top Supercomputers

The worlds most powerful computer

The Summit supercomputer at Oak Ridge


National Laboratory was built using
technologies available to all businesses
§ 200 quadrillion calculations per second
§ 250 petabytes storage capacity
§ 4,608 IBM AC922 POWER9 systems
§ 9,216 CPUs / 101,376 cores
§ 27,648 NVIDIA Tesla V100 GPUs
§ 25 gigabytes per second between nodes
§ 150 gigabytes per second between CPUs
and GPUs (NVLink 2.0)
§ 100 gigabytes per second to 2 terabytes
of DDR4 memory
AC922 Bandwidth

v AC922
§ Designed for HPC and Cognitive workloads
§ Nvidia NVLink2.0 GPU to GPU – 150 GB/s
§ NVLink2.0 CPU to GPU
150 GB/s vs 32 GB/s x86 PCIe3
§ Memory Bus
120 GB/s vs ~77 GB/s x86
CPW – POWER8 E880 vs POWER9 E980

CPW IBM i
# Cores per node
Model
Nodes 32c 40c 44c 48c
E880 361,180 436,080 n/a 491,060
1
E980 508,900 611,300 639,000 687,500
E880 715,740 863,620 n/a 980,230
2
E890 1,012,00 1,216,000 1,271,000 1,368,000
E880 1,084,510 1,291,170 n/a 1,470,340
3
E980 1,521,000 1,827,000 1,910,000 2,055,600
E880 1,433,800 1,718,720 n/a 1,961,410
4
E980 2,030,00 2,439,000 2,549,000 2,743,000

IBM Power Systems Performance Capabilities Reference


https://www.ibm.com/systems/resources/pcrm.pdf

10
rPerf Comparisons – POWER8 vs POWER9 SMT

Throughput Increase between SMT modes

ST -> SMT2 SMT2 -> SMT4 SMT4 -> SMT8

POWER8 45% 30% 7%

POWER9 70% 38% 26%

Use the IBM Power Systems Performance Report for POWER8 to POWER9
https://www.ibm.com/systems/power/hardware/reports/system_perf.html

For earlier POWER architectures, SMT breakdowns are not provided by the
report. For reference, these approximations are ‘roughly’ used:
•POWER6, Single-Thread sizing is 66% of SMT2 rating
•POWER7/7+, Single-Thread sizing is 56% of SMT4 rating
•POWER7/7+, SMT2 sizing is 83% of SMT4 rating
11
rPerf Comparisons – S824 vs S924

The real-world case for migrating PowerVM POWER8


workloads to POWER9 is from SMT4 to SMT8

S924
ST SMT2 SMT4 SMT8
S
ST 197 335 462 583
8
2 SMT2 285 1.18X 1.62X 2.04X
4 SMT4 371 0.90X 1.24X 1.57X

SMT8 397 0.84X 1.16X 1.47X

12
rPerf Comparisons – S92X with SMT4 & SMT8
Cores/GHz SMT rPerf
S924 S924 S922
24c/3.4-3.9 20c/3.5-3.9 20c/2.9-3.8
System Per System Per System Per
core core core
7 32c/3.6 4 335 1.38x 1.84x 1.18x 1.89x 1.01x 1.62x
5
0 8 - 1.74x 2.32x 1.49x 2.39x 1.27x 2.04x
32c/3.5 4 354 1.30x 1.74x 1.12x 1.8x 0.95x 1.53x
7
5 8 - 1.64x 2.19x 1.41x 2.27x 1.20x 1.92x
0 32c/4.0 4 397 1.16x 1.55x 1.0x 1.6x 0.85x 1.36x
+
8 - 1.47x 1.95x 1.26x 2.01x 1.07x 1.72x
24c/3.52 4 371 1.25x 1.25x 1.07x 1.28x 0.91x 1.09x

8 397 1.47x 1.47x 1.26x 1.51x 1.07x 1.28x


S
16c/4.15 4 284 1.62x 1.09x 1.39x 1.15x 1.19x 0.95x
8
2 8 304 1.91x 1.28x 1.64x 1.32x 1.40x 0.89x
4
12c/3.89 4 207 2.23x 1.12x 1.91x 1.15x 1.63x 0.98x

8 221 2.63x 1.32x 2.62x 1.35x 1.92x 1.16x


13
rPerf Comparisons – POWER7 vs POWER9 Enterprise
Cores SMT rPerf
E950 E980
32c 40c 44c 48c 32c 40c 44c 48c

4 691 820 850 910 722 871 937 1008


8 870 1034 1072 1146 910 1098 1181 1270
32c/3.5 4 354 1.95X 2.31X 2.57X 2.04X 2.46X 2.84X
7
5 8 - 2.46X 2.92X 3.23X 2.57X 3.10X 3.58X
0 32c/4.0 4 397 1.74X 2.06X 2.29X 1.82X 2.19X 2.54X
+
8 - 2.19X 2.60X 2.88X 2.30X 2.76X 3.20X

7 32c/3.1 4 306 2.26X 2.68X 2.97X 2.36X 2.84X 3.29X


7
8 - 2.84X 3.38X 3.74X 2.97X 3.58X 4.15X
0
7 32c/3.8 4 411 1.68X 1.99X 2.21X 1.75X 2.12X 2.45X
7
0 8 - 2.11X 2.51X 2.78X 2.21X 2.67X 3.09X
+
7 32c/4.4 4 460 1.51X 1.78X 1.98X 1.57X 1.89X 2.19X
8
0
8 - 1.89X 2.24X 2.49X 1.98X 2.38X 2.76X
+
14
rPerf Comparisons – POWER8 E850 vs POWER9 Enterprise

Cores SMT rPerf


E950 E980
32c 40c 44c 48c 32c 40c 44c 48c

4 691 820 850 910 722 871 937 1008


8 870 1034 1072 1146 910 1098 1181 1270
E 32c/3.72 4 523 1.32X 1.57X 1.74X 1.38X 1.66X 1.93X
8
5 8 559 1.55X 1.85X 2.05X 1.63X 1.96X 2.27X
0
32/4.22 4 574 1.20X 1.42X 1.58X 1.26X 1.51X 1.75X
E
8 8 615 1.41X 1.68X 1.86X 1.48X 1.78X 2.06X
5
0 48/3.65 4 756 0.91X 1.08X 1.20X 0.95X 1.15X 1.33X
C
8 809 1.08X 1.28X 1.41X 1.12X 1.35X 1.57X

15
rPerf Comparisons – POWER8 870/880 vs POWER9 Enterprise

Cores SMT rPerf


E950 E980
32c 40c 44c 48c 32c 40c 44c 48c

4 691 820 850 910 722 871 937 1008


8 870 1034 1072 1146 910 1098 1181 1270
E 32c/4.02 4 594 1.16X 1.38X 1.53X 1.21X 1.46X 1.69X
8
7 8 635 1.37X 1.63X 1.80X 1.43X 1.73X 2.0X
0
32c/4.35 4 630 1.09X 1.30X 1.44X 1.14X 1.38X 1.60X

8 675 1.29X 1.53X 1.70X 1.35X 1.63X 1.88X


E
40c/4.19 4 753 0.92X 1.09X 1.21X 0.96X 1.15X 1.39X
8
8 8 806 1.08X 1.28X 1.42X 1.30X 1.36X 1.57X
0
48c/4.0 4 860 0.80X 0.99X 1.06X 0.84X 1.01X 1.17X

8 920 0.94X 1.12X 1.24X 0.99X 1.19X 1.38X

16
POWER9 Other Updates (April and August)
Spectre/Meltdown?

While IBM will make no performance-related statement regarding any customer


workload, rPerfs have published with ratings including the patches for AIX & PowerVM
firmware.

POWER8 ratings have been reduced 5-7%. One could infer that POWER9 ratings for
customers not implementing the patches would be higher than those published.

AIX patches do not do anything unless PowerVM firmware contains patches


Security

AIX 7.2 code levels for POWER9 support an option to display speculation
security settings (currently undocumented)
# lparstat -x
LPAR Speculative Execution Mode : 2

19
POWER8->POWER9 SMT4 to SMT4 (Transactional Workload)

SMT4 Partition Processor Consumption vs Throughput


10000.00
9000.00
8000.00
7000.00
Throughput

6000.00 P8 SMT4 6vcpus


5000.00 P9 SMT4 6vcpus
4000.00
P9 SMT4 5vcpus
3000.00
2000.00
1000.00
0.00
2.00 3.00 4.00 5.00 6.00 7.00
Processor Consumed

• Migration with same VP count, improved utilization and response times


• 20% reduction in VP, similar response time and reduction in physical
consumption
20
POWER8->POWER9 SMT8 to SMT8 (Transactional Workload)
SMT8 Partition Processor Consumption vs Throughput
10000
9000
8000
7000
P8 SMT8 6vcpus
6000
Throughput

P9 SMT8 6vcpus
5000
P9 SMT8 5vcpus
4000
3000 P9 SMT8 4vcpus

2000
1000
0
2.00 3.00 4.00 5.00 6.00 7.00
Processor Consumed
• Migration same VP count: reduced utilization for same workload with similar or
improved response time and higher throughput
• Migration to 5 vcpu partitions will observe similar response time for same workload
with further reduced PC consumption
• 33% reduction in VP, better or equal response times for utilizations < 80%, higher
throughput, lowered PC
21
Java & WebSphere on POWER9

Best practices for Java and IBM WebSphere Application Server (WAS) on
IBM POWER9

Workload Throughput increase


from SMT4 to SMT8

SPECjbb2015 max-jOPS 24.5%

SPECjbb2015 critical-jOPS 37.6%

DayTrader7 throughput 35%


Lab Example, POWER8->POWER9 right-sizing

The following migration analysis are estimated based on IBM internal


measurements on the DayTrader7 workload
§ Example S824/24c 3.52Ghz vs S924/24c
§ rPerf P9/P8 ratios: SMT4 1.25x & SMT8 1.47x
§ Customer migration experience may vary by workload
§ Physc = AIX Physical Consumption
P8 SMT4 -> P9 SMT4 both 6 vcpus
Est. Physc
P8 Utilization P8 Physc P9 Utilization P9 Physc Improvement
20 2.06 17 1.94 6%
40 3.18 32 2.78 14%
60 4.32 46 3.63 19%
80 5.45 61 4.47 22%

P8 SMT8 6 vcpus -> P9 SMT8 4 vcpus


Estimated PC
P8 Utilization P8 Physc P9 Utilization P9 Physc Improvement
20 2.29 17 1.6 43%
40 3.35 36 2.3 46%
60 4.41 56 3 47%
80 5.46 75 3.7 48%
Informal OSDB Micro-benchmark – POWER8 vs POWER9

Test S824 S924 Relative


3.46 GHz 3.52 GHz Increase
Insert.SingleIndex.Contested.Rnd 96575 141943 47%
I.MIndex.Contested.Rnd 81808 120636 47%
I.MKeyIndex.Contested.Rnd 76032 109604 44%
I.DocVal.TwentyInt 67273 101930 55%
I.PartialIndex.FullRange 103715 144893 39%
Update.SetWithIndex.Rnd 71291 104519 46%
U.SetWithMIndex.Rnd 62483 94582 51%
U.DocVal.TwentyNum 51662 80280 55%
U.ManyElementsInArrary 2904 5091 75%
MultiUpdate.Contended.NoIndex 85654 124096 44%

P8 results adjusted to P9 frequency


PowerVM LPAR, 4.0 Entitlement, 4 Virtual Processors, SMT8
RHEL 7.4 3.10.0-693.el7.ppc64le (No SPEC fixes)
MongoDB 3.6.2, no tunings, 32 DB threads. Mongo—perf microbenchmark
24
Informal OSDB Micro-benchmark – SMT Levels

ST SMT2 SMT4 SMT8

TPS 54914 105530 161098 201966

Total 9886913 18998143 29001033 36353569


Transactions

Average 1.821 ms 0.948 ms 0.621 ms 0.495 ms


Latency

POWER9 24c/~3.5 GHz


PowerVM LPAR, 2.0 Entitlement, 6 Virtual Processors
RHEL 7.4 3.10.0-693.el7.ppc64le (No SPEC fixes)
Postgresql 9.6-3, no tunings, pgbench microbenchmark with 100 clients & 6 threads
Ramdisk to demonstrate CPU/RAM-only differentiation

25
All Migrations should consider moving to SMT8

The architectural changes in POWER9 see much higher improvements in


capacity and latency with SMT8
v The lab chose to be conservative with POWER8 and SMT4
v Because AIX 7.2 had already been developed, there was resistance to
changing default SMT mode from 4 to 8 when POWER8 shipped
v While the POWER8 architecture saw little improvement between SMT4 and
SMT8 (typically < 7% for transactional workloads), gains are much more
significant with POWER9
v Single- & two-socket POWER9 systems may not have greater peak
bandwidth than scale-out POWER8 CDIMM systems, but they have enough
§ Even In-Memory workloads on POWER8 could rarely achieve ~50% of the
total memory bandwidth before running out of CPU capacity
§ Standard DDR4 architecture provides better price/performance ratio for this
class of systems
§ PCIe Gen4 will yield higher I/O performance and is not gated by bus limits
Assessing SMT on POWER9

v SMT is dynamic in AIX and Linux – it is trivial to test!


v Start with traditional products known to behave well with SMT (WAS, DB2,
Oracle, SAP, some OSDB). The more current, the better.
v Check for software recommendations, but remember that most did not
assess SMT8 fully because of lab’s conservative approach with POWER8
v Older software levels migrating from older architectures should be reviewed
more carefully. They will not suffer from SMT8, but they may benefit less.
v Open Source products that have never been tested with SMT8 should be
assessed individually for scaling performance if using higher core counts
§ The concern is that some Open Source products may have never been
tested with dozens of logical cpu instances (lock/latch contention)
§ Issues appear as non-linear context switch behavior as cores are added or
SMT is increased
IBM has announced SMT8 will be the default with AIX 7.2 TL3 on POWER9
Larger LPAR Migrations to POWER9 should review VP counts

v Focus on larger workloads where 20-33% reductions may be possible


§ A wide-range of performance results show this is possible
§ You can’t reach higher frame utilization with SMT4 and high VP counts
§ Cost of software licensing by core warrants the effort
v Many organizations are slow to reconsider VP changes
§ Larger POWER6 to POWER7 migrations encountered “high physical
consumption” complaints because AIX Dispatcher changes – requiring
post-migration tuning/resizing to meet rPerf expectations
§ AIX more aggressively used Virtual Processors from POWER7 on. This
algorithm is (mostly) consistent across POWER7, POWER8 and POWER9
§ Reducing VPs aids memory affinity for those customers nervous about
spanning nodes
§ If you didn’t assess VP sizing with POWER8 or are jumping architectures,
now is the time to revisit
Migration – Use Performance Capacity Monitor in HMC
Use PCM to identify Virtual Processor delays
- If All Partition Spread graph shows partition is always over entitlement
- If Aggregated Utilization table shows Dispatch Wait Time (Virtual Processors waiting)
- Then partition is waiting for cycles and impacting performance

Under
Entitled

Over
Entitled

4343.200
Migration - Other

v rPerf-level improvements can be expected with POWER8 or POWER9 modes


§ Both support SMT8, the technology is mature
§ Observationally, POWER9 mode will reduce physical consumption
(covered later) due to dispatcher improvements
§ This is an early statement and may evolve as vendors exploit POWER9
capabilities in code, optimizations, compiler and JVM improvements
§ Lab/Support will prefer latest mode due to more current profiler tooling
§ Future feature developments may require POWER9 mode, such as
Interrupt Virtualization Engine and support for low latency storage
Memory Speeds (Scale Out)

DIMM / FC Speed (Socket)


<= Half Populated > Half Populated
16 GB / EM62 2.6 GHz 2.1 GHz
32 GB / EM63 2.4 2.1
64 GB / EM64 2.4 2.1
128 GB / EM65 2.4 2.1

v S92X Models
§ Peak B/W up to 170 GB/s per socket with DDR3
§ ½ population provides best memory bandwidth
§ Workloads sensitive to memory capacity should populate all slots
§ S914 does not support 128 GB DIMM
v GPU Models
§ NVLINK 2.0 interface integrated between CPU and GPU
§ 4 GPU models, 150 GB/s between CPU and GPU
§ 6 GPU models, 100 GB/s between CPU and GPU
§ Sustained 100 GB/s+ between DDR3 memory and CPU
Enterprise Bandwidth

v Enterprise Systems
§ Peak B/W up to 240 GB/s per socket with CDIMMs
§ 16 Gb/s X-Bus intranode connected fabric
§ 4X increase in SMP A-Bus internode connected fabric
§ 2X I/O bandwidth with PCIe Gen4 slots (8/drawer)
§ DDR3 CDIMMs from E880 models can be moved to E980
§ DDR3 CDIMMs from E850 cannot be moved to E950
§ (Future) Interrupt Virtualization Engine reduces the code path length
and improves performance compared to the previous architecture –
interrupt processing moved from Hypervisor into hardware
Enterprise Bandwidth
Power E850

Processor modules 2 3 4

GHz 3.35 3.35 3.35


Cores 20 30 40

L1 data cache 3225 GBps 4837 6450

L2 cache 3225 GBps 4837 6450

L3 cache 4300 GBps 6450 8600

Power E950
Processor
2 2 2 2 4 4 4 4
modules
GHz 3.6-3.8 3.4-3.8 3.2-3.8 3.15-3.8 3.6-3.8 3.4-3.8 3.2-3.8 3.15-3.8
Cores 16 20 22 24 32 40 44 48

L1 data 5,530 - 6,528 - 6,758 - 7,258 - 11,059 - 13,056 - 13,517 - 14,515 -


cache 5,837 7,296 8,026 8,755 11,674 14,592 16,051 17,510

5,530 - 6,528 - 6,758 - 7,258- 11,059 - 13,056 - 13,517 - 14,515 -


L2 cache
5,837 7,296 8,026 8,755 11,674 14,592 16,051 17,510

3,686 - 4,352 - 4,506 - 4,838 - 7,373 - 8,704 - 9,011 - 9,677 -


L3 cache
3,891 4,864 5,350 5,837 7,782 9,728 10,701 11,674
Enterprise Bandwidth

Power E870 Power E870 Power E880


32 cores 40 cores 32 cores
4.024 GHz [GBps] 4.190 GHz 4.350 GHz
L1 data cache 6,181 8,045 6,682
L2 cache 6,181 8,045 6,682
L3 cache 8,241 10,726 8,909

Power E980
32 cores 40 cores 44 cores 48 cores
3.9 to 4.0 GHz [GBps] 3.7 to 3.9 GHz 3.58 to 3.9 GHz 3.55 to 3.9 GHz
L1 data cache 11,981 - 12,288 14,208 - 14,976 15,122 - 16,474 16,358 - 17,971
L2 cache 11,981 - 12,288 14,208 - 14,976 15,122 - 16,474 16,358 - 17,9712
L3 cache 7,987 - 8,192 9,472 - 9,984 10,081 - 10,982 10,901 - 11,980
Utilization Values with Simultaneous Multithreading (SMT)

POWER processor-based systems support up to 8 SMT hardware threads per core.


Tools report each SMT thread as a vcpu (virtual - Linux), lcpu (logical - AIX) or cpu.
These threads can be equally weighted by the Linux Completely Fair Scheduler (CFS).
While commands for monitoring SMT Mode Core utilization% 1 busy* thread
CPU use are similar between (1 thread / vcpu)
Linux and a Unix derivative like
AIX, the utilization numbers are OS Linux AIX AIX
different. AIX uses a calibration Architecture P7/P8/P9 POWER8 POWER9
mechanism built into the POWER Mode P8/P9 P8 P8/P9
hardware to account for
spare/idle capacity left in a core Single Thread 90-99% 99% 99%
based on SMT threads used. Linux
does not use this calibration. 2 50% 77% 50%

Example: to reach 100% ”busy” 4 25% 60% 44%


in SMT4 Linux on a single core,
four vcpus would have to 8 12.5% 56% 32%
consume 25% each (100/4 = 25)
*Single core, single VP

Learning point: Linux on Power typically understates CPU utilization, you need to use tools
like sar or mpstat to view individual vcpu use of SMT threads to assess per core use
How does SMT work in POWER9?

For POWER9, SMT levels supported depend on virtualization layer


• OpenPOWER Linux cores support 1, 2 or 4 SMT threads only
• PowerVM Linux or AIX cores support 1, 2, 4 or 8 SMT threads

coren SMT HW coren coren + 7


coren Threads coren coren + 5
core
1 n 2 coren coren + 1
1 1
core
1 n 2 1 coren c
1 ore n+1 1
1 32 2 4 2 1 2 1 2
1 3 4 1 2 1 2
3 54 4 6 3 2 3 2 3
3 5 6 2 3 2 3
5 7 6 8 4 3 4 3 4
5 7 6 8 3 4 3 4
7 L18Cache 5 4 L1 Cache4 L1 Cache
7 8
L1 Cache 4 L1 Cache4 L1 Cache
6 L2 Cache (512 KB)
L1L2 Cache (512 KB)
Cache L1 Cache L1 Cache
L1 Cache (512 KB)
L2 Cache L1 Cache L2 Cache
L1 Cache(512 KB)
L3 (512
Cache (10 MB) L3 Cache
L2 Cache (512 KB) (10 MB)
L2 Cache KB) 7 L3 Cache
L3 (512
L2 Cache CacheKB)
(10 MB) L2 Cache (512 KB) (10 MB)
L3 Cache (10 MB) 8 L3 Cache (10 MB)
L3 Cache (10 MB) L3 Cache (10 MB)

POWER9 PowerVM AIX/Linux POWER9 OpenPOWER Linux


All POWER8 Environments
Dispatch Behavior in AIX
For POWER7 & POWER8, the default dispatch algorithm is known as Raw
Throughput Mode. AIX will dispatch Virtual Processors once a utilization
threshold of 49% has been exceeded (or conversely, fold a VP once below
this threshold). Once all available first SMT threads of all VPs are
executing, it then wraps around and uses the second SMT thread for each
VP.
Virtual Processor1 VP2 VP3 …. VPN

1 2 1 2 1 2 1 2
3 4 3 4 3 4 3 4
5 6 5 6 5 6 5 6
7 8 7 8 7 8 7 8

core1 core2 core3 coren

POWER8/POWER9 in POWER8 Mode (AIX)


Dispatch Behavior in AIX / POWER9

POWER9/AIX in POWER9 Mode behaves a little differently. The VP code is


aware of the core architecture and will place/collapse smaller workloads
slightly more aggressively when workloads are present
• Optimization has the additional impact of reducing physical consumption
• Single thread utilization is calibrated to ~32% in SMT8 (~44% in SMT4),
so below the default VP dispatch threshold of ~50% per core
• Because single-threads are calibrated lower and equivalent workloads
will overall generate a lower utilization, they are more likely to fall below
dispatch threshold, thus lowering physical consumed

But……
• AIX has decided adjust the dispatch threshold for POWER9 systems
• Intent is to make low-thread count database workloads dispatch more
aggressively for performance (follow POWER7 and POWER8 model)
• This will be less than the calibrated single-thread utilization of 32%
• Tuning will be to < 32%: APAR IJ10535: P9 VPM FOLD THRESHOLD
• Those using earlier releases can use vpm_fold_threshold=29, which is a
schedo dynamic tunable (this is short-term guidance before APAR ships)
• POWER9 Mode still has more awareness of core architecture for better
cache optimizations
POWER6, POWER7/POWER8 and POWER9 AIX Dispatch
POWER9 SMT8
POWER7/8 SMT4 Htc0 busy Htc0 busy
Htc1 idle
POWER6 SMT2 Htc0 busy
Htc1 busy

Htc2 idle Htc2 idle


Htc0 busy Htc1 idle
Htc3 idle Htc3 idle
Htc1 busy Htc2 idle
Htc0 idle Htc0 idle
Htc3 idle
Htc1 idle Htc1 idle
~80% busy
idle
physc: ~1.0 ~55% busy Htc2 idle Htc2

~45% idle Htc3 idle Htc3 idle

physc: ~1.0
~50% busy ~30% busy
~50% idle ~70% idle
(pre-IJ10535) (post-IJ10535)
physc: ~1.0 physc: ~1.0

Activate

Virtual
Processor
Customers using Scaled Throughput
Scaled Throughput Mode is an alternative AIX dispatch algorithm, where
SMT threads on the same Virtual Processor are executed more
aggressively. In general, this mode:
• Reduces physical consumption by activating more SMT threads
• More “POWER6 like”
• Adopted by customers wanting to reduce physical consumption
without the effort of reducing Virtual Processor counts in a migration
• Trades some performance/latency compared to Raw Mode
• Settings are 2, 4 & 8 and map to how many SMT threads are used
before the next Virtual Processor is activated

Raw Throughput Scaled Mode 2


12 12
10 10
8 8
6 6
4 4
2 2
0 0
T13
T17
T21
T25
T29
T33
T37
T41
T45
T49
T53
T57
T61
T65
T69
T73
T77
T81
T1
T5
T9

T1
T5
T9
T13
T17
T21
T25
T29
T33
T37
T41
T45
T49
T53
T57
T61
T65
T69
T73
T77
T81
T85
T89
T93
T97

Active_Threads Active_VP Active_Threads Active_VP


Phys_Busy Phys_Busy
Customers using Scaled Throughput
Guidance:
• There is nothing wrong with continuing to use Scaled Throughput in
lieu of reducing VP counts, but customers should pursue one strategy
over the other and not adopt both both simultaneously in a migration
(without prior testing)
• I generally start with Mode 2 and validate performance before moving
to higher settings
• Settings of 4 & 8 can be expected to have noticeable performance
impacts for single-thread/latency sensitive workloads
• Customers using Scaled Throughput on POWER8 and migrating to
POWER9 should not expect significant per-thread faster performance,
but core capacity is greater
• The tunable vpm_throughput_core_threshold allows you define
a core threshold where Raw Mode is used until this threshold and then
switches to Scaled Mode (think ”warmed up” cores for spikey, low-
latency workloads)
• Mode 1 is an alternative Raw Throughput Mode that uses a longer
moving average for Virtual Processor folding/activation (typically ~7-
12% reduction of physical consumption with little impact on
performance
Best tool to review true VP and SMT activity
To view Virtual Processor and SMT activity on an existing workload, the best tool is mpstat
with the –v option. This option displays the actual Virtual Time Base (VTB) – the dispatch
time for each Virtual Processor at the physical layer, physical consumption (pc) and the
activity of the SMT threads.

#mpstat –v 2 5 (two samples, 5 second interval)


vcpu lcpu us sy wa id pbusy pc VTB(ms)
--- ---- ---- ---- ----- ----- ----- ----- -------
0 55.88 0.53 0.00 43.59 0.34[ 56.4%] 0.60[119.7%] 649
0 55.88 0.52 0.00 0.47 0.34[ 56.4%] 0.34[ 56.9%] -
1 0.00 0.00 0.00 13.95 0.00[ 0.0%] 0.08[ 13.9%] -
2 0.00 0.00 0.00 15.04 0.00[ 0.0%] 0.09[ 15.0%] -
3 0.00 0.01 0.00 14.13 0.00[ 0.0%] 0.08[ 14.1%] -
4 How many VPs are
56.26 0.92 0.00 42.82 0.07[ 57.2%] 0.13[ 25.5%] 209
actually
4 dispatching,
56.26 0.87 0.00 1.28 0.07[ 57.1%] 0.07[ 58.4%] -
ignore
5 numbering
0.00 0.04 0.00 14.11 0.00[ 0.0%] 0.02[ 14.1%] -
scheme
6 0.00 0.01 0.00 13.69 0.00[ 0.0%] 0.02[ 14.8%] -
7 0.00 0.01 0.00 13.75 0.00[ 0.0%] 0.02[ 13.9%] -
8 60.92 0.50 0.00 38.58 0.15[ 61.4%] 0.25[ 49.0%] 404
8 60.92 0.49 0.00 0.64 0.15[ 61.4%] 0.15[ 62.0%] -
9 0.00 0.00 0.00 12.61 0.00[ 0.0%] 0.03[ 12.9%] -
10 0.00 0.00 0.00 12.66 0.00[ 0.0%] 0.03[ 13.0%] -
Dispatch time
11 0.00 0.00 0.00 12.67 0.00[ 0.0%] 0.03[ 13.0%] -
in milliseconds
ALL 173.05 1.95 0.00 124.99 0.56[175.0%] 0.97[194.2%] 1262
AIX 7.1 TL3 SP2 or above required
VIOS 3.1
Beyond general improvements and function, VIOS 3.1 will support these
performance and scalability improvements
- Native support for POWER8 and POWER9 architectures
- Utilizes POWER9 on-chip compression/encryption capabilities to
provide faster, more secure LPM
- SMT8, based on AIX 7.2 TL3
- Interrupt Virtualization Engine (future) and I/O optimizations for
future low-latency NVMe devices
- NPIV port scaling (future)
Dynamic System Optimizer (DSO)
DSO optimizes cache and memory affinity
- Monitors workloads for high cpu and memory utilization
- Associates targeted workloads to a specific core or set of cores
- Determines if memory pages being accessed can be relocated or
resized for higher affinity to cache & core
- Designed for POWER7 and originally shipped with AIX 7.1 TL01
- Supported on POWER8 and POWER9

Remind me again, what’s the difference between AIX Enhanced Affinity,


Dynamic System Optimizer, Dynamic Platform Optimizer?
- Enhanced Affinity optimizes threads to a scheduler domain (think chip)
- DSO optimizes threads within a chip to a core or set of cores
- DSO actively optimizes addressable memory pages for best locality
and size
- Dynamic Platform Optimizer optimizes a partition’s placement within a
frame or node (think “moves partitions” rather than threads)
POWER8 EnergyScale Overview

POWER8

EnergyScale Policies: Frequency


System ships with fixed nominal
frequency
DPS-FP
Maximum frequency (Turbo)
Frequency
NOMINAL
achieved when system operating in
nominal environment DPS

SPS
Workload will behave the same on
same system configuration
%

%
0%
0%
90

80

70

60

50

40

30

20

10
10

Load Level
POWER9 EnergyScale

Two new modes replace the POWER8 dynamic frequency modes


§ POWER9 systems will ship with one of these two modes on by default
– Dynamic Performance Mode – Enables dynamic frequency with some
restrained power/thermal envelope. Default on POWER9 S914
– Maximum Performance Mode – Enables dynamic frequency with highest
performance operation. Default on S922 and S924 systems.
§ Both modes dynamically adjust processor frequency to maximize
performance
§ Enable much higher CPU frequency range compared to POWER8
§ For PowerVM systems, these are system wide modes but each CPU socket
frequency is optimized separately

Factors used to determine the maximum CPU frequency


§ CPU Utilization – Lighter workloads will run at higher frequencies
§ Number of Active Cores – Fewer number of active cores will run at higher
frequencies
§ Environmental Conditions – Lower ambient temperatures will run at higher
frequencies
POWER9 Modes

Dynamic Performance Mode


• Increased performance for
typical workloads over Static
Nominal
POWER9
• Less active workloads can use
higher frequencies
• Lower active core counts also
increase top frequency
potential
POWER9 EnergyScale Modes

Maximum Performance Mode


• Increased performance over
DPM for nominal
environmental conditions
Frequency

DPM
• Takes advantage of nominal NOMINAL
environmental conditions by MPM
allowing increased CPU SPS
frequency and power draw
• Lighter workloads can exploit
higher frequencies
• Idle state remains at high
frequency
%

0%
0%

90

80

70

60

50

40

30

20

10
10

Utilization Level
Monitoring Frequency

AIX
Currently, the AIX tooling only shows legacy value for Dynamic AND
Maximum Performance Modes on POWER9 / AIX 7.2 (this is a bug)
lparstat –i | grep Saving
Power Saving Mode : Dynamic Power Savings (Favor Performance)

Average of processors on LPAR:


lparstat –E 1 10
Physical Processor Utilisation:
--------Actual-------- ------Normalised------
user sys wait idle freq user sys wait idle
---- ---- ---- ---- --------- ---- ---- ---- ----
0.352 0.003 0.000 0.645 2.3GHz[ 79%] 0.279 0.003 0.000 0.718
0.614 0.005 0.000 0.381 3.7GHz[128%] 0.786 0.006 0.000 0.207
0.614 0.005 0.000 0.382 3.7GHz[128%] 0.785 0.006 0.000 0.209

Regular lparstat output will also show %nsp values


POWER Processor counter interface:
pmcycles -M
This machine runs at 3475 MHz

48
Monitoring Frequency

Linux
List power management modes
dmesg | grep freq
[ 0.000000] time_init: decrementer frequency = 512.000000 MHz
[ 0.000000] time_init: processor frequency = 2900.000000 MHz

Linux (PowerVM)
ppc64_cpu --frequency
Linux (Non-PowerVM)
List frequency of all cores
cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq

Display nominal frequency range


cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequences

Display frequency range


cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_boost_frequencies

49
Proving Frequency

No tools in AIX support an indication of the range of frequencies possible on


a system. Commands like prtconf or pmcycles will show lower frequency. You
must apply a workload to see frequency changes.
Two simple examples would be to create a looping script or use Nigel’s
nstress package to generate a workload on a single cpu:
Create a script called: Download nstress and execute:
cpu_freq_test.sh ./ncpu -p 1 -s 120 &
#!/usr/bin/ksh
while true
do
:
done

Set execute permissions and run:


chmod 755 cpu_freq_test.sh
./cpu_freq_test.sh

Then execute pmcycles or lparstat –E as before:


pmcycles -M (Do not use -m)
This machine runs at 3658 MHz

nstress available at:


https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Power+Systems/page/nstress
50
Idle Power Saver
Idle Power Saver uses custom below-nominal thresholds for frequency adjustments. Can be
combined with Disable All Modes, Dynamic Performance Mode and Maximum Performance Mode.
Do not experiment w/o support guidance. If you had previous guidance to modify these at older
architecture levels, open a PMR and ask whether that should be continued for POWER9.
POWER8 POWER9
Other

The PowerVP tool will not be supported on POWER9 systems


Pre-reqs for Best Optimzation
IBM Java JDK8 SR5
Open JDK 1.8
Compilers
Linux/xlc v13.1.5, v15.1.6
Linux/gcc v7, -mtune=power9
Advanced Toolchain 11.0-3 or later
https://www-01.ibm.com/support/docview.wss?uid=swg27005174&aid=1
Next generation SR-IOV
https://www.ibm.com/support/knowledgecenter/POWER9/p9hcd/fcec3l.htm
https://www.ibm.com/support/knowledgecenter/POWER9/p9hcd/fcec2r.htm
https://www.ibm.com/support/knowledgecenter/9009-22A/p9hcd/fcec2t.htm

52
Notices and disclaimers

© 2018 International Business Machines Corporation. No part of this Performance data contained herein was generally obtained in a controlled,
document may be reproduced or transmitted in any form without isolated environments. Customer examples are presented as illustrations of
written permission from IBM. how those

U.S. Government Users Restricted Rights — use, duplication or customers have used IBM products and the results they may have
disclosure restricted by GSA ADP Schedule Contract with IBM. achieved. Actual performance, cost, savings or other results in other
operating environments may vary.
Information in these presentations (including information relating to
products that have not yet been announced by IBM) has been reviewed References in this document to IBM products, programs, or services does
for accuracy as of the date of initial publication and could include not imply that IBM intends to make such products, programs or services
unintentional technical or typographical errors. IBM shall have no available in all countries in which IBM operates or does business.
responsibility to update this information. This document is distributed
“as is” without any warranty, either express or implied. In no event, Workshops, sessions and associated materials may have been prepared by
shall IBM be liable for any damage arising from the use of this independent session speakers, and do not necessarily reflect the views of
information, including but not limited to, loss of data, business IBM. All materials and discussions are provided for informational purposes
interruption, loss of profit or loss of opportunity. IBM products and only, and are neither intended to, nor shall constitute legal or other
services are warranted per the terms and conditions of the agreements guidance or advice to any individual participant or their specific situation.
under which they are provided.
It is the customer’s responsibility to insure its own compliance with legal
IBM products are manufactured from new parts or new and used parts. requirements and to obtain advice of competent legal counsel as to
In some cases, a product may not be new and may have been previously the identification and interpretation of any relevant laws and regulatory
installed. Regardless, our warranty terms apply.” requirements that may affect the customer’s business and any actions the
customer may need to take to comply with such laws. IBM does not provide
Any statements regarding IBM's future direction, intent or product legal advice or represent or warrant that its services or products will ensure
plans are subject to change or withdrawal without notice. that the customer follows any law.

53
Notices and disclaimers
continued
Information concerning non-IBM products was obtained from the IBM, the IBM logo, ibm.com and [names of other referenced IBM
suppliers of those products, their published announcements or other products and services used in the presentation] are trademarks of
publicly available sources. IBM has not tested those products about this International Business Machines Corporation, registered in many
publication and cannot confirm the accuracy of performance, jurisdictions worldwide. Other product and service names might
compatibility or any other claims related to non-IBM be trademarks of IBM or other companies. A current list of IBM
products. Questions on the capabilities of non-IBM products should be trademarks is available on the Web at "Copyright and trademark
addressed to the suppliers of those products. IBM does not warrant the information" at: www.ibm.com/legal/copytrade.shtml.
quality of any third-party products, or the ability of any such third-party
products to interoperate with IBM’s products. IBM expressly disclaims .
all warranties, expressed or implied, including but not limited to, the
implied warranties of merchantability and fitness for a purpose.

The provision of the information contained herein is not intended to, and
does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.

54
55

You might also like