Nothing Special   »   [go: up one dir, main page]

Qualcomm Hexagon Architecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Lucian Codrescu

Sr. Director, Technology


Qualcomm Technologies, Inc.

Qualcomm Hexagon
DSP: An architecture
optimized for mobile
multimedia and
communications

Qualcomm Technologies, Inc. All Rights Reserved

Hexagon DSP processors in Snapdragon products


aDSP: Real-time
media & sensor
processing

Snapdragon
800
Camera

Adreno
GPU

Display
JPEG
Video

Krait
CPU

Krait
CPU

Krait
CPU

Krait
CPU

Other

Audio
Sensors
Hexagon
aDSP

Misc.
Connectivity

2MB L2

Multimedia Fabric

System Fabric

Fabric & Memory Controller

LPDDR3

LPDDR3

Modem

Hexagon
mDSP

mDSP: Dedicated
modem processing

Qualcomm Technologies, Inc. All Rights Reserved

Expansion of Hexagon DSP use cases beyond audio


Image Enhancement
Camera, Still, Video
HexagonV4 based products

HexagonV2/V3
Computer Vision &
Augmented Reality
HexagonV4 based products

Video
HexagonV5 based products

Voice

Audio
Sensors
HexagonV5 based products

Hexagon DSP is evolving for use beyond voice and audio to


computer vision, video and imaging features
Qualcomm Technologies, Inc. All Rights Reserved

The Hexagon DSP evolution


Generational improvements in performance and power efficiency driven by
both architecture and implementation

V4M

V3M

V5A

28nm
Dec 2010

45nm
June 2009

28nm
Dec 2012

V4L
V1
65nm
Oct 2006

28nm
Apr 2011

V3L

V2

45nm Nov
2009

65nm
Dec 2007

V3C

V4C

V5H

45nm Aug
2009

28nm
Dec 2010

28nm
Dec 2012

Time
Qualcomm Technologies, Inc. All Rights Reserved

Key characteristics of
modem & multimedia applications

Requirements

Characteristics

Require fixed real-time


performance level
(fps, Mbit/sec, etc.)
Extremely aggressive
power & area targets

Mix of signal processing


& control code
For modem, Qualcomm does not
use a split CPU/DSP architecture.
All processing is done on Hexagon
DSP
Multimedia apps have significant
control in the RTOS & frameworks

Heavy L2$ misses


Multimedia is data intensive
Modem is code intensive

Qualcomm Technologies, Inc. All Rights Reserved

Hexagon DSP blends features targeted to modem &


multimedia
VLIW
Need multi-issue to
meet performance
Low complexity for
Area & Power

Multi-Threading
To reduce L2$ miss
penalty without the need
for a large L2
Increases
instructions/VLIW packet
because compiler doesnt
need to schedule latency

Hexagon
DSP

Innovate in ISA to
maximize IPC
More work/VLIW packet
reduces energy/instruction
Keep the pipelines full for
MIPS/mm2
Target both Signal
Processing & Control code

Qualcomm Technologies, Inc. All Rights Reserved

VLIW: Area & power efficient multi-issue


Variable sized
instruction packets
(1 to 4 instructions
per Packet)

Device
DDR
Memory
Dual 64-bit
load/store
units
Also 32-bit
ALU

Dual 64-bit execution units


Standard 8/16/32/64bit data
types
SIMD vectorized MPY / ALU
/ SHIFT, Permute, BitOps
Up to 8 16b MAC/cycle
2 SP FMA/cycle

Instruction
Cache
Instruction Unit

L2
Cache
/ TCM
Data Unit
(Load/
Store/
ALU)

Data Unit
(Load/
Store/
ALU)

Execution
Unit
(64-bit
Vector)

Data Cache

Register File/Thread
Register File
Register File

Execution
Unit
(64-bit
Vector)

Unified 32x32bit
General Register
File is best for
compiler.
No separate Address
or Accum Regs
Per-Thread

Qualcomm Technologies, Inc. All Rights Reserved

Maximizing the signal processing code work/packet


Example from inner loop of FFT: Executing 29 simple RISC ops in 1 cycle

64-bit Load and


64-bit Store with
post-update
addressing

Zero-overhead loops

{ R17:16 = MEMD(R0++M1)
MEMD(R6++M1) = R25:24
R20 = CMPY(R20, R8):<<1:rnd:sat
R11:10 = VADDH(R11:10, R13:12)
}:endloop0

Complex multiply with


round and saturation

Vector 4x16-bit Add

Dec count
Com pare
Jum p top

Qualcomm Technologies, Inc. All Rights Reserved

Maximizing the control code work/packet


Hexagon DSP ISA improves control code efficiency
over traditional VLIW

Example C code
void example(int *ptr, int val) {
if (ptr!=0) {
*ptr = *ptr + val + 2;
}}

Tradional VLIW
Assembly Code

Hexagon DSP:

Hexagon DSP:

Dot-New Predication

Compound ALU

New-Value Store

p0 = cmp.eq(r0,#0)
{
if (!p0) r2=memw(r0)
if (p0) jumpr:nt r31

r2 = add(r2,#2)
r1 = add(r1,r2)
{
memw(r0) = r1
jumpr r31

p0 = cmp.eq (r0,#0)
if (!p0.new) r2=memw(r0)
if (p0.new) jumpr:nt r31

}
r2 = add(r2,#2)
r1 = add(r1,r2)
{
memw(r0) = r1
jumpr r31

Hexagon DSP:

p0 = cmp.eq(r0,#0)
if (!p0.new) r2=memw(r0)
if (p0.new) jumpr:nt r31

}
{

}
r1 = add(r1,add(r2,#2))
{
memw(r0) = r1
jumpr r31

p0 = cmp.eq(r0,#0)
if (!p0.new) r2=memw(r0)
if (p0.new) jumpr:nt r31

r1 = add(r1,add(r2,#2))
memw(r0) = r1.new
jumpr r31

Instr/Packet =
7 instr/5 packets = 1.4

Instr/Packet =
7 instr/2packets = 3.5
Qualcomm Technologies, Inc. All Rights Reserved

High avg. instructions/packet for targeted use cases

Average Instructions/VLIW Packet

Compound instructions count as 2


5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0

Computer
Vision
Source: Qualcomm internal measurements

Video

Imaging

Control

Audio

Qualcomm Technologies, Inc. All Rights Reserved

10

Programmers view of Hexagon DSP HW


multi-threading
Hexagon V5 includes three hardware threads
Architected to look like a multi-core with communication
through shared memory
Shared Instruction Cache
Thread 0
D
U

D
U

X
U

Thread 1
X
U

Register File

D
U

D
U

X
U

Thread 2
X
U

Register File

D
U

D
U

X
U

X
U

L2
Cache /
TCM

Register File

Shared Data Cache


Qualcomm Technologies, Inc. All Rights Reserved

11

Hexagon DSP V1-V4: Interleaved multi-threading


Simple round-robin thread scheduling
Number of threads match execution pipe depth
(three threads three execute stages)
All instructions complete before next packet dispatch
Compiler schedules for zero-latency which helps to increase
instructions/VLIW packet

T0: {

Thread 0 Dispatch

Thread 1 Dispatch

Ld

Add Cmp } T1: {

St

Ld

Mpy Add

T2: {

Ld

Add Jump

T0: {

Ld

Ld

Add Cmp }

T1: {

St

Ld

Mpy Add

T0: {

Ld

Ld

Add Cmp }

Ld

Thread 2 Dispatch

Qualcomm Technologies, Inc. All Rights Reserved

12

Hexagon DSP V5: Dynamic HW multi-threading


Recover some performance when threads idle or stalled
Remove a thread from IMT rotation
On L2 cache misses
When in wait-for-interrupt or off
mode
Additional forwarding to support
2-cycle packets
VLIW packets with dependencies
between long latency instructions
will stall
But many VLIW packets with
simple instructions can
complete in 2 processor clocks

Coremarks/
MHz
8

4.5
4

3.5

2.5

2
1.5

0.5

IMT

Source: Qualcomm internal measurements

Dhrystone
DMIPS/MHz

DMT

IMT

DMT

Qualcomm Technologies, Inc. All Rights Reserved

13

Hexagon DSP instructions per cycle

Average Instructions / Cycle

Multi-Threaded Apps

4.5
4
3.5
3
2.5
2
1.5

Single-Threaded Apps
IPC_DMT
IPC_IMT

1
0.5
0

Source: Qualcomm internal measurements

Qualcomm Technologies, Inc. All Rights Reserved

14

Qualcomm Hexagon DSP architecture

BDTImark2000/MHz

DSP Performance per MHz

Highly efficient mobile application processordesigned for more


performance per MHz
20
18
16
14
12
10
8
6
4
2
0

Clock Rate (MHz)


DSP Performance (BDTImark2000)

Mobile Competitor

Qualcomm HXGN V4 (1 thread)

Qualcomm HXGN V4 (3 threads)

430-520

100-233

300-700

4730-5720

1810-4220

5440-12660*
* - Projected best case score for 3-threads

Source: BDTI - For more detailed information see www.BDTI.com. All scores 2013 BDTI

Qualcomm Technologies, Inc. All Rights Reserved

15

Hexagon DSP Power Benefits

Qualcomm Technologies, Inc. All Rights Reserved

16

MP3 playback power for competitive smartphones

Lower is better

Power

Competitor A

Qualcomm /
Competitor B
Hexagon-based

Competitor C

Competitor D

Competitor E

Competitor F

Competitor G

Power measured at the battery for various phones


Includes everything: DSP, CPU, memory, analog components, etc
Source: Qualcomm internal measurements

Qualcomm Technologies, Inc. All Rights Reserved

17

Computer vision offload ARM/neon to Hexagon DSP


Augmented Reality Java App finding objects in
image using FastCV Feature Detect
Comparison of Feature Detect run on:
App CPU (ARM/Neon)
App DSP (Hexagon)

CPU Utilization (%)

52% Less CPU

Detection Time (%)

7% Less Time

Source: Qualcomm internal measurements. * Power measured at the device battery

Total Device Power (%)

32% Less Power*


Qualcomm Technologies, Inc. All Rights Reserved

18

Hexagon DSP power for different thread utilizations


Excellent near-linear power scalability
(as threads go idle, power used by the thread is nearly eliminated)

Achieved through optimized clock tree design & clock gating


Dhrystone Power,
IMT Mode

FIR Power,
IMT Mode

100%

100%

90%

90%

80%

80%

70%

70%

60%

60%

50%

50%

40%
30%

Actual
Ideal

40%
30%

20%

20%

10%

10%

0%

0%

Source: Qualcomm internal measurements

Actual
Ideal

Qualcomm Technologies, Inc. All Rights Reserved

19

Hexagon DSP Software Development

Qualcomm Technologies, Inc. All Rights Reserved

20

Independent Algorithm Developers on Hexagon DSP

Qualcomm Technologies, Inc. All Rights Reserved

21

Announcing the Hexagon DSP SDK


See the Hexagon DSP SDK in action at Uplinq2013 (www.uplinq.com)

Visit http://developer.qualcomm.com for more information.


Qualcomm Technologies, Inc. All Rights Reserved

22

Thank you
Follow us on:
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
2013 Qualcomm Technologies, Inc.
Qualcomm and Hexagon are trademarks of QUALCOMM Incorporated, registered in the United States
and other countries. All QUALCOMM Incorporated trademarks are used with permission. Other
product and brand names may be trademarks or registered trademarks of their respective owners.
Hexagon is a product of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. All Rights Reserved

23

You might also like