Nothing Special   »   [go: up one dir, main page]

Design and Implementation of Novel Source Synchronous Interconnection in Modern GPU Chips

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Design and Implementation of Novel Source

Synchronous Interconnection in Modern GPU Chips

Tao Li Greg Sadowski


GPU group Research Group
Advanced Macro Device (AMD) Advanced Macro Device (AMD)
Shanghai, China Boxborough, US
Tao.Li@amd.com Greg.Sadowski@amd.com

Abstract—As the architecture of GPU chips evolves to provide Traditional GPU chips normally adopt a clock mesh design
higher performance with lower power, new topology of graphics to provide a GHz and above frequency-balanced synchronous
shader engines interconnection to local frame buffers becomes clock distribution across the whole chip. With this mesh
critical. Source synchronous interconnection has been widely structure, the clock skew is minimized to a dozen picoseconds
adopted in Network-On-Chip (NoC). The SSB bus fabric to across the chip, and the whole system is pushed with a high
transfer data between shader engines and frame buffers adopts clock frequency to get high performance. The interconnection
more of the globally asynchronous locally synchronous (GALS) is implemented with a synchronous design, with repeater flops
design style for a large size GPU chip, in order to deal with the inserted so the bus will span a large distance. This synchronous
challenge of delivering synchronous high frequencies clocks in
interconnection approach has the limitations that all IPs, at
the GHz range across full chip. It also reduces the area cost and
least their interface logic and repeaters need to run at the same
power consumption on long distance wide width data transfer.
clock frequency, with the cost of the extra area and power
In this paper, we present the design structure and physical consumption on mesh itself.
implementation of a novel source synchronous interconnect The modern GPU design has evolved to be more like an
network for GALS-style GPU topology. This combines the source SoC design with a lot of heterogeneous components running on
synchronous bus lane together with Multiple Data Rate (MDR) their own clocks. It’s a lot of work to build mesh clock
structure and much higher transmission clock frequency than
structure to deliver the synchronous clock for interconnect
shader clock to provide high bandwidth, high speed, low area
repeaters between all these components. So the globally
cost data transmission fabric for GPU chips. We also developed
MDR signal bits encoding techniques to reduce the toggle rate of
asynchronous locally synchronous (GALS) methodology
the MDR data nets. With clock gating scheme and MDR signal emerges to handle this architecture update [3].
encoding techniques adapted to the applications, we could For GALS system, there are several approaches to
further reduce the total power on the SSB transmission fabric. implement the interconnection. One of them is using
asynchronous interface design, which has the advantages of
Keywords— GPU, SSB, MDR, NoC, GALS, ToF, OCV, AOCV
high throughput performance and without the need to distribute
I. INTRODUCTION the balanced synchronous clocks across the chip, significantly
reducing power consumption. But there are still limitations for
With the emergence of large systems-on-chips, the high the adoption of asynchronous design in big GPU design, such
bandwidth interconnection requirements between each IP as the lack of EDA tools to synthesize, implement and analyze
block becomes a major performance bottleneck [1]. New asynchronous design. And by its nature, asynchronous design
design methodologies were adopted to solve that issue, such as does not have good DFT support. The area overhead also needs
the packet switched paradigm called the Network-on-Chip to be carefully considered, as the bus bit width is high about
(NoC). 1,000, so the area overhead would be significant with a dual
Similarly, the architecture of GPU chips is becoming more rail asynchronous design.
and more like a complex SoC, to enable higher bandwidth and
higher performance with lower power. There is an SoC-level There is another approach to use source synchronous bus
interconnection fabric to connect all IP blocks with data and interconnection [4] that has been widely adopted in Network-
control exchange [2]. For GPU design, the interconnection On-Chip (NoC) for GALS style designs. For large size GPU
between graphics shader engines to local frame buffers designs, the SSB bus fabric transfers data between shader
becomes critical for performance. To meet the system engines and frame buffers, and it will insert retime flops for
performance requirement, there are several approaches to paths with large distances to accommodate the process
implement the interconnection, such as synchronous balanced variation. With SSB implementation, there is not a challenge in
clock distribution with repeaters, asynchronous interface, delivering synchronous high frequency clocks in the GHz
Source synchronous bus with retime flops, and 3D IC. range across full chip. It also reduces the area cost and power
consumption on long distance wide width data transfers.

978-1-4799-3378-5/14/$31.00 ©2014 IEEE 130


Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:37:29 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. GPU system interconnection between Shader engines and frame Fig. 3. SSBMDR scheme with N data rate
buffer

There are several different types of SSC architectures. The


first type is using the same clock edge which launches data to II. ON CHIP SSBMDR INTERCONNECTION SCHEME
capture data; we call it zero cycle. The second type is using the As current GPU designs are based on GALS, the natural
next negative clock edge to capture data; we call it half cycle. solution for interconnection is to use SSB, which could save
The third type is using the next clock edge to capture data; we
the cost for delivering balanced synchronous clocks across the
call it full cycle. The structure of zero cycle and full cycle is
full chip. Shader engines require interconnection with very
exactly the same, only the clock and data arrival timing
difference makes it function differently. Normally we could high bandwidth, which is normally a couple of thousand bits
use clock gating to stop the SSB bus to reduce the power wide. For such a wide interconnection, we could not afford to
consumption while it is idle. However for all these three types use RDL to route them. RDL routing resources are very
of SSB, only the zero cycle type could truly support clock limited because most of it is already used for the full chip
gating, while the other two types will have valid data stalled in power delivery network and clock mesh distribution. Such
the retime flops if we stop the clock. wide interconnection would have huge impact on regular signal
routing even using a regular metal layer without a surrounding
Another GALS interconnect approach is 3-D die stacks [5], shield. To handle this issue properly, we proposed the
as it could convert the 2-D long distance interconnections to be
SSBMDR scheme which will reduce the routing resource cost
short 3-D vertical interconnections with through-silicon vias
(TSVs). Combined with other approaches, it could make GPU and the energy dissipation on the interconnection.
designs more scalable with improved performance and faster A. SSBMDR Scheme Architecture
time to market.
There are several typical MDR structures: one is to recover
In this paper, we propose a Source Synchronous Bus the clock from the data stream normally used for narrow data
Multiple Data Rate (SSBMDR) interconnection scheme, which bus data transmission. Another is to transmit the clock together
delivers multiple data rate data on an SSB scheme. The with data. Since the clock recovery circuit has area overhead,
SSBMDR interconnection scheme eliminates the energy we chose to transmit clock together with data. The basic
dissipation of previous synchronous clock distributions, structure of SSBMDR is the combination of typical SSB with
reduces the area and routing resource cost for wide bit width MDR, shown in Fig. 3, with MDR driver to send out N bits of
bus. One could further reduce the total power on the SSB data with N times fast clock to achieve N data rate transfer. For
transmission fabric, with the additional support of clock gating interconnection of SSBMDR, there is certain ToF(Time of
scheme and MDR bits signal encoding techniques adapted to Flight) limit on the distance that the SSB signal could transfer
the applications. safely without resynchronize. MDR retime flops need to be
The remainder of this paper is organized as follows: inserted to re-sync the clock and data. At the receiving end of
Section II presents the proposed on chip SSBMDR the SSBMDR interconnection, there is logic to de-serialize the
interconnection scheme in detail, where all related design MDR data, and FIFO logic to synchronize the received data to
constraints are derived and discussed; Section III provides the the local clock.
SSBMDR design implementation and results; Section IV is the For SSBMDR scheme with N data rate, the SSB data route
conclusion. cost is reduced from N to 1, which benefits the interconnection
implementation on the area cost for repeaters and routing
resource, thus reducing the impact on other normal design
logic.
The basic SSBMDR scheme uses a N times fast clock for
SSB transit and the falling clock edges still waste energy. The
alternative is to use N/2 times fast clock to utilize both rising
and falling edges for SSB transit.

(a) (b)

Fig. 2. (a) balanced clock repeater scheme (b) Source Synchronous Bus

131
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:37:29 UTC from IEEE Xplore. Restrictions apply.
(a) (b)

Fig. 4. (a) N data rate retime flop symbol and structure, (b) double edge flop
symble and structure Fig. 7. DDR net energy dissipation vs SDR

SSB nets, the energy dissipation on repeaters is to be 1/N of


regular SSB, which includes the nets switching power and cells
internal power. As the clock is now running N times faster with
SSBMDR, the power dissipation on clock is N times more.
Because SSBMDR is running with an N times higher
frequency clock, the distance between retime flops is 1/N
comparing with regular SSB. Therefore the retime flops
Fig. 5. SSBMDR operation (N = 2) numbers are N times more than regular SSB, so the energy is
also N times more.
The energy dissipation difference of the SSBMDR scheme
against regular SSB scheme is modeled as equation (2). Then
we get equation (3) and (4), where Rrepeater is the ratio of regular
SSB repeater energy dissipation against the total energy
dissipation. Now we just need to check the typical Rrepeater in
real design to get the reduction of energy dissipation with
SSBMDR.
Fig. 6. Energy dissipation comparision of SSBMDR/SSB against ,
with N data rate (N = 1, 2, 4, 8) (1)
B. MDR Retime Flops of SSBMDR _

The MDR retime flops are different with different clock


_ _
(2)
_ _ _

schemes. With one N times fast clock, normal flops are used
for SSB retime flops. With N/2 times fast clock, a double edge (3)
flop is used for SSB retime flop, which captures data on both
clock edges, and there is one 2:1 MUX to select the right data
to transmit on the data net, shown as Fig. 4 (b). With 1x clock, (4)
N flops are used for retime flop, and there is an N-to-1 MUX
behind the retime flop to select the correct data to transmit on Based on real design data with 28nm process, the Rrepeater is
the net. This MUX is controlled by a counter of clock edges to usually above 80%, so we choose to use N = 2 for our
serialize the N bit data to the output, shown as Fig. 4 (a). SSBMDR design with the plot of Fig. 6. As Rrepeater is related to
the distance between retime flops, with extended distance
C. SSBMDR Scheme Operation
between retime flops, we will get bigger Rrepeater and we could
The operation of SSBMDR is similar to regular SSB, with either reduce more energy dissipation with the same N, or use a
one N times clock transmitting with data bits, every clock larger N to reduce more area cost and route resource.
rising edge will transmit one of the N data bits. If we use N/2
times clock, then both edges of the clock will transmit N/2 bits As the Rrepeater value is high, the SSBMDR power
of data. Fig. 5 shows the SSBMDR operation with data rate N dissipation on repeaters is high. Most of the energy is
= 2. dissipated on the buffers and inverters, so we apply the MDR
bit encoding method to further reduce the repeaters power
D. Reducing Energy Dissipation consumption.
We built the model to calculate the energy dissipation
We chose DDR (N=2) to show the bit encoding methods.
reduction of SSBMDR versus regular SSB by breaking down
Without any bit encoding, the power consumption of DDR net
the total energy dissipation. The first part is the repeaters which
is shown in Fig. 7. We could see when Signal 1 is inverting
are mostly the combinational logic of buffer and inverter; the
Signal 2, and both signals are static, DDR net has the
second part is retime flops; and the third part is clock network
maximum energy dissipation versus the standard SSB. So we
as shown in equation (1). The majority of the power dissipation
developed several encoding logics shown as TABLE I. to
occurred between switching the power to charge and discharge
lower the toggle rate of DDR nets, and in turn to reduce the
the net load, and it is toggle rate dependent. With the N data
power consumption. The area overhead of the bit encoding
rate SSBMDR, for an average scenario, if we assume we could
logic is minor. With the working scenario simulation, we get
reduce the toggle rate of SSBMDR nets to be similar as regular
the plot of the DDR toggle rate against the Sig1 versus Sig2

132
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:37:29 UTC from IEEE Xplore. Restrictions apply.
Fig. 9. ToF of different repeater cells with different hop distance
Fig. 8. DDR bit encoding toggle rate against Sig1/Sig2 toggle difference
1) Distance between retime flops.
toggle rate difference, shown as Fig. 8. We found Code-4 to be In order to achieve a higher reduction in area and energy
the best for big toggle differences (above 50%). Code-1 is the dissipation, we needed to make the distance between retime
best for small toggle differences (below 28%), and Basic DDR flops larger. The distance between retime flops could be
without encoding seems the best for toggle differences between
modeled as equation (5), where data rate N = 2, D is
28% and 50%. So the best method is to use mixed bit encoding
logic for different DDR bus segments. One could use Code-1 impacted by several factors. The variable T is the
for address bus MSB bits, and Code-4 for LSB bits. For data SSBMDR clock period. The variable T is the skew
bus, for graphic applications, the sign bits have lower toggle between the clock and data; the SSBMDR clock and data
rate, therefore Code-1 is a good choice. traverse in parallel sharing the same condition, so normally the
T is very small. The Variable R is the ratio of crosstalk
With the SSBMDR scheme, the clock gating structure SI impact on the SSBMDR data and clock paths.
could be easily implemented to reduce the energy dissipation
when the bus is not transmitting valid data. This is similar to
the regular SSB scheme, but needed to guarantee all of the data (5)
bits are launched and captured with same clock edge between
retime flops.
a) OCV, On Chip Variation, represents the process
TABLE I. DDR BIT ENCODING variations as well as other design dependent variations across
the logic gates. OCV has a big impact on timing closure for
definition Code-1 Code-2 Code-3 Code-4
Sig1 and Sig2 the unalter as before 00 00 01 00
designs with deep sub-macro process nodes below 28nm.
Sig1 and Sig2 both inverted 11 10 10 10 Moving into 28nm process and below, the device and parasitic
Sig2 inverted 01 01 00 11 variations are much higher than previous process nodes. The
Sig1 inverted 10 11 11 01
fixed single OCV derate used for timing signoff is much
higher to cover corner cases, which requires higher margin for
E. SSBMDR benefits design timing closure on all the design. The Industry has now
turned to AOCV, which is advanced OCV, with a depth-based
derate table. As the logic depth becomes deeper, the random
With SSBMDR scheme, there are several benefits as cancellation effect will only require a small OCV derate. It
below: makes the timing closure easier for advanced design nodes.
AOCV also reserves a margin to model the difference of IR-
• Reduction of the energy dissipation. As shown in
section D, with optimized N data rate SSBMDR could drop, temperature, and wire parasitics on the various logic
reduce the energy dissipation against the regular SSB paths for clock and data. By nature, the repeater chains used in
scheme. With bit encoding and clock gating SSBMDR design usually have very deep logic levels (depth)
optimization, we could further reduce the energy between retime flops. The corresponding AOCV derate value
dissipation of SSBMDR scheme. is not big with such deep depth. And because clock and data
logic travel in parallel for long distances, they share the
• Reduction of the design area cost. Comparing with similar circumstance. With just a small difference on IR-drop,
regular SSB, SSBMDR with N data rate could reduce temperature and wire parasitics, the extra magin in AOCV is
the area of repeater from N to 1, although the area of also smaller. So if there is no other impact, SSB structure
retime flops will increase from 1 to N. But since retime could have a very long distance with carefully selected
flops only account for a small percentage of total area,
repeater cells.
the total area of SSBMDR is still reduced.
b) Another important factor is , which is decided by
• Reduction of the route resource. Similar to the area the selection of repeater cells. For any given process and
reduction, SSBMDR with N data rate schemes reduce standard cell library we need to push the SSBMDR as fast as
the route resource from N to 1. This will help to possible. To do this, we need to choose the practical ToF to be
minimize the impact on other normal design logic. as small as possible, carefully select the combination of
F. Design Constraints buffer/invert cell, and the hop distance of the repeaters, which
There is one major design constraint on SSBMDR, which is maps to the drive load of repeaters. We plot the circuit
the distance between retime flops.

133
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:37:29 UTC from IEEE Xplore. Restrictions apply.
Fig. 10. Repeater power consumption of SSBMDR/SSB, and Rrepeater value Fig. 11. SSBMDR power consumption reuction vs regular SSB

simulation ToF curve with a combination of different repeater 0.14um. The test chip has a target clock frequency of 1GHz.
cells and hop distances, as in Fig. 9. The ideal hop distance is Because of the high metal parasitic, we designed the distance
between retime flops to be 1500um for the SSBMDR structure,
around 60um~80um, but this will have too much impact on
and 3000um for the regular SSB structure. Both the clock and
normal design logic. So for real projects, the practical choice
data route are fully shielded.
of hop distance is about 100um ~ 150um.
c) is also a very important factor of crosstalk The bus data stimulus generation logic and data capture
impact. Ideally without any crosstalk impact, the distance comparison logic were also designed in a test chip to measure
between retime flops could be very long before it needed to be the SSBMDR scheme function and performance, as well as the
synchronized. But in the real world, crosstalk has the biggest difference between SSBMDR scheme and regular SSB
scheme. This was to estimate the reduction of energy
impact on the distance. Moving into 28nm process and below,
dissipation with SSBMDR scheme. All test logic was
the route width and space both shrink, which leads to higher
controlled and observed through normal IEEE 1149.1 JTAG
net resistance and capacitance. As the majority of SSBMDR interface in the test chip.
nets are routeed in parallel, with smaller net space, the
coupling capacitance is much higher than what we have see B. Simulation and Signoff Results
before. Normally we could add shield protection to reduce the As the test chip is still under tape out process, we will only
, and there is shield protection for clock route. But there is show the design simulation and signoff analysis results. The
a tradeoff inadding shield protection for every data bit. power consumption report is shown in Fig. 10 and Fig. 11
Comparing with regular SSB, adding shield protection for with different data nets toggle rate applied on both SSBMDR
SSBMDR data bits will increase the route resource from 1⁄ and regular SSB structures. We could see the regular SSB
to 2⁄ . That means that we could not reduce the route Rrepeater value is around 85% to 90%, based on equation (4) and
resource cost with SSBMDR unless N > 2. If we used N > 2, Fig. 6. We saw high energy dissipation reduction with
the system bus width would be very wide up to 1,000bits. This SSBMDR scheme against regular SSB. With the same data
would cost hugely in area and route resource with shield nets toggle rate, SSBMDR could reduce up to 45% of the
protection. Another way to mitigate the crosstalk impact is to power consumption.
shift the timing windows of neighbouring bits [6], [7]. When
one bit is switching, the neighbouring bits are stable and
acting like a natural shield. There are many ways to do this, IV. CONCLUSION
such as use delay cells for data bits, or adjust the clock phase. The architecture of GPU chips have evolved to be more of
a SoC design style, thus requiring an interconnection to provide
higher performance with lower power. In modern GPU design,
III. IMPLEMENTATION AND ANALYSIS RESULT the new GALS style of graphics shader engines topology
makes the interconnection to local frame buffers very critical.
A. Design implementation To solve this challenge with real design constraints, several
One 128 data bits SSBMDR with double data rate (N = 2) solutions are being adopted in current designs. One is the
structure with 10mm channel distance designed in a below source synchronous bus interconnection, which has been
20nm technology process test chip. As the comparison, a widely adopted in Network-On-Chip (NoC). The SSB bus uses
similar topology regular SSB with 128 data bits, also designed the fabric to transfer data between the shader engines and the
in the same test chip. All interconnect route is using non- frame buffers, with the sending and receiving ends running as
DPT(Double Patterning Technology) metals with 0.04um in same clock frequency but with different phase. This is ideal for
width, 0.04um in space, which has a bigger coupling GALS design style for a large size GPU chip, thus it could deal
capacitance and resistance when compared with mainstream with the challenge of delivering synchronous high frequencies
processes, such as 28nm technology whose metals have clocks in the GHz range across a full chip. But regular SSB
0.05um in both width and space. For even older process such interconnection also has issues for high cost of area and route
as 65nm technology, the parasitic of the metal is much smaller resource, and power consumption across long distance wide
to allow longer distances as the metal width and space are both width data transfer, as only half of the clock edges are actually
transmitting data.

134
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:37:29 UTC from IEEE Xplore. Restrictions apply.
In this paper, we presented the design structure and [1] J. D. Meindl et al., ‘‘Interconnect opportunities for gigascale integration,’’
physical implementation of a novel SSBMDR scheme, which IBM J. Res. Develop., vol. 46, no. 2/3, pp. 245---264, 2002.
combines the basic SSB together with a DDR structure. This [2] J. Sell, P. O’Connor ‘‘Main SOC and XBOX One Kinect,’’ Hotchip
Symposium, 2013.
allows it to fit an interconnect network for GALS style GPU
[3] A. Edman and C. Svensson, ‘‘Timing closure through a globally
topology and provide high bandwidth and high speed data synchronous, timing partitioned design methodology,’’ in Proc. DAC,
transmission, similar as SSB scheme. At the same time, Jun. 2004, pp. 71---74.
SSBMDR could also reduce area cost, and route resource for [4] M. Ghoneima, Y. Ismail, M. Khellah, V. De, ‘‘SSMCB: Low-Power
wide data transmission fabric for GPU chips. Most importantly, Variation-Tolerant Source-Synchronous Multicycle Bus,’’ in IEEE
with a clock gating scheme and bits signal encoding techniques Trans. Circuits and Systems, Feb 2009, pp. 384-394.
adapted to the applications, SSBMDR could further reduce the [5] A. A. Maashri, G. Sun, X. Dong, V. Narayanan, Y. Xie, ‘‘3D GPU
total power on the SSB transmission fabric. A test chip has architecture using cache stacking: performance, cost, power and thermal
been designed to test the SSBMDR scheme. The analysis result analysis,’’ ICCD'09, pp.254-259, 2009.
shows SSBMDR has good improvement over regular SSB for [6] M. Khellah, M. Ghoneima, J. Tschanz, Y. Ye, N. Kurd, J. Barkatullah,
Y. Ismail, and V. De, ‘‘A skewed repeater bus architecture for on-chip
energy dissipation with a reduction up to 45%. As there are energy reduction in microprocessors,’’ in Proc. ICCD, Oct. 2005, pp.
many wide interconnection buses in GALS style GPU designs, 253---257.
one could reduce more power consumption with this SSBMDR [7] M. Ghoneima, M. Khellah, J. Tschanz, Y. Ye, N. Kurd, J. Barkatullah,
scheme. Y. Ismail, and V. De, ‘‘Skewing adjacent line repeaters to reduce the
delay and energy dissipation of on-chip buses,’’ in Proc. ISCAS, May
2005, pp. 592---595.

REFERENCES

135
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on May 12,2024 at 06:37:29 UTC from IEEE Xplore. Restrictions apply.

You might also like