Nothing Special   »   [go: up one dir, main page]

A Risc MIPS, Mflops

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Ml

T he single-chip i860
executes parallel instructio
architectural concepts.
mm x 15 mm processor (see Fi
floating-point, and graphics pe
A million-
transistor budget
eration CAD tools and 1-micrometer semic helps this RISC
To accommodate our per
between blocks for integer operations, floating-point operations, and in-
struction and data cache memories. Inclusion of the RISC (reduced instruc-
deliver balanced
tion set computing) core, floating-point units, and caches on one chip lets
us design wider internal buses, eliminate interchip communication over-
MIPS, Mflops,
head, and offer higher performance. As a result, the i860 avoids off-chip
delays and allows users to scale the clock beyond the current 33- and 40- and graphics
MHz speeds.
We designed the i860 for performance-driven applications such as work- performance
stations, minicomputers, application accelerators for existing processors,
and parallel supercomputers. The i860 CPU design began with the specifi- with no data
cation of a general-purpose RISC integer core. However, we felt it neces-
sary to go beyond the traditional 32-bit, one-instruction-per-clock RISC
processor. A 64-bit architecture provides the data and instruction band-
bottlenecks.
width needed to support multiple operations in each clock cycle. The
balanced performance between integer and floating-point computations
produces the raw computing power required to support demanding applica-
tions such as modeling and simulations.
Finally, we recognized a synergistic opportunity to incorporate a 3D
graphics unit that supports interactive visualization of results. The architec-
ture of the i860 CPU provides a complete platform for software vendors
developing i860 applications.

Architecture overview. The i860 CPU includes the following units on


one chip (see Figure 2):
the RISC integer core,
a memory management unit with paging, Les Kohn
a floating-point control unit, Neal Margulis
a floating-point adder unit,
a floating-point multiplier unit, Intel Corp.
a 3D graphics unit,

0272- 1732/89/0800-0015$01.O 0 1989 IEEE August 1989 15


Intel 860

The floating-point instructions include a set of op-


erations that initiate both an add and a multiply. The
add and multiply. combined with the integer operation.
result in three operations each clock cycle. With this
fine-grained parallelism, the architecture can support
traditional vector processing by software libraries that
implement a vector instruction set. The inner loops of
the software vector routines operate up to the peak
floating-point hardware rate of 80 million floating-
point operations per second. Consistent with RISC
philosophy, the i860 CPU achieves the performance of
hardware vector instructions without the complex
control logic of hardware vector instructions. The fine-
grained parallelism can also be used in other parallel
algorithms that cannot be vectorized.

Register and addressing model. The i860 micro-


processor contains separate register files for the integer
and floating-point units to support parallel execution.
In addition to these register files, as can be seen in
Figure 3 on page 18, are six control registers and four
special-purpose registers. The RISC core contains the
integer register file of thirty-two 32-bit registers, des-
ignated RO through R3 1 and used for storing addresses
or data. The floating-point control unit contains a sepa-
rate set of thirty-two 32-bit floating-point registers
designated FO through F31. These registers can be
addressed individually, as sixteen 64-bit registers, or as
eight 128-bit registers. The integer registers contain
three ports. Five ports in the floating-point registers
allow them to be used as a data staging area for perform-
Figure 1. Die photograph of the i860 CPU. ing loads and stores in parallel with floating-point
operations.
The i860 operates on standard integer and floating-
point data, as well as pixel data formats for graphics
operations. All operations on the integer registers exe-
a 4-Kbyte instruction cache, cute on 32-bit data as signed or unsigned operations and
an 8-Kbyte data cache, and additional add and subtract instructions that operate on
a bus control unit. 64-bit-long words. All 64-bit operations occur in the
floating-point registers.
Parallel execution. To support the performance The i860 microprocessor supports a paged virtual
available from multiple functional units, the i860 CPU address space of four gigabytes. Therefore, data and
issues up to three operations each clock cycle. In single- instructions can be stored anywhere in that space, and
instruction mode, the processor issues either a RISC multibyte data values are addressed by specifying their
core instruction or a floating-point instruction each lowest addressed byte. Data must be accessed on
cycle. This mode is useful when the instruction per- boundaries that are multiples of their size. For example,
forms scalar operations such as operating system two-byte datamust be aligned to an address divisible by
routines. two, four-byte data on an address divisible by four, and
In dual-instruction mode, the RISC core fetches two so on, up to 16-byte data values. Data in memory can be
32-bit instructions each clock cycle using the 64-bit- stored in either little-endian or big-endian format.
wide instruction cache. One 32-bit instruction moves to (Little-endian format sends the least significant byte,
the RISC core, and the other moves to the floating-point D7-DO, first to the lowest memory address, while big-
section for parallel execution. This mode allows the endian sends the most significant byte first.) Code is
RISC core to keep the floating-point units fed by fetch- always stored in little-endian format. Support for big-
ing and storing information and performing loop con- endian data allows the processor to operate on data
trol, while the floating-point section operates on the produced by a big-endian processor, without perform-
data. ing a lengthy data conversion.

16 IEEE MICRO
External + ,
address 32 bits

Instruction cache
I 1 Data cache
(4 Kbytes) management (8 Kbytes)

Cache data

-t t
Floating-point instruction
64,' " 128
I1
Core instruction 32 '32 "32

T T T
Floating-point
RISC core control unit
Bus control
unit
j Core registers Floating-point registers
i
64,. 64 64.,

SRC2 I
I I
I
I
--- II 1 . 1

t KL
T ~
KR

Adder unit Multiplier unit

Merge

Figure 2. Functional units and data paths of the i860 microprocessor.

load and store instructions operate on memory; all other


RISC core instructions operate on registers. Most instructions
The RISC core fetches both integer and floating- allow users to specify two source registers and a third
point instructions. It executes load, store, integer, bit, register for storing the results.
and control transfer instructions. Table 1 on page 19 A key feature of the core unit is its ability to execute
lists the full instruction set with the 42 core unit instruc- most instructions in one clock cycle. The RISC core
tions and their mnemonics in the left column. All in- contains a pipeline consisting of four stages: fetch,
structions are 32 bits long and follow the load/store, decode, execute, and write. We used several techniques
three-operand style of traditional RISC designs. Only to hide clock cycles of instructions that may take more

August 1989 17
Intel i860

time to complete. Integer register loads from memory We included several control flow optimizations in
take one execution cycle, and the next instruction can the core instruction set. The conditional branch instruc-
begin on the following cycle. tions have variations with and without a delay slot. A
The processor uses a scoreboarding technique to delay slot allows the processor to execute an instruction
guarantee proper operation of the code and allow the following a branch while it is fetching from the branch
highest possible performance. The scoreboard keeps a target. Having both delayed and nondelayed variations
history of which registers await data from memory. The of branch instructions allows the compiler to optimize
actual loading of data takes one clock cycle if it is held the code easily, whether a branch is likely to be taken or
in the cache memory buffer available for ready access, not. Test and branch instructions execute in one clock
but several cycles if it is in main memory. Using cycle, a savings of one cycle when testing special cases.
scoreboarding, the i860 microprocessor continues Finally, another one-cycle loop control instruction
execution unless a subsequent instruction attempts to usefully handles tight loops, such as those in vector
use the data before it is loaded. This condition would routines.
cause execution to freeze. An optimizing compiler can Instead of providing a limited set of locked opera-
organize the code so that freezing rarely occurs by not tions, the RISC core provides lock and unlock instruc-
referencing the load data in the following cycle. Be- tions. With these two instructions a sequence of up to
cause the hardware implements scoreboarding, it is 32 instructions can be interlocked for multiprocessor
never necessary to insert NO-OP instructions. synchronization. Thus, traditional test and set opera-

Integer registers Floating-point registers


31 0 63 32 31 0
R1 F1 FO
R2 F3 F2

t R3
R4
R5
.R.-
6 F10
R7 F13 F12
R8 F15 F14 1
I R9 I F17
, ,. I F16
R10 F19 F18
t18
RI1 F21 F20
F23 F22
F25 F24
I R14 I F27 F26
R15 F29 F28

Special-purpose floating-point registers


I R19 I KR
R20
KL
R21 r
R22
Merge
R23

R27
R28 Control registers
Page directory base
Data breakpoint
Floating-point status

Figure 3. Register set.

18 IEEE MICRO
Table 1.
Instruction-set summary.

Llnemonic Description Mnemonic Description

Zore unit
Floating-point unit
Load and store instructions
LD.X Load integer Floating-point multiplier instructions
3T.X Store integer FMUL.P F-P multiply
FLD.Y F-P load PFMUL.P Pipelined F-P multiply
?FLD.Z Pipelined F-P load PFMUL3.DD Three-stage pipelined F-P multiply
FST.Y F-P store FMLOW .P F-P multiply low
PST.D Pixel store FRCP.P F-P reciprocal
FRSQR .P F-P reciprocal square root
Register-to-register moves
lXFR Transfer integer to F-P register Floating-point adder instructions
FXFR Transfer F-P to integer register FADD.P F-P add
PFADD.P Pipelined F-P add
integer arithmetic instructions
FSUB.P F-P subtract
4DDU Add unsigned PFSUB.P Pipelined F-P subtract
4DDS Add signed
PFGT.P Pipelined F-P greater-than compare
SUBU Subtract unsigned
PFEQ.P Pipelined F-P equal compare
SUBS Subtract signed F1X.P F-P to integer conversion
Shift instructions PF1X.P Pipelined F-P to integer conversion
SHL Shift left FTRUNC.P F-P to integer truncation
SHR Shift right PFTRUNC.P Pipelined F-P to integer truncation
SHRA Shift right arithmetic PFLE.P Pipelined F-P less than or equal
SHRD Shift right double PAMOV F-P adder move
Logical instructions PFAMOV Pipelined F-P adder move
4ND Logical AND Dual-operation instructions
4NDH Logical AND high PFAM.P Pipelined F-P add and multiply
4NDNOT Logical AND NOT PFSM.P Pipelined F-P subtract and multiply
4NDNOTH Logical AND NOT high PFMAM Pipelined F-P multiply with add
3R Logical OR PFMSM Pipelined F-P multiply with subtract
3RH Logical OR high
Long integer instructions
)<OR Logical exclusive OR
FLSUB.Z Long-integer subtract
YORH Logical exclusive OR high
PFLSUB.Z Pipelined long-integer subtract
Zontrol-transfer instructions FLADD.Z Long-integer add
rRAP Software trap PFLADD.Z Pipelined long-integer add
INTOVR Software trap on integer overflow
Graphics instructions
BR Branch direct
FZCHKS 16-bit z-buffer check
SRI Branch indirect PFZCHKS Pipelined 16-bit ,--buffer check
BC Branch on CC
FZCHLD 32-bit z-buffer check
BC.T Branch on CC taken
PFZCHLD Pipelined 32-bit z-buffer check
3NC Branch on not CC
FADDP Add with pixel merge
BNC.T Branch on not CC taken
PFADDP Pipelined add with pixel merge
3 TE Branch if equal
FADDZ Add with z merge
BTNE Branch if not equal
PFADDZ Pipelined add with 2 merge
BLA Branch on LCC and add
FORM OR with merge register
CALL Subroutine call
PFORM Pipelined OR with merge register
CALLI Indirect subroutine call
System control instructions Assembler pseudo-operations
FLUSH Cache flush MOV Integer register-register move
LD.C Load from control register FM0V.Q F-P register-register move
ST.C Store to control register PFM0V.Q Pipelined F-P register-register move
LOCK Begin interlocked sequence NOP Core no-operation
UNLOCK End interlocked sequence FNOP F-P no-operation

cc Condition code
F-P Floating-point
LCC Load condition code

August 1989 19
Intel i860

tions as well as more sophisticated operations, such as the pipelined external bus before the data from the first
compare and swap, can be performed. cache miss is returned. The pipelined loads occur di-
The RISC core also executes a pixel store instruc- rectly from memory and do not cause extra bus cycles
tion. This instruction operates in conjunction with the to fill the cache line, avoiding bus accesses to data that
graphics unit to eliminate hidden surfaces. Other in- is not needed. The full bus bandwidth of the external
structions transfer integer and floating-point registers, bus can be used even though cache misses are being
examine and modify the control registers, and flush the processed. Autoincrement addressing, with an arbi-
data cache. trary increment, increases the flexibility and perform-
The six control registers accessible by core instruc- ance for accessing data structures.
tions are the
PSR (processor status),
EPSR (extended processor status),
DB (data breakpoint), Memory management
FIR (fault instruction), The i860’s on-chip memory management unit imple-
Dirbase (directory base), and ments the basic features needed for paged virtual
FSR (floating-point status) registers. memory management and page-level protection. We
intentionally duplicated the memory management tech-
The PSR contains state information relevant to the nique in the 386 and 486 microprocessors’ paging
current process, such as trap-related and pixel informa- system. In this way we can be sure that the processors
tion. The EPSR contains additional state information easily exist in a common operating environment. The
for the current process and information such as the similar MMUs are also useful for reusing paging and
processor type, stepping, and cache size. The DB reg- virtual memory software that is written in C.
ister generates data breakpoints when the breakpoint is The address translation process maps virtual address
enabled and the address matched. The FIR stores the space onto actual address space in fixed-size blocks
address of the instruction that causes a trap. The Dir- called pages. While paging is enabled, the processor
base register contains the control information for cach- translates a linear address to a physical address using
ing, address translation, and bus options. Finally, the page tables. As used in mainframes, the i860 CPU page
FSR contains the floating-point trap and rounding- tables are arranged in a two-level hierarchy. (See Fig-
mode status for the current process. The four special- ure 4.) The directory table base (DTB), which is part of
purpose registers are used with the dual-operation the Dirbase register, points to the page directory. This
floating-point instructions (described later). one-page-long directory contains address entries for
The core unit executes all loads and stores, including 1,024 page tables. The page tables are also one page
those to the floating-point registers. Two types of float- long, and their entries describe 1,024 pages. Each page
ing-point loads are available: FLD (floating-point load) is 4 Kbytes in size.
and PFLD (pipelined floating-point load). The FLD Figure 4 also shows the translation from a virtual
instruction loads the floating-point register from the address to a physical address. The processor uses the
cache, or loads the data from memory and fills the cache upper 10 bits of the linear address as an index into the
line if the data is not in the cache. Up to four floating- directory. Each directory entry contains 20 bits of
point registers can be loaded from the cache in one addressing information, part of which contains the
clock cycle. This ability to perform 128-bit loads or address of a page table. The processor uses these 20 bits
stores in one clock cycle is crucial to supplying the data and the middle 10 bits of the linear address to form the
at the rate needed to keep the floating-point units page table address. The address contents of the page
executing. The FLD instruction processes scalar table entry and the lower 12 bits (nine address bits and
floating-point routines, vector data that can fit entirely the byte enables) of the linear address form the 32-bit
in the cache, or sections of large data structures that are physical address.
going to be reused. The processor creates the paging tables and stores
For accessing data structures too large to fit into the them in memory when it creates the process. If the
on-chip cache, the core uses the PFLD instruction. The processor had to access these page tables in memory
pipelined load places data directly into the floating- each time that a reference was made, performance
point registers without placing it in the data cache on a would suffer greatly. To save the overhead of the page
cache miss. This operation avoids displacing the data table lookups, the processor automatically caches
already in the cache that will be reused. Similarly on a mapping information for the 64 recently used pages
store miss, the data writes through to memory without in an on-chip, four-way, set-associative translation
allocating a cache block. Thus, we avoid data cache lookaside buffer. The TLB’s 64 entries cover 4 Kbytes,
thrashing, a crucial factor in achieving high sustained each providing a total cover of 256 Kbytes of memory
performance in large vector calculations. addresses. The TLB can be flushed by setting a bit in the
PFLD also allows up to three accesses to be issued on Dirbase register.

20 IEEEMICRO
Dir Page Offset

Physical
b address

Page directory Page table


A

Figure 4. Virtual-to-physical address translation.

~~

Writable
User
Write-through
Cache disable
Accessed
Dirty
(Reserved)
Available for systems programmer user

Page frame address 31 . . . 12 Available X X D A E U W P

~~

Figure 5. Format of a page table entry. (X indicates Intel reserved; do not use.)

Only when the processor does not find the mapping The format of a page table entry can be seen in Figure
information for a page in the TLB does it perform a 5. Paging protects supervisor memory from user ac-
page table lookup from information stored in memory. cesses and also permits write protection of pages. The
When a TLB miss does occur, the processor performs U (user) and W (write) bits control the access rights.
the TLB entry replacement entirely in hardware. The The operating system can allow a user program to have
hardware reads the virtual-to-physical mapping infor- read and write, read-only, or no access to a given page
mation from the page directory and the page table or page group. If a memory access violates the page
entries, and caches this information in the TLB. protection attributes, such as U-level code writing a

August 1989 21
Intel i860
DO 10, I = 1 , 100
10 X=X*A+C
read-only page, the system generates an exception. FMUL X, A, temp
While at the user level, the system ignores store control FADD temp, C, X
instructions to certain control registers.
The U bit of the PSR is set to 0 when executing at the 1 result per 6 clock cycles
supervisor level, in which all present pages are read- (a)
able. Normally, at this level, all pages are also writable.
DO 10, I = 1, 100
To support a memory management optimization called 10 X[I] = A[I] * B[I] + C
copy-on-write, the processor sets the write-protection
(WP) bit of the EPSR. With WP set, any write to a page M12TPM A[I], B[I], XII - 61
whose W bit is not set causes a trap, allowing an
1 result per clock cycle
operating system to share pages between tasks without
making a new copy of the page until it is written. (b)
Of the two remaining control bits, cache disable
(CD) and write through (WT), one is reflected on the Figure 6. Floating-point execution models: data-de-
output pin for a page table bit (PTB), dependent on the pendent code in scalar mode (a) and vector code in
setting of the page table bit mode (PBM) in EPSR. The pipeline mode (b).
WT bit, CD bit, and KEN# cache enable pin are inter-
nally NORed to determine “cachability.” If either of
these bits is set to one, the processor will not cache that
page of data. For systems that use a second-level cache,
SRC1 SRC2 RDEST
these bits can be used to manage a second-level coher-
ent cache, with no shared data cached on chip. In
addition to controlling cachability with software, the
KEN# hardware signal can be used to disable cache
reads.

r-7
Floating-point unit
Floating-point unit instructions, as listed in Table 1 ,
support both single-precision real and double-preci-
sion real data. Both types follow the ANSI/IEEE 754
standard.’ The i860 CPU hardware implements all four
modes of IEEE rounding. The special values infinity,
NaN (not a number), indefinite, and denormal generate Multiplier unit
a trap when encountered; and the trap handler produces
an IEEE-standard result. The double-precision real
data occupies two adjacent floating-point registers with Result
bits 31 . . . 0 stored in an even-numbered register and
bits 63 . . . 32 stored in the adjacent, higher odd-
numbered register.
The floating-point unit includes three-stage-pipe-
lined add and multiply units. For single-precision data
each unit can produce one result per clock cycle for a
peak rate of 80 Mflops at a 40-MHz clock speed. For
double-precision data, the multiplier can produce a
result every other cycle. The adder produces a result
every cycle, for a peak rate of 60 million floating-point
operations per second. The double-precision peak
number is 40 Mflops if an algorithm has an even
Adder unit
distribution of multiplies and adds. Reducing the
double-precision multiply rate saves half of the multi-
plier tree and is consistent with the data bandwidth Result
available for double-precision operations.
To save silicon area, we did not include a floating-
point divide unit. Instead, software performs floating-
point divide and square-root operations. Newton-Ra-
phson algorithms use an 8-bit seed provided by a Figure 7. Dual-operation data paths.

22 IEEEMICRO
31 0
OP

hardware lookup table. Full IEEE rounding can be d.FP-OP


implemented by using an instruction that returns the 63 d.FP-OP or CORE-OP f
Enter dual-
low-order bits of a floating-point multiply. Therefore CORE-OP d.FP-OP instruction mode.
these algorithms can take advantage of the pipeline and Initiate exit from dual-
CORE-OP FP-OP
allow 16-bit reciprocals used in many graphics calcula- instruction mode.
tions to be performed either in 10 clock cycles or four
pipelined cycles.
CORE-OP
I
FP-OP
I
.c
Leave dual-

+
OP instruction mode.
The floating-point instruction set supports two
OP
computation models, scalar and pipelined. In scalar
mode new floating-point instructions do not start proc-
essing until the previous floating-point instruction 31 0
completes. This mode is used when a data dependency
exists between the operations or when a compiler ig-
nores pipeline scheduling. In the scalar-mode example
of Figure 6 each iteration of the Do loop requires the
results from the previous iteration and 6-cycle execu-
tion.
In pipelined mode the same operation can produce a
63

L CORE-Of‘
OP
d.FP-OP
FP-OP
FP-OP
OP
I
Temporary dual-
instruction mode

result every clock cycle, and the CPU pipeline stages I OP I


are exposed to software. The software issues a new
floating-point operation to the first stage of the pipeline
and gets back the result of the last stage of the pipeline. Figure 8. Dual-instruction-mode transitions.
Destination registers are not specified when the opera-
tion begins, rather when the result is available. This
explicit pipelining avoids tying up valuable floating-
point registers for results, so the registers can still be
used in the pipeline. Implicit pipelining, using score- The is60 microprocessor can provide its fast float-
boarding, would cause the registers to become the ing-point hardware with the necessary data bandwidth
bottleneck in the floating-point unit. to achieve peak performance for the inner loops of
common routines. The dual-instruction mode allows
Pipelining also takes place in a dual-operation mode
the processor to perform up to 128-bit data loads and
in which an add and a multiply process in parallel.
stores at the same time it executes a multiply and an
Figure 7 shows the adder unit, the multiplier unit, the
add. Figure 8 shows the dual-instruction-mode transi-
special registers, and the dual-operation data paths.
tions for an extended sequence of instruction pairs and
Dual-operation instructions require six operands. The
for a single instruction pair. Programs specify dual-
register file provides three of the operands, and the
instruction mode in two ways. They can either include
special registers and the interunit bypasses provide the
in the mnemonic of a floating-point instruction a “d.”
remaining three. The instruction encodings specify the
prefix or use the assembler directives .dual. . . enddual.
source and destination paths for the units.
Either of these methods causes the dual or D-bit of the
Referring back to the pipeline-mode example of floating-point instruction to be set. If the processor
Figure 6 , note that we show the dual-operation instruc- while executing in single-instruction mode encounters
tion M12TPM SRCl, SRC2, RDEST as M12TPM A[i], a floating-point instruction with the D-bit set, it exe-
B[i], X[-61. (The M12TPM mnemonic is a variation of cutes one more 32-bit instruction before beginning
the PFAN instruction.) This instruction specifies that dual-instruction execution. In dual-instruction mode, a
the multiply is initiated with SRCl and SRC2 as the floating-point instruction could encounter a clear D-
operands. It also specifies that the add is initiated with bit. The processor would then execute one more in-
the result from the multiply and the T register as the struction pair before returning to single-instruction
operands, and RDEST stores the result from the add. mode.
Because of the three stages of the add and multiply The floating-point hardware also performs integer
pipelines, the available result comes from the operation multiplies and long integer adds or subtracts. Integer
that started six clock cycles previously. multiplies by constants can be performed in the RISC
There are 32 variations of dual-operation instruc- core using shift instructions. To perform a full integer
tions. Applications such as fast Fourier transforms, multiply, the processor transfers two integer registers
graphics transforms, and matrix operations can be by using IXFR instructions. The FMLOW instruction
implemented efficiently with these instructions. Some performs the actual multiplication, and the FXFR in-
apparently scalar operations, such as adding a series of struction transfers the results back to the core. The total
numbers, can also take advantage of the pipelining operation takes from four to nine clock cycles, depend-
capability. ing on what other instructions can be overlapped.

August 1989 23
Intel 860

Red color 20 (r, g, b, x , Y. z)


Graphics (0-255)
The floating-point hardware of the CPU efficiently
performs the transformation calculations and advanced
lighting calculations required for 3D graphics. The
processor performs 500K transforms/second for 3 x 4
3D matrices, including the trivial reject clipping and
perspective calculations. A 3D image display requires
the use of integer operations for shading and hidden-
surface removal. The graphics unit hardware speeds
these back-end rendering operations and operates di-
rectly into screen buffer memory. It uses the floating-
point registers and operates in parallel with the core.
Graphics instructions take advantage of the 64-bit
data paths and can operate on multiple pixels simulta-
neously, realizing I O times the speed of the RISC
core when performing shading. Instructions support
8-, 16-, and 24/32-bit pixels, operating respectively
on eight, four, or two pixels simultaneously. Figure 9. Pixel interpolation for Gouraud shading of a
In 3D graphics, polygons generally represent the set triangle for red colors and 0-255 intensity levels.
of points on the surface of a solid object. During
transformation, the graphics u n i t calculates only the
vertices of the polygons. The unit knows the locations
and color intensities of the vertices of the polygons. but
points between these vertices must be calculated. These In graphics the :-buffer, which can reside in normal
points, along with their associated data, are called dynamic RAM, stores the depth of the pixel buffer
pixels. If a figure is displayed with only the vertices and currently being displayed. Instructions for ;-buffer
simple lines, it appears as a wireframe drawing. The interpolation calculate the z values between vertices. Z -
simplest wireframe drawing typically shows all verti- buffer check instructions compare the new pixels’ z
ces, even the ones that should be hidden from view by values to the values in the ,--buffer, and if closer, the
an overlapping polygon. To show shaded 3D images, pixels are unmasked in the pixel mask register. The
the graphics unit must display the surface of the poly- RISC core operates in parallel with the graphics unit
gons. Where polygons overlap, it must display the and executes a pixel store instruction. The pixel store
polygon closest to the viewer. updates the pixels that are unmasked in the mask regis-
In graphics calculations the z value represents the ter. If a pixel is updated, the new :value needs to be
distance of a pixel from the viewer. Although the depth stored to the :-buffer. The z-buffer check instruction
of each polygon’s vertices is known, to overlay poly- updates the buffer with the minimum :value for each
gons not on a vertex, the graphics unit must interpolate pixel.
the depths from the bordering vertices. This step is Most workstations typically have a base graphics
called z interpolation. In this step the depths of all system of a simple frame buffer with simple display
points of a polygon can be determined. For overlapping hardware. With a frame-buffer graphics system, the
points, the z values of different polygons can be checked i860 CPU can perform Gouraud-shading operations on
and only the pixel data of the polygon closest to the 50,000 triangles per second at 40 MHz. This level of
viewer displayed. performance exceeds that of workstations that include
To perform the procedure just described, the graph- costly dedicated graphics processor boards.
ics instructions include intensity interpolation, z inter-
polation, and z-buffer checks. Intensity interpolation
allows smooth linear changes in pixel intensity and
color between vertices. This capability provides a Caches
smoother appearance than does the flat shading of the The i860 CPU has a4-Kbyte instruction cache and an
polygons. The more data bits per pixel, the smoother 8-Kbyte data cache, each with its own address and data
the interpolation becomes. The i860 CPU graphics paths to support concurrent accesses. The data cache
instructions support both Gouraud and higher order supports up to 128-bit accesses on each clock cycle, and
shading techniques. Gouraud shading interpolates in- the instruction cache supports up to 64-bit accesses.
tensities along the scan lines. Figure 9 illustrates pixel The aggregate bandwidth at 40 MHz is 960 Mbytes/
interpolation for Gouraud shading of a triangle. The second. Both caches combine two-way set-associative
intensity level across the scan line shown is interpo- parallelism with a 32-byte line size. Additionally, the
lated from 30 to 27. data cache uses write-back caching.

24 IEEEMICRO
Both caches use virtual addresses to avoid a critical bus, and then the write back occurs.
path in the cache access. Data cache accesses use the A convenient software model for managing the data
TLB lookup for enforcing the page-based protection. cache for vector computations on large matrices is to
Since both caches use virtual tags, software must avoid the treat the data cache as a “vector register set.”
the aliasing of data. Within a context, each physical Vectors, or their intermediate results, that are being
address must only be accessed with one virtual address. reused are kept in the onboard cache by referencing
During context switches, the instruction cache must be with the normal floating-point load instruction. The
invalidated and the data cache flushed. The caches, vectorization process analyzes nested loops to deter-
although large enough to give hit rates above 90 percent mine which vectors are reusable in the second-loop
within many applications, are too small to provide hits level. Vector register references in the vector library
across context changes. Therefore, we did not feel routines use the normal floating-point load instruction.
process IDS or a duplicate set of physical tags to avoid Vector memory references use the pipelined floating
flushing the cache between context switches were load instruction to stream the data from memory di-
warranted. rectly into the registers and not disturb the cache. Using
Flushing the data cache is an easy way to avoid the data cache as a vector register set is a more flexible
aliasing, and a simple calculation shows what little concept than that found in many supercomputers with
impact a small cache has on performance flushing. A small, fixed-length vector registers. This concept of-
typical i860 CPU context switch, including the data fers the advantages of a vector register set for vector
cache flush, takes approximately 65 microseconds. In computations while retaining the flexibility of a data
the worst case, a workstation will change context 200 cache for scalar computations.
times per second; multiplying (65 * seconds * 200
times/second) equals a 1.3 percent performance degra-
dation due to context switching.
Write-back data caching avoids propagating all Bus interface
writes to the external bus, which reduces bus traffic. It Designed for scalability to 50 MHz, the i860 CPU
also prevents a bottleneck in vector operations where external bus performs a 64-bit transfer every two clock
write traffic from the vector result collides with an cycles. Thus, we achieve the design of a practical TTL
incoming vector operand. With write-back caching, the (transistor-transistor logic) system, even at 50 MHz.
hardware necessary to implement transparent caching The bus can interface either to a second-level cache or
for multiprocessor systems moved costs beyond the directly to a DRAM system. The bus allows optional
silicon budget of this implementation. Instead, we use pipelining for increasing the access time without de-
software to manage cache coherency. Each processor creasing the bandwidth. The full bus bandwidth can be
can cache code, vector register data, and private stack realized from one bank of DRAMS, however, the la-
data, while shared data remains uncached. Software tency will be greater than if a fast static RAM cache is
controls the caching by using a cachable bit in the page used.
table entries to prevent shared data from being cached. With the two-cycle transfer rate, the external bus can
External hardware can also assert a cachable enable pin supply one memory operand for every double-
to control cachability of each line’s read miss. The precision add/multiply pair, or two contiguous single-
flush instruction forces all “dirty” blocks in the data precision operands for every two single-precision add/
cache back to memory. Flushing is needed before multiply pairs. The other two vector operands for an
removing a page or changing to a new virtual address add/multiply pair must come from the onboard data
space. cache. This approach provides the same ratio of
We included optimizations for cache-miss process- floating-point rate to external memory bandwidth as
ing. Each cachable read miss results in four bus cycles the Cray 1. To avoid bus bottlenecks, the vectorization
to fill the 32-byte cache line. First, the processor fetches process must try to reuse two of the three vector oper-
the referenced data word and performs a wraparound ands in the second-level inner loop.
fill to read the entire line. The processor can then The i860 microprocessor contains a synchronous
continue execution when the first word is returned. The interface with a demultiplexed address and 64-bit-wide
processor contains two 128-bit write buffers used for data bus. The address bus provides 32-bit addressing,
store misses and cache miss processing. When the consisting of 29 address lines and separate byte enable
processor issues a store instruction that misses the signals for each eight data bits. The bidirectional data
cache, it can continue execution while the write buffer bus can accept or drive new data on every other clock
carries out the actual memory write. The write buffers cycle, yielding a bandwidth of 160 Mbytes per second
support two store misses and also support a delayed at 40 MHz.
write back of the dirty cache line. If a cachable read The bus optionally allows for two levels of bus
miss displaces a dirty cache line, three operations take pipelining selected on a bus cycle-by-cycle basis. When
place. The processor writes the dirty line to the write pipelining, a new cycle starts prior to the completion of
buffer, the cache line read takes place on the external the outstanding cycles. Two levels of pipelining allow

August 1989 25
Intel 860

8-bit-wide EPROM can be entered at reset by activating


Table 2. the INT/CS8 pin. In this mode the processor fetches
Processor-pin summary. instructions from the EPROM with the byte-enable
~~
signals BE2#-BEO# redefined as address lines A2-AO.
Pin name Function Active Input/ The HOLD, HOLDA, and BREQ signals activate
state output arbitration of the processor’s local bus. When a DMA
controller, or another processor, needs access to the
Execution control pins local bus of the CPU, it asserts HOLD. When the CPU
CLK Clock - I completes all of its outstanding bus cycles, it floats the
RESET System reset High I bus interface pins and returns HOLDA active high. The
HOLD Bus hold High I CPU will remain in this state with HOLDA active until
HOLDA Bus hold acknowledge High 0 HOLD is deasserted. The CPU can continue processing
BREQ Bus request High 0 while in HOLD until the external bus is required. At
INT/CS8 Interrupt, code size High I this time it asserts the BREQ output signal. Arbitration
logic samples the BREQ signal to arbitrate a shared
Bus interface pins
A31-A3 Address bus High 0 bus.
Low The A3 1-A3 and BE7#-BEO# bus interface pins can
BE7#-BEO# Byte enable 0
High I/O
access up to 4 gigabytes of address space. The address
D63-DO Data bus
Low 0 lines select the 8-byte word, and the byte-enable signals
LOCK# Bus lock select the byte within the word. For read accesses to
W/R# Writehead bus cycle High/Low 0
Low 0
cachable memory, the processor caches the entire data
NENE# Next near bus so the byte-enable signals are ignored. For write
NA# Next address request Low I
READY# Transfer acknowledge Low I
operations the byte-enable signals determine which
Low bytes in memory must be updated. The i860 micropro-
ADS# Address status 0
cessor does not, however, allow misaligned accesses.
Cache interface pins Data of 32 and 16 bits must be placed on 4- and 2-byte
KEN# Cache enable Low I boundaries, respectively. However, single-byte data
PTB Page table bit High 0 can be placed at any byte location. The 64 bidirectional
Testability pins data pins can transfer 8-, 16-, 32-, or 64-bit quantities;
SHI Boundary scan shift High I pins D7-DO signify the least significant byte and D63-
input D56 signify the most significant byte.
BSCN Boundary scan enable High I The processor asserts the ADS# output during the
SCAN Shift scan path High I first clock cycle of each bus cycle to indicate the start
Intel-reserved configuration pins of the bus cycle. The W/R# signal distinguishes the
CC1-CCO Configuration High I write and read bus cycles. The NENE# output indicates
Power and ground pins to the DRAM controller that the current address is in the
System power - - same DRAM page as the previous cycle. As shown
vcc
System ground - - later, this information is useful for designing high-
vss
performance memory systems.
The NA# input to the CPU controls pipelining and
A # symbol after a pin name indicates that the signal is active can be asserted before the current cycle ends. When the
when at the low-voltage level. processor samples NA# active, it can start driving the
next bus cycle’s address and definition. This can be
done two times prior to returning data for any of the
cycles.
While NA# controls the address and bus cycle defi-
three cycles to operate at one time. Fast TTL latches can nition signals, READY# controls the data operations.
be used on the address and data bus. This method When READY# is sampled as active for a read, the
isolates the memory array from the processor pin tim- processor latches the data from the data bus. When
ings, allowing easy scalability and providing the maxi- READY# is sampled as active for a write, the processor
‘mum time for memory accesses. With pipelining, the stops driving the data from that cycle. READY# also
maximum data rate of the bus can be sustained even if serves to end a bus cycle. The LOCK# signal output
the access time is six clock cycles. We achieve over 100 provides atomic (indivisible) sequences. Using LOCK#
nanoseconds of address-to-data access time for a full prevents the processor from relinquishing the bus even
bandwidth system at 40 MHz. if HOLD is asserted. For multiprocessor systems, the
A summary of the processor pins appears in Table 2. external hardware only needs to lock the first address in
We timed the processor with a single-frequency, TTL- a locked sequence.
level clock. An optional mode for executing out of one This processor samples the KEN# input to determine

26 IEEEMICRO
i860 CPU DRAM read cycle

CLK 1

ADS# w;I

I I
I

I I I
I I

I
I
I

I
I
I
I
I
I
I
I
I
I I I I I I I I I
I I I I I I I
NENE#
I
I
I
I
I
I
I
I
I
I
I
I1 I
I
I
I
I
I I I I I I I I I

W/R#
I
I \ I I
I
I
I
I
I
I
I I : I
I
I
I
I I I I I I I I I
I I I I I I I I I

< >
I I I I I I I I I

CPUaddress
I
1 ' AddressX' @ 1 I I
I I I I I I I I I
I I I I I I I I I
I I I I I I I I I
I I I
I I I I
NA#
I I I I
I I I I I I I I I

Data (i860)
I
I
I
I
I
I
I
I
I
;(Data>;
I I
1 @;
I I
a; I
I
I
I
I I I I I I I I I
I I I I I I I I I
I I I
I I I
READY#
I I I I I I I I I
I I I I I I I I I

DRAM address
I
I
I
<Row X x
I
COlX
I
xI

I
I
Col x + lXC0l
I
8

I
I
x+ 2XCol x + 3
x
I
I
I
I
I
I
I
I I I I I I I I I
I I I I I I I I
RAS# I I I I I I I I
I I I I I I I I I
I I I I I I I I I
I I I I I I
CAS# 8 I I I I
I I I I I I I I I
I I I I
I I I
DRAM data I

Figure 10. The CPU performs four read cycles to fill a cache line.

if the data for the current read cycle is cachable. Ad- cycle. Two NA#s are returned before any of the cycles
dress space that is used for input and output can be are completed. To complete a read cycle, the memory
decoded to deassert KEN# during I/O accesses. Soft- system provides the data on the bus and returns
ware can also mark areas of memory as noncachable on READY# to the processor. Once fully pipelined, the
a page-by-page basis. If the software has not disabled memory system provides data and READY# on every
caching of the page, and KEN# is available for a read other clock cycle. Important for high performance, this
cycle, three additional 64-bit bus cycles will be gener- data rate can be provided by ordinary static column
ated to fill the 32-byte cache block. DRAMS. The processor also provides the control signal
NENE# to optimize DRAM control.
The memory system in Figure 1 1 on the next page
consists of an address buffer; an address latch; eight
Interfacing to a DRAM latching data buffers; and a 64-bit-wide, static column-
mode DRAM (256K x 4). This arrangement allows the
system memory size to be increased in increments of two
Figure 10 shows the processor performing four read megabytes. Using 256 x 4-memories also has advan-
cycles as it would do to fill a cache line. Also shown in tages in reducing power and signal-drive requirements.
the figure is the NA# signal returned to the processor, To support the two levels of pipelining, the processor
which indicates that the system can accept the next bus latches both address and data. The address latches hold

August 1989 27
Intel 860

-
Row address

c
buffer
Row
address
74F827
'9 bits

-G
Column
address latch
Multiplexed DRAM address bus

Column
address

A[3:3,1

+4
Address bus
Decode

I PA I>- 256K X 4
256K X 4

i860
CPU

0[0:63]
b
DRAMSEL#

64-bit data bus


II
t
DRAM
control logic I I - 4 -
[ M A Bl 256KX4

ADS#
A
~ CAE# I '8 I
W/R# b
CAL
NENE# RAL
IF256K X 4
~

* ~ RAE#
'&A B
NA# c CPAB '
8 I$- 256K X 4 -
READY# CLK ~ CPBA
~ OE#
~ DIR
BE[O:7] OE# Buffer control
RAS
b
CAS
WE#(8) DRAM control

Figure 11. A DRAM system for the i860 microprocessor requires little "glue logic."

the address of the previous cycle, while the data from supplied to the memory address lines. Most systems
the cycle prior to that is held in the data buffers. Using that use a fast-access DRAM mode need an additional
TTL components on the address and data paths also has hardware comparator. The i860 CPU has a compara-
the advantage of isolating the memory system from the tor-which supplies the NENE# signal on each bus
processor's pin timings. cycle-built into the bus unit. The controller uses this
The two address latches are used for multiplexing the signal to determine if a fast static column-mode access
row and column addresses from the processor to the can occur or if a full DRAM cycle needs to take place.
DRAMS' address lines. When accesses occur within The bidirectional data buffers latch the data for both
the DRAM page, only the column address needs to be reads and writes. For reads, the buffers latch data and

28 IEEEMICRO
return READY# on the following clock cycle. With the and Fortran compilers, assembler/linker, simulator/
two levels of pipelining the total access time is six debugger, Fortran vectorizer, plus mathematical, vec-
cycles, while data is available every two cycles. Zero- tor primitive, and 3D graphics libraries. To support the
wait-state operation does not require pipelining for initial development environments, both Unix System V
write cycles. When a write occurs, the address and data run on a 386 microprocessor and OS/2 host cross-
latched in the buffers allow READY# to be returned to compilers. The optimizations used in the compilers
the processor. The actual write cycle occurs after include coloring for register allocation, register-based
READY# returns to the processor. This delayed write parameter passing for calls, interblock common subex-
operation allows processor execution to continue even pression and loop invariant elimination, constant
though the write has not fully completed. propagation, strength reduction, extensive peephole
Using 85-11s static column-mode DRAMS, the 33- optimizations, and instruction scheduling.
MHz i860 microprocessor can operate at zero wait Scientific-application support includes a Fortran
states for access within the DRAM page. The two-level vectorizing precompiler. Vectorization occurs in Do
pipelining and two-clock transfer rate allow the proces- and If loops, outer loops, and forward-branching condi-
sor to sustain performance without the need for an tional operations. The precompiler recognizes these
external cache memory system. structures and generates calls to a set of preprogrammed
procedures. The preprogrammed procedures are opti-
mized for the processor’s instruction set and for manag-
ing the data cache as a vector register. Additionally,
Software support other high-level languages can call these procedures.
Both internal development teams and independent We plan to further increase the degree of parallelism
vendors provide a full complement of software devel- that high-level languages can use in the processor. We
opment tools and operating systems for the i860. Figure also provide a library of assembly-language routines
12 shows the software development tools available: C for scalar mathematics.

Fortran
Vectorizer compiler

C
compiler

9I ASM source

I Assembler Linker

Vector
primitive library processor

Math library

Figure 12. Software development environment supporting the i860.

August 1989 29
Intel i860

The first 3D visualization tool ported to the iX60 he i860 microprocessor begins the second genera-
CPU is Ardent Computer's Dore. This tool supports
both real-time. interactive 3D modeling and higher
quality static images. Several windouing environments
T tion of 32-bit RISC processors. By using a 64-bit
architecture. the i860 delivers balanced MIPS,
Mflops. and graphics performance. The million-tran-
and other 3D tools and libraries are also being ported. sistor budget lets us integrate the RISC core and pro-
Application software can be run on either a softuare vide dedicated, fast floating-point hardware, graphics
simulator or an add-in application accelerator. Both capabilities, and cache memories on one chip. The
share a common debugging interface. The simulator design allows maximum parallelism between the func-
allows the user to model different memory systems and tional units while achieving a balance between compu-
measure their effects on performance. A Unix V/386 o r tation speed and data bandwidth. Mainframe and super-
OS/2 hosts the application accelerator. which includes computer architectural concepts let the processor offer
a runtime operating environment that maps I/O re- a complete solution :%the requirements of high-compu-
quests back to the host processor. tation applications.&
A multiprocessing version of Unix System V Release
4.0 is under development for the i860 CPU. This is a
joint effort by AT&T, Convergent Technologies, Intel,
Olivetti, Prime Computer, and others. We plan to main- References
tain source-code compatibility with the high-level lan- I . ANSI!IEEE Sriirldur.d 754-/985 for- Biriur-y Flourin<q-
guages between the 386, i486, and i860 microproces- P oirrt Ar-ithmetic.. IEEE Computer Society Press, Los
sors. Specifications for an applications binary inter- Alamitos. Calif., 19x5.
face standard (ABI) will allow portability of
application software across multiple vendors' system
implementations.

Les Kohn is a chief architect for high-performance proces- Neal Margulis is a senior engineer for high-performance
sors at Intel Corporation of Santa Clara. California, where he processors at Intel. His interests include processor architec-
has worked on various 32- and 64-bit microprocessor design ture and system design. Margulls received his degree in
projects. Before joining the company, he worked as a soft- electrical engineering from the University of Vermont in
ware manager and architect for the NS32000 family at Na- Burlington. He is a member of the IEEE Computer Society
tional Semiconductor. His interests include computer archi- and Tau Beta Pi.
tectures and compilers and electronic synthesizers.
Kohn received his BS degree in physics from the California Questions concerning this article may be directed to the
Institute of Technology in Pasadena. authors through Michael Sullivan at Intel Corporation, SC4-
42, 2625 Walsh Avenue, Santa Clara, CA 9505 1 .

Reader Interest Survey


Indicate your interest in this article by circling the appropriate number on the Reader Service Card.

Low 150 Medium 151 High 152

30 IEEEMICRO

You might also like