A Risc MIPS, Mflops
A Risc MIPS, Mflops
A Risc MIPS, Mflops
T he single-chip i860
executes parallel instructio
architectural concepts.
mm x 15 mm processor (see Fi
floating-point, and graphics pe
A million-
transistor budget
eration CAD tools and 1-micrometer semic helps this RISC
To accommodate our per
between blocks for integer operations, floating-point operations, and in-
struction and data cache memories. Inclusion of the RISC (reduced instruc-
deliver balanced
tion set computing) core, floating-point units, and caches on one chip lets
us design wider internal buses, eliminate interchip communication over-
MIPS, Mflops,
head, and offer higher performance. As a result, the i860 avoids off-chip
delays and allows users to scale the clock beyond the current 33- and 40- and graphics
MHz speeds.
We designed the i860 for performance-driven applications such as work- performance
stations, minicomputers, application accelerators for existing processors,
and parallel supercomputers. The i860 CPU design began with the specifi- with no data
cation of a general-purpose RISC integer core. However, we felt it neces-
sary to go beyond the traditional 32-bit, one-instruction-per-clock RISC
processor. A 64-bit architecture provides the data and instruction band-
bottlenecks.
width needed to support multiple operations in each clock cycle. The
balanced performance between integer and floating-point computations
produces the raw computing power required to support demanding applica-
tions such as modeling and simulations.
Finally, we recognized a synergistic opportunity to incorporate a 3D
graphics unit that supports interactive visualization of results. The architec-
ture of the i860 CPU provides a complete platform for software vendors
developing i860 applications.
16 IEEE MICRO
External + ,
address 32 bits
Instruction cache
I 1 Data cache
(4 Kbytes) management (8 Kbytes)
Cache data
-t t
Floating-point instruction
64,' " 128
I1
Core instruction 32 '32 "32
T T T
Floating-point
RISC core control unit
Bus control
unit
j Core registers Floating-point registers
i
64,. 64 64.,
SRC2 I
I I
I
I
--- II 1 . 1
t KL
T ~
KR
Merge
August 1989 17
Intel i860
time to complete. Integer register loads from memory We included several control flow optimizations in
take one execution cycle, and the next instruction can the core instruction set. The conditional branch instruc-
begin on the following cycle. tions have variations with and without a delay slot. A
The processor uses a scoreboarding technique to delay slot allows the processor to execute an instruction
guarantee proper operation of the code and allow the following a branch while it is fetching from the branch
highest possible performance. The scoreboard keeps a target. Having both delayed and nondelayed variations
history of which registers await data from memory. The of branch instructions allows the compiler to optimize
actual loading of data takes one clock cycle if it is held the code easily, whether a branch is likely to be taken or
in the cache memory buffer available for ready access, not. Test and branch instructions execute in one clock
but several cycles if it is in main memory. Using cycle, a savings of one cycle when testing special cases.
scoreboarding, the i860 microprocessor continues Finally, another one-cycle loop control instruction
execution unless a subsequent instruction attempts to usefully handles tight loops, such as those in vector
use the data before it is loaded. This condition would routines.
cause execution to freeze. An optimizing compiler can Instead of providing a limited set of locked opera-
organize the code so that freezing rarely occurs by not tions, the RISC core provides lock and unlock instruc-
referencing the load data in the following cycle. Be- tions. With these two instructions a sequence of up to
cause the hardware implements scoreboarding, it is 32 instructions can be interlocked for multiprocessor
never necessary to insert NO-OP instructions. synchronization. Thus, traditional test and set opera-
t R3
R4
R5
.R.-
6 F10
R7 F13 F12
R8 F15 F14 1
I R9 I F17
, ,. I F16
R10 F19 F18
t18
RI1 F21 F20
F23 F22
F25 F24
I R14 I F27 F26
R15 F29 F28
R27
R28 Control registers
Page directory base
Data breakpoint
Floating-point status
18 IEEE MICRO
Table 1.
Instruction-set summary.
Zore unit
Floating-point unit
Load and store instructions
LD.X Load integer Floating-point multiplier instructions
3T.X Store integer FMUL.P F-P multiply
FLD.Y F-P load PFMUL.P Pipelined F-P multiply
?FLD.Z Pipelined F-P load PFMUL3.DD Three-stage pipelined F-P multiply
FST.Y F-P store FMLOW .P F-P multiply low
PST.D Pixel store FRCP.P F-P reciprocal
FRSQR .P F-P reciprocal square root
Register-to-register moves
lXFR Transfer integer to F-P register Floating-point adder instructions
FXFR Transfer F-P to integer register FADD.P F-P add
PFADD.P Pipelined F-P add
integer arithmetic instructions
FSUB.P F-P subtract
4DDU Add unsigned PFSUB.P Pipelined F-P subtract
4DDS Add signed
PFGT.P Pipelined F-P greater-than compare
SUBU Subtract unsigned
PFEQ.P Pipelined F-P equal compare
SUBS Subtract signed F1X.P F-P to integer conversion
Shift instructions PF1X.P Pipelined F-P to integer conversion
SHL Shift left FTRUNC.P F-P to integer truncation
SHR Shift right PFTRUNC.P Pipelined F-P to integer truncation
SHRA Shift right arithmetic PFLE.P Pipelined F-P less than or equal
SHRD Shift right double PAMOV F-P adder move
Logical instructions PFAMOV Pipelined F-P adder move
4ND Logical AND Dual-operation instructions
4NDH Logical AND high PFAM.P Pipelined F-P add and multiply
4NDNOT Logical AND NOT PFSM.P Pipelined F-P subtract and multiply
4NDNOTH Logical AND NOT high PFMAM Pipelined F-P multiply with add
3R Logical OR PFMSM Pipelined F-P multiply with subtract
3RH Logical OR high
Long integer instructions
)<OR Logical exclusive OR
FLSUB.Z Long-integer subtract
YORH Logical exclusive OR high
PFLSUB.Z Pipelined long-integer subtract
Zontrol-transfer instructions FLADD.Z Long-integer add
rRAP Software trap PFLADD.Z Pipelined long-integer add
INTOVR Software trap on integer overflow
Graphics instructions
BR Branch direct
FZCHKS 16-bit z-buffer check
SRI Branch indirect PFZCHKS Pipelined 16-bit ,--buffer check
BC Branch on CC
FZCHLD 32-bit z-buffer check
BC.T Branch on CC taken
PFZCHLD Pipelined 32-bit z-buffer check
3NC Branch on not CC
FADDP Add with pixel merge
BNC.T Branch on not CC taken
PFADDP Pipelined add with pixel merge
3 TE Branch if equal
FADDZ Add with z merge
BTNE Branch if not equal
PFADDZ Pipelined add with 2 merge
BLA Branch on LCC and add
FORM OR with merge register
CALL Subroutine call
PFORM Pipelined OR with merge register
CALLI Indirect subroutine call
System control instructions Assembler pseudo-operations
FLUSH Cache flush MOV Integer register-register move
LD.C Load from control register FM0V.Q F-P register-register move
ST.C Store to control register PFM0V.Q Pipelined F-P register-register move
LOCK Begin interlocked sequence NOP Core no-operation
UNLOCK End interlocked sequence FNOP F-P no-operation
cc Condition code
F-P Floating-point
LCC Load condition code
August 1989 19
Intel i860
tions as well as more sophisticated operations, such as the pipelined external bus before the data from the first
compare and swap, can be performed. cache miss is returned. The pipelined loads occur di-
The RISC core also executes a pixel store instruc- rectly from memory and do not cause extra bus cycles
tion. This instruction operates in conjunction with the to fill the cache line, avoiding bus accesses to data that
graphics unit to eliminate hidden surfaces. Other in- is not needed. The full bus bandwidth of the external
structions transfer integer and floating-point registers, bus can be used even though cache misses are being
examine and modify the control registers, and flush the processed. Autoincrement addressing, with an arbi-
data cache. trary increment, increases the flexibility and perform-
The six control registers accessible by core instruc- ance for accessing data structures.
tions are the
PSR (processor status),
EPSR (extended processor status),
DB (data breakpoint), Memory management
FIR (fault instruction), The i860’s on-chip memory management unit imple-
Dirbase (directory base), and ments the basic features needed for paged virtual
FSR (floating-point status) registers. memory management and page-level protection. We
intentionally duplicated the memory management tech-
The PSR contains state information relevant to the nique in the 386 and 486 microprocessors’ paging
current process, such as trap-related and pixel informa- system. In this way we can be sure that the processors
tion. The EPSR contains additional state information easily exist in a common operating environment. The
for the current process and information such as the similar MMUs are also useful for reusing paging and
processor type, stepping, and cache size. The DB reg- virtual memory software that is written in C.
ister generates data breakpoints when the breakpoint is The address translation process maps virtual address
enabled and the address matched. The FIR stores the space onto actual address space in fixed-size blocks
address of the instruction that causes a trap. The Dir- called pages. While paging is enabled, the processor
base register contains the control information for cach- translates a linear address to a physical address using
ing, address translation, and bus options. Finally, the page tables. As used in mainframes, the i860 CPU page
FSR contains the floating-point trap and rounding- tables are arranged in a two-level hierarchy. (See Fig-
mode status for the current process. The four special- ure 4.) The directory table base (DTB), which is part of
purpose registers are used with the dual-operation the Dirbase register, points to the page directory. This
floating-point instructions (described later). one-page-long directory contains address entries for
The core unit executes all loads and stores, including 1,024 page tables. The page tables are also one page
those to the floating-point registers. Two types of float- long, and their entries describe 1,024 pages. Each page
ing-point loads are available: FLD (floating-point load) is 4 Kbytes in size.
and PFLD (pipelined floating-point load). The FLD Figure 4 also shows the translation from a virtual
instruction loads the floating-point register from the address to a physical address. The processor uses the
cache, or loads the data from memory and fills the cache upper 10 bits of the linear address as an index into the
line if the data is not in the cache. Up to four floating- directory. Each directory entry contains 20 bits of
point registers can be loaded from the cache in one addressing information, part of which contains the
clock cycle. This ability to perform 128-bit loads or address of a page table. The processor uses these 20 bits
stores in one clock cycle is crucial to supplying the data and the middle 10 bits of the linear address to form the
at the rate needed to keep the floating-point units page table address. The address contents of the page
executing. The FLD instruction processes scalar table entry and the lower 12 bits (nine address bits and
floating-point routines, vector data that can fit entirely the byte enables) of the linear address form the 32-bit
in the cache, or sections of large data structures that are physical address.
going to be reused. The processor creates the paging tables and stores
For accessing data structures too large to fit into the them in memory when it creates the process. If the
on-chip cache, the core uses the PFLD instruction. The processor had to access these page tables in memory
pipelined load places data directly into the floating- each time that a reference was made, performance
point registers without placing it in the data cache on a would suffer greatly. To save the overhead of the page
cache miss. This operation avoids displacing the data table lookups, the processor automatically caches
already in the cache that will be reused. Similarly on a mapping information for the 64 recently used pages
store miss, the data writes through to memory without in an on-chip, four-way, set-associative translation
allocating a cache block. Thus, we avoid data cache lookaside buffer. The TLB’s 64 entries cover 4 Kbytes,
thrashing, a crucial factor in achieving high sustained each providing a total cover of 256 Kbytes of memory
performance in large vector calculations. addresses. The TLB can be flushed by setting a bit in the
PFLD also allows up to three accesses to be issued on Dirbase register.
20 IEEEMICRO
Dir Page Offset
Physical
b address
~~
Writable
User
Write-through
Cache disable
Accessed
Dirty
(Reserved)
Available for systems programmer user
~~
Figure 5. Format of a page table entry. (X indicates Intel reserved; do not use.)
Only when the processor does not find the mapping The format of a page table entry can be seen in Figure
information for a page in the TLB does it perform a 5. Paging protects supervisor memory from user ac-
page table lookup from information stored in memory. cesses and also permits write protection of pages. The
When a TLB miss does occur, the processor performs U (user) and W (write) bits control the access rights.
the TLB entry replacement entirely in hardware. The The operating system can allow a user program to have
hardware reads the virtual-to-physical mapping infor- read and write, read-only, or no access to a given page
mation from the page directory and the page table or page group. If a memory access violates the page
entries, and caches this information in the TLB. protection attributes, such as U-level code writing a
August 1989 21
Intel i860
DO 10, I = 1 , 100
10 X=X*A+C
read-only page, the system generates an exception. FMUL X, A, temp
While at the user level, the system ignores store control FADD temp, C, X
instructions to certain control registers.
The U bit of the PSR is set to 0 when executing at the 1 result per 6 clock cycles
supervisor level, in which all present pages are read- (a)
able. Normally, at this level, all pages are also writable.
DO 10, I = 1, 100
To support a memory management optimization called 10 X[I] = A[I] * B[I] + C
copy-on-write, the processor sets the write-protection
(WP) bit of the EPSR. With WP set, any write to a page M12TPM A[I], B[I], XII - 61
whose W bit is not set causes a trap, allowing an
1 result per clock cycle
operating system to share pages between tasks without
making a new copy of the page until it is written. (b)
Of the two remaining control bits, cache disable
(CD) and write through (WT), one is reflected on the Figure 6. Floating-point execution models: data-de-
output pin for a page table bit (PTB), dependent on the pendent code in scalar mode (a) and vector code in
setting of the page table bit mode (PBM) in EPSR. The pipeline mode (b).
WT bit, CD bit, and KEN# cache enable pin are inter-
nally NORed to determine “cachability.” If either of
these bits is set to one, the processor will not cache that
page of data. For systems that use a second-level cache,
SRC1 SRC2 RDEST
these bits can be used to manage a second-level coher-
ent cache, with no shared data cached on chip. In
addition to controlling cachability with software, the
KEN# hardware signal can be used to disable cache
reads.
r-7
Floating-point unit
Floating-point unit instructions, as listed in Table 1 ,
support both single-precision real and double-preci-
sion real data. Both types follow the ANSI/IEEE 754
standard.’ The i860 CPU hardware implements all four
modes of IEEE rounding. The special values infinity,
NaN (not a number), indefinite, and denormal generate Multiplier unit
a trap when encountered; and the trap handler produces
an IEEE-standard result. The double-precision real
data occupies two adjacent floating-point registers with Result
bits 31 . . . 0 stored in an even-numbered register and
bits 63 . . . 32 stored in the adjacent, higher odd-
numbered register.
The floating-point unit includes three-stage-pipe-
lined add and multiply units. For single-precision data
each unit can produce one result per clock cycle for a
peak rate of 80 Mflops at a 40-MHz clock speed. For
double-precision data, the multiplier can produce a
result every other cycle. The adder produces a result
every cycle, for a peak rate of 60 million floating-point
operations per second. The double-precision peak
number is 40 Mflops if an algorithm has an even
Adder unit
distribution of multiplies and adds. Reducing the
double-precision multiply rate saves half of the multi-
plier tree and is consistent with the data bandwidth Result
available for double-precision operations.
To save silicon area, we did not include a floating-
point divide unit. Instead, software performs floating-
point divide and square-root operations. Newton-Ra-
phson algorithms use an 8-bit seed provided by a Figure 7. Dual-operation data paths.
22 IEEEMICRO
31 0
OP
+
OP instruction mode.
The floating-point instruction set supports two
OP
computation models, scalar and pipelined. In scalar
mode new floating-point instructions do not start proc-
essing until the previous floating-point instruction 31 0
completes. This mode is used when a data dependency
exists between the operations or when a compiler ig-
nores pipeline scheduling. In the scalar-mode example
of Figure 6 each iteration of the Do loop requires the
results from the previous iteration and 6-cycle execu-
tion.
In pipelined mode the same operation can produce a
63
L CORE-Of‘
OP
d.FP-OP
FP-OP
FP-OP
OP
I
Temporary dual-
instruction mode
August 1989 23
Intel 860
24 IEEEMICRO
Both caches use virtual addresses to avoid a critical bus, and then the write back occurs.
path in the cache access. Data cache accesses use the A convenient software model for managing the data
TLB lookup for enforcing the page-based protection. cache for vector computations on large matrices is to
Since both caches use virtual tags, software must avoid the treat the data cache as a “vector register set.”
the aliasing of data. Within a context, each physical Vectors, or their intermediate results, that are being
address must only be accessed with one virtual address. reused are kept in the onboard cache by referencing
During context switches, the instruction cache must be with the normal floating-point load instruction. The
invalidated and the data cache flushed. The caches, vectorization process analyzes nested loops to deter-
although large enough to give hit rates above 90 percent mine which vectors are reusable in the second-loop
within many applications, are too small to provide hits level. Vector register references in the vector library
across context changes. Therefore, we did not feel routines use the normal floating-point load instruction.
process IDS or a duplicate set of physical tags to avoid Vector memory references use the pipelined floating
flushing the cache between context switches were load instruction to stream the data from memory di-
warranted. rectly into the registers and not disturb the cache. Using
Flushing the data cache is an easy way to avoid the data cache as a vector register set is a more flexible
aliasing, and a simple calculation shows what little concept than that found in many supercomputers with
impact a small cache has on performance flushing. A small, fixed-length vector registers. This concept of-
typical i860 CPU context switch, including the data fers the advantages of a vector register set for vector
cache flush, takes approximately 65 microseconds. In computations while retaining the flexibility of a data
the worst case, a workstation will change context 200 cache for scalar computations.
times per second; multiplying (65 * seconds * 200
times/second) equals a 1.3 percent performance degra-
dation due to context switching.
Write-back data caching avoids propagating all Bus interface
writes to the external bus, which reduces bus traffic. It Designed for scalability to 50 MHz, the i860 CPU
also prevents a bottleneck in vector operations where external bus performs a 64-bit transfer every two clock
write traffic from the vector result collides with an cycles. Thus, we achieve the design of a practical TTL
incoming vector operand. With write-back caching, the (transistor-transistor logic) system, even at 50 MHz.
hardware necessary to implement transparent caching The bus can interface either to a second-level cache or
for multiprocessor systems moved costs beyond the directly to a DRAM system. The bus allows optional
silicon budget of this implementation. Instead, we use pipelining for increasing the access time without de-
software to manage cache coherency. Each processor creasing the bandwidth. The full bus bandwidth can be
can cache code, vector register data, and private stack realized from one bank of DRAMS, however, the la-
data, while shared data remains uncached. Software tency will be greater than if a fast static RAM cache is
controls the caching by using a cachable bit in the page used.
table entries to prevent shared data from being cached. With the two-cycle transfer rate, the external bus can
External hardware can also assert a cachable enable pin supply one memory operand for every double-
to control cachability of each line’s read miss. The precision add/multiply pair, or two contiguous single-
flush instruction forces all “dirty” blocks in the data precision operands for every two single-precision add/
cache back to memory. Flushing is needed before multiply pairs. The other two vector operands for an
removing a page or changing to a new virtual address add/multiply pair must come from the onboard data
space. cache. This approach provides the same ratio of
We included optimizations for cache-miss process- floating-point rate to external memory bandwidth as
ing. Each cachable read miss results in four bus cycles the Cray 1. To avoid bus bottlenecks, the vectorization
to fill the 32-byte cache line. First, the processor fetches process must try to reuse two of the three vector oper-
the referenced data word and performs a wraparound ands in the second-level inner loop.
fill to read the entire line. The processor can then The i860 microprocessor contains a synchronous
continue execution when the first word is returned. The interface with a demultiplexed address and 64-bit-wide
processor contains two 128-bit write buffers used for data bus. The address bus provides 32-bit addressing,
store misses and cache miss processing. When the consisting of 29 address lines and separate byte enable
processor issues a store instruction that misses the signals for each eight data bits. The bidirectional data
cache, it can continue execution while the write buffer bus can accept or drive new data on every other clock
carries out the actual memory write. The write buffers cycle, yielding a bandwidth of 160 Mbytes per second
support two store misses and also support a delayed at 40 MHz.
write back of the dirty cache line. If a cachable read The bus optionally allows for two levels of bus
miss displaces a dirty cache line, three operations take pipelining selected on a bus cycle-by-cycle basis. When
place. The processor writes the dirty line to the write pipelining, a new cycle starts prior to the completion of
buffer, the cache line read takes place on the external the outstanding cycles. Two levels of pipelining allow
August 1989 25
Intel 860
26 IEEEMICRO
i860 CPU DRAM read cycle
CLK 1
ADS# w;I
I I
I
I I I
I I
I
I
I
I
I
I
I
I
I
I
I
I
I I I I I I I I I
I I I I I I I
NENE#
I
I
I
I
I
I
I
I
I
I
I
I1 I
I
I
I
I
I I I I I I I I I
W/R#
I
I \ I I
I
I
I
I
I
I
I I : I
I
I
I
I I I I I I I I I
I I I I I I I I I
< >
I I I I I I I I I
CPUaddress
I
1 ' AddressX' @ 1 I I
I I I I I I I I I
I I I I I I I I I
I I I I I I I I I
I I I
I I I I
NA#
I I I I
I I I I I I I I I
Data (i860)
I
I
I
I
I
I
I
I
I
;(Data>;
I I
1 @;
I I
a; I
I
I
I
I I I I I I I I I
I I I I I I I I I
I I I
I I I
READY#
I I I I I I I I I
I I I I I I I I I
DRAM address
I
I
I
<Row X x
I
COlX
I
xI
I
I
Col x + lXC0l
I
8
I
I
x+ 2XCol x + 3
x
I
I
I
I
I
I
I
I I I I I I I I I
I I I I I I I I
RAS# I I I I I I I I
I I I I I I I I I
I I I I I I I I I
I I I I I I
CAS# 8 I I I I
I I I I I I I I I
I I I I
I I I
DRAM data I
Figure 10. The CPU performs four read cycles to fill a cache line.
if the data for the current read cycle is cachable. Ad- cycle. Two NA#s are returned before any of the cycles
dress space that is used for input and output can be are completed. To complete a read cycle, the memory
decoded to deassert KEN# during I/O accesses. Soft- system provides the data on the bus and returns
ware can also mark areas of memory as noncachable on READY# to the processor. Once fully pipelined, the
a page-by-page basis. If the software has not disabled memory system provides data and READY# on every
caching of the page, and KEN# is available for a read other clock cycle. Important for high performance, this
cycle, three additional 64-bit bus cycles will be gener- data rate can be provided by ordinary static column
ated to fill the 32-byte cache block. DRAMS. The processor also provides the control signal
NENE# to optimize DRAM control.
The memory system in Figure 1 1 on the next page
consists of an address buffer; an address latch; eight
Interfacing to a DRAM latching data buffers; and a 64-bit-wide, static column-
mode DRAM (256K x 4). This arrangement allows the
system memory size to be increased in increments of two
Figure 10 shows the processor performing four read megabytes. Using 256 x 4-memories also has advan-
cycles as it would do to fill a cache line. Also shown in tages in reducing power and signal-drive requirements.
the figure is the NA# signal returned to the processor, To support the two levels of pipelining, the processor
which indicates that the system can accept the next bus latches both address and data. The address latches hold
August 1989 27
Intel 860
-
Row address
c
buffer
Row
address
74F827
'9 bits
-G
Column
address latch
Multiplexed DRAM address bus
Column
address
A[3:3,1
+4
Address bus
Decode
I PA I>- 256K X 4
256K X 4
i860
CPU
0[0:63]
b
DRAMSEL#
ADS#
A
~ CAE# I '8 I
W/R# b
CAL
NENE# RAL
IF256K X 4
~
* ~ RAE#
'&A B
NA# c CPAB '
8 I$- 256K X 4 -
READY# CLK ~ CPBA
~ OE#
~ DIR
BE[O:7] OE# Buffer control
RAS
b
CAS
WE#(8) DRAM control
Figure 11. A DRAM system for the i860 microprocessor requires little "glue logic."
the address of the previous cycle, while the data from supplied to the memory address lines. Most systems
the cycle prior to that is held in the data buffers. Using that use a fast-access DRAM mode need an additional
TTL components on the address and data paths also has hardware comparator. The i860 CPU has a compara-
the advantage of isolating the memory system from the tor-which supplies the NENE# signal on each bus
processor's pin timings. cycle-built into the bus unit. The controller uses this
The two address latches are used for multiplexing the signal to determine if a fast static column-mode access
row and column addresses from the processor to the can occur or if a full DRAM cycle needs to take place.
DRAMS' address lines. When accesses occur within The bidirectional data buffers latch the data for both
the DRAM page, only the column address needs to be reads and writes. For reads, the buffers latch data and
28 IEEEMICRO
return READY# on the following clock cycle. With the and Fortran compilers, assembler/linker, simulator/
two levels of pipelining the total access time is six debugger, Fortran vectorizer, plus mathematical, vec-
cycles, while data is available every two cycles. Zero- tor primitive, and 3D graphics libraries. To support the
wait-state operation does not require pipelining for initial development environments, both Unix System V
write cycles. When a write occurs, the address and data run on a 386 microprocessor and OS/2 host cross-
latched in the buffers allow READY# to be returned to compilers. The optimizations used in the compilers
the processor. The actual write cycle occurs after include coloring for register allocation, register-based
READY# returns to the processor. This delayed write parameter passing for calls, interblock common subex-
operation allows processor execution to continue even pression and loop invariant elimination, constant
though the write has not fully completed. propagation, strength reduction, extensive peephole
Using 85-11s static column-mode DRAMS, the 33- optimizations, and instruction scheduling.
MHz i860 microprocessor can operate at zero wait Scientific-application support includes a Fortran
states for access within the DRAM page. The two-level vectorizing precompiler. Vectorization occurs in Do
pipelining and two-clock transfer rate allow the proces- and If loops, outer loops, and forward-branching condi-
sor to sustain performance without the need for an tional operations. The precompiler recognizes these
external cache memory system. structures and generates calls to a set of preprogrammed
procedures. The preprogrammed procedures are opti-
mized for the processor’s instruction set and for manag-
ing the data cache as a vector register. Additionally,
Software support other high-level languages can call these procedures.
Both internal development teams and independent We plan to further increase the degree of parallelism
vendors provide a full complement of software devel- that high-level languages can use in the processor. We
opment tools and operating systems for the i860. Figure also provide a library of assembly-language routines
12 shows the software development tools available: C for scalar mathematics.
Fortran
Vectorizer compiler
C
compiler
9I ASM source
I Assembler Linker
Vector
primitive library processor
Math library
August 1989 29
Intel i860
The first 3D visualization tool ported to the iX60 he i860 microprocessor begins the second genera-
CPU is Ardent Computer's Dore. This tool supports
both real-time. interactive 3D modeling and higher
quality static images. Several windouing environments
T tion of 32-bit RISC processors. By using a 64-bit
architecture. the i860 delivers balanced MIPS,
Mflops. and graphics performance. The million-tran-
and other 3D tools and libraries are also being ported. sistor budget lets us integrate the RISC core and pro-
Application software can be run on either a softuare vide dedicated, fast floating-point hardware, graphics
simulator or an add-in application accelerator. Both capabilities, and cache memories on one chip. The
share a common debugging interface. The simulator design allows maximum parallelism between the func-
allows the user to model different memory systems and tional units while achieving a balance between compu-
measure their effects on performance. A Unix V/386 o r tation speed and data bandwidth. Mainframe and super-
OS/2 hosts the application accelerator. which includes computer architectural concepts let the processor offer
a runtime operating environment that maps I/O re- a complete solution :%the requirements of high-compu-
quests back to the host processor. tation applications.&
A multiprocessing version of Unix System V Release
4.0 is under development for the i860 CPU. This is a
joint effort by AT&T, Convergent Technologies, Intel,
Olivetti, Prime Computer, and others. We plan to main- References
tain source-code compatibility with the high-level lan- I . ANSI!IEEE Sriirldur.d 754-/985 for- Biriur-y Flourin<q-
guages between the 386, i486, and i860 microproces- P oirrt Ar-ithmetic.. IEEE Computer Society Press, Los
sors. Specifications for an applications binary inter- Alamitos. Calif., 19x5.
face standard (ABI) will allow portability of
application software across multiple vendors' system
implementations.
Les Kohn is a chief architect for high-performance proces- Neal Margulis is a senior engineer for high-performance
sors at Intel Corporation of Santa Clara. California, where he processors at Intel. His interests include processor architec-
has worked on various 32- and 64-bit microprocessor design ture and system design. Margulls received his degree in
projects. Before joining the company, he worked as a soft- electrical engineering from the University of Vermont in
ware manager and architect for the NS32000 family at Na- Burlington. He is a member of the IEEE Computer Society
tional Semiconductor. His interests include computer archi- and Tau Beta Pi.
tectures and compilers and electronic synthesizers.
Kohn received his BS degree in physics from the California Questions concerning this article may be directed to the
Institute of Technology in Pasadena. authors through Michael Sullivan at Intel Corporation, SC4-
42, 2625 Walsh Avenue, Santa Clara, CA 9505 1 .
30 IEEEMICRO