Nothing Special   »   [go: up one dir, main page]

EE108B Lab Assignment #4 Caches: Due: Tuesday, June 6, 2006

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

E108B Prof.

Kozyrakis
Spring 2005-2006

EE108B Lab Assignment #4


Caches
Due: Tuesday, June 6, 2006

1. Introduction
At this point you have created a single-cycle microprocessor and added pipelining to it.
Up to now we have modeled memory accesses simply as a direct access to the BRAM
without a hierarchy, and you have been working with relatively small data sets. In Lab 4,
we will run programs that use large datasets, and will split up the BRAM to emulate
different layers of the memory hierarchy. Your task in lab 4 is to implement a data cache
and coordinate the data transfer between the “main memory” and “cache” parts of the
BRAM. The main memory part of the BRAM has already been declared, to which access
delays has been added to emulate the slower main memory layer, and it also has a access
protocol to which your implementation must adhere for correct functioning. It is your
task to direct reads and writes to the cache to maximize the performance of your
microprocessor.

2. Requirements
You must implement a write-through data cache in your microprocessor, along with
performance counters that tabulate relevant statistics about the performance of your
cache. You do not need to implement a new instruction cache, however. The program to
be executed is not large and the BRAM is sufficient for storage of the instructions.

3. Implementation Details
Cache: The cache data will be stored in BRAM. It is up to you to decide the
associativity and replacement policy of the cache. As you make these decisions you must
consider that the cache will need to contain tags and possibly valid bits depending on
your implementation. So, as always, there are tradeoffs between speed and size in
hardware.
The cache you implment may not exceed 2 kB in size i.e. 2048 bytes. This is inclusive of
all tags, valid bits, use bits, data etc. You are also not permitted to remove the tcgrom
and/or framebuffer to make more space.
Because you will need to store tags, valid bits, etc. in your cache, each access to the
cache will contain more than just the 32 bits that are in a data word. Thus, the memory
accesses to the BRAM will not be 32 bit aligned. The exact size will depend on the
design of your cache.
However, reads and writes to the “main memory” must be word aligned. In main
memory there are no tags or other bits—all you are storing is data, and it should be
aligned by the word. Also, the main memory has a capacity of 256 kB, so your effective
E108B Prof. Kozyrakis
Spring 2005-2006

address size is 16 bits (256 kB = 2^18 B). You can use this fact to shorten the length of
the tags.
Main Memory Access: The data cache will interact with the emulated main memory
when necessary, both for reads and writes. Main memory access times are slower than
cache access times, so you must use the signaling protocol shown here:

Figure 1. Signal protocol for a write to main memory


To describe the signaling in words here is the sequence of events that should occur on a
cache write – the instruction occurs over 5 mips cycles as indicated by the 2 vertical
lines. Since you will be implementing a write-through cache, the main memory will be
accessed on every write.
1. The pc is changed to that of the new instruction on the positive edge of the mipsclk,
the new instruction is a memory write.
2. 1 memclk cycle later the write instruction is fetched from irom.
3. Upon receiving the new instruction, the MemWrite signal is set high slightly after the
positive memclk edge if it is detected that the instruction is a memory write.
4. When the MemWrite signal goes high, the stall signal goes high with it so that no new
instructions will be executed while a data transaction is still in progress
5. The MemWrite signal is an input to your cache module, which should detect the write
and assert the cache_busy and mainmem_access signals exactly 1 memclk cycle after
the Memwrite signal is set.
6. The mainmem_access signal is your indication to the main memory that you wish to
access it (either read or write). Since both the MemWrite and RtData signals are fed
into the main memory directly, the main memory will automatically write RtData into
the address indicated by ALUResult.
7. When the main memory senses that a memory access request has been placed i.e.
mainmem_access signal is 1, it asserts the mainmem_busy (shown as dram_busy in
the diagram above) signal after 1 memclk cycle.
E108B Prof. Kozyrakis
Spring 2005-2006

8. The mainmem_busy signal will be high for 3 memclk cycles before it becomes de-
asserted, at which point the data would have been written to the main memory. Your
mainmem_access signals must be set to 0 exactly 1 memclk cycle after dram_busy
becomes low. Your cache_busy signal can be set high for longer if necessary.
9. Your cache_busy signal is monitored by the MemStage module, so when your
cache_busy signal is set low, the stall signal is automatically set low. After this the
processor continues as usual.
The following diagrams describe what happens on a read.

Figure 2. Signal protocol for a read cache hit

Figure 3. Signal protocol for a read from main memory (read miss, 1-word block)
E108B Prof. Kozyrakis
Spring 2005-2006

Figure 4. Signal protocol for a red from main memory (read miss, 4-word block)
1. The pc is changed to that of the new instruction on the positive edge of the mipsclk,
the new instruction is a memory read. 1 memclk cycle later the read instruction is
fetched from irom
2. Upon receiving the new instruction, the MemRead signal is set high slightly after the
positive memclk edge if it is detected that the instruction is a memory read.
3. In order to stall the processor in time in case of a read miss, the stall signal is always
set high 1 clk cycle after the positive edge of MemRead regardless of whether the
read would result in a cache miss or hit. Since the MemRead signal is an input to the
cache module, you should detect whether the read is a hit or a miss AND set the
cache_busy signal appropriately within 2 memclk cycles of the MemRead signal
going high.
4. If the read is a cache hit (Figure 2.), your cache_busy signal should remain low, this is
monitored by the MemStage module which would in turn de-assert the stall signal.
Since the stall signal always goes high on a read (at least temporarily), any memory
read would always take at least 2 mipsclk cycles.
5. If the read is a read miss (Figure 3. and Figure 4), you need to fetch the data from
main memory. In this case, you should assert the cache_busy signal exactly 2 memclk
cycle after the positive edge of the MemRead signal. You should also assert the
mainmem_access signal at this point to initiate a main memory read. The stall signal
will remain high for as long as your cache_busy signal is high, and will go low 1 clk
cycle after you de-assert your cache_busy signal.
6. If you want the main memory to read just 1 word per access (Figure 3.), then the
read_block signal in the cache module should be set to 0. The first word can be read
on the negative edge of the start_read signal, which comes at 3 memclk cycles after
the dram_busy becomes high. Since only 1 word is being read, the dram_busy signal
is de-asserted at the same time as start_read. Your mainmem_access signals must be
set to 0 exactly 1 memclk cycle after dram_busy becomes low. Your cache_busy
signal can be set high for longer if necessary.
E108B Prof. Kozyrakis
Spring 2005-2006

7. However if you want the main memory to read 4 words per access (Figure 4.), then
the read_block signal should be set to 1. The first word can be read on the negative
edge of the start_read signal, which also comes at 3 memclk cycles after the
dram_busy becomes high. The subsequent 3 words can be read on each positive edge
of the memclk, with the last one being read at the same time that the dram_busy
signal is de-asserted. Your mainmem_access signals must be set to 0 exactly 1
memclk cycle after dram_busy becomes low. Your cache_busy signal can be set high
for longer if necessary.
8. Your cache_busy signal is monitored by the MemStage module, so when your
cache_busy signal is set low, the stall signal is automatically set low. After this the
processor continues as usual.
Please ask the TAs if you have any questions regarding the signaling protocol.
Design Considerations: When designing the cache it is important to consider both the
associativity of the cache and the replacement policy that will be used. Both the lecture
notes and the book discuss the relative advantages of different associativities and
replacement policies.
It is recommended that you implement state machines in your design to direct memory
accesses to the main memory and to coordinate the various control signals. Basically
these state machines will consist of D-flip flops and combinational logic that determines
the next state of the program.
Over, your cache implementation should behave as follows:

• Read hit: return data to CPU.

• Read miss: fetch block from main memory and return data to CPU.

• Write hit: update both cache block and main memory.

• Write miss: if your cache block size is 1 word, then you should update both the
cache block and main memory. If you cache block size is greater than 1 word,
then you should update only the main memory – of course you are free to
implement write allocate on miss if you wish.
If you use main memory read and cache blocks of greater than 1 word, you should note
that for any given address the main memory would always first return the word at the
address specified, and then the 3 words immediately following it in main memory. This
means that reads from the main memory would in general not line up with your cache
block. In order to align your main memory reads to your cache blocks, you should
calculate the aligned memory address in your cache, and pass it out of MemStage.v, and
using it to replace the ALUResult input of the MainMem_ctl.v module.
For simulation we recommend writing simple test programs in MIPS and running through
the functionality of the cache – it is much easier to isolate and spot an error in a small
predictable piece of code than otherwise. Important: due to the way signals are
E108B Prof. Kozyrakis
Spring 2005-2006

initialized in the test bench for ModelSim, you should place at least 1 nop at the very
beginning of any MIPS assembly code you run.
Lastly, note that although both performance and complexity of your cache will be part of
your evaluation, a correctly functioning 1-word block direct-mapped cache will
nonetheless earn you full credit. More sophisticated and correctly functioning cache
implementations e.g. set associative cache, LRU, sub-blocking, write-allocate on miss
etc. would earn you extra credit; thus high performance/complexity will be viewed as an
extension.
Performance Counters: You must implement performance counters in your design that
count the number of clock cycles that elapse while the program is executing – code is
given in Memstage.v that may help you implement these counters. These counters
should start upon execution and halt upon completion. They should run continuously
during the entire time that the program runs; that is to say, they should not stall when the
processor stalls. This will give us an idea of the efficiency of your cache design.
To stop the performance counters once the program is finished, use a memory-mapped
processor command. The idea here is similar to the memory-mapped input/output that
the processor uses for the VGA monitor and Sega GamePad. Those commands use
locations 0xff and 0xfd in the memory so for the processor command use 0xfe (perform a
sw $rx, 0xfe($zero). When a store word command is given to this location the
performance counters on the processor should stop. If this memory-mapped instruction is
correctly implemented in the processor, you should be able to stop the counters at the end
of the program by inserting a store word instruction that writes to 0xfe.
In sum the performance counters should record the following data:

• total number of CPU cycles passed during execution

• total number of instructions executed

• total number of reads and read hits

• total number of writes and write hits


If you comment out the code for performance registers near the bottom of the MemStage
module, you should already be able to run the processor in ModelSim – the cache given
always misses.
Clocking: The VirtexII-Pro system clock is 100 MHz, this clock is divided to 50 MHz
(memclk) for the memory blocks i.e. your cache, irom etc. and to 25 MHz for the MIPS
processor (mipsclk). Data at addresses fed into a memory block on a positive edge of the
memclk would be available on the next positive edge of the memclk.
Demo: Your working design will be demonstrated in ModelSim for this lab. You will
need to run the test program and show the TA’s the performance results. As an extension
you may attempt to download the design to hardware and run it on the Virtex-II Pro
FPGAs but this is not required.
E108B Prof. Kozyrakis
Spring 2005-2006

Starter Files: You are given the code for a single cycle implementation of a
microprocessor. Your modifications should be made to the Cache.v file and possibly to
MemStage.v. In the starter files a cache that always misses is emulated in Cache.v, while
logic has been placed inside the MemStage module to monitor the output signals form
Cache.v, thus all signal control logic should be placed inside the Cache.v file. You should
note that since the given cache is dumb, where its behavior deviates from the signaling
protocol shown above, you should conform to this handout and not the cache.
ModelSim Libraries: As with your previous labs, you are to please type
vmap unisims_ver c:/modeltech_5.7a/xilinx_libs/unisims_ver
vmap xilinxcorelib_ver c:/modeltech_5.7a/xilinx_libs/XilinxCoreLib_ver
on the ModelSim command line to include the necessary libraries.

4. Writeup
As always, you are expected to follow the protocol listed in the lab report guidelines on
the class website. In addition, we would like you to include the performance results that
you were able to attain with your cache, along with a paragraph or two explaining how
you arrived at the cache implementation that you chose. For example, how did you arrive
on the associativity that you chose? Also, how did you decide on a replacement policy?
Please be sure to report your results of the performance counters for the test program we
provide.

You might also like