Introduction

This is a followup of my original post.

My FPGA computer uses SDRAM as operating memory. It has static RAM too, but most of it is used as dual-port RAM for the VGA video subsystem. The SDRAM inside is 32MB, 16-bit data bus memory and it usually takes about six clock cycles for read and the same amount of cycles for write. The clock is 100MHz. Knowing all of this, it was about time to do some performance measurement:

I have made a simple program that counts from 1 to 10 000 000. If that program is loaded in SDRAM, it takes about 15 seconds to finish. However, if I load it in static RAM, it takes about 6 seconds to finish. So, there was an obvious motivation to try to implement the cache controller. You can look at the Verilog code here:
https://github.com/milanvidakovic/FPGAComputer32/blob/master/cpu.v

Implementation

I haven't used all of the static RAM in my FPGA computer, so I was able to make about 8KB of L1 cache. Here are the details:
- I have 4096 cache lines, each having two bytes. That is 8KB of cache.
- for each cache line, I have added 12-bit TAG, used for the direct mapping of the cache line. That consumes additional 5632 bytes of static RAM.
- I have implemented write-through policy, since I didn't have enough resources to make a write-back removal policy. I will try to make write-back, but it requires a complete rework of the cache controller, so, perhaps later...

How this thing works in practice? First of all, here is the Verilog code:

// cache TAG
reg [11:0] tag[4095:0];
// cache line
reg [15:0] cl[4095:0];

Each cache line (a row in the cl variable) holds two bytes of data. Whenever a CPU wants to do a read, the address from the address bus goes into the cache controller:

if (tag[addr[11:0]] == addr[23:12]) begin
// cache hit (required data is in cache)
data_r <= cl[addr[11:0]];
state <= next_state;
end
else begin
// cache miss -> we need to read from SDRAM
rd_enable_o <= 1'b1;
if (busy_i) begin
state <= READ_WAIT;
end
end

12 lower bits of the address (addr[11:0]) are used to address the cache line. To check if the wanted data is in cache, the tag is used. The same 12 lower bits address the tag which is assigned to a cache line. If the upper 12 bits of the address (addr[23:12]) match those in the tag, then we have a cache hit and the data can be returned directly from the cache. 

If that is not the case, then we need to perform a read from the SDRAM, and then:

rd_enable_o <= 1'b0;
if (rd_ready_i) begin
data_r <= rd_data_i;
// we store the fetched data into the cache
cl[addr[11:0]] <= rd_data_i;
// write tag
tag[addr[11:0]] <= addr[23:12];
state <= next_state;
end

When we finally obtain the data from the SDRAM, we return that data to the CPU, but we also write down that same data in the cache line, and we update the tag associated to that cache line with the upper 12 bits of the address.

That was the read cycle. Let's see how write works. When CPU wants to write data, it is saved into the SDRAM and into the cache as well:

// Write through, meaning that we save data in both SDRAM and cache
wr_data_o <= data_to_write;
// now we need to store the data that had to be saved into cache
cl[addr[11:0]] <= data_to_write;
// write tag
tag[addr[11:0]] <= addr[23:12];
wr_enable_o <= 1'b1;
if (busy_i) begin
state <= WRITE_WAIT;
end

As we can see, data is saved in both SDRAM and cache, and then we just return back:

wr_enable_o <= 1'b0;
if (~busy_i) begin
state <= next_state;
end

Performance

The cache controller works like a charm! The same counting example works now (almost) as fast as when it was executed in the static RAM (about 6 seconds to count from 1 to 10 million). 

Conclusion

Write-through implementation is simpler than write-back and maintains SDRAM in synchronization with the cache. However, it is slower, because CPU needs to wait for the data to be saved in SDRAM, instead of doing fast save just into the cache. Write-back is faster, since we don't have to wait for the slow SDRAM save, but the cache goes out-of-sync with the SDRAM (since we saved data in cache only). When we have a full cache, in case of write-back, we need to empty the corresponding cache line, by writing the content into the SDRAM, and then to write the new content in the cache.

The write cycle could be implemented as write-back, but with this setup, I cannot do that (not enough resources on FPGA chip). I will investigate that in future.