Introduction
        
            This is a followup of my original post.
        
            My FPGA computer uses SDRAM as operating memory. It has static RAM
            too, but most of it is used as dual-port RAM for the VGA video subsystem. The SDRAM inside is 32MB, 16-bit
            data bus memory and it usually takes about six clock cycles for read and the same amount of cycles for
            write. The clock is 100MHz. Knowing all of this, it was about time to do some performance measurement:
 
        
        I have made a simple program that counts from 1 to 10 000 000. If that program
            is loaded in SDRAM, it takes about 15 seconds to finish. However, if I load it in static RAM, it takes about
            6 seconds to finish. So, there was an obvious motivation to try to implement the cache controller. You can
            look at the Verilog code here:
https://github.com/milanvidakovic/FPGAComputer32/blob/master/cpu.v
         
        Implementation
I haven't used all of the static RAM in my FPGA computer, so I
        was able to make about 8KB of L1 cache. Here are the details:
- I have 4096 cache lines, each having two
        bytes. That is 8KB of cache.
- for each cache line, I have added 12-bit TAG, used for the direct mapping of
        the cache line. That consumes additional 5632 bytes of static RAM.
- I have implemented write-through policy,
        since I didn't have enough resources to make a write-back removal policy. I will try to make write-back, but it
        requires a complete rework of the cache controller, so, perhaps later...
        
        
        
How this thing works in practice?
        First of all, here is the Verilog code:
// cache TAG reg [11:0] tag[4095:0];// cache
                line reg [15:0]
                cl[4095:0];
        
        Each cache line (a row in the cl variable) holds two bytes of
            data. Whenever a CPU wants to do a read, the address from the address bus goes into the cache controller:
        
        
        
            if (tag[addr[11:0]] == addr[23:12]) begin
               // cache hit (required data is in cache)
            
               data_r <= cl[addr[11:0]];
               state <= next_state;
            end
            else begin
               // cache miss -> we need to read from
                        SDRAM
               rd_enable_o <= 1'b1;
               if (busy_i) begin
                 state <= READ_WAIT;
               end
            end
         
        
        12 lower bits of the address (addr[11:0]) are used to address the
            cache line. To check if the wanted data is in cache, the tag is used. The same 12 lower bits
            address the tag which is assigned to a cache line. If the upper 12 bits of the address
            (addr[23:12]) match those in the tag, then we have a cache hit and the data can be
            returned directly from the cache. 
        
        If that is not the case, then we need to perform a read from the SDRAM, and then:
        
        
            rd_enable_o <= 1'b0;
            if (rd_ready_i) begin
             data_r <= rd_data_i;
             // we store the fetched data into the cache
             cl[addr[11:0]] <= rd_data_i;
             // write tag
             tag[addr[11:0]] <= addr[23:12];
             state <= next_state;
            end
         
        
        When we finally obtain the data from the SDRAM, we return that data to the
            CPU, but we also write down that same data in the cache line, and we update the tag associated to that cache
            line with the upper 12 bits of the address.
        
        That was the read cycle. Let's see how write works. When CPU wants to write
            data, it is saved into the SDRAM and into the cache as well:
        
        
            // Write through, meaning that we save data in both SDRAM and cache
            wr_data_o <= data_to_write;
            // now we need to store the data that had to be saved into cache
            cl[addr[11:0]] <= data_to_write;
            // write tag
            tag[addr[11:0]] <= addr[23:12];
            wr_enable_o <= 1'b1;
            if (busy_i) begin
             state <= WRITE_WAIT;
            end
         
        
        As we can see, data is saved in both SDRAM and cache, and then we just return back:
        
        
            wr_enable_o <= 1'b0;
            if (~busy_i) begin
             state <= next_state;
            end
         
        Performance
        The cache controller works like a charm! The same counting example works now (almost) as fast as when it
            was executed in the static RAM (about 6 seconds to count from 1 to 10 million). 
        Conclusion
        Write-through implementation is simpler than write-back and maintains SDRAM in
            synchronization with the cache. However, it is slower, because CPU needs to wait for the data to be saved in
            SDRAM, instead of doing fast save just into the cache. Write-back is faster, since we don't have to wait for
            the slow SDRAM save, but the cache goes out-of-sync with the SDRAM (since we saved data in cache only). When
            we have a full cache, in case of write-back, we need to empty the corresponding cache line, by writing the
            content into the SDRAM, and then to write the new content in the cache.
        
        The write cycle could be implemented as write-back, but with this setup, I
            cannot do that (not enough resources on FPGA chip). I will investigate that in future.