DMA on FPGA computer

Milan Vidaković

This is a followup of my original post.

So far, my FPGA computer didn't have DMA (Direct Memory Access). There were two places on my computer where DMA would fit: SD card interface and Ethernet interface. I have decided to implement a simple DMA controller first for SD card interface and later for the Ethernet adapter.

Let's look at the current SD card implemention. I have made a text about it here. In short, SD card adapter is connected to my internal SPI module. This means that all the communication with the SD card goes via SPI. In particular, when reading a sector from the SD card, my driver actually sends 0xFF to the SPI, and then reads a byte from it. And it goes like this for all 512 bytes of the SD card sector. SPI read function was implemented using interrupts. This means that when a byte arrives from the SD card, the SPI controller will trigger an interrupt and CPU will have to handle that single byte. This can be quite slow, because CPU needs to transfer all 512 bytes one by one having SPI interrupt happen 512 times.

The initial code for receiving a count bytes from the SD card without DMA is quite simple. As previously stated, it simply receives a count number of bytes from the SPI port of the SD card and stores those bytes in the buffer. It looks like this:

for (uint16_t i = 0; i < count; i++) {
        dst[i] = spiRec();
}

The spiRec() function sends 0xFF to the SPI port, and then waits for the byte to arrive from the SPI. It looks like this:

uint8_t spiRec(void) {
        send_spi(SPI0, 0xFF);
        return read_spi(SPI0);
}

The read_spi() function waits in loop for the SPI interrupt to signal that the byte has arrived (received_from_slave variable):

int read_spi(int port)
{
        if (port == SPI0) {
                while (!received_from_slave || *PORT_SPI_OUT_BUSY)
                {
                }
                return received_byte;
        }
        else if (port == SPI1) {
                while (!received_from_slave1 || *PORT_SPI1_OUT_BUSY)
                {
                }
                return received_byte1;
        }
}

How can we make this better? First of all, we will not read all 512 bytes one by one using CPU and interrupts. We will make a DMA controller and it will transfer all 512 bytes from the SPI to the memory without CPU intervention:

unsigned int *PORT_DMA_ADDR_1                   = (unsigned int *)(0x80000000 + 0x578); // DMA channel 1 start address port
unsigned int *PORT_DMA_COUNT_1                  = (unsigned int *)(0x80000000 + 0x58C); // DMA channel 1 counter port
unsigned short *PORT_DMA_START_RCV_1            = (unsigned short *)(0x80000000 + 0x5BE); // DMA channel 1 start receiving data

uint8_t dma_receive(uint8_t *dst, uint16_t count)
{
        finished_dma_read_1 = 0;
        *PORT_DMA_ADDR_1 = (unsigned int)dst;
        *PORT_DMA_COUNT_1 = count;
        *PORT_DMA_START_RCV_1 = 1;
        while (!finished_dma_read_1) {
        }
        return true;
}

We can see that all we need to do is to set the start address (the PORT_DMA_ADDR_1 memory location - we have a memory-mapped IO) and the number of bytes (the PORT_DMA_COUNT_1 location). As soon as we start the transfer (by putting 1 in the PORT_DMA_START_RCV_1 memory location), the CPU does not need to do anything for the actual transfer. Only when all bytes are transferred, the finished_dma_read_1 variable will be set to 1 and the function will return. Who will set that variable? The DMA controller will trigger the DMA interrupt at the end of a transfer and the interrupt handler will set the variable:

short int *DMA_1_HANDLER_INSTR  = (short int *)80; // address of the IRQ#14 handler address first instruction (DMA channel 1 handler)
int *DMA_1_HANDLER_ADDR         = (int *)82; // address of the IRQ#14 handler address (DMA Channel 1 handler)
...
*DMA_1_HANDLER_INSTR    = 1;
*DMA_1_HANDLER_ADDR     = (int)&dma_1_irq_triggered;

This is the dma_1_irq_triggered() interrupt handler:

void dma_1_irq_triggered()
{
        asm
        (
        "push r0\npush r1\npush r2\npush r3\npush r4\npush r5\npush r6\npush r7\npush r8\npush r9\npush r10\npush r11\npush r12\npush r13\n"
        );

        finished_dma_read_1 = 1;

        asm
        (
        "pop r13\npop r12\npop r11\npop r10\npop r9\npop r8\npop r7\npop r6\npop r5\npop r4\npop r3\npop r2\npop r1\npop r0\nmov.w sp,r13\npop r13\niret"
        );
}

Now we can reconstruct the whole sequence. First we give the start adress and the number of bytes to the DMA controller. For each byte that needs to be received, it will send 0xFF to the SPI first, and then it will wait for the byte from the SPI. It will store that byte to the given address and then will increment the internal counter. When the counter reaches the given number of bytes, the DMA interrupt will be triggered. All we have to do is to write DMA interrupt handler which will signal the end of a transfer (dma_1_irq_triggered() function).

How is DMA controller implemented?

DMA controller is implemented, as usual, in Verilog, in the CPU.v file. That file implements the CPU, Cache, and DMA. It is a kind of SOC (System On a Chip).

Setting the starting address for the transfer is done by writing the starting address into a corresponding memory location (memory-mapped IO) - 0x80000578. That will place the transfer starting address into the 32-bit dma_addr_1 register. The same goes with the number of bytes - write that number into the 0x8000058C location. That will place the number of bytes into the 32-bit dma_count_1 register. When we place 1 into the 0x800005BE address, we will initiate the dma transfer, by setting the dma_start_rcv_1 1-bit register.

So, at this point we have three registers filled with the data: starting address, the number of bytes and a flag that we need DMA to start working.

Next thing happens at the beginning of each instruction fetch (FETCH state of the CPU):

if (dma_start_rcv_1 && spi_ready && !spi_sent_ff_1 && dma_spi_received_1) begin
        spi_out <= 255; // send FF to initiate spi read
        spi_start <= 1;
        spi_sent_ff_1 <= 1;
        dma_spi_received_1 <= 0;
end

This code works like this - if we initiated the DMA read, and the SPI port is ready for transfer, and we haven't already sent the 0xFF to the SPI (in order to "tell" the SD card that we want to receive a byte from the given sector), and we haven't already received that byte, then we need to send 0xFF to the SPI. We set the spi_out register to 0xFF, we set the spi_start register to 1, and that will send 0xFF to the SPI (and to the SD card). SD card will in turn send one byte from the given sector back to the computer. That byte will be received by the SPI controller, which will in turn trigger the SPI interrupt. BUT, this time, the CPU will not handle that interrupt! The DMA controller will fetch the byte from the SPI, save it to the given memory location, and increase the internal counter (the dma_current_1 register):

else if (irq_r[IRQ_SPI]) begin
        if (dma_count_1 && dma_start_rcv_1) begin
                // DMA will handle the byte received from the SPI
                ...
        else begin
                // regular IRQ handling
                irq_r[IRQ_SPI] <= 0;
                pc <= 16'd56;
                addr <= 16'd28;
                irq_state <= 0;
                state <= FETCH;
                ir <= 0;
        end
end

How DMA controller handles received byte? It saves the byte to the given address (dma_addr_1 + dma_current_1) and it will increment the counter (dma_current_1):

if (dma_current_1[0] == 0) begin
        // even address
        dma_byte_1 <= spi_in_r;
        dma_current_1 <= dma_current_1 + 1;

        irq_state <= 7;
end
else begin
        // odd address
        addr <= (dma_addr_1 + dma_current_1) >> 1;
        data_to_write <= (dma_byte_1 << 8) | spi_in_r;
        next_state <= CHECK_IRQ;
        state <= WRITE_DATA;

        dma_current_1 <= dma_current_1 + 1;

        irq_state <= 7;
end
...
7: begin
        // special case for DMA transfer
        irq_r[IRQ_SPI] <= 0;
        dma_spi_received_1 <= 1;
        not_received_counter <= 0;
        irq_state <= 0;
        state <= FETCH;
        ir <= 0;
end

What is the story of even and odd bytes? Data bus is 16 bits wide, so it is necessary to get two bytes in a row and then to save them into the memory. It would be a waste of time to save just one byte, especially in case of odd addresses, since we would have to read the byte at the even address, put it together with the odd-address byte, and then save those two bytes. Instead of doing so, we will backup every odd byte, and when the even byte comes, we will compose one 16-bit word out of those two bytes, and save it in the memory.

We can see in the code above, that the DMA controller takes over all the handling of a byte that has been received over SPI. When all the bytes have been received, DMA controller will trigger a new interrupt - IRQ_DMA_1:

if (dma_count_1 && (dma_current_1 == dma_count_1)) begin
                irq[IRQ_DMA_1] <= 1;
                dma_current_1 <= 0;
                dma_count_1 <= 0;
                dma_start_rcv_1 <= 0;
end

Conclusion

The main motivation for DMA implementation was curiosity - I was interested if I was able to make it. I was also expecting that the whole DMA implementation runs faster than the IRQ-driven. The DMA-driven file transfer was at least two times faster than the IRQ-driven, as can be seen in this clip.

How is DMA controller implemented?

Conclusion

Comments