This is a followup of my original post.
    
SPI interface is a kind of a standard when it comes to connecting
        various peripherals to a computer (or, at least to a microcontroller). There is also I2C interface, but I will
        focus on the SPI in this post.
SPI stands for Serial Peripheral Interface. It is organized as a
        master-slave communication. If we presume that our FPGA computer is master, then the peripheral will be
        slave.
It usually has four important pins:
1. MISO (Master In Slave Out) - a wire which is used to
        transport data from slave to the master device,
2. MOSI (Master Out Slave In) - a wire which is used to
        transport data from master to the slave device,
3. SCL - clock (all the data transport is synchronized using
        this clock line), and
4. SS (Slave Select) - when active, the slave is selected (sometimes it is called CS -
        chip select). With this wire, it is possible to connect several peripherals to the same three mentioned wires
        (MISO, MOSI and SCK) and to have separate SS wires to each peripheral.
        Why did I choose to use the SPI on my computer. First of all, SD cards have
            SPI built-in. This means that every SD card is actually a SPI slave device. Next, I use the ENC28J60
            Ethernet module for my Arduino/ESP32/RaspberryPi Zero devices for the Ethernet connectivity. That module has
            SPI interface, too.
        
        
        
How did I integrate SPI into my FPGA
        computer? I have found a very nice implementation in Verilog here:
https://github.com/nandland/spi-master
BTW, that
        guy has excellent YouTube channel here: https://www.youtube.com/channel/UCsdA-aNqtMA1_2T15aXePWw
        Next I had to integrate that module into my FPGA computer. I have decided to
            allocate an interrupt for the incoming data from the SPI and to ignore the module-controlled SS pin (I will
            manually activate SS signal from code, instead of letting that job to the SPI module):
// ####################################
// SPI Master instance
//
                ####################################
wire   spi_start;
wire
                [7:0] spi_in;
reg [7:0] spi_out;
wire spi_ready;
wire spi_received;
reg [7:0]
                spi_in_r;
reg fake_CS;
SPI_Master_With_Single_CS spi0 (
                .i_Clk(clk100),
 .i_Rst_L(KEY[0]),
 .i_TX_Count(1),
                .i_TX_DV(spi_start),
 .o_RX_Byte(spi_in),
 .i_TX_Byte(spi_out),
                .o_RX_DV(spi_received),
                .o_TX_Ready(spi_ready),
 
 .o_SPI_MOSI(gpio0[32]),
                .i_SPI_MISO(gpio0[30]),
                .o_SPI_Clk(gpio0[28]),
                .o_SPI_CS_n(fake_CS)
);
        The code above creates a SPI module named spi0 and connects it to a set of
            wires and registers. Next, in the main interrupt part, when the spi_received wire goes high (a byte has
            arrived on SPI), the IRQ_SPI interrupt is triggered:
 // ##################### IRQ3 - SPI Master
                #####################
 if (spi_received) begin
 spi_in_r <= spi_in;
                // if we have received a byte from the MISO, 
  // we will trigger the IRQ#3
 irq[IRQ_SPI] <= 1'b1;
                end
 else
                begin
 irq[IRQ_SPI] <= 1'b0;
 end
In the CPU module, the IRQ_SPI interrupt
        causes processor to go to the predefined interrupt handler routine at the address of 56:
else if (irq_r[IRQ_SPI]) begin
                // SPI byte received
 pc <= 16'd56;
 addr <= 16'd28;
                irq_r[IRQ_SPI] <= 0;
end
        
        All you have to do is to put some code at the address of 56 and to return from the interrupt handler
            routine using the IRET assembly instruction:
        
        
            spi_irq_triggered:    push r0
                    ld.w    r0, [PORT_SPI_IN]   # PORT_SPI_IN.5_1, PORT_SPI_IN    ld.s    r0, [r0]    # _2, *PORT_SPI_IN.5_1    zex.s   r0, r0  # _3, _2    st.w    [received_byte], r0 # received_byte, _3    mov.w   r0, 1   # tmp29,    st.w    [received_from_slave], r0   # received_from_slave, tmp29    pop r0     iret
         
        
        Now that I have the C compiler, the SPI interrupt handler routine can be written in C:
        
            
                void init_spi()
                {
                    *SPI_HANDLER_INSTR  = 1;
                    *SPI_HANDLER_ADDR   = (int)&spi_irq_triggered;
                }
             
            
                
                    void spi_irq_triggered()
                    {
                        received_byte = *PORT_SPI_IN;
                        received_from_slave = 1;
                        asm 
                        (
                            "mov.w sp,r13\npop r13\niret"
                        );
                    }
                 
             
         
        
        In order to read the received byte, and to send some byte to the SPI, we need
            to implement some IO operations. As usual, I have done that in both direct and memory-mapped way. Here is
            the direct way using the IN and OUT assembly instructions:
        
             // OUT [xx],
                        reg
             4'b0100:
                        begin
             `ifdef DEBUG
            
             $display("%2x: OUT
                        [%4d], r%-d",ir[3:0], data_r, (ir[15:12]));
             `endif
             case
                        (mc_count) 
             0: begin
            
             // get the xx
            
             addr <= (pc + 2)
                        >> 1;
             pc <= pc +
                        2;
             mc_count <=
                        1;
             next_state <=
                        EXECUTE;
             state <=
                        READ_DATA;
             end
             1: begin
            
             mbr <=
                        data_r;
             mc_count <=
                        2;
             end
             2: begin
            
             case (mbr)
            
             ...
             PORT_SPI_OUT:
                        begin
             spi_out <=
                        regs[ir[15:12]];
             spi_start <=
                        1'b1;
             end
             ...
             default:
                        begin
             end
             endcase  // end of case
                        (data)
             mc_count <=
                        3;
             end
             3: begin
            
             tx_send <=
                        1'b0;
             spi_start <=
                        1'b0;
             spi_start1 <=
                        1'b0;
             state <=
                        CHECK_IRQ;
             pc <= pc +
                        2;
             end
             default:
                        begin
             end
             endcase
             end // end of OUT [xx],
                        reg
         
        
        What happens above? The OUT instruction is written in memory using four bytes.
            First two bytes are OPCODE of the instruction, and the second two bytes hold the port number (limiting the
            total number of available ports to 65536, but I think it is enough). 
        
        In the first cycle (step 0) of the OUT instruction, the CPU sets the address
            to be read to be next two bytes after those two OPCODE bytes. Then the CPU waits for those two bytes to
            arrive (step 1). 
        
        Then the CPU checks which IO port has been read from the memory, and of the
            port number is PORT_SPI_OUT, it means that we are trying to send some byte to the SPI, and the CPU sends the
            data to that port (step 2). In step 3 the CPU finishes sending and sets the next CPU state to be the IRQ
            check.
        
        And, here is the memory-mapped IO way:
        
             // Memory mapped
                        IO
             case (addr &
                        32'h3FFFFFFF)
             ...
             PORT_SPI_OUT/2:
                        begin
             spi_out <=
                        data_to_write;
             spi_start <=
                        1'b1;
             end
         
         ...
         endcase
        
        Memory-mapped is a bit simpler, but does the same job of sending a byte to the SPI.
        
        OK, now that we have the working SPI interface, how can we use it to work with
            the SD card? I have made a Frankenstein-like code merging the original Arduino SD card code (written in C++)
            with some other pieces of code from the github in a way that now I have some elementary support for the SD
            cards. For example:
        
        
            
                
                    
                        uint8_t sdcard_init(){
                          writeCRC_ = errorCode_ = inBlock_ = partialBlockRead_ = type_ = 0;
                          // 16-bit init start time allows over a minute
                          uint32_t t0 = (uint32_t)get_millis();
                          uint32_t arg;
                           // must supply min of 74 clock cycles with CS high.
                        
                          for (uint8_t i = 0; i < 10; i++) spiSend(0XFF);
                        
                          chipSelectLow();
                          // command to go idle in SPI mode
                          while ((status_ = cardCommand(CMD0, 0)) != R1_IDLE_STATE) {
                            if (((uint32_t)get_millis() - t0) > SD_INIT_TIMEOUT) {
                              error(SD_CARD_ERROR_CMD0);
                              goto fail;
                            }
                          }
                         
                          // check SD version
                          if ((cardCommand(CMD8, 0x1AA) & R1_ILLEGAL_COMMAND)) {
                            type(SD_CARD_TYPE_SD1);
                          } else {
                            // only need last byte of r7 response
                            for (uint8_t i = 0; i < 4; i++) status_ = spiRec();
                            if (status_ != 0XAA) {
                              error(SD_CARD_ERROR_CMD8);
                              goto fail;
                            }
                            type(SD_CARD_TYPE_SD2);
                          }
                     
                 
                  ...}
             
         
        
        In the code above, we see that there are some spi-related functions, like spiSend() or
            spiRec(). Here are those:
        
        
            
                void spiSend(int b)
                {
                            received_from_slave = 0;
                            unsigned short int busy;
                        
                            do 
                            { 
                                busy = *PORT_SPI_OUT_BUSY;
                            } while (busy);
                            *PORT_SPI_OUT = b; //send the byte to the SPI
                        
                            
                            do 
                            { 
                                busy = *PORT_SPI_OUT_BUSY;
                            } while (busy);
                     
                 
         
        
            
                }
                
                    
                        
                        uint8_t spiRec(void) {
                            send_spi(0xFF);
                            return read_spi();
                        }
                     
                 
             
         
        
            
                
                    
                        int read_spi()
                        
                        {
                            while (!received_from_slave || *PORT_SPI_OUT_BUSY) 
                            {
                            }
                            return received_byte;
                        }
                     
                 
             
         
        
        Now, when we look at the spi_irq_triggered() function, we see that
            whenever that interrupt routine is triggered by the incoming byte from the SPI, that byte is stored in the
            received_byte variable. That byte is returned from the read_spi() function to the
            spiRec() function, and from that to the caller function.
        
        OK, what next? How is this used? All of the interaction with the SD card is done by sending card commands
            and reading and writing 512 bytes of data, in so-called blocks:
        
            
                
                    
                        uint8_t cardCommand(uint8_t cmd, uint32_t arg) {
                        
                          // end read if in partialBlockRead mode
                          readEnd();
                          // select card
                          chipSelectLow();
                          // wait up to 300 ms if busy
                          waitNotBusy(300);
                          // send command
                          spiSend(cmd | 0x40);
                          // send argument
                          for (int8_t s = 24; s >= 0; s -= 8) spiSend(arg >> s);
                          // send CRC
                          uint8_t crc = 0XFF;
                          if (cmd == CMD0) crc = 0X95;  // correct crc for CMD0 with arg 0
                          if (cmd == CMD8) crc = 0X87;  // correct crc for CMD8 with arg 0X1AA
                          spiSend(crc);
                          // wait for response
                          for (uint8_t i = 0; ((status_ = spiRec()) & 0X80) && i != 0XFF; i++);
                          return status_;
                        }
                        
                     
                 
             
         
        
            
                
                    
                        uint8_t readData(uint32_t block,
                        
                                uint16_t offset, uint16_t count, uint8_t* dst) {
                        
                          uint16_t n;
                          if (count == 0) return true;
                          if ((count + offset) > 512) {
                            goto fail;
                          }
                          #ifdef FAT_DEBUG
                          printf("block: %d, offset: %d, count: %d\n", block, offset, count);
                          #endif
                        
                          if (!inBlock_ || block != block_ || offset < offset_) {
                        
                            block_ = block;
                            // use address if not SDHC card
                            if (get_type()!= SD_CARD_TYPE_SDHC) block <<= 9;
                            if (cardCommand(CMD17, block)) {
                              error(SD_CARD_ERROR_CMD17);
                              goto fail;
                            }
                            if (!waitStartBlock()) {
                              goto fail;
                            }
                            offset_ = 0;
                            inBlock_ = 1;
                          }
                          // skip data before offset
                          for (;offset_ < offset; offset_++) {
                            spiRec();
                          }
                          // transfer data
                          for (uint16_t i = 0; i < count; i++) {
                            dst[i] = spiRec();
                        //    printf("%x ", dst[i]);
                          }
                          offset_ += count;
                          if (!partialBlockRead_ || offset_ >= 512) {
                            // read rest of data, checksum and set chip select high
                        
                            readEnd();
                          }
                          return true;
                        
                         fail:
                          chipSelectHigh();
                          #if FAT_DEBUG
                          printf("read data error code: %d\n", errorCode_);
                        
                          #endif
                          return false;
                        
                        }
                     
                 
                
                    
                        
                            
                                uint8_t writeData(uint8_t token, const uint8_t* src) {
                                  spiSend(token);
                                  for (uint16_t i = 0; i < 512; i++) {
                                    spiSend(src[i]);
                                  }
                                  spiSend(0xff);  // dummy crc
                                  spiSend(0xff);  // dummy crc
                                  status_ = spiRec();
                                  if ((status_ & DATA_RES_MASK) != DATA_RES_ACCEPTED) {
                                
                                    error(SD_CARD_ERROR_WRITE);
                                    chipSelectHigh();
                                    return false;
                                  }
                                  return true;
                                }
                             
                         
                        
                            
                                uint8_t writeBlock(uint32_t blockNumber, const uint8_t* src, uint8_t blocking) {
                                  #if FAT_DEBUG
                                  printf("Write block number: %d\n", blockNumber);
                                  #endif
                                
                                //  return true;
                                  // don't allow write to first block
                                  if (blockNumber == 0) {
                                    error(SD_CARD_ERROR_WRITE_BLOCK_ZERO);
                                
                                    goto fail;
                                  }
                                  // use address if not SDHC card
                                  if (get_type() != SD_CARD_TYPE_SDHC) {
                                    blockNumber <<= 9;
                                  }
                                  if (cardCommand(CMD24, blockNumber)) {
                                    error(SD_CARD_ERROR_CMD24);
                                    goto fail;
                                  }
                                  if (!writeData(DATA_START_BLOCK, src)) {
                                    goto fail;
                                  }
                                  if (blocking) {
                                    // wait for flash programming to complete
                                
                                    if (!waitNotBusy(SD_WRITE_TIMEOUT)) {
                                      error(SD_CARD_ERROR_WRITE_TIMEOUT);
                                
                                      goto fail;
                                    }
                                    // response is r2 so get and check two bytes for nonzero
                                
                                    if (cardCommand(CMD13, 0) || spiRec()) {
                                      error(SD_CARD_ERROR_WRITE_PROGRAMMING);
                                
                                      goto fail;
                                    }
                                  }
                                  chipSelectHigh();
                                  return true;
                                fail:
                                  chipSelectHigh();
                                  return false;
                                }
                             
                         
                     
                 
             
         
        
        Now that we are able to read and write 512-sized blocks, we need to figure out
            how the data is organized on SD cards. Well, the format is FAT32. That is an ancient format from Microsoft,
            but it is quite simple and is used everywhere.
        
        
        
        So, if we want, for example, to list all files in the root folder, here is the code:
        
            
                
                    file_descriptor_t fd;
                    int next = 0;
                    while ((next = getDirEntry(&fd, next)) != 0)
                    {
                        printf("%s %d bytes, cluster: %d (%d)\n", fd.dir_entry.filename, fd.dir_entry.filesize, fd.curr_cluster, fd.dir_entry.first_cluster);
                    }
                 
             
         
        
        The key code is in the getDirEntry() function:
        
            
                
                    
                        uint32_t getDirEntry(file_descriptor_t* fd, uint32_t index)
                        {
                          int i,j;
                          uint16_t cluster;
                          uint32_t file_size;
                          uint8_t b;
                          uint8_t *buf = g_block_buf;
                          char filename_upper[12];
                        
                          uint32_t counter = 0;
                          for (i = 0; i < (dataStartBlock_ - rootDirStart_); i++)
                          {
                            b = readBlock(rootDirStart_ + i, g_block_buf);
                        
                            for(j = 0; j < 16; j++)
                        
                            {
                              if (*(buf + j*32)==0 || *(buf + j*32)==0x2e || *(buf + j*32)==0xe5 || *(buf + j*32 + 0x0b) == 0xf)
                              { 
                                continue; // free, or deleted file/folder, or phantom entry for long names?
                        
                                if (counter > index)
                                  return 0;
                              }
                              
                              if(counter == index)
                              {
                                file_size = *(buf + j*32 + 0x1c);
                                file_size += *(buf + j*32 + 0x1c + 1)<<8;
                                file_size += *(buf + j*32 + 0x1c + 2)<<16;
                                file_size += *(buf + j*32 + 0x1c + 3)<<24;
                                cluster = *(buf + j*32 + 0x1a);
                                cluster += *(buf + j*32 + 0x1a + 1) << 8;
                                cluster += *(buf + j*32 + 0x14 + 0) << 16;
                        
                                cluster += *(buf + j*32 + 0x14 + 1) << 24;
                        
                                strncpy(filename_upper, (char*)(buf+j*32), 11);
                                filename_upper[11] = '\0';
                                // fill in dir_entry
                                memmove(fd->dir_entry.filename, filename_upper, 12);
                                fd->dir_entry.attributes = *(buf + j*32 + 0x0b);
                                memmove(fd->dir_entry.unused_attr, buf + j*32 + 0x0c, 14);
                                fd->dir_entry.filesize = file_size;
                                fd->dir_entry.block = rootDirStart_ + i;
                                fd->dir_entry.slot = j;
                        
                                fd->dir_entry.first_cluster = cluster;
                                fd->curr_cluster = cluster;
                                return counter + 1;
                              } else if (counter > index) {
                                return 0;
                        
                              }
                              counter++;
                            }
                          }
                          return 0;
                        }
                     
                 
             
         
        
        The code above loads chunks of 512 bytes from the root directory start block,
            and then tries to iterate through the directory structure until it finds the right entry, given by its
            index. The directory structure is this:
        
            
                
                    typedef struct
                    {
                      char filename[12];  /** The file's name and extension, total 11 chars padded with spaces. */
                    
                      uint8_t attributes;  /** The file's attributes. Mask of the FAT_ATTRIB_* constants. */
                    
                      uint8_t unused_attr[14]; /** Attributes in directory which are unused or unsupported */
                    
                      uint16_t first_cluster;     /** The cluster in which the file's first byte resides. */
                    
                      uint32_t filesize;   /** The file's size. */
                      uint32_t block; /** The number of a block from the rootDirStart_ where this entry resides. */
                    
                      uint32_t slot; /** The number of the slot in the block where this entry resids. Each slot is 32 bytes large. */
                    
                    } dir_entry_t;
                 
             
         
        
        Since my FPGA computer is big endian, I couldn't just read bytes for file size and cluster address.
            Instead, I had to compute those numbers byte-by-byte.
        Conclusion
        Initial implementation of the SPI was simple enough. It is what you can do with it what matters. I was able
            to use the SPI to integrate SD card into my FPGA computer. That way, I don't need the Arduino/ESPP32 anymore
            to do the role of SD card reader, as 
I used to have.