Snakes!

The first game on my FPGA platform

This is a follow-up of the FPGA computer post. 

I have decided to make a game for the FPGA Computer. My friend gave me his Pascal implementation of the Snakes game and I have ported it into the FPGA Assembler. Not an easy task, but it works now:


The game was made to work in the video text mode. The frame is constructed using '-', '+', and '|' characters. The head of the snake is '@', the body is 'O', the star is '*', and when the snake hits the wall or its tail, we get the 'X' character at the crash scene.

In the emulator it works the same way:

The game is placed on the github.

Milliseconds counter register

During the game development, it occured to me that I need to implement the milliseconds counter register in order to implement the delay function. In the cpu.v file, I have created two additional registers:

reg [N-1:0] millis_counter;

reg [15:0] clock_counter;


The millis_counter register will hold the number of milliseconds counted so far. Incrementing this register is done whenever clock_counter reaches 50000 (clock is 50MHz, which means that when the clock_counter reaches 50000, one millisecond has elapsed):

always @ (posedge CLOCK_50) begin
if (clock_counter < 50000) begin
clock_counter <= clock_counter + 1'b1;
end
else begin
clock_counter <= 0;
millis_counter <= millis_counter + 1'b1;
end
...

To read the millis_counter register, I have introduced another port address for the IN instruction:

4'b0011: begin
  // IN reg, [xx]
  `ifdef DEBUG
  $display("%2x: IN r%-d, [%4d]",ir[3:0], (ir[11:8]), data);
  `endif
  case (mc_count)
0: begin
mbr <= data;  // remember the address of IO port
mc_count <= 1;
pc <= pc + 2'd2;  // move to the next instruction
end
1: begin
    case (mbr)
64: begin    // UART RX DATA
regs[ir[11:8]] <= {8'b0, rx_data_r};
end
65: begin   // UART TX BUSY
regs[ir[11:8]] <= tx_busy;
end
68: begin    // keyboard data
regs[ir[11:8]] <= {8'b0, ps2_data_r};
end
69: begin // milliseconds counted so far
regs[ir[11:8]] <= millis_counter;
end
  endcase // end of case(mbr)
  ir <= 0;      // initiate fetch
  addr <= pc >> 1;
end
default: begin
end
  endcase  // end of case (mc_count)
end // end of IN reg, [xx]

The example of the usage can be found in the snakes.asm file:

; ################################################################
; function delay(r0)
; waits for the r0 milliseconds
; ################################################################
delay:
push r1
push r2
delay_loop2:
in r1, [PORT_MILLIS] ; port 69
delay_loop1:
in r2, [PORT_MILLIS] ; port 69
sub r2, r1
jz delay_loop1 ; one millisecond elapsed here
dec r0
jnz delay_loop2
pop r2
pop r1
ret

Conclusion

This was the first game made for the FPGA Computer. The game is quite simple and uses video text mode. I had to implement a milliseconds counter register and the corresponding IN port to read it. It was used to implement the delay function. 




PS/2 keyboard and FPGA Computer

Added PS/2 keyboard to the FPGA Computer

This is a follow-up of the FPGA computer post. 

I have added a keyboard port to the FPGA Computer. The port is PS/2 because it is easier to work with the PS/2 than with the USB HID protocol. The final look is here (you will recognize the purple PS/2 keyboard connector):

The hardware part of this project is simple - add four resistors and a PS/2 connector:
Now the board has three connectors: PS/2, VGA and UART.

PS/2 connector is connected to the GPIO ports of the DE0-NANO board:
- Data is connected to the GPIO31 (PIN_D11) port
- Clock is connected to the GPIO33 (PIN_B12) port.

The communication between keyboard and computer is a clocked serial. Clock pulses appear on the Clock pin, while data is on the Data pin, synchronized with the Clock on the falling edge. There is one start bit, one parity bit and one stop bit. Here are oscilloscope snapshots of the "A" key being pressed (and then released):

The waveform below is the make code of the "A" key (1C hex)


The waveform below is the first byte of the "A" break code (F0 hex)

The waveform below is the second byte of the "A" break code (1C hex)

Keyboards work by sending the make and the break codes for each key. Make code is sent when the key is pressed, while the break code is sent when the key is released. For example, when we press and then release the "A" key, we get the following sequence:
1C F0 1C
This could be interpreted as: A pressed (1C), A released (F0 1C)

Unfortunately, it is all not that simple. First of all, if you quickly press A and C, one after another, you will get the following sequence:
1C 1B F0 1B F0 1C
This could be interpreted as: make code of "A", make code of "S", break code of "S" and break code of "A".

When you press Shift + A, you will get the following sequence:
12 1C F0 1C F0 12
Shift pressed, A pressed, A released, Shift released

When you press A for a long time (autorepeat will occur):
1C 1C 1C 1C 1C F0 1C
A pressed, A pressed, A pressed, A pressed, A pressed, A released (F0 1C)

To make things more complicated, extended key codes (both make and break) have been introduced, for some keys. For example, the Cursor Down (Arrow Down) key produces the following sequence:
E0 72 E0 F0 72
Cursor down pressed (E0 72), Cursor down released (E0 F0 72).

And so on... All this makes parsing a bit complicated, but eventually you will be able to figure it out.

The next step was to add the support for the keyboard within the FPGA Computer.

Introducing the keyboard interrupt

I have introduced a new interrupt for the keyboard - the IRQ#2. This IRQ is triggered when a byte from PS/2 keyboard arrives. The CPU then jumps to the address of 24 decimal, where the raw PS/2 keyboard handling routine should be. Actually, at that address should be one JUMP instruction which will jump to the handling routine.

In the main computer module, I have instantiated the PS/2 module:
// ####################################
// PS/2 keyboard instance
// ####################################
wire [7:0] ps2_data;
wire ps2_received;
reg [7:0] ps2_data_r;

ps2_read ps2(
  CLOCK_50,
  reset,
  gpio0[31], // Input pin - PS/2 data line
  gpio0[33], // Input pin - PS/2 clock line
  ps2_data,  // here we will receive a character
  ps2_received  // if something came from PS/2, this goes high
); 

Then I have detected the byte being received from the PS/2 module and triggered the IRQ:



always @ (posedge CLOCK_50) begin
// ######### IRQ2 - keyboard ######
if (ps2_received) begin
ps2_data_r <= ps2_data;
// if we have received a byte from 
// the keyboard, we will trigger the IRQ#2
irq[2] <= 1'b1;
end 
...



In the cpu.v module, I have added a support for the new interrupt:
if (irq_r[2]) begin
`ifdef DEBUG
LED[7] <= 1;
$display("3.1 JUMP TO IRQ #2 SERVICE");
`endif
pc <= 16'd24;
addr <= 16'd12;
end

So, to receive bytes from the PS/2 keyboard, a programmer must register the IRQ#2 handler:
; set the IRQ handler for keyboard to our own IRQ handler
mov r0, 1 ; JUMP instruction opcode
mov r1, IRQ2_ADDR ; IRQ#2 vector address
st [r1], r0
mov r0, irq_triggered
mov r1, IRQ2_ADDR + 2   
st [r1], r0

Since this is raw PS/2 handling, the programmer must write the complete make/break code handling. I have done that in this example.

Unfortunately, the code is quite long since it has to deal with the raw PS/2 protocol. The code demonstrates parsing the raw PS/2 protocol and it looks like those vintage screen editors:

How to use the keyboard? First of all, two callbacks should be registered - one for the key pressed, and the other one for the key released:
mov r0, 1 ; JUMP instruction opcode
mov r1, KEY_PRESSED_HANDLER_ADDR
st [r1], r0
mov r0, pressed ; key pressed routine address
mov r1, KEY_PRESSED_HANDLER_ADDR + 2
st [r1], r0

mov r0, 1 ; JUMP instruction opcode
mov r1, KEY_RELEASED_HANDLER_ADDR
st [r1], r0
mov r0, released ; key released routine address
mov r1, KEY_RELEASED_HANDLER_ADDR + 2
st [r1], r0

Both callbacks will then need to obtain the virtual key code of the key pressed (or released) by reading from the location 48 (VIRTUAL_KEY_ADDR):

pressed:
ld r0, [VIRTUAL_KEY_ADDR]
cmp r0, VK_F1
...

released:
ld r1, [VIRTUAL_KEY_ADDR]
...

What is the Virtual Key Code? It is a number assigned to each key, so all the programs would get the same number when a key is pressed, or released. In the code above, VK_F1 is the constant assigned to the F1 key, so the programmer can determine if the F1 was pressed by writing cmp r0, VK_F1.

Then, if needed, programmer can call the vk_to_char function which translates a virtual key to  the actual character, if possible (not all keys produce characters; F1 key does not produce character, for example):

; ###############################
; r1 = function vk_to_char(r1)
; translates virtual key to character
; if shift is pressed, does the uppercase
; ###############################
vk_to_char:
push r0
push r2
...

Conclusion

Most examples for keyboard support on the net use PS/2 keyboards, since USB HID protocol is quite complex and PS/2 isn't. I went the same path. I have couple of spare keyboards, some of them are PS/2, so I have soldered the PS/2 female connector and those four resistors from the schematics above. From that point on, everything was programming - a little bit of Verilog programming, and much more of assembler programming.

UART Loader

FPGA Computer UART Loader

This is a follow-up of the FPGA computer post. 

I have developed the UART Loader for the FPGA Computer to be able to send programs to it. It is based on the UART module developed in Verilog, for the FPGA Computer. This module provides both sending and receiving bytes, using 115200 bauds, 8 bits, 1 start, 1 stop bit, no parity. The serial port of the FPGA computer is connected to the TTL SerialToUSB dongle, which is then connected to the USB port of the computer:

When I initially created the FPGA Computer, I was able to store just one program in it, by hardcoding it in the RAM memory. Here is the part of the RAM.v Verilog module that includes the program in the RAM:

// Declare the RAM variable
reg [N-1:0] ram[32767:0];

initial
begin
  $readmemh("program.hex", ram);
end

The problem with this approach is that it is very slow. This program has to be embedded into the computer during the building of the computer, which can last several minutes. That is why I have devised the Loader. It is hardcoded in the RAM module, and when the computer powers on, it jumps to the address 0x0000, where I have placed a JUMP instruction to go to the Loader:

; ########################################################
; RESET CODE (4 bytes max)
; ########################################################
#addr 0x0000
j start

When started, Loader sends an initialisation sequence of bytes to the PC, via UART:

; send raspbootin boot char sequence
mov r0, 77 ; "M" character
call uart_send
mov r0, 13 ; \n character
call uart_send
mov r0, 10 ; \r character
call uart_send
mov r0, 3
call uart_send
mov r0, 3
call uart_send
mov r0, 3
call uart_send

This sequence is inherited from the original Raspbootin protocol for which I have made a Java implementation. This version is similar, but I have added a checksum at the end (more about this below).

The Loader then fetches the number of bytes to be received:

first_byte:
in r1, [64] ; get the char from the uart
st [size], r1 ; store the lowest byte to the size variable
inc [state] ; next state -> 1 (second byte)
j skip ; return from interrupt
second_byte:
in r1, [64] ; get the char from the uart (8 upper bits)
ld r2, [size] ; get the lower 8 bits (received earlier)
shl r1, 8 ; shift the received byte 8 bits
or r1, r2 ; put together lower and upper 8 bits
st [size], r1 ; store the calculated size
inc [state] ; next state 
j skip ; return from interrupt

After that, the Loader returns back the received size (just to make sure that it received the correct number of bytes):

; this is 16-bit cpu, so we don't load code bigger than 65535 bytes
; send confirmation that the code has been loaded
ld r0, [size]
and r0, 255
call uart_send
ld r0, [size]
shr r0, 8
call uart_send
inc [state] ; next state ->  (code arrives)

After that, all incoming bytes are loaded into the memory, starting from the 0x400 address:

in r1, [64] ; get the byte from the uart into r1

mov r2, r1
ld r0, [sum_all]
add r0, r2
st [sum_all], r0 ; primitive checksum - sum of all bytes
; at this moment, r1 holds the received byte
ld r2, [current_addr]
st.b [r2], r1 ; store the received byte into the memory
inc r2 ; move to the next location in memory
st [current_addr], r2   ; save the incremented value of the address

ld r2, [current_size]   ; increment the byte counter
inc r2
st [current_size], r2
cmp r2, [size] ; did we receive all?
jz all_arrived
j skip

When all bytes are received, the Loader sends back the primitive checksum, so the PC can check if everything is OK:

all_arrived:
; send the sum of all bytes
ld r0, [sum_all]
and r0, 255
call uart_send
ld r0, [sum_all]
shr r0, 8
call uart_send

mov r0, 1; signal to the main program ->loader has received all
st [loaded], r0

After that, the Loader jumps to the 0x400 address:

not_loaded:
ld r0, [loaded]
cmp r0, 1
jz 0x400
nop
j not_loaded

For the PC, I have modified the Raspbootin Loader, originally used in the Raspberry Pi bare metal programming, and it is also stored on the github.

Conclusion

When I tried Raspberry Pi bare metal programming, I immediately had the problem of transferring programs from the PC to the RPI. Usually, there is no network (it is bare metal platform with almost none of the I/O libraries) and the only other way is by transferring programs via micro SD cards (card dance). You would cross-compile the program on the PC, save it to the SD card, eject it, put it in the RPI, and reset the RPI. And then again, and again...

That was a motivation for the programmers to develop some kind of a loader for the RPI. One of those loaders is the Raspbootin. It is fairly simple. I re-used it for the exaclty same purpose - to load programs on my FPGA Computer from the PC. The only problematic part of this development was debugging the Loader. It could be only done on the FPGA, with those couple-of-minutes compiling. When I survived that, I was able to cross-assemble programs on my PC and send them to the board via Loader.


Text mode in the FPGA computer

How text mode works


This is a follow-up of the FPGA computer post. 

In this post I will give more details about the text mode of the FPGA computer. The text mode is the default mode for the computer. When the computers powers up, this is the default mode.


Text mode is 80x60 characters, occupying 4800 words, or 9600 bytes, starting from the address of 2400

Lower byte is the ASCII code of a character, while the upper byte contains the attributes:
7
6
5
4
3
2
1
0

Foreground color, inverted

Background color


r
g
b
r
g
b

The foreground color is inverted so zero values (default) would mean white color. That way, you don't need to set the foreground color to white, and by default (0, 0, 0), it is white. The default background color is black (0, 0, 0). This means that if the upper (Attribute) byte is zero (0x00), the background color is black, and the foreground color is white.

I have used Ken Shirriff's blog post FizzBuzz a lot for this implementation. I highly recommend his posts!

Verilog implementation relies on the character ROM. Character ROM is implemented as a separate Verilog module, and is used like this:
// Character generator
chars chars_1(
  .char(curr_char[7:0]),
  .rownum(y[2:0]),
  .pixels(pixels)
); 

Current character (which is read from the address of 2400, up to the 2400+9600) is received in the curr_char register. This register is wired to the chars module, together with two additional parameters: rownum (wired to the y register - the y coordinate) and the pixels output register (this register will hold the pixels of the current character, for the current y coordinate).

The chars module itself is a giant switch statement:
always @(*)
  case ({char, rownum})

    11'b00110000000: pixels = 8'b01111100; //  XXXXX  
    11'b00110000001: pixels = 8'b11000110; // XX   XX 
    11'b00110000010: pixels = 8'b11001110; // XX  XXX 
    11'b00110000011: pixels = 8'b11011110; // XX XXXX 
    11'b00110000100: pixels = 8'b11110110; // XXXX XX 
    11'b00110000101: pixels = 8'b11100110; // XXX  XX 
    11'b00110000110: pixels = 8'b01111100; //  XXXXX  
    11'b00110000111: pixels = 8'b00000000; //         

    11'b00110001000: pixels = 8'b00110000; //   XX    
    11'b00110001001: pixels = 8'b01110000; //  XXX    
    11'b00110001010: pixels = 8'b00110000; //   XX    
    11'b00110001011: pixels = 8'b00110000; //   XX    
    11'b00110001100: pixels = 8'b00110000; //   XX    
    11'b00110001101: pixels = 8'b00110000; //   XX    
    11'b00110001110: pixels = 8'b11111100; // XXXXXX  
    11'b00110001111: pixels = 8'b00000000; //       

As you can see, the input character and the y coordinate are concatenated to determine which row of pixels will be returned to the vga text module.

How is the current_char obtained? There are three distinctive situations when this byte is obtained:
1. during the visible scanline processing. During this case, we wait for the last column (pixel) of the current character to be displayed, and then we fetch the next character:
else if (x < 640 && !mem_read) begin
 if ((x & 7) == 7) begin
  // when we are finishing current character, 
  // we need to fetch in advance 
  // the next character (x+1, y)
  // (at the last pixel of the current character, let's fetch next)
  rd <= 1'b1;
  wr <= 1'b0;
  addr <= VIDEO_MEM_ADDR + ((x >> 3) + (y >> 3)*80 + 1);
  mem_read <= 1'b1;
 end

end 
2. during the horizontal blanking. During this case, we need to obtain either the current character (we haven't finished the current row yet), or the next character in the next row:


else if ((x >= 640) && ((y & 7) < 7)) begin
// when we start the horizontal blanking, 
// and still displaying character in the current row,
// we need to fetch in advance 
// the first character in the current row (0, row)
rd <= 1'b1;
wr <= 1'b0;
addr <= VIDEO_MEM_ADDR + ((y >> 3)*80);
mem_read <= 1'b1;
end
else if ((x >= 640) && ((y & 7) == 7)) begin
// when we start the horizontal blanking, 
// and we need to go to the next line, 
// we need to fetch in advance the first character in next row (0, row+1)
rd <= 1'b1;
wr <= 1'b0;
addr <= VIDEO_MEM_ADDR + (((y >> 3) + 1)*80);
mem_read <= 1'b1;
end



3. during the vertical blanking. In this case, we need to fetch the first character, at the beginning of the frame buffer:

if ((x >= 640) && (y >= 480)) begin
// when we start the vertical blanking, 
// we need to fetch in advance the first character (0, 0)
rd <= 1'b1;
wr <= 1'b0;
addr <= VIDEO_MEM_ADDR + 0;
mem_read <= 1'b1;
end

The code above sets the address bus and control lines. The character is then fetched from the data bus:
if (mem_read) begin
curr_char <= data;
rd <= 1'bz;
wr <= 1'bz;
mem_read <= 1'b0;
end

The character is wired to the character ROM, and the output is placed in the pixels register. From that point, the pixels are shifted bit by bit to the r, g, and b wires of VGA connector:
if (valid) begin
r <= pixels[7 - (x & 7)] ? !curr_char[6+8] : curr_char[2+8];
g <= pixels[7 - (x & 7)] ? !curr_char[5+8] : curr_char[1+8];
b <= pixels[7 - (x & 7)] ? !curr_char[4+8] : curr_char[0+8];
end 
else begin
// blanking -> no pixels
r <= 1'b0;
g <= 1'b0;
b <= 1'b0;
end

It is interesting how horizontal and vertical sync pulses are generated:
assign hs = x < (640 + 16) || x >= (640 + 16 + 96);
assign vs = y < (480 + 10) || y >= (480 + 10 + 2);
assign valid = (x < 640) && (y < 480);

Just  by wiring hs an vs one-bit registers to the VGA connector and by assigning to them expressions above, horizontal and vertical sync pulses are generated according to the current state of the x and y counters. When the x counter reaches 640 + 10, it is the end of the current scanline and the hs pulse is low (inverted logic). Similarly, the vs pulse is low when the y counter (the line counter) reaches 480 + 10.

If you look at the value range of the x and y registers, you will see that the x goes from 0 to 799, while y goes from 0 to 524. This means that the actual resolution of the VGA 640x480 mode is 800x525. However, during the horizontal and vertical blanking some of those pixels (and also lines) are not visible, so the actual visible pixels are from the 640x480 range. That is detected in the "assign valid =..." line of the code above. 

Programming in assembler

Assembler examples can be found here.

Following assembler code writes a string with color attributes.
mov r1, hello  ; r1 holds the address of the "Hello World!" string
mov r2, 0      ; r2 is the index
mov r3, 0      ; r3 has the attribute
again:
ld.b r0, [r1]  ; load r0 with the content of the memory location to which r1 points (current character)
cmp r0, 0      ; if the current character is 0 (string terminator),
jz end         ; go out of this loop 
st.b [r2 + VIDEO_1], r3; store the attribute
inc r2        ; move to the character location
st.b [r2 + VIDEO_1], r0; store the character at the VIDEO_0 + r2 
inc r1        ; move to the next character in the string
inc r2         ; move to the next location (attribute) in video memory
inc r3         ; change the attribute of the current character
j again        ; continue with the loop
end:
halt


hello:

#str "Hello World!\0"



The result is on the image below.


In the emulator, it looks like this:


Conclusion

The text mode is implemented by reading the character from the framebuffer, and then by obtaining its pixels from the character ROM. When those pixels are obtained, they are shifted one by one to the VGA connector, to the corresponding r, g and b wires. That way, the character is shown on the screen. I have implemented the text mode first, and then l have implemented the graphics mode. Both modes are surprisingly simple to be implemented in Verilog.

Text mode module is on the github.


Added graphics mode in the FPGA computer

Added graphics mode to the FPGA computer

This is another follow-up of the original FPGA Computer post.

I have added the graphics mode to the FPGA computer - 320x240 pixels, 8 colors for each pixel. Framebuffer starts at the same address as the text mode one (2400 decimal), but it now displays pixels, instead of characters.

Each pixel can have one of eight colors. There are two pixels per byte in the framebuffer:

7 6 5 4 3 2 1 0
x r g b x r g b

For example, if you want to draw two red pixels at the (0, 0) coordinates (top left corner), you need to put the following byte into the location 2400:

7 6 5 4 3 2 1 0
0 1 0 0 0 1 0 0

Or, 0x44 in hex.

Since the default mode is text mode, I have devised a system to switch video modes:
mov r1, 1
out [128], r1

This code will switch to the graphics mode of 320x240. To switch back to the text mode, you need to execute the following code:
mov r1, 0
out [128], r1



So, number 1 at the port 128 sets the video mode to 320x240, while 0 sets to the text mode.

The implementation in Verilog was not complicated compared to the text mode. First of all, I had to decide which resolution to implement. I have chosen 320x240 with 8 colors, because it consumes 38400 bytes, which is the least amount of memory with the decent resolution and number of colors. I could not have more pixels, since that would consume more RAM than the computer has (64KB).

Even this mode consumes more than half of the available memory, so I can always make other modes not so demanding in memory (reduce the number of colors). For example, having the same 320x240 black and white framebuffer would consume 9600 bytes, or approx. 9KB.

Next, the implementation has two pixels per byte of the framebuffer. So, I have recycled the text mode vga module and used the same two variables: x and y to go through all the pixels of the screen. Then I had to fetch in advance the next word (two bytes - remember, memory is organized in 32KW, having data bus 16 bits wide) containing next four pixels. So the algorithm was simple:


- during the visible scanline processing, the video module fetches next four pixels when displaying the third pixel of the current word in a row
else if (x < 640 && !mem_read) begin
if ((x & 7) == 7)  begin
// when we are finishing current word, 
// containing four pixels, 
// we need to fetch in advance 
// the next word (x+1, y)
// (at the last pixel of the current character,
// let's fetch next)
rd <= 1'b1;
wr <= 1'b0;
addr <= VIDEO_MEM_ADDR + ((xx >> 2)+(yy * 80) + 1);
mem_read <= 1'b1;
end

end 
- during the horizontal blanking, the video module fetches first four pixels at the beginning of the next row
else if ((x >= 640) && (y < 480)) begin
// when we start the horizontal blanking, 
// and we need to go to the next line, 
// we need to fetch in advance the first word 
// in the next line (0, y+1)
rd <= 1'b1;
wr <= 1'b0;
mem_read <= 1'b1;
if ((y & 1) == 1) begin
addr <= VIDEO_MEM_ADDR + ((yy + 1) * 80);
end
else begin
addr <= VIDEO_MEM_ADDR + ((yy) * 80);
end
end
- during the vertical blanking, the video module fetches the first four pixels at the top left corner of the screen (start of the video memory - address 2400 decimal).
if ((x >= 640) && (y >= 480)) begin
// when we start the vertical blanking, 
// we need to fetch in advance the first word at (0, 0)
rd <= 1'b1;
wr <= 1'b0;
mem_read <= 1'b1;
addr <= VIDEO_MEM_ADDR + 0;
end


When we set the addres bus to the address of the word (containing pixels) to be fetched, then we receive that word using the following code:
if (mem_read) begin
pixels <= data;
rd <= 1'bz;
wr <= 1'bz;
mem_read <= 1'b0;
end

Received pixels are stored in the pixels register.

The actual output of the pixels register to the r, g, and b wires of the vga connector is then simple:
if (valid) begin
r <= pixels[12 - ((xx & 3) << 2) + 0] == 1'b1;
g <= pixels[12 - ((xx & 3) << 2) + 1] == 1'b1;
b <= pixels[12 - ((xx & 3) << 2) + 2] == 1'b1;
end
else begin
// blanking -> no pixels
r <= 1'b0;
g <= 1'b0;
b <= 1'b0;
end

The xx and yy variables contain actual x and y positions divided by two. x and y iterate in the 640x480 range, while xx and yy iterate in the range of 320x200:
assign xx = x >> 1;
assign yy = y >> 1;

Assembler example

Assembler examples can be found here.

Here is the assembler example which draws two pixels, three lines and a circle on the screen:

mov r0, 1
out [128], r0  ; set the video mode to graphics

mov r0, 0 ; x = 0
mov r1, 100         ; y = 100
mov r2, 7 ; white color (0111)
call pixel
inc r0 ; x = 1
mov r2, 4 ; red color (0100)
call pixel

mov r2, 4 ; red color (0100)
mov r0, 50 ; A.x = 50
mov r1, 50 ; A.y = 50
mov r3, 150 ; B.x = 150
mov r4, 150 ; B.y = 150
call line

mov r2, 2 ; green color (0010)
mov r0, 50 ; A.x = 50
mov r1, 50 ; A.y = 50
mov r3, 150 ; B.x = 150
mov r4, 50 ; B.y = 50
call line

mov r2, 1 ; blue color (0001)
mov r0, 150 ; A.x = 150
mov r1, 50 ; A.y = 50
mov r3, 150 ; B.x = 150
mov r4, 150 ; B.y = 150
call line

mov r2, 7 ; white color (0111)
mov r0, 150 ; x = 150
mov r1, 150 ; y = 150
mov r3, 50 ; r = 50
call circle

First we switch to the graphics mode (out instruction). Then we draw two pixels. The pixel subroutine has three parameters: r0 (x-coordinate), r1(y-coordinate) and r2 (color). The color is determined by the content that is put in the r2 register. It is 0x7, which means that all three bits of a pixel are set to 1, having the white color. 

Three lines are drawn using the line subroutine. It has five parameters: r0 (x1-coordinate), r1 (y1-coordinate), r2 (color), r3 (x2-coordinate), and r4 (y2-coordinate). The circle subroutine has four parameters: r0 (x-coordinate), r1 (y-coordinate), r2 (color), and r3 (radius).

Lines and circles are created using Bresenham's line and circle algorithms



Here is the snapshot of the emulator:


Conclusion

Adding graphics mode was not that complicated. I have decided to have the 320x240 resolution having each pixel independent of the other (no attributes). That approach consumed quite a lot of memory, but this is not important since this computer will be comparable to the vintage platforms of the 70s and 80s.

The graphics module is on the github.


Raspberry PI stuff

Various stuff about Raspberry Pi



Installation

You need to download the OS image from the official Raspberry PI site:


I prefer Raspbian with desktop.

Then you need to download the Etcher software for writing the OS image to the micro SD card:


Put the micro SD card in your computer, start the Etcher, choose the image file and write.

When everything is done, remove the micro SD card safely from the PC, put it in the Raspberry PI, connect HDMI cable (in case of Zero, mini HDMI cable) from RPI to the monitor (or TV), and connect the keyboard to one of the USB ports (in case of RPI Zero, you need to connect your USB keyboard via adapter to the micro USB port). Connect the power cable. RPI will boot for the first time.

Default username/password is pi/raspberry.

Upon login, start the raspi-config by typing:

sudo raspi-config

This will start the configuration utility for the RPI. I use it to set up the new password, host name of the RPI and to turn on almost all interfacing options. When setting the interfacing options, I turn on the SSH, I2C, SPI and 1-wire. 

When exiting, the raspi-config will reboot the machine.

I prefer to set up the static IP to my RPIs, so here are some combinations:
1. Set up RPI 3 with the static IP on Ethernet,
2. Setup RPI Zero with the static IP on wireless,
3. Set up RPI Zero with the Ethernet support (needs additional ENC28J60 module to be connected to the RPI Zero).

Setting up RPI 3 with the static IP on Ethernet (and WiFi)

Before booting, connect the Ethernet cable from your router to the RPI 3, and connect the power. You can then log on. From that moment, you can set up the static IP address. Before that, you can check if the networking works. First of all, you can type:

ifconfig

This will write your IP address, which your RPI obtained from the router (via DHCP). If the IP address of the RPI begins, for example, with 192.168.1, then the static IP address will need to start the same way (remember first three numbers of the IP address). 

Here we have two branches:
1. from stretch, on with the buster builds of the Raspbian
2. before stretch build.

Stretch, buster, and newer builds

To set up the static IP address, you need to edit the /etc/network/interfaces file:

sudo nano /etc/network/interfaces

The nano editor will open the interfaces file. You can then put the following content:

# interfaces(5) file used by ifup(8) and ifdown(8)

# Please note that this file is written to be used with dhcpcd
# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo
iface lo inet loopback

auto eth0
allow-hotplug eth0
iface eth0 inet manual

auto wlan0
allow-hotplug wlan0
iface wlan0 inet manual
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

Both eth0 and wlan0 (I have decided to assign my wlan0 static address, too) are set to manual. In case of wlan0, you need to edit the /etc/wpa_supplicant/wpa_supplicant.conf file to the basic content:

ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
network={
   ssid="xxxx"
   psk="yyyy"
}

Then you need to add the following code to the end of the /etc/dhcpcd.conf file:

# Static eth0 IP configuration
interface eth0
static ip_address=192.168.1.207/24
static routers=192.168.1.1
static domain_name_servers=192.168.1.1 8.8.8.8
# Static wlan0 IP configuration
interface wlan0
static ip_address=192.168.1.217/24
static routers=192.168.1.1
static domain_name_servers=192.168.1.1 8.8.8.8

Before stretch (or buster) builds

To set up the static IP address, you need to edit the /etc/network/interfaces file:

sudo nano /etc/network/interfaces

The nano editor will open the interfaces file. You can then put the following content:

# interfaces(5) file used by ifup(8) and ifdown(8)

# Please note that this file is written to be used with dhcpcd
# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo
iface lo inet loopback

allow-hotplug eth0
iface eth0 inet static
address 192.168.1.200
netmask 255.255.255.0
gateway 192.168.1.1

The address set in this example is 192.168.1.200. After that, you can restart the networking by typing:

sudo service networking restart

Or, you can reboot the RPI by typing:

sudo reboot


Setting up RPI Zero with the static IP on Wireless

RPI Zero W already has the wireless, while RPI Zero does not. In case of having the RPI Zero, you need to obtain WiFi dongle and some adapter to connect it to the micro USB port. After that, the procedure is the same for both RPI Zero W and RPI Zero.

Here too, we have two branches:
1. stretch/buster builds.
2. pre-stretch(or buster) builds

Stretch, buster, and newer builds

Just look above at the same title.

Before stretch (or buster) builds

You need to edit the /etc/network/interfaces by typing:

sudo nano /etc/network/interfaces

In the nano editor, change the interfaces file to:

# interfaces(5) file used by ifup(8) and ifdown(8)

# Please note that this file is written to be used with dhcpcd
# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo
iface lo inet loopback

allow-hotplug wlan0
iface wlan0 inet static
#    wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf
        wpa-ssid "MySSID"
        wpa-psk "xxxxxx"
address 192.168.1.201
netmask 255.255.255.0
gateway 192.168.1.1

The address set in this example is 192.168.1.201. The MySSID is the SSID of your WiFi network. You must enter the SSID and the password with the quotes (").


Setting up RPI Zero for the Ethernet support

RPI Zero supports the ENC28J60 Ethernet module out of box.

ENC28J60 Ethernet module

This module needs to be connected to the RPI Zero via SPI interface. Don't forget to enable the SPI from the raspi-config tool (look above). After that, you need to do the following:

1. Connect the ENC28J60 module to the RPI using the following pin scheme:

Pi            PinNo ENC28J60     
---------------------------------
+3V3          17 VCC          
GPIO10/MOSI    19 SI           
GPIO9/MISO    21 SO           
GPIO11/SCLK    23 SCK          
GND            20 GND          

GPIO25        22 INT          
CE0#/GPIO8    24 CS           

2. Enable the ENC28j60 module at the end of your /boot/config.txt file by typing:

sudo nano /boot/config.txt

This will open the nano editor. Go to the end of the file and enter the following text:

dtoverlay=enc28j60

3. Reboot (sudo reboot)

From this moment on, you can work with the Ethernet as eth0 device.


Having static IP on both Ethernet and WiFi

The text below is for the pre-stretch/buster builds. For having both ethernet and WiFi static, look above, at the "Setting up RPI 3 with the static IP on Ethernet (and WiFi)" title.

If you want to have the static IP on both Ethernet port and WiFi, you need to edit the /etc/network/interfaces file and put the following text:

# interfaces(5) file used by ifup(8) and ifdown(8)

# Please note that this file is written to be used with dhcpcd
# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo
iface lo inet loopback

#allow-hotplug eth0
iface eth0 inet static
address 192.168.1.202
netmask 255.255.255.0
gateway 192.168.1.1

auto wlan0
#allow-hotplug wlan0
iface wlan0 inet static
#    wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf
        wpa-ssid "MySSID"
        wpa-psk "xxxxxxx"
address 192.168.1.212
netmask 255.255.255.0
gateway 192.168.1.1

The address set in this example for the Ethernet is 192.168.1.202 and for the WiFi is 192.168.1.212. 

Installing Java8 on your RPI

Type the following in your console:

sudo aptitude install oracle-java8-jdk

This will install the Java8 installer and would run it. 


Samba support

Samba allows you to share a part of your RPI disk to the network, for other machines and users. It also allows you to access other samba shares on the network. We will focus on the sharing of our disk on the network.

Install Samba via apt-get:

sudo apt-get install samba samba-common-bin

Edit the smb.conf file using nano:

sudo nano /etc/samba/smb.conf

Find the entries for workgroup and wins support, and set them up as follows:

workgroup = your_workgroup_name
wins support = yes

You also need to add the following section to end of the smb.conf to add share:

[pihome]
   comment= Pi Home
   path=/home/pi
   browseable=Yes
   writeable=Yes
   only guest=no
   create mask=0777
   directory mask=0777
   public=no

This will add the Samba share named "pihome" on your RPI, so it will be accessible from other machines.

At the end, we need to add the current user to the Samba:

sudo smbpasswd -a pi

After that, just restart the smbd daemon:

sudo systemctl restart smbd


FPGA Computer Assembler

This is the second follow-up of my initial text about the FPGA Computer.

I use a fork of the customasm project for my FPGA-based CPU. It is on the github here:

https://github.com/milanvidakovic/FPGAcustomasm

This 16-bit CPU has 8 general-purpose registers (r0 – r7), pc (program counter), sp (stack pointer), ir (instruction register), and h (higher word when multiplying, or remainder when dividing). Each register is 16-bits wide.

The address bus is 16 bits wide, addressing 65536 addresses. Data bus is also 16 bits wide, but all the addresses are 8-bit aligned. 

There are eleven groups of instructions:


Group number
Group name
Group members
Group description
0
NOP/MOV/
IN/OUT/PUSH/
POP/RET/IRET/
HALT/SWAP
nop
mov reg, xx
mov reg, reg
in reg, [xx]
out [xx], reg
push reg
push xx
pop reg
ret
iret
swap
halt
The most general group. Deals with putting values into registers, exchanging values between registers, I/O operations, stack operations, returning from subroutines, and register content swapping. NOP and HALT are also in this group.
1
JUMP
j xx
jc xx
jnc xx
jz xx
jnz xx
jo xx
jno xx
jp xx
jnp xx
jg xx
jge xx
js xx
jse xx
Jump to the given location.

2
CALL
call xx
callc xx
callnc xx
callz xx
callnz xx
callo xx
callno xx
callp xx
callnp xx
callg xx
callge xx
calls xx
callse xx
Calling subroutine. Puts the return address on the stack before jumping to the subroutine. Needs to call RET when returning from the subroutine.
3
LOAD/STORE
ld reg, [xx]
ld reg, [reg]
ld reg, [reg + xx]
ld.b reg, [xx]
ld.b reg, [reg]
ld.b reg, [reg + xx]
st [xx], reg
st [reg], reg
st [reg + xx], reg
st.b [xx], reg
st.b [reg], reg
st.b [reg + xx], reg
Load from memory into the register
destination: register
source: memory address given by the number, or by the register, or by the register+number.
Store the given register into the memory location
destination: memory location given by the number, or by the register, or by the register+number.
4
ADD/SUB
add reg, reg
add reg, xx
add reg, [reg]
add reg, [xx]
add reg, [reg + xx]
add.b reg, [reg]
add.b reg, [xx]
add.b reg, [reg + xx]
sub reg, reg
sub reg, xx
sub reg, [reg]
sub reg, [xx]
sub reg, [reg + xx]
sub.b reg, [reg]
sub.b reg, [xx]
sub.b reg, [reg + xx]
 Add and sub group.
5
AND/OR
and reg, reg
and reg, xx
and reg, [reg]
and reg, [xx]
and reg, [reg + xx]
and.b reg, [reg]
and.b reg, [xx]
and.b reg, [reg + xx]
or reg, reg
or reg, xx
or reg, [reg]
or reg, [xx]
or reg, [reg + xx]
or.b reg, [reg]
or.b reg, [xx]
or.b reg, [reg + xx]
 And / or group.
6
XOR
xor reg, reg
xor reg, xx
xor reg, [reg]
xor reg, [xx]
xor reg, [reg + xx]
xor.b reg, [reg]
xor.b reg, [xx]
xor.b reg, [reg + xx]
 Xor group.
7
SHL/SHR
shl reg, reg
shl reg, xx
shl reg, [reg]
shl reg, [xx]
shl reg, [reg + xx]
shl.b reg, [reg]
shl.b reg, [xx]
shl.b reg, [reg + xx]
shr reg, reg
shr reg, xx
shr reg, [reg]
shr reg, [xx]
shr reg, [reg + xx]
shr.b reg, [reg]
shr.b reg, [xx]
shr.b reg, [reg + xx]
 Shift group.
8
MUL/DIV
mul reg, reg
mul reg, xx
mul reg, [reg]
mul reg, [xx]
mul reg, [reg + xx]
mul.b reg, [reg]
mul.b reg, [xx]
mul.b reg, [reg + xx]
div reg, reg
div reg, xx
div reg, [reg]
div reg, [xx]
div reg, [reg + xx]
div.b reg, [reg]
div.b reg, [xx]
div.b reg, [reg + xx]
Multiply / divide group.
9
INC/DEC
inc reg
inc [reg]
inc [xx]
inc [reg + xx]
inc.b [reg]
inc.b [xx]
inc.b [reg + xx]
dec reg
dec [reg]
dec [xx]
dec [reg + xx]
dec.b [reg]
dec.b [xx]
dec.b [reg + xx]
Increment and decrement group.
10
CMP/NEG
cmp reg, reg
cmp reg, xx
cmp reg, [reg]
cmp reg, [xx]
cmp reg, [reg + xx]
cmp.b reg, [reg]
cmp.b reg, [xx]
cmp.b reg, [reg + xx]
neg reg
neg [reg]
neg [xx]
neg [reg + xx]
neg.b [reg]
neg.b [xx]
neg.b [reg + xx]
 Compare / negate group.

All the instructions are two or four bytes long. Since the data bus is 16-bits wide, the complete instruction is fetched in either one or two memory reads. This means that, since the SRAM is used, the complete instruction is fetched, decoded, and executed in three or more clock cycles.

All the instructions have the similar format:


from
to
what
group
bbbb
0-7: r0-r7
8-sp
9-h
bbbb
0-7: r0-r7
8-sp
9-h
0000
0=>mov regx, regy
0000

The first byte has lower four bits used to designate the destination register (to), while upper four bits  are used for the source register (from) identification. The second byte has lower four bits for the instruction group identification (group) and upper four bits for the type of the instruction in that group (what).

For example, the  mov r2, r1  instruction is encoded as:
binary: 0001 0010 0000 0000
hex: 12 00

The Source is r1 (0001), the Destination is r2 (0010), the group is 0 (0000) and the type is move regx, regy (0000).

Second example is the  mov r1, 0x0f  instruction:
binary: 0000 0001 0010 0000, 0000 0000 0000 1111
hex: 01 20, 00 0f


The Load instructions are used to load the value from the memory into the register. The Store instructions store the value of the register into the given memory location. Memory location is given as number (ld  r1, [0x0a] - load the content of the 0x0a location into the r1 register), or as a value of a register (ld  r1, [r2] - load the content of the memory location to which r2 points), or as a sum of number and register (ld  r1, [0x0f + r2]). 

ld r1, [0x0a] loads two bytes from the 0x0a location. The address (0x0a) must be even if we work with 16-bit values.

If we want to load a byte from a location, we need to use the ".b" suffix:
ld.b r1, [0x0a]

The code above will load a byte from the 0x0a location into the r1 register.

Hello World example


Let's look at the Hello World example:

; this program will print HELLO WORLD
#addr 0x400
VIDEO_0 = 2400 ; beginning of the text frame buffer

mov r2, 0      ; r1 is the index
mov r1, hello  ; r1 holds the address of the "HELLO WORLD" string

again:
ld.b r0, [r1]          ; load r0 with the content of the memory location to which r1 points (current character)
cmp r0, 0              ; if the current character is 0 (string terminator),
jz end                 ; go out of this loop 
st [r2 + VIDEO_0], r0  ; store the character at the VIDEO_0 + r2 
inc r1                 ; move to the next character
add r2, 2              ; move to the next location in the video memory
j again                ; continue with the loop

end:
halt
hello:

#str "HELLO WORLD!\0"

First we define the constant VIDEO_0 with the valuer of 2400. This is the address of the text-based frame buffer. It points to the first character in the video memory.

Then we set the r2 to 0 and r1 to the address of the hello string. Note that the mov instruction is used to move the number into the register (for example, mov r2, 0), or to move a value of the source register to the destination register (for example, mov r1, r2).

Next, we enter the loop. The loop starts with the again label, and in the loop we load the byte value from the current address (starts with the first character of the hello string), then we compare that byte with the zero (checking the end of the string), and then we store that byte in the current address of the video memory.

When all the characters are printed on the screen, the CPU halts (halt instruction).


Interrupts


Let's look at the UART echo demo. This demo waits for the character to arrive via serial UART (115200 baud, one start bit, one stop bit, no partiy), then prints that character on the screen, and finally, echoes that character back to the UART:

#addr 0x400
; ########################################################
; REAL START OF THE PROGRAM
; ########################################################
mov sp, 1000

mov r0, 14
st [cursor], r0

; set the IRQ handler for UART to our own IRQ handler
mov r0, 1
mov r1, 16
st [r1], r0
mov r0, irq_triggered
mov r1, 18
st [r1], r0

halt

The code above sets the interrupt handling routine (irq_triggered) for the UART. This is the IRQ1 and its handling routine is at the address 16 (0x0010). This means that whenever the serial  UART subsystem receives a byte, the CPU will jump to the 0x0010 address. At that address, we have placed the JUMP instruction (j irq_triggered), having at the address 0x0010 value of 0x0001 (the JUMP instruction opcode - 0x0001) and at the address 0x0012 the address of the irq_triggered routine (st [r1], irq_triggered).

That way, we have prepared the UART interrupt routine and the main program halts. The rest of the program is in the interrupt routine. Let's look at the interrupt routine:

; ##################################################################
; Subroutine which is called whenever some byte arrives at the UART
; ##################################################################
irq_triggered:
push r0
push r1
push r2   
push r5
push r6

in r1, [64] ; r1 holds now received byte from the UART (address 64 decimal)
ld r6, [cursor]
st [r6 + VIDEO_0], r1    ; store the UART character at the VIDEO_0 + r2 
add r6, 2       ; move to the next location in the video memory
st [cursor], r6

loop2:
in r5, [65]   ; tx busy in r5
cmp r5, 0     
jz not_busy   ; if not busy, send back the received character 
j loop2
not_busy:
out [66], r1  ; send the received character to the UART
skip:
pop r6
pop r5
pop r2
pop r1                 
pop r0
iret
When the interrupt happens, the irq_triggered routine first pushes some registers on the stack, obtains the received byte from the UART (in r1, [64]), prints it on the screen, and then sends back that character through UART (out [66], r1). If the UART is busy sending some character, the in r5, [65] will have r5 set to 1; otherwise, the r5 will have 0. Finally, the routine pops the registers from the stack and returns (iret instruction). 

The difference between iret and ret is that ret pops the return address from the stack and jumps to the obtained address (return from the call subroutine), while the iret pops the return address, pops the flags, and then jumps to the obtained address (interrupt routine might have changed flags,so they need to be saved before interrupt routine is invoked, and restored during the iret execution).

All the examples are stored in the FPGACustomasm project on the github:
https://github.com/milanvidakovic/FPGAcustomasm/tree/master/examples/FPGA/raspbootin


Adding byte-related instructions

Adding byte-oriented instructions

This is a follow-up of my previous post about the FPGA Computer.

When I initially commited the FPGA Computer, the CPU was 16-bit wide in both address and data bus. Also, all the instructions were word-oriented, working with 16 bits. Even the memory was word-oriented, having 64KWords, not 64KB. At first, that looked promising, having double the amount of RAM memory compared to the usual 8-bit platforms (64KW compared to 64KB).

However, all the instructions were word oriented, making byte-oriented programs complicated. For example, the UART loader receives bytes, not words, since the UART is byte-oriented. That causes a problem when the loader has to receive the code from the UART:

in r1, [64] ; get the byte from the uart into r1

ld r2, [flip]
cmp r2, 0
jz do_flip       ; we have received the even byte
; at this moment, r1 holds the received byte
neg [flip] ; we have received the odd byte - time to complete the word out of those two bytes (even and odd)
ld r0, [current_byte] ; get the even byte from the memory (stored earlier)
shl r0, 8 ; shift it 8 bits to the left
or r0, r1 ; complete the word
ld r2, [current_addr] ; r2 holds the current pointer in memory to store the received byte
st [r2], r0 ; store the completed word into the memory
inc r2 ; move to the next location in memory
st [current_addr], r2  ; save the incremented value of the current address
ld r2, [current_size]  ; increment the byte counter
inc r2
st [current_size], r2
cmp r2, [size] ; did we receive all?
jz all_arrived
j skip

do_flip:
neg [flip]
st [current_byte], r1 ; we need to receive two bytes to form the word, so we are saving this byte before receiving the other
ld r2, [current_size] 
inc r2 ; increment the byte counter
st [current_size], r2

cmp r2, [size] ; did we receive all?
jz all_arrived_even

j skip ; return and wait for the next byte

all_arrived_even:
; at this moment, r1 holds the received byte
shl r1, 8 ; the upper byte is for the odd bytes
ld r2, [current_addr] ; r2 holds the current pointer in memory to store the received byte
st [r2], r1 ; store the incomplete word into the memory
all_arrived:

As you can see, the problem is with the word-oriented instructions and memory locations. Whenever a byte comes to the computer, it must be saved, then combined with the next byte that would come, and that combination then stored in memory as a 16-bit value.

That was the reason for the redesign. I have introduced the ".b" suffix. If the instruction has the ".b" suffix, it is byte-oriented. This also caused the change in the addressing. The data bus is still 16-bit wide, and all the memory operations are 16-bit, but the address range covers 64KB now, instead of 64KW. That way, all the addresses in the assembler are byte-oriented, not word-oriented.

This means that if the instruction does not have the ".b" suffix, it will work with the word-oriented memory location, aiming at the word at the given address. If that is the case, the address must be aligned to 16-bits (even).

For example, this instruction is word-oriented:

ld r0, [1000]

It loads the 16-bit content of the address 1000 (two bytes, one byte from the 1001 and the other from 1000) and stores that 16-bit value in the r0 register. The address must be even.

If the instruction has the ".b" suffix, then it is byte-oriented. The address in byte-oriented instructions can be both even and odd. This instruction is byte-oriented:

ld.b r0, [1001]

It loads the 8-bit value (one byte) from the address 1001 into the r0 register.

It the 16-bit word is stored in the memory, it is stored as big endian, having the lower byte in odd address, and the upper byte in the even address. For example, the number 0x1234 stored at the 1000 address looks like this:

address
content
1000
0x12
1001
0x34

Now let's look at the same UART loader code, having byte-oriented instructions:

in r1, [64] ; get the byte from the uart into r1

; at this moment, r1 holds the received byte
; r2 holds the current pointer in memory to store the received byte
ld r2, [current_addr]
st.b [r2], r1 ; store the received byte into the memory
inc r2 ; move to the next location in memory
st [current_addr], r2 ; save the incremented value of the current address
ld r2, [current_size] ; increment the byte counter
inc r2
st [current_size], r2
cmp r2, [size] ; did we receive all?
jz all_arrived
j skip

all_arrived:

As you can see, the code is shorter and easier to understand.

The same idea can be applied to strings. Now that we have the byte-oriented instructions, dealing with byte-oriented strings is easy. This code prints the hello string on the screen:

VIDEO_0 = 2400 ; beginning of the text frame buffer
mov r2, 0 ; r1 is the index
mov r1, hello ; r1 holds the address of the "HELLO WORLD" string
again:
; load r0 with the content of the memory location to which r1 points
ld.b r0, [r1]          
cmp r0, 0 ; if the current character is 0 (string terminator),
jz end ; go out of this loop 
st.b [r2 + VIDEO_0], r0 ; store the character at the VIDEO_0 + r2 
inc r1 ; move to the next character
add r2, 2 ; move to the next location in the video memory
j again ; continue with the loop
end:
halt
hello:
#str "HELLO WORLD\0"

Conclusion

This change in the design of the CPU contributed to the much better assembler code. I haven't lost all the word-oriented instructions, but I have gained whole bunch of byte-oriented instructions. I did lose 64KB of memory, but my FPGA didn't have 128KB of SRAM memory anyway. 

Even if we try to make whole code word-oriented, we cannot skip 8-bit strings and protocols. That is why I have done this refactoring.

Here are github links:
- FPGAComputer
- FPGA Custom Assembler
- FPGA UART Loader (Raspbootin-like)
- FPGA Emulator

16-bit computer made using FPGA

16-bit computer on a FPGA

There are follow-ups of this topic:
- 32-bit FPGA computer,
- hardware sprites,
- the first game,
- PS/2 keyboard support,
- UART Loader,
- text mode,
- graphics mode,
- assembler,
- byte-oriented instructions.

I have made a 16-bit computer using my DE0-NANO FPGA board. The computer has 16-bit CPU, 64K 16-bit words, UART (115200 bps), and VGA (640x480, text-based frame buffer, 80x60 characters).

The Verilog code is on the github.


At first I used a breadboard


Then I have wired the board as a shield for the DE0-NANO

The schematics of the current version is above

The 16-bit CPU has 8 general-purpose registers (r0 – r7), pc (program counter), sp (stack pointer), ir (instruction register), mbr (memory buffer register), h (higher word when multiplying, or remainder when dividing).
The address bus is 16 bits wide, addressing 65536 memory locations (words). Data bus is also 16 bits wide, having each memory location 16 bits wide. This gives 65536 16-bits words, or 128KB.
Video output is VGA, 640x480. Text mode has 80x60 characters, each character being 8x8  pixels in dimensions. Video framebuffer in text mode has 4800 16-bit words (80x60 characters). The lower byte has the ASCII character, while the upper byte has the attributes (3 bits for the background color, 3 bits for the foreground color, inverted, and the last two bits unused).
It has two interrupts: IRQ0 and IRQ1. IRQ0 is connected to the KEY2 of the DE0-NANO, while IRQ1 is connected to the UART. Whenever a byte comes to the UART, it generates an IRQ1. Interrupt causes CPU to push flags to the stack, then to push PC to the stack and then to jump to the location designated for the CPU:
  • for the IRQ0, it is 0x0004, and
  • for the IRQ1, it is 0x0008.
It is up to the programmer to put the code in those locations. Usually, it is a JUMP instruction. To return from the interrupt routine, it is necessary to put the IRET instruction. It pops the return address, and then pops the flags register, and then goes back into the interrupted program.
KEY1 of the DE0-NANO is used as the reset key. When pressed, it forces CPU to go to the 0x0000 address. Usually there is a JUMP instruction to go to the main program.

VGA text mode

Text mode is 80x60 characters, occupying 4800 words. Lower byte is the ASCII code of a character, while the upper byte is the attributes:
7
6
5
4
3
2
1
0


Foreground color, inverted
Background color


r
g
b
r
g
b

The foreground color is inverted so zero values (default) would mean white color. That way, you don't need to set the foreground color to white, and by default (0, 0, 0), it is white. The default background color is black (0, 0, 0). This means that if the upper (Attribute) byte is zero (0x00), the background color is black, and the foreground color is white.
Attributes provide 8 foreground and 8 background colors

VGA female connector is connected via resistors to the GPIO-0 expansion header of the DE0-NANO board:

  • GPIO_R (pin 2, GPIO_00, PIN_A3) -> 68Ohm -> VGA_R,
  • GPIO_G (pin 4, GPIO_01, PIN_C3) -> 68Ohm -> VGA_G,
  • GPIO_B (pin 6, GPIO_03, PIN_D3) -> 68Ohm -> VGA_B,
  • GPIO_HS (pin 8, GPIO_05, PIN_B4) -> 470Ohm -> VGA_HORIZONTAL_SYNC,
  • GPIO_VS (pin 10, GPIO_07, PIN_B5) -> 470Ohm -> VGA_VERTICAL_SYNC.

UART interface

UART interface provides TTL serial communication on 115200kbps. It uses one start bit, one stop bit, and eight data bits, no parity, no handshake.
UART is connected to the GPIO-0 expansion header of the DE0-NANO board:

  • TX (pin 32, GPIO_025, PIN_D9) should be connected to the RX pin of the PC,
  • RX (pin 34, GPIO_027, PIN_E10) should be connected to the TX pin of the PC.
UART is used within the CPU via IN, and OUT instructions. RX triggers the IRQ1, which means that whenever a byte is received via UART, the IRQ1 will be triggered, forcing CPU to jump to the 0x0008 address. There you should place the JUMP instruction to your UART interrupt routine.
Inside the UART interrupt routine, you can get the received byte by using the IN instruction:

in r1, [64]; r1 holds now received byte from the UART 

To send a byte, first you need to check if the UART TX is free. You can do it by using the in instruction:
loop:
      in r5, [65]   ; tx busy in r5
      cmp r5, 0    
      jz not_busy   ; if not busy, send back the received character
      j loop
not_busy:
      out [66], r1  ; send the character to the UART

Addresses used by the UART are in the following table:
Address
Description
64
Received byte from the RX part of the UART (use the IN instruction).
65
0 if the TX part of the UART is free to send a byte, 1 if TX part is busy.
66
Byte to be sent must be placed here using the OUT instruction.

Assembler

The assembler for the CPU is again a fork of the customasm. This is a universal assembler, which can be used to generate machine code for any CPU. All you need is the instructions definition file. That file contains the list of instructions and the resulting machine code. For example, here are several instructions for my CPU:

#cpudef
{
#bits 16

#tokendef reg
{
r0 = 0
r1 = 1
r2 = 2
r3 = 3
r4 = 4
r5 = 5
r6 = 6
r7 = 7
sp = 8
h  = 9
}
nop  -> 16'0x0000
halt -> 16'0xfff0

mov {dest: reg}, {src: reg} -> src[3:0] @ dest[3:0] @ 4'0x0 @ 4'0x1
mov {dest: reg}, {value}    ->    4'0x0 @ dest[3:0] @ 4'0x1 @ 4'0x1 @ value[15:0]
...
}


The assembler is on the github.

UART loader

I have hardcoded the UART loader in the RAM memory. This means that whenever I start the computer, RAM contains the loader and after the reset, the loader is started. The loader initially sends the identification sequence to the PC, and the PC then sends the program to be executed. All this is done via serial port (UART). On my PC, I use the USB-to-serial (TTL) dongle. This way, I can develop programs on my PC and then load them on the board after the reset. The protocol for the loader is a modification of the Raspbootin protocol, which I have used for my bare metal programming.

Java client for uploading images is on the github.

Emulator

I have made an emulator for this computer. It is written in Java and it supports full-speed execution, break points, and step-by-step execution (both step-into and step-over). It has a separate window to display characters in the VGA framebuffer, and on that window, when you press a key, it will generate an IRQ1, emulating the UART reception of a byte ("received" byte is a pressed key).

Java-based emulator is on the github.

Conclusion

FPGA programming is fun, but also can be painful. Especially when you need to wait couple of minutes for the design to be compiled and placed on your board. Also, as I said earlier, the learning curve is very steep. You can run into a lot of problems at the beginning. First of all, you need to figure that all the lines in the Verilog are "executed" in parallel. You actually need to imagine the resulting hardware (or at least to have an idea what will be created) when programming in Verilog. 
Next, you need to read warnings! Some of those warnings will actually tell you what you have done wrong in your design. I have also run into problems with the operator precedence in Verilog. I had to place brackets to make an explicit order of evaluation, because it was different from my usual experience (in other programming languages).
I used my oscilloscope extensively during the VGA signal generation and UART development. Without it I would not have been able to make them work.
Horizontal sync pulses are shown on the oscilloscope

Also, I have used the Icarus Verilog to make the simulations. It is way faster than using the Altera ModelSim tool, and much easier to work with. Unfortunately, it cannot detect timing problems, which can occur, and which can make your design fail. If that is the case, read those warnings again! 
I will expand this text and write more texts about this computer and my experience with the FPGA programming.



Generating VGA video signals using Raspberry Pi and then FPGA

I have recently found couple of posts (for example, this and this) on the net about generating VGA signals using just CPU. All those posts talk about generating 640x480 VGA video signals using Arduino.
I have decided to try to generate VGA signals using Raspberry PI 3. I have read that you can actually get the VGA output from the RPI without much problems, since it has that feature built in the Broadcom SoC. However, I didn't want to use the built in feature; I wanted to generate signals myself.
It proved to be quite difficult. First I tried the lowest possible level library, under the Raspbian OS:

// Set GPIO pins to output
INP_GPIO(VIDEO); // must use INP_GPIO before we can use OUT_GPIO
OUT_GPIO(VIDEO);
INP_GPIO(H_SYNC); // must use INP_GPIO before we can use OUT_GPIO
OUT_GPIO(H_SYNC);
INP_GPIO(V_SYNC); // must use INP_GPIO before we can use OUT_GPIO
OUT_GPIO(V_SYNC);

vSyncLow();
hSyncLow();
vgaOff();
nsleep(99999999);
while(1) {
line = 0;
vSyncLow();

while(line < 600){
    //2.2uS Back Porch
    delayMicroseconds(2);  
    
    //20uS Color Data
    vgaOn();  //High
    delayMicroseconds(10); // 1uS        
    //Red Color Low
    vgaOff();  //Low
    delayMicroseconds(10); // 1uS        
    
    //1uS Front Porch
    delayMicroseconds(1); // 1uS 
    line++;
    
    //3.2uS Horizontal Sync
    hSyncHigh();  //HSYNC High
    delayMicroseconds(3);
    hSyncLow();  //HSYNC Low
    
    //26.4uS Total
  }
  //Clear the counter
  line=0; 
  //VSYNC High
  vSyncHigh();
  //4 Lines Of VSYNC   
  while(line < 4){         
    //2.2uS Back Porch    
    delayMicroseconds(2);
    
    //20 uS Of Color Data
    delayMicroseconds(20);// 20uS
    
    //1uS Front Porch
    delayMicroseconds(1); // 1uS
    line++;
    
    //HSYNC for 3.2uS
    hSyncHigh();  //High
    delayMicroseconds(3);
    hSyncLow();  //Low  
    
    //26.4uS Total
  }
  
  //Clear the counter
  line = 0;
  //VSYNC Low
  vSyncLow();
  //22 Lines Of Vertical Back Porch + 1 Line Of Front Porch
  while(line < 22){
      //2.2uS Back Porch
      delayMicroseconds(2);

      //20uS Color Data
      delayMicroseconds(20);// 20uS
        
      //1uS Front Porch
      delayMicroseconds(1); // 1uS
      line++;
      
      //HSYNC for 3.2uS
      hSyncHigh();  //High
      delayMicroseconds(3);
      hSyncLow();  //Low  

      //26.4uS Total
  }     
}

The result was this:


The monitor has some physical damage, but that is not the problem. The problem is the bad synchronization. The timing is critical. Since the pixel frequency is approx. 25MHz, the time for a single pixel is 40 nanoseconds. In the picture, it is obvious that the lines do not start at the same time, which means that the horizontal sync pulses do not start at the precise time. They miss their time for couple of hundred of nano seconds.

OK, this could be due to the multitasking in the Linux OS. So, I moved to the bare metal programming. I have found a nice bare metal library on the github, here:


Author did a great job of making nice, readable examples. I have modified one of his examples and made a VGA signal generator using the built in interrupt generator, which is triggered every 3 microseconds:

// clear pending irq
*ARMTIMER_ACQ = 1;
//*TIMER_BASE = 2;
//printf("Inside dbg_main, counter: %d ", counter);
if (counter == 0) {
GPIO_SET(V_SYNC);
GPIO_SET(H_SYNC);
  GPIO_SET(VIDEO);
} else  
if (counter == 1) {
GPIO_CLR(H_SYNC);
  GPIO_CLR(VIDEO);
} else 

if ((counter % 8) == 0) {
GPIO_SET(H_SYNC);
  GPIO_SET(VIDEO);
} else 
if ((counter % 8) == 1) {
  GPIO_CLR(VIDEO);
GPIO_CLR(H_SYNC);

if (counter == 17) {
GPIO_CLR(V_SYNC);
}

counter++;
if (counter == 4700) {
counter = 0;
}

However, the picture was not much better:


The timing is a bit better, but still horizontal sync pulses manage to miss the right time to fire.

I did all of this while waiting for the DE0-NANO FPGA board, which I choose to play with in order to generate VGA signals. When it finally arrived, I was able to properly generate VGA signals:


Then I have added the color and some text (top left corner):


Damage on the LCD is visible here, but it is OK for the development.

Here is the Altera DE0-NANO FPGA board:


I have purchased a female VGA connector and connected GPIO pins from the board to the connector:
  • GPIO_R -> 68Ohm -> VGA_R
  • GPIO_G -> 68Ohm -> VGA_G
  • GPIO_B -> 68Ohm -> VGA_B
  • GPIO_HS -> 470Ohm -> VGA_HORIZONTAL_SYNC
  • GPIO_VS -> 470Ohm -> VGA_VERTICAL_SYNC

I have found various values for the resistors on various sites (from direct connections, to 68 Ohms, 100Ohms, 500 Ohms, etc.), but this schematics works for me.

The Verilog code is quite simple. I have recycled the FizzBuzz example made by Ken Shirriff:

module vga(
//////////// CLOCK //////////
input CLOCK_50,  // this is 50MHz clock
//////////// KEY //////////
input KEY,             // reset key (one of two onboard keys)
//////////// GPIO //////////
output reg r,
output reg g,
output reg b,
output wire hs,
output wire vs
);

//=======================================================
//  REG/WIRE declarations
//=======================================================
reg clk25; // 25MHz signal (clk divided by 2)
reg newframe;
reg newline;

reg [9:0] x;
reg [9:0] y;
wire valid;

reg [7:0] xx;
reg [7:0] yy;

reg [7:0] framebuffer [9:0]; // 10  bytes text-based framebuffer
wire [6:0] counter;
wire [7:0] pixels; // Pixels making up one row of the character
//////////// GPIO //////////
output reg r,
output reg g,
output reg b,
output wire hs,
output wire vs
);
//=======================================================
//  REG/WIRE declarations
//=======================================================
reg clk25; // 25MHz signal (clk divided by 2)
reg newframe;
reg newline;

reg [9:0] x;
reg [9:0] y;
wire valid;

reg [7:0] xx;
reg [7:0] yy;

reg [7:0] framebuffer [9:0];
wire [6:0] counter;
wire [7:0] pixels; // Pixels making up one row of the character

//=======================================================
//  Structural coding
//=======================================================
initial begin
framebuffer[0] = "0";
framebuffer[1] = "1";
framebuffer[2] = "2";
framebuffer[3] = "3";
framebuffer[4] = "A";
framebuffer[5] = "a";
framebuffer[6] = "B";
framebuffer[7] = "b";
framebuffer[8] = "8";
framebuffer[9] = "9";
end
// Character generator

chars chars_1(
  .char(framebuffer[counter]),
  .rownum(y[2:0]),
  .pixels(pixels)
  );

assign hs = x < (640 + 16) || x >= (640 + 16 + 96);
assign vs = y < (480 + 10) || y >= (480 + 10 + 2);
assign valid = (x < 640) && (y < 480);
assign counter = (valid)?(x >> 3):0;

always @(posedge CLOCK_50) begin
newframe <= 0;
newline <= 0;
if (!KEY) begin
x <= 10'b0;
y <= 10'b0;
clk25 <= 1'b0;
newframe <= 1;
newline <= 1;
end
else begin
clk25 <= ~clk25;
if (clk25 == 1'b1) begin
if (x < 10'd799) begin
x <= x + 1'b1;
end
else begin
x <= 10'b0;
newline <= 1;
if (y < 10'd524) begin
y <= y + 1'b1;
end
else begin
y <= 10'b0;
newframe <= 1;
end
end
end
end

if (valid) begin

if (x < 80 && y < 8) begin
r <= pixels[7 - (x & 7)];
g <= pixels[7 - (x & 7)];
b <= pixels[7 - (x & 7)];
end
else begin
r <= (x < 213) ? 1 : 0;
g <= (x >= 213 && x < 426) ? 1 : 0;
b <= (x >= 426) ? 1 : 0;
end
end
else begin
// blanking -> no pixels
r <= 0;
g <= 0;
b <= 0;
end
end
endmodule

Verilog programming is not simple. It has a steep learning curve. The other problem can be a long compile time in the Quartus II IDE. For a bit more complex code than this VGA project, the compile time easily exceeds couple of minutes. I have solved this problem by installing the Icarus Verilog software, which compiles the Verilog code in a fraction of a second. This is due to the fact that the Icarus Verilog is not intended to deploy the Verilog code to the actual hardware - instead, it is intended for the simulation only. This way, I am able to produce the running and correct code quickly, and then I can copy that code into the Quartus II IDE, build the project, and deploy it to the real hardware.