Wednesday, 10 January 2018

Bringing the internal 3.5" floppy drive to life - part 1

While it will take a while longer before it can read or write disks, the next step for the internal floppy drive is to make it mirror what the SD-card floppy emulation thinks the drive should be doing, for example, when the motor should spin, the light come on, and when the head should step in and out tracks. I also want to have debug registers that allow me to directly control and read the floppy drive state, so that I can work towards being able to read and write real disks.

So, first step is to plumb the floppy status and control signals into the VHDL module that handles F011 emulation and SD card access, and provide a debug register for those.  This is already done, and $D6A0 is the register.  Writing sets control lines, and reading reads the status lines. As this is a debug register, you have to remember the state of the control lines yourself.  The control lines are:

7              f_density - 0=1.44MB, 1=720K
6              f_motor - 0 = motor on, 1 = motor off
5              f_select - 0 = drive selected (and LED on), 1 = drive not selected
4              f_stepdir  - 0 = step inwards, 1 = step outwards
3              f_step - 0 = generate step pulse (two required per track)
2              f_wdata - bit to write
1              f_wgate - 0 = turn write head on
0              f_side1 - 0 = head on side 1 selected, 1 = head on side0 selected

The status lines are:

7              f_index - 0 when passing over index hole
6              f_track0 - 0 when head is over track 0
5              f_writeprotect - 0 when disk is write protecteed
4              f_rdata - data bit read from head
3              f_diskchanged - 0 when disk has been changed

I have already confirmed that I can make the motor and LED turn on and off. Next step is to write a little test program that lets me test all of these functions.

To do this, I need to make a set of pull-up resistors for the floppy interface, as the revision 1 PCB lacks pull-ups on the status lines, so once they go to 0, they never go back to 1.  The pull-up resistors gently pull the lines high again, so that when the floppy stops driving them low, they return high, and thus back to a 1.  As previously mentioned, it is rather annoying that the 34-pin floppy cable has no +5V line, which makes it a nuisance to make a pull-up kit that can fit on the cable. Fortunately, however, there are some lines that we control, and can drive high.  For current testing, the density select line can stay +5V, since we don't need to do any 1.44MB disk access just yet. Here is my little home-made pull-up kit:


With this plugged in, I can easily see when the track 0 sensor etc, so all is good. I can also see data being read from the test disks, so that answers that question. Writing is more complex, so can't be immediately tested.

So, with that working, now I want to get reading data from the floppy working.  The first step is to acquire a decent slab of data from one of the tracks on the floppy.  The trick is how to capture it at a decent rate, since floppy data pulses come at approximately 500Hz, but the pulses are only 150ns - 800ns wide.  The easiest solution is to DMA read the register, as this will read every two clock cycles (the alternate clock cycle will be writing the value that had been read into memory), for a sample time of 40ns. This creates a data capture problem, because the MEGA65 has only a limited amount of RAM. Using a single DMA to read 56K samples (we could push it to 64K, but that would require fiddling with memory a bit more). At 40ns, that equates to 2,293,760ns = 2.29 milliseconds.  Given that a floppy spins at 300 RPM = 5 revolutions per second, a single rotation is 200 ms, so we are sampling only ~1% of a track this way. Admittedly at very high resolution.  This is not really enough to capture even a single sector for decoding. However, what it is useful for is to let me see just how long the pulses are on this floppy.  Here is a sample of the captured data:

:0012310 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012320 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012330 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012340 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012350 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012360 1F 1F 0F 0F 0F 0F 0F 0F 0F 0F 1F 1F 1F 1F 1F 1F
 :0012370 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012380 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012390 1F 1F 1F 1F 1F 0F 0F 0F 0F 0F 0F 0F 0F 1F 1F 1F
 :00123A0 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :00123B0 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :00123C0 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :00123D0 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F

We know from the table of input signals above, that it is only the upper 5 bits that matter.  These samples all have bit 7 (index hole sense), bit 6 (track 0) and bit 5 (write-protect) equal to 0, which means asserted. That is, we are reading track 0 on a write-protected disk, while the index hole is passing.  Bit 4 is the data bit itseslf, and we see that it is mostly high, with a couple of pulses lasting 8 samples each, i.e., 8 x 40 ns = 320 ns in duration.  The length of the pulses is within the specified range, so that looks good.  In this case, the pair of pulses are 51 samples, i.e., 51 x 40 = 2040 ns = ~2 usec apart.  Then pulses appear throughout the capture at varying intervals, as we expect.

Without going into the gory detail of how MFM works, a summary is that the gaps between the pulses vary to encode the information.  For a given data rate, gaps of 1, 1.5 and 2 time units are possible, corresponding to the reception of the following bit sequences:

+-----------------+--------+--------+
|Last received bit|Interval|New Bits|
+-----------------+--------+--------+
|      NONE       |   1.0  |11 or 00|
|      NONE       |   1.5  |   01   |
|      NONE       |   2.0  |  101   |
|        0        |   1.0  |    0   |
|        1        |   1.0  |    1   |
|        0        |   1.5  |    1   |
|        1        |   1.5  |   00   |
|        X        |   2.0  |   01   |
+-----------------+--------+--------+
There are some exceptions to this for synchronisation marks, where the pattern can be different.  In particular, the common "A1" sync mark consists of intervals of 2.0, 1.5, 2.0 and 1.5.  Encoding $A1 using MFM would normally result in gaps of 2.0 (101), 1.5 (00), 1.0 (0), 1.0 (0), 1.5 (1), where values in brackets are the bits of the byte $A1 (= 101000001 in binary) being encoded. The ambiguity at the start of a byte is solved by using these sync bytes to first synchronise the decoder at the start of a string of bytes.

We should therefore expect to see disk data consisting of long runs of intervals that are 1.0, 1.5 and 2.0 times the basic time interval.  The first time I captured data, I was seeing all sorts of crazy intervals, which had me thinking that something was terribly wrong.  However, second time around, the stream looks much better, with intervals all within a 1:2 range of size ratios, as we expect.

So now I need to write a little program that tries to find the sync marks in the data stream, and then begin decoding the data from there, to see if it looks like what it should be. Here is how one of the traces decoded:

$e4 $e4 $e4 $e4 $ce $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00
Sync $A1
Sync $A1
Sync $A1
 $fe $01 $01 $01 $02 $8b $eb $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e
$00 $00 $00 $00
 $00 $00 $00 $00 $00 $00 $00 $00
Sync $A1
Sync $A1
Sync $A1
 $fb $00 $00 $00 $00 $00 $00 $00 $00 $00 $00


The 1581 service manual describes the byte sequences that we should expect to see, as:

12 bytes of 00
3 bytes of Hex A1 (Data Hex A1, Clock Hex 0A) <- i.e., $A1 Sync
1 byte of FE (ID Address Mark)
1 byte (Track number)
1 byte (Side number)
1 byte (Sector number)
1 byte (Sector length. 02 for 512 byte sectors)
2 bytes CRC (cyclic redundancy check)
22 bytes of Hex 22
12 bytes of 00
3 bytes of Hex A1 (Data Hex A1, Clock Hex 0A) <- i.e., $A1 Sync
1 byte of Hex FB (Data Address Mark)
512 bytes of data
2 bytes CRC (cyclic redundancy check)
38 bytes of Hex 4E

If we try to match this with what we saw, it is pretty close.  There is some junk at the beginning, that looks like the tail end of the 38 bytes of $4E, allowing for lack of synchronisation, then we see the start of sector 2, track 1, side 1 (bold), following the prescribed format exactly, with the exception that the 22 bytes are hex $4E, not $22 (underlined). It is possible that $22 is a typo in the 1581 service guide, given that there are 22 bytes stipulated.  Indeed, we find evidence that this is the case from the C65 specifications guide, which describes the on-disk format as:


  quan      data/clock      description
  ----      ----------      -----------
    12      00              gap 3*
    3       A1/FB           Marks
            FE              Header mark
            (track)         Track number
            (side)          Side number
            (sector)        Sector number
            (length)        Sector Length (0=128,1=256,2=512,3=1024)

    2       (crc)           CRC bytes
    23      4E              gap 2
    12      00              gap 2
    3       A1/FB           Marks
            FB              Data mark
    128,
    256,
    512, or
    1024    00              Data bytes (consistent with length)
    2       (crc)           CRC bytes
    24      4E              gap 3*

    * you may reduce the size of gap 3 to increase diskette capacity,
      however the sizes shown are suggested.

Here we have the byte value $4E specified for gap 2, however, it suggests that the gap should be 23 bytes long, not the 22 we have observed, or that is stipulated in the 1581 service manual.

Nonetheless, we have proven that we can read sensible data from the disk, and use a simple table of relative gap size to drive decoding of the MFM data.  What we have not discussed here, is how to deal with the variable (and varying) disk rotation speed, and the errors that it can introduce.  A simplistic perspective on this is that we have to have something approximating a phase-locked loop that recaptures the clock on each transition that is encountered.  There are various ways to do this.  The C65 floppy controller supports three such algorithms, none of which exactly correspond to the method I have described here, of considering the gap intervals, rather than the arrival time of the transitions.  The error correction in my scheme relies on quantifying the gap intervals to be either 1.0, 1.5 or 2.0, with values in between being rounded to the nearest of those.  It stands to be seen how well (or badly) my scheme works in practice. I have also not yet covered generating or checking the CRC of sectors.

However, we now have enough information to be able to create a VHDL implementation that takes the raw input, extracts gap intervals, quantifies the gap intervals, detects sync marks, and extracts the sync and byte stream, and can feed this into a higher-level decoder that can check the track and sector that has been found, and extract the sector data.  Indeed, I can use the captured sequence as a test vector into this.  In short, we are well on our way to being able to read 3.5" floppies using the internal drive.

Testing Ethernet on the r1 PCB

There are now only a couple of interfaces left to test on the r1 PCB: HDMI output and the 100 Mbit ethernet port.  Ethernet is the next on the list, as it should, in principle, be easy to test, as we already had the same ethernet hardware on the Nexys4 DDR boards.  Thus, it really comes down to verifying that the pin assignments are correct.

However, it has been ages since we used the ethernet interface in earnest, in part because there is still a bug in my VHDL ethernet controller when transmitting (bits get corrupted, most probably due to a timing problem).

Thus, the first step was to get back to a working setup on the Nexys4 DDR board, where I could verify that I had a working test procedure.

The setup was quite simple:  The etherload program, which is a tiny program that listens for incoming ethernet packets on the MEGA65's ethernet interface, and if they are UDP packets on port 4510, it executes the contents of the packet in memory. This is used by a companion program on a computer connected via ethernet to send 1KB pieces of a program to be loaded, together with the little routine to copy it into place.  This scheme allows the ethernet loading program to be <256 bytes in length, including the ability to respond to ARP requests (although with the ethernet transmission problem, this is currently not very useful).

So, I loaded and ran the etherload program on the MEGA65 on a Nexys4DDR board, connected an ethernet cable, and then ran the etherload program on the Linux laptop at the other end of the ethernet cable. Without ARP, the IP address to send to must be a valid broadcast address on the ethernet interface.  I used a command like:

etherload 192.168.1.255 ../c64/games/gyrrus.prg

When etherload is running on the MEGA65 and waiting for packets, it looks like this:



(Note that etherload is so old, that it doesn't explicitly set the CPU to 50MHz, so I had to POKE0,65 before running it to do this. Otherwise it is too slow, and won't capture the packets coming in on the 100 Mbit/sec link.)

Then, when it is finished, it drops back to the ready prompt, like this:



The squiggly characters are drawn one per packet loaded, with the position matching the address of the packet loaded, so that you can see if there are any gap, which would indicate missed packets. None here, so I could happily run Gyrrus, which worked fine.

So, at this point, I have a test procedure that I can attempt on the r1 PCB.

Trying this on the MEGA65, I see the ethernet link light come on when the ethernet is plugged in, and the ethernet LED blinks on receiving the packets, but the etherload program shows no sign of having seen the packets.  Time to investigate.  Pausing the CPU, and looking at $D6E1 to see if the ethernet controller thinks that any packets have been received shows no signs of life. 

As I have had to debug this once before on the Nexys boards, there is a debug register at $D6E0 that shows the current status of the ethernet receive lines.  Thus I can write a little routine that continually draws the contents of that register on the screen, and try sending it a packet to see if we see signs of life. 

 This initially saw no signs of life, so I wrote a program to talk to the ethernet controller via the MIIM / MDIO interface, a two-wire interface that can be used to check the current connection and settings, and to set various link parameters.

After some trial and error, I was able to talk to the MIIM interface, and read out the various registers, which showed the link autonegotiating and coming up when a cable was connected.  So I tried again to write a little routine that shows the state of the ethernet interface registers. This time, I wrote the routine to increment a location on screen based on the contents of $D6E0, as a more robust way of seeing what is happening.  This showed that the RX lines were toggling, and that the RX valid line was also changing state when packets were flying on the ethernet connection.  However, etherload still failed to see any packets.

Back when I first implemented ethernet for the Nexys4 boards, I added a feature to allow reading the values arriving on the ethernet RX lines into a buffer to help debug the implementation.  That same function is now helpful for trying to work out what is going on here.  It confirms that the data bits are being received, and that they, in general, look right.  Digging deeper, I can see that packet data is being received, but no packet reception is reported. This most likely means that the CRC is invalid.  Fortunately, when a packet is rejected due to the CRC, it still gets written into the packet buffer.  Here is what I saw after receiving a 500 byte ping packet:

 :FFDE800 00 80 BD 00 5E 00 00 FB 10 05 01 9F FC FD 08 00
 :FFDE810 45 00 00 A9 9A 5F 40 00 FF 11 3E 3E C0 A8 01 02
 :FFDE820 E0 00 00 FB 14 E9 14 E9 00 95 A6 AF 00 00 00 00
 :FFDE830 00 09 00 00 00 00 00 00 05 5F 69 70 70 73 04 5F
 :FFDE840 74 63 70 05 6C 6F 63 61 6C 00 00 0C 00 01 04 5F
 :FFDE850 66 74 70 C0 12 00 0C 00 01 07 5F 77 65 62 64 61
 :FFDE860 76 C0 12 00 0C 00 01 08 5F 77 65 62 64 61 76 73
 :FFDE870 C0 12 00 0C 00 01 09 5F 73 66 74 70 2D 73 73 68
 :FFDE880 C0 12 00 0C 00 01 04 5F 73 6D 62 C0 12 00 0C 00
 :FFDE890 01 0B 5F 61 66 70 6F 76 65 72 74 63 70 C0 12 00
 :FFDE8A0 0C 00 01 04 5F 6E 66 73 C0 12 00 0C 00 01 04 5F
 :FFDE8B0 69 70 70 C0 12 00 0C 00 BD 8D 8E 8F 90 91 92 93
 :FFDE8C0 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F A0 A1 A2 A3
 :FFDE8D0 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF B0 B1 B2 B3
 :FFDE8E0 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF C0 C1 C2 C3
 :FFDE8F0 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF D0 D1 D2 D3
 :FFDE900 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF E0 E1 E2 E3
 :FFDE910 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF F0 F1 F2 F3
 :FFDE920 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF 00 01 02 03
 :FFDE930 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13
 :FFDE940 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23
 :FFDE950 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33
 :FFDE960 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 40 41 42 43
 :FFDE970 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53
 :FFDE980 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 60 61 62 63
 :FFDE990 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 70 71 72 73
 :FFDE9A0 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F 80 81 82 83
 :FFDE9B0 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F 90 91 92 93
 :FFDE9C0 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F A0 A1 A2 A3
 :FFDE9D0 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF B0 B1 B2 B3
 :FFDE9E0 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF C0 C1 C2 C3
 :FFDE9F0 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF D0 D1 D2 D3

 :FFDEA00 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF E0 E1 E2 E3
 :FFDEA10 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF F0 F1 F2 BD
 

The first two bytes are supposed to indicate the length of the packet, low-order byte first, and with the MSB of the second byte indicating if a CRC error has occurred. If a CRC error occurs, then no packet received interrupt is triggered, and the controller will keep trying to receive a valid packet, instead of marking the receive buffer full (the MEGA65 ethernet controller has two receive buffers, so that one can be processed while the other is receiving a packet).

The byte $BD at the end of the packet is written by the ethernet controller as a handy marker so that if you have been receiving multiple packets, and want to see where the latest one ends, you can.  So, this tells us that the packet was indeed correctly received as being $A1F - $800 - (2 bytes length header) = $21D bytes long.  However, the length header in the first two bytes of the packet says that it is zero bytes long, and that there was a CRC error.  That the length header is wrong tells me that there is something fishy going on.  I am resynthesising with an option to ignore CRC errors, and to try to investigate a little deeper the writing of the length field.

So, synthesis has finally finished an hour and a half later, so I can try etherload again, this time with the ethernet CRC check disabled, and ... it works.  Moreover, there is no sign of the packets having any errors, as I can load a game, and the game runs fine.  This leaves me wondering what is going on, or more specifically, how an incorrect ethernet CRC is getting calculated on what seem to be perfectly correct packets.  To try to solve this riddle, I took a look at the last packet sent by etherload as received by a Nexys4 DDR board and by the MEGA65 r1 PCB. Here is the one from the Nexys4 board:

 :FFDE800 AE 00 FF FF FF FF FF FF 10 05 01 9F FC FD 08 00
 :FFDE810 45 00 00 9C B9 79 40 00 40 11 FC 85 C0 A8 01 02
 :FFDE820 C0 A8 01 FF CE 1F 11 9E 00 88 A1 55 A9 00 EA EA
 :FFDE830 EA EA EA EA A2 00 BD 44 68 9D 40 03 E8 E0 40 D0
 :FFDE840 F5 4C 40 03 A9 47 8D 2F D0 A9 53 8D 2F D0 A9 00
 :FFDE850 A2 0F A0 00 A3 00 5C EA A9 00 A2 00 A0 00 A3 00
 :FFDE860 5C EA 68 68 60 00 00 00 00 00 00 00 00 00 00 00
 :FFDE870 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE880 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE890 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8A0 00 00 00 00 00 00 00 00 00 00 00 00 41 0B 3F 4D
 :FFDE8B0 BD


Here we see our $BD end of frame marker, and just before it, four bytes that are the CRC.  So, everything is fine there, as we know it is, since etherload works fine on that board with CRC checking enabled.

Now, the same packet received by the MEGA65 r1 PCB:

 :FFDE800 A9 80 FF FF FF FF FF FF 10 05 01 9F FC FD 08 00
 :FFDE810 45 00 00 9C 52 C0 40 00 40 11 63 3F C0 A8 01 02
 :FFDE820 C0 A8 01 FF E8 8C 11 9E 00 88 86 E8 A9 00 EA EA
 :FFDE830 EA EA EA EA A2 00 BD 44 68 9D 40 03 E8 E0 40 D0
 :FFDE840 F5 4C 40 03 A9 47 8D 2F D0 A9 53 8D 2F D0 A9 00
 :FFDE850 A2 0F A0 00 A3 00 5C EA A9 00 A2 00 A0 00 A3 00
 :FFDE860 5C EA 68 68 60 00 00 00 00 00 00 00 00 00 00 00
 :FFDE870 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE880 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE890 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8A0 00 00 00 00 00 00 00 00 00 00 00 BD


Note that apart from some different values in the IP and UDP header fields, the ethernet frames are identical, except for the lack of a CRC field.  While this is rather confusing, as I have never seen ethernet frames lacking a CRC field before, it at least does explain the behaviour I am seeing.  I also confirmed that if I use my Mac instead of the Linux laptop, the same behaviour is seen on the receiving side.

The MEGA65 r1 PCB does use a different ethernet receiver IC.  Is it possible that this IC does automatic CRC checking, and simply trims the CRC field from the end of the packet?  If so, I can find no mention of this feature in the datasheet for it.  There is a way that this can be tested, however: Connect two MEGA65's back to back via ethernet, and send a frame from one to the other, and see if the CRC that the one sent is received by the other.  The MEGA65's ethernet controller I have written in VHDL always sends a CRC, so this eliminates that question.  This is also a good idea, since I want to test the sending of ethernet frames, since there is a problem with that, which I suspect is due to timing of the TX bits compared to the 50MHz ethernet clock.

To do this, I wrote a little program that simply copies a sample ethernet frame to the TX buffer and sends the packet, whenever a key is pressed.  First time trying this, I can see that a packet is sent from that side, and received by the other, with the CRC missing.  However, it also showed up a problem with memory mapping, because while I can read from the packet RX buffer when I had used the MAP instruction to make it visible at $6800-$6FFF, I can't write to it. Instead writes are going to colour RAM. Using the serial monitor causes the same problem. Time for another synthesis run to fix that (found the wrong 2-bit constant in the CPU source code that was causing it)...

So, having fixed that memory mapping error, I can now send packets from the Nexys4DDR board to the MEGA65 r1 board, but no CRC is visible. Also, I discovered that the packet length must be set to one more than the number of bytes in the packet. Now, what about in the other direction, from the MEGA65 r1 PCB to the Nexys4 board?

Here we have some interesting things.   First, the data coming through is corrupted, specifically, it looks like the bits that have been transmitted in one cycle are actually often used in the following cycle, i.e., I am presenting the data on the opposite side of the ethernet TX clock compared to when I should be.  Here is the hexadecimal version of the packet as received at the other end:

 :0000428 47 80 FF FF FF FF FF FF 47 45 45 45 45 45 29 00
 :0000438 3F 50 55 5A 5F 54 55 5E 5F 78 7D 7A 7F 7C 7D 7E
 :0000448 7F A0 A5 AA AF B4 B5 BE BF A8 AD AA AF BC BD BE
 :0000458 BF F0 2F FA FF F4 00 05 28 15 3C 3C 3F A0 5F 3F
 :0000468 5A 3C 14 A5 A0 40 7F 5F F4 BD 00 00 00 00 00 00


The first two bytes are the length ($47) + CRC error flag, then we have the usual ethernet fields.  Clues that the bits are being read once cycle are that the ethernet source address field is 474545454545, when it should be 414141414141.  The $47 has the upper two bits from the $FF of the last ethernet destination address rotated in, and then the $41 rotated left.  $41 = 01000001, so rotating it left and pulling in the 11 bits, we get 00000111, which isn't quite right. However, if we assume that each bit pair is the logical OR of the previous bits, plus the bits that are being sent now, then it makes sense: 01000001 OR 00000111 = 01000111 = $47.  This says to me quite strongly that it is this marginal timing issue.  Basically by presenting the bits and clock at the same time, there isn't enough time for them to stabilise and replace the old values before the ethernet controller samples them.

Despite the difficulty that this glitch provides in determining if the CRC field is there, by repeatedly sending slightly different frames, I can see that the last four bytes of the frame before the $BD end of frame marker (which looks like a reverse = sign on the screen display) change each time. The only other byte that changes is one byte in the frame that I am changing on the transmit side.  You can play spot the differences with me in these shots: There is a single byte different in the body of the packets shown, so that the CRCs would be different, and then the CRC fields themselves:




So, these problems shouldn't be too hard to fix. The out-by-one length error I can very easily fix. The timing error will be a little more work, but not particularly hard. What I will probably do is use a 200MHz clock to drive the TX lines, and have a register that allows me to adjust the phase of the TX data bits with relation to the ethernet clock. That way I will only need to resynthesise once to be able to find the correct settings, which can then be baked into the next synthesis run after that.


So, adding the phase delay on the ethernet TX data lines has fixed the data corruption we were seeing. Here is how it looks now, sent from the MEGA65 r1 PCB to the Nexys 4 DDR board:

 :7776800 47 80 FF FF FF FF FF FF 41 41 41 41 41 41 08 00
 :7776810 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E
 :7776820 1F 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E
 :7776830 2F 30 0F 32 33 34 00 01 08 05 0C 0C 0F 20 17 0F
 :7776840 12 0C 04 21 20 91 51 83 E2


Now we see the MAC address being correctly formed, and all the bytes look correct. Also, as this was received by the Nexys4 DDR board, we see the ethernet CRC field.  However, it still thinks the CRC is wrong.

In the reverse direction, we still don't see the CRC field, so we see packets like this:

 :7776800 42 80 FF FF FF FF FF FF 41 41 41 41 41 41 08 00
 :7776810 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E
 :7776820 1F 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E
 :7776830 2F 30 EE 32 33 34 00 01 08 05 0C 0C 0F 20 17 0F
 :7776840 12 0C 04 21


What is nice is that the same TX line phase delay works in both directions, so we don't need to make that a setting specific to the type of board.

We also see that the number of bytes sent differs between them by one, that is, the MEGA65 r1 is sending one more byte than the Nexys4 DDR board is. This probably explains why the Nexys board sees an incorrect CRC, and is more of a concern.

What I think I will do next, is to send a frame to the r1 PCB, and use the debug mode on the ethernet controller to see the raw data lines, and see if we see the CRC bits arriving.  Here is what I see:

 :7776800 80 80 80 80 80 80 80 80 80 80 80 80 81 81 81 81
 :7776810 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
 :7776820 81 81 81 81 81 81 81 81 81 81 81 83 83 83 83 83
 :7776830 83 83 83 83 83 83 83 83 83 83 83 83 83 83 83 83
 :7776840 83 83 83 83 81 80 80 81 81 80 80 81 81 80 80 81
 :7776850 81 80 80 81 81 80 80 81 81 80 80 81 80 82 80 80
 :7776860 80 80 80 80 83 83 80 80 80 80 81 80 81 80 81 80
 :7776870 82 80 81 80 83 80 81 80 80 81 81 80 81 81 81 80
 :7776880 82 81 81 80 83 81 81 80 80 82 81 80 81 82 81 80
 :7776890 82 82 81 80 83 82 81 80 80 83 81 80 81 83 81 80
 :77768A0 82 83 81 80 83 83 81 80 80 80 82 80 81 80 82 80
 :77768B0 82 80 82 80 83 80 82 80 80 81 82 80 81 81 82 80
 :77768C0 82 81 82 80 83 81 82 80 80 82 82 80 81 82 82 80
 :77768D0 82 82 82 80 83 82 82 80 80 83 82 80 81 83 82 80
 :77768E0 82 83 82 80 83 83 82 80 80 80 83 80 82 81 80 83
 :77768F0 82 80 83 80 83 80 83 80 80 81 83 80 80 80 80 80
 :7776900 81 80 80 80 80 82 80 80 81 81 80 80 80 83 80 80
 :7776910 80 83 80 80 83 83 80 80 80 80 82 80 83 81 81 80
 :7776920 83 83 80 80 82 80 81 80 80 83 80 80 80 81 80 80
 :7776930 81 80 82 80 80 80 02 80 03 82 00 83 03 81 00 80
 :7776940 02 81 02 80 00 81 02 83 00 00 00 00 00 00 00 00


Each byte in this capture is one 20ns time step on the ethernet interface.  Bit 7 is the "data valid" signal, and bits 0 and 1 are the data being read. Four of these makes one byte of actual data. So, let's decode it. The long train of 81's followed by 83's is the ethernet preamble.  So we need to start from the second 83.  We then have the following 4 time steps making the following bytes:

$0000 : 83 83 83 83 = %11111111 = $FF
$0001 : 83 83 83 83 = %11111111 = $FF
$0002 : 83 83 83 83 = %11111111 = $FF
$0003 : 83 83 83 83 = %11111111 = $FF
$0004 : 83 83 83 83 = %11111111 = $FF
$0005 : 83 83 83 83 = %11111111 = $FF
$0006 : 81 80 80 81 = %01000001 = $41
$0007 : 81 80 80 81 = %01000001 = $41
$0008 : 81 80 80 81 = %01000001 = $41
$0009 : 81 80 80 81 = %01000001 = $41
$000a : 81 80 80 81 = %01000001 = $41
$000b : 81 80 80 81 = %01000001 = $41
$000c : 80 82 80 80 = 001000 = $08
$000d : 80 80 80 80 = 000000 = $00
$000e : 83 83 80 80 = 001111 = $0F
$000f : 80 80 81 80 = 010000 = $10
$0010 : 81 80 81 80 = 010001 = $11
$0011 : 82 80 81 80 = 010010 = $12
$0012 : 83 80 81 80 = 010011 = $13
$0013 : 80 81 81 80 = 010100 = $14
$0014 : 81 81 81 80 = 010101 = $15
$0015 : 82 81 81 80 = 010110 = $16
$0016 : 83 81 81 80 = 010111 = $17
$0017 : 80 82 81 80 = 011000 = $18
$0018 : 81 82 81 80 = 011001 = $19
$0019 : 82 82 81 80 = 011010 = $1A
$001a : 83 82 81 80 = 011011 = $1B
$001b : 80 83 81 80 = 011100 = $1C
$001c : 81 83 81 80 = 011101 = $1D
$001d : 82 83 81 80 = 011110 = $1E
$001e : 83 83 81 80 = 011111 = $1F
$001f : 80 80 82 80 = 100000 = $20
$0020 : 81 80 82 80 = 100001 = $21
$0021 : 82 80 82 80 = 100010 = $22
$0022 : 83 80 82 80 = 100011 = $23
$0023 : 80 81 82 80 = 100100 = $24
$0024 : 81 81 82 80 = 100101 = $25
$0025 : 82 81 82 80 = 100110 = $26
$0026 : 83 81 82 80 = 100111 = $27
$0027 : 80 82 82 80 = 101000 = $28
$0028 : 81 82 82 80 = 101001 = $29
$0029 : 82 82 82 80 = 101010 = $2A
$002a : 83 82 82 80 = 101011 = $2B
$002b : 80 83 82 80 = 101100 = $2C
$002c : 81 83 82 80 = 101101 = $2D
$002d : 82 83 82 80 = 101110 = $2E
$002e : 83 83 82 80 = 101111 = $2F
$002f : 80 80 83 80 = 110000 = $30
$0030 : 82 81 80 83 = %11000110 = $C6
$0031 : 82 80 83 80 = 110010 = $32
$0032 : 83 80 83 80 = 110011 = $33
$0033 : 80 81 83 80 = 110100 = $34
$0034 : 80 80 80 80 = 000000 = $00
$0035 : 81 80 80 80 = 000001 = $01
$0036 : 80 82 80 80 = 001000 = $08
$0037 : 81 81 80 80 = 000101 = $05
$0038 : 80 83 80 80 = 001100 = $0C
$0039 : 80 83 80 80 = 001100 = $0C
$003a : 83 83 80 80 = 001111 = $0F
$003b : 80 80 82 80 = 100000 = $20
$003c : 83 81 81 80 = 010111 = $17
$003d : 83 83 80 80 = 001111 = $0F
$003e : 82 80 81 80 = 010010 = $12
$003f : 80 83 80 80 = 001100 = $0C
$0040 : 80 81 80 80 = 000100 = $04
$0041 : 81 80 82 80 = 100001 = $21
$0042 : 80 80 02 80 = 100000 = $20 (some bits missing data valid)
$0043 : 03 82 00 83 = %11001011 = $CB (some bits missing data valid)
$0044 : 03 81 00 80 = 000111 = $07 (some bits missing data valid)
$0045 : 02 81 02 80 = 100110 = $26 (some bits missing data valid)
$0046 : 00 81 02 83 = %11100100 = $E4 (some bits missing data valid)


So, this is VERY interesting.  The ethernet controller isn't filtering out the CRC, but is rather claiming that those bits are not data valid.  Given the very specific pattern, with one di-bit missing the data valid for the last data byte of the packet, and then two di-bits missing the data-valid signal for the CRC, and the same two each byte, I suspect that the ethernet controller might be signalling the end of the frame.  This would mean that it must be buffering at least five bytes worth of received data, but that is not impossible.  Anyway, it explains where the CRC has gone. So, digging around a bit, I have found that the RX data valid signal is multiplexed with carrier sense on some PHY chips.  This looks like exactly what could be happening here (although the PHY receiver on the Nexys4 doesn't do this, as I have just re-confirmed), thus providing an explanation for what we are seeing.
So, time to resynthesise again, and see if it this gets us CRCs received on the MEGA65 r1 PCB.  That should just leave the CRC checksum problems, if they are still occurring after that change (which I expect that they will).

Indeed, success! I can now receive the last byte and CRC of a packet on the MEGA65 r1 PCB.  However, I still have a problem with CRCs.  I know that the CRC problem is on the sending side, because I can receive packets sent from my laptop without difficulty -- it is only packets sent from the MEGA65 that have this problem.

My planned approach was to investigate this is to capture some good packets, and find or write an ethernet CRC checking program, and confirm it works for those, and then see how the MEGA65-originated packets fare -- and if there is some mutation of the packet data that will make the CRC correct. However, then I decided to take a closer look at the CRC generation code in the ethernet controller, and get that to provide me with the list of bytes that it thought it was CRCing, to make sure that there was nothing strange going on. In the process of that, I found that the data valid input to the CRC calculator was remaining high, while clocking the CRC out at the end of a packet.  Thus, only the first two bits of the CRC would be correct, and the rest would be wrong. So, off to synthesis again, to see if this fixes the problem.

Testing with this fix, it still wasn't working.  So I took a known good packet sent by my laptop to the MEGA65, and got the other MEGA65 to send it, so that I could compare the CRC with that of the good packet, to try to get some handle on what was still going wrong. I was really quite frustrated at this point, because I had gone through the relevant code carefully, and thought I had understood what was going on, and with the help of simulation, confirmed that it was doing the right thing.  So I was somewhat relieved when I realised what the problem with the CRC was.  Here is the good and the bad CRC:

GOOD: $C2F7B15F = binary 11000010 11110111 10110001 01011111
 BAD: $C1FB72AF = binary 11000001 11111011 01110010 10101111
Looking at the hex, I could see that there were strong similarities, much more so than if the CRC was just plain wrong. Bit it took me a little while to realise it was just each pair of bits were swapped: The routine that copies the CRC bits out, two at a time, for transmission was putting them into the wrong TX line. So, its off for a few hours of synthesis again to fix this up... and after 10,238 seconds of synthesis, we finally have ethernet transmission with working CRC generation.

The only program I have that does any ethernet transmission at the moment is the etherload program, that in theory listens for ARP requests. It would be nice if it also listened for PING packets and replied, but you can't have everything. However, pinging it's IP address from my laptop does now result in ARP succeeding, with the very uncreative MAC address hardwired into etherload:

paul@F96NG92-L:~/Projects/mega65/mega65-core$ ping 192.168.1.65
PING 192.168.1.65 (192.168.1.65) 56(84) bytes of data.
^C
--- 192.168.1.65 ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6140ms

paul@F96NG92-L:~/Projects/mega65/mega65-core$ arp -na

...
? (192.168.1.65) auf 40:40:40:40:40:40 [ether] auf enp0s31f6...

What would be nice, would be if etherload responded to pings, and also read out the MAC address it should use from the MIIM, now that I know how to use it. In fact, it would be nice if the ethernet controller provided simple register based access to the MIIM registers, and had an option to automatically populate frames with its MAC address.  I'll add these to the queue.  But for now, I am happy to finally have ethernet working in a solid way, for both transmit and receive, and will move on to testing the HDMI port, and now that I remember, the last aspects of the 3.5" floppy drive interface.

Tuesday, 9 January 2018

Repairing my 1581

I still would like to formate a fresh 3.5" floppy to test in the MEGA65. However, my 1581 has gone senile, apparently because the 8KB SRAM in it has gone faulty.  The trouble is, the Fujitsu 8264A SRAM is not the easiest thing to lay ones hands on.  However, I have a stash of old ICs given to me by a second cousin a long time ago.  Many of those ICs go back to the late 70s and early 80s.  So I wondered if, per chance, I happened to have such an SRAM, or at least a compatible one.

It didn't take much digging to find a 6264, that looked to me like it might fit the bill. A bit of comparison of pinouts (here and here), says that this should in fact be a fine replacement.  You can play spot the SRAM here:



Also, I have the spare 1581 replacement board that Simon Scott kindly sent me.  However, when I tried to use that, the Amiga drive mechanism in it steps back and forth, instead of progressively across the disk.  Something is fishy there.

So, before taking the soldering iron to my 1581 PCB, I tried the mechanism and cable from my 1581 with the PCB he sent me -- and success! It formats and reads a disk fine.  Was it the drive or the cables? That's a fairly easy experiment to run, and it turns out it is the drive mechanism that is the problem.

It thus looks like I can frankenstein the two units to make one good one, and if I am feeling excited, I can de-solder the SRAM on my old PCB, and try to resurrect that, and then hunt a new drive mechanism for that. In fact, I have plenty of PC floppy drive mechanisms laying about -- if I can find the instructions on using one of those with the 1581 PCB.

But for now, I would have to switch the LED leads on the board with the faulty mechanism, so I decided I should just go the whole hog, and resurrect my old board with the spare RAM chip I found. After re-acquanting myself with my aged solder-sucker (which is perhaps only marginally younger than the SRAM I was replacing), I had removed the old IC by chopping it off at its knees (all 28 of them), and then putting a socket in, and then seating the "new" SRAM in place:


Then, it was very pleasing to discover that the drive works perfectly again, at least so far as I have tested it, which is to format a new 720K disk, and read the directory from it.

So, after all that diversion, I now have disks that I can safely use to test the MEGA65 floppy interface, and when I get to the point of writing data to disk in it, a drive in which I can verify that the written data is readable.

Sunday, 7 January 2018

Fixing a variety of sprite rendering bugs

Today's rambling blog post is following my (mis-)adventures tracking down and fixing a variety of sprite rendering bugs on the MEGA65.


I have been making small incremental fixes (or what were intended to be fixes, but in some cases were worse) in the sprite rendering module.  However, the long synthesis times are a pain for making progress that way.  Simulation is much faster, and also allows more insight as to why something is going wrong.

So, today I setup a test harness that allows me to render a single sprite, and work out what is going on.  But before we go further, some of the known problems that I am trying to solve with sprites are:

1. Sprites "shimmer" with the data being shown on a physical raster line sometimes varying between what is on the raster above, and the raster below.

2. Sprites display a line (or more) of junk at the bottom. This is actually one extra VGA raster of data, as though the sprite were one VGA pixel taller.

3. Vertical expansion of sprites was broken.

4. 16-colour sprite mode was missing (this is a new MEGA65 sprite mode)

5. Tiling mode didn't work properly in 640H mode, but stopped at pixel 419.

6. Non-tiled sprites would stop in the right border, instead of wrapping around to the left of the next raster line.

There are probably a few others in there, as well.

I have started by refining the logic somewhat so that the Y advance logic and start of sprite rendering is now much cleaner and simpler.  This results in the sprite thinking it is drawing the correct data on each raster line, and stopping immediately at the correct point.

I have also found some bugs with the sprite collision checking, which I have tried to fix.

Now the trick is to get the sprites showing the correct data on every line.  This is a little bit complicated, because the sprite needs to know what data it needs for the next raster line, to tell the VIC-IV to fetch it. Otherwise, it will end up one VIC-IV raster late, and if the VIC-IV provides the updated data in the middle of a raster line, we can get that shimmer effect I mentioned in the bug list.  What is nice is that realising all this has helped to explain to me the kinds of bugs that I am seeing, and gives me confidence that they can be fixed.

Now, to fix the sprite fetching, we need to understand how the sprites work on the MEGA65.

First, each sprite has a 64 bit raster buffer that contains the 8 bytes of sprite data for one row of pixels.  On the VIC-II and VIC-III, this is only 3 bytes, because sprites are only 24 pixels wide. However, on the MEGA65, you can ask for sprites to be 64 pixels wide instead (or 16 pixels wide if in 16-colour mode, where four pixels control the colour of each pixel, but without becoming wider like they do in multi-colour sprite mode).

To get data into the sprite raster buffers, there is a ring bus where the sprites put their current offset with respect to the start of the sprite data block onto the bus, and the VIC-IV receives this as the ring bus rotates the data, and the VIC-IV then provides the 8 bytes of data relative to that address once per raster line. The VIC-IV fetches all 8 bytes for all 8 sprites (and also the 8 bitplanes) every raster line, regardless off whether they are to be displayed or not. This simplifies things somewhat, and is possible because of the high pixel clock of the MEGA65 (currently 100MHz) compared to the screen resolution.

This 64 bit buffer is continuously expanded into a 128 bit version for both multi-colour mode and hires mode, so that the sprite drawing logic can just always take two pixels. The only exception to this is 16-colour mode, where the 128 bit version is prepared as for hi-res mode, but the shift register is shifted four times per pixel on subsequent cycles (which requires that the VGA pixel clock be at least 4x the sprite pixel clock.  For now, this means that 16-colour sprites cannot be used in 640H mode. I hope to relax this in future, by allowing the barrel shifter to shift either 1x2=2 or 4x2=8 bits each time.)  In all cases, the sprite copies the 128 bits of expanded data into the shift register for drawing when the X position matches the left edge of the sprite -- which is why the shimmer can vary based on the sprite number and X position.

What we need to do instead is to latch the 128 bits of drawing data at the start of a raster line, to eliminate the shimmer. Then, to make sure the correct data is drawn each raster, we need to have the sprite request the data for the next row of pixels once it has received the data for the current row of pixels, and latch the new data just as the next row of pixels should begin to be drawn.

Because sprite data fetches happen every single raster line, we can do the fetch for the first row of pixels on a continuous basis, and latch it as soon as we begin drawing that row of pixels, and then begin asking for the next, and latch it in the same way.

Okay, so all of that is synthesising now.

While I wait for that, I want to try to fix a problem with bad-lines.  The specific problem I want to solve is that in Gyrrus, it displays the wrong characters for the inter-level announcements.  What I am not currently sure about is whether the problem is that badlines happen at the wrong time on the MEGA65, or whether the problem is that the raster numbers for raster interrupts are out by one.  The latter would make sense, because a lot of games have the top pixel of text following a raster split missing. So, this says that we need the VIC-II raster interrupts to happen one raster line earlier than the currently do.

So, a bit of investigation using Gyrrus as an example was in order. First, this is how it looks by default.  Here is the problem, we see the yellow row of funny characters instead of the copyright message (we can also see some rubbish under the sprites from the previously described bug, as well):


Fortunately I already have a debug register that lets me move the VIC-II raster numbers up in comparison with the actual screen.  If I shift it up by two pixels, it looks mostly right, but with some raster timing problems:


I have seen a funny problem where raster routines don't seem to run quite as fast as they should when the MEGA65 is at 1MHz, so I also tried putting the CPU to 50MHz.  This made the shift of two pixels to be too much, but reducing it to a one pixel delta with 50MHz resulted in a perfect display (give or take the sprite rendering bug):


So, are we not getting enough CPU cycles per raster line?  This is fairly easy to compute, because we know the pixel clock is 100MHz, and that there are 3,196 cycles per VGA raster = 63.92 microseconds per two raster lines (since we are effectively running double-scan). Given the CPU clock at 1MHz for PAL in the MEGA65 is calculated as (64569/65536) * 1MHz, this gives 63.92 * 64569/65536 = 62.9768 cycles per raster. That is, about once every 43 raster lines, a cycle will be missing. To fix this, we need to make the 1MHz PAL clock run at 64593/65536 = 985.61 KHz, instead of the true PAL clock speed of 985.25 KHz. That is, we are talking about about one part per two thousand.  This is most likely an acceptable error.

However, it is best to actually test things like this rather than just trusting in them.  I was able to confirm that raster lines do appear to have only somewhere between 62 and 63 cycles each.

Then I thought I should try NTSC. Loading Gyrrus in NTSC and adjusting the raster number by two yielded a perfect result at 1MHz -- no 50MHz boost required. Is it possible that Gyrrus was designed for NTSC instead of PAL? Anyway, it suggests that we should move the VIC-II raster numbers up relative to the physical screen dimensions.  However, causing raster lines 0 and 1 to  cease to exist is likely to cause other problems, e.g., for raster interrupts set to trigger on those raster lines.  Thus there is an easier solution: Simply move the borders and character generator down two pixels. Then everything will line up naturally. We'll try that.

Now, back to the sprite rendering problems, Gyrrus has shown that we still have the junk at the bottom, and also that the top row of pixels in sprites are still only one pixel high. The row of junk is a true sprite pixel high, i.e., it expands with the Y-expand flag.  What is helpful is that now with my test harness, I can simulate the drawing of a sprite, either expanded or not, and see what is going on.  This image here is of an X and Y expanded sprite. You might need to enlarge the image to see what is going on:


The data pattern for each row of pixels is for each byte to be $80 + row number. So row 0 of pixels is $80 $80 $80 = 100000001000000010000000, where 1 makes a blue pixel, and 0 a green background pixel.  The next row is $81, so 1000000110000001100000011. There should be rows 0 to 20 for the 21 vertical pixels of a sprite.

So, we can see a few problems here:

1. The first row of pixels is too short.

2. The missing height of the first row seems to be made up for with pixels from row 21, i.e., a 22nd row of pixels that should not be present.


That is, it looks like it should just be the one problem the fix here. Basically what is happening is that it is pulling the new data in on the off-raster for the Y-expanded sprite. After a bit of poking around, I sorted this out, so now the output looks like:


Okay, so let's make sure we haven't messed things up for non-expanded sprites:


Hmm, here we have a similar problem, except that it is the top row of pixels being drawn too many times, instead of too few.  This was fixed by setting the sprite drawing flag at the start of the raster line on which the sprite appears, instead of when the first pixel is due to be drawn, so that the data offset can be set in time.  I still have some misgivings about this, as the VIC-IV may not be given enough notice from when the data offset changes to when the bytes are required.  This I will have to verify. But at least now in simulation non-expanded sprites are drawing correctly:

Super! A quick check back to Y-expanded sprites to verify that it hasn't stuff that up again, and this patch is off for synthesis.

So, after synthesising, we have some improvements.  The text messages in Gyrrus are now visible (still glitchy in PAL, but perfect in NTSC), thanks to the bad lines occurring at the right time, and the position of the rasters properly matching up. However, vertically expanded sprites are still messing up in a new worse way, although they are mostly fine when not vertically expanded.  That said, the top row of pixels is still too short for the non-vertically expanded sprites.  This can all be seen in this screenshot from Gyrrus:


In short, it feels a bit like two steps forward, one step back, or sideways or something.  Anyway, let's start analysing what is going on with those vertically expanded sprites.  Instead of just shimmering or being vertically rotated, they are a bit of a mess.  What I suspect is happening is that the 64 pixel shift register is only being rotated by the 24 pixels being displayed, and thus isn't properly synchronised for use on the next raster line.  If this theory is correct, enabling 64-pixel wide mode should fix the problem. And indeed, it does. In this screenshot I have just set the 64-pixel wide mode bit for the "G" sprite. Of course, now it is completely unreadable, but it is stable.


The frustrating part here is that I have not yet been able to reproduce the same behaviour in the simulation, which of course makes it more difficult to try to fix.
I did in fact try a fix where the pixel shift register gets reset at the right edge sprites, but that didn't work.  Looking at that fix again this morning, I can see, however, that it was only being activated some of the time.  So, we will see if that fixes that problem.

In the meantime, there is that problem of the top pixel of a non Y-expanded sprite being trimmed.  Again, this doesn't show up in simulation.  But I did see something like it at one stage in simulation, when the sprite data fetch was happening slightly wrongly.  Thus, I am suspecting it is due to an unhelpful interaction of the sprite data latching and when the VIC-IV fetches the sprite data. 

The problem in simulation wasn't quite the same, as it was the first row of pixels getting stretched instead of compressed, however, as it is a similar type of error, it is of interest.  In simulation I was only fetching sprite data every VIC-II raster, not every VGA raster, i.e., only once every two rasters. Thus if the sprite changed its data pointer during the "off" raster, it would not get updated until one more raster line had passed, and thus would come one raster line late.

Thus I suspect it is some interaction with the VIC-IV sprite data fetch timing, presumably fetching the data before the sprite expects it. I can see if I can reproduce that sort of situation by making my sprite test harness fetch sprite data continuously, and see if that make it appear.  Unfortunately, it refuses to do.  So, back to testing on the running machine.  The short first pixel only appears when a sprite is 24-pixels wide, but not when 64-pixels wide, again suggesting that the sprite pixel buffer is likely to blame in some way. So it is possible that the fix I have already committed to improve the behaviour of that may fix this problem as well. That would be nice.  Now I just have to wait for synthesis to complete.

Hmm.. That didn't work. So this time I have added an explicit buffer for the sprites, so that the correct data gets kept and re-used for as long as it is required. This avoids all the shift register synchronisation problems.  This time we have some marked improvement, particularly because the shift-register problems are being avoided:


From a distance, that even looks mostly correct.  However, if you look carefully, you can see that some rows of pixels in the vertically expanded sprites are wrong. In effect, it looks as though the order of each pair of pixel rows is inverted, or more correctly, every other row of pixels is drawn from the row of pixels that follows, instead of the one it should be drawn from. 

Now, it is rather odd that this doesn't happen in simulation.  What occurs to me is that it could be a case of order of assignment differing between GHDL, which I am using for simulation, and the Xilinx ISE tools that perform the synthesis.  That is, I might have multiple code paths that try in the same cycle to set the sprite pixel buffer, and GHDL happens to interpret them in the correct order, while ISE does not.  It is in principle easy to find if this is  the case, by putting debug statements on the assignments, and see if any of them happen at the same time. There is no sign of this happening, unfortunately. That said, it could well be other signals that are assigned in such a way.

Thinking about it from another perspective, what we can say is that the newly fetched data is used one VGA raster earlier than it should be, before reverting to using the previously fetch row.  From this we can deduce that sprite_pixel_bits_last has the previous row of pixels, since it is later retrieved. However, there is simply no code path that allows a new row of pixels to be used, without replacing the contents of sprite_pixel_bits_last.

However, I noticed something as I was looking at the simulation output: Every other row of pixels, the sprite data is not actually reloaded -- the shift-register is left with what it had after being rotated by 24 pixels.  The reason I wasn't seeing the effects of this, is that my test harness generated 8 of the same bytes for each sprite row, and because the buffer of 8 bytes was long enough for the two rows of pixels, each requiring only 3 bytes.  However, by making the bytes have a bit field that changes across as well as down, suddenly the problem becomes apparent:

In each byte it is bits 4,5 and 6 that vary across, and it is precisely these bytes that change on thte alternate raster lines.  The keen observer can pick out that it cycles through 000, 001, 010, 011, 100 and 101, i.e., 2x3 rows of values.  So, we at least now have a clear understanding of what is wrong, and can see it in simulation.  So, now I can easily fix it. But first, there is still a missing row of pixels from the first row of pixels in non Y-expanded mode.  I am curious to see if the same problem might be responsible for that. Indeed, it looks like it might well be:


Remembering that for a real sprite, the 4th to 6th bytes of a data fetch are actually the 2nd row of pixels, which thus gives the appearance of the top row being shorter. It also explains why 3 bytes of rubbish appears at the bottom still: It is bytes 4 to 6 of the fetch for the last row of pixels.

The Lesson? Think carefully about your test vectors!

Now that the problem has been clearly identified, the fix was simple: Copy the cached row of pixels at the start of VGA rasters that are not also the start of a new VIC-II raster.  And the result: Simulation now looks correct, for both un-expanded and expanded sprites:



And after synthesis, Gyrrus looks like it should, with expanded sprites looking correct, and the half-height first pixel row of the earth also fixed:


Indeed, using Impossible Mission as a test for the extra row of sprite junk, since it was obvious there as well, I was also able to confirm that sprites are drawing correctly there as well:

In fact, Impossible Mission now seems to work except for some glitching when going up and down the elevator, and a rather deadly (quite literally in the case of the player) sprite collision bug, that I am now exploring.  But for now, it is job done on fixing sprite rendering bugs.

I have yet to report on is the 16-colour sprite mode, as well as some of the other more esoteric sprite render problems. However, seeing as this post has got a bit long, I'll cover those in separate posts later on.

Saturday, 6 January 2018

1351 Mouse and paddles now work on the MEGA65 rev1 PCB

I had previously written about how I had figured out to drive the mouse.  I have now merged those changes into the MEGA65 source code, and synthesised a bit-stream that happily works with it, as you can see in this video:


This proved not too hard to do, especially since I had already proven it to work in my VHDL test harness. The main complications I faced were in plumbing the signals through to the SIDs, and getting the CIA multiplexing of the POT lines working correctly.  With two SIDs, the MEGA65 lets you use all four POT lines without fiddling with the multiplexor: Two on each SID. The CIA multiplexor lines just switch which SID handles the POT lines from which joystick port.

The only known caveat at the moment, is if you use external SIDs, then the POT lines won't work, because the POT lines are read through the VHDL SIDs.  There are a couple of ways around this at the moment:

1. Have some kind of crazy adaptor on any external SID cartridge, to route the lines. This might require joystick pass-through adapters, similar to what I am using on this rev1 PCB.

2. Patch software to use the MEGA65 direct POT registers at $D620-$D623, which allow all four POT lines to be read, without involving the SIDs.

However, we recognise that this is not ideal, and I will have a look at having an option to have SIDs external, but still making the POT lines be read from the internal VHDL-SIDs. That way, things like the MSSIAH cartridge combined with an external SID cartridge would still be able to work, which we see as a potential use-case for some folks, and that we would like to support.

Otherwise, the only remaining thing to fix on the POT lines is to try out the 1.0nF capacitors, to see if that gets us full range of movement on the paddles working correctly.  I had a couple of errands to run this morning, so I pedalled past Jaycar on the way and invested the necessary $0.70 in the two capacitors, and also a few dollars for an HDMI cable, so that I can begin working on HDMI output.

After replacing the capacitors, the mouse still works, and I even didn't accidentally exchange the X and Y axes when I had the appropriate wires de-soldered, which I was very happy about :)

Also, changing the capacitors fixed the jumping problem in the mouse test program, as the values being presented by the mouse are now in the correct range expected from a real mouse. However, the paddles are still not perfect: They only cover the range from $00 to about $C0 (range of 192).  So it might be that we need a slightly larger value capacitor, perhaps a 1.2nF in order to get full range. However, that said, real C64 paddles typically have a similar usable range, but not identical.  So, the best solution: Test it! The easiest option open to me was to simply plug in Pinball Spectacular, and try to see if I could get full travel on the game. This worked totally fine, with ample spare travel at each end.


So, I think that wraps up the POT lines on the MEGA65 for now.  I can now work through the other tasks on my list.

Fixing various niggly problems

So, nothing hugely exciting to see here, but important progress on a number of fronts that have been causing niggly problems.

First, the cartridge port clock is now correctly and accurately at 8MHz for the dotclock and 1MHz for the CPU clock.  It was variously about half that previously due to various little bugs.

Second, the mapping of several register blocks had some subtle, and not so subtle bugs preventing their access. This has now been fixed.

Third, in trying to track down the IO address collision bugs, I have moved just about every IO device that was not using a chip-select line to use one.

Fourth, the Nexys4DDR boards thought that they always had an 8KB cartridge plugged in if switch 8 was not set.  This was a debug thing I put in when getting the cartridge interface to work some time ago, and of course then promptly forgot about, resulting in a few hours of lost time trying to figure out why on earth I was seeing 30719 BYTES FREE, but no actual data in the mapped cartridge.

Fifth, in making the ethernet controller use a chip-select line, I stripped out all of the half-baked RR-NET emulation code, as it was seriously complicating the ethernet controller addressing. I suspect that this might well have been the source of the IO address collisions.  Also, the RR-NET sits at $DE00, which now that we have a real cartridge port, causes even more problems, as we have to be able to work out which $DE00 to look at.  If anyone desperately needs an RR-NET based ethernet instead of using the internal 100mbit ethernet of the MEGA65, they can just plug a cartridge with it in.  The MEGA65 ethernet controller is much simpler to use, with a memory-mapped TX and RX buffer, and no bizarre 16-bit access semantics for the registers.  I expect that over time software that uses ethernet will be patched to use the MEGA65 native ethernet in most cases.  Binary patching will be possible because of the simpler interface allowing for smaller code, and there are not that many programs that use RR-NET ethernet, in any case.  Is this a perfect solution? No, but it is a reasonable trade-off.


Sixth, sprite vertical-expansion has been fixed, so the players in International Soccer have returned to their full stature.  There are still some remaining display glitches with sprites that are on my list to tackle in the next couple of days.  But here is international soccer looking like it should:


Friday, 5 January 2018

Curious problems and that Super Games banked cartridge

While testing cartridges I am sometimes hitting a rather strange set of symptoms if I have a cartridge plugged in.  What makes it specially weird, is that it only happens if the cartridge is an Ultimax cartridge, and that the problem persists after the cartridge has been removed, until the board has been powered off for a few seconds.

1. Kickstart gets stuck loading the C65 ROM with all sorts of SD card read-errors. Or it could be that Kickstart is not running correctly in some way.

2. Kickstart thinks that some of the Nexys 4's switches are set, even though this is on a MEGA65 r1 PCB, which has those input lines wired to ground in the VHDL, i.e., it is not possible to set those lines at all on this machine.


Also, not sure if it is related, but the USB serial interface also stops working, or stops sending one particular hexadecimal digit, typically one that has several one bits in a row in the ASCII representation, presumably because of a timing error.

The serial interface problems suggest to me that the problem might be timing related.  The SD card problems could also be timing related, for example, with the FPGA generating clock speeds that are too high or too low, or just plain jittery. The FPGA design meets total timing closure, so timing closure in the design can be eliminated as a potential cause.

That this should be caused by the presence of a particular type of C64 cartridge is rather odd.

Then sometimes I am seeing the board thinks the switches are set while booting, but then checking them after exit from the Hypevisor, they have cleared.  This has happened specifically with the Super Games cartridge. My gut feeling is that this is a useful piece of information, and that if I can figure out when the lines go back to all zeros on the switch inputs (even though, as I say, they ought to be provided via the inputs that are specifically assigned to ground in the VHDL), that it will be very useful information.

Ah, now I have something interesting:  The incorrect reading of these two (and several other nearby) registers happens only when in the Hypervisor (indicated by the H- at the end of the register information lines), as this little log of serial monitor interaction shows below. This would also explain the SD card errors, since the problem seems to be reading strange values from SD card registers, as compared to actual problems with the SD card.

First, we start with the MEGA65 in the hypervisor. By writing to $D67F, we trigger exit from the hypervisor back to normal operation (now with -- at the end of the register information lines). Examining the contents of the switch registers ($D6F0 and $D6F1) we see that they are correctly holding zero:


PC   A  X  Y  Z  B  SP   MAPL MAPH LAST-OP     P  P-FLAS   RP uS IO
80FC 11 22 33 00 BF BEFF 4000 3F00 AD 7C D6 24 00 ..E..I... ..P 11 -FF H-
.

PC   A  X  Y  Z  B  SP   MAPL MAPH LAST-OP     P  P-FLAS   RP uS IO
8200 11 22 33 00 BF BEFF 4000 3F00 4C 00 82 24 00 ..E..I... ..P 11 R01 H-
.Sd67f 1

.r

PC   A  X  Y  Z  B  SP   MAPL MAPH LAST-OP     P  P-FLAS   RP uS IO
8100 11 22 33 00 00 01FF 4000 8F00 4C 00 82 24 00 ..E..I... ..P 11 -FF --
.dd6f0
 :777D6F0 00 00 1F E1 FF A0 00 00 00 02 00 7F 7F 80 00 C0


We then switch back to the hypervisor by writing to $D67F again, this time we have to provide the full address, as the Hypervisor trap registers are not visible normally. Now we get a clue: The serial monitor hit a timeout waiting for the CPU to respond.  The CPU is either stuck doing something, or takes longer to respond than it should. Much, much longer, i.e., 65,535 or more cycles instead of no more than about 100 (2x 1MHz cycles), even in the worst case scenario where the memory access is to the cartridge interface, and has just missed the last rising edge of the 1MHz clock.

PC   A  X  Y  Z  B  SP   MAPL MAPH LAST-OP     P  P-FLAS   RP uS IO
8102 11 22 33 00 00 01FF 4000 8F00 44 45    26 00 ..E..IZ.. ..P 11 -00 --
.sffd367f 1

?REQUEST TIMEOUT  ERROR






However, we do enter the Hypervisor with this, but then we see that our problem of non-zero values in these registers returns. Most curiously, it takes a little while for the problem to fully recur. We also see the 04's turning up in other registers where they shouldn't.

.r

PC   A  X  Y  Z  B  SP   MAPL MAPH LAST-OP     P  P-FLAS   RP uS IO
80FC 11 22 33 00 BF BEFF 4000 3F00 44 45    26 00 ..E..IZ.. ..P 11 -FF H-
.dd6f0
 :777D6F0 00 04 1F E5 FF A4 04 04 00 04 00 7F 7F 80 00 C4
.dd6f0                                                         
 :777D6F0 04 04 1F E5 FF A4 04 04 00 06 00 7F 7F 80 00 C4
.dd6f0                                                         
 :777D6F0 04 04 1F E5 FF A4 04 04 04 04 04 7F 7F 80 00 C4
.dd6f0                                                         
 :777D6F0 04 04 1F E5 FF A4 04 04 00 06 00 7F 7F 80 00 C4
 

This all makes me think that there is something funny in the CPU's IO reading logic, which is presumably tickled in some way by the presence of a cartridge. As Hypervisor mode changes the behaviour, it must be some logic that is differentially handled between Hypervisor mode and normal mode.

I tried removing the cartridge, so that the cartridge control lines would float back high, and thus tell the MEGA65 that it doesn't have a cartridge inserted anymore, however, that didn't help.  It did show me that my cartridge line probing function fails to properly charge the lines high before re-scanning, which could cause problems for things like the Action Replay, that fiddle with those lines to change cartridge modes mid-stream.

Looking through the CPU's address resolution logic, there isn't any obvious smoking gun for the cause of this problem. What I need to do next, is to see if I can reproduce the problem in synthesis.  I also have a suspicion that when the problem is happening, the CPU is reading from the cartridge port when accessing these IO locations, or at least thinks it is in some kind of half-hearted way. However, I have been able to confirm that the CPU doesn't take longer in either mode than the other, so there is no reason to suspect that the CPU thinks it is accessing the cartridge port in these cases.

What is also really curious, is that this problem survives FPGA reconfigurations, which are supposed to clear all internal state in the FPGA.  This is even when I have removed the cartridge before triggering the reconfiguration, so the newly loaded FPGA bitstream has no idea that a cartridge had even been inserted.

A brief power off (~0.5 seconds) also doesn't fix it. In fact, removal of power for at least 5 seconds seems to be required!  So where is the state that is being preserved? Even if Xilinx ISE were generating a bad bitstream, it would presumably not be able to cause the keeping of state in this kind of way. And yet there aren't really any external components that seem likely candidates for holding such state for so long.

Anyway, the next step is indeed to see if I can't get this problem to show up in simulation in some way.

My best guess otherwise is that one of the expansion pins is the cause, through the lack of pull-up resistors causing signals to be interpreted as asserted, when they are not supposed to be.

Meanwhile, I had some previous tweaks synthesising, including using only phi2 for cartridge access for the CPU, and trying those with the Super Game cartridge, after a couple of false starts, it is now working in its entirety.  At first, only Silicon Syborgs would start, but after reinserting it a couple of times, all games were working.  So that one might have been a case of contacts still a bit dodgy after cleaning with alcohol. This is quite possible, since alcohol won't remove corrosion, and this is the cartridge that was coated in a goodly layer of what looks like Outback sand.  So, this is rather nice that we have a memory banked cartridge working on the MEGA65 -- this is quite a milestone, as it involves both IO and memory, and in a dynamic manner.   Here are some screen shots:






Also, all the Ultimax cartridges are working again now. But again, I don't know if this is a freak of synthesis, or if the changes I made actually fixed it. The uncertainty is the really annoying par. What I have done, and am currently synthesising, is a feature where I can disable the chip-select lines on most of the internal devices on the IO bus of the MEGA65, so that if the problem is simultaneous driving of the bus, it will get picked up. As part of this I have also made a couple of the IO devices that were checking the address internally, rather than using a chip-select line, to use a chip-select line, so that the whole thing is a bit simpler, and a bit more regular.  Hopefully I can find and fix the problem fairly quickly.