Saturday, January 20, 2018

Improving the DMAgic controller interface

The C65 has a DMA controller, the "DMAgic". Well, actually, each C65 has either the A or B revision of the F018 DMAgic IC.  This means we already have some magic to support different revisions of the C65 ROM that use assume either one or the other revision of the F018.

Then add to that that the MEGA65 already has extensions to DMAgic to support memory access outside of the 1st mega-byte of memory. Until now, those were implemented as memory mapped registers, as that was the most convenient way to implement them.  However, as those extensions have grown, having more and more memory mapped registers that affect the way that a DMA job is interpreted was getting a bit hairy, as it meant that each caller of a DMAgic job needed to be aware of those registers, or alternatively, the previous caller had to be well behaved and clear all the extra registers out. Neither seemed a really satisfactory option.

So instead, I have removed all those extra registers, leaving only two additional registers, that are required for issuing DMA jobs, plus the register that allows selection between F018A and F018B mode for normal DMA jobs:

$D703 - Select F018A/B mode.
$D704 - Sets the upper 8 bits of the 28-bit address where the DMA list is to be loaded from, i.e., which mega-byte of memory the DMA list lives in.
$D705 - Like $D700, it sets the bottom 8 bits of the DMA list address, and triggers the start of a DMA job, but unlike $D700, it triggers an enhanced DMA job.

All the extra options, like which MB of RAM is being copied from and to, and DMA stepping rates are now specified in a set of variable-length options prefixed to the front of the DMA list.

For example, the DMA job to clear the screen in the hypervisor now looks like:

        ; Set bottom 22 bits of DMA list address as for C65
        ; (8MB address range)
        ;
        lda #$ff
        sta $d702

        ; Kickstart ROM is at $FFFE000 - $FFFFFFF, so
        ; we need to tell DMAgic that DMA list is in $FFxxxxx.
        ; this has to be done AFTER writing to $d702, as $d702
        ; clears bits 27 - 22 of the DMA list address to help with
        ; compatibility.
        ;
        lda #$ff
        sta $d704

        lda #>erasescreendmalist
        sta $d701

        ; set bottom 8 bits of address and trigger DMA.
        ;
        lda #<erasescreendmalist
        sta $d705


erasescreendmalist:
        ; Clear screen RAM
        ;
        ; MEGA65 enhanced DMA options
        .byte $0A      ; Request format is F018A
        .byte $00 ; end of options marker
       
; F018A DMA list
        .byte $04   ; COPY + chained request
        .word 1996  ; 40x25x2-4 = 1996
        .word $0400 ; copy from start of screen at $0400
        .byte $00   ; source bank 00
        .word $0404 ; ... to screen at $0402
        .byte $00   ; screen is in bank $00
        .word $0000 ; modulo (unused)



The bold lines are the ones that are different to the old method of calling such a job: We write to $D705 instead of $D700 to initiate the DMA job, and then the DMA job now begins, in this case, with two option bytes: The first tells the DMAgic that the DMA list will be in F018A format, after the end of option marker ($00).  Now there is no confusion for a particular list as to whether it expects an F018A or B, and any further extensions that we add can be safely ignored, because they are all disabled by default for each job.  Also, the job setup code is shorter, because it doesn't need to set or clear any DMA options, and there is no longer any need for DMA cleanup code, to put options back to how a naive caller might expect them.

The current list of supported options are:

$00 = End of options
$06 = Use $86 $xx transparency value (don't write source bytes to destination, if byte value matches $xx)
$07 = Disable $86 $xx transparency value.
$0A = Use F018A list format
$0B = Use F018B list format
$80 $xx = Set MB of source address
$81 $xx = Set MB of destination address
$82 $xx = Set source skip rate (/256ths of bytes)
$83 $xx = Set source skip rate (whole bytes)
$84 $xx = Set destination skip rate (/256ths of bytes)
$85 $xx = Set destination skip rate (whole bytes)
$86 $xx = Don't write to destination if byte value = $xx, and option $06 enabled


$00 and $0A we have already met.
$0B is the opposite of $0A, and tells the DMAgic to expect an F018B format DMA list.
$06 and $07 allow enabling/disabling of a "transparent value", that is a value that is not written during a DMA copy. For example, if you were copying an image with a transparent colour, you can now tell the DMAgic what colour that is, and it will copy all the bytes that don't have that value. The value is set via the $86 option
Then $80 and $81 allow setting of the upper 8 bits of the 28-bit source and destination addresses, i.e., which mega-byte of memory to copy to/from.
$82 - $85 allow setting the stepping rate of the DMA. This allows memory copies that smear out or squish up the source, say, for example, if you wanted to scale a texture when drawing it.

Anyway, the net result is a nicely extensible architecture for the DMAgic in the MEGA65, and one that results in increased compatibility with the C65 when faced with lazy programmers, as well as saving bytes  for the typical case where most of the options are not required.  It also makes it easier to freeze and resume a MEGA65 program, because there are now fewer registers to save and restore.

Wednesday, January 17, 2018

Repairing an Amiga Mouse, and then using it on the MEGA65

As anticipated earlier, I have added transparent support for Amiga mouses to the MEGA65, so that people don't have to find a 1351, and can use the existing USB to Amiga mouse adapters, to allow use of newer mouses.

To get that working, I put out an appeal on Facebook for anyone who had an Amiga mouse to spare (not necessarily 100% working, just enough so that I could test with), and a kind person from Tasmania sent me one with a problem with the X axis.  Nonetheless, it was enough for me to do what I needed to do, and as a result the MEGA65 works very nicely indeed when an Amiga mouse is plugged in, and can automatically detect an Amiga mouse, 1351 mouse, paddles and joystick -- in real time -- and switch modes on the joystick port as required, so that no user fiddling is required.

This all works by looking at what the digital and analog lines are doing on each joystick port, to see whether the behaviour matches one or the other type of device.  It all turned out to be relatively simple in the end, requiring only a modest amount of fiddling to get it stable.

However, as I mentioned, the Amiga mouse that was donated was a bit sick, and I would really like it fully working, as I can use it with the MEGA65 r1 PCB without any funny adapters to route the POT lines, as I currently do for the 1351 mouse (this will of course be fixed on the r2 PCB).  So I started poking around in the mouse today to find out what is wrong, exactly.

It didn't take long with the multi-meter to work out that the voltage output of one of the infra-red light sensors on the mouse was a bit lower than it should be. The mouse shows signs of having been physically repaired around that part of the assembly, so it isn't too surprising, really.  A bit of looking at Amiga mouse schematics I could see that the mouse sensors are fed through a LM339 type comparator to produce a nice 5V square-wave output from the fuzzy light sensor readings. 

Each light sensor is paired with a reference voltage that is used to determine whether to output 5V or 0V, depending whether the sensor voltage exceeds the reference voltage or not.  Since the problem was the sensor voltage was a bit low, I figured the easy solution was to put a bit of extra bias towards 0V on the reference voltage, so that the sensor voltage would again be correct.  It looked like there was about 1.5K resistance between the reference voltage and ground, so I thought I would start by adding a 2K Ohm resistor between the reference voltage and ground, like this:

Only I realised when testing that I had put the bias on the sensor voltage instead, i.e., making the problem worse, rather than better.  So then I though, well, why don't I just put the bias towards 5V on the sensor voltage, like this:


Only it turns out that that doesn't work. I didn't bother exploring why.  So, back to Plan A, only this time with the correct pins:


The astute observer will notice that the above picture shows a 1K not 2K resistor. This was because in testing I found a 1K resistor provided the correct bias correction.

Then it was time to test on the MEGA65 using the Mouse Test 2 program, which as we can see below I was able to use to move the 1351 mouse indicator, with the others staying put -- even though I was using an Amiga mouse:

 However, that is all a bit dull to look at, so I made a short video showing it in action, with me turning on and off the 1351 emulation mode on and off in real-time:


So now the mouse all works very nicely, and I can work on software that uses the mouse.

Tuesday, January 16, 2018

Rebuilding SD card access

In my enthusiasm reworking the SD card and F011 code, it turns out I got a bit overzealous, and stripped out the code that provides the memory-mapped sector buffer for SD card access. Clearly not optimal. So I had better fix that up.

Before the rework, there were three separate sector buffers:
1. One for the CPU to read as a memory mapped sector buffer;
2. One for the CPU to write to, and the SD controller to read from; and
3. One for the F011 emulation.

At the moment, only the third one is still there.

This means that SD card access is currently only working for mounted disk images -- but there is no way to mount a disk image, as the Hypervisor can't even read the partition table of the SD card.

We could go back to having three buffers, but that seems a waste of precious BRAMs.  The question is whether we can do everything we need with just the one buffer, or whether we need two.  The answer to that is that we do need a second buffer if we want to have memory mapped access to the sector buffer.  That second buffer would be connected to the MEGA65's FastIO bus for reading, so that the CPU can read from it when it wants to. This buffer would be written to when the SD card controller reads data from the SD card.  To allow the CPU to write to the buffer, we would trap writes to the sector buffer location, and pass them to the SD controller to actually perform the write operation.

The complication is that when the SD card controller is asked to write a sector, it has no way to read this buffer, as only the CPU can cause it to be read.  We can solve this by making the F011 sector buffer 1K instead of 512 bytes (since BRAMs are 4K, this doesn't cost us anything extra), and whenever the CPU writes to the SD card sector buffer, we also write to the second 512 bytes of the F011 buffer.  When the CPU is asked to write a sector to the SD card, it uses this second copy of the data, which it does have read access to, in order to perform the write.

Another wrinkle is virtualised F011 mode, where the hypervisor gets a trap whenever a sector read or write is attempted.  This is used with monitor_load to allow feeding a disk image via the serial monitor interface, instead of using the SD card (handy for some software development tasks, and if your SD card interface is broken, as it is on Falk's r1 PCB).  So I need to preserve that.

Probably the best solution here is to have the two buffers, each with 2 x 512 bytes, with the lower half the F011 buffer, and the upper half the SD card buffer, and have a register bit that allows selecting which one is being accessed.

After a bit of fiddling about, this is all done now and working nicely, and the saved BRAM is also a nice result.

Switching back and forth between the SD card and floppy drive works, but it seems that it is possible for the track numbers to get out of sync, so it is necessary to seek to track 0 after switching from SD card, to make sure that everything matches up.   Ideally we would have some program to allow switching back and forth in a really easy way. Initially modifying the disk chooser menu program is probably the right way to do this, so that there is an option to select "internal drive" as one of the disk choices.

After that, the next step on the floppy drive now is to get writing working, including ideally formatting disks which requires unbuffered writing.

Improving my bench MEGA65 prototype hardware

After the last several posts focussing on VHDL implementation of various interfaces and things, here is a much shorter read with more pictures, following the improvement of my bench-test MEGA65 revision 1 PCB. 

Here it is before improvement:

 

The main problems I wanted to solve were, in no particular order:

1. The floppy drive had to live externally and loose, and be powered by an adapter on the joystick port. I want it internal, and firmly held in place.

2. No keyboard. Is further explanation really necessary?

3. The headphone jack and FPGA programming port are very close, which I can't change, but the hole in the case for them was too small to allow both to be plugged in at the same time.  Very annoying when the kids want to play games, and I have to keep unplugging the sound to plug in the data interface and vice-versa.

4. The hole for the joystick ports was very tight, and needed a little enlarging.

5. The hole for the HDMI port was also too small.

6. The cartridge port hole was also too small.

7. I wanted to install pull-ups for the IEC bus, so that it would just work, without having to plug anything strange in (and so that it wouldn't lock the C65  ROM during boot-up).
 
8. I wanted all the improvements to result in a reasonably physically robust arrangement, that wouldn't be at risk of falling down and damaging itself or shorting itself out when used, whether by myself, or by the kids.

The clear plastic case is big enough for everything, so I figured I would just enlarge holes, and in the case of the floppy drive, make some new holes.  It would be nice if I had the correct tools for cutting holes in plastic. Instead, I have a power drill and some 3mm wood drill bits.  Making the hole for t he floppy drive consisted of drilling perforations around the outline, and then joining them up using the drill as a kind of power saw.  Not ideal at all, but it worked. Here it is with most of the holes drilled:


Then with all the holes joined up, and the piece knocked out, but yet to be filed into a nice rectangle. Sorry the shot is a bit blurry. Cameras don't like photographing nearly invisible objects very much.


 We'll come back to that a bit later.

Then it was time to think about how to attach me genuine C65 keyboard (without printet key caps) to the top of the box, such that it couldn't fall off, fall in, or snag the fragile ribbon cable that connects it to the mother board.

I had a piece of acrylic the right size to sit on top of the box, so I traced out around where I wanted the keyboard to be mounted on it:


Then fitted a couple of scrap plastic blocks to the underside, so that when it rests in the top of the plastic case, it can't move in any direction.  Here is the arrangement from underneath:


With those in place, and a couple of extra holes to fix the keyboard to the top (using the existing two holes in the keyboard, which presumably were designed with a similar purpose in mind), I had a keyboard sitting nicely and securely on top of the box:


Here you can see it from the side, with the green plastic thingoes you put in walls before you put screws in.  If only I could remember their name. Anyway, even nameless they work just fine for this job:


Then it was time to fix a few issues with the PCB, adding a floppy power connector, and pull-ups for the IEC serial bus, so that that can work without any special cable attached. By permanently fitting the pull-ups, it means I can't use the IEC port to drive the POT lines on the joystick, as I did while implementing 1351 mouse support, however as that is done, and since I can use an Amiga mouse transparently (a blog post on that coming up soon), I figured this was no great loss. The expansion/cartridge port was the easiest place to find power:


So after doing that, and having finished making the hole for the floppy drive (which is held in place with four screws on the underside, like in an old PC), I had all the electronics inside. By this time I had also already enlarged the various holes in the case.


Connecting the ribbon cable for the keyboard was fairly straightforward:


Here it is all together. They keyboard ribbon pokes up a bit, which I don't really like, as it is still at some risk of damage like that. But it is not snagging on anything or under any tension, so it will have to do for now:
 And the view from the lest side with the power and joystick ports etc:


 And set up in my office, ready for use:
So while it is clearly a bench prototype, it is now all assembled and functional, without a mess of cables and having to plug and unplug things all the time.

Monday, January 15, 2018

Bringing the internal 3.5" floppy drive to life - part 2

Again, a longer post documenting the process of making the real 3.5" floppy drive work.  What I thought would be hard, the MFM decoding sector reading from the real disk, turned out to be quite easy -- taking only a day or so.  But what I thought would be easy, plumbing this into my existing F011 implementation, turned out to be a long string of strange bugs to track down, and took a week or so to work through.  Anyway, here it is.

So, yesterday, I managed to successfully decode MFM pulses from the 3.5" floppy drive, but only in a little C test program on my laptop.  Today, I want to move that into VHDL, make sure I can decode MFM data there, and then correctly parse the sector markers etc, to find a requested sector, and push the sector bytes out, and do all the other little bits and pieces, like CRC checking, required to plumb it into my F011 floppy controller implementation in the MEGA65. The result should be that the MEGA65 can read data from a real 1581 floppy disk in the 3.5" floppy drive.

Yesterday I had broken down the MFM parsing into a set of clearly defined steps: measure pulse gaps, quantise pulse gaps, turn pulse gaps into bits/sync markers, turn bits into bytes.  Turning those things into VHDL was rather easy, give or take the odd spot of debugging.  Similarly, implementing a parser that looks for sync marks and captures the track/sector numbers and compares them with requested track/sector, and works out whether the following sector data should be read out was also fairly easy.  Then came the CRC checks.

The 1581 uses a CRC check that is similar to, but not exactly like the CCITT CRC16 algorithm.  The C65 specifications manual provides an explanation and example routine:


Generating the CRC

     The  CRC  is a sixteen bit value that must be generated serially,
one  bit  at  a  time.  Think of it as a 16 bit shift register that is
broken in two places. To CRC a byte of data, you must do the following
eight  times,  (once  for each bit) beginning with the MSB or bit 7 of
the input byte.

     1. Take the exclusive OR of the MSB of the input byte and CRC
        bit 15. Call this INBIT.
     2. Shift the entire 16 bit CRC left (toward MSB) 1 bit position,
        shifting a 0 into CRC bit 0.
     3. If INBIT is a 1, toggle CRC bits 0, 5, and 12.

     To  Generate a CRC value for a header,  or for a data field,  you
must  first  initialize the CRC to all 1's (FFFF hex).  Be sure to CRC
all bytes of the header or data field, beginning with the first of the
three  A1  marks,  and ending with the before the two CRC bytes.  Then
output  the  most  significant CRC byte (bits 8-15) and then the least
significant CRC byte  (bits 7-0).  You may also CRC the two CRC bytes.
If you do, the final CRC value should be 0.

     Shown below is an example of code required to CRC bytes of data.

;
; CRC a byte. Assuming byte to CRC in accumulator and cumulative
;             CRC value in CRC (lsb) and CRC+1 (msb).

        CRCBYTE LDX  #8          ; CRC eight bits
                STA  TEMP
        CRCLOOP ASL  TEMP        ; shift bit into carry
                JSR  CRCBIT      ; CRC it
                DEX
                BNE  CRCLOOP
                RTS

;
; CRC a bit. Assuming bit to CRC in carry, and cumulative CRC
;            value in CRC (lsb) and CRC+1 (msb).

       CRCBIT   ROR
                EOR CRC+1       ; MSB contains INBIT
                PHP
                ASL CRC
                ROL CRC+1       ; shift CRC word
                PLP
                BPL RTS
                LDA CRC         ; toggle bits 0, 5, and 12 if INBIT is 1.
                EOR #$21
                STA CRC
                LDA CRC+1
                EOR #$10
                STA CRC+1
       RTS      RTS

It is super helpful to have an example implementation, as well as explanation. Nonetheless, it took me about 2 hours to actually get the CRC calculating correctly, as CRC routines are notorious to get exactly right, as they require considerable attention to detail and very sound comprehension of the algorithm.

Another interesting problem was how to test this in simulation.  I already have a debug register on the MEGA65 that allows me to read the FDC data read line. However, as some signals can be as narrow as 120ns, this requires sampling at at least 5MHz.  Using DMA I could sample it at around 20MHz, however, this meant being able to capture only a part of a sector.  And even at 50MHz, the CPU is fractionally too slow, unless I completely unrolled the data capture loop, in which case I would still only be able to capture a relatively few samples, as 64KB of data capture loop would only be able to cover about 10K samples.  What I realised after, was that I should just write a C program that MFM encodes a set of sectors, and feed that in.  If my existing C program for decoding MFM data can read it, then it should make fine test input data for the VHDL. Writing such a program would also be the first step towards being able to write data to floppies, so it is work that needs to happen, anyway.  If I have problems with the current work, I will certainly follow this path.

Then it was a case of debugging some out-by-one errors and the like on sector read lengths, and making sure the right flags are asserted at the right times.  Finally, the whole new MFM assembly had to be plumbed into the existing VHDL, and some new registers added to the SDcard (and now, floppy drive) controller, so that it is possible to select the SD card or the floppy drive as the source for C65 F011 floppy controller accesses.

At this stage, all the machinery is now in place for this to work, assuming that there are no bugs in my VHDL.  Because synthesis takes a long time, I have added some debug registers that will allow me to interrogate more closely what the floppy drive and MFM decoder are doing.  While it would be nice to think that I won't need them, and the test-benching of the MFM decoder with real captured data helps reduce the risks, I suspect I will be getting familiar with them.  We will see how that pans out in a few hours.

So, synthesis has finished, and the selection register to switch to using the real floppy drive seems to work, and the read request line gets asserted to the MFM decoder, but there is no sign of bytes being read from the drive.  So, I will do what I should have done before synthesising the first time, and debug registers to see right down to the lowest level of the MFM decoder.

Okay, next resynthesis, and I can see that it is finding gaps, and decoding bytes, but it never finds a sync byte. This is likely due to the register I setup for setting the number of cycles per MFM bit to be limited to too low a value, thus preventing the gaps from being properly detected.  I rather confused myself here for a while, because I couldn't remember definitively the data rate for a 720K drive. Was it 250K, 500K or 1Mbit?  It took a long time to actually find a list. It is 250K, so 4 usec per bit. At 50MHz, this means we should be using 200 as the cycles per interval. So I should be seeing gaps of around 200, 300 and 400 cycles.

So, in terms of gaps, I am seeing plenty of them around 200 cycles, corresponding to runs of identical bits. This makes sense.  I am also seeing a reasonable number around 400, which also makes sense, as well as some around 300. However, I am also seeing a lot of seemingly randomly distributed gaps, anywhere upto at least 1000 cycles, which is more than 2x the maximum we should see.  There is something clearly wrong here.  However, looking at the source code for the gap collecting code, it is outrageously simple, and of course, it works fine in simulation.  What is particularly curious is that the failure mode appears to be the missing of pulses, not seeing spurious pulses, for example, if there was high frequency glitching on the floppy read-data line.  Yet to be missing pulses is very strange, since the debug register that allows direct reading of the floppy read-data line shows no sign of these random-length pulse intervals. This makes me a little worried about intermittent glitches I have seen, where a couple of registers in particular are read with the wrong values.  As much as I would like to blame the synthesis tools, it is quite possible I have a subtle timing bug that might just be tickling things here.  My immediate next step is to resynthesise with the mfm_gaps module outputting a history of the values it has seen on the f_rdata line, so that I can see if it is indeed missing pulses.

And things get stranger.  While waiting for synthesis of the above debug register, I tried again, this time simply having the M65 try to boot the C65 ROM using the real floppy drive.  In this situation, the C65 ROM will try to load an auto-start file from the floppy.  And all of a sudden, I am seeing better distribution of gaps. I think that what is happening here is the ROM steps the head, which results in it being definitively on a track, where as if the drive was powered off, the head might have been able to move off track a little (or outside tracks 0 - 79).  I'm still not really sure, but it seems to be something mechanical.

To avoid the long resynthesis times,  I have imported my MFM controller into my joystick controlled test bench.  That way I can iterate after only a few minutes. The down side is I have very limited input and output, and can't capture and stream signal values.  This is what I used to work out the stepping the head helps.  However, even after stepping, while I see quantised gaps being detected, with essentially no invalid gap lengths, it is still not detecting any sync bytes.
So, I added yet another debug option (and had another 8,000 second synthesis) to log the quantised pulse gaps, and wrote a little program to interpret a capture of those signals.  That is, I tested the function of the pulse detection and gap quantisation.  And the results are good. Here is what I saw from my captured trace after decoding:

 $42 $42 $42 $42 $42 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72
 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72 $72
 $72 $72 $72 $70 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $05
Sync $A1
Sync $A1
Sync $A1
 $fe $00 $01 $01 $02 $fd $5f $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $00 $00 $00 $00
 $00 $00 $00 $00 $00 $00 $00 $00
Sync $A1
Sync $A1
Sync $A1
 $fb $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00
 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00
 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00
...

$00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00
 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00
 $00 $00 $00 $00 $da $6e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $4e $4e $4e $4e $00 $00 $00 $00 $00 $00 $00 $00 $00 $00
 $00 $00
Sync $A1
Sync $A1
Sync $A1
 $fe $00 $01 $02 $02 $a8 $0c $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e


First, we can see that Sync $A1 bytes are correctly detected.  Second, we can see that a complete sector is observed, with the $FE header ID byte and track, side, sector and sector size bytes. Then there is a $FB ID byte following the next burst of Sync bytes indicating the sector, and a completely empty sector, followed by two CRC bytes s($da and $6e), and the $4e and $00 gap bytes, before the header of another sector is visible. I did notice an error in checking this: I was pulling the header bytes out as Track, Sector, then Side, not Track, Side and then Sector. That's easy enough to fix.

So now we know that the intervals are being correctly quantised, and the stream of intervals that should be detected as a Sync mark are being generated.  Attention thus turns to the mfm_gaps_to_bits module, where the sync bytes are generated. Assuming that the problem must be in the collection and assembly of the strings of gaps into bytes, I am modifying my VHDL test rig so that I can sample the most recent four gaps at any point in time, and see that sample held.  I am wondering if the gap_valid signal is being asserted for more than one clock cycle, for example, which is causing a gap to be registered more than once.  It is also possible that gaps are being missed, but that will be harder to determine.

The logged sets of recent gaps seem to be okay, and I even managed to see a sync mark. So I added a counter for the number of times the sync marks are found. I also added a safety catch in case the strobes to indicate a new pulse has been seen were sticking around for more than one clock cycle.  Whether that was the issue or not, I am seeing of the order of 256 sync marks per second.  Given that there should be 5 rotations per second, and 10 physical sectors each with 2 x 3 sync marks per second, this means I should be seeing 5 x 10 x 2 x 3 = 300 sync marks per second.  This basically matches my eye-ball investigation of ~256 per second, watching the highest order bit of an 8-bit counter blinking about twice per second.  In short, it seems reasonable to conclude that sync marks are now being reliably recognised.  The question is whether that tweak to the strobe line handling fixed anything, or not. This is at least easy to test: Change one line of VHDL to remove the extra check I put in place, and then resynthesise the VHDL test rig. Hmm. Seems that it wasn't necessary.

So, it seems that the problem must be higher up the processing chain.  So I took a closer look at the top level mfm_decoder module, and noticed a couple of errors in CRC checking of sectors, and when the found_track/sector/side variables are set.  I then realised that my test bench tested that the sectors were found on the disk, but not that this information was communicated back up to the caller.  Needless to say I have fixed those now, and am resynthesising.

While waiting for synthesis of the whole MEGA65, I also made the same change to my VHDL test rig.  Finally, that is now finding sectors! I can also see which track I am on, from the reported track number, as well as watching the sector numbers cycle through.  Now I am impatiently waiting for synthesis to complete, so that I can test it in the MEGA65 again...

Resynthesis has finished, so now trying it out.  I have hit the problem that sometimes happens where some registers for the SD card interface and the non-existent input switches misbehave. This mean booting up without an SD card, and hence, without the F011 disk present flag being set by mounting a D81 image.  It turns out that this is the only source of the disk present flag at the moment. So I need to plumb the disk ready line. 

However, I have just hit the exact same problem that causes Amiga 600 and 1200 disk drives to tick constantly when there is no disk inserted: The /DISKCHANGE line does double duty as the disk present signal.  As it doesn't seem right for an 8-bit computer that is likely to only sometimes have a disk in the drive to tick like a bomb, I'll just lie and tell the F011 that it always has a disk inserted when it is talking to a real floppy drive.

I also found a similar problem in the logic that checks if a disk is available before dispatching a sector read or write job via the F011.  I have now fixed that logic ready for next synthesis. This was again because the existing F011 code assumed that the SD card was the only source of floppy disks.  It also meant that I could work around it by setting the disk image enable flag for floppy drive 0, so that it would look like a disk is available to the internal logic.

With those two fixes, the C65 BASIC no longer reports DRIVE NOT READY when attempting to run a DIR command.  Instead it hangs... because I don't assert the Request Not Found (RNF) flag if a requested sector is not found.  Reading the C65 specifications, the RNF flag should be set if the requested sector has not been found before six index pulses, i.e., within about a second.
This is implemented now, ready for the next synthesis run.

The hanging, I think, was because I had stepped the disk drive to a different track, which confused it, as the sector headers would not have matched, and so it would have kept on searching.  Now with the drive on the correct track, the DIR command returns a broken directory, that looks to me as though it is reading all zeroes from disk, or some similar problem.

Taking a look at the sector buffer for the F011, it looks like it is finding all zeroes.  Ah, that would be because I set the "match any sector" flag.  This is actually encouraging, as it means that it is reading the empty sectors on the disk.  With that flag turned off, DOS is back to hanging. Perhaps I have the side 1/2 head select flag wrong?  The request that it is currently stuck on is track 39 (this is floppy track numbering, the C65 is talking about track 40, the directory track on a 1581 disk), sector 1, side 0.  Checking the sectors that the MFM decoder is seeing (via the debug registers I added at $D6A3-5), I can see that it is in fact correctly finding the sectors for that track and side.  More specifically, I can see that it finds track 39, sector 1, side 0.  Yet it never seems to indicate to the F011 that it has read the sector, and the busy flag stays asserted, which is why DOS is hanging.

At this point, I could try to work my way through the C65 DOS to figure out everything that is going on. However, as the regular SD card mode of operation works fine, I am instead going to write a little test program that tries to drive the F011 to read a sector, and see if that works. If not, then I will know where the process is failing.  If it does work, then I can take a look at the DOS after all.

This reveals that reading a sector doesn't seem to properly complete. So I finally got around to writing a program that produces a whole sector's worth of MFM data to input into simulation, to see if that sheds any light on the situation. Found one more bug: sector_found was being reset the same cycle as the sector_end flag was being asserted, and the MEGA65 was not handling that correctly.  So I have added a fix for that, which is now synthesising.  I have also added some extra debug registers so that I can see if the MEGA65 thinks it is accepting sector bytes from the MFM decoder.  Synthesis time again.

Following synthesis of the above, I have worked out that the SIDE flag from the F011 registers is being inverted in sense, so I have pushed a fix for that, which will require another synthesis.  The next mystery is that the Request Not Found flag is still being set, even when the track and side are correct. Setting the match any sector flag solves this, by accepting any sector. So this suggests that there is still something faulty with the track/sector/side match logic.  Indeed, I can see that the found_track/sector/side matches what is being asked for from the F011.
This problem was caused by comparing unsigned values, instead of first casting them to integers.  This is one of many really annoying "features" of VHDL.

After various other little fixes, I can now see the correct sector is being found, and the various signals are being asserted, that should be telling the MEGA65 that bytes are ready for writing to the sector buffer as they are read from the floppy drive. However, still no bytes are read.  The combinations of signals that I am seeing, using the debug register $D6A1 as the source of my information are recorded by incrementing a byte of RAM corresponding to each value read using a routine like LDX $D6A1, INC $0400,X. In this way the contents of $0400-$04FF form a histogram, which I can see on screen as it is gathered, and yet the sampling rate is still able to be ~1MHz.  This will miss some instances of short-lived signals, however, by repeatedly plotting in this way, those will still tend to show up over time.  So here are the combinations of signals I am seeing:

fdc_sector_end - Very frequent. This is pulsed each time the end of a sector is reached, whether or not it is the sector we are looking for.

fdc_sector_found - Is held for a period of time, ~5 times per second, i.e., each time the sector we are searching for passes under the head.  This is encouraging, as it means we are finding our sector quite reliably.

fdc_sector_found | fdc_sector_data_gap - As above. This indicates that we have passed the sector header, and are now in the gap between the sector header, and the data of the sector itself. Again, a very healthy sign.

fdc_sector_found | fdc_byte_valid - This also pulses a number of times each rotation, indicating that the MFM decoder has the bytes off the sector to serve.

fdc_sector_found | fdc_byte_valid | fdc_first_byte - This gets indicated only once per reading of the sector, to indicate the first byte of the sector is being presented. Again, a very healthy sign.

So, we have clear evidence that the sequence of signals is more or less correct when the MFM decoder is doing its job.  It should be noted that these are generated automatically, whether or not the F011 has asked for a sector to be read or not.  When a read sector command is given, there is an extra signal that is asserted by the F011 so that it knows that it should accept the bytes presented by the MFM decoder, fdc_read_request.  By running a bunch of read requests while my histogram is collecting, I can verify if all of these combinations also occur while fdc_read_request is asserted. 

Because the histogram is running continuously collecting, I have to manually issue the read command via the serial monitor, so the number of samples is much smaller.  As a result I didn't detect any instances of the rather rare combination of fdc_sector_found | fdc_byte_valid | fdc_first_byte, however the face that all of the others are showing up gives me confidence that it should be showing up.

For a byte to be written, fdc_read_request, at least one of fdc_sector_found and fdc_sector_end, and fdc_byte_valid must be simultaneously asserted. From my histogram, I can see that fdc_read_request | fdc_byte_valid | fdc_sector_found happens quite frequently, as one would expect. That is, the conditions are satisfied for writing the bytes read from the floppy into the buffer. However, there is not so much as a single byte changed in the buffer. The logic that makes this decision consists of a couple of nested if statements, so I might put something in one of the outer if statements, so that I can see if it is getting there.  I'll also go through the synthesis warnings to see if there is anything fishy there that looks like it could be causing this problem.

Nothing fishy turned up in the synthesis warnings, however instrumenting those if statements also showed that it never gets to the appropriate test.  A little further searching, and it looks like the fdc_sector_end signal was not being handled correctly: Whenever it occurred -- whether it was the end of the sector that was being searched or not -- it would cancel the current read request.  I have now added this extra check, and am resynthesising. 

As I was thinking about this bug, I was initially thinking that this should mean I have a 1 in (sectors per track) chance of reading a sector. However, it instead required that the read request be initiated in the gap after the end of the sector just before the one desired, and the sector to be read.  This was quite a comforting revelation, as it explained why I was never seeing a sector being read -- because the probability was probably less than one in a thousand that this condition would be met. Even when set to accept any sector, the inter-sector gap is only a small percentage of the length of a sector.  Now we await synthesis to see if this theory is correct.
 
Indeed, with that fix, the sector data thinks it is being read into the sector buffer. I say thinks, because the sector buffer data doesn't change.  I lost several hours to this problem, until Falk reminded me that we have two copies of the sector buffer, one where the CPU controls access to it, and the other where the SD card writes directly into it, because the SD card can't be held off while the CPU is looking at the sector buffer, and vice-versa, when the CPU is reading from the buffer, there isn't a mechanism to add an extra wait-state if the SD card (or now also the floppy controller) is writing to the buffer.  There is special logic that allows each side to cause writes to happen to the other copy of the sector buffer, and I had forgotten to implement that for floppy reads. That is now synthesising.

Although the data itself is not being read into the sector buffer, in theory, the status signalling should be mostly complete, and I should be able to boot the C65 ROM with the real floppy drive enabled, so I tried doing this. This revealed that there is a problem with the track stepping: The DOS gets stuck looking on boot trying to read 1581 track 40, sector 0, because no data ever arrives, which is because the floppy drive has stepped only to track 38, not track 40. I'm not quite sure what the cause for this is.  If I manually step the drive forward to the correct track, the read can then complete, although I think I am seeing a problem with the buffer empty/full flag being erroneously set to indicate the buffer is empty, rather than full. This is used by the C65 ROM to work out if a sector has been read or not.  This might just be because of how I am probing the registers to debug, as the EQ flag inhibit when reading gets cleared if $D087 (the port register to the sector buffer) is read. Since I am reading 16 registers at $D080-$D08F as a batch, this would cause that problem.  So I expect that this is a non-problem. So it really just leaves the track stepping problem.

The C65 DOS steps tracks assuming that this works without problem, and always lands on the correct track. The routine for seeking to a track is at $9B98, and looks something like:

findtrack:
   phy
   ldy #$00
   ldx $1FE8    ; presumably the current drive ID
   lda $d084    ; The target track
waitforready:
   jsr $9A88    ; Wait for F011 busy flag to clear
   cmp $010F,x ; compare with the track we think we are on?
   beq foundtrack
   bcc lowertrack

   inc $010f,x   ; increase the track we think we are on?
   ldy #$18
   bra issuestepcommand

lowertrack:
   dec $010f,x      ; decrease the track we think we are on?
   ldy #$10

issuestepcommand:
   sty $D081     ; Tell F011 to step
   bra waitforready

foundtrack:
   tya
   beq done      ; if we didn't step, skip head settling time
   jsr waitforheadsettle

done:
   ply
   rts

Single-stepping through this routine, it seems to do what it should. However, after that $14 gets written to the command register in the waitforheadsettle routine. And that's where the problem is: I was interpreting that as a head step command, because the high-nybl is $1.  Again, a couple of lines to fix it, and probably a couple of hours of waiting for synthesis to run.  Then I found that the F011 disk-side select line was inverted, and I wasn't updating the pointer to the location in the second copy of the sector buffer, i.e., the one that the CPU sees.  So, it's off to synthesis again, but very much with the feeling now that it is the last few barriers this time.

It turns out there were still a number of other little niggly bugs to track down with filling the sector buffer, which once solved frustratingly still haven't quite got it working.  Sectors do now get read, and loading a directory goes through the motions of working -- but it is reading the wrong sectors for some reason.  Or perhaps looking at the wrong halves of the sectors, since the 1581 uses standard 512 byte sectors as the on-disk format, which each contain two logical 256 byte sectors, presumably because the PC-style floppy controller IC it uses doesn't support 256 byte sectors, or because using 256 byte sectors would have resulted in a slightly lower disk capacity.  Whatever the cause, the question is how SD card and floppy can behave differently, when all the buffer pointer management is the same regardless of whether a sector is read from the SD card or from the floppy drive -- and yet the problem only occurs when reading from the floppy drive.  Thus there must still be some subtle difference. 

To try to figure out what the difference might be, I tried loading the same sector from both SD card and from the real floppy drive.  Lo and behold, the sector from the floppy drive was rotated by one byte through the sector buffer.  Looking through the source code, I can see that the buffer pointer in question was not being zeroed when the read sector job was accepted, i.e., the part common to both SD card and floppy drive access, but instead in the setup stage for SD card access. Satisfying that I can see that this makes sense. In retrospect, I saw that the buffer write addresses were out by one in simulation, but the consequences didn't fully occur to me at the time.  Anyway, time for synthesis again...

Okay, so now the sector buffer rotation bug is fixed, and yet DIR still shows gibberish, as though it is reading the sector data incorrectly.  The C65 DOS uses unbuffered reading, where it accepts each byte as it is supplied by the F011 floppy controller, rather than waiting for it all to arrive. So I wondered if there was some sort of bug in that handling -- although, again, for SD card reads, it works without problem.  So I wrote a test program to make sure that unbuffered reading works correctly, and the correct data is read out, and in order. All seems to be correct there.  However, when trying to load the directory of a disk, it still looks very much like it is accessing the wrong half of the sector.

Unlike for the sector write offset, I can see no difference in the way the read pointer into the sector buffer is handled between the FDC and SD card data paths. Yet, like the last time I said this, there must be some difference, or else it would be working.

The F011 has a SWAP bit, that allows the halves of the sector buffer to be switched from the CPU's perspective, and the handling of this is the prime suspect as the root cause for this bug.  Single stepping through the C65 DOS, it turns out that this is not the problem.  Rather, the problem is that the floppy drive reads data at a slower rate than the CPU can read it, and the EQ flag that indicates if the buffer is empty claims not-empty as soon as the sector starts being read, regardless of whether the data has been read far enough or not.  As a result, the problem is actually that the C65 DOS is reading the contents of the buffer, before it has been filled from the floppy drive.  So the SWAP flag can stay as it is, but the EQ flag requires correction.

In the end, I looked at the whole handling of the EQ flag, and with the distance of time from when I first implemented, it seemed rather over complicated to me.  So I re-worked the entire sector buffer handling code, so that it now works much more like the real F011, and the duplicate copies of the sector buffer to solve bus contention have been reduced to a single buffer, with a bit of clever pre-fetching and write-buffer. The result is a whole lot simpler, and, it works.  FINALLY I can stick a 1581 disk in the real floppy drive, type DIR from C65 BASIC, and get a directory listing. Here is one I specially formatted the other day using my resurrected 1581:



And here is an old disk from the mid-90s when I was using the C65 I had at the time as my main computer:


I tried a bunch of my other old 1581 disks, but only a couple would work.  What I don't know is whether that means those disks have simply rotted away, which is entirely possible after a 20+ years, or that my error correction for MFM decoding could do with a bit more work.  I guess I will pull them out again at some point, and do some data captures from them, and from there work out whether my error correction could be improved. But adding write-support to the floppy drive is a higher priority than that.

Wednesday, January 10, 2018

Bringing the internal 3.5" floppy drive to life - part 1

While it will take a while longer before it can read or write disks, the next step for the internal floppy drive is to make it mirror what the SD-card floppy emulation thinks the drive should be doing, for example, when the motor should spin, the light come on, and when the head should step in and out tracks. I also want to have debug registers that allow me to directly control and read the floppy drive state, so that I can work towards being able to read and write real disks.

So, first step is to plumb the floppy status and control signals into the VHDL module that handles F011 emulation and SD card access, and provide a debug register for those.  This is already done, and $D6A0 is the register.  Writing sets control lines, and reading reads the status lines. As this is a debug register, you have to remember the state of the control lines yourself.  The control lines are:

7              f_density - 0=1.44MB, 1=720K
6              f_motor - 0 = motor on, 1 = motor off
5              f_select - 0 = drive selected (and LED on), 1 = drive not selected
4              f_stepdir  - 0 = step inwards, 1 = step outwards
3              f_step - 0 = generate step pulse (two required per track)
2              f_wdata - bit to write
1              f_wgate - 0 = turn write head on
0              f_side1 - 0 = head on side 1 selected, 1 = head on side0 selected

The status lines are:

7              f_index - 0 when passing over index hole
6              f_track0 - 0 when head is over track 0
5              f_writeprotect - 0 when disk is write protecteed
4              f_rdata - data bit read from head
3              f_diskchanged - 0 when disk has been changed

I have already confirmed that I can make the motor and LED turn on and off. Next step is to write a little test program that lets me test all of these functions.

To do this, I need to make a set of pull-up resistors for the floppy interface, as the revision 1 PCB lacks pull-ups on the status lines, so once they go to 0, they never go back to 1.  The pull-up resistors gently pull the lines high again, so that when the floppy stops driving them low, they return high, and thus back to a 1.  As previously mentioned, it is rather annoying that the 34-pin floppy cable has no +5V line, which makes it a nuisance to make a pull-up kit that can fit on the cable. Fortunately, however, there are some lines that we control, and can drive high.  For current testing, the density select line can stay +5V, since we don't need to do any 1.44MB disk access just yet. Here is my little home-made pull-up kit:


With this plugged in, I can easily see when the track 0 sensor etc, so all is good. I can also see data being read from the test disks, so that answers that question. Writing is more complex, so can't be immediately tested.

XXX - Drive stepping, LED and motor tracking SD card working (short video).

So, with that working, now I want to get reading data from the floppy working.  The first step is to acquire a decent slab of data from one of the tracks on the floppy.  The trick is how to capture it at a decent rate, since floppy data pulses come at approximately 500Hz, but the pulses are only 150ns - 800ns wide.  The easiest solution is to DMA read the register, as this will read every two clock cycles (the alternate clock cycle will be writing the value that had been read into memory), for a sample time of 40ns. This creates a data capture problem, because the MEGA65 has only a limited amount of RAM. Using a single DMA to read 56K samples (we could push it to 64K, but that would require fiddling with memory a bit more). At 40ns, that equates to 2,293,760ns = 2.29 milliseconds.  Given that a floppy spins at 300 RPM = 5 revolutions per second, a single rotation is 200 ms, so we are sampling only ~1% of a track this way. Admittedly at very high resolution.  This is not really enough to capture even a single sector for decoding. However, what it is useful for is to let me see just how long the pulses are on this floppy.  Here is a sample of the captured data:

:0012310 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012320 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012330 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012340 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012350 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012360 1F 1F 0F 0F 0F 0F 0F 0F 0F 0F 1F 1F 1F 1F 1F 1F
 :0012370 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012380 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :0012390 1F 1F 1F 1F 1F 0F 0F 0F 0F 0F 0F 0F 0F 1F 1F 1F
 :00123A0 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :00123B0 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :00123C0 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F
 :00123D0 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F

We know from the table of input signals above, that it is only the upper 5 bits that matter.  These samples all have bit 7 (index hole sense), bit 6 (track 0) and bit 5 (write-protect) equal to 0, which means asserted. That is, we are reading track 0 on a write-protected disk, while the index hole is passing.  Bit 4 is the data bit itseslf, and we see that it is mostly high, with a couple of pulses lasting 8 samples each, i.e., 8 x 40 ns = 320 ns in duration.  The length of the pulses is within the specified range, so that looks good.  In this case, the pair of pulses are 51 samples, i.e., 51 x 40 = 2040 ns = ~2 usec apart.  Then pulses appear throughout the capture at varying intervals, as we expect.

Without going into the gory detail of how MFM works, a summary is that the gaps between the pulses vary to encode the information.  For a given data rate, gaps of 1, 1.5 and 2 time units are possible, corresponding to the reception of the following bit sequences:

+-----------------+--------+--------+
|Last received bit|Interval|New Bits|
+-----------------+--------+--------+
|      NONE       |   1.0  |11 or 00|
|      NONE       |   1.5  |   01   |
|      NONE       |   2.0  |  101   |
|        0        |   1.0  |    0   |
|        1        |   1.0  |    1   |
|        0        |   1.5  |    1   |
|        1        |   1.5  |   00   |
|        X        |   2.0  |   01   |
+-----------------+--------+--------+
There are some exceptions to this for synchronisation marks, where the pattern can be different.  In particular, the common "A1" sync mark consists of intervals of 2.0, 1.5, 2.0 and 1.5.  Encoding $A1 using MFM would normally result in gaps of 2.0 (101), 1.5 (00), 1.0 (0), 1.0 (0), 1.5 (1), where values in brackets are the bits of the byte $A1 (= 101000001 in binary) being encoded. The ambiguity at the start of a byte is solved by using these sync bytes to first synchronise the decoder at the start of a string of bytes.

We should therefore expect to see disk data consisting of long runs of intervals that are 1.0, 1.5 and 2.0 times the basic time interval.  The first time I captured data, I was seeing all sorts of crazy intervals, which had me thinking that something was terribly wrong.  However, second time around, the stream looks much better, with intervals all within a 1:2 range of size ratios, as we expect.

So now I need to write a little program that tries to find the sync marks in the data stream, and then begin decoding the data from there, to see if it looks like what it should be. Here is how one of the traces decoded:

$e4 $e4 $e4 $e4 $ce $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00 $00
Sync $A1
Sync $A1
Sync $A1
 $fe $01 $01 $01 $02 $8b $eb $4e $4e $4e $4e $4e $4e $4e $4e $4e
 $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e $4e
$00 $00 $00 $00
 $00 $00 $00 $00 $00 $00 $00 $00
Sync $A1
Sync $A1
Sync $A1
 $fb $00 $00 $00 $00 $00 $00 $00 $00 $00 $00


The 1581 service manual describes the byte sequences that we should expect to see, as:

12 bytes of 00
3 bytes of Hex A1 (Data Hex A1, Clock Hex 0A) <- i.e., $A1 Sync
1 byte of FE (ID Address Mark)
1 byte (Track number)
1 byte (Side number)
1 byte (Sector number)
1 byte (Sector length. 02 for 512 byte sectors)
2 bytes CRC (cyclic redundancy check)
22 bytes of Hex 22
12 bytes of 00
3 bytes of Hex A1 (Data Hex A1, Clock Hex 0A) <- i.e., $A1 Sync
1 byte of Hex FB (Data Address Mark)
512 bytes of data
2 bytes CRC (cyclic redundancy check)
38 bytes of Hex 4E

If we try to match this with what we saw, it is pretty close.  There is some junk at the beginning, that looks like the tail end of the 38 bytes of $4E, allowing for lack of synchronisation, then we see the start of sector 2, track 1, side 1 (bold), following the prescribed format exactly, with the exception that the 22 bytes are hex $4E, not $22 (underlined). It is possible that $22 is a typo in the 1581 service guide, given that there are 22 bytes stipulated.  Indeed, we find evidence that this is the case from the C65 specifications guide, which describes the on-disk format as:


  quan      data/clock      description
  ----      ----------      -----------
    12      00              gap 3*
    3       A1/FB           Marks
            FE              Header mark
            (track)         Track number
            (side)          Side number
            (sector)        Sector number
            (length)        Sector Length (0=128,1=256,2=512,3=1024)

    2       (crc)           CRC bytes
    23      4E              gap 2
    12      00              gap 2
    3       A1/FB           Marks
            FB              Data mark
    128,
    256,
    512, or
    1024    00              Data bytes (consistent with length)
    2       (crc)           CRC bytes
    24      4E              gap 3*

    * you may reduce the size of gap 3 to increase diskette capacity,
      however the sizes shown are suggested.

Here we have the byte value $4E specified for gap 2, however, it suggests that the gap should be 23 bytes long, not the 22 we have observed, or that is stipulated in the 1581 service manual.

Nonetheless, we have proven that we can read sensible data from the disk, and use a simple table of relative gap size to drive decoding of the MFM data.  What we have not discussed here, is how to deal with the variable (and varying) disk rotation speed, and the errors that it can introduce.  A simplistic perspective on this is that we have to have something approximating a phase-locked loop that recaptures the clock on each transition that is encountered.  There are various ways to do this.  The C65 floppy controller supports three such algorithms, none of which exactly correspond to the method I have described here, of considering the gap intervals, rather than the arrival time of the transitions.  The error correction in my scheme relies on quantifying the gap intervals to be either 1.0, 1.5 or 2.0, with values in between being rounded to the nearest of those.  It stands to be seen how well (or badly) my scheme works in practice. I have also not yet covered generating or checking the CRC of sectors.

However, we now have enough information to be able to create a VHDL implementation that takes the raw input, extracts gap intervals, quantifies the gap intervals, detects sync marks, and extracts the sync and byte stream, and can feed this into a higher-level decoder that can check the track and sector that has been found, and extract the sector data.  Indeed, I can use the captured sequence as a test vector into this.  In short, we are well on our way to being able to read 3.5" floppies using the internal drive.

Tuesday, January 9, 2018

Testing Ethernet on the r1 PCB

There are now only a couple of interfaces left to test on the r1 PCB: HDMI output and the 100 Mbit ethernet port.  Ethernet is the next on the list, as it should, in principle, be easy to test, as we already had the same ethernet hardware on the Nexys4 DDR boards.  Thus, it really comes down to verifying that the pin assignments are correct.

However, it has been ages since we used the ethernet interface in earnest, in part because there is still a bug in my VHDL ethernet controller when transmitting (bits get corrupted, most probably due to a timing problem).

Thus, the first step was to get back to a working setup on the Nexys4 DDR board, where I could verify that I had a working test procedure.

The setup was quite simple:  The etherload program, which is a tiny program that listens for incoming ethernet packets on the MEGA65's ethernet interface, and if they are UDP packets on port 4510, it executes the contents of the packet in memory. This is used by a companion program on a computer connected via ethernet to send 1KB pieces of a program to be loaded, together with the little routine to copy it into place.  This scheme allows the ethernet loading program to be <256 bytes in length, including the ability to respond to ARP requests (although with the ethernet transmission problem, this is currently not very useful).

So, I loaded and ran the etherload program on the MEGA65 on a Nexys4DDR board, connected an ethernet cable, and then ran the etherload program on the Linux laptop at the other end of the ethernet cable. Without ARP, the IP address to send to must be a valid broadcast address on the ethernet interface.  I used a command like:

etherload 192.168.1.255 ../c64/games/gyrrus.prg

When etherload is running on the MEGA65 and waiting for packets, it looks like this:



(Note that etherload is so old, that it doesn't explicitly set the CPU to 50MHz, so I had to POKE0,65 before running it to do this. Otherwise it is too slow, and won't capture the packets coming in on the 100 Mbit/sec link.)

Then, when it is finished, it drops back to the ready prompt, like this:



The squiggly characters are drawn one per packet loaded, with the position matching the address of the packet loaded, so that you can see if there are any gap, which would indicate missed packets. None here, so I could happily run Gyrrus, which worked fine.

So, at this point, I have a test procedure that I can attempt on the r1 PCB.

Trying this on the MEGA65, I see the ethernet link light come on when the ethernet is plugged in, and the ethernet LED blinks on receiving the packets, but the etherload program shows no sign of having seen the packets.  Time to investigate.  Pausing the CPU, and looking at $D6E1 to see if the ethernet controller thinks that any packets have been received shows no signs of life. 

As I have had to debug this once before on the Nexys boards, there is a debug register at $D6E0 that shows the current status of the ethernet receive lines.  Thus I can write a little routine that continually draws the contents of that register on the screen, and try sending it a packet to see if we see signs of life. 

 This initially saw no signs of life, so I wrote a program to talk to the ethernet controller via the MIIM / MDIO interface, a two-wire interface that can be used to check the current connection and settings, and to set various link parameters.

After some trial and error, I was able to talk to the MIIM interface, and read out the various registers, which showed the link autonegotiating and coming up when a cable was connected.  So I tried again to write a little routine that shows the state of the ethernet interface registers. This time, I wrote the routine to increment a location on screen based on the contents of $D6E0, as a more robust way of seeing what is happening.  This showed that the RX lines were toggling, and that the RX valid line was also changing state when packets were flying on the ethernet connection.  However, etherload still failed to see any packets.

Back when I first implemented ethernet for the Nexys4 boards, I added a feature to allow reading the values arriving on the ethernet RX lines into a buffer to help debug the implementation.  That same function is now helpful for trying to work out what is going on here.  It confirms that the data bits are being received, and that they, in general, look right.  Digging deeper, I can see that packet data is being received, but no packet reception is reported. This most likely means that the CRC is invalid.  Fortunately, when a packet is rejected due to the CRC, it still gets written into the packet buffer.  Here is what I saw after receiving a 500 byte ping packet:

 :FFDE800 00 80 BD 00 5E 00 00 FB 10 05 01 9F FC FD 08 00
 :FFDE810 45 00 00 A9 9A 5F 40 00 FF 11 3E 3E C0 A8 01 02
 :FFDE820 E0 00 00 FB 14 E9 14 E9 00 95 A6 AF 00 00 00 00
 :FFDE830 00 09 00 00 00 00 00 00 05 5F 69 70 70 73 04 5F
 :FFDE840 74 63 70 05 6C 6F 63 61 6C 00 00 0C 00 01 04 5F
 :FFDE850 66 74 70 C0 12 00 0C 00 01 07 5F 77 65 62 64 61
 :FFDE860 76 C0 12 00 0C 00 01 08 5F 77 65 62 64 61 76 73
 :FFDE870 C0 12 00 0C 00 01 09 5F 73 66 74 70 2D 73 73 68
 :FFDE880 C0 12 00 0C 00 01 04 5F 73 6D 62 C0 12 00 0C 00
 :FFDE890 01 0B 5F 61 66 70 6F 76 65 72 74 63 70 C0 12 00
 :FFDE8A0 0C 00 01 04 5F 6E 66 73 C0 12 00 0C 00 01 04 5F
 :FFDE8B0 69 70 70 C0 12 00 0C 00 BD 8D 8E 8F 90 91 92 93
 :FFDE8C0 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F A0 A1 A2 A3
 :FFDE8D0 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF B0 B1 B2 B3
 :FFDE8E0 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF C0 C1 C2 C3
 :FFDE8F0 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF D0 D1 D2 D3
 :FFDE900 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF E0 E1 E2 E3
 :FFDE910 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF F0 F1 F2 F3
 :FFDE920 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF 00 01 02 03
 :FFDE930 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13
 :FFDE940 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23
 :FFDE950 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33
 :FFDE960 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 40 41 42 43
 :FFDE970 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53
 :FFDE980 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 60 61 62 63
 :FFDE990 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 70 71 72 73
 :FFDE9A0 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F 80 81 82 83
 :FFDE9B0 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F 90 91 92 93
 :FFDE9C0 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F A0 A1 A2 A3
 :FFDE9D0 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF B0 B1 B2 B3
 :FFDE9E0 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF C0 C1 C2 C3
 :FFDE9F0 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF D0 D1 D2 D3

 :FFDEA00 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF E0 E1 E2 E3
 :FFDEA10 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF F0 F1 F2 BD
 

The first two bytes are supposed to indicate the length of the packet, low-order byte first, and with the MSB of the second byte indicating if a CRC error has occurred. If a CRC error occurs, then no packet received interrupt is triggered, and the controller will keep trying to receive a valid packet, instead of marking the receive buffer full (the MEGA65 ethernet controller has two receive buffers, so that one can be processed while the other is receiving a packet).

The byte $BD at the end of the packet is written by the ethernet controller as a handy marker so that if you have been receiving multiple packets, and want to see where the latest one ends, you can.  So, this tells us that the packet was indeed correctly received as being $A1F - $800 - (2 bytes length header) = $21D bytes long.  However, the length header in the first two bytes of the packet says that it is zero bytes long, and that there was a CRC error.  That the length header is wrong tells me that there is something fishy going on.  I am resynthesising with an option to ignore CRC errors, and to try to investigate a little deeper the writing of the length field.

So, synthesis has finally finished an hour and a half later, so I can try etherload again, this time with the ethernet CRC check disabled, and ... it works.  Moreover, there is no sign of the packets having any errors, as I can load a game, and the game runs fine.  This leaves me wondering what is going on, or more specifically, how an incorrect ethernet CRC is getting calculated on what seem to be perfectly correct packets.  To try to solve this riddle, I took a look at the last packet sent by etherload as received by a Nexys4 DDR board and by the MEGA65 r1 PCB. Here is the one from the Nexys4 board:

 :FFDE800 AE 00 FF FF FF FF FF FF 10 05 01 9F FC FD 08 00
 :FFDE810 45 00 00 9C B9 79 40 00 40 11 FC 85 C0 A8 01 02
 :FFDE820 C0 A8 01 FF CE 1F 11 9E 00 88 A1 55 A9 00 EA EA
 :FFDE830 EA EA EA EA A2 00 BD 44 68 9D 40 03 E8 E0 40 D0
 :FFDE840 F5 4C 40 03 A9 47 8D 2F D0 A9 53 8D 2F D0 A9 00
 :FFDE850 A2 0F A0 00 A3 00 5C EA A9 00 A2 00 A0 00 A3 00
 :FFDE860 5C EA 68 68 60 00 00 00 00 00 00 00 00 00 00 00
 :FFDE870 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE880 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE890 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8A0 00 00 00 00 00 00 00 00 00 00 00 00 41 0B 3F 4D
 :FFDE8B0 BD


Here we see our $BD end of frame marker, and just before it, four bytes that are the CRC.  So, everything is fine there, as we know it is, since etherload works fine on that board with CRC checking enabled.

Now, the same packet received by the MEGA65 r1 PCB:

 :FFDE800 A9 80 FF FF FF FF FF FF 10 05 01 9F FC FD 08 00
 :FFDE810 45 00 00 9C 52 C0 40 00 40 11 63 3F C0 A8 01 02
 :FFDE820 C0 A8 01 FF E8 8C 11 9E 00 88 86 E8 A9 00 EA EA
 :FFDE830 EA EA EA EA A2 00 BD 44 68 9D 40 03 E8 E0 40 D0
 :FFDE840 F5 4C 40 03 A9 47 8D 2F D0 A9 53 8D 2F D0 A9 00
 :FFDE850 A2 0F A0 00 A3 00 5C EA A9 00 A2 00 A0 00 A3 00
 :FFDE860 5C EA 68 68 60 00 00 00 00 00 00 00 00 00 00 00
 :FFDE870 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE880 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE890 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 :FFDE8A0 00 00 00 00 00 00 00 00 00 00 00 BD


Note that apart from some different values in the IP and UDP header fields, the ethernet frames are identical, except for the lack of a CRC field.  While this is rather confusing, as I have never seen ethernet frames lacking a CRC field before, it at least does explain the behaviour I am seeing.  I also confirmed that if I use my Mac instead of the Linux laptop, the same behaviour is seen on the receiving side.

The MEGA65 r1 PCB does use a different ethernet receiver IC.  Is it possible that this IC does automatic CRC checking, and simply trims the CRC field from the end of the packet?  If so, I can find no mention of this feature in the datasheet for it.  There is a way that this can be tested, however: Connect two MEGA65's back to back via ethernet, and send a frame from one to the other, and see if the CRC that the one sent is received by the other.  The MEGA65's ethernet controller I have written in VHDL always sends a CRC, so this eliminates that question.  This is also a good idea, since I want to test the sending of ethernet frames, since there is a problem with that, which I suspect is due to timing of the TX bits compared to the 50MHz ethernet clock.

To do this, I wrote a little program that simply copies a sample ethernet frame to the TX buffer and sends the packet, whenever a key is pressed.  First time trying this, I can see that a packet is sent from that side, and received by the other, with the CRC missing.  However, it also showed up a problem with memory mapping, because while I can read from the packet RX buffer when I had used the MAP instruction to make it visible at $6800-$6FFF, I can't write to it. Instead writes are going to colour RAM. Using the serial monitor causes the same problem. Time for another synthesis run to fix that (found the wrong 2-bit constant in the CPU source code that was causing it)...

So, having fixed that memory mapping error, I can now send packets from the Nexys4DDR board to the MEGA65 r1 board, but no CRC is visible. Also, I discovered that the packet length must be set to one more than the number of bytes in the packet. Now, what about in the other direction, from the MEGA65 r1 PCB to the Nexys4 board?

Here we have some interesting things.   First, the data coming through is corrupted, specifically, it looks like the bits that have been transmitted in one cycle are actually often used in the following cycle, i.e., I am presenting the data on the opposite side of the ethernet TX clock compared to when I should be.  Here is the hexadecimal version of the packet as received at the other end:

 :0000428 47 80 FF FF FF FF FF FF 47 45 45 45 45 45 29 00
 :0000438 3F 50 55 5A 5F 54 55 5E 5F 78 7D 7A 7F 7C 7D 7E
 :0000448 7F A0 A5 AA AF B4 B5 BE BF A8 AD AA AF BC BD BE
 :0000458 BF F0 2F FA FF F4 00 05 28 15 3C 3C 3F A0 5F 3F
 :0000468 5A 3C 14 A5 A0 40 7F 5F F4 BD 00 00 00 00 00 00


The first two bytes are the length ($47) + CRC error flag, then we have the usual ethernet fields.  Clues that the bits are being read once cycle are that the ethernet source address field is 474545454545, when it should be 414141414141.  The $47 has the upper two bits from the $FF of the last ethernet destination address rotated in, and then the $41 rotated left.  $41 = 01000001, so rotating it left and pulling in the 11 bits, we get 00000111, which isn't quite right. However, if we assume that each bit pair is the logical OR of the previous bits, plus the bits that are being sent now, then it makes sense: 01000001 OR 00000111 = 01000111 = $47.  This says to me quite strongly that it is this marginal timing issue.  Basically by presenting the bits and clock at the same time, there isn't enough time for them to stabilise and replace the old values before the ethernet controller samples them.

Despite the difficulty that this glitch provides in determining if the CRC field is there, by repeatedly sending slightly different frames, I can see that the last four bytes of the frame before the $BD end of frame marker (which looks like a reverse = sign on the screen display) change each time. The only other byte that changes is one byte in the frame that I am changing on the transmit side.  You can play spot the differences with me in these shots: There is a single byte different in the body of the packets shown, so that the CRCs would be different, and then the CRC fields themselves:




So, these problems shouldn't be too hard to fix. The out-by-one length error I can very easily fix. The timing error will be a little more work, but not particularly hard. What I will probably do is use a 200MHz clock to drive the TX lines, and have a register that allows me to adjust the phase of the TX data bits with relation to the ethernet clock. That way I will only need to resynthesise once to be able to find the correct settings, which can then be baked into the next synthesis run after that.


So, adding the phase delay on the ethernet TX data lines has fixed the data corruption we were seeing. Here is how it looks now, sent from the MEGA65 r1 PCB to the Nexys 4 DDR board:

 :7776800 47 80 FF FF FF FF FF FF 41 41 41 41 41 41 08 00
 :7776810 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E
 :7776820 1F 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E
 :7776830 2F 30 0F 32 33 34 00 01 08 05 0C 0C 0F 20 17 0F
 :7776840 12 0C 04 21 20 91 51 83 E2


Now we see the MAC address being correctly formed, and all the bytes look correct. Also, as this was received by the Nexys4 DDR board, we see the ethernet CRC field.  However, it still thinks the CRC is wrong.

In the reverse direction, we still don't see the CRC field, so we see packets like this:

 :7776800 42 80 FF FF FF FF FF FF 41 41 41 41 41 41 08 00
 :7776810 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E
 :7776820 1F 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E
 :7776830 2F 30 EE 32 33 34 00 01 08 05 0C 0C 0F 20 17 0F
 :7776840 12 0C 04 21


What is nice is that the same TX line phase delay works in both directions, so we don't need to make that a setting specific to the type of board.

We also see that the number of bytes sent differs between them by one, that is, the MEGA65 r1 is sending one more byte than the Nexys4 DDR board is. This probably explains why the Nexys board sees an incorrect CRC, and is more of a concern.

What I think I will do next, is to send a frame to the r1 PCB, and use the debug mode on the ethernet controller to see the raw data lines, and see if we see the CRC bits arriving.  Here is what I see:

 :7776800 80 80 80 80 80 80 80 80 80 80 80 80 81 81 81 81
 :7776810 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
 :7776820 81 81 81 81 81 81 81 81 81 81 81 83 83 83 83 83
 :7776830 83 83 83 83 83 83 83 83 83 83 83 83 83 83 83 83
 :7776840 83 83 83 83 81 80 80 81 81 80 80 81 81 80 80 81
 :7776850 81 80 80 81 81 80 80 81 81 80 80 81 80 82 80 80
 :7776860 80 80 80 80 83 83 80 80 80 80 81 80 81 80 81 80
 :7776870 82 80 81 80 83 80 81 80 80 81 81 80 81 81 81 80
 :7776880 82 81 81 80 83 81 81 80 80 82 81 80 81 82 81 80
 :7776890 82 82 81 80 83 82 81 80 80 83 81 80 81 83 81 80
 :77768A0 82 83 81 80 83 83 81 80 80 80 82 80 81 80 82 80
 :77768B0 82 80 82 80 83 80 82 80 80 81 82 80 81 81 82 80
 :77768C0 82 81 82 80 83 81 82 80 80 82 82 80 81 82 82 80
 :77768D0 82 82 82 80 83 82 82 80 80 83 82 80 81 83 82 80
 :77768E0 82 83 82 80 83 83 82 80 80 80 83 80 82 81 80 83
 :77768F0 82 80 83 80 83 80 83 80 80 81 83 80 80 80 80 80
 :7776900 81 80 80 80 80 82 80 80 81 81 80 80 80 83 80 80
 :7776910 80 83 80 80 83 83 80 80 80 80 82 80 83 81 81 80
 :7776920 83 83 80 80 82 80 81 80 80 83 80 80 80 81 80 80
 :7776930 81 80 82 80 80 80 02 80 03 82 00 83 03 81 00 80
 :7776940 02 81 02 80 00 81 02 83 00 00 00 00 00 00 00 00


Each byte in this capture is one 20ns time step on the ethernet interface.  Bit 7 is the "data valid" signal, and bits 0 and 1 are the data being read. Four of these makes one byte of actual data. So, let's decode it. The long train of 81's followed by 83's is the ethernet preamble.  So we need to start from the second 83.  We then have the following 4 time steps making the following bytes:

$0000 : 83 83 83 83 = %11111111 = $FF
$0001 : 83 83 83 83 = %11111111 = $FF
$0002 : 83 83 83 83 = %11111111 = $FF
$0003 : 83 83 83 83 = %11111111 = $FF
$0004 : 83 83 83 83 = %11111111 = $FF
$0005 : 83 83 83 83 = %11111111 = $FF
$0006 : 81 80 80 81 = %01000001 = $41
$0007 : 81 80 80 81 = %01000001 = $41
$0008 : 81 80 80 81 = %01000001 = $41
$0009 : 81 80 80 81 = %01000001 = $41
$000a : 81 80 80 81 = %01000001 = $41
$000b : 81 80 80 81 = %01000001 = $41
$000c : 80 82 80 80 = 001000 = $08
$000d : 80 80 80 80 = 000000 = $00
$000e : 83 83 80 80 = 001111 = $0F
$000f : 80 80 81 80 = 010000 = $10
$0010 : 81 80 81 80 = 010001 = $11
$0011 : 82 80 81 80 = 010010 = $12
$0012 : 83 80 81 80 = 010011 = $13
$0013 : 80 81 81 80 = 010100 = $14
$0014 : 81 81 81 80 = 010101 = $15
$0015 : 82 81 81 80 = 010110 = $16
$0016 : 83 81 81 80 = 010111 = $17
$0017 : 80 82 81 80 = 011000 = $18
$0018 : 81 82 81 80 = 011001 = $19
$0019 : 82 82 81 80 = 011010 = $1A
$001a : 83 82 81 80 = 011011 = $1B
$001b : 80 83 81 80 = 011100 = $1C
$001c : 81 83 81 80 = 011101 = $1D
$001d : 82 83 81 80 = 011110 = $1E
$001e : 83 83 81 80 = 011111 = $1F
$001f : 80 80 82 80 = 100000 = $20
$0020 : 81 80 82 80 = 100001 = $21
$0021 : 82 80 82 80 = 100010 = $22
$0022 : 83 80 82 80 = 100011 = $23
$0023 : 80 81 82 80 = 100100 = $24
$0024 : 81 81 82 80 = 100101 = $25
$0025 : 82 81 82 80 = 100110 = $26
$0026 : 83 81 82 80 = 100111 = $27
$0027 : 80 82 82 80 = 101000 = $28
$0028 : 81 82 82 80 = 101001 = $29
$0029 : 82 82 82 80 = 101010 = $2A
$002a : 83 82 82 80 = 101011 = $2B
$002b : 80 83 82 80 = 101100 = $2C
$002c : 81 83 82 80 = 101101 = $2D
$002d : 82 83 82 80 = 101110 = $2E
$002e : 83 83 82 80 = 101111 = $2F
$002f : 80 80 83 80 = 110000 = $30
$0030 : 82 81 80 83 = %11000110 = $C6
$0031 : 82 80 83 80 = 110010 = $32
$0032 : 83 80 83 80 = 110011 = $33
$0033 : 80 81 83 80 = 110100 = $34
$0034 : 80 80 80 80 = 000000 = $00
$0035 : 81 80 80 80 = 000001 = $01
$0036 : 80 82 80 80 = 001000 = $08
$0037 : 81 81 80 80 = 000101 = $05
$0038 : 80 83 80 80 = 001100 = $0C
$0039 : 80 83 80 80 = 001100 = $0C
$003a : 83 83 80 80 = 001111 = $0F
$003b : 80 80 82 80 = 100000 = $20
$003c : 83 81 81 80 = 010111 = $17
$003d : 83 83 80 80 = 001111 = $0F
$003e : 82 80 81 80 = 010010 = $12
$003f : 80 83 80 80 = 001100 = $0C
$0040 : 80 81 80 80 = 000100 = $04
$0041 : 81 80 82 80 = 100001 = $21
$0042 : 80 80 02 80 = 100000 = $20 (some bits missing data valid)
$0043 : 03 82 00 83 = %11001011 = $CB (some bits missing data valid)
$0044 : 03 81 00 80 = 000111 = $07 (some bits missing data valid)
$0045 : 02 81 02 80 = 100110 = $26 (some bits missing data valid)
$0046 : 00 81 02 83 = %11100100 = $E4 (some bits missing data valid)


So, this is VERY interesting.  The ethernet controller isn't filtering out the CRC, but is rather claiming that those bits are not data valid.  Given the very specific pattern, with one di-bit missing the data valid for the last data byte of the packet, and then two di-bits missing the data-valid signal for the CRC, and the same two each byte, I suspect that the ethernet controller might be signalling the end of the frame.  This would mean that it must be buffering at least five bytes worth of received data, but that is not impossible.  Anyway, it explains where the CRC has gone. So, digging around a bit, I have found that the RX data valid signal is multiplexed with carrier sense on some PHY chips.  This looks like exactly what could be happening here (although the PHY receiver on the Nexys4 doesn't do this, as I have just re-confirmed), thus providing an explanation for what we are seeing.
So, time to resynthesise again, and see if it this gets us CRCs received on the MEGA65 r1 PCB.  That should just leave the CRC checksum problems, if they are still occurring after that change (which I expect that they will).

Indeed, success! I can now receive the last byte and CRC of a packet on the MEGA65 r1 PCB.  However, I still have a problem with CRCs.  I know that the CRC problem is on the sending side, because I can receive packets sent from my laptop without difficulty -- it is only packets sent from the MEGA65 that have this problem.

My planned approach was to investigate this is to capture some good packets, and find or write an ethernet CRC checking program, and confirm it works for those, and then see how the MEGA65-originated packets fare -- and if there is some mutation of the packet data that will make the CRC correct. However, then I decided to take a closer look at the CRC generation code in the ethernet controller, and get that to provide me with the list of bytes that it thought it was CRCing, to make sure that there was nothing strange going on. In the process of that, I found that the data valid input to the CRC calculator was remaining high, while clocking the CRC out at the end of a packet.  Thus, only the first two bits of the CRC would be correct, and the rest would be wrong. So, off to synthesis again, to see if this fixes the problem.

Testing with this fix, it still wasn't working.  So I took a known good packet sent by my laptop to the MEGA65, and got the other MEGA65 to send it, so that I could compare the CRC with that of the good packet, to try to get some handle on what was still going wrong. I was really quite frustrated at this point, because I had gone through the relevant code carefully, and thought I had understood what was going on, and with the help of simulation, confirmed that it was doing the right thing.  So I was somewhat relieved when I realised what the problem with the CRC was.  Here is the good and the bad CRC:

GOOD: $C2F7B15F = binary 11000010 11110111 10110001 01011111
 BAD: $C1FB72AF = binary 11000001 11111011 01110010 10101111
Looking at the hex, I could see that there were strong similarities, much more so than if the CRC was just plain wrong. Bit it took me a little while to realise it was just each pair of bits were swapped: The routine that copies the CRC bits out, two at a time, for transmission was putting them into the wrong TX line. So, its off for a few hours of synthesis again to fix this up... and after 10,238 seconds of synthesis, we finally have ethernet transmission with working CRC generation.

The only program I have that does any ethernet transmission at the moment is the etherload program, that in theory listens for ARP requests. It would be nice if it also listened for PING packets and replied, but you can't have everything. However, pinging it's IP address from my laptop does now result in ARP succeeding, with the very uncreative MAC address hardwired into etherload:

paul@F96NG92-L:~/Projects/mega65/mega65-core$ ping 192.168.1.65
PING 192.168.1.65 (192.168.1.65) 56(84) bytes of data.
^C
--- 192.168.1.65 ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6140ms

paul@F96NG92-L:~/Projects/mega65/mega65-core$ arp -na

...
? (192.168.1.65) auf 40:40:40:40:40:40 [ether] auf enp0s31f6...

What would be nice, would be if etherload responded to pings, and also read out the MAC address it should use from the MIIM, now that I know how to use it. In fact, it would be nice if the ethernet controller provided simple register based access to the MIIM registers, and had an option to automatically populate frames with its MAC address.  I'll add these to the queue.  But for now, I am happy to finally have ethernet working in a solid way, for both transmit and receive, and will move on to testing the HDMI port, and now that I remember, the last aspects of the 3.5" floppy drive interface.