Saturday, 18 June 2022

TEI0004 JTAG Adaptor adaptors can now be ordered

So in the last blog post I designed a simple adaptor to allow use of the (at time of writing) still in stock TEI0004 JTAG adaptor with the MEGA65*, in place of the not currently in stock TE0790 JTAG adaptors.

Well, I have made up the first dozen of them, and they work. I can communicate with the serial monitor interface of the MEGA65, and I can push bitstreams to the MEGA65 over it as well. 

You can even stack them like LEGO (tm etc)!

This means it can be used to support development workflows that use the serial monitor interface. It can also be used to de-brick your MEGA65 if you mess up flash slot 0.

Note that these adaptors are _not_ supported by Vivado, but they _are_ supported by the MEGA65 m65 and m65connect tools on Linux and Mac, and for the serial monitor interface on Windows. 

The JTAG interface also technically works on Windows, but is not currently well supported by m65 and m65connect, so in practice may not work in that mode. It might work for you, but it might not. We are not promising that it does right now, let's put it that way. 

But if you are on Windows and need the JTAG interface, e.g., for pushing a bitstream to de-brick a MEGA65, or just try out other bitstreams without having to flash them, they will work 100% in a Linux VM.

My 11 year old son and I have decided to use this as a good opportunity for him to learn about the process of producing and selling a product, including working out all the costs, setting an appropriate price etc, so he is offering a batch of 50 of these for sale.

So if you would like to order one, they will be AU$15.95 or 9,99€ plus postage from Australia, depending on your preferred currency.  Payments will be accepted via EFT transfer from Australian bank accounts in Australian dollars or via IBAN/Direct Transfer/Sofort├╝berweisung in Euros to a German bank account -- i.e., no currency exchange fees will be involved.

Postage within Australia is $9.30 or $12.50 if you would like it Express Post.  Postage to most of Europe, including Germany and the UK is 10€ for slow-and-cheap or 18€ for express via Australia Post/EMS. Post should be constant for up to 3 or 4 adaptors in a single shipment. If you would like to order more, let me know, and I can get a more exact quote.

Finally, if you would like to order one (or more to save postage with friends), please visit the MEGA65 Discord Server and send me a private message.

But don't forget to check stock and order a TEI0004 from Trenz Electronic first, so that you know you will be able to make use of the adaptor!

Of course, if you would prefer to make your own, I'm happy to share the design with anyone who would like it.

Sunday, 5 June 2022

Using a TE0004 JTAG adapter in place of a TE0790 JTAG adapter

The MEGA65 has three FPGAs: One from Lattice in the keyboard, one from Xilinx as the main FPGA on the mother board, and one from Altera/Intel also on the motherboard.  The Altera FPGA is programmed using a Trenz TEI0004-02 JTAG adapter on J17, while the Xilinx on is flashed using a TE0790 on JB1.  The reason for these two adapters is a bit complex and historical, and influenced by which adapter works easily with the Xilinx and the Altera tools.  But regardless of how we got here, it's our starting position.

Why this matters, is that some MEGA65 owners would really like to get a JTAG adapter connected to their MEGA65 for flashing and interacting with the main Xilinx FPGA.  And this is where the problem comes in: Because of Chipaggeddon, Trenz will likely not have any more of the TE0790 adapters available for more than a year.  BUT they do have the TEI0004-02's in stock.  

Thus we would like to find a way to use the TEI0004-02 to control the Xilinx FPGA. We know that it won't work from in the Xilinx Vivado software, but it will work from the m65 command line tool, and also from m65connect on OSX and Linux.  Windows may be a bit tricky, because we don't have a rock-solid way to talk to the USB JTAG adapters from in Windows, except to use Vivado.  But even if Windows folks have to spin up a virtual machine running Linux, that will still in many cases be better than waiting a year or more. So, in short, there is good reason to do this little project.

First up, we need to get the pinout of JB1 and J17 from the MEGA65 schematics, and also making reference to the datasheet for the TEI0004 and TE0790 adapters.

Let's start with the TEI0004's pinout for its 10-pin header:

Pin 1 - JTAG TCK (output from adapter)
Pin 2 - GND
Pin 3 - TDO (input to adapter)
Pin 4 - Reference I/O-voltage from target board for JTAG and UART
Pin 5 - TMS (output from adapter)
Pin 6 - Reserved Output (May be used as Processor Reset in future software releases)
Pin 7 - UART RX (input to adapter)
Pin 8 - UART TX (output from adapter)
Pin 9 - TDI (output from adapter)
Pin 10 - GND

The TE0790 is a bit more complex, because it has a little CPLD that allows the pinout to be reassigned dynamically. So we need to know which profile it uses normally.  For this, its easiest to work backwards from the MEGA65 schematics, where the pin assignments will be fixed.

Pin 1 - GND
Pin 2 - GND
Pin 3 - UART RX
Pin 4 - TCK
Pin 5 - 3.3V (but not connected, as I discover later...)
Pin 6 - 3.3V
Pin 7 - UART TX
Pin 8 - TDO
Pin 9 - Not Connected
Pin 10 - TDI
Pin 11 - Not Connected
Pin 12 - TMS

So, as we already knew should be the case, all the required signals are there: We just need to connect them to one another.  For testing, I will just use a packet of header jumpers.  If that works, we can design ourselves a little bitty PCB that will do the adaption.

So what we want is:

TEI0004     TE0790
      1 TCK 4
      2 GND 2
      3 TDO 8
      4 VCC 6
      5 TMS 12
      6 ---
      7 RX  3
      8 TX  7
      9 TDI 10
     10 GND 1

So that all looks good.  I have an 8-pin header cable ready.  To make my life easier, I will record which colour I am using for which signal:

TEI0004     TE0790
      1 TCK 4         BROWN
      2 GND 2         RED
      3 TDO 8         ORANGE
      4 VCC 6         YELLOW
      5 TMS 12        GREEN
      7 RX  3         BLUE
      8 TX  7         PURPLE
      9 TDI 10        GREY

I have removed the not-connected pin, and the duplicate GND lines in the process.

Testing that out, it looks like the adapter powers up, and I can even see the output of the serial monitor interface from the MEGA65, which is good.  But JTAG communications don't work, nor does writing to the UART interface, so there is something odd going on.

In fact, if I even try to talk JTAG over it, then the machine goes all weird, resetting itself, and seeming to put random junk in memory.

This all suggests to me that I have likely got one of the UART and JTAG lines crossed... Now to figure out which.  Perhaps the easiest approach here is to find the UART line, since it is the uart line from the host computer to the adapter, so I should be able to type some characters into a terminal, and see the correct line waggle. We are expecting it to be pin 7 or 8 on the TEI0004, on the blue and purple lines. 

Without even going that far, I can see that the purple wire (UART TX to the MEGA65) is near GND, rather than floating at 3.3V. This is being caused by the TEI0004. So I am guessing I have the wrong pin... but double-checking everything seems to indicate that I do have the correct pin.  The MEGA65 main-board has a 10K pull-down on the UART TX line, but that should not be so strong that the TEI0004 can't drive the line high.

I think I have figured it out: I was using pin 5 on JB1 for VCC, but this is not connected on the MEGA65. VCC is instead on pin 6. With it on pin 5, the VIO reference voltage would have been near zero volts, and thus we saw the problems we were seeing.  

(Now, the astute reader might have noticed that I have pin 6 and not pin 5 in the table above, that is because I purposely changed it there after writing this, so that if anyone happened to glance at that to use it as a reference for themselves, that they wouldn't accidentally use the wrong wiring.)

... and funilly enough, with the correct voltage, both the UART and JTAG interface are now working: I can even do m65 -b mega65.bit to push a new bitstream to the MEGA65, or use m65 -S to take a screen-shot.  In short, we have it all working :)

So now let's design a simple little adapter PCB, so that folks can do this, without having to have a bodge cable.

Done! and sent off to PCBWay for prototyping!   

So it should look like in the following images when done:

From underneath, we have the 12-pin connector that goes onto the MEGA65's TE0790 header JB1. The big hole is supposed to line up with the 3mm hole in the MEGA65's PCB, so that if you want to secure it in place semi-permanently, this is possible.  It also provides a nice visual indicator that you have it the right way around.

From above, we have the 10-pin header to accept the TEI0004.

And a couple of different angles, just so you get an idea of the whole thing.  It is quite obviously very simple, hence why I was able to design it in about an hour, even though I had forgotten how to drive KiCad:

The particularly observant among you might have noticed that the TE0790 header is not lined up properly with the other two in the images above. This is a side-effect of the footprint I was using in KiCad having the pin numbers in the wrong order if I put it on the correct side of the board. So I designed it looking like this, i.e., with both on the same side, but with the pin positions actually correct. Of course, for assembly it doesn't make any difference at all:

 



Now its the annoying wait... PCBWay do have nice updates though as it goes through. Because the board is so small, I chose the largest batch size that would still give 24 hour build time (presumably because it fits on a single panel).  So right now, it is already showing:


So while we wait for the boards and the connectors to arrive, thanks to AmokPhase101 we have mock-up images that will help you to see how it will fit together, and so that if you get one (or make your own), you can easily fit it into place:

First, we have the upper right area of the MEGA65's mainboard, with the JB1 connector where the TE0790 JTAG adapter would normally sit. This is the 2x6 pin male header just to the left of the screw, and just below the microSD card slot, and above and right of the word "TEAM" in "MOULDS TEAM".

Second, we can see what it would look like with the adaptor in place: We see the pins of the adapter poking through where JB1 is beneath it, and the 2x5 pin male header that can accept the TEI0004:
Finally, we can put the TEI00004 onto that connector, with the micro USB connector facing towards the front of the computer:

So that's everything, really, until the PCBs and connectors turn up, probably in a week's time, and I can test one out.  I'm hoping that with it being so simple, I haven't screwed up one of the tracks or flipped pin numbers around or done anything else equally stupid (but that are all so easy to do with PCB prototyping).  But those (mis)adventures will have to wait until the parts arrive...



Monday, 2 May 2022

The "Chipageddon" parts shortage is weird

As many of you are aware, the MEGA65 has been delayed because of the whole supply chain disruption stuff that we have all come to know with COVID, and not at all helped by Russia deciding to invade its next-door neighbour.

We managed to source all the electronic parts for the first batch of MEGA65 months ago, and thought that we were all in the clear.

Then we found out that the power supplies were delayed for a bit. But not as much as the cardboard for the boxes.

Well, the cardboard turned up at the factory and was busy getting printed, but was delayed for a few more days because of problems getting a truck driver and the wooden palettes on which to stack the printed boxes.

The truck driver thing I knew was an issue already, with Germany going backwards by about 15,000 truck drivers a year, in part because they stopped teaching people to drive trucks in their year of national service, and not many people are willing to fork out around 10,000€ to get their Class-E driver's license, to qualify them for a job that is pretty poorly paid, and has notoriously bad working conditions.  As a result there are about 10,000 less qualified drivers created each year, which amounts to a good 2/3 of the short-fall.

I take a bit of interest in the truck-driver thing, because I have the Australian equivalent of a Class-E license (A "Heavy Combination" license for the curious, which means I can drive semi-trailers, but not b-doubles or road-trains), and am mindful of truck driving as a Career Of Last Resort(tm), which is at least handy to have as a short-term solution should the need ever arise. And the key points there are short-term and last-resort.  It's fun to drive 80 tonnes of truck around for a few days, but to have to do it day-in, day-out, and not see your family for days at a time, all the time, is not hugely appealing to most folks. So why would you invest 10K in getting qualified for such an unappealing job?

But anyway, the truck driver wasn't as much of a problem as the packing palettes are turning out to be. This is because the highly specialised and efficient German palette making machines are designed to use a specific high-grade nail made in ... you guessed, it, Russia. The other options for importing finished palettes for Germany are typically, Russia, Ukraine and Belarus. Oh dear.  

Meanwhile, we are also seeing other less obvious supply chain issues.  For example, I needed to order some parts for the prototyping of the next iteration of the hand-held version of the MEGA65, and had to search all over the place to find a supplier with the combined microSD / SIM card holders we are using to save space. I did eventually find a wholesaler in Japan who had them, so that I wouldn't have to wait the 72 week lead time (lead times of up to 100 weeks are now becoming quite common).  The catch was the minimum order quantity (MoQ) was 244, when I only needed 8 pieces.  Fortunately they are less than 0,20€ each, so even multiplied out, this would mean about 50€.

I placed my order, and got an email a few days later saying that the MoQ for this scarce part had increased from 244, up to a complete reel of 1,200.  Needless to say this struck me as rather odd: Because of a parts shortage, I was being forced to buy hundreds times more parts than I needed.  It also happened with some other parts like resistors and capacitors, where we had to order a reel of 2,000, when we only needed 20 or 50 pieces. So even though I only needed this many:


 

I ended up with this many:

 

So what is going on here? I think there are a few factors at play: One, the wholesalers are simply too busy to be bothered cutting a few parts of a reel, like they might have done before. Two, they don't want the risk of being stranded with most of a reel. This is especially true for weird parts like the ones I was ordering.  I think there is a good chance that they just wanted to get rid of them, and free up their warehouse space for something that people are more likely to need.

Anyway, back to the MEGA65, we got our palettes, and I believe that the cartons should be arriving at Trenz more or less any day now, and the shipping of the first batch of ~400 units should be starting perhaps as early as the end of this week. Exciting times, despite all of the dysfunction in supply chains at the moment.

Thursday, 24 March 2022

Solving the last of the digital video output glitching, at least I think so

First, as many will have noticed, we are now in March 2022, the month when we hope that the first 400 MEGA65s will ship. At the time of writing, we are still waiting to hear that the cardboard packaging and related materials have arrived at Trenz to go with the assembled PCBs, keyboards and cases, to make the complete units.  We still don't know exactly when they will arrive, but are waiting expectantly to hear news on this any day now -- as we are sure many of you are!

MEGA65 R3A Production board under test, touched up using the "my crappy phone's auto-focus is about as reliable as its auto-correct feature" filter.

But in the meantime...

As past readers will know, we have access to an old N5998A HDMI protocol analyser that we use to debug the HDMI and DVI compatible audio-video output on the MEGA65.  Recently we have seen some of the production boards having glitchy digital video output, and have set about fixing this in the VHDL.  However, in the process, we have found that in many cases the HDMI protocol analyser fails to recognise the video output, even though it displays fine on a monitor.

Also, in some cases where the protocol analyser does recognise the display, it reports "pixel errors", which is the problem we are trying to fix, however it doesn't give us any information about what those pixel errors are -- but that information would be tremendously helpful for us in debugging the signal. For example, are some of the DVI data words being shifted, are there stuck bits, or are some words being reversed or replaced with the lyrics to a lost shakespearian sonnet?  We really want to know, and we know that the N5998A's capture files seem to be raw bit-level captures of the three data plus one clock channel, and thus should contain the information that we need.  The trick is figuring out the file format.

The files seem to be in fact a completely raw capture with no header at all. The data looks like this:

00000000: 3f c4 00 3f c1 03 00 00 80 0f fc 80 01 0c 00 00    ?D@?AC@@@O|@AL@@
00000010: 3f c4 00 3f c1 03 00 00 80 04 00 80 01 05 00 00    ?D@?AC@@@D@@AE@@
00000020: 3f cf fc 3f c1 0a 00 00 3f c4 00 3f c1 03 00 00    ?O|?AJ@@?D@?AC@@
00000030: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@
00000040: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@
00000050: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@
00000060: 3f c4 00 3f c1 03 00 00 80 0f fc 80 01 0c 00 00    ?D@?AC@@@O|@AL@@
00000070: 3f c4 00 3f c1 03 00 00 80 0f fc 80 01 0c 00 00    ?D@?AC@@@O|@AL@@
00000080: 3f c4 00 3f c1 03 00 00 80 0f fc 80 01 0c 00 00    ?D@?AC@@@O|@AL@@
00000090: 3f c4 00 3f c1 03 00 00 3f cf fc 3f c1 0a 00 00    ?D@?AC@@?O|?AJ@@
000000a0: 80 04 00 80 01 05 00 00 3f c4 00 3f c1 03 00 00    @D@@AE@@?D@?AC@@
000000b0: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@
000000c0: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@

We can see that there seems to be six bytes of data, followed by two zero bytes, in a repeating pattern. 

We know how HDMI/DVI pixel encoding works, e.g., from the description here:

https://en.wikipedia.org/wiki/Transition-minimized_differential_signaling

The relevant part being:

The method is a form of 8b/10b encoding but using a code-set that differs from the original IBM form. A two-stage process converts an input of 8 bits into a 10 bit code with particular desirable properties. In the first stage, the first bit is untransformed and each subsequent bit is either XOR or XNOR transformed against the previous bit. The encoder chooses between XOR and XNOR by determining which will result in the fewest transitions; the ninth bit encodes which operation was used. In the second stage, the first eight bits are optionally inverted to even out the balance of ones and zeros and therefore the sustained average DC level; the tenth bit encodes whether this inversion took place.  

There are only 460 such unique codes that are used.

Also of relevance are the four special code words used to indicate H and V sync:

0010101011
0010101010
1101010100
1101010101

Those are interesting in that one of them, the one that encodes horizontal sync, should occur every raster line, i.e. every ~864 ticks of the 27MHz pixel clock, a pattern that we should be able to fairly readily find.

There are also 16 more that can be used to HDMI data islands.

So a quick bit of a histogram of the frequency of each two-byte combination should give us a hint as to whether each two bytes encodes one of the R,G and B channels respectively, as I suspect that they might.

I cooked up a quick program to count the number of unique tokens:

$ make hist && ./hist 0.cap
gcc -Wall -g -o hist hist.c
Read 65536000 8-byte rows.
Saw 116 unique 2-byte tokens.

Given that there can be 65536 unique values, the highly constrained set of values gives me hope that they might be raw HDMI/DVI 8/10 bit data words.

In fact, if I separate each of the 2 byte tokens and treat them as separate channels, we see that they each have quite different populations:

(The binary values are least significant bit first, i.e., the most significant bit is on the right hand side.)

CHANNEL 0:
Read 65536000 8-byte rows.
  $0255 : 1010101001000000 : 16634
  $0440 : 0000001000100000 : 70011
  $0480 : 0000000100100000 : 11761792
  $0567 : 1110011010100000 : 46
  $0667 : 1110011001100000 : 46
  $0967 : 1110011010010000 : 1308
  $09B9 : 1001110110010000 : 116
  $0A4F : 1111001001010000 : 2156
  $0A67 : 1110011001010000 : 72541
  $0AB9 : 1001110101010000 : 2213
  $0B67 : 1110011011010000 : 5417
  $0D55 : 1010101010110000 : 9649169
  $0DD5 : 1010101110110000 : 483395
  $0F80 : 0000000111110000 : 9801503
  $85B1 : 1000110110100001 : 510
  $86B1 : 1000110101100001 : 508
  $89B1 : 1000110110010001 : 1455
  $8A63 : 1100011001010001 : 173
  $8AB1 : 1000110101010001 : 572465
  $8AB8 : 0001110101010001 : 24457
  $8BB1 : 1000110111010001 : 12142
  $C2AA : 0101010101000011 : 197898
  $C43F : 1111110000100011 : 16172474
  $C458 : 0001101000100011 : 12130
  $C4B0 : 0000110100100011 : 92075
  $C558 : 0001101010100011 : 91
  $C5B0 : 0000110110100011 : 1021
  $C658 : 0001101001100011 : 46
  $C6B0 : 0000110101100011 : 508
  $C958 : 0001101010010011 : 560
  $C9B0 : 0000110110010011 : 965
  $CA58 : 0001101001010011 : 24081
  $CAB0 : 0000110101010011 : 188469
  $CB58 : 0001101011010011 : 174
  $CBB0 : 0000110111010011 : 1986
  $CD2A : 0101010010110011 : 38720
  $CDAA : 0101010110110011 : 3724825
  $CF3F : 1111110011110011 : 12601920
Saw 38 unique 2-byte tokens.
CHANNEL 1:
Read 65536000 8-byte rows.
  $2AAC : 0011010101010100 : 214532
  $3F00 : 0000000011111100 : 16172474
  $3FFC : 0011111111111100 : 12601920
  $4000 : 0000000000000010 : 70011
  $4CCC : 0011001100110010 : 104205
  $5C38 : 0001110000111010 : 1108
  $5C90 : 0000100100111010 : 556
  $63C4 : 0010001111000110 : 1113
  $8000 : 0000000000000001 : 11761792
  $80FC : 0011111100000001 : 9801503
  $9870 : 0000111000011001 : 1108
  $988C : 0011000100011001 : 2818
  $A70C : 0011000011100101 : 11453
  $A770 : 0000111011100101 : 872912
  $A78C : 0011000111100101 : 1586
  $A790 : 0000100111100101 : 1082
  $B00C : 0011000000001101 : 2297
  $B070 : 0000111000001101 : 11452
  $B970 : 0000111010011101 : 1083
  $B990 : 0000100110011101 : 4331
  $B9C4 : 0010001110011101 : 555
  $D550 : 0000101010101011 : 13896109
Saw 22 unique 2-byte tokens.
CHANNEL 2:
$06C1 : 1000001101100000 : 328
  $0801 : 1000000000010000 : 483395
  $0AC1 : 1000001101010000 : 12601920
  $0C01 : 1000000000110000 : 9801503
  $0C41 : 1000001000110000 : 508
  $1481 : 1000000100101000 : 46
  $1CC1 : 1000001100111000 : 6520
  $1D01 : 1000000010111000 : 38720
  $1D81 : 1000000110111000 : 512
  $1FC1 : 1000001111111000 : 971
  $2601 : 1000000001100100 : 4933
  $3A01 : 1000000001011100 : 23719
  $3AC1 : 1000001101011100 : 348
  $4241 : 1000001001000010 : 46
  $43C1 : 1000001111000010 : 484
  $4B41 : 1000001011010010 : 508
  $4C01 : 1000000000110010 : 86
  $4DC1 : 1000001110110010 : 92075
  $52C1 : 1000001101001010 : 4932
  $5301 : 1000000011001010 : 565945
  $5501 : 1000000010101010 : 232
  $55C1 : 1000001110101010 : 438
  $5A01 : 1000000001011010 : 24457
  $5EC1 : 1000001101111010 : 965
  $6941 : 1000001010010110 : 510
  $6D01 : 1000000010110110 : 174
  $6E01 : 1000000001110110 : 484
  $7101 : 1000000010001110 : 2156
  $7401 : 1000000000101110 : 995
  $8501 : 1000000010100001 : 70011
  $8601 : 1000000001100001 : 1998
  $8801 : 1000000000010001 : 9649169
  $8901 : 1000000010010001 : 67261
  $9201 : 1000000001001001 : 186988
  $9B01 : 1000000011011001 : 46
  $9D01 : 1000000010111001 : 3724825
  $9F41 : 1000001011111001 : 46
  $A401 : 1000000000100101 : 1867
  $A7C1 : 1000001111100101 : 116
  $AA01 : 1000000001010101 : 87
  $B9C1 : 1000001110011101 : 2119
  $BC01 : 1000000000111101 : 173
  $C501 : 1000000010100011 : 1986
  $C581 : 1000000110100011 : 45
  $DB01 : 1000000011011011 : 2213
  $DE81 : 1000000101111011 : 510
  $EBC1 : 1000001111010111 : 276
  $EEC1 : 1000001101110111 : 16634
  $EFC1 : 1000001111110111 : 178
  $F001 : 1000000000001111 : 6520
  $F301 : 1000000011001111 : 509
  $F341 : 1000001011001111 : 46
  $F5C1 : 1000001110101111 : 12130
Saw 55 unique 2-byte tokens.
CHANNEL 3:
Read 65536000 8-byte rows.
  $0000 : 0000000000000000 : 65536000
Saw 1 unique 2-byte tokens.

So we see that in fact each of four two-byte tokens are drawn from completely separate populations, with the total number of unique tokens equalling the sum of the number of unique tokens within each of the four channels.  This is a good hint that we are looking at in fact four channels of data.

Note also that I have picked the byte order and bit order within the bytes quite arbitrarily -- and either or both might be wrong.

Channel 2 is particularly interesting, as there are six bits that have a fixed value, leaving exactly 10 bits that take differing values... and an HDMI/DVI data word is 10 bits long.

Now, looking at more of the file, I can see a general structure that repeats every 863 x 8 bytes. This is significant, because the raster lines in this video mode are 863 cycles in duration.  This is further circumstancial evidence that each 8 byte block corresponds to one cycle of time, and thus should contain the 3x10 bits of data plus 10 bits of clock. Hmmm... That adds up to 40 bits = 5 bytes, so we have one extra byte of something, presumably, or more if the clock is not recorded explicitly. The clock is interesting, because it should be 5 bits of 1 followed by 5 bits of 0, which should be easy to spot.

Also looking, I can see that there are 64 repetitions of:

aa cd 50 d5 01 9d 00 00

every raster line. That's the number of horizontal sync pulses, so we should expect at least the DVI channel 0 to have the HSYNC control word in it, i.e., one of those values that has lots of 1010101 in it.  Plus the clock should have 1111100000, so let's turn those bytes into binary, and see what we can see, recognising that we might need to reverse the byte and/or bit orders:

10101010 11001101 01010000 11010101 00000001 10011101

Well, that's reassuring that we can see plenty of 1010101 looking stuff in there, e.g., in the first and fourth bytes, in particular. What I can't see, though, is any way to have the 1111100000 clock pattern, so perhaps it isn't recorded.  This would make sense, as it would kind of be consumed during the decoding in the analyser, and implied.  That then leaves the mystery of why we have 48 bits being used to represent 3 x 10 bit values, when 30 would have been enough.

We then see 67 lots of:

55 0d 50 d5 01 88 00 00

01010101 00001101 01010000 11010101 00000001 10001000 

This looks to me, like 12 bits is reserved for each of the 10 bit words,and that they are laid out quite naturally, as we can see by lining those up on top of each other:

10101010 11001101 01010000 11010101 00000001 10011101
01010101 00001101 01010000 11010101 00000001 10001000

We see the different control word in channel 0 in the left most two bytes, while those in the other two channels remain unchanged after the end of the HYSNC period.  The values in the channel 0 control word are only valid, if the bit order were reversed over the 10 bits, to yield:

1101010101 xx 0010101011 xx 0010101011 xx ????????????

I would like to be really sure that this pattern holds, which I can check by testing for any of the xx bits being non-zero anywhere in the file. If not, then we will assume that they aren't used -- which holds true.  

So what on earth is in those last 2 bytes, then, if contains only 2 bits of channel data?

It holds to a specific pattern, where only the last 8 bits actually change.  My guess is that it is some kind of checksum or CRC, and I think I am right: Remember those two 8 byte vectors we were looking at?

55 0d 50 d5 01 88 00 00
aa cd 50 d5 01 9d 00 00

0x55+0x0d+0x50+0xd5=0x0188

0xaa+0xcd+0x50+0xd5=0x29c

If we assume that the fifth byte is simply required to have "000001" in its lower bits, then that leaves the last byte as a simple 8-bit checksum minus 0x100, with internal carry, i.e., the sum 0x29c becomes 0x19c becomes 0x9c + 1 = 0x9d.

I can try to verify that this algorithm is correct, by testing it on every single 8-byte vector. Well, it's almost right, in that it comes out to within 1 of the correct value every time. The internal carry might be wrong, perhaps, and there might be some other fancy adjustment being made.

Anyway, I'm not going to get too worried about it, since it is pretty clear that it is some kind of checksum.  This all means that we have a pretty clear idea of the format of the file, and can move on to making a simple decoder for it.    

It's the weekend again, and I have made some more progress, by beginning to create a little program that accepts in one of these capture files, and begins probing and testing the captured log.  The reason for this is to debug an issue with glitchy/absent DVI/HDMI video on some of the R3A production boards, which we know is due to manufacture variation of the FPGA part itself, as previously discussed.

So my plan is to make a capture from an R3 board that has good video output on a given bitstream, and then use the same bitstream on an R3A board that doesn't show video in the same circumstances. I will then implement the various tests -- starting from the low-level signalling, which is where I suspect the problem will be, and working my way up.

One nuisance of the capture format of the HDMI analyser is that it doesn't include an indication of the clock frequency, so there isn't a hard and fast way to automatically recognise the video mode, or be sure that it is at the right pixel clock. To work around this, I have already implemented logic that figures out the HYSNC and VSYNC timing, and then compares that to the list of official HDTV modes, and computes a least-squares error to find out which mode it most probably is.  This allows for some slop in the mode implementation, which is not uncommon (the MEGA65 in PAL for example uses 863 cycles per raster instead of 864 cycles per raster, so that it is divisible by 3).

In trying to remember exactly how the DVI/HDMI SYNC marking works, I was reminded of this series of blogs by someone else who made a basic HDMI analyser from an FPGA:

https://warmcat.com/2015/10/20/hdmi-capture-and-analysis-fpga-project.html

While not directly applicable to what we are trying to do here (as we already have a box that grabs the HDMI/DVI signals), it is an interesting read through, and has a hint for something that I will try to fix on the MEGA65 video out -- the HSYNC and VSYNC pulses should be synchronised, which they are not currently on the MEGA65.  That can be fixed easily enough in the VHDL.

But back to the HDMI analyser software so that we can compare good and bad output from different boards, I'll now implement the SYNC detection by tracking the SYNC values, and updating them when we see differing SYNC control word values on the blue channel.  

That now works, and I can see the video mode parameters being correctly inferred:

$ make hist && ./hist good-r3.cap
make: „hist“ ist bereits aktuell.
DEBUG: 459 unique pixel values
DEBUG: Read 13107200 records.
ERROR: 2603320 invalid DVI 10-bit words observed:
       10790x $013 (0000010011)
       22150x $043 (0001000011)
       25085x $20f (1000001111)
       45970x $270 (1001110000)
       40350x $2bc (1010111100)
       2458975x $2ec (1011101100)
DEBUG: Saw control word counts: 29122 144 15490 120
DEBUG: Saw most frequent lengths of: 14(x14013) 799(x96) 64(x14625) 64(x120)
DEBUG: HSYNC+ intervals = 863 863 863
DEBUG: HSYNC- intervals = 863 863 863
DEBUG: VSYNC+ intervals = 539375 539375 539375
DEBUG: VSYNC- intervals = 539375 539375 539375
INFO: Raster lines are 863 cycles long.
INFO: Frames 539375 cycles long.
INFO: Frames consist of 625 raster lines
DEBUG: Most frequent VSYNC low/high length = 65535(x24) 4315(x24)
DEBUG: VSYNC low -> 65535 / 863 = 75.94,  VSYNC high -> 4315 / 863 = 5.00
INFO: VSYNC pulse lasts 5 rasters, polarity is POSITIVE
INFO: File contains 24 frames (first is possibly partial, and a last partial frame may also be present)
DEBUG: Most frequent HSYNC low/high length = 799(x15166) 64(x15167)
INFO: HSYNC duration is 64 cycles
INFO: Mode most closely matches 720x576 50Hz non-interlaced (mode error = 1)
INFO: The mode differs from the expected mode in the following ways:
      h_total: saw 863, expected 864

You can also see in the 2nd-last INFO line, that the video mode detection magic has correctly inferred the video mode, using that least-square error algorithm I mentioned. The error is only 1, because the only divergence between the actual and model video modes is the duration of the raster lines differing by 1 cycle.  

The algorithm isn't perfect, if a mode differs in multiple ways from a standard mode, but it should in most cases give a good indication, and can act as a sanity check if you have fed in a different mode to that expected -- something that the fancy HDMI analyser software was blind of, and would instead just complain about every divergence between what was seen and expected, without suggesting that you check that the source was set to the correct mode.

We can also see that there are some unexpected control words (in italics).  Interestingly, this is on a board that is producing a valid picture. So I need to figure out what is going on there: Are we producing nonsense codes systematically, or is my decoder doing something odd.  The fact that there are very large numbers of them suggests to me that it is some systematic problem. For example, am I incorrectly calculating the codes for the different pixel values in either the VHDL, or in this analyser program -- either could cause what we are seeing. Or is it a problem with the transmission of the codes over the digital serial interface that is DVI/HDMI?

In the end I solved that problem by finding a complete list of the codes from one of Hamster (Mike Field)'s great FPGA projects, and slurping those in.

With that in place, I now get no errors of bad TMDS code words, which is great. 

Even better, when I ran the revised version of the program over the capture from one of the boards that was showing glitchy video, it showed that the video mode was not the PAL video mode that it should have been:

$ make && ./hist bad-r3a.cap
make: F├╝r das Ziel „all“ ist nichts zu tun.
DEBUG: 460 unique pixel values
DEBUG: Flip test passed.
DEBUG: Read 13107200 records.
DEBUG: Saw control word counts: 1938 14476 162 28505
DEBUG: Saw most frequent lengths of: 176(x637) 64(x12843) 799(x75) 14(x12358)
DEBUG: HSYNC+ intervals = 863 863 863
DEBUG: HSYNC- intervals = 863 863 863
DEBUG: VSYNC+ intervals = 1771 12900 4423
DEBUG: VSYNC- intervals = 2133 12694 4640
INFO: Raster lines are 863 cycles long.
ERROR: Frames are not of consistent length.
INFO: Frames consist of 0 raster lines
ERROR: Frames are not an integer number of rasters long.
DEBUG: Most frequent VSYNC low/high length = 176(x63) 504(x5)
DEBUG: VSYNC low -> 176 / 863 = 0.20,  VSYNC high -> 504 / 863 = 0.58
ERROR: VSYNC pulses don't seem to be a multiple of the raster length.
INFO: VSYNC pulse lasts 0 rasters, polarity is UNKNOWN
INFO: File contains 63 frames (first is possibly partial, and a last partial frame may also be present)
DEBUG: Most frequent HSYNC low/high length = 64(x13335) 799(x12467)
INFO: HSYNC duration is 64 cycles
INFO: Mode most closely matches 720x480 60Hz interlaced (mode error = 274641)
INFO: The mode differs from the expected mode in the following ways:
      h_total: saw 863, expected 858
      v_total: saw 0, expected 524
      hsync_len: saw 64, expected 62
      vsync_len: saw 0, expected 6

Basically it thinks the screen is zero raster lines long. Something very odd is going on. Well, we knew that, because we weren't seeing a picture. But now we are getting some info on what is going on.

This led me down a fruitful path of investigation that revealed that the RESET line was glitching, with the glitch coming from the MAX10 2nd FPGA on the MEGA65 main board.  Further looking revealed that the communications from the MAX10 to the main FPGA is a basically totally confusing itself.  The MAX10 is also responsible for reading the dip switches -- one of which toggles between DVI and HDMI.     

All that means that the HDMI encoder was thinking it had to switch between HDMI and DVI all the time, which was confusing it no end. Especially since this could happen in the middle of a pixel.

So now I have something concrete to fix.

For now, I will just disable the MAX10 communications, and see if I can't get something stable.

The HDMI test bitstream with MAX10 disconnected is stable, provided all the HDMI features are turned off, i.e., sending pure DVI signalling.  So I will turn on one of the packets, and then try to play spot the difference between the two.

So, in fact, I don't need to do that, because I have confirmed the glitching on the reset line is the entire problem.  It took a bit of fiddling to fix the glitching, because the MAX10 is using its internal oscillator which drifts with temperature to anywhere between 55MHz and 116MHz.  

Essentially the problem I had was that I was counting how long the incoming 41MHz clock from the Xilinx to the MAX10 FPGA on the communications path holds, to detect the sync (which is a long hold) among the clock pulses (short holds).  41MHz < 55MHz, so it should be fine, right? Nope, because each _half_ of the clock is effectively an 81MHz signal -- which is more than 55MHz, so if the MAX10's internal clock was running slow, it could sample on the same half of the clock multiple times running, and thus mistakenly believe that it is in a sync pulse.  

To solve this, I halve the 41MHz signal in the MAX10 FPGA, and then check that. With a bit of fiddling of the constants, that had it fixed. I could have fixed it in the Xilinx FPGA, but then we would have had to modify our release bitstream and test it again, which I still really wanted to avoid. And fortunately I was able to do this. In fact, I solved these last problems during the live stream I did on this topic:





Thursday, 10 February 2022

MEGA65 PCB Population

As we get closer to the first 400 units being sent out, we thought you might like to see how our manufacturing partner in Germany, Trenz Electronic GmbH, went about populating the PCBs of the MEGA65.

Trenz designed the MEGA65's PCB and are uniquely placed to do the manufacturing and assembly -- and as with the rest of the key parts of the MEGA65 PCB, these are all manufactured in the German-speaking parts of Europe. The keyboards are also made in Germany by GMK, and the cases in Austria by Hintsteiner. As an example of this partnership, you can read about it on Hintsteiner's blog (in German, sorry, but Google Translate is pretty good these days).

It's worth stressing just how fantastic it is to work with these companies. They all place quality is an extremely high priority, and really take pride in their work. Their close proximity to the core MEGA65 team in Germany is also very helpful, and the lack of any language or cultural barrier also just makes work quicker, easier and more precise. Without their generous support, the MEGA65 would not be where it is today -- which is of course not very far off from being shipped out (more on that in a bit)!

In fact, it has turned out to be critical to the project that we have taken this quality-first approach, because this pesky pandemic has made it very hard for small projects to get the attention of the low-cost Chinese manufacturers, and even to get the components required to build boards.  We really feel for some of the other retro computing projects out there at the moment that are struggling to get the FPGAs and even components like resistors that are in very short supply right now.

But back to the PCB, the MEGA65's mainboard is a multilayer PCB with the big fat FPGA at the heart of the machine.  Just designing a board that can fan out many hundreds of pins from a chip to the various places on the board that they are required is non-trivial. Then to manufacture, you have to be able to place the solder paste for the reflow precisely, and then place the components in their correct locations using a pick-and-place machine, often called a "chicken-picker" for the way that they alternately "peck" at the component rolls and mainboard. Here you can see the head over the mainboard:

These chicken-picker machines are quite large and expensive machines, especially the ones that are able to place hundreds or thousands of components per hour. The parts themselves come from tape reels, which automatically rotate as each component is taken, allowing the whole process to be driven automatically. You can see the reels on the left of the machine here, as the shuttle moves to collect the components there, before moving back to the right where the board is hidden behind the structure from this angle:


The chicken-picker machines also typically have a camera that helps them to visually check that the "chicken head" is in the absolute correct location, and that everything is going right, which you can see here dwelling on some components near the MEGA65 logo on the board:

 

You can see all of this in action in the following video that we have made with Trenz to show how the PCBs get processed in the chicken-picker:

Otherwise, behind the scenes we continue to work on a bunch of improvements to the MEGA65 core, including among others, improving our already pretty-good SID to sound even better, improving international character input with the keyboard, lots of other little improvements of various kinds, continuing to improve the BASIC65 ROM and KERNAL to support nice things like direct access to the SD card, D64 disk image mounting, fixing 40 year old bugs in Commodore's DOS among others. To join the fun in real-time, maybe join our discord server.

Saturday, 1 January 2022

More work on HD floppies, RLL encoding, disk density auto-detection and other fun

[Gadzooks! This blog post has sat idle for ages while life and various other things happened.  But finally it is here!]

In the previous blog post, I described in perhaps excess detail some of my adventures getting very high data density on normal 1.44MB floppies in the MEGA65's normal 1.44MB floppy drive.  We seem to be able to fairly easily get more than 2MB, which is pretty nice, but further boosts are most likely going to hinge on whether I can get RLL encoding to work, which in the best case, will get us an extra 50% storage, thus allowing 3MB or so, i.e., more than a standard 2.88MB ED floppy.  But we have a number of barriers to address to get to that point:

1. We need to implement RLL encoding.

2. We need to fix some of the regressions with accessing DD disks that have been introduced, presumably because we are writing at the wrong density or on the wrong track.

3. Write pre-compensation is likely to be even more entertaining for RLL coding, than for MFM, as we have to worry about six different peaks in the histogram being cleanly resolved from one another, compared with MFM's three.

I'll start with implementing the RLL encoder and decoder, as I am already part way there, and can't resynthesise until I get it mostly done, and in turn, I have remapped a few of the low-level floppy registers to make space for RLL encoder selection, which means that my test program is now out of step with the older bistreams. So its all a bit of a mess that I need to work through. So let's dig straight into it!

For RLL encoding, I am implementing an RLL encoder and decoder that are parallel to the MFM ones, and selection logic to decide which one is connected to the disk drive at any point in time.  That way we can easily switch between the two.

One of the challenges with the RLL encoder, is that we have variable length bit groups that get encoded together, which can be 2, 3 or 4 bits long.  It is thus quite common for encoding to continue across byte boundaries. This is not a problem in itself, but it is a problem for the way that I have been writing the synchronisation marks, as in MFM it was possible to assume that byte boundaries would be respected, and thus the sync byte would get properly detected.  But in RLL this doesn't happen.

To solve this, I allow any partial bit pattern to run out when the next byte is a sync mark, basically pulling bits out of thin air to use to fill in the missing extra bits needed to complete the last symbol.  That, and fixing an error in my transcription of the RLL table into the VHDL, I now seem to be able to encode bytes and sync marks alike.  The next step is to make a decoder, and test it under simulation, to make sure we detect sector headers and data etc, as we should.

So now to make the decoder, which is also a bit interesting, because of the way the codes get emitted.  For the MFM coder, we could just look at the previous bit and the length of the incoming gap.  But for the RLL coder things are fiddlier.  For example, if we wrote the byte $42, i.e., %01000010, this would be emitted as:


010      000100
000      100100
10       0100 

So the total bit sequence will be:  0001001001000100

The trouble is that there could be either 100 or 1000 from the previous byte on the front, and 1, 01, 001, 0001 or 00001 at the end, depending on the next byte.

In any case, the first step is to build the gap quantiser, that converts the incoming pulses into a series of pulse lengths, i.e., the distance between each magnetic inversion, represented by a 1 in these sequences. So for the above, this should come out as 4, 3, 3, 4, 2 + something not yet known.

I now have the gap quantiser working under simulation, and can simulate formatting a track, which results in plenty of predictable data to decode, and I can see that said data is being generated.  So now its time to figure out how to decode RLL2,7 data from the series of gaps.

I'm thinking that the best way will be to turn the gaps back into a series of bits, and then chop the head off for any bit sequences that we can extract. When we hit a sync mark, we then just need to remember to discard all waiting bits, and discard the first two bits of the next gap as well, because of the trailing 00 at the end of the sync mark.

With the above approach, we need only a few IF statements in the VHDL to decode the bits, and some glue to make the bits emit one by one from the variable-length RLL bit blocks. So we should be ready to synthesise a bitstream, and see if we can see some RLL encoded data. Like this, for example:

 

Notice that there are more than 3 peaks, because with RLL2,7, we should have peaks at 2x, 3x, 4x, 5x, 6x, and 7x the time base.  The 6x and 7x are too long to show up here.  The noise at low rates might in fact be funny artefacts from the higher intervals wrapping around, or alternatively, that the drive electronics don't like the gaps being that long.  If we speed up the data rate from DD to the HD rate, then we see a much better result:

We have the peaks, and they are looking very nicely resolved, and without any funny noise at the low end.  Note that the 7x peak is really short -- this is because a gap of 7 zeroes requires fairly specific bit patterns, which are not that common. This is also a good thing, because drives are much happier with shorter intervals than that.

It is possible to increase the data rate some more, but I will need to add in write-precompensation as well, because the two shortest interval peaks do start to bleed into one another, which is, of course bad.  I'll have a look and a think about that, because it might end up being better to make a custom RLL code that merges the two shortest interval peaks, in order to be able to increase the data rate higher than would otherwise be possible.

Here is about the best data-rate that is possible right now, with a divisor of $18 = 24 = 1.6875MHz:


We can see the left-shift of the 3x pulse, just like we saw with the MFM encoding with the 1.5x pulse, that is also the 2nd in the sequence. To give an idea of how much better this RLL encoding is at this frequency compared with MFM, here is the MFM equivalent, with the best write pre-compensation I have been able to do:

I.e., MFM encoding at that data rate is utter rubbish, and is probably not usable below a divisor of $1E = 30.  So we are getting a divisor of 24 with RLL, which means 30/24 = 6/5 = 120% of the MFM capacity with this RLL encoding, assuming that it actually works, which I will soon be able to test.  

Interestingly, it looks like write pre-compensation will be rather unlikely to make much difference with the RLL encoding, at least on track 0. Shifting the 2nd peak to the right a bit might get us one divisor lower to /23 instead of /24, but it feels like its really diminishing returns. Trying other tracks, I can see the split peaks again, so implementing write pre-compensation might help to preserve higher density onto more tracks, so it might well still be worth implementing.

More generally, I'm also curious to know how well perfect write pre-compensation would be able to help.  To test that, I could write a track with a constant stream of equally spaced pulses, and see how wide the spread of the peak is from that, as that should only have the non-data-dependent timing jitter.

Its the weekend again, and I have had a bit of time to work on this all again, including another twitch stream (sorry, the video isn't up yet, I'll update this with a link, when it is), where I tracked down some problems with the RLL encoding: Specifically, I am facing a problem now where the CRC unit takes a few cycles to update the CRC, but the RLL encoder buffers 2 bytes instead of 1, which means that it is trying to inject the CRC before it has fully calculated.  Also, earlier on, this double-byte buffering itself had some problems.

There is a way to calculate the CRC in a single cycle, but it needs a larger look-up table which would take up space in the FPGA, and we don't really need the speed, we'd just like it. All that I need to do is to make the FDC encoder wait for the CRC to finish calculating before feeding bytes.  I have a fix for that synthesising now.

Meanwhile, I am still curious about how the RLL2,7 code was generated, and whether we can generate RLL3,x or RLL2,x codes, with x larger than 7, that the floppy drive can still read, but that would allow for higher densities.  RLL3,x codes are particularly interesting here, because with an RLL3,x code, there would be three 0s between every 1, which means that we can -- in theory at least -- double the data rate vs MFM, although other effects will likely prevent us reaching that limit. But for that to be effective, we need a nice short RLL3,x table, built along the same lines as the very clever RLL2,7 tables.

What is interesting about the construction of the RLL2,7 table, is that it is very efficient, because it uses unique prefixes to reduce the overall code length. Without that trick, it would require more like 3 ticks per bit, instead of 2 ticks per bit. I know, because I made a brute-force table generator that works out the numbers of combinations of codes of given lengths that satisfy a given RLLx,y rule.  RLL2,7 requires at least two 0s between every one, so the table is based around that, by looking at all possible codes that satisfy that of a given short length, and attaching two 0s to the end, to make sure that they are guaranteed to obey the RLL2,7 rule when joined together in any combination.  

So maybe we can do something like that for RLL3,10, which I think is probably about the sweet spot. We'll initially try for something that is only as efficient as the RLL2,7 code, but using units of 3 bits, instead of 2:

001000

010000

100000

That gets us 3 combinations using 6 bits, which is a fair start, since the idea is that we want to use no more than 3n bits. So lets add some more 3-bit units, that can't be confused for any of those three above, of which there are the following six:

100010000

100001000

010001000

000100000

000010000

000001000

This still keeps us within RLL3,10, but only just, as two of those together can result in exactly 10 0s between adjacent 1s.

With 9 combinations, we could encode 3 bits using those, but we would like constant length encoding, like RLL2,7 does, because data-dependent encoding length is a real pain to work with.  But unfortunately the number of cases here don't work for that.  So let's start with a 5 bit prefix, and try to encode more bits

10000000

01000000

00100000

00010000

00001000

10001000

Ok, so we have 6 combinations using 8 ticks.  To make it at least as good as RLL2,7, we need to use less than 3 ticks per bit on average, but to keep the length constant, we need to find a multiple of 3 that is a power of two if we do it this way... of which there are none. So I'll just have to think on all this a bit more. So let's go back to that bitstream...

With the new bitstream, we can now write a valid TIB and RLL2,7 formatted sector headers, because I can verify their correctness from the written data. However it refuses to parse even the TIBs, let alone the RLL sectors.  My best guess is that the CRC feeding on the decoder side is also having problems, although it doesn't really make sense why the TIBs refuse to read. 

Also, digging through the source of sdcardio.vhdl and mfm_decoder.vhdl, the last sector seen under the head should still be updating, even if the CRCs were bad -- which they are not. It's almost as though the RDATA line from the floppy is not connected -- but it is, because I can display histograms, which means that the gap counting logic is reading the RDATA line.  The data rate for the TIB decoder is locked at divide by 81, the rate for DD disks, and to MFM encoding, so neither of those should be blocking TIB detection.  

So where is the signal getting to, and where is it getting stuck? We have some debug tools in this bitstream still to help. $D6AF returns mfm_quantised_gap from the mfm_decoder. It actually contains different information to the mfm_quantised gap, but that's okay. The main thing is that it is changing value continuously, showing that gaps are being detected and quantised.  Perhaps its that we are not having the synchronisation marks passed through correctly? That could stop everything from working.  

The sync signal is selected from either the MFM or RLL27 decoder, and so it is possible with all this fiddling about that it has been messed up.  Although even that would be a bit odd, given that it simulates fine.  I'll modify the bitstream so that we can see if either the RLL and/or MFM synchronisation marks are detected -- just in case it is some funny problem with synthesis of the design.

And I might just have found the problem: the sensitivity list for the process in the VHDL file where the selection between RLL and MFM signals happens was missing all those signals, so it might well have all been optimised out during synthesis, which would result in exactly what we are seeing.

Nope, that wasn't it, either. So lets work out the commit it stopped working, and figure out the differences from that.

Bitstream from commit 9071b54 works, but the bitstream from commit 2c32a9e  doesn't work. After digging through the differences, I have found the problem: The CRC of the Track Info Blocks is incorrectly calculated when reading them, and thus they aren't accepted.  The good thing is that this shows up under simulation, so can be fixed. Similar problems probably occur when hitting the sector header blocks, which is why sectors aren't being identified.  

Specifically, it looks like the first of the three sync marks is not being fed into the CRC calculation. This is because we assert crc_reset at the same time as crc_feed, when we encounter the first sync byte.  Now synthesising a fix for that, after confirming in simulation that it works.

Totally unrelated to the floppy, also included in this synthesis is support for switching between 6581 and 8580 SIDs, using the updated SID code from here. More on that in another blog post later, most likely.

Back to the RLL floppy handling, that fix has now got sector headers being detected, and I am able to cram up to 40 sectors per track in Amiga-style track-at-once at a divisor of $17 = 23, or 36 sectors per track if sector gaps are included, and the peaks are quite well resolved still at that divisor, although the double-peaking that is symptomatic of the need for write pre-compensation is clearly visible:

 

Note that this is without any write pre-compensation. It's also without testing whether we can actually read the sectors reliably, either. So it might well be a fair appraisal of what will be reliable after I implement write pre-compensation for the RLL2,7 encoding.  For comparison, the best stable divisor for MFM encoding with write pre-compensation is around $1C = 28, which gets 30 sectors per track with gaps, or 33 without gaps.  So its an improvement of 40/33 = 120% over MFM, before we implement write pre-compensation.  Not the full 150% that we should theoretically be able to obtain, if it is only the shortest gaps that are a problem, but still a welcome boost.  

Over a whole disk, we can probably count on averaging 34 sectors per tracks with gaps, or 38 without. This means that we can now contemplate disks with capacities of around:

With Gaps, 80 tracks = 34x80x2x512 = 2.72MB

With Gaps, 84 tracks = 2.86MB

Without Gaps, 80 tracks = 38x80x2x512 = 3.04MB

Without Gaps, 84 tracks = 3.19MB

In short, we have reached the point where we are able to cram more data on an HD disk than an ED disk officially holds... provided that I can read the sectors back reliably.  

Which I can't yet. In fact, I can't read any of them back. I am suspecting that I have another CRC related problem, so will see if a CRC error shows up in simulation of reading a sector body. And, yes, the sector body CRCs are wrong.  Bug found and fixed, and synthesising afresh. Again.

And that has sectors reading! Now to tweak the floppytest program to format RLL disks correctly, and to test reading all of the sectors, and see how far we can safely dial things up.  That showed a bug in the TIB extraction, where the sector count was being read as $00 always, which I am now fixing -- and is now fixed. 

Now its time to think about write pre-compensation, so that I can maximise the reliable data rate, as right now when reading sectors, a data rate divisor of 25 = 1.62MHz is not quite reliable on track 0, and certainly falls apart by about track 35 or so.

As part of this, I might follow up on something that occurred to me on the ride to work this morning:  My PLL that tracks the floppy data clock when reading from real disks is probably sub-optimal: It synchronises exactly when it sees a pulse, rather than working out if the pulse is a bit early or late vs the expected time, and taking that into account. This will be causing gaps to seem over large and too small, when the magnetic phenomena occur that push short gaps together and spread out longer gaps between them. If I can fix this up, it should sharpen the peaks in the histograms quite a bit, and allow higher data rates, which would be nice.

So I might take a brief excursion into analysing the behaviour of the gaps and think about PLL algorithms that will not cause this double-addition of gap variance, but rather, ideally, be able to cancel some of it out.  The first step is to take a raw track capture, and take a look at the gaps in it, and try simulating various PLL algorithms on it, to see what we can do. 

Part of why I think that this has considerable potential, is that the histogram peaks for gaps are really quite tight, until the data rate is increased beyond a certain point, after which they start getting quite a bit wider, which I think is a direct result of the measuring gaps from the time of the last gap, which of course will tend to be shifted at the higher data rates, but not at the lower data rates, because the shifting only happens in any noticeable amount above some threshold data rate.

The trick is to make a simple algorithm for tracking the pulses and error between expected and actual time of appearance.  And the methods need to be simple enough for me to fairly easily implement.  

Some low-hanging fruit is to look at the error on the previous pulse interval, and assume that all (or half) of that error is due the most recent pulse being displaced. We then adjust the nominal start time of the next interval to the modelled arrival time of the pulse, if it had not been displaced in the time domain.  This will stop pulses that are squashed together from having both ends of the squish deducted from their apparent duration.  That is, we carry forward the error in arrival time, so that we can compensate for it. The trick then, is that we need to allow for true error, i.e., drift in the phase. We can do this by checking how often a pulse arrives late versus early: Whenever there are too many lates or earlies, we adjust the phase by one tick in the appropriate direction. In other words, we track the average frequency over time, and attempt to correct individual pulse intervals based on recent deviations from the average frequency.

Initial testing on the captured RLL encoded track data suggests that it should help. In particular, it seems to be reducing the maximum magnitude of the error in arrival time, which is exactly what we need if we want to narrow those peaks.  This should operate nicely in addition to any write pre-compensation, because it is all about fixing the residual error that the write pre-compensation didn't fix.  

In the process of all this, it is clear that these ALPS drives keep the motor speed pretty accurate, probably to within +/-1% of the nominal speed, which encourages me that we should be able to get higher data rates out of them, once I have implemented the improved PLL (which will help both MFM and RLL), and then implemented write pre-compensation for the RLL encoding.

I also realised that the encoder for both MFM and RLL is calculating the data rate differently during the read and write phases: During reading, it is exactly correct, but during writing I am counting from the divisor down to 0, i.e., each interval is one cycle longer than it should be.  This will be causing problems during decoding at higher data rates, because all the bits will be slightly shifted in time, and errors will be more likely.  This is quite easy to fix, and should help (a little) with the maximum achievable data density as well.

I think the next step is probably for me to find a way to simulate reading from a real track of data, so that I can work on the PLL improvements and check that they are really functioning.

I have in the meantime been working on the PLL algorithm some more, and now have it checking against the last 8 pulses, and if the current pulse is exactly timed to an integer number of gaps (+/- one clock cycle), then it counts those up, and the gap size that has the most votes is accepted as exact. That squelches a large fraction of the errors.  For cases where this is not possible, then I take the average of the gap as base-lined against those last 8 pulses, which tends to help reduce the error a little -- but rarely completely.  This algorithm will all be very easy to implement in hardware, as it is just addition, subtraction and 3 right shifts.

The other thing I have noticed, is that the shortest gaps almost always get stretched, rather than compressed for some reason.  The longer gaps tend to have much less error on them, when compared with the shortest n=2 gaps.  So it probably makes sense to allow a bit more slop on the shortest gaps, and in return slightly narrow the lower bound for n=3 gaps.  Maybe allow up to 2.625 instead of 2.5 gaps to be counted as a n=2.

This is all with non write pre-compensated RLL2,7 data.  Some of these residual errors can likely be reduced if I implement the write pre-compensation.  But this "read post-compensation" certainly seems to be much more robust than my previous naive approach, and I'm hopeful that it will allow higher data rates -- which I can test by writing such a track, and then testing against it.  

I can also use this test program against MFM data, so I can test my theory about write pre-compensation working well with it. But that will have to wait until the weekend, as will testing higher data rates -- which I am excited to see if it will handle.  If I go much faster, the resolution of my captures at a sample rate of 13.3MHz will actually start to be the limiting factor, which is pretty cool.

Another week has gone by, and I have write pre-compensation working for RLL2,7, which has helped, but I haven't gotten around to implementing the read post-compensation stuff.  I'm going to defer that for now, as I want to just close-out the RLL and variable data rate and Track Info Block stuff, and merge it back into the development branch.

So first up, the good news is that write pre-compensation with RLL2,7 encoding works quite nicely. At the really high data rates, it looks like having the write pre-compensation dialled up high for the 2nd coefficient works better. That's the case where the difference in gaps is the greatest, and it makes sense that it works that way. But making sense is not something I take for granted with floppy magnetics.

So let's take a look at the impact of write pre-compensation with RLL2,7 at high data rates.  First, let's see what divisor $17 and $15 look like without any write pre-comp:

Basically we can see that divisor $17 = 23 = 40.5MHz/23 = 1.76MHz is a bit marginal, while divisor $15 = 21 = 1.93MHz is basically rubbish. So let's now turn on some write pre-compensation:

That cleans up divisor $17 quite nicely, and that's probably quite usable like that. If we then move to divisor $15, i.e., the faster data rate = higher data density, then it is kind of okay, but the clear area between the peaks is noticeably narrowed:

So that's probably not usable.  However, when we increase the 2nd pre-comp coefficient from 5 = ~125ns to 8 = ~200ns, then it cleans up quite a bit, to the point where it looks probably about as good as divisor $17 did with the more conservative pre-comp values:


These pre-comp values seem to be fine for HD and above data rates (divisors $28 and smaller). Above that, they spread the peaks rather than sharpen them. But even in that case, the distance between the gaps is pretty extreme in that case, anyway.

Well, time is really flying at the moment, for a variety of reasons. Its now a couple of weeks later, and I have done another couple of live streams: https://www.youtube.com/watch?v=MOEPXSAW08g and https://www.youtube.com/watch?v=n5sfdv7K8Zw

To summarise the progress in those streams, I tried a pile of floppy drives I have here -- including a 2.88MB ED drive (admittedly in HD mode, because I have no ED disks or ED-drilled HD disks).  Of the drives that worked (a bunch were duds), the almost all could read the same data densities, and had similar performance for RLL encoding as the ALPS drive in my MEGA65, thus giving me confidence that this crazy high-capacity RLL storage encoding is likely to work on people's machines -- certainly the production machines which will have ALPS drives, but also the vast majority of the 100 DevKits that have random PC floppy drives in them.

One discovery during this process was that one of several Panasonic drives from my pile was a lot better than the rest, and seemed to be reliable at data rates around 10% or more faster than the other drives could handle. It will be interesting to investigate this later on to find out why. 

But back to finishing the RLL HD stuff, I then started testing behaviour at DD as well as HD, to check for regressions. And there is some problem, where sectors are not being written back properly.  To debug this, I really need to be able to read whole tracks of data in exact manner.  

This is a bit tricky, because the pulses we have to log can occur at >2MHz, so doing this in a tight loop is a bit of a pain.  So what I have decided to do, is to make a mode for the MEGA65's DMA controller, that reads the raw flux inversion intervals, and writes them to memory.  This has the added benefit that it will make it quite easy to write software for the MEGA65 that can read arbitrary other disk formats -- like Amiga disks, for example.

This _should_ be quite simple to implement, but it keeps causing me problems for some reason. I have it counting the number of 40.5MHz clock cycles between each inversion.  But my MFM/RLL decoder test program is failing to find the 3x SYNC sequences.  The Track Info Block is always written at divisor $51 = 81 decimal, and thus we should see gaps approximately 160, 160, 120 and 160 = $A0, $A0, $78, $A0 long for the TIB's SYNC marks.  The start of a captured track that is written using RLL and a divisor of $16 looks like:


00000000: 00 ef 6f 83 6d 6d 85 6a 6f 82 6c 6f 83 6c 70 80    @ooCmmEjoBloClp@
00000010: 6f 6b f0 70 82 ff 6a 6e 84 6e 6d 4e 37 84 a1 a1    okppB~jnDnmN7Daa
00000020: a1 a2 ff a0 a2 ff a2 a1 a2 a0 a2 a0 a2 a0 a2 a2    ab~`b~bab`b`b`bb
00000030: a2 9e a3 9f a3 a0 a3 9f a3 9f a3 a0 a2 a0 ff a3    b^c_c`c_c_c`b`~c
00000040: a0 a2 ff a2 a0 a3 9f a3 9f a3 a0 a2 a0 a2 a0 a2    `b~b`c_c_c`b`b`b
00000050: a0 a3 a0 a2 9f a3 a0 a2 9f a4 9f a2 a0 a3 9f a3    `c`b_c`b_d_b`c_c
00000060: 9f ff a0 a2 a0 a2 a1 a1 a1 a2 ff a0 a2 a0 a2 a0    _~`b`baaab~`b`b`
00000070: ff a1 a0 a3 a0 ff 9f a2 a1 a2 9f a3 9c f7 ff f0    ~a`c`~_bab_c\w~p
00000080: ff f8 91 ff ed ff 94 ff f0 ff f0 ff 98 f8 ef ff    ~xQ~m~T~p~p~Xxo~
00000090: f8 98 a3 a0 a2 9f a3 a0 a2 9c f7 ff 97 f7 f6 f4    xXc`b_c`b\w~Wwvt
000000a0: 9b a0 9f f8 ff a2 a0 a2 9c ff 98 9e ff f9 93 ff    [`_x~b`b\~X^~yS~
000000b0: ff d8 9b ff 97 fc 9b a0 a2 ff ff a1 a1 a2 a0 a2    ~X[~W|[`b~~aab`b
000000c0: 68 47 41 41 c5 41 82 41 42 40 42 3f 42 f3 6d 84    hGAAEABAB@B?BsmD
000000d0: 6b 6c 43 b1 3f 40 b1 40 f3 40 41 58 56 6c 43 40    klCq?@q@s@AXVlC@
000000e0: 41 41 41 41 41 41 41 41 41 41 3f 41 73 42 3d 41    AAAAAAAAAA?AsB=A
000000f0: b1 c5 44 54 6d f3 57 96 6e 6d 84 6b ff 6d 85 6b    qEDTmsWVnmDk~mEk

The first 12 lines show values around $A0 which is suspiciously close to the long gap (=2x nominal signal rate) in MFM, before in the last four lines it switches to much more variable values, with a floor near $40, i.e., around $16 x 3 (=$42), which correlates with the switch to RLL at that divisor.

So why can't we see our MFM sync marks in the first part?  I know that they must be there, because the floppytest.c program is able to read them.  

Fiddling about, I am seeing a funny thing, though, where I can't reliably read data from the drive until I have stepped the head.  It can even be stepped backwards from track 0. But without it, I don't reliably read the track: I can still see all the peaks for the gaps, but something is wonky, and the MEGA65's floppy controller doesn't see the sectors.  

Ah, found the problem: The histogram display of the tracks that I was using, doesn't enable automatically setting the floppy controller's data rate to the speed indicated by the TIB.  With that enabled, the sector headers get found fine.  But that's on the running machine, not on my captured data, which is still refusing to read correctly.  And which is reading the raw flux, so doesn't care what the floppy controller thinks.

Okay, I think I know what the problem is: The MFM gaps are effectively 2x to 4x the divisor long, not 1x to 2x, because of the way that MFM is described. Thus at a divisor of $51 = 81, this means that the gaps will be between 81x2 = 162 and 81x4 = 324.  But 324 is >255, so we can't reliably pick them out with our 8-bit values.  To fix that, I need to shift the gap durations one bit to the right, which I am synthesising now.  Hopefully that is the problem there.

Meanwhile, while that is synthesising, I also looked at adding support for DMA jobs >64KB long, since there can be more than 64K flux transitions per track. This will be needed for reading HD Amiga disks, for example, as well as for letting me debug whole tracks of data. 

The divide by two on the counts has helped, in that I can now see some sync marks, but there are still things being borked up, such that I don't see all 3 sync marks in a row. This is what it looks like now:

00000000: 00 81 48 80 48 a9 49 c6 51 50 50 51 50 51 a2 50    @AH@HiIFQPPQPQbP
00000010: 51 50 51 50 a2 50 51 50 50 51 50 51 50 51 50 51    QPQPbPQPPQPQPQPQ
00000020: 50 51 a1 a2 50 51 50 51 50 50 51 51 50 51 4f 51    PQabPQPQPPQQPQOQ
00000030: 50 51 50 51 50 51 50 51 50 51 50 51 50 51 50 51    PQPQPQPQPQPQPQPQ
00000040: 4f 51 50 51 50 51 50 52 4f 51 50 51 50 51 50 51    OQPQPQPROQPQPQPQ
00000050: 50 a1 52 4f 51 50 51 4f 51 50 51 4f 51 50 51 50    PaROQPQOQPQOQPQP
00000060: a1 52 4f 51 50 51 4d 1f 77 a4 7c 49 a8 75 a5 7a    aROQPQM_wd|Ihuez
00000070: 4a a7 76 a4 77 a5 4b 7e 76 a4 7c 4c 51 4f 51 50    JgvdweK~vd|LQOQP
00000080: 51 4d 7d a2 7d 47 7c 21 4a 51 a2 4e 80 4b 52 4c    QM}b}G|!JQbN@KRL
00000090: a8 4c f6 4e ce 49 7d a3 79 7b 4c 4f 81 4a 51 51    hLvNNI}cy{LOAJQQ
000000a0: 51 50 51 50 51 50 51 50 51 a2 50 51 50 51 50 51    QPQPQPQPQbPQPQPQ
000000b0: 50 52 4f 51 50 51 4f 52 50 51 4d 7f 79 7b 4c 4e    PROQPQORPQMy{LN
000000c0: 7d 7a 78 7c 4c 4d a5 a5 1b 7b 48 a8 76 a4 7c 48    }zx|LMee[{Hhvd|H
000000d0: a8 75 a5 7b 4d 50 50 50 a2 4e 7f 4d 50 51 50 51    hue{MPPPbNMPQPQ
000000e0: 50 51 f1 52 51 50 4d 7a 4c 51 51 50 cd 7c 4b 51    PQqRQPMzLQQPM|KQ
000000f0: 50 4f 7b a6 4d 50 50 50 4c a6 a1 a6 4b 51 51 4d    PO{fMPPPLfafKQQM

All the values are halved from before, as mentioned. So we now see lots of ~$51 values, which we expect. There should be 12 bytes x 8 bits = 96 = $60 of them before we write the sync marks.  The sync marks should look like $A0 $78 $A0 $78, or thereabouts, and we can see some things that look like this in the above, marked in bold (note that this is from a DD formatted track, so we continue to see similar numbers below).

The underlined part looks to me like it _wanted_ to be a sync mark, but the first gap got chopped up into little pieces. So something is still really rogered up.

Weirdly if I instead read the $D6A0 debug register, that lets me read the same RDATA line from the floppy drive that we are using to calculate these gaps, then the sync marks get properly detected.  What I am fairly consistently seeing, is that we see a sequence something like a8 76 1f 49 ef, instead of the A0 78 A0 78 50 type sequence (the extra 50 at the end is a single gap after each sync that gets written).  So something is getting really confused, in that the gap boundaries are being read at the wrong locations. And its the last 3 of those that do it:

A0 78 50 versus

20 50 F0

But as I say, if I read _the same signal_ via a different register, I am getting a different result.

I'm just going to have to sleep on this, I think.

Problem found and fixed!  It was a cross-domain clocking problem. Which is fancy FPGA programmer speak for "the signal I was reading was produced by something on a different clock, and if you don't treat it nicely enough, you get all manner of weird glitching".  The read signal from the floppy is not generated using the same clock as the MEGA65's CPU, and thus this problem happens. The solution is to add extra flip-flops to latch the signal more cleanly.  In reality cross-domain clocking is _much_ more complicated than I have described here, but you get the general idea.

So I can now happily spot the sync marks again, decode the TIB, etc. But I am getting CRC errors reported in the TIB among other places, and it looks like we are skipping bytes, which I think is related.  I was fiddling about with fancy post-read timing corrections in my mfm-decode.c program, so its possible that I have just messed that up.

Certainly something in the decoder is messed up, because where it misses bytes, it is indeed processing lots of gaps, more than it should.  I'll have to see how that is happening.  Found the problem: I had an error in my RLL decoding table.  With that fixed, I can now read the saved data, and all the sectors are being returned with no CRC errors.  

With a 64KB transfer, it is returning 7 sectors worth of data.  This is why I added the >64KB DMA job support.  With 128KB, we will be able to read an entire DD track, which will let me debug exactly what I am botching when writing a sector. Thinking about it more, I can actually do a longer fetch, if I do it into the 8MB HyperRAM, ensuring that even at the highest data rates, I can capture a whole track. 

So now I should be able to format a disk at normal DD 800KB, and then try saving something to it, and seeing how it gets messed up at that point.  My best guess is that the data rate or encoding mode will be incorrectly set for some reason.

Formatting from the latest patched ROM seems to be using the new formatting routines, but sets the track number for all tracks to be 39 in the Track Info Block that gets written by doing that, which then causes problems reading back saved files -- although the directory and BAM updates fine.  So it is possible that an update to the ROM will fix that.  I might even be able to find the part of the ROM where it does the format and patch it myself.

In the meantime, I am formatting the disks myself using the floppytest.c program, which does setup the TIB data correctly (well, after I fixed a problem with the sectors per track).  However, oddly reading sectors from the 2nd side of the disk is having problems.  The sector headers are visible, but it hangs forever. 

Fortunately I can use my new track reader code to read that whole track, and check if I think the data is valid or not. It is showing incorrect data in the sectors, and failing sector body CRCs. I thought I would compare it with the first side of the disk, which reads fine, but that is also showing similar errors.  So there is something weird going on here.

Hmm... If I use the ALPS drive in my MEGA65, it has this problem, but if I use the nice Panasonic drive, it doesn't. It _looks_ like there are extra magnetic inversions being inserted where they shouldn't during writing.  Reading doesn't seem to be affected.  I had added an extra write path for the floppy under CPU DMA control, so I am reversing that out, in case its the problem. But the problem is manifesting even with older bitstreams. So I am wondering if I haven't half-fried one of the level converters for the floppy interface on my MEGA65, or something odd like that.  I'll wait for that bitstream to build, and maybe see if I can get a DevKit owner or two to try it out for me as well, and work out my next steps from there.

Overnight it occurred to me that there is a simple test I could do, to work out if it is the writing side or the reading side that is the problem:  Write the disk using the ALPS, and then try to read it using the nice Panasonic drive, and vice-versa, if required.  If the ALPS can't read what the ALPS wrote, but the Panasonic _can_ read what the ALPS wrote, then there is a difference on the _read_ side of the Panasonic.

And the result is that the Panasonic _can_ read the ALPS written disk that the ALPS drive is refusing to read, although after about track 35 it is having some trouble, but that could just be that some of those tracks on the disk are a bit worn from my relentless testing of it.  

But anyway, this says that whatever the weird problem that has crept in is, it has something to do with the read behaviour. Maybe the RDATA pulses are longer or shorter on the different drives, for example.  Or maybe the voltages of them are different.  The good thing is that those are all quite testable things, as I can use the oscilloscope to probe both, and get to the bottom of this annoying little mystery.

Right, so the plot thickens: I just switched floppy cables to one with dual connectors for drives A: and B: on PCs, as it is convenient to poke oscilloscope leads into the 5.25" drive connector on it. But now it can't read _anything_, even with the good drive.  The cable is a bit longer. Maybe the issue is something with the level converter, and the extra length of cable is just enough to cause it grief. BUT that longer cable _does_ work with the ALPS drive (although still refusing to read, I can see the gap histogram, which I couldn't with the Panasonic on the same cable). And now after fiddling things again, the long cable is working with the Panasonic drive.  Maybe something was just loose.

So now to get the oscilloscope hooked up...

Both drives produce similar looking pulse shapes for the RDATA line. The main difference I can see is that the good Panasonic drive seems to have much more stable timing, perhaps suggesting that it just has better motor speed control.

Next step is to record what we write to a track as raw flux, and then compare that back with what we have written, and try to spot the kinds of errors that are occurring.

Advance two months...

Well, the rest of the year flew by pretty quickly, and its now almost New Year's Eve, and I have finally had the chance to get back to this.

First up, I have found that there was a problem with writing after all: I had tried writing at the start of an index hole mark, rather than the end, and the floppy drive was considering this illegal, and not writing anything to the track at all. Thus I was reading back whatever was already on the track, not the fresh data.

I've fixed that, but I am still seeing very short gaps on the odd occasion.  The TIB is also not being written to the track at all, but we will deal with that separately.  My best guess is that the very fast FPGA is seeing glitching on the RDATA line from the floppy, which is causing some long gaps to get broken up by mistake.  If this theory is correct, then de-glitching the RDATA line by requiring it to go low for several CPU cycles will fix it. If that doesn't fix it, then the short gaps are most likely being created during the writing process.

According to floppy drive data sheets, the RDATA pulses should be >= 0.15usec = ~6 cycles at 40.5MHz.  If we require RDATA to go low for 4 cycles, that should be a safe amount, and still squelch any glitching from cross-domain or other effects.

Okay, that has been done, and the problem persists, and I just checked with a Known Good Disk, and it doesn't exhibit the short gaps, so I am still suspecting that it is a write-based problem. So I have instrumented the VHDL to keep track of the actual WDATA interval time so that we can see if we are producing the problem ourselves.

The only other difference I can see is that the problem is happening on HD disks, but not on DD disks. That could just be coincidence, or it could be that the DD disks (or drives in DD mode) simply don't have the flux resolution to write the glitches. So time for another synthesis run...

That's done, and I can now verify that no F_WDATA pulse interval is shorter than 80 clock cycles at the DD 720K data rate -- which is correct. If I tune the data rate up and down, then the minimum gap length adjusts accordingly, so I am confident that I am reading it correctly.

... and yet I am still seeing short gaps being written.  Is it actually an artefact of writing DD data on an HD disk?  If so, then writing at HD rate should remove the problem. Or writing to a DD disk.  So, writing at HD data rate seems to work without any of these short gaps, adding some support to my theory.... and formatting a DD disk at DD data rate also works just fine, too (although some of the Track Info Blocks are not able to be read back).

So there _is_ a problem with writing to HD disks at DD data rate. I have since confirmed this 100% by covering the hole on an HD disk that refuses to make a nice format at DD rate without the hole covered, and it suddenly reads back reliably. This affirms my theory that the drive electronics is being switched to a different read recovery mode between DD and HD, and that this is messing things up at DD data rates.

Speaking with others from the team on discord, we have come up with a sequence of events that results in corrupted DD disks. 

DD DISK OR FAKE DD DISK USED

1. FORMAT in floppytest (option 3)
2. READ TEST (all green) in floppy test (option 2)
3. RESET with disk still inserted
4. HEADER"TEST",IDK
5. DIR
6. NEW
7. 0?
8. SAVE"TEST"
9. DIR     -looking good
10. SAVE"TEST2"
11. DIR      -this time still looking good
12. SAVE"TEST3"
13. DIR      -this time still looking good
14. RESET
15. DIR      -this time still looking good
16. NEW
17. 0?
18. SAVE"TEST4"
19. DIR      -this time still looking good
20. POWER OFF
21. REMOVE AND REINSERT DISK
22. POWER ON
23. DIR        -this time still looking good
24. NEW
25. 0?
26. SAVE"TEST5"    -this time rattling noise
27. DIR            -broken, mostly zeroed directory:

   0 "           "
   1 "
   0
   0
8704 DEL
3155 BLOCKS FREE

28. SAVE"1"
29. DIR         -file "1" is first entry in above broken directory!
30. SAVE"2"
31. SAVE"3"
32. DIR        -files 1, 2 and 3 appear on top:

   0 "              "
   1 "1"              PRG
   1 "2"              PRG
   1 "3"              PRG
   1 "
   0
   0
8704 DEL
3152 BLOCKS FREE

33. POWER OFF, DISK OUT, DISK IN, POWER ON
34. DIR              -magic! old files appear again:

   0 "TEST           " DK 1D
   1 "TEST"            PRG
   1 "TEST2"           PRG
   1 "TEST3"           PRG
   1 "TEST4"           PRG
   0 "TEST5"          *PRG
3156 BLOCKS FREE
 

One of the first things I found, is that this doesn't seem to happen if the disk was formatted using floppytest, and then just a simple HEADER "TEST" command, instead of HEADER "TEST",IDK command from BASIC.  The difference is that with the ,Ixx option, the disk is low-level formatted again, which means that the C65 ROM is doing the disk format, instead of the floppytest program.  This might be because of the C65 ROM format routine, or it might be from something else.  

To help understand why, I am creating a derivative of the m65 command line tool that can remotely read the contents of a real floppy, without having to interrupt the machine.  This means it is possible to type a BASIC SAVE command, for example, and then see what actually changed on the disk, versus what should have changed.

It looks like the problem might be the C65 ROM has every track labelled as track 39 -- this would explain the track chattering symptom, as well as the rest of the problems. I'll have to ask @bitshifter to patch that, so that we can test again. If that fixes it, then that means we have DD disk behaviour all working, and I can (finally) finish fixing the HD disk stuff.

Speaking with @bitshifter, the C65 ROM still uses the old un-buffered method of formatting tracks. But we are seeing a Track Info Block being written, which only happens via the hardware assisted formatting.  If we assume that hardware assisted formatting is somehow being triggered, this would also explain why all tracks are marked as track 39, as the C65 ROM doesn't know to update the track number. We can confirm this theory by modifying the track number in $D084 during a C65 ROM format, and it should result in changed TIB and sector header track numbers: And, yes, this is exactly what is happening.

So the question is how is automatic track formatting being triggered? The relevant track format code in the C65 ROM is pretty straight-forward:

993c a9 a1        lda #wtt+1       ;Erase track (fill with $4E gap bytes)
993e 20 58 9b     jsr CommandReg   ;necessary due to simulated index pulse!
9941 8d 81 d0     sta command      ;Begin formatting

9944 a0 10    10$ ldy #16          ;write post index gap 12 sync
9946 b9 7a 9a 15$ lda secdat-1,y
9949 be 8a 9a     ldx secclk-1,y
994c 2c 82 d0 20$ bit stata
994f 10 dd        bpl wtabort      ;oops
9951 50 f9        bvc 20$
9953 8d 87 d0     sta data1        ;Always write data before clock
9956 8e 88 d0     stx clock
9959 88           dey
995a d0 ea        bne 15$

995c a0 04        ldy #4           ;Write 4 header bytes
995e a2 ff        ldx #$ff
9960 b9 6d 01 25$ lda header-1,y
9963 2c 82 d0 30$ bit stata
9966 10 c6        bpl wtabort      ;oops
9968 50 f9        bvc 30$
996a 8d 87 d0     sta data1
996d 8e 88 d0     stx clock
9970 88           dey
9971 d0 ed        bne 25$

We can see that it is putting #$A1 into the command register.  The VHDL that looks at the commands looks like this:

              temp_cmd := fastio_wdata(7 downto 2) & "00";
              report "F011 command $" & to_hstring(temp_cmd) & " issued.";
              case temp_cmd is

                when x"A0" | x"A4" | x"A8" | x"AC" =>
                  -- Format a track (completely automatically)
                  -- At high data rates, it is problematic to feed the data
                  -- fast enough to avoid failures, especially when using
                  -- code written using CC65, as I am using to test things.
                  -- So this command just attempts to format the whole track
                  -- with all empty sectors, and everything nicely built.

                  f_wgate <= '1';
                  f011_busy <= '1';

                  -- $A4 = enable write precomp
                  -- $A8 = no gaps, i.e., Amiga-style track-at-once
                  -- $A0 = with inter-sector gaps, i.e., 1581 / PC 1.44MB style
                  -- that can be written to using DOS
                  format_no_gaps <= temp_cmd(3);

                  -- Only allow formatting when real drive is used
                  if (use_real_floppy0='1' and virtualise_f011_drive0='0' and f011_ds="000") or
                    (use_real_floppy2='1' and virtualise_f011_drive1='0' and f011_ds="001") then
                    report "FLOPPY: Real drive selected, so starting track format";
                    sd_state <= FDCAutoFormatTrackSyncWait;
                  else
                    report "FLOPPY: Ignoring track format, due to using D81 image";
                  end if;
                  
                when x"A1" | x"A5" =>
                  -- Track write: Unbuffered
                  -- It doesn't matter if you enable buffering or not, for
                  -- track write, as we just enforce unbuffered operation,
                  -- since it is the only way that it is used on the C65, and
                  -- thus the MEGA65.
                  -- (Conversely, when we get to that point, we will probably only
                  -- support buffered mode for sector writes).

                  -- Clear the LOST and DRQ flags at the beginning.
                  f011_lost <= '0';
                  f011_drq <= '0';

                  -- We clear the write gate until we hit a sync pulse, and
                  -- only then begin writing.  The write gate will be closed
                  -- again at the next sync pulse.
                  f_wgate <= '1';

                  -- Mark drive busy, as we should
                  -- C65 DOS also relies on this.
                  f011_busy <= '1';

                  report "FLOPPY: Asked for track format";
                  
                  -- Only allow formatting when real drive is used
                  if (use_real_floppy0='1' and virtualise_f011_drive0='0' and f011_ds="000") or
                    (use_real_floppy2='1' and virtualise_f011_drive1='0' and f011_ds="001") then
                    report "FLOPPY: Real drive selected, so starting track format";
                    sd_state <= FDCFormatTrackSyncWait;
                  else
                    report "FLOPPY: Ignoring track format, due to using D81 image";
                  end if;

We can see that $A1 leads to the non-automatic track formatting. But I also just spotted the bug :) Write a comment if you can spot it, too!

And with that, the main problem was solved. I still need to think about how I make the TIBs reliable at DD data rate on HD disks because of the drive electronics limitation, but I have ideas on how to do that.  But that can wait a bit, as we have bigger fish to fry prior to the release of the MEGA65! And besides, this blog post has already taken way too long. So with that, I wish you all a Happy New Year, and lots of retro fun.