Saturday 1 January 2022

More work on HD floppies, RLL encoding, disk density auto-detection and other fun

[Gadzooks! This blog post has sat idle for ages while life and various other things happened.  But finally it is here!]

In the previous blog post, I described in perhaps excess detail some of my adventures getting very high data density on normal 1.44MB floppies in the MEGA65's normal 1.44MB floppy drive.  We seem to be able to fairly easily get more than 2MB, which is pretty nice, but further boosts are most likely going to hinge on whether I can get RLL encoding to work, which in the best case, will get us an extra 50% storage, thus allowing 3MB or so, i.e., more than a standard 2.88MB ED floppy.  But we have a number of barriers to address to get to that point:

1. We need to implement RLL encoding.

2. We need to fix some of the regressions with accessing DD disks that have been introduced, presumably because we are writing at the wrong density or on the wrong track.

3. Write pre-compensation is likely to be even more entertaining for RLL coding, than for MFM, as we have to worry about six different peaks in the histogram being cleanly resolved from one another, compared with MFM's three.

I'll start with implementing the RLL encoder and decoder, as I am already part way there, and can't resynthesise until I get it mostly done, and in turn, I have remapped a few of the low-level floppy registers to make space for RLL encoder selection, which means that my test program is now out of step with the older bistreams. So its all a bit of a mess that I need to work through. So let's dig straight into it!

For RLL encoding, I am implementing an RLL encoder and decoder that are parallel to the MFM ones, and selection logic to decide which one is connected to the disk drive at any point in time.  That way we can easily switch between the two.

One of the challenges with the RLL encoder, is that we have variable length bit groups that get encoded together, which can be 2, 3 or 4 bits long.  It is thus quite common for encoding to continue across byte boundaries. This is not a problem in itself, but it is a problem for the way that I have been writing the synchronisation marks, as in MFM it was possible to assume that byte boundaries would be respected, and thus the sync byte would get properly detected.  But in RLL this doesn't happen.

To solve this, I allow any partial bit pattern to run out when the next byte is a sync mark, basically pulling bits out of thin air to use to fill in the missing extra bits needed to complete the last symbol.  That, and fixing an error in my transcription of the RLL table into the VHDL, I now seem to be able to encode bytes and sync marks alike.  The next step is to make a decoder, and test it under simulation, to make sure we detect sector headers and data etc, as we should.

So now to make the decoder, which is also a bit interesting, because of the way the codes get emitted.  For the MFM coder, we could just look at the previous bit and the length of the incoming gap.  But for the RLL coder things are fiddlier.  For example, if we wrote the byte $42, i.e., %01000010, this would be emitted as:


010      000100
000      100100
10       0100 

So the total bit sequence will be:  0001001001000100

The trouble is that there could be either 100 or 1000 from the previous byte on the front, and 1, 01, 001, 0001 or 00001 at the end, depending on the next byte.

In any case, the first step is to build the gap quantiser, that converts the incoming pulses into a series of pulse lengths, i.e., the distance between each magnetic inversion, represented by a 1 in these sequences. So for the above, this should come out as 4, 3, 3, 4, 2 + something not yet known.

I now have the gap quantiser working under simulation, and can simulate formatting a track, which results in plenty of predictable data to decode, and I can see that said data is being generated.  So now its time to figure out how to decode RLL2,7 data from the series of gaps.

I'm thinking that the best way will be to turn the gaps back into a series of bits, and then chop the head off for any bit sequences that we can extract. When we hit a sync mark, we then just need to remember to discard all waiting bits, and discard the first two bits of the next gap as well, because of the trailing 00 at the end of the sync mark.

With the above approach, we need only a few IF statements in the VHDL to decode the bits, and some glue to make the bits emit one by one from the variable-length RLL bit blocks. So we should be ready to synthesise a bitstream, and see if we can see some RLL encoded data. Like this, for example:

 

Notice that there are more than 3 peaks, because with RLL2,7, we should have peaks at 2x, 3x, 4x, 5x, 6x, and 7x the time base.  The 6x and 7x are too long to show up here.  The noise at low rates might in fact be funny artefacts from the higher intervals wrapping around, or alternatively, that the drive electronics don't like the gaps being that long.  If we speed up the data rate from DD to the HD rate, then we see a much better result:

We have the peaks, and they are looking very nicely resolved, and without any funny noise at the low end.  Note that the 7x peak is really short -- this is because a gap of 7 zeroes requires fairly specific bit patterns, which are not that common. This is also a good thing, because drives are much happier with shorter intervals than that.

It is possible to increase the data rate some more, but I will need to add in write-precompensation as well, because the two shortest interval peaks do start to bleed into one another, which is, of course bad.  I'll have a look and a think about that, because it might end up being better to make a custom RLL code that merges the two shortest interval peaks, in order to be able to increase the data rate higher than would otherwise be possible.

Here is about the best data-rate that is possible right now, with a divisor of $18 = 24 = 1.6875MHz:


We can see the left-shift of the 3x pulse, just like we saw with the MFM encoding with the 1.5x pulse, that is also the 2nd in the sequence. To give an idea of how much better this RLL encoding is at this frequency compared with MFM, here is the MFM equivalent, with the best write pre-compensation I have been able to do:

I.e., MFM encoding at that data rate is utter rubbish, and is probably not usable below a divisor of $1E = 30.  So we are getting a divisor of 24 with RLL, which means 30/24 = 6/5 = 120% of the MFM capacity with this RLL encoding, assuming that it actually works, which I will soon be able to test.  

Interestingly, it looks like write pre-compensation will be rather unlikely to make much difference with the RLL encoding, at least on track 0. Shifting the 2nd peak to the right a bit might get us one divisor lower to /23 instead of /24, but it feels like its really diminishing returns. Trying other tracks, I can see the split peaks again, so implementing write pre-compensation might help to preserve higher density onto more tracks, so it might well still be worth implementing.

More generally, I'm also curious to know how well perfect write pre-compensation would be able to help.  To test that, I could write a track with a constant stream of equally spaced pulses, and see how wide the spread of the peak is from that, as that should only have the non-data-dependent timing jitter.

Its the weekend again, and I have had a bit of time to work on this all again, including another twitch stream (sorry, the video isn't up yet, I'll update this with a link, when it is), where I tracked down some problems with the RLL encoding: Specifically, I am facing a problem now where the CRC unit takes a few cycles to update the CRC, but the RLL encoder buffers 2 bytes instead of 1, which means that it is trying to inject the CRC before it has fully calculated.  Also, earlier on, this double-byte buffering itself had some problems.

There is a way to calculate the CRC in a single cycle, but it needs a larger look-up table which would take up space in the FPGA, and we don't really need the speed, we'd just like it. All that I need to do is to make the FDC encoder wait for the CRC to finish calculating before feeding bytes.  I have a fix for that synthesising now.

Meanwhile, I am still curious about how the RLL2,7 code was generated, and whether we can generate RLL3,x or RLL2,x codes, with x larger than 7, that the floppy drive can still read, but that would allow for higher densities.  RLL3,x codes are particularly interesting here, because with an RLL3,x code, there would be three 0s between every 1, which means that we can -- in theory at least -- double the data rate vs MFM, although other effects will likely prevent us reaching that limit. But for that to be effective, we need a nice short RLL3,x table, built along the same lines as the very clever RLL2,7 tables.

What is interesting about the construction of the RLL2,7 table, is that it is very efficient, because it uses unique prefixes to reduce the overall code length. Without that trick, it would require more like 3 ticks per bit, instead of 2 ticks per bit. I know, because I made a brute-force table generator that works out the numbers of combinations of codes of given lengths that satisfy a given RLLx,y rule.  RLL2,7 requires at least two 0s between every one, so the table is based around that, by looking at all possible codes that satisfy that of a given short length, and attaching two 0s to the end, to make sure that they are guaranteed to obey the RLL2,7 rule when joined together in any combination.  

So maybe we can do something like that for RLL3,10, which I think is probably about the sweet spot. We'll initially try for something that is only as efficient as the RLL2,7 code, but using units of 3 bits, instead of 2:

001000

010000

100000

That gets us 3 combinations using 6 bits, which is a fair start, since the idea is that we want to use no more than 3n bits. So lets add some more 3-bit units, that can't be confused for any of those three above, of which there are the following six:

100010000

100001000

010001000

000100000

000010000

000001000

This still keeps us within RLL3,10, but only just, as two of those together can result in exactly 10 0s between adjacent 1s.

With 9 combinations, we could encode 3 bits using those, but we would like constant length encoding, like RLL2,7 does, because data-dependent encoding length is a real pain to work with.  But unfortunately the number of cases here don't work for that.  So let's start with a 5 bit prefix, and try to encode more bits

10000000

01000000

00100000

00010000

00001000

10001000

Ok, so we have 6 combinations using 8 ticks.  To make it at least as good as RLL2,7, we need to use less than 3 ticks per bit on average, but to keep the length constant, we need to find a multiple of 3 that is a power of two if we do it this way... of which there are none. So I'll just have to think on all this a bit more. So let's go back to that bitstream...

With the new bitstream, we can now write a valid TIB and RLL2,7 formatted sector headers, because I can verify their correctness from the written data. However it refuses to parse even the TIBs, let alone the RLL sectors.  My best guess is that the CRC feeding on the decoder side is also having problems, although it doesn't really make sense why the TIBs refuse to read. 

Also, digging through the source of sdcardio.vhdl and mfm_decoder.vhdl, the last sector seen under the head should still be updating, even if the CRCs were bad -- which they are not. It's almost as though the RDATA line from the floppy is not connected -- but it is, because I can display histograms, which means that the gap counting logic is reading the RDATA line.  The data rate for the TIB decoder is locked at divide by 81, the rate for DD disks, and to MFM encoding, so neither of those should be blocking TIB detection.  

So where is the signal getting to, and where is it getting stuck? We have some debug tools in this bitstream still to help. $D6AF returns mfm_quantised_gap from the mfm_decoder. It actually contains different information to the mfm_quantised gap, but that's okay. The main thing is that it is changing value continuously, showing that gaps are being detected and quantised.  Perhaps its that we are not having the synchronisation marks passed through correctly? That could stop everything from working.  

The sync signal is selected from either the MFM or RLL27 decoder, and so it is possible with all this fiddling about that it has been messed up.  Although even that would be a bit odd, given that it simulates fine.  I'll modify the bitstream so that we can see if either the RLL and/or MFM synchronisation marks are detected -- just in case it is some funny problem with synthesis of the design.

And I might just have found the problem: the sensitivity list for the process in the VHDL file where the selection between RLL and MFM signals happens was missing all those signals, so it might well have all been optimised out during synthesis, which would result in exactly what we are seeing.

Nope, that wasn't it, either. So lets work out the commit it stopped working, and figure out the differences from that.

Bitstream from commit 9071b54 works, but the bitstream from commit 2c32a9e  doesn't work. After digging through the differences, I have found the problem: The CRC of the Track Info Blocks is incorrectly calculated when reading them, and thus they aren't accepted.  The good thing is that this shows up under simulation, so can be fixed. Similar problems probably occur when hitting the sector header blocks, which is why sectors aren't being identified.  

Specifically, it looks like the first of the three sync marks is not being fed into the CRC calculation. This is because we assert crc_reset at the same time as crc_feed, when we encounter the first sync byte.  Now synthesising a fix for that, after confirming in simulation that it works.

Totally unrelated to the floppy, also included in this synthesis is support for switching between 6581 and 8580 SIDs, using the updated SID code from here. More on that in another blog post later, most likely.

Back to the RLL floppy handling, that fix has now got sector headers being detected, and I am able to cram up to 40 sectors per track in Amiga-style track-at-once at a divisor of $17 = 23, or 36 sectors per track if sector gaps are included, and the peaks are quite well resolved still at that divisor, although the double-peaking that is symptomatic of the need for write pre-compensation is clearly visible:

 

Note that this is without any write pre-compensation. It's also without testing whether we can actually read the sectors reliably, either. So it might well be a fair appraisal of what will be reliable after I implement write pre-compensation for the RLL2,7 encoding.  For comparison, the best stable divisor for MFM encoding with write pre-compensation is around $1C = 28, which gets 30 sectors per track with gaps, or 33 without gaps.  So its an improvement of 40/33 = 120% over MFM, before we implement write pre-compensation.  Not the full 150% that we should theoretically be able to obtain, if it is only the shortest gaps that are a problem, but still a welcome boost.  

Over a whole disk, we can probably count on averaging 34 sectors per tracks with gaps, or 38 without. This means that we can now contemplate disks with capacities of around:

With Gaps, 80 tracks = 34x80x2x512 = 2.72MB

With Gaps, 84 tracks = 2.86MB

Without Gaps, 80 tracks = 38x80x2x512 = 3.04MB

Without Gaps, 84 tracks = 3.19MB

In short, we have reached the point where we are able to cram more data on an HD disk than an ED disk officially holds... provided that I can read the sectors back reliably.  

Which I can't yet. In fact, I can't read any of them back. I am suspecting that I have another CRC related problem, so will see if a CRC error shows up in simulation of reading a sector body. And, yes, the sector body CRCs are wrong.  Bug found and fixed, and synthesising afresh. Again.

And that has sectors reading! Now to tweak the floppytest program to format RLL disks correctly, and to test reading all of the sectors, and see how far we can safely dial things up.  That showed a bug in the TIB extraction, where the sector count was being read as $00 always, which I am now fixing -- and is now fixed. 

Now its time to think about write pre-compensation, so that I can maximise the reliable data rate, as right now when reading sectors, a data rate divisor of 25 = 1.62MHz is not quite reliable on track 0, and certainly falls apart by about track 35 or so.

As part of this, I might follow up on something that occurred to me on the ride to work this morning:  My PLL that tracks the floppy data clock when reading from real disks is probably sub-optimal: It synchronises exactly when it sees a pulse, rather than working out if the pulse is a bit early or late vs the expected time, and taking that into account. This will be causing gaps to seem over large and too small, when the magnetic phenomena occur that push short gaps together and spread out longer gaps between them. If I can fix this up, it should sharpen the peaks in the histograms quite a bit, and allow higher data rates, which would be nice.

So I might take a brief excursion into analysing the behaviour of the gaps and think about PLL algorithms that will not cause this double-addition of gap variance, but rather, ideally, be able to cancel some of it out.  The first step is to take a raw track capture, and take a look at the gaps in it, and try simulating various PLL algorithms on it, to see what we can do. 

Part of why I think that this has considerable potential, is that the histogram peaks for gaps are really quite tight, until the data rate is increased beyond a certain point, after which they start getting quite a bit wider, which I think is a direct result of the measuring gaps from the time of the last gap, which of course will tend to be shifted at the higher data rates, but not at the lower data rates, because the shifting only happens in any noticeable amount above some threshold data rate.

The trick is to make a simple algorithm for tracking the pulses and error between expected and actual time of appearance.  And the methods need to be simple enough for me to fairly easily implement.  

Some low-hanging fruit is to look at the error on the previous pulse interval, and assume that all (or half) of that error is due the most recent pulse being displaced. We then adjust the nominal start time of the next interval to the modelled arrival time of the pulse, if it had not been displaced in the time domain.  This will stop pulses that are squashed together from having both ends of the squish deducted from their apparent duration.  That is, we carry forward the error in arrival time, so that we can compensate for it. The trick then, is that we need to allow for true error, i.e., drift in the phase. We can do this by checking how often a pulse arrives late versus early: Whenever there are too many lates or earlies, we adjust the phase by one tick in the appropriate direction. In other words, we track the average frequency over time, and attempt to correct individual pulse intervals based on recent deviations from the average frequency.

Initial testing on the captured RLL encoded track data suggests that it should help. In particular, it seems to be reducing the maximum magnitude of the error in arrival time, which is exactly what we need if we want to narrow those peaks.  This should operate nicely in addition to any write pre-compensation, because it is all about fixing the residual error that the write pre-compensation didn't fix.  

In the process of all this, it is clear that these ALPS drives keep the motor speed pretty accurate, probably to within +/-1% of the nominal speed, which encourages me that we should be able to get higher data rates out of them, once I have implemented the improved PLL (which will help both MFM and RLL), and then implemented write pre-compensation for the RLL encoding.

I also realised that the encoder for both MFM and RLL is calculating the data rate differently during the read and write phases: During reading, it is exactly correct, but during writing I am counting from the divisor down to 0, i.e., each interval is one cycle longer than it should be.  This will be causing problems during decoding at higher data rates, because all the bits will be slightly shifted in time, and errors will be more likely.  This is quite easy to fix, and should help (a little) with the maximum achievable data density as well.

I think the next step is probably for me to find a way to simulate reading from a real track of data, so that I can work on the PLL improvements and check that they are really functioning.

I have in the meantime been working on the PLL algorithm some more, and now have it checking against the last 8 pulses, and if the current pulse is exactly timed to an integer number of gaps (+/- one clock cycle), then it counts those up, and the gap size that has the most votes is accepted as exact. That squelches a large fraction of the errors.  For cases where this is not possible, then I take the average of the gap as base-lined against those last 8 pulses, which tends to help reduce the error a little -- but rarely completely.  This algorithm will all be very easy to implement in hardware, as it is just addition, subtraction and 3 right shifts.

The other thing I have noticed, is that the shortest gaps almost always get stretched, rather than compressed for some reason.  The longer gaps tend to have much less error on them, when compared with the shortest n=2 gaps.  So it probably makes sense to allow a bit more slop on the shortest gaps, and in return slightly narrow the lower bound for n=3 gaps.  Maybe allow up to 2.625 instead of 2.5 gaps to be counted as a n=2.

This is all with non write pre-compensated RLL2,7 data.  Some of these residual errors can likely be reduced if I implement the write pre-compensation.  But this "read post-compensation" certainly seems to be much more robust than my previous naive approach, and I'm hopeful that it will allow higher data rates -- which I can test by writing such a track, and then testing against it.  

I can also use this test program against MFM data, so I can test my theory about write pre-compensation working well with it. But that will have to wait until the weekend, as will testing higher data rates -- which I am excited to see if it will handle.  If I go much faster, the resolution of my captures at a sample rate of 13.3MHz will actually start to be the limiting factor, which is pretty cool.

Another week has gone by, and I have write pre-compensation working for RLL2,7, which has helped, but I haven't gotten around to implementing the read post-compensation stuff.  I'm going to defer that for now, as I want to just close-out the RLL and variable data rate and Track Info Block stuff, and merge it back into the development branch.

So first up, the good news is that write pre-compensation with RLL2,7 encoding works quite nicely. At the really high data rates, it looks like having the write pre-compensation dialled up high for the 2nd coefficient works better. That's the case where the difference in gaps is the greatest, and it makes sense that it works that way. But making sense is not something I take for granted with floppy magnetics.

So let's take a look at the impact of write pre-compensation with RLL2,7 at high data rates.  First, let's see what divisor $17 and $15 look like without any write pre-comp:

Basically we can see that divisor $17 = 23 = 40.5MHz/23 = 1.76MHz is a bit marginal, while divisor $15 = 21 = 1.93MHz is basically rubbish. So let's now turn on some write pre-compensation:

That cleans up divisor $17 quite nicely, and that's probably quite usable like that. If we then move to divisor $15, i.e., the faster data rate = higher data density, then it is kind of okay, but the clear area between the peaks is noticeably narrowed:

So that's probably not usable.  However, when we increase the 2nd pre-comp coefficient from 5 = ~125ns to 8 = ~200ns, then it cleans up quite a bit, to the point where it looks probably about as good as divisor $17 did with the more conservative pre-comp values:


These pre-comp values seem to be fine for HD and above data rates (divisors $28 and smaller). Above that, they spread the peaks rather than sharpen them. But even in that case, the distance between the gaps is pretty extreme in that case, anyway.

Well, time is really flying at the moment, for a variety of reasons. Its now a couple of weeks later, and I have done another couple of live streams: https://www.youtube.com/watch?v=MOEPXSAW08g and https://www.youtube.com/watch?v=n5sfdv7K8Zw

To summarise the progress in those streams, I tried a pile of floppy drives I have here -- including a 2.88MB ED drive (admittedly in HD mode, because I have no ED disks or ED-drilled HD disks).  Of the drives that worked (a bunch were duds), the almost all could read the same data densities, and had similar performance for RLL encoding as the ALPS drive in my MEGA65, thus giving me confidence that this crazy high-capacity RLL storage encoding is likely to work on people's machines -- certainly the production machines which will have ALPS drives, but also the vast majority of the 100 DevKits that have random PC floppy drives in them.

One discovery during this process was that one of several Panasonic drives from my pile was a lot better than the rest, and seemed to be reliable at data rates around 10% or more faster than the other drives could handle. It will be interesting to investigate this later on to find out why. 

But back to finishing the RLL HD stuff, I then started testing behaviour at DD as well as HD, to check for regressions. And there is some problem, where sectors are not being written back properly.  To debug this, I really need to be able to read whole tracks of data in exact manner.  

This is a bit tricky, because the pulses we have to log can occur at >2MHz, so doing this in a tight loop is a bit of a pain.  So what I have decided to do, is to make a mode for the MEGA65's DMA controller, that reads the raw flux inversion intervals, and writes them to memory.  This has the added benefit that it will make it quite easy to write software for the MEGA65 that can read arbitrary other disk formats -- like Amiga disks, for example.

This _should_ be quite simple to implement, but it keeps causing me problems for some reason. I have it counting the number of 40.5MHz clock cycles between each inversion.  But my MFM/RLL decoder test program is failing to find the 3x SYNC sequences.  The Track Info Block is always written at divisor $51 = 81 decimal, and thus we should see gaps approximately 160, 160, 120 and 160 = $A0, $A0, $78, $A0 long for the TIB's SYNC marks.  The start of a captured track that is written using RLL and a divisor of $16 looks like:


00000000: 00 ef 6f 83 6d 6d 85 6a 6f 82 6c 6f 83 6c 70 80    @ooCmmEjoBloClp@
00000010: 6f 6b f0 70 82 ff 6a 6e 84 6e 6d 4e 37 84 a1 a1    okppB~jnDnmN7Daa
00000020: a1 a2 ff a0 a2 ff a2 a1 a2 a0 a2 a0 a2 a0 a2 a2    ab~`b~bab`b`b`bb
00000030: a2 9e a3 9f a3 a0 a3 9f a3 9f a3 a0 a2 a0 ff a3    b^c_c`c_c_c`b`~c
00000040: a0 a2 ff a2 a0 a3 9f a3 9f a3 a0 a2 a0 a2 a0 a2    `b~b`c_c_c`b`b`b
00000050: a0 a3 a0 a2 9f a3 a0 a2 9f a4 9f a2 a0 a3 9f a3    `c`b_c`b_d_b`c_c
00000060: 9f ff a0 a2 a0 a2 a1 a1 a1 a2 ff a0 a2 a0 a2 a0    _~`b`baaab~`b`b`
00000070: ff a1 a0 a3 a0 ff 9f a2 a1 a2 9f a3 9c f7 ff f0    ~a`c`~_bab_c\w~p
00000080: ff f8 91 ff ed ff 94 ff f0 ff f0 ff 98 f8 ef ff    ~xQ~m~T~p~p~Xxo~
00000090: f8 98 a3 a0 a2 9f a3 a0 a2 9c f7 ff 97 f7 f6 f4    xXc`b_c`b\w~Wwvt
000000a0: 9b a0 9f f8 ff a2 a0 a2 9c ff 98 9e ff f9 93 ff    [`_x~b`b\~X^~yS~
000000b0: ff d8 9b ff 97 fc 9b a0 a2 ff ff a1 a1 a2 a0 a2    ~X[~W|[`b~~aab`b
000000c0: 68 47 41 41 c5 41 82 41 42 40 42 3f 42 f3 6d 84    hGAAEABAB@B?BsmD
000000d0: 6b 6c 43 b1 3f 40 b1 40 f3 40 41 58 56 6c 43 40    klCq?@q@s@AXVlC@
000000e0: 41 41 41 41 41 41 41 41 41 41 3f 41 73 42 3d 41    AAAAAAAAAA?AsB=A
000000f0: b1 c5 44 54 6d f3 57 96 6e 6d 84 6b ff 6d 85 6b    qEDTmsWVnmDk~mEk

The first 12 lines show values around $A0 which is suspiciously close to the long gap (=2x nominal signal rate) in MFM, before in the last four lines it switches to much more variable values, with a floor near $40, i.e., around $16 x 3 (=$42), which correlates with the switch to RLL at that divisor.

So why can't we see our MFM sync marks in the first part?  I know that they must be there, because the floppytest.c program is able to read them.  

Fiddling about, I am seeing a funny thing, though, where I can't reliably read data from the drive until I have stepped the head.  It can even be stepped backwards from track 0. But without it, I don't reliably read the track: I can still see all the peaks for the gaps, but something is wonky, and the MEGA65's floppy controller doesn't see the sectors.  

Ah, found the problem: The histogram display of the tracks that I was using, doesn't enable automatically setting the floppy controller's data rate to the speed indicated by the TIB.  With that enabled, the sector headers get found fine.  But that's on the running machine, not on my captured data, which is still refusing to read correctly.  And which is reading the raw flux, so doesn't care what the floppy controller thinks.

Okay, I think I know what the problem is: The MFM gaps are effectively 2x to 4x the divisor long, not 1x to 2x, because of the way that MFM is described. Thus at a divisor of $51 = 81, this means that the gaps will be between 81x2 = 162 and 81x4 = 324.  But 324 is >255, so we can't reliably pick them out with our 8-bit values.  To fix that, I need to shift the gap durations one bit to the right, which I am synthesising now.  Hopefully that is the problem there.

Meanwhile, while that is synthesising, I also looked at adding support for DMA jobs >64KB long, since there can be more than 64K flux transitions per track. This will be needed for reading HD Amiga disks, for example, as well as for letting me debug whole tracks of data. 

The divide by two on the counts has helped, in that I can now see some sync marks, but there are still things being borked up, such that I don't see all 3 sync marks in a row. This is what it looks like now:

00000000: 00 81 48 80 48 a9 49 c6 51 50 50 51 50 51 a2 50    @AH@HiIFQPPQPQbP
00000010: 51 50 51 50 a2 50 51 50 50 51 50 51 50 51 50 51    QPQPbPQPPQPQPQPQ
00000020: 50 51 a1 a2 50 51 50 51 50 50 51 51 50 51 4f 51    PQabPQPQPPQQPQOQ
00000030: 50 51 50 51 50 51 50 51 50 51 50 51 50 51 50 51    PQPQPQPQPQPQPQPQ
00000040: 4f 51 50 51 50 51 50 52 4f 51 50 51 50 51 50 51    OQPQPQPROQPQPQPQ
00000050: 50 a1 52 4f 51 50 51 4f 51 50 51 4f 51 50 51 50    PaROQPQOQPQOQPQP
00000060: a1 52 4f 51 50 51 4d 1f 77 a4 7c 49 a8 75 a5 7a    aROQPQM_wd|Ihuez
00000070: 4a a7 76 a4 77 a5 4b 7e 76 a4 7c 4c 51 4f 51 50    JgvdweK~vd|LQOQP
00000080: 51 4d 7d a2 7d 47 7c 21 4a 51 a2 4e 80 4b 52 4c    QM}b}G|!JQbN@KRL
00000090: a8 4c f6 4e ce 49 7d a3 79 7b 4c 4f 81 4a 51 51    hLvNNI}cy{LOAJQQ
000000a0: 51 50 51 50 51 50 51 50 51 a2 50 51 50 51 50 51    QPQPQPQPQbPQPQPQ
000000b0: 50 52 4f 51 50 51 4f 52 50 51 4d 7f 79 7b 4c 4e    PROQPQORPQMy{LN
000000c0: 7d 7a 78 7c 4c 4d a5 a5 1b 7b 48 a8 76 a4 7c 48    }zx|LMee[{Hhvd|H
000000d0: a8 75 a5 7b 4d 50 50 50 a2 4e 7f 4d 50 51 50 51    hue{MPPPbNMPQPQ
000000e0: 50 51 f1 52 51 50 4d 7a 4c 51 51 50 cd 7c 4b 51    PQqRQPMzLQQPM|KQ
000000f0: 50 4f 7b a6 4d 50 50 50 4c a6 a1 a6 4b 51 51 4d    PO{fMPPPLfafKQQM

All the values are halved from before, as mentioned. So we now see lots of ~$51 values, which we expect. There should be 12 bytes x 8 bits = 96 = $60 of them before we write the sync marks.  The sync marks should look like $A0 $78 $A0 $78, or thereabouts, and we can see some things that look like this in the above, marked in bold (note that this is from a DD formatted track, so we continue to see similar numbers below).

The underlined part looks to me like it _wanted_ to be a sync mark, but the first gap got chopped up into little pieces. So something is still really rogered up.

Weirdly if I instead read the $D6A0 debug register, that lets me read the same RDATA line from the floppy drive that we are using to calculate these gaps, then the sync marks get properly detected.  What I am fairly consistently seeing, is that we see a sequence something like a8 76 1f 49 ef, instead of the A0 78 A0 78 50 type sequence (the extra 50 at the end is a single gap after each sync that gets written).  So something is getting really confused, in that the gap boundaries are being read at the wrong locations. And its the last 3 of those that do it:

A0 78 50 versus

20 50 F0

But as I say, if I read _the same signal_ via a different register, I am getting a different result.

I'm just going to have to sleep on this, I think.

Problem found and fixed!  It was a cross-domain clocking problem. Which is fancy FPGA programmer speak for "the signal I was reading was produced by something on a different clock, and if you don't treat it nicely enough, you get all manner of weird glitching".  The read signal from the floppy is not generated using the same clock as the MEGA65's CPU, and thus this problem happens. The solution is to add extra flip-flops to latch the signal more cleanly.  In reality cross-domain clocking is _much_ more complicated than I have described here, but you get the general idea.

So I can now happily spot the sync marks again, decode the TIB, etc. But I am getting CRC errors reported in the TIB among other places, and it looks like we are skipping bytes, which I think is related.  I was fiddling about with fancy post-read timing corrections in my mfm-decode.c program, so its possible that I have just messed that up.

Certainly something in the decoder is messed up, because where it misses bytes, it is indeed processing lots of gaps, more than it should.  I'll have to see how that is happening.  Found the problem: I had an error in my RLL decoding table.  With that fixed, I can now read the saved data, and all the sectors are being returned with no CRC errors.  

With a 64KB transfer, it is returning 7 sectors worth of data.  This is why I added the >64KB DMA job support.  With 128KB, we will be able to read an entire DD track, which will let me debug exactly what I am botching when writing a sector. Thinking about it more, I can actually do a longer fetch, if I do it into the 8MB HyperRAM, ensuring that even at the highest data rates, I can capture a whole track. 

So now I should be able to format a disk at normal DD 800KB, and then try saving something to it, and seeing how it gets messed up at that point.  My best guess is that the data rate or encoding mode will be incorrectly set for some reason.

Formatting from the latest patched ROM seems to be using the new formatting routines, but sets the track number for all tracks to be 39 in the Track Info Block that gets written by doing that, which then causes problems reading back saved files -- although the directory and BAM updates fine.  So it is possible that an update to the ROM will fix that.  I might even be able to find the part of the ROM where it does the format and patch it myself.

In the meantime, I am formatting the disks myself using the floppytest.c program, which does setup the TIB data correctly (well, after I fixed a problem with the sectors per track).  However, oddly reading sectors from the 2nd side of the disk is having problems.  The sector headers are visible, but it hangs forever. 

Fortunately I can use my new track reader code to read that whole track, and check if I think the data is valid or not. It is showing incorrect data in the sectors, and failing sector body CRCs. I thought I would compare it with the first side of the disk, which reads fine, but that is also showing similar errors.  So there is something weird going on here.

Hmm... If I use the ALPS drive in my MEGA65, it has this problem, but if I use the nice Panasonic drive, it doesn't. It _looks_ like there are extra magnetic inversions being inserted where they shouldn't during writing.  Reading doesn't seem to be affected.  I had added an extra write path for the floppy under CPU DMA control, so I am reversing that out, in case its the problem. But the problem is manifesting even with older bitstreams. So I am wondering if I haven't half-fried one of the level converters for the floppy interface on my MEGA65, or something odd like that.  I'll wait for that bitstream to build, and maybe see if I can get a DevKit owner or two to try it out for me as well, and work out my next steps from there.

Overnight it occurred to me that there is a simple test I could do, to work out if it is the writing side or the reading side that is the problem:  Write the disk using the ALPS, and then try to read it using the nice Panasonic drive, and vice-versa, if required.  If the ALPS can't read what the ALPS wrote, but the Panasonic _can_ read what the ALPS wrote, then there is a difference on the _read_ side of the Panasonic.

And the result is that the Panasonic _can_ read the ALPS written disk that the ALPS drive is refusing to read, although after about track 35 it is having some trouble, but that could just be that some of those tracks on the disk are a bit worn from my relentless testing of it.  

But anyway, this says that whatever the weird problem that has crept in is, it has something to do with the read behaviour. Maybe the RDATA pulses are longer or shorter on the different drives, for example.  Or maybe the voltages of them are different.  The good thing is that those are all quite testable things, as I can use the oscilloscope to probe both, and get to the bottom of this annoying little mystery.

Right, so the plot thickens: I just switched floppy cables to one with dual connectors for drives A: and B: on PCs, as it is convenient to poke oscilloscope leads into the 5.25" drive connector on it. But now it can't read _anything_, even with the good drive.  The cable is a bit longer. Maybe the issue is something with the level converter, and the extra length of cable is just enough to cause it grief. BUT that longer cable _does_ work with the ALPS drive (although still refusing to read, I can see the gap histogram, which I couldn't with the Panasonic on the same cable). And now after fiddling things again, the long cable is working with the Panasonic drive.  Maybe something was just loose.

So now to get the oscilloscope hooked up...

Both drives produce similar looking pulse shapes for the RDATA line. The main difference I can see is that the good Panasonic drive seems to have much more stable timing, perhaps suggesting that it just has better motor speed control.

Next step is to record what we write to a track as raw flux, and then compare that back with what we have written, and try to spot the kinds of errors that are occurring.

Advance two months...

Well, the rest of the year flew by pretty quickly, and its now almost New Year's Eve, and I have finally had the chance to get back to this.

First up, I have found that there was a problem with writing after all: I had tried writing at the start of an index hole mark, rather than the end, and the floppy drive was considering this illegal, and not writing anything to the track at all. Thus I was reading back whatever was already on the track, not the fresh data.

I've fixed that, but I am still seeing very short gaps on the odd occasion.  The TIB is also not being written to the track at all, but we will deal with that separately.  My best guess is that the very fast FPGA is seeing glitching on the RDATA line from the floppy, which is causing some long gaps to get broken up by mistake.  If this theory is correct, then de-glitching the RDATA line by requiring it to go low for several CPU cycles will fix it. If that doesn't fix it, then the short gaps are most likely being created during the writing process.

According to floppy drive data sheets, the RDATA pulses should be >= 0.15usec = ~6 cycles at 40.5MHz.  If we require RDATA to go low for 4 cycles, that should be a safe amount, and still squelch any glitching from cross-domain or other effects.

Okay, that has been done, and the problem persists, and I just checked with a Known Good Disk, and it doesn't exhibit the short gaps, so I am still suspecting that it is a write-based problem. So I have instrumented the VHDL to keep track of the actual WDATA interval time so that we can see if we are producing the problem ourselves.

The only other difference I can see is that the problem is happening on HD disks, but not on DD disks. That could just be coincidence, or it could be that the DD disks (or drives in DD mode) simply don't have the flux resolution to write the glitches. So time for another synthesis run...

That's done, and I can now verify that no F_WDATA pulse interval is shorter than 80 clock cycles at the DD 720K data rate -- which is correct. If I tune the data rate up and down, then the minimum gap length adjusts accordingly, so I am confident that I am reading it correctly.

... and yet I am still seeing short gaps being written.  Is it actually an artefact of writing DD data on an HD disk?  If so, then writing at HD rate should remove the problem. Or writing to a DD disk.  So, writing at HD data rate seems to work without any of these short gaps, adding some support to my theory.... and formatting a DD disk at DD data rate also works just fine, too (although some of the Track Info Blocks are not able to be read back).

So there _is_ a problem with writing to HD disks at DD data rate. I have since confirmed this 100% by covering the hole on an HD disk that refuses to make a nice format at DD rate without the hole covered, and it suddenly reads back reliably. This affirms my theory that the drive electronics is being switched to a different read recovery mode between DD and HD, and that this is messing things up at DD data rates.

Speaking with others from the team on discord, we have come up with a sequence of events that results in corrupted DD disks. 

DD DISK OR FAKE DD DISK USED

1. FORMAT in floppytest (option 3)
2. READ TEST (all green) in floppy test (option 2)
3. RESET with disk still inserted
4. HEADER"TEST",IDK
5. DIR
6. NEW
7. 0?
8. SAVE"TEST"
9. DIR     -looking good
10. SAVE"TEST2"
11. DIR      -this time still looking good
12. SAVE"TEST3"
13. DIR      -this time still looking good
14. RESET
15. DIR      -this time still looking good
16. NEW
17. 0?
18. SAVE"TEST4"
19. DIR      -this time still looking good
20. POWER OFF
21. REMOVE AND REINSERT DISK
22. POWER ON
23. DIR        -this time still looking good
24. NEW
25. 0?
26. SAVE"TEST5"    -this time rattling noise
27. DIR            -broken, mostly zeroed directory:

   0 "           "
   1 "
   0
   0
8704 DEL
3155 BLOCKS FREE

28. SAVE"1"
29. DIR         -file "1" is first entry in above broken directory!
30. SAVE"2"
31. SAVE"3"
32. DIR        -files 1, 2 and 3 appear on top:

   0 "              "
   1 "1"              PRG
   1 "2"              PRG
   1 "3"              PRG
   1 "
   0
   0
8704 DEL
3152 BLOCKS FREE

33. POWER OFF, DISK OUT, DISK IN, POWER ON
34. DIR              -magic! old files appear again:

   0 "TEST           " DK 1D
   1 "TEST"            PRG
   1 "TEST2"           PRG
   1 "TEST3"           PRG
   1 "TEST4"           PRG
   0 "TEST5"          *PRG
3156 BLOCKS FREE
 

One of the first things I found, is that this doesn't seem to happen if the disk was formatted using floppytest, and then just a simple HEADER "TEST" command, instead of HEADER "TEST",IDK command from BASIC.  The difference is that with the ,Ixx option, the disk is low-level formatted again, which means that the C65 ROM is doing the disk format, instead of the floppytest program.  This might be because of the C65 ROM format routine, or it might be from something else.  

To help understand why, I am creating a derivative of the m65 command line tool that can remotely read the contents of a real floppy, without having to interrupt the machine.  This means it is possible to type a BASIC SAVE command, for example, and then see what actually changed on the disk, versus what should have changed.

It looks like the problem might be the C65 ROM has every track labelled as track 39 -- this would explain the track chattering symptom, as well as the rest of the problems. I'll have to ask @bitshifter to patch that, so that we can test again. If that fixes it, then that means we have DD disk behaviour all working, and I can (finally) finish fixing the HD disk stuff.

Speaking with @bitshifter, the C65 ROM still uses the old un-buffered method of formatting tracks. But we are seeing a Track Info Block being written, which only happens via the hardware assisted formatting.  If we assume that hardware assisted formatting is somehow being triggered, this would also explain why all tracks are marked as track 39, as the C65 ROM doesn't know to update the track number. We can confirm this theory by modifying the track number in $D084 during a C65 ROM format, and it should result in changed TIB and sector header track numbers: And, yes, this is exactly what is happening.

So the question is how is automatic track formatting being triggered? The relevant track format code in the C65 ROM is pretty straight-forward:

993c a9 a1        lda #wtt+1       ;Erase track (fill with $4E gap bytes)
993e 20 58 9b     jsr CommandReg   ;necessary due to simulated index pulse!
9941 8d 81 d0     sta command      ;Begin formatting

9944 a0 10    10$ ldy #16          ;write post index gap 12 sync
9946 b9 7a 9a 15$ lda secdat-1,y
9949 be 8a 9a     ldx secclk-1,y
994c 2c 82 d0 20$ bit stata
994f 10 dd        bpl wtabort      ;oops
9951 50 f9        bvc 20$
9953 8d 87 d0     sta data1        ;Always write data before clock
9956 8e 88 d0     stx clock
9959 88           dey
995a d0 ea        bne 15$

995c a0 04        ldy #4           ;Write 4 header bytes
995e a2 ff        ldx #$ff
9960 b9 6d 01 25$ lda header-1,y
9963 2c 82 d0 30$ bit stata
9966 10 c6        bpl wtabort      ;oops
9968 50 f9        bvc 30$
996a 8d 87 d0     sta data1
996d 8e 88 d0     stx clock
9970 88           dey
9971 d0 ed        bne 25$

We can see that it is putting #$A1 into the command register.  The VHDL that looks at the commands looks like this:

              temp_cmd := fastio_wdata(7 downto 2) & "00";
              report "F011 command $" & to_hstring(temp_cmd) & " issued.";
              case temp_cmd is

                when x"A0" | x"A4" | x"A8" | x"AC" =>
                  -- Format a track (completely automatically)
                  -- At high data rates, it is problematic to feed the data
                  -- fast enough to avoid failures, especially when using
                  -- code written using CC65, as I am using to test things.
                  -- So this command just attempts to format the whole track
                  -- with all empty sectors, and everything nicely built.

                  f_wgate <= '1';
                  f011_busy <= '1';

                  -- $A4 = enable write precomp
                  -- $A8 = no gaps, i.e., Amiga-style track-at-once
                  -- $A0 = with inter-sector gaps, i.e., 1581 / PC 1.44MB style
                  -- that can be written to using DOS
                  format_no_gaps <= temp_cmd(3);

                  -- Only allow formatting when real drive is used
                  if (use_real_floppy0='1' and virtualise_f011_drive0='0' and f011_ds="000") or
                    (use_real_floppy2='1' and virtualise_f011_drive1='0' and f011_ds="001") then
                    report "FLOPPY: Real drive selected, so starting track format";
                    sd_state <= FDCAutoFormatTrackSyncWait;
                  else
                    report "FLOPPY: Ignoring track format, due to using D81 image";
                  end if;
                  
                when x"A1" | x"A5" =>
                  -- Track write: Unbuffered
                  -- It doesn't matter if you enable buffering or not, for
                  -- track write, as we just enforce unbuffered operation,
                  -- since it is the only way that it is used on the C65, and
                  -- thus the MEGA65.
                  -- (Conversely, when we get to that point, we will probably only
                  -- support buffered mode for sector writes).

                  -- Clear the LOST and DRQ flags at the beginning.
                  f011_lost <= '0';
                  f011_drq <= '0';

                  -- We clear the write gate until we hit a sync pulse, and
                  -- only then begin writing.  The write gate will be closed
                  -- again at the next sync pulse.
                  f_wgate <= '1';

                  -- Mark drive busy, as we should
                  -- C65 DOS also relies on this.
                  f011_busy <= '1';

                  report "FLOPPY: Asked for track format";
                  
                  -- Only allow formatting when real drive is used
                  if (use_real_floppy0='1' and virtualise_f011_drive0='0' and f011_ds="000") or
                    (use_real_floppy2='1' and virtualise_f011_drive1='0' and f011_ds="001") then
                    report "FLOPPY: Real drive selected, so starting track format";
                    sd_state <= FDCFormatTrackSyncWait;
                  else
                    report "FLOPPY: Ignoring track format, due to using D81 image";
                  end if;

We can see that $A1 leads to the non-automatic track formatting. But I also just spotted the bug :) Write a comment if you can spot it, too!

And with that, the main problem was solved. I still need to think about how I make the TIBs reliable at DD data rate on HD disks because of the drive electronics limitation, but I have ideas on how to do that.  But that can wait a bit, as we have bigger fish to fry prior to the release of the MEGA65! And besides, this blog post has already taken way too long. So with that, I wish you all a Happy New Year, and lots of retro fun.