Saturday, 1 January 2022

More work on HD floppies, RLL encoding, disk density auto-detection and other fun

[Gadzooks! This blog post has sat idle for ages while life and various other things happened.  But finally it is here!]

In the previous blog post, I described in perhaps excess detail some of my adventures getting very high data density on normal 1.44MB floppies in the MEGA65's normal 1.44MB floppy drive.  We seem to be able to fairly easily get more than 2MB, which is pretty nice, but further boosts are most likely going to hinge on whether I can get RLL encoding to work, which in the best case, will get us an extra 50% storage, thus allowing 3MB or so, i.e., more than a standard 2.88MB ED floppy.  But we have a number of barriers to address to get to that point:

1. We need to implement RLL encoding.

2. We need to fix some of the regressions with accessing DD disks that have been introduced, presumably because we are writing at the wrong density or on the wrong track.

3. Write pre-compensation is likely to be even more entertaining for RLL coding, than for MFM, as we have to worry about six different peaks in the histogram being cleanly resolved from one another, compared with MFM's three.

I'll start with implementing the RLL encoder and decoder, as I am already part way there, and can't resynthesise until I get it mostly done, and in turn, I have remapped a few of the low-level floppy registers to make space for RLL encoder selection, which means that my test program is now out of step with the older bistreams. So its all a bit of a mess that I need to work through. So let's dig straight into it!

For RLL encoding, I am implementing an RLL encoder and decoder that are parallel to the MFM ones, and selection logic to decide which one is connected to the disk drive at any point in time.  That way we can easily switch between the two.

One of the challenges with the RLL encoder, is that we have variable length bit groups that get encoded together, which can be 2, 3 or 4 bits long.  It is thus quite common for encoding to continue across byte boundaries. This is not a problem in itself, but it is a problem for the way that I have been writing the synchronisation marks, as in MFM it was possible to assume that byte boundaries would be respected, and thus the sync byte would get properly detected.  But in RLL this doesn't happen.

To solve this, I allow any partial bit pattern to run out when the next byte is a sync mark, basically pulling bits out of thin air to use to fill in the missing extra bits needed to complete the last symbol.  That, and fixing an error in my transcription of the RLL table into the VHDL, I now seem to be able to encode bytes and sync marks alike.  The next step is to make a decoder, and test it under simulation, to make sure we detect sector headers and data etc, as we should.

So now to make the decoder, which is also a bit interesting, because of the way the codes get emitted.  For the MFM coder, we could just look at the previous bit and the length of the incoming gap.  But for the RLL coder things are fiddlier.  For example, if we wrote the byte $42, i.e., %01000010, this would be emitted as:


010      000100
000      100100
10       0100 

So the total bit sequence will be:  0001001001000100

The trouble is that there could be either 100 or 1000 from the previous byte on the front, and 1, 01, 001, 0001 or 00001 at the end, depending on the next byte.

In any case, the first step is to build the gap quantiser, that converts the incoming pulses into a series of pulse lengths, i.e., the distance between each magnetic inversion, represented by a 1 in these sequences. So for the above, this should come out as 4, 3, 3, 4, 2 + something not yet known.

I now have the gap quantiser working under simulation, and can simulate formatting a track, which results in plenty of predictable data to decode, and I can see that said data is being generated.  So now its time to figure out how to decode RLL2,7 data from the series of gaps.

I'm thinking that the best way will be to turn the gaps back into a series of bits, and then chop the head off for any bit sequences that we can extract. When we hit a sync mark, we then just need to remember to discard all waiting bits, and discard the first two bits of the next gap as well, because of the trailing 00 at the end of the sync mark.

With the above approach, we need only a few IF statements in the VHDL to decode the bits, and some glue to make the bits emit one by one from the variable-length RLL bit blocks. So we should be ready to synthesise a bitstream, and see if we can see some RLL encoded data. Like this, for example:

 

Notice that there are more than 3 peaks, because with RLL2,7, we should have peaks at 2x, 3x, 4x, 5x, 6x, and 7x the time base.  The 6x and 7x are too long to show up here.  The noise at low rates might in fact be funny artefacts from the higher intervals wrapping around, or alternatively, that the drive electronics don't like the gaps being that long.  If we speed up the data rate from DD to the HD rate, then we see a much better result:

We have the peaks, and they are looking very nicely resolved, and without any funny noise at the low end.  Note that the 7x peak is really short -- this is because a gap of 7 zeroes requires fairly specific bit patterns, which are not that common. This is also a good thing, because drives are much happier with shorter intervals than that.

It is possible to increase the data rate some more, but I will need to add in write-precompensation as well, because the two shortest interval peaks do start to bleed into one another, which is, of course bad.  I'll have a look and a think about that, because it might end up being better to make a custom RLL code that merges the two shortest interval peaks, in order to be able to increase the data rate higher than would otherwise be possible.

Here is about the best data-rate that is possible right now, with a divisor of $18 = 24 = 1.6875MHz:


We can see the left-shift of the 3x pulse, just like we saw with the MFM encoding with the 1.5x pulse, that is also the 2nd in the sequence. To give an idea of how much better this RLL encoding is at this frequency compared with MFM, here is the MFM equivalent, with the best write pre-compensation I have been able to do:

I.e., MFM encoding at that data rate is utter rubbish, and is probably not usable below a divisor of $1E = 30.  So we are getting a divisor of 24 with RLL, which means 30/24 = 6/5 = 120% of the MFM capacity with this RLL encoding, assuming that it actually works, which I will soon be able to test.  

Interestingly, it looks like write pre-compensation will be rather unlikely to make much difference with the RLL encoding, at least on track 0. Shifting the 2nd peak to the right a bit might get us one divisor lower to /23 instead of /24, but it feels like its really diminishing returns. Trying other tracks, I can see the split peaks again, so implementing write pre-compensation might help to preserve higher density onto more tracks, so it might well still be worth implementing.

More generally, I'm also curious to know how well perfect write pre-compensation would be able to help.  To test that, I could write a track with a constant stream of equally spaced pulses, and see how wide the spread of the peak is from that, as that should only have the non-data-dependent timing jitter.

Its the weekend again, and I have had a bit of time to work on this all again, including another twitch stream (sorry, the video isn't up yet, I'll update this with a link, when it is), where I tracked down some problems with the RLL encoding: Specifically, I am facing a problem now where the CRC unit takes a few cycles to update the CRC, but the RLL encoder buffers 2 bytes instead of 1, which means that it is trying to inject the CRC before it has fully calculated.  Also, earlier on, this double-byte buffering itself had some problems.

There is a way to calculate the CRC in a single cycle, but it needs a larger look-up table which would take up space in the FPGA, and we don't really need the speed, we'd just like it. All that I need to do is to make the FDC encoder wait for the CRC to finish calculating before feeding bytes.  I have a fix for that synthesising now.

Meanwhile, I am still curious about how the RLL2,7 code was generated, and whether we can generate RLL3,x or RLL2,x codes, with x larger than 7, that the floppy drive can still read, but that would allow for higher densities.  RLL3,x codes are particularly interesting here, because with an RLL3,x code, there would be three 0s between every 1, which means that we can -- in theory at least -- double the data rate vs MFM, although other effects will likely prevent us reaching that limit. But for that to be effective, we need a nice short RLL3,x table, built along the same lines as the very clever RLL2,7 tables.

What is interesting about the construction of the RLL2,7 table, is that it is very efficient, because it uses unique prefixes to reduce the overall code length. Without that trick, it would require more like 3 ticks per bit, instead of 2 ticks per bit. I know, because I made a brute-force table generator that works out the numbers of combinations of codes of given lengths that satisfy a given RLLx,y rule.  RLL2,7 requires at least two 0s between every one, so the table is based around that, by looking at all possible codes that satisfy that of a given short length, and attaching two 0s to the end, to make sure that they are guaranteed to obey the RLL2,7 rule when joined together in any combination.  

So maybe we can do something like that for RLL3,10, which I think is probably about the sweet spot. We'll initially try for something that is only as efficient as the RLL2,7 code, but using units of 3 bits, instead of 2:

001000

010000

100000

That gets us 3 combinations using 6 bits, which is a fair start, since the idea is that we want to use no more than 3n bits. So lets add some more 3-bit units, that can't be confused for any of those three above, of which there are the following six:

100010000

100001000

010001000

000100000

000010000

000001000

This still keeps us within RLL3,10, but only just, as two of those together can result in exactly 10 0s between adjacent 1s.

With 9 combinations, we could encode 3 bits using those, but we would like constant length encoding, like RLL2,7 does, because data-dependent encoding length is a real pain to work with.  But unfortunately the number of cases here don't work for that.  So let's start with a 5 bit prefix, and try to encode more bits

10000000

01000000

00100000

00010000

00001000

10001000

Ok, so we have 6 combinations using 8 ticks.  To make it at least as good as RLL2,7, we need to use less than 3 ticks per bit on average, but to keep the length constant, we need to find a multiple of 3 that is a power of two if we do it this way... of which there are none. So I'll just have to think on all this a bit more. So let's go back to that bitstream...

With the new bitstream, we can now write a valid TIB and RLL2,7 formatted sector headers, because I can verify their correctness from the written data. However it refuses to parse even the TIBs, let alone the RLL sectors.  My best guess is that the CRC feeding on the decoder side is also having problems, although it doesn't really make sense why the TIBs refuse to read. 

Also, digging through the source of sdcardio.vhdl and mfm_decoder.vhdl, the last sector seen under the head should still be updating, even if the CRCs were bad -- which they are not. It's almost as though the RDATA line from the floppy is not connected -- but it is, because I can display histograms, which means that the gap counting logic is reading the RDATA line.  The data rate for the TIB decoder is locked at divide by 81, the rate for DD disks, and to MFM encoding, so neither of those should be blocking TIB detection.  

So where is the signal getting to, and where is it getting stuck? We have some debug tools in this bitstream still to help. $D6AF returns mfm_quantised_gap from the mfm_decoder. It actually contains different information to the mfm_quantised gap, but that's okay. The main thing is that it is changing value continuously, showing that gaps are being detected and quantised.  Perhaps its that we are not having the synchronisation marks passed through correctly? That could stop everything from working.  

The sync signal is selected from either the MFM or RLL27 decoder, and so it is possible with all this fiddling about that it has been messed up.  Although even that would be a bit odd, given that it simulates fine.  I'll modify the bitstream so that we can see if either the RLL and/or MFM synchronisation marks are detected -- just in case it is some funny problem with synthesis of the design.

And I might just have found the problem: the sensitivity list for the process in the VHDL file where the selection between RLL and MFM signals happens was missing all those signals, so it might well have all been optimised out during synthesis, which would result in exactly what we are seeing.

Nope, that wasn't it, either. So lets work out the commit it stopped working, and figure out the differences from that.

Bitstream from commit 9071b54 works, but the bitstream from commit 2c32a9e  doesn't work. After digging through the differences, I have found the problem: The CRC of the Track Info Blocks is incorrectly calculated when reading them, and thus they aren't accepted.  The good thing is that this shows up under simulation, so can be fixed. Similar problems probably occur when hitting the sector header blocks, which is why sectors aren't being identified.  

Specifically, it looks like the first of the three sync marks is not being fed into the CRC calculation. This is because we assert crc_reset at the same time as crc_feed, when we encounter the first sync byte.  Now synthesising a fix for that, after confirming in simulation that it works.

Totally unrelated to the floppy, also included in this synthesis is support for switching between 6581 and 8580 SIDs, using the updated SID code from here. More on that in another blog post later, most likely.

Back to the RLL floppy handling, that fix has now got sector headers being detected, and I am able to cram up to 40 sectors per track in Amiga-style track-at-once at a divisor of $17 = 23, or 36 sectors per track if sector gaps are included, and the peaks are quite well resolved still at that divisor, although the double-peaking that is symptomatic of the need for write pre-compensation is clearly visible:

 

Note that this is without any write pre-compensation. It's also without testing whether we can actually read the sectors reliably, either. So it might well be a fair appraisal of what will be reliable after I implement write pre-compensation for the RLL2,7 encoding.  For comparison, the best stable divisor for MFM encoding with write pre-compensation is around $1C = 28, which gets 30 sectors per track with gaps, or 33 without gaps.  So its an improvement of 40/33 = 120% over MFM, before we implement write pre-compensation.  Not the full 150% that we should theoretically be able to obtain, if it is only the shortest gaps that are a problem, but still a welcome boost.  

Over a whole disk, we can probably count on averaging 34 sectors per tracks with gaps, or 38 without. This means that we can now contemplate disks with capacities of around:

With Gaps, 80 tracks = 34x80x2x512 = 2.72MB

With Gaps, 84 tracks = 2.86MB

Without Gaps, 80 tracks = 38x80x2x512 = 3.04MB

Without Gaps, 84 tracks = 3.19MB

In short, we have reached the point where we are able to cram more data on an HD disk than an ED disk officially holds... provided that I can read the sectors back reliably.  

Which I can't yet. In fact, I can't read any of them back. I am suspecting that I have another CRC related problem, so will see if a CRC error shows up in simulation of reading a sector body. And, yes, the sector body CRCs are wrong.  Bug found and fixed, and synthesising afresh. Again.

And that has sectors reading! Now to tweak the floppytest program to format RLL disks correctly, and to test reading all of the sectors, and see how far we can safely dial things up.  That showed a bug in the TIB extraction, where the sector count was being read as $00 always, which I am now fixing -- and is now fixed. 

Now its time to think about write pre-compensation, so that I can maximise the reliable data rate, as right now when reading sectors, a data rate divisor of 25 = 1.62MHz is not quite reliable on track 0, and certainly falls apart by about track 35 or so.

As part of this, I might follow up on something that occurred to me on the ride to work this morning:  My PLL that tracks the floppy data clock when reading from real disks is probably sub-optimal: It synchronises exactly when it sees a pulse, rather than working out if the pulse is a bit early or late vs the expected time, and taking that into account. This will be causing gaps to seem over large and too small, when the magnetic phenomena occur that push short gaps together and spread out longer gaps between them. If I can fix this up, it should sharpen the peaks in the histograms quite a bit, and allow higher data rates, which would be nice.

So I might take a brief excursion into analysing the behaviour of the gaps and think about PLL algorithms that will not cause this double-addition of gap variance, but rather, ideally, be able to cancel some of it out.  The first step is to take a raw track capture, and take a look at the gaps in it, and try simulating various PLL algorithms on it, to see what we can do. 

Part of why I think that this has considerable potential, is that the histogram peaks for gaps are really quite tight, until the data rate is increased beyond a certain point, after which they start getting quite a bit wider, which I think is a direct result of the measuring gaps from the time of the last gap, which of course will tend to be shifted at the higher data rates, but not at the lower data rates, because the shifting only happens in any noticeable amount above some threshold data rate.

The trick is to make a simple algorithm for tracking the pulses and error between expected and actual time of appearance.  And the methods need to be simple enough for me to fairly easily implement.  

Some low-hanging fruit is to look at the error on the previous pulse interval, and assume that all (or half) of that error is due the most recent pulse being displaced. We then adjust the nominal start time of the next interval to the modelled arrival time of the pulse, if it had not been displaced in the time domain.  This will stop pulses that are squashed together from having both ends of the squish deducted from their apparent duration.  That is, we carry forward the error in arrival time, so that we can compensate for it. The trick then, is that we need to allow for true error, i.e., drift in the phase. We can do this by checking how often a pulse arrives late versus early: Whenever there are too many lates or earlies, we adjust the phase by one tick in the appropriate direction. In other words, we track the average frequency over time, and attempt to correct individual pulse intervals based on recent deviations from the average frequency.

Initial testing on the captured RLL encoded track data suggests that it should help. In particular, it seems to be reducing the maximum magnitude of the error in arrival time, which is exactly what we need if we want to narrow those peaks.  This should operate nicely in addition to any write pre-compensation, because it is all about fixing the residual error that the write pre-compensation didn't fix.  

In the process of all this, it is clear that these ALPS drives keep the motor speed pretty accurate, probably to within +/-1% of the nominal speed, which encourages me that we should be able to get higher data rates out of them, once I have implemented the improved PLL (which will help both MFM and RLL), and then implemented write pre-compensation for the RLL encoding.

I also realised that the encoder for both MFM and RLL is calculating the data rate differently during the read and write phases: During reading, it is exactly correct, but during writing I am counting from the divisor down to 0, i.e., each interval is one cycle longer than it should be.  This will be causing problems during decoding at higher data rates, because all the bits will be slightly shifted in time, and errors will be more likely.  This is quite easy to fix, and should help (a little) with the maximum achievable data density as well.

I think the next step is probably for me to find a way to simulate reading from a real track of data, so that I can work on the PLL improvements and check that they are really functioning.

I have in the meantime been working on the PLL algorithm some more, and now have it checking against the last 8 pulses, and if the current pulse is exactly timed to an integer number of gaps (+/- one clock cycle), then it counts those up, and the gap size that has the most votes is accepted as exact. That squelches a large fraction of the errors.  For cases where this is not possible, then I take the average of the gap as base-lined against those last 8 pulses, which tends to help reduce the error a little -- but rarely completely.  This algorithm will all be very easy to implement in hardware, as it is just addition, subtraction and 3 right shifts.

The other thing I have noticed, is that the shortest gaps almost always get stretched, rather than compressed for some reason.  The longer gaps tend to have much less error on them, when compared with the shortest n=2 gaps.  So it probably makes sense to allow a bit more slop on the shortest gaps, and in return slightly narrow the lower bound for n=3 gaps.  Maybe allow up to 2.625 instead of 2.5 gaps to be counted as a n=2.

This is all with non write pre-compensated RLL2,7 data.  Some of these residual errors can likely be reduced if I implement the write pre-compensation.  But this "read post-compensation" certainly seems to be much more robust than my previous naive approach, and I'm hopeful that it will allow higher data rates -- which I can test by writing such a track, and then testing against it.  

I can also use this test program against MFM data, so I can test my theory about write pre-compensation working well with it. But that will have to wait until the weekend, as will testing higher data rates -- which I am excited to see if it will handle.  If I go much faster, the resolution of my captures at a sample rate of 13.3MHz will actually start to be the limiting factor, which is pretty cool.

Another week has gone by, and I have write pre-compensation working for RLL2,7, which has helped, but I haven't gotten around to implementing the read post-compensation stuff.  I'm going to defer that for now, as I want to just close-out the RLL and variable data rate and Track Info Block stuff, and merge it back into the development branch.

So first up, the good news is that write pre-compensation with RLL2,7 encoding works quite nicely. At the really high data rates, it looks like having the write pre-compensation dialled up high for the 2nd coefficient works better. That's the case where the difference in gaps is the greatest, and it makes sense that it works that way. But making sense is not something I take for granted with floppy magnetics.

So let's take a look at the impact of write pre-compensation with RLL2,7 at high data rates.  First, let's see what divisor $17 and $15 look like without any write pre-comp:

Basically we can see that divisor $17 = 23 = 40.5MHz/23 = 1.76MHz is a bit marginal, while divisor $15 = 21 = 1.93MHz is basically rubbish. So let's now turn on some write pre-compensation:

That cleans up divisor $17 quite nicely, and that's probably quite usable like that. If we then move to divisor $15, i.e., the faster data rate = higher data density, then it is kind of okay, but the clear area between the peaks is noticeably narrowed:

So that's probably not usable.  However, when we increase the 2nd pre-comp coefficient from 5 = ~125ns to 8 = ~200ns, then it cleans up quite a bit, to the point where it looks probably about as good as divisor $17 did with the more conservative pre-comp values:


These pre-comp values seem to be fine for HD and above data rates (divisors $28 and smaller). Above that, they spread the peaks rather than sharpen them. But even in that case, the distance between the gaps is pretty extreme in that case, anyway.

Well, time is really flying at the moment, for a variety of reasons. Its now a couple of weeks later, and I have done another couple of live streams: https://www.youtube.com/watch?v=MOEPXSAW08g and https://www.youtube.com/watch?v=n5sfdv7K8Zw

To summarise the progress in those streams, I tried a pile of floppy drives I have here -- including a 2.88MB ED drive (admittedly in HD mode, because I have no ED disks or ED-drilled HD disks).  Of the drives that worked (a bunch were duds), the almost all could read the same data densities, and had similar performance for RLL encoding as the ALPS drive in my MEGA65, thus giving me confidence that this crazy high-capacity RLL storage encoding is likely to work on people's machines -- certainly the production machines which will have ALPS drives, but also the vast majority of the 100 DevKits that have random PC floppy drives in them.

One discovery during this process was that one of several Panasonic drives from my pile was a lot better than the rest, and seemed to be reliable at data rates around 10% or more faster than the other drives could handle. It will be interesting to investigate this later on to find out why. 

But back to finishing the RLL HD stuff, I then started testing behaviour at DD as well as HD, to check for regressions. And there is some problem, where sectors are not being written back properly.  To debug this, I really need to be able to read whole tracks of data in exact manner.  

This is a bit tricky, because the pulses we have to log can occur at >2MHz, so doing this in a tight loop is a bit of a pain.  So what I have decided to do, is to make a mode for the MEGA65's DMA controller, that reads the raw flux inversion intervals, and writes them to memory.  This has the added benefit that it will make it quite easy to write software for the MEGA65 that can read arbitrary other disk formats -- like Amiga disks, for example.

This _should_ be quite simple to implement, but it keeps causing me problems for some reason. I have it counting the number of 40.5MHz clock cycles between each inversion.  But my MFM/RLL decoder test program is failing to find the 3x SYNC sequences.  The Track Info Block is always written at divisor $51 = 81 decimal, and thus we should see gaps approximately 160, 160, 120 and 160 = $A0, $A0, $78, $A0 long for the TIB's SYNC marks.  The start of a captured track that is written using RLL and a divisor of $16 looks like:


00000000: 00 ef 6f 83 6d 6d 85 6a 6f 82 6c 6f 83 6c 70 80    @ooCmmEjoBloClp@
00000010: 6f 6b f0 70 82 ff 6a 6e 84 6e 6d 4e 37 84 a1 a1    okppB~jnDnmN7Daa
00000020: a1 a2 ff a0 a2 ff a2 a1 a2 a0 a2 a0 a2 a0 a2 a2    ab~`b~bab`b`b`bb
00000030: a2 9e a3 9f a3 a0 a3 9f a3 9f a3 a0 a2 a0 ff a3    b^c_c`c_c_c`b`~c
00000040: a0 a2 ff a2 a0 a3 9f a3 9f a3 a0 a2 a0 a2 a0 a2    `b~b`c_c_c`b`b`b
00000050: a0 a3 a0 a2 9f a3 a0 a2 9f a4 9f a2 a0 a3 9f a3    `c`b_c`b_d_b`c_c
00000060: 9f ff a0 a2 a0 a2 a1 a1 a1 a2 ff a0 a2 a0 a2 a0    _~`b`baaab~`b`b`
00000070: ff a1 a0 a3 a0 ff 9f a2 a1 a2 9f a3 9c f7 ff f0    ~a`c`~_bab_c\w~p
00000080: ff f8 91 ff ed ff 94 ff f0 ff f0 ff 98 f8 ef ff    ~xQ~m~T~p~p~Xxo~
00000090: f8 98 a3 a0 a2 9f a3 a0 a2 9c f7 ff 97 f7 f6 f4    xXc`b_c`b\w~Wwvt
000000a0: 9b a0 9f f8 ff a2 a0 a2 9c ff 98 9e ff f9 93 ff    [`_x~b`b\~X^~yS~
000000b0: ff d8 9b ff 97 fc 9b a0 a2 ff ff a1 a1 a2 a0 a2    ~X[~W|[`b~~aab`b
000000c0: 68 47 41 41 c5 41 82 41 42 40 42 3f 42 f3 6d 84    hGAAEABAB@B?BsmD
000000d0: 6b 6c 43 b1 3f 40 b1 40 f3 40 41 58 56 6c 43 40    klCq?@q@s@AXVlC@
000000e0: 41 41 41 41 41 41 41 41 41 41 3f 41 73 42 3d 41    AAAAAAAAAA?AsB=A
000000f0: b1 c5 44 54 6d f3 57 96 6e 6d 84 6b ff 6d 85 6b    qEDTmsWVnmDk~mEk

The first 12 lines show values around $A0 which is suspiciously close to the long gap (=2x nominal signal rate) in MFM, before in the last four lines it switches to much more variable values, with a floor near $40, i.e., around $16 x 3 (=$42), which correlates with the switch to RLL at that divisor.

So why can't we see our MFM sync marks in the first part?  I know that they must be there, because the floppytest.c program is able to read them.  

Fiddling about, I am seeing a funny thing, though, where I can't reliably read data from the drive until I have stepped the head.  It can even be stepped backwards from track 0. But without it, I don't reliably read the track: I can still see all the peaks for the gaps, but something is wonky, and the MEGA65's floppy controller doesn't see the sectors.  

Ah, found the problem: The histogram display of the tracks that I was using, doesn't enable automatically setting the floppy controller's data rate to the speed indicated by the TIB.  With that enabled, the sector headers get found fine.  But that's on the running machine, not on my captured data, which is still refusing to read correctly.  And which is reading the raw flux, so doesn't care what the floppy controller thinks.

Okay, I think I know what the problem is: The MFM gaps are effectively 2x to 4x the divisor long, not 1x to 2x, because of the way that MFM is described. Thus at a divisor of $51 = 81, this means that the gaps will be between 81x2 = 162 and 81x4 = 324.  But 324 is >255, so we can't reliably pick them out with our 8-bit values.  To fix that, I need to shift the gap durations one bit to the right, which I am synthesising now.  Hopefully that is the problem there.

Meanwhile, while that is synthesising, I also looked at adding support for DMA jobs >64KB long, since there can be more than 64K flux transitions per track. This will be needed for reading HD Amiga disks, for example, as well as for letting me debug whole tracks of data. 

The divide by two on the counts has helped, in that I can now see some sync marks, but there are still things being borked up, such that I don't see all 3 sync marks in a row. This is what it looks like now:

00000000: 00 81 48 80 48 a9 49 c6 51 50 50 51 50 51 a2 50    @AH@HiIFQPPQPQbP
00000010: 51 50 51 50 a2 50 51 50 50 51 50 51 50 51 50 51    QPQPbPQPPQPQPQPQ
00000020: 50 51 a1 a2 50 51 50 51 50 50 51 51 50 51 4f 51    PQabPQPQPPQQPQOQ
00000030: 50 51 50 51 50 51 50 51 50 51 50 51 50 51 50 51    PQPQPQPQPQPQPQPQ
00000040: 4f 51 50 51 50 51 50 52 4f 51 50 51 50 51 50 51    OQPQPQPROQPQPQPQ
00000050: 50 a1 52 4f 51 50 51 4f 51 50 51 4f 51 50 51 50    PaROQPQOQPQOQPQP
00000060: a1 52 4f 51 50 51 4d 1f 77 a4 7c 49 a8 75 a5 7a    aROQPQM_wd|Ihuez
00000070: 4a a7 76 a4 77 a5 4b 7e 76 a4 7c 4c 51 4f 51 50    JgvdweK~vd|LQOQP
00000080: 51 4d 7d a2 7d 47 7c 21 4a 51 a2 4e 80 4b 52 4c    QM}b}G|!JQbN@KRL
00000090: a8 4c f6 4e ce 49 7d a3 79 7b 4c 4f 81 4a 51 51    hLvNNI}cy{LOAJQQ
000000a0: 51 50 51 50 51 50 51 50 51 a2 50 51 50 51 50 51    QPQPQPQPQbPQPQPQ
000000b0: 50 52 4f 51 50 51 4f 52 50 51 4d 7f 79 7b 4c 4e    PROQPQORPQMy{LN
000000c0: 7d 7a 78 7c 4c 4d a5 a5 1b 7b 48 a8 76 a4 7c 48    }zx|LMee[{Hhvd|H
000000d0: a8 75 a5 7b 4d 50 50 50 a2 4e 7f 4d 50 51 50 51    hue{MPPPbNMPQPQ
000000e0: 50 51 f1 52 51 50 4d 7a 4c 51 51 50 cd 7c 4b 51    PQqRQPMzLQQPM|KQ
000000f0: 50 4f 7b a6 4d 50 50 50 4c a6 a1 a6 4b 51 51 4d    PO{fMPPPLfafKQQM

All the values are halved from before, as mentioned. So we now see lots of ~$51 values, which we expect. There should be 12 bytes x 8 bits = 96 = $60 of them before we write the sync marks.  The sync marks should look like $A0 $78 $A0 $78, or thereabouts, and we can see some things that look like this in the above, marked in bold (note that this is from a DD formatted track, so we continue to see similar numbers below).

The underlined part looks to me like it _wanted_ to be a sync mark, but the first gap got chopped up into little pieces. So something is still really rogered up.

Weirdly if I instead read the $D6A0 debug register, that lets me read the same RDATA line from the floppy drive that we are using to calculate these gaps, then the sync marks get properly detected.  What I am fairly consistently seeing, is that we see a sequence something like a8 76 1f 49 ef, instead of the A0 78 A0 78 50 type sequence (the extra 50 at the end is a single gap after each sync that gets written).  So something is getting really confused, in that the gap boundaries are being read at the wrong locations. And its the last 3 of those that do it:

A0 78 50 versus

20 50 F0

But as I say, if I read _the same signal_ via a different register, I am getting a different result.

I'm just going to have to sleep on this, I think.

Problem found and fixed!  It was a cross-domain clocking problem. Which is fancy FPGA programmer speak for "the signal I was reading was produced by something on a different clock, and if you don't treat it nicely enough, you get all manner of weird glitching".  The read signal from the floppy is not generated using the same clock as the MEGA65's CPU, and thus this problem happens. The solution is to add extra flip-flops to latch the signal more cleanly.  In reality cross-domain clocking is _much_ more complicated than I have described here, but you get the general idea.

So I can now happily spot the sync marks again, decode the TIB, etc. But I am getting CRC errors reported in the TIB among other places, and it looks like we are skipping bytes, which I think is related.  I was fiddling about with fancy post-read timing corrections in my mfm-decode.c program, so its possible that I have just messed that up.

Certainly something in the decoder is messed up, because where it misses bytes, it is indeed processing lots of gaps, more than it should.  I'll have to see how that is happening.  Found the problem: I had an error in my RLL decoding table.  With that fixed, I can now read the saved data, and all the sectors are being returned with no CRC errors.  

With a 64KB transfer, it is returning 7 sectors worth of data.  This is why I added the >64KB DMA job support.  With 128KB, we will be able to read an entire DD track, which will let me debug exactly what I am botching when writing a sector. Thinking about it more, I can actually do a longer fetch, if I do it into the 8MB HyperRAM, ensuring that even at the highest data rates, I can capture a whole track. 

So now I should be able to format a disk at normal DD 800KB, and then try saving something to it, and seeing how it gets messed up at that point.  My best guess is that the data rate or encoding mode will be incorrectly set for some reason.

Formatting from the latest patched ROM seems to be using the new formatting routines, but sets the track number for all tracks to be 39 in the Track Info Block that gets written by doing that, which then causes problems reading back saved files -- although the directory and BAM updates fine.  So it is possible that an update to the ROM will fix that.  I might even be able to find the part of the ROM where it does the format and patch it myself.

In the meantime, I am formatting the disks myself using the floppytest.c program, which does setup the TIB data correctly (well, after I fixed a problem with the sectors per track).  However, oddly reading sectors from the 2nd side of the disk is having problems.  The sector headers are visible, but it hangs forever. 

Fortunately I can use my new track reader code to read that whole track, and check if I think the data is valid or not. It is showing incorrect data in the sectors, and failing sector body CRCs. I thought I would compare it with the first side of the disk, which reads fine, but that is also showing similar errors.  So there is something weird going on here.

Hmm... If I use the ALPS drive in my MEGA65, it has this problem, but if I use the nice Panasonic drive, it doesn't. It _looks_ like there are extra magnetic inversions being inserted where they shouldn't during writing.  Reading doesn't seem to be affected.  I had added an extra write path for the floppy under CPU DMA control, so I am reversing that out, in case its the problem. But the problem is manifesting even with older bitstreams. So I am wondering if I haven't half-fried one of the level converters for the floppy interface on my MEGA65, or something odd like that.  I'll wait for that bitstream to build, and maybe see if I can get a DevKit owner or two to try it out for me as well, and work out my next steps from there.

Overnight it occurred to me that there is a simple test I could do, to work out if it is the writing side or the reading side that is the problem:  Write the disk using the ALPS, and then try to read it using the nice Panasonic drive, and vice-versa, if required.  If the ALPS can't read what the ALPS wrote, but the Panasonic _can_ read what the ALPS wrote, then there is a difference on the _read_ side of the Panasonic.

And the result is that the Panasonic _can_ read the ALPS written disk that the ALPS drive is refusing to read, although after about track 35 it is having some trouble, but that could just be that some of those tracks on the disk are a bit worn from my relentless testing of it.  

But anyway, this says that whatever the weird problem that has crept in is, it has something to do with the read behaviour. Maybe the RDATA pulses are longer or shorter on the different drives, for example.  Or maybe the voltages of them are different.  The good thing is that those are all quite testable things, as I can use the oscilloscope to probe both, and get to the bottom of this annoying little mystery.

Right, so the plot thickens: I just switched floppy cables to one with dual connectors for drives A: and B: on PCs, as it is convenient to poke oscilloscope leads into the 5.25" drive connector on it. But now it can't read _anything_, even with the good drive.  The cable is a bit longer. Maybe the issue is something with the level converter, and the extra length of cable is just enough to cause it grief. BUT that longer cable _does_ work with the ALPS drive (although still refusing to read, I can see the gap histogram, which I couldn't with the Panasonic on the same cable). And now after fiddling things again, the long cable is working with the Panasonic drive.  Maybe something was just loose.

So now to get the oscilloscope hooked up...

Both drives produce similar looking pulse shapes for the RDATA line. The main difference I can see is that the good Panasonic drive seems to have much more stable timing, perhaps suggesting that it just has better motor speed control.

Next step is to record what we write to a track as raw flux, and then compare that back with what we have written, and try to spot the kinds of errors that are occurring.

Advance two months...

Well, the rest of the year flew by pretty quickly, and its now almost New Year's Eve, and I have finally had the chance to get back to this.

First up, I have found that there was a problem with writing after all: I had tried writing at the start of an index hole mark, rather than the end, and the floppy drive was considering this illegal, and not writing anything to the track at all. Thus I was reading back whatever was already on the track, not the fresh data.

I've fixed that, but I am still seeing very short gaps on the odd occasion.  The TIB is also not being written to the track at all, but we will deal with that separately.  My best guess is that the very fast FPGA is seeing glitching on the RDATA line from the floppy, which is causing some long gaps to get broken up by mistake.  If this theory is correct, then de-glitching the RDATA line by requiring it to go low for several CPU cycles will fix it. If that doesn't fix it, then the short gaps are most likely being created during the writing process.

According to floppy drive data sheets, the RDATA pulses should be >= 0.15usec = ~6 cycles at 40.5MHz.  If we require RDATA to go low for 4 cycles, that should be a safe amount, and still squelch any glitching from cross-domain or other effects.

Okay, that has been done, and the problem persists, and I just checked with a Known Good Disk, and it doesn't exhibit the short gaps, so I am still suspecting that it is a write-based problem. So I have instrumented the VHDL to keep track of the actual WDATA interval time so that we can see if we are producing the problem ourselves.

The only other difference I can see is that the problem is happening on HD disks, but not on DD disks. That could just be coincidence, or it could be that the DD disks (or drives in DD mode) simply don't have the flux resolution to write the glitches. So time for another synthesis run...

That's done, and I can now verify that no F_WDATA pulse interval is shorter than 80 clock cycles at the DD 720K data rate -- which is correct. If I tune the data rate up and down, then the minimum gap length adjusts accordingly, so I am confident that I am reading it correctly.

... and yet I am still seeing short gaps being written.  Is it actually an artefact of writing DD data on an HD disk?  If so, then writing at HD rate should remove the problem. Or writing to a DD disk.  So, writing at HD data rate seems to work without any of these short gaps, adding some support to my theory.... and formatting a DD disk at DD data rate also works just fine, too (although some of the Track Info Blocks are not able to be read back).

So there _is_ a problem with writing to HD disks at DD data rate. I have since confirmed this 100% by covering the hole on an HD disk that refuses to make a nice format at DD rate without the hole covered, and it suddenly reads back reliably. This affirms my theory that the drive electronics is being switched to a different read recovery mode between DD and HD, and that this is messing things up at DD data rates.

Speaking with others from the team on discord, we have come up with a sequence of events that results in corrupted DD disks. 

DD DISK OR FAKE DD DISK USED

1. FORMAT in floppytest (option 3)
2. READ TEST (all green) in floppy test (option 2)
3. RESET with disk still inserted
4. HEADER"TEST",IDK
5. DIR
6. NEW
7. 0?
8. SAVE"TEST"
9. DIR     -looking good
10. SAVE"TEST2"
11. DIR      -this time still looking good
12. SAVE"TEST3"
13. DIR      -this time still looking good
14. RESET
15. DIR      -this time still looking good
16. NEW
17. 0?
18. SAVE"TEST4"
19. DIR      -this time still looking good
20. POWER OFF
21. REMOVE AND REINSERT DISK
22. POWER ON
23. DIR        -this time still looking good
24. NEW
25. 0?
26. SAVE"TEST5"    -this time rattling noise
27. DIR            -broken, mostly zeroed directory:

   0 "           "
   1 "
   0
   0
8704 DEL
3155 BLOCKS FREE

28. SAVE"1"
29. DIR         -file "1" is first entry in above broken directory!
30. SAVE"2"
31. SAVE"3"
32. DIR        -files 1, 2 and 3 appear on top:

   0 "              "
   1 "1"              PRG
   1 "2"              PRG
   1 "3"              PRG
   1 "
   0
   0
8704 DEL
3152 BLOCKS FREE

33. POWER OFF, DISK OUT, DISK IN, POWER ON
34. DIR              -magic! old files appear again:

   0 "TEST           " DK 1D
   1 "TEST"            PRG
   1 "TEST2"           PRG
   1 "TEST3"           PRG
   1 "TEST4"           PRG
   0 "TEST5"          *PRG
3156 BLOCKS FREE
 

One of the first things I found, is that this doesn't seem to happen if the disk was formatted using floppytest, and then just a simple HEADER "TEST" command, instead of HEADER "TEST",IDK command from BASIC.  The difference is that with the ,Ixx option, the disk is low-level formatted again, which means that the C65 ROM is doing the disk format, instead of the floppytest program.  This might be because of the C65 ROM format routine, or it might be from something else.  

To help understand why, I am creating a derivative of the m65 command line tool that can remotely read the contents of a real floppy, without having to interrupt the machine.  This means it is possible to type a BASIC SAVE command, for example, and then see what actually changed on the disk, versus what should have changed.

It looks like the problem might be the C65 ROM has every track labelled as track 39 -- this would explain the track chattering symptom, as well as the rest of the problems. I'll have to ask @bitshifter to patch that, so that we can test again. If that fixes it, then that means we have DD disk behaviour all working, and I can (finally) finish fixing the HD disk stuff.

Speaking with @bitshifter, the C65 ROM still uses the old un-buffered method of formatting tracks. But we are seeing a Track Info Block being written, which only happens via the hardware assisted formatting.  If we assume that hardware assisted formatting is somehow being triggered, this would also explain why all tracks are marked as track 39, as the C65 ROM doesn't know to update the track number. We can confirm this theory by modifying the track number in $D084 during a C65 ROM format, and it should result in changed TIB and sector header track numbers: And, yes, this is exactly what is happening.

So the question is how is automatic track formatting being triggered? The relevant track format code in the C65 ROM is pretty straight-forward:

993c a9 a1        lda #wtt+1       ;Erase track (fill with $4E gap bytes)
993e 20 58 9b     jsr CommandReg   ;necessary due to simulated index pulse!
9941 8d 81 d0     sta command      ;Begin formatting

9944 a0 10    10$ ldy #16          ;write post index gap 12 sync
9946 b9 7a 9a 15$ lda secdat-1,y
9949 be 8a 9a     ldx secclk-1,y
994c 2c 82 d0 20$ bit stata
994f 10 dd        bpl wtabort      ;oops
9951 50 f9        bvc 20$
9953 8d 87 d0     sta data1        ;Always write data before clock
9956 8e 88 d0     stx clock
9959 88           dey
995a d0 ea        bne 15$

995c a0 04        ldy #4           ;Write 4 header bytes
995e a2 ff        ldx #$ff
9960 b9 6d 01 25$ lda header-1,y
9963 2c 82 d0 30$ bit stata
9966 10 c6        bpl wtabort      ;oops
9968 50 f9        bvc 30$
996a 8d 87 d0     sta data1
996d 8e 88 d0     stx clock
9970 88           dey
9971 d0 ed        bne 25$

We can see that it is putting #$A1 into the command register.  The VHDL that looks at the commands looks like this:

              temp_cmd := fastio_wdata(7 downto 2) & "00";
              report "F011 command $" & to_hstring(temp_cmd) & " issued.";
              case temp_cmd is

                when x"A0" | x"A4" | x"A8" | x"AC" =>
                  -- Format a track (completely automatically)
                  -- At high data rates, it is problematic to feed the data
                  -- fast enough to avoid failures, especially when using
                  -- code written using CC65, as I am using to test things.
                  -- So this command just attempts to format the whole track
                  -- with all empty sectors, and everything nicely built.

                  f_wgate <= '1';
                  f011_busy <= '1';

                  -- $A4 = enable write precomp
                  -- $A8 = no gaps, i.e., Amiga-style track-at-once
                  -- $A0 = with inter-sector gaps, i.e., 1581 / PC 1.44MB style
                  -- that can be written to using DOS
                  format_no_gaps <= temp_cmd(3);

                  -- Only allow formatting when real drive is used
                  if (use_real_floppy0='1' and virtualise_f011_drive0='0' and f011_ds="000") or
                    (use_real_floppy2='1' and virtualise_f011_drive1='0' and f011_ds="001") then
                    report "FLOPPY: Real drive selected, so starting track format";
                    sd_state <= FDCAutoFormatTrackSyncWait;
                  else
                    report "FLOPPY: Ignoring track format, due to using D81 image";
                  end if;
                  
                when x"A1" | x"A5" =>
                  -- Track write: Unbuffered
                  -- It doesn't matter if you enable buffering or not, for
                  -- track write, as we just enforce unbuffered operation,
                  -- since it is the only way that it is used on the C65, and
                  -- thus the MEGA65.
                  -- (Conversely, when we get to that point, we will probably only
                  -- support buffered mode for sector writes).

                  -- Clear the LOST and DRQ flags at the beginning.
                  f011_lost <= '0';
                  f011_drq <= '0';

                  -- We clear the write gate until we hit a sync pulse, and
                  -- only then begin writing.  The write gate will be closed
                  -- again at the next sync pulse.
                  f_wgate <= '1';

                  -- Mark drive busy, as we should
                  -- C65 DOS also relies on this.
                  f011_busy <= '1';

                  report "FLOPPY: Asked for track format";
                  
                  -- Only allow formatting when real drive is used
                  if (use_real_floppy0='1' and virtualise_f011_drive0='0' and f011_ds="000") or
                    (use_real_floppy2='1' and virtualise_f011_drive1='0' and f011_ds="001") then
                    report "FLOPPY: Real drive selected, so starting track format";
                    sd_state <= FDCFormatTrackSyncWait;
                  else
                    report "FLOPPY: Ignoring track format, due to using D81 image";
                  end if;

We can see that $A1 leads to the non-automatic track formatting. But I also just spotted the bug :) Write a comment if you can spot it, too!

And with that, the main problem was solved. I still need to think about how I make the TIBs reliable at DD data rate on HD disks because of the drive electronics limitation, but I have ideas on how to do that.  But that can wait a bit, as we have bigger fish to fry prior to the release of the MEGA65! And besides, this blog post has already taken way too long. So with that, I wish you all a Happy New Year, and lots of retro fun.

Sunday, 7 November 2021

Creating a simple internal drive fast-loader for the MEGA65

For a while now I have been thinking about making a simple fast-loader for the MEGA65 that bypasses the C65 DOS, and directly accesses the floppy controller.  It's a topic that comes up from time to time for developers who want to load large files from disk, for example. So I spent a couple of hours yesterday writing a proof-of-concept version.  My design criteria were:

1. Must be able to be run from an IRQ, so that it can be used in games or demos to load in the background while other activity goes on. The C65 DOS cannot be sensibly used for this, because when it runs, it blocks all interrupts for arbitrary periods of time, which can exceed 200ms(!!!).

2. Must allow loading to any address in memory.

3. Must be small enough that it can be easily incorporated into other programs.

(1) and (3) meant that it had to be written in assembly.

So here's what I created. It still is missing a few things, like it doesn't save and restore DMA list address registers (in case you were composing a DMA job in real-time, just as the IRQ triggered), and doesn't support specifying how much of a file to load, to allow progressive streaming in of a file. Both would be fairly easy to implement.  But back to what we do have, an annotated walk through the source:

First up, to demonstrate it, we have a simple BASIC header (I am running it from C64 mode, but you could almost as easily run it from C65 mode):

   
basic_header

    !byte 0x10,0x08,<2021,>2021,0x9e
    !pet "2061"
    !byte 0x00,0x00,0x00
 

Then we have the start of the demo program that is using the fast-loader.  The actual fast-loader code will come a bit later. We do the usuals of making sure we have MEGA65 IO enabled, and the CPU at full-speed, as well as have some boiler plate to clear the screen and set screen colours etc:   

program_start:   

    ;; Select MEGA65 IO mode
    lda #$47
    sta $d02f
    lda #$53
    sta $d02f

    ;; Select 40MHz mode
    lda #65
    sta $0

    lda #$00
    sta $d020
    sta $d021

    lda #$01
    sta $0286
    jsr $e544
    

Next it is time to setup our raster interrupt. This should all be very familiar to C64 coders:
    ;; Install our raster IRQ with our fastloader
    sei

    lda #$7f
    sta $dc0d
    sta $dd0d
    lda #$40
    sta $d012
    lda #$1b
    sta $d011
    lda #$01
    sta $d01a
    dec $d019

    lda #$16
    sta $d018
    
    lda #<irq_handler
    sta $0314
    lda #>irq_handler
    sta $0315
    cli

We'll get to the IRQ handler in a moment, but we will finish looking at the real-time part of the program first.  The fast-loader uses a single byte state/status variable to keep track of what it is doing. If it is $00, then the loader is idle.  If you want to ask it to load something, you setup the filename and load address, and then write $01 into the variable.  It will go back to $00 when its done, or have bit 7 set if there is some kind of error. This means you can check status with BEQ and BMI.  The load address will progressively update to show where it is loaded to, if that's important for you to track. In the example, we load the game GYRRUS into bank 4 at $00040000:
    ;; Example for using the fast loader
    
    ;; copy filename from start of screen
    ;; Expected to be PETSCII and $A0 padded at end, and exactly 16 chars
    ldx #$0f
    lda #$a0
clearfilename:
    sta fastload_filename,x
    dex
    bpl clearfilename
    ldx #$ff
filenamecopyloop:
    inx
    cpx #$10
    beq endofname
    lda filename,x
    beq endofname
    sta fastload_filename,x
    bne filenamecopyloop
endofname:   
    inx
    stx fastload_filename_len
    
    ;; Set load address (32-bit)
    ;; = $40000 = BANK 4
    lda #$00
    sta fastload_address+0
    lda #$00
    sta fastload_address+1
    lda #$04
    sta fastload_address+2
    lda #$00
    sta fastload_address+3
Remember what I said about the status variable? We need to make sure it is $00 before we submit our load request.  This is important because when the fast-loader initialises, it doesn't know what track the drive is on, and so it seeks back to track 0 first. So we make sure that that completes before we submit our job. If we didn't do this, reading of any sector from the disk on a real drive would hang, because the head would be on the wrong track.
    ;; Give the fastload time to get itself sorted
    ;; (largely seeking to track 0)
wait_for_fastload:   
    lda fastload_request
    bne wait_for_fastload
Finally the fast-loader is ready, so we can then submit our job. It really is this simple:
    ;; Request fastload job
    lda #$01
    sta fastload_request
We can then go off and do whatever we want in real-time, knowing that the raster interrupt will be calling the fast-loader, and allowing it to progress in the background. For simplicity, in our demo we just wait for the fast-load to complete, and indicate if an error occurred, or if it loaded ok.
    ;; Then just wait for the request byte to
    ;; go back to $00, or to report an error by having the MSB
    ;; set. The request value will continually update based on the
    ;; state of the loading.
waiting
    lda fastload_request
    bmi error
    bne waiting
    beq done
    
error
    inc $042f
    jmp error

done
    inc $d020
    jmp done

That's over and done with for real-time, so now lets look at our raster interrupt.  This is also quite simple: Acknowledge the IRQ source, set border colour to white, call the fastload_irq routine, then return the border colour to black, before returning via the well known $EA81 interrupt exit handler code in the C64 KERNAL. You can of course do whatever you want, but this shows just how simple it can be. The border colour stuff is of course optional, but let's us see just how little raster time this loader uses.
irq_handler:
    ;; Here is our nice minimalistic IRQ handler that calls the fastload IRQ
    
    dec $d019

    ;; Call fastload and show raster time used in the loader
    lda #$01
    sta $d020
    jsr fastload_irq
    lda #$00
    sta $d020

    ;; Chain to KERNAL IRQ exit
    jmp $ea81

As mentioned, I set this demo up to load GYRRUS into bank 4, just because that was a file on the disk image I had active in my MEGA65 at the time.  Note that the filename has to be padded with $A0s, because the fast-load code literally compares all 16 bytes of the filename with the 16 bytes of filename in the directory sectors. It doesn't support partitions or sub-directories on the disk image, but someone could hack that in if they wanted it, but I don't think it will be necessary for almost all use-cases.
filename:
    ;; GYRRUS for testing
    !byte $47,$59,$52,$52,$55,$53,$a0,$a0
    !byte $a0,$a0,$a0,$a0,$a0,$a0,$a0,$a0

    
;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
So that was the code for our example driver of the fast load. For your own programs, you can cut everything above here away, and just keep what follows.  It requires about 1.2KB, including the 512 byte sector buffer, so its quite small in the grand scheme of things.   
    ;; ------------------------------------------------------------
    ;; Actual fast-loader code
    ;; ------------------------------------------------------------
First up, we have the variables and temporary storage for the fast loader: The filename and length (which actually gets ignored, because of the use of $A0 padding, so can be removed at some point), the address where the user wants to load, and the state/status variable.  These four variables are the only ones you need to access from your code. Everything else that follows is internal to the fast-loader.

fastload_filename:   
    *=*+16
fastload_filename_len:   
    !byte 0
fastload_address:   
    !byte 0,0,0,0
fastload_request:   
    ;; Start with seeking to track 0
    !byte 4
This variable keeps track of which physical track on the disk the loader thinks the head is currently over, so that we can step to the correct track:

fl_current_track:    !byte 0

Then we have variables for the logical track and sector of the next 256 byte block of the file. These have to get translated into the physical track and sector of the drive, which like the 1581, stores two blocks in each physical sector.
fl_file_next_track:    !byte 0
fl_file_next_sector:    !byte 0
 

Then finally, we have the 512 byte sector buffer. Now, this could be optimised away, by enabling mapping of the sector buffer at $DE00-$DFFF, but I couldn't be bothered remembering how to do that, and also didn't want to cause potential problems for code that also uses REU emulation or other things that might appear in the IO area. It's not that it can't be done, but rather that I just took the quick and easy path.  It would be a great exercises for the reader to change this, and reduce the total size of the loader to <1KB as a result.   
fastload_sector_buffer:
    *=*+512
 

Now let's take a look at the fast-loader's IRQ handler.  It basically checks if there is an active request, and if not does nothing. Then it checks if the floppy controller is busy doing something that it asked it to earlier. If so, it does nothing.  But if we have an active job, and the floppy controller is not busy, this means that we can ask for the next operation to occur.  The fastload_request variable doubles as the state number for the resulting simple state-machine.  This approach really simplifies the code a lot, and makes it much easier to run in an interrupt.

Before going further, it is worth noting that if you run the interrupt on a normal raster IRQ, the loader will be able to load at most one block = 254 bytes of usable data per frame.  This means 254 x 50 = ~12.7KB/sec in PAL or 15.2KB/sec in NTSC.  If you are using a real 800KB 1581 disk, that's not a problem, because the drive will slow you down more than that.  But if you are using a disk image, or one of the MEGA65's HD disk formats, then this will slow things down.  

The easy solution is to have your IRQ routine trigger multiple times per frame, or enable IRQs in the floppy controller, and have it be called on demand whenever a sector is ready. You will need to acknowledge the floppy controller interrupts, if you do that.

There is also a further ~2x speed up without doing that which is possible by modifying the loader to realise when a single sector contains two consecutive blocks of a file. It doesn't currently do this, which is a bit stupid.  Fixing that would also be a great exercise for the reader.

 
fastload_irq:
    ;; If the FDC is busy, do nothing, as we can't progress.
    ;; This really simplifies the state machine into a series of
    ;; sector reads
    lda fastload_request
    bne todo
    rts
todo:   
    lda $d082
    bpl fl_fdc_not_busy
    rts
fl_fdc_not_busy:   
    ;; FDC is not busy, so check what state we are in
    lda fastload_request
    bpl fl_not_in_error_state
    rts
fl_not_in_error_state:

It's worth explaining how the IRQ handler calls the various routines for the different states, because it uses a nice feature of the 65CE02: JMP indirect, X-indexed.  This instruction basically allows you to have a jump-table without the silly push-addr-minus-one to stack trick you have to use on the C64. The resulting code is quite a lot simpler and clearer as a result:
    ;; Shift state left one bit, so that we can use it as a lookup
    ;; into a jump table.
    ;; Everything else is handled by the jump table
    cmp #6
    bcc fl_job_ok
    ;; Ignore request/status codes that don't correspond to actions
    rts
fl_job_ok:   
    asl
    tax
    jmp (fl_jumptable,x)
    
fl_jumptable:
    !16 fl_idle
    !16 fl_new_request
    !16 fl_directory_scan
    !16 fl_read_file_block
    !16 fl_seek_track_0
    !16 fl_step_track

The first of those state routines is the one for when the loader is idle: Just return immediately. This can be optimised away, since there are (1) plenty of other RTS instructions we could point at; and (2) because it never gets called, because we have the short-circuit exit at the start of the IRQ handler.  If you haven't already gotten the idea by now, you can tell that I have really just hacked this together until it works, and then stopped to document it.  Lots of opportunities for you to get involved and improve it ;)
fl_idle:
    rts

The next state handler checks if we are on track 0 yet, and if not, commands a step towards track 0, which like all other floppy controller actions, will have the floppy controller busy until the step has completed. Again, our nice busy check in the start of the IRQ handler means that we can just keep stepping in this routine until we reach track 0. Note how it writes $00 into fastload_request when done, to indicate that the loader is idle and ready for a new job.
fl_seek_track_0:
    lda $d082
    and #$01
    bne fl_not_on_track_0
    lda #$00
    sta fastload_request
    sta fl_current_track
    rts
fl_not_on_track_0:
    ;; Step back towards track 0
    lda #$10
    sta $d081
    rts

As you saw in the demo driver code, to submit a new job, you write $01 into fastload_request. This causes the following routine to be run when the IRQ is next triggered.  It puts $02 into fastload_request, so that it knows that it has just accepted a job, and also immediately requests the reading of the first physical sector that contains a directory block, ready for us to look for the requested file.
fl_new_request:
    ;; Acknowledge fastload request
    lda #2
    sta fastload_request
    ;; Start motor
    lda #$60
    sta $d080
    ;; Request T40 S3 to start directory scan
    ;; (remember we have to do silly translation to real sectors)
    lda #40-1
    sta $d084
    lda #(3/2)+1
    sta $d085
    lda #$00
    sta $d086         ; side
    ;; Request read
    jsr fl_read_sector
    rts

The above set fastload_request to call this routine on each IRQ, i.e., as each sector of the directory is loaded. We then look through the whole 512 byte sector for a matching filename, and if found, change state to load the file from the logical track and sector of the first block of the file as obtained from the directory listing. Note that we ignore the file type, including if the file is deleted. Again, a great opportunity for someone to improve the loader.
fl_directory_scan:
    ;; Check if our filename we want is in this sector
    jsr fl_copy_sector_to_buffer

    ;; (XXX we scan the last BAM sector as well, to keep the code simple.)
    ;; filenames are at offset 4 in each 32-byte directory entry, padded at
    ;; the end with $A0
    lda #<fastload_sector_buffer
    sta fl_buffaddr+1
    lda #>fastload_sector_buffer
    sta fl_buffaddr+2

fl_check_logical_sector:
    ldx #$05
fl_filenamecheckloop:
    ldy #$00

fl_check_loop_inner:

fl_buffaddr:
    lda fastload_sector_buffer+$100,x
    
    cmp fastload_filename,y   
    bne fl_filename_differs
    inx
    iny
    cpy #$10
    bne fl_check_loop_inner
    ;; Filename matches
    txa
    sec
    sbc #$12
    tax
    lda fl_buffaddr+2
    cmp #>fastload_sector_buffer
    bne fl_file_in_2nd_logical_sector
    ;; Y=Track, A=Sector
    lda fastload_sector_buffer,x
    tay
    lda fastload_sector_buffer+1,x
    jmp fl_got_file_track_and_sector
fl_file_in_2nd_logical_sector:   
    ;; Y=Track, A=Sector
    lda fastload_sector_buffer+$100,x
    tay
    lda fastload_sector_buffer+$101,x
fl_got_file_track_and_sector:
    ;; Store track and sector of file
    sty fl_file_next_track
    sta fl_file_next_sector
    ;; Request reading of next track and sector
    jsr fl_read_next_sector
    ;; Advance to next state
    lda #3
    sta fastload_request
    rts
    
fl_filename_differs:
    ;; Skip same number of chars as though we had matched
    cpy #$10
    beq fl_end_of_name
    inx
    iny
    jmp fl_filename_differs
fl_end_of_name:
    ;; Advance to next directory entry
    txa
    clc
    adc #$10
    tax
    bcc fl_filenamecheckloop
    inc fl_buffaddr+2
    lda fl_buffaddr+2
    cmp #(>fastload_sector_buffer)+1
    bne fl_checked_both_halves
    jmp fl_check_logical_sector
fl_checked_both_halves:   
    
    ;; No matching name in this 512 byte sector.
    ;; Load the next one, or give up the search
    inc $d085
    lda $d085
    cmp #11
    bne fl_load_next_dir_sector
    ;; Ran out of sectors in directory track
    ;; (XXX only checks side 0, and assumes DD disk)

    ;; Mark load as failed
    lda #$80         ; $80 = File not found
    sta fastload_request   
    rts

We now have several little utility routines related to reading sectors from the disk, including doing the conversion from 1581 logical sectors to 3.5" floppy physical sectors, and tracking the head if we aren't on the correct track already etc. If it detects that it needs to step the head, it changes fastload_request to point to a handler for that, which in turn sets it back to the handler for reading blocks of the file.

Note that I haven't actually tried this on a real disk, yet. This should be done, as there will quite likely be some subtle problem that will need shaking out, most likely with the track stepping. But it shouldn't be too hard to fix, and who knows, I might have got it right the first time ;)
fl_load_next_dir_sector:   
    ;; Request read
    jsr fl_read_sector
    ;; No need to change state
    rts

fl_read_sector:
    ;; Check if we are already on the correct track/side
    ;; and if not, select/step as required
    lda #$40
    sta $d081
    rts

fl_step_track:
    lda #3
    sta fastload_request
    ;; FALL THROUGH
    
fl_read_next_sector:
    ;; Check if we reached the end of the file first
    lda fl_file_next_track
    bne fl_not_end_of_file
    rts
fl_not_end_of_file:   
    ;; Read next sector of file
    jsr fl_logical_to_physical_sector

    lda fl_current_track
    lda $d084
    cmp fl_current_track
    beq fl_on_correct_track
    bcc fl_step_in
fl_step_out:
    ;; We need to step first
    lda #$18
    sta $d081
    inc fl_current_track
    lda #5
    sta fastload_request
    rts
fl_step_in:
    ;; We need to step first
    lda #$10
    sta $d081
    dec fl_current_track
    lda #5
    sta fastload_request
    rts
    
fl_on_correct_track:   
    jsr fl_read_sector
    rts


Here we have another utility routine that does the logical-to-physical track and sector conversion. Again, this basically mirrors what the 1581 does. It will need modifying to use the fast-loader on HD disks, because there will be more sectors on each side of the disk.
fl_logical_to_physical_sector:
    ;; Convert 1581 sector numbers to physical ones on the disk.
    ;; Track = Track - 1
    ;; Sector = 1 + (Sector/2)
    ;; Side = 0
    ;; If sector > 10, then sector=sector-10, side=1
    lda #$00         ; side 0
    sta $d086
    lda fl_file_next_track
    dec
    sta $d084
    lda fl_file_next_sector
    lsr
    inc
    cmp #10
    bcs fl_on_second_side
    sta $d085
    jmp fl_set_fdc_head
    
fl_on_second_side:
    sec
    sbc #10
    sta $d085
    lda #1
    sta $d086

    ;; FALL THROUGH
fl_set_fdc_head:
    ;; Select correct side of real disk drive
    lda $d086
    asl
    asl
    asl
    and #$08
    ora #$60
    sta $d080
    rts
    

This is the routine that really does the loading: It gets the read physical sector, works out which half of it contains the data for us, DMAs the read bytes into the destination location in memory, and then follows the block chain to the next block of the file, and detects the end-of-file marker indicated by logical track = $00.
fl_read_file_block:
    ;; We have a sector from the floppy drive.
    ;; Work out which half and how many bytes,
    ;; and copy them into place.

    ;; Get sector from FDC
    jsr fl_copy_sector_to_buffer

    ;; Assume full sector initially
    lda #254
    sta fl_bytes_to_copy
    
    ;; Work out which half we care about
    lda fl_file_next_sector
    and #$01
    bne fl_read_from_second_half
fl_read_from_first_half:
    lda #(>fastload_sector_buffer)+0
    sta fl_read_dma_page
    lda fastload_sector_buffer+1
    sta fl_file_next_sector
    lda fastload_sector_buffer+0
    sta fl_file_next_track
    bne fl_1st_half_full_sector
fl_1st_half_partial_sector:
    lda fastload_sector_buffer+1
    sta fl_bytes_to_copy   
    ;; Mark end of loading
    lda #$00
    sta fastload_request
fl_1st_half_full_sector:
    jmp fl_dma_read_bytes
    
fl_read_from_second_half:
    lda #(>fastload_sector_buffer)+1
    sta fl_read_dma_page
    lda fastload_sector_buffer+$101
    sta fl_file_next_sector
    lda fastload_sector_buffer+$100
    sta fl_file_next_track
    bne fl_2nd_half_full_sector
fl_2nd_half_partial_sector:
    lda fastload_sector_buffer+$101
    sta fl_bytes_to_copy
    ;; Mark end of loading
    lda #$00
    sta fastload_request
fl_2nd_half_full_sector:
    ;; FALLTHROUGH
fl_dma_read_bytes:

    ;; Update destination address
    lda fastload_address+3
    asl
    asl
    asl
    asl
    sta fl_data_read_dmalist+2
    lda fastload_address+2
    lsr
    lsr
    lsr
    lsr
    ora fl_data_read_dmalist+2
    sta fl_data_read_dmalist+2
    lda fastload_address+2
    and #$0f
    sta fl_data_read_dmalist+12
    lda fastload_address+1
    sta fl_data_read_dmalist+11
    lda fastload_address+0
    sta fl_data_read_dmalist+10

    ;; Copy FDC data to our buffer
    lda #$00
    sta $d704
    lda #>fl_data_read_dmalist
    sta $d701
    lda #<fl_data_read_dmalist
    sta $d705

    ;; Update load address
    lda fastload_address+0
    clc
    adc fl_bytes_to_copy
    sta fastload_address+0
    lda fastload_address+1
    adc #0
    sta fastload_address+1
    lda fastload_address+2
    adc #0
    sta fastload_address+2
    lda fastload_address+3
    adc #0
    sta fastload_address+3
    
    ;; Schedule reading of next block
    jsr fl_read_next_sector
    
    rts

We are now almost at the end. What we have here is the DMA lists for copying the read data to its final destination, as well as the routine and DMA list for copying a physical sector from the FDC's buffer down to fastload_sector_buffer.  As previously noted, we can probably shrink the whole thing (and make it use less raster time) by avoiding that copy, if we instead fiddle the IO banking to make the floppy sector buffer map at $DE00-$DFFF (there is a special bit that enables this).  But what we have here works, and isn't that much slower, as the DMA doesn't take very long. 
fl_data_read_dmalist:
    !byte $0b      ; F011A type list
    !byte $81,$00      ; Destination MB
    !byte 0         ; no more options
    !byte 0            ; copy
fl_bytes_to_copy:   
    !word 0               ; size of copy
fl_read_page_word:   
fl_read_dma_page = fl_read_page_word + 1
    ;; +2 is to skip track/header link
    !word fastload_sector_buffer+2    ; Source address
    !byte $00        ; Source bank
    
    !word 0                 ; Dest address
    !byte $00             ; Dest bank
    
    !byte $00             ; sub-command
    !word 0                 ; modulo (unused)
    
    rts
    
fl_copy_sector_to_buffer:
    ;; Make sure FDC sector buffer is selected
    lda #$80
    trb $d689

    ;; Copy FDC data to our buffer
    lda #$00
    sta $d704
    lda #>fl_sector_read_dmalist
    sta $d701
    lda #<fl_sector_read_dmalist
    sta $d705
    rts

fl_sector_read_dmalist:
    !byte $0b      ; F011A type list
    !byte $80,$ff            ; MB of FDC sector buffer address ($FFD6C00)
    !byte 0         ; no more options
    !byte 0            ; copy
    !word 512        ; size of copy
    !word $6c00        ; low 16 bits of FDC sector buffer address
    !byte $0d        ; next 4 bits of FDC sector buffer address
    !word fastload_sector_buffer ; Dest address   
    !byte $00             ; Dest bank
    !byte $00             ; sub-command
    !word 0                 ; modulo (unused)

And that's it.  The loader really is quite simple, especially compared with a 1541 fast-loader.  You can find the source in https://github.com/mega65/mega65-tools, just look for fastload-demo.asm.

Finally, a somewhat arbitrary screen-shot, because every blog post requires at least one, but its kind of hard to show a fast-loader in action in a still image.