In recent posts I reported the progress on getting formatting and write support working for the internal 3.5" floppy drive. However, formatting currently had to be done at 40MHz to work reliably. The cause was that the tight little loop that the C65 DOS does when checking when to write the next byte to the floppy when formatting a track is right on the edge of the speed required at 3.5MHz, but we had some errors in cycle timing for the CPU, that was tipping it over the edge.
Let's start by looking at the DOS's loop when formatting. The same basic loop is used in a few places, and looks like this:
10$ ldy #16 ;write post index gap 12 sync
15$ lda secdat-1,y
20$ bit stata
bpl wtabort ;oops
sta data1 ;Always write data before clock
So we basically do a BIT on the stata register to check if an error has occured (bit 7 set) or the drive is ready for the next byte (bit 6 set), which is what the BVC does. So while waiting to see if we need to write the next byte, we do the BIT/BPL/BVC sequence. If we check the 65CE02 datasheet, we see that those should take 5, 2 and 2 cycles each, for an inner loop of just 9 cycles.
I was mistakenly charging 3 cycles for branches taken, and thus the BVC was taking one more cycle, meaning our inner loop was taking 10 cycles instead of 9. At 3.5MHz, this means the inner loop was taking about 2.86 microseconds. This is much faster than the time between bytes on a 720K floppy, where the effective data rate is about 250kbit/sec, or about 32 microseconds, or about 112 cycles, assuming the C65's clock is exactly 3.5MHz (which it isn't, but doesn't matter for our current purposes).
But we have to pay the cost of the LDA, LDX, STA, STX, DEY and BNE in there as well, which cost 4, 4, 4, 4, 1 and 2 cycles each, for a total extra cost of 19 cycles. So it should take 19 + 9n cycles for that loop to detect when the next byte has to be written. So really we mean that we have a jitter of about 9 cycles from when the last bit starts to be written, to when we will detect that the next byte is required, and then it costs us another 4 + 4 cycles = 8 cycles to actually write the byte.
There is also a bit of a fun race condition when providing the next byte, if the clock byte has changed since the previous byte, which means if we are a bit too late, the floppy controller might latch the wrong clock value and mess everything up. But even without that, it is conceivable that we can end up a tiny bit too late and miss the deadline for the next byte, causing everything to slip by one bit, and making the resulting part of the track unreadable. And this was happening quite a lot.
Once I worked out that this was the problem, I fixed the branch instructions to not charge the extra cycles when the CPU is at >=3.5 MHz, thus reducing our loop down to the correct 9 cycles.
This reduced the number of dud sectors being written, but it didn't actually eliminate them. So I had to investigate what was going on with this, as I was pretty sure I had the cycle times correct in the CPU. Pulling up the 65CE02 datasheet, I double-checked things, and apart from spotting a couple of unrelated timing errors (I was only charging 2 cycles for JSR due to a typo, and there was one of the ROR instructions I also had the timing wrong for, both of which I have fixed now), the BIT and branch instructions were all looking spot-on. Most strange.
Then it occurred to me that it was a bit weird that BIT $nnnn cost 5 cycles, as it only requires 3 instruction byte cycles, and then one memory read cycle, since it doesn't need to write anything back to memory after. After all, one of hte 65CE02's stated marketing claims was the removal of almost all "dead cycles", giving it about a 25% speed-up vs the 6502 at the same clock speed.
But the Commodore datasheet does indeed claim that the BIT instruction in all modes requires an extra cycle. Even the VICE emulator implements it this way. But I dug around further to see if there were any experimentally derived instruction timings for the 65CE02, and found this page, which suggested that the BIT $nnnn instruction indeed requires only 4 cycles instead of 5. I also found another page that had different timing again for a lot of instructions, such as indicating that TRB/TSB require 6 instead of 5 cycles, but again, indicated just 4 cycles for the BIT $nnnn instruction.
So I am deeply suspicious that the correct timing for the BIT instructions should indeed be one cycle less than the Commodore datasheet indicates. If fixing this results in perfect formatting at 3.5MHz, it will be very strong circumstantial evidence for this, until we can borrow a real C65 (or one of those Amiga 8-way serial cards that includes a 65CE02 on it), to check.
While I wait for that, there is one other piece of circumstantial evidence in favour of these changes, which is that I had previously established that the CPU had to be about 15% faster than 3.5MHz to reliably format disks correctly. 10 cycles for the inner format loop divided by 115% = 8.7 cycles, i.e., just below the 9 cycles that the Commodore datasheet indicates that this loop should take. So I'm crossing my fingers that this will fix the problem. I'll check in the morning when the synthesis has finished.
After some mishaps with the synthesis process, I have a bit stream now, but it is still resulting in a few dud sectors when formatting at 3.5MHz. So there must still be something a bit fruity going on. I still feel it has to be the CPU timing, as that is the only thing that can cause the sector writing to fail in this way. I know its late bytes, because I have seen the slipped bits when I examine the disks.
Maybe what I have to do, is to buffer 2 bytes instead of 1 in the track format logic, so that if one byte is a bit late, its not a big drama, and the jitter in getting the bytes delivered is very unlikely to result in two consecutive bytes being late, since its only getting ~5 late bytes when formatting an entire disk, i.e., with a probability somewhere around 5 bytes in a million means p(late) = 0.000002, so the chance of two consecutive late bytes would be something like 0.00000000004, i.e., about once every 10,000 disks.
Actually, looking at the logic, there is also the slim chance that it is a race-condition where a byte gets ignored if arrives exactly when it is needed. Fixing that may well be enough, but since I have the bonnet up, I'll add the extra byte of buffering, as well, so that it should all be 100% stable next time.
The race-condition means that I need to latch the byte_valid signal coming into the MFM encoder, in case that is exactly the cycle when we remove the byte from the buffer, as otherwise the removing the byte from the buffer was taking priority over storing the new byte in the buffer, and thus the new byte would be ignored, and its replacement byte would seem to be late, because a byte had been ignored.
It will be a bit funny if that turns out to be the problem, because it will still have caused me to find and fix some CPU instruction timing bugs that may well have been upsetting other things. In particular, BIT taking an extra cycle may have been causing C64 disk fast loaders problems, because BIT is often used to check the state of lines on the IEC bus. But anyway, the main thing is that it will hopefully just be fixed.
Well, the synthesis completed, but now it isn't writing anything to disk. A bit of poking around in simulation revealed several logic errors in my addition of an extra byte of buffer, which I have now fixed, such that simulation of MFM writing is working again. So its time for another synthesis run while I sleep. If I get up on time, I might have the chance for a quick peek to see if it has worked. Fingers crossed I have it right this time!
And it has. I'd include a screen-shot here, but for some reason it's refusing to grab one using the m65 tool, which I will have to investigate separately.
I'm going to have to investigate why the m65 tool has broken, so in the meantime, I'll include this image I took earlier, which shows what it should look like, since every blog post must contain at least one image: