Sunday 7 November 2021

Creating a simple internal drive fast-loader for the MEGA65

For a while now I have been thinking about making a simple fast-loader for the MEGA65 that bypasses the C65 DOS, and directly accesses the floppy controller.  It's a topic that comes up from time to time for developers who want to load large files from disk, for example. So I spent a couple of hours yesterday writing a proof-of-concept version.  My design criteria were:

1. Must be able to be run from an IRQ, so that it can be used in games or demos to load in the background while other activity goes on. The C65 DOS cannot be sensibly used for this, because when it runs, it blocks all interrupts for arbitrary periods of time, which can exceed 200ms(!!!).

2. Must allow loading to any address in memory.

3. Must be small enough that it can be easily incorporated into other programs.

(1) and (3) meant that it had to be written in assembly.

So here's what I created. It still is missing a few things, like it doesn't save and restore DMA list address registers (in case you were composing a DMA job in real-time, just as the IRQ triggered), and doesn't support specifying how much of a file to load, to allow progressive streaming in of a file. Both would be fairly easy to implement.  But back to what we do have, an annotated walk through the source:

First up, to demonstrate it, we have a simple BASIC header (I am running it from C64 mode, but you could almost as easily run it from C65 mode):


    !byte 0x10,0x08,<2021,>2021,0x9e
    !pet "2061"
    !byte 0x00,0x00,0x00

Then we have the start of the demo program that is using the fast-loader.  The actual fast-loader code will come a bit later. We do the usuals of making sure we have MEGA65 IO enabled, and the CPU at full-speed, as well as have some boiler plate to clear the screen and set screen colours etc:   


    ;; Select MEGA65 IO mode
    lda #$47
    sta $d02f
    lda #$53
    sta $d02f

    ;; Select 40MHz mode
    lda #65
    sta $0

    lda #$00
    sta $d020
    sta $d021

    lda #$01
    sta $0286
    jsr $e544

Next it is time to setup our raster interrupt. This should all be very familiar to C64 coders:
    ;; Install our raster IRQ with our fastloader

    lda #$7f
    sta $dc0d
    sta $dd0d
    lda #$40
    sta $d012
    lda #$1b
    sta $d011
    lda #$01
    sta $d01a
    dec $d019

    lda #$16
    sta $d018
    lda #<irq_handler
    sta $0314
    lda #>irq_handler
    sta $0315

We'll get to the IRQ handler in a moment, but we will finish looking at the real-time part of the program first.  The fast-loader uses a single byte state/status variable to keep track of what it is doing. If it is $00, then the loader is idle.  If you want to ask it to load something, you setup the filename and load address, and then write $01 into the variable.  It will go back to $00 when its done, or have bit 7 set if there is some kind of error. This means you can check status with BEQ and BMI.  The load address will progressively update to show where it is loaded to, if that's important for you to track. In the example, we load the game GYRRUS into bank 4 at $00040000:
    ;; Example for using the fast loader
    ;; copy filename from start of screen
    ;; Expected to be PETSCII and $A0 padded at end, and exactly 16 chars
    ldx #$0f
    lda #$a0
    sta fastload_filename,x
    bpl clearfilename
    ldx #$ff
    cpx #$10
    beq endofname
    lda filename,x
    beq endofname
    sta fastload_filename,x
    bne filenamecopyloop
    stx fastload_filename_len
    ;; Set load address (32-bit)
    ;; = $40000 = BANK 4
    lda #$00
    sta fastload_address+0
    lda #$00
    sta fastload_address+1
    lda #$04
    sta fastload_address+2
    lda #$00
    sta fastload_address+3
Remember what I said about the status variable? We need to make sure it is $00 before we submit our load request.  This is important because when the fast-loader initialises, it doesn't know what track the drive is on, and so it seeks back to track 0 first. So we make sure that that completes before we submit our job. If we didn't do this, reading of any sector from the disk on a real drive would hang, because the head would be on the wrong track.
    ;; Give the fastload time to get itself sorted
    ;; (largely seeking to track 0)
    lda fastload_request
    bne wait_for_fastload
Finally the fast-loader is ready, so we can then submit our job. It really is this simple:
    ;; Request fastload job
    lda #$01
    sta fastload_request
We can then go off and do whatever we want in real-time, knowing that the raster interrupt will be calling the fast-loader, and allowing it to progress in the background. For simplicity, in our demo we just wait for the fast-load to complete, and indicate if an error occurred, or if it loaded ok.
    ;; Then just wait for the request byte to
    ;; go back to $00, or to report an error by having the MSB
    ;; set. The request value will continually update based on the
    ;; state of the loading.
    lda fastload_request
    bmi error
    bne waiting
    beq done
    inc $042f
    jmp error

    inc $d020
    jmp done

That's over and done with for real-time, so now lets look at our raster interrupt.  This is also quite simple: Acknowledge the IRQ source, set border colour to white, call the fastload_irq routine, then return the border colour to black, before returning via the well known $EA81 interrupt exit handler code in the C64 KERNAL. You can of course do whatever you want, but this shows just how simple it can be. The border colour stuff is of course optional, but let's us see just how little raster time this loader uses.
    ;; Here is our nice minimalistic IRQ handler that calls the fastload IRQ
    dec $d019

    ;; Call fastload and show raster time used in the loader
    lda #$01
    sta $d020
    jsr fastload_irq
    lda #$00
    sta $d020

    ;; Chain to KERNAL IRQ exit
    jmp $ea81

As mentioned, I set this demo up to load GYRRUS into bank 4, just because that was a file on the disk image I had active in my MEGA65 at the time.  Note that the filename has to be padded with $A0s, because the fast-load code literally compares all 16 bytes of the filename with the 16 bytes of filename in the directory sectors. It doesn't support partitions or sub-directories on the disk image, but someone could hack that in if they wanted it, but I don't think it will be necessary for almost all use-cases.
    ;; GYRRUS for testing
    !byte $47,$59,$52,$52,$55,$53,$a0,$a0
    !byte $a0,$a0,$a0,$a0,$a0,$a0,$a0,$a0

;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
So that was the code for our example driver of the fast load. For your own programs, you can cut everything above here away, and just keep what follows.  It requires about 1.2KB, including the 512 byte sector buffer, so its quite small in the grand scheme of things.   
    ;; ------------------------------------------------------------
    ;; Actual fast-loader code
    ;; ------------------------------------------------------------
First up, we have the variables and temporary storage for the fast loader: The filename and length (which actually gets ignored, because of the use of $A0 padding, so can be removed at some point), the address where the user wants to load, and the state/status variable.  These four variables are the only ones you need to access from your code. Everything else that follows is internal to the fast-loader.

    !byte 0
    !byte 0,0,0,0
    ;; Start with seeking to track 0
    !byte 4
This variable keeps track of which physical track on the disk the loader thinks the head is currently over, so that we can step to the correct track:

fl_current_track:    !byte 0

Then we have variables for the logical track and sector of the next 256 byte block of the file. These have to get translated into the physical track and sector of the drive, which like the 1581, stores two blocks in each physical sector.
fl_file_next_track:    !byte 0
fl_file_next_sector:    !byte 0

Then finally, we have the 512 byte sector buffer. Now, this could be optimised away, by enabling mapping of the sector buffer at $DE00-$DFFF, but I couldn't be bothered remembering how to do that, and also didn't want to cause potential problems for code that also uses REU emulation or other things that might appear in the IO area. It's not that it can't be done, but rather that I just took the quick and easy path.  It would be a great exercises for the reader to change this, and reduce the total size of the loader to <1KB as a result.   

Now let's take a look at the fast-loader's IRQ handler.  It basically checks if there is an active request, and if not does nothing. Then it checks if the floppy controller is busy doing something that it asked it to earlier. If so, it does nothing.  But if we have an active job, and the floppy controller is not busy, this means that we can ask for the next operation to occur.  The fastload_request variable doubles as the state number for the resulting simple state-machine.  This approach really simplifies the code a lot, and makes it much easier to run in an interrupt.

Before going further, it is worth noting that if you run the interrupt on a normal raster IRQ, the loader will be able to load at most one block = 254 bytes of usable data per frame.  This means 254 x 50 = ~12.7KB/sec in PAL or 15.2KB/sec in NTSC.  If you are using a real 800KB 1581 disk, that's not a problem, because the drive will slow you down more than that.  But if you are using a disk image, or one of the MEGA65's HD disk formats, then this will slow things down.  

The easy solution is to have your IRQ routine trigger multiple times per frame, or enable IRQs in the floppy controller, and have it be called on demand whenever a sector is ready. You will need to acknowledge the floppy controller interrupts, if you do that.

There is also a further ~2x speed up without doing that which is possible by modifying the loader to realise when a single sector contains two consecutive blocks of a file. It doesn't currently do this, which is a bit stupid.  Fixing that would also be a great exercise for the reader.

    ;; If the FDC is busy, do nothing, as we can't progress.
    ;; This really simplifies the state machine into a series of
    ;; sector reads
    lda fastload_request
    bne todo
    lda $d082
    bpl fl_fdc_not_busy
    ;; FDC is not busy, so check what state we are in
    lda fastload_request
    bpl fl_not_in_error_state

It's worth explaining how the IRQ handler calls the various routines for the different states, because it uses a nice feature of the 65CE02: JMP indirect, X-indexed.  This instruction basically allows you to have a jump-table without the silly push-addr-minus-one to stack trick you have to use on the C64. The resulting code is quite a lot simpler and clearer as a result:
    ;; Shift state left one bit, so that we can use it as a lookup
    ;; into a jump table.
    ;; Everything else is handled by the jump table
    cmp #6
    bcc fl_job_ok
    ;; Ignore request/status codes that don't correspond to actions
    jmp (fl_jumptable,x)
    !16 fl_idle
    !16 fl_new_request
    !16 fl_directory_scan
    !16 fl_read_file_block
    !16 fl_seek_track_0
    !16 fl_step_track

The first of those state routines is the one for when the loader is idle: Just return immediately. This can be optimised away, since there are (1) plenty of other RTS instructions we could point at; and (2) because it never gets called, because we have the short-circuit exit at the start of the IRQ handler.  If you haven't already gotten the idea by now, you can tell that I have really just hacked this together until it works, and then stopped to document it.  Lots of opportunities for you to get involved and improve it ;)

The next state handler checks if we are on track 0 yet, and if not, commands a step towards track 0, which like all other floppy controller actions, will have the floppy controller busy until the step has completed. Again, our nice busy check in the start of the IRQ handler means that we can just keep stepping in this routine until we reach track 0. Note how it writes $00 into fastload_request when done, to indicate that the loader is idle and ready for a new job.
    lda $d082
    and #$01
    bne fl_not_on_track_0
    lda #$00
    sta fastload_request
    sta fl_current_track
    ;; Step back towards track 0
    lda #$10
    sta $d081

As you saw in the demo driver code, to submit a new job, you write $01 into fastload_request. This causes the following routine to be run when the IRQ is next triggered.  It puts $02 into fastload_request, so that it knows that it has just accepted a job, and also immediately requests the reading of the first physical sector that contains a directory block, ready for us to look for the requested file.
    ;; Acknowledge fastload request
    lda #2
    sta fastload_request
    ;; Start motor
    lda #$60
    sta $d080
    ;; Request T40 S3 to start directory scan
    ;; (remember we have to do silly translation to real sectors)
    lda #40-1
    sta $d084
    lda #(3/2)+1
    sta $d085
    lda #$00
    sta $d086         ; side
    ;; Request read
    jsr fl_read_sector

The above set fastload_request to call this routine on each IRQ, i.e., as each sector of the directory is loaded. We then look through the whole 512 byte sector for a matching filename, and if found, change state to load the file from the logical track and sector of the first block of the file as obtained from the directory listing. Note that we ignore the file type, including if the file is deleted. Again, a great opportunity for someone to improve the loader.
    ;; Check if our filename we want is in this sector
    jsr fl_copy_sector_to_buffer

    ;; (XXX we scan the last BAM sector as well, to keep the code simple.)
    ;; filenames are at offset 4 in each 32-byte directory entry, padded at
    ;; the end with $A0
    lda #<fastload_sector_buffer
    sta fl_buffaddr+1
    lda #>fastload_sector_buffer
    sta fl_buffaddr+2

    ldx #$05
    ldy #$00


    lda fastload_sector_buffer+$100,x
    cmp fastload_filename,y   
    bne fl_filename_differs
    cpy #$10
    bne fl_check_loop_inner
    ;; Filename matches
    sbc #$12
    lda fl_buffaddr+2
    cmp #>fastload_sector_buffer
    bne fl_file_in_2nd_logical_sector
    ;; Y=Track, A=Sector
    lda fastload_sector_buffer,x
    lda fastload_sector_buffer+1,x
    jmp fl_got_file_track_and_sector
    ;; Y=Track, A=Sector
    lda fastload_sector_buffer+$100,x
    lda fastload_sector_buffer+$101,x
    ;; Store track and sector of file
    sty fl_file_next_track
    sta fl_file_next_sector
    ;; Request reading of next track and sector
    jsr fl_read_next_sector
    ;; Advance to next state
    lda #3
    sta fastload_request
    ;; Skip same number of chars as though we had matched
    cpy #$10
    beq fl_end_of_name
    jmp fl_filename_differs
    ;; Advance to next directory entry
    adc #$10
    bcc fl_filenamecheckloop
    inc fl_buffaddr+2
    lda fl_buffaddr+2
    cmp #(>fastload_sector_buffer)+1
    bne fl_checked_both_halves
    jmp fl_check_logical_sector
    ;; No matching name in this 512 byte sector.
    ;; Load the next one, or give up the search
    inc $d085
    lda $d085
    cmp #11
    bne fl_load_next_dir_sector
    ;; Ran out of sectors in directory track
    ;; (XXX only checks side 0, and assumes DD disk)

    ;; Mark load as failed
    lda #$80         ; $80 = File not found
    sta fastload_request   

We now have several little utility routines related to reading sectors from the disk, including doing the conversion from 1581 logical sectors to 3.5" floppy physical sectors, and tracking the head if we aren't on the correct track already etc. If it detects that it needs to step the head, it changes fastload_request to point to a handler for that, which in turn sets it back to the handler for reading blocks of the file.

Note that I haven't actually tried this on a real disk, yet. This should be done, as there will quite likely be some subtle problem that will need shaking out, most likely with the track stepping. But it shouldn't be too hard to fix, and who knows, I might have got it right the first time ;)
    ;; Request read
    jsr fl_read_sector
    ;; No need to change state

    ;; Check if we are already on the correct track/side
    ;; and if not, select/step as required
    lda #$40
    sta $d081

    lda #3
    sta fastload_request
    ;; Check if we reached the end of the file first
    lda fl_file_next_track
    bne fl_not_end_of_file
    ;; Read next sector of file
    jsr fl_logical_to_physical_sector

    lda fl_current_track
    lda $d084
    cmp fl_current_track
    beq fl_on_correct_track
    bcc fl_step_in
    ;; We need to step first
    lda #$18
    sta $d081
    inc fl_current_track
    lda #5
    sta fastload_request
    ;; We need to step first
    lda #$10
    sta $d081
    dec fl_current_track
    lda #5
    sta fastload_request
    jsr fl_read_sector

Here we have another utility routine that does the logical-to-physical track and sector conversion. Again, this basically mirrors what the 1581 does. It will need modifying to use the fast-loader on HD disks, because there will be more sectors on each side of the disk.
    ;; Convert 1581 sector numbers to physical ones on the disk.
    ;; Track = Track - 1
    ;; Sector = 1 + (Sector/2)
    ;; Side = 0
    ;; If sector > 10, then sector=sector-10, side=1
    lda #$00         ; side 0
    sta $d086
    lda fl_file_next_track
    sta $d084
    lda fl_file_next_sector
    cmp #10
    bcs fl_on_second_side
    sta $d085
    jmp fl_set_fdc_head
    sbc #10
    sta $d085
    lda #1
    sta $d086

    ;; Select correct side of real disk drive
    lda $d086
    and #$08
    ora #$60
    sta $d080

This is the routine that really does the loading: It gets the read physical sector, works out which half of it contains the data for us, DMAs the read bytes into the destination location in memory, and then follows the block chain to the next block of the file, and detects the end-of-file marker indicated by logical track = $00.
    ;; We have a sector from the floppy drive.
    ;; Work out which half and how many bytes,
    ;; and copy them into place.

    ;; Get sector from FDC
    jsr fl_copy_sector_to_buffer

    ;; Assume full sector initially
    lda #254
    sta fl_bytes_to_copy
    ;; Work out which half we care about
    lda fl_file_next_sector
    and #$01
    bne fl_read_from_second_half
    lda #(>fastload_sector_buffer)+0
    sta fl_read_dma_page
    lda fastload_sector_buffer+1
    sta fl_file_next_sector
    lda fastload_sector_buffer+0
    sta fl_file_next_track
    bne fl_1st_half_full_sector
    lda fastload_sector_buffer+1
    sta fl_bytes_to_copy   
    ;; Mark end of loading
    lda #$00
    sta fastload_request
    jmp fl_dma_read_bytes
    lda #(>fastload_sector_buffer)+1
    sta fl_read_dma_page
    lda fastload_sector_buffer+$101
    sta fl_file_next_sector
    lda fastload_sector_buffer+$100
    sta fl_file_next_track
    bne fl_2nd_half_full_sector
    lda fastload_sector_buffer+$101
    sta fl_bytes_to_copy
    ;; Mark end of loading
    lda #$00
    sta fastload_request

    ;; Update destination address
    lda fastload_address+3
    sta fl_data_read_dmalist+2
    lda fastload_address+2
    ora fl_data_read_dmalist+2
    sta fl_data_read_dmalist+2
    lda fastload_address+2
    and #$0f
    sta fl_data_read_dmalist+12
    lda fastload_address+1
    sta fl_data_read_dmalist+11
    lda fastload_address+0
    sta fl_data_read_dmalist+10

    ;; Copy FDC data to our buffer
    lda #$00
    sta $d704
    lda #>fl_data_read_dmalist
    sta $d701
    lda #<fl_data_read_dmalist
    sta $d705

    ;; Update load address
    lda fastload_address+0
    adc fl_bytes_to_copy
    sta fastload_address+0
    lda fastload_address+1
    adc #0
    sta fastload_address+1
    lda fastload_address+2
    adc #0
    sta fastload_address+2
    lda fastload_address+3
    adc #0
    sta fastload_address+3
    ;; Schedule reading of next block
    jsr fl_read_next_sector

We are now almost at the end. What we have here is the DMA lists for copying the read data to its final destination, as well as the routine and DMA list for copying a physical sector from the FDC's buffer down to fastload_sector_buffer.  As previously noted, we can probably shrink the whole thing (and make it use less raster time) by avoiding that copy, if we instead fiddle the IO banking to make the floppy sector buffer map at $DE00-$DFFF (there is a special bit that enables this).  But what we have here works, and isn't that much slower, as the DMA doesn't take very long. 
    !byte $0b      ; F011A type list
    !byte $81,$00      ; Destination MB
    !byte 0         ; no more options
    !byte 0            ; copy
    !word 0               ; size of copy
fl_read_dma_page = fl_read_page_word + 1
    ;; +2 is to skip track/header link
    !word fastload_sector_buffer+2    ; Source address
    !byte $00        ; Source bank
    !word 0                 ; Dest address
    !byte $00             ; Dest bank
    !byte $00             ; sub-command
    !word 0                 ; modulo (unused)
    ;; Make sure FDC sector buffer is selected
    lda #$80
    trb $d689

    ;; Copy FDC data to our buffer
    lda #$00
    sta $d704
    lda #>fl_sector_read_dmalist
    sta $d701
    lda #<fl_sector_read_dmalist
    sta $d705

    !byte $0b      ; F011A type list
    !byte $80,$ff            ; MB of FDC sector buffer address ($FFD6C00)
    !byte 0         ; no more options
    !byte 0            ; copy
    !word 512        ; size of copy
    !word $6c00        ; low 16 bits of FDC sector buffer address
    !byte $0d        ; next 4 bits of FDC sector buffer address
    !word fastload_sector_buffer ; Dest address   
    !byte $00             ; Dest bank
    !byte $00             ; sub-command
    !word 0                 ; modulo (unused)

And that's it.  The loader really is quite simple, especially compared with a 1541 fast-loader.  You can find the source in, just look for fastload-demo.asm.

Finally, a somewhat arbitrary screen-shot, because every blog post requires at least one, but its kind of hard to show a fast-loader in action in a still image.

Tuesday 2 November 2021

Speeding up the MEGA65 flash menu

The MEGA65's flash menu that lets you write new cores into the flash is, shall we say, a little pedestrian in speed.  It takes close to 15 minutes to write a new core, which is really annoying.  

It's also become a bit important for another reason, because Trenz need a tool to flash the MEGA65 production boards, because Vivado is refusing to flash the new shiny 512mbit (64MB) flash chip that is going on the production machines for some unknown reason.  They can't afford to spend 15 minutes on each machine flashing them.

I just timed Vivado flashing a bitstream, and it took 165 seconds = 2 minutes, 45 seconds, so that's our goal. 

Now, what is interesting is that Vivado is much slower than what the flash can do.  In theory, we can erase at around 500KB/sec, flash at around 1MB/sec, and verify back at >1MB/sec.  For an 8MB bitstream this gives us a theoretical time of 8MB/500KB/sec + 8MB/1MB/sec + 8MB/1MB/sec = 16 + 8 + 8 = 32 seconds.  Now, that would be really nice if we can reach.  But I'll just be happy if we can get down to 165 seconds or better, like Vivado does.

To improve from our current ~15 minutes = ~900 seconds down to 165 seconds, we have quite a bit of improvement to make.  Fortunately this should be fairly easy, as the root cause of the slowness is that we are using CC65 as our compiler, which produces slow code, and then bit-bashing the QSPI communications.  So getting it much faster than now should be quite straight forward.

But first, we need an easy way to test the flash program, because currently the QSPI flash is only accessible when in hypervisor mode. So I have made it so that dip-switch 3 now enables access to the QSPI flash from any mode. This should not be normally enabled, as it can cause your QSPI flash to get trashed. But for production of machines (and testing of my flash program speed-ups), its fine.

With that out the way, it was time to start implementing the QSPI speed up stuff.  I could in theory implement a complete QSPI controller in hardware, but that's a lot of work, and not really needed, because it is just the large transfers for reading and writing the flash to verify and program it that take by far the most time -- more than 90% in fact.

So instead I am just implementing hardware acceleration of exactly those options.  The QSPI lines are routed through the SD card controller, which already has a nice buffer that I can re-use.  For some reason I use the "Q" nybl-based modes for reading from the flash, but single-bit ones for writing. IT would be faster to use the Q mode for both, as it will reduce the time to write a byte from ~8x4 = 32 cycles down to 4 cycles per byte. But as the flashing itself takes ~1usec per byte, we will still be at 50% efficiency at least.  Somewhat similarly for the reading, we could run the QSPI flash at >40MHz, but that would require more work for even less gain -- especially since we still have some logic code from CC65 slowing things down as well.

For the commands to setup those transfers, we can also inline some of the functions to help things along a bit. About 2x to 3x for some parts of things was possible there, but still not enough to get us near 165 seconds.

To help track the improvements, I have improved the progress bars in the flash program to show the speed and time remaining using the RTC to do the timing.

In the process I also did some more work on improving the detection of the flash chip's parameters.  This helped quite a bit, because there are two different erase commands and one works on all pages of the flash, but is really slow on most of them, while the one that is faster on the most of the pages hangs on the other pages.  This is because the chip we are using has 32 x 4KB pages at the start, and then 64KB pages after that.

Getting the reading of data working with hardware acceleration was pretty quick and easy. I also found a horrible bug in the erase code that meant that it would erase all pages, even if they were already empty.  Together those two improvements have had a dramatic improvement on the erase performance: It is now down to 19 seconds when erasing a typical bitstream that is about 5.5MB of the 8MB slot size, i.e. an erase speed of >300KB/sec.  That's down from several minutes, so that's the first part of our victory.

I have also implemented the hardware SPI writing acceleration, but there is some bug with it at present which means that the same byte is being written over and over again, which I need to investigate. But given that I writing the correct number of bytes, the speed should be about right.  And this is also greatly improved, now taking only 38 seconds, at around 164KB/sec. About half of that time is the actual fast SPI data write and the time for the flash to actually write to the non-volatile memory, so there is perhaps some scope for in-lining more stuff in the C code to speed it up a bit, but otherwise further improvements would require the Q mode writing. With both of those, it would probably be possible to get under 20 seconds for the writing phase, but honestly, 38 seconds is already fast enough to not feel annoying. The main thing is that the progress bar is continuously growing, and at a good speed.

So once I have the SPI writing bug fixed, we are looking at just under 1 minute to erase and write.  Verification should be at least as fast as erasing, so I'm hoping that we will be around 80 seconds -- that is, about 2x faster than Vivado, which is really nice!

Part of the bug is because I hadn't implemented 256 byte page writing, but rather only 512 byte page writing. That's fine for the 64MB flash chips on the production boards, but not for the existing 32MB flash in the R3 board I have here.  The errata for the 32MB flash said that you can write >256 bytes, but only the last 256 bytes will be written.  I have since fixed that, but without any visible improvement.

What I am seeing is that the same byte is being written to the flash over and over again.  Sometimes its $80, other times $00.  This says to me that bytes are probably being written, but that the bytes we are reading from the buffer may be wrong. So I might make a test that tries writing some known data and see how that goes.  That way I can also do it much faster, as I can erase the single page, write the known data, and then read it back.

Okay, so that confirms that we are writing exactly 256 bytes, but that all 256 bytes are being written with the same value, in this case $80.  I'll do a quick bit of simulation to check whether the SD buffer is being read out correctly to be written, as that strikes me as the most likely place to be borked.

Borkage duly found via simulation: I was reloading the byte from the buffer every bit, causing endless hilarity to ensue.  Now synthesising that, but slowed down by watching Shallan50k's twitch stream with the music competition results which was excellent (congratulations to @proton_fig for your great X Files tunes, and to the other entrants for their great tunes as well!). It's amazing just how much CPU it takes for the Twitch stream view. Basically was eating 75% or more of the CPU on my (admittedly 4 year old) i7 box.

On the up side, the simulation affirmed that the rest of the process looks to be behaving properly, so hopefully when the synthesis does complete, that it will work. Which it did after I eventually spotted and fixed some stupid bugs.

After that it was a case of fine-tuning various things, like reducing how often I update the progress bar. I also added a hand-written assembly routine for the verification step, as that is currently the slowest of all the actions, which is a bit silly given that erasing and writing have real work to do.

The end result is that writing a new core file to a slot can now be done in about 86 seconds -- i.e., about 1/2 the time that Vivado takes, as we can see in this screenshot:

Victory achieved!

Now to win the war, I need to back-port all those speed-ups and general improvements into the flash menu, and hope that it doesn't make it too big to fit in the bitstream... which I have also done.

To say that this makes the process of flashing a core file more pleasant is really an understatement. We have rather coincidentally gone from C64 datasette to disk drive loading times, and the impact feels just as profound: You can now flash a core without thinking about what you will do for the next 1/4 hour while it chugs away.

While speed further improvements are possible, it doesn't really feel like it is necessary, given that the theoretical minimum time is something like 20 seconds, and it would be a lot of effort to claw back any of that extra minute -- but it is only one extra minute.

The only further improvement I am likely to make down the track is to make a utility that will allow safe reflashing of slot 0 using this new dip-switch 3 mechanism: The program will check that the FPGA has booted from slot 1 or 2, and thus be satisfied that slot 0 can be written over without bricking the machine, and only if that is the case, will it attempt to flash.  But to make sure people don't leave switch 3 on all the time, which would allow malicious software to brick your MEGA65, I'll likely put an inter-lock into the hypervisor that requires you to press some key to continue booting if it is enabled, so that you don't forget.

So that's all that, really.

Sunday 12 September 2021

Adding transparent support for HD floppies

The internal floppy drive in the MEGA65 is actually a standard PC-compatible HD / 1.44MB floppy drive, so while the C65 DOS only understands DD disks, the hardware is capable of more.

While we could just stick to DD media, there are some good reasons to support HD media.  One of the key ones, is that the MEGA65's advanced features mean that its quite conceivable to imagine a game that would like more than 800KB of data on a disk.  In fact, this thought was triggered exactly by @Shallan50k wanting to fit 1MB maps for his kart-racer game for the MEGA65 onto a disk.

There are several challenges that I would like to attack in doing this:

1. The resulting disks should work with the C65 DOS, without modification, at least to the extent of being able to get a directory listing, and load one modest size program from it.

2. The disk format should allow creating disks that can be written to using the MEGA65's floppy controller, if that is the user's intention.

3. For mastering disks for games/software distribution, we don't care about write-ability (at least not on all tracks), but we would really like to be able to cram as much data as possible onto a disk.  By reducing the inter-sector gaps, it is possible to fit more sectors on a track. That's how the Amiga gets 880KB on a DD disk, compared to the 1581's 800KB.

4. We need a common disk image format that can be used for all variants of the above, to ease software development, and make it possible to run these images from SD card.

(1) and (2) are the easiest ones to solve. In fact, I have solved them already, by implementing a 2nd parallel MFM decoder in the MEGA65's floppy controller, that runs at 2x the data rate of the main decoder. As HD disks run at 2x the data rate, this means that we automatically are able to read (but not write) HD formatted disks, without having to modify the C65 DOS -- but the C65 DOS will only see the first 10 sectors on each track, which is totally fine for goal (1), as the directory listing can appear, and we can load upto ~800KB of files from it, by using only the DD-compatible sector numbers.

The floppytest.prg program in the repository now includes options to format and test HD-formatted disks created in this way, with 20 sectors per side instead of 10 sectors per side, giving a 1600KB fully-writeable disk. 

To be able to write to such disks, you just have to poke $28 into $D6A2, to set the floppy controller to the HD data rate, so that the MFM encoder (of which there is still only one) is looking for HD-formatted sectors.

So that's all solved.

It's with (3) is where it gets more interesting. We could just go Amiga style, and settle for 21 sectors per side per track, and thus get 1760KB per disk. But I know that we can fit more.  For a start, we can go 1541 style, and vary the data encoding rate on different tracks, and fit more sectors onto the outer tracks, since the normal data rate is good enough for the inner-most track.  But before we do that, let's talk for a bit about how a floppy works, and some of the important aspects of magnetic recording that affect us.

Magnetic grain size refers to the size of the individual magnetic domains on the floppy.  For the inner most tracks, fewer particles will pass under the head per second, so we have to use longer magnetic pulses, i.e., store less.  For the outer tracks, more magnetic domains will pass under the head per second, so we can have more data on each track out there.  I'll talk about the relative length of the inner and outer tracks shortly.

Magnetic signal strength is how strong the raw signal from the floppy drive is. It's strength scales with the square of the velocity of the magnetic transitions going past it. So on the inner tracks, the media is passing by more slowly, and the signal will be quite a bit weaker than on those lovely longer outer tracks, where that longer circumference goes past in the same time, i.e., at higher speed.

So in short, we should absolutely be able to cram more sectors on those outer tracks by increasing the data rate.  

By way of comparison, the 1541 that used only 35 tracks varies between 17 and 21 sectors per track. That is, on the outer tracks, it crams in 21/17 = 123% more data than on the inner tracks. And fully half the disk uses that maximum number of sectors per track, as it really is only the innermost few tracks that are so short as to be a problem. In fact, only the first 5 tracks on the 1541 have the minimum 17 sectors per track.  Thus compared with a naive 17 x 35 = 595 sectors, the 1541 actually fits 683 sectors, i.e., almost 115% of the constant-sector-count capacity.

So let's think about what should be possible on an 80 track 3.5" floppy disk. We know that the floppy drive can read a standard 11 sectors per inner-track format, because that's what the Amiga did. So we will scale up from there. I'll come back to how we actually produce such tracks later.

First up, we need to know that track 0 is actually the outermost, i.e., longest track, so we will fit more sectors on lower-numbered tracks, and fewer sectors on the higher-numbered tracks.

This site claims the following about 3.5" disks:

        track spacing: .0074 inches 
	drive track radius from center (inner to outer) 
                 side 0 .9719 inches to 1.5551 inches
                 side 1 .9129 inches to 1.4961 inches
	track width: .115 mm (.0045 inch) after "trim erase" on either side (not confirmed for 4MB format)
		also see erase notes

If this is true, and the difference really is 8 tracks, I find that quite amazing, as it means that we could have had an extra 8 tracks on side 1, and that side 1 should have much worse properties near the inner-most track than side 0 has -- a property that we might be able to exploit.

But for now, let's just assume the smallest benefit, which comes from the largest diameters, i.e., that of side 0: The outer-most track is 1.5551/0.9719 times longer = 160% the length of the inner most track.  I was expecting some nice benefit, but fully 160% is even more than I had anticipated.  This means that we should be able to fit 160% x 11 = 35 sectors on the outermost track, instead of the 18 that a PC fits. Even the Amiga's HD floppy's "cramming" of 22 sectors onto the track starts to look quite lame.

So let me do a bit of a calculation here as to how many sectors we can fit, using Amiga style track-at-once, and also 1581-style sector-at-once fitting. Basically we work out the relative length of the tracks versus the inner-most track, and then scale up the number of sectors to the largest integer that fits.

Let's look at how many sectors each track would fit, and the cumulative number of sectors on the disk to that point using 1581 and Amiga style track writing:

              1581-style (sector)   Amiga-style (track)
Track #00 :   32/  64 (  32 KB) :   35/  70 (  35 KB)
Track #01 :   31/ 126 (  63 KB) :   35/ 140 (  70 KB)
Track #02 :   31/ 188 (  94 KB) :   34/ 208 ( 104 KB)
Track #03 :   31/ 250 ( 125 KB) :   34/ 276 ( 138 KB)
Track #04 :   31/ 312 ( 156 KB) :   34/ 344 ( 172 KB)
Track #05 :   31/ 374 ( 187 KB) :   34/ 412 ( 206 KB)
Track #06 :   31/ 436 ( 218 KB) :   34/ 480 ( 240 KB)
Track #07 :   30/ 496 ( 248 KB) :   34/ 548 ( 274 KB)
Track #08 :   30/ 556 ( 278 KB) :   33/ 614 ( 307 KB)
Track #09 :   30/ 616 ( 308 KB) :   33/ 680 ( 340 KB)
Track #10 :   30/ 676 ( 338 KB) :   33/ 746 ( 373 KB)
Track #11 :   30/ 736 ( 368 KB) :   33/ 812 ( 406 KB)
Track #12 :   30/ 796 ( 398 KB) :   33/ 878 ( 439 KB)
Track #13 :   30/ 856 ( 428 KB) :   33/ 944 ( 472 KB)
Track #14 :   29/ 914 ( 457 KB) :   32/1008 ( 504 KB)
Track #15 :   29/ 972 ( 486 KB) :   32/1072 ( 536 KB)
Track #16 :   29/1030 ( 515 KB) :   32/1136 ( 568 KB)
Track #17 :   29/1088 ( 544 KB) :   32/1200 ( 600 KB)
Track #18 :   29/1146 ( 573 KB) :   32/1264 ( 632 KB)
Track #19 :   29/1204 ( 602 KB) :   32/1328 ( 664 KB)
Track #20 :   29/1262 ( 631 KB) :   31/1390 ( 695 KB)
Track #21 :   28/1318 ( 659 KB) :   31/1452 ( 726 KB)
Track #22 :   28/1374 ( 687 KB) :   31/1514 ( 757 KB)
Track #23 :   28/1430 ( 715 KB) :   31/1576 ( 788 KB)
Track #24 :   28/1486 ( 743 KB) :   31/1638 ( 819 KB)
Track #25 :   28/1542 ( 771 KB) :   31/1700 ( 850 KB)
Track #26 :   28/1598 ( 799 KB) :   30/1760 ( 880 KB)
Track #27 :   27/1652 ( 826 KB) :   30/1820 ( 910 KB)
Track #28 :   27/1706 ( 853 KB) :   30/1880 ( 940 KB)
Track #29 :   27/1760 ( 880 KB) :   30/1940 ( 970 KB)
Track #30 :   27/1814 ( 907 KB) :   30/2000 (1000 KB)
Track #31 :   27/1868 ( 934 KB) :   30/2060 (1030 KB)
Track #32 :   27/1922 ( 961 KB) :   29/2118 (1059 KB)
Track #33 :   27/1976 ( 988 KB) :   29/2176 (1088 KB)
Track #34 :   26/2028 (1014 KB) :   29/2234 (1117 KB)
Track #35 :   26/2080 (1040 KB) :   29/2292 (1146 KB)
Track #36 :   26/2132 (1066 KB) :   29/2350 (1175 KB)
Track #37 :   26/2184 (1092 KB) :   29/2408 (1204 KB)
Track #38 :   26/2236 (1118 KB) :   28/2464 (1232 KB)
Track #39 :   26/2288 (1144 KB) :   28/2520 (1260 KB)
Track #40 :   26/2340 (1170 KB) :   28/2576 (1288 KB)
Track #41 :   25/2390 (1195 KB) :   28/2632 (1316 KB)
Track #42 :   25/2440 (1220 KB) :   28/2688 (1344 KB)
Track #43 :   25/2490 (1245 KB) :   28/2744 (1372 KB)
Track #44 :   25/2540 (1270 KB) :   27/2798 (1399 KB)
Track #45 :   25/2590 (1295 KB) :   27/2852 (1426 KB)
Track #46 :   25/2640 (1320 KB) :   27/2906 (1453 KB)
Track #47 :   24/2688 (1344 KB) :   27/2960 (1480 KB)
Track #48 :   24/2736 (1368 KB) :   27/3014 (1507 KB)
Track #49 :   24/2784 (1392 KB) :   27/3068 (1534 KB)
Track #50 :   24/2832 (1416 KB) :   26/3120 (1560 KB)
Track #51 :   24/2880 (1440 KB) :   26/3172 (1586 KB)
Track #52 :   24/2928 (1464 KB) :   26/3224 (1612 KB)
Track #53 :   24/2976 (1488 KB) :   26/3276 (1638 KB)
Track #54 :   23/3022 (1511 KB) :   26/3328 (1664 KB)
Track #55 :   23/3068 (1534 KB) :   26/3380 (1690 KB)
Track #56 :   23/3114 (1557 KB) :   25/3430 (1715 KB)
Track #57 :   23/3160 (1580 KB) :   25/3480 (1740 KB)
Track #58 :   23/3206 (1603 KB) :   25/3530 (1765 KB)
Track #59 :   23/3252 (1626 KB) :   25/3580 (1790 KB)
Track #60 :   23/3298 (1649 KB) :   25/3630 (1815 KB)
Track #61 :   22/3342 (1671 KB) :   25/3680 (1840 KB)
Track #62 :   22/3386 (1693 KB) :   24/3728 (1864 KB)
Track #63 :   22/3430 (1715 KB) :   24/3776 (1888 KB)
Track #64 :   22/3474 (1737 KB) :   24/3824 (1912 KB)
Track #65 :   22/3518 (1759 KB) :   24/3872 (1936 KB)
Track #66 :   22/3562 (1781 KB) :   24/3920 (1960 KB)
Track #67 :   21/3604 (1802 KB) :   24/3968 (1984 KB)
Track #68 :   21/3646 (1823 KB) :   23/4014 (2007 KB)
Track #69 :   21/3688 (1844 KB) :   23/4060 (2030 KB)
Track #70 :   21/3730 (1865 KB) :   23/4106 (2053 KB)
Track #71 :   21/3772 (1886 KB) :   23/4152 (2076 KB)
Track #72 :   21/3814 (1907 KB) :   23/4198 (2099 KB)
Track #73 :   21/3856 (1928 KB) :   23/4244 (2122 KB)
Track #74 :   20/3896 (1948 KB) :   22/4288 (2144 KB)
Track #75 :   20/3936 (1968 KB) :   22/4332 (2166 KB)
Track #76 :   20/3976 (1988 KB) :   22/4376 (2188 KB)
Track #77 :   20/4016 (2008 KB) :   22/4420 (2210 KB)
Track #78 :   20/4056 (2028 KB) :   22/4464 (2232 KB)
Track #79 :   20/4096 (2048 KB) :   22/4508 (2254 KB)

So remember that this is on a nominal "1.44MB" floppy, and using only 80 tracks. With 82 or 84 tracks, we can squeeze a bit more out. But remember those high-numbered tracks are inner-tracks, so the benefit will only be quite small.

What is interesting with the 1581-style approach is that we end up with exactly 2MiB. I have no idea if that is coincidence or part of the "2MB unformatted capacity" that is touted around 3.5" HD disks.  It might well be the latter, as this capacity calculation is based on constant bits per inch.

Now, coming back to "practical land", we can see 32 sectors per track is the most with a 1581-style format, that would allow disks to be written to using normal sector operations, or for about a 10% capacity increase, we need to deal with upto 35 sectors per track.  For those not familiar with floppy formatting, the Amiga squeezes its 10% extra capacity out of disks by having much shorter gaps between the sectors, because it doesn't need to tolerate variation in rotational speed between the drive that formatted the disk, and the drive that is writing to the disk right now.

If you have been reading these blog posts recently, you will know that the C65 DOS writes 71 gap bytes in addition to the 512 data bytes for every sector.  There are also an overhead of another 13 bytes per sector that are unavoidable, to mark where the sector starts and ends, and which sector it is. So, in short, the 1581 uses 13 + 512 + 71 = 596 bytes per sector written.  The Amiga reduces the number of gap bytes, so that it can safely fit an extra sector on.  

Think about it like this: The 1581 requires 596 bytes to write a single sector, and fits 10 on a disk, so needs a track to fit 5,960 bytes on it. If the Amiga wants to fit 11 sectors, it needs to reduce that down to 5,960 / 11 = ~541 bytes.  We know we can get away with as little as 13 + 512 = 525 bytes per sector, and 525 x 11 < 5,960, so the Amiga can fit the extra sector in. But 525 x 12 = 6,300 bytes, which is a bit too much, so this is why the Amiga couldn't fit 12 sectors per track.

Now, back to our situation, if we use that refined information, and scale the 5,960 bytes on the inner-most  track, and needing 525 bytes per sector, its possible we can squeeze an extra sector per track here and there.  But its probably not worth flying quite that close to the wire. 

If we were going to do that, we could just add support for some kind of bizarre "super sector" that fills a whole track with as many bytes as we can.  The CRC16 would then probably not be strong enough anymore, and we would probably want to consider using an even higher data rate and using some appropriate error correction code to handle the kinds of errors that happen on floppy media.  I might do such a thing in the future, but for now, I think its overkill.

The real question is whether we think its worth fitting an extra 10% on a disk in return for not being able to write to it sector-at-once. Or more the point, whether I should make the corresponding disk image format allow for this or not.

I'm really tempted to stick with 2048KB, as it means 32 sectors per track, which is easy to implement in the hardware for track offset calculation, and is just a pleasant round number. 

There is also a kind of half-way house, where we could have 32 sectors on the first 20 tracks, which would be track-at-once, and then have normally writeable tracks after that, to squeeze a few extra KB out of the disk.  This wouldn't require a different disk image format, as I am planning on just making the disk image format be 80 tracks x 32 sectors per track x 2 sides x 512 bytes = 2,560KB, and when on SD card, you would have that full capacity available, and if you write it to a real disk, there is some dead space. But maybe I will just stop whinging about multiplying by 35 in hardware, and make the image format allow for, say, 40 sectors per track = 3200KB total size, in case I come up with future improvements that allow reaching that (like finding out how much we can creep the data rate up, over all ;)

This would mean that the programmer has the responsibility to know which sectors are safe to write to, but I think that's not unreasonable, since there will already need to be some mechanism for changing the data rate based on which track you are on, which leads me back to a particular problem... the directory track.

If we are going to support these disks, we still need to have some magic to make reading the directory track work, in the very least. The loader program could be required to live on that track, or more easily, on track 79, which would be at the normal data rate.

We also have to look at whether we can specify the data rate accurately enough to get all of these track sizes.  The data rate is specified in "40.5MHz cycles per magnetic interval", with normal HD disks at a setting of $28 = 40.  For the outer-most tracks we want 160% of the data rate, so we need 40 / 1.6 = 25 = $19.  So we have 15 different steps along the way, although because they are based on different divisors, they are not equally spaced.

I'll tweak my little program to work out the data rates that are possible on each track from the real ones available, and map the track fitting to those.  Hopefully it won't result in the loss of too many actual sectors on the disk format.  If it does, I could look at changing the way the floppy data rate is calculated from a simple divisor to an accumulator approach, that would allow much more accurate specification. But we will see if it is necessary, first.

So by using the rates we currently have available, we need to allow a tolerance of only 1.7% to get exactly 2048KB on a disk.  If we are strict and require 0% maximum excess data rate, then it drops to 2011KB.  If we were to allow 5% over-rate, then we could fit 2116KB, which says very much diminishing returns to me. These are all for sector-by-sector capacities.  For Amiga-style track-at-once writing, then the capacity would be 2403KB at 5%, 2328KB at 1.7% and 2287KB with strict 0% excess data rate.

If we really want to get that bit extra, then it probably makes much more sense to just use 82 or 84 tracks, which almost all drives can read, which at 1.7% speed tolerance allows up to 2,416KB on an Amiga-style track-at-once disk, and 2,128KB on a 1581-style sector-at-once disk.

So, back to specifying a convenient disk image format, I am going to go for 40 sectors x 85 tracks maximum = 3,400KiB = 3,481,600 bytes. Of course, as described above, only about 2MB of that will be usable on a real floppy, but this inefficiency is the cost of having the MEGA65's SD card floppy emulation logic being able to efficiently handle them, and at the same time, allowing for us to extract some future improved capacity out of the real floppy drive -- although I think 2MB -- 2.2MB on nominally 1.44MB media is still a pretty nice result. 

Thus I will now turn my attention to testing the feasibility of all of this by writing some code to actually master such variable data-rate disks, and making sure that the sectors fit, and that they can be reliably read back.  This will probably also require getting some other folks to do the same, to test repeatability on different drives -- although the fact that we are keeping withing 1.7% of the fairly conservative officially supported data rates of the media gives me a fair bit of confidence.

What I am more interested in finding out, is just how close we actually come to filling tracks with these extra sectors at those rates: Does it all fit at all? Is there enough spare space to try cramming an extra sector or two onto some of the tracks?

To answer these questions, I am refactoring the floppytest.c program, so that I can request the writing of individual tracks, so that I can then try reading the sectors on them back, to find out how many sectors at which data rates fit on a track, to confirm if my back-of-envelope calculations above are correct.

In the process of doing that, I hit a funny bug, which I think is in CC65, where adding a little bit of extra code was causing some unrelated stuff to crash.  In particular, as I refactored out the track formatting code, the track reading code would break, even if I never called the formatting code first.  I've seen funny things like this with CC65 compiled programs before, and don't really know the cause.  But in this case, I could at least see that the generated code was incorrect.

In the end, I worked around it by reducing the amount of code a bit, which was incidentally writing to $C0xx, which can end up on the CC65 C stack, but is at the bottom of the 4KB stack, and shouldn't change what code the generator produced for the track reading code. Anyway, its a bit of a mystery, and I might have to keep an eye on it, to see if it happens more.

Anyway, now that I have code to format a track factored out, I'll start work on code that tries various data rates and counts how many sectors it can fit -- both with gaps for sector-at-once writing, and without them, for Amiga-style track-at-once writing, and the ~10% more sectors it should allow us to fit on each track.

That was easy to get working, and I have confirmed that it is writing tracks without the gaps. I can tell this, because the track read testing program is no longer able to keep up with an interleave of 2, because the sectors come around a bit quicker.

So next step is to make a routine that tries formatting a single track multiple times at multiple data-rates, and then checks which sectors can be read back.

Thinking ahead to a denser coding, I was reminded that what I want to use is RLL2,7 coding, not GCR coding, which actually is no more efficient than MFM.  RLL2,7 coding is a bit funny, because it uses variable length codes for different bit patterns, such as the following:

Input    Encoded

11       1000
10       0100
000      100100
010      000100
011      001000
0011     00001000
0010     00100100 

I also spent a long time trying to find out what the sync mark is for RLL2,7 encoding, and eventually found out from here, that it might be:


For it to be correct, it has to be impossible to build this by concatenating the pieces above. So let's try, beginning at all possible bit offsets into the sequence:

Starting at bit 0: it would decode as: 

1000 1000 1000 1000 1000 0100 1000 1000 1000 1000

 11   11   11   11   11   10   11   11   11   11

So that decodes, which means it shouldn't be the sync mark.

I eventually found the answer in this thesis:

RLL2,7 cannot generate 100000001001, because the only sequence with four leading zeroes is 00001000, thus it is not possible to get 100000001001 -- super simple :)

Meanwhile, back in the world of MFM encoded disks, I did some more work on working out just how much I can cram on each track using MFM coding with variable data rate for each track.  In fact, I live streamed for a couple of hours this morning working on it.

So now I have a tool that will let me try different data rates on every track, with or without inter-sector gaps, and report the highest number of sectors that could be written and read back.  This takes quite a while to run, as it of course hits bad sectors which have several seconds of timeout.  Writing this, I just realised an efficiency problem with this code, where it would try all sectors, even after one had failed, which would slow things down on the later tracks where we know for sure it can't have all sectors. More the point, if we have any error on a track, we stop at that point, so there is no point continuing after the first error. So I have set that running again now.

What I am also doing now is synthesising VHDL that will allow a complete track format command, without this need to do software feeding of every byte as it goes along.  The reason for this is that I was unable to increase the data rate to the level where I think it should go, and I think the actual problem is that the CC65 compiled code is just too slow to reliably feed the bytes.  So its possible that by moving to hardware-assisted formatting that I will be able to recover those extra potential sectors.  For example, I think I should be able to fit 36 or more sectors on track 0, but can't get above 31.  So the difference is potentially quite large.

In fact, as I discovered on the stream, its possible to cram an extra couple of sectors on a track, so it is probably more like 38 sectors as the maximum compared to the 31 I am seeing now. If I can pull that off, we should be able to get 2,176KiB on a sector-by-sector disk and 2,471KiB on an Amiga-style track-at-once disk.  All of this is also equally relevant for the RLL2,7 encoding, as it still means +50% over whatever we can do on MFM, plus potentially the odd extra sector here or there.  

So a sector-by-sector disk with 3,264KiB (which is 3.34MB using floppy "marketing MB" of 1,024,000 bytes like the 1.44MB standard does), or an Amiga-style track-at-once disk of 3,742KB (= 3.83MB in floppy marketing terms) should be possible.  Thus my dream of outclassing a 2.88MB ED floppy drive with a standard HD drive and media really does look to be in reach -- assuming a lot of things yet to be proven.

But first, its sit back and wait while I synthesise the hardware formatter, and hope that I haven't got too many bugs in it, so that I can know for sure if I have formatted disks properly when testing if they can be reliably read back.

I will probably also pull out the floppy histogram display code again, too, to see how close to overlapping the peaks for the 1.0, 1.5 and 2.0 period buckets of bits are, as a further guide as to whether this will all have a chance of working reliably.

So the hardware formatter is indeed writing things to the disk, but it is mis-calculating the CRCs for the sector headers and bodies.  I'll have to have a think about the best way to test this via simulation, as its likely that one or more bytes are not being included in the CRC, or being counted twice or something.  Past experience tells me its hard to work that out from just staring at the source.  The challenge to simulating this, is that it will take a long time, and requires the sdcardio.vhdl as part of the simulation, rather than just being the MFM encoder.  

Update: I realised that I probably wasn't feeding the CRC engine the bytes I was writing, having mistakenly thought that I had it automatically plumbed, which I didn't. So I'm hopeful that this will work now.

While I wait for that to finish synthesising, let's take another aside in the history of floppy storage, and consider the SFD1001 drive from Commodore: This was a beast in its day: 1MB on DD 5.25" disks, compared to the 1541's 170KB.  Of course it was double sided, so the fairer comparison would have been 340KB.  Also, it was designed for quad-density media, so really 680KB, but DD media tended to work just fine. It achieved this capacity by doubling the number of tracks, and I presume, increasing the data rate of the standard Commodore GCR encoding.  So with ~80 tracks, they got ~500KB per side. We should be able to double that on a 3.5" HD disk -- which we are, with >1MB per side -- but not by much.  So in a sense, what we are doing now with HD 3.5" floppies is not too different to what Commodore did with the SFD1001, except that the SFD1001 officially used media with double the density.

Meanwhile, I finally have the hardware auto-formatter generating correct CRC values, and can now format disks using it. One of the nice things, is that it reduces the time between tracks, allowing the format to complete faster.

So now its time to update floppycapacity.c, so that it uses the hardware-assisted formatting, so that we can see if we can't actually get to the maximum number of sectors we think we should -- somewhere around 36 with Amiga-style track-at-once.  

Unfortunately, it looks like the floppy drive hardware refuses to behave properly if the magnetic interval is less than 30 cycles, i.e., about 40.5MHz/30 = 1.35MHz, compared to the nominal 1MHz that HD floppies use. This is probably because the filtering circuits in the floppy drive itself thinks that anything that fast is noise, not a signal, so is suppressing it.  This is a bit of a blow for our desire for maximum density on the longer tracks, as we can only get 35% more on those tracks, not the 60% that we should be able to get.

Running the test program, this is the rates and numbers of sectors that fit on each track, with sector-gaps, like an 1581:

Tracks 0 -- 12 : Rate = 30 cycles (1.34MHz) : 28 sectors

Tracks 13 -- 25 : Rate = 31 cycles (1.31MHz) : 27 sectors

Tracks 26 -- 43 : Rate = 32 cycles (1.27MHz) : 27 sectors

Tracks 44 -- 47 : Rate = 33 cycles (1.23MHz) : 26 sectors

Tracks 48 -- 68 : Rate = 32 cycles (1.27MHz) : 27 sectors

Tracks 69 -- 75 : Rate = 33 cycles (1.23MHz) : 26 sectors

Tracks 76 -- 84 : Rate = 34 cycles (1.19MHz) : 25 sectors

First up, note that there is something weird with four tracks near the middle of the disk: Its possible that those tracks are just a bit more worn out, as I can't think of any other reason for those four consecutive tracks being worse -- unless its a bit of luck as to what was written there before, but I do wipe the tracks before writing to them.

Second, notice that for some intervals, the number of sectors we can cram on a track doesn't change, even if we drop the bitrate a bit. That's because it might be that one bit rate can fit 27.8 sectors, while the next slower bitrate can fit, say, 27.1 sectors. Those differences might become important when we try again without gaps, as it might just be enough space to fit another sector in.

But if we assume we have to drop to 26 sectors per track from track 44, that gives us:

Tracks 0 -- 12 (13 tracks) @ 28 sectors per side

Tracks 13 -- 43 (31 tracks) @ 27 sectors per side

Tracks 44 -- 75 (32 tracks) @ 26 sectors per side

Tracks 76 -- 84 (9 tracks) @ 25 sectors per side

Remember that PCs put 18 sectors per track for 1.44MB, so this is quite a bit more.  And with those 5 extra tracks, that all adds up to 2,258KB, i.e., 2.26MB "storage industry megabytes", or 1.56x the PC HD standard storage. Whether we are flying too close to the wind with any of these densities, I'm not sure, and only time will tell.

So now let's try it Amiga-style, without gaps between sectors. But first, I have to fix a bug with the hardware-assisted track formatting without inter-sector gaps. I was messing up the CRC calculation again. We should get at least 10% extra, and maybe a few "leap sectors" where there was not quite enough space to put an extra sector on with sector-gaps, but the 2.5 to 2.8 saved sectors helps us cram an extra one in.  

So I'm expecting between 170KB and 255KB extra, pulling us up into the 2.4MB -- 2.5MB range -- funnily enough about where I originally expected, just not for the exact reasons expected: We are limited with our maximum data rate to 1.34MHz, instead of 1.6MHz, but we are fitting a few extra sectors per track, regardless. But let's see what the reality is, after that synthesis completes.

A quick note while that runs, as I think about RLL2,7 encoding, though: We will still be limited to the 1.35MHz maximum pulse rate, so can still only hope for 50% greater density than MFM.  But it does make me think about longer RLL codes that have longer minimum distances between pulses, e.g., RLL4,13, that would allow for doubling the MFM data rate, but will require more accurate timing of pulses. That might let us cram more on the disk.

That has also just reminded me: It is possible that the problem we are hitting at rates above 1.34MHz is not in fact the floppy electronics, but rather the need for aggresive write pre-compensation, so that the gaps come out correctly when they are placed so closely together on the media.  The way to verify this is to read a raw track after formatting it, and see how it looks in terms of raw flux.  If I do that at various speeds, and see how the various transitions are detected (or not), and how early (or late) they appear, I should be able to get some interesting intelligence on this: It might end up being possible to push towards 1.6MHz after all, which would get us a further 18% or so on top of our 2,258KB, which would get us towards 2.65MB while still keeping sector gaps.  The thought is tantalising... But first some sleep.

The quick summary is: 2,493KB without sector gaps, so an extra 235KB, which is near the upper-end of what I was hoping for.  Now for the track-zone break-down:

Tracks 0 -- 10 (11 tracks) : Rate 30 : 31 sectors per side

Tracks 11 -- 24 (14 tracks) : Rate 31 : 30 sectors per side

Tracks 25 -- 76 (52 tracks) : Rate 33 : 29 sectors per side

Tracks 77 -- 84 (8 tracks) : Rate 34 : 28 sectors per side

So now to think about how to test if lack of write-precompensation is the problem, or if it is that the floppy.  Ideally I would look at a waveform of the data when read-back, to see if all the pulses are there, and if they have moved.  If pulses are missing, then its floppy analog electronics, and if its that pulses have moved, then it is magentic physics, and write pre-compensation should be able to fix it.

The trick to viewing the waveforms is that capturing them on the MEGA65 itself is a bit tricky using a program, because we are talking about pulses that can occur every 30 cycles, which means we need a very tight loop -- tighter than our current loop.  

What we can do, is go back to using DMA to read the FDC debug register $D6A0, which lets us directly read then floppy RDATA line. The DMA will alternate between reading that, and writing to the memory buffer, resulting in a sampling rate of 40.5MHz / 3 cycles (the register takes 2 cycles to read) = ~13MHz, which is a healthy ~10x the expected pulse rate.  We will be limited to a single DMA job of 65536 samples, for ~192K cycles = ~4.9 milliseconds. But that will be more than enough data to see what is happening.  If I run this over tracks recorded at various rates, we should get a clear picture of what is going on.

After chasing my tail on a faulty DMA job definition for a while, I am now collecting some data. For track 0, with a rate of 40.5MHz/30 = 1.35MHz, and using the DMA capture method, I am seeing the data pulses being 8 samples wide.  That means 24 / 40.5MHz = 0.6 microsecond pulse duration, which sets a hard upper limit on the data rate, as otherwise the data pulse will just be continuous, presumably. That means we should be able to detect pulses at up to about 1.69MHz minus a bit, well above the 1.35MHz we have managed.

Let's try formatting a track at rate 26, which should yield 1.56MHz, and thus should have a short gap between each of the data pulses, and we seem in fact to still be getting quite large gaps. Ah: I have of course confused myself a bit here: Because we are using MFM with an MFM rate of ~1MHz, this means an actual maximum flux inversion rate of ~0.5MHz.  Thus we should be able to safely result well above 2MHz MFM rate.  Because I can see the pulses, this makes me think that it might well be the lack of write pre-compensation after all.  So back to drawing those waveforms from the data I read.

We are getting increasingly shifted data pulses, as the pulse frequency increases, i.e., the distance between the pulses reduces. In short, we need to implement write pre-compensation, if we want to support faster data rates.

I dug around on the internet for a while, but could not find any clear explanations of the write-precompensation algorithms used on MFM floppies.  I think it might be because these algorithms were held as trade-secrets by the floppy controller manufacturers, so we will need to do some reverse-engineering.  What I was able to discover, is that the write-precompensation shift can be positive or negative in time, and is dependent on the last several magnetic bits written. 

I also read somewhere that the need for write-precompensation isn't because the magnetic domains shift during writing, but rather the result of the magnetic fields on the head during reading causing an apparent shift in bit position. This probably means that the required shift depends on past and future bits to be written.

What I think I will do, is create a way to write various known bit patterns to the floppy, and then read them back, and from that, work out how to shift the pulses to be properly lined up. 

I did a bit of this on the live-stream this morning, and basically came to the conclusion that I would need a look-up table to make any sense of it.  I did eventually find the following interesting simple set of rules:

That then led me to the software for controlling a CatWeasel floppy controller, like this here:

The important bit is here:

if (len == 2 && nextlen > 2) {
   adj = -(precomp * CWHZ/1000000000.0);
} else if (len > 2 && nextlen == 2) {
   adj = (precomp * CWHZ/1000000000.0);
} else {
   adj = 0.0;    


In other words, place a pulse early if the last pulse was short, and the next one is long, or place the pulse late, if its the other way around.  That looks like a very simple rule.  Because we are doing variable data rates, we will need to vary the amount of precompensation based on the track number, but finally I have a bit of a starting point.

So I might have a quick go at modifying my model to try implementing that, and see how well it matches with the measured values.  I can make the model match the first 180 or so clocked pulses that I logged. After that it goes to rubbish for a while, because the data seems to have rubbish in it after that, and is not well aligned between the various sample logs I made at each speed.  Basically there isn't enough data there to be sure that I have properly generalised out the rules.

In the end, I have gone back to the first link and the table-based approach, but allowing for "small" and "big" corrections, depending on the difference in the duration between inversions. The code looks like this:

          case f_write_buf is
            when "0101000" =>
              -- short pulse before, long one after : pulse will be pushed
              -- early, so write it a bit late              
              f_write_time_adj <= to_integer(write_precomp_magnitude_b);
            when "1001000" =>
              -- medium pulse before, long one after : pulse will be pushed
              -- early, so write it a bit late
              f_write_time_adj <= to_integer(write_precomp_magnitude);              
            when "0001000" =>
              -- equal length pulses either side
              f_write_time_adj <= 0;
            when "0101010" =>
              -- equal length pulses either side
              f_write_time_adj <= 0;
            when "1001010" =>
              -- Medium pulse before, short one after : pulse will be pushed late,
              -- so write it a bit early
              f_write_time_adj <= - to_integer(write_precomp_magnitude);              
            when "0001010" =>
              -- Long pulse before, short one after
              f_write_time_adj <= - to_integer(write_precomp_magnitude_b);

            when "0101001" =>
              -- Short pulse before, medium after
              f_write_time_adj <= to_integer(write_precomp_magnitude);
            when "1001001" =>
              -- equal length pulses either side
              f_write_time_adj <= 0;
            when "0001001" =>
              -- Long pulse before, medium after
              f_write_time_adj <= - to_integer(write_precomp_magnitude);

            when others =>
              -- All other combinations are invalid for MFM encoding, so do no
              -- write precompensation
              f_write_time_adj <= 0;                
          end case;

We only need cases with a 1 in the middle, as it is the middle bit we are writing, where 1 means a magnetic inversion, and 0 means no magnetic inversion, and we only need precompensation if the pattern either side is asymmetric. I just hope that I have the sign on the corrections correct! If not, I should be able to test with putting negative numbers in the corrections.  We shall see after synthesis has finished, probably in about 24 hours time in reality, as I have long day with my 60km round trip bike ride to work tomorrow, which I am really looking forward to, as I didn't get to ride in earlier in the week, so its too long between rides.

Back at the desk, and had time to look further into this, and record another stream, where I got the pre-compenssation stuff all working, and was able to drop the data rate divisor from 30 to 29, but more importantly, to sustain using divisor 29 over the first 40 tracks, and divisor 30 all the way to track 60.  Compare that with the prior effort without pre-compensation, where we had to drop to divisor 32 by track 26.

What was interesting is that a constant write-precompensation of ~100ns and 200ns respectively for short and long differences in gaps between pulses was pretty much ideal across the whole range of tracks and data rates I tests.

However, I couldn't get anything below divisor 29 to work, and all rates below 32 were sometimes a bit funny.  It occurred to me during the stream that this is most likely because the start-of-track gap bytes are now too short at the higher data rates, and thus the start of the first sector is not actually getting written, thus stuffing the whole track up -- even though the peaks in the histograms indicate that going down to rate 28 or even 26 should be possible.

To resolve this, I am adding more gaps to the start of the track.  But rather than it being wasted, I am going to write a short "Track Information Block" at the start of each track together with the normal 12 start-of-track gap bytes -- but these will always be written at the DD / 720KB data rate, so that it will be a fixed length prefix to the track, thus avoiding any variability caused by differing data rates.  This Track Info Block will contain the divisor used for the rest of the track, as well as flags that indicate if it contains sector gaps, or conversely, is a track-at-once and thus read-only track. I have also reserved a flag that indicates if the encoding is MFM or RLL2,7, and for good measure, I have included the track number.  

So now I just need to get that all working... I have implemented the code to write the Track Info Block, and also to read it back, and use it to set the data rate divisor, but it isn't getting detected.  One of many good reasons to blog about code you are writing, or to do live-streams, for that matter, is that you spend more time thinking about what you are doing and getting your thinking clear.  The revelation just now, is that it might be stuffing up, because I am using multiple data rates on each track, and that the timer loop in the inner MFM encoder might mess up if the divisor is reduced mid-bit, as it might have already counted past that divisor, and will thus have to wrap around.

I can verify if this is the problem, and generally get a better idea of what is going wrong, if I make a test harness that includes the whole sdcardio.vhdl, and feeds it synthetic register writes to command it to format a sector, and then decode what it writes out, to see what it looks like.

As usual, making a test harness turns out to be a very good idea: I almost immediately spotted that I was using $FB clock bytes instead of $FF clock bytes for the Track Info Block bytes. That would certainly have caused problems.  So I have a fix for that synthesising, but while that runs, I might run further in the test harness, and see what I can see. Which I should have done, before synthesising, as I would have picked up the next bug that way, which is that I was switching writing speed before the CRC bytes had a chance to be written out, but I was tired, and had a nap while it was synthesising, instead. But this time I did check, and of course there was no further problem with it ;)

After synthesising I am still seeing a problem, that only one sector is being found, even though there are multiple sectors being written.  In simulation, I can see the various sector headers and sector data fields, and they all have CRCs and are being correctly numbered, so this one is being a bit annoying to debug -- which is a bit frustrating, because it is possibly the last barrier before I can test writing tracks with the safe constant-length lead-in.  Most frustrating!

A bit more digging, and I can read DD format disks just fine -- so its possible the problem lies with the switch to enable the use of variable rate recording, which I did "clean up" a bit before.  That was part of the problem, the part that was stopping it from seeing the sectors. But what remains is that trying to read any of the sectors still just hangs.  I'll first synthesise the fix for selecting the variable rate recording properly, and then revisit it a bit later.

And that has it back to being able to read disks that I have formatted at high density again, but unfortunately it hasn't enabled any higher data rates to work.  It is possible that I can get one or two more, by making all 1.5x pulses slightly longer, as they all seem to be shifted towards the 1.0x pulses, and away from the 2.0x pulses.  I'm not sure of the precise mechanism by which this would be occurring, nor if simply delaying all 1.5x pulse signals a bit will really help, although it seems that it should be worth a try.  Otherwise, I suspect that its time to start working on the RLL encoding, and see what we can wring out of that.

Righty oh. I have just implemented variable delay of the 1.5x pulses, to see if that doesn't help us improve peak resolution at the faster data rates. I suspect we really are now starting to get into the area where other magnetic effects and electronics filter effects in the floppies are coming into play, so this will be the last attempt to increase the data rate with MFM, after which I will switch to working on the RLL stuff, and see how much benefit that gets us in terms of increased capacity. I'm not feeling super confident about it right now, because of the various magnetic effects I am seeing at the current data rate, but I might just be being pessimistic, since the whole point of RLL is that it lowers the effective data rate in the magnetic domain for the same real data rate.  We shall see after it all synthesises.  

The other little fix I have made in this run is to include the number of sectors in the Track Info Block.  I have also rolled in fixes for the hypervisor code and freezer to support the 85 track x 64 sector = ~5.5MB ".D65" HD disk images.

Right, that's built, so now to modify floppytest.c again, to allow fiddling with the setting to allow shifting the 1.5x pulses, and see if that works. Interestingly it helps only if I shift the pulses in the opposite direction -- but the gain is very marginal. To give an idea, here is rate divisor 28 (=$1C), i.e., about 1.45MHz MFM rate, with sensible write pre-compensation settings, and no adjustment to the position of the 1.5x pulses:

We see three peaks for 1.0x, 1.5x and 2,0x MFM intervals, with the 1.5x peak in the middle somewhat nearer the 1.0x peak on the left, which is the problem I have been describing. The WPC 04/04/00 means write pre-compensation of 4 and 4 for small and big gap differences, and 1.5x pulse delay of 0.  If we now advanced the pulse delay in the expected direction, we get something like this:

We can now see the 1.5x peak has split in two, but the extra peak is now further left, not moved to the right as we hoped. So if we reverse the direction by putting a negative correction on the 1.5x pulse position, we get the following, which is quite a bit better:

It is now quite nicely spaced out.  If I move it any further to the right, not only is it worse, but all sorts of other bad effects are happening. Its interesting that there are still quite clearly double peaks in the 1.0x and 1.5x peaks, so there is something else there that can in theory be found and improved, but how much further benefit is possible is hard to tell, and it is a bit of diminishing returns at this point, although each further divisor drop we can achieve at this point, does result in progressively larger amounts of data per track. 

But I am starting to chase my tail a bit here, and conclude that (for the time being at least), that I think we are at the limit of what we can get out of this drive and media using MFM encoding, and that further work should be focused on RLL encoding, to see if that can't get us a substantial further boost for less effort.  For that, you will have to wait for the next blog-post, however.

What I do want to close out in this post though, is making sure that normal DD disk activity still works correctly.  Specifically, when the C65 DOS ROM issues a format command, we don't want the Track Info Block data overriding and causing the tracks to be written at the wrong rate.  

The logic here is a little subtle: We want write operations to the drive to switch to the TIB indicated rate and encoding settings, so that sector writes succeed. And there is something going a bit odd with this, but it might have been a problem with a disk I borked up when it was writing the wrong speed.  A HEADER command with ID supplied, which should do a full format was failing with "75, FORMAT ERROR", but a quick format would succeed.  Trying to save a file on the disk after that would result in messed up sectors, most likely from writing at the wrong data rate. But if I instead do a quick format, then a full format, the format succeeds.  So it was probably just that I had made a complete mess of the disk formatting some tracks at funny rates with incorrect TIBs etc, as I was testing things.

But even after the clean low-level re-format, if I then try to save a file on the disk, the directory is all messed up, so something is still wrong. So I'll have to fix that up, too.  But that will have to also wait for the next blog post, or you will all be stuck reading this one long after the pre-orders have sold-out, because it has got so long ;)