Sunday 7 November 2021

Creating a simple internal drive fast-loader for the MEGA65

For a while now I have been thinking about making a simple fast-loader for the MEGA65 that bypasses the C65 DOS, and directly accesses the floppy controller.  It's a topic that comes up from time to time for developers who want to load large files from disk, for example. So I spent a couple of hours yesterday writing a proof-of-concept version.  My design criteria were:

1. Must be able to be run from an IRQ, so that it can be used in games or demos to load in the background while other activity goes on. The C65 DOS cannot be sensibly used for this, because when it runs, it blocks all interrupts for arbitrary periods of time, which can exceed 200ms(!!!).

2. Must allow loading to any address in memory.

3. Must be small enough that it can be easily incorporated into other programs.

(1) and (3) meant that it had to be written in assembly.

So here's what I created. It still is missing a few things, like it doesn't save and restore DMA list address registers (in case you were composing a DMA job in real-time, just as the IRQ triggered), and doesn't support specifying how much of a file to load, to allow progressive streaming in of a file. Both would be fairly easy to implement.  But back to what we do have, an annotated walk through the source:

First up, to demonstrate it, we have a simple BASIC header (I am running it from C64 mode, but you could almost as easily run it from C65 mode):

   
basic_header

    !byte 0x10,0x08,<2021,>2021,0x9e
    !pet "2061"
    !byte 0x00,0x00,0x00
 

Then we have the start of the demo program that is using the fast-loader.  The actual fast-loader code will come a bit later. We do the usuals of making sure we have MEGA65 IO enabled, and the CPU at full-speed, as well as have some boiler plate to clear the screen and set screen colours etc:   

program_start:   

    ;; Select MEGA65 IO mode
    lda #$47
    sta $d02f
    lda #$53
    sta $d02f

    ;; Select 40MHz mode
    lda #65
    sta $0

    lda #$00
    sta $d020
    sta $d021

    lda #$01
    sta $0286
    jsr $e544
    

Next it is time to setup our raster interrupt. This should all be very familiar to C64 coders:
    ;; Install our raster IRQ with our fastloader
    sei

    lda #$7f
    sta $dc0d
    sta $dd0d
    lda #$40
    sta $d012
    lda #$1b
    sta $d011
    lda #$01
    sta $d01a
    dec $d019

    lda #$16
    sta $d018
    
    lda #<irq_handler
    sta $0314
    lda #>irq_handler
    sta $0315
    cli

We'll get to the IRQ handler in a moment, but we will finish looking at the real-time part of the program first.  The fast-loader uses a single byte state/status variable to keep track of what it is doing. If it is $00, then the loader is idle.  If you want to ask it to load something, you setup the filename and load address, and then write $01 into the variable.  It will go back to $00 when its done, or have bit 7 set if there is some kind of error. This means you can check status with BEQ and BMI.  The load address will progressively update to show where it is loaded to, if that's important for you to track. In the example, we load the game GYRRUS into bank 4 at $00040000:
    ;; Example for using the fast loader
    
    ;; copy filename from start of screen
    ;; Expected to be PETSCII and $A0 padded at end, and exactly 16 chars
    ldx #$0f
    lda #$a0
clearfilename:
    sta fastload_filename,x
    dex
    bpl clearfilename
    ldx #$ff
filenamecopyloop:
    inx
    cpx #$10
    beq endofname
    lda filename,x
    beq endofname
    sta fastload_filename,x
    bne filenamecopyloop
endofname:   
    inx
    stx fastload_filename_len
    
    ;; Set load address (32-bit)
    ;; = $40000 = BANK 4
    lda #$00
    sta fastload_address+0
    lda #$00
    sta fastload_address+1
    lda #$04
    sta fastload_address+2
    lda #$00
    sta fastload_address+3
Remember what I said about the status variable? We need to make sure it is $00 before we submit our load request.  This is important because when the fast-loader initialises, it doesn't know what track the drive is on, and so it seeks back to track 0 first. So we make sure that that completes before we submit our job. If we didn't do this, reading of any sector from the disk on a real drive would hang, because the head would be on the wrong track.
    ;; Give the fastload time to get itself sorted
    ;; (largely seeking to track 0)
wait_for_fastload:   
    lda fastload_request
    bne wait_for_fastload
Finally the fast-loader is ready, so we can then submit our job. It really is this simple:
    ;; Request fastload job
    lda #$01
    sta fastload_request
We can then go off and do whatever we want in real-time, knowing that the raster interrupt will be calling the fast-loader, and allowing it to progress in the background. For simplicity, in our demo we just wait for the fast-load to complete, and indicate if an error occurred, or if it loaded ok.
    ;; Then just wait for the request byte to
    ;; go back to $00, or to report an error by having the MSB
    ;; set. The request value will continually update based on the
    ;; state of the loading.
waiting
    lda fastload_request
    bmi error
    bne waiting
    beq done
    
error
    inc $042f
    jmp error

done
    inc $d020
    jmp done

That's over and done with for real-time, so now lets look at our raster interrupt.  This is also quite simple: Acknowledge the IRQ source, set border colour to white, call the fastload_irq routine, then return the border colour to black, before returning via the well known $EA81 interrupt exit handler code in the C64 KERNAL. You can of course do whatever you want, but this shows just how simple it can be. The border colour stuff is of course optional, but let's us see just how little raster time this loader uses.
irq_handler:
    ;; Here is our nice minimalistic IRQ handler that calls the fastload IRQ
    
    dec $d019

    ;; Call fastload and show raster time used in the loader
    lda #$01
    sta $d020
    jsr fastload_irq
    lda #$00
    sta $d020

    ;; Chain to KERNAL IRQ exit
    jmp $ea81

As mentioned, I set this demo up to load GYRRUS into bank 4, just because that was a file on the disk image I had active in my MEGA65 at the time.  Note that the filename has to be padded with $A0s, because the fast-load code literally compares all 16 bytes of the filename with the 16 bytes of filename in the directory sectors. It doesn't support partitions or sub-directories on the disk image, but someone could hack that in if they wanted it, but I don't think it will be necessary for almost all use-cases.
filename:
    ;; GYRRUS for testing
    !byte $47,$59,$52,$52,$55,$53,$a0,$a0
    !byte $a0,$a0,$a0,$a0,$a0,$a0,$a0,$a0

    
;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
So that was the code for our example driver of the fast load. For your own programs, you can cut everything above here away, and just keep what follows.  It requires about 1.2KB, including the 512 byte sector buffer, so its quite small in the grand scheme of things.   
    ;; ------------------------------------------------------------
    ;; Actual fast-loader code
    ;; ------------------------------------------------------------
First up, we have the variables and temporary storage for the fast loader: The filename and length (which actually gets ignored, because of the use of $A0 padding, so can be removed at some point), the address where the user wants to load, and the state/status variable.  These four variables are the only ones you need to access from your code. Everything else that follows is internal to the fast-loader.

fastload_filename:   
    *=*+16
fastload_filename_len:   
    !byte 0
fastload_address:   
    !byte 0,0,0,0
fastload_request:   
    ;; Start with seeking to track 0
    !byte 4
This variable keeps track of which physical track on the disk the loader thinks the head is currently over, so that we can step to the correct track:

fl_current_track:    !byte 0

Then we have variables for the logical track and sector of the next 256 byte block of the file. These have to get translated into the physical track and sector of the drive, which like the 1581, stores two blocks in each physical sector.
fl_file_next_track:    !byte 0
fl_file_next_sector:    !byte 0
 

Then finally, we have the 512 byte sector buffer. Now, this could be optimised away, by enabling mapping of the sector buffer at $DE00-$DFFF, but I couldn't be bothered remembering how to do that, and also didn't want to cause potential problems for code that also uses REU emulation or other things that might appear in the IO area. It's not that it can't be done, but rather that I just took the quick and easy path.  It would be a great exercises for the reader to change this, and reduce the total size of the loader to <1KB as a result.   
fastload_sector_buffer:
    *=*+512
 

Now let's take a look at the fast-loader's IRQ handler.  It basically checks if there is an active request, and if not does nothing. Then it checks if the floppy controller is busy doing something that it asked it to earlier. If so, it does nothing.  But if we have an active job, and the floppy controller is not busy, this means that we can ask for the next operation to occur.  The fastload_request variable doubles as the state number for the resulting simple state-machine.  This approach really simplifies the code a lot, and makes it much easier to run in an interrupt.

Before going further, it is worth noting that if you run the interrupt on a normal raster IRQ, the loader will be able to load at most one block = 254 bytes of usable data per frame.  This means 254 x 50 = ~12.7KB/sec in PAL or 15.2KB/sec in NTSC.  If you are using a real 800KB 1581 disk, that's not a problem, because the drive will slow you down more than that.  But if you are using a disk image, or one of the MEGA65's HD disk formats, then this will slow things down.  

The easy solution is to have your IRQ routine trigger multiple times per frame, or enable IRQs in the floppy controller, and have it be called on demand whenever a sector is ready. You will need to acknowledge the floppy controller interrupts, if you do that.

There is also a further ~2x speed up without doing that which is possible by modifying the loader to realise when a single sector contains two consecutive blocks of a file. It doesn't currently do this, which is a bit stupid.  Fixing that would also be a great exercise for the reader.

 
fastload_irq:
    ;; If the FDC is busy, do nothing, as we can't progress.
    ;; This really simplifies the state machine into a series of
    ;; sector reads
    lda fastload_request
    bne todo
    rts
todo:   
    lda $d082
    bpl fl_fdc_not_busy
    rts
fl_fdc_not_busy:   
    ;; FDC is not busy, so check what state we are in
    lda fastload_request
    bpl fl_not_in_error_state
    rts
fl_not_in_error_state:

It's worth explaining how the IRQ handler calls the various routines for the different states, because it uses a nice feature of the 65CE02: JMP indirect, X-indexed.  This instruction basically allows you to have a jump-table without the silly push-addr-minus-one to stack trick you have to use on the C64. The resulting code is quite a lot simpler and clearer as a result:
    ;; Shift state left one bit, so that we can use it as a lookup
    ;; into a jump table.
    ;; Everything else is handled by the jump table
    cmp #6
    bcc fl_job_ok
    ;; Ignore request/status codes that don't correspond to actions
    rts
fl_job_ok:   
    asl
    tax
    jmp (fl_jumptable,x)
    
fl_jumptable:
    !16 fl_idle
    !16 fl_new_request
    !16 fl_directory_scan
    !16 fl_read_file_block
    !16 fl_seek_track_0
    !16 fl_step_track

The first of those state routines is the one for when the loader is idle: Just return immediately. This can be optimised away, since there are (1) plenty of other RTS instructions we could point at; and (2) because it never gets called, because we have the short-circuit exit at the start of the IRQ handler.  If you haven't already gotten the idea by now, you can tell that I have really just hacked this together until it works, and then stopped to document it.  Lots of opportunities for you to get involved and improve it ;)
fl_idle:
    rts

The next state handler checks if we are on track 0 yet, and if not, commands a step towards track 0, which like all other floppy controller actions, will have the floppy controller busy until the step has completed. Again, our nice busy check in the start of the IRQ handler means that we can just keep stepping in this routine until we reach track 0. Note how it writes $00 into fastload_request when done, to indicate that the loader is idle and ready for a new job.
fl_seek_track_0:
    lda $d082
    and #$01
    bne fl_not_on_track_0
    lda #$00
    sta fastload_request
    sta fl_current_track
    rts
fl_not_on_track_0:
    ;; Step back towards track 0
    lda #$10
    sta $d081
    rts

As you saw in the demo driver code, to submit a new job, you write $01 into fastload_request. This causes the following routine to be run when the IRQ is next triggered.  It puts $02 into fastload_request, so that it knows that it has just accepted a job, and also immediately requests the reading of the first physical sector that contains a directory block, ready for us to look for the requested file.
fl_new_request:
    ;; Acknowledge fastload request
    lda #2
    sta fastload_request
    ;; Start motor
    lda #$60
    sta $d080
    ;; Request T40 S3 to start directory scan
    ;; (remember we have to do silly translation to real sectors)
    lda #40-1
    sta $d084
    lda #(3/2)+1
    sta $d085
    lda #$00
    sta $d086         ; side
    ;; Request read
    jsr fl_read_sector
    rts

The above set fastload_request to call this routine on each IRQ, i.e., as each sector of the directory is loaded. We then look through the whole 512 byte sector for a matching filename, and if found, change state to load the file from the logical track and sector of the first block of the file as obtained from the directory listing. Note that we ignore the file type, including if the file is deleted. Again, a great opportunity for someone to improve the loader.
fl_directory_scan:
    ;; Check if our filename we want is in this sector
    jsr fl_copy_sector_to_buffer

    ;; (XXX we scan the last BAM sector as well, to keep the code simple.)
    ;; filenames are at offset 4 in each 32-byte directory entry, padded at
    ;; the end with $A0
    lda #<fastload_sector_buffer
    sta fl_buffaddr+1
    lda #>fastload_sector_buffer
    sta fl_buffaddr+2

fl_check_logical_sector:
    ldx #$05
fl_filenamecheckloop:
    ldy #$00

fl_check_loop_inner:

fl_buffaddr:
    lda fastload_sector_buffer+$100,x
    
    cmp fastload_filename,y   
    bne fl_filename_differs
    inx
    iny
    cpy #$10
    bne fl_check_loop_inner
    ;; Filename matches
    txa
    sec
    sbc #$12
    tax
    lda fl_buffaddr+2
    cmp #>fastload_sector_buffer
    bne fl_file_in_2nd_logical_sector
    ;; Y=Track, A=Sector
    lda fastload_sector_buffer,x
    tay
    lda fastload_sector_buffer+1,x
    jmp fl_got_file_track_and_sector
fl_file_in_2nd_logical_sector:   
    ;; Y=Track, A=Sector
    lda fastload_sector_buffer+$100,x
    tay
    lda fastload_sector_buffer+$101,x
fl_got_file_track_and_sector:
    ;; Store track and sector of file
    sty fl_file_next_track
    sta fl_file_next_sector
    ;; Request reading of next track and sector
    jsr fl_read_next_sector
    ;; Advance to next state
    lda #3
    sta fastload_request
    rts
    
fl_filename_differs:
    ;; Skip same number of chars as though we had matched
    cpy #$10
    beq fl_end_of_name
    inx
    iny
    jmp fl_filename_differs
fl_end_of_name:
    ;; Advance to next directory entry
    txa
    clc
    adc #$10
    tax
    bcc fl_filenamecheckloop
    inc fl_buffaddr+2
    lda fl_buffaddr+2
    cmp #(>fastload_sector_buffer)+1
    bne fl_checked_both_halves
    jmp fl_check_logical_sector
fl_checked_both_halves:   
    
    ;; No matching name in this 512 byte sector.
    ;; Load the next one, or give up the search
    inc $d085
    lda $d085
    cmp #11
    bne fl_load_next_dir_sector
    ;; Ran out of sectors in directory track
    ;; (XXX only checks side 0, and assumes DD disk)

    ;; Mark load as failed
    lda #$80         ; $80 = File not found
    sta fastload_request   
    rts

We now have several little utility routines related to reading sectors from the disk, including doing the conversion from 1581 logical sectors to 3.5" floppy physical sectors, and tracking the head if we aren't on the correct track already etc. If it detects that it needs to step the head, it changes fastload_request to point to a handler for that, which in turn sets it back to the handler for reading blocks of the file.

Note that I haven't actually tried this on a real disk, yet. This should be done, as there will quite likely be some subtle problem that will need shaking out, most likely with the track stepping. But it shouldn't be too hard to fix, and who knows, I might have got it right the first time ;)
fl_load_next_dir_sector:   
    ;; Request read
    jsr fl_read_sector
    ;; No need to change state
    rts

fl_read_sector:
    ;; Check if we are already on the correct track/side
    ;; and if not, select/step as required
    lda #$40
    sta $d081
    rts

fl_step_track:
    lda #3
    sta fastload_request
    ;; FALL THROUGH
    
fl_read_next_sector:
    ;; Check if we reached the end of the file first
    lda fl_file_next_track
    bne fl_not_end_of_file
    rts
fl_not_end_of_file:   
    ;; Read next sector of file
    jsr fl_logical_to_physical_sector

    lda fl_current_track
    lda $d084
    cmp fl_current_track
    beq fl_on_correct_track
    bcc fl_step_in
fl_step_out:
    ;; We need to step first
    lda #$18
    sta $d081
    inc fl_current_track
    lda #5
    sta fastload_request
    rts
fl_step_in:
    ;; We need to step first
    lda #$10
    sta $d081
    dec fl_current_track
    lda #5
    sta fastload_request
    rts
    
fl_on_correct_track:   
    jsr fl_read_sector
    rts


Here we have another utility routine that does the logical-to-physical track and sector conversion. Again, this basically mirrors what the 1581 does. It will need modifying to use the fast-loader on HD disks, because there will be more sectors on each side of the disk.
fl_logical_to_physical_sector:
    ;; Convert 1581 sector numbers to physical ones on the disk.
    ;; Track = Track - 1
    ;; Sector = 1 + (Sector/2)
    ;; Side = 0
    ;; If sector > 10, then sector=sector-10, side=1
    lda #$00         ; side 0
    sta $d086
    lda fl_file_next_track
    dec
    sta $d084
    lda fl_file_next_sector
    lsr
    inc
    cmp #10
    bcs fl_on_second_side
    sta $d085
    jmp fl_set_fdc_head
    
fl_on_second_side:
    sec
    sbc #10
    sta $d085
    lda #1
    sta $d086

    ;; FALL THROUGH
fl_set_fdc_head:
    ;; Select correct side of real disk drive
    lda $d086
    asl
    asl
    asl
    and #$08
    ora #$60
    sta $d080
    rts
    

This is the routine that really does the loading: It gets the read physical sector, works out which half of it contains the data for us, DMAs the read bytes into the destination location in memory, and then follows the block chain to the next block of the file, and detects the end-of-file marker indicated by logical track = $00.
fl_read_file_block:
    ;; We have a sector from the floppy drive.
    ;; Work out which half and how many bytes,
    ;; and copy them into place.

    ;; Get sector from FDC
    jsr fl_copy_sector_to_buffer

    ;; Assume full sector initially
    lda #254
    sta fl_bytes_to_copy
    
    ;; Work out which half we care about
    lda fl_file_next_sector
    and #$01
    bne fl_read_from_second_half
fl_read_from_first_half:
    lda #(>fastload_sector_buffer)+0
    sta fl_read_dma_page
    lda fastload_sector_buffer+1
    sta fl_file_next_sector
    lda fastload_sector_buffer+0
    sta fl_file_next_track
    bne fl_1st_half_full_sector
fl_1st_half_partial_sector:
    lda fastload_sector_buffer+1
    sta fl_bytes_to_copy   
    ;; Mark end of loading
    lda #$00
    sta fastload_request
fl_1st_half_full_sector:
    jmp fl_dma_read_bytes
    
fl_read_from_second_half:
    lda #(>fastload_sector_buffer)+1
    sta fl_read_dma_page
    lda fastload_sector_buffer+$101
    sta fl_file_next_sector
    lda fastload_sector_buffer+$100
    sta fl_file_next_track
    bne fl_2nd_half_full_sector
fl_2nd_half_partial_sector:
    lda fastload_sector_buffer+$101
    sta fl_bytes_to_copy
    ;; Mark end of loading
    lda #$00
    sta fastload_request
fl_2nd_half_full_sector:
    ;; FALLTHROUGH
fl_dma_read_bytes:

    ;; Update destination address
    lda fastload_address+3
    asl
    asl
    asl
    asl
    sta fl_data_read_dmalist+2
    lda fastload_address+2
    lsr
    lsr
    lsr
    lsr
    ora fl_data_read_dmalist+2
    sta fl_data_read_dmalist+2
    lda fastload_address+2
    and #$0f
    sta fl_data_read_dmalist+12
    lda fastload_address+1
    sta fl_data_read_dmalist+11
    lda fastload_address+0
    sta fl_data_read_dmalist+10

    ;; Copy FDC data to our buffer
    lda #$00
    sta $d704
    lda #>fl_data_read_dmalist
    sta $d701
    lda #<fl_data_read_dmalist
    sta $d705

    ;; Update load address
    lda fastload_address+0
    clc
    adc fl_bytes_to_copy
    sta fastload_address+0
    lda fastload_address+1
    adc #0
    sta fastload_address+1
    lda fastload_address+2
    adc #0
    sta fastload_address+2
    lda fastload_address+3
    adc #0
    sta fastload_address+3
    
    ;; Schedule reading of next block
    jsr fl_read_next_sector
    
    rts

We are now almost at the end. What we have here is the DMA lists for copying the read data to its final destination, as well as the routine and DMA list for copying a physical sector from the FDC's buffer down to fastload_sector_buffer.  As previously noted, we can probably shrink the whole thing (and make it use less raster time) by avoiding that copy, if we instead fiddle the IO banking to make the floppy sector buffer map at $DE00-$DFFF (there is a special bit that enables this).  But what we have here works, and isn't that much slower, as the DMA doesn't take very long. 
fl_data_read_dmalist:
    !byte $0b      ; F011A type list
    !byte $81,$00      ; Destination MB
    !byte 0         ; no more options
    !byte 0            ; copy
fl_bytes_to_copy:   
    !word 0               ; size of copy
fl_read_page_word:   
fl_read_dma_page = fl_read_page_word + 1
    ;; +2 is to skip track/header link
    !word fastload_sector_buffer+2    ; Source address
    !byte $00        ; Source bank
    
    !word 0                 ; Dest address
    !byte $00             ; Dest bank
    
    !byte $00             ; sub-command
    !word 0                 ; modulo (unused)
    
    rts
    
fl_copy_sector_to_buffer:
    ;; Make sure FDC sector buffer is selected
    lda #$80
    trb $d689

    ;; Copy FDC data to our buffer
    lda #$00
    sta $d704
    lda #>fl_sector_read_dmalist
    sta $d701
    lda #<fl_sector_read_dmalist
    sta $d705
    rts

fl_sector_read_dmalist:
    !byte $0b      ; F011A type list
    !byte $80,$ff            ; MB of FDC sector buffer address ($FFD6C00)
    !byte 0         ; no more options
    !byte 0            ; copy
    !word 512        ; size of copy
    !word $6c00        ; low 16 bits of FDC sector buffer address
    !byte $0d        ; next 4 bits of FDC sector buffer address
    !word fastload_sector_buffer ; Dest address   
    !byte $00             ; Dest bank
    !byte $00             ; sub-command
    !word 0                 ; modulo (unused)

And that's it.  The loader really is quite simple, especially compared with a 1541 fast-loader.  You can find the source in https://github.com/mega65/mega65-tools, just look for fastload-demo.asm.

Finally, a somewhat arbitrary screen-shot, because every blog post requires at least one, but its kind of hard to show a fast-loader in action in a still image.



Tuesday 2 November 2021

Speeding up the MEGA65 flash menu

The MEGA65's flash menu that lets you write new cores into the flash is, shall we say, a little pedestrian in speed.  It takes close to 15 minutes to write a new core, which is really annoying.  

It's also become a bit important for another reason, because Trenz need a tool to flash the MEGA65 production boards, because Vivado is refusing to flash the new shiny 512mbit (64MB) flash chip that is going on the production machines for some unknown reason.  They can't afford to spend 15 minutes on each machine flashing them.

I just timed Vivado flashing a bitstream, and it took 165 seconds = 2 minutes, 45 seconds, so that's our goal. 

Now, what is interesting is that Vivado is much slower than what the flash can do.  In theory, we can erase at around 500KB/sec, flash at around 1MB/sec, and verify back at >1MB/sec.  For an 8MB bitstream this gives us a theoretical time of 8MB/500KB/sec + 8MB/1MB/sec + 8MB/1MB/sec = 16 + 8 + 8 = 32 seconds.  Now, that would be really nice if we can reach.  But I'll just be happy if we can get down to 165 seconds or better, like Vivado does.

To improve from our current ~15 minutes = ~900 seconds down to 165 seconds, we have quite a bit of improvement to make.  Fortunately this should be fairly easy, as the root cause of the slowness is that we are using CC65 as our compiler, which produces slow code, and then bit-bashing the QSPI communications.  So getting it much faster than now should be quite straight forward.

But first, we need an easy way to test the flash program, because currently the QSPI flash is only accessible when in hypervisor mode. So I have made it so that dip-switch 3 now enables access to the QSPI flash from any mode. This should not be normally enabled, as it can cause your QSPI flash to get trashed. But for production of machines (and testing of my flash program speed-ups), its fine.

With that out the way, it was time to start implementing the QSPI speed up stuff.  I could in theory implement a complete QSPI controller in hardware, but that's a lot of work, and not really needed, because it is just the large transfers for reading and writing the flash to verify and program it that take by far the most time -- more than 90% in fact.

So instead I am just implementing hardware acceleration of exactly those options.  The QSPI lines are routed through the SD card controller, which already has a nice buffer that I can re-use.  For some reason I use the "Q" nybl-based modes for reading from the flash, but single-bit ones for writing. IT would be faster to use the Q mode for both, as it will reduce the time to write a byte from ~8x4 = 32 cycles down to 4 cycles per byte. But as the flashing itself takes ~1usec per byte, we will still be at 50% efficiency at least.  Somewhat similarly for the reading, we could run the QSPI flash at >40MHz, but that would require more work for even less gain -- especially since we still have some logic code from CC65 slowing things down as well.

For the commands to setup those transfers, we can also inline some of the functions to help things along a bit. About 2x to 3x for some parts of things was possible there, but still not enough to get us near 165 seconds.

To help track the improvements, I have improved the progress bars in the flash program to show the speed and time remaining using the RTC to do the timing.

In the process I also did some more work on improving the detection of the flash chip's parameters.  This helped quite a bit, because there are two different erase commands and one works on all pages of the flash, but is really slow on most of them, while the one that is faster on the most of the pages hangs on the other pages.  This is because the chip we are using has 32 x 4KB pages at the start, and then 64KB pages after that.

Getting the reading of data working with hardware acceleration was pretty quick and easy. I also found a horrible bug in the erase code that meant that it would erase all pages, even if they were already empty.  Together those two improvements have had a dramatic improvement on the erase performance: It is now down to 19 seconds when erasing a typical bitstream that is about 5.5MB of the 8MB slot size, i.e. an erase speed of >300KB/sec.  That's down from several minutes, so that's the first part of our victory.

I have also implemented the hardware SPI writing acceleration, but there is some bug with it at present which means that the same byte is being written over and over again, which I need to investigate. But given that I writing the correct number of bytes, the speed should be about right.  And this is also greatly improved, now taking only 38 seconds, at around 164KB/sec. About half of that time is the actual fast SPI data write and the time for the flash to actually write to the non-volatile memory, so there is perhaps some scope for in-lining more stuff in the C code to speed it up a bit, but otherwise further improvements would require the Q mode writing. With both of those, it would probably be possible to get under 20 seconds for the writing phase, but honestly, 38 seconds is already fast enough to not feel annoying. The main thing is that the progress bar is continuously growing, and at a good speed.

So once I have the SPI writing bug fixed, we are looking at just under 1 minute to erase and write.  Verification should be at least as fast as erasing, so I'm hoping that we will be around 80 seconds -- that is, about 2x faster than Vivado, which is really nice!

Part of the bug is because I hadn't implemented 256 byte page writing, but rather only 512 byte page writing. That's fine for the 64MB flash chips on the production boards, but not for the existing 32MB flash in the R3 board I have here.  The errata for the 32MB flash said that you can write >256 bytes, but only the last 256 bytes will be written.  I have since fixed that, but without any visible improvement.

What I am seeing is that the same byte is being written to the flash over and over again.  Sometimes its $80, other times $00.  This says to me that bytes are probably being written, but that the bytes we are reading from the buffer may be wrong. So I might make a test that tries writing some known data and see how that goes.  That way I can also do it much faster, as I can erase the single page, write the known data, and then read it back.

Okay, so that confirms that we are writing exactly 256 bytes, but that all 256 bytes are being written with the same value, in this case $80.  I'll do a quick bit of simulation to check whether the SD buffer is being read out correctly to be written, as that strikes me as the most likely place to be borked.

Borkage duly found via simulation: I was reloading the byte from the buffer every bit, causing endless hilarity to ensue.  Now synthesising that, but slowed down by watching Shallan50k's twitch stream with the music competition results which was excellent (congratulations to @proton_fig for your great X Files tunes, and to the other entrants for their great tunes as well!). It's amazing just how much CPU it takes for the Twitch stream view. Basically was eating 75% or more of the CPU on my (admittedly 4 year old) i7 box.

On the up side, the simulation affirmed that the rest of the process looks to be behaving properly, so hopefully when the synthesis does complete, that it will work. Which it did after I eventually spotted and fixed some stupid bugs.

After that it was a case of fine-tuning various things, like reducing how often I update the progress bar. I also added a hand-written assembly routine for the verification step, as that is currently the slowest of all the actions, which is a bit silly given that erasing and writing have real work to do.

The end result is that writing a new core file to a slot can now be done in about 86 seconds -- i.e., about 1/2 the time that Vivado takes, as we can see in this screenshot:


Victory achieved!

Now to win the war, I need to back-port all those speed-ups and general improvements into the flash menu, and hope that it doesn't make it too big to fit in the bitstream... which I have also done.

To say that this makes the process of flashing a core file more pleasant is really an understatement. We have rather coincidentally gone from C64 datasette to disk drive loading times, and the impact feels just as profound: You can now flash a core without thinking about what you will do for the next 1/4 hour while it chugs away.

While speed further improvements are possible, it doesn't really feel like it is necessary, given that the theoretical minimum time is something like 20 seconds, and it would be a lot of effort to claw back any of that extra minute -- but it is only one extra minute.

The only further improvement I am likely to make down the track is to make a utility that will allow safe reflashing of slot 0 using this new dip-switch 3 mechanism: The program will check that the FPGA has booted from slot 1 or 2, and thus be satisfied that slot 0 can be written over without bricking the machine, and only if that is the case, will it attempt to flash.  But to make sure people don't leave switch 3 on all the time, which would allow malicious software to brick your MEGA65, I'll likely put an inter-lock into the hypervisor that requires you to press some key to continue booting if it is enabled, so that you don't forget.

So that's all that, really.