For a while now I have been thinking about making a simple fast-loader for the MEGA65 that bypasses the C65 DOS, and directly accesses the floppy controller. It's a topic that comes up from time to time for developers who want to load large files from disk, for example. So I spent a couple of hours yesterday writing a proof-of-concept version. My design criteria were:
1. Must be able to be run from an IRQ, so that it can be used in games or demos to load in the background while other activity goes on. The C65 DOS cannot be sensibly used for this, because when it runs, it blocks all interrupts for arbitrary periods of time, which can exceed 200ms(!!!).
2. Must allow loading to any address in memory.
3. Must be small enough that it can be easily incorporated into other programs.
(1) and (3) meant that it had to be written in assembly.
So here's what I created. It still is missing a few things, like it doesn't save and restore DMA list address registers (in case you were composing a DMA job in real-time, just as the IRQ triggered), and doesn't support specifying how much of a file to load, to allow progressive streaming in of a file. Both would be fairly easy to implement. But back to what we do have, an annotated walk through the source:
First up, to demonstrate it, we have a simple BASIC header (I am running it from C64 mode, but you could almost as easily run it from C65 mode): 
    
basic_header
    !byte 0x10,0x08,<2021,>2021,0x9e
    !pet "2061"
    !byte 0x00,0x00,0x00
 
Then we have the start of the demo program that is using the fast-loader.  The actual fast-loader code will come a bit later. We do the usuals of making sure we have MEGA65 IO enabled, and the CPU at full-speed, as well as have some boiler plate to clear the screen and set screen colours etc:    
program_start:    
    ;; Select MEGA65 IO mode
    lda #$47
    sta $d02f
    lda #$53
    sta $d02f
    ;; Select 40MHz mode
    lda #65
    sta $0
    lda #$00
    sta $d020
    sta $d021
    lda #$01
    sta $0286
    jsr $e544
    
Next it is time to setup our raster interrupt. This should all be very familiar to C64 coders:
    ;; Install our raster IRQ with our fastloader
    sei
    lda #$7f
    sta $dc0d
    sta $dd0d
    lda #$40
    sta $d012
    lda #$1b
    sta $d011
    lda #$01
    sta $d01a
    dec $d019
    lda #$16
    sta $d018
    
    lda #<irq_handler
    sta $0314
    lda #>irq_handler
    sta $0315
    cli
We'll get to the IRQ handler in a moment, but we will finish looking at the real-time part of the program first.  The fast-loader uses a single byte state/status variable to keep track of what it is doing. If it is $00, then the loader is idle.  If you want to ask it to load something, you setup the filename and load address, and then write $01 into the variable.  It will go back to $00 when its done, or have bit 7 set if there is some kind of error. This means you can check status with BEQ and BMI.  The load address will progressively update to show where it is loaded to, if that's important for you to track. In the example, we load the game GYRRUS into bank 4 at $00040000: 
    ;; Example for using the fast loader
    
    ;; copy filename from start of screen
    ;; Expected to be PETSCII and $A0 padded at end, and exactly 16 chars
    ldx #$0f
    lda #$a0
clearfilename:
    sta fastload_filename,x
    dex
    bpl clearfilename
    ldx #$ff
filenamecopyloop:
    inx
    cpx #$10
    beq endofname
    lda filename,x
    beq endofname
    sta fastload_filename,x
    bne filenamecopyloop
endofname:    
    inx
    stx fastload_filename_len
    
    ;; Set load address (32-bit)
    ;; = $40000 = BANK 4
    lda #$00
    sta fastload_address+0
    lda #$00
    sta fastload_address+1
    lda #$04
    sta fastload_address+2
    lda #$00
    sta fastload_address+3
Remember what I said about the status variable? We need to make sure it is $00 before we submit our load request.  This is important because when the fast-loader initialises, it doesn't know what track the drive is on, and so it seeks back to track 0 first. So we make sure that that completes before we submit our job. If we didn't do this, reading of any sector from the disk on a real drive would hang, because the head would be on the wrong track.
    ;; Give the fastload time to get itself sorted
    ;; (largely seeking to track 0)
wait_for_fastload:    
    lda fastload_request
    bne wait_for_fastload
Finally the fast-loader is ready, so we can then submit our job. It really is this simple:
    ;; Request fastload job
    lda #$01
    sta fastload_request
We can then go off and do whatever we want in real-time, knowing that the raster interrupt will be calling the fast-loader, and allowing it to progress in the background. For simplicity, in our demo we just wait for the fast-load to complete, and indicate if an error occurred, or if it loaded ok.
    ;; Then just wait for the request byte to
    ;; go back to $00, or to report an error by having the MSB
    ;; set. The request value will continually update based on the
    ;; state of the loading.
waiting
    lda fastload_request
    bmi error
    bne waiting
    beq done
    
error
    inc $042f
    jmp error
done
    inc $d020
    jmp done
That's over and done with for real-time, so now lets look at our raster interrupt.  This is also quite simple: Acknowledge the IRQ source, set border colour to white, call the fastload_irq routine, then return the border colour to black, before returning via the well known $EA81 interrupt exit handler code in the C64 KERNAL. You can of course do whatever you want, but this shows just how simple it can be. The border colour stuff is of course optional, but let's us see just how little raster time this loader uses.
irq_handler:
    ;; Here is our nice minimalistic IRQ handler that calls the fastload IRQ
    
    dec $d019
    ;; Call fastload and show raster time used in the loader
    lda #$01
    sta $d020
    jsr fastload_irq
    lda #$00
    sta $d020
    ;; Chain to KERNAL IRQ exit
    jmp $ea81
As mentioned, I set this demo up to load GYRRUS into bank 4, just because that was a file on the disk image I had active in my MEGA65 at the time.  Note that the filename has to be padded with $A0s, because the fast-load code literally compares all 16 bytes of the filename with the 16 bytes of filename in the directory sectors. It doesn't support partitions or sub-directories on the disk image, but someone could hack that in if they wanted it, but I don't think it will be necessary for almost all use-cases.
filename:
    ;; GYRRUS for testing
    !byte $47,$59,$52,$52,$55,$53,$a0,$a0
    !byte $a0,$a0,$a0,$a0,$a0,$a0,$a0,$a0
    
;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
So that was the code for our example driver of the fast load. For your own programs, you can cut everything above here away, and just keep what follows.  It requires about 1.2KB, including the 512 byte sector buffer, so its quite small in the grand scheme of things.    
    ;; ------------------------------------------------------------
    ;; Actual fast-loader code
    ;; ------------------------------------------------------------
First up, we have the variables and temporary storage for the fast loader: The filename and length (which actually gets ignored, because of the use of $A0 padding, so can be removed at some point), the address where the user wants to load, and the state/status variable.  These four variables are the only ones you need to access from your code. Everything else that follows is internal to the fast-loader. 
fastload_filename:    
    *=*+16
fastload_filename_len:    
    !byte 0
fastload_address:    
    !byte 0,0,0,0
fastload_request:    
    ;; Start with seeking to track 0
    !byte 4
This variable keeps track of which physical track on the disk the loader thinks the head is currently over, so that we can step to the correct track:
fl_current_track:    !byte 0
Then we have variables for the logical track and sector of the next 256 byte block of the file. These have to get translated into the physical track and sector of the drive, which like the 1581, stores two blocks in each physical sector.
fl_file_next_track:    !byte 0
fl_file_next_sector:    !byte 0
 
Then finally, we have the 512 byte sector buffer. Now, this could be optimised away, by enabling mapping of the sector buffer at $DE00-$DFFF, but I couldn't be bothered remembering how to do that, and also didn't want to cause potential problems for code that also uses REU emulation or other things that might appear in the IO area. It's not that it can't be done, but rather that I just took the quick and easy path.  It would be a great exercises for the reader to change this, and reduce the total size of the loader to <1KB as a result.    
fastload_sector_buffer:
    *=*+512
 
Now let's take a look at the fast-loader's IRQ handler. It basically checks if there is an active request, and if not does nothing. Then it checks if the floppy controller is busy doing something that it asked it to earlier. If so, it does nothing. But if we have an active job, and the floppy controller is not busy, this means that we can ask for the next operation to occur. The fastload_request variable doubles as the state number for the resulting simple state-machine. This approach really simplifies the code a lot, and makes it much easier to run in an interrupt.
Before going further, it is worth noting that if you run the interrupt on a normal raster IRQ, the loader will be able to load at most one block = 254 bytes of usable data per frame. This means 254 x 50 = ~12.7KB/sec in PAL or 15.2KB/sec in NTSC. If you are using a real 800KB 1581 disk, that's not a problem, because the drive will slow you down more than that. But if you are using a disk image, or one of the MEGA65's HD disk formats, then this will slow things down.
The easy solution is to have your IRQ routine trigger multiple times per frame, or enable IRQs in the floppy controller, and have it be called on demand whenever a sector is ready. You will need to acknowledge the floppy controller interrupts, if you do that.
There is also a further ~2x speed up without doing that which is possible by modifying the loader to realise when a single sector contains two consecutive blocks of a file. It doesn't currently do this, which is a bit stupid.  Fixing that would also be a great exercise for the reader.
  
fastload_irq:
    ;; If the FDC is busy, do nothing, as we can't progress.
    ;; This really simplifies the state machine into a series of
    ;; sector reads
    lda fastload_request
    bne todo
    rts
todo:    
    lda $d082
    bpl fl_fdc_not_busy
    rts
fl_fdc_not_busy:    
    ;; FDC is not busy, so check what state we are in
    lda fastload_request
    bpl fl_not_in_error_state
    rts
fl_not_in_error_state:
It's worth explaining how the IRQ handler calls the various routines for the different states, because it uses a nice feature of the 65CE02: JMP indirect, X-indexed.  This instruction basically allows you to have a jump-table without the silly push-addr-minus-one to stack trick you have to use on the C64. The resulting code is quite a lot simpler and clearer as a result:
    ;; Shift state left one bit, so that we can use it as a lookup
    ;; into a jump table.
    ;; Everything else is handled by the jump table
    cmp #6
    bcc fl_job_ok
    ;; Ignore request/status codes that don't correspond to actions
    rts
fl_job_ok:    
    asl
    tax
    jmp (fl_jumptable,x)
    
fl_jumptable:
    !16 fl_idle
    !16 fl_new_request
    !16 fl_directory_scan
    !16 fl_read_file_block
    !16 fl_seek_track_0
    !16 fl_step_track
The first of those state routines is the one for when the loader is idle: Just return immediately. This can be optimised away, since there are (1) plenty of other RTS instructions we could point at; and (2) because it never gets called, because we have the short-circuit exit at the start of the IRQ handler.  If you haven't already gotten the idea by now, you can tell that I have really just hacked this together until it works, and then stopped to document it.  Lots of opportunities for you to get involved and improve it ;)
fl_idle:
    rts
The next state handler checks if we are on track 0 yet, and if not, commands a step towards track 0, which like all other floppy controller actions, will have the floppy controller busy until the step has completed. Again, our nice busy check in the start of the IRQ handler means that we can just keep stepping in this routine until we reach track 0. Note how it writes $00 into fastload_request when done, to indicate that the loader is idle and ready for a new job.
fl_seek_track_0:
    lda $d082
    and #$01
    bne fl_not_on_track_0
    lda #$00
    sta fastload_request
    sta fl_current_track
    rts
fl_not_on_track_0:
    ;; Step back towards track 0
    lda #$10
    sta $d081
    rts
As you saw in the demo driver code, to submit a new job, you write $01 into fastload_request. This causes the following routine to be run when the IRQ is next triggered.  It puts $02 into fastload_request, so that it knows that it has just accepted a job, and also immediately requests the reading of the first physical sector that contains a directory block, ready for us to look for the requested file.
fl_new_request:
    ;; Acknowledge fastload request
    lda #2
    sta fastload_request
    ;; Start motor
    lda #$60
    sta $d080
    ;; Request T40 S3 to start directory scan
    ;; (remember we have to do silly translation to real sectors)
    lda #40-1
    sta $d084
    lda #(3/2)+1
    sta $d085
    lda #$00
    sta $d086         ; side
    ;; Request read
    jsr fl_read_sector
    rts
The above set fastload_request to call this routine on each IRQ, i.e., as each sector of the directory is loaded. We then look through the whole 512 byte sector for a matching filename, and if found, change state to load the file from the logical track and sector of the first block of the file as obtained from the directory listing. Note that we ignore the file type, including if the file is deleted. Again, a great opportunity for someone to improve the loader.
fl_directory_scan:
    ;; Check if our filename we want is in this sector
    jsr fl_copy_sector_to_buffer
    ;; (XXX we scan the last BAM sector as well, to keep the code simple.)
    ;; filenames are at offset 4 in each 32-byte directory entry, padded at
    ;; the end with $A0
    lda #<fastload_sector_buffer
    sta fl_buffaddr+1
    lda #>fastload_sector_buffer
    sta fl_buffaddr+2
fl_check_logical_sector:
    ldx #$05
fl_filenamecheckloop:
    ldy #$00
fl_check_loop_inner:
fl_buffaddr:
    lda fastload_sector_buffer+$100,x
    
    cmp fastload_filename,y    
    bne fl_filename_differs
    inx
    iny
    cpy #$10
    bne fl_check_loop_inner
    ;; Filename matches
    txa
    sec
    sbc #$12
    tax
    lda fl_buffaddr+2
    cmp #>fastload_sector_buffer
    bne fl_file_in_2nd_logical_sector
    ;; Y=Track, A=Sector
    lda fastload_sector_buffer,x
    tay
    lda fastload_sector_buffer+1,x
    jmp fl_got_file_track_and_sector
fl_file_in_2nd_logical_sector:    
    ;; Y=Track, A=Sector
    lda fastload_sector_buffer+$100,x
    tay
    lda fastload_sector_buffer+$101,x
fl_got_file_track_and_sector:
    ;; Store track and sector of file
    sty fl_file_next_track
    sta fl_file_next_sector
    ;; Request reading of next track and sector
    jsr fl_read_next_sector
    ;; Advance to next state
    lda #3
    sta fastload_request
    rts
    
fl_filename_differs:
    ;; Skip same number of chars as though we had matched
    cpy #$10
    beq fl_end_of_name
    inx
    iny
    jmp fl_filename_differs
fl_end_of_name:
    ;; Advance to next directory entry
    txa
    clc
    adc #$10
    tax
    bcc fl_filenamecheckloop
    inc fl_buffaddr+2
    lda fl_buffaddr+2
    cmp #(>fastload_sector_buffer)+1
    bne fl_checked_both_halves
    jmp fl_check_logical_sector
fl_checked_both_halves:    
    
    ;; No matching name in this 512 byte sector.
    ;; Load the next one, or give up the search
    inc $d085
    lda $d085
    cmp #11
    bne fl_load_next_dir_sector
    ;; Ran out of sectors in directory track
    ;; (XXX only checks side 0, and assumes DD disk)
    ;; Mark load as failed
    lda #$80         ; $80 = File not found
    sta fastload_request    
    rts
We now have several little utility routines related to reading sectors from the disk, including doing the conversion from 1581 logical sectors to 3.5" floppy physical sectors, and tracking the head if we aren't on the correct track already etc. If it detects that it needs to step the head, it changes fastload_request to point to a handler for that, which in turn sets it back to the handler for reading blocks of the file. 
Note that I haven't actually tried this on a real disk, yet. This should be done, as there will quite likely be some subtle problem that will need shaking out, most likely with the track stepping. But it shouldn't be too hard to fix, and who knows, I might have got it right the first time ;) 
fl_load_next_dir_sector:    
    ;; Request read
    jsr fl_read_sector
    ;; No need to change state
    rts
fl_read_sector:
    ;; Check if we are already on the correct track/side
    ;; and if not, select/step as required
    lda #$40
    sta $d081
    rts
fl_step_track:
    lda #3
    sta fastload_request
    ;; FALL THROUGH
    
fl_read_next_sector:
    ;; Check if we reached the end of the file first
    lda fl_file_next_track
    bne fl_not_end_of_file
    rts
fl_not_end_of_file:    
    ;; Read next sector of file
    jsr fl_logical_to_physical_sector
    lda fl_current_track
    lda $d084
    cmp fl_current_track
    beq fl_on_correct_track
    bcc fl_step_in
fl_step_out:
    ;; We need to step first
    lda #$18
    sta $d081
    inc fl_current_track
    lda #5
    sta fastload_request
    rts
fl_step_in:
    ;; We need to step first
    lda #$10
    sta $d081
    dec fl_current_track
    lda #5
    sta fastload_request
    rts
    
fl_on_correct_track:    
    jsr fl_read_sector
    rts
Here we have another utility routine that does the logical-to-physical track and sector conversion. Again, this basically mirrors what the 1581 does. It will need modifying to use the fast-loader on HD disks, because there will be more sectors on each side of the disk. 
fl_logical_to_physical_sector:
    ;; Convert 1581 sector numbers to physical ones on the disk.
    ;; Track = Track - 1
    ;; Sector = 1 + (Sector/2)
    ;; Side = 0
    ;; If sector > 10, then sector=sector-10, side=1
    lda #$00         ; side 0
    sta $d086
    lda fl_file_next_track
    dec
    sta $d084
    lda fl_file_next_sector
    lsr
    inc
    cmp #10
    bcs fl_on_second_side
    sta $d085
    jmp fl_set_fdc_head
    
fl_on_second_side:
    sec
    sbc #10
    sta $d085
    lda #1
    sta $d086
    ;; FALL THROUGH
fl_set_fdc_head:
    ;; Select correct side of real disk drive
    lda $d086
    asl
    asl
    asl
    and #$08
    ora #$60
    sta $d080
    rts
    
This is the routine that really does the loading: It gets the read physical sector, works out which half of it contains the data for us, DMAs the read bytes into the destination location in memory, and then follows the block chain to the next block of the file, and detects the end-of-file marker indicated by logical track = $00.
fl_read_file_block:
    ;; We have a sector from the floppy drive.
    ;; Work out which half and how many bytes,
    ;; and copy them into place.
    ;; Get sector from FDC
    jsr fl_copy_sector_to_buffer
    ;; Assume full sector initially
    lda #254
    sta fl_bytes_to_copy
    
    ;; Work out which half we care about
    lda fl_file_next_sector
    and #$01
    bne fl_read_from_second_half
fl_read_from_first_half:
    lda #(>fastload_sector_buffer)+0
    sta fl_read_dma_page
    lda fastload_sector_buffer+1
    sta fl_file_next_sector
    lda fastload_sector_buffer+0
    sta fl_file_next_track
    bne fl_1st_half_full_sector
fl_1st_half_partial_sector:
    lda fastload_sector_buffer+1
    sta fl_bytes_to_copy    
    ;; Mark end of loading
    lda #$00
    sta fastload_request
fl_1st_half_full_sector:
    jmp fl_dma_read_bytes
    
fl_read_from_second_half:
    lda #(>fastload_sector_buffer)+1
    sta fl_read_dma_page
    lda fastload_sector_buffer+$101
    sta fl_file_next_sector
    lda fastload_sector_buffer+$100
    sta fl_file_next_track
    bne fl_2nd_half_full_sector
fl_2nd_half_partial_sector:
    lda fastload_sector_buffer+$101
    sta fl_bytes_to_copy
    ;; Mark end of loading
    lda #$00
    sta fastload_request
fl_2nd_half_full_sector:
    ;; FALLTHROUGH
fl_dma_read_bytes:
    ;; Update destination address
    lda fastload_address+3
    asl
    asl
    asl
    asl
    sta fl_data_read_dmalist+2
    lda fastload_address+2
    lsr
    lsr
    lsr
    lsr
    ora fl_data_read_dmalist+2
    sta fl_data_read_dmalist+2
    lda fastload_address+2
    and #$0f
    sta fl_data_read_dmalist+12
    lda fastload_address+1
    sta fl_data_read_dmalist+11
    lda fastload_address+0
    sta fl_data_read_dmalist+10
    ;; Copy FDC data to our buffer
    lda #$00
    sta $d704
    lda #>fl_data_read_dmalist
    sta $d701
    lda #<fl_data_read_dmalist
    sta $d705
    ;; Update load address
    lda fastload_address+0
    clc
    adc fl_bytes_to_copy
    sta fastload_address+0
    lda fastload_address+1
    adc #0
    sta fastload_address+1
    lda fastload_address+2
    adc #0
    sta fastload_address+2
    lda fastload_address+3
    adc #0
    sta fastload_address+3
    
    ;; Schedule reading of next block
    jsr fl_read_next_sector
    
    rts
We are now almost at the end. What we have here is the DMA lists for copying the read data to its final destination, as well as the routine and DMA list for copying a physical sector from the FDC's buffer down to fastload_sector_buffer.  As previously noted, we can probably shrink the whole thing (and make it use less raster time) by avoiding that copy, if we instead fiddle the IO banking to make the floppy sector buffer map at $DE00-$DFFF (there is a special bit that enables this).  But what we have here works, and isn't that much slower, as the DMA doesn't take very long.  
fl_data_read_dmalist:
    !byte $0b      ; F011A type list
    !byte $81,$00      ; Destination MB
    !byte 0         ; no more options
    !byte 0            ; copy
fl_bytes_to_copy:    
    !word 0               ; size of copy
fl_read_page_word:    
fl_read_dma_page = fl_read_page_word + 1
    ;; +2 is to skip track/header link
    !word fastload_sector_buffer+2    ; Source address
    !byte $00        ; Source bank
    
    !word 0                 ; Dest address
    !byte $00             ; Dest bank
    
    !byte $00             ; sub-command
    !word 0                 ; modulo (unused)
    
    rts
    
fl_copy_sector_to_buffer:
    ;; Make sure FDC sector buffer is selected
    lda #$80
    trb $d689
    ;; Copy FDC data to our buffer
    lda #$00
    sta $d704
    lda #>fl_sector_read_dmalist
    sta $d701
    lda #<fl_sector_read_dmalist
    sta $d705
    rts
fl_sector_read_dmalist:
    !byte $0b      ; F011A type list
    !byte $80,$ff            ; MB of FDC sector buffer address ($FFD6C00)
    !byte 0         ; no more options
    !byte 0            ; copy
    !word 512        ; size of copy
    !word $6c00        ; low 16 bits of FDC sector buffer address
    !byte $0d        ; next 4 bits of FDC sector buffer address
    !word fastload_sector_buffer ; Dest address    
    !byte $00             ; Dest bank
    !byte $00             ; sub-command
    !word 0                 ; modulo (unused)
And that's it. The loader really is quite simple, especially compared with a 1541 fast-loader. You can find the source in https://github.com/mega65/mega65-tools, just look for fastload-demo.asm.
Finally, a somewhat arbitrary screen-shot, because every blog post requires at least one, but its kind of hard to show a fast-loader in action in a still image.

 



