Making a C64/C65 compatible computer: Creating a simple internal drive fast-loader for the MEGA65

For a while now I have been thinking about making a simple fast-loader for the MEGA65 that bypasses the C65 DOS, and directly accesses the floppy controller. It's a topic that comes up from time to time for developers who want to load large files from disk, for example. So I spent a couple of hours yesterday writing a proof-of-concept version. My design criteria were:

1. Must be able to be run from an IRQ, so that it can be used in games or demos to load in the background while other activity goes on. The C65 DOS cannot be sensibly used for this, because when it runs, it blocks all interrupts for arbitrary periods of time, which can exceed 200ms(!!!).

2. Must allow loading to any address in memory.

3. Must be small enough that it can be easily incorporated into other programs.

(1) and (3) meant that it had to be written in assembly.

So here's what I created. It still is missing a few things, like it doesn't save and restore DMA list address registers (in case you were composing a DMA job in real-time, just as the IRQ triggered), and doesn't support specifying how much of a file to load, to allow progressive streaming in of a file. Both would be fairly easy to implement. But back to what we do have, an annotated walk through the source:

First up, to demonstrate it, we have a simple BASIC header (I am running it from C64 mode, but you could almost as easily run it from C65 mode):

basic_header

    !byte 0x10,0x08,<2021,>2021,0x9e
   !pet "2061"
   !byte 0x00,0x00,0x00

Then we have the start of the demo program that is using the fast-loader. The actual fast-loader code will come a bit later. We do the usuals of making sure we have MEGA65 IO enabled, and the CPU at full-speed, as well as have some boiler plate to clear the screen and set screen colours etc:

program_start:

   ;; Select MEGA65 IO mode
   lda #$47
   sta $d02f
   lda #$53
   sta $d02f

   ;; Select 40MHz mode
   lda #65
   sta $0

   lda #$00
   sta $d020
   sta $d021

   lda #$01
   sta $0286
   jsr $e544

Next it is time to setup our raster interrupt. This should all be very familiar to C64 coders:
   ;; Install our raster IRQ with our fastloader
   sei

   lda #$7f
   sta $dc0d
   sta $dd0d
   lda #$40
   sta $d012
   lda #$1b
   sta $d011
   lda #$01
   sta $d01a
   dec $d019

   lda #$16
   sta $d018

   lda #<irq_handler
   sta $0314
   lda #>irq_handler
   sta $0315
   cli

We'll get to the IRQ handler in a moment, but we will finish looking at the real-time part of the program first. The fast-loader uses a single byte state/status variable to keep track of what it is doing. If it is $00, then the loader is idle. If you want to ask it to load something, you setup the filename and load address, and then write $01 into the variable. It will go back to $00 when its done, or have bit 7 set if there is some kind of error. This means you can check status with BEQ and BMI. The load address will progressively update to show where it is loaded to, if that's important for you to track. In the example, we load the game GYRRUS into bank 4 at $00040000:
   ;; Example for using the fast loader

   ;; copy filename from start of screen
   ;; Expected to be PETSCII and $A0 padded at end, and exactly 16 chars
   ldx #$0f
   lda #$a0
clearfilename:
   sta fastload_filename,x
   dex
   bpl clearfilename
   ldx #$ff
filenamecopyloop:
   inx
   cpx #$10
   beq endofname
   lda filename,x
   beq endofname
   sta fastload_filename,x
   bne filenamecopyloop
endofname:
   inx
   stx fastload_filename_len

   ;; Set load address (32-bit)
   ;; = $40000 = BANK 4
   lda #$00
   sta fastload_address+0
   lda #$00
   sta fastload_address+1
   lda #$04
   sta fastload_address+2
   lda #$00
   sta fastload_address+3
Remember what I said about the status variable? We need to make sure it is $00 before we submit our load request. This is important because when the fast-loader initialises, it doesn't know what track the drive is on, and so it seeks back to track 0 first. So we make sure that that completes before we submit our job. If we didn't do this, reading of any sector from the disk on a real drive would hang, because the head would be on the wrong track.
   ;; Give the fastload time to get itself sorted
   ;; (largely seeking to track 0)
wait_for_fastload:
   lda fastload_request
   bne wait_for_fastload
Finally the fast-loader is ready, so we can then submit our job. It really is this simple:
   ;; Request fastload job
   lda #$01
   sta fastload_request
We can then go off and do whatever we want in real-time, knowing that the raster interrupt will be calling the fast-loader, and allowing it to progress in the background. For simplicity, in our demo we just wait for the fast-load to complete, and indicate if an error occurred, or if it loaded ok.
   ;; Then just wait for the request byte to
   ;; go back to $00, or to report an error by having the MSB
   ;; set. The request value will continually update based on the
   ;; state of the loading.
waiting
   lda fastload_request
   bmi error
   bne waiting
   beq done

error
   inc $042f
   jmp error

done
   inc $d020
   jmp done

That's over and done with for real-time, so now lets look at our raster interrupt. This is also quite simple: Acknowledge the IRQ source, set border colour to white, call the fastload_irq routine, then return the border colour to black, before returning via the well known $EA81 interrupt exit handler code in the C64 KERNAL. You can of course do whatever you want, but this shows just how simple it can be. The border colour stuff is of course optional, but let's us see just how little raster time this loader uses.
irq_handler:
   ;; Here is our nice minimalistic IRQ handler that calls the fastload IRQ

   dec $d019

   ;; Call fastload and show raster time used in the loader
   lda #$01
   sta $d020
   jsr fastload_irq
   lda #$00
   sta $d020

   ;; Chain to KERNAL IRQ exit
   jmp $ea81

As mentioned, I set this demo up to load GYRRUS into bank 4, just because that was a file on the disk image I had active in my MEGA65 at the time. Note that the filename has to be padded with $A0s, because the fast-load code literally compares all 16 bytes of the filename with the 16 bytes of filename in the directory sectors. It doesn't support partitions or sub-directories on the disk image, but someone could hack that in if they wanted it, but I don't think it will be necessary for almost all use-cases.
filename:
   ;; GYRRUS for testing
   !byte $47,$59,$52,$52,$55,$53,$a0,$a0
   !byte $a0,$a0,$a0,$a0,$a0,$a0,$a0,$a0


;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
;; ----------------------------------------------------------------------------
So that was the code for our example driver of the fast load. For your own programs, you can cut everything above here away, and just keep what follows. It requires about 1.2KB, including the 512 byte sector buffer, so its quite small in the grand scheme of things.
   ;; ------------------------------------------------------------
   ;; Actual fast-loader code
   ;; ------------------------------------------------------------
First up, we have the variables and temporary storage for the fast loader: The filename and length (which actually gets ignored, because of the use of $A0 padding, so can be removed at some point), the address where the user wants to load, and the state/status variable. These four variables are the only ones you need to access from your code. Everything else that follows is internal to the fast-loader.

fastload_filename:
   *=*+16
fastload_filename_len:
   !byte 0
fastload_address:
   !byte 0,0,0,0
fastload_request:
   ;; Start with seeking to track 0
   !byte 4
This variable keeps track of which physical track on the disk the loader thinks the head is currently over, so that we can step to the correct track:
fl_current_track:    !byte 0

Then we have variables for the logical track and sector of the next 256 byte block of the file. These have to get translated into the physical track and sector of the drive, which like the 1581, stores two blocks in each physical sector.
fl_file_next_track: !byte 0
fl_file_next_sector: !byte 0

Then finally, we have the 512 byte sector buffer. Now, this could be optimised away, by enabling mapping of the sector buffer at $DE00-$DFFF, but I couldn't be bothered remembering how to do that, and also didn't want to cause potential problems for code that also uses REU emulation or other things that might appear in the IO area. It's not that it can't be done, but rather that I just took the quick and easy path. It would be a great exercises for the reader to change this, and reduce the total size of the loader to <1KB as a result.
fastload_sector_buffer:
*=*+512

Now let's take a look at the fast-loader's IRQ handler. It basically checks if there is an active request, and if not does nothing. Then it checks if the floppy controller is busy doing something that it asked it to earlier. If so, it does nothing. But if we have an active job, and the floppy controller is not busy, this means that we can ask for the next operation to occur. The fastload_request variable doubles as the state number for the resulting simple state-machine. This approach really simplifies the code a lot, and makes it much easier to run in an interrupt.

Before going further, it is worth noting that if you run the interrupt on a normal raster IRQ, the loader will be able to load at most one block = 254 bytes of usable data per frame. This means 254 x 50 = ~12.7KB/sec in PAL or 15.2KB/sec in NTSC. If you are using a real 800KB 1581 disk, that's not a problem, because the drive will slow you down more than that. But if you are using a disk image, or one of the MEGA65's HD disk formats, then this will slow things down.

The easy solution is to have your IRQ routine trigger multiple times per frame, or enable IRQs in the floppy controller, and have it be called on demand whenever a sector is ready. You will need to acknowledge the floppy controller interrupts, if you do that.

There is also a further ~2x speed up without doing that which is possible by modifying the loader to realise when a single sector contains two consecutive blocks of a file. It doesn't currently do this, which is a bit stupid. Fixing that would also be a great exercise for the reader.

fastload_irq:
   ;; If the FDC is busy, do nothing, as we can't progress.
   ;; This really simplifies the state machine into a series of
   ;; sector reads
   lda fastload_request
   bne todo
   rts
todo:
   lda $d082
   bpl fl_fdc_not_busy
   rts
fl_fdc_not_busy:
   ;; FDC is not busy, so check what state we are in
   lda fastload_request
   bpl fl_not_in_error_state
   rts
fl_not_in_error_state:

It's worth explaining how the IRQ handler calls the various routines for the different states, because it uses a nice feature of the 65CE02: JMP indirect, X-indexed. This instruction basically allows you to have a jump-table without the silly push-addr-minus-one to stack trick you have to use on the C64. The resulting code is quite a lot simpler and clearer as a result:
   ;; Shift state left one bit, so that we can use it as a lookup
   ;; into a jump table.
   ;; Everything else is handled by the jump table
   cmp #6
   bcc fl_job_ok
   ;; Ignore request/status codes that don't correspond to actions
   rts
fl_job_ok:
   asl
   tax
   jmp (fl_jumptable,x)

fl_jumptable:
   !16 fl_idle
   !16 fl_new_request
   !16 fl_directory_scan
   !16 fl_read_file_block
   !16 fl_seek_track_0
   !16 fl_step_track

The first of those state routines is the one for when the loader is idle: Just return immediately. This can be optimised away, since there are (1) plenty of other RTS instructions we could point at; and (2) because it never gets called, because we have the short-circuit exit at the start of the IRQ handler. If you haven't already gotten the idea by now, you can tell that I have really just hacked this together until it works, and then stopped to document it. Lots of opportunities for you to get involved and improve it ;)
fl_idle:
rts

The next state handler checks if we are on track 0 yet, and if not, commands a step towards track 0, which like all other floppy controller actions, will have the floppy controller busy until the step has completed. Again, our nice busy check in the start of the IRQ handler means that we can just keep stepping in this routine until we reach track 0. Note how it writes $00 into fastload_request when done, to indicate that the loader is idle and ready for a new job.
fl_seek_track_0:
   lda $d082
   and #$01
   bne fl_not_on_track_0
   lda #$00
   sta fastload_request
   sta fl_current_track
   rts
fl_not_on_track_0:
   ;; Step back towards track 0
   lda #$10
   sta $d081
   rts

As you saw in the demo driver code, to submit a new job, you write $01 into fastload_request. This causes the following routine to be run when the IRQ is next triggered. It puts $02 into fastload_request, so that it knows that it has just accepted a job, and also immediately requests the reading of the first physical sector that contains a directory block, ready for us to look for the requested file.
fl_new_request:
   ;; Acknowledge fastload request
   lda #2
   sta fastload_request
   ;; Start motor
   lda #$60
   sta $d080
   ;; Request T40 S3 to start directory scan
   ;; (remember we have to do silly translation to real sectors)
   lda #40-1
   sta $d084
   lda #(3/2)+1
   sta $d085
   lda #$00
   sta $d086         ; side
   ;; Request read
   jsr fl_read_sector
   rts

The above set fastload_request to call this routine on each IRQ, i.e., as each sector of the directory is loaded. We then look through the whole 512 byte sector for a matching filename, and if found, change state to load the file from the logical track and sector of the first block of the file as obtained from the directory listing. Note that we ignore the file type, including if the file is deleted. Again, a great opportunity for someone to improve the loader.
fl_directory_scan:
   ;; Check if our filename we want is in this sector
   jsr fl_copy_sector_to_buffer

   ;; (XXX we scan the last BAM sector as well, to keep the code simple.)
   ;; filenames are at offset 4 in each 32-byte directory entry, padded at
   ;; the end with $A0
   lda #<fastload_sector_buffer
   sta fl_buffaddr+1
   lda #>fastload_sector_buffer
   sta fl_buffaddr+2

fl_check_logical_sector:
   ldx #$05
fl_filenamecheckloop:
   ldy #$00

fl_check_loop_inner:

fl_buffaddr:
   lda fastload_sector_buffer+$100,x

   cmp fastload_filename,y
   bne fl_filename_differs
   inx
   iny
   cpy #$10
   bne fl_check_loop_inner
   ;; Filename matches
   txa
   sec
   sbc #$12
   tax
   lda fl_buffaddr+2
   cmp #>fastload_sector_buffer
   bne fl_file_in_2nd_logical_sector
   ;; Y=Track, A=Sector
   lda fastload_sector_buffer,x
   tay
   lda fastload_sector_buffer+1,x
   jmp fl_got_file_track_and_sector
fl_file_in_2nd_logical_sector:
   ;; Y=Track, A=Sector
   lda fastload_sector_buffer+$100,x
   tay
   lda fastload_sector_buffer+$101,x
fl_got_file_track_and_sector:
   ;; Store track and sector of file
   sty fl_file_next_track
   sta fl_file_next_sector
   ;; Request reading of next track and sector
   jsr fl_read_next_sector
   ;; Advance to next state
   lda #3
   sta fastload_request
   rts

fl_filename_differs:
   ;; Skip same number of chars as though we had matched
   cpy #$10
   beq fl_end_of_name
   inx
   iny
   jmp fl_filename_differs
fl_end_of_name:
   ;; Advance to next directory entry
   txa
   clc
   adc #$10
   tax
   bcc fl_filenamecheckloop
   inc fl_buffaddr+2
   lda fl_buffaddr+2
   cmp #(>fastload_sector_buffer)+1
   bne fl_checked_both_halves
   jmp fl_check_logical_sector
fl_checked_both_halves:

   ;; No matching name in this 512 byte sector.
   ;; Load the next one, or give up the search
   inc $d085
   lda $d085
   cmp #11
   bne fl_load_next_dir_sector
   ;; Ran out of sectors in directory track
   ;; (XXX only checks side 0, and assumes DD disk)

   ;; Mark load as failed
   lda #$80         ; $80 = File not found
   sta fastload_request
   rts

We now have several little utility routines related to reading sectors from the disk, including doing the conversion from 1581 logical sectors to 3.5" floppy physical sectors, and tracking the head if we aren't on the correct track already etc. If it detects that it needs to step the head, it changes fastload_request to point to a handler for that, which in turn sets it back to the handler for reading blocks of the file.

Note that I haven't actually tried this on a real disk, yet. This should be done, as there will quite likely be some subtle problem that will need shaking out, most likely with the track stepping. But it shouldn't be too hard to fix, and who knows, I might have got it right the first time ;)
fl_load_next_dir_sector:
   ;; Request read
   jsr fl_read_sector
   ;; No need to change state
   rts

fl_read_sector:
   ;; Check if we are already on the correct track/side
   ;; and if not, select/step as required
   lda #$40
   sta $d081
   rts

fl_step_track:
   lda #3
   sta fastload_request
   ;; FALL THROUGH

fl_read_next_sector:
   ;; Check if we reached the end of the file first
   lda fl_file_next_track
   bne fl_not_end_of_file
   rts
fl_not_end_of_file:
   ;; Read next sector of file
   jsr fl_logical_to_physical_sector

   lda fl_current_track
   lda $d084
   cmp fl_current_track
   beq fl_on_correct_track
   bcc fl_step_in
fl_step_out:
   ;; We need to step first
   lda #$18
   sta $d081
   inc fl_current_track
   lda #5
   sta fastload_request
   rts
fl_step_in:
   ;; We need to step first
   lda #$10
   sta $d081
   dec fl_current_track
   lda #5
   sta fastload_request
   rts

fl_on_correct_track:
   jsr fl_read_sector
   rts

Here we have another utility routine that does the logical-to-physical track and sector conversion. Again, this basically mirrors what the 1581 does. It will need modifying to use the fast-loader on HD disks, because there will be more sectors on each side of the disk.
fl_logical_to_physical_sector:
   ;; Convert 1581 sector numbers to physical ones on the disk.
   ;; Track = Track - 1
   ;; Sector = 1 + (Sector/2)
   ;; Side = 0
   ;; If sector > 10, then sector=sector-10, side=1
   lda #$00         ; side 0
   sta $d086
   lda fl_file_next_track
   dec
   sta $d084
   lda fl_file_next_sector
   lsr
   inc
   cmp #10
   bcs fl_on_second_side
   sta $d085
   jmp fl_set_fdc_head

fl_on_second_side:
   sec
   sbc #10
   sta $d085
   lda #1
   sta $d086

   ;; FALL THROUGH
fl_set_fdc_head:
   ;; Select correct side of real disk drive
   lda $d086
   asl
   asl
   asl
   and #$08
   ora #$60
   sta $d080
   rts

This is the routine that really does the loading: It gets the read physical sector, works out which half of it contains the data for us, DMAs the read bytes into the destination location in memory, and then follows the block chain to the next block of the file, and detects the end-of-file marker indicated by logical track = $00.
fl_read_file_block:
   ;; We have a sector from the floppy drive.
   ;; Work out which half and how many bytes,
   ;; and copy them into place.

   ;; Get sector from FDC
   jsr fl_copy_sector_to_buffer

   ;; Assume full sector initially
   lda #254
   sta fl_bytes_to_copy

   ;; Work out which half we care about
   lda fl_file_next_sector
   and #$01
   bne fl_read_from_second_half
fl_read_from_first_half:
   lda #(>fastload_sector_buffer)+0
   sta fl_read_dma_page
   lda fastload_sector_buffer+1
   sta fl_file_next_sector
   lda fastload_sector_buffer+0
   sta fl_file_next_track
   bne fl_1st_half_full_sector
fl_1st_half_partial_sector:
   lda fastload_sector_buffer+1
   sta fl_bytes_to_copy
   ;; Mark end of loading
   lda #$00
   sta fastload_request
fl_1st_half_full_sector:
   jmp fl_dma_read_bytes

fl_read_from_second_half:
   lda #(>fastload_sector_buffer)+1
   sta fl_read_dma_page
   lda fastload_sector_buffer+$101
   sta fl_file_next_sector
   lda fastload_sector_buffer+$100
   sta fl_file_next_track
   bne fl_2nd_half_full_sector
fl_2nd_half_partial_sector:
   lda fastload_sector_buffer+$101
   sta fl_bytes_to_copy
   ;; Mark end of loading
   lda #$00
   sta fastload_request
fl_2nd_half_full_sector:
   ;; FALLTHROUGH
fl_dma_read_bytes:

   ;; Update destination address
   lda fastload_address+3
   asl
   asl
   asl
   asl
   sta fl_data_read_dmalist+2
   lda fastload_address+2
   lsr
   lsr
   lsr
   lsr
   ora fl_data_read_dmalist+2
   sta fl_data_read_dmalist+2
   lda fastload_address+2
   and #$0f
   sta fl_data_read_dmalist+12
   lda fastload_address+1
   sta fl_data_read_dmalist+11
   lda fastload_address+0
   sta fl_data_read_dmalist+10

   ;; Copy FDC data to our buffer
   lda #$00
   sta $d704
   lda #>fl_data_read_dmalist
   sta $d701
   lda #<fl_data_read_dmalist
   sta $d705

   ;; Update load address
   lda fastload_address+0
   clc
   adc fl_bytes_to_copy
   sta fastload_address+0
   lda fastload_address+1
   adc #0
   sta fastload_address+1
   lda fastload_address+2
   adc #0
   sta fastload_address+2
   lda fastload_address+3
   adc #0
   sta fastload_address+3

   ;; Schedule reading of next block
   jsr fl_read_next_sector

   rts

We are now almost at the end. What we have here is the DMA lists for copying the read data to its final destination, as well as the routine and DMA list for copying a physical sector from the FDC's buffer down to fastload_sector_buffer. As previously noted, we can probably shrink the whole thing (and make it use less raster time) by avoiding that copy, if we instead fiddle the IO banking to make the floppy sector buffer map at $DE00-$DFFF (there is a special bit that enables this). But what we have here works, and isn't that much slower, as the DMA doesn't take very long.
fl_data_read_dmalist:
   !byte $0b    ; F011A type list
   !byte $81,$00    ; Destination MB
   !byte 0         ; no more options
   !byte 0            ; copy
fl_bytes_to_copy:
   !word 0               ; size of copy
fl_read_page_word:
fl_read_dma_page = fl_read_page_word + 1
   ;; +2 is to skip track/header link
   !word fastload_sector_buffer+2    ; Source address
   !byte $00        ; Source bank

   !word 0                 ; Dest address
   !byte $00             ; Dest bank

   !byte $00             ; sub-command
   !word 0                 ; modulo (unused)

   rts

fl_copy_sector_to_buffer:
   ;; Make sure FDC sector buffer is selected
   lda #$80
   trb $d689

   ;; Copy FDC data to our buffer
   lda #$00
   sta $d704
   lda #>fl_sector_read_dmalist
   sta $d701
   lda #<fl_sector_read_dmalist
   sta $d705
   rts

fl_sector_read_dmalist:
   !byte $0b    ; F011A type list
   !byte $80,$ff            ; MB of FDC sector buffer address ($FFD6C00)
   !byte 0         ; no more options
   !byte 0            ; copy
   !word 512        ; size of copy
   !word $6c00        ; low 16 bits of FDC sector buffer address
   !byte $0d        ; next 4 bits of FDC sector buffer address
   !word fastload_sector_buffer ; Dest address
   !byte $00             ; Dest bank
   !byte $00             ; sub-command
   !word 0                 ; modulo (unused)

And that's it. The loader really is quite simple, especially compared with a 1541 fast-loader. You can find the source in https://github.com/mega65/mega65-tools, just look for fastload-demo.asm.

Finally, a somewhat arbitrary screen-shot, because every blog post requires at least one, but its kind of hard to show a fast-loader in action in a still image.

MEGA65 Links

Sunday, 7 November 2021

Creating a simple internal drive fast-loader for the MEGA65

1 comment: