Monday 12 March 2018

Creating files on a FAT32 filesystem, and making it simpler to implement

The MEGA65 uses a VFAT32 file system as the native file system on the SD Card, both for convenience (pretty much any computer can read/write it), and also because it is a fairly simple on-disk format.  So far, however, the MEGA65 hypervisor DOS code can only read files, not create or write to them. This, naturally, is not a great situation, and thus this post describes progress towards being able to create files, and write to them.

One of the challenges is that the MEGA65 Hypervisor is only 16KB, including boot-up code, file system, freezing and task swapping, and a bunch of other miscellaneous other little things.  This means we have to be very careful about what we put in there.  Thus, for now at least, FAT file creation will also require the specification of the file length.  This means that we can have a call that allows writing data at any offset of a file, without having to worry about allocating space at that point in time.  It also means that we can fairly easily enforce that all files we create will have contiguous on-disk storage, which is important for D81 disk images, because the hardware support for disk images requires that they exist as a single contiguous slab on the disk, so that the hardware doesn't need to know anything about the FAT file system.

So, the first routine we need is one that can find a contiguous set of clusters that are free, and that are a candidate for hosting a file.  On the one hand, this is a very simple routine: All it has to do is to iterate through the cluster list, counting the number of consecutive empties, and resetting that count if it finds one already allocated, and continue until it reaches either the end of the partition, or finds the required number of consecutive unallocated clusters.  Of course, when you are manipulating 32-bit values on an 8-bit computer, everything ends up more complicated than you would like.  Here is my first go at implementing this (which I won't be able to easily test, until I have the rest of the routines in place):

dos_find_contiguous_free_space:
    ; Find a piece of free space in the specified file system
    ; (dos_default_disk), and return the first cluster number
    ; if it can be found.
    ;
    ; INPUT: dos_dirent_length = # of clusters required
    ; OUTPUT: dos_opendir_cluster = first cluster
    ; C flag set on success, clear on failure.
    ;
    ; FAT32 file systems have the expected first cluster free
    ; stored in the 2nd sector of the file system at offset
    ; $1EC (actually, this may point to the last allocated cluster).
    ; This field is only a suggestion however, to accelerate allocation,
    ; and thus should not be relied upon, but rather to allow quick
    ; skipping of the already allocated area of disk.
    ;
    ; The number of clusters we need to allocate is to be provided in
    ; dos_opendir_cluster as a 32-bit value, thus allowing for files
    ; upto 2^32 * cluster size bytes long to be created.
    ; (Note that in practice files are limited to 2GiB - 1 bytes for
    ; total compatibility, and 4GiB -1 for fairly decent compatibility,
    ; and 256GiB - 1 if we implement the FAT+ specification
    ; (http://www.fdos.org/kernel/fatplus.txt.1) in the appropriate places
    ; at some point, which we will likely do, as it is really very simple.
    ; Mostly we just have to use 40 bit offsets for file lengths.

    ; Let's start with a completely naive algorithm:
    ; 1. Begin at cluster 2. Reset # of contiguous clusters found to zero.
    ;    Remember this cluster as candidatee starting point.
    ; 2. If current cluster not free, advance to next cluster number, reset
    ;    contiguous free cluster count.  Remember the newly advanced cluster
    ;    number as candidate starting point.
    ; 3. If current cluster is free, increase contiguous free clusters found.
    ;    If equal to desired number, return candidate cluster number.
    ; 4. Repeat 2 and 3 above until end of file system is reached (and return
    ;    an error), or until step 3 has returned success.
    ;
    ; This algorithm can be made more efficient by using the last allocated
    ; cluster number field as an alternative starting point, and only if that
    ; fails, retrying beginning at cluster 2.

    ; So the state we need to keep track of is:
    ; dos_opendir_cluster = current cluster we are considering as candidate starting point.
    ; dos_file_loadaddress = current cluster being tested for vacancy.
    ; dos_dirent_length = number of clusters required, in total.
    ; dos_dirent_cluster = number of clusters required, from this point
    ; (i.e., the total minus the number we have already found contiguously free).
    ; current_disk[fs_fat32_cluster_count] = number of clusters in disk
    ; (and thus the end point of our search).
    ;
    ; Other than that, we just need to advance our way linearly through the FAT sectors.
    ; This is straight forward, as we can just read the first FAT sector, and then progress
    ; our way through, until we reach the end.

    ; First, make sure we can see the sector buffer at $DE00-$DFFF
   
JSR   sd_map_sectorbuffer

    ; 1. Start at cluster 2
   
LDA  #$02
    STA  dos_opendir_cluster+0
    STA  dos_file_loadaddress+0
    LDA  #$00
    STA  dos_opendir_cluster+1
    STA  dos_opendir_cluster+2
    STA  dos_opendir_cluster+3
    STA  dos_file_loadaddress+1
    STA  dos_file_loadaddress+2
    STA  dos_file_loadaddress+3

@tryNewStartingPointCandidate:

    ;    Reset # of clusters still required
   
LDX  #$03
@ll74:    LDA  dos_dirent_length, X
    STA  dos_dirent_cluster, X
    DEX 
    BPL  @ll74

@testIfClusterEmptyAfterReadingFATSector:
    ; Read the appropriate sector of the FAT
    ; To do this, we copy the target cluster number to
    ; dos_current_cluster, and call dos_cluster_to_fat_sector.
    ; This leaves the absolute sector number in
    ; dos_current_cluster.
   
LDX    #$03
@ll78:    LDA    dos_file_loadaddress, X
    STA    dos_current_cluster, X
    DEX
    BPL    @ll78
    JSR    dos_cluster_to_fat_sector
    ; Now we have the sector # in dos_current_cluster
    ; Copy to the SD card sector register, and do the read
   
LDX    #$03
@ll83:    LDA    dos_current_cluster, X
    STA    $D681, X
    DEX
    BPL    @ll83
    JSR    sd_fix_sectornumber
    ; Finally, do the read
   
LDA    #dos_errorcode_read_error
    STA    dos_error_code
    JSR    sd_readsector
    ; Fail on error
   
BCS       @ll93
    RTS
@ll93:

@testIfClusterEmpty:
    ; Here we have the sector read, and do the test on the contents of the cluster
    ; entry.

    ; But first, check that the cluster number is valid:
    ; 1. Get dos_disk_table_offset pointing correctly
   
LDX   dos_disk_current_disk
    JSR   dos_set_current_disk
    LDX   dos_disk_table_offset
    ; 2. Compare cluster count to current cluster
   
LDY   #$00
@ll128:    LDA   [dos_disk_table + fs_fat32_cluster_count + 0], X
    CMP   dos_file_loadaddress, Y
    BNE   @notLastClusterOfFileSystem
    INX
    INY
    CPY    #$04
    BNE    @ll128

    ; Return error due to lack of space
   
LDA    #dos_errorcode_no_space
    STA    dos_error_code
    CLC
    RTS

@notLastClusterOfFileSystem:

    ; The offset in the sector is computed from the bottom
    ; 7 bits of the cluster number * 4, to give an offset
    ; in the 512 byte sector. Once we have the offset, OR the
    ; four bytes of the cluster number together to test if = 0,
    ; and thus empty.
   
LDA      dos_file_loadaddress+0
    ASL
    ASL
    TAY
    LDA    dos_file_loadaddress+0
    AND    #$40
    BEQ    @lowMoby
    LDA    $DF00, Y
    ORA    $DF01, Y
    ORA    $DF02, Y
    ORA    $DF03, Y
    BRA    @ll120
@lowMoby:
    LDA    $DE00, Y
    ORA    $DE01, Y
    ORA    $DE02, Y
    ORA    $DE03, Y
@ll120:
    ; Remember result of free-test
   
TAX

    ; Increment next cluster number we will look at
   
LDA    dos_file_loadaddress+0
    CLC
    ADC    #$01
    STA    dos_file_loadaddress+0
    LDA    dos_file_loadaddress+1
    ADC    #$00
    LDA    dos_file_loadaddress+1
    LDA    dos_file_loadaddress+2
    ADC    #$00
    LDA    dos_file_loadaddress+2
    LDA    dos_file_loadaddress+3
    ADC    #$00
    LDA    dos_file_loadaddress+3

    ; If the cluster was not free, then reset search point
   
CPX       #$00
    BEQ     @thisClusterWasFree
    LDX     #$03   
@ll160:    LDA     dos_file_loadaddress, X
    STA     dos_opendir_cluster, X
    DEX
    BPL     @ll160   
    JMP     @tryNewStartingPointCandidate

@thisClusterWasFree:
    ; Decrement # of clusters still required
   
LDA    dos_dirent_cluster+0
    SEC
    SBC    #$01
    STA    dos_dirent_cluster+0
    LDA    dos_dirent_cluster+1
    SBC    #$00
    STA    dos_dirent_cluster+1
    LDA    dos_dirent_cluster+2
    SBC    #$00
    STA    dos_dirent_cluster+2
    LDA    dos_dirent_cluster+3
    SBC    #$00
    STA    dos_dirent_cluster+3
    ; Now see if zero
    ORA    dos_dirent_cluster+2
    ORA    dos_dirent_cluster+1
    ORA    dos_dirent_cluster+0
    BEQ    @foundFreeSpace
   
    ; Nope, we still need more, so continue the search

    ; Then check if this next cluster is unallocated?
    ; (If the cluster entry in the FAT will be in the same
    ;  sector as the last, then don't waste time recomputing
    ;  and reading the FAT sector number).
   
LDA    dos_file_loadaddress+0
    AND    #$7F
    BNE    @sameSector
    JMP     @testIfClusterEmptyAfterReadingFATSector
@sameSector:
    JMP    @testIfClusterEmpty

@foundFreeSpace:
    ; Found the requested space
   
SEC
    RTS


It isn't too hard to find the sections of code where there is considerable repetition due to the need to update 32-bit = 4 byte long values, using only 8-bit CPU operations.  This is both a nuisance from a code-size perspective (remember that the Hypervisor is only 16KB in total, and so every wasted byte is potentially important), and also a correctness/security perspective, because it is simply hard to make sure that such code is functionally correct.

If the criteria were only security and performance, we would of course just use a simple 32-bit processor, however, the whole point of the MEGA65 is that it is an 8-bit computer.  That said, the 6502 (and 4510) already both have a number of non-8-bit features: Most obviously is that the program counter is 16-bits.  But also the zero-page indirection instructions operate on 16-bit pointers, the stack can be 16-bit addressed on the 4510, the 4510 even has increment/decrement/shift word instructions, which really start to blur the lines on where the boundary between 8 and 16 bit processors lie.  I think my take on the situation is that while the registers are 8-bit, and the majority of operations are 8-bit, that there probably isn't a problem.

The question then really is whether it makes sense to add some simple 32-bit operations, for example increment and decrement, which are the most common kinds of operations, and which also are the most annoying to do now, because in reality you need to unroll them, or do very strange things with saving the processor flags between iterations of the loop, so that the carry flag is not corrupted.  These could be implemented using a prefix to the INW and DEW (increment and decrement word) instruction, similar to how the MEGA65's 32-bit zero-page indirect operations work.  So to increment a 32-bit value, we might end up using something like:

NOP
NOP
INW $address


That is, we can replace quite a lot of goo with just 5 bytes, if we allow INW and DEW to operate on 32-bits of data instead of 16. Unfortunately, INW and DEW (and the other 16-bit read-modify-write instructions) are all zero-page only, which is a bit limiting.  That they are zero-page only is a bit silly in retrospect, because these instructions already require 2 cycles for the instruction, 2 more to read the 16-bit value, and 2 more to write the result back = 6 cycles. For a seventh cycle, they would have been much more general purpose.  I could in theory make the instructions behave as absolute addressing, i.e., non-zero-page, when used with the double-NOP prefix, but they we might fall into the same trap as the 65816, where you have to know the CPU state, in order to be able to disassemble code.  This is not good on a variety of fronts.

To add to my frustration, the ROW and and ASW instructions, the ones that shift 16-bit values do use absolute addressing mode.  Even the PHW instructions take 16-bit addresses/values as operands, only not the INW/DEW instructions, that are the most useful in this situation.

So looking at our available resources, we have PHW #$nnnn, PHW $nnnn, ASW $nnnn and ROW $nnnn as the four instructions that we could potentially overload with a prefix to do something more useful, and INW/DEW that can also be logically extended, but with the annoying limitation on their source of operand.  There is some intrinsic value to having 32-bit ROW and ASW operations, to allow bit-shifting of wider values.  The stack operations are more vexed, because there are no matching 16-bit stack pop operations (assuming that we don't count RTS, although it technically does pop 16 bits from the stack, it is just that any instruction that puts its result into the program counter is perhaps not ideally suited to being bent towards arithmetic operations.


For the existing 32-bit flat addressing extensions, i.e., where the data remains 8-bits wide, but the address is 32-bits long, we use NOP + NOP as the prefix. Note that this is different to what we are talking about here, where the data is 32-bit, but the adress remains 16-bit.  We really don't want to confuse those two, and there are situations where it probably makes sense to use them both at the same time*, so we will need a different prefix. 

[ * The main problem with using them at the same time, is that the 32-bit addressing extensions use the indirect zero-page instructions like LDA ($12), Z -- which uses one of the registers in the address computation. This is fine for LDA, but not for ADC and friends, and definitely not for STA, where the value you are storing, including all the 8-bit register values as it does, would thus affect the address you are writing to.  To solve this for now, if you use STA in this mode, the value of the Z register is NOT added to the 32-bit address.  This whole scheme will need some further fine-tuning, but at least for now is minimally useful: It is possible in theory to load a 32-bit value from, or write one to, a 32-bit flat address. ]


After having stared at the instruction set many times before, and not really seen many good candidates for prefix instructions, since, after all, the whole point of instructions that are not called NOP is to do something, it occurred to me that the NEG instruction would in fact be a pretty good one.  Yes, it will mess up the Z and N flags -- but that's all it will do if you call it twice, since it negates, i.e., inverts the sign of the value of the accumulator.  Thus, calling it twice puts the value of the accumulator back to what it was to begin with, and potentially even restores the values of the Z and N flags.  Anyway, for lack of a better alternative, this is what we will use.

Anyway, by using the NEG + NEG prefix on any LDA, STA, ADC, SBC, EOR, AND, ORA, INC, DEC, ASL, ROL, ROR or LSR instruction, we can make it instead do a 32-bit read (and or write).  The simplest way to implement this, and which is what I have done, is to duplicate the operand read and/or write after the address has been computed.  In this way, you can still use all the usual addressing modes (although the indexed ones will require some care, because many of these 32-bit operations will of course modify the contents of the index registers). This leaves the question of where do we find a 32-bit register in the 6502 architecture that we can use.  That, and related questions we will soon return to. But first, lets get back to the routine for finding where to put a newly created file, there are the following 32-bit operations:

2x write 32-bit value to memory
4x copy 32-bit value to another location
1x compare two 32-bit values for equality
3x test 32-bit value if equal to zero
1x increment 32-bit value
1x decrement 32-bit value

Of these, the increments, decrements and zero-tests could be directly accelerated, accounting for 5 of the 13 operations.  Helpful, but not ideal. 

What simple changes can we make that would give a much better result? One of the requirements for "simple" in my mind, is that they should not result in the creation of more registers or other CPU state that needs to be saved/restored during a context switch or Hypervisor trap.  We do have four general purpose(ish) registers in the 4510, however: A, X, Y and Z  - just enough for holding 32-bits.  Thus we could have a prefix on LDA / STA / ADC / SBC / CMP that cause A, X, Y and Z to all be used together, so for example:

NEG
NEG
LDA $1234
CLC
NEG
NEG
ADC $1238
NEG
NEG
STA $123C

could be used to read the 32-bit value from $1234-$1237 into A, X, Y and Z, and then add to this the 32-bits of data stored at $1238-$123B, and finally write the results to the four bytes at $123C-$123F.  The same routine in normal 6502 would be:

LDA $1234
CLC
ADC $1238
STA $123C
LDA $1235
ADC $1239
STA $123D
LDA $1236
ADC $123A
STA $123E
LDA $1237
ADC $123B
STA $123F

Not only does this require 37 bytes and 50 cycles on a regular 6502, it is also prone to copy-paste errors (hopefully I didn't make any here).  It is possible to slightly optimise this, e.g.:

LDX #$00
CLC
PHP
loop:
PLP
LDA $1234,X
ADC $1238,X
STA $123C,X
PHP
INX
CPX #$04
BNE loop
PLP

This is shorter than the unrolled version, requiring only 21 bytes, however the program flow is rather more complicated to understand (and requires a byte off stack space).  Also, it requires 99 cycles, making it twice as slow as the unrolled version.

If we implement our CPU extension to allow our nice clear 15 byte version, it should take only 30 cycles* on the MEGA65, making it superior on all three fronts: size, speed and simplicity. 

* NOP and CLC take 1 cycle on the MEGA65, and load instructions have a one cycle increased cost, compared to the 6502.  We also have to allow four cycles for reading/writing the 32-bit values in this mode, so my current estimation of the cycle timing would be:

= 1 (NEG)
+ 1 (NEG)
+ 8 (LDA, $34, $12, wait state, read $1234, read $1235, read $1236, read $1237)
+ 1 (CLC)
= 1 (NEG)
+ 1 (NeG)
+ 8 (ADC, $38, $12, wait state, read $1238, read $1239, read $123A, read $123B)
+ 1 (NEG)
+ 1 (NEG)
+ 7 (STA, $3C, $12, write $123C, write $123D, write $123E, write $123F)

= 30 cycles.

Also, it stands to reason that there are lots of other situations where being able to read multiple bytes into the registers, or write all the registers out at the same time will probably be useful. Anyway, back to our routine, let's look at how it would look if we had these extensions:

    ; First, make sure we can see the sector buffer at $DE00-$DFFF
   
JSR   sd_map_sectorbuffer

    ; 1. Start at cluster = $00000002

    ; (Use new extensions to store AXYZ with the values)
   
LDA  #$00

    TAX
    TAY
    TAZ
    LDA  #$02
    NEG
    NEG
    STA  dos_opendir_cluster

    NEG
    NEG
    STA  dos_file_loadaddress
 
@tryNewStartingPointCandidate:

    ;    Reset # of clusters still required
   
NEG

    NEG
    LDA dos_dirent_length
    NEG
    NEG
    STA dos_dirent_cluster

@testIfClusterEmptyAfterReadingFATSector:
    ; Read the appropriate sector of the FAT
    ; To do this, we copy the target cluster number to
    ; dos_current_cluster, and call dos_cluster_to_fat_sector.
    ; This leaves the absolute sector number in
    ; dos_current_cluster.
   
NEG

    NEG
    LDA dos_file_loadaddress
    NEG
    NEG
    STA dos_current_cluster 
    JSR    dos_cluster_to_fat_sector
    ; Now we have the sector # in dos_current_cluster
    ; Copy to the SD card sector register, and do the read
   
NEG

    NEG
    LDA dos_current_cluster
    NEG
    NEG
    STA $D681
    JSR    sd_fix_sectornumber
    ; Finally, do the read
   
LDA    #dos_errorcode_read_error
    STA    dos_error_code
    JSR    sd_readsector
    ; Fail on error
   
BCS       @ll93
    RTS
@ll93:

@testIfClusterEmpty:
    ; Here we have the sector read, and do the test on the contents of the cluster
    ; entry.

    ; But first, check that the cluster number is valid:
    ; 1. Get dos_disk_table_offset pointing correctly
   
LDX    dos_disk_current_disk
    JSR    dos_set_current_disk
    LDX    dos_disk_table_offset
    ; 2. Compare cluster count to current cluster
    ; (Note that here we can use the previous value of X still as an index,
    ;  because it doesn't get overwritten by the 32-bit load until after the 
    ;  address has been computed. Of course, this doesn't work for a store
    ;  operation, because in that case the index registers are forming part of
    ;  the 32-bit value to be written).
    NEG
    NEG
    LDA    [dos_disk_table + fs_fat32_cluster_count + 0], X

    NEG
    NEG
    CMP    dos_file_loadaddress

    BNE    @notLastClusterOfFileSystem

    ; Return error due to lack of space
   
LDA    #dos_errorcode_no_space
    STA    dos_error_code
    CLC
    RTS

@notLastClusterOfFileSystem:

    ; The offset in the sector is computed from the bottom
    ; 7 bits of the cluster number * 4, to give an offset
    ; in the 512 byte sector. Once we have the offset, OR the
    ; four bytes of the cluster number together to test if = 0,
    ; and thus empty.
   
LDA    dos_file_loadaddress+0
    ASL
    ASL
    TAY
    LDA    dos_file_loadaddress+0
    AND    #$40
    BEQ    @lowMoby

    NEG
    NEG
    BIT    $DF00, Y
    BRA @ll120
@lowMoby:
    NEG
    NEG
    BIT    $DE00, Y
@ll120:
    ; Remember result of free-test
   
PHP

    ; Increment next cluster number we will look at
   
NEG

    NEG
    INC    dos_file_loadaddress

    ; If the cluster was not free, then reset search point
   
PLP
    BEQ    @thisClusterWasFree



    NEG
    NEG
    LDA    dos_file_loadaddress
    NEG
    NEG
    STA    dos_opendir_cluster
 

    JMP    @tryNewStartingPointCandidate

@thisClusterWasFree:
    ; Decrement # of clusters still required
   
NEG

    NEG
    DEC    dos_dirent_cluster
    ; Now see if zero
    BEQ    @foundFreeSpace
   
    ; Nope, we still need more, so continue the search

    ; Then check if this next cluster is unallocated?
    ; (If the cluster entry in the FAT will be in the same
    ;  sector as the last, then don't waste time recomputing
    ;  and reading the FAT sector number).
   
LDA    dos_file_loadaddress+0
    AND    #$7F
    BNE    @sameSector
    JMP     @testIfClusterEmptyAfterReadingFATSector
@sameSector:
    JMP    @testIfClusterEmpty

@foundFreeSpace:
    ; Found the requested space
   
SEC
    RTS


That's quite a bit shorter and simpler, and would be even simpler again if we were to teach the assembler about the new extensions, and have LDA32, STA32, ADC32, INC32 etc generate the NEG, NEG prefix:

    ; First, make sure we can see the sector buffer at $DE00-$DFFF
   
JSR   sd_map_sectorbuffer

    ; 1. Start at cluster = $00000002

    ; (Use new extensions to store AXYZ with the values)
   
LDA  #$00

    TAX
    TAY
    TAZ
    LDA  #$02
    STA32  dos_opendir_cluster
    STA32  dos_file_loadaddress
 
@tryNewStartingPointCandidate:

    ;    Reset # of clusters still required
   
LDA32  dos_dirent_length

    STA32  dos_dirent_cluster

@testIfClusterEmptyAfterReadingFATSector:
    ; Read the appropriate sector of the FAT
    ; To do this, we copy the target cluster number to
    ; dos_current_cluster, and call dos_cluster_to_fat_sector.
    ; This leaves the absolute sector number in
    ; dos_current_cluster.
   
LDA32 dos_file_loadaddress

    STA32 dos_current_cluster 
    JSR    dos_cluster_to_fat_sector
    ; Now we have the sector # in dos_current_cluster
    ; Copy to the SD card sector register, and do the read
   
LDA32  dos_current_cluster

    STA32  $D681
    JSR    sd_fix_sectornumber
    ; Finally, do the read
   
LDA    #dos_errorcode_read_error
    STA    dos_error_code
    JSR    sd_readsector
    ; Fail on error
   
BCS       @ll93
    RTS
@ll93:

@testIfClusterEmpty:
    ; Here we have the sector read, and do the test on the contents of the cluster
    ; entry.

    ; But first, check that the cluster number is valid:
    ; 1. Get dos_disk_table_offset pointing correctly
   
LDX    dos_disk_current_disk
    JSR    dos_set_current_disk
    LDX    dos_disk_table_offset
    ; 2. Compare cluster count to current cluster

    ; (Note that here we can use the previous value of X still as an index,
    ;  because it doesn't get overwritten by the 32-bit load until after the 
    ;  address has been computed. Of course, this doesn't work for a store
    ;  operation, because in that case the index registers are forming part of
    ;  the 32-bit value to be written).
   
LDA32  [dos_disk_table + fs_fat32_cluster_count + 0], X

    CMP32  dos_file_loadaddress
    BNE    @notLastClusterOfFileSystem

    ; Return error due to lack of space
   
LDA    #dos_errorcode_no_space
    STA    dos_error_code
    CLC
    RTS

@notLastClusterOfFileSystem:

    ; The offset in the sector is computed from the bottom
    ; 7 bits of the cluster number * 4, to give an offset
    ; in the 512 byte sector. Once we have the offset, OR the
    ; four bytes of the cluster number together to test if = 0,
    ; and thus empty.
   
LDA    dos_file_loadaddress+0
    ASL
    ASL
    TAY
    LDA    dos_file_loadaddress+0
    AND    #$40
    BEQ    @lowMoby

    BIT32    $DF00, Y
    BRA @ll120
@lowMoby:
    BIT32    $DE00, Y
@ll120:
    ; Remember result of free-test
   
PHP

    ; Increment next cluster number we will look at
 
  INC32    dos_file_loadaddress


    ; If the cluster was not free, then reset search point
   
PLP
    BEQ    @thisClusterWasFree



    LDA32    dos_file_loadaddress
    STA32    dos_opendir_cluster

    JMP    @tryNewStartingPointCandidate

@thisClusterWasFree:
    ; Decrement # of clusters still required    DEC32    dos_dirent_cluster
    ; Now see if zero
    BEQ    @foundFreeSpace
   
    ; Nope, we still need more, so continue the search

    ; Then check if this next cluster is unallocated?
    ; (If the cluster entry in the FAT will be in the same
    ;  sector as the last, then don't waste time recomputing
    ;  and reading the FAT sector number).
   
LDA    dos_file_loadaddress+0
    AND    #$7F
    BNE    @sameSector
    JMP    @testIfClusterEmptyAfterReadingFATSector
@sameSector:
    JMP    @testIfClusterEmpty

@foundFreeSpace:
    ; Found the requested space
   
SEC
    RTS


The end result is that the routine can be reduced to 162 bytes instead of 215 (assuming I have counted correctly).  This is really quite a nice improvement -- especially since as we have mentioned, the reduction in size comes with a reduction in execution time and complexity. In short, it will be smaller, faster, simpler -- exactly what we want.

The next step is to test all this, and make sure that it actually works -- both the routine, and the CPU extensions.