The MEGA65 uses a VFAT32 file system as the native file system on the SD Card, both for convenience (pretty much any computer can read/write it), and also because it is a fairly simple on-disk format. So far, however, the MEGA65 hypervisor DOS code can only read files, not create or write to them. This, naturally, is not a great situation, and thus this post describes progress towards being able to create files, and write to them.
One of the challenges is that the MEGA65 Hypervisor is only 16KB, including boot-up code, file system, freezing and task swapping, and a bunch of other miscellaneous other little things. This means we have to be very careful about what we put in there. Thus, for now at least, FAT file creation will also require the specification of the file length. This means that we can have a call that allows writing data at any offset of a file, without having to worry about allocating space at that point in time. It also means that we can fairly easily enforce that all files we create will have contiguous on-disk storage, which is important for D81 disk images, because the hardware support for disk images requires that they exist as a single contiguous slab on the disk, so that the hardware doesn't need to know anything about the FAT file system.
So, the first routine we need is one that can find a contiguous set of clusters that are free, and that are a candidate for hosting a file. On the one hand, this is a very simple routine: All it has to do is to iterate through the cluster list, counting the number of consecutive empties, and resetting that count if it finds one already allocated, and continue until it reaches either the end of the partition, or finds the required number of consecutive unallocated clusters. Of course, when you are manipulating 32-bit values on an 8-bit computer, everything ends up more complicated than you would like. Here is my first go at implementing this (which I won't be able to easily test, until I have the rest of the routines in place):
dos_find_contiguous_free_space:
; Find a piece of free space in the specified file system
; (dos_default_disk), and return the first cluster number
; if it can be found.
;
; INPUT: dos_dirent_length = # of clusters required
; OUTPUT: dos_opendir_cluster = first cluster
; C flag set on success, clear on failure.
;
; FAT32 file systems have the expected first cluster free
; stored in the 2nd sector of the file system at offset
; $1EC (actually, this may point to the last allocated cluster).
; This field is only a suggestion however, to accelerate allocation,
; and thus should not be relied upon, but rather to allow quick
; skipping of the already allocated area of disk.
;
; The number of clusters we need to allocate is to be provided in
; dos_opendir_cluster as a 32-bit value, thus allowing for files
; upto 2^32 * cluster size bytes long to be created.
; (Note that in practice files are limited to 2GiB - 1 bytes for
; total compatibility, and 4GiB -1 for fairly decent compatibility,
; and 256GiB - 1 if we implement the FAT+ specification
; (http://www.fdos.org/kernel/fatplus.txt.1) in the appropriate places
; at some point, which we will likely do, as it is really very simple.
; Mostly we just have to use 40 bit offsets for file lengths.
; Let's start with a completely naive algorithm:
; 1. Begin at cluster 2. Reset # of contiguous clusters found to zero.
; Remember this cluster as candidatee starting point.
; 2. If current cluster not free, advance to next cluster number, reset
; contiguous free cluster count. Remember the newly advanced cluster
; number as candidate starting point.
; 3. If current cluster is free, increase contiguous free clusters found.
; If equal to desired number, return candidate cluster number.
; 4. Repeat 2 and 3 above until end of file system is reached (and return
; an error), or until step 3 has returned success.
;
; This algorithm can be made more efficient by using the last allocated
; cluster number field as an alternative starting point, and only if that
; fails, retrying beginning at cluster 2.
; So the state we need to keep track of is:
; dos_opendir_cluster = current cluster we are considering as candidate starting point.
; dos_file_loadaddress = current cluster being tested for vacancy.
; dos_dirent_length = number of clusters required, in total.
; dos_dirent_cluster = number of clusters required, from this point
; (i.e., the total minus the number we have already found contiguously free).
; current_disk[fs_fat32_cluster_count] = number of clusters in disk
; (and thus the end point of our search).
;
; Other than that, we just need to advance our way linearly through the FAT sectors.
; This is straight forward, as we can just read the first FAT sector, and then progress
; our way through, until we reach the end.
; First, make sure we can see the sector buffer at $DE00-$DFFF
JSR sd_map_sectorbuffer
; 1. Start at cluster 2
LDA #$02
STA dos_opendir_cluster+0
STA dos_file_loadaddress+0
LDA #$00
STA dos_opendir_cluster+1
STA dos_opendir_cluster+2
STA dos_opendir_cluster+3
STA dos_file_loadaddress+1
STA dos_file_loadaddress+2
STA dos_file_loadaddress+3
@tryNewStartingPointCandidate:
; Reset # of clusters still required
LDX #$03
@ll74: LDA dos_dirent_length, X
STA dos_dirent_cluster, X
DEX
BPL @ll74
@testIfClusterEmptyAfterReadingFATSector:
; Read the appropriate sector of the FAT
; To do this, we copy the target cluster number to
; dos_current_cluster, and call dos_cluster_to_fat_sector.
; This leaves the absolute sector number in
; dos_current_cluster.
LDX #$03
@ll78: LDA dos_file_loadaddress, X
STA dos_current_cluster, X
DEX
BPL @ll78
JSR dos_cluster_to_fat_sector
; Now we have the sector # in dos_current_cluster
; Copy to the SD card sector register, and do the read
LDX #$03
@ll83: LDA dos_current_cluster, X
STA $D681, X
DEX
BPL @ll83
JSR sd_fix_sectornumber
; Finally, do the read
LDA #dos_errorcode_read_error
STA dos_error_code
JSR sd_readsector
; Fail on error
BCS @ll93
RTS
@ll93:
@testIfClusterEmpty:
; Here we have the sector read, and do the test on the contents of the cluster
; entry.
; But first, check that the cluster number is valid:
; 1. Get dos_disk_table_offset pointing correctly
LDX dos_disk_current_disk
JSR dos_set_current_disk
LDX dos_disk_table_offset
; 2. Compare cluster count to current cluster
LDY #$00
@ll128: LDA [dos_disk_table + fs_fat32_cluster_count + 0], X
CMP dos_file_loadaddress, Y
BNE @notLastClusterOfFileSystem
INX
INY
CPY #$04
BNE @ll128
; Return error due to lack of space
LDA #dos_errorcode_no_space
STA dos_error_code
CLC
RTS
@notLastClusterOfFileSystem:
; The offset in the sector is computed from the bottom
; 7 bits of the cluster number * 4, to give an offset
; in the 512 byte sector. Once we have the offset, OR the
; four bytes of the cluster number together to test if = 0,
; and thus empty.
LDA dos_file_loadaddress+0
ASL
ASL
TAY
LDA dos_file_loadaddress+0
AND #$40
BEQ @lowMoby
LDA $DF00, Y
ORA $DF01, Y
ORA $DF02, Y
ORA $DF03, Y
BRA @ll120
@lowMoby:
LDA $DE00, Y
ORA $DE01, Y
ORA $DE02, Y
ORA $DE03, Y
@ll120:
; Remember result of free-test
TAX
; Increment next cluster number we will look at
LDA dos_file_loadaddress+0
CLC
ADC #$01
STA dos_file_loadaddress+0
LDA dos_file_loadaddress+1
ADC #$00
LDA dos_file_loadaddress+1
LDA dos_file_loadaddress+2
ADC #$00
LDA dos_file_loadaddress+2
LDA dos_file_loadaddress+3
ADC #$00
LDA dos_file_loadaddress+3
; If the cluster was not free, then reset search point
CPX #$00
BEQ @thisClusterWasFree
LDX #$03
@ll160: LDA dos_file_loadaddress, X
STA dos_opendir_cluster, X
DEX
BPL @ll160
JMP @tryNewStartingPointCandidate
@thisClusterWasFree:
; Decrement # of clusters still required
LDA dos_dirent_cluster+0
SEC
SBC #$01
STA dos_dirent_cluster+0
LDA dos_dirent_cluster+1
SBC #$00
STA dos_dirent_cluster+1
LDA dos_dirent_cluster+2
SBC #$00
STA dos_dirent_cluster+2
LDA dos_dirent_cluster+3
SBC #$00
STA dos_dirent_cluster+3
; Now see if zero
ORA dos_dirent_cluster+2
ORA dos_dirent_cluster+1
ORA dos_dirent_cluster+0
BEQ @foundFreeSpace
; Nope, we still need more, so continue the search
; Then check if this next cluster is unallocated?
; (If the cluster entry in the FAT will be in the same
; sector as the last, then don't waste time recomputing
; and reading the FAT sector number).
LDA dos_file_loadaddress+0
AND #$7F
BNE @sameSector
JMP @testIfClusterEmptyAfterReadingFATSector
@sameSector:
JMP @testIfClusterEmpty
@foundFreeSpace:
; Found the requested space
SEC
RTS
It isn't too hard to find the sections of code where there is considerable repetition due to the need to update 32-bit = 4 byte long values, using only 8-bit CPU operations. This is both a nuisance from a code-size perspective (remember that the Hypervisor is only 16KB in total, and so every wasted byte is potentially important), and also a correctness/security perspective, because it is simply hard to make sure that such code is functionally correct.
If the criteria were only security and performance, we would of course just use a simple 32-bit processor, however, the whole point of the MEGA65 is that it is an 8-bit computer. That said, the 6502 (and 4510) already both have a number of non-8-bit features: Most obviously is that the program counter is 16-bits. But also the zero-page indirection instructions operate on 16-bit pointers, the stack can be 16-bit addressed on the 4510, the 4510 even has increment/decrement/shift word instructions, which really start to blur the lines on where the boundary between 8 and 16 bit processors lie. I think my take on the situation is that while the registers are 8-bit, and the majority of operations are 8-bit, that there probably isn't a problem.
The question then really is whether it makes sense to add some simple 32-bit operations, for example increment and decrement, which are the most common kinds of operations, and which also are the most annoying to do now, because in reality you need to unroll them, or do very strange things with saving the processor flags between iterations of the loop, so that the carry flag is not corrupted. These could be implemented using a prefix to the INW and DEW (increment and decrement word) instruction, similar to how the MEGA65's 32-bit zero-page indirect operations work. So to increment a 32-bit value, we might end up using something like:
NOP
NOP
INW $address
That is, we can replace quite a lot of goo with just 5 bytes, if we allow INW and DEW to operate on 32-bits of data instead of 16. Unfortunately, INW and DEW (and the other 16-bit read-modify-write instructions) are all zero-page only, which is a bit limiting. That they are zero-page only is a bit silly in retrospect, because these instructions already require 2 cycles for the instruction, 2 more to read the 16-bit value, and 2 more to write the result back = 6 cycles. For a seventh cycle, they would have been much more general purpose. I could in theory make the instructions behave as absolute addressing, i.e., non-zero-page, when used with the double-NOP prefix, but they we might fall into the same trap as the 65816, where you have to know the CPU state, in order to be able to disassemble code. This is not good on a variety of fronts.
To add to my frustration, the ROW and and ASW instructions, the ones that shift 16-bit values do use absolute addressing mode. Even the PHW instructions take 16-bit addresses/values as operands, only not the INW/DEW instructions, that are the most useful in this situation.
So looking at our available resources, we have PHW #$nnnn, PHW $nnnn, ASW $nnnn and ROW $nnnn as the four instructions that we could potentially overload with a prefix to do something more useful, and INW/DEW that can also be logically extended, but with the annoying limitation on their source of operand. There is some intrinsic value to having 32-bit ROW and ASW operations, to allow bit-shifting of wider values. The stack operations are more vexed, because there are no matching 16-bit stack pop operations (assuming that we don't count RTS, although it technically does pop 16 bits from the stack, it is just that any instruction that puts its result into the program counter is perhaps not ideally suited to being bent towards arithmetic operations.
For the existing 32-bit flat addressing extensions, i.e., where the data remains 8-bits wide, but the address is 32-bits long, we use NOP + NOP as the prefix. Note that this is different to what we are talking about here, where the data is 32-bit, but the adress remains 16-bit. We really don't want to confuse those two, and there are situations where it probably makes sense to use them both at the same time*, so we will need a different prefix.
[ * The main problem with using them at the same time, is that the 32-bit addressing extensions use the indirect zero-page instructions like LDA ($12), Z -- which uses one of the registers in the address computation. This is fine for LDA, but not for ADC and friends, and definitely not for STA, where the value you are storing, including all the 8-bit register values as it does, would thus affect the address you are writing to. To solve this for now, if you use STA in this mode, the value of the Z register is NOT added to the 32-bit address. This whole scheme will need some further fine-tuning, but at least for now is minimally useful: It is possible in theory to load a 32-bit value from, or write one to, a 32-bit flat address. ]
After having stared at the instruction set many times before, and not
really seen many good candidates for prefix instructions, since, after
all, the whole point of instructions that are not called NOP is to do
something, it occurred to me that the NEG instruction would in fact be a
pretty good one. Yes, it will mess up the Z and N flags -- but that's
all it will do if you call it twice, since it negates, i.e., inverts the
sign of the value of the accumulator. Thus, calling it twice puts the
value of the accumulator back to what it was to begin with, and
potentially even restores the values of the Z and N flags. Anyway, for
lack of a better alternative, this is what we will use.
Anyway, by using the NEG + NEG prefix on any LDA, STA, ADC, SBC, EOR, AND, ORA, INC, DEC, ASL, ROL, ROR or LSR instruction, we can make it instead do a 32-bit read (and or write). The simplest way to implement this, and which is what I have done, is to duplicate the operand read and/or write after the address has been computed. In this way, you can still use all the usual addressing modes (although the indexed ones will require some care, because many of these 32-bit operations will of course modify the contents of the index registers). This leaves the question of where do we find a 32-bit register in the 6502 architecture that we can use. That, and related questions we will soon return to. But first, lets get back to the routine for finding where to put a newly created file, there are the following 32-bit operations:
2x write 32-bit value to memory
4x copy 32-bit value to another location
1x compare two 32-bit values for equality
3x test 32-bit value if equal to zero
1x increment 32-bit value
1x decrement 32-bit value
Of these, the increments, decrements and zero-tests could be directly accelerated, accounting for 5 of the 13 operations. Helpful, but not ideal.
What simple changes can we make that would give a much better result? One of the requirements for "simple" in my mind, is that they should not result in the creation of more registers or other CPU state that needs to be saved/restored during a context switch or Hypervisor trap. We do have four general purpose(ish) registers in the 4510, however: A, X, Y and Z - just enough for holding 32-bits. Thus we could have a prefix on LDA / STA / ADC / SBC / CMP that cause A, X, Y and Z to all be used together, so for example:
NEG
NEG
LDA $1234
CLC
NEG
NEG
ADC $1238
NEG
NEG
STA $123C
could be used to read the 32-bit value from $1234-$1237 into A, X, Y and Z, and then add to this the 32-bits of data stored at $1238-$123B, and finally write the results to the four bytes at $123C-$123F. The same routine in normal 6502 would be:
LDA $1234
CLC
ADC $1238
STA $123C
LDA $1235
ADC $1239
STA $123D
LDA $1236
ADC $123A
STA $123E
LDA $1237
ADC $123B
STA $123F
Not only does this require 37 bytes and 50 cycles on a regular 6502, it is also prone to copy-paste errors (hopefully I didn't make any here). It is possible to slightly optimise this, e.g.:
LDX #$00
CLC
PHP
loop:
PLP
LDA $1234,X
ADC $1238,X
STA $123C,X
PHP
INX
CPX #$04
BNE loop
PLP
This is shorter than the unrolled version, requiring only 21 bytes, however the program flow is rather more complicated to understand (and requires a byte off stack space). Also, it requires 99 cycles, making it twice as slow as the unrolled version.
If we implement our CPU extension to allow our nice clear 15 byte version, it should take only 30 cycles* on the MEGA65, making it superior on all three fronts: size, speed and simplicity.
* NOP and CLC take 1 cycle on the MEGA65, and load instructions have a one cycle increased cost, compared to the 6502. We also have to allow four cycles for reading/writing the 32-bit values in this mode, so my current estimation of the cycle timing would be:
= 1 (NEG)
+ 1 (NEG)
+ 8 (LDA, $34, $12, wait state, read $1234, read $1235, read $1236, read $1237)
+ 1 (CLC)
= 1 (NEG)
+ 1 (NeG)
+ 8 (ADC, $38, $12, wait state, read $1238, read $1239, read $123A, read $123B)
+ 1 (NEG)
+ 1 (NEG)
+ 7 (STA, $3C, $12, write $123C, write $123D, write $123E, write $123F)
= 30 cycles.
Also, it stands to reason that there are lots of other situations where being able to read multiple bytes into the registers, or write all the registers out at the same time will probably be useful. Anyway, back to our routine, let's look at how it would look if we had these extensions:
; First, make sure we can see the sector buffer at $DE00-$DFFF
JSR sd_map_sectorbuffer
; 1. Start at cluster = $00000002
; (Use new extensions to store AXYZ with the values)
LDA #$00
TAX
TAY
TAZ
LDA #$02
NEG
NEG
STA dos_opendir_cluster
NEG
NEG
STA dos_file_loadaddress
@tryNewStartingPointCandidate:
; Reset # of clusters still required
NEG
NEG
LDA dos_dirent_length
NEG
NEG
STA dos_dirent_cluster
@testIfClusterEmptyAfterReadingFATSector:
; Read the appropriate sector of the FAT
; To do this, we copy the target cluster number to
; dos_current_cluster, and call dos_cluster_to_fat_sector.
; This leaves the absolute sector number in
; dos_current_cluster.
NEG
NEG
LDA dos_file_loadaddress
NEG
NEG
STA dos_current_cluster
JSR dos_cluster_to_fat_sector
; Now we have the sector # in dos_current_cluster
; Copy to the SD card sector register, and do the read
NEG
NEG
LDA dos_current_cluster
NEG
NEG
STA $D681
JSR sd_fix_sectornumber
; Finally, do the read
LDA #dos_errorcode_read_error
STA dos_error_code
JSR sd_readsector
; Fail on error
BCS @ll93
RTS
@ll93:
@testIfClusterEmpty:
; Here we have the sector read, and do the test on the contents of the cluster
; entry.
; But first, check that the cluster number is valid:
; 1. Get dos_disk_table_offset pointing correctly
LDX dos_disk_current_disk
JSR dos_set_current_disk
LDX dos_disk_table_offset
; 2. Compare cluster count to current cluster ; (Note that here we can use the previous value of X still as an index,
; because it doesn't get overwritten by the 32-bit load until after the
; address has been computed. Of course, this doesn't work for a store
; operation, because in that case the index registers are forming part of
; the 32-bit value to be written).
NEG
NEG
LDA [dos_disk_table + fs_fat32_cluster_count + 0], X
NEG
NEG
CMP dos_file_loadaddress
BNE @notLastClusterOfFileSystem
; Return error due to lack of space
LDA #dos_errorcode_no_space
STA dos_error_code
CLC
RTS
@notLastClusterOfFileSystem:
; The offset in the sector is computed from the bottom
; 7 bits of the cluster number * 4, to give an offset
; in the 512 byte sector. Once we have the offset, OR the
; four bytes of the cluster number together to test if = 0,
; and thus empty.
LDA dos_file_loadaddress+0
ASL
ASL
TAY
LDA dos_file_loadaddress+0
AND #$40
BEQ @lowMoby
NEG
NEG
BIT $DF00, Y
BRA @ll120
@lowMoby:
NEG
NEG
BIT $DE00, Y
@ll120:
; Remember result of free-test
PHP
; Increment next cluster number we will look at
NEG
NEG
INC dos_file_loadaddress
; If the cluster was not free, then reset search point
PLP
BEQ @thisClusterWasFree
NEG
NEG
LDA dos_file_loadaddress
NEG
NEG
STA dos_opendir_cluster
JMP @tryNewStartingPointCandidate
@thisClusterWasFree:
; Decrement # of clusters still required
NEG
NEG
DEC dos_dirent_cluster
; Now see if zero
BEQ @foundFreeSpace
; Nope, we still need more, so continue the search
; Then check if this next cluster is unallocated?
; (If the cluster entry in the FAT will be in the same
; sector as the last, then don't waste time recomputing
; and reading the FAT sector number).
LDA dos_file_loadaddress+0
AND #$7F
BNE @sameSector
JMP @testIfClusterEmptyAfterReadingFATSector
@sameSector:
JMP @testIfClusterEmpty
@foundFreeSpace:
; Found the requested space
SEC
RTS
That's quite a bit shorter and simpler, and would be even simpler again if we were to teach the assembler about the new extensions, and have LDA32, STA32, ADC32, INC32 etc generate the NEG, NEG prefix:
; First, make sure we can see the sector buffer at $DE00-$DFFF
JSR sd_map_sectorbuffer
; 1. Start at cluster = $00000002
; (Use new extensions to store AXYZ with the values)
LDA #$00
TAX
TAY
TAZ
LDA #$02
STA32 dos_opendir_cluster
STA32 dos_file_loadaddress
@tryNewStartingPointCandidate:
; Reset # of clusters still required
LDA32 dos_dirent_length
STA32 dos_dirent_cluster
@testIfClusterEmptyAfterReadingFATSector:
; Read the appropriate sector of the FAT
; To do this, we copy the target cluster number to
; dos_current_cluster, and call dos_cluster_to_fat_sector.
; This leaves the absolute sector number in
; dos_current_cluster.
LDA32 dos_file_loadaddress
STA32 dos_current_cluster
JSR dos_cluster_to_fat_sector
; Now we have the sector # in dos_current_cluster
; Copy to the SD card sector register, and do the read
LDA32 dos_current_cluster
STA32 $D681
JSR sd_fix_sectornumber
; Finally, do the read
LDA #dos_errorcode_read_error
STA dos_error_code
JSR sd_readsector
; Fail on error
BCS @ll93
RTS
@ll93:
@testIfClusterEmpty:
; Here we have the sector read, and do the test on the contents of the cluster
; entry.
; But first, check that the cluster number is valid:
; 1. Get dos_disk_table_offset pointing correctly
LDX dos_disk_current_disk
JSR dos_set_current_disk
LDX dos_disk_table_offset
; 2. Compare cluster count to current cluster
; (Note that here we can use the previous value of X still as an index,
; because it doesn't get overwritten by the 32-bit load until after the
; address has been computed. Of course, this doesn't work for a store
; operation, because in that case the index registers are forming part of
; the 32-bit value to be written).
LDA32 [dos_disk_table + fs_fat32_cluster_count + 0], X
CMP32 dos_file_loadaddress
BNE @notLastClusterOfFileSystem
; Return error due to lack of space
LDA #dos_errorcode_no_space
STA dos_error_code
CLC
RTS
@notLastClusterOfFileSystem:
; The offset in the sector is computed from the bottom
; 7 bits of the cluster number * 4, to give an offset
; in the 512 byte sector. Once we have the offset, OR the
; four bytes of the cluster number together to test if = 0,
; and thus empty.
LDA dos_file_loadaddress+0
ASL
ASL
TAY
LDA dos_file_loadaddress+0
AND #$40
BEQ @lowMoby
BIT32 $DF00, Y
BRA @ll120
@lowMoby:
BIT32 $DE00, Y
@ll120:
; Remember result of free-test
PHP
; Increment next cluster number we will look at
INC32 dos_file_loadaddress
; If the cluster was not free, then reset search point
PLP
BEQ @thisClusterWasFree
LDA32 dos_file_loadaddress
STA32 dos_opendir_cluster
JMP @tryNewStartingPointCandidate
@thisClusterWasFree:
; Decrement # of clusters still required DEC32 dos_dirent_cluster
; Now see if zero
BEQ @foundFreeSpace
; Nope, we still need more, so continue the search
; Then check if this next cluster is unallocated?
; (If the cluster entry in the FAT will be in the same
; sector as the last, then don't waste time recomputing
; and reading the FAT sector number).
LDA dos_file_loadaddress+0
AND #$7F
BNE @sameSector
JMP @testIfClusterEmptyAfterReadingFATSector
@sameSector:
JMP @testIfClusterEmpty
@foundFreeSpace:
; Found the requested space
SEC
RTS
The end result is that the routine can be reduced to 162 bytes instead of 215 (assuming I have counted correctly). This is really quite a nice improvement -- especially since as we have mentioned, the reduction in size comes with a reduction in execution time and complexity. In short, it will be smaller, faster, simpler -- exactly what we want.
The next step is to test all this, and make sure that it actually works -- both the routine, and the CPU extensions.