Tuesday, April 10, 2018

GOSUB variable in C64 BASIC

While writing a program with a student, we wanted to make an efficient dynamic jump table under C64 BASIC 2, where could have an array of line numbers that represent event handlers for particular events.  The idea is that we would then GOSUB JT%(EVENT) to dispatch to the appropriate handler, and be able to freely update the jump table as we go.

However, C64 BASIC doesn't allow this.  I am not really sure why they made the GOTO command (which is the heart of GOSUB also) unable to resolve variables. It would have meant that the ON ... GOTO command could have been left out, saving ROM space, for example.

Anyway, a bit of poking around found that other people have tried to do this, but none of their solutions really seemed particularly nice.

So, I thought, is it possible to do this just using BASIC 2, and without any assembly routines?

The answer is YES! Using the rather under utilised technique of self modifying BASIC programs.  Is this a horror beyond comprehension?  Probably.  But it also works a treat :)

So, here is how it works:

1. Have a GOSUB command somewhere in your basic program that you can reliably find in memory. I made it GOSUB,21456 in my program.  The number doesn't matter, provided it is 5 digits long, so that we have space to overwrite with any valid line number we might encounter.  The comma is there because GOSUB+COMMA = SYNTAX ERROR, and thus should not exist in any sensible program*. 

2000 REM THIS IS THE ROUTINE THAT WILL GET MODIFIED DURING EXECUTION
2010 GOSUB,21456
2020 RETURN

2. Find out where in memory that line lives, but searching for the GOSUB token (141) and the comma:

1000 FOR JA = 2048 TO 40959: IF PEEK(JA-1)<>141 OR PEEK(JA)<>44 THEN NEXT: RETURN


(By doing the comparison in the negative sense, the loop continues until it finds the location, and then aborts as soon as it finds it).

3. To GOSUB to any line, have a routine like this:

1100 REM GOSUB TO LINE IN VARIABLE LN
1110 LN$=STR(LN): REM GET STRING VERSION OF LINE NUMBER
1120 REM REPLACE ,21456 WITH CORRECT LINE NUMBER
1130 FORI=0TO5:POKEJA+I,32: REM FIRST RUB OUT WITH SPACES IN CASE LINE NUMBER IS SHORT
1140 FORI=0TOLEN(LN$)-1:POKEJA+I,ASC(RIGHT$(LEFT$(LN$,I),1)):NEXT
1150 GOSUB 2000
1160 POKEJA,44: REM PUT THE COMMA BACK READY FOR NEXT TIME
1170 RETURN

(The astute reader will realise that they can merge steps 1 and 3 to give something like:

1100 REM GOSUB TO LINE IN VARIABLE LN
1110 LN$=STR(LN): REM GET STRING VERSION OF LINE NUMBER
1120 REM REPLACE ,21456 WITH CORRECT LINE NUMBER
1130 FORI=0TO5:POKEJA+I,32: REM FIRST RUB OUT WITH SPACES IN CASE LINE NUMBER IS SHORT
1140 FORI=0TOLEN(LN$)-1:POKEJA+I,ASC(RIGHT$(LEFT$(LN$,I),1)):NEXT
1150 GOSUB,00000
1160 POKEJA,44: REM PUT THE COMMA BACK IN CASE WE WANT TO RUN AGAIN
1170 RETURN

Now you can GOSUB to any line you like with a simple program like:

10 GOSUB 1000: REM ONE-TIME ONLY LOOKUP PATCH ADDRESS
20 LN=12345: GOSUB 1100: REM GOSUB TO LINE 12345
999 END

Here is a complete program, and an example of it running. It is short enough to fit on a single screen, even with some comments!




* I don't claim that this program is sensible. It is also possible for it to show up in a string that has shift-M followed by a comma, but I can avoid that easily enough.

Sunday, April 1, 2018

Optimising infinite loops with VHDL

It's been a while since I have tried to improve the CPU performance of the MEGA65, so I thought I would take a look at an optimisation I hadn't tackled yet: Accelerating infinite loops.  This sort of loop isn't normally accelerated because they occur so rarely in programs on modern computers. 

However, on 8-bit systems it wasn't uncommon to have an infinite loop occupy the CPU, and the rest of the work occurring in interrupt routines.  This means that there is a surprising amount of CPU time wasted infinite loops that we can try to reduce.  We can get an idea of just how much potential benefit we can obtain by applying Amdahl's Law

Basically if 1/10th of the CPU time is spend on a particular task, this means that even if that task can be made to take no time at all, that we can only reduce the run time to 100% - 1/10, i.e., it will still take 90% of the original time.  This is where accelerating infinite loops has such great potential:  If the task would have taken infinite time, and we can reduce that to finite time, then we have gained a massive advantage.  Even better, if we can reduce the infinite time to practically zero, then we can gain an infinite speed up.  That is because, if we speed up the infinite part to take practically zero time, we have speed up the overwhelming majority of the task.  For any finite remaining run time, the ratio of speed up will be infinity / remaining run time, and since infinity divided by a finite quantity still equals infinity, the result is infinite speed up.

This all sounds great, but how can we go about achieving this in practice?  For a start, there are all sorts of infinite loops that we can have to contend with.  Again, a fortunate situation is that on 6502 systems the vast majority of infinite loops take the simple form of a JMP instruction that jumps to itself, such as:

2000 JMP $2000

The first trick is how to efficiently detect these operations. Fortunately this is simple: If the same JMP instruction is encountered two instructions in a row, it means that the code path has become invariant, and an infinite loop will eventuate. In fact, this method is robust enough that it can be used to detect a surprisingly wide range of infinite loops. The second trick is to know how to achieve the effect of the infinite loop. Fortunately JMP instructions don't perform any other computation, or cause unpredictable changes to the processor flags, so we can, in fact, simply ignore the instruction on the second iteration, falling through to the next instruction in memory -- thus effectively executing an infinite loop in a about 12 clock cycles, i.e., 12 x 20ns = 240ns.  That is, we can execute an infinite loop in about 1/4 of a single CPU cycle on the C64, or slightly faster than a single CPU cycle on a C65.  While there is scope to further improve on this, it will be good enough for now.

So, first step is to modify the CPU to detect back-to-back jumps to the same address, and to abort the jump when this occurs.  This turned out to be trivial. Here is the complete patch:

diff --git a/src/vhdl/gs4510.vhdl b/src/vhdl/gs4510.vhdl
index e2036f7..87b4852 100644
--- a/src/vhdl/gs4510.vhdl
+++ b/src/vhdl/gs4510.vhdl
@@ -1173,6 +1173,9 @@ architecture Behavioural of gs4510 is
   signal reg_math_cycle_counter_plus_one : unsigned(31 downto 0) := to_unsigned(0,32);
   -- # of math cycles to trigger end of job / math interrupt
   signal reg_math_cycle_compare : unsigned(31 downto 0) := to_unsigned(0,32);
+
+  signal reg_last_jump : unsigned(15 downto 0) := to_unsigned(0,16);
+  signal last_was_jump : std_logic := '0';
  
 begin

@@ -6136,8 +6139,16 @@ begin
               end if;
              
               if reg_microcode.mcJump='1' then
-                report "Setting PC: mcJump=1";
-                reg_pc <= reg_addr;
+                if reg_last_jump /= reg_addr or last_was_jump='0' then
+                  report "Setting PC: mcJump=1";
+                  reg_pc <= reg_addr;
+                else
+                  report "Accelerating infinite loop";
+                end if;

As can be seen, all we have had to do, was to add a couple of signals to remember if the last instruction was a jump, and where the jump went. We then only take the jump if the immediately preceeding instruction was not an identical jump.

A quick test using GHDL to verify that it works under simulation was the next item on the list.  For this, I added an infinite loop at the reset entry for the MEGA65 Kickstart. This would have normally caused the MEGA65 to take an infinite amount of time to proceed with the boot process.  But now, we see the following simulation output:

@5190ns:(report note): Setting PC to $80xx/8100 on hypervisor entry@5370ns:(report note): Setting PC: mcJump=1@5450ns:(report note): $8100 4C 88 A1  jmp  $A188
@5490ns:(report note): $A188 78        sei
@5610ns:(report note): Setting PC: mcJump=1@5690ns:(report note): $A189 4C 89 A1  jmp  $A189
@5810ns:(report note): Accelerating infinite loop@5890ns:(report note): $A189 4C 89 A1  jmp  $A189

@6010ns:(report note): Setting PC in JSR/BSR (fast dispatch)
@6050ns:(report note): $A18C 20 D8 A0  jsr  $0104
@6090ns:(report note): $0104 78        sei


I have highlighted the important lines above. As we can see, the infinite loop is entered at 5,610ns, and is detected at 5,810ns, and the following instruction is executed at 5,890ns.  Thus the total cost of the "infinite" loop was a mere 280ns -- a little slower than I predicted, but still quite acceptable.  It will certainly do for now, until I have time to track down where the other 40ns has gone.

Monday, March 12, 2018

Creating files on a FAT32 filesystem, and making it simpler to implement

The MEGA65 uses a VFAT32 file system as the native file system on the SD Card, both for convenience (pretty much any computer can read/write it), and also because it is a fairly simple on-disk format.  So far, however, the MEGA65 hypervisor DOS code can only read files, not create or write to them. This, naturally, is not a great situation, and thus this post describes progress towards being able to create files, and write to them.

One of the challenges is that the MEGA65 Hypervisor is only 16KB, including boot-up code, file system, freezing and task swapping, and a bunch of other miscellaneous other little things.  This means we have to be very careful about what we put in there.  Thus, for now at least, FAT file creation will also require the specification of the file length.  This means that we can have a call that allows writing data at any offset of a file, without having to worry about allocating space at that point in time.  It also means that we can fairly easily enforce that all files we create will have contiguous on-disk storage, which is important for D81 disk images, because the hardware support for disk images requires that they exist as a single contiguous slab on the disk, so that the hardware doesn't need to know anything about the FAT file system.

So, the first routine we need is one that can find a contiguous set of clusters that are free, and that are a candidate for hosting a file.  On the one hand, this is a very simple routine: All it has to do is to iterate through the cluster list, counting the number of consecutive empties, and resetting that count if it finds one already allocated, and continue until it reaches either the end of the partition, or finds the required number of consecutive unallocated clusters.  Of course, when you are manipulating 32-bit values on an 8-bit computer, everything ends up more complicated than you would like.  Here is my first go at implementing this (which I won't be able to easily test, until I have the rest of the routines in place):

dos_find_contiguous_free_space:
    ; Find a piece of free space in the specified file system
    ; (dos_default_disk), and return the first cluster number
    ; if it can be found.
    ;
    ; INPUT: dos_dirent_length = # of clusters required
    ; OUTPUT: dos_opendir_cluster = first cluster
    ; C flag set on success, clear on failure.
    ;
    ; FAT32 file systems have the expected first cluster free
    ; stored in the 2nd sector of the file system at offset
    ; $1EC (actually, this may point to the last allocated cluster).
    ; This field is only a suggestion however, to accelerate allocation,
    ; and thus should not be relied upon, but rather to allow quick
    ; skipping of the already allocated area of disk.
    ;
    ; The number of clusters we need to allocate is to be provided in
    ; dos_opendir_cluster as a 32-bit value, thus allowing for files
    ; upto 2^32 * cluster size bytes long to be created.
    ; (Note that in practice files are limited to 2GiB - 1 bytes for
    ; total compatibility, and 4GiB -1 for fairly decent compatibility,
    ; and 256GiB - 1 if we implement the FAT+ specification
    ; (http://www.fdos.org/kernel/fatplus.txt.1) in the appropriate places
    ; at some point, which we will likely do, as it is really very simple.
    ; Mostly we just have to use 40 bit offsets for file lengths.

    ; Let's start with a completely naive algorithm:
    ; 1. Begin at cluster 2. Reset # of contiguous clusters found to zero.
    ;    Remember this cluster as candidatee starting point.
    ; 2. If current cluster not free, advance to next cluster number, reset
    ;    contiguous free cluster count.  Remember the newly advanced cluster
    ;    number as candidate starting point.
    ; 3. If current cluster is free, increase contiguous free clusters found.
    ;    If equal to desired number, return candidate cluster number.
    ; 4. Repeat 2 and 3 above until end of file system is reached (and return
    ;    an error), or until step 3 has returned success.
    ;
    ; This algorithm can be made more efficient by using the last allocated
    ; cluster number field as an alternative starting point, and only if that
    ; fails, retrying beginning at cluster 2.

    ; So the state we need to keep track of is:
    ; dos_opendir_cluster = current cluster we are considering as candidate starting point.
    ; dos_file_loadaddress = current cluster being tested for vacancy.
    ; dos_dirent_length = number of clusters required, in total.
    ; dos_dirent_cluster = number of clusters required, from this point
    ; (i.e., the total minus the number we have already found contiguously free).
    ; current_disk[fs_fat32_cluster_count] = number of clusters in disk
    ; (and thus the end point of our search).
    ;
    ; Other than that, we just need to advance our way linearly through the FAT sectors.
    ; This is straight forward, as we can just read the first FAT sector, and then progress
    ; our way through, until we reach the end.

    ; First, make sure we can see the sector buffer at $DE00-$DFFF
   
JSR   sd_map_sectorbuffer

    ; 1. Start at cluster 2
   
LDA  #$02
    STA  dos_opendir_cluster+0
    STA  dos_file_loadaddress+0
    LDA  #$00
    STA  dos_opendir_cluster+1
    STA  dos_opendir_cluster+2
    STA  dos_opendir_cluster+3
    STA  dos_file_loadaddress+1
    STA  dos_file_loadaddress+2
    STA  dos_file_loadaddress+3

@tryNewStartingPointCandidate:

    ;    Reset # of clusters still required
   
LDX  #$03
@ll74:    LDA  dos_dirent_length, X
    STA  dos_dirent_cluster, X
    DEX 
    BPL  @ll74

@testIfClusterEmptyAfterReadingFATSector:
    ; Read the appropriate sector of the FAT
    ; To do this, we copy the target cluster number to
    ; dos_current_cluster, and call dos_cluster_to_fat_sector.
    ; This leaves the absolute sector number in
    ; dos_current_cluster.
   
LDX    #$03
@ll78:    LDA    dos_file_loadaddress, X
    STA    dos_current_cluster, X
    DEX
    BPL    @ll78
    JSR    dos_cluster_to_fat_sector
    ; Now we have the sector # in dos_current_cluster
    ; Copy to the SD card sector register, and do the read
   
LDX    #$03
@ll83:    LDA    dos_current_cluster, X
    STA    $D681, X
    DEX
    BPL    @ll83
    JSR    sd_fix_sectornumber
    ; Finally, do the read
   
LDA    #dos_errorcode_read_error
    STA    dos_error_code
    JSR    sd_readsector
    ; Fail on error
   
BCS       @ll93
    RTS
@ll93:

@testIfClusterEmpty:
    ; Here we have the sector read, and do the test on the contents of the cluster
    ; entry.

    ; But first, check that the cluster number is valid:
    ; 1. Get dos_disk_table_offset pointing correctly
   
LDX   dos_disk_current_disk
    JSR   dos_set_current_disk
    LDX   dos_disk_table_offset
    ; 2. Compare cluster count to current cluster
   
LDY   #$00
@ll128:    LDA   [dos_disk_table + fs_fat32_cluster_count + 0], X
    CMP   dos_file_loadaddress, Y
    BNE   @notLastClusterOfFileSystem
    INX
    INY
    CPY    #$04
    BNE    @ll128

    ; Return error due to lack of space
   
LDA    #dos_errorcode_no_space
    STA    dos_error_code
    CLC
    RTS

@notLastClusterOfFileSystem:

    ; The offset in the sector is computed from the bottom
    ; 7 bits of the cluster number * 4, to give an offset
    ; in the 512 byte sector. Once we have the offset, OR the
    ; four bytes of the cluster number together to test if = 0,
    ; and thus empty.
   
LDA      dos_file_loadaddress+0
    ASL
    ASL
    TAY
    LDA    dos_file_loadaddress+0
    AND    #$40
    BEQ    @lowMoby
    LDA    $DF00, Y
    ORA    $DF01, Y
    ORA    $DF02, Y
    ORA    $DF03, Y
    BRA    @ll120
@lowMoby:
    LDA    $DE00, Y
    ORA    $DE01, Y
    ORA    $DE02, Y
    ORA    $DE03, Y
@ll120:
    ; Remember result of free-test
   
TAX

    ; Increment next cluster number we will look at
   
LDA    dos_file_loadaddress+0
    CLC
    ADC    #$01
    STA    dos_file_loadaddress+0
    LDA    dos_file_loadaddress+1
    ADC    #$00
    LDA    dos_file_loadaddress+1
    LDA    dos_file_loadaddress+2
    ADC    #$00
    LDA    dos_file_loadaddress+2
    LDA    dos_file_loadaddress+3
    ADC    #$00
    LDA    dos_file_loadaddress+3

    ; If the cluster was not free, then reset search point
   
CPX       #$00
    BEQ     @thisClusterWasFree
    LDX     #$03   
@ll160:    LDA     dos_file_loadaddress, X
    STA     dos_opendir_cluster, X
    DEX
    BPL     @ll160   
    JMP     @tryNewStartingPointCandidate

@thisClusterWasFree:
    ; Decrement # of clusters still required
   
LDA    dos_dirent_cluster+0
    SEC
    SBC    #$01
    STA    dos_dirent_cluster+0
    LDA    dos_dirent_cluster+1
    SBC    #$00
    STA    dos_dirent_cluster+1
    LDA    dos_dirent_cluster+2
    SBC    #$00
    STA    dos_dirent_cluster+2
    LDA    dos_dirent_cluster+3
    SBC    #$00
    STA    dos_dirent_cluster+3
    ; Now see if zero
    ORA    dos_dirent_cluster+2
    ORA    dos_dirent_cluster+1
    ORA    dos_dirent_cluster+0
    BEQ    @foundFreeSpace
   
    ; Nope, we still need more, so continue the search

    ; Then check if this next cluster is unallocated?
    ; (If the cluster entry in the FAT will be in the same
    ;  sector as the last, then don't waste time recomputing
    ;  and reading the FAT sector number).
   
LDA    dos_file_loadaddress+0
    AND    #$7F
    BNE    @sameSector
    JMP     @testIfClusterEmptyAfterReadingFATSector
@sameSector:
    JMP    @testIfClusterEmpty

@foundFreeSpace:
    ; Found the requested space
   
SEC
    RTS


It isn't too hard to find the sections of code where there is considerable repetition due to the need to update 32-bit = 4 byte long values, using only 8-bit CPU operations.  This is both a nuisance from a code-size perspective (remember that the Hypervisor is only 16KB in total, and so every wasted byte is potentially important), and also a correctness/security perspective, because it is simply hard to make sure that such code is functionally correct.

If the criteria were only security and performance, we would of course just use a simple 32-bit processor, however, the whole point of the MEGA65 is that it is an 8-bit computer.  That said, the 6502 (and 4510) already both have a number of non-8-bit features: Most obviously is that the program counter is 16-bits.  But also the zero-page indirection instructions operate on 16-bit pointers, the stack can be 16-bit addressed on the 4510, the 4510 even has increment/decrement/shift word instructions, which really start to blur the lines on where the boundary between 8 and 16 bit processors lie.  I think my take on the situation is that while the registers are 8-bit, and the majority of operations are 8-bit, that there probably isn't a problem.

The question then really is whether it makes sense to add some simple 32-bit operations, for example increment and decrement, which are the most common kinds of operations, and which also are the most annoying to do now, because in reality you need to unroll them, or do very strange things with saving the processor flags between iterations of the loop, so that the carry flag is not corrupted.  These could be implemented using a prefix to the INW and DEW (increment and decrement word) instruction, similar to how the MEGA65's 32-bit zero-page indirect operations work.  So to increment a 32-bit value, we might end up using something like:

NOP
NOP
INW $address


That is, we can replace quite a lot of goo with just 5 bytes, if we allow INW and DEW to operate on 32-bits of data instead of 16. Unfortunately, INW and DEW (and the other 16-bit read-modify-write instructions) are all zero-page only, which is a bit limiting.  That they are zero-page only is a bit silly in retrospect, because these instructions already require 2 cycles for the instruction, 2 more to read the 16-bit value, and 2 more to write the result back = 6 cycles. For a seventh cycle, they would have been much more general purpose.  I could in theory make the instructions behave as absolute addressing, i.e., non-zero-page, when used with the double-NOP prefix, but they we might fall into the same trap as the 65816, where you have to know the CPU state, in order to be able to disassemble code.  This is not good on a variety of fronts.

To add to my frustration, the ROW and and ASW instructions, the ones that shift 16-bit values do use absolute addressing mode.  Even the PHW instructions take 16-bit addresses/values as operands, only not the INW/DEW instructions, that are the most useful in this situation.

So looking at our available resources, we have PHW #$nnnn, PHW $nnnn, ASW $nnnn and ROW $nnnn as the four instructions that we could potentially overload with a prefix to do something more useful, and INW/DEW that can also be logically extended, but with the annoying limitation on their source of operand.  There is some intrinsic value to having 32-bit ROW and ASW operations, to allow bit-shifting of wider values.  The stack operations are more vexed, because there are no matching 16-bit stack pop operations (assuming that we don't count RTS, although it technically does pop 16 bits from the stack, it is just that any instruction that puts its result into the program counter is perhaps not ideally suited to being bent towards arithmetic operations.


For the existing 32-bit flat addressing extensions, i.e., where the data remains 8-bits wide, but the address is 32-bits long, we use NOP + NOP as the prefix. Note that this is different to what we are talking about here, where the data is 32-bit, but the adress remains 16-bit.  We really don't want to confuse those two, and there are situations where it probably makes sense to use them both at the same time*, so we will need a different prefix. 

[ * The main problem with using them at the same time, is that the 32-bit addressing extensions use the indirect zero-page instructions like LDA ($12), Z -- which uses one of the registers in the address computation. This is fine for LDA, but not for ADC and friends, and definitely not for STA, where the value you are storing, including all the 8-bit register values as it does, would thus affect the address you are writing to.  To solve this for now, if you use STA in this mode, the value of the Z register is NOT added to the 32-bit address.  This whole scheme will need some further fine-tuning, but at least for now is minimally useful: It is possible in theory to load a 32-bit value from, or write one to, a 32-bit flat address. ]


After having stared at the instruction set many times before, and not really seen many good candidates for prefix instructions, since, after all, the whole point of instructions that are not called NOP is to do something, it occurred to me that the NEG instruction would in fact be a pretty good one.  Yes, it will mess up the Z and N flags -- but that's all it will do if you call it twice, since it negates, i.e., inverts the sign of the value of the accumulator.  Thus, calling it twice puts the value of the accumulator back to what it was to begin with, and potentially even restores the values of the Z and N flags.  Anyway, for lack of a better alternative, this is what we will use.

Anyway, by using the NEG + NEG prefix on any LDA, STA, ADC, SBC, EOR, AND, ORA, INC, DEC, ASL, ROL, ROR or LSR instruction, we can make it instead do a 32-bit read (and or write).  The simplest way to implement this, and which is what I have done, is to duplicate the operand read and/or write after the address has been computed.  In this way, you can still use all the usual addressing modes (although the indexed ones will require some care, because many of these 32-bit operations will of course modify the contents of the index registers). This leaves the question of where do we find a 32-bit register in the 6502 architecture that we can use.  That, and related questions we will soon return to. But first, lets get back to the routine for finding where to put a newly created file, there are the following 32-bit operations:

2x write 32-bit value to memory
4x copy 32-bit value to another location
1x compare two 32-bit values for equality
3x test 32-bit value if equal to zero
1x increment 32-bit value
1x decrement 32-bit value

Of these, the increments, decrements and zero-tests could be directly accelerated, accounting for 5 of the 13 operations.  Helpful, but not ideal. 

What simple changes can we make that would give a much better result? One of the requirements for "simple" in my mind, is that they should not result in the creation of more registers or other CPU state that needs to be saved/restored during a context switch or Hypervisor trap.  We do have four general purpose(ish) registers in the 4510, however: A, X, Y and Z  - just enough for holding 32-bits.  Thus we could have a prefix on LDA / STA / ADC / SBC / CMP that cause A, X, Y and Z to all be used together, so for example:

NEG
NEG
LDA $1234
CLC
NEG
NEG
ADC $1238
NEG
NEG
STA $123C

could be used to read the 32-bit value from $1234-$1237 into A, X, Y and Z, and then add to this the 32-bits of data stored at $1238-$123B, and finally write the results to the four bytes at $123C-$123F.  The same routine in normal 6502 would be:

LDA $1234
CLC
ADC $1238
STA $123C
LDA $1235
ADC $1239
STA $123D
LDA $1236
ADC $123A
STA $123E
LDA $1237
ADC $123B
STA $123F

Not only does this require 37 bytes and 50 cycles on a regular 6502, it is also prone to copy-paste errors (hopefully I didn't make any here).  It is possible to slightly optimise this, e.g.:

LDX #$00
CLC
PHP
loop:
PLP
LDA $1234,X
ADC $1238,X
STA $123C,X
PHP
INX
CPX #$04
BNE loop
PLP

This is shorter than the unrolled version, requiring only 21 bytes, however the program flow is rather more complicated to understand (and requires a byte off stack space).  Also, it requires 99 cycles, making it twice as slow as the unrolled version.

If we implement our CPU extension to allow our nice clear 15 byte version, it should take only 30 cycles* on the MEGA65, making it superior on all three fronts: size, speed and simplicity. 

* NOP and CLC take 1 cycle on the MEGA65, and load instructions have a one cycle increased cost, compared to the 6502.  We also have to allow four cycles for reading/writing the 32-bit values in this mode, so my current estimation of the cycle timing would be:

= 1 (NEG)
+ 1 (NEG)
+ 8 (LDA, $34, $12, wait state, read $1234, read $1235, read $1236, read $1237)
+ 1 (CLC)
= 1 (NEG)
+ 1 (NeG)
+ 8 (ADC, $38, $12, wait state, read $1238, read $1239, read $123A, read $123B)
+ 1 (NEG)
+ 1 (NEG)
+ 7 (STA, $3C, $12, write $123C, write $123D, write $123E, write $123F)

= 30 cycles.

Also, it stands to reason that there are lots of other situations where being able to read multiple bytes into the registers, or write all the registers out at the same time will probably be useful. Anyway, back to our routine, let's look at how it would look if we had these extensions:

    ; First, make sure we can see the sector buffer at $DE00-$DFFF
   
JSR   sd_map_sectorbuffer

    ; 1. Start at cluster = $00000002

    ; (Use new extensions to store AXYZ with the values)
   
LDA  #$00

    TAX
    TAY
    TAZ
    LDA  #$02
    NEG
    NEG
    STA  dos_opendir_cluster

    NEG
    NEG
    STA  dos_file_loadaddress
 
@tryNewStartingPointCandidate:

    ;    Reset # of clusters still required
   
NEG

    NEG
    LDA dos_dirent_length
    NEG
    NEG
    STA dos_dirent_cluster

@testIfClusterEmptyAfterReadingFATSector:
    ; Read the appropriate sector of the FAT
    ; To do this, we copy the target cluster number to
    ; dos_current_cluster, and call dos_cluster_to_fat_sector.
    ; This leaves the absolute sector number in
    ; dos_current_cluster.
   
NEG

    NEG
    LDA dos_file_loadaddress
    NEG
    NEG
    STA dos_current_cluster 
    JSR    dos_cluster_to_fat_sector
    ; Now we have the sector # in dos_current_cluster
    ; Copy to the SD card sector register, and do the read
   
NEG

    NEG
    LDA dos_current_cluster
    NEG
    NEG
    STA $D681
    JSR    sd_fix_sectornumber
    ; Finally, do the read
   
LDA    #dos_errorcode_read_error
    STA    dos_error_code
    JSR    sd_readsector
    ; Fail on error
   
BCS       @ll93
    RTS
@ll93:

@testIfClusterEmpty:
    ; Here we have the sector read, and do the test on the contents of the cluster
    ; entry.

    ; But first, check that the cluster number is valid:
    ; 1. Get dos_disk_table_offset pointing correctly
   
LDX    dos_disk_current_disk
    JSR    dos_set_current_disk
    LDX    dos_disk_table_offset
    ; 2. Compare cluster count to current cluster
    ; (Note that here we can use the previous value of X still as an index,
    ;  because it doesn't get overwritten by the 32-bit load until after the 
    ;  address has been computed. Of course, this doesn't work for a store
    ;  operation, because in that case the index registers are forming part of
    ;  the 32-bit value to be written).
    NEG
    NEG
    LDA    [dos_disk_table + fs_fat32_cluster_count + 0], X

    NEG
    NEG
    CMP    dos_file_loadaddress

    BNE    @notLastClusterOfFileSystem

    ; Return error due to lack of space
   
LDA    #dos_errorcode_no_space
    STA    dos_error_code
    CLC
    RTS

@notLastClusterOfFileSystem:

    ; The offset in the sector is computed from the bottom
    ; 7 bits of the cluster number * 4, to give an offset
    ; in the 512 byte sector. Once we have the offset, OR the
    ; four bytes of the cluster number together to test if = 0,
    ; and thus empty.
   
LDA    dos_file_loadaddress+0
    ASL
    ASL
    TAY
    LDA    dos_file_loadaddress+0
    AND    #$40
    BEQ    @lowMoby

    NEG
    NEG
    BIT    $DF00, Y
    BRA @ll120
@lowMoby:
    NEG
    NEG
    BIT    $DE00, Y
@ll120:
    ; Remember result of free-test
   
PHP

    ; Increment next cluster number we will look at
   
NEG

    NEG
    INC    dos_file_loadaddress

    ; If the cluster was not free, then reset search point
   
PLP
    BEQ    @thisClusterWasFree



    NEG
    NEG
    LDA    dos_file_loadaddress
    NEG
    NEG
    STA    dos_opendir_cluster
 

    JMP    @tryNewStartingPointCandidate

@thisClusterWasFree:
    ; Decrement # of clusters still required
   
NEG

    NEG
    DEC    dos_dirent_cluster
    ; Now see if zero
    BEQ    @foundFreeSpace
   
    ; Nope, we still need more, so continue the search

    ; Then check if this next cluster is unallocated?
    ; (If the cluster entry in the FAT will be in the same
    ;  sector as the last, then don't waste time recomputing
    ;  and reading the FAT sector number).
   
LDA    dos_file_loadaddress+0
    AND    #$7F
    BNE    @sameSector
    JMP     @testIfClusterEmptyAfterReadingFATSector
@sameSector:
    JMP    @testIfClusterEmpty

@foundFreeSpace:
    ; Found the requested space
   
SEC
    RTS


That's quite a bit shorter and simpler, and would be even simpler again if we were to teach the assembler about the new extensions, and have LDA32, STA32, ADC32, INC32 etc generate the NEG, NEG prefix:

    ; First, make sure we can see the sector buffer at $DE00-$DFFF
   
JSR   sd_map_sectorbuffer

    ; 1. Start at cluster = $00000002

    ; (Use new extensions to store AXYZ with the values)
   
LDA  #$00

    TAX
    TAY
    TAZ
    LDA  #$02
    STA32  dos_opendir_cluster
    STA32  dos_file_loadaddress
 
@tryNewStartingPointCandidate:

    ;    Reset # of clusters still required
   
LDA32  dos_dirent_length

    STA32  dos_dirent_cluster

@testIfClusterEmptyAfterReadingFATSector:
    ; Read the appropriate sector of the FAT
    ; To do this, we copy the target cluster number to
    ; dos_current_cluster, and call dos_cluster_to_fat_sector.
    ; This leaves the absolute sector number in
    ; dos_current_cluster.
   
LDA32 dos_file_loadaddress

    STA32 dos_current_cluster 
    JSR    dos_cluster_to_fat_sector
    ; Now we have the sector # in dos_current_cluster
    ; Copy to the SD card sector register, and do the read
   
LDA32  dos_current_cluster

    STA32  $D681
    JSR    sd_fix_sectornumber
    ; Finally, do the read
   
LDA    #dos_errorcode_read_error
    STA    dos_error_code
    JSR    sd_readsector
    ; Fail on error
   
BCS       @ll93
    RTS
@ll93:

@testIfClusterEmpty:
    ; Here we have the sector read, and do the test on the contents of the cluster
    ; entry.

    ; But first, check that the cluster number is valid:
    ; 1. Get dos_disk_table_offset pointing correctly
   
LDX    dos_disk_current_disk
    JSR    dos_set_current_disk
    LDX    dos_disk_table_offset
    ; 2. Compare cluster count to current cluster

    ; (Note that here we can use the previous value of X still as an index,
    ;  because it doesn't get overwritten by the 32-bit load until after the 
    ;  address has been computed. Of course, this doesn't work for a store
    ;  operation, because in that case the index registers are forming part of
    ;  the 32-bit value to be written).
   
LDA32  [dos_disk_table + fs_fat32_cluster_count + 0], X

    CMP32  dos_file_loadaddress
    BNE    @notLastClusterOfFileSystem

    ; Return error due to lack of space
   
LDA    #dos_errorcode_no_space
    STA    dos_error_code
    CLC
    RTS

@notLastClusterOfFileSystem:

    ; The offset in the sector is computed from the bottom
    ; 7 bits of the cluster number * 4, to give an offset
    ; in the 512 byte sector. Once we have the offset, OR the
    ; four bytes of the cluster number together to test if = 0,
    ; and thus empty.
   
LDA    dos_file_loadaddress+0
    ASL
    ASL
    TAY
    LDA    dos_file_loadaddress+0
    AND    #$40
    BEQ    @lowMoby

    BIT32    $DF00, Y
    BRA @ll120
@lowMoby:
    BIT32    $DE00, Y
@ll120:
    ; Remember result of free-test
   
PHP

    ; Increment next cluster number we will look at
 
  INC32    dos_file_loadaddress


    ; If the cluster was not free, then reset search point
   
PLP
    BEQ    @thisClusterWasFree



    LDA32    dos_file_loadaddress
    STA32    dos_opendir_cluster

    JMP    @tryNewStartingPointCandidate

@thisClusterWasFree:
    ; Decrement # of clusters still required    DEC32    dos_dirent_cluster
    ; Now see if zero
    BEQ    @foundFreeSpace
   
    ; Nope, we still need more, so continue the search

    ; Then check if this next cluster is unallocated?
    ; (If the cluster entry in the FAT will be in the same
    ;  sector as the last, then don't waste time recomputing
    ;  and reading the FAT sector number).
   
LDA    dos_file_loadaddress+0
    AND    #$7F
    BNE    @sameSector
    JMP    @testIfClusterEmptyAfterReadingFATSector
@sameSector:
    JMP    @testIfClusterEmpty

@foundFreeSpace:
    ; Found the requested space
   
SEC
    RTS


The end result is that the routine can be reduced to 162 bytes instead of 215 (assuming I have counted correctly).  This is really quite a nice improvement -- especially since as we have mentioned, the reduction in size comes with a reduction in execution time and complexity. In short, it will be smaller, faster, simpler -- exactly what we want.

The next step is to test all this, and make sure that it actually works -- both the routine, and the CPU extensions.

Monday, February 12, 2018

Extending The Commodore 64 BASIC with 1st-class tokens

There are a number of nice tutorials that talk about extending the C64's BASIC. However, all the ones I could find talk about having a singl-character prefix, e.g., @, and then making single-character commands that follow that. This is the usual approach, because it is much easier, as you don't need to make the LIST command able to interpret the new commands.  However, it really makes for an ugly result, and hard to read code. For the new MEGA BASIC on the MEGA65, I wanted to avoid this, and have "real" BASIC keywords for the extensions, so that the resulting programs would be easier to write, and easier to read, and much easier to learn how to do things by reading other peoples programs.  So, I had to find out how to make a proper BASIC extension myself.

There are three main functions that you have to modify for this: (1) tokenisation (the part where words get turned into single-byte values when stored in a program; (2) detokenisation (the part where those get turned back into readable words, which really only matters for the LIST command); and (3) executing tokens. In this post, I'll show you how I implemented each one. This post will go into quite low-level detail of this, which might not be interesting to some, so please accept my apologies in advance if that is the case. However, for others who wish to implement an extension for the C64 BASIC, it would seem to be the only publicly released breakdown of how to make one, so I think it is warranted.

Before I dive into the tokenisation and detokenisation routines, it is worth mentioning a common problem to them both: The existing list of BASIC keywords as stored in the BASIC ROM is 254 bytes long.  By keeping it under 256 bytes, the tokenisation and detokenisation routines are able to be somewhat simpler, by using only index registers to scan through the list.  This has to change if we want to allow more keywords. As I don't anticipate making more keywords than 2*256 bytes, I was able to make changes that only worried about checking which half of the list is being read.


My token list is simply formed by taking the standard 255 bytes from the BASIC ROM at $A09E, and appending my new ones to the end:


tokenlist:
        ;; Reserve space for C64 BASIC token list, less the end $00 marker

        ;; (I copy these in place with a little loop elsewhere)
        .res ($A19C - $A09E + 1), $00
        ;; End of list marker (remove to enable new tokens)
                ;.byte $00
        ;; Now we have our new tokens
        ;; extra_token_count must be correctly set to the number of tokens
        ;; (This lists only the number of tokens that are good for direct use.
        ;; Keywords found only within statements are not in this tally.)
        extra_token_count = 5
        token_fast = $CC + 0
        .byte "FAS",'T'+$80
        token_slow = $CC + 1
        .byte "SLO",'W'+$80

        token_canvas = $CC + 2
        .byte "CANVA",'S'+$80
        token_colour = $CC + 3
        .byte "COLOU",'R'+$80
        token_tile = $CC + 4
        .byte "TIL",'E'+$80

        token_first_sub_command = token_tile + 1
       
        ;; These tokens are keywords used within other
        ;; commands, not as executable commands. These
        ;; will all generate syntax errors.
        token_text = token_first_sub_command + 0
        .byte "TEX",'T'+$80
        token_sprite = token_first_sub_command + 1
        .byte "SPRIT",'E'+$80
        token_screen = token_first_sub_command + 2
        .byte "SCREE",'N'+$80
        token_border = token_first_sub_command + 3
        .byte "BORDE",'R'+$80
        token_set = token_first_sub_command + 4
        .byte "SE",'T'+$80
        token_delete = token_first_sub_command + 5
        .byte "DELET",'E'+$80
        token_stamp = token_first_sub_command + 6
        .byte "STAM",'P'+$80
        token_at = token_first_sub_command + 7
        .byte "A",'T'+$80
        token_from = token_first_sub_command + 8
        .byte "FRO",'M'+$80
        ;; And the end byte
        .byte $00       


Another key issue is that most of the possible token values are already used.  Only token values between $CD and $FE are available, i.e., only 49 of them.  If I want to avoid double-byte tokens (which I really do), then I need to keep my new keywords to less than that number.  We'll see if I manage to do that, but for now, we will try. On both these points, I suspect that BASIC 10 on the C65 probably exceeds these limits, thus making the tokeniser and detokeniser more complex than they are in BASIC 2 on the C64. 

The tokeniser

The tokeniser is the routine that scans a line of BASIC text, and changes keywords into single-byte tokens.  It has to take account of various quirks, such as the PI character officially being a token with the special value $FF, and know whether parts of the line are inside quotation marks or follow a REM statement. This all makes the logic of the tokeniser routine rather more subtle and complex than one might first imagine.  As a result, I figured it was safer to essentially replicate the functionality of the standard tokeniser routine as exactly as possible, adding only the ability to scan a keyword list that is upto 511 bytes long, instead of 255 bytes long.

This routine is effectively the same as in the C64 BASIC ROM at $A57C, but with the changes to allow the token list to be two pages long. I have also tried to more fully document the routine in the comments and labels. Some of the labels have address suffixes on them, that indicate where in the original C64 BASIC ROM routine they were located, to make it easier to reference between the two of them. 

The algorithm is basically one of seeing if the current character might be the start of a keyword that should be tokenised, and if so, doing a string match against it.  Tokens are stored in the token list with bit 7 set on the last character, so, for example RUN is stored as $52 $55 $CE, where $CE = $4E + $80 to represent the terminal N of the word.  This trick is also how the short-cuts for keywords using shifted letters works: If the result of comparing the input character and the keyword letter is $80, then it is either the terminal character of the keyword, or a shifted letter in the input being tokenised.  This is also why typing something like RU(shift-N) won't work.

It also occurs to me that this allows for a rather subtle bug, where if you typed a keyword with the last character shifted, and then characters for the following token, it will match against the first token, so typing RU(shift-N)I(shift-F)RESTORE would actually be interpreted and tokenised as though you had just typed RUN. Quite what use you might make of this, I have no idea, but I have confirmed that it does indeed work.  I haven't bothered to try to fix this little problem in my extension, but just report it here in case it is of curiosity to anyone.  I'd be interested to know if anyone else had previously discovered this quirk.

Back to the task at hand, for the longer token list, I have added a "hi page flag" that is set when accessing the upper of the two pages of tokens, and clear for the lower.  In places this is initialised to $FF when set, so that INC token_hi_page_flag causes it to be cleared.  This is used in particular near the start following tokeniseNextChar, so that the routine can continue to pre-increment the pointer into the token list (which is normally just held in the Y register).

This means that the logic for advancing and, retreating the token pointer is no longer as simple as INY or DEY. To keep things understandable, I have broken out the code to do this into tokenListReadByte, tokenListAdvancePointer etc.

Beyond that, there is not really much more to say about it, without having to dive deep into explaining every little subtly about the C64 BASIC tokenise routine, other than to repeat that this tokeniser supports only single-byte tokens. To allow more tokens, we would need to maintain a 16-bit counter for the token value, and if it were >253, come up with a double-byte token, e.g., $FE <second byte>.  i.e., token 254 would be the first token to require two bytes, because token $FF is already used for PI, and we need to have at least one token reserved as a kind of escape code, which above I have suggested $FE as the logical candidate for this role. If I end up using too many tokens, that is probably the method that I would use.

megabasic_tokenise:

        ;; Get the basic execute pointer low byte
        LDX    $7A
        ;; Set the save index
        LDY    #$04
        ;; Clear the quote/data flag
        STY    $0F

@tokeniseNextChar:
        ;; Get hi page flag for tokenlist scanning, so that if we INC it, it will
        ;; point back to the first page.  As we start with offset = $FF, the first
        ;; increment will do this. Since offsets are pre-incremented, this means
        ;; that it will switch to the low page at the outset, and won't switch again
        ;; until a full page has been stepped through.
        PHA
        LDA     #$FF
        STA    token_hi_page_flag
        PLA
       
        ;; Read a byte from the input buffer
        LDA    $0200,X
        ;; If bit 7 is clear, try to tokenise
        BPL    @tryTokenise
        ;; Now check for PI (char $FF)
        CMP    #$FF     ; = PI
        BEQ    @gotToken_a5c9
        ;; Not PI, but bit 7 is set, so just skip over it, and don't store
        INX
        BNE    @tokeniseNextChar
@tryTokenise:
        ;; Now look for some common things
        ;; Is it a space?
        CMP    #$20    ; space
        BEQ    @gotToken_a5c9
        ;; Not space, so save byte as search character
        STA    $08
        CMP    #$22    ; quote marks
        BEQ    @foundQuotes_a5ee
        BIT    $0F    ; Check quote/data mode
        BVS    @gotToken_a5c9 ; If data mode, accept as is
        CMP    #$3F           ; Is it a "?" (short cut for PRINT)
        BNE    @notQuestionMark
        LDA    #$99    ; Token for PRINT
        BNE    @gotToken_a5c9 ; Accept the print token (branch always taken, because $99 != $00)
@notQuestionMark:
        ;; Check for 0-9, : or ;
        CMP     #$30
        BCC    @notADigit
        CMP    #$3C
        BCC    @gotToken_a5c9
@notADigit:
        ;; Remember where we are upto in the BASIC line of text
        STY    $71
        ;; Now reset the pointer into tokenlist
        LDY    #$00
        ;; And the token number minus $80 we are currently considering.
        ;; We start with token #0, since we search from the beginning.
        STY    $0B
        ;; Decrement Y from $00 to $FF, because the inner loop increments before processing
        ;; (Y here represents the offset in the tokenlist)
        DEY
        ;; Save BASIC execute pointer
        STX    $7A
        ;; Decrement X also, because the inner loop pre-increments
        DEX
@compareNextChar_a5b6:
        ;; Advance pointer in tokenlist
        jsr tokenListAdvancePointer
        ;; Advance pointer in BASIC text
        INX
@compareProgramTextAndToken:
        ;; Read byte of basic program
        LDA    $0200, X
        ;; Now subtract the byte from the token list.
        ;; If the character matches, we will get $00 as result.
        ;; If the character matches, but was ORd with $80, then $80 will be the
        ;; result.  This allows efficient detection of whether we have found the
        ;; end of a keyword.
        bit     token_hi_page_flag
        bmi    @useTokenListHighPage
        SEC
        SBC    tokenlist, Y
        jmp    @dontUseHighPage
@useTokenListHighPage:
        SEC
        SBC    tokenlist+$100,Y
@dontUseHighPage:
        ;; If zero, then compare the next character
        BEQ    @compareNextChar_a5b6
        ;; If $80, then it is the end of the token, and we have matched the token
        CMP    #$80
        BNE    @tokenDoesntMatch
        ;; A = $80, so if we add the token number stored in $0B, we get the actual
        ;; token number
        ORA    $0B
@tokeniseNextProgramCharacter:
        ;; Restore the saved index into the BASIC program line
        LDY    $71
@gotToken_a5c9:
        ;; We have worked out the token, so record it.
        INX
        INY
        STA    $0200 - 5, Y
        ;; Now check for end of line (token == $00)
        LDA    $0200 - 5, Y
        BEQ @tokeniseEndOfLine_a609

        ;; Now think about what we have to do with the token
        SEC
        SBC    #$3A
        BEQ    @tokenIsColon_a5dc
        CMP    #($83 - $3A) ; (=$49) Was it the token for DATA?
        BNE    @tokenMightBeREM_a5de
@tokenIsColon_a5dc:
        ;; Token was DATA
        STA    $0F    ; Store token - $3A (why?)
@tokenMightBeREM_a5de:
        SEC
        SBC    #($8F - $3A) ; (=$55) Was it the token for REM?
        BNE    @tokeniseNextChar
        ;; Was REM, so say we are searching for end of line (== $00)
        ;; (which is conveniently in A now)
        STA    $08   
@label_a5e5:
        ;; Read the next BASIC program byte
        LDA    $0200, X
        BEQ    @gotToken_a5c9
        ;; Does the next character match what we are searching for?
        CMP    $08
        ;; Yes, it matches, so indicate we have the token
        BEQ    @gotToken_a5c9

@foundQuotes_a5ee:
        ;; Not a match yet, so advance index for tokenised output
        INY
        ;; And write token to output
        STA    $0200 - 5, Y
        ;; Increment read index of basic program
        INX
        ;; Read the next BASIC byte (X should never be zero)
        BNE    @label_a5e5

@tokenDoesntMatch:
        ;; Restore BASIC execute pointer to start of the token we are looking at,
        ;; so that we can see if the next token matches
        LDX    $7A
        ;; Increase the token ID number, since the last one didn't match
        INC    $0B
        ;; Advance pointer in tokenlist from the end of the last token to the start
        ;; of the next token, ready to compare the BASIC program text with this token.
@advanceToNextTokenLoop:
        jsr     tokenListAdvancePointer
        jsr     tokenListReadByteMinus1
        BPL    @advanceToNextTokenLoop
        ;; Check if we have reached the end of the token list
        jsr    tokenListReadByte
        ;; If not, see if the program text matches this token
        BNE    @compareProgramTextAndToken

        ;; We reached the end of the token list without a match,
        ;; so copy this character to the output, and
        LDA    $0200, X
        ;; Then advance to the next character of the BASIC text
        ;; (BPL acts as unconditional branch, because only bytes with bit 7
        ;; cleared can get here).
        BPL    @tokeniseNextProgramCharacter
@tokeniseEndOfLine_a609:
        ;; Write end of line marker (== $00), which is conveniently in A already
        STA    $0200 - 3, Y
        ;; Decrement BASIC execute pointer high byte
        DEC    $7B
        ;; ... and set low byte to $FF
        LDA    #$FF
        STA    $7A
        RTS

tokenListAdvancePointer:   
        INY
        BNE    @dontAdvanceTokenListPage
        PHP
        PHA
        LDA    token_hi_page_flag
        EOR    #$FF
        STA    token_hi_page_flag
        ;; XXX Why on earth do we need these three NOPs here to correctly parse the extra
        ;; tokens? If you remove one, then the first token no longer parses, and the later
        ;; ones get parsed with token number one less than it should be!
        NOP
        NOP
        NOP
        PLA
        PLP
@dontAdvanceTokenListPage:
        PHP
        PHX
        PHA
        tya
        tax
        bit    token_hi_page_flag
        bmi    @page2
        jmp    @done
@page2:       
        @done:
       
        PLA
        PLX
        PLP
        RTS

tokenListReadByte:   
        bit     token_hi_page_flag
        bmi    @useTokenListHighPage
        LDA    tokenlist, Y
        RTS
@useTokenListHighPage:
        LDA    tokenlist+$100,Y
        RTS       

tokenListReadByteMinus1:   
        bit     token_hi_page_flag
        bmi    @useTokenListHighPage
        LDA    tokenlist - 1, Y
        RTS
@useTokenListHighPage:
        LDA    tokenlist - 1 + $100,Y
        RTS       


The detokenisation routine also heavily draws on the original C64 BASIC ROM's routine (located at $A71A).  This routine is rather simpler, because it doesn't need to do any string matching, or handling of shorted tokens using SHIFT+letter:

megabasic_detokenise:
        ;; The C64 detokenise routine lives at $A71A-$A741.
        ;; The routine is quite simple, reading through the token list,
        ;; decrementing the token number each time the end of at token is
        ;; found.  The only complications for us, is that we need to change
        ;; the parts where the token bytes are read from the list to allow
        ;; the list to be two pages long.

        ;; Print non-tokens directly
        bpl     jump_to_a6f3
        ;; Print PI directly
        cmp    #$ff
        beq    jump_to_a6f3
        ;; If in quote mode, print directly
        bit    $0f
        bmi     jump_to_a6f3

        ;; At this point, we know it to be a token

        ;; Tokens are $80-$FE, so subtract #$7F, to renormalise them
        ;; to the range $01-$7F
        SEC
        SBC    #$7F
        ;; Put the normalised token number into the X register, so that
        ;; we can easily count down
        TAX
        STY    $49     ; and store it somewhere necessary, apparently

        ;; Now get ready to find the string and output it.
        ;; Y is used as the offset in the token list, and gets pre-incremented
        ;; so we start with it equal to $00 - $01 = $FF
        LDY    #$FF
        ;; Set token_hi_page_flag to $FF, so that when Y increments for the first
        ;; time, it increments token_hi_page_flag, making it $00 for the first page of
        ;; the token list.
        STY    token_hi_page_flag

       
@detokeniseSearchLoop:
        ;; Decrement token index by 1
        DEX
        ;; If X = 0, this is the token, so read the bytes out
        beq    @thisIsTheToken
        ;; Since it is not this token, we need to skip over it
@detokeniseSkipLoop:
        jsr tokenListAdvancePointer
        jsr tokenListReadByte
        BPL    @detokeniseSkipLoop
        ;; Found end of token, loop to see if the next token is it
        BMI    @detokeniseSearchLoop
@thisIsTheToken:
        jsr tokenListAdvancePointer
        jsr tokenListReadByte
        ;; If it is the last byte of the token, return control to the LIST
        ;; command routine from the BASIC ROM
        BMI    jump_list_command_finish_printing_token_a6ef
        ;; As it is not the end of the token, print it out
        JSR    $AB47
        BNE    @thisIsTheToken

        ;; This can only be reached if the next byte in the token list is $00
        ;; This could only happen in C64 BASIC if the token ID following the
        ;; last is attempted to be detokenised.
        ;; This is the source of the REM SHIFT+L bug, as SHIFT+L gives the
        ;; character code $CC, which is exactly the token ID required, and
        ;; the C64 BASIC ROM code here simply fell through the FOR routine.
        ;; Actually, understanding this, makes it possible to write a program
        ;; that when LISTed, actually causes code to be executed!
        ;; However, this vulnerability appears not possible to be exploited,
        ;; because $0201, the next byte to be read from the input buffer during
        ;; the process, always has $00 in it when the FOR routine is run,
        ;; causing a failure when attempting to execute the FOR command.
        ;; Were this not the case, REM (SHIFT+L)I=1TO10:GOTO100, when listed
        ;; would actually cause GOTO100 to be run, thus allowing LIST to
        ;; actually run code. While still not a very strong form of source
        ;; protection, it could have been a rather fun thing to try.

        ;; Instead of having this error, we will just cause the character to
        ;; be printed normally.
        LDY    $49
jump_to_a6f3:   
        JMP     $A6F3
jump_list_command_finish_printing_token_a6ef:
        JMP    $A6EF


The one piece of interest, is that this routine is responsible for the REM SHIFT+L bug when listing a program on the C64. This bug *almost* creates a "code injection" bug in the C64's BASIC, as described in the comments above.  The bug is caused by detokenising SHIFT-L causes a fall-through at the end of the detokenisation routine into the start of the FOR command's routine in the ROM. However, as I describe in the comments above, the contents of the next byte of input when detokenising seems to always be $00 or unusable, which will cause a syntax error in the FOR routine, when it tries to parse the variable it should be using as the loop iterator.  Actually, digging a bit deeper, the next token byte it reads is whatever followed the LIST command.  However, it is only really possible to have a colon or $00 there, so I think it is still not exploitable.
I'd be very interested to hear if anyone can think of a way to exploit this, so that LISTing a newly loaded program causes code of your choosing to be executed.

Speaking of execution, that just leaves our token execution routine to present:

megabasic_execute:       
        JSR    $0073
        ;; Is it a MEGA BASIC primary keyword?
        CMP    #$CC
        BCC    @basic2_token
        CMP    #token_first_sub_command
        BCC    megabasic_execute_token
        ;; Handle PI
        CMP    #$FF
        BEQ    @basic2_token
        ;; Else, it must be a MEGA BASIC secondary keyword
        ;; You can't use those alone, so ILLEGAL DIRECT ERROR
        jmp megabasic_perform_illegal_direct_error
@basic2_token:
        ;; $A7E7 expects Z flag set if ==$00, so update it
        CMP    #$00
        JMP    $A7E7

megabasic_execute_token:
        ;; Normalise index of new token
        SEC
        SBC     #$CC
        ASL
        ;; Clip it to make sure we don't have any overflow of the jump table
        AND    #$0E
        TAX
        PHX
        ;; Get next token/character ready
        JSR    $0073
        PLX
        JMP     (newtoken_jumptable,X)

        ;; Tokens are $CC-$FE, so to be safe, we need to have a jump
newtoken_jumptable:
        .word     megabasic_perform_fast
        .word     megabasic_perform_slow
        .word    megabasic_perform_canvas ; canvas operations, including copy/stamping, clearing, creating new
        .word    megabasic_perform_colour ; set colours
        .word    megabasic_perform_tile ; "TILE" command, used for TILESET and other purposes
        .word    megabasic_perform_syntax_error ; "SET" SYNTAXERROR: Used only with TILE to make TILESET
        .word     megabasic_perform_syntax_error
        .word     megabasic_perform_syntax_error
        .word     megabasic_perform_syntax_error

        basic2_main_loop     =    $A7AE

This routine is really quite simple: Check if the token is not one of ours, in which case jump to the normal BASIC 2 routine, else work out which one, and jump to the appropriate routine.  I have cheated slightly, and used a C65/M65 new addressing mode for JMP, that makes jump tables much easier to implement.

Each of the routines then does what is required of it, before jumping to an error routine, or back to the BASIC 2 main loop. Here is the FAST command, which is nice and simple:


megabasic_perform_fast:
        jsr    enable_viciv
        LDA    #$40
        TSB    $D054
        TSB    d054_bits
        JMP    basic2_main_loop


As you can see, there is really nothing to it.  If you want to read arguments and the like, then you have to call the same routines that the BASIC ROM would use to do the same.  The best way to discover these is to look at an existing routine that implements a BASIC command, and see how it does it.  If you want to see how we have used them, or indeed to look at how these routines are strung together to make a fully-functioning BASIC extension, the source code for MEGA BASIC (in its currently unfinished state) is all here on github.

So there you have it.  I'll hopefully post an update soon that shows some more of the progress that has been made on MEGA BASIC. In particular, the CANVAS command and its variations are now almost complete, with the exception of the commands to get and set individual tiles in a canvas.