There are three main functions that you have to modify for this: (1) tokenisation (the part where words get turned into single-byte values when stored in a program; (2) detokenisation (the part where those get turned back into readable words, which really only matters for the LIST command); and (3) executing tokens. In this post, I'll show you how I implemented each one. This post will go into quite low-level detail of this, which might not be interesting to some, so please accept my apologies in advance if that is the case. However, for others who wish to implement an extension for the C64 BASIC, it would seem to be the only publicly released breakdown of how to make one, so I think it is warranted.
Before I dive into the tokenisation and detokenisation routines, it is worth mentioning a common problem to them both: The existing list of BASIC keywords as stored in the BASIC ROM is 254 bytes long. By keeping it under 256 bytes, the tokenisation and detokenisation routines are able to be somewhat simpler, by using only index registers to scan through the list. This has to change if we want to allow more keywords. As I don't anticipate making more keywords than 2*256 bytes, I was able to make changes that only worried about checking which half of the list is being read.
My token list is simply formed by taking the standard 255 bytes from the BASIC ROM at $A09E, and appending my new ones to the end:
;; Reserve space for C64 BASIC token list, less the end $00 marker
;; (I copy these in place with a little loop elsewhere)
.res ($A19C - $A09E + 1), $00
;; End of list marker (remove to enable new tokens)
;; Now we have our new tokens
;; extra_token_count must be correctly set to the number of tokens
;; (This lists only the number of tokens that are good for direct use.
;; Keywords found only within statements are not in this tally.)
extra_token_count = 5
token_fast = $CC + 0
token_slow = $CC + 1
token_canvas = $CC + 2
token_colour = $CC + 3
token_tile = $CC + 4
token_first_sub_command = token_tile + 1
;; These tokens are keywords used within other
;; commands, not as executable commands. These
;; will all generate syntax errors.
token_text = token_first_sub_command + 0
token_sprite = token_first_sub_command + 1
token_screen = token_first_sub_command + 2
token_border = token_first_sub_command + 3
token_set = token_first_sub_command + 4
token_delete = token_first_sub_command + 5
token_stamp = token_first_sub_command + 6
token_at = token_first_sub_command + 7
token_from = token_first_sub_command + 8
;; And the end byte
Another key issue is that most of the possible token values are already used. Only token values between $CD and $FE are available, i.e., only 49 of them. If I want to avoid double-byte tokens (which I really do), then I need to keep my new keywords to less than that number. We'll see if I manage to do that, but for now, we will try. On both these points, I suspect that BASIC 10 on the C65 probably exceeds these limits, thus making the tokeniser and detokeniser more complex than they are in BASIC 2 on the C64.
The tokeniserThe tokeniser is the routine that scans a line of BASIC text, and changes keywords into single-byte tokens. It has to take account of various quirks, such as the PI character officially being a token with the special value $FF, and know whether parts of the line are inside quotation marks or follow a REM statement. This all makes the logic of the tokeniser routine rather more subtle and complex than one might first imagine. As a result, I figured it was safer to essentially replicate the functionality of the standard tokeniser routine as exactly as possible, adding only the ability to scan a keyword list that is upto 511 bytes long, instead of 255 bytes long.
This routine is effectively the same as in the C64 BASIC ROM at $A57C, but with the changes to allow the token list to be two pages long. I have also tried to more fully document the routine in the comments and labels. Some of the labels have address suffixes on them, that indicate where in the original C64 BASIC ROM routine they were located, to make it easier to reference between the two of them.
The algorithm is basically one of seeing if the current character might be the start of a keyword that should be tokenised, and if so, doing a string match against it. Tokens are stored in the token list with bit 7 set on the last character, so, for example RUN is stored as $52 $55 $CE, where $CE = $4E + $80 to represent the terminal N of the word. This trick is also how the short-cuts for keywords using shifted letters works: If the result of comparing the input character and the keyword letter is $80, then it is either the terminal character of the keyword, or a shifted letter in the input being tokenised. This is also why typing something like RU(shift-N) won't work.
It also occurs to me that this allows for a rather subtle bug, where if you typed a keyword with the last character shifted, and then characters for the following token, it will match against the first token, so typing RU(shift-N)I(shift-F)RESTORE would actually be interpreted and tokenised as though you had just typed RUN. Quite what use you might make of this, I have no idea, but I have confirmed that it does indeed work. I haven't bothered to try to fix this little problem in my extension, but just report it here in case it is of curiosity to anyone. I'd be interested to know if anyone else had previously discovered this quirk.
Back to the task at hand, for the longer token list, I have added a "hi page flag" that is set when accessing the upper of the two pages of tokens, and clear for the lower. In places this is initialised to $FF when set, so that INC token_hi_page_flag causes it to be cleared. This is used in particular near the start following tokeniseNextChar, so that the routine can continue to pre-increment the pointer into the token list (which is normally just held in the Y register).
This means that the logic for advancing and, retreating the token pointer is no longer as simple as INY or DEY. To keep things understandable, I have broken out the code to do this into tokenListReadByte, tokenListAdvancePointer etc.
Beyond that, there is not really much more to say about it, without having to dive deep into explaining every little subtly about the C64 BASIC tokenise routine, other than to repeat that this tokeniser supports only single-byte tokens. To allow more tokens, we would need to maintain a 16-bit counter for the token value, and if it were >253, come up with a double-byte token, e.g., $FE <second byte>. i.e., token 254 would be the first token to require two bytes, because token $FF is already used for PI, and we need to have at least one token reserved as a kind of escape code, which above I have suggested $FE as the logical candidate for this role. If I end up using too many tokens, that is probably the method that I would use.
;; Get the basic execute pointer low byte
;; Set the save index
;; Clear the quote/data flag
;; Get hi page flag for tokenlist scanning, so that if we INC it, it will
;; point back to the first page. As we start with offset = $FF, the first
;; increment will do this. Since offsets are pre-incremented, this means
;; that it will switch to the low page at the outset, and won't switch again
;; until a full page has been stepped through.
;; Read a byte from the input buffer
;; If bit 7 is clear, try to tokenise
;; Now check for PI (char $FF)
CMP #$FF ; = PI
;; Not PI, but bit 7 is set, so just skip over it, and don't store
;; Now look for some common things
;; Is it a space?
CMP #$20 ; space
;; Not space, so save byte as search character
CMP #$22 ; quote marks
BIT $0F ; Check quote/data mode
BVS @gotToken_a5c9 ; If data mode, accept as is
CMP #$3F ; Is it a "?" (short cut for PRINT)
LDA #$99 ; Token for PRINT
BNE @gotToken_a5c9 ; Accept the print token (branch always taken, because $99 != $00)
;; Check for 0-9, : or ;
;; Remember where we are upto in the BASIC line of text
;; Now reset the pointer into tokenlist
;; And the token number minus $80 we are currently considering.
;; We start with token #0, since we search from the beginning.
;; Decrement Y from $00 to $FF, because the inner loop increments before processing
;; (Y here represents the offset in the tokenlist)
;; Save BASIC execute pointer
;; Decrement X also, because the inner loop pre-increments
;; Advance pointer in tokenlist
;; Advance pointer in BASIC text
;; Read byte of basic program
LDA $0200, X
;; Now subtract the byte from the token list.
;; If the character matches, we will get $00 as result.
;; If the character matches, but was ORd with $80, then $80 will be the
;; result. This allows efficient detection of whether we have found the
;; end of a keyword.
SBC tokenlist, Y
;; If zero, then compare the next character
;; If $80, then it is the end of the token, and we have matched the token
;; A = $80, so if we add the token number stored in $0B, we get the actual
;; token number
;; Restore the saved index into the BASIC program line
;; We have worked out the token, so record it.
STA $0200 - 5, Y
;; Now check for end of line (token == $00)
LDA $0200 - 5, Y
;; Now think about what we have to do with the token
CMP #($83 - $3A) ; (=$49) Was it the token for DATA?
;; Token was DATA
STA $0F ; Store token - $3A (why?)
SBC #($8F - $3A) ; (=$55) Was it the token for REM?
;; Was REM, so say we are searching for end of line (== $00)
;; (which is conveniently in A now)
;; Read the next BASIC program byte
LDA $0200, X
;; Does the next character match what we are searching for?
;; Yes, it matches, so indicate we have the token
;; Not a match yet, so advance index for tokenised output
;; And write token to output
STA $0200 - 5, Y
;; Increment read index of basic program
;; Read the next BASIC byte (X should never be zero)
;; Restore BASIC execute pointer to start of the token we are looking at,
;; so that we can see if the next token matches
;; Increase the token ID number, since the last one didn't match
;; Advance pointer in tokenlist from the end of the last token to the start
;; of the next token, ready to compare the BASIC program text with this token.
;; Check if we have reached the end of the token list
;; If not, see if the program text matches this token
;; We reached the end of the token list without a match,
;; so copy this character to the output, and
LDA $0200, X
;; Then advance to the next character of the BASIC text
;; (BPL acts as unconditional branch, because only bytes with bit 7
;; cleared can get here).
;; Write end of line marker (== $00), which is conveniently in A already
STA $0200 - 3, Y
;; Decrement BASIC execute pointer high byte
;; ... and set low byte to $FF
;; XXX Why on earth do we need these three NOPs here to correctly parse the extra
;; tokens? If you remove one, then the first token no longer parses, and the later
;; ones get parsed with token number one less than it should be!
LDA tokenlist, Y
LDA tokenlist - 1, Y
LDA tokenlist - 1 + $100,Y
The detokenisation routine also heavily draws on the original C64 BASIC ROM's routine (located at $A71A). This routine is rather simpler, because it doesn't need to do any string matching, or handling of shorted tokens using SHIFT+letter:
;; The C64 detokenise routine lives at $A71A-$A741.
;; The routine is quite simple, reading through the token list,
;; decrementing the token number each time the end of at token is
;; found. The only complications for us, is that we need to change
;; the parts where the token bytes are read from the list to allow
;; the list to be two pages long.
;; Print non-tokens directly
;; Print PI directly
;; If in quote mode, print directly
;; At this point, we know it to be a token
;; Tokens are $80-$FE, so subtract #$7F, to renormalise them
;; to the range $01-$7F
;; Put the normalised token number into the X register, so that
;; we can easily count down
STY $49 ; and store it somewhere necessary, apparently
;; Now get ready to find the string and output it.
;; Y is used as the offset in the token list, and gets pre-incremented
;; so we start with it equal to $00 - $01 = $FF
;; Set token_hi_page_flag to $FF, so that when Y increments for the first
;; time, it increments token_hi_page_flag, making it $00 for the first page of
;; the token list.
;; Decrement token index by 1
;; If X = 0, this is the token, so read the bytes out
;; Since it is not this token, we need to skip over it
;; Found end of token, loop to see if the next token is it
;; If it is the last byte of the token, return control to the LIST
;; command routine from the BASIC ROM
;; As it is not the end of the token, print it out
;; This can only be reached if the next byte in the token list is $00
;; This could only happen in C64 BASIC if the token ID following the
;; last is attempted to be detokenised.
;; This is the source of the REM SHIFT+L bug, as SHIFT+L gives the
;; character code $CC, which is exactly the token ID required, and
;; the C64 BASIC ROM code here simply fell through the FOR routine.
;; Actually, understanding this, makes it possible to write a program
;; that when LISTed, actually causes code to be executed!
;; However, this vulnerability appears not possible to be exploited,
;; because $0201, the next byte to be read from the input buffer during
;; the process, always has $00 in it when the FOR routine is run,
;; causing a failure when attempting to execute the FOR command.
;; Were this not the case, REM (SHIFT+L)I=1TO10:GOTO100, when listed
;; would actually cause GOTO100 to be run, thus allowing LIST to
;; actually run code. While still not a very strong form of source
;; protection, it could have been a rather fun thing to try.
;; Instead of having this error, we will just cause the character to
;; be printed normally.
The one piece of interest, is that this routine is responsible for the REM SHIFT+L bug when listing a program on the C64. This bug *almost* creates a "code injection" bug in the C64's BASIC, as described in the comments above. The bug is caused by detokenising SHIFT-L causes a fall-through at the end of the detokenisation routine into the start of the FOR command's routine in the ROM. However, as I describe in the comments above, the contents of the next byte of input when detokenising seems to always be $00 or unusable, which will cause a syntax error in the FOR routine, when it tries to parse the variable it should be using as the loop iterator. Actually, digging a bit deeper, the next token byte it reads is whatever followed the LIST command. However, it is only really possible to have a colon or $00 there, so I think it is still not exploitable.
I'd be very interested to hear if anyone can think of a way to exploit this, so that LISTing a newly loaded program causes code of your choosing to be executed.
Speaking of execution, that just leaves our token execution routine to present:
;; Is it a MEGA BASIC primary keyword?
;; Handle PI
;; Else, it must be a MEGA BASIC secondary keyword
;; You can't use those alone, so ILLEGAL DIRECT ERROR
;; $A7E7 expects Z flag set if ==$00, so update it
;; Normalise index of new token
;; Clip it to make sure we don't have any overflow of the jump table
;; Get next token/character ready
;; Tokens are $CC-$FE, so to be safe, we need to have a jump
.word megabasic_perform_canvas ; canvas operations, including copy/stamping, clearing, creating new
.word megabasic_perform_colour ; set colours
.word megabasic_perform_tile ; "TILE" command, used for TILESET and other purposes
.word megabasic_perform_syntax_error ; "SET" SYNTAXERROR: Used only with TILE to make TILESET
basic2_main_loop = $A7AE
This routine is really quite simple: Check if the token is not one of ours, in which case jump to the normal BASIC 2 routine, else work out which one, and jump to the appropriate routine. I have cheated slightly, and used a C65/M65 new addressing mode for JMP, that makes jump tables much easier to implement.
Each of the routines then does what is required of it, before jumping to an error routine, or back to the BASIC 2 main loop. Here is the FAST command, which is nice and simple:
As you can see, there is really nothing to it. If you want to read arguments and the like, then you have to call the same routines that the BASIC ROM would use to do the same. The best way to discover these is to look at an existing routine that implements a BASIC command, and see how it does it. If you want to see how we have used them, or indeed to look at how these routines are strung together to make a fully-functioning BASIC extension, the source code for MEGA BASIC (in its currently unfinished state) is all here on github.
So there you have it. I'll hopefully post an update soon that shows some more of the progress that has been made on MEGA BASIC. In particular, the CANVAS command and its variations are now almost complete, with the exception of the commands to get and set individual tiles in a canvas.