There are three main functions that you have to modify for this: (1) tokenisation (the part where words get turned into single-byte values when stored in a program; (2) detokenisation (the part where those get turned back into readable words, which really only matters for the LIST command); and (3) executing tokens. In this post, I'll show you how I implemented each one. This post will go into quite low-level detail of this, which might not be interesting to some, so please accept my apologies in advance if that is the case. However, for others who wish to implement an extension for the C64 BASIC, it would seem to be the only publicly released breakdown of how to make one, so I think it is warranted.
Before I dive into the tokenisation and detokenisation routines, it is worth mentioning a common problem to them both: The existing list of BASIC keywords as stored in the BASIC ROM is 254 bytes long. By keeping it under 256 bytes, the tokenisation and detokenisation routines are able to be somewhat simpler, by using only index registers to scan through the list. This has to change if we want to allow more keywords. As I don't anticipate making more keywords than 2*256 bytes, I was able to make changes that only worried about checking which half of the list is being read.
My token list is simply formed by taking the standard 255 bytes from the BASIC ROM at $A09E, and appending my new ones to the end:
tokenlist:
;; Reserve space for C64 BASIC token list, less the end $00 marker
;; (I copy these in place with a little loop elsewhere)
.res ($A19C - $A09E + 1), $00
;; End of list marker (remove to enable new tokens)
;.byte $00
;; Now we have our new tokens
;; extra_token_count must be correctly set to the number of tokens
;; (This lists only the number of tokens that are good for direct use.
;; Keywords found only within statements are not in this tally.)
extra_token_count = 5
token_fast = $CC + 0
.byte "FAS",'T'+$80
token_slow = $CC + 1
.byte "SLO",'W'+$80
token_canvas = $CC + 2
.byte "CANVA",'S'+$80
token_colour = $CC + 3
.byte "COLOU",'R'+$80
token_tile = $CC + 4
.byte "TIL",'E'+$80
token_first_sub_command = token_tile + 1
;; These tokens are keywords used within other
;; commands, not as executable commands. These
;; will all generate syntax errors.
token_text = token_first_sub_command + 0
.byte "TEX",'T'+$80
token_sprite = token_first_sub_command + 1
.byte "SPRIT",'E'+$80
token_screen = token_first_sub_command + 2
.byte "SCREE",'N'+$80
token_border = token_first_sub_command + 3
.byte "BORDE",'R'+$80
token_set = token_first_sub_command + 4
.byte "SE",'T'+$80
token_delete = token_first_sub_command + 5
.byte "DELET",'E'+$80
token_stamp = token_first_sub_command + 6
.byte "STAM",'P'+$80
token_at = token_first_sub_command + 7
.byte "A",'T'+$80
token_from = token_first_sub_command + 8
.byte "FRO",'M'+$80
;; And the end byte
.byte $00
Another key issue is that most of the possible token values are already used. Only token values between $CD and $FE are available, i.e., only 49 of them. If I want to avoid double-byte tokens (which I really do), then I need to keep my new keywords to less than that number. We'll see if I manage to do that, but for now, we will try. On both these points, I suspect that BASIC 10 on the C65 probably exceeds these limits, thus making the tokeniser and detokeniser more complex than they are in BASIC 2 on the C64.
The tokeniser
The tokeniser is the routine that scans a line of BASIC text, and changes keywords into single-byte tokens. It has to take account of various quirks, such as the PI character officially being a token with the special value $FF, and know whether parts of the line are inside quotation marks or follow a REM statement. This all makes the logic of the tokeniser routine rather more subtle and complex than one might first imagine. As a result, I figured it was safer to essentially replicate the functionality of the standard tokeniser routine as exactly as possible, adding only the ability to scan a keyword list that is upto 511 bytes long, instead of 255 bytes long.This routine is effectively the same as in the C64 BASIC ROM at $A57C, but with the changes to allow the token list to be two pages long. I have also tried to more fully document the routine in the comments and labels. Some of the labels have address suffixes on them, that indicate where in the original C64 BASIC ROM routine they were located, to make it easier to reference between the two of them.
The algorithm is basically one of seeing if the current character might be the start of a keyword that should be tokenised, and if so, doing a string match against it. Tokens are stored in the token list with bit 7 set on the last character, so, for example RUN is stored as $52 $55 $CE, where $CE = $4E + $80 to represent the terminal N of the word. This trick is also how the short-cuts for keywords using shifted letters works: If the result of comparing the input character and the keyword letter is $80, then it is either the terminal character of the keyword, or a shifted letter in the input being tokenised. This is also why typing something like RU(shift-N) won't work.
It also occurs to me that this allows for a rather subtle bug, where if you typed a keyword with the last character shifted, and then characters for the following token, it will match against the first token, so typing RU(shift-N)I(shift-F)RESTORE would actually be interpreted and tokenised as though you had just typed RUN. Quite what use you might make of this, I have no idea, but I have confirmed that it does indeed work. I haven't bothered to try to fix this little problem in my extension, but just report it here in case it is of curiosity to anyone. I'd be interested to know if anyone else had previously discovered this quirk.
Back to the task at hand, for the longer token list, I have added a "hi page flag" that is set when accessing the upper of the two pages of tokens, and clear for the lower. In places this is initialised to $FF when set, so that INC token_hi_page_flag causes it to be cleared. This is used in particular near the start following tokeniseNextChar, so that the routine can continue to pre-increment the pointer into the token list (which is normally just held in the Y register).
This means that the logic for advancing and, retreating the token pointer is no longer as simple as INY or DEY. To keep things understandable, I have broken out the code to do this into tokenListReadByte, tokenListAdvancePointer etc.
Beyond that, there is not really much more to say about it, without having to dive deep into explaining every little subtly about the C64 BASIC tokenise routine, other than to repeat that this tokeniser supports only single-byte tokens. To allow more tokens, we would need to maintain a 16-bit counter for the token value, and if it were >253, come up with a double-byte token, e.g., $FE <second byte>. i.e., token 254 would be the first token to require two bytes, because token $FF is already used for PI, and we need to have at least one token reserved as a kind of escape code, which above I have suggested $FE as the logical candidate for this role. If I end up using too many tokens, that is probably the method that I would use.
megabasic_tokenise:
;; Get the basic execute pointer low byte
LDX $7A
;; Set the save index
LDY #$04
;; Clear the quote/data flag
STY $0F
@tokeniseNextChar:
;; Get hi page flag for tokenlist scanning, so that if we INC it, it will
;; point back to the first page. As we start with offset = $FF, the first
;; increment will do this. Since offsets are pre-incremented, this means
;; that it will switch to the low page at the outset, and won't switch again
;; until a full page has been stepped through.
PHA
LDA #$FF
STA token_hi_page_flag
PLA
;; Read a byte from the input buffer
LDA $0200,X
;; If bit 7 is clear, try to tokenise
BPL @tryTokenise
;; Now check for PI (char $FF)
CMP #$FF ; = PI
BEQ @gotToken_a5c9
;; Not PI, but bit 7 is set, so just skip over it, and don't store
INX
BNE @tokeniseNextChar
@tryTokenise:
;; Now look for some common things
;; Is it a space?
CMP #$20 ; space
BEQ @gotToken_a5c9
;; Not space, so save byte as search character
STA $08
CMP #$22 ; quote marks
BEQ @foundQuotes_a5ee
BIT $0F ; Check quote/data mode
BVS @gotToken_a5c9 ; If data mode, accept as is
CMP #$3F ; Is it a "?" (short cut for PRINT)
BNE @notQuestionMark
LDA #$99 ; Token for PRINT
BNE @gotToken_a5c9 ; Accept the print token (branch always taken, because $99 != $00)
@notQuestionMark:
;; Check for 0-9, : or ;
CMP #$30
BCC @notADigit
CMP #$3C
BCC @gotToken_a5c9
@notADigit:
;; Remember where we are upto in the BASIC line of text
STY $71
;; Now reset the pointer into tokenlist
LDY #$00
;; And the token number minus $80 we are currently considering.
;; We start with token #0, since we search from the beginning.
STY $0B
;; Decrement Y from $00 to $FF, because the inner loop increments before processing
;; (Y here represents the offset in the tokenlist)
DEY
;; Save BASIC execute pointer
STX $7A
;; Decrement X also, because the inner loop pre-increments
DEX
@compareNextChar_a5b6:
;; Advance pointer in tokenlist
jsr tokenListAdvancePointer
;; Advance pointer in BASIC text
INX
@compareProgramTextAndToken:
;; Read byte of basic program
LDA $0200, X
;; Now subtract the byte from the token list.
;; If the character matches, we will get $00 as result.
;; If the character matches, but was ORd with $80, then $80 will be the
;; result. This allows efficient detection of whether we have found the
;; end of a keyword.
bit token_hi_page_flag
bmi @useTokenListHighPage
SEC
SBC tokenlist, Y
jmp @dontUseHighPage
@useTokenListHighPage:
SEC
SBC tokenlist+$100,Y
@dontUseHighPage:
;; If zero, then compare the next character
BEQ @compareNextChar_a5b6
;; If $80, then it is the end of the token, and we have matched the token
CMP #$80
BNE @tokenDoesntMatch
;; A = $80, so if we add the token number stored in $0B, we get the actual
;; token number
ORA $0B
@tokeniseNextProgramCharacter:
;; Restore the saved index into the BASIC program line
LDY $71
@gotToken_a5c9:
;; We have worked out the token, so record it.
INX
INY
STA $0200 - 5, Y
;; Now check for end of line (token == $00)
LDA $0200 - 5, Y
BEQ @tokeniseEndOfLine_a609
;; Now think about what we have to do with the token
SEC
SBC #$3A
BEQ @tokenIsColon_a5dc
CMP #($83 - $3A) ; (=$49) Was it the token for DATA?
BNE @tokenMightBeREM_a5de
@tokenIsColon_a5dc:
;; Token was DATA
STA $0F ; Store token - $3A (why?)
@tokenMightBeREM_a5de:
SEC
SBC #($8F - $3A) ; (=$55) Was it the token for REM?
BNE @tokeniseNextChar
;; Was REM, so say we are searching for end of line (== $00)
;; (which is conveniently in A now)
STA $08
@label_a5e5:
;; Read the next BASIC program byte
LDA $0200, X
BEQ @gotToken_a5c9
;; Does the next character match what we are searching for?
CMP $08
;; Yes, it matches, so indicate we have the token
BEQ @gotToken_a5c9
@foundQuotes_a5ee:
;; Not a match yet, so advance index for tokenised output
INY
;; And write token to output
STA $0200 - 5, Y
;; Increment read index of basic program
INX
;; Read the next BASIC byte (X should never be zero)
BNE @label_a5e5
@tokenDoesntMatch:
;; Restore BASIC execute pointer to start of the token we are looking at,
;; so that we can see if the next token matches
LDX $7A
;; Increase the token ID number, since the last one didn't match
INC $0B
;; Advance pointer in tokenlist from the end of the last token to the start
;; of the next token, ready to compare the BASIC program text with this token.
@advanceToNextTokenLoop:
jsr tokenListAdvancePointer
jsr tokenListReadByteMinus1
BPL @advanceToNextTokenLoop
;; Check if we have reached the end of the token list
jsr tokenListReadByte
;; If not, see if the program text matches this token
BNE @compareProgramTextAndToken
;; We reached the end of the token list without a match,
;; so copy this character to the output, and
LDA $0200, X
;; Then advance to the next character of the BASIC text
;; (BPL acts as unconditional branch, because only bytes with bit 7
;; cleared can get here).
BPL @tokeniseNextProgramCharacter
@tokeniseEndOfLine_a609:
;; Write end of line marker (== $00), which is conveniently in A already
STA $0200 - 3, Y
;; Decrement BASIC execute pointer high byte
DEC $7B
;; ... and set low byte to $FF
LDA #$FF
STA $7A
RTS
tokenListAdvancePointer:
INY
BNE @dontAdvanceTokenListPage
PHP
PHA
LDA token_hi_page_flag
EOR #$FF
STA token_hi_page_flag
;; XXX Why on earth do we need these three NOPs here to correctly parse the extra
;; tokens? If you remove one, then the first token no longer parses, and the later
;; ones get parsed with token number one less than it should be!
NOP
NOP
NOP
PLA
PLP
@dontAdvanceTokenListPage:
PHP
PHX
PHA
tya
tax
bit token_hi_page_flag
bmi @page2
jmp @done
@page2:
@done:
PLA
PLX
PLP
RTS
tokenListReadByte:
bit token_hi_page_flag
bmi @useTokenListHighPage
LDA tokenlist, Y
RTS
@useTokenListHighPage:
LDA tokenlist+$100,Y
RTS
tokenListReadByteMinus1:
bit token_hi_page_flag
bmi @useTokenListHighPage
LDA tokenlist - 1, Y
RTS
@useTokenListHighPage:
LDA tokenlist - 1 + $100,Y
RTS
The detokenisation routine also heavily draws on the original C64 BASIC ROM's routine (located at $A71A). This routine is rather simpler, because it doesn't need to do any string matching, or handling of shorted tokens using SHIFT+letter:
megabasic_detokenise:
;; The C64 detokenise routine lives at $A71A-$A741.
;; The routine is quite simple, reading through the token list,
;; decrementing the token number each time the end of at token is
;; found. The only complications for us, is that we need to change
;; the parts where the token bytes are read from the list to allow
;; the list to be two pages long.
;; Print non-tokens directly
bpl jump_to_a6f3
;; Print PI directly
cmp #$ff
beq jump_to_a6f3
;; If in quote mode, print directly
bit $0f
bmi jump_to_a6f3
;; At this point, we know it to be a token
;; Tokens are $80-$FE, so subtract #$7F, to renormalise them
;; to the range $01-$7F
SEC
SBC #$7F
;; Put the normalised token number into the X register, so that
;; we can easily count down
TAX
STY $49 ; and store it somewhere necessary, apparently
;; Now get ready to find the string and output it.
;; Y is used as the offset in the token list, and gets pre-incremented
;; so we start with it equal to $00 - $01 = $FF
LDY #$FF
;; Set token_hi_page_flag to $FF, so that when Y increments for the first
;; time, it increments token_hi_page_flag, making it $00 for the first page of
;; the token list.
STY token_hi_page_flag
@detokeniseSearchLoop:
;; Decrement token index by 1
DEX
;; If X = 0, this is the token, so read the bytes out
beq @thisIsTheToken
;; Since it is not this token, we need to skip over it
@detokeniseSkipLoop:
jsr tokenListAdvancePointer
jsr tokenListReadByte
BPL @detokeniseSkipLoop
;; Found end of token, loop to see if the next token is it
BMI @detokeniseSearchLoop
@thisIsTheToken:
jsr tokenListAdvancePointer
jsr tokenListReadByte
;; If it is the last byte of the token, return control to the LIST
;; command routine from the BASIC ROM
BMI jump_list_command_finish_printing_token_a6ef
;; As it is not the end of the token, print it out
JSR $AB47
BNE @thisIsTheToken
;; This can only be reached if the next byte in the token list is $00
;; This could only happen in C64 BASIC if the token ID following the
;; last is attempted to be detokenised.
;; This is the source of the REM SHIFT+L bug, as SHIFT+L gives the
;; character code $CC, which is exactly the token ID required, and
;; the C64 BASIC ROM code here simply fell through the FOR routine.
;; Actually, understanding this, makes it possible to write a program
;; that when LISTed, actually causes code to be executed!
;; However, this vulnerability appears not possible to be exploited,
;; because $0201, the next byte to be read from the input buffer during
;; the process, always has $00 in it when the FOR routine is run,
;; causing a failure when attempting to execute the FOR command.
;; Were this not the case, REM (SHIFT+L)I=1TO10:GOTO100, when listed
;; would actually cause GOTO100 to be run, thus allowing LIST to
;; actually run code. While still not a very strong form of source
;; protection, it could have been a rather fun thing to try.
;; Instead of having this error, we will just cause the character to
;; be printed normally.
LDY $49
jump_to_a6f3:
JMP $A6F3
jump_list_command_finish_printing_token_a6ef:
JMP $A6EF
The one piece of interest, is that this routine is responsible for the REM SHIFT+L bug when listing a program on the C64. This bug *almost* creates a "code injection" bug in the C64's BASIC, as described in the comments above. The bug is caused by detokenising SHIFT-L causes a fall-through at the end of the detokenisation routine into the start of the FOR command's routine in the ROM. However, as I describe in the comments above, the contents of the next byte of input when detokenising seems to always be $00 or unusable, which will cause a syntax error in the FOR routine, when it tries to parse the variable it should be using as the loop iterator. Actually, digging a bit deeper, the next token byte it reads is whatever followed the LIST command. However, it is only really possible to have a colon or $00 there, so I think it is still not exploitable.
I'd be very interested to hear if anyone can think of a way to exploit this, so that LISTing a newly loaded program causes code of your choosing to be executed.
Speaking of execution, that just leaves our token execution routine to present:
megabasic_execute:
JSR $0073
;; Is it a MEGA BASIC primary keyword?
CMP #$CC
BCC @basic2_token
CMP #token_first_sub_command
BCC megabasic_execute_token
;; Handle PI
CMP #$FF
BEQ @basic2_token
;; Else, it must be a MEGA BASIC secondary keyword
;; You can't use those alone, so ILLEGAL DIRECT ERROR
jmp megabasic_perform_illegal_direct_error
@basic2_token:
;; $A7E7 expects Z flag set if ==$00, so update it
CMP #$00
JMP $A7E7
megabasic_execute_token:
;; Normalise index of new token
SEC
SBC #$CC
ASL
;; Clip it to make sure we don't have any overflow of the jump table
AND #$0E
TAX
PHX
;; Get next token/character ready
JSR $0073
PLX
JMP (newtoken_jumptable,X)
;; Tokens are $CC-$FE, so to be safe, we need to have a jump
newtoken_jumptable:
.word megabasic_perform_fast
.word megabasic_perform_slow
.word megabasic_perform_canvas ; canvas operations, including copy/stamping, clearing, creating new
.word megabasic_perform_colour ; set colours
.word megabasic_perform_tile ; "TILE" command, used for TILESET and other purposes
.word megabasic_perform_syntax_error ; "SET" SYNTAXERROR: Used only with TILE to make TILESET
.word megabasic_perform_syntax_error
.word megabasic_perform_syntax_error
.word megabasic_perform_syntax_error
basic2_main_loop = $A7AE
This routine is really quite simple: Check if the token is not one of ours, in which case jump to the normal BASIC 2 routine, else work out which one, and jump to the appropriate routine. I have cheated slightly, and used a C65/M65 new addressing mode for JMP, that makes jump tables much easier to implement.
Each of the routines then does what is required of it, before jumping to an error routine, or back to the BASIC 2 main loop. Here is the FAST command, which is nice and simple:
megabasic_perform_fast:
jsr enable_viciv
LDA #$40
TSB $D054
TSB d054_bits
JMP basic2_main_loop
As you can see, there is really nothing to it. If you want to read arguments and the like, then you have to call the same routines that the BASIC ROM would use to do the same. The best way to discover these is to look at an existing routine that implements a BASIC command, and see how it does it. If you want to see how we have used them, or indeed to look at how these routines are strung together to make a fully-functioning BASIC extension, the source code for MEGA BASIC (in its currently unfinished state) is all here on github.
So there you have it. I'll hopefully post an update soon that shows some more of the progress that has been made on MEGA BASIC. In particular, the CANVAS command and its variations are now almost complete, with the exception of the commands to get and set individual tiles in a canvas.