Monday, February 12, 2018

Extending The Commodore 64 BASIC with 1st-class tokens

There are a number of nice tutorials that talk about extending the C64's BASIC. However, all the ones I could find talk about having a singl-character prefix, e.g., @, and then making single-character commands that follow that. This is the usual approach, because it is much easier, as you don't need to make the LIST command able to interpret the new commands.  However, it really makes for an ugly result, and hard to read code. For the new MEGA BASIC on the MEGA65, I wanted to avoid this, and have "real" BASIC keywords for the extensions, so that the resulting programs would be easier to write, and easier to read, and much easier to learn how to do things by reading other peoples programs.  So, I had to find out how to make a proper BASIC extension myself.

There are three main functions that you have to modify for this: (1) tokenisation (the part where words get turned into single-byte values when stored in a program; (2) detokenisation (the part where those get turned back into readable words, which really only matters for the LIST command); and (3) executing tokens. In this post, I'll show you how I implemented each one. This post will go into quite low-level detail of this, which might not be interesting to some, so please accept my apologies in advance if that is the case. However, for others who wish to implement an extension for the C64 BASIC, it would seem to be the only publicly released breakdown of how to make one, so I think it is warranted.

Before I dive into the tokenisation and detokenisation routines, it is worth mentioning a common problem to them both: The existing list of BASIC keywords as stored in the BASIC ROM is 254 bytes long.  By keeping it under 256 bytes, the tokenisation and detokenisation routines are able to be somewhat simpler, by using only index registers to scan through the list.  This has to change if we want to allow more keywords. As I don't anticipate making more keywords than 2*256 bytes, I was able to make changes that only worried about checking which half of the list is being read.


My token list is simply formed by taking the standard 255 bytes from the BASIC ROM at $A09E, and appending my new ones to the end:


tokenlist:
        ;; Reserve space for C64 BASIC token list, less the end $00 marker

        ;; (I copy these in place with a little loop elsewhere)
        .res ($A19C - $A09E + 1), $00
        ;; End of list marker (remove to enable new tokens)
                ;.byte $00
        ;; Now we have our new tokens
        ;; extra_token_count must be correctly set to the number of tokens
        ;; (This lists only the number of tokens that are good for direct use.
        ;; Keywords found only within statements are not in this tally.)
        extra_token_count = 5
        token_fast = $CC + 0
        .byte "FAS",'T'+$80
        token_slow = $CC + 1
        .byte "SLO",'W'+$80

        token_canvas = $CC + 2
        .byte "CANVA",'S'+$80
        token_colour = $CC + 3
        .byte "COLOU",'R'+$80
        token_tile = $CC + 4
        .byte "TIL",'E'+$80

        token_first_sub_command = token_tile + 1
       
        ;; These tokens are keywords used within other
        ;; commands, not as executable commands. These
        ;; will all generate syntax errors.
        token_text = token_first_sub_command + 0
        .byte "TEX",'T'+$80
        token_sprite = token_first_sub_command + 1
        .byte "SPRIT",'E'+$80
        token_screen = token_first_sub_command + 2
        .byte "SCREE",'N'+$80
        token_border = token_first_sub_command + 3
        .byte "BORDE",'R'+$80
        token_set = token_first_sub_command + 4
        .byte "SE",'T'+$80
        token_delete = token_first_sub_command + 5
        .byte "DELET",'E'+$80
        token_stamp = token_first_sub_command + 6
        .byte "STAM",'P'+$80
        token_at = token_first_sub_command + 7
        .byte "A",'T'+$80
        token_from = token_first_sub_command + 8
        .byte "FRO",'M'+$80
        ;; And the end byte
        .byte $00       


Another key issue is that most of the possible token values are already used.  Only token values between $CD and $FE are available, i.e., only 49 of them.  If I want to avoid double-byte tokens (which I really do), then I need to keep my new keywords to less than that number.  We'll see if I manage to do that, but for now, we will try. On both these points, I suspect that BASIC 10 on the C65 probably exceeds these limits, thus making the tokeniser and detokeniser more complex than they are in BASIC 2 on the C64. 

The tokeniser

The tokeniser is the routine that scans a line of BASIC text, and changes keywords into single-byte tokens.  It has to take account of various quirks, such as the PI character officially being a token with the special value $FF, and know whether parts of the line are inside quotation marks or follow a REM statement. This all makes the logic of the tokeniser routine rather more subtle and complex than one might first imagine.  As a result, I figured it was safer to essentially replicate the functionality of the standard tokeniser routine as exactly as possible, adding only the ability to scan a keyword list that is upto 511 bytes long, instead of 255 bytes long.

This routine is effectively the same as in the C64 BASIC ROM at $A57C, but with the changes to allow the token list to be two pages long. I have also tried to more fully document the routine in the comments and labels. Some of the labels have address suffixes on them, that indicate where in the original C64 BASIC ROM routine they were located, to make it easier to reference between the two of them. 

The algorithm is basically one of seeing if the current character might be the start of a keyword that should be tokenised, and if so, doing a string match against it.  Tokens are stored in the token list with bit 7 set on the last character, so, for example RUN is stored as $52 $55 $CE, where $CE = $4E + $80 to represent the terminal N of the word.  This trick is also how the short-cuts for keywords using shifted letters works: If the result of comparing the input character and the keyword letter is $80, then it is either the terminal character of the keyword, or a shifted letter in the input being tokenised.  This is also why typing something like RU(shift-N) won't work.

It also occurs to me that this allows for a rather subtle bug, where if you typed a keyword with the last character shifted, and then characters for the following token, it will match against the first token, so typing RU(shift-N)I(shift-F)RESTORE would actually be interpreted and tokenised as though you had just typed RUN. Quite what use you might make of this, I have no idea, but I have confirmed that it does indeed work.  I haven't bothered to try to fix this little problem in my extension, but just report it here in case it is of curiosity to anyone.  I'd be interested to know if anyone else had previously discovered this quirk.

Back to the task at hand, for the longer token list, I have added a "hi page flag" that is set when accessing the upper of the two pages of tokens, and clear for the lower.  In places this is initialised to $FF when set, so that INC token_hi_page_flag causes it to be cleared.  This is used in particular near the start following tokeniseNextChar, so that the routine can continue to pre-increment the pointer into the token list (which is normally just held in the Y register).

This means that the logic for advancing and, retreating the token pointer is no longer as simple as INY or DEY. To keep things understandable, I have broken out the code to do this into tokenListReadByte, tokenListAdvancePointer etc.

Beyond that, there is not really much more to say about it, without having to dive deep into explaining every little subtly about the C64 BASIC tokenise routine, other than to repeat that this tokeniser supports only single-byte tokens. To allow more tokens, we would need to maintain a 16-bit counter for the token value, and if it were >253, come up with a double-byte token, e.g., $FE <second byte>.  i.e., token 254 would be the first token to require two bytes, because token $FF is already used for PI, and we need to have at least one token reserved as a kind of escape code, which above I have suggested $FE as the logical candidate for this role. If I end up using too many tokens, that is probably the method that I would use.

megabasic_tokenise:

        ;; Get the basic execute pointer low byte
        LDX    $7A
        ;; Set the save index
        LDY    #$04
        ;; Clear the quote/data flag
        STY    $0F

@tokeniseNextChar:
        ;; Get hi page flag for tokenlist scanning, so that if we INC it, it will
        ;; point back to the first page.  As we start with offset = $FF, the first
        ;; increment will do this. Since offsets are pre-incremented, this means
        ;; that it will switch to the low page at the outset, and won't switch again
        ;; until a full page has been stepped through.
        PHA
        LDA     #$FF
        STA    token_hi_page_flag
        PLA
       
        ;; Read a byte from the input buffer
        LDA    $0200,X
        ;; If bit 7 is clear, try to tokenise
        BPL    @tryTokenise
        ;; Now check for PI (char $FF)
        CMP    #$FF     ; = PI
        BEQ    @gotToken_a5c9
        ;; Not PI, but bit 7 is set, so just skip over it, and don't store
        INX
        BNE    @tokeniseNextChar
@tryTokenise:
        ;; Now look for some common things
        ;; Is it a space?
        CMP    #$20    ; space
        BEQ    @gotToken_a5c9
        ;; Not space, so save byte as search character
        STA    $08
        CMP    #$22    ; quote marks
        BEQ    @foundQuotes_a5ee
        BIT    $0F    ; Check quote/data mode
        BVS    @gotToken_a5c9 ; If data mode, accept as is
        CMP    #$3F           ; Is it a "?" (short cut for PRINT)
        BNE    @notQuestionMark
        LDA    #$99    ; Token for PRINT
        BNE    @gotToken_a5c9 ; Accept the print token (branch always taken, because $99 != $00)
@notQuestionMark:
        ;; Check for 0-9, : or ;
        CMP     #$30
        BCC    @notADigit
        CMP    #$3C
        BCC    @gotToken_a5c9
@notADigit:
        ;; Remember where we are upto in the BASIC line of text
        STY    $71
        ;; Now reset the pointer into tokenlist
        LDY    #$00
        ;; And the token number minus $80 we are currently considering.
        ;; We start with token #0, since we search from the beginning.
        STY    $0B
        ;; Decrement Y from $00 to $FF, because the inner loop increments before processing
        ;; (Y here represents the offset in the tokenlist)
        DEY
        ;; Save BASIC execute pointer
        STX    $7A
        ;; Decrement X also, because the inner loop pre-increments
        DEX
@compareNextChar_a5b6:
        ;; Advance pointer in tokenlist
        jsr tokenListAdvancePointer
        ;; Advance pointer in BASIC text
        INX
@compareProgramTextAndToken:
        ;; Read byte of basic program
        LDA    $0200, X
        ;; Now subtract the byte from the token list.
        ;; If the character matches, we will get $00 as result.
        ;; If the character matches, but was ORd with $80, then $80 will be the
        ;; result.  This allows efficient detection of whether we have found the
        ;; end of a keyword.
        bit     token_hi_page_flag
        bmi    @useTokenListHighPage
        SEC
        SBC    tokenlist, Y
        jmp    @dontUseHighPage
@useTokenListHighPage:
        SEC
        SBC    tokenlist+$100,Y
@dontUseHighPage:
        ;; If zero, then compare the next character
        BEQ    @compareNextChar_a5b6
        ;; If $80, then it is the end of the token, and we have matched the token
        CMP    #$80
        BNE    @tokenDoesntMatch
        ;; A = $80, so if we add the token number stored in $0B, we get the actual
        ;; token number
        ORA    $0B
@tokeniseNextProgramCharacter:
        ;; Restore the saved index into the BASIC program line
        LDY    $71
@gotToken_a5c9:
        ;; We have worked out the token, so record it.
        INX
        INY
        STA    $0200 - 5, Y
        ;; Now check for end of line (token == $00)
        LDA    $0200 - 5, Y
        BEQ @tokeniseEndOfLine_a609

        ;; Now think about what we have to do with the token
        SEC
        SBC    #$3A
        BEQ    @tokenIsColon_a5dc
        CMP    #($83 - $3A) ; (=$49) Was it the token for DATA?
        BNE    @tokenMightBeREM_a5de
@tokenIsColon_a5dc:
        ;; Token was DATA
        STA    $0F    ; Store token - $3A (why?)
@tokenMightBeREM_a5de:
        SEC
        SBC    #($8F - $3A) ; (=$55) Was it the token for REM?
        BNE    @tokeniseNextChar
        ;; Was REM, so say we are searching for end of line (== $00)
        ;; (which is conveniently in A now)
        STA    $08   
@label_a5e5:
        ;; Read the next BASIC program byte
        LDA    $0200, X
        BEQ    @gotToken_a5c9
        ;; Does the next character match what we are searching for?
        CMP    $08
        ;; Yes, it matches, so indicate we have the token
        BEQ    @gotToken_a5c9

@foundQuotes_a5ee:
        ;; Not a match yet, so advance index for tokenised output
        INY
        ;; And write token to output
        STA    $0200 - 5, Y
        ;; Increment read index of basic program
        INX
        ;; Read the next BASIC byte (X should never be zero)
        BNE    @label_a5e5

@tokenDoesntMatch:
        ;; Restore BASIC execute pointer to start of the token we are looking at,
        ;; so that we can see if the next token matches
        LDX    $7A
        ;; Increase the token ID number, since the last one didn't match
        INC    $0B
        ;; Advance pointer in tokenlist from the end of the last token to the start
        ;; of the next token, ready to compare the BASIC program text with this token.
@advanceToNextTokenLoop:
        jsr     tokenListAdvancePointer
        jsr     tokenListReadByteMinus1
        BPL    @advanceToNextTokenLoop
        ;; Check if we have reached the end of the token list
        jsr    tokenListReadByte
        ;; If not, see if the program text matches this token
        BNE    @compareProgramTextAndToken

        ;; We reached the end of the token list without a match,
        ;; so copy this character to the output, and
        LDA    $0200, X
        ;; Then advance to the next character of the BASIC text
        ;; (BPL acts as unconditional branch, because only bytes with bit 7
        ;; cleared can get here).
        BPL    @tokeniseNextProgramCharacter
@tokeniseEndOfLine_a609:
        ;; Write end of line marker (== $00), which is conveniently in A already
        STA    $0200 - 3, Y
        ;; Decrement BASIC execute pointer high byte
        DEC    $7B
        ;; ... and set low byte to $FF
        LDA    #$FF
        STA    $7A
        RTS

tokenListAdvancePointer:   
        INY
        BNE    @dontAdvanceTokenListPage
        PHP
        PHA
        LDA    token_hi_page_flag
        EOR    #$FF
        STA    token_hi_page_flag
        ;; XXX Why on earth do we need these three NOPs here to correctly parse the extra
        ;; tokens? If you remove one, then the first token no longer parses, and the later
        ;; ones get parsed with token number one less than it should be!
        NOP
        NOP
        NOP
        PLA
        PLP
@dontAdvanceTokenListPage:
        PHP
        PHX
        PHA
        tya
        tax
        bit    token_hi_page_flag
        bmi    @page2
        jmp    @done
@page2:       
        @done:
       
        PLA
        PLX
        PLP
        RTS

tokenListReadByte:   
        bit     token_hi_page_flag
        bmi    @useTokenListHighPage
        LDA    tokenlist, Y
        RTS
@useTokenListHighPage:
        LDA    tokenlist+$100,Y
        RTS       

tokenListReadByteMinus1:   
        bit     token_hi_page_flag
        bmi    @useTokenListHighPage
        LDA    tokenlist - 1, Y
        RTS
@useTokenListHighPage:
        LDA    tokenlist - 1 + $100,Y
        RTS       


The detokenisation routine also heavily draws on the original C64 BASIC ROM's routine (located at $A71A).  This routine is rather simpler, because it doesn't need to do any string matching, or handling of shorted tokens using SHIFT+letter:

megabasic_detokenise:
        ;; The C64 detokenise routine lives at $A71A-$A741.
        ;; The routine is quite simple, reading through the token list,
        ;; decrementing the token number each time the end of at token is
        ;; found.  The only complications for us, is that we need to change
        ;; the parts where the token bytes are read from the list to allow
        ;; the list to be two pages long.

        ;; Print non-tokens directly
        bpl     jump_to_a6f3
        ;; Print PI directly
        cmp    #$ff
        beq    jump_to_a6f3
        ;; If in quote mode, print directly
        bit    $0f
        bmi     jump_to_a6f3

        ;; At this point, we know it to be a token

        ;; Tokens are $80-$FE, so subtract #$7F, to renormalise them
        ;; to the range $01-$7F
        SEC
        SBC    #$7F
        ;; Put the normalised token number into the X register, so that
        ;; we can easily count down
        TAX
        STY    $49     ; and store it somewhere necessary, apparently

        ;; Now get ready to find the string and output it.
        ;; Y is used as the offset in the token list, and gets pre-incremented
        ;; so we start with it equal to $00 - $01 = $FF
        LDY    #$FF
        ;; Set token_hi_page_flag to $FF, so that when Y increments for the first
        ;; time, it increments token_hi_page_flag, making it $00 for the first page of
        ;; the token list.
        STY    token_hi_page_flag

       
@detokeniseSearchLoop:
        ;; Decrement token index by 1
        DEX
        ;; If X = 0, this is the token, so read the bytes out
        beq    @thisIsTheToken
        ;; Since it is not this token, we need to skip over it
@detokeniseSkipLoop:
        jsr tokenListAdvancePointer
        jsr tokenListReadByte
        BPL    @detokeniseSkipLoop
        ;; Found end of token, loop to see if the next token is it
        BMI    @detokeniseSearchLoop
@thisIsTheToken:
        jsr tokenListAdvancePointer
        jsr tokenListReadByte
        ;; If it is the last byte of the token, return control to the LIST
        ;; command routine from the BASIC ROM
        BMI    jump_list_command_finish_printing_token_a6ef
        ;; As it is not the end of the token, print it out
        JSR    $AB47
        BNE    @thisIsTheToken

        ;; This can only be reached if the next byte in the token list is $00
        ;; This could only happen in C64 BASIC if the token ID following the
        ;; last is attempted to be detokenised.
        ;; This is the source of the REM SHIFT+L bug, as SHIFT+L gives the
        ;; character code $CC, which is exactly the token ID required, and
        ;; the C64 BASIC ROM code here simply fell through the FOR routine.
        ;; Actually, understanding this, makes it possible to write a program
        ;; that when LISTed, actually causes code to be executed!
        ;; However, this vulnerability appears not possible to be exploited,
        ;; because $0201, the next byte to be read from the input buffer during
        ;; the process, always has $00 in it when the FOR routine is run,
        ;; causing a failure when attempting to execute the FOR command.
        ;; Were this not the case, REM (SHIFT+L)I=1TO10:GOTO100, when listed
        ;; would actually cause GOTO100 to be run, thus allowing LIST to
        ;; actually run code. While still not a very strong form of source
        ;; protection, it could have been a rather fun thing to try.

        ;; Instead of having this error, we will just cause the character to
        ;; be printed normally.
        LDY    $49
jump_to_a6f3:   
        JMP     $A6F3
jump_list_command_finish_printing_token_a6ef:
        JMP    $A6EF


The one piece of interest, is that this routine is responsible for the REM SHIFT+L bug when listing a program on the C64. This bug *almost* creates a "code injection" bug in the C64's BASIC, as described in the comments above.  The bug is caused by detokenising SHIFT-L causes a fall-through at the end of the detokenisation routine into the start of the FOR command's routine in the ROM. However, as I describe in the comments above, the contents of the next byte of input when detokenising seems to always be $00 or unusable, which will cause a syntax error in the FOR routine, when it tries to parse the variable it should be using as the loop iterator.  Actually, digging a bit deeper, the next token byte it reads is whatever followed the LIST command.  However, it is only really possible to have a colon or $00 there, so I think it is still not exploitable.
I'd be very interested to hear if anyone can think of a way to exploit this, so that LISTing a newly loaded program causes code of your choosing to be executed.

Speaking of execution, that just leaves our token execution routine to present:

megabasic_execute:       
        JSR    $0073
        ;; Is it a MEGA BASIC primary keyword?
        CMP    #$CC
        BCC    @basic2_token
        CMP    #token_first_sub_command
        BCC    megabasic_execute_token
        ;; Handle PI
        CMP    #$FF
        BEQ    @basic2_token
        ;; Else, it must be a MEGA BASIC secondary keyword
        ;; You can't use those alone, so ILLEGAL DIRECT ERROR
        jmp megabasic_perform_illegal_direct_error
@basic2_token:
        ;; $A7E7 expects Z flag set if ==$00, so update it
        CMP    #$00
        JMP    $A7E7

megabasic_execute_token:
        ;; Normalise index of new token
        SEC
        SBC     #$CC
        ASL
        ;; Clip it to make sure we don't have any overflow of the jump table
        AND    #$0E
        TAX
        PHX
        ;; Get next token/character ready
        JSR    $0073
        PLX
        JMP     (newtoken_jumptable,X)

        ;; Tokens are $CC-$FE, so to be safe, we need to have a jump
newtoken_jumptable:
        .word     megabasic_perform_fast
        .word     megabasic_perform_slow
        .word    megabasic_perform_canvas ; canvas operations, including copy/stamping, clearing, creating new
        .word    megabasic_perform_colour ; set colours
        .word    megabasic_perform_tile ; "TILE" command, used for TILESET and other purposes
        .word    megabasic_perform_syntax_error ; "SET" SYNTAXERROR: Used only with TILE to make TILESET
        .word     megabasic_perform_syntax_error
        .word     megabasic_perform_syntax_error
        .word     megabasic_perform_syntax_error

        basic2_main_loop     =    $A7AE

This routine is really quite simple: Check if the token is not one of ours, in which case jump to the normal BASIC 2 routine, else work out which one, and jump to the appropriate routine.  I have cheated slightly, and used a C65/M65 new addressing mode for JMP, that makes jump tables much easier to implement.

Each of the routines then does what is required of it, before jumping to an error routine, or back to the BASIC 2 main loop. Here is the FAST command, which is nice and simple:


megabasic_perform_fast:
        jsr    enable_viciv
        LDA    #$40
        TSB    $D054
        TSB    d054_bits
        JMP    basic2_main_loop


As you can see, there is really nothing to it.  If you want to read arguments and the like, then you have to call the same routines that the BASIC ROM would use to do the same.  The best way to discover these is to look at an existing routine that implements a BASIC command, and see how it does it.  If you want to see how we have used them, or indeed to look at how these routines are strung together to make a fully-functioning BASIC extension, the source code for MEGA BASIC (in its currently unfinished state) is all here on github.

So there you have it.  I'll hopefully post an update soon that shows some more of the progress that has been made on MEGA BASIC. In particular, the CANVAS command and its variations are now almost complete, with the exception of the commands to get and set individual tiles in a canvas.

6 comments:

  1. Question: Why are you extending the C64 mode basic instead of the C65 one?

    ReplyDelete
    Replies
    1. Hi Bil,

      There are a couple of reasons why we are using BASIC 2 as the starting point. This will likely change over time, however.

      1. BASIC 2 is finished and stable. We have yet to confirm (or brew) a "finished" C65 BASIC 10 ROM.

      2. The internals of BASIC 2 are (currently) better documented.

      3. The MEGA BASIC extension needs a reasonable amount of RAM for holding the graphics and sound assets, so in the short term the easiest solution was to start with a BASIC that doesn't know about (and therefore cannot accidentally use) the 2nd 64KB of RAM.

      4. I haven't thought through exactly how the new graphics and sound commands will fit in with the existing ones in BASIC 10.

      5. I haven't surveyed exactly how much ROM space is free (although I recall it to be at least 16KB).

      In short, it was quicker to use BASIC 2 as a starting point, but that is quite likely not going to be the end point.

      Paul.

      Delete
  2. There was already an evolution of BASIC dialects at CBM.
    BASIC 2-4 on the PET/CBM line, BASIC 2 on VIC-20 and C64,
    BASIC 3.5 on PLUS/4, C16,C116, and BASIC 7 on C128.
    All these were version backward compatible.
    So I would prefer the BASIC 7 from the C128 as a stable starting point. Tis would be much better, than extending the BASIC 2 of the C64 in a way that is incompatible to the already existing dialects 3.5 and 7.
    At least you should make sure, that you are not using token values for your extensions, that have a different purpose in the BASIC 7 or BASIC 10 interpreter.

    ReplyDelete
    Replies
    1. Yes, we plan to make it as backward compatible as possible, and will look at how we deal with avoiding reuse of tokens. But that can happen once we have the functionality of the new commands under control.

      Paul.

      Delete
  3. And BASIC 3.6 on Commodore LCD :) http://commodore-lcd.lgb.hu/jsemu/

    But, I think the important point is the functionality, at first at least. It's another question what BASIC version will be used to add this feature as the final form, with what kind of BASIC tokens etc etc. I had to realize this too, since my first thought was also "why BASIC 2, if it's not even a C64, not even C65, but M65". However I think, Paul felt, that it's easier to develop the code functionality with the most simple CBM BASIC, but it's may not the final form of it, anyway. And it seems to be a good reason to have "something" to develop/test "ASAP", even if it's not the final solution.

    ReplyDelete
    Replies
    1. Exactly. I also figure that if I do the M65-specific parts that would be hard for others to figure out at this stage, that rebasing it onto another version of CBM BASIC can potentially be done by someone else later, if necessary. I just have to be careful to invest my limited spare time into the parts of the project that can't be easily delegated.

      Paul.

      Delete