Tuesday 13 February 2018

Extending The Commodore 64 BASIC with 1st-class tokens

There are a number of nice tutorials that talk about extending the C64's BASIC. However, all the ones I could find talk about having a singl-character prefix, e.g., @, and then making single-character commands that follow that. This is the usual approach, because it is much easier, as you don't need to make the LIST command able to interpret the new commands.  However, it really makes for an ugly result, and hard to read code. For the new MEGA BASIC on the MEGA65, I wanted to avoid this, and have "real" BASIC keywords for the extensions, so that the resulting programs would be easier to write, and easier to read, and much easier to learn how to do things by reading other peoples programs.  So, I had to find out how to make a proper BASIC extension myself.

There are three main functions that you have to modify for this: (1) tokenisation (the part where words get turned into single-byte values when stored in a program; (2) detokenisation (the part where those get turned back into readable words, which really only matters for the LIST command); and (3) executing tokens. In this post, I'll show you how I implemented each one. This post will go into quite low-level detail of this, which might not be interesting to some, so please accept my apologies in advance if that is the case. However, for others who wish to implement an extension for the C64 BASIC, it would seem to be the only publicly released breakdown of how to make one, so I think it is warranted.

Before I dive into the tokenisation and detokenisation routines, it is worth mentioning a common problem to them both: The existing list of BASIC keywords as stored in the BASIC ROM is 254 bytes long.  By keeping it under 256 bytes, the tokenisation and detokenisation routines are able to be somewhat simpler, by using only index registers to scan through the list.  This has to change if we want to allow more keywords. As I don't anticipate making more keywords than 2*256 bytes, I was able to make changes that only worried about checking which half of the list is being read.

My token list is simply formed by taking the standard 255 bytes from the BASIC ROM at $A09E, and appending my new ones to the end:

        ;; Reserve space for C64 BASIC token list, less the end $00 marker

        ;; (I copy these in place with a little loop elsewhere)
        .res ($A19C - $A09E + 1), $00
        ;; End of list marker (remove to enable new tokens)
                ;.byte $00
        ;; Now we have our new tokens
        ;; extra_token_count must be correctly set to the number of tokens
        ;; (This lists only the number of tokens that are good for direct use.
        ;; Keywords found only within statements are not in this tally.)
        extra_token_count = 5
        token_fast = $CC + 0
        .byte "FAS",'T'+$80
        token_slow = $CC + 1
        .byte "SLO",'W'+$80

        token_canvas = $CC + 2
        .byte "CANVA",'S'+$80
        token_colour = $CC + 3
        .byte "COLOU",'R'+$80
        token_tile = $CC + 4
        .byte "TIL",'E'+$80

        token_first_sub_command = token_tile + 1
        ;; These tokens are keywords used within other
        ;; commands, not as executable commands. These
        ;; will all generate syntax errors.
        token_text = token_first_sub_command + 0
        .byte "TEX",'T'+$80
        token_sprite = token_first_sub_command + 1
        .byte "SPRIT",'E'+$80
        token_screen = token_first_sub_command + 2
        .byte "SCREE",'N'+$80
        token_border = token_first_sub_command + 3
        .byte "BORDE",'R'+$80
        token_set = token_first_sub_command + 4
        .byte "SE",'T'+$80
        token_delete = token_first_sub_command + 5
        .byte "DELET",'E'+$80
        token_stamp = token_first_sub_command + 6
        .byte "STAM",'P'+$80
        token_at = token_first_sub_command + 7
        .byte "A",'T'+$80
        token_from = token_first_sub_command + 8
        .byte "FRO",'M'+$80
        ;; And the end byte
        .byte $00       

Another key issue is that most of the possible token values are already used.  Only token values between $CD and $FE are available, i.e., only 49 of them.  If I want to avoid double-byte tokens (which I really do), then I need to keep my new keywords to less than that number.  We'll see if I manage to do that, but for now, we will try. On both these points, I suspect that BASIC 10 on the C65 probably exceeds these limits, thus making the tokeniser and detokeniser more complex than they are in BASIC 2 on the C64. 

The tokeniser

The tokeniser is the routine that scans a line of BASIC text, and changes keywords into single-byte tokens.  It has to take account of various quirks, such as the PI character officially being a token with the special value $FF, and know whether parts of the line are inside quotation marks or follow a REM statement. This all makes the logic of the tokeniser routine rather more subtle and complex than one might first imagine.  As a result, I figured it was safer to essentially replicate the functionality of the standard tokeniser routine as exactly as possible, adding only the ability to scan a keyword list that is upto 511 bytes long, instead of 255 bytes long.

This routine is effectively the same as in the C64 BASIC ROM at $A57C, but with the changes to allow the token list to be two pages long. I have also tried to more fully document the routine in the comments and labels. Some of the labels have address suffixes on them, that indicate where in the original C64 BASIC ROM routine they were located, to make it easier to reference between the two of them. 

The algorithm is basically one of seeing if the current character might be the start of a keyword that should be tokenised, and if so, doing a string match against it.  Tokens are stored in the token list with bit 7 set on the last character, so, for example RUN is stored as $52 $55 $CE, where $CE = $4E + $80 to represent the terminal N of the word.  This trick is also how the short-cuts for keywords using shifted letters works: If the result of comparing the input character and the keyword letter is $80, then it is either the terminal character of the keyword, or a shifted letter in the input being tokenised.  This is also why typing something like RU(shift-N) won't work.

It also occurs to me that this allows for a rather subtle bug, where if you typed a keyword with the last character shifted, and then characters for the following token, it will match against the first token, so typing RU(shift-N)I(shift-F)RESTORE would actually be interpreted and tokenised as though you had just typed RUN. Quite what use you might make of this, I have no idea, but I have confirmed that it does indeed work.  I haven't bothered to try to fix this little problem in my extension, but just report it here in case it is of curiosity to anyone.  I'd be interested to know if anyone else had previously discovered this quirk.

Back to the task at hand, for the longer token list, I have added a "hi page flag" that is set when accessing the upper of the two pages of tokens, and clear for the lower.  In places this is initialised to $FF when set, so that INC token_hi_page_flag causes it to be cleared.  This is used in particular near the start following tokeniseNextChar, so that the routine can continue to pre-increment the pointer into the token list (which is normally just held in the Y register).

This means that the logic for advancing and, retreating the token pointer is no longer as simple as INY or DEY. To keep things understandable, I have broken out the code to do this into tokenListReadByte, tokenListAdvancePointer etc.

Beyond that, there is not really much more to say about it, without having to dive deep into explaining every little subtly about the C64 BASIC tokenise routine, other than to repeat that this tokeniser supports only single-byte tokens. To allow more tokens, we would need to maintain a 16-bit counter for the token value, and if it were >253, come up with a double-byte token, e.g., $FE <second byte>.  i.e., token 254 would be the first token to require two bytes, because token $FF is already used for PI, and we need to have at least one token reserved as a kind of escape code, which above I have suggested $FE as the logical candidate for this role. If I end up using too many tokens, that is probably the method that I would use.


        ;; Get the basic execute pointer low byte
        LDX    $7A
        ;; Set the save index
        LDY    #$04
        ;; Clear the quote/data flag
        STY    $0F

        ;; Get hi page flag for tokenlist scanning, so that if we INC it, it will
        ;; point back to the first page.  As we start with offset = $FF, the first
        ;; increment will do this. Since offsets are pre-incremented, this means
        ;; that it will switch to the low page at the outset, and won't switch again
        ;; until a full page has been stepped through.
        LDA     #$FF
        STA    token_hi_page_flag
        ;; Read a byte from the input buffer
        LDA    $0200,X
        ;; If bit 7 is clear, try to tokenise
        BPL    @tryTokenise
        ;; Now check for PI (char $FF)
        CMP    #$FF     ; = PI
        BEQ    @gotToken_a5c9
        ;; Not PI, but bit 7 is set, so just skip over it, and don't store
        BNE    @tokeniseNextChar
        ;; Now look for some common things
        ;; Is it a space?
        CMP    #$20    ; space
        BEQ    @gotToken_a5c9
        ;; Not space, so save byte as search character
        STA    $08
        CMP    #$22    ; quote marks
        BEQ    @foundQuotes_a5ee
        BIT    $0F    ; Check quote/data mode
        BVS    @gotToken_a5c9 ; If data mode, accept as is
        CMP    #$3F           ; Is it a "?" (short cut for PRINT)
        BNE    @notQuestionMark
        LDA    #$99    ; Token for PRINT
        BNE    @gotToken_a5c9 ; Accept the print token (branch always taken, because $99 != $00)
        ;; Check for 0-9, : or ;
        CMP     #$30
        BCC    @notADigit
        CMP    #$3C
        BCC    @gotToken_a5c9
        ;; Remember where we are upto in the BASIC line of text
        STY    $71
        ;; Now reset the pointer into tokenlist
        LDY    #$00
        ;; And the token number minus $80 we are currently considering.
        ;; We start with token #0, since we search from the beginning.
        STY    $0B
        ;; Decrement Y from $00 to $FF, because the inner loop increments before processing
        ;; (Y here represents the offset in the tokenlist)
        ;; Save BASIC execute pointer
        STX    $7A
        ;; Decrement X also, because the inner loop pre-increments
        ;; Advance pointer in tokenlist
        jsr tokenListAdvancePointer
        ;; Advance pointer in BASIC text
        ;; Read byte of basic program
        LDA    $0200, X
        ;; Now subtract the byte from the token list.
        ;; If the character matches, we will get $00 as result.
        ;; If the character matches, but was ORd with $80, then $80 will be the
        ;; result.  This allows efficient detection of whether we have found the
        ;; end of a keyword.
        bit     token_hi_page_flag
        bmi    @useTokenListHighPage
        SBC    tokenlist, Y
        jmp    @dontUseHighPage
        SBC    tokenlist+$100,Y
        ;; If zero, then compare the next character
        BEQ    @compareNextChar_a5b6
        ;; If $80, then it is the end of the token, and we have matched the token
        CMP    #$80
        BNE    @tokenDoesntMatch
        ;; A = $80, so if we add the token number stored in $0B, we get the actual
        ;; token number
        ORA    $0B
        ;; Restore the saved index into the BASIC program line
        LDY    $71
        ;; We have worked out the token, so record it.
        STA    $0200 - 5, Y
        ;; Now check for end of line (token == $00)
        LDA    $0200 - 5, Y
        BEQ @tokeniseEndOfLine_a609

        ;; Now think about what we have to do with the token
        SBC    #$3A
        BEQ    @tokenIsColon_a5dc
        CMP    #($83 - $3A) ; (=$49) Was it the token for DATA?
        BNE    @tokenMightBeREM_a5de
        ;; Token was DATA
        STA    $0F    ; Store token - $3A (why?)
        SBC    #($8F - $3A) ; (=$55) Was it the token for REM?
        BNE    @tokeniseNextChar
        ;; Was REM, so say we are searching for end of line (== $00)
        ;; (which is conveniently in A now)
        STA    $08   
        ;; Read the next BASIC program byte
        LDA    $0200, X
        BEQ    @gotToken_a5c9
        ;; Does the next character match what we are searching for?
        CMP    $08
        ;; Yes, it matches, so indicate we have the token
        BEQ    @gotToken_a5c9

        ;; Not a match yet, so advance index for tokenised output
        ;; And write token to output
        STA    $0200 - 5, Y
        ;; Increment read index of basic program
        ;; Read the next BASIC byte (X should never be zero)
        BNE    @label_a5e5

        ;; Restore BASIC execute pointer to start of the token we are looking at,
        ;; so that we can see if the next token matches
        LDX    $7A
        ;; Increase the token ID number, since the last one didn't match
        INC    $0B
        ;; Advance pointer in tokenlist from the end of the last token to the start
        ;; of the next token, ready to compare the BASIC program text with this token.
        jsr     tokenListAdvancePointer
        jsr     tokenListReadByteMinus1
        BPL    @advanceToNextTokenLoop
        ;; Check if we have reached the end of the token list
        jsr    tokenListReadByte
        ;; If not, see if the program text matches this token
        BNE    @compareProgramTextAndToken

        ;; We reached the end of the token list without a match,
        ;; so copy this character to the output, and
        LDA    $0200, X
        ;; Then advance to the next character of the BASIC text
        ;; (BPL acts as unconditional branch, because only bytes with bit 7
        ;; cleared can get here).
        BPL    @tokeniseNextProgramCharacter
        ;; Write end of line marker (== $00), which is conveniently in A already
        STA    $0200 - 3, Y
        ;; Decrement BASIC execute pointer high byte
        DEC    $7B
        ;; ... and set low byte to $FF
        LDA    #$FF
        STA    $7A

        BNE    @dontAdvanceTokenListPage
        LDA    token_hi_page_flag
        EOR    #$FF
        STA    token_hi_page_flag
        ;; XXX Why on earth do we need these three NOPs here to correctly parse the extra
        ;; tokens? If you remove one, then the first token no longer parses, and the later
        ;; ones get parsed with token number one less than it should be!
        bit    token_hi_page_flag
        bmi    @page2
        jmp    @done

        bit     token_hi_page_flag
        bmi    @useTokenListHighPage
        LDA    tokenlist, Y
        LDA    tokenlist+$100,Y

        bit     token_hi_page_flag
        bmi    @useTokenListHighPage
        LDA    tokenlist - 1, Y
        LDA    tokenlist - 1 + $100,Y

The detokenisation routine also heavily draws on the original C64 BASIC ROM's routine (located at $A71A).  This routine is rather simpler, because it doesn't need to do any string matching, or handling of shorted tokens using SHIFT+letter:

        ;; The C64 detokenise routine lives at $A71A-$A741.
        ;; The routine is quite simple, reading through the token list,
        ;; decrementing the token number each time the end of at token is
        ;; found.  The only complications for us, is that we need to change
        ;; the parts where the token bytes are read from the list to allow
        ;; the list to be two pages long.

        ;; Print non-tokens directly
        bpl     jump_to_a6f3
        ;; Print PI directly
        cmp    #$ff
        beq    jump_to_a6f3
        ;; If in quote mode, print directly
        bit    $0f
        bmi     jump_to_a6f3

        ;; At this point, we know it to be a token

        ;; Tokens are $80-$FE, so subtract #$7F, to renormalise them
        ;; to the range $01-$7F
        SBC    #$7F
        ;; Put the normalised token number into the X register, so that
        ;; we can easily count down
        STY    $49     ; and store it somewhere necessary, apparently

        ;; Now get ready to find the string and output it.
        ;; Y is used as the offset in the token list, and gets pre-incremented
        ;; so we start with it equal to $00 - $01 = $FF
        LDY    #$FF
        ;; Set token_hi_page_flag to $FF, so that when Y increments for the first
        ;; time, it increments token_hi_page_flag, making it $00 for the first page of
        ;; the token list.
        STY    token_hi_page_flag

        ;; Decrement token index by 1
        ;; If X = 0, this is the token, so read the bytes out
        beq    @thisIsTheToken
        ;; Since it is not this token, we need to skip over it
        jsr tokenListAdvancePointer
        jsr tokenListReadByte
        BPL    @detokeniseSkipLoop
        ;; Found end of token, loop to see if the next token is it
        BMI    @detokeniseSearchLoop
        jsr tokenListAdvancePointer
        jsr tokenListReadByte
        ;; If it is the last byte of the token, return control to the LIST
        ;; command routine from the BASIC ROM
        BMI    jump_list_command_finish_printing_token_a6ef
        ;; As it is not the end of the token, print it out
        JSR    $AB47
        BNE    @thisIsTheToken

        ;; This can only be reached if the next byte in the token list is $00
        ;; This could only happen in C64 BASIC if the token ID following the
        ;; last is attempted to be detokenised.
        ;; This is the source of the REM SHIFT+L bug, as SHIFT+L gives the
        ;; character code $CC, which is exactly the token ID required, and
        ;; the C64 BASIC ROM code here simply fell through the FOR routine.
        ;; Actually, understanding this, makes it possible to write a program
        ;; that when LISTed, actually causes code to be executed!
        ;; However, this vulnerability appears not possible to be exploited,
        ;; because $0201, the next byte to be read from the input buffer during
        ;; the process, always has $00 in it when the FOR routine is run,
        ;; causing a failure when attempting to execute the FOR command.
        ;; Were this not the case, REM (SHIFT+L)I=1TO10:GOTO100, when listed
        ;; would actually cause GOTO100 to be run, thus allowing LIST to
        ;; actually run code. While still not a very strong form of source
        ;; protection, it could have been a rather fun thing to try.

        ;; Instead of having this error, we will just cause the character to
        ;; be printed normally.
        LDY    $49
        JMP     $A6F3
        JMP    $A6EF

The one piece of interest, is that this routine is responsible for the REM SHIFT+L bug when listing a program on the C64. This bug *almost* creates a "code injection" bug in the C64's BASIC, as described in the comments above.  The bug is caused by detokenising SHIFT-L causes a fall-through at the end of the detokenisation routine into the start of the FOR command's routine in the ROM. However, as I describe in the comments above, the contents of the next byte of input when detokenising seems to always be $00 or unusable, which will cause a syntax error in the FOR routine, when it tries to parse the variable it should be using as the loop iterator.  Actually, digging a bit deeper, the next token byte it reads is whatever followed the LIST command.  However, it is only really possible to have a colon or $00 there, so I think it is still not exploitable.
I'd be very interested to hear if anyone can think of a way to exploit this, so that LISTing a newly loaded program causes code of your choosing to be executed.

Speaking of execution, that just leaves our token execution routine to present:

        JSR    $0073
        ;; Is it a MEGA BASIC primary keyword?
        CMP    #$CC
        BCC    @basic2_token
        CMP    #token_first_sub_command
        BCC    megabasic_execute_token
        ;; Handle PI
        CMP    #$FF
        BEQ    @basic2_token
        ;; Else, it must be a MEGA BASIC secondary keyword
        ;; You can't use those alone, so ILLEGAL DIRECT ERROR
        jmp megabasic_perform_illegal_direct_error
        ;; $A7E7 expects Z flag set if ==$00, so update it
        CMP    #$00
        JMP    $A7E7

        ;; Normalise index of new token
        SBC     #$CC
        ;; Clip it to make sure we don't have any overflow of the jump table
        AND    #$0E
        ;; Get next token/character ready
        JSR    $0073
        JMP     (newtoken_jumptable,X)

        ;; Tokens are $CC-$FE, so to be safe, we need to have a jump
        .word     megabasic_perform_fast
        .word     megabasic_perform_slow
        .word    megabasic_perform_canvas ; canvas operations, including copy/stamping, clearing, creating new
        .word    megabasic_perform_colour ; set colours
        .word    megabasic_perform_tile ; "TILE" command, used for TILESET and other purposes
        .word    megabasic_perform_syntax_error ; "SET" SYNTAXERROR: Used only with TILE to make TILESET
        .word     megabasic_perform_syntax_error
        .word     megabasic_perform_syntax_error
        .word     megabasic_perform_syntax_error

        basic2_main_loop     =    $A7AE

This routine is really quite simple: Check if the token is not one of ours, in which case jump to the normal BASIC 2 routine, else work out which one, and jump to the appropriate routine.  I have cheated slightly, and used a C65/M65 new addressing mode for JMP, that makes jump tables much easier to implement.

Each of the routines then does what is required of it, before jumping to an error routine, or back to the BASIC 2 main loop. Here is the FAST command, which is nice and simple:

        jsr    enable_viciv
        LDA    #$40
        TSB    $D054
        TSB    d054_bits
        JMP    basic2_main_loop

As you can see, there is really nothing to it.  If you want to read arguments and the like, then you have to call the same routines that the BASIC ROM would use to do the same.  The best way to discover these is to look at an existing routine that implements a BASIC command, and see how it does it.  If you want to see how we have used them, or indeed to look at how these routines are strung together to make a fully-functioning BASIC extension, the source code for MEGA BASIC (in its currently unfinished state) is all here on github.

So there you have it.  I'll hopefully post an update soon that shows some more of the progress that has been made on MEGA BASIC. In particular, the CANVAS command and its variations are now almost complete, with the exception of the commands to get and set individual tiles in a canvas.

Monday 5 February 2018

First steps towards MEGA BASIC for the MEGA65

It's got late, so just a quick post with some fun screen shots and a video.  I'll do another post another day talking about how I have extended the C64 BASIC tokeniser with proper keywords, as there doesn't seem to be many examples of this that I can find on line.

But for now, we have a demo of the new CANVAS STAMP command.

CANVASes in MEGA BASIC are things you can put graphics on, more precisely, they are screen-like things that you can cover with characters. However, as these are MEGA65 characters, they can be 8x8 blocks where each pixel is specified by an 8-bit palette entry, i.e., you can freely use any of 256 colours in each of these 8x8 blocks. 

Each CANVAS has a size, up to 255x255, and you can have upto 255 of them (you will run out of RAM before then in any case, most likely).

There is also the special CANVAS 0, which is the screen you see. If you STAMP something onto a visible part of CANVAS 0, then it will appear on the screen, e.g., if I do:


This tells MEGA BASIC that I want to draw whatever is in CANVAS 1 onto the screen, beginning at the top left.  You can also STAMP just a part of a CANVAS, but that is too advanced for today.

Notice that I haven't said anything about how to turn on "graphics mode" or pick the right mode etc. That's because you don't need to do that in MEGA BASIC.  The text mode screen is automatically overlain onto CANVAS 0 in real time. There is a raster interrupt that uses a few % of the CPU time to do this each frame, so you can even POKE onto the screen, and it will still magically update and appear together with the graphics.

However, that's all a bit abstract. What we need is to see this in action.

So first we start at the READY prompt for MEGA BASIC. It really is a V0.1 at the moment, with lots of bugs, and most things are not yet implemented.

So lets do that CANVAS STAMP command I mentioned above, to draw the contents of CANVAS 1 (a picture of the MEGA65 logo) onto the screen (CANVAS0):

Oh dear, we can't see part of it, because it is hidden behind the text.  That's okay, because we can just do shift-HOME to rub the text out, and then we can see the full logo behind:

 We could also get a bit more excited and draw three logos, instead of 1:
This ability to seamlessly work with text and graphics using simple commands is very much the philosophy of MEGA BASIC. There will also be commands for easy manipulation of sprites, and most likely also using SID tunes in the standard format that loads at $1000-$1FFF, perhaps with commands like TUNE LOAD "some.sid",8 , TUNE 1 and TUNE OFF.  There will also be CANVAS and SPRITE editors on the MEGA65, as well as cross-platform tools, to suit various needs. Basically if you want to make a nice little game or graphically rich program, it will be super easy and super fun to do.

So lets make something a bit more sophisticated. Right now we don't have those nice editors etc, and it is VERY late, so we will just draw MEGA65 logos randomly on the screen. To do this, we need only four lines of quite understandable BASIC:

And then, pronto, we have logos being drawn all over the screen!
It should also me mentioned that because it is based on characters, the drawing is really quite fast.  You can paste quite a few CANVASes per second, even at 1MHz. And with the FAST command to select 50MHz, you can do even more. Here is a little video (sorry for the bad audio, my microphone is quite sick at the moment) to see it all in action, including stamping many of these big logos on the screen per second: