Tuesday 30 March 2021

Guest Post from Bitshifter: Fixing the Oldest and Nastiest Bug in Commodore BASIC

We again are able to enjoy another guest-post from Bitshifter, as he fixes and enhances the C65 ROMs for use on the MEGA65:

---  

The probably oldest and nastiest bug in Commodore BASIC

A huge part of working on the ROM for the MEGA65, which contains the resident part of the operating system and BASIC interpreter, is Debugging. Not counting the character sets, the source code, written in 45GS02 assembler, has about 30000 lines, separated into the modules kernel, editor, DOS, BASIC and graphics.

The debugging is necessary mainly for two reasons:
1) Programmers, developers, software engineers are humans.
2) A source code, that is not your own, is full of traps, side effects and assumptions, that one is not aware of.
Of course my own code is full of traps and side effects too, but I know them (at least for the next few months).

While working on the ROM, fixing Commodore bugs, optimising code to get free space for extensions and introducing new features (and sometimes new bugs), I write often some hundred lines of assembly code and make changes on existing code. Sometimes these changes seem to work perfectly, some result in crashes and freezes, because of errors and some only seem to work fine in my own test, but fail if a developer named „ubik“ demonstrates situations, which the code cannot handle.

Well, I’ll not tell a long story, how I debug, but come directly to the bug mentioned in the title. I tracked his existence down to BASIC 2.0 as used in the VIC-20, C64 and the early PET/CBM series and it seems, that it was never detected, documented or fixed.

It is related to temporary strings, the stack of descriptors for temporary strings, that has a size of 3, and the so called „garbage collection“, which in reality doesn’t collect garbage, but does a defragmentation of string storage.

Let’s look at an example:

10 V1 = 12345: V2 = 6789

20 A$ = RIGHT$(MID$(STR$(V1),2)) + RIGHT$(MID$(STR$(V2),2))


and let’s look into string memory, after the execution of these two statements:



The interpreter needed 6 temporary strings, these are the strings, which are followed by the two byte „garbage“ word in magenta, to create the resulting permanent string, marked here with the „back link“ word in yellow.


Each time, where a string is modified or new assigned, it gets a new allocation in string memory and the old string is flagged as „garbage“ by replacing its back link, which points to the string descriptor, by the garbage word, which is the length of the string, followed by $ff (The high byte $ff can never appear as part of a back link, because the highest address in string memory is $F6FF, therefore it is safe, to use this value as flag).


In programs with much string activity, the occupied string space, which grows from top ($f6ff) to bottom (top of array space), will be filled very fast and if there is no more space left for creating a new string, the call of the defragmentation routine is triggered. This starts at top of screen memory and copies all strings, which do not carry the garbage flag, to adjacent addresses, skipping the garbage strings. Doing this it is necessary, to update the string pointers of the descriptors. That’s the purpose of the backlink: After copying a string to the new address the back link is used to find the descriptor, and the two pointer bytes of the descriptor get the new address. Te descriptor can reside in scalar space, e.g. for a variable like AB$, or in array space, e.g. XY$(I,J) or it can be a temporary descriptor on the descriptor stack in the zero page. And this is, where the bug lurks.

The descriptor stack is 9 bytes long in the zero page and has therefore room for 3 descriptors. Additionally we have a stackpointer, which has four valid values for 0,1,2 or 3 descriptors in use. And we have a LASTPT called variable, which points to the last temporary descriptor, which was used to allocate a string. This is an optimisation tool. When a string is no more needed and it is the last string, that was allocated, the pointer to the used string memory can be updated by the length (plus 2 for the back link) instead of just flagging the string as garbage. This method can slow down the speed of filling the string memory somewhat and let’s the defragmentation happen less often.

This LASTPT is initialised at the execution start, alas only the high byte!




The low byte therefore has the value, that was either initialised at power up or has the value from a previous use. This use can for example be the loading of a program, because this involves string handling for the filename and therefore the use of the descriptor stack.


So it can happen, that the routine, that pops a value from the descriptor stack compares the current pointer with LASTPT, sees equality and decides not to flag the string as garbage, but to update the pointer to free string memory instead. This is an error, if the value of LASTPT is from a previous usage and was not set in the current statement. Alas, it will have no consequences in 99.9 % of the cases, because at the end of a statement all temporary descriptors are freed anyway.

The bug can drive mischief if following conditions meet:


1) A program with heavy usage of strings, which triggers the garbage collection / defragmentation frequently.


2) Statements with complex string operations, which need more than one temporary descriptor.


Then a really rare event can happen:

The garbage collection is triggered in the middle of a complex string formula with temporary descriptors on the stack. The garbage collection is aware of temporary descriptors and copies their strings too and updates the descriptors, but if one of the descriptors freed his string due to a false LASTPT comparison, his string is outside of the garbage collection area and will not be updated. And this causes strange things to be happen. The descripror has now a pointer, that does not point to a valid string. So the whole system of links and back links is corrupt. To make the bug hunting more interesting, this corrupt pointer often does no obvious harm until the next garbage collection, or the collection after it, but each will make the error worse, until back links appear in screen memory or strings dissappear or the computer freezes.


And all this because the setting of LASTPNT was not done with:

STA LASTPT

STA LASTPT+1


instead the first part, setting the low byte, was forgotten.

BTW, the highy byte is always zero, because the descriptor stack resides in the zero page, so this usage of the high byte is redundant.

I found this bug in all versions of Commodore BASIC, that I investigated, VIC-20, C64, C128, C65, MEGA65.

But it needed the „11 BASIC“ preprocessor/compiler of UBIK, to let the bug appear.

And it was very difficult to detect, because only huge programs with heavy string activity could activate it.


So the old programmer’s talk is true:

You can only find the penultimate error in a program!
There is always the ultimate error, that remained undetected.


5 comments:

  1. When I was a teen in the 80's I wrote and sold a BBS system for the C64 in BASIC (and the Blitz optimizer). It relied heavily on strings (of course), and used a lot of memory, so garbage collection was frequent. I can't recall the specifics, but there were sometimes weird instances of string corruption and crashes that I just couldn't track down. Very rare, but when your code is operating 24/7 the rare becomes inevitable. Now I wonder if this bug had something to do with it! Very much looking forward to official production of the MEGA65, by the way.

    ReplyDelete
  2. > 20 A$ = RIGHT$(MID$(STR$(V1),2)) + RIGHT$(MID$(STR$(V2),2))

    I have never tried to invoke RIGHT$() with only a string parameter and not a length parameter. Nor MID$() with only a string and a start point but not length. I would have thought that this would result in a syntax error. Does the behavior change if the code includes those length parameters? For example:

    20 A$ = RIGHT$(MID$(STR$(V1),2,2),2) + RIGHT$(MID$(STR$(V2),2,2),2)

    ReplyDelete
  3. Thank you - this was a great read. Amazing that you found a bug that slept there for decades :-)

    ReplyDelete
  4. Hello Bitshifter,

    very interesting analysis!

    Two small details: the program you posted is not syntactically correct.
    MIDS$ and RIGHT$ are missing some arguments. Something like this works better to show how memory is used by string operations:


    10 v1 = 12345: v2 = 6789
    20 a$ = right$(mid$(str$(v1),2,9),4) + right$(mid$(str$(v2),2,3),4)
    30 print a$
    ready.

    Second detail: the backlinks you're describing were introduced by BASIC 4, first seen in the PET 8032/4032. The VIC-20 and C=64 don't have them. That makes garbage collection a lot slower...

    Here are two memory dumps after running the above program, for the 64 and the 8032.


    C64:

    >C:9f90 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00 ..¿¿¿¿....¿¿¿¿..
    >C:9fa0 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00 ..¿¿¿¿....¿¿¿¿..
    >C:9fb0 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00 ..¿¿¿¿....¿¿¿¿..
    >C:9fc0 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 00 ..¿¿¿¿....¿¿¿¿..
    >C:9fd0 00 00 ff ff ff ff 00 00 00 00 ff ff ff ff 00 32 ..¿¿¿¿....¿¿¿¿.2
    >C:9fe0 33 34 35 36 37 38 36 37 38 36 37 38 20 36 37 38 345678678678 678
    >C:9ff0 39 32 33 34 35 31 32 33 34 35 20 31 32 33 34 35 9234512345 12345

    PET 8032 (Basic 4):

    >C:7f90 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿
    >C:7fa0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿
    >C:7fb0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿
    >C:7fc0 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿
    >C:7fd0 aa 32 33 34 35 36 37 38 63 04 36 37 38 03 ff 36 ¿2345678c.678.¿6
    >C:7fe0 37 38 03 ff 20 36 37 38 39 05 ff 32 33 34 35 04 78.¿ 6789.¿2345.
    >C:7ff0 ff 31 32 33 34 35 05 ff 20 31 32 33 34 35 06 ff ¿12345.¿ 12345.¿

    ReplyDelete
  5. This guest post is a fascinating read!

    The software environment within which the bug can manifest, in itself, is a non-trivial thing to describe; but the article lays it out in an extremely clear and easy-to-understand manner, which I can say from experience is much harder to do than you'd think. The garbage collector's flawed actions — and what it _should_ have been doing instead — is described about as straightforwardly as it ever could be, but for the author to have untangled its behaviour enough to understand it that deeply must have been incredibly tedious.

    In short, this is some most impressive work by Bitshifter. Bravo!

    The discovery documented here does raise an important question, though. There are a number of different reasons a Commodore 8-bit enthusiast might want to gather information on ROM bugs. Where would you post about this, to be confident that interested parties would easily find it?

    ReplyDelete