Wednesday, January 23, 2019

The MEGA65 at Linux Conf AU

This week I am at Linux Conf AU, where I have, among other things, given a presentation about the MEGA65, using the MEGA65:

The cover slide was produced using a PNG to MEGA65 full-colour image converstion utility I wrote, and the main slides using MegaWAT, which I wrote with Lucas, a student, over the past few months in preparation for this, and in general, for being able to use the MEGA65 to introduce itself.  The source is at

Wednesday, January 16, 2019

Livening CPU speed, video mode and monitor in freeze menu

Continuing the work on the freeze menu, I have been able to make quite a bit of progress lately. In the last post, I had the freezer and freeze menu working in a minimalistic kind of way. However, there were a lot of things not there, and still plenty of bugs to cause trouble. 

For a start, I wasn't saving the state of the CIAs.  This turned out to be a bit more of a pain than I first expected.  Fortunately, I was able to read through the information on C64 freezing by Gideon, as well as get a bunch of useful tips from Groepaz and others.  This opened a whole can of worms, and basically reminded me of a fact I had either previously forgotten, or had failed to fully grasp:  Perfect freezing is practically impossible, and requires in the very least hardware support if you are to have any chance at getting it right. 

The CIAs are a big part of the reason behind this.  The problem is that the CIAs have timers that keep on running while you are trying to freeze them, and there is certain internal state that you can't read from the registers, but requires that you try to work out how far the timers have run down, wait for them to run out, and then read out the latched information.  Of course, one or more of the four timers of the 2 CIAs will probably run out before you spot it, and the accuracy you can achieve when trying to watch four timers with a single 1MHz CPU is rather limited -- probably of the order of 50 or 100 cycles.

My solution to deal with the CIAs was to make the CIAs effectively freeze themselves when in Hypervisor mode: The CIA timers stop ticking, and acknowledging interrupts via $DC0D etc has no effect.  I also added 16 extra registers in hypervisor mode of the CIA that allows us to directly read out the latched and current values of the timers, as well as the current and alarm times of the time-of-day clocks. It is still not perfect, but it pretty much works.

As I worked on the freeze menu, I wanted to provide an indication of which ROM you are running, since there are a variety of C65 ROMs available, and people will probably want to try some out.  So I wrote a little routine that makes an educated guess as to which ROM you are running. For the C65 ROMs, this is quite simple: they have a fairly reliable version string near the start of the ROM, and so if I find one of those, I indicate it is a C65 ROM and show the relevant version.  It can also detect a wide range of C64 and even PET 4064 ROMs. I did this by making a utility that reads in all the ROMs I could find, and looks for unique bytes in the KERNAL ROM part, and makes its decision on that basis. It turns out if you do the tests in the right order, a single PEEK is all you need to test for each known ROM:

  // Check for C65 ROM via version string
  if ((freeze_peek(0x20016L)=='V')
      &&(freeze_peek(0x20017L)=='9')) {
    c65_rom_name[0]=' ';
    c65_rom_name[4]=' ';
    return c65_rom_name;

  if (freeze_peek(0x2e47dL)=='J') {
    // Probably jiffy dos
    if (freeze_peek(0x2e535L)==0x06)
      return "sx64 jiffy ";
      return "c64 jiffy  ";
  // Else guess using detection routines from detect_roms.c
  // These were built using a combination of the ROMs from,
  // the RetroReplay ROM collection, and the JiffyDOS ROMs
  if (freeze_peek(0x2e449L)==0x2e) return "C64GS      ";
  if (freeze_peek(0x2e119L)==0xc9) return "C64 REV1   ";
  if (freeze_peek(0x2e67dL)==0xb0) return "C64 REV2 JP";
  if (freeze_peek(0x2ebaeL)==0x5b) return "C64 REV3 DK";
  if (freeze_peek(0x2e0efL)==0x28) return "C64 SCAND  ";
  if (freeze_peek(0x2ebf3L)==0x40) return "C64 SWEDEN ";
  if (freeze_peek(0x2e461L)==0x20) return "CYCLONE 1.0";
  if (freeze_peek(0x2e4a4L)==0x41) return "DOLPHIN 1.0";
  if (freeze_peek(0x2e47fL)==0x52) return "DOLPHIN 2AU";
  if (freeze_peek(0x2eed7L)==0x2c) return "DOLPHIN 2P1";
  if (freeze_peek(0x2e7d2L)==0x6b) return "DOLPHIN 2P2";
  if (freeze_peek(0x2e4a6L)==0x32) return "DOLPHIN 2P3";
  if (freeze_peek(0x2e0f9L)==0xaa) return "DOLPHIN 3.0";
  if (freeze_peek(0x2e462L)==0x45) return "DOSROM V1.2";
  if (freeze_peek(0x2e472L)==0x20) return "MERCRY3 PAL";
  if (freeze_peek(0x2e16dL)==0x84) return "MERCRY NTSC";
  if (freeze_peek(0x2e42dL)==0x4c) return "PET 4064   ";
  if (freeze_peek(0x2e1d9L)==0xa6) return "SX64 CROACH";
  if (freeze_peek(0x2eba9L)==0x2d) return "SX64 SCAND ";
  if (freeze_peek(0x2e476L)==0x2a) return "TRBOACS 2.6";
  if (freeze_peek(0x2e535L)==0x07) return "TRBOACS 3P1";
  if (freeze_peek(0x2e176L)==0x8d) return "TRBOASC 3P2";
  if (freeze_peek(0x2e42aL)==0x72) return "TRBOPROC US";
  if (freeze_peek(0x2e4acL)==0x81) return "C64C 251913";
  if (freeze_peek(0x2e479L)==0x2a) return "C64 REV2   ";
  if (freeze_peek(0x2e535L)==0x06) return "SX64 REV4  ";
  return "UNKNOWN ROM";

This routine is written in C, because the whole freeze menu is written in C using the CC65 compiler.  This makes it quite easy to change, which is how we want it: If you want to make your own replacement or modified freeze menu, then it should be possible to do.   We are progressively building up a library of functions that are helpful, such as the freeze_peek() routine, which the freezer uses to retrieve a byte of memory from the frozen program.  To do this, it needs to know the layout of the freeze slots as they are saved on the SD card, because on the MEGA65 freezing always happens with the result written to the SD card, rather than being squished around in memory like with the original freeze cartridges.  The freeze_peek() routine itself is fairly simple, basically consisting of working out where the byte lives on the SD card, then reading the relevant sector, and returning the appropriate byte of data:

unsigned char freeze_peek(uint32_t addr)
  // Find sector
  uint32_t freeze_slot_offset=address_to_freeze_slot_offset(addr);
  unsigned short offset;

  if (freeze_slot_offset==0xFFFFFFFFL) {
    // Invalid / unfrozen memory
    return 0x55;

  // Read the sector

  // Return the byte
  return sector_buffer[offset&0x1ff];


Working out where the byte lives is a little more complicated, as we have to ask the Hypervisor for the layout of the freeze regions, and then iterate through those regions to work out if the requested address falls in one of those regions.  It is possible that it doesn't, because the MEGA65 has a lot more address space (256MB) than it has populated with actual memory.

/* Convert a requested address to a location in the freeze slot,
   or to 0xFFFFFFFF if the address is not present.
uint32_t address_to_freeze_slot_offset(uint32_t address)
  uint32_t freeze_slot_offset=1;  // Skip the initial saved SD sector at the beginning of each slot
  uint32_t relative_address=0;
  uint32_t region_length=0;
  char skip,i;

  for(i=0;i<freeze_region_count;i++) {
    if (address<freeze_region_list[i].address_base) skip=1;
    if (freeze_region_list[i].address_base==0x1000L) {
      // Thumbnail region: Treat specially so that we can examine it
      // We give the fictional mapping of $FF54xxx
      if ((address&0xFFFF000L)==0xFF54000L)
    { relative_address=address&0xFFF;
      return freeze_slot_offset;
    if (relative_address>=region_length) skip=1;
    if (skip) {
      // Skip this region if our address is not in it
      // If region is not an integer number of sectors long, don't forget to count the partial sector
      if (region_length&0x1ff) freeze_slot_offset++;
    } else {
      // The address is in this region.

      // Firsts add the number of sectors to get to the one with the content we want

      // Now multiply it by the length of a sector (512 bytes), and add the offset in the sector
      // This gives us the absolute byte position in the slot of the address we want.
      return freeze_slot_offset;
  return 0xFFFFFFFFL;

Anyway, back to the ROM auto-detection: The result of the ROM auto-detection is displayed in the freeze menu, as shown in this screenshot:

While we are here, let's walk through the various elements, starting with most obvious part: the picture of the computer itself!  Here the results of the ROM auto-detection are used a second time, to pick the correct surround to show around the thumbnail image of the frozen program: If it is a C65 ROM, you see a C65 and 1084S monitor depicted, otherwise a C64 and 1702:

These surrounding images can be drawn as a 152x96 pixel PNG file, and converted to the correct format using the thumbnail-surround-formatter from thte MEGA65 Freeze Menu repository, so if you want to make it look different, perhaps like Gus the snail, you are totally free to do so! The only limitation is that the screen position for the thumbnail is fixed, and the colour palette is a 256 colour colour cube, with the first 16 entries replaced by the C64's standard palette. I might do something to improve the palette at some point, because it is a bit annoying.  It might be possible, for example, to have a separate palette for the surrounding image versus the thumbnail by selecting the alternate palette bit in the thumbnail graphics tiles. But that will have to wait for another day.

The thumbnails are 80x50 images generated automatically by the MEGA65 hardware, exactly for the freeze menu.  I wrote about the thumbnail generator hardware a long time ago, and it has finally taken until now before it got to be used for its intended purpose.  The only real change to the thumbnail generator was to move it to $D640 instead of $D630 to avoid a recurring address resolution problem in the VHDL, which I think was due to glitching of the chip select line.

While on the one hand just a bit of a fun cosmetic touch, the thumbnails serve the very useful purpose of making it easier for you to find previously frozen programs.  From in the freeze menu you can use the cursor keys to navigate through the set of freeze slots configured in the MEGA65 system partition.  The MEGA65 FDISK+FORMAT utility will normally allocate upto 2GB for the systesm partition, with about half of that being freeze slots, giving a total of just over 2,000 freeze slots under the current design. 

That should be more than enough to have games and programs you like to use pre-frozen ready for an instant meal whenever you want! It also means if you need to interrupt a favourite game at a critical point, it should be easy enough to find a free slot and save it for reheating later.  In fact, with ~2,000 slots, searching through them linearly will be sure to become a chore, so I will add some kind of search facility in the future.  This is a great thing about writing the freeze menu as a separate program in C, is that it is pretty easy to work on and extend.

Going further through the freeze menu, we find the bank of six quite configuration settings: CPU mode, ROM, CPU Frequency (speed), write-protection of the ROM area, cartridge enable/disable and PAL/NTSC select.  With each off these, the setting can be changed by pressing the indicated letter on the screen.  The only exception is R for ROM selection, which is not yet implemented.

Below that we have a set of typical Freeze utilities: A monitor to inspect and modify the memory of a frozen program, a mechanism to enter poke codes, e.g., for game cheats, a disk image chooser to let you switch disks while running a program (or to get ready to run one), a sprite viewer, poke finder and sprite collision killer (also for cheating in gamems).  At the moment only the monitor and disk select options are implemented. 

The monitor uses 80-column mode, and works like your typical machine code monitor on the C64.  At the moment only M to inspect memory and S to set contents of memory are supported. The syntax of these mirrors that of the MEGA65's hardware monitor, except that with the S command you can use " and ' to indicate either ASCII strings or screen poke codes.  So for example S400 'HELLO would put the word HELLO at the top left of the screen in C64 mode, by writing $08 $05 $0C $0C $0F to locations $0400 - $0404.

The disk selector scans the SD card for D81 files, and then displays a list of up to 1,000 of them, and lets you choose one to use.  Again, at some point I will add some kind of sorting and/or searching function to make browsing through long lists of disk images easier.  But what I have already implemented, is that if you highlight a disk image, and then don't press any keys for a second or more, it will retrieve the directory from the disk image.  In this way you can more easily find the disk you are looking for.  Here is an example of this in action:

This menu will also refuse to let you select a disk image if it can't be mounted for some reason. The most common cause at the moment is if the .D81 file is fragmented on the SD card.  The reason this isn't allowed is that the hardware support for D81 image access requires the D81 file to exist as a single linear 800KB block, so as to avoid the hardware needing to know about FAT file systems.  What I will likely do at some point is add support for automatically de-fragmenting image files, so that the user doesn't need to think about it.

Talking about disk images, whenever you load a frozen program, it goes through the disk image attachment process again, so if you have changed the disk image, e.g., by creating a new file in a different session, then they will show up.  There are some hazards with this, for example, if the frozen program isn't clever enough to notice the disk-change line, but this will hopefully not be a big problem. Also, of course you could have deleted the disk image.  In that case, when you resume the frozen program there will be no disk image attached.

That's the bulk of the function of the freeze menu for now.  I have skipped over a whole pile of little niggly problems I have had to solve along the way, and there are still a number of bugs and missing features to be solved, and the freeze/unfreeze process still sometimes messes up the frozen program in some way causing it to crash, but it is already functional enough to be very useful.  In short, you can now easily use the MEGA65, including switching disks, CPU speed and other things, without having to do anything awkward.  The following video shows me using the freeze menu to do various things.  I particularly like how cute the little thumbnails of the frozen programs look :)

Thursday, January 3, 2019

More work on the MEGA65 built-in freezer

Yesterday I posted the progress on the built-in freezer for the MEGA65, and explained a bit how it works.  However, at that point in time, the freezer was not really functional -- it could save and restore some memory and IO registers, but not without problems, and thus it wasn't possible to actually resume a program after freezing.  That has changed today!  After quite a bit of fiddling, the freeze and unfreeze routines are now much better, and generally work.

The main progress is that I am able to save the main memory, the colour RAM, the VIC-IV registers (including the colour palettes), the MEGA65 Hypervisor saved state (which is really the saved state of the program being frozen, since it was saved on entry to the Hypervisor, which is what is actually doing the freezing), along with most of the new MEGA65 registers, e.g., those at $D7xx.

The result is that the program gets fairly convincingly frozen. But this is no good, if the program can't be unfrozen after.  But this also works just fine now, as the following video of me playing Krakout and freezing and resuming it multiple times shows.  (Apologies for the shaky video, I don't have my good camera and tripod here at home.  Similarly the general lack of audio due to the Zoom recorder also not being here.)

What is clear is that we can freeze and unfreeze a real game, and it resumes without any noticeable problems.  Even multiple times, is not a problem. It also works fine to freeze BASIC, as the following freezing, frozen and un-frozen images show:

Just to prove that it was still alive after, I typed some rubbish:

(Note the fun feature of the later C65 ROMs of showing error messages in red, regardless of what the cursor colour was before).

While I would like the freeze and unfreeze time to be a little faster, it is already quite acceptable.  Once we have the 8MB expansion RAM in the MEGA65 working, we will be able to freeze to expansion RAM instead of the SD card in the first instance, which should make freezing and unfreezing several times faster.

In fact, the main limitations at the moment are relatively few:

1. Like most C64 freezers, we can't really freeze the state of the SIDs, because of all those SID registers being write-only, and even if they were readable, they would only show what you wrote, not the current ADSR state of the voices etc.  I'll likely add some support for saving and resuming the internal state of the SIDs, so that freezing doesn't mess up music.

2. The CIAs are not currently backed up.  This is really just a little oversight, and should be quite trivial to fix.

3. The Hypervisor doesn't sanity-check the state of any previously mounted disk image(s), and re-mount them if still available.  Similarly, it doesn't check any other bits and pieces in the process descriptor block after loading it back in.

4. I noticed that by blindly restoring the VIC-IV registers that it is bad if the freeze occurred at a high raster line, because it is possible for the raster compare register to be programmed to an impossibly high raster number.  This would cause the program to effectively not resume after unfreezing, unless you manually modified $D011 to clear the high-bit of the raster compare register. Thus I should probably and $D011 with $7F after restoring the machine state.

Wednesday, January 2, 2019

Working on the MEGA65 Freeze Menu

For a long time, the planned primary interface for controlling the MEGA65 has been planned to be a kind of "freeze menu".  While this will be easy for folks to change, our rationale for this is that it allows the machine to boot the BASIC as expected, but still have all the features you want to commonly use, e.g., mounting disk images, loading programs from a menu etc, a single button press away.

A while back, I mentioned that we were planning on having a double-tap of RESTORE trigger this.  This has evolved a bit into a long-press of RESTORE (anywhere from ~0.5 seconds to 5 seconds.  Longer than that will reset the machine in stead, which we might remove to avoid accidents, especially since the M65 will come with a reset button).

Quite a lot of work has gone on in the background to actually get to the point of having a freeze menu appear and be useful.  While it isn't quite there yet, it is now getting much closer.  A lot of that work has been on getting functional(ish) freeze and unfreeze routines working, as well as the hypervisor hooks to actually trigger the freeze and load the freeze menu itself.

So let's walk through how this all pulls together, beginning with pressing the RESTORE key, and detecting if it is a normal press of the RESTORE key, a long-press that should trigger the Hypervisor trap that launches the freeze process, or whether it should reset the CPU.  This is all in src/vhdl/keymapper.vhdl

  -- 0= restore down (pressed), 1 = restore up (not-pressed)
        if restore_state='0' and last_restore_state='1' then
          -- Restore has just been pressed, do nothing special.
          -- (Events happen on rising edge)
        elsif restore_state='1' and last_restore_state='0' then
          -- Restore has just been released
          if restore_down_ticks < 8 then
            -- <0.25 seconds = quick tap = trigger NMI
            restore_out <= '0';
          elsif restore_down_ticks < 32 then
            -- 0.25 - ~ 1 second hold = trigger hypervisor trap
            hyper_trap <= '0';

            hyper_trap_count <= hyper_trap_count_internal + 1;
            hyper_trap_count_internal <= hyper_trap_count_internal + 1;
          elsif restore_down_ticks < 128 then
            -- Long hold = do RESET instead of NMI
            -- But holding it down for >4 seconds does nothing,
            -- incase someone holds it by mistake, and wants to abort doing a reset.
            reset_drive <= '0';
            report "asserting reset via RESTORE key";
          end if;
          hyper_trap <= '1';
          restore_out <= '1';
          reset_drive <= '1';
        end if;

When hyper_trap goes to zero, then this tells the CPU to trigger the freezer Hypervisor trap.  This really just means that the CPU enters Hypervisor mode after saving register state, and then jumps to a certain location in the Hypervisor programme.  To make writing the freeze menu easy, after saving the state of the machine to freeze slot #0, the hypervisor loads in the standard C64 character set and a C65 ROM, and assumes that the freeze menu is a program made for C64 mode with entry point at SYS 2061.  This means we can write the freeze menu using CC65, the C compiler for the C64, for example.  In the following snippet from kickstart_task.a65 we can see that the Hypervisor already implements a bunch of very handy routines, that make it easy to load the ROM files, and then the freeze menu itself.  Loading the freeze menu is performed by setting the name of the file we want to load from the SD card ("FREEZER.M65"), and then providing the 32-bit load address. We load it to $07FF instead of $0800 or $0801 as you might have otherwise expected, because we expect the program to have a normal C64-style $01 $08 header on it, and thus we need to pretend it loads at $07FF so that the first real byte of data is placed at $0801.  Otherwise, there is nothing too surprising here. We set the C64 memory map to make life easier for the program, and we also provide a dummy NMI vector, as we have seen race conditions where an NMI can be triggered before a proper NMI vector has been installed. Since we don't enter via the C64/C65 ROM's normal entry point, the NMI vector at $0316 won't get setup automatically, thus requiring this precaution.  Finally we set the value of the PC on exit from the Hypervisor, and actually exit the Hypervisor itself:


    ; Freeze to slot 0
    ldx #<$0000
    ldy #<$0000
    jsr freeze_to_slot

    ; Load freeze program
    jsr attempt_loadcharrom
    jsr attempt_loadc65rom

    ldx #<txt_FREEZER
    ldy #>txt_FREEZER
    jsr dos_setname

    ; Prepare 32-bit pointer for loading freezer program ($000007FF)
    ; (i.e. $0801 - 2 byte header, so we can use a normal PRG file)
    lda #$00
    sta <dos_file_loadaddress+2
    sta <dos_file_loadaddress+3
    lda #$07
    sta <dos_file_loadaddress+1
    lda #$ff
    sta <dos_file_loadaddress+0

    jsr dos_readfileintomemory
    jsr task_set_c64_memorymap
    jsr task_dummy_nmi_vector
    ; set entry point and memory config
    lda #<2061
    sta hypervisor_pcl
    lda #>2061
    sta hypervisor_pch

    ; return from hypervisor, causing freeze menu to start
    sta hypervisor_enterexit_trigger

The actual freezing happens in the Hypevisor in the freeze_to_slot routine, rather than in the freeze menu. Similarly, unfreezing happens in the Hypervisor as well.  This actually solves a lot of problems all at the same time. First, the freeze menu doesn't need to know about changing on-SD formats for the freeze slots.  Second, it makes sure that there is a single freeze and a single unfreeze routine used in all situations. Third, it allows use of the extra memory of the Hypervisor, to allow for near-perfect freezing, without corrupting the stack or any other memory.  It also means that we can provide a nice simple abstracted interface to allow one program to get itself replaced by another in memory, similar to exec() on UNIX-like systems.

The freeze and unfreeze routines are naturally very similar. They basically consist of a loop that iterates through a range of memory areas that have to be loaded or saved, with an optional pre-save or post-load hook.  This allows us to define pseudo regions that save some tricky bits of machine state that we can't just DMA to the SD card.  It also makes it quite easy to modify what gets saved.  Here is the definition of the list of regions to be saved as they currently stand.  We know there are some missing bits, and we have removed some bits to make this easier to read.

    ; start address (4 bytes), length (3 bytes),
    ; preparatory action required before reading/writing (1 byte)
    ; Each segment will live in its own sector (or sectors if
    ; >512 bytes) when frozen. So we should avoid excessive
    ; numbers of blocks.

    ; SDcard sector buffer + SD card registers
    ; We have to save this before anything much else, because
    ; we need it for freezing.
    .dword $ffd6000
    .word $0290
    .byte 0
    .byte freeze_prep_stash_sd_buffer_and_regs

    ; 384KB RAM (includes the 128KB "ROM" area)
    .dword $0000000
    .word $0000     
    .byte 6          ; =6x64K blocks = 384KB
    .byte freeze_prep_none   

    ; 32KB colour RAM
    .dword $ff80000
    .word $8000
    .byte $00
    .byte freeze_prep_none

    ; VIC-IV palette block 0
    .dword $ffd3100
    .word $0400
    .byte 0
    .byte freeze_prep_palette0

    ; VIC-IV palette block 1
    .dword $ffd3100
    .word $0400
    .byte 0
    .byte freeze_prep_palette1

    ; VIC-IV palette block 2
    .dword $ffd3100
    .word $0400
    .byte 0
    .byte freeze_prep_palette2

    ; VIC-IV palette block 3
    .dword $ffd3100
    .word $0400
    .byte 0
    .byte freeze_prep_palette3   

    ; Process scratch space
    .dword currenttask_block
    .word $0100
    .byte 0
    .byte freeze_prep_none
    ; $D640-$D67E hypervisor state registers
    ; XXX - These can't be read by DMA, so we need to have a
    ; prep routine that copies them out first?
    .dword $ffd3640
    .word $003F
    .byte 0
    .byte freeze_prep_none

    ; VIC-IV, F011 $D000-$D0FF
    .dword $ffd3000
    .word $0100
    .byte 0
    .byte freeze_prep_none

    ; $D700-$D7FF CPU registers

    .dword $ffd3700
    .word $0100
    .byte 0
    .byte freeze_prep_none

    ; XXX - Other IO chips!

    ; End of list
    .dword $FFFFFFFF
    .word $FFFF
    .byte $FF
    .byte $FF

There are four lots of the VIC-IV palette, because the MEGA65 has four palette banks that can be dynamically selected, but are mapped to the same region of memory, therefore the freeze_prep_paletten routines make sure the correct one is mapped before the area is saved/loaded. These routines are typically quite simple, e.g.:

    ; We do the same memory map setup during freeze and unfreeze
    ; X = 6, 8, 10 or 12
    ; Use this to pick which of the four palette banks
    ; is visible at $D100-$D3FF
    sbc #freeze_prep_palette0
    ora #$3f  ; keep displaying the default palette
    sta $d070

 Now if we turn our attention to the freeze menu, this basically consists of a normal program that can do whatever we want.  The current version just displays a simple set of options (most of which aren't yet implemented), and selects one of them based on key input.  Key input is done using the MEGA65's super-easy ASCII keyboard input abstraction layer, where you can basically just read $D610 to get the next key from the keyboard, with all modifiers like SHIFT and CONTROL already applied.  Function keys map to $F1 - $FE, making life super simple for menus.  Here is the important bit of freezer.c:

  // Flush input buffer
  while (PEEK(0xD610U)) POKE(0xD610U,0);
  // Main keyboard input loop
  while(1) {
    //    POKE(0xD020U,PEEK(0xD020U)+1);
    if (PEEK(0xD610U)) {
      // Process char
      switch(PEEK(0xD610U)) {
      case 0xf1: // F1 = backup
      case 0xf3: // F3 = resume
    // Load memory from freeze slot $0000, i.e., the temporary save space
    // This implicitly restarts the frozen program
    __asm__("LDX #<$0000");
    __asm__("LDY #>$0000");
    __asm__("LDA #$12");
    __asm__("STA $D642");
      case 0xf7: // F7 = show screen of frozen program
    // XXX for now just show we read the key
      // Flush char from input buffer

 The highlighted snippet of code makes a Hypervisor call asking for whatever currently lives in freeze slot 0 to be loaded back into memory.  This by definition will replace the freeze menu in memory, so there is nothing more to be done.  We have gone to quite some effort to make calling the Hypervisor really painless, which I think shows here:  All you have to do is prepare the register values for the call, where the accumulator usually indicates the sub-function of the Hypervisor call, and then write to the correct Hypervisor trap address between $D640-$D67F.  It doesn't matter what you write, or from which register, as the act of asking the CPU to write to these registers tells it you want to trap to the Hypervisor.  The Hypervisor automatically (in just one clock cycle!) saves all process or flags, registers and memory mapping settings, and switches to the Hypervisor memory context.  This makes Hypervisor calls very simple and efficient.  The only gotcha at the moment is the need to put a NOP or other single-byte junk instruction after the write that triggers the Hypervisor call.  This is to work around a bug where sometimes the PC value on exit from the Hypervisor call is incremented by one.

But enough theory already. We want pictures!

Here is the MEGA65 mid-freeze, with border colour action telling you something is happening:

After a couple of seconds, this is replaced with the freeze menu, which is currently rather spartan. You can probably tell I used to use an Action Replay as my preferred freeze cartridge ;) This program will get a thorough pimping as time goes on.

Finally, here is the view after resuming:

If you want to see it as moving pictures:

There are a few obvious things to point out here:

1. We can clearly trigger loading of the freeze menu program.
2. We can (at least partly) save and restore memory contents and IO registers, as shown by how we manage to restore the C65 BASIC boot screen on un-freeze, complete with switching back to 80 column mode, and restoring colour RAM (so that the bars are different colours etc.
3. The palette is seriously messed up.  It turns out I have a bug in the DMAgic implementation when reading the palette, where it gets it one byte late.  It might be that we need to have an extra wait-state on reading the palette memory.
4. The frozen program doesn't actually resume after being unfrozen.  I'll have to look at the saved registers etc, and see why they aren't getting restored correctly. Actually, it looks like the unfreeze process never quite completes, but is instead stuck loading a sector from the SD card. I'll have to investigate that.

Anyway, that's where things are upto right now.  It shouldn't hopefully be too much longer before we can correctly unfreeze with the right colours, and with a running program after.

Thursday, December 27, 2018

Super simple Protovision-Compatible Joystick Expander for MEGA65

This post isn't a feature that I had originally planned for the MEGA65, but came about from playing games on the MEGA65 with the kids.  Two kids plus one Dad = 3 players, but of course only two joystick ports, a recipe for problems.  Fortunately, there are some fun games that support 3 or 4 players, mostly using the excellent Protovision 4-port joystick adapter, but it requires a user port, which the MEGA65 doesn't have.

We thought about including a user port on the MEGA65, but the extra cost from making the PCB quite a lot bigger was prohibitive, especially since we don't expect that many people to need the userport -- with the possible exception of playing four player games.

However, rather than say there shall forever be no user port, we decided that instead that we would create a cartridge that turns the expansion port into a user port, in a way that is totally transparent to software.  Thus, it just means you can use either the expansion port or the user port at any point in time, which probably solves the needs of almost all users, especially since the MEGA65 has built in freezer, RAM expansion, ethernet, microSD card and so on -- in short, there shouldn't be too much that you need to plug into the expansion port, unless you want to use a game cartridge, but then none of those (that I am aware of) require the user port at the same time.  The only device I can think of that would be a problem would be the combined cartridge + user port EEPROM burner I have stashed somewhere, but even then the cartridge ROM could be copied and run from RAM.  Let's just say that for the very few cases where you might want both ports that there are probably work-arounds, or you can pull out your real C64.

Anyway, since I had no user port, and wanted extra joystick ports, I thought I might solve both problems at the same time by creating a simplified version of what will eventually become the MEGA65 user port cartridge, that just provides two extra joystick ports.  I also want it to be as simple and cheap to build as possible, ideally using only passive components, perhaps even only wires to the joystick connectors and between pins on the expansion port.   I would also like the cartridge to not damage a C64 or C128 if it is inserted, which means it has to play nicely with the existing use of the expansion port.

What I came up with, was to make a cartridge that ties the /DMA line to ground, and then directly connect the joystick lines to the lower data and address bits.  This requires no active components at all.  Also, because pulling /DMA low causes the CPU of the C64 or C128 to pause, and thus release the address and data lines, it should be safe from that side.  For the VIC-II when connected to a real C64/C128, it isn't quite as simple, because the VIC-II can cause memory accesses, and can drive the address lines itself. This is all only a problem if you actually use a joystick, because if the joysticks are idle, then none of the lines are pulled low.  Thus, while the cartridge should never be inserted into a C64 on purpose (because it won't work), it shouldn't break anything. 

The safety of this could probably be improved by tying R/_W to GND, so that even if the VIC-II did ask for a memory access, the RAM will stay off the bus. Better yet, the cartridge should first detect the presence of a MEGA65 in some way.  The trick is how to do this passively.  Perhaps the most sensible approach is to find some signal we can drive to GND from the MEGA65 side, and that is normally high on the C64 or C128's expansion port.  This signal can then be used as the source for the GND line on the joysticks. The /RESET line is probably a good choice for this, as it is only very transiently low, and when it is low, the contents of the C64/C128 it is connected to should, by definition, all be silent and buses in tri-state conditions. Indeed, with this approach the only opportunity to cause grief would be if you tried to use the joystick while /RESET was active, which would require some degree of intentionality, and would be limited to a few microseconds, which is probably much too short to cause any damage to anything. 

In short, this sounds to me like quite a good and simple solution -- provided that the MEGA65's /RESET line on the expansion port can sink enough current.  If this proves to be a problem, the addition of a single drive transistor on the cartridge would certainly solve the problem. Thus, I think we are good to go.
EDIT: For my joysticks with lights, the current is too low, which causes the computer to think fire is being pressed when any direction is selected, because the current draw of the LED is already reducing the voltage enough to be marginal.  But as mentioned, this is easily fixed with the addition of a driving transistor.

All that was left was to implement the VHDL that detects the /DMA line being pulled down from start up.  Until such time as /DMA goes high (which would be immediately on any normal cartridge), then the expansion port goes into normal operation mode. However, if /DMA remains held low, then the expansion port disables itself, and instead sets up the data and address lines to input, pulls /RESET low, and reads data from the data and address lines and feeds them to the CIA that normally handles the user port and acts exactly as the Protovision joystick expander, so that software does not need to be modified at all.

I had an old broken MACH5 cartridge, whose ROM was lost long ago floating around, which I decided to use as the PCB, since all I needed to do was tack on the wires for the joystick, and connect /DMA to GND.  Here is the result:

 The eagle-eyed among you will notice that: (1) this is from before I came up with using /RESET as the GND for the joystick.  I'll be fixing that momentary. (2) There is only one joystick port connected. That's because I have only two children, and thus only needed 3 joysticks at the same time.

Here it is plugged into the MEGA65 prototype on my desk:

Now, as mentioned, I needed three joysticks. I had already made two arcade style joysticks some time back, but still needed an extra one. The two with red buttons were the ones I had made already.  So I went to Jaycar Electronics and bought parts for a third, although I accidentally got a larger box than I used for the other two.  Not a problem, we now have the two kid-sized joysticks and one "papa joystick", which is also identified by the green light.  So now I had three joysticks, and we could try some multiplayer games:

That's Shotgun that we are playing there, which is indeed MUCH more fun with more than two players.

We also had a go at frogs, which like shotgun is a free game from Dr. Wuro.  It is very nice that he has made some fun free games. We can now confirm that both work perfectly on the MEGA65, including with the extra joystick ports.

So now the kids and I can play games together on the MEGA65 all at once. We might even make a simple free multiplayer game of our own to celebrate.

M.E.G.A 6502 benchmark for checking cycles per instruction

Recently I was tweaking the 1MHz instruction timing and debugging some related glitching I was seeing, and realised that I didn't have any really good way to check whether the MEGA65 was really running every instruction with the correct number of cycles.  The existing well-known C64 benchmarks, SynthMark64 and BoulderMark, don't really give this kind of direct break-down.  So I wrote one that does:

Basically it works by going through all the legal 6502 opcodes, and timing how many cycles they take.  Actually, it runs each instruction 256 times, so that if the CPU is fast, we can accurately measure fractional cycles taken.  This is of course quite important when running on a MEGA65 at 40MHz!

To achieve this, the test program works fairly similarly to SynthMark64 (and is also written primarily using the CC65 C compiler):  It constructs a test routine for each instruction that sets up a CIA counter, runs the instruction, stops the counter, subtracts the overhead of fiddling with the timer, and then divides the result by 256 to get the number of cycles per instruction.

In reality, it is a bit more complex, as there are a bunch of instructions where we have to get a bit clever-pants. For example, for PHA, we have to pop something off the stack after stopping the timer, while for PLA we have to first put something onto the stack.

Some instructions like RTS require that we push a return address that exactly matches the next instruction in the stream to run.  This turns out to not be that hard to implement. First, insert the instructions to push something to the stack.

 if (!strcmp(instruction_descriptions[opcode],"RTS")) {
    // Push two bytes to the stack for return address
    // (the actual return address will be re-written later
    test_routine[offset++]=0xA9; // LDA #$nn
    test_routine[offset++]=0x48; // PHA
    test_routine[offset++]=0xA9; // LDA #$nn
    test_routine[offset++]=0x48; // PHA

Then once we know the address we should return to, replace that:

 // If instruction was RTS or RTI, rewrite the pushed PC to correctly point here
 if (!strcmp(instruction_descriptions[opcode],"RTS")) {
    addr=(unsigned short)(&test_routine[offset-1]);

RTS is also one of the few instructions where we can't actually count 256 executions in one loop, because we would need a bigger stack than exists. Basically any instruction that touches the stack falls in this category. As a result, those instructions will seem a bit faster when running at full speed than they really are.  However, the effect is somewhat less than what SynthMark64 suffers in similar circumstances, where it doesn't correctly adjust for the reduced cost of the overhead of fiddling with the timers.  It is that problem in SynthMark64 that results in impossibly fast reported speeds for NOP instructions on the MEGA65.

Anyway, as can be seen above, the MEGA65 now uses exactly the correct number of cycles per instruction when simulating 1MHz mode.

The three speed figures at the bottom report how fast the machine seems to be, compared with a stock PAL C64.  These figures are based on different weightings for each opcode.  FLAT uses an equal weighting, i.e., it assumes that LDA and RTI both are executed equally frequently, which is clearly not realistic. It does, however, give a good general idea of the machine's speed.  For the other two work-loads, C64 BASIC and BLDERDASH, I used the real-time CPU instruction capture program, ethermon, that I wrote about in the last post to gather statistics.  In fact, I modified it, so that if you give it the -f option for instruction frequencies, it outputs the table in exactly the correct format for me to include in the source for this benchmark program :)

If we then enable full speed on the MEGA65 and run it again, we get something like the following:

First, ignore the information claiming that BRK runs at 255 cycles per iteration. BRK is the single instruction that I haven't implemented testing for, although it should be possible. I got a little lazy, because BRK is not an instruction that tends to get used a lot.  While the test is running you can use the cursor keys to move the cursor around the test field, and it will show you information for the selected opcode. Thus, if you see an instruction that is green or red, you can fairly easily select it, and see exactly the detected difference in speed.  There are still a few wrinkles in this (including that it sometimes displays the line twice, as above, but it mostly works.

A bigger problem at the moment is that the reported speeds can report lower than correct sometimes when running on very fast CPUs. This is because the 1MHz ticks of the CIA do not occur every CPU cycle.  Thus it is possible for the number of consumed 1MHz ticks to sometimes be one more than it should be, if the phase difference means that a 1MHz cycle boundary is crossed.

Similarly, the overhead calculation can vary by 1 cycle as well, between zero and one cycle.  I can't think of a really good solution to this jitter, that is machine independent. It really is just a side-effect of trying to measure something that can be only a tiny fraction of a cycle with a clock that counts only whole cycles.  I could try to average the overhead over many tests, and use the average figure, but that would also require some care in case you changed the CPU speed while the test is running.  But at the end of the day, the contribution of the jitter on the overhead is relatively small, and should be mostly self-cancelling, since it will be rather random for each opcode, and thus overall should end up being near the correct average value, and this seems to be played out in practice.

A bigger problem was in implementing this, I accidentally introduced a systemic bias where the jitter was only being counted in the direction of slower measured speeds, combined with running the stack-based instructions only once.  This was solved by summing the overhead of each of 256 iterations of the stack-based instructions as well as the time taken for each instruction, and then deducting the entire overhead at the end, instead of calculating it for each iteration and removing the cases where the overhead was one cycle but the instruction run was measured at zero cycles, i.e., giving an apparent run time of -1 cycles.  By instead summing these over all 256 iterations, these were counted and cancelled with the occasions where a full 1MHz cycle was claimed to be consumed, and on average, approximating the correct time consumed.

We can now see all the cycle times are fractional, and for the faster instructions are reported as .0 cycles (I kept the output fields only 2 chars wide so that it can all fit on the screen at the same time. As a result it is a bit squashed.  I could also make an 80-column C65/MEGA65 only version, but I wanted it to also run on a stock C64).  They also show up in green because they are faster than on a stock C64. If they were slower, they would show up in yellow (because red on blue on a stock C64 is a known disaster colour combination).

By using the cursor keys you can even select an instruction to give more details on, which is indicated by the reverse-video cursor, and the info given below the instruction table:

So we see that the MEGA65 is close to 40x stock C64 speed, depending on the work load, but with considerable variation among the instructions. This makes sense, since its CPU is 40x faster, and has broadly similar native instruction cycle counts, with some faster and some slower.  The reason some take more native cycles than on a 6502 is because the MEGA65's CPU cannot read from an address immediately after computing it from instruction arguments, as 25ns isn't enough time for the data from one memory cell to make its way into the CPU, be computed on (eg, adding an index register), and then be sent out to the RAM address lines.  I keep meaning to use these latent cycles to pre-fetch the opcode of the next instruction, but given the CPU is already so fast, it has never really become a priority.

The careful observer will also see that the 4th workload to the mix is rather unexpected:  The folks on the #c-64 IRC channel pointed me to the C64 bitcoin mining program (yes, someone actually wrote one!).  It is of course so slow as to be useless, even when running at 40x original speed.  But I guess I now have the record for the fastest bitcoin mining C64, at a rate of 100 nonces in a mere 6 seconds!

Talking earlier about the real-time ethernet-based CPU instruction logger, I mentioned that the 100mbit ethernet isn't fast enough to log instructions at full speed at 40MHz.  But we can now easily check just how fast we can do it:

Not bad! We can even run slightly faster than a stock C65, while logging every single instruction and every single video badline / raster line advance event!

As this is a new benchmark (although it will likely evolve a bit, in particular, because the reported speeds are still a bit jittery, and it runs a bit slowly, testing only one instruction per frame, so as to ensure the test is running in the vertical borders to avoid the effects of badlines), it would be great if folks were able to test it on some different systems.  Groepaz has already kindly run it on his Turbo Chameleon, which reported speeds of: FLAT=16.4x , C64 BASIC=17.7x, Boulder Dash = 17.54x and Bitcoin 64 = 15.47x (although these were produced before I fixed the jitter, so actual results might vary somewhat). 

EDIT: I have now uploaded it to CSDB so you can try it out on your own hardware if you wish.  I'd love to receive speed reports for different systems.  Note for now you will need to manually activate acceleration, as the program itself does not do this for you.

Real-time CPU tracing

We already have the means to perform low-speed tracing of the MEGA65's CPU via the serial monitor interface.  This has been invaluable for all sorts of things so far.  However, I have been investigating a number of gritty little bugs that only show up when the CPU is running a full speed.  A good example is when Ghosts'n'Goblins sometimes does this:

The underlying problem is that the C65/MEGA65 register at $D031 gets written to.  Actually, it looks like $D030 - $D03F (at least) get written with the pattern $78 $50.  However, single-stepping, the problem never shows up, perhaps because it requires an IRQ at a critical time, or is due to some other dynamic effect.

Thus we need a way to debug the machine at full speed. The serial monitor interface is limited to 4mbit/sec in practice, which even with an efficient encoding for instructions would limit us to well below 1 million instructions per second, probably significantly less in practice. Also, the ROM for the serial monitor is already rather full, so there isn't really space to implement such an encoding, and its memory access interface to the CPU would change the timing of the instructions, which might well hide the problems we are trying to track down.  Also, it would not be effective at all if we want to debug bus accesses instead of instructions, as those happen at 40MHz (the current speed of the MEGA65's CPU).  So we need a higher bandwidth option.

The logical choice is to use the 100mbit/sec ethernet adapter.  Not only does it have 25x the bandwidth, but the video streaming code already has most of the infrastructure needed -- all that is required is to add the code to pull the CPU instructions in, pack them in a buffer, and then push them out in packets.

The first challenge is that the MEGA65's CPU can issue instructions on back-to-back cycles, as implied mode instructions typically take only one cycle. This means have to be able to write the entire instruction debug information vector in a single cycle.  Fortunately the BRAMs on Xilinx FPGAs can support funny arrangements where you can write 64 bits wide, and read 8 bits wide.  There is a bit of magic to making this work, as you have to get Vivado to infer the correct BRAM mode.  Xilinx have some good documentation on example VHDL templates that will achieve this, which I was able to adapt quite easily.  So now I could write 8 bytes at a time into a FIFO, and read it out byte at a time when sending ethernet frames.

It is a bit annoying that we are limited to 64 bits, as it would be nice to have the program counter (16 bits), instruction (up to 24 bits), flags (8 bits), and stack pointer (16 bits), as well as the A, B, X, Y and Z registers (40 bits more).  That makes a total of 104 bits.  So in the end I have had to leave the B, X, Y and Z registers out.  I could get all clever-pants later and provide less frequent updates of those registers by providing them in cycles between instructions, however, for now I don't really have a need for that.

So now I could produce 2KB ethernet frames packed with 512 x 8 byte instruction records.  Next, I had to get the logic right to pause the CPU when the buffer was filling, but ideally not have to pause the CPU every time we have a packet.  Using 2KB packets and a 4KB BRAM, this means I can have one packet being sent, while preparing the next one, and only pausing the CPU if we approach the end of the packet we are preparing, and the ethernet is still sending the other.  The logic turned out to be almost elegant in its simplicity here:

        if dumpram_waddr(7 downto 0) = "11110000" then
          -- Check if we are about to run out of buffer space
          if eth_tx_idle_cpuside = '0' then
            cpu_arrest_internal <= '1';
            report "ETHERDUMP: Arresting CPU";
          end if;
        end if; 

I just reused the existing compressed video sending logic in the ethernet controller, so there is (currently) no way to tell the packet types apart, other than looking closely at the content. However, as these are tools that are design for debugging, this is a reasonable solution.

I then started writing a new program, ethermon, that listens for these packets and decodes them for display.  The result is really quite nice:

--E---ZC($23), SP=$xxF6, A=$17 : $084C : AD 52 D0   LDA $D052 
--E----C($21), SP=$xxF6, A=$17 : $084F : 85 FD      STA $FD   
--E----C($21), SP=$xxF6, A=$02 : $0851 : AD 53 D0   LDA $D053 
--E----C($21), SP=$xxF6, A=$02 : $0854 : 29 03      AND #$03  
--E----C($21), SP=$xxF6, A=$02 : $0856 : 18         CLC       
--E-----($20), SP=$xxF6, A=$07 : $0857 : 69 05      ADC #$05  
--E-----($20), SP=$xxF6, A=$07 : $0859 : 85 FE      STA $FE   
--E-----($20), SP=$xxF6, A=$07 : $085B : A0 00      LDY #$00  
--E---Z-($22), SP=$xxF6, A=$2A : $085D : A9 2A      LDA #$2A  
--E-----($20), SP=$xxF6, A=$2A : $085F : 91 FD      STA ($FD),Y
--E-----($20), SP=$xxF4, A=$2A : $0861 : 20 E4 FF   JSR $FFE4 
--E-----($20), SP=$xxF4, A=$2A : $FFE4 : 6C 2A 03   JMP ($032A)
--E-----($20), SP=$xxF4, A=$00 : $F13E : A5 99      LDA $99   
--E---Z-($22), SP=$xxF4, A=$00 : $F140 : D0 08      BNE $F14A 
--E---Z-($22), SP=$xxF4, A=$00 : $F143 : A5 C6      LDA $C6   
--E---Z-($22), SP=$xxF4, A=$00 : $F144 : F0 0F      BEQ $F155 
--E---Z-($22), SP=$xxF4, A=$00 : $F155 : 18         CLC       
--E---Z-($22), SP=$xxF6, A=$00 : $F156 : 60         RTS       
--E---Z-($22), SP=$xxF6, A=$00 : $0864 : C9 00      CMP #$00  
--E---ZC($23), SP=$xxF6, A=$00 : $0866 : F0 E4      BEQ $084C 
--E---ZC($23), SP=$xxF6, A=$1D : $084C : AD 52 D0   LDA $D052 
--E----C($21), SP=$xxF6, A=$1D : $084F : 85 FD      STA $FD   
--E----C($21), SP=$xxF6, A=$02 : $0851 : AD 53 D0   LDA $D053 
--E----C($21), SP=$xxF6, A=$02 : $0854 : 29 03      AND #$03 

We have the CPU flags on the left, followed by the hex version of the same. Then comes the bottom 8-bits of the stack pointer (remember on the 4510 you can make the stack pointer 16-bit), then the contents of the accumulator, and finally, the program counter, instruction bytes, and the disassembled version of the instruction.

There was a bit of fun to get the program counter correct, as some instructions pre-increment the program counter.  This just meant a bit of fiddling to get the out-by-one errors squashed in the display.

What is really nice is that at 8 bytes per instruction on 100Mbit ethernet, is that we can log perhaps a little over one million instructions per second.  That's enough to log a C65 at full 3.5MHz, unless you are running almost exclusively single-byte instructions, which is implausible in practice.  Certainly there is no noticeable slow-down of the machine when logging to ethernet like this at 1MHz, 2MHz or 3.5MHz. It is only when running the CPU at full speed that it limits things.

 Again, it depends a bit on the instruction mix, but typically it can log at an effective CPU speed of about 5MHz -- certainly fast enough for live use.  Also, for timing sensitive bugs, the fact that it runs at full speed for 512 instructions, and then pauses for a while before running the next 512 instructions at full speed means that it is still very likely to catch bugs, as most instruction are executed completely unaffected in terms of timing or bus pausing.

In fact, the real problem is that logging millions of instructions per seconds results in hundreds of megabytes of log for every few seconds.  This means that I need to think about careful instruction filtering strategies, so that it will be easier to debug problems around particular pieces of code.  I have already added an option to capture a fixed number of instructions, and to capture all instructions for exactly one frame.

Anyway, back to trying to figure out what is going wrong with Ghosts'n'goblins using this now, I have been rather frustrated!  Even after adding a mode that logs every single bus access, no access to $D031 is visible -- and yet, instrumenting the VIC-IV to toggle a line every time $D031 is written to, I can see that it is indeed getting written to.  I pulled my hair out for quite some time on this, because I could see the glitch happening within a few seconds.  Then I finally a penny dropped: The problem happens exactly when the screen swaps from the high-scores to the credits screen with the sprite ghosts'n'goblins display near the top.

Moreover, using the ethernet CPU trace, I have seen that the values $78 and $50 that erroneously end up in $D031 - $D03F are the values written to $D00E and $D00F when setting the sprite locations for that screen.  So, now I have a really interesting clue, and I just have to figure out where the code is for this, and then get a trace of it.

In the process I also discovered that the glitch only occurs at 1MHz. If the CPU is set to 4MHz, then it doesn't happen.  Also at 2MHz ("C128 fast" mode) it doesn't happen, nor at full CPU speed. Thus we have some kind of dynamic bus bug.

Anyway, have a clue as to the cause, I used monitor_save to dump out memory from the running machine, and then searched for routines that wrote to $D000 and $D001.  There are two such routines, one at around $3332, and the other at around $33B0.  The routine at $33B0 looks like:

,000033B0  BD 62 34  LDA   $3462,X
,000033B3  99 01 D0  STA   $D001,Y
,000033B6  BD 52 34  LDA   $3452,X
,000033B9  99 00 D0  STA   $D000,Y

If I replace that with all NOPs, then suddenly the problem goes away.  In fact, it is the instruction at $33B9 that causes the problem, it seems.

Changing the STA $D000,Y to just STA $D000 or even STA $D000,X also stops it.

So is the problem with the STA $xxxx, Y instruction?

I also wondered about the addresses involved.  The correct write address will be $FFD000x, while the erroneous writes are happening to $FFD303x.  Thus there is a bit of a pattern there where 0 nybls are becoming 3's.  But one is a low nybl and the other a high nybl, so it still doesn't really make any sense. So lets look at the STA $xxxx,Y instruction in the CPU, and see if there are any clues there.  Nope.  The $xxxx,X and $xxxx,Y addressing modes are carbon copies of each other, just with the different index register supplied.

This left me with glitching due to not making timing closure as the most likely cause.  Basically I will have to investigate it further when I get the chance, as for now, I am just driving myself insane, even if I have made a nice new tool for tracing CPU activity.

Fast forward a couple of weeks, and I have fixed the problem, although I have also forgotten exactly what it was that I fixed!  I have a recollection that it was something in the way the bus is held when running at 1MHz, and making sure that I clean out any old accesses that were sitting on the IO bus.  The main thing is that it is now conclusively fixed.