Thursday, December 27, 2018

Real-time CPU tracing

We already have the means to perform low-speed tracing of the MEGA65's CPU via the serial monitor interface.  This has been invaluable for all sorts of things so far.  However, I have been investigating a number of gritty little bugs that only show up when the CPU is running a full speed.  A good example is when Ghosts'n'Goblins sometimes does this:

The underlying problem is that the C65/MEGA65 register at $D031 gets written to.  Actually, it looks like $D030 - $D03F (at least) get written with the pattern $78 $50.  However, single-stepping, the problem never shows up, perhaps because it requires an IRQ at a critical time, or is due to some other dynamic effect.

Thus we need a way to debug the machine at full speed. The serial monitor interface is limited to 4mbit/sec in practice, which even with an efficient encoding for instructions would limit us to well below 1 million instructions per second, probably significantly less in practice. Also, the ROM for the serial monitor is already rather full, so there isn't really space to implement such an encoding, and its memory access interface to the CPU would change the timing of the instructions, which might well hide the problems we are trying to track down.  Also, it would not be effective at all if we want to debug bus accesses instead of instructions, as those happen at 40MHz (the current speed of the MEGA65's CPU).  So we need a higher bandwidth option.

The logical choice is to use the 100mbit/sec ethernet adapter.  Not only does it have 25x the bandwidth, but the video streaming code already has most of the infrastructure needed -- all that is required is to add the code to pull the CPU instructions in, pack them in a buffer, and then push them out in packets.

The first challenge is that the MEGA65's CPU can issue instructions on back-to-back cycles, as implied mode instructions typically take only one cycle. This means have to be able to write the entire instruction debug information vector in a single cycle.  Fortunately the BRAMs on Xilinx FPGAs can support funny arrangements where you can write 64 bits wide, and read 8 bits wide.  There is a bit of magic to making this work, as you have to get Vivado to infer the correct BRAM mode.  Xilinx have some good documentation on example VHDL templates that will achieve this, which I was able to adapt quite easily.  So now I could write 8 bytes at a time into a FIFO, and read it out byte at a time when sending ethernet frames.

It is a bit annoying that we are limited to 64 bits, as it would be nice to have the program counter (16 bits), instruction (up to 24 bits), flags (8 bits), and stack pointer (16 bits), as well as the A, B, X, Y and Z registers (40 bits more).  That makes a total of 104 bits.  So in the end I have had to leave the B, X, Y and Z registers out.  I could get all clever-pants later and provide less frequent updates of those registers by providing them in cycles between instructions, however, for now I don't really have a need for that.

So now I could produce 2KB ethernet frames packed with 512 x 8 byte instruction records.  Next, I had to get the logic right to pause the CPU when the buffer was filling, but ideally not have to pause the CPU every time we have a packet.  Using 2KB packets and a 4KB BRAM, this means I can have one packet being sent, while preparing the next one, and only pausing the CPU if we approach the end of the packet we are preparing, and the ethernet is still sending the other.  The logic turned out to be almost elegant in its simplicity here:

        if dumpram_waddr(7 downto 0) = "11110000" then
          -- Check if we are about to run out of buffer space
          if eth_tx_idle_cpuside = '0' then
            cpu_arrest_internal <= '1';
            report "ETHERDUMP: Arresting CPU";
          end if;
        end if; 

I just reused the existing compressed video sending logic in the ethernet controller, so there is (currently) no way to tell the packet types apart, other than looking closely at the content. However, as these are tools that are design for debugging, this is a reasonable solution.

I then started writing a new program, ethermon, that listens for these packets and decodes them for display.  The result is really quite nice:

--E---ZC($23), SP=$xxF6, A=$17 : $084C : AD 52 D0   LDA $D052 
--E----C($21), SP=$xxF6, A=$17 : $084F : 85 FD      STA $FD   
--E----C($21), SP=$xxF6, A=$02 : $0851 : AD 53 D0   LDA $D053 
--E----C($21), SP=$xxF6, A=$02 : $0854 : 29 03      AND #$03  
--E----C($21), SP=$xxF6, A=$02 : $0856 : 18         CLC       
--E-----($20), SP=$xxF6, A=$07 : $0857 : 69 05      ADC #$05  
--E-----($20), SP=$xxF6, A=$07 : $0859 : 85 FE      STA $FE   
--E-----($20), SP=$xxF6, A=$07 : $085B : A0 00      LDY #$00  
--E---Z-($22), SP=$xxF6, A=$2A : $085D : A9 2A      LDA #$2A  
--E-----($20), SP=$xxF6, A=$2A : $085F : 91 FD      STA ($FD),Y
--E-----($20), SP=$xxF4, A=$2A : $0861 : 20 E4 FF   JSR $FFE4 
--E-----($20), SP=$xxF4, A=$2A : $FFE4 : 6C 2A 03   JMP ($032A)
--E-----($20), SP=$xxF4, A=$00 : $F13E : A5 99      LDA $99   
--E---Z-($22), SP=$xxF4, A=$00 : $F140 : D0 08      BNE $F14A 
--E---Z-($22), SP=$xxF4, A=$00 : $F143 : A5 C6      LDA $C6   
--E---Z-($22), SP=$xxF4, A=$00 : $F144 : F0 0F      BEQ $F155 
--E---Z-($22), SP=$xxF4, A=$00 : $F155 : 18         CLC       
--E---Z-($22), SP=$xxF6, A=$00 : $F156 : 60         RTS       
--E---Z-($22), SP=$xxF6, A=$00 : $0864 : C9 00      CMP #$00  
--E---ZC($23), SP=$xxF6, A=$00 : $0866 : F0 E4      BEQ $084C 
--E---ZC($23), SP=$xxF6, A=$1D : $084C : AD 52 D0   LDA $D052 
--E----C($21), SP=$xxF6, A=$1D : $084F : 85 FD      STA $FD   
--E----C($21), SP=$xxF6, A=$02 : $0851 : AD 53 D0   LDA $D053 
--E----C($21), SP=$xxF6, A=$02 : $0854 : 29 03      AND #$03 

We have the CPU flags on the left, followed by the hex version of the same. Then comes the bottom 8-bits of the stack pointer (remember on the 4510 you can make the stack pointer 16-bit), then the contents of the accumulator, and finally, the program counter, instruction bytes, and the disassembled version of the instruction.

There was a bit of fun to get the program counter correct, as some instructions pre-increment the program counter.  This just meant a bit of fiddling to get the out-by-one errors squashed in the display.

What is really nice is that at 8 bytes per instruction on 100Mbit ethernet, is that we can log perhaps a little over one million instructions per second.  That's enough to log a C65 at full 3.5MHz, unless you are running almost exclusively single-byte instructions, which is implausible in practice.  Certainly there is no noticeable slow-down of the machine when logging to ethernet like this at 1MHz, 2MHz or 3.5MHz. It is only when running the CPU at full speed that it limits things.

 Again, it depends a bit on the instruction mix, but typically it can log at an effective CPU speed of about 5MHz -- certainly fast enough for live use.  Also, for timing sensitive bugs, the fact that it runs at full speed for 512 instructions, and then pauses for a while before running the next 512 instructions at full speed means that it is still very likely to catch bugs, as most instruction are executed completely unaffected in terms of timing or bus pausing.

In fact, the real problem is that logging millions of instructions per seconds results in hundreds of megabytes of log for every few seconds.  This means that I need to think about careful instruction filtering strategies, so that it will be easier to debug problems around particular pieces of code.  I have already added an option to capture a fixed number of instructions, and to capture all instructions for exactly one frame.

Anyway, back to trying to figure out what is going wrong with Ghosts'n'goblins using this now, I have been rather frustrated!  Even after adding a mode that logs every single bus access, no access to $D031 is visible -- and yet, instrumenting the VIC-IV to toggle a line every time $D031 is written to, I can see that it is indeed getting written to.  I pulled my hair out for quite some time on this, because I could see the glitch happening within a few seconds.  Then I finally a penny dropped: The problem happens exactly when the screen swaps from the high-scores to the credits screen with the sprite ghosts'n'goblins display near the top.

Moreover, using the ethernet CPU trace, I have seen that the values $78 and $50 that erroneously end up in $D031 - $D03F are the values written to $D00E and $D00F when setting the sprite locations for that screen.  So, now I have a really interesting clue, and I just have to figure out where the code is for this, and then get a trace of it.

In the process I also discovered that the glitch only occurs at 1MHz. If the CPU is set to 4MHz, then it doesn't happen.  Also at 2MHz ("C128 fast" mode) it doesn't happen, nor at full CPU speed. Thus we have some kind of dynamic bus bug.

Anyway, have a clue as to the cause, I used monitor_save to dump out memory from the running machine, and then searched for routines that wrote to $D000 and $D001.  There are two such routines, one at around $3332, and the other at around $33B0.  The routine at $33B0 looks like:

,000033B0  BD 62 34  LDA   $3462,X
,000033B3  99 01 D0  STA   $D001,Y
,000033B6  BD 52 34  LDA   $3452,X
,000033B9  99 00 D0  STA   $D000,Y

If I replace that with all NOPs, then suddenly the problem goes away.  In fact, it is the instruction at $33B9 that causes the problem, it seems.

Changing the STA $D000,Y to just STA $D000 or even STA $D000,X also stops it.

So is the problem with the STA $xxxx, Y instruction?

I also wondered about the addresses involved.  The correct write address will be $FFD000x, while the erroneous writes are happening to $FFD303x.  Thus there is a bit of a pattern there where 0 nybls are becoming 3's.  But one is a low nybl and the other a high nybl, so it still doesn't really make any sense. So lets look at the STA $xxxx,Y instruction in the CPU, and see if there are any clues there.  Nope.  The $xxxx,X and $xxxx,Y addressing modes are carbon copies of each other, just with the different index register supplied.

This left me with glitching due to not making timing closure as the most likely cause.  Basically I will have to investigate it further when I get the chance, as for now, I am just driving myself insane, even if I have made a nice new tool for tracing CPU activity.

Fast forward a couple of weeks, and I have fixed the problem, although I have also forgotten exactly what it was that I fixed!  I have a recollection that it was something in the way the bus is held when running at 1MHz, and making sure that I clean out any old accesses that were sitting on the IO bus.  The main thing is that it is now conclusively fixed.