Tuesday, 26 December 2023

Tracking down and fixing the thumbnail images in the freezer

After some recent fixes for some other bugs in the MEGA65 core, the hardware-generated thumbnails are no longer working in the freeze menu, instead they just show up black:

 

This issue is being tracked here.

This is sad, because having the little thumbnail of whatever was running when you froze is super handy, and just a really nice visual touch.  So we will of course fix this.

The problem stems from a fix to deconflict some address resolution stuff, that has meant that the thumbnail generator is not being decoded properly now, meaning that the freeze menu just reads all zeroes, thus giving the black display.

I'm going to fix it by moving the thumbnail interface to its own 4KB memory-mapped slab, so that we don't have to do anything fancy to read it out in the freezer any more. This will save some code in the HYPPO hypervisor that previously had to do this when preparing to freeze.

The big clue for this came from the following conversation on Discord:

From @johnwayner in discord:
It appears that the addressing changes in https://builder.mega65.org/job/mega65-core/job/development/17/ broke the ability of the freeze code to read from the thumbnail generator addresses at $FFD2640 and $FFD2641. Thus the freezer is just reading some memory somewhere else. My VHDL is very weak. <@824166878284349461> might be able to help with this as he was involved with these changes.

The first step, was for me to change the mapping, so that it is now visible at $FFD4xxx (see .  This gets us a little further, in that the thumbnail is now all the colour of one part of the screen, in this case, blue.

It looks like the mapping is fine, but that it is either reading the same address in all cases, or that it is writing to the same address in all cases when producing the thumbnail. As the thumbnail generation logic has not changed, I'm strongly suspecting that the problem is on the read-address side of things.

The best way to tackle this will be to simulate the MEGA65, and see what is going wrong. Getting simulation of the whole MEGA65 going is always a bit of a trial, as we don't tend to need to do it often, and there are almost always a bunch of weird regressions from the perspective of the GHDL simulation software. 

After an hour or so, I can now run simulation again. So let's add some simple code that tries to read the thumbnail data, and see if we are reading the same address repeatedly or not.  This is a touch annoying to craft, because $FFD4xxx is not mapped anywhere convenient. An inline DMA job is probably the simplest approach here.

Inline DMA jobs on the MEGA65 work by writing any value to the magic address $D707. The DMA controller then reads the bytes following the program counter as the DMA list, and then resumes the CPU at the next byte following.  This means we can read from all 4KB of the thumbnail buffer with something like this:


               sta $d707
        ;; MEGA65 Enhanced DMA options
        !8 $0A  ;; Request format is F018A
        !8 $80,$ff ;; Source is $00xxxxx
        !8 $81,$00 ;; Destination is $00xxxxx
        !8 $00  ;; No more options
        ;; copy $FFD4000-$FFD4FFF to $4000-$4FFF
        ;; F018A DMA list
        ;; (MB offsets get set in routine)
        !8 $00 ;; copy + last request in chain
        !16 $1000 ;; size of copy is 4KB
        !16 $4000 ;; starting at $4000
        !8 $0D   ;; of bank $0
        !16 $4000 ;; destination address is $4000
        !8 $00   ;; of bank $0
        !16 $0000 ;; modulo (unused)

Simulation indicates that it seems to be reading from successive addresses. So that's not the problem. Adding further debug output into the simulation reveals that it does indeed seem to be reading from successive addresses and returning the resulting data as output to the bus.  Quite mysterious.  More the point, quite annoying: This kind of "doesn't-show-up-in-simulation" type bug is quite frustrating in VHDL. 

Sometimes there are clues in warnings in the output of Vivado. But nothing there really seems to give me any clues this time.

Hmmm....

The plumbing from the buffer for the thumbnail to the bus is a little convoluted, so I'll try inserting dummy data at the mid-point of the plumbing, and see if that appears, or not. And it fails! Yay! I'm happy because that gives me a clue as to what is going on, or rather, not going on.  The output from the thumbnail buffer is simply not being used when working out what is on the bus.

Okay, so I found that it should now be testable under simulation: The previous test only checked that the data could be read out, but didn't check what the DMA controller saw.  Now doing that, it looks like there is a plumbing problem.

It seems that thumbnail_cs and fastio_read are both only asserted for the first one or two bytes of the read. Poking around some more, mostly just adding debug statements, it now seems to be working correctly under simulation. So let's see how it goes synthesised...

Well, still the same problem: The colour of the border shows up in _all_ the bytes of the thumbnail data.  I'm pretty sure it is being read out correctly now.  And I know that something is being written, because I can change what value gets read out by changing the border colour.

My current theory is that the logic that writes the thumbnail is also broken. I'm reworking it to have much simpler and more robust logic to determine the correct thumbnail pixel coordinate based on the current screen pixel coordinate, rather than having a pile of magic that works out if it is a pixel row that we are currently needing to sample into the thumbnail buffer, and if so, which pixels along that line.

Hmm... That changed things, but didn't fix it: Now it seems that no thumbnail pixels are being written at all.  The most logical explanations for this, are that the hypervisor mode check is messed up, or that the pixel coordinate calculations are still broken. That none of the thumbnail pixels have been painted suggests to me that the hypervisor mode check might be the problem.

I'm now resynthesising with the hypervisor mode check disabled. I also found and fixed a bug where the thumbnail address would step by 2 at a time, instead of 1 at a time for the row and column tracking.

Hmm... it still looks like nothing gets written to the thumbnail buffer.  Confirmed this by changing the default contents of the thumbnail buffer.  So how can this be? Oddly when the CPU is in hypervisor mode, I can see other contents rather than the default contents.  This also makes me suspect that there might be some problem on the reading side, where it is still just reading the same address all the time.

I'll try debugging that by adding some debug registers that let me see the number of writes that have happened to the thumbnail buffer, as well as the lower bits of the last address written to, like this:

    if fastio_read='1' and (thumbnail_cs='1') then
      if fastio_addr(11 downto 0) = x"000" then
        fastio_rdata <= thumbnail_write_address_int(7 downto 0);
      elsif fastio_addr(11 downto 0) = x"001" then
        fastio_rdata <= thumbnail_write_count;
      else
        fastio_rdata <= thumbnail_rdata;
        report "THUMB: Exporting read data $" & to_hexstring(thumbnail_rdata);
--      fastio_rdata <= fastio_addr(7 downto 0);
--      report "THUMB: Spoofing read data $" & to_hexstring(fastio_addr(7 downto 0));
      end if;
    else
      fastio_rdata <= (others => 'Z');
    end if;

This way, I will be able to tell whether it thinks it is writing lots of pixels or not, or if it thinks it is stuck writing to the same address etc.  I.e., I'm hoping it will be quite diagnostic.

... and of course, both are working quite nicely. Also, they are both being decoded, which also means that the address decoding isn't the problem, either. The way I am checking for pixel writes is also robust, in that it reads the same signal that is used as the write enable for the thumbnail buffer.

Thus I'm starting to suspect I might be tickling a Vivado bug. I've seen similar issues before where BRAM inferencing goes wonky, and it feels a bit like that here. Changing the BRAM definition to one of the "officially blessed" configurations tends to fix it. We'll see if that helps here.

And, yes, that does seem to have fixed the thumbnail buffer writing issue: When I DMA the contents out now, it has plausible looking data in there.  Now, whether it is correct data is another question. It looks like it is capturing only the top-left part of the screen, but in too high resolution. i.e., my stepping for pixel rows and columns is not striding over enough pixels.  

Also, for some reason, the freeze menu is still not reading the data, and showing a blank blue thumbnail. So there is still some problem there somewhere. The freeze monitor does work to view the thumbnail, so I'm guessing the thumbnail renderer in the freezer is broken. It might be fetching from a fixed offset in the freeze slot, which has shifted since the internal virtual 1541 ROM and RAM contents were added to the memory freeze list. Ah! It was because previously the thumbnail claimed to be frozen at address $1000, rather than $FFD4000.  So that should fix that. That can be fixed without resynthesis, since it is just in FREEZER.M65.

With that fix, we can now see a thumbnail again:


But it clearly has problems.  There are black pixels in the border, and white rubbish in the main image.  More the point, it is not showing a thumbnail of what was last on the screen!

Magically, if I run a program in BASIC to DMA the thumbnail to the screen, it suddenly updates, but only to what was DMAd -- or at least seems that way. Am I still reading the wrong memory in the freezer?  Nope, it is really what is in the thumbnail buffer.

Incidentally, BASIC65 makes it easy to read the thumbnail data and draw it (in a non-decoded way) on the screen with something like this:

EDMA 0,$800, $FFD4000, $0800

Also, it looks like my worry about the pixels only being of the top corner of the screen is not quite right: It's still probably the upper part of the screen only, but it's certainly showing the full width.

I had disabled the suppression of thumbnail update in hypervisor mode, which might have been partially messing things up. I've put that back, and also tried to fix the vertical positioning, so will see how that goes in a new bitstream.

Hmm... when I turn the hypervisor mode check back on, the thumbnail never gets written -- even though I have added debug code that let's me confirm that the hypervisor mode flag is there, and correctly readable by the thumbnail generator logic.  What on earth is going on here?

It looks like some writes happen, but not all.  Yet the hypervisor mode bit stays clear.  We have no timing violations in the log, and the logic is all in a single clock domain.  Ah! I think I might have found the answer to this and related mysteries of it not updating properly, or only updating when I am reading from the thumbnail: The BRAM instantiation I am using uses the same chip-select line for reading and writing. We need separate ones.

Yay :) That fixes most of the remaining problems. Now to fix the vertical positioning, and then we are good to go. It can still do with a little more tweaking, but that is something that is now simple enough, that someone else can likely tackle it, so I'm going to leave it at this point, with the thumbnails working again:











No comments:

Post a Comment