Saturday 21 October 2023

Debugging SDRAM and HyperRAM cache and other related things

The R4 board has both the 8MB HyperRAM that is on the R3/R3A board, as well as a new 64MB SDRAM.  For the R4 board, I have made the bitstream allow run-time selection between the two, but with the 8MB HyperRAM active by default, so that people don't accidentally make software that won't work on an R3/R3A board.

The problem is that in the process of doing this, it seems that I have tickled a bug in the HyperRAM access that is causing test_585 and test_710 to both fail.  test_710 I can't find the source for, but that's ok, as I have test_585 source here, and can confirm that it fails.

First, a bit of background about how HyperRAM and SDRAM accesses work:

Unlike the 384KB of fast chip RAM in the MEGA65, the HyperRAM or SDRAM are external to the FPGA and the processor.  They are in fact connected via the same mechanism that is used to access the cartridge port.  This adds some extra cycles to each access, making it quite slow. To combat this slowness, the expansion RAM interface presents an 8 byte cache line to the CPU, to allow it to read successive addresses much more quickly, with just a single wait-state, allowing 20MB/sec for linear reads.

Like all caches, things are never quite as easy as one would like.  Writes to the expansion RAM bypass the cache, and are written directly to the expansion RAM.  This leaves the problem that the cache line is not updated, but still contains the old value. Thus if you read, write and then again read from addresses are all in the same cache line (i.e., within the same 8-byte-aligned range of 8 bytes), the second read will read the original value.

Well, except that this shouldn't happen, because when the expansion RAM is written to, the cache line is marked as invalid. And indeed, that does work fine.

However, we do see some other odd effects.

If we write $00, $FF, $55 and $AA in turn to successive bytes of the expansion RAM, we get some rather strange effects:

Notice how only the first byte in the block of 8 holds its address.  The others seem to magically get the $AA value, even before it has been written. And that's actually a clue there:  It turns out that this particular issue is caused by the SDRAM having different timing compared to what was expected: It has one more cycle delay before a value is read.  

We can see this more clearly if I make a program that writes all zeroes to a block of 8 addresses, and then sets exactly one of them to something else, to see where it gets read back:

Notice than all cases the first 2 bytes are read back as $99, no matter where $99 is written within the 8-byte block, and that the $99 that has really been written has appeared two addresses late.  As the SDRAM reads 16 bits per cycle, this corresponds to that single cycle delay I mentioned before.

That one cycle delay is something that I already know about, so I have made this switchable in the bitstream by clearing bit 5 in $FFD37FE.  If we do this, we now see the writes being read back in the correct addresses:

Note that this is all with the SDRAM, rather than the HypeRAM. That's okay, as we will work through the SDRAM, and then move to the HypeRAM, once the issues are resolved with SDRAM access. Most likely the issues are common to the two, anyway, so fixing for one will fix for the other.

Okay, so let's try our first program again, that writes $00, $FF, $55 and $AA to each of the first 8 addresses.

Well, that's an improvement: It now works fine for the even addresses. It's just the odd addresses that fail.  Remember I said before that the SDRAM has a 16-bit wide bus? I'm finding it quite suspicious that all the failing bytes are for the same half of of that 16-bits.

But we know that writing a value to one of the addresses only after having just cleared the rest of them doesn't exhibit this problem, as we saw when we last ran the program starting at line 100.  So it's not as simple as the odd addresses not writing or reading properly from the SDRAM.

Our next clue is that the output from running the program changes based on whether we have just run the program at line 100 or not:

This, I think, is the cache playing up. I can test this theory by forcing the cache line to invalidate by reading from an address not within the cache line.

Aha! Now it's fine, because I force the cache to invalidate with line 35.

Ah, it's not about odd or even addresses, its about the 2nd consecutive address within a cache line, as the following shows:

So let's see if I can reduce the failing case to something as simple as possible, so that I can more easily test it under simulation of the VHDL.

This is as simple as it gets, while still reproducing the problem:

This should print 1, not 123, but because the cache line for $8000000-$8000007 has been read by the PEEK in line 320, when we PEEK again in 340, we see the old data. Note that the new value has been written to the SDRAM, it's just that the cache line has not been refreshed.

Playing around with things, this incorrect behaviour happens only when slow_prefetch_enable is enabled in the CPU.  Enabling or disabling the cache line in the CPU doesn't have any effect. That is, it isn't the super-fast short-cut logic for the external RAM specifically that is playing up, but rather the logic for prefetching data from expansion devices generally. This also means that this bug is not specific to the SDRAM, but applies also to HyperRAM.

So, POKE $FFD37FE,$14 fixes that problem for the SDRAM, and then the SDRAM seems to work perfectly for me.

HyperRAM still has problems, essentially not working when selected. I'm strongly suspecting a plumbing problem with the HyperRAM connection into the rest of the design.

Yup: Found one problem there: hyperram_data_ready_toggle isn't plumbed into the HyperRAM module.  That will require a synthesis run, so we will investigate that more tomorrow.

Okay, with that, I am now getting data out of the HyperRAM again.  However, it is shifted by one byte when reading back, as we can see here:

This bug is proving a bit more interesting, i.e., not having an obvious cause.  The HyperRAM controller has some debug registers at $BFFFFFx and $B0000xx that let us get some idea of what is going on.

Those registers suggest that the HyperRAM is getting the correct data written to it, and that the cache lines as read in the HyperRAM controller are being read correctly.  So it's presumably happening in the way that the bytes are being fed out to the expansion RAM controller and the CPU.

Here's an interesting little program I wrote to look into this behaviour:

The first big fat clue is that the value written to $8111111 is what gets read from $8000000. This says to me that there is some funny thing going on with the data presentation from the HyperRAM to the CPU when reading byte by byte.  

Note that this is all with the HyperRAM cache line logic disabled. i.e., we should just be getting slowish single byte transactions from the CPU being fed back via the expansion RAM controller. Thus we can exclude the cache from being part of the problem. Something is causing it to return the byte values one transaction late, it would seem. 

We can confirm this theory by shifting the range of bytes read up a few bytes, so that they aren't even aligned to a cache line 8-byte boundary, and observe whether we are seeing the exact same effect, which we do:

This all also suggests that if we read an address twice in a row, it should magically look right. Let's try that:

And look at that -- it's magically fixed :) Well, rather, we have worked around the problem with this little trick. Clearly we need to fix this at the root cause.  

So we now know that the symptom is that the byte from the previous read request is the byte value that is returned.  Now to figure out why this happens.

This is the relevant snipped of VHDL:

      if data_ready_strobe_hold = '0' then
        if fake_data_ready_strobe='1' then
          report "asserting data_ready_strobe via fake_data_ready_strobe";
        end if;
        data_ready_strobe <= fake_data_ready_strobe;
        if fake_data_ready_strobe='1' then
          report "DISPATCH: holding data_ready_strobe via fake data = $" & to_hstring(fake_rdata);
          rdata <= fake_rdata;
          rdata_hi <= fake_rdata_hi;
        end if;
      else
        report "data_ready_strobe delayed by 1 cycle to ensure data setup";
        report "asserting data_ready_strobe";
        data_ready_strobe <= '1';
        data_ready_toggle_out <= not data_ready_toggle_int;
        data_ready_toggle_int <= not data_ready_toggle_int;
      end if;
      data_ready_strobe_hold <= '0';


All that stuff about fake_ prefixed signals is because to meet timing closure, the HyperRAM controlled needed an extra drive cycle to push to rdata and the strobe line.  This was done at the time by "faking" the output with the fake_ signals, and then presenting them a cycle later, and with the strobe signal being delayed by a further cycle, to make sure that the data lines had had time to settle in the previous cycle. This is required because the HyperRAM runs at 2x the clock rate of the CPU, and so we need to make sure that everything looks consistent from the slower perspective of the CPU.

Anyway, I'm highly suspicious that the problem lies here somewhere, especially since the logic doesn't seem to make a great deal of sense to me right now!

So what do we really need to do, whenever we have new data to present?

1. Present the data to rdata and rdata_hi.

2. On the following cycle, assert the data_ready_strobe signal for 2 clock cycles, to make sure it gets accepted by the 1/2 clock speed receiver. Also toggle data_ready_toggle.

Ok, so I have done that, and resynthesised.  Now the behaviour of the HyperRAM and the SDRAM is the same when caching is disabled.  However, it's still incorrect: The data from the previous transaction is still being read, instead of the correct data.

Let's warm up the unit tests I wrote for the SDRAM controller back when I was implementing it, and see what those can tell us.

==== Summary ========================================================================================
pass lib.tb_sdram_controller.SDRAM starts busy, and becomes ready                 (16.3 seconds)
pass lib.tb_sdram_controller.Can read config registers                            (13.3 seconds)
fail lib.tb_sdram_controller.Write and read back single bytes                     (12.5 seconds)
fail lib.tb_sdram_controller.Write and read back single bytes from different rows (12.6 seconds)
fail lib.tb_sdram_controller.Cache line correctly populated                       (12.8 seconds)
=====================================================================================================
pass 2 of 5
fail 3 of 5


Well, that's certainly confirmation that something is wrong.  Wrong in simulation is good, because it can be iterated quickly to find out the cause.

Only the first 6 bytes of the 8-byte cache line are being read.   After a bit of hunting around, I found the cause: The byte mask bits for the SDRAM were not being set for every cycle during the read of a cache line.  This doesn't explain some of the other issues we are seeing, but it is certainly a real issue that we need to fix. I am synthesising a fix for it now to test.

So now that the old tests are all working, it's time for me to try implementing the access sequence that is resulting in reading from the previous transaction.

In our latest test program, I did:

POKE $8111111,$77
A = PEEK($8123456)
B = PEEK($8111111)
C = PEEK($8111111)

The values of B and C should both be $77.

To be more complete, we can make it like this:

POKE $8111111,$77
POKE $8123456,$99
A = PEEK($8123456)
B = PEEK($8111111)
C = PEEK($8111111)

What we now see is that B = $99 and C = $77

So now to make a VHDL test that can reproduce this effect. Well, we already have read and write tests that operate on just the SDRAM controller, and those pass.  So the problem must be in either the slow_devices or gs4510 modules.

I've patched up the test_hyperram.vhdl and simulatehyperram2 Makefile targets to work again, updating GHDL to version 4.0.0 in the process.     

Found the first part of the problem: The timeout logic in the slow_devices.vhdl was broken, causing all reads to timeout.  This explains the behaviour of seeing the result of the previous read, as that would be what the connected device would be presenting from last time.  

Now the problem is that it gets in an infinite loop retrying the read.  Fixed that, and now it is back to timing out, but only after the allowed 64 clock cycles.

Now to find out why it doesn't realise that the value has been read.  

The hyperram.vhdl publishes the memory read  value with 27 cycles to go, but it never gets detected by slow_devices.vhdl.  I'm guessing the data ready toggle is not being properly plumbed through.

Well, that looks promising: With it correctly plumbed, it seems to be working now. The test runs to completion, with no read errors.  So time to cook a bitstream while I sleep...

Meanwhile, I got a fix for the GHDL bug from their great maintainer, Tristan, so I can try simulation again to track it down.

To do this, I have put some test code in the reset entry point of the hypervisor:

 

reset_machine_state:
        ;; get CPU state sensible
        sei
        cld
        see

    ;; XXX - Debug HyperRAM read bug
    ;; The following in BASIC tickles it:
    ;;     POKE $8111111,$77
    ;;     POKE $8123456,$99
    ;;     A = PEEK($8123456)
    ;;     B = PEEK($8111111)
    ;;     C = PEEK($8111111)
    ;; For simplicity, we will use two different addresses in the same
    ;; page of HyperRAM, but in different cache rows.
    lda #$00
    sta $00
    sta $01
    sta $02
    lda #$80
    sta $03
    ldz #$00
    lda #$77
    sta [$00],z
    lda #$99
    ldz #$08
    sta [$00],z
    lda [$00],z
    tax
    ldz #$00
    lda [$00],z
    tay
    lda [$00],z
    taz
    foopl    jmp foopl    
 

In theory, this should do just what the BASIC code does, and give equivalent results.  I don't have hardware with me right now, or I would make a test program that I can run on that with exactly these instructions in it to test the behaviour.

At 3285ns I can see the slow_devices.vhdl module accepting the initial write request to address $8000000.

And then I found an interesting clue: the busy flag from the HyperRAM is staying asserted.  It looks like it is continuously doing cache pre-fetches or something. This would explain the behaviour of the bug: The HyperRAM gets around to servicing the read or write request some time later, after the slow_devices module has given up waiting. In that situation, the slow_devices module returns the last byte that the HyperRAM presented. That is, we would consistently see the value from the previous read transaction.

So next step will be for me to figure out why the HyperRAM is staying busy.

Well, it might be a red herring after all, as the HyperRAM requires a 150usec start up delay to get itself organised. 

I disabled the 150usec delay in the HyperRAM, and now can simulate it... except that the bug doesn't show up.  

So I ported the program to a C64 mode program and ran it there, and it also doesn't tickle the bug. How annoying. But also a bit of a relief, because if the bug couldn't be reproduced in simulation, then it would make my life much harder to track it down.

So what do we need to do to make the bug show up on real hardware using a program that we can run it under simulation?

Is it that we need a big fat time delay between the reads? That doesn't seem to make a difference.

I've gone back to looking at the test_585 code that does still fail reliably.  I now want to make a super minimal assembly program from that that fails on real hardware. 

So, that is failing for me. But after some hair pulling as I pulled that apart, it's now also failing for the simple BASIC test.

This is really frustrating, because it makes it feel like something is unstable, and depending on glitches, which means it might not be detectable under simulation.

The overall symptom remains one where reading from expansion RAM -- whether HyperRAM or SDRAM -- results in the CPU being given the value from the previous read, rather than the current read.

Now, I did make a fix for one potential cause of that issue, which I thought was already in the bitstream I am testing. But it's possible that the commit ID on the bitstream is wrong, if I made those changes to the source code while the bitstream was building.  The way to test this is to build a fresh bitstream, and see how that behaves...

I'm hopeful that will fix it, because a couple of things about the behaviour right now suggests that it is a bug in slow_devices, not in HyperRAM or SDRAM controllers:

1. The error is related to a single byte shift. Both HyperRAM and SDRAM effectively have 16-bit interfaces, so the expected shift from an error there would be 2 bytes.

2. The error happens with _both_ HyperRAM and SDRAM, and looks identical on both.  It seems fairly unlikely that I'd have function-identical bugs on both controllers. Not impossible, but unlikely.

Anyway, we will know in an hour or two...

Bitstream built, and I now have a new clue:  On cold boot, it behaves properly, but after running the test, it gets into a state where the read is consistenly one byte out, as before.  So I'm guessing some internal state somewhere is getting screwed up.  This explains the intermittant behaviour I have been seeing. So now to make a sequence that works from cold to trigger it exactly, so that I can investigate it under simulation...

Using test_585, the bug is typically triggered and then occurs persistently after somewhere between 20 and 400 loop iterations. So the probability of the event on any given read is quite low, which is a bit annoying.

My best guess, is that it is sensitive to the expansion RAM delivering a byte at some critical time. 

Hmm.. I just had a stupid bug in my test code, where for some reason the sta [$nn],Z was being done as a sta ($nn),Z, so writing over the pointer, rather than to the expansion RAM. That caused me endless "fun" until I realised what I had done.

Now, digging more, I have found that I can trigger the problem by repeatedly reading and writing to a single address, provided that the bottom 3 bits are in the range 5 - 7.  If they are in the range 0 - 4, then the problem never shows up.

But it still takes some variable number of iterations with a "bad" address, before the problem is triggered.  However, that number of iterations is now typically many fewer. It should now be fast enough to trigger, that I can test if I can reproduce it under simulation.

I'm wondering if one of the data ready toggles isn't being handled incorrectly, and the expected and actual sense of the line get out of sync. This would cause exactly the kind of issue we are seeing. This theory should be testable, to see whether after the problem occurs, it can be flipped back off again by continuing to attempt to trigger it.

Well, I ended up making a new bistream with the toggle handling reworked instead after I looked through the code for probable causes.  The toggle was only being updated when the slow_devices module was actively waiting for RAM reads. Thus if a read finished after the slow_devices module gave up waiting, as soon as it would schedule the next read, it would think the result was immediately valid, and thus would be returning the result of the previous read.  In other words, this was indeed a bug of the type I had suspected.

With that change, some errors are still occurring, but when they do, it goes back to reading values correctly after. The read timeouts were set quite low -- possibly too low for HyperRAM, and even possibly for the SDRAM as well. So I have increased those from 24 to 128 cycles at 81MHz, which should be plenty enough.

And it looks like that combination of fixes has done it!


Now, the SDRAM still has a problem with unreliable initialisation and clock/data line relative timing, which I will need to tackle.  However, that can wait for another day, as the SDRAM itself works fine, and is not required for the MEGA65 core -- I implemented it really only to test it.


No comments:

Post a Comment