Friday, 30 June 2023

MEGA65R4 Bring-Up

Well, we have the first samples of the R4 of the MEGA65 PCB:


This will be the model of PCB that finds its way into the next batch, plus or minus any last-minute changes that might have to be made.

The R4 includes a number of nice improvements over the R3A that was used in the previous batch, including:

1. Digital video back-power protection through the addition of a digital video driver IC.

2. Removal of the Intel MAX10 2nd FPGA to simplify the design. This means that the Arrow-compatible JTAG programmer would also not be required with the R4 board, since it was only ever required for updating the MAX10 FPGA.

3. Improvement of the 3.5mm audio jack quality with the addition of a nicer DAC setup.

4. Addition of a 64MiB SDRAM, that will mean that more MiSTer cores can be ported to the MEGA65.

5. Joystick ports will be bi-directional, allowing use of the Protopad and similar devices.

6. RTC chip has been replaced by another that should give us none of the problems of the previous one.

7. The RTC now takes a CR2032 battery and has a super-cap for battery-less time-keeping, at least for a while at a time.

As with all changes to boards, we have to test all the changes and get things working. As the R4 has a lot of changes, there are a lot of FPGA pins that have been reassigned, so the first job was to go through and update the XDC file to reflect this.

With that done, VGA output, SD card, keyboard input were all quickly confirmed working, allowing the MEGA65 R4 board to work in a minimalistic manner.  

The digital video output requires a slight tweak, as we attempted to build the boards without some allegedly optional pull-up resistors. However, due to the exact design we used, we do in fact need those resistors. Those are getting retro-fitted onto my R4 sample board in the near future.  In the meantime, I will keep working on other parts of the board.

Next stop is the RTC: The IC is completely different to the previous one, so I need to change the I2C address in the code, add address decode logic for the R4 I2C peripherals generally, and then remap the RTC registers between the two ICs, so that it is backwards compatible.  Otherwise the Configure program, BASIC65 ROMs, and anything else that uses the RTC would need patching.

The register map of the two ICs are:

For the new one, the important registers are:

for the old one, it was:

 

What we care about for backwards compatibility is that the current time and date registers match up.  Everything else can remain different, as only very specialised software will need to touch those. For those registers we find:

Register 0 = Seconds = Register 1 on the new one
Register 1 = Minutes = Register 2 on the new one
Register 2 = Hours (with bit 7 indicating if 24 hour time) = Register 3 on the new one (which has no 24 hour time flag, as it is always in 24 hour clock mode)
Register 3 = Day of month = Register 5 on the new one
Register 4 = Month = Register 6 on the new one
Register 5 = Year-2000 = Register 7 on the new one
Register 6 = Day of Week = Register 4 on the new one

The Grove external RTC support already has a mechanism for the rearrangement of RTC registers, so it should be possible to re-use that.

The new RTC also offers a 100ths of a second register in register 0, which it would be good to expose.

I have the RTC working reasonably now, so will turn my attention to the 64MiB SDRAM next.  To avoid long synthesis runs between each little fix, I want to use unit tests to test the SDRAM controller.  Ideally I would have a good VHDL model of the SDRAM itself, which I don't.  But I can probably make one that will be good enough.  

I'd also like to have a program that can take a VHDL file and auto-generate a VUnit VHDL test case template from that, to avoid having to do all the annoying boiler plate VHDL generation.  I've written these before, but not that's available open-source, so I'll do it from scratch again.  As an experiment, I'm going to use ChatGPT 4 to help write the tokeniser, as I always find that part boring and error-prone.  Using ChatGPT probably was faster, but was just as error prone, although the process was probably less boring!  

With the tokeniser, I can search VHDL source files for the entity definitions, and generate the lists of parameters required to instantiate the entity in the test case, and from that, auto-generate the boiler-plate for the entity (or entities) that I want to include in the VUnit test case, and then generate the VUnit test case boiler plate containing all that.  This will come in handy later for making unit tests that cover other parts of the MEGA65's VHDL.

I could probably just ask ChatGPT to generate a VUnit test case for a given entity, directly.  Maybe that's the easier approach here. I've used my 25 questions to ChatGPT 4 for the next 3 hours, so I tried doing it with ChatGPT 3.5, which was a fairly unmitigated disaster for making a complete test, however, it did a fine job when I pasted in two entity definitions for the SDRAM and SDRAM controller and asked it to instantiate them and connect them up appropriately. Well, by that I mean it compiles. It may well have connected them in a completely useless or insane manner.

At first glance, it might have done ok. We'll see. Actually, no. So now I am trying to use it to help me write a VHDL model for the SDRAM.  This is also producing VHDL code that kind of looks like it should be an SDRAM, but not quite. But it probably is still saving me some overall typing.

Well, that was an interesting experiment. In any case, I have now worked things to the point where I have the first unit tests with VUnit working, that check if the SDRAM initialisation sequence is given and received correctly using my model of the SDRAM and my SDRAM controller.  Now I am working on some tests for simple memory accesses. Once those are working, I can look at synthesising a fresh bitstream with it enabled, and see if the SDRAM talks to it.

Ok, bitstream has synthesised, but no sign of life when reading the SDRAM.  I'll go back to the simulation unit testing, and have a look at the waveforms generated when doing the reads and writes, and make sure that they look correct, in case I implemented the SDRAM model wrongly in a way that I haven't noticed yet.

Meanwhile, I am checking a couple of other things on the board:

First, I have confirmed Ethernet is still working, so that's one more sub-system ticked off.

Second, I want to get the bi-directional joystick behaviour working.  This should allow the use of the ProtoPad and similar fancy joysticks that have lots of buttons.  The MEGA65 R4 has some open-collector inverters that can be used to pull the lines low from the MEGA65 side when required. I have these setup to be activated when the DDR for a joystick line is set to output, and the value being written to $DD00 or $DD01 is a zero -- i.e., matching the CIA behaviour for output.  The read-side remains the same, as it does on the CIA, because its just the case of an output driver pulling the line low, which the input port of the CIA (on the same pin) correctly sees as a dropped voltage.

I seem to now have the joystick ports working for output as well as input, although some more extensive testing will be required to make sure I have the bits correctly ordered.

The next big annoying problem while I wait to get the pull-up resistors on the R4 board for the digital video output, is that BASIC 65 doesn't get to READY, as it seems to get stuck trying to read from the IEC bus.  It gets stuck in a loop checking for bit 6 of $DD00 to go high (the CLK pin), from what I can tell.  I've just double-checked the XDC file, to make sure I haven't messed up any of the IEC pin assignments on the FPGA, and they all look fine.  Time to look more closely at the plumbing for the IEC CLK line.

For each of the main IEC pins, we have 3 FPGA pins: One that reads the current voltage on the pin, one that enables the output driver, and one that sets the signal that is fed to the output driver.  This is technically a bit of an overkill, as it allows us to drive high, instead of just drive low, or go tri-state for reading.  Anyway, it means that for the signal on an IEC pin to be low, both the output driver has to be enabled, and the input to the output driver set to select a low voltage.

Now, looking at the R3 vs R4 schematics, the circuit for this has changed a little. It is now:

U20 takes the input from the IEC CLK pin (as SER_CLK) which is a 5V signal, and level-converts it to a 3.3V signal, F_SER_CLK_I.  The Output Enable (OE) signal that controls whether the F_SER_CLK_I pin is valid is active high, so that all looks fine. i.e., it seems like we should be ok there.  We also have a 4.7K pull-up to 5V, so it shouldn't read 0V if U15A at the bottom is not driving it low.  If F_SER_CLK_EN=0, F_SER_CLK_O=1, then we would expect SER_CLK to be driven low.  To allow it to float, F_SER_CLK_EN=1 should do the trick, since there is an inverter on the input of this line to U15A.  So why are we seeing the CLK line at 0V when I probe the physical port? 

Probably the best thing to do here is to make sure that the line can go high.  Resetting the MEGA65 does cause it to go high for a time, so things are ok at least partially. Ah, I think I see the problem. On the R3 board, this is the output logic for this line:


Note that the F_SER_CLK_EN line is using a non-inverted output enable line.  So I should just be able to invert the logic for F_SER_CLK_EN, and its equivalent for the other IEC lines, and it should fix the problem -- which it did.

Now to work on the improved 3.5mm analog audio output.  This is handled on the R4 by a AK4432VT DAC chip instead of a simple 1-bit digital output plus filter on the R3. This DAC chip uses I2S audio signalling, the same as the DAC used on the MEGAphone prototypes, so I'm hoping I can re-use most of what we have already made for that.

Working my way through the datasheet, and how we have it hooked up on the board:

1. Configuration mode is strapped to serial, rather than parallel (not yet sure what that means).

2. We want a sample rate <= 48KHz, so DFS0 and DFS1 must be zero, which is the default.

3. BICK frequency should be 2.8224 for 44.1KHz or 3.072MHz for 48KHz sample rate. 

4. MCLK freuency should be 11.2896MHz for 44.1KHz or 12.2880MHz for 48KHz sample rate.

5. For 512fs configuration, the 512fs clock should be 22.5792MHz for 44.1KHz or 24.576MHz for 48KHz sample rate.  256fs is not available at these sample rates, but it is for 88.2KHz and 96KHz sample rates.  Maybe we can use those sample rates instead off 44.1KHz or 48KHz?

6. Audio data is signed, MSB first, latched on rising edge of BICK.

7. There is an I2C interface that can be used to set left and right gain between +12dB and -infinity dB (= mute). $00 = loudest, $FF = mute. This interface has a "soft transition" feature, so gain changes shouldn't produces pops or clicks.

8. There is a hardware mute pin available, which has similar effect. This is connected on the MEGA65 R4 board, so we better not activate it by mistake! This line is active high, so we need to drive it low normally.

9. The power down line (connected to audio_powerdown_n) holds the DAC in reset while low.  Datasheet says that it should be bought low once and then released to ensure correct operation. It has to be held low for at least 800ns.

10. To avoid click on power up in (9), the mute pin can be applied before releasing the powerdown pin. That's probably worth doing.

11. To enable the I2C interface, $DEADDA7A has to be written as four separate transactions with /CS being released between each (see page 33 of the datasheet). We'll have to do that, as the DAC is strapped for serial configuration.

12. The software mute pin is also the /CS line for the I2C interface.

13. I2C interface is 400KHz normal speed I2C is probably best, based on conflicting information about 7MHz or 1Mhz or 400KHz modes.

14. There are 6 I2C registers ($00 -- $05), but the register field is 16 bits long. Thus each write transaction consists of at least 32 bits, plus the usual ACKs between bytes that I2C uses. All 8 registers can be written as one sequential transaction.

15. The six registers are listed on page 38 of the datasheet. From those we can see certain default settings:

  15.1 DFS0/1 = 0, selecting either 44.1KHz or 48KHz, as previously described.

  15.2 ACKS (automatic clock recovery) is disabled.

  15.3 DIF2-0 = 110, selecting 32-bit MSB aligned format. This means "Mode 6" timing, which is described in figure 15 on page 23 of the datasheet. This basically means that BICK needs to be 64x the sample rate, to allow for 2x32bit (left and right) samples per sample period.  For 44.1KHz, this means 44.1KHz x 64 = 2.8224MHz, and for 48KHz it is 48KHz x 64 = 3.072MHz. This matches (3) above, which explains how those are calculated.

  15.4 TDM1-0 = 00, selecting stereo mode

  15.5 SDS1-0 = 00, selecting normal L1, R1 channels.

  15.6 SYNCE = 1, enabling clock synchronistion

  15.7 SMUTE is disabled, allowing audio to be produced

  15.8 ATS, DASD, DASL (various filter related things) have sensible default values.

  15.9 ATTL/ATTR = volume levels for left and right default to $18 = 0dB, which is sensible.

And that's all the registers: So if we don't want to adjust the default volume level, we can actually ignore the I2C configuration interface completely, it would seem... which after synthesising the design up with all the necessary plumbing in place is indeed the case: We can completely ignore the I2C configuration.

I must say, the audio quality of this DAC is _way_ better than the old method. It sounds really nice and crisp and clear to my wooden ears at least. Listening to MOD files on the MEGA65 now sounds really nice  -- not that it was bad before. Enough so that I'm currently uploading a pile of MOD files onto my MEGA65's SD card to listen to, while I debug the remaining stuff. We really need one of the MEGA65 MOD players to support play-lists soon...

I've also spent some time debugging the bi-directional joystick functionality, and that's mostly working, although some strange thing has come up that is stopping Solitaire for the MEGA65 from being able to read mouse clicks, which are just the same as fire button presses on a joystick.  This will require some investigation to figure out the root cause. I have the source for Solitaire on the MEGA65, as I wrote the 1351 mouse driver for it, so I might have a poke and a fiddle into this.

Ok, so the problem in Solitaire is actually the high speed of the MEGA65's CPU vs the response time of 5V circuits: It just takes a few clock cycles at 40MHz before the DDR change to input allows the joystick inputs to float back high. In the case of Solitaire, just one extra clock cycle was required. I might be able to claw that back by making the DDR effects asynchronous, rather than waiting for the next clock-cycle edge.  Apart from that, the issue is considered closed.

So now we are back to figuring out what I have wrong in the SDRAM controller.  I have just added some debug registers to the SDRAM controller, so that we can confirm that data from the controller is making its way through the MEGA65 to the expansion device controller, and from there, into the CPU. If not, then that needs to be fixed, before I can tell whether the SDRAM controller is actually working or not.  It seems to be at least partly working, because it is issuing data strobes to the expansion controller, which I can tell because if this weren't the case, as it was before I plumbed it all together, the expansion device controller would timeout waiting from a response, freezing the CPU for millions of cycles before giving up.

And nothing is visible. Looking into the expansion device controller, it looks like it requires the presentation of whole cache lines from the expansion RAM. Nope, it can be configured to need that, but wasn't. So time to find where the plumbing is broken. We'll start by disconnecting the read data from the SDRAM controller, and feeding in a fixed value to the rdata signal, to see if that gets read or not.  That will tell us whether the problem is up-stream or down-stream of the SDRAM controller.

Ok, so when I export a constant value, instead of real memory accesses, it is visible.

So now let's see if we pretend to read a constant value from the RAM, that that is also visible. This will check if the RAM value export logic is working right, in particular, if the timing hand-over from the 162MHz to 81MHz clock domains is wonky or not. Well, in the process of doing that, I found two important things: 1. I hadn't actually connected the clock to the SDRAM in the R4 target. That will certainly not help ;) and 2. the rdata lines from the SDRAM controller were being set tri-state outside of a process, quite possibly overriding where they were being set within the controller process.  Both of those have now been fixed, and I'm synthesising the bitstream to test them.

Okay, so fixing those has my debug registers exported from the SDRAM controller visible, but accessing the SDRAM itself is still not working. But I can at least see if the controller thinks it is scheduling reads and writes via those debug registers now.

First up, I can see that the processor reads whatever was last read from the SDRAM controller, rather than the actually requested byte. This means that the data shows up "one read late".  For example, the SDRAM registers have the word "SDRAM" = 53 44 52 41 4D at $C000000. But when I read it, we get:

:0C000000:42534452414D00054242424242424242

i.e., one of those 42s from the end is read at the start (because I read this address block before), and our values are present, but not in the correct place. I'll have to figure out the cause of this, as it clearly won't be helping. That said, it shouldn't stop us being able to see written values being read back.

What we can also see in the above, is the "0005" part:

:0C000000:42534452414D00054242424242424242 

This is 2 debug registers that count the lower 8-bits of the number of reads and writes. It is showing $00 reads and $05 writes, because I tried to write 01 02 03 04 05 at $8000000 earlier.

So let's now read 16 bytes (=$10 bytes), and see what happens:

.m8000000
:08000000:42000000000000000000000000000000

.mc000000
:0C000000:53534452414D10054242424242424242

Two things to notice here: First, we see the $42 from the end of the register read showing up in the first position again. Second, we can see that the number of reads has jumped to $10, indicating that the SDRAM controller did see all the read requests.

So, given that this "out by one" read problem might be enough to cause real reads from the SDRAM to be missed, depending how the buffering works, I should work to fix that.

Looking at how the old HyperRAM controller works, that holds the data strobe line for an extra cycle.  I don't really see how that would impact things in terms of reading the old value.  It feels more like the data ready strobe needs to be delayed by one cycle, so that the value can be assured to be available when required. I'll try adding that delay, and see how it goes.

Well, that's resulted in some progress.  It is now clear that some bytes are being written to and correctly read back from the SDRAM. However, now the debug registers in the SDRAM controller can't be read.  Also, the addresses being written and read aren't lining up, and only even numbered bytes are being read back correctly -- but whether this is because odd numbered bytes aren't being written or aren't being correctly read I can't yet tell. But at least I have signs of life from the SDRAM!

I need to check that I haven't accidentally got cross-domain clock relaxation between the SDRAM and expansion controller, as an effect of clock names having changed, as that would screw things up real bad. It's quite tricky to tell what is being done in this regard in the Vivado logs, as the clock names are not trivial to map.  Anyway, I've removed them all except for the Ethernet to CPU clock and 12.228MHz (audio sample clock) to CPU clock domains, as those should be the only required ones. If synthesis takes forever, I can look at what the remaining clocks are that need relaxation between them.

So, the good news is that removing all those other clock relaxations didn't result in longer synthesis time, or a broken bitstream.  Unfortunately, it still hasn't got the $C000000 registers readable -- those are still timing out. Which is odd, because that part of the logic shouldn't have changed.  I have made a simulation unit test case for that, so let's see if it has now failing -- which it is. So let's see what's happened there.

Well, quite how it was working before is a bit of a mystery, because I found a really silly bug in the $C000000 register access code.  The test now passes, so I'm resynthesising...

While that is running, I'm thinking about the other bugs I have seen in the SDRAM controller: Primarily that the bytes are read out one word later in memory from where they should be, and that writing odd numbered bytes seems to fail, or at best, write zeroes.

For the odd-bytes-are-written-as-zero bug, I am seeing some interesting things:

s8000007 87

.m8000000
:08000000:87008000820084008700F07010204080

.s8000008 88

.m8000000
:08000000:88008000820084008800887010204080

.s8000009 89

.m8000000
:08000000:89008000820084008900880010204080

This sequence is giving me some more clues as to what might be going on. To see it more clearly, I wrote this little test program:

It clears out the first 16 bytes of SDRAM, and then writes values in progressively, to see what we read back after each:


We see again that the odd bytes are not written at all, and that the even bytes are pushed two to the right, and the first byte of each 8-byte block is whatever was written last.

So what I think is happening here, is that my SDRAM controller is using one too few cycles of latency when reading, thus reading (semi-)rubbish on the first word (remember the SDRAM has 16 bit wide bus).  Then separately, the write mask for the upper byte is messed up, or the data being presented in the upper byte is being written as zeroes instead of the correct data -- one or the other.

The odd-byte-write bug I can see clearly: We assume 16-bit wide writes from the expansion controller, instead of checking if its an odd-address 8-bit write.  I've now made a fix for that, which I will test, and then synthesise.

While possible fixes for that are synthesising, I think I'll work on the failing back-to-back write test of my SDRAM controller, and then on the simple cached reading.  Without any caching, reading from SDRAM to chip RAM is around 4.5MiB/sec, implying it is taking 9 clock cycles to perform a random read from the SDRAM. That sounds fairly plausible.

Ok, synthesis run has finished, and now all SDRAM reads are timing out. BUT it looks like the reads and writes are happening, more or less, and are visible in the correct addresses. So that's some positive progress. Upper 2 bits of data seem to be getting chopped, though. Let's see what happens in simulation with my unit tests, to see if we can't find and fix the problem there.

The bit trimming doesn't happen in simulation, so far as I can tell. After the latest bitstream synthesis my cache line stuff turns out to be working nicely, but the reads are still timing out. But with the cache, this now means reads happen in blocks of 8 (the size of one cache line) before freezing for a while until timeout occurs.

I think that might be caused by the data ready strobe being only 1x 162MHz cycle long, but it has to be caught by the 81MHz clocked expansion device bus. So I've now stretched it to last 2 cycles long, and will see if that helps. Meanwhile, I'm going to attack writing more tests for the SDRAM controller, including the cache I implemented, to make sure that it is not doing anything else detectably stupid.

Thanks to a pile of help from Adam, one of the great folks behind the C64 and other cores for the MEGA65, I have managed to get the SDRAM basically working now.  A lot of it was tackling the fairly tricky timing to get the SDRAM working at 162MHz with the FPGA.  There are some remaining timing closure issues, but it is already working stably. That is, once I remembered to implement refresh of the SDRAM! 

The remaining problems now are all related to the slow devices module that interfaces the CPU to the expansion RAM. Basically, reading sometimes returns stale data, rather than fetching the correctly written data. Reading from a different 8-byte region of the SDRAM fixes this problem.  The issue isn't the SDRAM bugging out reading the 8-byte blocks, but rather seems to be the CPU and slow devices modules messing up somewhere.  

To debug this I am adding a pile of debug registers to the CPU to see when it is reading from the SDRAM, and when it is using the new direct cache line read mode I have implemented, that should make linear reads from the SDRAM much faster -- but for unknown reasons causes a speed _drop_ of some 500x.  So something is clearly going wrong with the signalling there.  But the other problems still occur, even when that feature is disabled, so whatever that problem is, it is presumably separate from whatever is causing the reading of incorrect stale data.

A bit of poking about shows that the following kind of sequence can occur:

First, we can read an address, and see a value in it:

m8000808
:08000808:18191A1B1C1D1E1F0000000000000000

Then we write a different value into a different address, that is the same modulo 8: 

.s8000800 22

We can now read that address, and see that it is updated:


.m8000800
:08000800:221112131415161718191A1B1C1D1E1F

So far, so good.  Note that $8000808 contains $18 (underlined above).

Now the weirdness comes, if we try to read from $8000808 directly:


.m8000808
:08000808:22191A1B1C1D1E1F0000000000000000

Note that we end up reading the most recently written value, rather than the correct value.

If we now ask for it again from the modified address, we read it correctly:


.m8000800
:08000800:221112131415161718191A1B1C1D1E1F

But asking again for $8000808 reads back the most recently written value again:


.m8000808
:08000808:22191A1B1C1D1E1F0000000000000000

And I think I finally found the cause:  When making any access to the expansion RAM, the last written address is updated, instead of only when a write is occurring.

I'm fixing this now, and am confident that it should be fixed (and also fix similar problems with the HyperRAM on the R3/R3A boards). And following synthesis, I can confirm that this is the case: All cache consistency issues are now fixed. This has resulted in the apparent speed of copying from slow to chip RAM dropping, because it was previously erroneously using a cached result when it shouldn't have been.  Its now more like 4.6MB/sec, or about 7MB/sec if the single pre-fetch mode is enabled -- but that _does_ still result in cache consistency errors.

So that just leaves the problem of the SDRAM going hundreds of times slower when enabling the cache line reading directly by the CPU.  This is totally weird, as it can only save time for the CPU, not cost more time.

I can only guess that the CPU is getting confused thinking that it has to wait for the slow device ready toggle to flip before it can proceed. So I have made sure that when reading from the cache line, that the accessing_slowram signal is cleared.  However, if that fixes it, then it is revealing that there is some other deeper problem, where wait_states_non_zero is being asserted, even though I clear it in the case of reading via the cache line. In fact, I copied the perfectly working code from the case where we read the single pre-fetched byte from the slow devices unit, which effectively is just an indirect way of doing the same thing (and thus has higher latency).

Well, a lot has happened since I wrote the above, including tracking down a fascinating bug that was caused by accidentally activating writes with prefetch during the prefetch after a write.  That caused the most fascinating problem with random corruption of an entire DRAM row.  This is interesting on a couple of fronts. 

First, it was quite an adventure and process of elimination to work out what was going on.  Because entire DRAM rows were being corrupted, I figured it had to be something to do with the DRAM row activation and/or precharging when closing a row. Initially, I thought it was not having enough latency cycles during row activation or precharging at the end. Then I was very happy, because my SDRAM model for my simulation tests helped to find the problem really quickly: I wasn't clearing the WRITE+PRECHARGE command during the latency cycles. My simulation model of the SDRAM is purposely quite paranoid, and aborts the simulation if it detects you are doing anything that might be invalid -- and it flagged the issue strait away.

Second, I had only just been reading about how DRAM can be used as a true random number generator. Their method is to purposely not allow enough latency when opening a row, and triggers modest numbers of errors in a row when used.  What I was doing was causing lots of errors, possibly more than they reported in the paper.  I just dropped one of the authors a line to let them know what I discovered, in case its of interest to them.

Anyway, with that fixed, that got it working.

The next step is to improve the performance by not closing the DRAM row after every access, so that we can avoid the row latency for linear reads.  This will also be kinder to the SDRAM, rather than basically implementing ROWHAMMER as the default mode of operation ;)  It would actually be interesting to see if it prone to ROWHAMMER. 

Okay, that didn't take too much effort to implement.  Now the only remaining bug is if the SDRAM read cache-line feature is enabled (which is disabled by default because of this), occasionally a read request doesn't respond, causing a timeout of 2^23 clock cycles. I've reduced that to 32 clock cycles, so that when this happens, it doesn't cause the machine to freeze for a fraction of a second. 

I've also made selection between HyperRAM and SDRAM on the R4 selectable at synthesis time. So let's compare the performance of the two. Note that the HyperRAM implementation benefits from the fixing of the pre-fetch bug, thus HyperRAM performance is also improved over what we have seen previously on the R3:

First, with the HyperRAM. Note that there is actually no Trapdoor Slow RAM in this machine, so it is showing some slightly random numbers for that:

 
And now with the SDRAM:
 

We can see that it is faster across all measures -- especially copying between regions of the expansion RAM is now more than twice as fast, because the latency of the SDRAM is quite a bit lower than of the HyperRAM.

If I ever get the SDRAM cache-line bug fixed, then the speed of copying from slow to chip RAM will approximately double again, to around 16MB/sec, only 20% slower than copying from chip RAM to chip RAM -- but I've run out of puff right now to work on that.  My goal was really just to get the MEGA65 R4 board working, and confirm it has no known problems vs the R3A board, which I have achieved. 

This means that Trenz Electronic can now move forward to the next steps (of which there are still a few to go) of having the next batch of MEGA65's produced, which will be based on this new R4 PCB -- which we are hoping will come out later in 2023, probably Q3.