Monday, 4 September 2023

Debricking MEGA65 R3/R3A Boards with De-programmed MAX10 FPGAs

We had one machine previously that got into this state, and now have encountered a 2nd.  

The first happened when someone was updating the MAX10 bitstream to deal with the HDMI flicker problem, and due to there being a microSD card in the external slot, that the flashing of the MAX10 failed.

The situation behind this second one I'm not sure of. In the end, it doesn't matter. We just need a way to easily fix the problem.

To understand what the situation looks like, we need to understand a little of how the MEGA65 R3/R3A works with two FPGAs.

The main FPGA of the MEGA65 is the Artix A200T Xilinx FPGA that actually runs the MEGA65 core.  That's the one that most folks are familiar with.

But there is a 2nd FPGA on the board, and Intel (previously Altera) MAX10 FPGA that does some helper tasks, and was added, somewhat ironically as it turns out, to make it impossible to brick the MEGA65, by making it possible to have the MAX10 FPGA load and flash a new bitstream to the Xilinx FPGA via the external SD card slot.  

That functionality was never implemented in the MAX10 FPGA, because it turns out to be quite complex, and was never really needed, as there are a bunch of other ways to solve that problem, usually involving a JTAG adaptor on one the MEGA65's internal JTAG headers.  However, it left a lingering problem, which has now popped up twice.

That problem is that the MAX10 FPGA is configured to disable its own JTAG interface if the main FPGA is running a valid bitstream.  I don't really recall the reasoning behind this, except perhaps that it was intended to operate in the opposite sense: Make sure JTAG is enabled on the MAX10 if the main FPGA is not running a valid program.  But it is the disabling of the MAX10's JTAG that is the root of this problem:  If flashing the MAX10 fails part way through, but the Xilinx FPGA has a valid bitstream stored in flash, then the MAX10 will not "boot", and can't be reprogrammed via JTAG because the Xilinx FPGA is running a valid bitstream.

If that were all there was to the problem, it would be easy to fix, because you could just use the utility menu on the MEGA65 to erase slot 0, or even just use JTAG to start loading a bitstream onto it, and then abort it, so that it remains "unprogrammed".  However, there are some nasty side-effects to the MAX10 being deprogrammed that make the situation much worse. 

Primarily, the MAX10 relays the JTAG interface of the Xilinx FPGA, so that it could do the backup programming, that as said, was never implemented.  But to make things worse, the keyboard and even the power enable for the joystick and cartridge ports also goes via the MAX10 (so that the MAX10 recovery system could read the keyboard, making it easier to use). 

So the net result is the MEGA65 will look like it is fine, but the keyboard and joystick ports won't work.

Fortunately, the SD card interface still works, and the default HYPPO hypervisor code on the MEGA65 allows updating itself from the SD card. This allows us to effectively run arbitrary code in Hypervisor context, even without working keyboard or joystick ports.  When the first instance of this occurred, we used this feature to load a special version of the flashing program for the MEGA65 that could be controlled using IO lines on the 34pin internal floppy interface.  It was cumbersome, but worked.

However, in trying to do it a second time, we have discovered we didn't really document it very well, and also on reflection, its a bit over complicated. All we need is a HYPPO program that causes the Xilinx FPGA to deprogram for long enough for flashing the MAX10 FPGA to occur.

This can probably be done by asking the Xilinx FPGA to switch to an empty flash slot, or some other similar technique. This would reduce the problem to just putting the special HICKUP.M65 file on the SD card, and inserting it.

I've tried doing this now, making a special "BRICKUP.M65" file that can be copied onto the SD card as HICKUP.M65, and causes the FPGA to deconfigure for a few seconds, before it finds the next valid slot.  This works by trying to boot from part way into slot 7.  The FPGA reads the flash at some speed, until it eventually wraps around to the start of the flash, and finds a valid slot. 

For some reason, in my case it ends up booting from slot 3, rather than slot 0, which makes me suspect that I have the math for the slot address wrong.  Each slot is 8MB long, so slot 7 should start at 7x8MB = 56MiB = $3800000. I am telling the FPGA to start at $3810000, so it looks like it should be fine. It might just be some funny behaviour of the FPGA partially loading the other slots, and accidentally skipping over the valid slots in slots 0, 1 and 2.

Anyway, it keeps the Xilinx FPGA deconfigured for long enough for the keyboard to show seven sequences of the "ambulance light flashing", i.e., about 5 or 6 seconds.  Is that long enough?  Can we make it take longer?

Well, it turns out that it is enough, because we can program the MAX10 in ~1sec via JTAG.

So the sequence is as follows (with instructions for Linux primarily, but also info on how to do it under Windows).

If you prefer a visual explanation, the following video shows me doing it under Linux. We don't currently have a visual record of doing it under Windows, sorry.

Stage 1: Preparation

1. Copy BRICKUP.M65 onto the SDcard, and rename it to HICKUP.M65

2. Power off MEGA65 and turn it back on. Make sure that after briefly showing the MEGA65 boot messages, the monitor then goes blank for at least few seconds.

3. Plug in an TEI0004 or compatible JTAG adaptor onto J17 of the MEGA65, and connect the USB cable to a computer running Linux.

4. Make sure you have a MAX10 bitstream built, using the https://github.com/mega65/mega65-r2-max10 repository.

5. Using the program.sh file from that repository, make sure that the JTAG adapter is detected. It will, for now, say that no devices are connected to it. That's ok. You should see something like this:

mega65-r2-max10$ ./program.sh
1) Arrow-USB-Blaster [USB0]
  Unable to read device chain - JTAG chain broken

Error (213019): Can't scan JTAG chain. Error code 87.

Okay, you are now all set for the inital recovery phase, where we load a valid bitstream into the MAX10, but not yet into its flash. This will let you interact with the MEGA65 via JTAG, thus allowing us to deprogram the Xilinx FPGA for a longer period of time, which will then make it _way_ easier to reflash the MAX10. 

If you are using Linux, then use the following versions of steps 6 and 7 (step 9 also differs. Use the green-background instructions for Linux, and the blue-background instructions for Windows:

6. Turn the MEGA65 off, and on the Linux computer with the MAX10 firmware, type this command, but don't yet hit enter on it: sleep 3 ; ./program.sh

7. Hit enter at the same time as you turn the MEGA65 on. Wait upto 15 seconds for everything to happen.

Repeat (6) and (7) until the keyboard power light comes on, and you see output like this from program.sh:

mega65-r2-max10$ sleep 3; ./program.sh
1) Arrow-USB-Blaster [USB0]
  031820DD   10M08SA(.|ES)/10M08SC

Info: *******************************************************************
Info: Running Quartus Prime Programmer
    Info: Version 18.1.0 Build 625 09/12/2018 SJ Lite Edition
    Info: Copyright (C) 2018  Intel Corporation. All rights reserved.
    Info: Your use of Intel Corporation's design tools, logic functions
    Info: and other software and tools, and its AMPP partner logic
    Info: functions, and any output files from any of the foregoing
    Info: (including device programming or simulation files), and any
    Info: associated documentation or information are expressly subject
    Info: to the terms and conditions of the Intel Program License
    Info: Subscription Agreement, the Intel Quartus Prime License Agreement,
    Info: the Intel FPGA IP License Agreement, or other applicable license
    Info: agreement, including, without limitation, that your use is for
    Info: the sole purpose of programming logic devices manufactured by
    Info: Intel and sold by Intel or its authorized distributors.  Please
    Info: refer to the applicable agreement for further details.
    Info: Processing started: Sat Sep  2 13:48:19 2023
Info: Command: quartus_pgm -m jtag -o p;output_files/mega65-r2-max10.sof
Info (213045): Using programming cable "Arrow-USB-Blaster [USB0]"
Info (213011): Using programming file output_files/mega65-r2-max10.sof with checksum 0x0014479C for device 10M08SAU169@1
Info (209060): Started Programmer operation at Sat Sep  2 13:48:20 2023
Info (209016): Configuring device index 1
Info (209017): Device 1 contains JTAG ID code 0x031820DD
Info (209007): Configuration succeeded -- 1 device(s) configured
Info (209011): Successfully performed operation(s)
Info (209061): Ended Programmer operation at Sat Sep  2 13:48:20 2023
Info: Quartus Prime Programmer was successful. 0 errors, 0 warnings
    Info: Peak virtual memory: 427 megabytes
    Info: Processing ended: Sat Sep  2 13:48:20 2023
    Info: Elapsed time: 00:00:01
    Info: Total CPU time (on all processors): 00:00:00

If you are using a Windows based computer instead, use the following versions of steps 6 and 7:

6. Turn the MEGA65 off, and on the Windows computer with the MAX10 firmware, open the Quartus hardware manager, and select the .SOF file for the MAX10 firmare, similar to in the following image:

7. Start mashing the start button near the top left of the display, while simultaneously turning the MEGA65 on. You might see a happy run indicated by Quartus, but probably not, because you are continuing to mash the start button. You will know when it has worked because the red LED in the lower-right corner of the MEGA65 will start pulsing, and the ambulance lights on the keyboard will stop, assuming that the MEGA65 eventually loads a valid core after some seconds.

 

NOTE: For reasons I don't fully understand, sometimes you have to try this process a few times before it will work, whether on Linux or Windows.  Leaving the MEGA65 off for a few seconds between attempts is probably a good idea. As might be disconnecting the HDMI when powered off, to prevent HDMI back-powering issues keeping either of the FPGAs partially powered.

8. Okay, so at this point, we have the MAX10 temporarily programmed. Now we can de-configure the Xilinx FPGA via a TE0790 JTAG connected to port JB1 of the MEGA65, with a command like this from a connected computer running Linux, with the MEGA65 tools installed:

$ m65 -q brickup.bit

The brickup.bit file is a special bitstream that has been purposely constructed to be invalid. This causes the Xilinx FPGA to never complete configuring, and thus leaving the MAX10 JTAG interface active. The easy way to make your own brickup.bit file is to take any valid bitstream, e.g., mega65r3.bit, and then use a command like this:

$ dd if=mega65r3.bit of=brickup.bit bs=65536 count=1

You should now be back with ambulance lights on the keyboard. You can now do the final steps:

9. If on Linux, Run the flash.sh program from the mega65-r2-max10 repository:

mega65-r2-max10$ ./flash.sh
1) Arrow-USB-Blaster [USB0]
  031820DD   10M08SA(.|ES)/10M08SC

Info: *******************************************************************
Info: Running Quartus Prime Programmer
    Info: Version 18.1.0 Build 625 09/12/2018 SJ Lite Edition
    Info: Copyright (C) 2018  Intel Corporation. All rights reserved.
    Info: Your use of Intel Corporation's design tools, logic functions
    Info: and other software and tools, and its AMPP partner logic
    Info: functions, and any output files from any of the foregoing
    Info: (including device programming or simulation files), and any
    Info: associated documentation or information are expressly subject
    Info: to the terms and conditions of the Intel Program License
    Info: Subscription Agreement, the Intel Quartus Prime License Agreement,
    Info: the Intel FPGA IP License Agreement, or other applicable license
    Info: agreement, including, without limitation, that your use is for
    Info: the sole purpose of programming logic devices manufactured by
    Info: Intel and sold by Intel or its authorized distributors.  Please
    Info: refer to the applicable agreement for further details.
    Info: Processing started: Sat Sep  2 13:51:45 2023
Info: Command: quartus_pgm -m jtag -o p;output_files/mega65-r2-max10.pof
Info (213045): Using programming cable "Arrow-USB-Blaster [USB0]"
Info (213011): Using programming file output_files/mega65-r2-max10.pof with checksum 0x027283AE for device 10M08SAU169@1
Info (209060): Started Programmer operation at Sat Sep  2 13:51:46 2023
Info (209017): Device 1 contains JTAG ID code 0x031820DD
Info (209060): Started Programmer operation at Sat Sep  2 13:51:46 2023
Info (209016): Configuring device index 1
Info (209017): Device 1 contains JTAG ID code 0x031820DD
Info (209007): Configuration succeeded -- 1 device(s) configured
Info (209011): Successfully performed operation(s)
Info (209061): Ended Programmer operation at Sat Sep  2 13:51:47 2023
Info (209024): Programming device 1
Info (209011): Successfully performed operation(s)
Info (209061): Ended Programmer operation at Sat Sep  2 13:51:56 2023
Info: Quartus Prime Programmer was successful. 0 errors, 0 warnings
    Info: Peak virtual memory: 427 megabytes
    Info: Processing ended: Sat Sep  2 13:51:56 2023
    Info: Elapsed time: 00:00:11
    Info: Total CPU time (on all processors): 00:00:01


9. If on Windows, use Quartus to flash the .POF file now.

10. Remove the SD card, and power the MEGA65 off and on again, and confirm that the keyboard power light comes on, and that you get no more ambulance lights. If not, you might need to repeat from step (6).

11. Delete the HICKUP.M65 file from the SD card, and reinsert it, and turn your MEGA65 off and on again.

Your MEGA65 should now be all happy and healthy again.

Wednesday, 19 July 2023

Last minute changes to the R4 board

Well, we thought we had the R4 board all settled, and then the ghost of supply chains past came back to visit, requiring some changes to the R4 board to accommodate the lack of availability of one of the voltage regulators.

This is giving us the opportunity to make some further minor refinements based on feedback from our community, to make the cartridge port compatible with more cartridges.  This is because in the previous PCB revisions, I didn't realise that we needed to have the /RESET, /GAME, /EXROM, /IRQ, /NMI and potentially also /ROML and /ROMH lines should be bi-directional.

This requires one additional FPGA pin for each of these lines, but there are currently no spare pins, so we need a way to save 7 pins.

We can do this by replacing the 4 FPGA pins that are used to identify board revisions (but was only added on the R4 board), and the 4 FPGA pins used to connect to the DIP switches with an I2C IO expander. This would require only 2 pins (and allow 8 DIP switches instead of 4, which would be handy), thus saving 6 pins. We can then re-use DBG10 for the 7th one, thus meeting our needs.

To make this change we would need to:

0. The revised board shall be called R5, because it will not be bitstream compatible with R4, and is not feature identical, because of the cartridge port enhancements that will result from these changes.
1. Remove the CHS-04TA dipswitch, freeing FPGA pins N18, P19, T16 and U16.
2. Remove the REV_BIT0 - REV_BIT3 assignments, freeing FPGA pins L15, M16, F20 and T21.
3. Add a PCA9555 or similar 16-bit I2C IO expander with internal pull-up resistors (or alternatively, a PCA9535 or similar that lacks pull-up resistors, and then add external pull-up resistors as required to the signals described in (4) - (6).
4. Assign REV_BIT0 -- REV_BIT3 to 4 IO pins on the IO expander, to indicate REV5, i.e., with BIT1,3 tied to GND, and BIT0,2 unconnected to float to VCC.
5. Create new SUBREV_BIT0 -- SUBREV_BIT3 to indicate the sub-revision of the board, to indicate sub-revision 0 = "R5" without suffix letter, i.e., BIT0,1,2,3 tied to GND.
6. Add an 8-position dipswitch (or 2x8 pin 0.1" male header such as this to save cost), connected to the other 8 IO pins of the I2C IO expander.
7. Connect the I2C interface of IO expander to FPGA pins N18 and P19.
8. Disconnect F_C64_ROMH, F_C64_ROML, C64_ROMH and C64_ROML from U8.
9. Add an NC7SZ126P5X IC (similar to U30) to input and level shift C64_ROMH to F_C64_ROMH, but with pin 1 (OE, active high) connected to new signal F_C64_ROMH_DIR.
10. Add an NC7SZ126P5X IC (similar to U30) to input and level shift C64_ROML to F_C64_ROML, but with pin 1 (OE, active high) connected to new signal F_C64_ROML_DIR.
11. Add a 74AHCT1G125DBV gate to allow driving C64_ROMH low when new signal F_C64_ROMH_DIR is low, with pin 2 of the gate tied to existing signal F_C64_ROMH, to create a tri-stateable output driver, pulled high by the existing C64_ROMH pull-up resistor R99.
12. Assign the new signal F_C64_ROMH_DIR to FPGA pin T16.
13. Add a 74AHCT1G125DBV gate to allow driving C64_ROML low controlled by the new signal F_C64_ROML_DIR and existing signal F_C64_ROML, similar to (11).
14. Assign the new signal F_C64_ROML_DIR to FPGA pin U16.
15. Disconnect F_C64_RESET and C64_RESET from U9.
16. Add an NC7SZ126P5X IC (similar to U30) to level input and level shift C64_RESET to F_C64_RESET, but with pin 1 (OE, active high) connected to new signal F_C64_RESET_EN.
17. Add a 74AHCT1G125DBV gate to allow driving C64_RESET low controlled by the new signal F_C64_RESET_EN, similar to U30.
18. Assign the new signal F_C64_RESET_EN to FPGA pin T21.
19. Add a 74AHCT1G125DBV gate to allow driving C64_GAME low controlled by the new signal F_C64_GAME_EN and existing signal F_C64_GAME, similar to U30.
20. Assign the new signal F_C64_GAME_EN to FPGA pin L15.
21. Add a 74AHCT1G125DBV gate to allow driving C64_EXROM low controlled by the new signal F_C64_EXROM_EN and existing signal F_C64_EXROM, similar to U30.
22. Assign the new signal F_C64_EXROM_EN to FPGA pin M16.
23. Add a 74AHCT1G125DBV gate to allow driving C64_NMI low independently, controlled by the new signal F_C64_NMI_EN and existing signal F_C64_NMI, similar to U30.
24. Assign the new signal F_C64_NMI_EN signal to FPGA pin F20.
25. Add a 74AHCT1G125DBV gate to allow driving C64_IRQ low independently, controlled by the signal DBG10 on pin 1 of the gate, and existing signal F_C64_IRQ on pin 2 of the gate, similar to U30.
26. Replace R41 and R42 with 3.3K resistors instead of the current 4.7K resistors. This will then exactly match the value of those on the C64.

That list looks long, but all it really does is fix those seven signals to be independently bidirectionally controllable at high-speed. When I say high-speed, I mean at full cartridge speed of ~2MHz.  The routing on the PCB can, however, be as byzantine as is expedient, because they are still very low speed in the grand scheme of things.

The next step is to submit this to Trenz Electronic, so that they can sanity check it for us, and confirm if the changes are feasible.

Sunday, 9 July 2023

Enabling the Super-Cap on the MEGA65 R4 PCB

The MEGA65 R4 PCB uses a different RTC chip than on the old R3A board.  The R4 uses a newer chip, the RTC-RV-3032-C7. This chip uses less power, so a battery will last longer.  Also, it supports the use of a "super capacitor", if you don't have a battery, and the MEGA65 R4 board is designed to use such an arrangement (it also has a socket for a CR2032 battery as well).


 

We already have the new RTC chip working on the R4, and you can set the time and date on it, without problem. But what we have not yet implemented is enabling the super capacitor.  This requires setting an EEPROM register that enables the charge pump to the super capacitor, and selecting the voltage and current for this.

This is all done using the Power Management Unit (PMU), which is configured in register $C0.  We have this accessible via $FFD71D0 on the R4 board.  The only problem is, writing to it doesn't seem to have any effect. So I need to investigate this.

The datasheet can be found here.

Page 72 of the datasheet explains the correct process for changing this register:

Edit the Configuration settings (example, when write protection is enabled (EEPWE = 255)):
1. Enter the correct password PW (PW = EEPW) to unlock write protection
2. Disable automatic refresh by setting EERD = 1
3. Edit Configuration settings in registers C0h to C5h (RAM)
4. Update EEPROM (all Configuration RAM EEPROM) by setting EECMD = 11h
5. Enable automatic refresh by setting EERD = 0
6. Enter an incorrect password PW (PW ≠ EEPW) to lock the device

We don't have write protection enabled, so far as I am aware. So we need to set the EERD bit, which is in bit 2 of register $10, which is mapped at $FFD7120, update the value, then write $11 into the EECMD register, which is register $3F, which is mapped at $FFD714F.

So let's try that.

Hmm... Setting $FFD7120 doesn't seem to work.  I've just double-checked the VHDL, and it looks like all the registers should be writeable. But now that I am testing it, I am finding that none of the registers are writeable. But I'm pretty sure I had them writeable in the past -- at least the ones for the time and date.  I did have to adjust the register writing code when I added support for writing to higher-numbered registers, so it's possible I have messed something up.

Now I need to go back in time and find where this problem came in, so that I can confirm this theory. This requires doing some POKE and PEEKs using about 100 different bitstream versions.  So I really want to automate this.  So here's the crazy shell script I wrote to do this for me:

#!/bin/tcsh -fx

foreach bit ( `ls -1t bin/mega65r4*.bit` )
  echo $bit
  m65 -b $bit
  sleep 4
  m65 -r rtctest.prg
  sleep 2
  monitor_save -a 0800:0802 foop
  hexdump -C foop | cut -c11-18 > foop2
  set len=`gzip < foop2 | wc -c`
  if ( $len != 27 ) then
    echo "Bingo! $bit"
    set userinput=$<
  endif
end

This basically loads each bitstream, then loads a little BASIC program that tries to set and read back the seconds value of the RTC in $FFD7110.  It also copies the initial value from $FFD7110 to $0800, and then the PEEK of $FFD7110 after writing to different values to it to $0801 and $0802.  If the RTC register writing is working, then at least two of those results will be different.  I couldn't be bothered writing a program to compare those three numbers, so I just used a fun hack I know: Check if the compressed size is different to if it contains 3 identical numbers. 

So with this ugly little script, I have it automatically testing a bunch of bitstreams while I write this blog.  Hopefully it will pick one up soon. If it doesn't, then I have to assume that writing to the RTC has never worked -- or that it is write-protected or something like that.

Hmm.. it's tried bitstreams all the way back to mid-March (its late-June now), with no luck. So I am beginning to suspect that I might have enabled some write-protect kind of thing.

The process of unlocking the write-protect is listed here on page 114 of the datasheet:

4.22.1. ENABLE/DISABLE WRITE PROTECTION
If the write protection function is enabled by writing 255 in register EEPWE (EEPROM CAh), it remains possible to
read all the registers except the EEPROM registers. The EEPROM registers cannot be read because it cannot be
written to the EE Address and EE Command registers. If the function is not enabled, read and write are possible for
all corresponding registers.
If the write protection function is enabled, it is necessary to first write the correct 32-Bit Password PW (PW = EEPW)
(Unlock), before any attempt to write in the RAM registers and to read and write in the EEPROM registers.
Once the user is finished with the write access and subsequently the write protection is still enabled or enabled again
(by writing 255 in EEPROM register EEPWE), it is necessary to write an incorrect password (PW ≠ EEPW) into the
Password PW registers in order to write-protect (Lock) the registers. See program sequences below and
FLOWCHART.
Enable write protection:
1. Initial state (POR): WP-Registers are Not write-protected (EEPWE ≠ 255)
Reference password is stored in the RAM mirror of EEPW (addrs C6h to C9h)
2. Disable automatic refresh by setting EERD = 1
3. Enable password function by entering EEPWE = 255 (RAM mirror address CAh)
4. Enter the correct password PW (PW = EEPW) to unlock write protection (RAM addresses 39h to 3Ch)
5. Update EEPROM (all Configuration RAM  EEPROM) by writing 11h to EECMD
6. Enable automatic refresh by setting EERD = 0
7. Enter an incorrect password PW (PW ≠ EEPW) to lock the device (RAM addresses 39h to 3Ch)
8. Final state: WP-Registers are Write-protected by password (EEPWE = 255)

Disable write protection:
1. Initial state (POR): WP-Registers are Write-protected by password (EEPWE = 255)
Reference password is stored in the RAM mirror of EEPW (addrs C6h to C9h)
2. Enter the correct password PW (PW = EEPW) to unlock write protection (RAM addresses 39h to 3Ch)
3. Disable automatic refresh by setting EERD = 1
4. Disable password function by entering EEPWE ≠ 255) (RAM mirror address CAh)
5. Update EEPROM (all Configuration RAM  EEPROM) by writing 11h to EECMD
6. Enable automatic refresh by setting EERD = 0
7. Final state: WP-Registers are Not write-protected (EEPWE ≠ 255)

This sounds quite feasible to be the issue here.

After rebuilding a bitstream that fixed an unrelated bug preventing writing to registers on the RTC, I have been able to confirm that this procedure for un-write-protecting the RTC registers works:

10 A=$FFD7110
20 POKE A + $39, ASC("E")
30 POKE A + $3A, ASC("E")
40 POKE A + $3B, ASC("P")
50 POKE A + $3C, ASC("W")

Does indeed disable write-protection of the registers temporarily, allowing me to set the RTC time and date registers. (I believe I probably accidentally enabled write protection at some point.)

So next step is to update the EEPROM configuration registers. After some fiddling about, the following program seems to be able to set the PMU register for me:

   5 W=0.01
   10 A=$FFD7110
   15 SLEEP W
   20 POKEA+$39,ASC("E")
   25 SLEEP W
   30 POKEA+$3A,ASC("E")
   35 SLEEP W
   40 POKEA+$3B,ASC("P")
   45 SLEEP W
   50 POKEA+$3C,ASC("W")
   55 SLEEP W
   60 POKEA+$10,4:REM DISABLE EEPROM SHADOW RAM AUTO-REFRESH
   65 SLEEP W
   70 POKEA+$CA,0:REM DISABLE WRITE-PROTECT
   75 SLEEP W
   80 POKEA+$C0,$13:REM ENABLE TRICKLE-CHARGE OF SUPER CAP
   85 SLEEP W
   90 POKEA+$3F,$11: REM UPDATE EEPROM FROM SHADOW RAM
   95 SLEEP W
  110 POKEA+$10,0:REM RE-ENABLE EEPROM SHADOW RAM AUTO-REFRESH
  115 SLEEP W
  120 IFPEEK(A+$C0)=$13THEN PRINT "SUPER CAP CHARGING ENABLED"
  130 IFPEEK(A+$C0)<>$13THEN PRINT "SUPER CAP NOT ENABLED"

The SLEEP statements are to ensure the I2C memory mapping system of the MEGA65 has time to commit each write, before attempting the next. So if you ignore those, we see that I enter the default RTC password of EEPW, then stop the EEPROM shadow ram reading, so that I can modify the shadow RAM contents for the EEPROM registers, then I set the trickle-charge settings in the PMU register ($13 into register $C0), write it to the EEPROM, and then tidy up after ourselves.  Finally, I read from the EEPROM shadow RAM to confirm that our setting has taken effect.

Running it takes only a second or so, and then we see something like this:


Great -- so assuming I have correctly understood the settings to charge the super-capacitor, and that it all actually works, my super capacitor should now be being charged. This means, I should be able to turn the power off to the MEGA65 R4 board for a while (I purposely have no battery installed), and it should still retain the time and date. 

What I don't know, is how long it will take for the super-cap to charge up enough to work. Also, I don't know if I have really put the right settings into the PMU register to use the super-cap.  So while I give the super-cap some time to charge up, let's take a look at the PMU register bits and related things.

First, you can tell if the RTC lost power on boot, by checking bit 1 of $FFD711D. If set, then the RTC didn't have power. This can be cleared by writing to it, e.g., POKE $FFD711D,0.  

Okay, so I have done that. Let's power cycle, and see if it gets asserted still. But first, I think I'll flash this fixed bitstream for the MEGA65 R4 board, so that I don't have to keep loading it on power on.

Done. I've also read through the datasheet for the RTC, and it looks like putting either $13 or $23 into the PMU register should work. The difference between the two values affects only the way that it detects when to switch to backup power (level switching or direct switching modes). So far, neither seems to be effective, which makes me think that something is not allowing the capacitor to charge. I might need to pull the board out of the case, so that I can get to the pins of the super capacitor, and see what the voltage there is doing. 

Looking at the schematic, there are no components between the super capacitor and the RTC, so there doesn't seem to be any possibility for something outside of the RTC or super capacitor to be the problem. The RTC is ticking, so I am assuming either the super capacitor is dead (unlikely) or that I am somehow not programming it correctly (more likely).  So let's see if it is charging, then...

I've pulled the board out, and see 0.39V on the super capacitor. After a minute or so, it climbed to 0.40V. It's climbing by about 0.02V per minute. So this means that I probably do have it all setup correctly, and that it will just take a lot longer than I had originally anticipated for the super cap to get to a usable voltage (about 2V).  It should take about another 1 - 2 hours to climb to that level, so I'll leave it charging for a while and do some other stuff, and then come back to it and see if it has charged enough to keep time.

Yes, that was all it needed -- enough time to charge the capacitor enough to reach the voltage required by the RTC chip.  Based on this, I'd recommend leaving a MEGA65 R4 on all day or over-night to put enough initial charge into the super capacitor to allow the RTC to keep time.  

The next logical question is how long it can hold the time for when powered off.  To get an idea of this, I've just turned the machine off, and recorded 2.45V on the super capacitor at 14:15.  I'll check the voltage again in a couple of hours, and see how much it as dropped. From that, we should be able to come up with a fairly decent estimate of the RTC retention time when using just the super capacitor. After 121 minutes, it is down to 2.32V. This means ~(2.54 - 2.32 / 2) V / hour, i.e., about 0.11V / hour. Assuming fully charged will be around 4.5V, and the minimum viable voltage is 2V, 2.5V / 0.11V/hour = ~20 hours -- assuming that the charge consumption is constant regardless of voltage, which may not be the case.

Anyway, I'll set it charging overnight, so that I can see what the actual maximum voltage is, and then see how long that voltage takes to decay. The datasheet suggests "days to weeks" when using a super capacitor -- but without indicating the capacity of super capacitor that this range corresponds to.  My gut feeling is that the super capacitor will provide at most a few days, based on the data so far -- but we will see when I have fully charged it ...

So overnight it charged up to 4.40V. I am now going to let it discharge for a while, and see if it still drops at 0.11V / hour from full, or if its a bit slower.  Power turned off to the MEGA65 R4 board at 12:41 on Sunday. I'll take some measurments whenever I come by my office and remember:

Sunday 13:21 - 4.39V

Sunday 14:31 - 4.38V

Sunday 15:36 - 4.37V

Sunday 16:54 - 4.35V

Sunday 05:48 - 4.27V

Monday 17:23 - 4.22V

Tuesday 17:21 - 4.14V

At this rage, 0.08V/day, and a critical voltage of 2V, this suggests it might last 30 days.  Will keep an eye on it over coming days...

Friday 21:02 -- 3.94V

Well, its still continuing at 0.08V per day, which is pretty good.  It's now close to a week since I turned it off, and it has a lot more time to go before it goes flat.

Tuesday 20:29 -- 3.74V

Yup, continuing at about that rate. After 9.5 days, it's gone from 4.40V to 3.74V = ~0.07V/day average over the full time.  I'll leave it running a few more days yet...

Sunday 19:20 - 3.55V

So that's just over 15 days, with a drop of 4.40V - 3.35V = 0.06V / day on average. Given the critival voltage of ~2V, this is pointing to a maximum retension without power of 4.4V - 2V = 2.4V / 0.06V / day = 40 days. I'd allow some safety margin on that, and say that 3 to 4 weeks should be a reasonable bet.  

Anyway, that's more than enough to confirm that the supercap works.

Friday, 30 June 2023

MEGA65R4 Bring-Up

Well, we have the first samples of the R4 of the MEGA65 PCB:


This will be the model of PCB that finds its way into the next batch, plus or minus any last-minute changes that might have to be made.

The R4 includes a number of nice improvements over the R3A that was used in the previous batch, including:

1. Digital video back-power protection through the addition of a digital video driver IC.

2. Removal of the Intel MAX10 2nd FPGA to simplify the design. This means that the Arrow-compatible JTAG programmer would also not be required with the R4 board, since it was only ever required for updating the MAX10 FPGA.

3. Improvement of the 3.5mm audio jack quality with the addition of a nicer DAC setup.

4. Addition of a 64MiB SDRAM, that will mean that more MiSTer cores can be ported to the MEGA65.

5. Joystick ports will be bi-directional, allowing use of the Protopad and similar devices.

6. RTC chip has been replaced by another that should give us none of the problems of the previous one.

7. The RTC now takes a CR2032 battery and has a super-cap for battery-less time-keeping, at least for a while at a time.

As with all changes to boards, we have to test all the changes and get things working. As the R4 has a lot of changes, there are a lot of FPGA pins that have been reassigned, so the first job was to go through and update the XDC file to reflect this.

With that done, VGA output, SD card, keyboard input were all quickly confirmed working, allowing the MEGA65 R4 board to work in a minimalistic manner.  

The digital video output requires a slight tweak, as we attempted to build the boards without some allegedly optional pull-up resistors. However, due to the exact design we used, we do in fact need those resistors. Those are getting retro-fitted onto my R4 sample board in the near future.  In the meantime, I will keep working on other parts of the board.

Next stop is the RTC: The IC is completely different to the previous one, so I need to change the I2C address in the code, add address decode logic for the R4 I2C peripherals generally, and then remap the RTC registers between the two ICs, so that it is backwards compatible.  Otherwise the Configure program, BASIC65 ROMs, and anything else that uses the RTC would need patching.

The register map of the two ICs are:

For the new one, the important registers are:

for the old one, it was:

 

What we care about for backwards compatibility is that the current time and date registers match up.  Everything else can remain different, as only very specialised software will need to touch those. For those registers we find:

Register 0 = Seconds = Register 1 on the new one
Register 1 = Minutes = Register 2 on the new one
Register 2 = Hours (with bit 7 indicating if 24 hour time) = Register 3 on the new one (which has no 24 hour time flag, as it is always in 24 hour clock mode)
Register 3 = Day of month = Register 5 on the new one
Register 4 = Month = Register 6 on the new one
Register 5 = Year-2000 = Register 7 on the new one
Register 6 = Day of Week = Register 4 on the new one

The Grove external RTC support already has a mechanism for the rearrangement of RTC registers, so it should be possible to re-use that.

The new RTC also offers a 100ths of a second register in register 0, which it would be good to expose.

I have the RTC working reasonably now, so will turn my attention to the 64MiB SDRAM next.  To avoid long synthesis runs between each little fix, I want to use unit tests to test the SDRAM controller.  Ideally I would have a good VHDL model of the SDRAM itself, which I don't.  But I can probably make one that will be good enough.  

I'd also like to have a program that can take a VHDL file and auto-generate a VUnit VHDL test case template from that, to avoid having to do all the annoying boiler plate VHDL generation.  I've written these before, but not that's available open-source, so I'll do it from scratch again.  As an experiment, I'm going to use ChatGPT 4 to help write the tokeniser, as I always find that part boring and error-prone.  Using ChatGPT probably was faster, but was just as error prone, although the process was probably less boring!  

With the tokeniser, I can search VHDL source files for the entity definitions, and generate the lists of parameters required to instantiate the entity in the test case, and from that, auto-generate the boiler-plate for the entity (or entities) that I want to include in the VUnit test case, and then generate the VUnit test case boiler plate containing all that.  This will come in handy later for making unit tests that cover other parts of the MEGA65's VHDL.

I could probably just ask ChatGPT to generate a VUnit test case for a given entity, directly.  Maybe that's the easier approach here. I've used my 25 questions to ChatGPT 4 for the next 3 hours, so I tried doing it with ChatGPT 3.5, which was a fairly unmitigated disaster for making a complete test, however, it did a fine job when I pasted in two entity definitions for the SDRAM and SDRAM controller and asked it to instantiate them and connect them up appropriately. Well, by that I mean it compiles. It may well have connected them in a completely useless or insane manner.

At first glance, it might have done ok. We'll see. Actually, no. So now I am trying to use it to help me write a VHDL model for the SDRAM.  This is also producing VHDL code that kind of looks like it should be an SDRAM, but not quite. But it probably is still saving me some overall typing.

Well, that was an interesting experiment. In any case, I have now worked things to the point where I have the first unit tests with VUnit working, that check if the SDRAM initialisation sequence is given and received correctly using my model of the SDRAM and my SDRAM controller.  Now I am working on some tests for simple memory accesses. Once those are working, I can look at synthesising a fresh bitstream with it enabled, and see if the SDRAM talks to it.

Ok, bitstream has synthesised, but no sign of life when reading the SDRAM.  I'll go back to the simulation unit testing, and have a look at the waveforms generated when doing the reads and writes, and make sure that they look correct, in case I implemented the SDRAM model wrongly in a way that I haven't noticed yet.

Meanwhile, I am checking a couple of other things on the board:

First, I have confirmed Ethernet is still working, so that's one more sub-system ticked off.

Second, I want to get the bi-directional joystick behaviour working.  This should allow the use of the ProtoPad and similar fancy joysticks that have lots of buttons.  The MEGA65 R4 has some open-collector inverters that can be used to pull the lines low from the MEGA65 side when required. I have these setup to be activated when the DDR for a joystick line is set to output, and the value being written to $DD00 or $DD01 is a zero -- i.e., matching the CIA behaviour for output.  The read-side remains the same, as it does on the CIA, because its just the case of an output driver pulling the line low, which the input port of the CIA (on the same pin) correctly sees as a dropped voltage.

I seem to now have the joystick ports working for output as well as input, although some more extensive testing will be required to make sure I have the bits correctly ordered.

The next big annoying problem while I wait to get the pull-up resistors on the R4 board for the digital video output, is that BASIC 65 doesn't get to READY, as it seems to get stuck trying to read from the IEC bus.  It gets stuck in a loop checking for bit 6 of $DD00 to go high (the CLK pin), from what I can tell.  I've just double-checked the XDC file, to make sure I haven't messed up any of the IEC pin assignments on the FPGA, and they all look fine.  Time to look more closely at the plumbing for the IEC CLK line.

For each of the main IEC pins, we have 3 FPGA pins: One that reads the current voltage on the pin, one that enables the output driver, and one that sets the signal that is fed to the output driver.  This is technically a bit of an overkill, as it allows us to drive high, instead of just drive low, or go tri-state for reading.  Anyway, it means that for the signal on an IEC pin to be low, both the output driver has to be enabled, and the input to the output driver set to select a low voltage.

Now, looking at the R3 vs R4 schematics, the circuit for this has changed a little. It is now:

U20 takes the input from the IEC CLK pin (as SER_CLK) which is a 5V signal, and level-converts it to a 3.3V signal, F_SER_CLK_I.  The Output Enable (OE) signal that controls whether the F_SER_CLK_I pin is valid is active high, so that all looks fine. i.e., it seems like we should be ok there.  We also have a 4.7K pull-up to 5V, so it shouldn't read 0V if U15A at the bottom is not driving it low.  If F_SER_CLK_EN=0, F_SER_CLK_O=1, then we would expect SER_CLK to be driven low.  To allow it to float, F_SER_CLK_EN=1 should do the trick, since there is an inverter on the input of this line to U15A.  So why are we seeing the CLK line at 0V when I probe the physical port? 

Probably the best thing to do here is to make sure that the line can go high.  Resetting the MEGA65 does cause it to go high for a time, so things are ok at least partially. Ah, I think I see the problem. On the R3 board, this is the output logic for this line:


Note that the F_SER_CLK_EN line is using a non-inverted output enable line.  So I should just be able to invert the logic for F_SER_CLK_EN, and its equivalent for the other IEC lines, and it should fix the problem -- which it did.

Now to work on the improved 3.5mm analog audio output.  This is handled on the R4 by a AK4432VT DAC chip instead of a simple 1-bit digital output plus filter on the R3. This DAC chip uses I2S audio signalling, the same as the DAC used on the MEGAphone prototypes, so I'm hoping I can re-use most of what we have already made for that.

Working my way through the datasheet, and how we have it hooked up on the board:

1. Configuration mode is strapped to serial, rather than parallel (not yet sure what that means).

2. We want a sample rate <= 48KHz, so DFS0 and DFS1 must be zero, which is the default.

3. BICK frequency should be 2.8224 for 44.1KHz or 3.072MHz for 48KHz sample rate. 

4. MCLK freuency should be 11.2896MHz for 44.1KHz or 12.2880MHz for 48KHz sample rate.

5. For 512fs configuration, the 512fs clock should be 22.5792MHz for 44.1KHz or 24.576MHz for 48KHz sample rate.  256fs is not available at these sample rates, but it is for 88.2KHz and 96KHz sample rates.  Maybe we can use those sample rates instead off 44.1KHz or 48KHz?

6. Audio data is signed, MSB first, latched on rising edge of BICK.

7. There is an I2C interface that can be used to set left and right gain between +12dB and -infinity dB (= mute). $00 = loudest, $FF = mute. This interface has a "soft transition" feature, so gain changes shouldn't produces pops or clicks.

8. There is a hardware mute pin available, which has similar effect. This is connected on the MEGA65 R4 board, so we better not activate it by mistake! This line is active high, so we need to drive it low normally.

9. The power down line (connected to audio_powerdown_n) holds the DAC in reset while low.  Datasheet says that it should be bought low once and then released to ensure correct operation. It has to be held low for at least 800ns.

10. To avoid click on power up in (9), the mute pin can be applied before releasing the powerdown pin. That's probably worth doing.

11. To enable the I2C interface, $DEADDA7A has to be written as four separate transactions with /CS being released between each (see page 33 of the datasheet). We'll have to do that, as the DAC is strapped for serial configuration.

12. The software mute pin is also the /CS line for the I2C interface.

13. I2C interface is 400KHz normal speed I2C is probably best, based on conflicting information about 7MHz or 1Mhz or 400KHz modes.

14. There are 6 I2C registers ($00 -- $05), but the register field is 16 bits long. Thus each write transaction consists of at least 32 bits, plus the usual ACKs between bytes that I2C uses. All 8 registers can be written as one sequential transaction.

15. The six registers are listed on page 38 of the datasheet. From those we can see certain default settings:

  15.1 DFS0/1 = 0, selecting either 44.1KHz or 48KHz, as previously described.

  15.2 ACKS (automatic clock recovery) is disabled.

  15.3 DIF2-0 = 110, selecting 32-bit MSB aligned format. This means "Mode 6" timing, which is described in figure 15 on page 23 of the datasheet. This basically means that BICK needs to be 64x the sample rate, to allow for 2x32bit (left and right) samples per sample period.  For 44.1KHz, this means 44.1KHz x 64 = 2.8224MHz, and for 48KHz it is 48KHz x 64 = 3.072MHz. This matches (3) above, which explains how those are calculated.

  15.4 TDM1-0 = 00, selecting stereo mode

  15.5 SDS1-0 = 00, selecting normal L1, R1 channels.

  15.6 SYNCE = 1, enabling clock synchronistion

  15.7 SMUTE is disabled, allowing audio to be produced

  15.8 ATS, DASD, DASL (various filter related things) have sensible default values.

  15.9 ATTL/ATTR = volume levels for left and right default to $18 = 0dB, which is sensible.

And that's all the registers: So if we don't want to adjust the default volume level, we can actually ignore the I2C configuration interface completely, it would seem... which after synthesising the design up with all the necessary plumbing in place is indeed the case: We can completely ignore the I2C configuration.

I must say, the audio quality of this DAC is _way_ better than the old method. It sounds really nice and crisp and clear to my wooden ears at least. Listening to MOD files on the MEGA65 now sounds really nice  -- not that it was bad before. Enough so that I'm currently uploading a pile of MOD files onto my MEGA65's SD card to listen to, while I debug the remaining stuff. We really need one of the MEGA65 MOD players to support play-lists soon...

I've also spent some time debugging the bi-directional joystick functionality, and that's mostly working, although some strange thing has come up that is stopping Solitaire for the MEGA65 from being able to read mouse clicks, which are just the same as fire button presses on a joystick.  This will require some investigation to figure out the root cause. I have the source for Solitaire on the MEGA65, as I wrote the 1351 mouse driver for it, so I might have a poke and a fiddle into this.

Ok, so the problem in Solitaire is actually the high speed of the MEGA65's CPU vs the response time of 5V circuits: It just takes a few clock cycles at 40MHz before the DDR change to input allows the joystick inputs to float back high. In the case of Solitaire, just one extra clock cycle was required. I might be able to claw that back by making the DDR effects asynchronous, rather than waiting for the next clock-cycle edge.  Apart from that, the issue is considered closed.

So now we are back to figuring out what I have wrong in the SDRAM controller.  I have just added some debug registers to the SDRAM controller, so that we can confirm that data from the controller is making its way through the MEGA65 to the expansion device controller, and from there, into the CPU. If not, then that needs to be fixed, before I can tell whether the SDRAM controller is actually working or not.  It seems to be at least partly working, because it is issuing data strobes to the expansion controller, which I can tell because if this weren't the case, as it was before I plumbed it all together, the expansion device controller would timeout waiting from a response, freezing the CPU for millions of cycles before giving up.

And nothing is visible. Looking into the expansion device controller, it looks like it requires the presentation of whole cache lines from the expansion RAM. Nope, it can be configured to need that, but wasn't. So time to find where the plumbing is broken. We'll start by disconnecting the read data from the SDRAM controller, and feeding in a fixed value to the rdata signal, to see if that gets read or not.  That will tell us whether the problem is up-stream or down-stream of the SDRAM controller.

Ok, so when I export a constant value, instead of real memory accesses, it is visible.

So now let's see if we pretend to read a constant value from the RAM, that that is also visible. This will check if the RAM value export logic is working right, in particular, if the timing hand-over from the 162MHz to 81MHz clock domains is wonky or not. Well, in the process of doing that, I found two important things: 1. I hadn't actually connected the clock to the SDRAM in the R4 target. That will certainly not help ;) and 2. the rdata lines from the SDRAM controller were being set tri-state outside of a process, quite possibly overriding where they were being set within the controller process.  Both of those have now been fixed, and I'm synthesising the bitstream to test them.

Okay, so fixing those has my debug registers exported from the SDRAM controller visible, but accessing the SDRAM itself is still not working. But I can at least see if the controller thinks it is scheduling reads and writes via those debug registers now.

First up, I can see that the processor reads whatever was last read from the SDRAM controller, rather than the actually requested byte. This means that the data shows up "one read late".  For example, the SDRAM registers have the word "SDRAM" = 53 44 52 41 4D at $C000000. But when I read it, we get:

:0C000000:42534452414D00054242424242424242

i.e., one of those 42s from the end is read at the start (because I read this address block before), and our values are present, but not in the correct place. I'll have to figure out the cause of this, as it clearly won't be helping. That said, it shouldn't stop us being able to see written values being read back.

What we can also see in the above, is the "0005" part:

:0C000000:42534452414D00054242424242424242 

This is 2 debug registers that count the lower 8-bits of the number of reads and writes. It is showing $00 reads and $05 writes, because I tried to write 01 02 03 04 05 at $8000000 earlier.

So let's now read 16 bytes (=$10 bytes), and see what happens:

.m8000000
:08000000:42000000000000000000000000000000

.mc000000
:0C000000:53534452414D10054242424242424242

Two things to notice here: First, we see the $42 from the end of the register read showing up in the first position again. Second, we can see that the number of reads has jumped to $10, indicating that the SDRAM controller did see all the read requests.

So, given that this "out by one" read problem might be enough to cause real reads from the SDRAM to be missed, depending how the buffering works, I should work to fix that.

Looking at how the old HyperRAM controller works, that holds the data strobe line for an extra cycle.  I don't really see how that would impact things in terms of reading the old value.  It feels more like the data ready strobe needs to be delayed by one cycle, so that the value can be assured to be available when required. I'll try adding that delay, and see how it goes.

Well, that's resulted in some progress.  It is now clear that some bytes are being written to and correctly read back from the SDRAM. However, now the debug registers in the SDRAM controller can't be read.  Also, the addresses being written and read aren't lining up, and only even numbered bytes are being read back correctly -- but whether this is because odd numbered bytes aren't being written or aren't being correctly read I can't yet tell. But at least I have signs of life from the SDRAM!

I need to check that I haven't accidentally got cross-domain clock relaxation between the SDRAM and expansion controller, as an effect of clock names having changed, as that would screw things up real bad. It's quite tricky to tell what is being done in this regard in the Vivado logs, as the clock names are not trivial to map.  Anyway, I've removed them all except for the Ethernet to CPU clock and 12.228MHz (audio sample clock) to CPU clock domains, as those should be the only required ones. If synthesis takes forever, I can look at what the remaining clocks are that need relaxation between them.

So, the good news is that removing all those other clock relaxations didn't result in longer synthesis time, or a broken bitstream.  Unfortunately, it still hasn't got the $C000000 registers readable -- those are still timing out. Which is odd, because that part of the logic shouldn't have changed.  I have made a simulation unit test case for that, so let's see if it has now failing -- which it is. So let's see what's happened there.

Well, quite how it was working before is a bit of a mystery, because I found a really silly bug in the $C000000 register access code.  The test now passes, so I'm resynthesising...

While that is running, I'm thinking about the other bugs I have seen in the SDRAM controller: Primarily that the bytes are read out one word later in memory from where they should be, and that writing odd numbered bytes seems to fail, or at best, write zeroes.

For the odd-bytes-are-written-as-zero bug, I am seeing some interesting things:

s8000007 87

.m8000000
:08000000:87008000820084008700F07010204080

.s8000008 88

.m8000000
:08000000:88008000820084008800887010204080

.s8000009 89

.m8000000
:08000000:89008000820084008900880010204080

This sequence is giving me some more clues as to what might be going on. To see it more clearly, I wrote this little test program:

It clears out the first 16 bytes of SDRAM, and then writes values in progressively, to see what we read back after each:


We see again that the odd bytes are not written at all, and that the even bytes are pushed two to the right, and the first byte of each 8-byte block is whatever was written last.

So what I think is happening here, is that my SDRAM controller is using one too few cycles of latency when reading, thus reading (semi-)rubbish on the first word (remember the SDRAM has 16 bit wide bus).  Then separately, the write mask for the upper byte is messed up, or the data being presented in the upper byte is being written as zeroes instead of the correct data -- one or the other.

The odd-byte-write bug I can see clearly: We assume 16-bit wide writes from the expansion controller, instead of checking if its an odd-address 8-bit write.  I've now made a fix for that, which I will test, and then synthesise.

While possible fixes for that are synthesising, I think I'll work on the failing back-to-back write test of my SDRAM controller, and then on the simple cached reading.  Without any caching, reading from SDRAM to chip RAM is around 4.5MiB/sec, implying it is taking 9 clock cycles to perform a random read from the SDRAM. That sounds fairly plausible.

Ok, synthesis run has finished, and now all SDRAM reads are timing out. BUT it looks like the reads and writes are happening, more or less, and are visible in the correct addresses. So that's some positive progress. Upper 2 bits of data seem to be getting chopped, though. Let's see what happens in simulation with my unit tests, to see if we can't find and fix the problem there.

The bit trimming doesn't happen in simulation, so far as I can tell. After the latest bitstream synthesis my cache line stuff turns out to be working nicely, but the reads are still timing out. But with the cache, this now means reads happen in blocks of 8 (the size of one cache line) before freezing for a while until timeout occurs.

I think that might be caused by the data ready strobe being only 1x 162MHz cycle long, but it has to be caught by the 81MHz clocked expansion device bus. So I've now stretched it to last 2 cycles long, and will see if that helps. Meanwhile, I'm going to attack writing more tests for the SDRAM controller, including the cache I implemented, to make sure that it is not doing anything else detectably stupid.

Thanks to a pile of help from Adam, one of the great folks behind the C64 and other cores for the MEGA65, I have managed to get the SDRAM basically working now.  A lot of it was tackling the fairly tricky timing to get the SDRAM working at 162MHz with the FPGA.  There are some remaining timing closure issues, but it is already working stably. That is, once I remembered to implement refresh of the SDRAM! 

The remaining problems now are all related to the slow devices module that interfaces the CPU to the expansion RAM. Basically, reading sometimes returns stale data, rather than fetching the correctly written data. Reading from a different 8-byte region of the SDRAM fixes this problem.  The issue isn't the SDRAM bugging out reading the 8-byte blocks, but rather seems to be the CPU and slow devices modules messing up somewhere.  

To debug this I am adding a pile of debug registers to the CPU to see when it is reading from the SDRAM, and when it is using the new direct cache line read mode I have implemented, that should make linear reads from the SDRAM much faster -- but for unknown reasons causes a speed _drop_ of some 500x.  So something is clearly going wrong with the signalling there.  But the other problems still occur, even when that feature is disabled, so whatever that problem is, it is presumably separate from whatever is causing the reading of incorrect stale data.

A bit of poking about shows that the following kind of sequence can occur:

First, we can read an address, and see a value in it:

m8000808
:08000808:18191A1B1C1D1E1F0000000000000000

Then we write a different value into a different address, that is the same modulo 8: 

.s8000800 22

We can now read that address, and see that it is updated:


.m8000800
:08000800:221112131415161718191A1B1C1D1E1F

So far, so good.  Note that $8000808 contains $18 (underlined above).

Now the weirdness comes, if we try to read from $8000808 directly:


.m8000808
:08000808:22191A1B1C1D1E1F0000000000000000

Note that we end up reading the most recently written value, rather than the correct value.

If we now ask for it again from the modified address, we read it correctly:


.m8000800
:08000800:221112131415161718191A1B1C1D1E1F

But asking again for $8000808 reads back the most recently written value again:


.m8000808
:08000808:22191A1B1C1D1E1F0000000000000000

And I think I finally found the cause:  When making any access to the expansion RAM, the last written address is updated, instead of only when a write is occurring.

I'm fixing this now, and am confident that it should be fixed (and also fix similar problems with the HyperRAM on the R3/R3A boards). And following synthesis, I can confirm that this is the case: All cache consistency issues are now fixed. This has resulted in the apparent speed of copying from slow to chip RAM dropping, because it was previously erroneously using a cached result when it shouldn't have been.  Its now more like 4.6MB/sec, or about 7MB/sec if the single pre-fetch mode is enabled -- but that _does_ still result in cache consistency errors.

So that just leaves the problem of the SDRAM going hundreds of times slower when enabling the cache line reading directly by the CPU.  This is totally weird, as it can only save time for the CPU, not cost more time.

I can only guess that the CPU is getting confused thinking that it has to wait for the slow device ready toggle to flip before it can proceed. So I have made sure that when reading from the cache line, that the accessing_slowram signal is cleared.  However, if that fixes it, then it is revealing that there is some other deeper problem, where wait_states_non_zero is being asserted, even though I clear it in the case of reading via the cache line. In fact, I copied the perfectly working code from the case where we read the single pre-fetched byte from the slow devices unit, which effectively is just an indirect way of doing the same thing (and thus has higher latency).

Well, a lot has happened since I wrote the above, including tracking down a fascinating bug that was caused by accidentally activating writes with prefetch during the prefetch after a write.  That caused the most fascinating problem with random corruption of an entire DRAM row.  This is interesting on a couple of fronts. 

First, it was quite an adventure and process of elimination to work out what was going on.  Because entire DRAM rows were being corrupted, I figured it had to be something to do with the DRAM row activation and/or precharging when closing a row. Initially, I thought it was not having enough latency cycles during row activation or precharging at the end. Then I was very happy, because my SDRAM model for my simulation tests helped to find the problem really quickly: I wasn't clearing the WRITE+PRECHARGE command during the latency cycles. My simulation model of the SDRAM is purposely quite paranoid, and aborts the simulation if it detects you are doing anything that might be invalid -- and it flagged the issue strait away.

Second, I had only just been reading about how DRAM can be used as a true random number generator. Their method is to purposely not allow enough latency when opening a row, and triggers modest numbers of errors in a row when used.  What I was doing was causing lots of errors, possibly more than they reported in the paper.  I just dropped one of the authors a line to let them know what I discovered, in case its of interest to them.

Anyway, with that fixed, that got it working.

The next step is to improve the performance by not closing the DRAM row after every access, so that we can avoid the row latency for linear reads.  This will also be kinder to the SDRAM, rather than basically implementing ROWHAMMER as the default mode of operation ;)  It would actually be interesting to see if it prone to ROWHAMMER. 

Okay, that didn't take too much effort to implement.  Now the only remaining bug is if the SDRAM read cache-line feature is enabled (which is disabled by default because of this), occasionally a read request doesn't respond, causing a timeout of 2^23 clock cycles. I've reduced that to 32 clock cycles, so that when this happens, it doesn't cause the machine to freeze for a fraction of a second. 

I've also made selection between HyperRAM and SDRAM on the R4 selectable at synthesis time. So let's compare the performance of the two. Note that the HyperRAM implementation benefits from the fixing of the pre-fetch bug, thus HyperRAM performance is also improved over what we have seen previously on the R3:

First, with the HyperRAM. Note that there is actually no Trapdoor Slow RAM in this machine, so it is showing some slightly random numbers for that:

 
And now with the SDRAM:
 

We can see that it is faster across all measures -- especially copying between regions of the expansion RAM is now more than twice as fast, because the latency of the SDRAM is quite a bit lower than of the HyperRAM.

If I ever get the SDRAM cache-line bug fixed, then the speed of copying from slow to chip RAM will approximately double again, to around 16MB/sec, only 20% slower than copying from chip RAM to chip RAM -- but I've run out of puff right now to work on that.  My goal was really just to get the MEGA65 R4 board working, and confirm it has no known problems vs the R3A board, which I have achieved. 

This means that Trenz Electronic can now move forward to the next steps (of which there are still a few to go) of having the next batch of MEGA65's produced, which will be based on this new R4 PCB -- which we are hoping will come out later in 2023, probably Q3.