Monday 4 September 2023

Debricking MEGA65 R3/R3A Boards with De-programmed MAX10 FPGAs

We had one machine previously that got into this state, and now have encountered a 2nd.  

The first happened when someone was updating the MAX10 bitstream to deal with the HDMI flicker problem, and due to there being a microSD card in the external slot, that the flashing of the MAX10 failed.

The situation behind this second one I'm not sure of. In the end, it doesn't matter. We just need a way to easily fix the problem.

To understand what the situation looks like, we need to understand a little of how the MEGA65 R3/R3A works with two FPGAs.

The main FPGA of the MEGA65 is the Artix A200T Xilinx FPGA that actually runs the MEGA65 core.  That's the one that most folks are familiar with.

But there is a 2nd FPGA on the board, and Intel (previously Altera) MAX10 FPGA that does some helper tasks, and was added, somewhat ironically as it turns out, to make it impossible to brick the MEGA65, by making it possible to have the MAX10 FPGA load and flash a new bitstream to the Xilinx FPGA via the external SD card slot.  

That functionality was never implemented in the MAX10 FPGA, because it turns out to be quite complex, and was never really needed, as there are a bunch of other ways to solve that problem, usually involving a JTAG adaptor on one the MEGA65's internal JTAG headers.  However, it left a lingering problem, which has now popped up twice.

That problem is that the MAX10 FPGA is configured to disable its own JTAG interface if the main FPGA is running a valid bitstream.  I don't really recall the reasoning behind this, except perhaps that it was intended to operate in the opposite sense: Make sure JTAG is enabled on the MAX10 if the main FPGA is not running a valid program.  But it is the disabling of the MAX10's JTAG that is the root of this problem:  If flashing the MAX10 fails part way through, but the Xilinx FPGA has a valid bitstream stored in flash, then the MAX10 will not "boot", and can't be reprogrammed via JTAG because the Xilinx FPGA is running a valid bitstream.

If that were all there was to the problem, it would be easy to fix, because you could just use the utility menu on the MEGA65 to erase slot 0, or even just use JTAG to start loading a bitstream onto it, and then abort it, so that it remains "unprogrammed".  However, there are some nasty side-effects to the MAX10 being deprogrammed that make the situation much worse. 

Primarily, the MAX10 relays the JTAG interface of the Xilinx FPGA, so that it could do the backup programming, that as said, was never implemented.  But to make things worse, the keyboard and even the power enable for the joystick and cartridge ports also goes via the MAX10 (so that the MAX10 recovery system could read the keyboard, making it easier to use). 

So the net result is the MEGA65 will look like it is fine, but the keyboard and joystick ports won't work.

Fortunately, the SD card interface still works, and the default HYPPO hypervisor code on the MEGA65 allows updating itself from the SD card. This allows us to effectively run arbitrary code in Hypervisor context, even without working keyboard or joystick ports.  When the first instance of this occurred, we used this feature to load a special version of the flashing program for the MEGA65 that could be controlled using IO lines on the 34pin internal floppy interface.  It was cumbersome, but worked.

However, in trying to do it a second time, we have discovered we didn't really document it very well, and also on reflection, its a bit over complicated. All we need is a HYPPO program that causes the Xilinx FPGA to deprogram for long enough for flashing the MAX10 FPGA to occur.

This can probably be done by asking the Xilinx FPGA to switch to an empty flash slot, or some other similar technique. This would reduce the problem to just putting the special HICKUP.M65 file on the SD card, and inserting it.

I've tried doing this now, making a special "BRICKUP.M65" file that can be copied onto the SD card as HICKUP.M65, and causes the FPGA to deconfigure for a few seconds, before it finds the next valid slot.  This works by trying to boot from part way into slot 7.  The FPGA reads the flash at some speed, until it eventually wraps around to the start of the flash, and finds a valid slot. 

For some reason, in my case it ends up booting from slot 3, rather than slot 0, which makes me suspect that I have the math for the slot address wrong.  Each slot is 8MB long, so slot 7 should start at 7x8MB = 56MiB = $3800000. I am telling the FPGA to start at $3810000, so it looks like it should be fine. It might just be some funny behaviour of the FPGA partially loading the other slots, and accidentally skipping over the valid slots in slots 0, 1 and 2.

Anyway, it keeps the Xilinx FPGA deconfigured for long enough for the keyboard to show seven sequences of the "ambulance light flashing", i.e., about 5 or 6 seconds.  Is that long enough?  Can we make it take longer?

Well, it turns out that it is enough, because we can program the MAX10 in ~1sec via JTAG.

So the sequence is as follows (with instructions for Linux primarily, but also info on how to do it under Windows).

If you prefer a visual explanation, the following video shows me doing it under Linux. We don't currently have a visual record of doing it under Windows, sorry.

Stage 1: Preparation

1. Copy BRICKUP.M65 onto the SDcard, and rename it to HICKUP.M65

2. Power off MEGA65 and turn it back on. Make sure that after briefly showing the MEGA65 boot messages, the monitor then goes blank for at least few seconds.

3. Plug in an TEI0004 or compatible JTAG adaptor onto J17 of the MEGA65, and connect the USB cable to a computer running Linux.

4. Make sure you have a MAX10 bitstream built, using the https://github.com/mega65/mega65-r2-max10 repository.

5. Using the program.sh file from that repository, make sure that the JTAG adapter is detected. It will, for now, say that no devices are connected to it. That's ok. You should see something like this:

mega65-r2-max10$ ./program.sh
1) Arrow-USB-Blaster [USB0]
  Unable to read device chain - JTAG chain broken

Error (213019): Can't scan JTAG chain. Error code 87.

Okay, you are now all set for the inital recovery phase, where we load a valid bitstream into the MAX10, but not yet into its flash. This will let you interact with the MEGA65 via JTAG, thus allowing us to deprogram the Xilinx FPGA for a longer period of time, which will then make it _way_ easier to reflash the MAX10. 

If you are using Linux, then use the following versions of steps 6 and 7 (step 9 also differs. Use the green-background instructions for Linux, and the blue-background instructions for Windows:

6. Turn the MEGA65 off, and on the Linux computer with the MAX10 firmware, type this command, but don't yet hit enter on it: sleep 3 ; ./program.sh

7. Hit enter at the same time as you turn the MEGA65 on. Wait upto 15 seconds for everything to happen.

Repeat (6) and (7) until the keyboard power light comes on, and you see output like this from program.sh:

mega65-r2-max10$ sleep 3; ./program.sh
1) Arrow-USB-Blaster [USB0]
  031820DD   10M08SA(.|ES)/10M08SC

Info: *******************************************************************
Info: Running Quartus Prime Programmer
    Info: Version 18.1.0 Build 625 09/12/2018 SJ Lite Edition
    Info: Copyright (C) 2018  Intel Corporation. All rights reserved.
    Info: Your use of Intel Corporation's design tools, logic functions
    Info: and other software and tools, and its AMPP partner logic
    Info: functions, and any output files from any of the foregoing
    Info: (including device programming or simulation files), and any
    Info: associated documentation or information are expressly subject
    Info: to the terms and conditions of the Intel Program License
    Info: Subscription Agreement, the Intel Quartus Prime License Agreement,
    Info: the Intel FPGA IP License Agreement, or other applicable license
    Info: agreement, including, without limitation, that your use is for
    Info: the sole purpose of programming logic devices manufactured by
    Info: Intel and sold by Intel or its authorized distributors.  Please
    Info: refer to the applicable agreement for further details.
    Info: Processing started: Sat Sep  2 13:48:19 2023
Info: Command: quartus_pgm -m jtag -o p;output_files/mega65-r2-max10.sof
Info (213045): Using programming cable "Arrow-USB-Blaster [USB0]"
Info (213011): Using programming file output_files/mega65-r2-max10.sof with checksum 0x0014479C for device 10M08SAU169@1
Info (209060): Started Programmer operation at Sat Sep  2 13:48:20 2023
Info (209016): Configuring device index 1
Info (209017): Device 1 contains JTAG ID code 0x031820DD
Info (209007): Configuration succeeded -- 1 device(s) configured
Info (209011): Successfully performed operation(s)
Info (209061): Ended Programmer operation at Sat Sep  2 13:48:20 2023
Info: Quartus Prime Programmer was successful. 0 errors, 0 warnings
    Info: Peak virtual memory: 427 megabytes
    Info: Processing ended: Sat Sep  2 13:48:20 2023
    Info: Elapsed time: 00:00:01
    Info: Total CPU time (on all processors): 00:00:00

If you are using a Windows based computer instead, use the following versions of steps 6 and 7:

6. Turn the MEGA65 off, and on the Windows computer with the MAX10 firmware, open the Quartus hardware manager, and select the .SOF file for the MAX10 firmare, similar to in the following image:

7. Start mashing the start button near the top left of the display, while simultaneously turning the MEGA65 on. You might see a happy run indicated by Quartus, but probably not, because you are continuing to mash the start button. You will know when it has worked because the red LED in the lower-right corner of the MEGA65 will start pulsing, and the ambulance lights on the keyboard will stop, assuming that the MEGA65 eventually loads a valid core after some seconds.

 

NOTE: For reasons I don't fully understand, sometimes you have to try this process a few times before it will work, whether on Linux or Windows.  Leaving the MEGA65 off for a few seconds between attempts is probably a good idea. As might be disconnecting the HDMI when powered off, to prevent HDMI back-powering issues keeping either of the FPGAs partially powered.

8. Okay, so at this point, we have the MAX10 temporarily programmed. Now we can de-configure the Xilinx FPGA via a TE0790 JTAG connected to port JB1 of the MEGA65, with a command like this from a connected computer running Linux, with the MEGA65 tools installed:

$ m65 -q brickup.bit

The brickup.bit file is a special bitstream that has been purposely constructed to be invalid. This causes the Xilinx FPGA to never complete configuring, and thus leaving the MAX10 JTAG interface active. The easy way to make your own brickup.bit file is to take any valid bitstream, e.g., mega65r3.bit, and then use a command like this:

$ dd if=mega65r3.bit of=brickup.bit bs=65536 count=1

You should now be back with ambulance lights on the keyboard. You can now do the final steps:

9. If on Linux, Run the flash.sh program from the mega65-r2-max10 repository:

mega65-r2-max10$ ./flash.sh
1) Arrow-USB-Blaster [USB0]
  031820DD   10M08SA(.|ES)/10M08SC

Info: *******************************************************************
Info: Running Quartus Prime Programmer
    Info: Version 18.1.0 Build 625 09/12/2018 SJ Lite Edition
    Info: Copyright (C) 2018  Intel Corporation. All rights reserved.
    Info: Your use of Intel Corporation's design tools, logic functions
    Info: and other software and tools, and its AMPP partner logic
    Info: functions, and any output files from any of the foregoing
    Info: (including device programming or simulation files), and any
    Info: associated documentation or information are expressly subject
    Info: to the terms and conditions of the Intel Program License
    Info: Subscription Agreement, the Intel Quartus Prime License Agreement,
    Info: the Intel FPGA IP License Agreement, or other applicable license
    Info: agreement, including, without limitation, that your use is for
    Info: the sole purpose of programming logic devices manufactured by
    Info: Intel and sold by Intel or its authorized distributors.  Please
    Info: refer to the applicable agreement for further details.
    Info: Processing started: Sat Sep  2 13:51:45 2023
Info: Command: quartus_pgm -m jtag -o p;output_files/mega65-r2-max10.pof
Info (213045): Using programming cable "Arrow-USB-Blaster [USB0]"
Info (213011): Using programming file output_files/mega65-r2-max10.pof with checksum 0x027283AE for device 10M08SAU169@1
Info (209060): Started Programmer operation at Sat Sep  2 13:51:46 2023
Info (209017): Device 1 contains JTAG ID code 0x031820DD
Info (209060): Started Programmer operation at Sat Sep  2 13:51:46 2023
Info (209016): Configuring device index 1
Info (209017): Device 1 contains JTAG ID code 0x031820DD
Info (209007): Configuration succeeded -- 1 device(s) configured
Info (209011): Successfully performed operation(s)
Info (209061): Ended Programmer operation at Sat Sep  2 13:51:47 2023
Info (209024): Programming device 1
Info (209011): Successfully performed operation(s)
Info (209061): Ended Programmer operation at Sat Sep  2 13:51:56 2023
Info: Quartus Prime Programmer was successful. 0 errors, 0 warnings
    Info: Peak virtual memory: 427 megabytes
    Info: Processing ended: Sat Sep  2 13:51:56 2023
    Info: Elapsed time: 00:00:11
    Info: Total CPU time (on all processors): 00:00:01


9. If on Windows, use Quartus to flash the .POF file now.

10. Remove the SD card, and power the MEGA65 off and on again, and confirm that the keyboard power light comes on, and that you get no more ambulance lights. If not, you might need to repeat from step (6).

11. Delete the HICKUP.M65 file from the SD card, and reinsert it, and turn your MEGA65 off and on again.

Your MEGA65 should now be all happy and healthy again.