Tuesday 2 November 2021

Speeding up the MEGA65 flash menu

The MEGA65's flash menu that lets you write new cores into the flash is, shall we say, a little pedestrian in speed.  It takes close to 15 minutes to write a new core, which is really annoying.  

It's also become a bit important for another reason, because Trenz need a tool to flash the MEGA65 production boards, because Vivado is refusing to flash the new shiny 512mbit (64MB) flash chip that is going on the production machines for some unknown reason.  They can't afford to spend 15 minutes on each machine flashing them.

I just timed Vivado flashing a bitstream, and it took 165 seconds = 2 minutes, 45 seconds, so that's our goal. 

Now, what is interesting is that Vivado is much slower than what the flash can do.  In theory, we can erase at around 500KB/sec, flash at around 1MB/sec, and verify back at >1MB/sec.  For an 8MB bitstream this gives us a theoretical time of 8MB/500KB/sec + 8MB/1MB/sec + 8MB/1MB/sec = 16 + 8 + 8 = 32 seconds.  Now, that would be really nice if we can reach.  But I'll just be happy if we can get down to 165 seconds or better, like Vivado does.

To improve from our current ~15 minutes = ~900 seconds down to 165 seconds, we have quite a bit of improvement to make.  Fortunately this should be fairly easy, as the root cause of the slowness is that we are using CC65 as our compiler, which produces slow code, and then bit-bashing the QSPI communications.  So getting it much faster than now should be quite straight forward.

But first, we need an easy way to test the flash program, because currently the QSPI flash is only accessible when in hypervisor mode. So I have made it so that dip-switch 3 now enables access to the QSPI flash from any mode. This should not be normally enabled, as it can cause your QSPI flash to get trashed. But for production of machines (and testing of my flash program speed-ups), its fine.

With that out the way, it was time to start implementing the QSPI speed up stuff.  I could in theory implement a complete QSPI controller in hardware, but that's a lot of work, and not really needed, because it is just the large transfers for reading and writing the flash to verify and program it that take by far the most time -- more than 90% in fact.

So instead I am just implementing hardware acceleration of exactly those options.  The QSPI lines are routed through the SD card controller, which already has a nice buffer that I can re-use.  For some reason I use the "Q" nybl-based modes for reading from the flash, but single-bit ones for writing. IT would be faster to use the Q mode for both, as it will reduce the time to write a byte from ~8x4 = 32 cycles down to 4 cycles per byte. But as the flashing itself takes ~1usec per byte, we will still be at 50% efficiency at least.  Somewhat similarly for the reading, we could run the QSPI flash at >40MHz, but that would require more work for even less gain -- especially since we still have some logic code from CC65 slowing things down as well.

For the commands to setup those transfers, we can also inline some of the functions to help things along a bit. About 2x to 3x for some parts of things was possible there, but still not enough to get us near 165 seconds.

To help track the improvements, I have improved the progress bars in the flash program to show the speed and time remaining using the RTC to do the timing.

In the process I also did some more work on improving the detection of the flash chip's parameters.  This helped quite a bit, because there are two different erase commands and one works on all pages of the flash, but is really slow on most of them, while the one that is faster on the most of the pages hangs on the other pages.  This is because the chip we are using has 32 x 4KB pages at the start, and then 64KB pages after that.

Getting the reading of data working with hardware acceleration was pretty quick and easy. I also found a horrible bug in the erase code that meant that it would erase all pages, even if they were already empty.  Together those two improvements have had a dramatic improvement on the erase performance: It is now down to 19 seconds when erasing a typical bitstream that is about 5.5MB of the 8MB slot size, i.e. an erase speed of >300KB/sec.  That's down from several minutes, so that's the first part of our victory.

I have also implemented the hardware SPI writing acceleration, but there is some bug with it at present which means that the same byte is being written over and over again, which I need to investigate. But given that I writing the correct number of bytes, the speed should be about right.  And this is also greatly improved, now taking only 38 seconds, at around 164KB/sec. About half of that time is the actual fast SPI data write and the time for the flash to actually write to the non-volatile memory, so there is perhaps some scope for in-lining more stuff in the C code to speed it up a bit, but otherwise further improvements would require the Q mode writing. With both of those, it would probably be possible to get under 20 seconds for the writing phase, but honestly, 38 seconds is already fast enough to not feel annoying. The main thing is that the progress bar is continuously growing, and at a good speed.

So once I have the SPI writing bug fixed, we are looking at just under 1 minute to erase and write.  Verification should be at least as fast as erasing, so I'm hoping that we will be around 80 seconds -- that is, about 2x faster than Vivado, which is really nice!

Part of the bug is because I hadn't implemented 256 byte page writing, but rather only 512 byte page writing. That's fine for the 64MB flash chips on the production boards, but not for the existing 32MB flash in the R3 board I have here.  The errata for the 32MB flash said that you can write >256 bytes, but only the last 256 bytes will be written.  I have since fixed that, but without any visible improvement.

What I am seeing is that the same byte is being written to the flash over and over again.  Sometimes its $80, other times $00.  This says to me that bytes are probably being written, but that the bytes we are reading from the buffer may be wrong. So I might make a test that tries writing some known data and see how that goes.  That way I can also do it much faster, as I can erase the single page, write the known data, and then read it back.

Okay, so that confirms that we are writing exactly 256 bytes, but that all 256 bytes are being written with the same value, in this case $80.  I'll do a quick bit of simulation to check whether the SD buffer is being read out correctly to be written, as that strikes me as the most likely place to be borked.

Borkage duly found via simulation: I was reloading the byte from the buffer every bit, causing endless hilarity to ensue.  Now synthesising that, but slowed down by watching Shallan50k's twitch stream with the music competition results which was excellent (congratulations to @proton_fig for your great X Files tunes, and to the other entrants for their great tunes as well!). It's amazing just how much CPU it takes for the Twitch stream view. Basically was eating 75% or more of the CPU on my (admittedly 4 year old) i7 box.

On the up side, the simulation affirmed that the rest of the process looks to be behaving properly, so hopefully when the synthesis does complete, that it will work. Which it did after I eventually spotted and fixed some stupid bugs.

After that it was a case of fine-tuning various things, like reducing how often I update the progress bar. I also added a hand-written assembly routine for the verification step, as that is currently the slowest of all the actions, which is a bit silly given that erasing and writing have real work to do.

The end result is that writing a new core file to a slot can now be done in about 86 seconds -- i.e., about 1/2 the time that Vivado takes, as we can see in this screenshot:


Victory achieved!

Now to win the war, I need to back-port all those speed-ups and general improvements into the flash menu, and hope that it doesn't make it too big to fit in the bitstream... which I have also done.

To say that this makes the process of flashing a core file more pleasant is really an understatement. We have rather coincidentally gone from C64 datasette to disk drive loading times, and the impact feels just as profound: You can now flash a core without thinking about what you will do for the next 1/4 hour while it chugs away.

While speed further improvements are possible, it doesn't really feel like it is necessary, given that the theoretical minimum time is something like 20 seconds, and it would be a lot of effort to claw back any of that extra minute -- but it is only one extra minute.

The only further improvement I am likely to make down the track is to make a utility that will allow safe reflashing of slot 0 using this new dip-switch 3 mechanism: The program will check that the FPGA has booted from slot 1 or 2, and thus be satisfied that slot 0 can be written over without bricking the machine, and only if that is the case, will it attempt to flash.  But to make sure people don't leave switch 3 on all the time, which would allow malicious software to brick your MEGA65, I'll likely put an inter-lock into the hypervisor that requires you to press some key to continue booting if it is enabled, so that you don't forget.

So that's all that, really.

No comments:

Post a Comment