Friday, 17 January 2020

Adventures in talking to the QSPI flash

I am getting closer to being able to communicate with the QSPI flash, so that we can have the MEGA65 update its own bitstreams in the field. To recap the current situation:

1. Most of the signals to the flash are easy to connect to with the QSPI flash, except the clock, which is normally driven by the FPGA's configuration logic.

2. The FPGA has a facility, the STARTUPE2 component, that allows the running bitstream to take control of this signal.

3. I have managed to achieve (2) in a test bitstream, as confirmed by my new JTAG boundary scan setup.

4. But I haven't got it working for a real bitstream.

To get to this point, from the last blog post, I discovered that the STARTUPE2 component *must* be in the top level of a design.

The question is now why in the real bitstream, it still isn't working, even though I have moved it to the top level.

Basically it works in the pixeltext test target, that lacks a M65 computer, but not in the nexys4ddr-widget target. More weird, when I removed the M65 computer component out of this second target, it still isn't working.

This makes me suspect that there might be some kind of target setup in the Vivado project that is to blame. There is a "persist" flag that can be used, which causes the configuration clock to remain active on the QSPI clock pin.  That could be the problem -- but then I would still be expecting to see the line waggle, which it doesn't seem to.

However, digging further, I did managed to control the line with the M65 computer component taken out of the real bitstream.  Now trying to put it back in, but with a dedicated 1Hz clock on the pin, so that I can eliminate internal problems in the plumbing of the line to the register I had it hooked up to.  Basically I can keep pushing the connection deeper down into the design, until it is in the component where I was controlling it.

Ok, so with the full machine core, and the 1Hz clock in the outer layer, I can control the clock line. Next step is from in the sdcardio.vhdl file where it gets connected to, to see if I can toggle it there under automatic control.  If that works, then I must have some subtle bug in the register plumbing. If not, then the plumbing problem must be between sdcardio.vhdl and the outer layer of the design. Either way, I will be able to considerably narrow down where the problem can be hiding.

So, the clock toggles, meaning the problem is probably in sdcardio.vhdl somewhere...

Okay.... So, this is one of those funny bug fixes that I really hate. It could well be that I have done something really stupid, but if so, I am ignorant to what it is.  But the solution was to create a 2nd register to control the QSPI clock at $D6CD.  With that implemented, magically $D6CC works to control the clock.  I've had this kind of problem before with VHDL, where possibly something is incorrectly optimising out the ability to write to some signal.  Anyway, it is solved for now.

Then I started trying to investigate things, and came to the rapid conclusion that my life would be so much nicer, if I could make my new JTAG boundary scanner produce industry-standard VCD files that I could view in gtkwave, to get a more effective understanding of what is going on.  So I did. It wasn't too hard, and now I can produce pretty pictures like this:

Which is helpfully showing me that I can waggle the clock line, and also control the CS (chip select) line, but that the data lines are seemingly not doing anything.  But I know from prior experimentation that I can indeed control these lines, so this is probably an example of me having an error in my test program.  But how nice it is to be able to determine that in just a few seconds :)

Digging through this, I fixed the initial problem, but also found I had the SO and SI lines switched around from the way they should be, so that will need a resynthesis...  Well, then I wasn't so sure, so I made it so that the four data lines are open-collector with internal pull-ups in the FPGA. This means that the lines can be either driven low, or float high.  This means I can fiddle with which line is which etc, without having to resynthesise each time.

However, I am seeing some quite weird things with the data lines when I look at the JTAG traces:

So let me explain what we have here.  Because I was seeing weird things, I make a test program that tries every possible value on the four data lines, CS and clock pins to the QSPI flash.  The open-collector operation means that the direction pins (the .ctl pins in the lower half) basically indicate what we *should* be seeing on the actual pins (in the top half).  This holds true for QspiDB[2], QspiDB[3], QspiCSn and the clock, but not for QspiDB[1] and QspiDB[0]: These two pins switch a short time later.  This would only make real sense, if the QSPI flash was pulling those lines down (remember, open-collector outputs "float" high, so any device connected to them can pull them down to ground), or there is something really fishy going on with the FPGA control of those pins.  I now need to try to solve this riddle.

Let's look first at FPGA control of the pins as a potential cause. As the other pins don't exhibit this strange behaviour, and the four DB pins are all controlled in an identical manner, I find it hard to believe that the problem is there.  That leaves the QSPI flash as the current primary suspect.

First stop: Check the schematics.  Nothing sinister here on the Nexys4DDR boards: the QSPI flash is directly connected to the FPGA, with only some external pull-up resistors, which can't cause this funny problem I am seeing.
So that suggests it is most likely just the way that I am communicating with the QSPI flash.

Poking around, it seems that DB0 only changes (or is only changeable) when  CS is high. This makes sense, as when CS is high, the QSPI flash is not active, and so shouldn't be trying to drive any lines. When it is low, then DB1 stays tied low.  This makes me 99% sure that DB1 is the line from the QSPI to the FPGA, and DB0 is the command line from the FPGA to the QSPI.

This means, in theory at least, that I should be able to talk to the QSPI flash, if I drive the correct waveform. However, so far at least, there are no signs of active response from the QSPI flash.  And looking at the trace, here we see this weird problem again: The DB0 signal stays low for one clock tick longer than it is being pulled low:

This is really weird. I can slow the clock down even more (its currently less than 1KHz, anyway) to the point where it looks mucb better, but this feels altogether wrong: The FPGA can read out its bitstream from this QSPI interface at 66MHz, so ~660Hz should be absolutely no problem!  The 1.8KOhm pull ups should be able to pull these lines high in <1 micro second, but we are seeing rise (or delay) times of >1 milli second -- a thousand times slower.

This bizarre delay occurs whether the QSPI flash is selected via the CS line, or not.  This would seem to suggest that it is not the QSPI flash to blame -- unless it is in some strange mode following the FPGA configuration process. 

Ok, looking again that the schematic, there are indeed 1.8K pull-ups on the DB2 and DB3 lines, but not on DB0 or DB1. This means that it is possible that running these lines open-collector might not be practicable. So I resynthesised with the ability to push those lines actively high, as well as pull them low, or tri-state them, as before.  Now by actively pushing them, they respond immediately, as expected. So now I can send a byte via the SPI interface, and it all looks right:

Of course, it still isn't working. But that could be because I just realised I am sending the bits least-significant-bit first, instead of most-significant-bit first. And indeed, that suddenly gets it responding to me!

Now we're finally getting somewhere :)  Again, I am so glad I implemented this VCD logger and JTAG boundary scan stuff.

Of course I could have just figured out how to do it from in Vivado, but its so much nicer to have a little light-weight and open-source tool.  Also, by having it integrated in monitor_load, I can do multiple things all in one quick action.  Here is now I run the test program, and then ask monitor_load to sample those pins -- all in one single command:

make src/tests/qspitest.prg && src/tools/monitor_load -F -4 -r src/tests/qspitest.prg -V log.vcd -J src/vhdl/nexys4ddr-widget.xdc,${HOME}/build/artix7/public/bsdl/xc7a100tl_csg324.bsd,qspisck,qspicsn,qspidb[3],qspidb[2],qspidb[1],qspidb[0]

Okay, so its a bit of a long command, but that's what pressing the up arrow in a shell is all about, so you can just use it again and again, without having to re-type it. 

When that command has logged the pins for long enough, I just hit control-C, and then launch gtkwave on the resulting log.vcd file, with a little tiny script that tells it to automatically show all signals:

gtkwave -S allsigs.tcl log.vcd

So the whole work-flow is now super easy and efficient.

But anyway, back to figuring out why the test program doesn't read the data from the SPI response correctly... It's currently reading all ones, i.e., not noticing when the DB1 line goes low. Adding a short delay fixes this. Not entirely sure why. But with that, I can finally read some useful things out of the chip, and display them:

QSPI DEVICE ID = $2018                 
RDID BYTE COUNT = 77                   
TORS WITH 64KB SECTORS.                
PART FAMILY IS 8000                    
 01 80 30 30 80 FF FF FF               
 FF FF FF FF 51 52 59 02               
 00 40 00 53 46 51 00 27               
 36 00 00 06 08 08 0F 02               
 02 03 03 18 02 01 08 00               
 02 1F 00 10 00 FD 00 00               
 01 FF FF FF FF FF FF FF               
 FF FF FF FF 50 52 49 31               

I confirmed with the data sheet that these data are broadly sensible.  So the next step will be to extract all the relevant data out, e.g., the information I need to programme the device, and after that, to implement simple block read, erase and write functions... Which turned out to be remarkably painless, if rather boring internally.  The more exciting part will be in the next post, where I (hopefully) actually implement writing of bitstreams to the QSPI flash.

Thursday, 9 January 2020

Programming the Bitstream Boot Flash and all things JTAG

So, in the last post, I implemented the ability to tell the MEGA65 to switch to a different bitstream. The next challenge is to make it possible for the MEGA65 to be able to re-program the contents of the flash memory, so that we can supply people with an  updated bitstream, and make it super-easy to upgrade the MEGA65.

First piece of detective work was to realise that we can take a .bit file, remove the 120 byte header, and write it directly to the flash somewhere, and it should Just Work (tm).

So now I need to be able to talk to the SPI boot flash. This is a bit tricky, because the FPGA boot process controls the clock line to this device. Fortunately, there is a way to put this back under control of the VHDL:  Basically you use this slightly magic STARTUPE2 thing, and feed it a clock:

    generic map(PROG_USR=>"FALSE", --Activate program event security feature.
                                   --Requires encrypted bitstreams.
                SIM_CCLK_FREQ=>0.0 --Set the Configuration Clock Frequency(ns) for simulation.
    port map(CFGCLK=>CFGCLK,--1-bit output: Configuration main clock output
             CFGMCLK=>CFGMCLK,--1-bit output: Configuration internal oscillator
                              --clock output
             EOS=>EOS,--1-bit output: Active high output signal indicating the
                      --End Of Startup.
             PREQ=>PREQ,--1-bit output: PROGRAM request to fabric output
             CLK=>CLK,--1-bit input: User start-up clock input
             GSR=>GSR,--1-bit input: Global Set/Reset input (GSR cannot be used
                      --for the port name)
             GTS=>GTS,--1-bit input: Global 3-state input (GTS cannot be used
                      --for the port name)
             KEYCLEARB=>KEYCLEARB,--1-bit input: Clear AES Decrypter Key input
                                  --from Battery-Backed RAM (BBRAM)
             PACK=>PACK,--1-bit input: PROGRAM acknowledge input
             USRCCLKO=>spi_clock,--1-bit input: User CCLK input
             USRCCLKTS=>USRCCLKTS,--1-bit input: User CCLK 3-state enable input
             USRDONEO=>USRDONEO,--1-bit input: User DONE pin output control
             USRDONETS=>USRDONETS--1-bit input: User DONE 3-state enable output

The important bits here are the USRCCLK and USRDONE signals.  Basically the first pair of signals let us control the clock to the SPI flash, while the second lets us control the DONE signal, which the FPGA normally outputs high when it is configured.  We just have to keep that one behaving normally, since the MAX10 FPGA depends on it.

When I first attempted to implement this, the system failed to come up.  After a lot of poking around and inadequate documentation from Xilinx, I found this project, that actually showed a working instantiation. From there it wasn't long, before I at least had a working bitstream.

It's actually likely to be helpful for the rest of this part, as well, because it actually does everything that I want: i.e., it allows programming of a connected QSPI flash memory. I'm glad to have finally found some source code that I can look at when I get stuck, to see how others have solved the same problems.

So in theory at this point, I have a bitstream with working ECAPE2 for bitstream switching, and now, a bit-bashing interface that *should* allow me to talk to the QSPI flash.  So I started writing a little test program for that, that basically tries to read some device information from the QSPI chip.

So, not entirely suprisingly, the test program doesn't work, in that it doesn't return the device ID.  If the pins for the QSPI flash chip were exposed on the PCB, I'd be able to stick my oscilloscope on them, and waggle them in software, and make sure that everything is correct.  However, as both the FPGA and QSPI flash are BGA parts with no exposed pins, there is no such possibility.

It should be possible, however, to use JTAG debugging tools to read the pin status of every pin on the FPGA.  The trick is how to do this easily from command line on linux.

The UrJTAG package provides the jtag command that *should* be able to do this.  After some hunting for info, the following should work to detect a MEGA65 connected via the USB debug cable:

jtag> cable FT2232 vid=0x0403 pid=0x6010

Then the detect command should show something connected, like this:

jtag> detect
IR length: 6
Chain length: 1
Device Id: 00010011011000110001000010010011 (0x13631093)
  Manufacturer: Xilinx (0x093)
  Unknown part! (0011011000110001) (/usr/share/urjtag/xilinx/PARTS)

That's looking good, except that the Artix7 FPGA is not in the part list.

There is, however, a newer version of UrJTAG, that has been patched to support the Artix7 series, and even has a boundary scan file for at least one version of the chip -- that should allow us to map the JTAG output to actual pins, which will be very helppful for us. Unfortunately, the pre-built package for Ubuntu lacks this, so I need to build it from scratch.

Building UrJTAG is proving interesting, because it needs ftd2xx.h, which I can't figure out which package on Linux provides. It looks like it might come from here: You have to copy the include files from the release/ directory into the build directory for UrJTAG, and then it seems to build.

So, builting UrJTAG is a bit of a pain. The "make install" script basically doesn't work, so you have to do all that yourself.  With the new jtag binary, I now get this:

jtag> cable FT2232 vid=0x0403 pid=0x6010
Connected to libftd2xx driver.
jtag> detect
IR length: 6
Chain length: 1
Device Id: 00010011011000110001000010010011 (0x13631093)
  Manufacturer: Xilinx (0x093)
  Part(0):      xc7a100t (0x3631)
error: Unable to open file '/usr/local/share/urjtag/xilinx/xc7a100t/STEPPINGS'
  Unknown stepping! (0001) (/usr/local/share/urjtag/xilinx/xc7a100t/STEPPINGS)

So, that's a step forward, but  I have no idea yet where to get this STEPPINGS file from, or if it really is necessary. Ah, that was also just a problem with the install script not working. After manually copying  the data/ directory's contents into /usr/local/share/urljtag, it works:

jtag> detect
IR length: 6
Chain length: 1
Device Id: 00010011011000110001000010010011 (0x13631093)
  Manufacturer: Xilinx (0x093)
  Part(0):      xc7a100t (0x3631)
  Stepping:     1
  Filename:     /usr/local/share/urjtag/xilinx/xc7a100t/xc7a100t-csg324

This is all very nice, except that it thinks is the 324 pin part, not the 484 pin part that is actually in the MEGA65 R2 PCB.  It seems that UrJTAG might not support multiple variants of the same part, which is a bit annoying.

The first step, though, is to find the information required to actually even make the file. This seems to be available behind the license-wall at:   Using my account there, I downloaded the zip archive of BSDL files, and it seems that they are indeed the source material that I need. The PIN_MAP_STRING in each file seems to be the reverse-order of what appears in the UrJTAG file.  The syntax of BSDL is a bit weird, being VHDL derived, so where there are multiple pins defined on a single line, I'll have to work out how to parse those.

It turns out that UrJTAG has a parser utility for doing this:

bsdl2jtag xc7a100t_csg324.{bsd,jtag}
error: -E- error: In Package STD_1149_6_2003, Line 375, Error in User-Defined Package declarations.
error: -E- error: BSDL file 'xc7a100t_csg324.bsd' contains errors in VHDL stage, stopping
error: system error: Success Cannot open file STD_1149_6_2003 or /usr/local/share/urjtag/bsdl/STD_1149_6_2003

But as we can see, it is missing some files.  I suspect the install target of the Makefile might again be the problem here. Nope, apparently it just doesn't support STD_1149_6_2003. But someone has implemented the missing file. Unfortunately, it gives some error about user defined packages.  Someone else just took to modifying the BSDL files to remove the need for STD_1149_6_2003.  I might try that next.

Meanwhile, as I am out and about this morning, I took a Nexys4DDR board with me, which does have the exact chip that UrJTAG already supports, since I figured that should "just work", and I should be able to poke around with it while waiting for appointments.  Well, I don't get the error I described above, but I do instead get:

jtag> cable FT2232 vid=0x0403 pid=0x6010
Connected to libftd2xx driver.
jtag> detect
warning: TDO seems to be stuck at 1

What I don't know, is whether this is further along or not as far along as the other. I am guessing it is not as far along, since if the JTAG bus is stuck, it won't enumerate, and indeed, we are seeing a lack of enumeration. Fortunately I am not the only person with this problem.  Let's try some of their proposed solutions...

Unfortunately none of the suggestions on that page work. I'd suspect that my FPGA board is broken, except the fpgajtag command I use to send bitstreams to the FPGA via JTAG works perfectly.  So the JTAG interface *does* work, and my computer *can* communicate with it.  Most frustrating.

I also took a look at OpenOCD, an open-source JTAG tool for Linux etc.  This is an excellent project in many ways, but was never designed with doing simple FPGA boundary scans in practice.  Thus as a result, it still isn't in any way trivial to do them with it. I am sure if I invested enough time and energy I could figure out how to do it, but I really don't want to have to do that, if I don't have to.

I did take a quick look at the internals of the fpgajtag command, to see if I could easily adapt it.  It looks reasonably well-structured, but for someone who doesn't know that much about JTAG (although I am learning), it isn't immediately obvious what I would need to change.

So then I started looking at Vivado to see if the hardware manager in there can easily do a boundary scan.  I am sure it can, but even after a pile of Googling, I can't actually figure out how to do it.  There is a lot of talk about needing a debug bitstream or some debug core in the project.  This strikes me as incomplete information at best, since the JTAG interface on an FPGA, if not disabled, can ALWAYS do a boundary scan, if I understand things correctly.  Also, my workstation this morning doesn't have mains power, so I don't want to kill my battery before the kids swimming lessons finish for the morning:

The best thing I have found so far is this:
While whatever the JTAG library is that the example source code was written for isn't immediately clear, it does show how to go through the process of performing the boundary scan at a low level.  It might thus be enough information, together with the fpgajtag source, to cook something up that can work.  I have found the Xilinx BSDL files for the FPGAs I care about already, so in theory, I have all the information I need.

It also gives me hope of being able to take control of pins on the FPGA, so that I can more quickly test and develop things like this QSPI interface, as I can potentially avoid having to synthesise every change, but instead be able to bit-bash over JTAG.  But of course, I have to succeed in actually getting SOMETHING to work, before I can get that excited.

Well, at least integrating fpgajtag into monitor_load was relatively easy: The only slightly tricky part was re-doing the command line interface parse stuff. But I do want to extend it a little further, so that the fpgajtag stuff which correctly works out which USB serial port to talk to, can also be used to automatically find the correct serial port for the normal monitor_load communications.  This was also not too hard, once I found out that I could map the /dev/ttyUSBx paths to the entries in /sys/bus/usb-serial/devices, and look at the destination of those symlinks to check that the USB bus and port match.

So now, in theory, I have all necessary ingredients to adapt to be able to run a boundary scan from within monitor_load, so long as I can figure out how the  fpgajtag code does the JTAG communications.  But this is not proving as simple as I would like, as fpgajtag has what seems to be a quite clever mechanism for abstracting the low-level JTAG operations.

Unfortunately, there is little documentation in the source, and I am struggling to understand how to adapt it.  I'm pulling my hair out enough that I have logged an issue on the fpgajtag github repository asking for some help in understanding their code.  Within a few hours, I had received some pointers to documentation for the FTDI serial adapters, which gave me enough information, with quite a lot of trial and error, to work out how to control the JTAG interface.  This will also come in handy in the future, when we get to implementing updating the keyboard CPLD from the MEGA65 itself as well, as I will need to implement a JTAG interface for that.

Anyway, back to the point, I now seem to be able to read some JTAG boundary scan data from the FPGA.  It seems to be shifted by a few bits, and I don't yet capture it all, but I am able to see bits toggle as I flip the switches on a Nexys4 DDR board, and in roughly the right place in the boundary scan register.  I suspect the bit order of the bytes might be flipped, and that I need to ignore the first 6 or so bits, to make up for the bits of the boundary scan command itself being shifted out.  But the important thing is that I can now read boundary scan data.  The changes I made to the read_idcode() function to tell it to switch to boundary scan mode ended up being quite simple:

    write_bit(0, 0, 0xff, 0);     // Select first device on bus
    write_bit(0, 5, IRREG_SAMPLE, 0);     // Send IDCODE command

(Checkout if you would like to see it all together.)

This switches the JTAG interface from Reset to Idle, then to IR-Capture, send the JTAG SAMPLE command so that it ends up in the IR register, and then returns to Idle state, ready for the usual logic to shift bits in and out.  The boundary data is then in the data shifted back in.  All quite simple, once I had worked it out!

With a bit more work, I have now implemented an amazingly quick and dirty scanner for both the XDC and BSDL file formats.  XDC files inidicate the pins used by a project, while BSDL files have the information about the FPGA itself, importantly including the JTAG boundary scan information.  With these parsers, and a bit of glue, I can not only show the status of each FPGA pin, but also the name of the pin in the project.  While there is plenty of room to improve this, the result is already really nice.  Here is a little sample of the output on a Nexys4DDR board:

monitor_load -J src/vhdl/nexys4ddr-widget.xdc,${HOME}/build/artix7/public/bsdl/xc7a100tl_csg324.bsd
make: „src/tools/monitor_load“ ist bereits aktuell.
fpgajtag: Digilent:Digilent USB Device:210292645477; bcd:700; IDCODE:  3631093
Auto-detected serial port '/dev/ttyUSB1'
FPGA is assumed to be a XC7A100TL_CSG324, with 989 bits of boundary scan data.
bit#2 : CCLK_E9 (pin E9, signal {QspiSCK}) = 1
bit#3 : M0_P12 (pin P12, signal <unknown>) = 1
bit#4 : M1_P13 (pin P13, signal <unknown>) = 0
bit#5 : M2_P11 (pin P11, signal <unknown>) = 1
bit#6 : CFGBVS_P8 (pin P8, signal <unknown>) = 1
bit#10 : INIT_B_P7 (pin P7, signal <unknown>) = 1
bit#13 : DONE_P10 (pin P10, signal <unknown>) = 0
bit#53 : IO_U8 (pin U8, signal {sw[9]}) = 0
bit#56 : IO_T8 (pin T8, signal {sw[8]}) = 1


First, we have filtered out all the bits that are not marked "input" in the BSDL file, which dramatically shortens the list of output.

Second, we see the nice mapping of the BSDL bit names to FPGA pins and project signals.  sw[9] and sw[8] are two of the slide switches on the Nexys board, and I can happily twiddle those, and re-run the scan, and see the changing values.  So I am confident overall that its working, and that I can finally go back to what I was trying to do at the begining: Check whether I am correctly controlling the QSPI interface pins, in particular the CCLK pin.

So let's actually fire up a bitstream, and see if we can control the pin... and indeed I have confirmed that everything except the pesky clock pin is controllable.  This is what I had most suspected would be the problem, but now I don't have to suspect -- I can inspect!  But solving that will have to wait for the next blog post.

Meanwhile, if you would like to support me, I've setup a ko-fi page at

Wednesday, 1 January 2020

Running multiple bitstreams

The MEGA65 is based on an FPGA.  FPGAs are like a blank canvas that you load a hardware design into, with that design being typically stored in flash memory.  Generally you don't notice this, because the whole process of loading the design into the FPGA and starting it, takes only about 0.3 seconds.  This is why the MEGA65 can boot much faster than, say, a THE C64, which has to boot a Linux operating system and fire up an emulator.

It's one of the many advantages of FPGAs, if you have the time and sanity to spare to implement a retro-computer that way, instead of using software emulation.  But there is a potential down-side to this: With software emulation, it's really easy to change the program you are running. So, for example, emulator-based systems typically let you run not only C64, but also VIC-20, Amiga, Spectrum, Apple ][ and a whole pile of other systems.

So how can we have a framework for "swapping programs" like this on an FPGA?  Fortunately, this is a question that lots of big-spending customers of FPGAs asked a very long time ago, and so Xilinx and the other major vendors all have various ways of doing this.  In this blog post, I will document my learning process, as I explore the Xilinx documentation, to work out how to do this on the MEGA65, so that we can potentially have different machine cores down the track, but also, so that we can more easily have updates for the MEGA65's main core, without having the risk of bricking the machine if an update fails part way through.

So the starting point is Xilinx's documentation for configuring their FPGAs. Configuration is Xilinx's name for "loading the design into an FPGA and setting it running".  You can just think of it as being like loading a programme on a regular computer.  Anyway, Xilinx's documentation lives here.  We're particularly interested in Chapter 7 "Reconfiguration and Multiboot", since what Xilinx calls "multiboot" is exactly what we want.

Xilinx's Multiboot facility basically allows one bitstream (the FPGA program) to indicate where the FPGA should look in the flash memory for a different FPGA program, and then tell the FPGA to pretend it has just been turned on, so that it will load the new bitstream instead.  This means two lots of the approximately 0.3 seconds of boot time, if you want to have the first bitstream load the second one.  Actually, it can be a bit quicker, if the first bitstream, which Xilinx calls the "Golden Bitstream," is a really simple design, and thus will compress well.

My current thinking is that our Golden Bitstream will just be a known-working release of the normal MEGA65 core.  At least to begin with.  What I'm thinking of doing, is adding the necessary extra bits to the bitstream to allow the triggering of reconfiguration, together with a little bit of code in the Hypervisor, that checks if any of the number keys from 1 to 9 are being held down. If one of them is, then it will calculate an address in the flash memory based on the number pressed, and then trigger reconfiguration.  This will allow the use off the standard MEGA65 core, as well as up to 9 other cores, subject to them all fitting in the flash memory.

We also want to be able to support having updates to the MEGA65 core itself, which I am currently thinking will be implemented by having the Hypervisor try to load an updated bitstream from a specific part of the flash memory, if none of those number keys are pressed.  If 0 is held down, then I will have it not do this, so if you need to "downgrade", this will be possible. For example, if some bitstream update doesn't work for some particular reason.

The Xilinx FPGAs are also capable of a nice trick: If when you try to load a bitstream from somewhere else in the flash memory, and it fails, it will reload the Golden Bitstream again, but this time, with special flag set to say that it has fallen back to the Golden Bitstream. That way, we can even have the MEGA65 display some kind of message on first boot, if the updated bitstream doesn't work for whatever reason.

All up, this should give us a good basis on which to build a nice update mechanism for the bitstream on the MEGA65.  All I need to do now, is actually extract the information I need from Xilinx's documentation, and then actually implement it.  This could be the fun part, as this is a feature that is notoriously under-documented...

First step: Find out how to instantiate the ICAPE2 thingy (Dingsbums for the Germans reading along), that allows access to the whole configuration system.  This seems to be available here on page 178.  What worries me, is that it looks to be a bit minimalistic:

Library UNISIM;
use UNISIM.vcomponents.all;
-- ICAPE2: Internal Configuration Access Port
-- 7Series
-- Xilinx HDL Libraries Guide, version 2012.2

ICAPE2_inst: ICAPE2 
generic map(
   DEVICE_ID => X"3651093",    -- Specifies the pre-programmed
                               -- Device ID value to be used for
                               -- simulation purposes.
   ICAP_WIDTH => "X32",        -- Specifies the input and output
                               -- data width.
   SIM_CFG_FILE_NAME => "NONE" -- Specifies the Raw Bitstream (RBT)
                               -- file to be parsed by the
                               -- simulation model.
   O => O,        -- 32-bit output : Configuration data output bus
   CLK => CLK,    -- 1-bit input   : Clock Input 
   CSIB => CSIB,  -- 1-bit input   : Active-Low ICAP Enable
   I => I,        -- 32-bit input  : Configuration data input bus
   RDWRB => RDWRB -- 1-bit input   : Read/Write Select input
-- End of ICAPE2_inst instantiation

So now I need to figure out what each of those does.
The DEVICE_ID and SIM_CFG_FILE_NAME are apprently only used for simulation, so that the fake configuration register values can be read-out, so we can ignore those, I think.

ICAP_WIDTH, O and I also seems to be prettz logical, defining the width and input and output bus.  The fact that it is allowing the width to be varied is tempting for trying to make the interface 8-bit, but I have a gut feeling that that would just Lead To Trouble.  But I'll have a think about it as I keep exploring.

So that just leaves CLK, which should be straight-forward, CSIB and RDWRB, which I am not yet totally sure about.

Reading page 148 of this, suggests that we have to write a series of 32-bit values that are basically a pretend tiny bitstream.  This would explain why the interface has only Read/Write select and Chip Select (CS) sigals to go with the data: We just have to write the correct series of values. It also suggests that the 8-bit interface mode might just work, too, which would be nice -- if I can get the byte order correct.

Xilinx's recommended set of values to send are:

FFFFFFFF - Dummy word
AA995566 - Sync word 
20000000 - Type 1 NOOP
30020001 - Type 1 write to WBSTAR
00000000 - Warm-boot start address
30008001 - Type 1 write words to CMD
0000000F - IPROG word
20000000 - Type 1 NOOP

Let's try to go through those to understand what is going on.
The dummy word probably doesn't require much explanation. The sync word, I think, helps the FPGA work out the bit/byte order / endian-ness. Might also work to help get 8-bit mode right.  We'll investigate that later.
Then we have some "Type 1 NOOP"s in there. Those we can generally ignore for now, as well.
Then we have the interesting part, where we write to the WBSTAR register.  This sets the upper bits of the flash memory address used to configure the FPGA from.  The lower 8 bits are undefined, so apparently the bitstream should be pre-padded with 256 FF bytes, to make sure.
Then we have the writing the IPROG word to the CMD register. This is apparently what tells the FPGA to reset and reconfigure, but keeping the just-set WBSTAR value.

So, let's cook up a bit of VHDL that embeds one of these ICAPE2 thingies, and tries to tell it to load a bitstream from a particular place, and see if we can make it work.

Along the way, I also found that the bit order of each byte in the ICAPE2 entity have to be reversed. I also found what claims to be a working implementation.

Then I discovered that on the Artix 7 FPGAs, you have to allow 3 cycles, so the write sequence ends up like this:

  signal bitstream_values : reg_value_pair := (
    x"FFFFFFFF", -- Dummy word
    x"FFFFFFFF", -- Dummy word
    x"FFFFFFFF", -- Dummy word
    x"FFFFFFFF", -- Dummy word
    x"FFFFFFFF", -- Dummy word
    x"AA995566", -- Sync word
    x"20000000", -- Type 1 NOOP
    x"20000000", -- Type 1 NOOP
    x"30020001", -- Type 1 write to WBSTAR
    x"00000000", -- Warm-boot start address
    x"20000000", -- Type 1 NOOP
    x"20000000", -- Type 1 NOOP
    x"30008001", -- Type 1 write words to CMD
    x"0000000F", -- IPROG word
    x"20000000", -- Type 1 NOOP
    x"20000000", -- Type 1 NOOP
    others => x"FFFFFFFF"

I then dynamically change the contents of the entry for the Warm-boot start address via some memory mapped registers:

      bitstream_values(9) <= reconfigure_address;
        cs <= '1';
        rw <= '1';
        if trigger_reconfigure = '1' then
          counter <= 0;
        end if;

Asserting trigger_reconfigure sets the counter to the start of the command stream, and then sends them all, which triggers the reconfigure.

Then it's just a case of memory-mapping access to those registers:

           when x"C8" =>
              -- @IO:GS $D6C8-B - Address of bitstream in boot flash for reconfiguration
              reconfigure_address(7 downto 0) <= fastio_wdata;
            when x"C9" =>
              reconfigure_address(15 downto 8) <= fastio_wdata;
            when x"CA" =>
              reconfigure_address(23 downto 16) <= fastio_wdata;
            when x"CB" =>
              reconfigure_address(31 downto 24) <= fastio_wdata;
            when x"CF" =>
              -- @IO:GS $D6CF - Write $42 to Trigger FPGA reconfiguration to switch to alternate bitstream.
              if fastio_wdata = x"42" then
                trigger_reconfigure <= '1';
              end if;              

I got this all together, but then still had a problem: When I tested it, it wouldn't work.  I suspected that this is because I was loading the bitstream via JTAG, rather than from the SPI flash that contains the usual bitstream.  This means that the FPGA hasn't been setup for the SPI flash configuration, and that thus trying to load a subsequent bitstream will fail.  To test this, I had to reflash the SPI flash to contain this new bitstream, and then try it from there... And it worked without problem!

To be a bit more specific: If you set $D6C8-$D6CB to the value $00000000, and then write $42 to $D6CF, it will reload itself, since it is at $00000000.  Thus it works like a kind of Very Hard Reset Indeed for the MEGA65.  Better, if you put some other value in there, where no valid bitstream exists, the FPGA has a watchdog timer that gets tripped when the FPGA fails to configure up, and thus after a few seconds it falls back to the original bitstream at $00000000!  This means that if something goes wrong you get the "police lights" on the keyboard for a few seconds, before the machine boots normally.

Now, apparently there is a way to work out if this has happened, so that you can avoid an infinite loop of trying to start a broken bitstream, which I'll look into in due course.  Similarly, I need to work out how to write the extra bitstreams into the flash, so that we can actually use the multi-boot facility. Those who want to follow along, or see all the code, hop over to

But for now, I think it's time to open those fireworks we bought for New Year's Eve / Silvester that are sitting in our MEGA65 Hack Session New Year's Kit* and celebrate!

* Limited stocks. Some items not available in some countries.  Contents may vary from above image. Contains small parts. Not suitable for children under 3 years of age or lactating rhinoceroses, except under medical supervision.