Tuesday, 3 March 2020

Starting work on a libc for the MEGA65, and further work on QSPI flash updating

I'm continuing to work on creating a tool to write updated bitstreams (or "cores", or whatever you like to call the things that turn an FPGA into an interesting computer :) Here's what it looks like after I got it working:

But let's go back to the beginning...

In a recent post, I confirmed that I am able to access the QSPI flash on the MEGA65, reading it, writing to it, and erasing sectors of it, as required.  So the next step is to turn this into some kind of functioning utility that lets you actually pick out a bitstream from the SD card, and then write it into one of the slots in the flash memory.  The MEGA65 R2 board has 32MB of flash memory, and each core needs just under 4MB, so I will arrange things using 4MB slots. The little bit of spare space might well get used to allow including icons and descriptions for the bitstreams, so that we can have a more visually interactive means of switching cores.

But first, we need to be able to read a bitstream file to write to the flash.  To do that, we need to use the hypervisor calls for accessing the SD card's FAT32 file system, since the bitstreams are too big to fit in a D81 disk image.  Also, the FAT32 file system is MUCH faster (upto around 1MB/sec) compared with the C65 DOS on D81 disk images (typically <20KB/sec, even with the CPU at 40MHz).

Fortunately, I already have working Hypervisor calls implemented for traversing directories, and even opening and reading files.  What I don't have, though, is a collection of nice wrapper routines that I can call from a C program written using CC65, i.e., a kind of MEGA65 standard library or libc.  Eventually, this should go into CC65 as part of a MEGA65 target (and we would welcome a volunteer to work on that). But for now, I have created a repository where those routines can be collected: https://github.com/mega65/mega65-libc.

Many of the routines that I have already put in there exist duplicated in mega65-fdisk and mega65-freezer etc already, so this was really a job that needed to be done.  At some point, I will adapt those programmes to use the new library as well. Probably after I have confirmed that the library is working for the megaflash utility...

Argument and return value passing in CC65 is a little bit fiddly, but I have done it before, so I am hoping to not have any great problems there.

It didn't take too long to pull together a simple programme that can display the contents of the different areas of the QSPI flash: It considers each 4MB section as a "slot" for a bitstream, and allows a name (and soon, an icon) to be stored in each, followed by the bitstream itself:

  

To test this programme, I then hooked up keyboard input, so that pressing the numbers 1 - 8 will cause the FPGA to load the corresponding bitstream... and then it refused to reconfigure the FPGA.  This was unexpected and annoying, as I had gotten the FPGA reconfiguration stuff working only a couple of weeks earlier, and it was rock solid.

Then I eventually remembered that there is a funny aspect of the FPGA reconfiguration stuff: You must have started the first bitstream from the QSPI flash, if you want it to reconfigure from the QSPI flash.  This is because there is some magic at the start of the bitstream that tells it to use QSPI flash, what the clock frequency for that should be, along with a few other parameters.

Once I had remembered this, everything was good again, but it is still annoying, as I have to write the bitstream to flash to be able to test it, which takes a couple of minutes, instead of just being able to use it with the monitor_load command, that lets you load a bitstream in a couple of seconds.   This has me wondering, if it isn't possible to figure out the magic that tells the FPGA how to configure from the QSPI flash, so that I can enable that myself as part of the bitstreams I build.  That would just help speed up the remaining development of this feature.

The problem is that this seems to have me back into Poorly Documented Features of FPGAs territory.  I might just have to try experimenting building bitstreams with different frequencies and/or bus widths, to see if I can spot the differences.  Hardly ideal, especially when the whole point was to save time, not waste it :/ I'll try switching the SPI width from 4 to 1, and see if I see anything obvious looking, and maybe the frequency as well, but if it takes too long, I'll just give up, I think.  Project X-Ray does have some information as well, but I can't immediately figure out where to find what I need.

The first difference I have found is this:

30 03 E0 01 00 00 02 6C for 4 bit QSPI
30 03 E0 01 00 00 00 0C for 1 bit SPI

The 30 03 E0 01 is an instruction to write $001 words to register $3E/2 = $1F = 0b11111 of the FPGA, i.e., write a value to a specific register.  From the Xilinx 7 Series Configuration Guide, I have figured out that this is most likely the BPI/SPI Configuration Options Register described in Table 5-41.  Indeed the changed bits indicate a change in configuration bus width, and SPI read command (SPI chips have different read commands for 1-bit, 2-bit and 4-bits wide). Setting the frequency to 1MHz, changed another register, register 0b01001, which has a field for the oscillator frequency, which seems to be conveniently directly encoded in MHz. 

I also found other information about the FPGA register setup, and tried it out, but it still didn't work.  So rather than wasting more time, I think I will just concentrate on doing it the way I can -- just at the cost of a few minutes of extra delay when trying a new bitstream.

So, anyway, back to writing the programme...

Today I copied the disk image chooser code from the freeze menu in, and adapted that to allow selecting a .BIT file from the SD card.  That required a bit of fiddling about, because the freeze menu uses 16-bit text mode, while the freeze utility uses the normal VIC-II text mode. This uses the opendir() and readdir() functions in the library I have been writing.  So if you hold the CONTROL key when pressing the digit for a freeze slot, it will show you the list of .BIT files that are available:


Since I already have the code working to erase, write and read flash memory pages, that just leaves the code to read the contents of the BIT files.  I have already implemented the open() and close() functions in the library, as well as a read512() function, that reads one sector of data into a buffer.  Those were quite easy to hook up to test, but it looks like they aren't working for some reason.  My guess is that the problem lies somewhere in the library functions, so I'll start having a dig through that next.

To investigate the open() and read512() functions, I want to know if they are calling the correct Hypervisor traps, and if those are working. To do that, I'll add checkpoint message code to those Hypervisor functions temporarily, so that I can know if the calls are succeeding.  My gut feeling is that open() is fine, but that the C parameter handling for read512() might be messed up, as that is a bit fiddly.  So I might also add some debug output in the read512() function to show what it thinks are in the passed parameters.

So, first up, I have confirmed that open() correctly gets the filename from the passed parameters.  Now to check that it really opens the file.  Hmm.. It seems to not enter the Hypervisor open() call for some reason.  That I fixed by making the library to force VIC-IV IO mode, to make sure the Hypervisor trap registers remain visible.

After that, I wasted a few hours chasing my tail around various problems.  The main one that I have worked around, but not really completely resolved, is that the assembly routines I have for calling from the CC65 programme were doing weird things with parameter passing.  For example, both the _open() and _read512() functions take a single string argument.  I'd expect both functions to be treated the same way by CC65, but one gets the pointer to the string passed on the stack, while the other other gets is pointer passed in the A and X registers. At some point I'll figure out why, but for now, I have it working.

That got me to the point where I can now select a bitstream, and read the contents of the file from the SD card.  Together with the flash access routines, everything is now known to have worked at least once.  But that will have to wait until we get settled back in at Arkaroola, after being rained out for a week.

Okay, so we are still rained out, and our trailer has thrown an axle, so we're still stuck here.  I don't have a screen or keyboard to connect to the Nexys4DDR board, so I have to work blind. Well, almost blind, because I implemented a crude text screen grab function in monitor_load a while back.

As I had all the support functions for accessing the flash, drawing the progress bar and reading the bitstream file from the FAT32 file system, I figured I could at least write the block of code to erase and then write to the flash:

First step is to erase the flash area we are using.  These flash chips can have funny size flash sectors, which can make working out which sectors to erase a bit tricky. Fortunately, they also allow you to specify the page to erase by giving an address in the flash.  The trick is, you don't know how much of the flash has been erased, because you don't know the size of page it has just erased.  My solution to this is to read the flash region progressively, and if it needs erasing, ask for that piece of the flash to be erased.  I then verify that it has been erased, and continue.  This seems to work fine. Here is that block of code:

  // Do a smart erase: read blocks, and only erase pages if they are
  // not all $FF.  Later we can make it even smarter, and only clear
  // pages where bits need clearing.
  // Also, we will assume the BIT files contain the 4KB header we want
  // so we will just write upto 4MB of stuff in one go.
  progress=0; progress_acc=0;
  for(addr=(4L*1024L*1024L)*slot;addr<(4L*1024L*1024L)*(slot+1);addr+=512) {
    progress_acc+=512;
    if (progress_acc>26214) {
      progress_acc-=26214;
      progress++;
      progress_bar(progress);
    }
    read_data(addr);
    for(i=0;i<512;i++) if (data_buffer[i]!=0xff) break;
    if (i<512) {
      erase_sector(addr);
      // Wait a while for erasing to finish
      for(i=0;i<100;i++) usleep(10000);
      // Then verify that the sector has been erased
      read_data(addr);
      for(i=0;i<512;i++) if (data_buffer[i]!=0xff) break;
      if (i<512) {
    printf("\n! Failed to erase flash page at $%llx\n",addr);
    printf("  byte %d = $%x instead of $FF\n",i,data_buffer[i]);
    while(1) continue;
      }
    }
  }


The progress_acc>26214 stuff is to get the progress bar to work nicely.  4MB / 160 positions in the progress bar = 26,214 bytes. That is, the progress bar needs to grow a little after each 26,214 bytes. Otherwise is is fairly logical.  Erased bytes of flash should contain 0xFF.  The delay after erasing before reading is to allow the flash chip time to complete erasing the sector.  I'm probably being a bit conservative with this, and it can certainly be optimised, but it works for now.

Once that is done, I can then try to write to the flash.  Here is what I have cooked up so far:

  // Read the flash file and write it to the flash
  printf("Writing bitstream to flash...\n",0x93);
  progress=0; progress_acc=0;
  for(addr=(4L*1024L*1024L)*slot;addr<(4L*1024L*1024L)*(slot+1);addr+=512) {
    progress_acc+=512;
    if (progress_acc>26214) {
      progress_acc-=26214;
      progress++;
      progress_bar(progress);
    }

    bytes_returned=read512(buffer);
   
    if (!bytes_returned) break;

    // Programming works on 256 byte pages, so we have to write two of them.
    lcopy((unsigned long)&buffer[0],(unsigned long)data_buffer,256);
    program_page(addr);
    for(i=0;i<100;i++) usleep(10000);
    lcopy((unsigned long)&buffer[256],(unsigned long)data_buffer,256);
    program_page(addr+256);
    for(i=0;i<100;i++) usleep(10000);

    // Verify
    read_data(addr);
    for(i=0;i<512;i++) if (data_buffer[i]!=buffer[i]) break;
    if (i<512)
      {
    // Failed to verify. Try once more, then give up.

    if (i<256) {
      // Programming works on 256 byte pages, so we have to write two of them.
      lcopy((unsigned long)&buffer[0],(unsigned long)data_buffer,256);
      program_page(addr);
      for(i=0;i<100;i++) usleep(10000);
    } else {
      lcopy((unsigned long)&buffer[256],(unsigned long)data_buffer,256);
      program_page(addr+256);
      for(i=0;i<100;i++) usleep(10000);
    }
   
    // Verify
    read_data(addr);
    for(i=0;i<512;i++) if (data_buffer[i]!=buffer[i]) break;
    if (i==512) break;
   
    printf("Verification error at address $%llx:\n",
           addr+i);
    printf("Read back $%x instead of $%x\n",
           data_buffer[i],buffer[i]);
    while(1) continue;
      }
   
   
  }


For writing, the loop is somewhat similar to the erasing. The main difference is that here we read the 512 bytes of file data that we need to write, and then try to write it.  Writing (at the moment) happens in 256 byte blocks, so we have to spslit the data in halves.  That's the theory. But at the moment, it successfully writes only the first 256 bytes.  So I get output like the following (this is using the crude screen grab feature, so some of it is a bit messed up in appearance):

?RASING FLASH SLOT...                  
ACTIVATING WRITE ENABLE...             
CLEARING STATUS REGISTER...            
ERASING SECTOR...                      
                                       
?RITING BITSTREAM TO FLASH...          
ACTIVATING WRITE ENABLE...             
CLEARING STATUS REGISTER...            
WRITING 256 BYTES OF DATA...           
DATA AT $00400000 WRITTEN.             
ACTIVATING WRITE ENABLE...             
CLEARING STATUS REGISTER...            
WRITING 256 BYTES OF DATA...           
DATA AT $00400100 WRITTEN.             
ACTIVATING WRITE ENABLE...             
CLEARING STATUS REGISTER...            
WRITING 256 BYTES OF DATA...           
DATA AT $00400000 WRITTEN.             
?ERIFICATION ERROR AT ADDRESS $400000: 
?EAD BACK $FF INSTEAD OF $0            


On this particular run, we see it trying to write to $0040000 and then $0040100, which is the first 512 bytes of the flash slot.  It tries to write to $0040000 again, because the verification process detects that the write didn't happen correctly.  Sometimes it will do the same, but the verification error happens in the $0040100 write, instead of the $00400000 one.

This is all a bit annoying, as I had previously tested the write routines, and they worked fine.  I am presuming that the problem is something to do with the sequencing of the writes after each other, or the write after the erase + read or something like that.

Okay, so working some more, it turns out that the flash chip has some sort of read buffer, and it can return the contents of that instead of what has just been written (or erased).  I don't know how to invalidate the read buffer, other than to just do all the erase or write operations, and then check the whole thing over again after.  On the upside, I have managed to write a bitstream, and start it.

Erasing and programming is quite slow, several times slower than using the Vivado tools. Much of this I am sure is because I have an 8-bit CPU bit-bashing the SPI interface, and isn't even using quad SPI when programming, thus slowing the transfer down even more.  It actually shouldn't be that hard to get QSPI writing working.  Indeed, I probably should, since it will make the rest of the development team's life easier as they test things, as well as just making everyone happier who ever uses a MEGA65, by not having to wait as long when applying an update.

It's now several days later.  The intermittent flash writing problems I was experiencing above turned out to be a trivial fix: I was checking the wrong bits in the status register of the QSPI flash when checking if it had finished writing.  As a result, it would get in some funny state when I asked it to write another block of data while the first one was still writing. From there, it was fairly downhill running, and I had reliable writing and erasing of the flash.

So then the next step was to tie this all together, so that the flash menu could be invoked on boot up. Actually, more the point, the flash menu MUST get invoked on every boot, to see if it should reconfigure to an upgraded core, if you have installed one.  Again, this is so that when you install an upgrade core, you never actually remove the original "factory" core, which can then remain there as a safe fall-back, meaning you can't brick the machine, no matter how hard you try when installing an upgraded bitstream/core.

This means that the hypervisor has to copy the flash menu into the right place in memory, and briefly transfer control to it.  However, we don't want to wait for the SD card to finish resetting, partly as it would make booting an upgraded bitstream too slow, and partly because it would mean if you messed up your SD card, you could get into a position where you couldn't even enter the flash menu.  That would be a bit too fragile, so like the FDISK utility for preparing a new SD card, we prefer to have this all pre-loaded into the BRAM in the bitstream.

The trick with this, is that we have written the freeze menu using the CC65 C compiler targetting a C64 for ease of software development. This means that it makes use of various KERNAL calls, e.g., to implement printf().  So now we need to have a ROM available to us as well.  Fortunately, we started the OpenROM project to create open-source C64/C65 ROM sets.

As the flash menu needs to access the QSPI flash, which I don't intend to leave available from outside the hypervisor (so that programmes can't trash the QSPI flash and thus brick the machine), I don't intend to "boot" the ROM in the normal way.  Instead, all I want to do is to map the KERNAL in, initialise the screen and the indirect vector table, so that the program can run.

This all sounds great, but I hit a snag: The OpenROM for the C65/MEGA65 has had a number of improvements, and one of those caused me a small change to the CPU.  In short, Roman had made a really nice trick, where some of the initialisation routines in the KERNAL that are only used rarely, he has moved to a second 8KB part of the ROM, that was previously unused on the C65 ROM.  This frees up space for more interesting improvements, like support for turbo tape loading without a wedge etc.

However, it also caused problems for calling the KERNAL initialisation functions from within the hypervisor, because Roman used the MAP instruction to map that extra 8KB in. But of course the ROM doesn't expect to be in hypervisor mode, so it doesn't make any effort to preserve the existing memory map.  And in fact, it couldn't, even if it tried, because of an annoying flaw with the MAP instruction: You can't determine the current memory map with it, rather you can only destroy and replace it.

The work around was, however, rather simple: When the CPU is in Hypervisor Mode, it refuses to remove the Hypervisor from memory.  Here is the little bit of VHDL that does that check:

-- Lock the upper 32KB memory map when in hypervisor mode, so that nothing
-- can accidentally de-map it.  This will hopefully also fix using OpenROMs
-- with megaflash menu during boot (issue #156)
if hypervisor_mode='0' then
    reg_offset_high <= reg_z(3 downto 0) & reg_y;
    reg_map_high <= std_logic_vector(reg_z(7 downto 4));

end if;

With that fixed, the Hypervisor could then map the KERNAL in, and set everything up, and then call the flash menu. This has a few subtleties to it, that I will explain as we go along. So here goes...

First up, we need tell the flash menu where to re-enter the Hypervisor boot code, so that booting can resume.  We do this by writing the return address into $CF80/$CF81:

launch_flash_menu:
   
    // Store where the flash menu should jump to if it doesn't need to do anything.
    lda #<return_from_flashmenu
    sta $cf80
    lda #>return_from_flashmenu
    sta $cf81
    // Then actually start it.
    // NOTE: Flash menu runs in hypervisor mode, so can't use memory beyond $7FFF etc.




Then we have to copy the flash menu program into place, which we do via DMA for speed and simplicity:

    // Run the flash menu which is pre-loaded into memory on first boot
    // (in the FPGA BRAM).

        lda #$ff
        sta $d702
        lda #$ff
        sta $d704  // dma list is in top MB of address space
        lda #>flashmenu_dmalist
        sta $d701
        // Trigger enhanced DMA
        lda #<flashmenu_dmalist
        sta $d705



Then we need to make the memory map look a bit like a C64, with the KERNAL at $E000-$FFFF.  IO is already mapped in, so no problem there. We have to do a little trick of writing in absolute mode to $0001 instead of $01, because Zero Page is mapped elsewhere by default in the Hypervisor. Of course, writing this now, I can see that I can save a byte and the hassle, by just remapping Zero page first, and then doing this, but anyway. We also work around the assembler not knowing the TAB and TYS opcodes.


    // Bank in KERNAL ROM space so megaflash can run
    // Writing to $01 when ZP is relocated is a bit tricky, as
    // we have to mess about with the Base Register, or force
    // the assembler to do an absolute write.
    lda #$37
    .byte $8d,$01,$00 // ABS STA $0001

    // XXX Move Stack and ZP to normal places, before letting C64 KERNAL loose on
    // Hypervisor memory map!
    lda #$00
    .byte $5B // tab
    ldy #$01
    .byte $2B // tys


Then we make sure we are in a VIC-II video mode, and call the minimum set of KERNAL initialisation routines required to enable the flash menu to not crash:
   
    // We should also reset video mode to normal
    lda #$40
    sta $d054
   

    // Tell KERNAL screen is at $0400
    lda #>$0400
    sta $0288
    // Now ask KERNAL to setup vectors
    jsr $fd15
    // And clear screen, setup screen editor
    jsr $e518


Ah yes, for some reason that I do not in the slightest recall, we have 8KB of the 8MB RAM expansion mapped to $4000-$5FFF in the Hypervisor memory map. Maybe I though the Hypervisor needed some scratch space.  Of course, it is moot for now, because the expansion RAM doesn't yet work, although I am getting much closer.  But anyway, we remove it from the memory map, so that the flash menu program can use upto 30KB from $0800 - $7FFF:

    // Clear memory map at $4000-5FFF
    // (Why on earth do we even map some of the HyperRAM there, anyway???)
    lda #0
    tax
    tay
    ldz #$3f
    map
    eom

Then we simply jump into the entry point for the flash menu program:
   
    // Actually launch freeze menu
    jmp $080d





The DMA lists for setting everything up are here. Basically we copy the flash menu program down from $50000 to $07FF (so that the two load-address bytes don't displace things), and then save the screen RAM that the Hypervisor was using, so that we can restore it on exit, since the KERNAL initialisation routines that make it possible for the flash menu to run actually also clear the C64-mode screen, which overlaps with the Hypervisor's screen memory.

flashmenu_dmalist:
        // copy $50000-$577FF to $00007FF-$0007FFFF

        // MEGA65 Enhanced DMA options
        .byte $0A      // Request format is F018A
        .byte $80,$00  // Copy from $00xxxxx
        .byte $81,$00  // Copy to $00xxxxx

    // Copy screen from $0400-$0BFF to $00009000
        .byte $00 // no more options
        // F018A DMA list
        .byte $04 // copy + chained
        .word $0800 // size of copy
        .word $0400 // starting addr
        .byte $00   // of bank $0
        .word $9000 // destination address is $8000
        .byte $00   // of bank $5
        .word $0000 // modulo (unused)

    // Copy program down
        .byte $00 // no more options
    // F018A DMA list
        .byte $00 // copy + not chained request
        .word $77FF // size of copy
        .word $0000 // starting addr
        .byte $05   // of bank $5
        .word $07FF // destination address is $0801 - 2
        .byte $00   // of bank $0
        .word $0000 // modulo (unused)

Now we get back to what happens when the flash menu returns control to the Hypervisor (I'll explain how it does that in a moment).  Basically we just rearrange the furniture back to how the Hypervisor had everything, including restoring the screen:

return_from_flashmenu:   

    // Here we have been given control back from the flash menu program.
    // So we have to put some things back to continue the kickstart boot process.

    // Put ZP and stack back where they belong
    lda #$bf
    .byte $5B // tab
    ldy #$be
    .byte $2B // tys
    ldx #$ff
    txs
   
        lda #$ff
        sta $d702
        lda #$ff
        sta $d704  // dma list is in top MB of address space

    // Don't forget to reset colour RAM also
    lda #$01
    tsb $d030
        lda #>erasescreendmalist
        sta $d701
        // set bottom 8 bits of address and trigger DMA.
        //
        lda #<erasescreendmalist
        sta $d705
    lda #$01
    trb $d030
   
    // And finally, the screen data
        lda #>screenrestore_dmalist
        sta $d701
        // Trigger enhanced DMA
        lda #<screenrestore_dmalist
        sta $d705

    jsr resetdisplay
       
    jmp dont_launch_flash_menu
 

Otherwise, we have a couple of little niceties in the Hypervisor that check if you are trying to launch the flash menu after having already booted. If you try to do that, then it tells you that you need to power off and on first.  This is because the Hypervisor can't be sure that the flash menu is still intact in RAM after it has let, you, the user use the machine ;)
   
flash_menu_missing:
        ldx #<msg_flashmenumissing
        ldy #>msg_flashmenumissing
        jsr printmessage

dont_launch_flash_menu:

    // Check for the TAB key being pressed, indicating that the user wants
    // to enter the flash menu
    lda $d610
    cmp #$09
    bne fpga_has_been_reconfigured

    // Tell user what to do if they can't access the flash menu
noflash_menu:
        ldx #<msg_noflashmenu
        ldy #>msg_noflashmenu
        jsr printmessage
    inc $d020
nfm1:
    jmp nfm1

... boot normally

So, now let's look at how the flash menu program works out what to do, and passes control back to the hypervisor when required. The most important thing is this little line of code:

  if (PEEK(0xD610)!=0x09) {

It simply checks if you don't have the TAB key held down.  If it isn't held down, then it goes through and checks if the FPGA has already been reconfigured once since power on, which would mean that the flash menu has already done its job of switching to an upgraded bitstream.  If this isn't the case, then it looks to see if you have a valid, invalid or empty flash slot #1.  If it's valid, it switches to that bitstream. If it's empty, it just returns. If it has invalid contents, then it shows you a message, before entering the flash menu.  We also see here how the flash menu uses the return address provided by the Hypervisor to jump back into the Hypervisor, if the freeze menu has nothing to do:

    if (PEEK(0xD6C5)&0x01) {
      // FPGA has been reconfigured, so assume that we should boot
      // normally, unless magic keys are being pressed.
      if ((PEEK(0xD610)==0x09)||(!(PEEK(0xDC00)&0x10))||(!(PEEK(0xDC01)&0x10)))
    {
      // Magic key pressed, so proceed to flash menu after flushing keyboard input buffer
      while(PEEK(0xD610)) POKE(0xD610,0);
    }
      else {     
    // We should actually jump ($CF80) to resume hypervisor booting
    // (see src/hyppo/main.asm launch_flash_menu routine for more info)   
    POKE(0xCF7f,0x4C);
    asm (" jmp $cf7f ");
      }
    } else {
      // FPGA has NOT been reconfigured
      // So if we have a valid upgrade bitstream in slot 1, then run it.
      // Else, just show the menu.
      // XXX - For now, we just always show the menu
     
      // Check valid flag and empty state of the slot before launching it.
      read_data(4*1048576+0*256);
      y=0xff;
      valid=1;
      for(x=0;x<256;x++) y&=data_buffer[x];
      for(x=0;x<16;x++) if (data_buffer[x]!=bitstream_magic[x]) { valid=0; break; }
      // Check 512 bytes in total, because sometimes >256 bytes of FF are at the start of a bitstream.
      if (y==0xff) {
    read_data(4*1048576+1*256);
    for(x=0;x<256;x++) y&=data_buffer[x];
      } else {
    //      for(i=0;i<255;i++) printf("%02x",data_buffer[i]);
    //      printf("\n");
    printf("(First sector not empty. Code $%02x)\n",y);
      }
     
      if (valid) {
    // Valid bitstream -- so start it
    reconfig_fpga(1*(4*1048576)+4096);
      } else if (y==0xff) {
    // Empty slot -- ignore and resume
    POKE(0xCF7f,0x4C);
    asm (" jmp $cf7f ");
      } else {
    printf("WARNING: Flash slot 1 is seems to be\n"
           "messed up (code $%02X).\n",y);
    printf("To avoid seeing this message every time,either "
           "erase or re-flash the slot.\n");
    printf("\nPress almost any key to continue...\n");
    while(PEEK(0xD610)) POKE(0xD610,0);
    // Ignore TAB, since they might still be holding it
    while((!PEEK(0xD610))||(PEEK(0xD610)==0x09)) {
      if (PEEK(0xD610)==0x09) POKE(0xD610,0);
      continue;
    }
    while(PEEK(0xD610)) POKE(0xD610,0);


But of course don't worry if you can't follow how it works. All you need to know is that you hold the TAB key down, while turning the computer off and on, if you want to enter the flash menu.  Otherwise, the MEGA65 will just boot normally, including switching to any upgraded bitstream/core that you have installed via the flash menu.  This process launching the flash menu to check if it needs to switch to an upgrade bitstream/core and all the rest takes less than 0.5 seconds, keeping the MEGA65's boot time faster than most monitors can latch to the video signal.

Now, if you do use the TAB key to force the flash menu to appear, you get a display like this:


If you then press CONTROL and 1 through 7, this will let you choose which core file you want to write to that slot, or alternatively, to erase the slot.  I was lazy and hadn't put any on my SD card this time around, so we just see the "erase slot" option:


If I then hit enter, it will proceed to erase it, and show me a nice old-school progress bar while it does it:

Finally, we have the updated messages in the Hypervisor boot process, that tell us how to get into the flash menu, and of course also into the general utility menu:

 But of course if you have already booted the machine without turning it off first, then the flash menu can't be started, and it will tell you this, and what you should do:

Whew! That took a while. So then I set about creating a little utility called bit2core, that takes a bitstream file, and adds the correct header to it to make it into a COR file for the flash menu to program into the flash.

That all went well, until that is, I tried to flash a COR file. Then the flash menu refused to find any files... Then I suddenly rememberd that the flash menu is now running from within the hypervisor context. This means it can't use the normal Hypervisor Trap mechanism.  Probably the easiest solution here is to implement the basic FAT file system access stuff in the flash menu. Fortunately I can copy that from the fdisk program, which has all of that there.  I just hope it doesn't cause the flash menu to become too big, as we can only use 30KB in this context, and it is already about 23KB...

Fortunately I managed to make that all fit, and I can now use the flash menu to update the contents of flash slots.  I even fixed up the stuff to show the name and version of the bitstream/core that has been installed:


I also tidied up a few loose ends, like making it so that you can't accidentally try to flash over the factory bitstream in slot 0, while still making it possible for a determined user to do it. Basically there is a secret key to press, and then you have to answer a series of increasingly difficult responses.

So now it is pretty much all working, certainly well enough for us to share internally, and provided we don't discover any new problems with it, to include as the default bitstream on the MEGA65 DevKits once they are available -- which we hope won't be very far away now.

And if you'd like to see it all in action, here is a video of me installing a core update on the MEGA65:


It covers the full process, so feel free to skip over the boring 5 minutes in the middle :)

Thursday, 20 February 2020

Making the Real-Time Clock Tick

The Real-Time Clock (RTC) chip that Trenz chose for the MEGA65 main board is different to the one we have on the MEGAphone.  While the one on the MEGAphone proved easy to set, the one in the MEGA65 is resisting my efforts to be able to set the time.

The datasheet explains that you have to set the WRTC (Write Real-Time Clock) bit, before the time can be set.  The only problem is, is that this doesn't actually work. My best guess, is that the RTC registers have to be all set in a single write, for the write to take effect, whereas my SPI controller writes only one byte per transaction.

Although, it might not be required after all.  This Linux driver for the RTC chip writes one byte at a time, and seems to indicate that writing to any of the RTC clock registers should start it running.  However, nothing I have tried actually works to trigger this to occur.

Reading through the data sheet again, I was reminded that the RTC chip is very picky about writing to the clock registers:  It requires that an I2C STOP signal occurs immediately after writing to the clock registers, so that partial writes will be ignored to help protect the integrity of the clock.

This led me through quite an adventure of tracking down and fixing a bunch of I2C management errors in mega65_i2c.vhdl, which collectively caused that problem, along with having the potential to cause some other I2C glitches.   What it boiled down to, was that I was not allowing the I2C bus to go idle between the loop that reads all register values for displaying in the memory map, and when it writes a new value when requested by the CPU -- and vice versa.

With that fixed, I was finally seeing the STOP condition, which consists of the SDA line going high, while SCL stays high, as indicated by the vertical red-ish line in this simulation trace:


With that fixed and synthesised, I was then finally able to write to the RTC clock, and set it ticking (it doesn't start ticking until it has been initially set).

To make it easier for people to use, I have added getrtc() and setrtc() functions to the MEGA65 libc that I am writing. I have also added some initial documentation of the registers to the MEGA65 Book.  I also updated the i2cstatus test program to show the current RTC (and other target specific information).  It also allows editing of the RTC value:






 Thanks to the libc functions that I wrote, the code to read and display the current time and date is quite simple:

void show_rtc(void)
{
    getrtc(&tm);
    

    printf("Real-time clock: %02d:%02d.%02d",
           tm.tm_hour,tm.tm_min,tm.tm_sec);
    printf("\n");

    printf("Date:            %02d-",tm.tm_mday+1);
    switch(tm.tm_mon) {
    case 1: printf("jan"); break;
    case 2: printf("feb"); break;
    case 3: printf("mar"); break;
    case 4: printf("apr"); break;
    case 5: printf("may"); break;
    case 6: printf("jun"); break;
    case 7: printf("jul"); break;
    case 8: printf("aug"); break;
    case 9: printf("sep"); break;
    case 10: printf("oct"); break;
    case 11: printf("nov"); break;
    case 12: printf("dec"); break;
    default: printf("invalid month"); break;
    }
    printf("-%04d\n",tm.tm_year+1900);

}


The main trick there, is that we will need to use a MEGA65 Enhanced DMA operation to fetch the RTC registers, because the RTC registers sit above the 1MB barrier, which is the limit of the C65's normal DMA operations.  The easiest way to do this is to construct a little DMA list in memory somewhere, and make an assembly language routine that uses it.  Something like this (using BASIC 10 in C65 mode):

10 RESTORE 110:FORI=0TO43:READA$:POKE1024+I,DEC(A$):NEXT:BANK 128:SYS1042
20 S=PEEK(1024):M=PEEK(1025):H=PEEK(1026)
30 D=PEEK(1027):MM=PEEK(1028):Y=PEEK(1029)+DEC("2000")
40 IF H AND 128 GOTO 80
50 PRINT "THE TIME IS ";RIGHT$(HEX$(H AND 63),1);":";RIGHT$(HEX$(M),2);".";RIGHT$(HEX$(S),2)
60 IF H AND 32 THEN PRINT "PM": ELSE PRINT "AM"
70 GOTO 90
80 PRINT "THE TIME IS ";RIGHT$(HEX$(H AND 63),1);":";RIGHT$(HEX$(M),2);".";RIGHT$(HEX$(S),2)
90 PRINT "THE DATE IS";RIGHT$(HEX$(D),2);".";RIGHT$(HEX$(MM),2);".";HEX$(Y)
100 END
100 DATA 0B,80,FF,81,00,00,00,08,00,10,71,0D,20,04,00,00,00,00
110 DATA A9,47,8D,2F,D0,A9,53,8D,2F,D0,A9,00,8D,02,D7,A9
120 DATA 04,8D,01,D7,A9,00,8D,05,D7,60

This program works by setting up a DMA list in memory at $0400 (unused normally on the C65), followed by a routine at $1012 ( = 1,042 in decimal) which ensures we have MEGA65 registers unhidden, and then sets the DMA controller registers appropriately to trigger the DMA job, and then returns.  The rest of the BASIC code PEEKs out the RTC registers that the DMA job copied to $0400 -- $0407, and interprets them appropriately to print the time.
The curious can use the MONITOR command, and then D1012 to see the routine.

If you want a running clock, you could replace line 100 with GOTO 10.  Doing that, you will get a result something like the following:



If you first POKE0,65 to set the CPU to full speed, the whole program can run many times per second. There is an occasional glitch, if the RTC registers are read while being updated by the machine, so we really should de-bounce the values by reading the time a couple of times in succession, and if the values aren't the same both times, then repeat the process until they are. This is left as an exercise for the reader.

Finally, I updated the auto-documentation in the VHDL source, so that the new registers will be automatically documented in the MEGA65 Book, which is already getting close to 500 pages.  I'm expecting that the MEGA65 Book will be close to 1,000 pages when complete -- not that this should be scary, but rather reflect the depth of documentation that we want to provide potential users and developers of the machine.

Sunday, 16 February 2020

Getting the external microSD card slot working

The internal SD card slot has been working for ages, but we never managed to get the external one working when we had the sprint to bring the MEGA65 R2 board up last year.  So I am trying to finally fix this.

It's been an interesting problem, with a few unexpected things.

First, I had to understand how the microSD card interface and MAX10 JTAG interface share the microSD connector.  Trenz designed it this way, so that if you hold the reset button in on the side of the MEGA65, you can connect a JTAG breakout into the microSD slot in order to programme the MAX10 FPGA.

Initially I thought that I had to manually direct the pins, or control the JTAG enable pin.  However, it turned out a bit simpler: The MAX10 JTAG pins and the main Xilinx FPGA microSD card pins are just both connected in parallel on the microSD connector. This means if one device tri-states the lines, the other can control it.

However, it wasn't exactly that simple, as it still didn't work.  Looking closer, a couple of the microSD slot lines are connected not to the MAX10's JTAG interface, but to some other IO lines.  This meant that the MAX10 was trying to drive those lines, in particular the Card Select line, high, regardless of what the Xilinx FPGA was doing.  I realised that this had to be the case when I found that increasing the drive strength of the Xilinx FPGA on this pin to 24mA enabled it to be controlled. But this is not a great solution, as it means that almost 100mW of extra power is being disipated through the cross-driving, and it could eventually damage something.

So this meant adjusting the program for the MAX10 FPGA to tri-state the pin in question.  The change was only one line, to set the line to input.  However, it has been ages since I have programmed the MAX10 FPGA, and it took me a few hours of trying things to remember that I had to run jtagd as root before running the program.sh or flash.sh scripts in https://github.com/MEGA65/mega65-r2-max10.  I've now added those commands into the scripts, so that I don't have to remember this in future.

I also got tripped up by the fact that the Xilinx FPGA tells the MAX10 to reconfigure during its startup process, which meant that my updated bitstream was getting thrown away without me noticing.  This was fine, once I realised it, as all I had to do was flash the new bitstream into place.

At this point, I could again control the Card Select line without having to overdrive the line. However, the SD card interface was still not working.  Probing with the oscilloscope revealed that the clock pin was not being driven.  Digging through, it looks like I made the SD card interface switching logic to assume a common clock for both SD card busses, but never actually tied them together.  So I fixed this by making the clock on the correct bus toggle, instead of having them both toggle all the time.  Again the changes were relatively small.

So now its the waiting game again, while I resynthesise the design, in the hope that it will all work.  But even if it doesn't work immediately, I am feeling much more confident, as I have a good understanding of all of the relevant parts.

Two more problems remained: The Chip Select line wasn't being driven, and there were some related "signal plumbing" problems, that prevented things from working.

Now, finally, it is possible to run a MEGA65 from the external microSD card slot! :)

Obligatory photos showing the MEGA65 booting from the external microSD card:





Sunday, 26 January 2020

Almost 65K€ in $65 days!

As many reading will know, the German arm of the project has been running a fund raiser since mid-October last year, with the goal of funding the manufacture of the injection-moulding tooling to build the cases for the MEGA65.

The main concerns that people raised about this, was that it was rather unusual, and that some felt that we would be unlikely to get even close to the amount required in any reasonable time period, because people would be reluctant to simply donate, without getting any traditional perk, or a discount off the end price of a machine.
Nonetheless, in the first three months, the community donated around 20,000€, which is by any measure an amazingly generous result, and for which we are thankful for all of those who contributed directly, and/or who promoted the project.

But now as we approach the 101st day of this campaign, or better, the $65th day, if we count in hexadecimal, we suddenly fund ourselves on the cusp of having enough money, thanks to 40,000€ coming in over the weekend!  Here is our happy little graph of progress:


The plateau has turned into a hockey stick :)

This means that we now have over 60,000 of the 65,000€ needed for the injection moulding tooling to be produced for us by Hintsteiner in Austria, and certainly enough for us to proceed with the next steps with them, to get the moulds produced.

From here to having the first injection-moulded cases in our hands will likely be close to 6 months, as the whole injection mould tooling process is not something that can be rushed.  The actual production of the tools takes a number of weeks, as the block of tool-steel is etched out using electrical sparks. 

But before we get to that point, we have to be really sure that we have the case design exactly right, because it is really, really expensive to fix later. Good industrial design and fabrication companies, like Hintsteiner, manage this risk by producing prototypes, and using trusted engineering partners throughout the process, meaning that you get what you want the first time.  This is one of the many reasons why we have chosen to go with Hintsteiner: We want the MEGA65 to be of the best possible quality.  Also, we want the manufacture process for the MEGA65 to all be easily reachable, manageable and controllable. Thus we have a cluster of partners all relatively nearby in Germany and Austria, so that we can produce batches over time, without having to worry about some disengaged partner on the other side of the world melting down our injection moulding tools because they thought we weren't going to order any more parts (or even just not enough parts to keep them interested).

This all requires close engagement with Hintsteiner over the coming months, which will be a focus for many of us on the team, as we look to get everything right the first time.  On that note, we are already thinking about the final set of port holes on the back of the case.  We are currently leaning towards having a cut-out for a user-port and cassette port -- even though these will require expansion boards of some sort that will connect in some way to the MEGA65 motherboard.  We'd be interested in hearing if there are any other holes for ports that people think we have forgotten -- because now is the time to pin this down, and it will be thereafter forever set in stone (well, in tool-steel ;).

Hardware is, as they say, hard, so we expect we will have more little challenges to solve along the way (such as having found out that there are some bigger up-front costs for the beautiful mechanical keyboards from GMK), but these are much smaller and more manageable, now that we have the funding for the injection moulding tools for the case mostly complete (there is still at the time of writing a few thousand Euro until the 66,000€ total is reached, for those who are interested in securing a position in the top 8 backers, in order to receive the benefits that come with that.  See mega65.org for more information.)

But in the meantime, lets all celebrate that this massive funding hurdle for the project has been dealt with, and that the path is now much clearer for us to proceed.

Friday, 17 January 2020

Adventures in talking to the QSPI flash

I am getting closer to being able to communicate with the QSPI flash, so that we can have the MEGA65 update its own bitstreams in the field. To recap the current situation:

1. Most of the signals to the flash are easy to connect to with the QSPI flash, except the clock, which is normally driven by the FPGA's configuration logic.

2. The FPGA has a facility, the STARTUPE2 component, that allows the running bitstream to take control of this signal.

3. I have managed to achieve (2) in a test bitstream, as confirmed by my new JTAG boundary scan setup.

4. But I haven't got it working for a real bitstream.

To get to this point, from the last blog post, I discovered that the STARTUPE2 component *must* be in the top level of a design.

The question is now why in the real bitstream, it still isn't working, even though I have moved it to the top level.

Basically it works in the pixeltext test target, that lacks a M65 computer, but not in the nexys4ddr-widget target. More weird, when I removed the M65 computer component out of this second target, it still isn't working.

This makes me suspect that there might be some kind of target setup in the Vivado project that is to blame. There is a "persist" flag that can be used, which causes the configuration clock to remain active on the QSPI clock pin.  That could be the problem -- but then I would still be expecting to see the line waggle, which it doesn't seem to.

However, digging further, I did managed to control the line with the M65 computer component taken out of the real bitstream.  Now trying to put it back in, but with a dedicated 1Hz clock on the pin, so that I can eliminate internal problems in the plumbing of the line to the register I had it hooked up to.  Basically I can keep pushing the connection deeper down into the design, until it is in the component where I was controlling it.

Ok, so with the full machine core, and the 1Hz clock in the outer layer, I can control the clock line. Next step is from in the sdcardio.vhdl file where it gets connected to, to see if I can toggle it there under automatic control.  If that works, then I must have some subtle bug in the register plumbing. If not, then the plumbing problem must be between sdcardio.vhdl and the outer layer of the design. Either way, I will be able to considerably narrow down where the problem can be hiding.

So, the clock toggles, meaning the problem is probably in sdcardio.vhdl somewhere...

Okay.... So, this is one of those funny bug fixes that I really hate. It could well be that I have done something really stupid, but if so, I am ignorant to what it is.  But the solution was to create a 2nd register to control the QSPI clock at $D6CD.  With that implemented, magically $D6CC works to control the clock.  I've had this kind of problem before with VHDL, where possibly something is incorrectly optimising out the ability to write to some signal.  Anyway, it is solved for now.

Then I started trying to investigate things, and came to the rapid conclusion that my life would be so much nicer, if I could make my new JTAG boundary scanner produce industry-standard VCD files that I could view in gtkwave, to get a more effective understanding of what is going on.  So I did. It wasn't too hard, and now I can produce pretty pictures like this:

Which is helpfully showing me that I can waggle the clock line, and also control the CS (chip select) line, but that the data lines are seemingly not doing anything.  But I know from prior experimentation that I can indeed control these lines, so this is probably an example of me having an error in my test program.  But how nice it is to be able to determine that in just a few seconds :)

Digging through this, I fixed the initial problem, but also found I had the SO and SI lines switched around from the way they should be, so that will need a resynthesis...  Well, then I wasn't so sure, so I made it so that the four data lines are open-collector with internal pull-ups in the FPGA. This means that the lines can be either driven low, or float high.  This means I can fiddle with which line is which etc, without having to resynthesise each time.

However, I am seeing some quite weird things with the data lines when I look at the JTAG traces:

So let me explain what we have here.  Because I was seeing weird things, I make a test program that tries every possible value on the four data lines, CS and clock pins to the QSPI flash.  The open-collector operation means that the direction pins (the .ctl pins in the lower half) basically indicate what we *should* be seeing on the actual pins (in the top half).  This holds true for QspiDB[2], QspiDB[3], QspiCSn and the clock, but not for QspiDB[1] and QspiDB[0]: These two pins switch a short time later.  This would only make real sense, if the QSPI flash was pulling those lines down (remember, open-collector outputs "float" high, so any device connected to them can pull them down to ground), or there is something really fishy going on with the FPGA control of those pins.  I now need to try to solve this riddle.

Let's look first at FPGA control of the pins as a potential cause. As the other pins don't exhibit this strange behaviour, and the four DB pins are all controlled in an identical manner, I find it hard to believe that the problem is there.  That leaves the QSPI flash as the current primary suspect.

First stop: Check the schematics.  Nothing sinister here on the Nexys4DDR boards: the QSPI flash is directly connected to the FPGA, with only some external pull-up resistors, which can't cause this funny problem I am seeing.
So that suggests it is most likely just the way that I am communicating with the QSPI flash.

Poking around, it seems that DB0 only changes (or is only changeable) when  CS is high. This makes sense, as when CS is high, the QSPI flash is not active, and so shouldn't be trying to drive any lines. When it is low, then DB1 stays tied low.  This makes me 99% sure that DB1 is the line from the QSPI to the FPGA, and DB0 is the command line from the FPGA to the QSPI.

This means, in theory at least, that I should be able to talk to the QSPI flash, if I drive the correct waveform. However, so far at least, there are no signs of active response from the QSPI flash.  And looking at the trace, here we see this weird problem again: The DB0 signal stays low for one clock tick longer than it is being pulled low:

This is really weird. I can slow the clock down even more (its currently less than 1KHz, anyway) to the point where it looks mucb better, but this feels altogether wrong: The FPGA can read out its bitstream from this QSPI interface at 66MHz, so ~660Hz should be absolutely no problem!  The 1.8KOhm pull ups should be able to pull these lines high in <1 micro second, but we are seeing rise (or delay) times of >1 milli second -- a thousand times slower.

This bizarre delay occurs whether the QSPI flash is selected via the CS line, or not.  This would seem to suggest that it is not the QSPI flash to blame -- unless it is in some strange mode following the FPGA configuration process. 

Ok, looking again that the schematic, there are indeed 1.8K pull-ups on the DB2 and DB3 lines, but not on DB0 or DB1. This means that it is possible that running these lines open-collector might not be practicable. So I resynthesised with the ability to push those lines actively high, as well as pull them low, or tri-state them, as before.  Now by actively pushing them, they respond immediately, as expected. So now I can send a byte via the SPI interface, and it all looks right:


Of course, it still isn't working. But that could be because I just realised I am sending the bits least-significant-bit first, instead of most-significant-bit first. And indeed, that suddenly gets it responding to me!

Now we're finally getting somewhere :)  Again, I am so glad I implemented this VCD logger and JTAG boundary scan stuff.

Of course I could have just figured out how to do it from in Vivado, but its so much nicer to have a little light-weight and open-source tool.  Also, by having it integrated in monitor_load, I can do multiple things all in one quick action.  Here is now I run the test program, and then ask monitor_load to sample those pins -- all in one single command:

make src/tests/qspitest.prg && src/tools/monitor_load -F -4 -r src/tests/qspitest.prg -V log.vcd -J src/vhdl/nexys4ddr-widget.xdc,${HOME}/build/artix7/public/bsdl/xc7a100tl_csg324.bsd,qspisck,qspicsn,qspidb[3],qspidb[2],qspidb[1],qspidb[0]

Okay, so its a bit of a long command, but that's what pressing the up arrow in a shell is all about, so you can just use it again and again, without having to re-type it. 

When that command has logged the pins for long enough, I just hit control-C, and then launch gtkwave on the resulting log.vcd file, with a little tiny script that tells it to automatically show all signals:

gtkwave -S allsigs.tcl log.vcd

So the whole work-flow is now super easy and efficient.

But anyway, back to figuring out why the test program doesn't read the data from the SPI response correctly... It's currently reading all ones, i.e., not noticing when the DB1 line goes low. Adding a short delay fixes this. Not entirely sure why. But with that, I can finally read some useful things out of the chip, and display them:

QSPI FLASH MANUFACTURER = $01          
QSPI DEVICE ID = $2018                 
RDID BYTE COUNT = 77                   
SECTOR ARCHITECTURE IS 4KB PARAMETER SEC
TORS WITH 64KB SECTORS.                
PART FAMILY IS 8000                    
256/512 BYTE PROGRAM TYPICAL TIME IS 2^8
 MICROSECONDS.                         
ERASE TYPICAL TIME IS 2^8 MILLISECONDS.
 01 80 30 30 80 FF FF FF               
 FF FF FF FF 51 52 59 02               
 00 40 00 53 46 51 00 27               
 36 00 00 06 08 08 0F 02               
 02 03 03 18 02 01 08 00               
 02 1F 00 10 00 FD 00 00               
 01 FF FF FF FF FF FF FF               
 FF FF FF FF 50 52 49 31               
                                       
READY.                                 

I confirmed with the data sheet that these data are broadly sensible.  So the next step will be to extract all the relevant data out, e.g., the information I need to programme the device, and after that, to implement simple block read, erase and write functions... Which turned out to be remarkably painless, if rather boring internally.  The more exciting part will be in the next post, where I (hopefully) actually implement writing of bitstreams to the QSPI flash.

Thursday, 9 January 2020

Programming the Bitstream Boot Flash and all things JTAG

So, in the last post, I implemented the ability to tell the MEGA65 to switch to a different bitstream. The next challenge is to make it possible for the MEGA65 to be able to re-program the contents of the flash memory, so that we can supply people with an  updated bitstream, and make it super-easy to upgrade the MEGA65.

First piece of detective work was to realise that we can take a .bit file, remove the 120 byte header, and write it directly to the flash somewhere, and it should Just Work (tm).

So now I need to be able to talk to the SPI boot flash. This is a bit tricky, because the FPGA boot process controls the clock line to this device. Fortunately, there is a way to put this back under control of the VHDL:  Basically you use this slightly magic STARTUPE2 thing, and feed it a clock:

  STARTUPE2_inst: STARTUPE2
    generic map(PROG_USR=>"FALSE", --Activate program event security feature.
                                   --Requires encrypted bitstreams.
                SIM_CCLK_FREQ=>0.0 --Set the Configuration Clock Frequency(ns) for simulation.
    )
    port map(CFGCLK=>CFGCLK,--1-bit output: Configuration main clock output
             CFGMCLK=>CFGMCLK,--1-bit output: Configuration internal oscillator
                              --clock output
             EOS=>EOS,--1-bit output: Active high output signal indicating the
                      --End Of Startup.
             PREQ=>PREQ,--1-bit output: PROGRAM request to fabric output
             CLK=>CLK,--1-bit input: User start-up clock input
             GSR=>GSR,--1-bit input: Global Set/Reset input (GSR cannot be used
                      --for the port name)
             GTS=>GTS,--1-bit input: Global 3-state input (GTS cannot be used
                      --for the port name)
             KEYCLEARB=>KEYCLEARB,--1-bit input: Clear AES Decrypter Key input
                                  --from Battery-Backed RAM (BBRAM)
             PACK=>PACK,--1-bit input: PROGRAM acknowledge input
             USRCCLKO=>spi_clock,--1-bit input: User CCLK input
             USRCCLKTS=>USRCCLKTS,--1-bit input: User CCLK 3-state enable input
             USRDONEO=>USRDONEO,--1-bit input: User DONE pin output control
             USRDONETS=>USRDONETS--1-bit input: User DONE 3-state enable output
             );

The important bits here are the USRCCLK and USRDONE signals.  Basically the first pair of signals let us control the clock to the SPI flash, while the second lets us control the DONE signal, which the FPGA normally outputs high when it is configured.  We just have to keep that one behaving normally, since the MAX10 FPGA depends on it.


When I first attempted to implement this, the system failed to come up.  After a lot of poking around and inadequate documentation from Xilinx, I found this project, that actually showed a working instantiation. From there it wasn't long, before I at least had a working bitstream.

It's actually likely to be helpful for the rest of this part, as well, because it actually does everything that I want: i.e., it allows programming of a connected QSPI flash memory. I'm glad to have finally found some source code that I can look at when I get stuck, to see how others have solved the same problems.

So in theory at this point, I have a bitstream with working ECAPE2 for bitstream switching, and now, a bit-bashing interface that *should* allow me to talk to the QSPI flash.  So I started writing a little test program for that, that basically tries to read some device information from the QSPI chip.

So, not entirely suprisingly, the test program doesn't work, in that it doesn't return the device ID.  If the pins for the QSPI flash chip were exposed on the PCB, I'd be able to stick my oscilloscope on them, and waggle them in software, and make sure that everything is correct.  However, as both the FPGA and QSPI flash are BGA parts with no exposed pins, there is no such possibility.

It should be possible, however, to use JTAG debugging tools to read the pin status of every pin on the FPGA.  The trick is how to do this easily from command line on linux.

The UrJTAG package provides the jtag command that *should* be able to do this.  After some hunting for info, the following should work to detect a MEGA65 connected via the USB debug cable:

jtag> cable FT2232 vid=0x0403 pid=0x6010

Then the detect command should show something connected, like this:

jtag> detect
IR length: 6
Chain length: 1
Device Id: 00010011011000110001000010010011 (0x13631093)
  Manufacturer: Xilinx (0x093)
  Unknown part! (0011011000110001) (/usr/share/urjtag/xilinx/PARTS)

That's looking good, except that the Artix7 FPGA is not in the part list.

There is, however, a newer version of UrJTAG, that has been patched to support the Artix7 series, and even has a boundary scan file for at least one version of the chip -- that should allow us to map the JTAG output to actual pins, which will be very helppful for us. Unfortunately, the pre-built package for Ubuntu lacks this, so I need to build it from scratch.

Building UrJTAG is proving interesting, because it needs ftd2xx.h, which I can't figure out which package on Linux provides. It looks like it might come from here: https://www.ftdichip.com/Drivers/D2XX.htm. You have to copy the include files from the release/ directory into the build directory for UrJTAG, and then it seems to build.

So, builting UrJTAG is a bit of a pain. The "make install" script basically doesn't work, so you have to do all that yourself.  With the new jtag binary, I now get this:

jtag> cable FT2232 vid=0x0403 pid=0x6010
Connected to libftd2xx driver.
jtag> detect
IR length: 6
Chain length: 1
Device Id: 00010011011000110001000010010011 (0x13631093)
  Manufacturer: Xilinx (0x093)
  Part(0):      xc7a100t (0x3631)
error: Unable to open file '/usr/local/share/urjtag/xilinx/xc7a100t/STEPPINGS'
  Unknown stepping! (0001) (/usr/local/share/urjtag/xilinx/xc7a100t/STEPPINGS)

So, that's a step forward, but  I have no idea yet where to get this STEPPINGS file from, or if it really is necessary. Ah, that was also just a problem with the install script not working. After manually copying  the data/ directory's contents into /usr/local/share/urljtag, it works:

jtag> detect
IR length: 6
Chain length: 1
Device Id: 00010011011000110001000010010011 (0x13631093)
  Manufacturer: Xilinx (0x093)
  Part(0):      xc7a100t (0x3631)
  Stepping:     1
  Filename:     /usr/local/share/urjtag/xilinx/xc7a100t/xc7a100t-csg324

This is all very nice, except that it thinks is the 324 pin part, not the 484 pin part that is actually in the MEGA65 R2 PCB.  It seems that UrJTAG might not support multiple variants of the same part, which is a bit annoying.

The first step, though, is to find the information required to actually even make the file. This seems to be available behind the license-wall at: https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/device-models/bsdl-models/artix-series-fpgas.html.   Using my account there, I downloaded the zip archive of BSDL files, and it seems that they are indeed the source material that I need. The PIN_MAP_STRING in each file seems to be the reverse-order of what appears in the UrJTAG file.  The syntax of BSDL is a bit weird, being VHDL derived, so where there are multiple pins defined on a single line, I'll have to work out how to parse those.

It turns out that UrJTAG has a parser utility for doing this:

bsdl2jtag xc7a100t_csg324.{bsd,jtag}
error: -E- error: In Package STD_1149_6_2003, Line 375, Error in User-Defined Package declarations.
error: -E- error: BSDL file 'xc7a100t_csg324.bsd' contains errors in VHDL stage, stopping
error: system error: Success Cannot open file STD_1149_6_2003 or /usr/local/share/urjtag/bsdl/STD_1149_6_2003

But as we can see, it is missing some files.  I suspect the install target of the Makefile might again be the problem here. Nope, apparently it just doesn't support STD_1149_6_2003. But someone has implemented the missing file. Unfortunately, it gives some error about user defined packages.  Someone else just took to modifying the BSDL files to remove the need for STD_1149_6_2003.  I might try that next.

Meanwhile, as I am out and about this morning, I took a Nexys4DDR board with me, which does have the exact chip that UrJTAG already supports, since I figured that should "just work", and I should be able to poke around with it while waiting for appointments.  Well, I don't get the error I described above, but I do instead get:

jtag> cable FT2232 vid=0x0403 pid=0x6010
Connected to libftd2xx driver.
jtag> detect
warning: TDO seems to be stuck at 1

What I don't know, is whether this is further along or not as far along as the other. I am guessing it is not as far along, since if the JTAG bus is stuck, it won't enumerate, and indeed, we are seeing a lack of enumeration. Fortunately I am not the only person with this problem.  Let's try some of their proposed solutions...

Unfortunately none of the suggestions on that page work. I'd suspect that my FPGA board is broken, except the fpgajtag command I use to send bitstreams to the FPGA via JTAG works perfectly.  So the JTAG interface *does* work, and my computer *can* communicate with it.  Most frustrating.

I also took a look at OpenOCD, an open-source JTAG tool for Linux etc.  This is an excellent project in many ways, but was never designed with doing simple FPGA boundary scans in practice.  Thus as a result, it still isn't in any way trivial to do them with it. I am sure if I invested enough time and energy I could figure out how to do it, but I really don't want to have to do that, if I don't have to.

I did take a quick look at the internals of the fpgajtag command, to see if I could easily adapt it.  It looks reasonably well-structured, but for someone who doesn't know that much about JTAG (although I am learning), it isn't immediately obvious what I would need to change.

So then I started looking at Vivado to see if the hardware manager in there can easily do a boundary scan.  I am sure it can, but even after a pile of Googling, I can't actually figure out how to do it.  There is a lot of talk about needing a debug bitstream or some debug core in the project.  This strikes me as incomplete information at best, since the JTAG interface on an FPGA, if not disabled, can ALWAYS do a boundary scan, if I understand things correctly.  Also, my workstation this morning doesn't have mains power, so I don't want to kill my battery before the kids swimming lessons finish for the morning:



The best thing I have found so far is this: https://www.fpga4fun.com/JTAG4.html
While whatever the JTAG library is that the example source code was written for isn't immediately clear, it does show how to go through the process of performing the boundary scan at a low level.  It might thus be enough information, together with the fpgajtag source, to cook something up that can work.  I have found the Xilinx BSDL files for the FPGAs I care about already, so in theory, I have all the information I need.

It also gives me hope of being able to take control of pins on the FPGA, so that I can more quickly test and develop things like this QSPI interface, as I can potentially avoid having to synthesise every change, but instead be able to bit-bash over JTAG.  But of course, I have to succeed in actually getting SOMETHING to work, before I can get that excited.

Well, at least integrating fpgajtag into monitor_load was relatively easy: The only slightly tricky part was re-doing the command line interface parse stuff. But I do want to extend it a little further, so that the fpgajtag stuff which correctly works out which USB serial port to talk to, can also be used to automatically find the correct serial port for the normal monitor_load communications.  This was also not too hard, once I found out that I could map the /dev/ttyUSBx paths to the entries in /sys/bus/usb-serial/devices, and look at the destination of those symlinks to check that the USB bus and port match.

So now, in theory, I have all necessary ingredients to adapt to be able to run a boundary scan from within monitor_load, so long as I can figure out how the  fpgajtag code does the JTAG communications.  But this is not proving as simple as I would like, as fpgajtag has what seems to be a quite clever mechanism for abstracting the low-level JTAG operations.

Unfortunately, there is little documentation in the source, and I am struggling to understand how to adapt it.  I'm pulling my hair out enough that I have logged an issue on the fpgajtag github repository asking for some help in understanding their code.  Within a few hours, I had received some pointers to documentation for the FTDI serial adapters, which gave me enough information, with quite a lot of trial and error, to work out how to control the JTAG interface.  This will also come in handy in the future, when we get to implementing updating the keyboard CPLD from the MEGA65 itself as well, as I will need to implement a JTAG interface for that.

Anyway, back to the point, I now seem to be able to read some JTAG boundary scan data from the FPGA.  It seems to be shifted by a few bits, and I don't yet capture it all, but I am able to see bits toggle as I flip the switches on a Nexys4 DDR board, and in roughly the right place in the boundary scan register.  I suspect the bit order of the bytes might be flipped, and that I need to ignore the first 6 or so bits, to make up for the bits of the boundary scan command itself being shifted out.  But the important thing is that I can now read boundary scan data.  The changes I made to the read_idcode() function to tell it to switch to boundary scan mode ended up being quite simple:

    ENTER_TMS_STATE('I');
    ENTER_TMS_STATE('S');
    write_bit(0, 0, 0xff, 0);     // Select first device on bus
    write_bit(0, 5, IRREG_SAMPLE, 0);     // Send IDCODE command
    ENTER_TMS_STATE('I');


(Checkout https://github.com/MEGA65/mega65-core/blob/unstable/src/tools/fpgajtag/boundary_scan.c if you would like to see it all together.)

This switches the JTAG interface from Reset to Idle, then to IR-Capture, send the JTAG SAMPLE command so that it ends up in the IR register, and then returns to Idle state, ready for the usual logic to shift bits in and out.  The boundary data is then in the data shifted back in.  All quite simple, once I had worked it out!

With a bit more work, I have now implemented an amazingly quick and dirty scanner for both the XDC and BSDL file formats.  XDC files inidicate the pins used by a project, while BSDL files have the information about the FPGA itself, importantly including the JTAG boundary scan information.  With these parsers, and a bit of glue, I can not only show the status of each FPGA pin, but also the name of the pin in the project.  While there is plenty of room to improve this, the result is already really nice.  Here is a little sample of the output on a Nexys4DDR board:

monitor_load -J src/vhdl/nexys4ddr-widget.xdc,${HOME}/build/artix7/public/bsdl/xc7a100tl_csg324.bsd
make: „src/tools/monitor_load“ ist bereits aktuell.
fpgajtag: Digilent:Digilent USB Device:210292645477; bcd:700; IDCODE:  3631093
Auto-detected serial port '/dev/ttyUSB1'
FPGA is assumed to be a XC7A100TL_CSG324, with 989 bits of boundary scan data.
bit#2 : CCLK_E9 (pin E9, signal {QspiSCK}) = 1
bit#3 : M0_P12 (pin P12, signal <unknown>) = 1
bit#4 : M1_P13 (pin P13, signal <unknown>) = 0
bit#5 : M2_P11 (pin P11, signal <unknown>) = 1
bit#6 : CFGBVS_P8 (pin P8, signal <unknown>) = 1
bit#10 : INIT_B_P7 (pin P7, signal <unknown>) = 1
bit#13 : DONE_P10 (pin P10, signal <unknown>) = 0
bit#53 : IO_U8 (pin U8, signal {sw[9]}) = 0
bit#56 : IO_T8 (pin T8, signal {sw[8]}) = 1

...

First, we have filtered out all the bits that are not marked "input" in the BSDL file, which dramatically shortens the list of output.

Second, we see the nice mapping of the BSDL bit names to FPGA pins and project signals.  sw[9] and sw[8] are two of the slide switches on the Nexys board, and I can happily twiddle those, and re-run the scan, and see the changing values.  So I am confident overall that its working, and that I can finally go back to what I was trying to do at the begining: Check whether I am correctly controlling the QSPI interface pins, in particular the CCLK pin.

So let's actually fire up a bitstream, and see if we can control the pin... and indeed I have confirmed that everything except the pesky clock pin is controllable.  This is what I had most suspected would be the problem, but now I don't have to suspect -- I can inspect!  But solving that will have to wait for the next blog post.

Meanwhile, if you would like to support me, I've setup a ko-fi page at ko-fi.com/paulgs.