Sunday, 24 May 2020

Improving the MEGA65 Audio Mixer

This past week we have been working on improving some of the cross-development tooling that we use, which I will write about soon. But for now there is nothing really visible to show.  So I will instead share with you the work I did today on getting the MEGA65 Audio Mixer that is accessed from the Freeze Menu more user friendly.

We now have some nice decibel logarithmic volume scales, and a bunch of friendly controls for controlling overall volume level, as well as selecting stereo/mono, and swapping left/right channels.  You can also use the up/down cursor keys to pick a particular element and left/right cursor keys to change the values.    You can also press T and it will play a few musical notes through the four SIDs, alternating left and right, so that you can more easily verify that the settings are sensible, and not left/right swapped.

All up, it now feels like an interface that the average user could use, especially compared with the previous proof-of-concept audio mixer. Just in case you can't remember or haven't seen just how "user feindly" the previously one was, here is a reminder:

While it is technically more powerful (as you can set every coefficient in the full-cross-bar audio mixer separately), you really need to know what you are doing to avoid instant and prolonged confusion.

Wednesday, 20 May 2020

Injection Moulding Tooling Progress

A quick post just to share the exciting progress with the injection-moulding tool manufacture for the MEGA65 case.  The major pieces of the tool have now been produced, as you can see below!

First, here we see the inside of the lower case. The tools are all in negative profile. The big round holes are for the pushers for pushing the cases out of the tool. The smaller round holes are all for screw bosses -- so in addition to the bosses for the motherboard, there are plenty of extra ones for expansion boards, optional internal speaker etc. You can also see the MEGA65 team and major donors for the tooling costs here, so that they will appear as part of every case we produce. We are very proud to recognise some of the key contributors to the creation of the MEGA65 in this way, and wish we had limitless space to be able to recognise everyone who has been involved in the project.

Then we have the outside of the underside, where we can see the hole for the trap-door, and the logos for M.E.G.A. and Hintsteiner.  Note that it seems to be missing 2 of the sides. Those are the sides where all the ports go.  Those sides comea in as separate pieces that move in before the plastic is loaded, and then pull out the way before the pushers push the finished part out of the mould. 

Then for the upper half of the case we have the same inner and outer pieces, although this time quite a bit simpler. The funny sword looking shape in the middle is where the plastic comes in to distribute around, and to hold the piece straight as it comes out warm.
And of course the opposite side of that.  Here we also see the really big diameter holes where the rods go through on which the mould will slide together and apart each time a piece is moulded.  The lower half of the case will have something similar, but of course more complicated because of the pushers that come in from the sides.

Then we have the relatively simple trapdoor slot and eject button on a single little family mould. These can go on a single mould because they are similar enough in volumae that they can be balanced within the mould, to obtain good results.

 Here is the other part for this last mould still being prepared:

While I haven't had the chance to discuss it with them directly, my understanding is that the next stage is the assembling of the tools with all the big rods that hold them together etc, and making sure everything fits together properly.   Then it will be time to load it into a moulding machine at the tooling factory and test it out.  

Once they are happy at the tool factory, they will ship it to the factory where it will actually be used in the production injection moulding machines. They will then go through a commissioning process, where they will fiddle with pressures, gate settings etc to make sure the plastic flow fills every little bit reliably (a bit of a black art), and try to minimise the time per part, so that we optimise the price per case (you pay by the second that the machine is busy). All that will take another month or three, depending on how COVID19 affects everything.

We'll let you know more as soon as we know it, but it's already very exciting :)

Sunday, 10 May 2020

More work on the RAM Expansion and Timing Closure

I recently got the MEGA65's 8MB built-in expansion RAM mostly working.  I say mostly working, because it was still doing various strange things sometimes.  So I have been spending all my spare time trying to get this sorted.
While in theory the RAM controller is simple, in order for it to offer decent performance, it ends up a bit more complex.

Let's start with the simple HyperRAM itself: This can be driven by a clock of upto 100MHz, allowing for DDR throughput of upto 200MB/sec.  Except that it cant.

First, we don't really have the means to use a 100MHz clock, because it's not a clean multiple of any of the clocks we are using. So we are instead using 81.5MHZ, which is 2x the MEGA65 CPU clock and 3x the VIC-IV's pixel clock.

Second, the HyperRAM is accessed by first sending it a command to read or write something.  Regardless of the clock rate, there is a minimum of 40ns plus a bit between transactions.  In reality, it takes about 100ns -- 200ns to actually perform a random read or write. 

Third, in order to not drag the maximum clock speed of the MEGA65's CPU down, the expansion RAM is on the "slow devices" bus.  They don't really have to be slow, but the CPU requires at least 2 clock cycles to access such a device.  That bus then asks the HyperRAM for data, which takes at least another 2 clock cycles.  Thus there is something like 100ns extra latency from the memory architecture of the MEGA65.

The result is that a naive implementation will have about 300ns latency, i.e., about what a real C64 in 1982 had with its ancient DRAM chips.  This is enough for 2 - 3MB/sec -- which with a 40MHz CPU is a bit disappointing.

Thus the HyperRAM controller and related bits and pieces in the MEGA65 does a number of tricks to hide this latency, and increase the throughput for linear accesses. We are concentrating on linear access, as we figure it is most likely how the expansion RAM will be used: You will either be copying things in or out of it.  Some of those tricks are:

1. Two eight-byte write buffers, that allow most writes to happen with just 2 cycle latency, unless the buffers are already busy. Each buffer covers an 8-byte aligned region of 8-bytes, and multiple writes to the region in quick succession will result in a merged 64-bit write to the expansion RAM.  This speeds up writing a lot.

2. A similar setup of two eight-byte read buffers that you could call a tiny cache.  Again, this helps avoid the 100 -- 200ns latency of the HyperRAM, but not the latency of the slow device bus and related bits.

3. A block-read mechanism that reads and buffers upto 32 consecutive bytes from the HyperRAM, and feeds them pre-emptively to the slow device bus as an 8-byte buffer. This really helps reduce read latency.

4. A pre-emptive pre-fetch mechanism in the slow device bus that pushes the byte after the previous read one to the CPU, so that if it is doing a linear read, it can avoid all but one cycle of latency in the slow device bus, and theoretically allow reads at up to 20MB/sec.

I have also added a priority access mechanism to the HyperRAM, so that we can have automatic chipset driven DMAs like access, a little like the Amiga, so that we could, for example, feed the digital audio channels directly, without any CPU intervention.  I have yet to work out exactly what that will look like, but it is important to get the interface in now, so that it can be worked on later.

So all of this mostly works, but there are various little corner cases.  Also, the timing closure on the MEGA65 as a whole was getting a bit out of hand again, so I have spent quite a lot of time trying to flatten logic, so that everything that needs to happen in a single clock tick can in fact happen. And reliably.  This also can help synthesis to run a bit faster, because it doesn't have to try so hard to make everything fit with acceptable timing.  But its also important when debugging weird behaviours, so that I can be confident it isn't just because everything is violating timing by a bit.

I now have it mostly meeting timing, and all of the above optimisations and features at least mostly working.  Where I am now upto is that there are some weird bugs with the chipset DMA, and chaining of the 32-byte fetches and the like.  Also the optimisation in (4) is not completely implemented.

First, I'll tackle the weird bugs with the DMA. Basically the HyperRAM controller can get confused and freeze up, messing up both the chipset DMA, as well as regular CPU access.  I think much of the problem is how it handles the situation where a request comes in from the slow_device bus at around the same time as a chipset DMA. The chipset DMA is supposed to take priority, since it might be quite timing sensitive.  This means that the incoming request from the slow device bus has to be remembered for later.

This is further complicated by the fact that the HyperRAM controller has an 81.5MHz and a 163 section, and we have to be careful how we pass information between the two.  Normally the 81.5MHz side registers and processes the requests from the slow device bus. But when the HyperRAM is actually being accessed, it is the 163MHz part of the circuit that comes into play.  The trouble comes when the 163MHz side realises that it can service a newly arriving request, which it should do to reduce latency.

It should also tell the 81.5MHz that it has processed the request, so that it doesn't get processed twice. And this seems to be where it messes up at the moment: The request does in fact get processed twice.  This probably means I need to have a signal that tells the 81.5MHz side when a latches request has in fact already been processed.  Ok, implemented that, and it looks to simulate properly.

Next is to finish implementing the pre-emptive delivery of data to the CPU, so that it saves time every time it can, instead of just sometimes. Basically now it will save time at most 1/2 the time, because it supplies the next byte whenever the CPU requests a next byte, and it can.  For a linear read, that's only 3 out of every 8.  As we have the cache to work with, we have an 8-byte aligned block, and we should be able to pre-fetch 7 out of every 8 bytes in a linear read. We just need to watch for when the CPU consumes the previous pre-fetched byte (which it already tells the slow device bus) and get the next byte ready in that case. This all happens in slow_devices.vhdl.

After that, it was time to look at when we flush out the 8-byte write buffers. On the one hand, we want to not abort early if the last few byte positions don't need writing, in case a new write comes along that immediately follows in memory, and thus could be chained. But on the other hand, if all our write buffers are currently busy, such that we can't accept any new writes, then we really want to abort as quickly as possible, so that we can start processing any new requests.

I have a simple sequence of memory accesses in a test harness that I use to verify that each of these changes has not broken anything, i.e., we don't lose any writes or read stale data etc.  As I have been working on these various optimisations, the average transaction time has dropped from around 120ns to around 68ns -- so all up, this is quite a significant performance boost.  This should allow a hyperram access to, on average, occur in around 3 clock cycles, with linear reads potentially only requiring 2 clock cycles.  Thus, I'm hoping that when I run my speed test programme, it will show "Copy Slow RAM to Chip RAM" speeds of ~40/3 = ~13,000KB/sec. 

In generally, looking at the test harness output, things look pretty good now for the most part. The only exception is that a few memory accesses seem to take a REALLY long time, longer than should really be possible, up to around 450ns.  Something pathological must be happening there, such as an aborted 32-byte block read that would have got the data, but is aborted, and then a new fetch is started that does get it.

Looking into it, it might just be when there are a lot of writes banked up that need flushing.  But I did notice a weird thing, where data was being spread into both of the 8-byte read caches, instead of all going into a single one.  While that can't cause the current problem, it would still be good to squash it, as otherwise it will cause havoc with the performance of the cache when reading, as one of the cache lines will just get ignored in all likelihood.

Fixing timing closure is often about simplifying IF statements in the VHDL, so that they can be processed more quickly by the hardware.  Sometimes this can be through reducing the number of nested IF statements. Other times it is about pre-caculating expressions that appear in the IF statements, and then using the simple pre-calculated true or false result from that.

Vivado does give nice reports about signals that don't meet timing closure, like this:

Slack (VIOLATED) :        -1.245ns  (required time - arrival time)
  Source:                 hyperram0/block_valid_reg/C
                            (rising edge-triggered cell FDRE clocked by clock162  {rise@0.000ns fall@3.077ns period=6.154ns})
  Destination:            hyperram0/current_cache_line_drive_reg[3][5]/D
                            (rising edge-triggered cell FDRE clocked by clock162  {rise@0.000ns fall@3.077ns period=6.154ns})
  Path Group:             clock162
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            6.154ns  (clock162 rise@6.154ns - clock162 rise@0.000ns)
  Data Path Delay:        7.383ns  (logic 1.200ns (16.253%)  route 6.183ns (83.747%))
  Logic Levels:           6  (LUT4=1 LUT6=5)
  Clock Path Skew:        0.027ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    6.300ns = ( 12.454 - 6.154 )
    Source Clock Delay      (SCD):    6.600ns
    Clock Pessimism Removal (CPR):    0.327ns
  Clock Uncertainty:      0.073ns  ((TSJ^2 + DJ^2)^1/2) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Discrete Jitter          (DJ):    0.128ns
    Phase Error              (PE):    0.000ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clock162 rise edge)
                                                      0.000     0.000 r 
    V13                                               0.000     0.000 r  CLK_IN (IN)
                         net (fo=0)                   0.000     0.000    CLK_IN
    V13                                                               r  CLK_IN_IBUF_inst/I
    V13                  IBUF (Prop_ibuf_I_O)         1.521     1.521 r  CLK_IN_IBUF_inst/O
                         net (fo=2, routed)           1.233     2.755    clocks1/clk_in
    MMCME2_ADV_X0Y0                                                   r  clocks1/mmcm_adv0/CLKIN1
    MMCME2_ADV_X0Y0      MMCME2_ADV (Prop_mmcme2_adv_CLKIN1_CLKOUT5)
                                                      0.088     2.843 r  clocks1/mmcm_adv0/CLKOUT5
                         net (fo=1, routed)           2.018     4.861    clock162
    BUFGCTRL_X0Y4                                                     r  clock162_BUFG_inst/I
                        net (fo=1126, routed)        1.643     6.600    hyperram0/clock162_BUFG
    SLICE_X15Y57         FDRE                                         r  hyperram0/block_valid_reg/C
  -------------------------------------------------------------------    -------------------
    SLICE_X15Y57         FDRE (Prop_fdre_C_Q)         0.456     7.056 f  hyperram0/block_valid_reg/Q
                         net (fo=43, routed)          1.406     8.462    hyperram0/block_valid_reg_0
    SLICE_X21Y49                                                      f  hyperram0/current_cache_line_drive[0][7]_i_5/I3
    SLICE_X21Y49         LUT4 (Prop_lut4_I3_O)        0.124     8.586 r  hyperram0/current_cache_line_drive[0][7]_i_5/O
                         net (fo=135, routed)         1.242     9.828    hyperram0/current_cache_line_drive[0][7]_i_5_n_0
    SLICE_X11Y39                                                      r  hyperram0/current_cache_line_drive[3][5]_i_7/I5
    SLICE_X11Y39         LUT6 (Prop_lut6_I5_O)        0.124     9.952 f  hyperram0/current_cache_line_drive[3][5]_i_7/O
                         net (fo=1, routed)           1.245    11.197    hyperram0/current_cache_line_drive[3][5]_i_7_n_0
    SLICE_X15Y55                                                      f  hyperram0/current_cache_line_drive[3][5]_i_5/I0
    SLICE_X15Y55         LUT6 (Prop_lut6_I0_O)        0.124    11.321 f  hyperram0/current_cache_line_drive[3][5]_i_5/O
                         net (fo=1, routed)           0.803    12.124    hyperram0/current_cache_line_drive[3][5]_i_5_n_0
    SLICE_X9Y63                                                       f  hyperram0/current_cache_line_drive[3][5]_i_3/I1
    SLICE_X9Y63          LUT6 (Prop_lut6_I1_O)        0.124    12.248 f  hyperram0/current_cache_line_drive[3][5]_i_3/O
                         net (fo=2, routed)           0.908    13.156    hyperram0/current_cache_line_drive[3][5]_i_3_n_0
    SLICE_X5Y71                                                       f  hyperram0/current_cache_line_drive[3][5]_i_4/I4
    SLICE_X5Y71          LUT6 (Prop_lut6_I4_O)        0.124    13.280 r  hyperram0/current_cache_line_drive[3][5]_i_4/O
                         net (fo=1, routed)           0.579    13.859    hyperram0/current_cache_line_drive[3][5]_i_4_n_0
    SLICE_X3Y71                                                       r  hyperram0/current_cache_line_drive[3][5]_i_1/I4
    SLICE_X3Y71          LUT6 (Prop_lut6_I4_O)        0.124    13.983 r  hyperram0/current_cache_line_drive[3][5]_i_1/O
                         net (fo=1, routed)           0.000    13.983    hyperram0/current_cache_line_drive[3][5]_i_1_n_0
    SLICE_X3Y71          FDRE                                         r  hyperram0/current_cache_line_drive_reg[3][5]/D
  -------------------------------------------------------------------    -------------------

                         (clock clock162 rise edge)
                                                      6.154     6.154 r 
    V13                                               0.000     6.154 r  CLK_IN (IN)
                         net (fo=0)                   0.000     6.154    CLK_IN
    V13                                                               r  CLK_IN_IBUF_inst/I
    V13                  IBUF (Prop_ibuf_I_O)         1.450     7.604 r  CLK_IN_IBUF_inst/O
                         net (fo=2, routed)           1.162     8.766    clocks1/clk_in
    MMCME2_ADV_X0Y0                                                   r  clocks1/mmcm_adv0/CLKIN1
                                                     0.083     8.849 r  clocks1/mmcm_adv0/CLKOUT5
                         net (fo=1, routed)           1.923    10.772    clock162
    BUFGCTRL_X0Y4                                                     r  clock162_BUFG_inst/I
    BUFGCTRL_X0Y4        BUFG (Prop_bufg_I_O)         0.091    10.863 r  clock162_BUFG_inst/O
                         net (fo=1126, routed)        1.590    12.454    hyperram0/clock162_BUFG
    SLICE_X3Y71          FDRE                                         r  hyperram0/current_cache_line_drive_reg[3][5]/C
                         clock pessimism              0.327    12.781   
                         clock uncertainty           -0.073    12.708   
    SLICE_X3Y71          FDRE (Setup_fdre_C_D)        0.031    12.739    hyperram0/current_cache_line_drive_reg[3][5]
                         required time                         12.739   
                         arrival time                         -13.983   
                         slack                                 -1.245   

These are great in that they tell me from which signal to which signal the problem is, and how bad it is. But they are also a pain, because they don't actually tell you which lines of code is causing the path. So you have to hunt through the source trying to find the sequence of nested IF statements that apply, but which might be spread over thousands of lines of code.  It's a bit of a pain.  So I made a really quick-and-dirty tool that lets me see all the IF statements that lead up to a given signal in the source.

So in the case above, we can see:

 387        if rising_edge(pixelclock) then
 622          if (read_request or read_request_latch)='1'
 635            if (block_valid='1') and (address(26 downto 5) = block_address) then
 650              if (address(4 downto 3) = "11") and (flag_prefetch='1')
 658 >>>             ram_address <= tempaddr;
 387        if rising_edge(pixelclock) then
 622          if (read_request or read_request_latch)='1'
 826            elsif address(23 downto 4) = x"FFFFF" and address(25 downto 24) = "11" then
 883            elsif request_accepted = request_toggle then
 887 >>>           ram_address <= address;
 387        if rising_edge(pixelclock) then
 899          elsif queued_write='1' and write_collect0_dispatchable='0' and write_collect0_flushed='0'
 922          elsif (write_request or write_request_latch)='1' and busy_internal='0' then
 929            if address(23 downto 4) = x"FFFFF" and address(25 downto 24) = "11" then
 972              if cache_enabled = false then
 980 >>>             ram_address <= address;

I now have line numbers, and can see exactly what is going on.  Because it also shows me the rising_edge() statement, I can also tell under which clock it is .  So in this case, it's likely line 635 that is the problem.  It's also showing me that while this IF does depend on the block_valid signal, it also compares a 22 bit address -- that's the thin that's really going to be slow.  Thus I get more useful information than Vivado otherwise gives me.  I'm sure there are expensive professional tools that can do some of this, but it was faster to cook this up than to even find out if such tools exist.  It would be nice if the VHDL mode in emacs could display something like this, though.

Anyway, in this particular case, I can't do a great deal about it, unless I add an extra stage of latency to all hyperram reads, as pre-calculating the result of the comparison would add a cycle of delay.  This is the whole trade-off of clock speed and pipeline depth, which the Pentium 4 famously did really badly.  In this instance, I have fixed a whole lot of other timing violations, in the hope that Vivado can do better with this particular signal.  I have hope, because while the total path takes too long, only 1.2ns of that is logic -- the other ~6ns is routing overhead.  By making it easier for Vivado to route other signals, it can often enable it to get better closure on critical paths like this one.  Anyway, we will know in about half and hour when it finishes synthesising.

So, that's helped me get quite close to timing closure, but there are a few signals that are not quite meeting timing closure. My next step was to fix the clock frequencies we are using. They are approximately right, but not exact. They are in fact a little too fast. This can also cause us trouble with the HDMI output, so it makes sense to get these fixed. I recently worked out how to better control clocks. Basically the problem is that each clock generator in the FPGA can only take the input clock and multiply and divide it by a fixed range of values. Thus you can only generate a limited set of frequencies.

But you can chain the generators to take the output of another one as its input, and thus generate many more frequencies.  The trouble then, is that you need to consider a LOT of possible combinations. Billions and billions of combinations. So I made a little program that tries all the possible values, and looks for the one that can get us closest to the target frequency. Here's the important part of it:

  float best_freq=100;
  float best_diff=100-27;
  int best_factor_count=0;
  int best_factors[8]={0,0,0,0,0,0,0,0};

  int this_factors[8]={0,0,0,0,0,0,0,0};
  // Start with as few factors as possible, and then progressively search the space
  for (int max_factors=1;max_factors<4;max_factors++)
      printf("Trying %d factors...\n",max_factors);
      for(int i=0;i<max_factors;i++) this_factors[i]=0;
      while(this_factors[0]<uniq_factor_count) {
    float this_freq=27.0833333;
    for(int j=0;j<max_factors;j++) {
      //            printf(" %.3f",factors[this_factors[j]]);
    //        printf(" = %.3f MHz\n",this_freq);
    float diff=this_freq-27.00; if (diff<0) diff=-diff;
    if (0&&diff<1) {
      printf("Close freq: ");
      for(int j=0;j<max_factors;j++) printf(" %.3f",uniq_factors[this_factors[j]]);
      printf(" = %.3f MHz\n",this_freq);
    if (diff<best_diff) {
      for(int k=0;k<8;k++) best_factors[k]=this_factors[k];
      printf("New best: ");
      for(int j=0;j<max_factors;j++) printf(" %.3f",uniq_factors[this_factors[j]]);
      printf(" = %.6f MHz\n",this_freq);
      printf("          ");
      for(int j=0;j<max_factors;j++) {
        float uf=uniq_factors[this_factors[j]];
        int n=0;
        for(n=0;n<(1<<14);n++) if (uf==factors[n]) break;
        if (j) printf(" x");
        printf(" %.3f/%.3f",(1.0+((n>>0)&0x7f)/8),(1.0+((n>>7)&0x7f)/8));

    // Now try next possible value
    int j=max_factors-1;
    while((j>=1)&&(this_factors[j]>=uniq_factor_count)) {

Basically it tries all possible values, and reports each time it finds a better one.  After optimising it to only consider the unique frequency multipliers that each generator can produce, it takes less than a minute to find an exact match:

Calculating set of adjustment factors...
There are 159 unique factors.
Trying 1 factors...
New best:  1.000 = 27.083334 MHz

Trying 2 factors...
New best:  2.333 0.429 = 27.083332 MHz
           7.000/3.000 x 3.000/7.000

New best:  0.923 1.077 = 26.923079 MHz
           12.000/13.000 x 14.000/13.000

New best:  0.929 1.071 = 26.945152 MHz
           13.000/14.000 x 15.000/14.000

New best:  0.933 1.067 = 26.962965 MHz
           14.000/15.000 x 16.000/15.000

Trying 3 factors...
New best:  7.000 0.133 1.067 = 26.962967 MHz
           7.000/1.000 x 2.000/15.000 x 16.000/15.000

New best:  1.500 0.818 0.812 = 27.006392 MHz
           3.000/2.000 x 9.000/11.000 x 13.000/16.000

New best:  1.500 1.182 0.562 = 27.006390 MHz
           3.000/2.000 x 13.000/11.000 x 9.000/16.000

New best:  1.667 0.556 1.077 = 27.006176 MHz
           5.000/3.000 x 5.000/9.000 x 14.000/13.000

New best:  1.667 0.778 0.769 = 27.006174 MHz
           5.000/3.000 x 7.000/9.000 x 10.000/13.000

New best:  1.667 0.385 1.556 = 27.006172 MHz
           5.000/3.000 x 5.000/13.000 x 14.000/9.000

New best:  0.600 1.800 0.923 = 27.000002 MHz
           3.000/5.000 x 9.000/5.000 x 12.000/13.000

New best:  0.800 1.800 0.692 = 27.000000 MHz
           4.000/5.000 x 9.000/5.000 x 9.000/13.000

These factors are in addition to the existing 100MHz x 8.125 / 30 that I am already doing, which results in 27.083333 MHz.  I was a little bit surprised that it was able to find an exact match using only three factors. It isn't even just that it is correct to 6 decimal places. It really is exact:

100 x 8.125/30 x 4/5 x 9/5 x 9/13  MHz
= 100 x 65/240 x 36/25 x 9/13 MH
= 100 x 21060/78000 MHz
= 100 x 0.27 MHz
= 27 MHz

Well, that's all sounding good. So I modified clocking.vhdl to create the chain of MMCM instances to generate the exact clock frequency.  First go at synthesis, Vivado threw an error, because it is not possible to directly assemble such a long chain of MMCM units.  To solve this, I added a Global Buffer unit to take one of the intermediate MMCM outputs, so that it would be available at the next MMCM with minimal distortion or skew.  Hopefully that will solve the problem.

Next problem is that the last step multiplies the intermediate clock by 9, and ends up out of the valid range of 600 - 1200MHz.  This is because:

100 MHz x 8/10 = 800MHz(ok)/10 = 80MHz
80MHz x 9/5 = 720MHz(ok)/5 = 144MHz
144MHz x 9/13 = 1296MHz(not ok)/13

So I need to reorder these factors, so that the intermediate frequencies remain in the range of 600 - 1200MHz.

100 MHz x 9/13 = 900MHz(ok)/13 = 69.23MHz
69.23MHz x 9/5 = 623MHz(ok)/5 = 124.61MHz
124.61MHz x 8/10 = 996.92MHz(ok)/10 = 99.692MHz
(which then gets used to generate the 27MHz clock via x 8.125/30)

So, I'll just rearrange into that order and try to synthesise again...
Ok, that worked, and the bitstream runs, with HDMI output (still) working.  So it looks like my clock magic has worked just fine.

There are now just a very few timing violations of <0.2ns in the hyperram left to resolve.  That said, they are now a small percentage compared to the clock interval (~6.2ns), so SHOULDNT be a problem, except unless the FPGA voltage dipped, or the FPGA got really hot. I am seeing some weird problems still, though, so I will likely still work to completely eliminate them, so that I don't have to suspect that they aren't a problem, but rather that I can KNOW that they aren't a problem.

Digging through, I found one last sneaky slow address comparison that is likely the cause of that, so time to resynthesise again.  Hopefully that will be the last timing violation there, and I will be able to focus on the remaining bugs I am seeing.

Finally! I now have a synthesis run with full timing closure on the HyperRAM, and even the persistent timing closure problem with the CPU is less than 1ns late on a ~24ns clock, and it is related to the SP register, not the HyperRAM. Thus we can more forward confident that any problems with the HyperRAM are not related to lack of timing closure.

So, speaking of problems, we certainly still have some right now:
1. The prefetch logic has some significant problems, resulting in mis-read data.  The first two or three bytes read ok, and then the next byte read rubbish. The pattern is a bit longer, but clearly something is a bit bonkers.
2. Linear reads crossing a 32-byte boundary read rubbish after crossing that boundary, presumably because there is something wonky with the pre-fetch chaining.
3. The external hyperram seems to be even less reliable, but it has been since the start, presumably because the leads from the FPGA to it are longer.
4. When cache rows are read, the bytes are being spread between the two cache rows, instead of all going into the same row.
5. Copying from chip to hyperram is slow, and a bit variable in speed. It's less than half the speed of filling hyperram, so there is presumably something funny going on with the write scheduling.

First step: Make the pre-fetching run-time switchable, so that I can more easily assess the rest of the system.  That worked fine. My next step was to find and fix a bunch of the bugs in the prefetch logic and related areas. This looks like it might fix (1), (2) and (4) above.

I'm synthesising the fixed version now, to see if it really fixes those problems, but am currently fighting with multiple drivers.  Multiple drivers is when signals are trying to be set from multiple (unrelated) places, and the FPGA synthesis software can't figure out how to reconcile them, for example, if they are set from completely separate pieces of hardware.  They aren't too hard to fix, just a bit fiddly.  Fortunately improving the timing closure has dropped synthesis time back down to about 30 minutes instead of 45 minutes or so, so the going is a bit quicker now.

What I have done as a first step, is to get a working configuration for the HyperRAM.  It seems that using 80MHz clock for 160MB/sec throughput (via DDR) is just too much for the current PCB layout (or for how I have implemented it in the FPGA).  By leaving all stages of the transaction at 40MHz, that fixes a lot of problems, but at a modest cost to read performance, since 80MHz was only being used during reads, anyway.  I also disabled all of the CPU prefetch logic by default, and also the whole read-cache machinery.  The write cache is already rock solid, it seems, so that has been left.

The external HyperRAM still has some problems with writes, which I need to look into. Basically some writes get lost. I am suspecting that the chip select line to the HyperRAM might have slightly faster response than the data lines, and thus the last write gets ignored.  It could be something more sinister, of course, like writes being ignored in the write cache, but the fact that the internal HyperRAM is now rock solid makes me think that that is very unlikely.

The result is a known working configuration, for the internal HyperRAM at least, from which to try to debug everything else.  Everything else now largely means the read cache machinery.  The downside, in the interim at least, is that the read performance is, without the caching, quite poor, while linear writing is a lot faster, due to the functioning write caches.  So we get a result like the following (remember the HyperRAM is what is used as Slow RAM):

Fill Slow RAM: 9,287KB/sec
Copy ChipRAM to Slow RAM: 4,334KB/sec
Copy Slow to Chip RAM: 2,500KB/sec
Copy Slow RAM to Slow RAM: 1,547KB/sec

So, let's go through and find and fix the problems with the read cache.  First, we try to use simulation again, as this is MUCH faster and more informative.  With the reading set to 80MHz, simulation showed no problems. But when I drop the read speed to 40MHz, we do now get some errors.  That's a good thing, as I can investigate their causes and deal with them.

We also have another clue: The read cache works fine when there aren't also writes happening.  This means it is most likely something with the cache coherency logic. Either the read caches are taking too long to get updated when a new value is written that would hit them, or they are simply not getting updated, or getting updated incorrectly.

So, into the simulation to see what is going on there.  It looks suspiciously like chipset DMA reads are being used to reply to CPU memory requests.  That can't be the whole problem, because I have run tests with that disabled, and there are still cache consistency problems. However, I have seen evidence that the chipset accesses were stuffing things up, so its nice to have captured this in simulation.  So, that was an easy fix. Simulation then showed errors caused by a weird fix I had in place from when we were running at 80MHz, which was breaking things at 40MHz.  Having fixed that, simulation reveals no more errors. So time to synthesise that, and then see how it behaves on the real hardware.

Ok. Synthesis is now complete.  Having the cache enabled still causes problems when reading recently written locations. Its just that I can't reproduce it in simulation anymore, which means I need to use other approaches to track the problem down.  I have fortunately made a mechanism to examine the state of the cache in real-time, so that I can tell if a byte is being mistakenly put into the wrong cache row. 

To investigate this further, I have modified the hyperram test programme cache tests to reproduce this problem, and then to show me some information about what is going on. It looks like when we write $18 into $8000808, that it gets erroneously written into the first byte of the cache line for $8000800-$8000807.  Knowing that this is the case, I have been able to create a sequences of transactions that can reproduce the problem in simulation - yay! Well, except it seems to be some other error that I am seeing in simulation. No bother. It's still an error to be squashed, and that I can reproduce in simulation.  Actually, it DOES look to be the same type of error, just another instance of it.

The problem is that the cache_row0_address_matches_cache_row_update_address signal that indicates when a cache row matches the update address for the cache is delayed by a cycle (remember I mentioned this kind of thing, when I was talking about flattening logic by pipelining/pre-calculating signal values).  Probably the easiest solution for this, is to clear the signals whenever I update the cache update address, so that they can never be wrong.  It does mean that on the occassions when they would be right, we don't get the benefit of the cache, but that's livable.

Except life is never that simple. In this case, the cache_row_update_address signal and these precalculated comparison signals are generated in separate design parts, and will generate more of those annoying "multiple driver" problems if I just try to set it in both places.  Instead I'll just create a signal that blocks the use of these signals for one clock after the update address has been changed. That fixes the problem in simulation, so now to see if it fixes it when synthesised.

Well, the problem still shows up after synthesis -- but no longer in the little test-case I made, only in the larger test.  Now to try to figure out how to reproduce it with the simplest test case, so that I can hopefully reproduce it under simulation, where running thousands of memory accesses is too cumbersome and slow.

On initial inspection, it isn't actually the same problem, just a similar one: Now it reads an 8 byte row as all $00, instead of their correct values.  My suspicion is that the problem is similar: The cache row address is being compared with something a cycle after it is being updated, and thus there is an internal inconsistency. More precisely, there is a cache row being loaded with entirely incorrect data.  It looks like it might actually be the current_cache_line that gets exported to the slow_devices controller to reduce read latency. So I'll make that inspectable at run-time, too.

While I wait for that, I'll have another think about the chipset DMA stuff.  This is still being quite weird.  If these DMAs are running, the CPU can still (but less often) read wrong data. This only happens when the cache is enabled, so that's a clue that we still have some cache consistency problem.  Also, the chipset DMAs only seem to return data (or at least, interesting data) when the CPU is accessing the hyperram. That is, the CPU and chipset seem able to get data intended for each other.

I went through with a fine tooth comb, and made sure that Under No Circumstances Whatever could a chipset DMA fetch contaminate any of the cache machinery. That has the CPU now safely reading only its own data. But the chipset is still only receiving data when the CPU is accessing the HyperRAM.  This is still really annoying me, because I can't reproduce it under simulation, even when I let the HypeRAM bus otherwise remain idle. 

I ran simulation again,  to confirm that lots of HyperRAM requests can get sequenced one after the other, and that the data gets received, and that all looks fine. I also synthesised a debug mode that asserts the data ready strobe to the chipset the whole time, and that DOES cause the dummy data to be delivered. Thus I am confident that the plumbing between the two is correct. It just leaves the major mystery of why the HyperRAM seemingly gets stuck handling these chipset DMA requests.

To try to figure out what is going on, I have added a couple of extra run-time switches that will allow me to make the chipset DMA to take absolute priority on the HyperRAM bus, in case the priority managment is the issue. I also found a couple of places that, although unlikely, could be contributing to the problem, and tweaked those.  Basically they are places where it might have been possible for one of these data requests to get cancelled before it was complete. 
However, if those were the problem, then I'd expect the whole thing to freeze up, due to the end of the transaction not being properly acknowledged. So we'll see how that goes.

If this doesn't reveal any clues, then I am starting to suspect that the HyperRAM might be doing something funny with taking a VERY long time to respond. We are talking about ~1 milli-second here instead of the tens of nano-seconds that it should take to service a request.

That's improved things a bit: Some data from somewhere is now getting read for the chipset DMA, but it seems to come from the wrong address, and some CPU memory accesses to the HyperRAM end up being passed out as chipset DMA data.  Also, it still seems that sometimes the HyperRAM takes too long to supply the data.

Finding the address calculation bug was fairly easy.  Also, confirming that sometimes CPU access data gets output is helpful to know, as I can go through carefully to make sure that this can't happen.  This was happening when writing data, which I was easily able to verify by filling a large block of the HyperRAM and monitoring the chipset DMA data.  It probably also happens when reading, but I'll have to make a little programme that does reads in a big loop to make sure.  But back to the writes, this is quite odd, as the data path for feeding data to the chipset via DMA is quite specific, and should only be activated when such a request is running.

Otherwise, we still have the problem of some writes to the trap-door HyperRAM failing, mostly when crossing an 8-byte boundary, but also often enough that detection of the trap-door HyperRAM never properly succeeds. 

Anyway, this post has rambled on long enough, and despite the remaining problem with setting up support for chipset DMA and read cache optimisations, much as in fact been achieved: We now have a perfect 27MHz pixel clock and HyperRAM works sufficiently well to be useful.  Speed optimisations can come later.  Here's the current speed performance of the HyperRAM, which we are calling "Slow RAM" from here on, to avoid confusion with the Hypervisor part of the OS.

So while I am not yet satisfied with this sub-system, and frankly am frustrated at how annoying and drawn-out it has been to get to this point, it is now in good enough shape for me to move onto other more pressing issues, like getting HDMI sound working on all monitors...

Friday, 1 May 2020

Trying another approach to HDMI-compatible Audio

So, we have had HDMI-compatible video output working for some time. We even have the audio working on many monitors, but not all.  There is something fishy with the audio output of the ADV7511 driver, when using certain TVs, monitors and HDMI capture devices. The symptom is either no audio, or no picture at all when audio is enabled.

I've been pulling my hair out over this for a long time now.

To try to circuit-break this situation, we have bought an FPGA development board that doesn't use an ADV chip, but has the FPGA directly drive the video signals.  Mike Field and others have done great work on this, to demonstrate that it is quite possible with the Artix7 series FPGAs we are using. 

The board we are using for this is a Mimas A7:

This has a smaller FPGA than the MEGA65, and can't run the MEGA65, but it is big enough for me to produce some video and audio, and try it out on the Samsung TV I have here. The Mimas A7 board even has an example HDMI-output project, which I have been able to use and begin to adapt.

But I am still seeing some funny things.

Primarily, if I try to use more than 4 bits of audio resolution, that the HDMI TV fails to get any picture at all.  It doesn't seem to matter exactly which four bits. I can even sometimes use more than 4 bits, so long as no more than 4 in a row in the 16-bit sample format are used.

What is REALLY weird, is that if I mask out audio bits with dip switches on the board, that there is no picture, even if I have the audio bits masked out.  This just makes absolutely no sense to me, as what comes out the HDMI connector on this board should be completely identical when the bits are masked as when I have them simply tied low -- yet only the latter gives a picture.

Now, I am trying to use a different video mode (720x576x50Hz) to what the example project uses (640x480x60Hz), so it is possible that the HDMI info frames are confusing the TV. by saying one thing, while the display stream has something quite different in it.  What I am suspecting is that the audio clock recovery coefficients (N and CTS) will be causing problems. It still doesn't explain why the really weird problem happens, but if I can confirm that I can use all 16 bits of audio resolution in the default mode, that will still be a helpful data point.

So I am trying to switch back to the default 640x480 mode, and then try to get audio working in that mode.  However, I am hitting another really weird problem:  When I try to workout the set of flags required for that mode (these are some flags that go into the HDMI infoframe), I can make them work if I have them hardwired to the correct values, but if I try to make them settable via the dip switches, they don't change. This is despite the fact that I can read the dip switches, put them into an array of boolean flags, and check them elsewhere in the design, even have their status output continuously on a serial port where I can see them changing. 

Yet the TV steadfastly claims that there is no signal, no matter how I twiddle the dipswitches, and see the correct values on the serial output to indicate that they are being set correctly.  So then I plumbed some return signals back through, so that I can see if the dip-switch values are getting propagated down into the HDMI video generator, which they are.  This is of course very bizarre, since they are clearly being set, and getting where they need to be, but seemingly not affecting the video stream being produced.

And of course this is VHDL and FPGAs, so each "recompile" takes several minutes, which makes for a really frustrating debugging experience, especially when making only tiny changes.

So, after several more waits for resynthesis runs, I have again confirmed that if the EnhancedMode (which enables HDMI info frames) is hard-wired to false (as compared to being held low via a dip-switch mapping !!), then I get a perfectly fine image on the TV.  Recall that I have confirmed that the signal in question goes into the HDMI module and comes back out again, confirming that it should darn well be visible to the HDMI module when the dip-switch toggles.

If the dip-switch is in the correct position on power-on, there should in fact be absolutely no difference in behaviour -- despite the opposite happening in practice.  The only possible clue here is that seemingly the flags are being treated as always true, when connected to the dip-switch.

Right. So, by adding a counter into the HDMI generator, that counts the number of times that a particular area is entered, it suddenly works.
So bizarre is this, that I even looked at the Vivado logs for both versions for any warnings that might differ between the two -- but there are none. 

This does not really give me any encouragement for tracking down the audio problems, since that is also exhibiting similarly nonsensical behaviour. It might just be that I need to update to the latest version of Vivado, and see if if this is a bug that has been fixed in the meantime. In any case, if I want to log an issue with Xilinx, they will want me to try it with the latest version anyway.

Ok. So 16GB of downloads over our poor satellite later, I have Vivado 2019.2 installed in place of the old 2017.4 version, and it pleasingly fixes this weird bug with needing the counter.  I'll now carefully and progressively back out the debug stuff I added in, and see if this now also fixes the weirdness I was seeing with not being able to use more than 4 bits of the audio. I'm hopeful, as the two problems feel rather similar in their complete insanity. Indeed my hunch seems to be true: I can now get 8-bit audio (and presumably the full 16-bit available) in 720p50. 

Next step is to switch out the sample VGA frame generator for the MEGA65 one, so that I can be sure that I can get audio with that.  At that point, I will prepare a bitstream for the team to test in Germany with the monitors that they have there.  If it works with every monitor we can throw at it, then we know we have a good solution.

So far so good: I now have a bitstream where a dip-switch controls if it is PAL or NTSC mode, and I still get sound. However, I am not yet convinced that the audio quality is as good as it should be. This could just be my wooden ear, or it could also be that the 36-element Sine table I am using is not that great.  So I'll still try to get an audio sample recorded and imported as a little ROM in the test design, so that I can verify that the sound is right, or not.

After all manner of further adventures with Vivado weirdness (or possibly PEBCAC), where changing a single unrelated line of code changed the behaviouor of the whole thing, I finally have audio working.  It is possible that the Samsung monitor rejects any HDMI signal that has what it thinks is impossible sound. I'm not sure. 

But in any case, I can now push 16-bit stereo audio out the HDMI interface, which is correctly received by the Samsung TV -- which is one step up on what we have been able to achieve with the ADV7511 HDMI chip on the R2 PCB.  Indeed, I've been able to reach this point in MUCH less time than the whole fiddling with the HDMI chip took. I'd still like to know whatever it is that the ADV7511 is doing that the Samsung TVs don't like.  But we don't really have the time to investigate that right now.

So now it's time to package up that bitstream for the others in the team to try on one of these FPGA boards with all the monitors that they have on hand there.  If that all goes well, we will have a solution that is probably a few Euros cheaper per machine. That will help us to either keep the cost down, or, possibly by the magic of modern economics, to include a nice secret surprise on the production MEGA65 PCBs...

Anyway, I'm glad to finally have HDMI with audio working on the Samsung, and "ich drucke mich die Daumen" (I'm crossing my fingers) that our German half of the team will have success with the monitor testing on their side.  I'll also see what random selection of HDMI-compatible displays we have lurking around the other buildings here over the next few days, and test those, too.

Saturday, 11 April 2020

Getting the other 8MB of RAM working

Those who have been following along the MEGA65 story for a while will know that the MEGA65 has 384KB of main "fast chip" memory which is clocked at 40MHz, and is accessible by the VIC-IV, and that we also planned from the outset to have additional RAM.

The original Nexys4 boards had a kind of RAM called PSRAM, which is a DRAM dressed up to pretend to be an SRAM.  PSRAM is wonderfully simple to control, and so it was easy to support.  Then Digilent couldn't source the PSRAM chip any more, or got requests to have larger memory on the Nexys4 boards, and switched to a DDR2 DRAM.  In comparison with the PSRAM, DDR DRAM is a horror to write a reliable controller for, and after considerable effort to do so, I gave up in disgust.

Then as Trenz were designing the MEGA65 main board, they said that there is a new standard for something like PSRAM called Hyper RAM (not to be confused with a Hypervisor, which is used to virtualise hardware to allow running guest operating systems, including on the MEGA65 to run the C65 or C64 ROMs as a "guest" on the MEGA65 hardware).  Hyper RAM has become popular because it offers much higher bandwidth than PSRAM could, upto 333MB / sec versus about 10 MB / sec for PSRAM, and because it uses fewer pins (but note that we won't be getting anything like that performance just yet).

So Trenz added an 8MB Hyper RAM chip to the MEGA65 mainboard as our additional RAM. I did some initial work to write a controller for the Hyper RAM, but initially couldn't communicate with it.  This was complicated by two things: First, the Hyper RAM interface is a bit more complicated than the PSRAM interface, which is how it gets both the higher bandwidth and reduces the pin-count.  Second, with the Hyper RAM chip on the mainboard of the MEGA65, there weren't any test points to probe with the oscilloscope to see if I had it right.  Thus it got pushed to the back-burner for a while, while we addressed other problems.

I now have it working, as you can see in the following photo, but it has been a bit of an adventure to get there:

Now, speaking of adventures, as many of you know, the family and I are living in the middle of the Australian Outback this year.  So I thought I would just let you know that we are safely bunkered down in isolation here like the most of the rest of you.  The main difference is that we are lucky enough to have a very big backyard to walk around in compared with most folks: The property here is about 640 square kilometres (something like 256 square miles), so it doesn't really feel as much like quarantine as it might otherwise.

The main difference for us, is that the Arkaroola Wilderness Sanctuary normally has lots of guests this time of year, but that's of course not happening right now.  So for probably the first time in 40 years Arkaroola is very quiet at this lovely time of year.  We are well into the Southern Hemisphere Autumn right now, which at Arkaroola means temperatures averaging around 25C during the day and about 15C overnight.  It's lovely weather to camp out on our own back verandah and enjoy watching dawn break over the mountains:

Apart from that, everything here is running pretty normally.  We still get our medical attention through the amazing Royal Flying Doctor Service.  So when we all needed to get our Flu immunisations, the folks running the sanctuary gave them a call to get an appointment for all 18 people living on the sanctuary to get their jabs.  The only difference is that instead of driving to the nearest doctor's rooms or waiting around and peeking through the curtains to see if the doctor was here yet, we drove to the nearest airstrip, and knew the doctor was arriving by the familiar drone of their Pilatus PC12:

Fortunately I've not needed to actually ride in one of their planes, although I have ridden in a PC12 about 10 years ago -- actually to this very same air strip. But that's a whole other story.  They are a very fun and functional little plane, with a jet turbine actually spinning the propeller, and so for such a little plane are really fast, cruising at something like 550km/hour.

But anyway, I just wanted to reassure that you I are safely holed up here, so development of the MEGA65 can continue.  So let's get back to the Hyper RAM stuff.

The Hyper RAM interface (we'll just call it HRAM from here, to save my typing fingers) is, on the surface of it, very simple:

HR_D : 8-bit data/address bus
HR_CS0 : Chip select/attention line
HR_CLK : Clock line (some have differential clock lines, but we can safely ignore that)
HR_RESET : Active low reset line
HR_RWDS : Read/write data strobe

The tricky bits are that HR_D is used for both data and addresses, and that the HR_RWDS line behaves quite differently between read and write. Oh, and because it is really a dressed-up DRAM, and the HRAM chip does the DRAM refreshing internally, the latency for a read or write request can vary, and you have to know what the possible latencies will be by reading (or configuring) some special registers in the HRAM.

Oh, yes, and although the HRAM has an 8-bit data bus, it is really 16-bit RAM internally, and so you still have to do complete 16-bit transactions, which are DDR, that is, data is clocked on both the rising and falling edge of the clock. You can fortunately mask writes on a byte-by-byte basis though, so that you can write a single byte.  This behaviour is also controlled by the HR_RWDS line, making it something a bit like a Swiss-Army knife, with lots of different functions that you have to be a bit careful of, if you don't want to accidentally cut yourself.

So let's start at the simple end of things, which is writing the configuration registers that control the latency settings of the HRAM.  Here is what the wave-form looks like:

(You might need to click on the image to see a bigger version of it)

There's a fair bit to explain here, so let's get started.  The hr_clk_p line is the positive half of the differential clock, so just think about that as being the HR_CLK described earlier.  The hr_cs0 line is the chip-select/attention line.  This gets pulled low at the start of a new transaction.  The HRAM then immediately drives the HR_RWDS line high (bottom line), to tell us that "extra latency" will be applied, because the HRAM is internally busy with a DRAM refresh right now.  But in this case, we will ignore it, because we are writing to a configuration register (but we will explain it further down in this blog post).

The command to the HRAM is then sent over 3 complete clock cycles, with the HRAM accepting one of the six bytes on the rising and falling edges of the clock.  So in this example, the command bytes are $60 at 5,547 usec, $00 at 5,559 usec, $01 at 5,571 usec, and then $00 for the remaining three bytes.  Thus the complete command word is $600001000000.  The command is sent most-significant-bit first. The first few bits (i.e., the highest numbered bits) of the command are:

Bit 47 - Read (1) or Write (0)
Bit 46 - Access to memory (0) or Configuration registers (1)
Bit 45 - Wrapped (1) or linear (0) access
Bit 44 - 16 - Upper address bits of the access
bit 15 - 3 - Reserved
Bit 2 - 0 - The lower 3 bits of the address, counted in 16-bit words, not bytes

Thus we can deduce that this request is a write to the configuration registers, with a byte address of $00001000.  This is the address of the CR0 register which contains the settings for latency for all other transactions.  CR0 is a 16-bit register, and we write $8FE6 into it. The second $8F is only there because my state machine is wonky. The $FA is also just a debug value that my controller was producing so that I knew that it knew that it had finished the transaction.

The observant among you will notice that HR_RWDS has gone from being driven high to going tri-state just before the last byte of the command has been sent. This is caused by the HRAM stopping indicating the latency status before the bus master might need to drive the HR_D lines (or the HR_RWDS line), so as to avoid cross-driving any signals.

So, that's all pretty straight forward.  Where it gets a bit more entertaining, is when you try to access the memory of the HRAM, because now we DO need to worry about the latency.  We'll start with reading, because that case is actually a bit simpler.


So here we see a similar start, where we pull HR_CS0 low, and then send out a command, this time $A00001000004, which corresponds to reading from address $00001008.  This time, because the HRAM knows that it will be supplying data to the bus master, instead of tri-stating HR_RWDS, it keeps it low until the read latency has expired.  The bus master doesn't need to know what the latency will be, as the HRAM will keep HR_RWDS low until it has the first word of data ready. Then it pulses HR_RWDS to mark the two bytes of each word that it returns.  In this case, $00001008 contains $FF, $00001009 contains $49, and the next seven addresses all also contain $49.  So that's not too complicated.

This ability to read many bytes in a single transaction is key to HRAM's ability to deliver high throughput. If this weren't the case, then about 4 clock cycles would be wasted on every access, just fiddling with HR_CS0 and sending the command -- let alone the latency for the HRAM to dig up the required data out of its DRAM, in this case another two clock cycles.  In short, a transaction to read a single byte would require about eight clock cycles.  Thus at the maximum clock speed of 166 MHz, HRAM would only be able to deliver about 20MB/second.  However by reading lots of data at a time, it is possible to get relatively close to the theoretical maximum.

Now, in the case of the MEGA65, we can't get anywhere near that throughput, because our clock speed internally is not that high, and because the version of the HRAM chip we are using is actually only capable of a 100MHz maximum clock speed anyway. As we need to synchronise with the MEGA65's existing internal ~40MHz CPU clock, we can only use 2x that for ~80MHz as our maximum clock rate.

And for now atleast, it's actually even worse:  I haven't had the time to work out how to use the DDR signal drivers in the FPGA, so I am simulating the DDR access using a 40x4 = 160MHz clock, where four clock cycles are required for each HRAM clock, resulting in a 40MHz effective clock rate.  This is because when writing bytes to the HRAM, the data has to be set up between clock edges, but when reading data back from the HRAM, it is synchronised on the clock edges.  So the state machine alternates between ticking the clock and reading or writing data.  This could be improved in the future, but as I have the HRAM working at an acceptable speed, it has been pushed to the back burner.

What is clear in any case, is that having about ten clock cycles per access (by the time we take the HRAM internal latency wait cycles into account) will result in horrible performance, even if we were able to double the clock speed to 80MHz. This is because the CPU will still be waiting at least ten cycles (or five, if we can use the FPGA DDR resources to double the clock speed) for EVERY read.  So a three byte instruction would take at least 3 x 10 (or 3 x 5) clock cycles.

The reality is actually somewhat worse, because the HRAM logically connects to the "slow device" bus in the MEGA65's memory architecture, so that it doesn't complicate the CPU's inner loop to the point where we would have to drop the CPU speed. It costs one cycle in each direction for the CPU to ask the slow device bus, and for the slow device bus to ask the HRAM, and then another cycle each for the answers to get communicated back. Thus the real access time is something like 1 + 1 + 10 + 1 + 1 = 14 cycles.  Even if we double the HRAM clock via DDR to 80MHz, it would still be 1 + 1 + 5 + 1 + 1 = 9 cycles.  Thus while getting the DDR stuff working would still be a great idea, it would represent only a modest improvement.

What we really need is some kind of cache, so that we can take advantage of the HRAM's ability to read multiple bytes in a single transaction, and then to be able to look at the data we have recently read as often as possible, instead of having to wait for a whole HRAM transaction each time.  This has the ability to completely remove the 10 (or 5) cycles, reducing the access time from the cache to 1 + 1 + 0 +  1 + 1 = 4 cycles. That's sounding much better. In fact, we can do even better, if we make the cache accessible from the slow device bus, we can reduce this to 1 + 1 = 2 cycles.  Of course, if the cache doesn't have the data, then we will still need to do the HRAM transaction, but if we can do that less often, then we will still see a significant speed up.

When writing data, we can do things a little differently: The CPU can (and does) just hand the write request to the slow device bus, which in turn hands it to the HRAM controller to dispatch in the background. The CPU can then continue with whatever it was doing. If that doesn't immediately require another write to the HRAM, the CPU can get on with useful work. This is called "latency hiding": The latency is still there for writing to the HRAM, but we can hide it, by allowing the CPU to get on with other work.

We don't have stacks of spare RAM in the FPGA, as we are already using every single 4KB BRAM block.  Also, caches are most effective when they have high "associativity". That's just a fancy way of saying that the cache can more effectively hold bits of memory from all over the place, without having to discard any of the others when reading a new one because the too many of the bits of the addresses of the already cached data and the newly fetched data are identical.  Also, an 8-bit computer with a huge cache just wouldn't feel right to us.  The net result is that the MEGA65's HRAM controller has two cache rows of eight bytes each, with two-way associativity. Since the associativity and the number of rows is identical, this means that you can have any two areas of eight bytes.  Don't worry if that sounds tautological, as with such a tiny cache it basically is a tautology.  The bottom line is that the two cache rows can be used independently, e.g., if you were copying data from one area of the HRAM to another, or executing code in the HRAM, while also reading or writing data in it.

So let's see what that cache gets us, with a simple speed benchmark I wrote as I was implementing this:

The top half of the display just describes the HRAM's internal settings.  This was before I enabled "variable latency" and reduced the HRAM's internal latency.  The 20MB/sec is for fast chip ram to fast chip ram copies, i.e., testing the MEGA65's main memory as a comparison. The copies and fills are all using the MEGA65's DMA controller.  We then have "cache enabled" and "cache disabled" results for the HRAM in various scenarios, including when we copy between the HRAM and the MEGA65's main memory, which we expect to be a fairly common situation, e.g., when software uses the HRAM in REU or GeoRAM compatibility mode.

Let's start by looking at the "cache disabled" figures to see just how bad the situation is, if we don't do anything to hide the read or write latency.  Indeed, it is pretty horrible. Copying from Extra RAM, i.e., the HRAM, to Fast RAM is probably the most representative here for reading, as the time to write to the Fast RAM is only one cycle per byte.  Here we see only 1.6MB/sec, which given we know the clock speed is 40MHz, and only one cycle is lost to the Fast RAM write, means that the HRAM reads are taking something horrible like 23 cycles. This isn't too surprising, because with the default HRAM latency, plus the 1 + 1 + 10 + 1 + 1 transaction overhead, we quickly rack up the cycles.  A C128 can read memory faster than this!

Now, if we look at the "cache enabled" section, we can see that things are already quite a bit better: We get 5MB/sec, i.e., around three times faster. This is because with an 8 byte cache, we only need to pay the 23 cycles one out of eight times. The other seven out of eight times we only need to pay 1 + 1 + 1 + 1 = 4 cycles (the short-cut cache in the slow device controller was not implemented at this point in time).

The next step was to reduce the HRAM's internal latency settings as much as possible. This reduced the HRAM's latency from being 6x2 = 12 cycles all the time, to instead being 3 cycles most of the time, and 3x2 = 6 cycles otherwise. This helped quite a lot, even without the cache, since it is just fundamentally reducing the time taken for each transaction:

We can now do copies to/from HRAM at around 8MB/sec, and we can even write to the HRAM quite a bit faster, as it is the write latency that is most affected by this change. Speaking of writes, this is where the next bit if fun came, both to get the HRAM working, and then to speed it up.

I mentioned earlier that the HRAM is internally using 16-bit words. At first I didn't properly accommodate this, and would try to make the write transactions as fast as possible, by aborting the write after writing the single byte I needed to write. This gave the rather weird effect that writing to even addresses worked just fine, but writing to odd addresses gave, well, rather odd results.  For example, the byte would get written to the correct location, and then often to the next odd address, or to an odd address somewhere else in the memory.  Eventually I realised that this was because the HRAM internally requires every transfer to be a multiple of 16-bits, and if you don't obey this when writing, then very odd things indeed can happen.

Once I had the writes working reliably, I realised that improving the write performance would require the implementation of a "write scheduler", that could collect multiple writes, and pack them into a single transaction.  In practice, this means another structure that looks very much like the read cache, but is instead for collecting writes.  It marks each byte that needs to be written in a little bitmap, and as soon as the HRAM bus is idle, it begins writing the transaction out.  There is even a bit of logic (that could do with some further optimisation) that allows writes to be added into a transaction that has already started, provided that they don't arrive too late.  Like the read cache, this potentially allows up to 8 bytes to be written in a single transaction, and should give a similar speed up, and indeed it does:

Writing is now effectively the same speed as reading, because it is being turned into a similar number of grouped transactions.  It was actually really pleasing to get this write scheduler working, as it is can have many subtle corner-cases, but in practice, was fairly simple to get working.

Now, if I had infinite time to optimise things, I would go further, for example, by:

1. Figuring that DDR thing out, so that I can effectively halve the remaining HRAM latency.
2. Monitor reads from a cache line, and when we see linear reads getting towards the end of a cache row, scheduling the reading of the next cache row.
3. Actually read double the data that fits in a cache line, and have it waiting in the wings to do a quick cache update when reading linearly.
4. Similarly for writing, if we see that the next writes are following the current write transaction, to just extend the current write transaction.
5. Seeing if I can make the cache lines wider.
6. Use the "wrapped" feature of the HRAM to begin reads at the exact byte in a cache row, to shave a couple cycles off random reads of bytes that are near the end of a cache row.
7. Generally optimise transaction timing by removing some of the glitching at the end of transactions, that is simply wasting time.
8. Figure out if can push the clock up to 100MHz exactly, although that will involve some kind of clock-domain crossing mechanism, which are often more trouble than they are worth.

However, we don't have infinite time, and I have a bunch of other tasks to tick off for the MEGA65, so 8MB/sec will have to do for now.  In any case, it is certainly much better than the original ~1.6MB/sec.  So I'll take the 5x gain, and call it quits for the moment.  There is of course nothing stopping me (or someone else) revisiting this in the future to get nice speed improvements out of the HRAM.  With those approaches above, it should be possible to perhaps double or better the current speed.

For those interested in following the adventure of getting the HRAM to this point, it's all documented in the github issue.