Sunday, 10 May 2020

More work on the RAM Expansion and Timing Closure

I recently got the MEGA65's 8MB built-in expansion RAM mostly working.  I say mostly working, because it was still doing various strange things sometimes.  So I have been spending all my spare time trying to get this sorted.
While in theory the RAM controller is simple, in order for it to offer decent performance, it ends up a bit more complex.

Let's start with the simple HyperRAM itself: This can be driven by a clock of upto 100MHz, allowing for DDR throughput of upto 200MB/sec.  Except that it cant.

First, we don't really have the means to use a 100MHz clock, because it's not a clean multiple of any of the clocks we are using. So we are instead using 81.5MHZ, which is 2x the MEGA65 CPU clock and 3x the VIC-IV's pixel clock.

Second, the HyperRAM is accessed by first sending it a command to read or write something.  Regardless of the clock rate, there is a minimum of 40ns plus a bit between transactions.  In reality, it takes about 100ns -- 200ns to actually perform a random read or write. 

Third, in order to not drag the maximum clock speed of the MEGA65's CPU down, the expansion RAM is on the "slow devices" bus.  They don't really have to be slow, but the CPU requires at least 2 clock cycles to access such a device.  That bus then asks the HyperRAM for data, which takes at least another 2 clock cycles.  Thus there is something like 100ns extra latency from the memory architecture of the MEGA65.

The result is that a naive implementation will have about 300ns latency, i.e., about what a real C64 in 1982 had with its ancient DRAM chips.  This is enough for 2 - 3MB/sec -- which with a 40MHz CPU is a bit disappointing.

Thus the HyperRAM controller and related bits and pieces in the MEGA65 does a number of tricks to hide this latency, and increase the throughput for linear accesses. We are concentrating on linear access, as we figure it is most likely how the expansion RAM will be used: You will either be copying things in or out of it.  Some of those tricks are:

1. Two eight-byte write buffers, that allow most writes to happen with just 2 cycle latency, unless the buffers are already busy. Each buffer covers an 8-byte aligned region of 8-bytes, and multiple writes to the region in quick succession will result in a merged 64-bit write to the expansion RAM.  This speeds up writing a lot.

2. A similar setup of two eight-byte read buffers that you could call a tiny cache.  Again, this helps avoid the 100 -- 200ns latency of the HyperRAM, but not the latency of the slow device bus and related bits.

3. A block-read mechanism that reads and buffers upto 32 consecutive bytes from the HyperRAM, and feeds them pre-emptively to the slow device bus as an 8-byte buffer. This really helps reduce read latency.

4. A pre-emptive pre-fetch mechanism in the slow device bus that pushes the byte after the previous read one to the CPU, so that if it is doing a linear read, it can avoid all but one cycle of latency in the slow device bus, and theoretically allow reads at up to 20MB/sec.

I have also added a priority access mechanism to the HyperRAM, so that we can have automatic chipset driven DMAs like access, a little like the Amiga, so that we could, for example, feed the digital audio channels directly, without any CPU intervention.  I have yet to work out exactly what that will look like, but it is important to get the interface in now, so that it can be worked on later.

So all of this mostly works, but there are various little corner cases.  Also, the timing closure on the MEGA65 as a whole was getting a bit out of hand again, so I have spent quite a lot of time trying to flatten logic, so that everything that needs to happen in a single clock tick can in fact happen. And reliably.  This also can help synthesis to run a bit faster, because it doesn't have to try so hard to make everything fit with acceptable timing.  But its also important when debugging weird behaviours, so that I can be confident it isn't just because everything is violating timing by a bit.

I now have it mostly meeting timing, and all of the above optimisations and features at least mostly working.  Where I am now upto is that there are some weird bugs with the chipset DMA, and chaining of the 32-byte fetches and the like.  Also the optimisation in (4) is not completely implemented.

First, I'll tackle the weird bugs with the DMA. Basically the HyperRAM controller can get confused and freeze up, messing up both the chipset DMA, as well as regular CPU access.  I think much of the problem is how it handles the situation where a request comes in from the slow_device bus at around the same time as a chipset DMA. The chipset DMA is supposed to take priority, since it might be quite timing sensitive.  This means that the incoming request from the slow device bus has to be remembered for later.

This is further complicated by the fact that the HyperRAM controller has an 81.5MHz and a 163 section, and we have to be careful how we pass information between the two.  Normally the 81.5MHz side registers and processes the requests from the slow device bus. But when the HyperRAM is actually being accessed, it is the 163MHz part of the circuit that comes into play.  The trouble comes when the 163MHz side realises that it can service a newly arriving request, which it should do to reduce latency.

It should also tell the 81.5MHz that it has processed the request, so that it doesn't get processed twice. And this seems to be where it messes up at the moment: The request does in fact get processed twice.  This probably means I need to have a signal that tells the 81.5MHz side when a latches request has in fact already been processed.  Ok, implemented that, and it looks to simulate properly.

Next is to finish implementing the pre-emptive delivery of data to the CPU, so that it saves time every time it can, instead of just sometimes. Basically now it will save time at most 1/2 the time, because it supplies the next byte whenever the CPU requests a next byte, and it can.  For a linear read, that's only 3 out of every 8.  As we have the cache to work with, we have an 8-byte aligned block, and we should be able to pre-fetch 7 out of every 8 bytes in a linear read. We just need to watch for when the CPU consumes the previous pre-fetched byte (which it already tells the slow device bus) and get the next byte ready in that case. This all happens in slow_devices.vhdl.

After that, it was time to look at when we flush out the 8-byte write buffers. On the one hand, we want to not abort early if the last few byte positions don't need writing, in case a new write comes along that immediately follows in memory, and thus could be chained. But on the other hand, if all our write buffers are currently busy, such that we can't accept any new writes, then we really want to abort as quickly as possible, so that we can start processing any new requests.

I have a simple sequence of memory accesses in a test harness that I use to verify that each of these changes has not broken anything, i.e., we don't lose any writes or read stale data etc.  As I have been working on these various optimisations, the average transaction time has dropped from around 120ns to around 68ns -- so all up, this is quite a significant performance boost.  This should allow a hyperram access to, on average, occur in around 3 clock cycles, with linear reads potentially only requiring 2 clock cycles.  Thus, I'm hoping that when I run my speed test programme, it will show "Copy Slow RAM to Chip RAM" speeds of ~40/3 = ~13,000KB/sec. 

In generally, looking at the test harness output, things look pretty good now for the most part. The only exception is that a few memory accesses seem to take a REALLY long time, longer than should really be possible, up to around 450ns.  Something pathological must be happening there, such as an aborted 32-byte block read that would have got the data, but is aborted, and then a new fetch is started that does get it.

Looking into it, it might just be when there are a lot of writes banked up that need flushing.  But I did notice a weird thing, where data was being spread into both of the 8-byte read caches, instead of all going into a single one.  While that can't cause the current problem, it would still be good to squash it, as otherwise it will cause havoc with the performance of the cache when reading, as one of the cache lines will just get ignored in all likelihood.

Fixing timing closure is often about simplifying IF statements in the VHDL, so that they can be processed more quickly by the hardware.  Sometimes this can be through reducing the number of nested IF statements. Other times it is about pre-caculating expressions that appear in the IF statements, and then using the simple pre-calculated true or false result from that.

Vivado does give nice reports about signals that don't meet timing closure, like this:

Slack (VIOLATED) :        -1.245ns  (required time - arrival time)
  Source:                 hyperram0/block_valid_reg/C
                            (rising edge-triggered cell FDRE clocked by clock162  {rise@0.000ns fall@3.077ns period=6.154ns})
  Destination:            hyperram0/current_cache_line_drive_reg[3][5]/D
                            (rising edge-triggered cell FDRE clocked by clock162  {rise@0.000ns fall@3.077ns period=6.154ns})
  Path Group:             clock162
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            6.154ns  (clock162 rise@6.154ns - clock162 rise@0.000ns)
  Data Path Delay:        7.383ns  (logic 1.200ns (16.253%)  route 6.183ns (83.747%))
  Logic Levels:           6  (LUT4=1 LUT6=5)
  Clock Path Skew:        0.027ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    6.300ns = ( 12.454 - 6.154 )
    Source Clock Delay      (SCD):    6.600ns
    Clock Pessimism Removal (CPR):    0.327ns
  Clock Uncertainty:      0.073ns  ((TSJ^2 + DJ^2)^1/2) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Discrete Jitter          (DJ):    0.128ns
    Phase Error              (PE):    0.000ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clock162 rise edge)
                                                      0.000     0.000 r 
    V13                                               0.000     0.000 r  CLK_IN (IN)
                         net (fo=0)                   0.000     0.000    CLK_IN
    V13                                                               r  CLK_IN_IBUF_inst/I
    V13                  IBUF (Prop_ibuf_I_O)         1.521     1.521 r  CLK_IN_IBUF_inst/O
                         net (fo=2, routed)           1.233     2.755    clocks1/clk_in
    MMCME2_ADV_X0Y0                                                   r  clocks1/mmcm_adv0/CLKIN1
    MMCME2_ADV_X0Y0      MMCME2_ADV (Prop_mmcme2_adv_CLKIN1_CLKOUT5)
                                                      0.088     2.843 r  clocks1/mmcm_adv0/CLKOUT5
                         net (fo=1, routed)           2.018     4.861    clock162
    BUFGCTRL_X0Y4                                                     r  clock162_BUFG_inst/I
                        net (fo=1126, routed)        1.643     6.600    hyperram0/clock162_BUFG
    SLICE_X15Y57         FDRE                                         r  hyperram0/block_valid_reg/C
  -------------------------------------------------------------------    -------------------
    SLICE_X15Y57         FDRE (Prop_fdre_C_Q)         0.456     7.056 f  hyperram0/block_valid_reg/Q
                         net (fo=43, routed)          1.406     8.462    hyperram0/block_valid_reg_0
    SLICE_X21Y49                                                      f  hyperram0/current_cache_line_drive[0][7]_i_5/I3
    SLICE_X21Y49         LUT4 (Prop_lut4_I3_O)        0.124     8.586 r  hyperram0/current_cache_line_drive[0][7]_i_5/O
                         net (fo=135, routed)         1.242     9.828    hyperram0/current_cache_line_drive[0][7]_i_5_n_0
    SLICE_X11Y39                                                      r  hyperram0/current_cache_line_drive[3][5]_i_7/I5
    SLICE_X11Y39         LUT6 (Prop_lut6_I5_O)        0.124     9.952 f  hyperram0/current_cache_line_drive[3][5]_i_7/O
                         net (fo=1, routed)           1.245    11.197    hyperram0/current_cache_line_drive[3][5]_i_7_n_0
    SLICE_X15Y55                                                      f  hyperram0/current_cache_line_drive[3][5]_i_5/I0
    SLICE_X15Y55         LUT6 (Prop_lut6_I0_O)        0.124    11.321 f  hyperram0/current_cache_line_drive[3][5]_i_5/O
                         net (fo=1, routed)           0.803    12.124    hyperram0/current_cache_line_drive[3][5]_i_5_n_0
    SLICE_X9Y63                                                       f  hyperram0/current_cache_line_drive[3][5]_i_3/I1
    SLICE_X9Y63          LUT6 (Prop_lut6_I1_O)        0.124    12.248 f  hyperram0/current_cache_line_drive[3][5]_i_3/O
                         net (fo=2, routed)           0.908    13.156    hyperram0/current_cache_line_drive[3][5]_i_3_n_0
    SLICE_X5Y71                                                       f  hyperram0/current_cache_line_drive[3][5]_i_4/I4
    SLICE_X5Y71          LUT6 (Prop_lut6_I4_O)        0.124    13.280 r  hyperram0/current_cache_line_drive[3][5]_i_4/O
                         net (fo=1, routed)           0.579    13.859    hyperram0/current_cache_line_drive[3][5]_i_4_n_0
    SLICE_X3Y71                                                       r  hyperram0/current_cache_line_drive[3][5]_i_1/I4
    SLICE_X3Y71          LUT6 (Prop_lut6_I4_O)        0.124    13.983 r  hyperram0/current_cache_line_drive[3][5]_i_1/O
                         net (fo=1, routed)           0.000    13.983    hyperram0/current_cache_line_drive[3][5]_i_1_n_0
    SLICE_X3Y71          FDRE                                         r  hyperram0/current_cache_line_drive_reg[3][5]/D
  -------------------------------------------------------------------    -------------------

                         (clock clock162 rise edge)
                                                      6.154     6.154 r 
    V13                                               0.000     6.154 r  CLK_IN (IN)
                         net (fo=0)                   0.000     6.154    CLK_IN
    V13                                                               r  CLK_IN_IBUF_inst/I
    V13                  IBUF (Prop_ibuf_I_O)         1.450     7.604 r  CLK_IN_IBUF_inst/O
                         net (fo=2, routed)           1.162     8.766    clocks1/clk_in
    MMCME2_ADV_X0Y0                                                   r  clocks1/mmcm_adv0/CLKIN1
                                                     0.083     8.849 r  clocks1/mmcm_adv0/CLKOUT5
                         net (fo=1, routed)           1.923    10.772    clock162
    BUFGCTRL_X0Y4                                                     r  clock162_BUFG_inst/I
    BUFGCTRL_X0Y4        BUFG (Prop_bufg_I_O)         0.091    10.863 r  clock162_BUFG_inst/O
                         net (fo=1126, routed)        1.590    12.454    hyperram0/clock162_BUFG
    SLICE_X3Y71          FDRE                                         r  hyperram0/current_cache_line_drive_reg[3][5]/C
                         clock pessimism              0.327    12.781   
                         clock uncertainty           -0.073    12.708   
    SLICE_X3Y71          FDRE (Setup_fdre_C_D)        0.031    12.739    hyperram0/current_cache_line_drive_reg[3][5]
  -------------------------------------------------------------------
                         required time                         12.739   
                         arrival time                         -13.983   
  -------------------------------------------------------------------
                         slack                                 -1.245   


These are great in that they tell me from which signal to which signal the problem is, and how bad it is. But they are also a pain, because they don't actually tell you which lines of code is causing the path. So you have to hunt through the source trying to find the sequence of nested IF statements that apply, but which might be spread over thousands of lines of code.  It's a bit of a pain.  So I made a really quick-and-dirty tool that lets me see all the IF statements that lead up to a given signal in the source.

So in the case above, we can see:

---------------------------------------
 387        if rising_edge(pixelclock) then
 622          if (read_request or read_request_latch)='1'
 635            if (block_valid='1') and (address(26 downto 5) = block_address) then
 650              if (address(4 downto 3) = "11") and (flag_prefetch='1')
 658 >>>             ram_address <= tempaddr;
---------------------------------------
 387        if rising_edge(pixelclock) then
 622          if (read_request or read_request_latch)='1'
 826            elsif address(23 downto 4) = x"FFFFF" and address(25 downto 24) = "11" then
 883            elsif request_accepted = request_toggle then
 887 >>>           ram_address <= address;
---------------------------------------
 387        if rising_edge(pixelclock) then
 899          elsif queued_write='1' and write_collect0_dispatchable='0' and write_collect0_flushed='0'
 922          elsif (write_request or write_request_latch)='1' and busy_internal='0' then
 929            if address(23 downto 4) = x"FFFFF" and address(25 downto 24) = "11" then
 972              if cache_enabled = false then
 980 >>>             ram_address <= address;


I now have line numbers, and can see exactly what is going on.  Because it also shows me the rising_edge() statement, I can also tell under which clock it is .  So in this case, it's likely line 635 that is the problem.  It's also showing me that while this IF does depend on the block_valid signal, it also compares a 22 bit address -- that's the thin that's really going to be slow.  Thus I get more useful information than Vivado otherwise gives me.  I'm sure there are expensive professional tools that can do some of this, but it was faster to cook this up than to even find out if such tools exist.  It would be nice if the VHDL mode in emacs could display something like this, though.

Anyway, in this particular case, I can't do a great deal about it, unless I add an extra stage of latency to all hyperram reads, as pre-calculating the result of the comparison would add a cycle of delay.  This is the whole trade-off of clock speed and pipeline depth, which the Pentium 4 famously did really badly.  In this instance, I have fixed a whole lot of other timing violations, in the hope that Vivado can do better with this particular signal.  I have hope, because while the total path takes too long, only 1.2ns of that is logic -- the other ~6ns is routing overhead.  By making it easier for Vivado to route other signals, it can often enable it to get better closure on critical paths like this one.  Anyway, we will know in about half and hour when it finishes synthesising.

So, that's helped me get quite close to timing closure, but there are a few signals that are not quite meeting timing closure. My next step was to fix the clock frequencies we are using. They are approximately right, but not exact. They are in fact a little too fast. This can also cause us trouble with the HDMI output, so it makes sense to get these fixed. I recently worked out how to better control clocks. Basically the problem is that each clock generator in the FPGA can only take the input clock and multiply and divide it by a fixed range of values. Thus you can only generate a limited set of frequencies.

But you can chain the generators to take the output of another one as its input, and thus generate many more frequencies.  The trouble then, is that you need to consider a LOT of possible combinations. Billions and billions of combinations. So I made a little program that tries all the possible values, and looks for the one that can get us closest to the target frequency. Here's the important part of it:

  float best_freq=100;
  float best_diff=100-27;
  int best_factor_count=0;
  int best_factors[8]={0,0,0,0,0,0,0,0};

  int this_factors[8]={0,0,0,0,0,0,0,0};
 
  // Start with as few factors as possible, and then progressively search the space
  for (int max_factors=1;max_factors<4;max_factors++)
    {
      printf("Trying %d factors...\n",max_factors);
      for(int i=0;i<max_factors;i++) this_factors[i]=0;
      while(this_factors[0]<uniq_factor_count) {
    float this_freq=27.0833333;
    for(int j=0;j<max_factors;j++) {
      //            printf(" %.3f",factors[this_factors[j]]);
      this_freq*=uniq_factors[this_factors[j]];
    }
    //        printf(" = %.3f MHz\n",this_freq);
    float diff=this_freq-27.00; if (diff<0) diff=-diff;
    if (0&&diff<1) {
      printf("Close freq: ");
      for(int j=0;j<max_factors;j++) printf(" %.3f",uniq_factors[this_factors[j]]);
      printf(" = %.3f MHz\n",this_freq);
    }
    if (diff<best_diff) {
      best_diff=diff;
      best_freq=this_freq;
      for(int k=0;k<8;k++) best_factors[k]=this_factors[k];
      best_factor_count=max_factors;
      printf("New best: ");
      for(int j=0;j<max_factors;j++) printf(" %.3f",uniq_factors[this_factors[j]]);
      printf(" = %.6f MHz\n",this_freq);
      printf("          ");
      for(int j=0;j<max_factors;j++) {
        float uf=uniq_factors[this_factors[j]];
        int n=0;
        for(n=0;n<(1<<14);n++) if (uf==factors[n]) break;
        if (j) printf(" x");
        printf(" %.3f/%.3f",(1.0+((n>>0)&0x7f)/8),(1.0+((n>>7)&0x7f)/8));
      }
      printf("\n\n");     
    }


    // Now try next possible value
        this_factors[max_factors-1]++;
    int j=max_factors-1;
    while((j>=1)&&(this_factors[j]>=uniq_factor_count)) {
      this_factors[j]-=(1<<14);
      this_factors[j-1]++;
      j--;
    }
      }
    }
}


Basically it tries all possible values, and reports each time it finds a better one.  After optimising it to only consider the unique frequency multipliers that each generator can produce, it takes less than a minute to find an exact match:

Calculating set of adjustment factors...
There are 159 unique factors.
Trying 1 factors...
New best:  1.000 = 27.083334 MHz
           1.000/1.000

Trying 2 factors...
New best:  2.333 0.429 = 27.083332 MHz
           7.000/3.000 x 3.000/7.000

New best:  0.923 1.077 = 26.923079 MHz
           12.000/13.000 x 14.000/13.000

New best:  0.929 1.071 = 26.945152 MHz
           13.000/14.000 x 15.000/14.000

New best:  0.933 1.067 = 26.962965 MHz
           14.000/15.000 x 16.000/15.000

Trying 3 factors...
New best:  7.000 0.133 1.067 = 26.962967 MHz
           7.000/1.000 x 2.000/15.000 x 16.000/15.000

New best:  1.500 0.818 0.812 = 27.006392 MHz
           3.000/2.000 x 9.000/11.000 x 13.000/16.000

New best:  1.500 1.182 0.562 = 27.006390 MHz
           3.000/2.000 x 13.000/11.000 x 9.000/16.000

New best:  1.667 0.556 1.077 = 27.006176 MHz
           5.000/3.000 x 5.000/9.000 x 14.000/13.000

New best:  1.667 0.778 0.769 = 27.006174 MHz
           5.000/3.000 x 7.000/9.000 x 10.000/13.000

New best:  1.667 0.385 1.556 = 27.006172 MHz
           5.000/3.000 x 5.000/13.000 x 14.000/9.000

New best:  0.600 1.800 0.923 = 27.000002 MHz
           3.000/5.000 x 9.000/5.000 x 12.000/13.000

New best:  0.800 1.800 0.692 = 27.000000 MHz
           4.000/5.000 x 9.000/5.000 x 9.000/13.000

These factors are in addition to the existing 100MHz x 8.125 / 30 that I am already doing, which results in 27.083333 MHz.  I was a little bit surprised that it was able to find an exact match using only three factors. It isn't even just that it is correct to 6 decimal places. It really is exact:


100 x 8.125/30 x 4/5 x 9/5 x 9/13  MHz
= 100 x 65/240 x 36/25 x 9/13 MH
= 100 x 21060/78000 MHz
= 100 x 0.27 MHz
= 27 MHz

Well, that's all sounding good. So I modified clocking.vhdl to create the chain of MMCM instances to generate the exact clock frequency.  First go at synthesis, Vivado threw an error, because it is not possible to directly assemble such a long chain of MMCM units.  To solve this, I added a Global Buffer unit to take one of the intermediate MMCM outputs, so that it would be available at the next MMCM with minimal distortion or skew.  Hopefully that will solve the problem.

Next problem is that the last step multiplies the intermediate clock by 9, and ends up out of the valid range of 600 - 1200MHz.  This is because:

100 MHz x 8/10 = 800MHz(ok)/10 = 80MHz
80MHz x 9/5 = 720MHz(ok)/5 = 144MHz
144MHz x 9/13 = 1296MHz(not ok)/13

So I need to reorder these factors, so that the intermediate frequencies remain in the range of 600 - 1200MHz.

100 MHz x 9/13 = 900MHz(ok)/13 = 69.23MHz
69.23MHz x 9/5 = 623MHz(ok)/5 = 124.61MHz
124.61MHz x 8/10 = 996.92MHz(ok)/10 = 99.692MHz
(which then gets used to generate the 27MHz clock via x 8.125/30)

So, I'll just rearrange into that order and try to synthesise again...
Ok, that worked, and the bitstream runs, with HDMI output (still) working.  So it looks like my clock magic has worked just fine.

There are now just a very few timing violations of <0.2ns in the hyperram left to resolve.  That said, they are now a small percentage compared to the clock interval (~6.2ns), so SHOULDNT be a problem, except unless the FPGA voltage dipped, or the FPGA got really hot. I am seeing some weird problems still, though, so I will likely still work to completely eliminate them, so that I don't have to suspect that they aren't a problem, but rather that I can KNOW that they aren't a problem.

Digging through, I found one last sneaky slow address comparison that is likely the cause of that, so time to resynthesise again.  Hopefully that will be the last timing violation there, and I will be able to focus on the remaining bugs I am seeing.

Finally! I now have a synthesis run with full timing closure on the HyperRAM, and even the persistent timing closure problem with the CPU is less than 1ns late on a ~24ns clock, and it is related to the SP register, not the HyperRAM. Thus we can more forward confident that any problems with the HyperRAM are not related to lack of timing closure.

So, speaking of problems, we certainly still have some right now:
1. The prefetch logic has some significant problems, resulting in mis-read data.  The first two or three bytes read ok, and then the next byte read rubbish. The pattern is a bit longer, but clearly something is a bit bonkers.
2. Linear reads crossing a 32-byte boundary read rubbish after crossing that boundary, presumably because there is something wonky with the pre-fetch chaining.
3. The external hyperram seems to be even less reliable, but it has been since the start, presumably because the leads from the FPGA to it are longer.
4. When cache rows are read, the bytes are being spread between the two cache rows, instead of all going into the same row.
5. Copying from chip to hyperram is slow, and a bit variable in speed. It's less than half the speed of filling hyperram, so there is presumably something funny going on with the write scheduling.

First step: Make the pre-fetching run-time switchable, so that I can more easily assess the rest of the system.  That worked fine. My next step was to find and fix a bunch of the bugs in the prefetch logic and related areas. This looks like it might fix (1), (2) and (4) above.

I'm synthesising the fixed version now, to see if it really fixes those problems, but am currently fighting with multiple drivers.  Multiple drivers is when signals are trying to be set from multiple (unrelated) places, and the FPGA synthesis software can't figure out how to reconcile them, for example, if they are set from completely separate pieces of hardware.  They aren't too hard to fix, just a bit fiddly.  Fortunately improving the timing closure has dropped synthesis time back down to about 30 minutes instead of 45 minutes or so, so the going is a bit quicker now.

What I have done as a first step, is to get a working configuration for the HyperRAM.  It seems that using 80MHz clock for 160MB/sec throughput (via DDR) is just too much for the current PCB layout (or for how I have implemented it in the FPGA).  By leaving all stages of the transaction at 40MHz, that fixes a lot of problems, but at a modest cost to read performance, since 80MHz was only being used during reads, anyway.  I also disabled all of the CPU prefetch logic by default, and also the whole read-cache machinery.  The write cache is already rock solid, it seems, so that has been left.

The external HyperRAM still has some problems with writes, which I need to look into. Basically some writes get lost. I am suspecting that the chip select line to the HyperRAM might have slightly faster response than the data lines, and thus the last write gets ignored.  It could be something more sinister, of course, like writes being ignored in the write cache, but the fact that the internal HyperRAM is now rock solid makes me think that that is very unlikely.

The result is a known working configuration, for the internal HyperRAM at least, from which to try to debug everything else.  Everything else now largely means the read cache machinery.  The downside, in the interim at least, is that the read performance is, without the caching, quite poor, while linear writing is a lot faster, due to the functioning write caches.  So we get a result like the following (remember the HyperRAM is what is used as Slow RAM):

Fill Slow RAM: 9,287KB/sec
Copy ChipRAM to Slow RAM: 4,334KB/sec
Copy Slow to Chip RAM: 2,500KB/sec
Copy Slow RAM to Slow RAM: 1,547KB/sec

So, let's go through and find and fix the problems with the read cache.  First, we try to use simulation again, as this is MUCH faster and more informative.  With the reading set to 80MHz, simulation showed no problems. But when I drop the read speed to 40MHz, we do now get some errors.  That's a good thing, as I can investigate their causes and deal with them.

We also have another clue: The read cache works fine when there aren't also writes happening.  This means it is most likely something with the cache coherency logic. Either the read caches are taking too long to get updated when a new value is written that would hit them, or they are simply not getting updated, or getting updated incorrectly.

So, into the simulation to see what is going on there.  It looks suspiciously like chipset DMA reads are being used to reply to CPU memory requests.  That can't be the whole problem, because I have run tests with that disabled, and there are still cache consistency problems. However, I have seen evidence that the chipset accesses were stuffing things up, so its nice to have captured this in simulation.  So, that was an easy fix. Simulation then showed errors caused by a weird fix I had in place from when we were running at 80MHz, which was breaking things at 40MHz.  Having fixed that, simulation reveals no more errors. So time to synthesise that, and then see how it behaves on the real hardware.

Ok. Synthesis is now complete.  Having the cache enabled still causes problems when reading recently written locations. Its just that I can't reproduce it in simulation anymore, which means I need to use other approaches to track the problem down.  I have fortunately made a mechanism to examine the state of the cache in real-time, so that I can tell if a byte is being mistakenly put into the wrong cache row. 

To investigate this further, I have modified the hyperram test programme cache tests to reproduce this problem, and then to show me some information about what is going on. It looks like when we write $18 into $8000808, that it gets erroneously written into the first byte of the cache line for $8000800-$8000807.  Knowing that this is the case, I have been able to create a sequences of transactions that can reproduce the problem in simulation - yay! Well, except it seems to be some other error that I am seeing in simulation. No bother. It's still an error to be squashed, and that I can reproduce in simulation.  Actually, it DOES look to be the same type of error, just another instance of it.

The problem is that the cache_row0_address_matches_cache_row_update_address signal that indicates when a cache row matches the update address for the cache is delayed by a cycle (remember I mentioned this kind of thing, when I was talking about flattening logic by pipelining/pre-calculating signal values).  Probably the easiest solution for this, is to clear the signals whenever I update the cache update address, so that they can never be wrong.  It does mean that on the occassions when they would be right, we don't get the benefit of the cache, but that's livable.

Except life is never that simple. In this case, the cache_row_update_address signal and these precalculated comparison signals are generated in separate design parts, and will generate more of those annoying "multiple driver" problems if I just try to set it in both places.  Instead I'll just create a signal that blocks the use of these signals for one clock after the update address has been changed. That fixes the problem in simulation, so now to see if it fixes it when synthesised.

Well, the problem still shows up after synthesis -- but no longer in the little test-case I made, only in the larger test.  Now to try to figure out how to reproduce it with the simplest test case, so that I can hopefully reproduce it under simulation, where running thousands of memory accesses is too cumbersome and slow.

On initial inspection, it isn't actually the same problem, just a similar one: Now it reads an 8 byte row as all $00, instead of their correct values.  My suspicion is that the problem is similar: The cache row address is being compared with something a cycle after it is being updated, and thus there is an internal inconsistency. More precisely, there is a cache row being loaded with entirely incorrect data.  It looks like it might actually be the current_cache_line that gets exported to the slow_devices controller to reduce read latency. So I'll make that inspectable at run-time, too.

While I wait for that, I'll have another think about the chipset DMA stuff.  This is still being quite weird.  If these DMAs are running, the CPU can still (but less often) read wrong data. This only happens when the cache is enabled, so that's a clue that we still have some cache consistency problem.  Also, the chipset DMAs only seem to return data (or at least, interesting data) when the CPU is accessing the hyperram. That is, the CPU and chipset seem able to get data intended for each other.

I went through with a fine tooth comb, and made sure that Under No Circumstances Whatever could a chipset DMA fetch contaminate any of the cache machinery. That has the CPU now safely reading only its own data. But the chipset is still only receiving data when the CPU is accessing the HyperRAM.  This is still really annoying me, because I can't reproduce it under simulation, even when I let the HypeRAM bus otherwise remain idle. 


I ran simulation again,  to confirm that lots of HyperRAM requests can get sequenced one after the other, and that the data gets received, and that all looks fine. I also synthesised a debug mode that asserts the data ready strobe to the chipset the whole time, and that DOES cause the dummy data to be delivered. Thus I am confident that the plumbing between the two is correct. It just leaves the major mystery of why the HyperRAM seemingly gets stuck handling these chipset DMA requests.

To try to figure out what is going on, I have added a couple of extra run-time switches that will allow me to make the chipset DMA to take absolute priority on the HyperRAM bus, in case the priority managment is the issue. I also found a couple of places that, although unlikely, could be contributing to the problem, and tweaked those.  Basically they are places where it might have been possible for one of these data requests to get cancelled before it was complete. 
However, if those were the problem, then I'd expect the whole thing to freeze up, due to the end of the transaction not being properly acknowledged. So we'll see how that goes.

If this doesn't reveal any clues, then I am starting to suspect that the HyperRAM might be doing something funny with taking a VERY long time to respond. We are talking about ~1 milli-second here instead of the tens of nano-seconds that it should take to service a request.

That's improved things a bit: Some data from somewhere is now getting read for the chipset DMA, but it seems to come from the wrong address, and some CPU memory accesses to the HyperRAM end up being passed out as chipset DMA data.  Also, it still seems that sometimes the HyperRAM takes too long to supply the data.

Finding the address calculation bug was fairly easy.  Also, confirming that sometimes CPU access data gets output is helpful to know, as I can go through carefully to make sure that this can't happen.  This was happening when writing data, which I was easily able to verify by filling a large block of the HyperRAM and monitoring the chipset DMA data.  It probably also happens when reading, but I'll have to make a little programme that does reads in a big loop to make sure.  But back to the writes, this is quite odd, as the data path for feeding data to the chipset via DMA is quite specific, and should only be activated when such a request is running.

Otherwise, we still have the problem of some writes to the trap-door HyperRAM failing, mostly when crossing an 8-byte boundary, but also often enough that detection of the trap-door HyperRAM never properly succeeds. 

Anyway, this post has rambled on long enough, and despite the remaining problem with setting up support for chipset DMA and read cache optimisations, much as in fact been achieved: We now have a perfect 27MHz pixel clock and HyperRAM works sufficiently well to be useful.  Speed optimisations can come later.  Here's the current speed performance of the HyperRAM, which we are calling "Slow RAM" from here on, to avoid confusion with the Hypervisor part of the OS.



So while I am not yet satisfied with this sub-system, and frankly am frustrated at how annoying and drawn-out it has been to get to this point, it is now in good enough shape for me to move onto other more pressing issues, like getting HDMI sound working on all monitors...

3 comments:

  1. I'm a bit confused here. Trapdoor RAM? Is that an extra ramboard that hooks up to the motherboard?

    As always thanks for the update and your hard work!

    ReplyDelete
    Replies
    1. howdy,
      Yes, the Trapdoor RAM is in this case experimental support for an existing double-PMOD RAM board. It physically already fits the MEGA65's trapdoor, and is electrically compatible, so it's just a case of making the VHDL correctly support it. It's really just intended as a proof-of-concept trapdoor expansion. We expect to make an expansion board that connects via the trapdoor slot in the future, that will give userport, tape port etc.

      Delete