Sunday 24 May 2020

Improving the MEGA65 Audio Mixer

This past week we have been working on improving some of the cross-development tooling that we use, which I will write about soon. But for now there is nothing really visible to show.  So I will instead share with you the work I did today on getting the MEGA65 Audio Mixer that is accessed from the Freeze Menu more user friendly.

We now have some nice decibel logarithmic volume scales, and a bunch of friendly controls for controlling overall volume level, as well as selecting stereo/mono, and swapping left/right channels.  You can also use the up/down cursor keys to pick a particular element and left/right cursor keys to change the values.    You can also press T and it will play a few musical notes through the four SIDs, alternating left and right, so that you can more easily verify that the settings are sensible, and not left/right swapped.

All up, it now feels like an interface that the average user could use, especially compared with the previous proof-of-concept audio mixer. Just in case you can't remember or haven't seen just how "user feindly" the previously one was, here is a reminder:

While it is technically more powerful (as you can set every coefficient in the full-cross-bar audio mixer separately), you really need to know what you are doing to avoid instant and prolonged confusion.

Wednesday 20 May 2020

Injection Moulding Tooling Progress

A quick post just to share the exciting progress with the injection-moulding tool manufacture for the MEGA65 case.  The major pieces of the tool have now been produced, as you can see below!

First, here we see the inside of the lower case. The tools are all in negative profile. The big round holes are for the pushers for pushing the cases out of the tool. The smaller round holes are all for screw bosses -- so in addition to the bosses for the motherboard, there are plenty of extra ones for expansion boards, optional internal speaker etc. You can also see the MEGA65 team and major donors for the tooling costs here, so that they will appear as part of every case we produce. We are very proud to recognise some of the key contributors to the creation of the MEGA65 in this way, and wish we had limitless space to be able to recognise everyone who has been involved in the project.

Then we have the outside of the underside, where we can see the hole for the trap-door, and the logos for M.E.G.A. and Hintsteiner.  Note that it seems to be missing 2 of the sides. Those are the sides where all the ports go.  Those sides comea in as separate pieces that move in before the plastic is loaded, and then pull out the way before the pushers push the finished part out of the mould. 

Then for the upper half of the case we have the same inner and outer pieces, although this time quite a bit simpler. The funny sword looking shape in the middle is where the plastic comes in to distribute around, and to hold the piece straight as it comes out warm.
And of course the opposite side of that.  Here we also see the really big diameter holes where the rods go through on which the mould will slide together and apart each time a piece is moulded.  The lower half of the case will have something similar, but of course more complicated because of the pushers that come in from the sides.

Then we have the relatively simple trapdoor slot and eject button on a single little family mould. These can go on a single mould because they are similar enough in volumae that they can be balanced within the mould, to obtain good results.

 Here is the other part for this last mould still being prepared:

While I haven't had the chance to discuss it with them directly, my understanding is that the next stage is the assembling of the tools with all the big rods that hold them together etc, and making sure everything fits together properly.   Then it will be time to load it into a moulding machine at the tooling factory and test it out.  

Once they are happy at the tool factory, they will ship it to the factory where it will actually be used in the production injection moulding machines. They will then go through a commissioning process, where they will fiddle with pressures, gate settings etc to make sure the plastic flow fills every little bit reliably (a bit of a black art), and try to minimise the time per part, so that we optimise the price per case (you pay by the second that the machine is busy). All that will take another month or three, depending on how COVID19 affects everything.

We'll let you know more as soon as we know it, but it's already very exciting :)

Sunday 10 May 2020

More work on the RAM Expansion and Timing Closure

I recently got the MEGA65's 8MB built-in expansion RAM mostly working.  I say mostly working, because it was still doing various strange things sometimes.  So I have been spending all my spare time trying to get this sorted.
While in theory the RAM controller is simple, in order for it to offer decent performance, it ends up a bit more complex.

Let's start with the simple HyperRAM itself: This can be driven by a clock of upto 100MHz, allowing for DDR throughput of upto 200MB/sec.  Except that it cant.

First, we don't really have the means to use a 100MHz clock, because it's not a clean multiple of any of the clocks we are using. So we are instead using 81.5MHZ, which is 2x the MEGA65 CPU clock and 3x the VIC-IV's pixel clock.

Second, the HyperRAM is accessed by first sending it a command to read or write something.  Regardless of the clock rate, there is a minimum of 40ns plus a bit between transactions.  In reality, it takes about 100ns -- 200ns to actually perform a random read or write. 

Third, in order to not drag the maximum clock speed of the MEGA65's CPU down, the expansion RAM is on the "slow devices" bus.  They don't really have to be slow, but the CPU requires at least 2 clock cycles to access such a device.  That bus then asks the HyperRAM for data, which takes at least another 2 clock cycles.  Thus there is something like 100ns extra latency from the memory architecture of the MEGA65.

The result is that a naive implementation will have about 300ns latency, i.e., about what a real C64 in 1982 had with its ancient DRAM chips.  This is enough for 2 - 3MB/sec -- which with a 40MHz CPU is a bit disappointing.

Thus the HyperRAM controller and related bits and pieces in the MEGA65 does a number of tricks to hide this latency, and increase the throughput for linear accesses. We are concentrating on linear access, as we figure it is most likely how the expansion RAM will be used: You will either be copying things in or out of it.  Some of those tricks are:

1. Two eight-byte write buffers, that allow most writes to happen with just 2 cycle latency, unless the buffers are already busy. Each buffer covers an 8-byte aligned region of 8-bytes, and multiple writes to the region in quick succession will result in a merged 64-bit write to the expansion RAM.  This speeds up writing a lot.

2. A similar setup of two eight-byte read buffers that you could call a tiny cache.  Again, this helps avoid the 100 -- 200ns latency of the HyperRAM, but not the latency of the slow device bus and related bits.

3. A block-read mechanism that reads and buffers upto 32 consecutive bytes from the HyperRAM, and feeds them pre-emptively to the slow device bus as an 8-byte buffer. This really helps reduce read latency.

4. A pre-emptive pre-fetch mechanism in the slow device bus that pushes the byte after the previous read one to the CPU, so that if it is doing a linear read, it can avoid all but one cycle of latency in the slow device bus, and theoretically allow reads at up to 20MB/sec.

I have also added a priority access mechanism to the HyperRAM, so that we can have automatic chipset driven DMAs like access, a little like the Amiga, so that we could, for example, feed the digital audio channels directly, without any CPU intervention.  I have yet to work out exactly what that will look like, but it is important to get the interface in now, so that it can be worked on later.

So all of this mostly works, but there are various little corner cases.  Also, the timing closure on the MEGA65 as a whole was getting a bit out of hand again, so I have spent quite a lot of time trying to flatten logic, so that everything that needs to happen in a single clock tick can in fact happen. And reliably.  This also can help synthesis to run a bit faster, because it doesn't have to try so hard to make everything fit with acceptable timing.  But its also important when debugging weird behaviours, so that I can be confident it isn't just because everything is violating timing by a bit.

I now have it mostly meeting timing, and all of the above optimisations and features at least mostly working.  Where I am now upto is that there are some weird bugs with the chipset DMA, and chaining of the 32-byte fetches and the like.  Also the optimisation in (4) is not completely implemented.

First, I'll tackle the weird bugs with the DMA. Basically the HyperRAM controller can get confused and freeze up, messing up both the chipset DMA, as well as regular CPU access.  I think much of the problem is how it handles the situation where a request comes in from the slow_device bus at around the same time as a chipset DMA. The chipset DMA is supposed to take priority, since it might be quite timing sensitive.  This means that the incoming request from the slow device bus has to be remembered for later.

This is further complicated by the fact that the HyperRAM controller has an 81.5MHz and a 163 section, and we have to be careful how we pass information between the two.  Normally the 81.5MHz side registers and processes the requests from the slow device bus. But when the HyperRAM is actually being accessed, it is the 163MHz part of the circuit that comes into play.  The trouble comes when the 163MHz side realises that it can service a newly arriving request, which it should do to reduce latency.

It should also tell the 81.5MHz that it has processed the request, so that it doesn't get processed twice. And this seems to be where it messes up at the moment: The request does in fact get processed twice.  This probably means I need to have a signal that tells the 81.5MHz side when a latches request has in fact already been processed.  Ok, implemented that, and it looks to simulate properly.

Next is to finish implementing the pre-emptive delivery of data to the CPU, so that it saves time every time it can, instead of just sometimes. Basically now it will save time at most 1/2 the time, because it supplies the next byte whenever the CPU requests a next byte, and it can.  For a linear read, that's only 3 out of every 8.  As we have the cache to work with, we have an 8-byte aligned block, and we should be able to pre-fetch 7 out of every 8 bytes in a linear read. We just need to watch for when the CPU consumes the previous pre-fetched byte (which it already tells the slow device bus) and get the next byte ready in that case. This all happens in slow_devices.vhdl.

After that, it was time to look at when we flush out the 8-byte write buffers. On the one hand, we want to not abort early if the last few byte positions don't need writing, in case a new write comes along that immediately follows in memory, and thus could be chained. But on the other hand, if all our write buffers are currently busy, such that we can't accept any new writes, then we really want to abort as quickly as possible, so that we can start processing any new requests.

I have a simple sequence of memory accesses in a test harness that I use to verify that each of these changes has not broken anything, i.e., we don't lose any writes or read stale data etc.  As I have been working on these various optimisations, the average transaction time has dropped from around 120ns to around 68ns -- so all up, this is quite a significant performance boost.  This should allow a hyperram access to, on average, occur in around 3 clock cycles, with linear reads potentially only requiring 2 clock cycles.  Thus, I'm hoping that when I run my speed test programme, it will show "Copy Slow RAM to Chip RAM" speeds of ~40/3 = ~13,000KB/sec. 

In generally, looking at the test harness output, things look pretty good now for the most part. The only exception is that a few memory accesses seem to take a REALLY long time, longer than should really be possible, up to around 450ns.  Something pathological must be happening there, such as an aborted 32-byte block read that would have got the data, but is aborted, and then a new fetch is started that does get it.

Looking into it, it might just be when there are a lot of writes banked up that need flushing.  But I did notice a weird thing, where data was being spread into both of the 8-byte read caches, instead of all going into a single one.  While that can't cause the current problem, it would still be good to squash it, as otherwise it will cause havoc with the performance of the cache when reading, as one of the cache lines will just get ignored in all likelihood.

Fixing timing closure is often about simplifying IF statements in the VHDL, so that they can be processed more quickly by the hardware.  Sometimes this can be through reducing the number of nested IF statements. Other times it is about pre-caculating expressions that appear in the IF statements, and then using the simple pre-calculated true or false result from that.

Vivado does give nice reports about signals that don't meet timing closure, like this:

Slack (VIOLATED) :        -1.245ns  (required time - arrival time)
  Source:                 hyperram0/block_valid_reg/C
                            (rising edge-triggered cell FDRE clocked by clock162  {rise@0.000ns fall@3.077ns period=6.154ns})
  Destination:            hyperram0/current_cache_line_drive_reg[3][5]/D
                            (rising edge-triggered cell FDRE clocked by clock162  {rise@0.000ns fall@3.077ns period=6.154ns})
  Path Group:             clock162
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            6.154ns  (clock162 rise@6.154ns - clock162 rise@0.000ns)
  Data Path Delay:        7.383ns  (logic 1.200ns (16.253%)  route 6.183ns (83.747%))
  Logic Levels:           6  (LUT4=1 LUT6=5)
  Clock Path Skew:        0.027ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    6.300ns = ( 12.454 - 6.154 )
    Source Clock Delay      (SCD):    6.600ns
    Clock Pessimism Removal (CPR):    0.327ns
  Clock Uncertainty:      0.073ns  ((TSJ^2 + DJ^2)^1/2) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Discrete Jitter          (DJ):    0.128ns
    Phase Error              (PE):    0.000ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clock162 rise edge)
                                                      0.000     0.000 r 
    V13                                               0.000     0.000 r  CLK_IN (IN)
                         net (fo=0)                   0.000     0.000    CLK_IN
    V13                                                               r  CLK_IN_IBUF_inst/I
    V13                  IBUF (Prop_ibuf_I_O)         1.521     1.521 r  CLK_IN_IBUF_inst/O
                         net (fo=2, routed)           1.233     2.755    clocks1/clk_in
    MMCME2_ADV_X0Y0                                                   r  clocks1/mmcm_adv0/CLKIN1
    MMCME2_ADV_X0Y0      MMCME2_ADV (Prop_mmcme2_adv_CLKIN1_CLKOUT5)
                                                      0.088     2.843 r  clocks1/mmcm_adv0/CLKOUT5
                         net (fo=1, routed)           2.018     4.861    clock162
    BUFGCTRL_X0Y4                                                     r  clock162_BUFG_inst/I
                        net (fo=1126, routed)        1.643     6.600    hyperram0/clock162_BUFG
    SLICE_X15Y57         FDRE                                         r  hyperram0/block_valid_reg/C
  -------------------------------------------------------------------    -------------------
    SLICE_X15Y57         FDRE (Prop_fdre_C_Q)         0.456     7.056 f  hyperram0/block_valid_reg/Q
                         net (fo=43, routed)          1.406     8.462    hyperram0/block_valid_reg_0
    SLICE_X21Y49                                                      f  hyperram0/current_cache_line_drive[0][7]_i_5/I3
    SLICE_X21Y49         LUT4 (Prop_lut4_I3_O)        0.124     8.586 r  hyperram0/current_cache_line_drive[0][7]_i_5/O
                         net (fo=135, routed)         1.242     9.828    hyperram0/current_cache_line_drive[0][7]_i_5_n_0
    SLICE_X11Y39                                                      r  hyperram0/current_cache_line_drive[3][5]_i_7/I5
    SLICE_X11Y39         LUT6 (Prop_lut6_I5_O)        0.124     9.952 f  hyperram0/current_cache_line_drive[3][5]_i_7/O
                         net (fo=1, routed)           1.245    11.197    hyperram0/current_cache_line_drive[3][5]_i_7_n_0
    SLICE_X15Y55                                                      f  hyperram0/current_cache_line_drive[3][5]_i_5/I0
    SLICE_X15Y55         LUT6 (Prop_lut6_I0_O)        0.124    11.321 f  hyperram0/current_cache_line_drive[3][5]_i_5/O
                         net (fo=1, routed)           0.803    12.124    hyperram0/current_cache_line_drive[3][5]_i_5_n_0
    SLICE_X9Y63                                                       f  hyperram0/current_cache_line_drive[3][5]_i_3/I1
    SLICE_X9Y63          LUT6 (Prop_lut6_I1_O)        0.124    12.248 f  hyperram0/current_cache_line_drive[3][5]_i_3/O
                         net (fo=2, routed)           0.908    13.156    hyperram0/current_cache_line_drive[3][5]_i_3_n_0
    SLICE_X5Y71                                                       f  hyperram0/current_cache_line_drive[3][5]_i_4/I4
    SLICE_X5Y71          LUT6 (Prop_lut6_I4_O)        0.124    13.280 r  hyperram0/current_cache_line_drive[3][5]_i_4/O
                         net (fo=1, routed)           0.579    13.859    hyperram0/current_cache_line_drive[3][5]_i_4_n_0
    SLICE_X3Y71                                                       r  hyperram0/current_cache_line_drive[3][5]_i_1/I4
    SLICE_X3Y71          LUT6 (Prop_lut6_I4_O)        0.124    13.983 r  hyperram0/current_cache_line_drive[3][5]_i_1/O
                         net (fo=1, routed)           0.000    13.983    hyperram0/current_cache_line_drive[3][5]_i_1_n_0
    SLICE_X3Y71          FDRE                                         r  hyperram0/current_cache_line_drive_reg[3][5]/D
  -------------------------------------------------------------------    -------------------

                         (clock clock162 rise edge)
                                                      6.154     6.154 r 
    V13                                               0.000     6.154 r  CLK_IN (IN)
                         net (fo=0)                   0.000     6.154    CLK_IN
    V13                                                               r  CLK_IN_IBUF_inst/I
    V13                  IBUF (Prop_ibuf_I_O)         1.450     7.604 r  CLK_IN_IBUF_inst/O
                         net (fo=2, routed)           1.162     8.766    clocks1/clk_in
    MMCME2_ADV_X0Y0                                                   r  clocks1/mmcm_adv0/CLKIN1
                                                     0.083     8.849 r  clocks1/mmcm_adv0/CLKOUT5
                         net (fo=1, routed)           1.923    10.772    clock162
    BUFGCTRL_X0Y4                                                     r  clock162_BUFG_inst/I
    BUFGCTRL_X0Y4        BUFG (Prop_bufg_I_O)         0.091    10.863 r  clock162_BUFG_inst/O
                         net (fo=1126, routed)        1.590    12.454    hyperram0/clock162_BUFG
    SLICE_X3Y71          FDRE                                         r  hyperram0/current_cache_line_drive_reg[3][5]/C
                         clock pessimism              0.327    12.781   
                         clock uncertainty           -0.073    12.708   
    SLICE_X3Y71          FDRE (Setup_fdre_C_D)        0.031    12.739    hyperram0/current_cache_line_drive_reg[3][5]
                         required time                         12.739   
                         arrival time                         -13.983   
                         slack                                 -1.245   

These are great in that they tell me from which signal to which signal the problem is, and how bad it is. But they are also a pain, because they don't actually tell you which lines of code is causing the path. So you have to hunt through the source trying to find the sequence of nested IF statements that apply, but which might be spread over thousands of lines of code.  It's a bit of a pain.  So I made a really quick-and-dirty tool that lets me see all the IF statements that lead up to a given signal in the source.

So in the case above, we can see:

 387        if rising_edge(pixelclock) then
 622          if (read_request or read_request_latch)='1'
 635            if (block_valid='1') and (address(26 downto 5) = block_address) then
 650              if (address(4 downto 3) = "11") and (flag_prefetch='1')
 658 >>>             ram_address <= tempaddr;
 387        if rising_edge(pixelclock) then
 622          if (read_request or read_request_latch)='1'
 826            elsif address(23 downto 4) = x"FFFFF" and address(25 downto 24) = "11" then
 883            elsif request_accepted = request_toggle then
 887 >>>           ram_address <= address;
 387        if rising_edge(pixelclock) then
 899          elsif queued_write='1' and write_collect0_dispatchable='0' and write_collect0_flushed='0'
 922          elsif (write_request or write_request_latch)='1' and busy_internal='0' then
 929            if address(23 downto 4) = x"FFFFF" and address(25 downto 24) = "11" then
 972              if cache_enabled = false then
 980 >>>             ram_address <= address;

I now have line numbers, and can see exactly what is going on.  Because it also shows me the rising_edge() statement, I can also tell under which clock it is .  So in this case, it's likely line 635 that is the problem.  It's also showing me that while this IF does depend on the block_valid signal, it also compares a 22 bit address -- that's the thin that's really going to be slow.  Thus I get more useful information than Vivado otherwise gives me.  I'm sure there are expensive professional tools that can do some of this, but it was faster to cook this up than to even find out if such tools exist.  It would be nice if the VHDL mode in emacs could display something like this, though.

Anyway, in this particular case, I can't do a great deal about it, unless I add an extra stage of latency to all hyperram reads, as pre-calculating the result of the comparison would add a cycle of delay.  This is the whole trade-off of clock speed and pipeline depth, which the Pentium 4 famously did really badly.  In this instance, I have fixed a whole lot of other timing violations, in the hope that Vivado can do better with this particular signal.  I have hope, because while the total path takes too long, only 1.2ns of that is logic -- the other ~6ns is routing overhead.  By making it easier for Vivado to route other signals, it can often enable it to get better closure on critical paths like this one.  Anyway, we will know in about half and hour when it finishes synthesising.

So, that's helped me get quite close to timing closure, but there are a few signals that are not quite meeting timing closure. My next step was to fix the clock frequencies we are using. They are approximately right, but not exact. They are in fact a little too fast. This can also cause us trouble with the HDMI output, so it makes sense to get these fixed. I recently worked out how to better control clocks. Basically the problem is that each clock generator in the FPGA can only take the input clock and multiply and divide it by a fixed range of values. Thus you can only generate a limited set of frequencies.

But you can chain the generators to take the output of another one as its input, and thus generate many more frequencies.  The trouble then, is that you need to consider a LOT of possible combinations. Billions and billions of combinations. So I made a little program that tries all the possible values, and looks for the one that can get us closest to the target frequency. Here's the important part of it:

  float best_freq=100;
  float best_diff=100-27;
  int best_factor_count=0;
  int best_factors[8]={0,0,0,0,0,0,0,0};

  int this_factors[8]={0,0,0,0,0,0,0,0};
  // Start with as few factors as possible, and then progressively search the space
  for (int max_factors=1;max_factors<4;max_factors++)
      printf("Trying %d factors...\n",max_factors);
      for(int i=0;i<max_factors;i++) this_factors[i]=0;
      while(this_factors[0]<uniq_factor_count) {
    float this_freq=27.0833333;
    for(int j=0;j<max_factors;j++) {
      //            printf(" %.3f",factors[this_factors[j]]);
    //        printf(" = %.3f MHz\n",this_freq);
    float diff=this_freq-27.00; if (diff<0) diff=-diff;
    if (0&&diff<1) {
      printf("Close freq: ");
      for(int j=0;j<max_factors;j++) printf(" %.3f",uniq_factors[this_factors[j]]);
      printf(" = %.3f MHz\n",this_freq);
    if (diff<best_diff) {
      for(int k=0;k<8;k++) best_factors[k]=this_factors[k];
      printf("New best: ");
      for(int j=0;j<max_factors;j++) printf(" %.3f",uniq_factors[this_factors[j]]);
      printf(" = %.6f MHz\n",this_freq);
      printf("          ");
      for(int j=0;j<max_factors;j++) {
        float uf=uniq_factors[this_factors[j]];
        int n=0;
        for(n=0;n<(1<<14);n++) if (uf==factors[n]) break;
        if (j) printf(" x");
        printf(" %.3f/%.3f",(1.0+((n>>0)&0x7f)/8),(1.0+((n>>7)&0x7f)/8));

    // Now try next possible value
    int j=max_factors-1;
    while((j>=1)&&(this_factors[j]>=uniq_factor_count)) {

Basically it tries all possible values, and reports each time it finds a better one.  After optimising it to only consider the unique frequency multipliers that each generator can produce, it takes less than a minute to find an exact match:

Calculating set of adjustment factors...
There are 159 unique factors.
Trying 1 factors...
New best:  1.000 = 27.083334 MHz

Trying 2 factors...
New best:  2.333 0.429 = 27.083332 MHz
           7.000/3.000 x 3.000/7.000

New best:  0.923 1.077 = 26.923079 MHz
           12.000/13.000 x 14.000/13.000

New best:  0.929 1.071 = 26.945152 MHz
           13.000/14.000 x 15.000/14.000

New best:  0.933 1.067 = 26.962965 MHz
           14.000/15.000 x 16.000/15.000

Trying 3 factors...
New best:  7.000 0.133 1.067 = 26.962967 MHz
           7.000/1.000 x 2.000/15.000 x 16.000/15.000

New best:  1.500 0.818 0.812 = 27.006392 MHz
           3.000/2.000 x 9.000/11.000 x 13.000/16.000

New best:  1.500 1.182 0.562 = 27.006390 MHz
           3.000/2.000 x 13.000/11.000 x 9.000/16.000

New best:  1.667 0.556 1.077 = 27.006176 MHz
           5.000/3.000 x 5.000/9.000 x 14.000/13.000

New best:  1.667 0.778 0.769 = 27.006174 MHz
           5.000/3.000 x 7.000/9.000 x 10.000/13.000

New best:  1.667 0.385 1.556 = 27.006172 MHz
           5.000/3.000 x 5.000/13.000 x 14.000/9.000

New best:  0.600 1.800 0.923 = 27.000002 MHz
           3.000/5.000 x 9.000/5.000 x 12.000/13.000

New best:  0.800 1.800 0.692 = 27.000000 MHz
           4.000/5.000 x 9.000/5.000 x 9.000/13.000

These factors are in addition to the existing 100MHz x 8.125 / 30 that I am already doing, which results in 27.083333 MHz.  I was a little bit surprised that it was able to find an exact match using only three factors. It isn't even just that it is correct to 6 decimal places. It really is exact:

100 x 8.125/30 x 4/5 x 9/5 x 9/13  MHz
= 100 x 65/240 x 36/25 x 9/13 MH
= 100 x 21060/78000 MHz
= 100 x 0.27 MHz
= 27 MHz

Well, that's all sounding good. So I modified clocking.vhdl to create the chain of MMCM instances to generate the exact clock frequency.  First go at synthesis, Vivado threw an error, because it is not possible to directly assemble such a long chain of MMCM units.  To solve this, I added a Global Buffer unit to take one of the intermediate MMCM outputs, so that it would be available at the next MMCM with minimal distortion or skew.  Hopefully that will solve the problem.

Next problem is that the last step multiplies the intermediate clock by 9, and ends up out of the valid range of 600 - 1200MHz.  This is because:

100 MHz x 8/10 = 800MHz(ok)/10 = 80MHz
80MHz x 9/5 = 720MHz(ok)/5 = 144MHz
144MHz x 9/13 = 1296MHz(not ok)/13

So I need to reorder these factors, so that the intermediate frequencies remain in the range of 600 - 1200MHz.

100 MHz x 9/13 = 900MHz(ok)/13 = 69.23MHz
69.23MHz x 9/5 = 623MHz(ok)/5 = 124.61MHz
124.61MHz x 8/10 = 996.92MHz(ok)/10 = 99.692MHz
(which then gets used to generate the 27MHz clock via x 8.125/30)

So, I'll just rearrange into that order and try to synthesise again...
Ok, that worked, and the bitstream runs, with HDMI output (still) working.  So it looks like my clock magic has worked just fine.

There are now just a very few timing violations of <0.2ns in the hyperram left to resolve.  That said, they are now a small percentage compared to the clock interval (~6.2ns), so SHOULDNT be a problem, except unless the FPGA voltage dipped, or the FPGA got really hot. I am seeing some weird problems still, though, so I will likely still work to completely eliminate them, so that I don't have to suspect that they aren't a problem, but rather that I can KNOW that they aren't a problem.

Digging through, I found one last sneaky slow address comparison that is likely the cause of that, so time to resynthesise again.  Hopefully that will be the last timing violation there, and I will be able to focus on the remaining bugs I am seeing.

Finally! I now have a synthesis run with full timing closure on the HyperRAM, and even the persistent timing closure problem with the CPU is less than 1ns late on a ~24ns clock, and it is related to the SP register, not the HyperRAM. Thus we can more forward confident that any problems with the HyperRAM are not related to lack of timing closure.

So, speaking of problems, we certainly still have some right now:
1. The prefetch logic has some significant problems, resulting in mis-read data.  The first two or three bytes read ok, and then the next byte read rubbish. The pattern is a bit longer, but clearly something is a bit bonkers.
2. Linear reads crossing a 32-byte boundary read rubbish after crossing that boundary, presumably because there is something wonky with the pre-fetch chaining.
3. The external hyperram seems to be even less reliable, but it has been since the start, presumably because the leads from the FPGA to it are longer.
4. When cache rows are read, the bytes are being spread between the two cache rows, instead of all going into the same row.
5. Copying from chip to hyperram is slow, and a bit variable in speed. It's less than half the speed of filling hyperram, so there is presumably something funny going on with the write scheduling.

First step: Make the pre-fetching run-time switchable, so that I can more easily assess the rest of the system.  That worked fine. My next step was to find and fix a bunch of the bugs in the prefetch logic and related areas. This looks like it might fix (1), (2) and (4) above.

I'm synthesising the fixed version now, to see if it really fixes those problems, but am currently fighting with multiple drivers.  Multiple drivers is when signals are trying to be set from multiple (unrelated) places, and the FPGA synthesis software can't figure out how to reconcile them, for example, if they are set from completely separate pieces of hardware.  They aren't too hard to fix, just a bit fiddly.  Fortunately improving the timing closure has dropped synthesis time back down to about 30 minutes instead of 45 minutes or so, so the going is a bit quicker now.

What I have done as a first step, is to get a working configuration for the HyperRAM.  It seems that using 80MHz clock for 160MB/sec throughput (via DDR) is just too much for the current PCB layout (or for how I have implemented it in the FPGA).  By leaving all stages of the transaction at 40MHz, that fixes a lot of problems, but at a modest cost to read performance, since 80MHz was only being used during reads, anyway.  I also disabled all of the CPU prefetch logic by default, and also the whole read-cache machinery.  The write cache is already rock solid, it seems, so that has been left.

The external HyperRAM still has some problems with writes, which I need to look into. Basically some writes get lost. I am suspecting that the chip select line to the HyperRAM might have slightly faster response than the data lines, and thus the last write gets ignored.  It could be something more sinister, of course, like writes being ignored in the write cache, but the fact that the internal HyperRAM is now rock solid makes me think that that is very unlikely.

The result is a known working configuration, for the internal HyperRAM at least, from which to try to debug everything else.  Everything else now largely means the read cache machinery.  The downside, in the interim at least, is that the read performance is, without the caching, quite poor, while linear writing is a lot faster, due to the functioning write caches.  So we get a result like the following (remember the HyperRAM is what is used as Slow RAM):

Fill Slow RAM: 9,287KB/sec
Copy ChipRAM to Slow RAM: 4,334KB/sec
Copy Slow to Chip RAM: 2,500KB/sec
Copy Slow RAM to Slow RAM: 1,547KB/sec

So, let's go through and find and fix the problems with the read cache.  First, we try to use simulation again, as this is MUCH faster and more informative.  With the reading set to 80MHz, simulation showed no problems. But when I drop the read speed to 40MHz, we do now get some errors.  That's a good thing, as I can investigate their causes and deal with them.

We also have another clue: The read cache works fine when there aren't also writes happening.  This means it is most likely something with the cache coherency logic. Either the read caches are taking too long to get updated when a new value is written that would hit them, or they are simply not getting updated, or getting updated incorrectly.

So, into the simulation to see what is going on there.  It looks suspiciously like chipset DMA reads are being used to reply to CPU memory requests.  That can't be the whole problem, because I have run tests with that disabled, and there are still cache consistency problems. However, I have seen evidence that the chipset accesses were stuffing things up, so its nice to have captured this in simulation.  So, that was an easy fix. Simulation then showed errors caused by a weird fix I had in place from when we were running at 80MHz, which was breaking things at 40MHz.  Having fixed that, simulation reveals no more errors. So time to synthesise that, and then see how it behaves on the real hardware.

Ok. Synthesis is now complete.  Having the cache enabled still causes problems when reading recently written locations. Its just that I can't reproduce it in simulation anymore, which means I need to use other approaches to track the problem down.  I have fortunately made a mechanism to examine the state of the cache in real-time, so that I can tell if a byte is being mistakenly put into the wrong cache row. 

To investigate this further, I have modified the hyperram test programme cache tests to reproduce this problem, and then to show me some information about what is going on. It looks like when we write $18 into $8000808, that it gets erroneously written into the first byte of the cache line for $8000800-$8000807.  Knowing that this is the case, I have been able to create a sequences of transactions that can reproduce the problem in simulation - yay! Well, except it seems to be some other error that I am seeing in simulation. No bother. It's still an error to be squashed, and that I can reproduce in simulation.  Actually, it DOES look to be the same type of error, just another instance of it.

The problem is that the cache_row0_address_matches_cache_row_update_address signal that indicates when a cache row matches the update address for the cache is delayed by a cycle (remember I mentioned this kind of thing, when I was talking about flattening logic by pipelining/pre-calculating signal values).  Probably the easiest solution for this, is to clear the signals whenever I update the cache update address, so that they can never be wrong.  It does mean that on the occassions when they would be right, we don't get the benefit of the cache, but that's livable.

Except life is never that simple. In this case, the cache_row_update_address signal and these precalculated comparison signals are generated in separate design parts, and will generate more of those annoying "multiple driver" problems if I just try to set it in both places.  Instead I'll just create a signal that blocks the use of these signals for one clock after the update address has been changed. That fixes the problem in simulation, so now to see if it fixes it when synthesised.

Well, the problem still shows up after synthesis -- but no longer in the little test-case I made, only in the larger test.  Now to try to figure out how to reproduce it with the simplest test case, so that I can hopefully reproduce it under simulation, where running thousands of memory accesses is too cumbersome and slow.

On initial inspection, it isn't actually the same problem, just a similar one: Now it reads an 8 byte row as all $00, instead of their correct values.  My suspicion is that the problem is similar: The cache row address is being compared with something a cycle after it is being updated, and thus there is an internal inconsistency. More precisely, there is a cache row being loaded with entirely incorrect data.  It looks like it might actually be the current_cache_line that gets exported to the slow_devices controller to reduce read latency. So I'll make that inspectable at run-time, too.

While I wait for that, I'll have another think about the chipset DMA stuff.  This is still being quite weird.  If these DMAs are running, the CPU can still (but less often) read wrong data. This only happens when the cache is enabled, so that's a clue that we still have some cache consistency problem.  Also, the chipset DMAs only seem to return data (or at least, interesting data) when the CPU is accessing the hyperram. That is, the CPU and chipset seem able to get data intended for each other.

I went through with a fine tooth comb, and made sure that Under No Circumstances Whatever could a chipset DMA fetch contaminate any of the cache machinery. That has the CPU now safely reading only its own data. But the chipset is still only receiving data when the CPU is accessing the HyperRAM.  This is still really annoying me, because I can't reproduce it under simulation, even when I let the HypeRAM bus otherwise remain idle. 

I ran simulation again,  to confirm that lots of HyperRAM requests can get sequenced one after the other, and that the data gets received, and that all looks fine. I also synthesised a debug mode that asserts the data ready strobe to the chipset the whole time, and that DOES cause the dummy data to be delivered. Thus I am confident that the plumbing between the two is correct. It just leaves the major mystery of why the HyperRAM seemingly gets stuck handling these chipset DMA requests.

To try to figure out what is going on, I have added a couple of extra run-time switches that will allow me to make the chipset DMA to take absolute priority on the HyperRAM bus, in case the priority managment is the issue. I also found a couple of places that, although unlikely, could be contributing to the problem, and tweaked those.  Basically they are places where it might have been possible for one of these data requests to get cancelled before it was complete. 
However, if those were the problem, then I'd expect the whole thing to freeze up, due to the end of the transaction not being properly acknowledged. So we'll see how that goes.

If this doesn't reveal any clues, then I am starting to suspect that the HyperRAM might be doing something funny with taking a VERY long time to respond. We are talking about ~1 milli-second here instead of the tens of nano-seconds that it should take to service a request.

That's improved things a bit: Some data from somewhere is now getting read for the chipset DMA, but it seems to come from the wrong address, and some CPU memory accesses to the HyperRAM end up being passed out as chipset DMA data.  Also, it still seems that sometimes the HyperRAM takes too long to supply the data.

Finding the address calculation bug was fairly easy.  Also, confirming that sometimes CPU access data gets output is helpful to know, as I can go through carefully to make sure that this can't happen.  This was happening when writing data, which I was easily able to verify by filling a large block of the HyperRAM and monitoring the chipset DMA data.  It probably also happens when reading, but I'll have to make a little programme that does reads in a big loop to make sure.  But back to the writes, this is quite odd, as the data path for feeding data to the chipset via DMA is quite specific, and should only be activated when such a request is running.

Otherwise, we still have the problem of some writes to the trap-door HyperRAM failing, mostly when crossing an 8-byte boundary, but also often enough that detection of the trap-door HyperRAM never properly succeeds. 

Anyway, this post has rambled on long enough, and despite the remaining problem with setting up support for chipset DMA and read cache optimisations, much as in fact been achieved: We now have a perfect 27MHz pixel clock and HyperRAM works sufficiently well to be useful.  Speed optimisations can come later.  Here's the current speed performance of the HyperRAM, which we are calling "Slow RAM" from here on, to avoid confusion with the Hypervisor part of the OS.

So while I am not yet satisfied with this sub-system, and frankly am frustrated at how annoying and drawn-out it has been to get to this point, it is now in good enough shape for me to move onto other more pressing issues, like getting HDMI sound working on all monitors...

Friday 1 May 2020

Trying another approach to HDMI-compatible Audio

So, we have had HDMI-compatible video output working for some time. We even have the audio working on many monitors, but not all.  There is something fishy with the audio output of the ADV7511 driver, when using certain TVs, monitors and HDMI capture devices. The symptom is either no audio, or no picture at all when audio is enabled.

I've been pulling my hair out over this for a long time now.

To try to circuit-break this situation, we have bought an FPGA development board that doesn't use an ADV chip, but has the FPGA directly drive the video signals.  Mike Field and others have done great work on this, to demonstrate that it is quite possible with the Artix7 series FPGAs we are using. 

The board we are using for this is a Mimas A7:

This has a smaller FPGA than the MEGA65, and can't run the MEGA65, but it is big enough for me to produce some video and audio, and try it out on the Samsung TV I have here. The Mimas A7 board even has an example HDMI-output project, which I have been able to use and begin to adapt.

But I am still seeing some funny things.

Primarily, if I try to use more than 4 bits of audio resolution, that the HDMI TV fails to get any picture at all.  It doesn't seem to matter exactly which four bits. I can even sometimes use more than 4 bits, so long as no more than 4 in a row in the 16-bit sample format are used.

What is REALLY weird, is that if I mask out audio bits with dip switches on the board, that there is no picture, even if I have the audio bits masked out.  This just makes absolutely no sense to me, as what comes out the HDMI connector on this board should be completely identical when the bits are masked as when I have them simply tied low -- yet only the latter gives a picture.

Now, I am trying to use a different video mode (720x576x50Hz) to what the example project uses (640x480x60Hz), so it is possible that the HDMI info frames are confusing the TV. by saying one thing, while the display stream has something quite different in it.  What I am suspecting is that the audio clock recovery coefficients (N and CTS) will be causing problems. It still doesn't explain why the really weird problem happens, but if I can confirm that I can use all 16 bits of audio resolution in the default mode, that will still be a helpful data point.

So I am trying to switch back to the default 640x480 mode, and then try to get audio working in that mode.  However, I am hitting another really weird problem:  When I try to workout the set of flags required for that mode (these are some flags that go into the HDMI infoframe), I can make them work if I have them hardwired to the correct values, but if I try to make them settable via the dip switches, they don't change. This is despite the fact that I can read the dip switches, put them into an array of boolean flags, and check them elsewhere in the design, even have their status output continuously on a serial port where I can see them changing. 

Yet the TV steadfastly claims that there is no signal, no matter how I twiddle the dipswitches, and see the correct values on the serial output to indicate that they are being set correctly.  So then I plumbed some return signals back through, so that I can see if the dip-switch values are getting propagated down into the HDMI video generator, which they are.  This is of course very bizarre, since they are clearly being set, and getting where they need to be, but seemingly not affecting the video stream being produced.

And of course this is VHDL and FPGAs, so each "recompile" takes several minutes, which makes for a really frustrating debugging experience, especially when making only tiny changes.

So, after several more waits for resynthesis runs, I have again confirmed that if the EnhancedMode (which enables HDMI info frames) is hard-wired to false (as compared to being held low via a dip-switch mapping !!), then I get a perfectly fine image on the TV.  Recall that I have confirmed that the signal in question goes into the HDMI module and comes back out again, confirming that it should darn well be visible to the HDMI module when the dip-switch toggles.

If the dip-switch is in the correct position on power-on, there should in fact be absolutely no difference in behaviour -- despite the opposite happening in practice.  The only possible clue here is that seemingly the flags are being treated as always true, when connected to the dip-switch.

Right. So, by adding a counter into the HDMI generator, that counts the number of times that a particular area is entered, it suddenly works.
So bizarre is this, that I even looked at the Vivado logs for both versions for any warnings that might differ between the two -- but there are none. 

This does not really give me any encouragement for tracking down the audio problems, since that is also exhibiting similarly nonsensical behaviour. It might just be that I need to update to the latest version of Vivado, and see if if this is a bug that has been fixed in the meantime. In any case, if I want to log an issue with Xilinx, they will want me to try it with the latest version anyway.

Ok. So 16GB of downloads over our poor satellite later, I have Vivado 2019.2 installed in place of the old 2017.4 version, and it pleasingly fixes this weird bug with needing the counter.  I'll now carefully and progressively back out the debug stuff I added in, and see if this now also fixes the weirdness I was seeing with not being able to use more than 4 bits of the audio. I'm hopeful, as the two problems feel rather similar in their complete insanity. Indeed my hunch seems to be true: I can now get 8-bit audio (and presumably the full 16-bit available) in 720p50. 

Next step is to switch out the sample VGA frame generator for the MEGA65 one, so that I can be sure that I can get audio with that.  At that point, I will prepare a bitstream for the team to test in Germany with the monitors that they have there.  If it works with every monitor we can throw at it, then we know we have a good solution.

So far so good: I now have a bitstream where a dip-switch controls if it is PAL or NTSC mode, and I still get sound. However, I am not yet convinced that the audio quality is as good as it should be. This could just be my wooden ear, or it could also be that the 36-element Sine table I am using is not that great.  So I'll still try to get an audio sample recorded and imported as a little ROM in the test design, so that I can verify that the sound is right, or not.

After all manner of further adventures with Vivado weirdness (or possibly PEBCAC), where changing a single unrelated line of code changed the behaviouor of the whole thing, I finally have audio working.  It is possible that the Samsung monitor rejects any HDMI signal that has what it thinks is impossible sound. I'm not sure. 

But in any case, I can now push 16-bit stereo audio out the HDMI interface, which is correctly received by the Samsung TV -- which is one step up on what we have been able to achieve with the ADV7511 HDMI chip on the R2 PCB.  Indeed, I've been able to reach this point in MUCH less time than the whole fiddling with the HDMI chip took. I'd still like to know whatever it is that the ADV7511 is doing that the Samsung TVs don't like.  But we don't really have the time to investigate that right now.

So now it's time to package up that bitstream for the others in the team to try on one of these FPGA boards with all the monitors that they have on hand there.  If that all goes well, we will have a solution that is probably a few Euros cheaper per machine. That will help us to either keep the cost down, or, possibly by the magic of modern economics, to include a nice secret surprise on the production MEGA65 PCBs...

Anyway, I'm glad to finally have HDMI with audio working on the Samsung, and "ich drucke mich die Daumen" (I'm crossing my fingers) that our German half of the team will have success with the monitor testing on their side.  I'll also see what random selection of HDMI-compatible displays we have lurking around the other buildings here over the next few days, and test those, too.