Thursday, 24 March 2022

Solving the last of the digital video output glitching, at least I think so

First, as many will have noticed, we are now in March 2022, the month when we hope that the first 400 MEGA65s will ship. At the time of writing, we are still waiting to hear that the cardboard packaging and related materials have arrived at Trenz to go with the assembled PCBs, keyboards and cases, to make the complete units.  We still don't know exactly when they will arrive, but are waiting expectantly to hear news on this any day now -- as we are sure many of you are!

MEGA65 R3A Production board under test, touched up using the "my crappy phone's auto-focus is about as reliable as its auto-correct feature" filter.

But in the meantime...

As past readers will know, we have access to an old N5998A HDMI protocol analyser that we use to debug the HDMI and DVI compatible audio-video output on the MEGA65.  Recently we have seen some of the production boards having glitchy digital video output, and have set about fixing this in the VHDL.  However, in the process, we have found that in many cases the HDMI protocol analyser fails to recognise the video output, even though it displays fine on a monitor.

Also, in some cases where the protocol analyser does recognise the display, it reports "pixel errors", which is the problem we are trying to fix, however it doesn't give us any information about what those pixel errors are -- but that information would be tremendously helpful for us in debugging the signal. For example, are some of the DVI data words being shifted, are there stuck bits, or are some words being reversed or replaced with the lyrics to a lost shakespearian sonnet?  We really want to know, and we know that the N5998A's capture files seem to be raw bit-level captures of the three data plus one clock channel, and thus should contain the information that we need.  The trick is figuring out the file format.

The files seem to be in fact a completely raw capture with no header at all. The data looks like this:

00000000: 3f c4 00 3f c1 03 00 00 80 0f fc 80 01 0c 00 00    ?D@?AC@@@O|@AL@@
00000010: 3f c4 00 3f c1 03 00 00 80 04 00 80 01 05 00 00    ?D@?AC@@@D@@AE@@
00000020: 3f cf fc 3f c1 0a 00 00 3f c4 00 3f c1 03 00 00    ?O|?AJ@@?D@?AC@@
00000030: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@
00000040: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@
00000050: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@
00000060: 3f c4 00 3f c1 03 00 00 80 0f fc 80 01 0c 00 00    ?D@?AC@@@O|@AL@@
00000070: 3f c4 00 3f c1 03 00 00 80 0f fc 80 01 0c 00 00    ?D@?AC@@@O|@AL@@
00000080: 3f c4 00 3f c1 03 00 00 80 0f fc 80 01 0c 00 00    ?D@?AC@@@O|@AL@@
00000090: 3f c4 00 3f c1 03 00 00 3f cf fc 3f c1 0a 00 00    ?D@?AC@@?O|?AJ@@
000000a0: 80 04 00 80 01 05 00 00 3f c4 00 3f c1 03 00 00    @D@@AE@@?D@?AC@@
000000b0: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@
000000c0: 80 0f fc 80 01 0c 00 00 3f c4 00 3f c1 03 00 00    @O|@AL@@?D@?AC@@

We can see that there seems to be six bytes of data, followed by two zero bytes, in a repeating pattern. 

We know how HDMI/DVI pixel encoding works, e.g., from the description here:

https://en.wikipedia.org/wiki/Transition-minimized_differential_signaling

The relevant part being:

The method is a form of 8b/10b encoding but using a code-set that differs from the original IBM form. A two-stage process converts an input of 8 bits into a 10 bit code with particular desirable properties. In the first stage, the first bit is untransformed and each subsequent bit is either XOR or XNOR transformed against the previous bit. The encoder chooses between XOR and XNOR by determining which will result in the fewest transitions; the ninth bit encodes which operation was used. In the second stage, the first eight bits are optionally inverted to even out the balance of ones and zeros and therefore the sustained average DC level; the tenth bit encodes whether this inversion took place.  

There are only 460 such unique codes that are used.

Also of relevance are the four special code words used to indicate H and V sync:

0010101011
0010101010
1101010100
1101010101

Those are interesting in that one of them, the one that encodes horizontal sync, should occur every raster line, i.e. every ~864 ticks of the 27MHz pixel clock, a pattern that we should be able to fairly readily find.

There are also 16 more that can be used to HDMI data islands.

So a quick bit of a histogram of the frequency of each two-byte combination should give us a hint as to whether each two bytes encodes one of the R,G and B channels respectively, as I suspect that they might.

I cooked up a quick program to count the number of unique tokens:

$ make hist && ./hist 0.cap
gcc -Wall -g -o hist hist.c
Read 65536000 8-byte rows.
Saw 116 unique 2-byte tokens.

Given that there can be 65536 unique values, the highly constrained set of values gives me hope that they might be raw HDMI/DVI 8/10 bit data words.

In fact, if I separate each of the 2 byte tokens and treat them as separate channels, we see that they each have quite different populations:

(The binary values are least significant bit first, i.e., the most significant bit is on the right hand side.)

CHANNEL 0:
Read 65536000 8-byte rows.
  $0255 : 1010101001000000 : 16634
  $0440 : 0000001000100000 : 70011
  $0480 : 0000000100100000 : 11761792
  $0567 : 1110011010100000 : 46
  $0667 : 1110011001100000 : 46
  $0967 : 1110011010010000 : 1308
  $09B9 : 1001110110010000 : 116
  $0A4F : 1111001001010000 : 2156
  $0A67 : 1110011001010000 : 72541
  $0AB9 : 1001110101010000 : 2213
  $0B67 : 1110011011010000 : 5417
  $0D55 : 1010101010110000 : 9649169
  $0DD5 : 1010101110110000 : 483395
  $0F80 : 0000000111110000 : 9801503
  $85B1 : 1000110110100001 : 510
  $86B1 : 1000110101100001 : 508
  $89B1 : 1000110110010001 : 1455
  $8A63 : 1100011001010001 : 173
  $8AB1 : 1000110101010001 : 572465
  $8AB8 : 0001110101010001 : 24457
  $8BB1 : 1000110111010001 : 12142
  $C2AA : 0101010101000011 : 197898
  $C43F : 1111110000100011 : 16172474
  $C458 : 0001101000100011 : 12130
  $C4B0 : 0000110100100011 : 92075
  $C558 : 0001101010100011 : 91
  $C5B0 : 0000110110100011 : 1021
  $C658 : 0001101001100011 : 46
  $C6B0 : 0000110101100011 : 508
  $C958 : 0001101010010011 : 560
  $C9B0 : 0000110110010011 : 965
  $CA58 : 0001101001010011 : 24081
  $CAB0 : 0000110101010011 : 188469
  $CB58 : 0001101011010011 : 174
  $CBB0 : 0000110111010011 : 1986
  $CD2A : 0101010010110011 : 38720
  $CDAA : 0101010110110011 : 3724825
  $CF3F : 1111110011110011 : 12601920
Saw 38 unique 2-byte tokens.
CHANNEL 1:
Read 65536000 8-byte rows.
  $2AAC : 0011010101010100 : 214532
  $3F00 : 0000000011111100 : 16172474
  $3FFC : 0011111111111100 : 12601920
  $4000 : 0000000000000010 : 70011
  $4CCC : 0011001100110010 : 104205
  $5C38 : 0001110000111010 : 1108
  $5C90 : 0000100100111010 : 556
  $63C4 : 0010001111000110 : 1113
  $8000 : 0000000000000001 : 11761792
  $80FC : 0011111100000001 : 9801503
  $9870 : 0000111000011001 : 1108
  $988C : 0011000100011001 : 2818
  $A70C : 0011000011100101 : 11453
  $A770 : 0000111011100101 : 872912
  $A78C : 0011000111100101 : 1586
  $A790 : 0000100111100101 : 1082
  $B00C : 0011000000001101 : 2297
  $B070 : 0000111000001101 : 11452
  $B970 : 0000111010011101 : 1083
  $B990 : 0000100110011101 : 4331
  $B9C4 : 0010001110011101 : 555
  $D550 : 0000101010101011 : 13896109
Saw 22 unique 2-byte tokens.
CHANNEL 2:
$06C1 : 1000001101100000 : 328
  $0801 : 1000000000010000 : 483395
  $0AC1 : 1000001101010000 : 12601920
  $0C01 : 1000000000110000 : 9801503
  $0C41 : 1000001000110000 : 508
  $1481 : 1000000100101000 : 46
  $1CC1 : 1000001100111000 : 6520
  $1D01 : 1000000010111000 : 38720
  $1D81 : 1000000110111000 : 512
  $1FC1 : 1000001111111000 : 971
  $2601 : 1000000001100100 : 4933
  $3A01 : 1000000001011100 : 23719
  $3AC1 : 1000001101011100 : 348
  $4241 : 1000001001000010 : 46
  $43C1 : 1000001111000010 : 484
  $4B41 : 1000001011010010 : 508
  $4C01 : 1000000000110010 : 86
  $4DC1 : 1000001110110010 : 92075
  $52C1 : 1000001101001010 : 4932
  $5301 : 1000000011001010 : 565945
  $5501 : 1000000010101010 : 232
  $55C1 : 1000001110101010 : 438
  $5A01 : 1000000001011010 : 24457
  $5EC1 : 1000001101111010 : 965
  $6941 : 1000001010010110 : 510
  $6D01 : 1000000010110110 : 174
  $6E01 : 1000000001110110 : 484
  $7101 : 1000000010001110 : 2156
  $7401 : 1000000000101110 : 995
  $8501 : 1000000010100001 : 70011
  $8601 : 1000000001100001 : 1998
  $8801 : 1000000000010001 : 9649169
  $8901 : 1000000010010001 : 67261
  $9201 : 1000000001001001 : 186988
  $9B01 : 1000000011011001 : 46
  $9D01 : 1000000010111001 : 3724825
  $9F41 : 1000001011111001 : 46
  $A401 : 1000000000100101 : 1867
  $A7C1 : 1000001111100101 : 116
  $AA01 : 1000000001010101 : 87
  $B9C1 : 1000001110011101 : 2119
  $BC01 : 1000000000111101 : 173
  $C501 : 1000000010100011 : 1986
  $C581 : 1000000110100011 : 45
  $DB01 : 1000000011011011 : 2213
  $DE81 : 1000000101111011 : 510
  $EBC1 : 1000001111010111 : 276
  $EEC1 : 1000001101110111 : 16634
  $EFC1 : 1000001111110111 : 178
  $F001 : 1000000000001111 : 6520
  $F301 : 1000000011001111 : 509
  $F341 : 1000001011001111 : 46
  $F5C1 : 1000001110101111 : 12130
Saw 55 unique 2-byte tokens.
CHANNEL 3:
Read 65536000 8-byte rows.
  $0000 : 0000000000000000 : 65536000
Saw 1 unique 2-byte tokens.

So we see that in fact each of four two-byte tokens are drawn from completely separate populations, with the total number of unique tokens equalling the sum of the number of unique tokens within each of the four channels.  This is a good hint that we are looking at in fact four channels of data.

Note also that I have picked the byte order and bit order within the bytes quite arbitrarily -- and either or both might be wrong.

Channel 2 is particularly interesting, as there are six bits that have a fixed value, leaving exactly 10 bits that take differing values... and an HDMI/DVI data word is 10 bits long.

Now, looking at more of the file, I can see a general structure that repeats every 863 x 8 bytes. This is significant, because the raster lines in this video mode are 863 cycles in duration.  This is further circumstancial evidence that each 8 byte block corresponds to one cycle of time, and thus should contain the 3x10 bits of data plus 10 bits of clock. Hmmm... That adds up to 40 bits = 5 bytes, so we have one extra byte of something, presumably, or more if the clock is not recorded explicitly. The clock is interesting, because it should be 5 bits of 1 followed by 5 bits of 0, which should be easy to spot.

Also looking, I can see that there are 64 repetitions of:

aa cd 50 d5 01 9d 00 00

every raster line. That's the number of horizontal sync pulses, so we should expect at least the DVI channel 0 to have the HSYNC control word in it, i.e., one of those values that has lots of 1010101 in it.  Plus the clock should have 1111100000, so let's turn those bytes into binary, and see what we can see, recognising that we might need to reverse the byte and/or bit orders:

10101010 11001101 01010000 11010101 00000001 10011101

Well, that's reassuring that we can see plenty of 1010101 looking stuff in there, e.g., in the first and fourth bytes, in particular. What I can't see, though, is any way to have the 1111100000 clock pattern, so perhaps it isn't recorded.  This would make sense, as it would kind of be consumed during the decoding in the analyser, and implied.  That then leaves the mystery of why we have 48 bits being used to represent 3 x 10 bit values, when 30 would have been enough.

We then see 67 lots of:

55 0d 50 d5 01 88 00 00

01010101 00001101 01010000 11010101 00000001 10001000 

This looks to me, like 12 bits is reserved for each of the 10 bit words,and that they are laid out quite naturally, as we can see by lining those up on top of each other:

10101010 11001101 01010000 11010101 00000001 10011101
01010101 00001101 01010000 11010101 00000001 10001000

We see the different control word in channel 0 in the left most two bytes, while those in the other two channels remain unchanged after the end of the HYSNC period.  The values in the channel 0 control word are only valid, if the bit order were reversed over the 10 bits, to yield:

1101010101 xx 0010101011 xx 0010101011 xx ????????????

I would like to be really sure that this pattern holds, which I can check by testing for any of the xx bits being non-zero anywhere in the file. If not, then we will assume that they aren't used -- which holds true.  

So what on earth is in those last 2 bytes, then, if contains only 2 bits of channel data?

It holds to a specific pattern, where only the last 8 bits actually change.  My guess is that it is some kind of checksum or CRC, and I think I am right: Remember those two 8 byte vectors we were looking at?

55 0d 50 d5 01 88 00 00
aa cd 50 d5 01 9d 00 00

0x55+0x0d+0x50+0xd5=0x0188

0xaa+0xcd+0x50+0xd5=0x29c

If we assume that the fifth byte is simply required to have "000001" in its lower bits, then that leaves the last byte as a simple 8-bit checksum minus 0x100, with internal carry, i.e., the sum 0x29c becomes 0x19c becomes 0x9c + 1 = 0x9d.

I can try to verify that this algorithm is correct, by testing it on every single 8-byte vector. Well, it's almost right, in that it comes out to within 1 of the correct value every time. The internal carry might be wrong, perhaps, and there might be some other fancy adjustment being made.

Anyway, I'm not going to get too worried about it, since it is pretty clear that it is some kind of checksum.  This all means that we have a pretty clear idea of the format of the file, and can move on to making a simple decoder for it.    

It's the weekend again, and I have made some more progress, by beginning to create a little program that accepts in one of these capture files, and begins probing and testing the captured log.  The reason for this is to debug an issue with glitchy/absent DVI/HDMI video on some of the R3A production boards, which we know is due to manufacture variation of the FPGA part itself, as previously discussed.

So my plan is to make a capture from an R3 board that has good video output on a given bitstream, and then use the same bitstream on an R3A board that doesn't show video in the same circumstances. I will then implement the various tests -- starting from the low-level signalling, which is where I suspect the problem will be, and working my way up.

One nuisance of the capture format of the HDMI analyser is that it doesn't include an indication of the clock frequency, so there isn't a hard and fast way to automatically recognise the video mode, or be sure that it is at the right pixel clock. To work around this, I have already implemented logic that figures out the HYSNC and VSYNC timing, and then compares that to the list of official HDTV modes, and computes a least-squares error to find out which mode it most probably is.  This allows for some slop in the mode implementation, which is not uncommon (the MEGA65 in PAL for example uses 863 cycles per raster instead of 864 cycles per raster, so that it is divisible by 3).

In trying to remember exactly how the DVI/HDMI SYNC marking works, I was reminded of this series of blogs by someone else who made a basic HDMI analyser from an FPGA:

https://warmcat.com/2015/10/20/hdmi-capture-and-analysis-fpga-project.html

While not directly applicable to what we are trying to do here (as we already have a box that grabs the HDMI/DVI signals), it is an interesting read through, and has a hint for something that I will try to fix on the MEGA65 video out -- the HSYNC and VSYNC pulses should be synchronised, which they are not currently on the MEGA65.  That can be fixed easily enough in the VHDL.

But back to the HDMI analyser software so that we can compare good and bad output from different boards, I'll now implement the SYNC detection by tracking the SYNC values, and updating them when we see differing SYNC control word values on the blue channel.  

That now works, and I can see the video mode parameters being correctly inferred:

$ make hist && ./hist good-r3.cap
make: „hist“ ist bereits aktuell.
DEBUG: 459 unique pixel values
DEBUG: Read 13107200 records.
ERROR: 2603320 invalid DVI 10-bit words observed:
       10790x $013 (0000010011)
       22150x $043 (0001000011)
       25085x $20f (1000001111)
       45970x $270 (1001110000)
       40350x $2bc (1010111100)
       2458975x $2ec (1011101100)
DEBUG: Saw control word counts: 29122 144 15490 120
DEBUG: Saw most frequent lengths of: 14(x14013) 799(x96) 64(x14625) 64(x120)
DEBUG: HSYNC+ intervals = 863 863 863
DEBUG: HSYNC- intervals = 863 863 863
DEBUG: VSYNC+ intervals = 539375 539375 539375
DEBUG: VSYNC- intervals = 539375 539375 539375
INFO: Raster lines are 863 cycles long.
INFO: Frames 539375 cycles long.
INFO: Frames consist of 625 raster lines
DEBUG: Most frequent VSYNC low/high length = 65535(x24) 4315(x24)
DEBUG: VSYNC low -> 65535 / 863 = 75.94,  VSYNC high -> 4315 / 863 = 5.00
INFO: VSYNC pulse lasts 5 rasters, polarity is POSITIVE
INFO: File contains 24 frames (first is possibly partial, and a last partial frame may also be present)
DEBUG: Most frequent HSYNC low/high length = 799(x15166) 64(x15167)
INFO: HSYNC duration is 64 cycles
INFO: Mode most closely matches 720x576 50Hz non-interlaced (mode error = 1)
INFO: The mode differs from the expected mode in the following ways:
      h_total: saw 863, expected 864

You can also see in the 2nd-last INFO line, that the video mode detection magic has correctly inferred the video mode, using that least-square error algorithm I mentioned. The error is only 1, because the only divergence between the actual and model video modes is the duration of the raster lines differing by 1 cycle.  

The algorithm isn't perfect, if a mode differs in multiple ways from a standard mode, but it should in most cases give a good indication, and can act as a sanity check if you have fed in a different mode to that expected -- something that the fancy HDMI analyser software was blind of, and would instead just complain about every divergence between what was seen and expected, without suggesting that you check that the source was set to the correct mode.

We can also see that there are some unexpected control words (in italics).  Interestingly, this is on a board that is producing a valid picture. So I need to figure out what is going on there: Are we producing nonsense codes systematically, or is my decoder doing something odd.  The fact that there are very large numbers of them suggests to me that it is some systematic problem. For example, am I incorrectly calculating the codes for the different pixel values in either the VHDL, or in this analyser program -- either could cause what we are seeing. Or is it a problem with the transmission of the codes over the digital serial interface that is DVI/HDMI?

In the end I solved that problem by finding a complete list of the codes from one of Hamster (Mike Field)'s great FPGA projects, and slurping those in.

With that in place, I now get no errors of bad TMDS code words, which is great. 

Even better, when I ran the revised version of the program over the capture from one of the boards that was showing glitchy video, it showed that the video mode was not the PAL video mode that it should have been:

$ make && ./hist bad-r3a.cap
make: Für das Ziel „all“ ist nichts zu tun.
DEBUG: 460 unique pixel values
DEBUG: Flip test passed.
DEBUG: Read 13107200 records.
DEBUG: Saw control word counts: 1938 14476 162 28505
DEBUG: Saw most frequent lengths of: 176(x637) 64(x12843) 799(x75) 14(x12358)
DEBUG: HSYNC+ intervals = 863 863 863
DEBUG: HSYNC- intervals = 863 863 863
DEBUG: VSYNC+ intervals = 1771 12900 4423
DEBUG: VSYNC- intervals = 2133 12694 4640
INFO: Raster lines are 863 cycles long.
ERROR: Frames are not of consistent length.
INFO: Frames consist of 0 raster lines
ERROR: Frames are not an integer number of rasters long.
DEBUG: Most frequent VSYNC low/high length = 176(x63) 504(x5)
DEBUG: VSYNC low -> 176 / 863 = 0.20,  VSYNC high -> 504 / 863 = 0.58
ERROR: VSYNC pulses don't seem to be a multiple of the raster length.
INFO: VSYNC pulse lasts 0 rasters, polarity is UNKNOWN
INFO: File contains 63 frames (first is possibly partial, and a last partial frame may also be present)
DEBUG: Most frequent HSYNC low/high length = 64(x13335) 799(x12467)
INFO: HSYNC duration is 64 cycles
INFO: Mode most closely matches 720x480 60Hz interlaced (mode error = 274641)
INFO: The mode differs from the expected mode in the following ways:
      h_total: saw 863, expected 858
      v_total: saw 0, expected 524
      hsync_len: saw 64, expected 62
      vsync_len: saw 0, expected 6

Basically it thinks the screen is zero raster lines long. Something very odd is going on. Well, we knew that, because we weren't seeing a picture. But now we are getting some info on what is going on.

This led me down a fruitful path of investigation that revealed that the RESET line was glitching, with the glitch coming from the MAX10 2nd FPGA on the MEGA65 main board.  Further looking revealed that the communications from the MAX10 to the main FPGA is a basically totally confusing itself.  The MAX10 is also responsible for reading the dip switches -- one of which toggles between DVI and HDMI.     

All that means that the HDMI encoder was thinking it had to switch between HDMI and DVI all the time, which was confusing it no end. Especially since this could happen in the middle of a pixel.

So now I have something concrete to fix.

For now, I will just disable the MAX10 communications, and see if I can't get something stable.

The HDMI test bitstream with MAX10 disconnected is stable, provided all the HDMI features are turned off, i.e., sending pure DVI signalling.  So I will turn on one of the packets, and then try to play spot the difference between the two.

So, in fact, I don't need to do that, because I have confirmed the glitching on the reset line is the entire problem.  It took a bit of fiddling to fix the glitching, because the MAX10 is using its internal oscillator which drifts with temperature to anywhere between 55MHz and 116MHz.  

Essentially the problem I had was that I was counting how long the incoming 41MHz clock from the Xilinx to the MAX10 FPGA on the communications path holds, to detect the sync (which is a long hold) among the clock pulses (short holds).  41MHz < 55MHz, so it should be fine, right? Nope, because each _half_ of the clock is effectively an 81MHz signal -- which is more than 55MHz, so if the MAX10's internal clock was running slow, it could sample on the same half of the clock multiple times running, and thus mistakenly believe that it is in a sync pulse.  

To solve this, I halve the 41MHz signal in the MAX10 FPGA, and then check that. With a bit of fiddling of the constants, that had it fixed. I could have fixed it in the Xilinx FPGA, but then we would have had to modify our release bitstream and test it again, which I still really wanted to avoid. And fortunately I was able to do this. In fact, I solved these last problems during the live stream I did on this topic: