Making a C64/C65 compatible computer: January 2023

Friday 6 January 2023

Adding colour to the MEGA65's Composite Output

In the last episode, we got the composite video output of the MEGA65's expansion board working -- but only in monochrome. Through trickery, we were getting more than 16 shades of grey out of our 4-bit DAC (and we have further Cunning Plans :tm: to extract even more, but that's likely to be described in another blog post). But what we really want right now, is working colour. For both NTSC and PAL.

In this post, I manage to get PAL non-interlaced composite video working with pretty decent colour signalling, producing images like this:

First step, get those colour bursts in the raster lines where they need to go, and only in the raster lines where they should be, and with the correct phase. The book Video Demystified continues to be my able assistant here, with lots of deep technical information. The book can be easily found and purchased via online search.

The colour burst pulse should be 9 +/- 1 cycle at the appropriate frequency. I setup the mechanics for generating that in the previous blog post, but didn't actually hook it up, so I'll do that now, and see if we can't get that sorted fairly quickly. Interestingly, according to this image, the C64 had a much longer colour burst, presumably because it can only help the receiving monitor to have more cycles to lock onto the phase of. It also displayed an invisible white pixel on the left edge, presumably to help monitors calibrate the luminance more stably.

raster15khz_skip is conveniently counting down the 108 pixel ticks from the start of the HSYNC pulse. That covers a total of 12 usec, so we can just check for about the half-way point, and then add the colour burst signal.

We have to scale the colour burst down to a range of +/- 37 instead of +/- 127. I'm willing to bet that a simple divide by 4 to make it +/- 32 will work fine. We also need to synchronise the colour burst to start at $00 or $80 hexadegrees (see my previous post for a bit more discussion on the hexadegree system of 256 = $100 hexadegrees in a circle).

Okay, I have added logic to clamp the phase at the start of the colour burst, and then to selectively super-impose it onto the luma signal after the sync pulse. It looks something like this:

The wiggly lines are the colour burst. What is important, is that it starts from the centre line, rather than some random point on every raster line. This is right, so that's good. I reckon I can start the burst earlier, and have more cycles, like the C64 does. I can also slightly tweak the end time, so that it always ends on a zero-crossing, rather than having a glitch. Both of those are now done.

So in theory if we now modulate the colour components onto the colour burst signal, we should see colours. And this is where it gets fun. We will focus on PAL first. The luma signal is actually already a combination of the red, green and blue channels. This is called Y. Then to get all three, we need to have U and V, which are other combinations of the red, green and blue. This structure is based on the historical development of colour TV as an overlay onto the existing black and white TV signals, combined with some clever insight into the sensitivity of human vision to different colours.

What matters for us right now, is the equations for U and V. These should be gamma corrected first, but I'm going to leave that for now. It just means that the colour saturation curves will be a bit wrong. Gamma correction can be added in later, essentially just using a 256 entry look-up table that has the gamma corrected values, and which is pre-applied to the red, green and blue channels.

But back to those equations:

Y = 0.299R + 0.587G + 0.114B
U = – 0.147R – 0.289G + 0.436B = 0.492 (B – Y)
V = 0.615R – 0.515G – 0.100B = 0.877(R´ – Y)

Since Y is already transmitted as luma, which is mostly the green channel, with a bit of red and a bit of blue, the U and V signals are the differences between the red and blue channels and Y. Note that this means that U and/or V can be negative at any point in time.

The U and V signals are then modulated with the colour burst signal: U with the sine of the current colour burst phase, and V with the cosine of the colour burst phase. This means that they are encoded "90 degrees apart", using what is called "I/Q" or "quadrature" coding. There is a lot of clever-pants signal processing that can be done with I/Q signals to extract the I and the Q parts back out at the other end. Fortunately we don't have to separate them, but just combine them, which we just do with simple addition.

The literature indicates that the modulated U and V signals should then be "saturation limited", so that if the magnitude of the values are too great, that they get clipped, rather than wrapped. This makes sense, as it just limits the red and blue aspects of the signals.

We will need a bit of a pipeline to generate U and V, scale them, and then modulate and saturation limit them. The Y signal will need to be similarly delayed so that the components all stay in phase.

Ok, I now have the colour signal super-imposed over the luma signal. However, I think the amplitude is too large, so I'll try cutting it in half. This is what I am seeing:

The checkered pattern is the direct result of the over-amplitude chroma signal, as far as I can tell. What is interesting, and I don't really know why it is happening, i how the parts of the display that are being displayed more or less correctly, seem to be getting dithered.

Now, the quadrants of the image that are being displayed properly are those that have blue on full, and more red than green. Not yet sure what clues that gives me. Let's look at the chroma amplitude first, anyway.

This is how it looks at the moment, when we look at the raw signal:

We can see that the amplitude is much greater than in the colour burst itself. That should be fairly easy to fix.

Also, the bandwidth of the chroma signal is also way too high, as evidenced by the very jagged nature of the signal. This is a problem, because the chroma signal is only allowed to occupy ~1MHz of bandwidth. I'll likely need some kind of low-pass filter to solve this problem. I'll have a think about that while I fix the amplitude.

That's looking a lot more sensible to me. Let's hope it fixes it. Even if the bandwidth is bad now, the vertical colour bars are wide enough that they show up as areas with constant phase, and thus very narrow bandwidth, and they really should show up in colour. So I'll synthesise that, while I think more about limiting the chroma bandwidth more generally.

One of the things that has confused me about the chroma signal for PAL and NTSC, is what the actual horizontal colour resolution really is. As the carrier is ~4MHz, it can't be higher than that. In fact, it probably has to be somewhat less than that, probably of the order of 2MHz or so. Compared to our ~13.5MHz clock, that's close to one character wide.

In fact, if we remember back to the C65's composite output, that's probably close to right: Trying to display, for example, a red character on a blue background resulted in unreadable rubbish, because at the ~8MHz pixel clock of the C64, it couldn't change between red and blue every pixel, because the result would be above the colour carrier. Whole characters could be alternately coloured, however, without much trouble. So it's probably about right. It will be interesting to see if we get that kind of problem with the corrected colour amplitude, or if the monitor still refuses to show it in colour.

Still no colour, but the stippled pattern is less pronounced. My best guess at the moment is that it is Chroma Dots caused by the colour signal not being decoded, because the TV is treating it as a monochrome signal still.

Ah, there is a good clue in here. Apparently you _don't_ reset the phase of the colour carrier every raster, but leave it running, exactly as elpuri describes "despite what all those nice diagrams always show". And that has certainly helped! The monitor now clearly thinks that it is dealing with a colour image, even if things are rather broken:

The first image is of the PAL display:

The colours are clearly pretty messed up. NTSC is not quite as messed up, but it's all rather relative:

With NTSC it looks like I am encoding red as green. The upper right triangles of the middle section are actually not far off the correct blue gradient that is expected here.

Thinking about it, this test pattern has really quite chroma low bandwidth requirements, because the colours largely transition gradually. So I should be able to get it working here, without having to fluff about with low-pass filtering the chroma signal. That can be dealt with later.

I'm tempted to remove one of the U or V channels temporarily to simplify the debugging. It would also be great at this point to be able to access a "correct" composite display of the test pattern, so that I could compare it. I was hoping my VP-728 upscaler might have composite outputs, but it doesn't.

So I'll start by synthesising with just U, and no V component. I might also just try outputing the RGB as YUV values directly, since that should generate a valid image, just with messed up colour space. If that still has non-coloured sections, then I will know that there is something else messed up.

Okay, so with either U or V disabled, I am indeed getting less colouring.

Hmmm.. I've reworked the sine table lookup stuff a couple of times, and I'm now more happy that that is doing what it should. PAL colours are still messed up, and missing from some areas, especially where red dominates the colour. NTSC colour encoding is still quite incomplete, as I am basically using PAL encoding in all modes right now. But it's progress.

Now it's time to look at all those PAL phase changes, and check how I am going with that. I know that I am not inverting the phases each field, which is causing the colours to switch between two different rotations of the colour wheel. So that's the first one to fix, I think.

The first thing to tackle is that the phase of the V component is supposed to alternate every other line. With that in place, it's "less bad", but not perfect, as I still haven't implemented the inter-frame correction.

There are also still large areas that are lacking colour. Investigating this, I have found that green coloured areas are resulting in rubbish colour information. A single pure colour should generate a similarly pure sine wave with the correct phase that indicates its hue. But instead, I'm seeing choppy rubbish like this:

The nice sine wave is the colour burst at the start of the raster line. Then the choppy stuff is all supposed to be pure green (RGB = #00FF00). Pure red doesn't have this problem.

Now, one of the interesting things with the quadrature encoding method that PAL and NTSC use, is that a mixture of a pair of sine and cosine waves with the same frequency and kept in phase, will generate a signal of variable amplitude and phase, but at the same frequency. Thus for a region with constant colour, regardless of the colour, we should see a nice continuous sine wave. Not this kind of choppy rubbish.

Disabling either U or V colour information resolves it -- not surprising since it will result in either a pure sine or pure cosine component from the remaining signal.

Ah! I found the problem: When combining the U and V signals after multiplying them by the sine and cosine table entries, it is possible for the result to overflow the variable I had defined. By adding an extra bit of precision, that problem has now gone away. I'm not sure that the resulting amplitude is now high enough, but that is something fairly easy to fix later, if the colours are undersaturated (which is how that problem would manifest).

I'm synthesising a test bitstream for this now, and am quietly hopeful that it will fix a lot of the rubbish I was seeing, because a lot of it can be explained by pseudo-random patterns moving from the colour space to luma space -- which is exactly what would happen if it was saturating and then wrapping around instead of clipping.

And indeed that has helped: There are now colours that are still wrong, but there is no longer spotty rubbish caused by random looking colour subcarrier wave-forms. Here is how the PAL test pattern looks now:

Now I think the most fruitful thing to attack will be the correct handing of the phase inversion that is causing the colour to still flicker between two different hues all the time. This unfortunately can't be seen in the photo, because of the short shutter time. Figures 8.16a and 8.16b in Video Demystified has the information we need here, so I'll progressively implement it.

I'll start with fake-progressive, as that has by far the simplest arrangement.

In the process I found I was setting the PAL colour burst frequency using degrees instead of hexadegrees, which mean that it was using funny angles: 135 hexadegrees = 190 degrees and 225 hexadegrees = 316 degrees, giving an angle between them of 126 degrees, instead of 90 degrees, and likely to be causing all sorts of problems. Let's see if that improves things. Not noticeably, except that it removes the need for some fudge factors I had previously added, so that's a bonus. I also noticed I have -V on the wrong half of raster lines, so fixing that, too.

But now back to getting that sequence of suppressed colour bursts and phase inversions at the start of each field.

Hunting around the internet for tools to make the debugging of the video more efficient, I found cvbs-decode, which can decode raw PAL/NTSC video (but I can't get it to run, due to some weird python errors), and also the HackTV PAL/NTSC video generator for HackRF written in C.

The HackTV stuff is essentially an implementation of what I want to do, just in another programming language. Thus it could be a good place to mine for clues on what I am doing wrong. The code is structured very abstractly, however, so that might take me a while.

Coming back around to the parts of the test pattern that lack any colour at all, I am looking at the waveforms I am producing for those, and I can actually see that there is in fact no colour subcarrier visible for that part of the image. That's at least one thing that I can tackle with the tools I have already created, so I'll bash against that for a while.

The main area that is showing no colour is the triangle that has more red than green, and has blue at full value. If there is more green than red, then it seems to be ok. Looking closer, there is a colour signal, but the amplitude is really low. I had to reduce the colour amplitude by 2x while implementing it, to prevent over-flows. It's possible that I have over-attenuated it.

Meanwhile, to try to improve my workflow, I have improved my little trace program so that as well as a PDF with oscilloscope like traces, it now also produces an HSYNC-synchronised PNG view, effectively showing the whole video frame, including SYNC pulses (as black). The PAL colour information will appear as dot/stripe patterns, as it would on a black-and-white TV. After fixing a few bugs that it helped me to find, this is what I see for the PAL interlaced video:

You will probably need to click on the image to view full-size, as it has a fairly horrible aspect ratio.

One thing about showing the colour information directly without decoding, it means we can by eye see the colour intensity by how bright the pattern is. Here is a zoom-in of part of it:

Here we can see two regions with quite high colour saturation, and between them in the lower-right area, a section that still has some colour information, but the pattern is quite faint, meaning that the colour intensity is very low. This is the area that should be purple. This clearly means that we have some problem with the YUV generation, and explains the lack of colour on the PAL monitor, because it really isn't there, or rather, is so faint, as to be effectively invisible.

One of the colours that is having this problem is #5800FF. Let's do the YUV calculation by hand, and see how we think it should show up.

Let's apply the YUV calculation to this colour:

Y = 0.212R´ + 0.700G´ + 0.086B´ = 0.0482 + 0 + 0.086 = 0.134
U = 0.492(B´ – Y) = 0.492 x 0.866 = 0.426
V = 0.877(R´ – Y) = 0.877 x (0.345 - 0.134) = 0.185

Scale these up to 8-bit range, and we get:

Y = 34 = $22
U = 109 = $6E
V = 47 = $2F

Hmm... The values I have in the VHDL are nothing like that. Y itself is quite different, for a start. Now, there are two ways to calculate U and V: The method above, or by applying direct RGB coefficients:

Y = 0.299R´ + 0.587G´ + 0.114B´
U = – 0.147R´ – 0.289G´ + 0.436B´ = 0.492 (B´ – Y)
V = 0.615R´ – 0.515G´ – 0.100B´ = 0.877(R´ – Y)

Hmm, these coefficients are different again for Y, that I got from a different part of the document. They do seem to match what I am using in the VHDL, though, so let's start by checking those.

Y= 0.103 + 0 + 0.114 = 0.217
U = -0.051 - 0 + 0.436 = 0.385
V = 0.212 - 0 - 0.1 = 0.112

Scaled up, we get:
Y = 55 = $37
U = 98 = $62
V = 54 = $36

The VHDL is producing:

Y = $1D
U = $10
V = $04

In short, I have something very wrong going on with my YUV calculations in the VHDL, because the ratios between these values are not correct. It's a little more complicated to debug here, because the VHDL doesn't do x256 scaling, because it reserves part of the 8-bit range of the DAC for SYNC and head-room for colour coding. Thus I calculate Y with a different scaling factor compared with U and V. That said, U and V are calculated with the same scale factor, so the ratio between those two should hold. So lets look more closely at those:

U = – 0.147R´ – 0.289G´ + 0.436B´ = 0.492 (B´ – Y)
V = 0.615R´ – 0.515G´ – 0.100B´ = 0.877(R´ – Y)

For U, the VHDL does: U = -6 R - 12 G + 18 B. Those ratios look ok.

For V, the VHDL does: V = 18R - 15G - 3B. Again, those ratios look ok.

So where have I messed up? Let me double-check the direct calculations:

U = – 0.147R´ – 0.289G´ + 0.436B´
U = -0.147x88 - 0.289x0 + 0.436 x 255
U = -12.936 - 0 + 111
U = 98

So U looks ok. Now to V:

V = 0.615R´ – 0.515G´ – 0.100B´
V = 54 - 0 - 26 = 28

Ah, so I made an error in my calculations above (which I marked with the underline).

So in fact the ratio of U and V is about right.

This makes me then suspect that the colour saturation is too low, i.e., that we need to scale up the chroma signal. But it already has quite a bit of amplitude for some of the other colours. So maybe I should implement the gamma correction first, instead.

PAL should use a gamma correction of 2.8. There are plenty of online gamma curve calculators that can be used to generate a suitable table. Although the NTSC/PAL preferred curve is a little more complex:

For R, G, B < 0.018

R´ = 4.5 R
G´ = 4.5 G
B´ = 4.5 B

For R, G, B ≥ 0.018

R´ = 1.099 R^0.45 – 0.099
G´ = 1.099 G^0.45 – 0.099
B´ = 1.099 B^0.45 – 0.099

Gamma is now implemented. I'll synthesise a bitstream to see how it looks. But what I can already tell is that it hasn't fixed the lack of saturation of the purpley areas in the simulation. So there is still other problem. But it's nice to have at least one of issue eliminated.

Looking at the specifications for NTSC and PAL video again, I think I have the colour burst amplitude too high, which causes the monitor to interpret the colour saturation as being too low. Currently it has a peak to trough amplitude of 64 = $40 via 3 bit shifts of the sine table. But it is supposed to only rise or fall by 1/2 of sync voltage. With our scaling, that means it should be +/- $18, for a range of $30. So it
will be contributing to the problem, but is probably not enough to be the root cause. I'll fix it, anyway.

Meanwhile, looking at the images on the monitor instead of simulation again, the colour alternation flicker is annoying me a lot. So I am going to try a fix for that: I am suspecting that it is the phase alternation polarity should be switched every field -- although I cannot yet find clear information on this.

I've also introduced some sort of regression into the NTSC image, which I need to deal with. Ah, found that. Also affects PAL when interlace is enabled. I was a bit over-zealeous in marking the active vs non active area, and was marking some of the active area as inactive, causing it to blank.

Resynthesising again... That's fixed the regression.

I think my next step to track down the lack of chroma intensity when R > G, and B = full. I'll attack this by making a simple program in C that does the same YUV conversion, to see if I haven't just messed something up in the VHDL calculation.

Here is my hand-calculated Y, U and V panels for this pattern:

The V panel is showing much the same problem -- so this is good -- it means I have misunderstood something about the process of calculating these values. I'm guessing I'm not handling negative values properly.

U and V have ranges centred around zero, So this kind of sharp transition can appear. By correcting this in the C program, I am able to get much saner looking panels:

Much better, with no nasty lines going through. Also, it makes sense the way that U and V have peaks in particular corners, as U is basically the "missing blue" and V the "missing red". As the pattern I am rendering here has increasing red as we go down the image, it makes sense that the missing blue component will decrease a bit (U, in the middle), while the missing red component (V, on the right) will increase.

So now to figure out what I have wrong in the VHDL when handling U and V. The values I am generating are signed. So I suspect that when they go negative, the multiplication is ending up flipped, and for some reason the sign bit is not being interpreted properly.

Okay, that seems to have helped: Now when I simulate the VHDL, I am seeing a much more even colour intensity. That said, it is also quite faint, as you can see here below how the chroma oscillation pattern (the angled bars) have very low intensity:

In comparison, the colour burst that tells the monitor how bright they should be for maximum colour saturation is much more intense:

So I have one or the other not in the correct amplitude.

The correct amplitude variation is +/- half of the sync level, i.e., +/- 24.

The colour burst part is indeed too intense. The generation of this is a bit fiddly, because of how I do a pile of bit-shifts to avoid multiplications. But the luma_drive signal is generated as a 10 bit value, and the colour burst is added onto that. This means we want an amplitude of +/- 24*4 = +/- 96, since the range is 10 bit instead of 8 bit. I think I have that right now.

As for the U and V signals, those I don't think are scaled the same, so will need to be fixed.

Right, I think I have fixed the scaling of U and V, and the result looks much more promising now:

The intensity of the colour burst and of the chroma in the active area look reasonably well matched now. I'll take a look at the oscilloscope view to be sure:

If anything now, I reckon the colour burst might be slightly lower amplitude than the pixel data. But that should be fine. Let's try making a test bitstream from that, and seeing how it looks:

Okay, that is stacks better. There are still problems, but we do at least now have fairly consistent colour saturation. It's still flickering between different hues, especially in the left most panels, though, and of course the hues are not correct. The four narrower bars are supposed to be grey, red, green and blue, which they clearly aren't. That said, the green and blue ones are kinda the right hue. In fact, I would say that in general, it is red that has the most problems in the whole image.

Red in PAL comes mostly from the V signal, which is supposed to alternate phase every raster line. Thus it seems quite plausible that there is something wrong with that logic.

Ah, found a very likely cause: The colour burst signal is also required to swing by 90 degrees between normal and inverted V signal lines, as described here. I was adjusting the phase of the pixel data colour, rather than of the phase of the colour burst. This was likely causing the problems. Rearranging that now, so that it does it the right way, and hopefully we will see stable colour in PAL -- even if the red channel is still not quite right (there is a 50% chance I have it inverted).

Look like I got it the right way around:

It's not perfect, as red is still under-saturated compare with blue, which I can improve. Also green is a bit under-saturated now, too. But it is a million times better than before. It is recognisably a colour PAL display. The non-linearity in the DACs is probably now as much of the problem as anything else, and might in particular be responsible for the uneven colour saturation in places.

But, let's see how a MEGA65 bitstream looks with it:

Not, bad, I have to say :) I wasn't sure that 80 columns with colour would still be readable. That said, the camera has corrected what is a very real purple tint on the display. My crappy phone camera does a fair job at reproducing the tint, though:

While I'm running things, let's see how C64 mode / 40 columns looks:

I have to say, that this looks pretty darn good, apart from the purple tint. It is absolutely clearer than using the composite output from a C64. In fact, you can even do red text on blue background, that normally disappears into soup on a real C64:

The good camera has again desaturated the red a bit. The text is actually much more coloured. Again, out with the crappy phone camera to see how it really looks:

Well, how it looks in terms of colour, because my phone camera has horrible focus. The other camera correctly captures the sharpness, though.

So that's all good. The only problem now is that the interlaced mode of PAL now has colour instability. This is odd, as it was previously the other way around. Ah, no, I only thought it was the other way around ;) It is indeed consistent. That makes it much easier to investigate and fix.

But this blog post is now more than long enough, so I'm going to wrap it up here.

What I will say, though, is the VHDL part of this is now more or less all there, and with support, should be quite feasible for someone to contribute to, rather than myself continuing on it ahead of other competing priorities. So if you have an interest in composite video output (you don't need to know anything about how it works up-front), and would like to learn some VHDL and contribute to this feature on the MEGA65, drop in on the MEGA65 discord server (link from https://mega65.org).

Comparing the Graphics Capabilities of the MEGA65 and Amiga

This post is just a quick summary of some replies I gave on Facebook recently, when people were asking questions that made it clear that they had the mistaken belief that the MEGA65's graphic system is not powerful. This is a repeating theme I have noticed around the internet from time to time. Perhaps it is because a lot of people do things on the MEGA65 that are not graphically taxing, like playing Infocom text adventures. Anyway, someone posted a question that was along the lines of: "The MEGA65's graphics seem pretty underwhelming", and with the Amiga being given as a point of comparison. So I wrote the following explanation, that I figure will be useful for folks to refer to when trying to understand the relative graphics power of the two systems.

@retrocogs' work-in-progress game for the MEGA65

(Before diving into that, it's worth mentioning that the MEGA65 is undoubtably ahead of the Amiga on the sound front: It has 4x DMA driven audio channels like the Amiga has, but they can drive sample rates up to 2MHz, and each channel is fully stereo pannable, and support not just 8 bit samples, but also 4 bit packed, and 16 bit samples. This is in addition to the MEGA65's 4 SIDs. The MEGA65's floppy controller is also more versatile, able to read any DD or HD formatted disks, and has hardware acceleration not just for MFM encoding, but also for RLL encoding, which can give a capacity boost of another 50% or so. Finally, it doesn't actually matter which is better than the other. Both the MEGA65 and the Amiga series are very enjoyable machines, regardless of their theoretical capabilities.)

In the Amiga, the blitter was normally used on the Amiga to blit images onto one or more bitplanes, to give the effect of having more and/or larger sprites (even though we both know that they aren't sprites, but rather are "blitter objects" or bobs.) The Amiga's use of blitter objects was limited by the number of cycles available to the blitter per frame, and the fact that to move a bob, you had to first undraw it, and then re-draw it. If you have a bunch of potentially overlapping bobs, you may have to undraw many, just to be able to move the one you want, and then redraw them all again. And the greater the bit-depth of the bobs, the greater the cycle cost, because multiple bitplanes had to be undrawn and redrawn. And you had to do it all during VBLANK, or double buffer the display to avoid tearing.
Now a bit of a journey that will help you understand the MEGA65's graphic systems, and it's capabilities:
The C65 inherited the Amiga idea of bitplanes, but they are greatly hobbled, mostly because you can't scroll them in hardware like on the Amiga, and the very limited "chip RAM" of the C65 meant that it was a recipe for disappointment. However, it did get 8 bitplanes, and thus a palette size of 256 colours, with 4 bits per channel giving 4,096 colours -- just like the ECS Amiga, but with the bitplane count of the (at the time) not-yet-existing AGA Amigas.
So while the MEGA65 implements C65 bitplanes for backwards compatibility, they have not been further improved. Instead, a much more C64-inspired approach was taken, along with some general underlying improvments.
First, the VIC-IV in the MEGA65 has access to the chip RAM at 81MHz, i.e., it can suck 81 MB of data per second. For PAL, divide that by 50 and you get 1.62MB of data that can be fetch per frame. The colour palette registers are on a separate bus, and so don't subtract from that, as is the colour RAM for describing the colour and attributes of each character or 8x8 cell of bitmap data. So we really have 1.62MB per frame. On NTSC with 60Hz its a little lower, at 1.35MB. But either way, that's still a lot of bytes per frame. The MEGA65 supports "overscan" modes, i.e., 720x576 for PAL and 720x480 for NTSC. So if we multiply those out the PAL and NTSC frames contain 415 K and 346 K pixels per frame. This means the VIC-IV has about 3.9 bytes per pixel of bus transfer available to it, in both PAL and NTSC (because the extra frames of NTSC are essentially exactly offset by the reduced vertical resolution).
Compare this to the C64 that had 0.125 bytes per pixel (8MHz pixel clock and 1MHz CPU). This is why the bitmap mode of the C64 was monochrome if you wanted full horizontal resolution (1 bit per pixel = 1 byte per 1MHz CPU cycle), and had to halve the horizontal resolution to get 2 bits per pixel for multi-colour mode. The Amiga 500 with it's 3.5MHz 16-bit bus and half the vertical resolution of the MEGA65 had typically ~3.5MB/sec for video memory accesses, thus giving it about 0.875 bytes per pixel of total bandwidth (there are some simplifications I am making here, but the general result holds). This is why the Amiga was capable of 64 colour modes, as on average, those require 6 bits per pixel, which was within its bandwidth limit of about 7 bits per pixel. It's also why higher resolutions on the Amiga had lower maximum numbers of active bitplanes.
The MEGA65 with its 81MB/sec dedicated video bandwidth has ~31x more bandwidth per pixel than the C64 had, and better than 11x than the Amiga 500.

The Amiga 1200 essentially quadrupled the video bandwidth compared with the Amiga 500, giving a video memory bandwidth of about 14MB/sec, or about 3.5 bytes per pixel. That's still below the 3.9 bytes per pixel of the MEGA65, but you get the idea that the video system of the MEGA65 has the potential to be at least as powerful as an AGA Amiga. But because of the design decisions we made for the VIC-IV, the result is actually more powerful for several reasons:
1. Bitplanes eat a lot of RAM if you have repetitious content or lots of empty bits on the screen, e.g., for a typical platformer for top or side view shooter. Also drawing such data is relatively time-consuming with bitplanes, because you have to write all those pixels whenever you need to redraw the display (eg when you hit the limit of your hardware scrolling region in the bitplane). Arcade systems and many game consoles used a different approach: Layered tiled displays, effectively multiple layers of what we would call text mode on the C64. The MEGA65 also uses this approach.
2. The VIC-IV adds two new text/bitmap display types in addition to the 1-bit per pixel mono/hires and the 2-bits per pixel multi-colour/lores types. The new ones are: Nybl Colour Mode (NCM), which uses 4 bits per pixel, thus allowing a character to have each pixel selected from a 16 entry long slice of the 256-colour palette (which has also been upgraded to 8-bit colour depth for millions of colours instead of the Amiga 500's 4,096 colours). Full Colour Mode (FCM), which uses 8 bits per pixel, thus allowing a character to have each pixel freely selected from the 256-colour palette. NCM and FCM both use 64 bytes to describe a character, instead of 8 bytes. As NCM uses only 4 bits per pixel, the result is that NCM chars are 16x8. FCM chars stay 8x8. NCM and FCM can be used to very easily and efficiently (both in memory and computation) implement map-based displays, as for shooters and platformers. FCM and NCM effectively require a little over 1 byte per pixel, thus leaving the bulk of the memory bandwidth free.
3. Because there is still plenty of bandwidth available, even with FCM graphics, the VIC-IV includes another improvement: A double-buffered raster buffer of pixel data that is rendered in one raster, and then displayed the next. Special character codes can be setup that move the current position within the raster buffer, effectively allowing the "over stamping" of characters (which can be mono, multi-colour, NCM or FCM). This removes the need for bitplanes to achieve "play fields". It also means that each playfield is effectively as powerful as all 8 bitplanes on an AGA Amiga. This buffer is called the Raster Rewrite Buffer (RRB), that you had heard about Simple in design, but extremely powerful in effect.
4. If you want to have even more layers with the RRB, you can switch to half vertical resolution (i.e., the same vertical resolution as a C64 or non-interlaced Amiga), to double the "raster time" for the RRB. This can be used to create many layers of graphics, well beyond what an AGA Amiga could manage. Remember that none of this takes a single cycle from the CPU -- The CPU retains its full 40.5MB/sec bandwidth, regardless of what the VIC-IV is up to, because the chip RAM is dual-ported.
5. You probably notices that FCM is effectively a "chunky" mode. If you arrange the screen with columns of ascending character numbers you get a layout that is effectively a chunky bitmap mode, and the pixel address computations are very simple. This gives the MEGA65 a further significant advantage over the Amigas. It was very easy to write a full 320x200 256 colour wolfenstein 3D engine with >10fps, without optimising it very much at all. This is partly due to another trick the MEGA65 has up its sleeve:
6. The DMA controller in the MEGA65 can perform many of the functions of the blitter in the Amiga, plus some extras. One of those is that you can tell it your FCM chunky screen layout geometry, and it can then fill, copy and/or scale arbitrary angled lines in hardware. This can be used to draw lines at 81Mpixels / second -- enough to fill the display with lines every single frame, plus some. It can also be used to copy textures at 40.5Mpixels / second, including scaling them. This means that for the wolfenstein 3D engine that I made for the MEGA65, the actual drawing of the display is the fast part, requiring maybe 1/4 of the available raster time each frame. The slow part is the logic to work out which surface to show, because I just wrote that in C using the robust but not particularly optimising CC65 compiler. Even with that, it obtains double-digit frames per second. With optimisation, full frame rate at 50 or 60Hz is very likely possible.
7. The sprites in the VIC-IV have also been substantially updated, now featuring upto 64 pixel horizontal resolution and the VIC-II's mono and multi-colour modes, plus NCM, giving 15-colour sprites upto 16 pixels wide. The sprites can also be upto 255 pixels tall. They also have an alpha transparency mode, and a horizontal tiling mode, and be switched between hires and lores (i.e., expanded sprites) as on the C64, but also into super-hires, where the sprites are half their normal dimensions, but with double the resolution (ie the opposite of "expanded sprites" on the C64).
8. The VIC-IV has four hardware palette banks of 256 colours each. NCM and FCM characters can select from two of those palettes on a character by character basis. Which 2 of the four palettes are in use at any point in time can also be changed via a palette selection register, thus making it very easy to have 1K colours on screen at any point in time, without having to resort to copying palette entries around the place. But with the spare palette banks and 40.5MB/sec DMA copying, it is also possible to reload the palette registers multiple times per frame, to yield many more unique colours on the screen at once. This is more flexible than HAM, and doesn't produce any visual artifacts unlike HAM.
There are some other tricks up the MEGA65's sleeves, but hopefully that gives you a good introduction as to many of the improvments, and why the MEGA65 is at least a match for AGA Amigas in terms of graphics, and in other areas well ahead of it.

Sunday 1 January 2023

Working on Composite Video Output for the MEGA65 Expansion Board

The MEGA65 case has cutouts for a few ports that we didn't include on the original motherboard, to keep costs and complexity under control. Avid readers of this blog will know that I have designed a prototype expansion board for the MEGA65 that is intended to provide those missing ports. In this blog post, I am documenting my efforts on getting the composite/s-video output working.

To keep life simple to begin with, I am implementing a pure luma signal first. This is the monochrome grey-scale output, which contains the HSYNC and VSYNC information, as well as the brightness of the pixel data. The result should end up looking something like this:

Keep reading if you would like to find out how I got to that point...

The expansion board uses a very simple and cheap 4-bit resistor DAC to generate this signal, thus allowing up to 16 levels of brightness, unless I succeed in my cunning plans to allow over-sampling in one way or another. The PAL/NTSC video standards use 13.5MHz pixel clocks, and we can easily drive FPGA pins at more like 270MHz, thus providing a 20x over-sampling rate, which could be used to provide more bits of resolution to the signal. I might need to put some analog filtering components on the output to achieve this. But anyway, such optimisation is for later. First I need to get a simple working video image. So let's talk about how the luma signal for a PAL or NTSC system looks.

You can read more about this here, but what follows is how I am making use of it.

On the one hand, it is very similar to VGA, in that you have VSYNC and HSYNC signals, and then analog brightness information. Unlike VGA, this is all done on a single wire, instead of using three separate ones. The HSYNC and VSYNC are XORed together to produce a single combined sync signal. When this is active, the composite output is pulled hard to 0V.

At all other times, the voltage will be somewhere between about 0.3V and 1V, to cover the full range of brightnesses. As we have a 4-bit DAC, this gives means that of our 16 possible values, 5 of them will be used up to allow us the full dynamic range. This means we have only 11 real brightness values available to us. For C64 images this is already more than enough, as the VIC-II colours were all selected from only 5 brightnesses.

The other tricky thing that we should handle, is interlacing. The C64 didn't do this, because it didn't produce the difference between odd and even fields, which made interlaced graphics look even worse than they needed to. We will aim to fix this on the MEGA65, so that you can have real 480i or 576i 15KHz video output.

But before we get that far, let's implement a simple combined sync composite luma output in VHDL, and subject it to some automated tests, so that we can compare the resulting waveform timing to what we know is required -- especially for signalling interlaced mode, which requires having some half raster lines in the right places.

Step 1: Create luma, chroma and composite signals in pixel_driver.vhdl. I'll allow 8-bits of resolution, so that if we come up with a way to increase the bit depth, e.g., via that over-sampling idea, we can use it.

The luma signal is the important one for now, as it carries the SYNC signals, and the brightness of the video signal. While I have added plumbing for the other two signals, they will get completed later, largely because implementing the colour burst frequency is going to be quite tricky.

One of the things that we have to do to generate the luma signal, is to actually buffer the luma signal for a complete raster line. This is because the VIC-IV is generating rasters with a 27MHz pixel clock, but PAL and NTSC require 13.5MHz pixel clocks. This all relates back to interlace: On odd frames we should buffer and output the odd raster lines, and on even frames, it should be the even raster lines. Likewise, we have to also mess with the HSYNC pulses to have only half as many of them, so that they are at the correct rate, and then correct their durations, as well.

Well, a week has gone past due to a busy time at work, and I'm now getting myself back onto this. I have a couple of immediate approaches I can look at: On the one hand, I can do the 15KHz buffer right now, or on the other hand, I can get a working luma signal that drives an image on a screen, even if the image it drives will be garbled, because it will be using the double scan pixel data to feed a single-scan image.

I also need to think about how I will do the simulation, as the VCD file for gtkwave for two fields = 1 complete frame of data is about 500MB, which is a lot of unnecessary wear on my poor old SSD drive. I could redirect that to ram drive in /tmp to reduce that. Actually, ramfs is probably a better choice here, because tmpfs can get written out to disk if there is memory pressure, while ramfs is strictly in RAM. My laptop has 32GB of RAM, so I can probably spare 1GB or so without grief for this purpose. Looking closer, it looks like I'll actually need 2GB, but that's ok. So I will simply mount a ramfs over my vunit_out directory, and it should all Just Work (tm). Something like this:

mount -t ramfs -o size=2g ramfs vunit_out

Interestingly, with ramfs, it shows up with the mount command, but not if you do a df command. Most curious. Not that it matters.

Oh, yes, and don't forget to chown the newly mounted file system, so that you can actually write to it ;)

So I am now running the simulation of a whole frame again. It took about 2 minutes on a normal file system, and I am not expecting it to be much slower on ramfs. The simulation is probably more computationally intensive than the IO workload. It would be faster if I could find where a zillion "vector truncated" warnings from VHDL are coming from, as those cause about 500MB of error messages in log files, which is why I need 2GB instead of 1GB for the ramfs, and surely is slowing down the simulation. Unfortunately GHDL is notoriously difficult to compile in a way that will let you find out where those things have occurred. It's really my single biggest frustration with what is otherwise a really good open-source VHDL simulator.

Anyway, it's finished now, and only took 67 seconds, so it looks like it _is_ a bit faster in ramfs, which is nice. Ah, false alarm: I didn't ask for the VCD file for gtkwave to be generated. With that enabled, we are back to 2.5 minutes. Now let's see if suppressing those warnings helps bring it back down a bit. I can suppress them all by using:

vu.set_sim_option("ghdl.sim_flags", ["--ieee-asserts=disable"])

This is a bit more of a big hammer than I would have liked to have used, but it has made the log files much smaller. Simulation still takes 2.5 minutes or so, though. Oh well.

Anyway, now that I have the simulation stuff safely runnable without wearing my SSD out, I can again look at what I am generating now, and see if it looks like a sensible composite signal.

GTKWave displays unsigned values numerically, rather than showing a wiggly line, which makes visual checking for sync pulses etc a bit of a pain. So I need to find a solution to that.

Ah, that's right, the solution to that is to use pulseview. In theory pulseview is a great tool for this. In practice, neither the downloads nor the apt package for ubuntu 22.04 actually work for pulseview. This seems to have been the case for at least two years. You just get one of a wide variety of missing symbol errors. This is of course extremely irritating. I have tried to build it from source before, but probably before I upgraded to Ubuntu 22.04, so I am trying to compile it again, to see if that solves the problem.

I tried a bunch of things, and did eventually manage to get a nightly-build AppImage of pulseview to run, and not segfault, but it all felt very fragile. And then it complained that the VCD file had more than 64 channels in it, anyway. And it was treating the luma signal as a bunch of individual bits, rather than an analog signal.

So I think I'll just make my own VCD parser, and produce the oscilloscope view somehow myself. Maybe generating a PDF file with it in there. That way I can even make it line up based on the expected raster line time etc. After a couple of hours of mashing about, I now have something that can automatically generate PDFs with plots like this:

The colour coding shows yellow for the part of the waveform voltage that is reserved for sync pulses: We should only ever see sync pulses in that range. Then we have the green area, which is the part where the luma intensity should be, and the orangy colour at the top is the margin, to allow for overdriving of signals and colour information amplitude.

First up I can see that we have the HSYNC rate 2x what it should be, as previously mentioned, because the MEGA65 natively outputs ED, i.e., progressive rather than interlace modes. Second, I can see that some of the luma signal is sitting too low, going down into the sync part of the band. Third, the graphics signal seems to be occupying way less of the rasters than it should. It should occupy 720 / 864 = 83% of each raster line, not the about 1/3 that it is here. How nice it is to be able to quickly and easily see these problems! Now to try to fix them, probably in reverse order. Simulation plus rendering the PDF file takes less than 3 minutes in total, so the workflow will be pretty good.

First re-run with the green and blue components removed, seems to eliminate the negative excursions. So I will try again with each individual component, to see where the source is.

Meanwhile, I think I have also found where the third problem, i.e., the too-short period of graphics on each raster, is likely to originate. It looks like it might be the logic related to the "data valid" signals that HDMI and LCD panel outputs need. That said, it still doesn't really make sense, as that logic is all fairly well tested to generate the proper widths of display data.

Back to the colour channels: It looks like the green channel is responsible for it. Perhaps the multiplier logic I have is wonky, and over-flowing. Yes, and in fact they are all out, by the look of things. With that fixed, I now get nice waveforms that aren't doing weird wrapping things, which is good:

So now back to finding that problem with why the data period is much shorter than it should be. The HSYNC pulse (the trough in the yellow above) is a bit under 2.4usec in duration. Then there is about 2usec following that for the "back porch" before the graphics data starts. Based on that, we are seeing perhaps 10 usec of graphics data, when it should be about 27usec (remembering we are still currently using 576p rather than 576i timing until I fix that problem).

To investigate this, I am going to switch from the test pattern display to a fixed white display, which should show a single slab of peak voltage across the whole display period of each raster. This will then show me the display time envelope that it thinks it's using, which will let me find out if it is some problem with the test pattern rather than the video timing.

And it does seem to be the display time envelope that is the problem:

After poking around further, I think the issue is that the dataenable signal is being generated based on the 81MHz clock instead of 27MHz pixel clock. That means that the data period will be only 27/81 = 1/3 as long as it should be. The mystery is why HDMI and other video output have been working, despite this. Or perhaps I accidentally moved some stuff from the 27MHz to 81MHz clock domains during the early stages of implementing the composite output. But that doesn't seem to be the case. Most weird. Ah, no, I have the divide by 3 logic in there. So it's not that, either.

Further poking around has revealed that the problem is the 'plotting' signal, not the data enable signal. Quite what the plotting signal is even still doing, I have no idea. It is linked to a defunct read address calculation from some buffer that I presumably had long long ago. I think I'll just strip all of that out.

And look at that: We now have pixel data across the whole display width of the raster lines.

Now I just need to work on the 31KHz to 15KHz scan rate reduction. Somewhat ironically this may require the re-introduction of logic similar to what I have just removed with the plotting signal. Except that it needs to work!

At its simplest, we just need something that buffers a raster line, and then plays it back at half speed. Various pieces of the timing can just be re-used from the frame parameter, as it is just that the pixel rate is effectively dropped from 27MHz to 13.5MHz. This will cause the HSYNC etc to naturally double in width as we need.

Otherwise, we will want a signal that tells us when the left edge of the composite video raster display area has been encountered, so that the buffered raster line can be played back into it.

Hmm... it won't be quite that simple after all, though. This is in part because we need to support PAL and NTSC colour signals, that are going to require high frequency of updates, which means that halving the pixel clock will be counter-productive. This means that we need to buffer the raster line inside of the PAL and NTSC frame generators, rather than having a joint buffer that takes the output of whichever has been selected.

We already have the logic to generate the 15KHz HSYNC in the frame generator, that I did above. So that provides the means of synchronisation to play back a recorded raster. We now mostly need to do the recording.

It would be nice to be able to get away with a single raster buffer, and just time things carefully, so that it doesn't overwrite the part that is being read out and displayed. This should be possible with a bit of care. The major constraint is that we need the raw RGB data buffered for each pixel position, which means 720x (actually 800x, since the MEGAphone display is 800px wide) RGB pixels, with 8 bits per colour channel, thus requiring 24 bits per pixel. That's slightly more than 2KB, even with just 720x, so we will need a 4KB BRAM arranged as 1024 x 32 bits to buffer a single raster. In practice, we will need two such buffers, so that we can be recording a raster, while playing the previous raster back, so it will be 2048 x 32 bits.

But first, we have to get the composite VSYNC signal fixed: The duration needs to be twice as long for 15KHz video as 31KHz video, so that it is still 5 raster lines long. To achieve this, I have implemented a counter that counts up while VSYNC is active, and then counts down again to zero when VSYNC is released. If this counter is non-zero, then the 15KHz VSYNC should be active. This works nicely to double the duration of VSYNC:

We can see that the composite video VSYNC (cv_vsync) signal holds for twice as long as the original VSYNC signal (vsync_uninverted_int). We can also see that the HSYNC pulses for the 15KHz video are happening only half the rate of those for the 31KHz video to the LCD/VGA display.

What is still not quite right, is that the VSYNC is happening half-way through a 15KHz raster line. Actually, that's not strictly a problem, as this is how interlace on composite video works, by having a half raster line. It's a really simple approach when you realise how it works: Whenever a CRT display thinks it is drawing rasters, it is very slowly advancing the vertical position down the screen. This is timed so that after 1 raster has been drawn, the display is now drawing 1 raster lower on the display. Interlaced rasters should appear half-way between normal rasters, so at the top of the display, you draw half a raster, to setup the offset. This also means that raster lines aren't strictly speaking horizontal lines, but are actually sloped slightly down towards the right. It would be interesting to try to observe and measure that effect on a larger CRT, where it should be visible.

Anyway, I have made it selectable between odd and even frames now, and that fixes that problem, but I am seeing a 1-cycle glitch. This is because my VSYNC calculation is 1 cycle delayed compared with the HSYNC. So that should be fairly easy to fix. Except it turns out to not be that simple: The propagation of the HSYNC and VSYNC signals between 31KHz and 15KHz parts of the design are uneven. This means that VSYNC signal is rising 7 cycles before the HSYNC signal on the 15KHz side, as can be seen here:

i.e., we seee that cv_hsync drops 7 cycles after cv_vsync asserts. Since cv_sync is the XOR of those two, it results in this short glitch visible on the cv_sync signal, which then propagates to the cv_luma signal.

I guess I should just add in a 7 cycle delay to the cv_vsync signal. Because of how I set it up with the counter, I can just make it start when the counter gets to 7. That will make it start later. I will have to also check that it doesn't cause problems by making it finish earlier. If it does, I'll just come up with some little fiddle to deal with that, e.g., by creating an extra counter that caps at 7, and is added to the main VSYNC counter as it counts down, or something like that.

Actually the delay is 6, not 7 cycles, because I had put a one cycle delay on HSYNC when I thought that was the case. It being 6 cycles makes much more sense, as that is one pixel duration at 15KHz, and will be being caused by the logic that works out when to sample a 15KHz pixel value. This makes me happier, knowing how the difference in propagation time is being generated. Mysteries in VHDL code are usually a bad thing, i.e., will come back to bite you some time later, when it will be much harder to figure out the root cause.

Okay, a bit more fiddling, and I now have the commencement of VSYNC in 15KHz video glitch-free:

As expected, I am now seeing a glitch on the other end, because the VSYNC pulse is now slightly shorter:

Now, after adding the secondary counter to extend VSYNC by the same number of cycles as it was delayed by, we get the result we are after:

So now we have glitch-free 15KHz HSYNC and VSYNC signals. In theory, if I built a bitstream with this, the monitor should attempt to display an image. However, as we haven't yet converted the 31KHz pixel data to 15KHz, we would likely just get rubbish

Except, I just realised that during VBLANK, the sync pulses for 15KHz video are actually different: According to this, we need to put differing combinations of short and long sync pulses in each VBLANK raster for 15KHz. The pulses are also half-width, i.e., the same as for 31KHz. I think I'll add logic that works out which type of VBLANK raster it is: short, short, short long, long long or long short, and them make the means to generate each type, with the short and long pulse widths being automatically determined from the video stream. NTSC has similar requirements described here, which I'll deal with at some point. For now, I am focussing on getting PAL working, because I'm in Australia, a PAL-oriented country. NTSC will also require separate treatement when it comes to adding colour support.

Effectively we just need to know the sync sequence for the n-th raster of the VBLANK for both odd and even fields of a frame. I've got it partly working to do that now, where during the 31KHz VSYNC phase, it is putting the right things in:

But then it stops. This is because I should of course be using the 15KHz stretched version of the VSYNC signal... With that fixed, we now have a much healthier looking VBLANK sequence:

We can nicely see the 5 long sync followed by 5 short sync pulses, as required for a PAL odd field. What happens at the end of the even field, I haven't yet tested, because the the VCD file will be really big, possibly too big for my 2GB RAM drive where the logs are being written. But let's give it a try...

That looks pretty good, actually. The short/long sync pattern is shifted by half a raster-line as it should. The last SYNC pulse is a bit short, at 1.5usec instead of 2usec, though. The odd field has a similar problem. I should be able to fix both by just shortening each stage of the pulses, so that there is just enough time left at the end for a full 2usec pulse.

That fixed that, but now I am getting some glitches when VSYNC goes high. I think this will requires a bit of pragmatic bodging, until I squash them all. Which turned out to not be too hard. So finally, let's get back to the buffering of the pixel data, so that we can have a complete image!

First, I need to setup the two 1KB x 32 bit BRAMs. I could use a single 2KB one, but I don't think I have a VHDL model for that handy, so it will just be faster to setup two, and use the CS lines to select which one is being read/written at any point in time.

I have now hooked these buffers up, and got the scheduling right, so that they get written and read at the right time, so that the writing doesn't overwrite what is being read. Only just as it turns out, as the writing begins just after half way through the reading of a given buffer, and chases along the beam, only just failing to catch up in time. But that's ok. Here we can see the waveform of the write enable (_we) and read chip select (_cs) signals for the two buffers, and how this pattern repeats:

Hang on a minute. Something is not right here, because the 15KHz rasters should read for twice as long as the 31KHz rasters are written to, which means that the _we signals should only last half as long as the _cs signals, but they are the same length. I think this might actually be a red herring, as the write address gets to the end early, and stays constant after that. But I can clean that up, so that it is only active when required:

That's better. And in fact we can see that this buffering method can actually be reduced to a single buffer, because of this convenient aspect of the timing. That will actually greatly simplify things, as well as save a BRAM. It will also remove 1 raster line of delay in the composite output, which will also be good to fix.

But before I go optimising things, I want to make sure that it is actually operating as intended. And good thing I did, because it still looks like the 31KHz raster data is being written out.

Well, actually its half false-alarm, as I can see what the problem is: When the test pattern is selected, that isn't fed through the downscaler buffers. I'll fix that now. In the process, I also found and fixed some real problems where the composite image was being clipped by the data valid region of the 31KHz rasters, which would result in a vertical black bar in the composite output, if I hadn't fixed it. I also found a bug with the BRAM wrapper I was using not tri-stating the read bus when CS was low.

So with that all in order, now I can simplify things to use the single BRAM, and reduce the latency of the down conversion from 3 V400 rasters to just 1. That is, our composite conversion has a latency of only 32 usec, which I think is pretty acceptable :)

Removing the BRAM seems to have gone without problem.

So this finally menas that we are ready to try to build a bitstream that produces a composite luma on the expansion board. Initially it will just be a test pattern bitstream, not the actual MEGA65. This is because the test pattern bitstreams are much faster to biuld -- about 10 minutes instead of an hour -- which makes development more convenient.

Ok, so first bitstream generated, and I can see at least some sort of signal that seems to have HSYNC and VSYNC structure to it. Sorry about the crappy images from my phone camera. I still have to do something about that, and the lighting here at night is not great, either:

In the top one we can see something like a VSYNC structure, while the bottom shows several HSYNC pulses.

However I am not seeing any of the test pattern luma signal overlaid on this, which is odd. The test pattern is overlaid in pixel_driver.vhdl, and in this test bitstream it hard wired on, which I can confirm because it comes through on the HDMI monitor I have connected to the MEGA65. But its late now, so I will have to investigate this problem tomorrow.

It's another day, and hopefully I'll figure out the cause of the lack of pixel data in the luma signal. It looks completely flat. First step, I'll modify the output so that the read address of the raster buffer is output on the green channel. This should super-impose a sawtooth pattern on the luma signal. If it shows up, then I'll assume that the problem is in writing to the raster buffer. If it doesn't show up, then I'll assume that the problem is further down the pipeline. Either way, I'll have narrowed down the hiding places for the bug.

Okay, so the sawtooth is visible:

It's a bit non-linear and steppy, but I'll investigate that later. First step is to figure out what is going wrong with the pixel data.

Next I'll check whether writing to the raster buffer seems to be working. I'll keep the existing sawtooth on the green channel as above, and put the write address of the raster buffer on the blue channel, which will have lower amplitude than the green channel, and will also have a different slope and phase than the other, so I should be able to see them both superimposed without problem. Again, if I do see it, it will tell me that the raster buffer writing and reading is working fine, and thus that the problem must lie in grabbing the test pattern pixels, and if it doesn't show up, then the problem is in the raster buffer writing. That is, I'm again splitting the remaining bug hiding territory.

And no pixel data shows. So it seems most likely then, that the writing to the raster buffer is the problem. A bit further digging, and it seems that raster15khz_waddr is not being updated, but just stays at zero.

Now, the signals that can be influencing this are:

1. new_raster -- when it is 1, then waddr is clamped to zero

2. raster15khz_waddr_inc -- when it is 1, then waddr will be incremented

3. narrow_dataenable_internal -- when it is 1, then raster15khz_waddr_inc can be asserted.

4. hsync_uninverted_int -- when it is 1, waddr is clamped to zero.

narrow_dataenable_internal is probably fine, because otherwise the HDMI image would not be visible. So let's look at the other 3. I'll again just try superimposing them onto the luma signal.

hysnc_uninverted_int is confirmed fine through my testing. So that just leaves new_raster and raster15khz_waddr_inc as prime suspects. Either or both might be working, but not visible in my test, because they are signals that are 1 cycle in duration, and thus might not get picked up by the once per 6 cycles sampling for the luma. To help here, I will add signals that toggle whenever those others are modified, so that I can still see if they are changing. First I'll check with new_raster. new_raster is also ok.

So that leaves raster15khz_waddr_inc as the prime suspect. This is gated by buffering_31khz, which is working. Thus things are looking very much like raster15khz_waddr_inc isn't being asserted... Except that it seems to be.

Hmm... and even raster15khz_waddr seems to be working now. Weird, as I am fairly sure before that it wasn't. Anyway, I've moved on and checked that sensible looking data is being fed to the _wdata signal as well, which it is.

So if waddr and wdata are good, is the write-enable signal being asserted? Also the select line for the read side? Write-enable does indeed get set at the right time. the select line for reading looks fine, too, although it can in fact be hard-wired on, in case it is somehow the root cause... and that worked.

So the luma signal is now showing rasters of sensible looking pixels. I can also see the HSYNC and VSYNC pulses. However, between the sync pulses during the VBLANK period, the signal is not returning to the 30% point, but rather is reaching 100%. That was fairly easy to fix, by enforcing black level during VSYNC.

There is one more anomaly that would be good to fix, though: Whatever was in the last raster line of a field (ie single vertical retrace, which might be a whole frame if not using interlace, or is half a frame if using interlace), is repeated during the entire VBLANK period. This should be corrected for a variety of reasons. One, its just causing misleading information in the frame. Two, some TVs calibrate their black level based on what is happening during this period. Three, teletext and closed caption text is normally transmitted on the VBLANK lines. We don't want to accidentally trigger this. I'd also like to potentially implement at least closed captioning at some point, just for fun. But not yet.

So let's just fix the VBLANK rasters so that we prevent the retransmission of the last raster. A simple approach would be to have a signal that notes if a valid data pixel has been seen in the recent past. Another would be to count raster lines, and mask based on that. Time since last pixel is probably the simplest and most robust for use with PAL and NTSC.

The time since last pixel method works fairly well, but does result in some glitching in the raster immediately following the start of VBLANK, as the first part of the raster will still be drawn. Similarly, the first raster line after VBLANK will suddenly start part way through the raster. Both of these issues are quite livable.

Anyway, with all of that, we get a nice VBLANK arrangement looking something like this:

Here we can see, from left to right, a few jaggedly lines of pixel data, then a single raster with a white line, near the little arrow at the top. As white is bright, it has the highest voltage, thus that raster just has a signal at the top. The little spikes below are the SYNC pulses. Then to the right of that, we see some black rasters at the start of VBLANK, followed by the VSYNC pulses that are mostly 0V, but with short spikes up to black level. Then after those we have black raster lines until the start of the next frame, which begins with the white raster line on the right edge of the capture.

(As you can also see, I also managed to dig out a better camera, too :)

In short, I think we have a perfectly valid composite monochrome signal. Now to find a monitor to feed it into... and this is what I saw when I selected the composite input:

Yay! All that methodical drudgery to make sure the signal was correct paid off -- I got a working and stable composite image first go :)

Then I found the "fill to aspect ratio control":

Much nicer :)

Looking at it, I can see a few things:

1. My effort to increase the bit depth of my 4-bit resistor DACs by oversampling 10x seems to have worked, because I can make out more than 16 levels of grey -- and this image doesn't even go to full intensity with white. So that's a nice win.

2. Looking closer, I think I have the interlace the wrong way around. You can see a jagged edge on the sloped bars, and a feint detached line just above the short horizontal transition on the left:

That will be easy to fix, and also means that I have correctly implemented interlace. Another nice win :)

For reference, this is what is on the HDMI display:

Back to the composite output, the lack of linearity in the shading reflects what I was seeing on the oscilloscope, that something is messed up with the DAC. I'll look at that shortly, but first, to fix the interlace bug.

Ok, the interlace bug fix is synthesising, so in the meantime I am having a think about the non-linearity in the DAC. Actually, I just noticed something else: When the image starts from the very left, it is causing those whole raster lines to darken. This means that I have the video starting a little too early -- or similarly, that I am trying to display too many pixels on each raster. Either way, the monitor thinks that part of the raster is during what should be HBLANK.

Fixing this will be a bit trickier, because we are sending 720 horizontal pixels, to match the MEGA65's horizontal resolution. But at the 13.5MHz pixel clock for the composite video mode, this is probably taking more of the raster line than it should. Trimming pixels on the left will reduce the side-border width, however, which is not ideal, especially since many TVs will likely cut some of the raster off, anyway. The ideal solution would be to use a 720 / 640 = 13.5Mhz x 9/8 = 15.1875 MHz pixel clock -- but we can't really add any more crazy clock frequencies in, or we will have more problems. Or in the least, a lot of work, because I would have to implement some kind of horizontal sub-sampler, which could also be done at 13.5MHz to go through the raster faster.

The simplest method remains just trimming some pixels off the left (and possibly right) side of the image. I'll start with that, and we will worry about improving it from there. It might only need a few pixels trimmed to fix the problem.

16 pixels was enough to fix it, and in fact 8 is enough, too. I'm now about to try 4. Here is how it looks with only 8 trimmed. You can see the reduction in brightness part way down the pattern has disappeared:

It's entirely possible it will turn out that only 1 or 2 pixels is required, if I have erroneously slightly advanced the start of the active part of the raster. That would be a nice outcome. Once we have it working on the MEGA65 core, I will be able to check how much is being trimmed from each side, and whether any further adjustments are required.

Ok, so 4 pixels is enough. Trying 2. If that's fine, I'll leave it at that for now, and start trying to synthesise a MEGA65 core with mono composite output. It will be interesting to see how PAL and NTSC switching goes with that -- in theory it should just work, but I have only tested the PAL image with my test bitstream so far.

Trimming just 2 pixels is indeed enough, so onto the MEGA65 core synthesis.

First go at building a MEGA65 core with the monochrome composite output in resulted in a dud: I had accidentally tied luma and chroma together, and as chroma currently is just tied to ground, there wasn't any signal. So its building again now, with that fixed. But once again, the day has run out, so we will have to wait until the morning to see if I have been successful.

Hmmm..., well, I have a picture now, but it has some significant problems: Essentially the VSYNC seems to not be working, and the HSYNC seems to be out by half a raster line or so.

Just for fun, I tried switching to NTSC, and that's less bad, in that it has working VSYNC, but HSYNC still seems to be out by half a line, and there is some serious problems with the capture of the HSYNC in the upper half of the display, causing the most interesting (but unhelpful) effect:

What can't be seen so easily in these shots because of the relatively short exposure time, is that for NTSC, it is trying to switch between two different horizontal positions. I suspect that the half-rasters for the interlace control might be messing things up.

The big mystery, though, is how this is all happening, when the test pattern was rock solid. Enabling test pattern mode in the MEGA65 via $D066 bit 7 shows these same problems, so there must be something different about the video signal being produced by the MEGA65 core.

The most likely suspect I can think of, is that the genlock related logic is causing the frames to be a bit long or a bit short, and that is then confusing things... except it seems that I haven't actually plumbed that through, so it can't be the problem.

Hmm, using the oscilloscope, I can still see the HSYNC and VSYNC pulses, so I don't think that that is the problem. The amplitude of my composite signal might be too large, but then its no different to with the test bitstream. In fact, that's the whole point: When the test pattern is enabled, there is obstensibly no difference between the generation of the signals. Yet clearly something is different. Perhaps there is more jitter or more high frequency interference on the video signal from the FPGA, since it is doing a lot more stuff when the full core is running, compared with my test bitstreams.

Ah, interesting, I just tried one of the test bitstreams, and it was doing the same thing. Turn monitor off, turn monitor on, and its back to being rock-solid.

And back to the MEGA65 bitstream, and I now have something resembling a stable display. It still has several problems, though, as you can see here:

There are three problems I need to deal with:

1. In the upper part of the display, there is an oscillation in horizontal position, that clears up by about 1/2 way down the display. This also happens with the test bitstreams, so I can at least investigate that a bit faster.

2. There is vertical banding coming from somewhere. It doesn't appear when test pattern mode is enabled with the MEGA65 core, so that's a bit of a mystery.

3. As I discussed earlier, my suspicion is correct that the monitor is only displaying the first ~640 or so pixels of the 720 wide: I need to speed up the playback of the 720 pixels to make them all fit in the correct active part. If I do that, and add a "front porch", it is possible that problem 1 will disappear.

So that's probably where I should start.

The active part should apparently be 52usec long. We have a 4usec HSYNC and 8usec back porch = 12 usec, leaving 64 - 12 = 52 usec. So in theory we are allowed to use the whole remaining raster, but then there is no front porch. Elsewhere I read that the front porch should be 1.65usec, which makes more sense. On my monitor the last four or five 80-column chars are not visible, corresponding to 32 -- 40 pixels. The side border is a similar width, making for close to 80 pixels in total -- i.e., the difference between 720 and 640, thus confirming my suspicions here.

To fix that, we need to reduce the active part of the raster from 52 usec to 52 * 640 / 720 = 46.2 usec. That would leave a ~5.8 usec front porch, which should be plenty.

Now the real question is how to do this, without messing up the horizontal resolution. As I mentioned earlier, ideally I would just use a faster clock to play back the buffered data. But I can't just add any more arbitrary clocks.

Probably the easiest way will be to vary the number of 27MHz cycles between pixels. 720/640 = 9/8. So I just need to increase the playback rate by that fraction. Currently I advance a pixel every 6 clock ticks at 81 MHz to get the exact 13.5MHz pixel clock. Speeding that up by 9/8 means it should be 5 1/3 cycles instead. So I can switch to 5, and then every 3rd pixel stretch it to 6, and it will work out. Hopefully the pixel widths varying by 20% won't be visually noticeable. But there is only one way to find out!

We now have all 720 horizontal pixels display, and the 20% jitter in their widths is not immediately noticeable, which is good.

But there are two big issues still visible here:

1. Those weird vertical bars, and it probably can't quite be seen in this image, but there is actually some fainter vertical banding going on in there, too.

2. The two fields of the interlaced display are diverged by a fair amount at the top of each frame, and only line up in the lower third or so of the display. I'm fairly sure that the root cause for this is that I am not doing something quite right with the vertical sync signalling.

As a first test for the horizontal position stuff, I tried disabling interlace, and just using "fake progressive", like the C64 did. But that results in even worse divergence, with it initially syncing to a half-raster horizontal offset. This again points to some problem with the vertical sync signalling.

While I was doing that, I completely forgot that the VIC-III actually has register bits in $D031 to select interlaced video, as well as mono composite output, i.e., disabling any PAL/NTSC colour signalling. So I have hooked those up into the pixel driver. The interlace bit might end up needing an override, as some existing software that uses V400 doesn't set it, which means that the composite output would not display the alternate rasters, but rather the same 200 vertical lines all the time.

Reading Video Demystified, I can see where part of my problem is coming from: The rasters really are offset by half a raster on each successive frame, which explains why the monitor is trying to lock onto a half-shifted line. This means that the HSYNC and everything have to be shifted half a raster early. It also means that we are supposed to have a trailing half raster on those fields that are shifted by half a raster. That should actually be fairly easy to do, as it will occur quite naturally.

Hmmm... Investigating, it looks like we should already be switching the composite HSYNC by half a raster on alternate frames. But it isn't happening. So that's the first thing to investigate. The field_is_odd signal is being generated correctly, and is being passed into the frame_generator units. Those are also selecting the alternate Y lines based on that. That then directly sets the raster X position for the composite lines, which should then result in the correct positioning of the HSYNC pulse for the composite rasters -- but isn't. Or if it is, its then being magically overridden and ignored.

Ah, one minor bug to fix: When I shifted the pixels to being 5 1/3 ticks wide, I didn't update the delay from HSYNC to start of active area, so it will be a little early. I'll fix that while I'm here, but it shouldn't be affecting the real problem we are chasing here.

Okay, for the HSYNC problem, it looks like I got too clever pants: Because the 27MHz ED TV native mode of the VIC-IV uses an odd number of rasters, 625, this means that if we don't correct for which SD TV 13.5MHz field we are displaying, that the 1/2 raster shift will occur naturally. But if we try to make it do it, then it will cancel out, and not happen -- which is exactly what is happening.

Just to add to the fun: This problem with the half raster line toggle is PAL specific -- NTSC does not need it, and when I switch the display to NTSC, indeed, it looks rock solid, although those vertical bars are still visible, but fainter (also I didn't update the display geometry after switching, so the display is shifted down somewhat):

Enabling interlace for NTSC messes things up, though, because it tries to put the half-raster offset in, which I believe is not required for NTSC. I'll investigate that later. Interestingly I am seeing the same vertical banding as with PAL, but the vertical bars through the background are quite a lot fainter. Again, something to investigate after I have PAL display stable.

The best way to do this is to use simulation, so that I can look at the perfect signals being generated to look for the correct structure -- or lack thereof. Unfortunately, it takes about 6 minutes to simulate 2 fields = 1 frame in my VHDL test bed, which is a bit disruptively slow.

To solve this, I am adding a debug feature to the pixel_driver module that allows shortening the frames during simulation (but not synthesis), so that the simulation time can be greatly reduced. It was around 6 minutes, and now I have it down just under a minute, and could probably trim it further if necessary, but this will do for now.

That certainly saved a lot of time, and it didn't take too much longer to get the PAL VSYNC trains looking like they should. Or at least, I think they are right now. While my head is in this particular bucket, I'll also work on getting the NTSC VSYNC trains right, which have slightly different timing compared with PAL.

I am now trying to get the VSYNC trains exactly right for NTSC, and am pulling my hair out a bit. I think one of the root causes is that I used the convention of the HSYNC pulse being late in the raster instead of start of the raster on the MEGA65, which turns out to not be the convention. This means that VSYNC seems to start just after an HSYNC pulse, which complicates other aspects of the logic. It might be causing issues for the PAL as well -- which I'll check when get get back to that.

Meanwhile, my problem is that the HSYNC occurring just before VSYNC asserts means that I have a full-width HSYNC just before the VSYNC train, instead of a half-width one. The pragmatic solution is just to fake the rest of the raster line, and then do the proper thin HSYNCs after that.

Several hours of mucking about to get this sorted, and I think I have NTSC more or less right now. The image is stable in the horizontal direction. The only remaining issue is that I had an out-by-one vertically with the interlace that was causing problems when interlace mode was enable. With it disabled, it looks rock-solid. I also found the stupid error causing the vertical banding (old debug code writing stuff into the blue channel), and the cause of the disappearing test pattern in NTSC. PAL is still playing up, but might similarly have a vertial out-by-one that is confusing things. I'll find out after I have a nap.

Okay, several more hours of incrementally spotting and fixing timing errors, I now have PAL and NTSC working in both interlaced and fake progressive mode.

One interesting thing, is that a minor timing tweak I had to do for 50Hz digital output I have been able to remove when interlace is enabled, because interlaced PAL video actually has 1/2 a raster line more than progressive. This means that if interlaced PAL is enabled, the rasters are now back to being 864 ticks instead of the 863 I was using to match C64 timing. It also means that the MEGA65 set to 1MHz and PAL will by about 0.2% faster if fake progressive is selected instead of interlace mode, because of the selective application of this timing fix.

Meanwhile here is a quick example of the effectiveness of interlace mode. This is on this old DELL LCD monitor, so it is digitally persisting the alternate fields, allowing it to remove all flicker. I started with the MEGA65's READY prompt, and just set the V400 bit (bit 3 of $D031). Without interlace, it is only displaying every other raster, making it look pretty bad:

Now if we turn interlace on, by setting bit zero of $D031 (this means that, like on the C65, V400 and interlace are independently selectable):

It looks pretty much perfect :)

Note that even though it will flicker on a CRT, it will flicker like an Amiga, not like a C64 trying to do interlace, because the C64 uses fake progressive, which doesn't do the vertical interleaving of the lines, as I mentioned earlier.

So now onto getting colour working!

I have already prepared a 256-entry SIN table. We are clocking out pixels at 81MHz, and we need to produce colour bursts at two very specific frequencies for PAL and NTSC respectively:

PAL: 4.43361875 MHz

NTSC: 3.57561149 MHz

These have to be synthesised very accurately, so that the magic of the separation of the colour (chroma) and the brightness (luma) signals from one another can happen effectively in the TV or monitor. These frequencies weren't chosen by their creators randomly, but rather were specially designed so that harmonics from them neatly (mostly) stay away from those of the luma signal.

Our 81MHz video clock is 18.269500507 the PAL colour carrier, and 22.650278938 the NTSC colour frequency. This means that each 81MHz cycle, we need to advance by 360 / 18.269500507 = 19.704972222 degrees (PAL) or 360 / 15.893844 degrees (NTSC). Of course we are working in binary, so we will use a system where the circle has 256 "hexadegrees". In this system, the required advance is:

PAL: $E.032E43BA hexadegrees

NTSC: $B.4D62D0F8 hexadegrees

As we can see, the fractional part just carries off into the sunset, rather than ending nicely after a couple of digits. 32 bits of precision is more than enough, as the FPGA's clock is unlikely to be more accurate than about 10 parts per billion, anyway. That will get us well within 1Hz accuracy, which is plenty.

To count these accurately, we will need to have a whole and fractional part of the current signal phase in hexadegrees, and add the appropriate quantity to each every 81MHz cycle. We have to be a bit careful, because adding a 32bit value at 81MHz has the risk of having poor timing, and we probably in fact need a 40 bit value to have 32-bit fractional part plus 8 bit whole part (we don't care how many cycles).

The colour burst signal is the first part that we will implement. This consists of 9 +/- 1 full cycles of the colour burst signal during the horizontal retrace, a "short time" after the end of the HSYNC pulse. We will just time 2usec after the end of the HSYNC pulse, and put it there. For NTSC, the amplitude should be 40/140ths = ~29% of the full range of the video signal. As our sine table is based on a full amplitude of 255, this means that we have to scale it down to 73/255ths. PAL uses a slightly higher amplitude, which boils down to 78/255ths.

For NTSC, the initial phase of the colour burst signal varies for odd and even lines when interlacing, with the phase changing once every two fields, i.e., full frame. This can be implemented by inverting the phase after every odd field, using the naming of the fields that I have in the pixel_driver. Also, the colour burst phase has to be set to exactly 0 or 180 degrees ($00 or $80 hexadegrees) at the start of the colour burst on each line. For non-interlaced NTSC, the colour burst phase remains at zero degrees at the start of each burst, i.e., it does not alternate every frame.

For PAL, we also have to enable or disable the colour burst on the last and first whole raster of each field, as described in Figure 8.16a and 8.16b of Demystifying Video. I'll come back to that later. Otherwise, instead of using a 0 or 180 degree initial phase, PAL uses 135 or 135 + 90 = 225 degrees = $60 or $A0 hexadegrees, and this alternates every line within a field, as it does with NTSC.

For fake-progressive PAL, a fixed pattern is used, rather than changing each field.

That should be enough information for me to implement the colour burst signals. After that, I will have to implement the actual colour information and modulation onto the signal.

In the process of looking at this, I realised the cause of a minor issue I noticed earlier this morning: The NTSC luma was noticeably dimmer than PAL. NTSC uses a black level which is somewhat above the sync level, which I wasn't taking into account. I'm also suspicious that I had some error in the luma calculation that was causing the non-linear banding. Finally, these calculations needed to be reworked, anyway, to leave enough head-room for the modulated colour information. So I had to recalculate them, anyway. I'll generate a test bitstream with those tables, and it will be interesting to see if in addition to having more consistent brightness between PAL and NTSC, whether it has fixed the non-linear banding.

To help with testing this, and the colour representation in general, I have added grey scale and pure RGB bars to the test pattern. Eventually I should also add colour saturation bars as well, but step by step.

This is what the test pattern looks like with digital video output:

And in mono composite:

The non-linearity is still there, so it wasn't an error in the luma calculation. Oh well. But I have fixed the NTSC darkness issue -- PAL and NTSC now have very similar brightness.

And since this post is already about five miles long, I'm going to stop at this point, and start a new post, where I focus on actually adding the colour information, now that I have everything in place for it.

MEGA65 Links

Friday 6 January 2023

Adding colour to the MEGA65's Composite Output

Comparing the Graphics Capabilities of the MEGA65 and Amiga

Sunday 1 January 2023

Working on Composite Video Output for the MEGA65 Expansion Board