I have spent a bit more time tonight working on hardware support for proportional fonts.
For those coming in late, the VIC-IV already has the ability to draw skinny characters 2, 4 or 6 pixels wide as well as the usual 8. This can be used to construct large characters from one or more 8x8 character blocks to make any even number of pixels in width on screen. For a large type face, each character may be several 8x8 character blocks wide.
This means that a row of proportional text may have a variable number of characters, because if there are skinny characters, then more will fit on a line. Conversely, if there is no text on the right of the display, then it doesn't make sense to waste RAM describing empty characters. Thus I have followed Jeremy's idea of implementing a special end of line marker, so that each row can differ in length, and we can hopefully use RAM much more efficiently when faced with large high-resolution text displays.
In the previous post I describe the work on skinny characters.
Now I have just about finished implementing the end of line markers, although as I write there is one remaining bug which is quite obvious in the screenshot below in the form of the vertical bars that shouldn't be there:
At first glance, this looks mostly like a normal C64 text mode display. However the entire screen is described using only about 80 bytes each of screen and colour RAM:
The screen RAM:
:0400 01 C0 02 00 03 00 04 00 05 00 FF FF 06 00 07 00
:0410 08 00 FF FF FF FF FF FF FF FF FF FF 09 00 0A 00
:0420 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
:0430 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF
:0440 0B 00 FF FF 0C 00 FF FF 0D 00 FF FF
The colour RAM:
:D800 00 01 02 03 04 05 07 01 02 02 02 02 02 0E 0F 0E
:D810 0E 0E 0E 0E 00 00 00 00 02 03 04 05 02 08 01 02
:D820 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E
:D830 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E
:D840 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E
:D850 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E 0E
To follow what is going on, remember that in this mode ($D054 = $01), two bytes are used to describe each character. The first byte is the low 8 bits of the character number, and the low nybl of the second byte are extra character number bits. The top two bits of the second byte set the width of the character as it is displayed on screen.
Thus $01 $C0 at $0400 draws only the left two pixels of the letter A, which shows up as the stumpy black line in the screen shot. The rest of the row of text is now offset by 6 pixels compared to normal.
Normal characters are encoded from $0402 - $0409. This is followed by $FF $FF which tells the VIC-IV that there is an early end of line. Thus the letter F described by $06 $00 at $040C-$040D appears at the beginning of the next line, and no other characters appear to the right of the letter E.
Colour RAM is drawn in a somewhat strange way, that I will probably fix. Within a row, the colour RAM bytes are read one per character, and so $D801 has $01 (white), and this is applied to the letter B encoded in $0402-$0403. That seems quite reasonable. But following an end of line, the colour RAM address catches up with the screen RAM. What I intend to do is make the colour RAM address advance two bytes for each character, so that the extra byte of information can be used. I might use this to allow skinny characters to be odd widths, and also to have a kind of super-extended-background-mode, where the other bits select the background colour.
However, before I do any of that, I need to fix the bug that happens when a character row consists only of a $FF $FF end of row marker. In that case the length of the character data is incorrectly set to the maximum value, instead of zero, and so whatever rubbish was hanging around in the character raster buffer gets redrawn.
Although, as I write that, I am not entirely convinced that this is the whole story. Indeed, it seems that the three bars are the contents of $3894, $300C, and $80CB. Very strange.
Hopefully my little bug-fix will work, otherwise I will just specify that each row must have at least one character, so you would use something like $20 $00 $FF $FF to make a row empty with one space at the beginning.
Friday, 31 October 2014
Thursday, 30 October 2014
Hardware support for proportional fonts and other text mode effects
I was talking with Jeremy in the lab today, and we got talking about hardware support for proportional fonts on the C65GS. I had thought about doing this before, but had put it on the back-burner for a while, because I hadn't really come up with an elegant solution.
While we were talking today, however, we got talking about how I had 2 spare bits in the character number in 16-bit text mode, where two screen RAM bytes are used for each character. I don't remember exactly who came up with which part of it, but by the end of it, we had come up with a workable solution not only for hardware support for proportional fonts, but anti-aliased fonts, too (more on that in a future post).
Proportional fonts really just requires specifying the width of each character, so that some can be narrower than others. With two spare bits, it was easy enough to allow characters to be 8, 6, 4 or 2 pixels wide. Characters more than 8 pixels wide can then be constructed with any number of 8 pixel wide characters, and one narrow character to make it easy to obtain any even number of pixels in width. It would be nice to allow odd widths, but it feels like it is a reasonable trade-off to have to round to the nearest 2 pixels.
Similarly, tall fonts can be constructed with multiple rows of text. For fonts not a multiple of 8 pixels high, a raster split could be used to skip one or more rows of pixels.
The result is conceptually very simple. The main trade-off is that skinny characters still use the same amount of RAM as full-width ones, but that seems a reasonable trade-off. In fact, it was so simple that I was able to implement it in about an hour, and the main functionality worked first time, as can be seen below:
Each row has a different width specified for the "A" at the beginning of the line.
To use this feature, you first have to enable 16-bit text mode, where two bytes of screen memory describe each character on screen. This is done by setting bit 0 in $D054.
In terms of screen RAM, the memory for the rows here looks something like:
0400 01 00 02 00 03 00 04 00 05 00
0428 01 40 02 00 03 00 04 00 05 00
0450 01 80 02 00 03 00 04 00 05 00
0447 01 c0 02 00 03 00 04 00 05 00
The 01, 02 ... 05 is the character numbers for A through E, and these stay the same. For all but the A's, the 2nd byte is 00 to indicate that no special attributes are set for the characters B through E. However, the high-byte for the A characters is modified in each row to have all possible combinations in bits 6 and 7: the higher the value, the narrowerer the character.
While I am fiddling with text attributes, I should explain what the other bits in the high byte mean:
bits 0 - 3 = bits 11 - 8 of the character number. i.e., there can now be 4,096 characters in a character set.
bit 4 = flip character horizontally
bit 5 = flip character vertically
The following screen shot shows the flip bits in action:
The contents of screen RAM is:
0400 01 00 02 10 03 20 04 30 05 00
The ability to flip characters is designed to be used with full-colour text mode, where (some or all) characters on the screen consist of 64 8-bit pixels, providing a graphics mode that can be quickly scrolled.
Flipping characters in such a mode allows 64-byte characters to be reused in a graphical display without too much obvious repetition, e.g., for in textures in games.
Combining this with variable width characters introduces even more opportunity to reuse characters, and thus allow more interesting and complex high-resolution graphics within the limits of the 128KB of chipram.
While we were talking today, however, we got talking about how I had 2 spare bits in the character number in 16-bit text mode, where two screen RAM bytes are used for each character. I don't remember exactly who came up with which part of it, but by the end of it, we had come up with a workable solution not only for hardware support for proportional fonts, but anti-aliased fonts, too (more on that in a future post).
Proportional fonts really just requires specifying the width of each character, so that some can be narrower than others. With two spare bits, it was easy enough to allow characters to be 8, 6, 4 or 2 pixels wide. Characters more than 8 pixels wide can then be constructed with any number of 8 pixel wide characters, and one narrow character to make it easy to obtain any even number of pixels in width. It would be nice to allow odd widths, but it feels like it is a reasonable trade-off to have to round to the nearest 2 pixels.
Similarly, tall fonts can be constructed with multiple rows of text. For fonts not a multiple of 8 pixels high, a raster split could be used to skip one or more rows of pixels.
The result is conceptually very simple. The main trade-off is that skinny characters still use the same amount of RAM as full-width ones, but that seems a reasonable trade-off. In fact, it was so simple that I was able to implement it in about an hour, and the main functionality worked first time, as can be seen below:
Each row has a different width specified for the "A" at the beginning of the line.
To use this feature, you first have to enable 16-bit text mode, where two bytes of screen memory describe each character on screen. This is done by setting bit 0 in $D054.
In terms of screen RAM, the memory for the rows here looks something like:
0400 01 00 02 00 03 00 04 00 05 00
0428 01 40 02 00 03 00 04 00 05 00
0450 01 80 02 00 03 00 04 00 05 00
0447 01 c0 02 00 03 00 04 00 05 00
The 01, 02 ... 05 is the character numbers for A through E, and these stay the same. For all but the A's, the 2nd byte is 00 to indicate that no special attributes are set for the characters B through E. However, the high-byte for the A characters is modified in each row to have all possible combinations in bits 6 and 7: the higher the value, the narrowerer the character.
While I am fiddling with text attributes, I should explain what the other bits in the high byte mean:
bits 0 - 3 = bits 11 - 8 of the character number. i.e., there can now be 4,096 characters in a character set.
bit 4 = flip character horizontally
bit 5 = flip character vertically
The following screen shot shows the flip bits in action:
The contents of screen RAM is:
0400 01 00 02 10 03 20 04 30 05 00
The ability to flip characters is designed to be used with full-colour text mode, where (some or all) characters on the screen consist of 64 8-bit pixels, providing a graphics mode that can be quickly scrolled.
Flipping characters in such a mode allows 64-byte characters to be reused in a graphical display without too much obvious repetition, e.g., for in textures in games.
Combining this with variable width characters introduces even more opportunity to reuse characters, and thus allow more interesting and complex high-resolution graphics within the limits of the 128KB of chipram.
Wednesday, 29 October 2014
My real C65 keyboard has arrived, very well wrapped
A while back I won an e-bay auction for a genuine C65 keyboard (without printing on the top of the keys). Today it arrived, as you can see below. It was nice to again touch a C65 keyboard after about four years of not having one. The lack of printing on the top of the keys is not a big problem, as the key positions are fairly easy to figure out, and the printing on the front provides strong clues as well.
I also have to commend the seller for his meticulous packing. Below you can see lots of soft packing material, which I could only get to once I had cut through quite a lot of tape that held the box very well closed.
Inside that was the following inner part, also completely mummified in tape. This keyboard had no hope of escaping in transit.
Then once I de-mummified that, it was itself a further very large bag, sealed at the mouth with further tape:
Removing that tape, the quite large bag looked to also be heat-sealed.
Finally inside that the keyboard was sitting in its original light-blue Mitsumi packing bag, and in very good condition -- better than the original one I had, on which I had to repair a track on the cable when it first arrived.
So now I have a real C65 keyboard, a recreation C65 motherboard, lacking only the custom chips, and my FPGA design. I am getting closer to being able to make a complete working unit with real hardware. I am still leaning towards making a laptop form-factor C65GS, if only I can find a nice LCD panel that can do 1920x1200 or 1920x1080 and isn't too big. About 13" would be ideal, I think.
Hardware thumbnail generation is now usable
Today I had a chance to fix a few bugs with the hardware thumbnail generation, and actually test it out by writing a small program that does a raster split, with the thumbnail being draw in 256-colour mode in the bottom corner of the screen every frame, as you can see below:
There are still a few glitches, but is is working pretty nicely. In particular, you can see that it is being drawn in real-time, because the thumbnail contains an image of the thumbnail that contains an image of the thumbnail :)
Things you can't see, is that the thumbnail is a bit different every frame, I think because the counter that decides which raster to look at isn't reset at the start of each frame. This causes some weird things to happen. Also, the thumbnail just contains the value of chosen pixels. I am considering changing this so that it shows the average of the pixels in the sample area of the raster line, but at the same time, the current scheme works fairly well.
This is a nice milestone for a few reasons.
First, the hardware thumbnail generator is clearly working to some reasonable degree.
Second, full-colour text mode is working fairly well as well, such that I could write this program.
And finally, I have actually written a programme that does something (slightly) useful, using C65GS special features, and it works :)
Next stop is to fix the counter problem, and see if I am then happy enough with it, and if so, to move on to some of the other interesting things in my queue, like enhanced sprites.
There are still a few glitches, but is is working pretty nicely. In particular, you can see that it is being drawn in real-time, because the thumbnail contains an image of the thumbnail that contains an image of the thumbnail :)
Things you can't see, is that the thumbnail is a bit different every frame, I think because the counter that decides which raster to look at isn't reset at the start of each frame. This causes some weird things to happen. Also, the thumbnail just contains the value of chosen pixels. I am considering changing this so that it shows the average of the pixels in the sample area of the raster line, but at the same time, the current scheme works fairly well.
This is a nice milestone for a few reasons.
First, the hardware thumbnail generator is clearly working to some reasonable degree.
Second, full-colour text mode is working fairly well as well, such that I could write this program.
And finally, I have actually written a programme that does something (slightly) useful, using C65GS special features, and it works :)
Next stop is to fix the counter problem, and see if I am then happy enough with it, and if so, to move on to some of the other interesting things in my queue, like enhanced sprites.
Thursday, 23 October 2014
Hardware thumbnail generator for task-switcher
One of the main reasons for implementing the hypervisor is so that it will be possible to switch between different tasks running on the machine. The tasks won't be running at the same time, but rather they will be suspended while another task is running.
For a task-switcher to be nice, it would be really handy to be able to show a low-res screen-shot of the last state of each task so that the user can visually select which one they want. In other words, to have something that is not too unlike the Windows and OSX window/task switcher interfaces.
However, this is tricky on an 8-bit computer that has no frame buffer, and may be using all sorts of crazy raster effects.
Thus I need some way to have the VIC-IV update a little low-res screen shot, i.e., a thumbnail image, that the hypervisor can read out, and retain for later task-switching calls to show the user what was running in each task before they were suspended.
So I set about implementing a little 4KB thumbnail buffer which is automatically written to by the VIC-IV, and which can be read from the hypervisor. This resolution allows for 80x50, which should be sufficient to get the idea of what is on a display. Each pixel is an 8-bit RRRGGGBB colour byte.
Because the VIC-IV writes the thumbnail data directly from the pixel stream, it occurs after palette selection, sprites and all raster effects. That is, the thumbnails it generates should be "true".
After a bit of fiddling around, it is mostly working.
To test it, I wrote a little BASIC programme that reads from the one-byte access to the 4KB buffer, copying it to $4000-$4FFF. Then I used the serial monitor to grab that copy of the data, and wrote some UNIX shell scripts and a little C programme to munge it into an 80x50 Windows BMP file.
Here is how it looks, with the image rather enlarged to make it easier to see:
While not perfect, it is an improvement on the first capture, where I forgot to read from the start of the thumbnail buffer, so it was all out of whack:
In need to find out what is causing the "clouds", and also why it is writing only 77 pixels per line instead of 80 pixels per line.
But other than these problems, I am well on the way to being able to present a nice graphical display to allow for switching between tasks from the hypervisor.
For a task-switcher to be nice, it would be really handy to be able to show a low-res screen-shot of the last state of each task so that the user can visually select which one they want. In other words, to have something that is not too unlike the Windows and OSX window/task switcher interfaces.
However, this is tricky on an 8-bit computer that has no frame buffer, and may be using all sorts of crazy raster effects.
Thus I need some way to have the VIC-IV update a little low-res screen shot, i.e., a thumbnail image, that the hypervisor can read out, and retain for later task-switching calls to show the user what was running in each task before they were suspended.
So I set about implementing a little 4KB thumbnail buffer which is automatically written to by the VIC-IV, and which can be read from the hypervisor. This resolution allows for 80x50, which should be sufficient to get the idea of what is on a display. Each pixel is an 8-bit RRRGGGBB colour byte.
Because the VIC-IV writes the thumbnail data directly from the pixel stream, it occurs after palette selection, sprites and all raster effects. That is, the thumbnails it generates should be "true".
After a bit of fiddling around, it is mostly working.
To test it, I wrote a little BASIC programme that reads from the one-byte access to the 4KB buffer, copying it to $4000-$4FFF. Then I used the serial monitor to grab that copy of the data, and wrote some UNIX shell scripts and a little C programme to munge it into an 80x50 Windows BMP file.
Here is how it looks, with the image rather enlarged to make it easier to see:
While not perfect, it is an improvement on the first capture, where I forgot to read from the start of the thumbnail buffer, so it was all out of whack:
In need to find out what is causing the "clouds", and also why it is writing only 77 pixels per line instead of 80 pixels per line.
But other than these problems, I am well on the way to being able to present a nice graphical display to allow for switching between tasks from the hypervisor.
A 3rd party operating system for the C65GS?
I was quite surprised today in a happy way to see that someone is considering making an operating system for the C65GS:
http://65.theace.sk/index.html
This would be a port of the yet-to-be-complete ACE128 operating system.
I understand that Miro could do with some contributors to help, whether for the 65GS or 64/128 version.
http://65.theace.sk/index.html
This would be a port of the yet-to-be-complete ACE128 operating system.
I understand that Miro could do with some contributors to help, whether for the 65GS or 64/128 version.
Sunday, 19 October 2014
Virtualisation and Task Switching
One of the features I have wanted to include in the C65GS from early on is some sort of task switching and rudimentary multi-tasking.
Given the memory and processor constraints, I don't see the C65GS as running lots of independent processes at the same time. Rather, I want it to be possible to easily switch between different tasks you have running.
For example, you might be using Turbo Assembler to write some code, and decide to take a break playing a game for a few minutes, but don't want to have to reload Turbo Assembler and your source code again.
Or better, with a patched version of Turbo Assembler you might want to edit in one task and have it assemble into a separate task, and switch back and forth between them as you see fit.
It would also be nice to be able to have certain types of background processing supported. For example, being able to leave IRC or a download running in the background, with it waking up whenever a packet arrives or a timeout occurs.
For all these scenarios, it also makes sense to be able to quarantine one task from another, so that they cannot write to one another' memory or IO without permission. This implies the need for some sort of memory protection, and supervisor mode that can run a small operating system to control the tasks (and their own operating systems) running under it.
Thus, what we really want is something like VirtualBox that can run a hypervisor to virtualise the C65GS, so that it can have C64 or C65 "guest operating systems" beneath, and keep them all separate from each other.
This doesn't actually need much extra hardware to do in a simplistic manner.
First, we need the supervisor/hypervisor CPU mode that maps some extra registers. I have already implemented this with registers at $D640-$D67F.
Second, to make hypervisor calls fast, the CPU should save all CPU registers and automatically switch the memory map when entering and leaving the hypervisor. I have already implemented this, so that a call-up into the hypervisor takes just one cycle, as does returning from the hypervisor.
Third, we need to make the hypervisor programme memory only visible from in hypervisor mode. I have already implemented this. The hypervisor program is mapped at $8000-$BFFF, with the last 1KB reserved as scratch space, relocated zero-page (using the 4510's B register), and relocated stack (again, using the 4510's SPH register). I am in the process of modifying kickstart so that it works as a simple hypervisor.
Fourth, we need some registers that allow us to control which address lines on the 16MB RAM are available to a given task, and what the value of the other address lines should be. This would allow us to allocate any power-of-two number of 64KB memory blocks to a task. When a task is suspended, it's 128KB chipram and 64KB colour RAM and IO status can be saved into other 64KB memory blocks that are not addressable by the task when it is running. This I have yet to do.
Fifth, we need to be able to control what events result in a hypervisor trap, so that background processes can run, and also so that the hypervisor can switch tasks. The NMI line is one signal I definitely want to trap, so that pressing RESTORE can activate the hypervisor.
By finishing these things, and then writing the appropriate software for the hypervisor, it shouldn't be too hard to get task switching running on the C65GS.
Given the memory and processor constraints, I don't see the C65GS as running lots of independent processes at the same time. Rather, I want it to be possible to easily switch between different tasks you have running.
For example, you might be using Turbo Assembler to write some code, and decide to take a break playing a game for a few minutes, but don't want to have to reload Turbo Assembler and your source code again.
Or better, with a patched version of Turbo Assembler you might want to edit in one task and have it assemble into a separate task, and switch back and forth between them as you see fit.
It would also be nice to be able to have certain types of background processing supported. For example, being able to leave IRC or a download running in the background, with it waking up whenever a packet arrives or a timeout occurs.
For all these scenarios, it also makes sense to be able to quarantine one task from another, so that they cannot write to one another' memory or IO without permission. This implies the need for some sort of memory protection, and supervisor mode that can run a small operating system to control the tasks (and their own operating systems) running under it.
Thus, what we really want is something like VirtualBox that can run a hypervisor to virtualise the C65GS, so that it can have C64 or C65 "guest operating systems" beneath, and keep them all separate from each other.
This doesn't actually need much extra hardware to do in a simplistic manner.
First, we need the supervisor/hypervisor CPU mode that maps some extra registers. I have already implemented this with registers at $D640-$D67F.
Second, to make hypervisor calls fast, the CPU should save all CPU registers and automatically switch the memory map when entering and leaving the hypervisor. I have already implemented this, so that a call-up into the hypervisor takes just one cycle, as does returning from the hypervisor.
Third, we need to make the hypervisor programme memory only visible from in hypervisor mode. I have already implemented this. The hypervisor program is mapped at $8000-$BFFF, with the last 1KB reserved as scratch space, relocated zero-page (using the 4510's B register), and relocated stack (again, using the 4510's SPH register). I am in the process of modifying kickstart so that it works as a simple hypervisor.
Fourth, we need some registers that allow us to control which address lines on the 16MB RAM are available to a given task, and what the value of the other address lines should be. This would allow us to allocate any power-of-two number of 64KB memory blocks to a task. When a task is suspended, it's 128KB chipram and 64KB colour RAM and IO status can be saved into other 64KB memory blocks that are not addressable by the task when it is running. This I have yet to do.
Fifth, we need to be able to control what events result in a hypervisor trap, so that background processes can run, and also so that the hypervisor can switch tasks. The NMI line is one signal I definitely want to trap, so that pressing RESTORE can activate the hypervisor.
By finishing these things, and then writing the appropriate software for the hypervisor, it shouldn't be too hard to get task switching running on the C65GS.
Thursday, 16 October 2014
Sprites behind the border, and another bug discovered
Our almost-4yo went to sleep on the way home at 16h30 today, and so as a result is now up at 02h00. While I'd rather be sleeping, being up with him for a while gives me the chance to try the latest change that I left synthesising when I went to bed. That change was to make the VIC-II sprites honour the border.
My favourite way to test sprites at the moment is to run Lemmings. This confirmed that the sprites were now honouring the border. I also finally remembered the controls for Lemmings to start a game, and was pleasantly surprised to find that the game works, with little lemmings walking around the place as they should. The game is raster interrupt driven, so the speed was more or less correct as well, as you can see from the following screen shot:
I also learned two extra things:
1. Lemmings apparently uses sprites for the main display.
2. I have a bug where the bottom row of each sprite appears first.
I was also unable to see the cross-hairs, which I assume must be done with characters or bitmap data.
A quick check in VICE confirmed that this is indeed how the cross-hairs are drawn. So now I need to find out what is going wrong with this on the C65GS. I do at least now know that it is in characters $FC and $FD, and the screen is at $4000 for half the frames.
A quick bit of poking around has revealed the problem: I haven't implemented sprite background priority yet, so the sprites are hiding the cross hairs.
In theory, I should be able to use the joystick to move the cross-hairs to a blank section so that I can see it, however, for some reason joystick control isn't working. Maybe I have messed up the joystick CIA input in some way. I'll have to investigate this further, along with the sprite display problem.
My favourite way to test sprites at the moment is to run Lemmings. This confirmed that the sprites were now honouring the border. I also finally remembered the controls for Lemmings to start a game, and was pleasantly surprised to find that the game works, with little lemmings walking around the place as they should. The game is raster interrupt driven, so the speed was more or less correct as well, as you can see from the following screen shot:
I also learned two extra things:
1. Lemmings apparently uses sprites for the main display.
2. I have a bug where the bottom row of each sprite appears first.
I was also unable to see the cross-hairs, which I assume must be done with characters or bitmap data.
A quick check in VICE confirmed that this is indeed how the cross-hairs are drawn. So now I need to find out what is going wrong with this on the C65GS. I do at least now know that it is in characters $FC and $FD, and the screen is at $4000 for half the frames.
A quick bit of poking around has revealed the problem: I haven't implemented sprite background priority yet, so the sprites are hiding the cross hairs.
In theory, I should be able to use the joystick to move the cross-hairs to a blank section so that I can see it, however, for some reason joystick control isn't working. Maybe I have messed up the joystick CIA input in some way. I'll have to investigate this further, along with the sprite display problem.
Wednesday, 15 October 2014
Confirmed that I have fixed the sneaky CPU bug
This morning after synthesis of the fix for the sneaky CPU bug fix, I had the chance to test it out.
Rayne's interlace test program now works, and his MUIFLI program is also closer to working, although it isn't showing the right data. But that could be due to FLI not working on the VIC-IV -- yet to be confirmed.
However, what it did also fix is BoulderMark. So I can now present the latest result for the C65GS with this benchmark:
Notice that now the sprite appears (and that the sprite sitting in the border is also visible because sprites currently sit in front of the border). Otherwise the display is just about perfect. This image was captured via the VNC server video streaming interface (search previous posts to find out more about this).
Anyway, this all equates to 94x NTSC C64 or almost exactly 100x PAL C64. Of course as I have mentioned before, BoulderMark is non-linear with fast accelerators, and so the real performance is much more likely to be the roughly 44x that SynthMark64 reports.
Rayne's interlace test program now works, and his MUIFLI program is also closer to working, although it isn't showing the right data. But that could be due to FLI not working on the VIC-IV -- yet to be confirmed.
However, what it did also fix is BoulderMark. So I can now present the latest result for the C65GS with this benchmark:
Notice that now the sprite appears (and that the sprite sitting in the border is also visible because sprites currently sit in front of the border). Otherwise the display is just about perfect. This image was captured via the VNC server video streaming interface (search previous posts to find out more about this).
Anyway, this all equates to 94x NTSC C64 or almost exactly 100x PAL C64. Of course as I have mentioned before, BoulderMark is non-linear with fast accelerators, and so the real performance is much more likely to be the roughly 44x that SynthMark64 reports.
Tuesday, 14 October 2014
Found a sneaky CPU bug
While trying to run some graphic test programmes supplied by Rayne, I found that the CPU was mis-behaving in a way that reminded me of the bug I was seeing with BoulderMark, and probably Lemmings as well. Basically all was fine until a raster interrupt occurred, and then things would go odd or outright crash. What was extra odd was that BoulderMark would still run on the FPGA at work, but not on the one here at home, which shouldn't happen -- FPGAs shouldn't be picky like that.
Anyway, Rayne's programmes are much simpler, and offered the prospect of easily debugging what was going on.
So after a bit of poking around I discovered that the C65GS would go to lala-land after INC $D019.
This got me thinking, because $D019 is special in my CPU, because it adds a dummy write for RMW instructions that touch $D019, but not any other address. This is to avoid wasting a CPU cycle on the dummy write of the original value back to memory, except when required for C64 compatibility.
The lack of this dummy write on the C65, that acts to clear the VIC-II interrupt on a C64, is one of the major sources of incompatibility between the C65 and C64, and stops the majority of software from running on it. Thus I had gone to special effort to make sure it wouldn't be a problem on the C65GS, but without the CPU speed penalty of doing it on every address.
However, I had messed up the dummy write state in the CPU: it was not setting the target address on the bus, and so instead was writing to the last accessed memory location, which was the first byte of the following instruction. The net result is that the old contents of $D019, usually $F0 or $F1, would get written to the next byte in the instruction stream. I confirmed this in simulation, where the dummy write and final write can be seen marked in bold. Note that the dummy write is going to $F60F, not $D019!
gs4510.vhdl:1685:11:@700ns:(report note): MEMORY reading $FFFF60C = $EE
gs4510.vhdl:1004:7:@700ns:(report note): MEMORY long_address = $FFFF60D
gs4510.vhdl:1685:11:@780ns:(report note): MEMORY reading $FFFF60D = $19
gs4510.vhdl:1004:7:@780ns:(report note): MEMORY long_address = $FFFF60E
gs4510.vhdl:1685:11:@860ns:(report note): MEMORY reading $FFFF60E = $D0
gs4510.vhdl:1004:7:@860ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@940ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1004:7:@940ns:(report note): MEMORY long_address = $FFD3019
gs4510.vhdl:1685:11:@1020ns:(report note): MEMORY reading $FFD3019 = $70
gs4510.vhdl:1004:7:@1020ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1100ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1304:9:@1100ns:(report note): writing to shadow RAM via chipram shadowing. addr=$000F60F
gs4510.vhdl:1689:11:@1140ns:(report note): MEMORY writing $000F60F <= $70
gs4510.vhdl:1689:11:@1180ns:(report note): MEMORY writing $FFD3019 <= $71
gs4510.vhdl:1004:7:@1180ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1260ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:738:9:@1260ns:(report note): $F60C EE 19 D0 inc $D019 A:11 X:22 Y:33 Z:00 SP:01FF P:24 $01=3F MAPLO:0000 MAPHI:8F00 ..E-.I..
Oops.. not actually all fixed. It is now writing to $D019 in RAM, not IO. Lucky I decided to write this blog post, or I wouldn't have spotted that I still had the memory write flags slightly messed up. Specifically memory_access_resolve_address wasn't asserted, so the 16-bit address was not being translated to the physical 28-bit address. Fix that and try again:
gs4510.vhdl:1004:7:@1020ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1100ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1689:11:@1140ns:(report note): MEMORY writing $FFD3019 <= $70
gs4510.vhdl:1689:11:@1180ns:(report note): MEMORY writing $FFD3019 <= $71
gs4510.vhdl:1004:7:@1180ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1260ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:738:9:@1260ns:(report note): $F60C EE 19 D0 inc $D019 A:11 X:22 Y:33 Z:00 SP:01FF P:24 $01=3F MAPLO:0000 MAPHI:8F00 ..E-.I..
Ah, that's better!
Now to resynthesise, and see if BoulderMark, Lemmings and Rayne's MUIFLI all work properly.
Anyway, Rayne's programmes are much simpler, and offered the prospect of easily debugging what was going on.
So after a bit of poking around I discovered that the C65GS would go to lala-land after INC $D019.
This got me thinking, because $D019 is special in my CPU, because it adds a dummy write for RMW instructions that touch $D019, but not any other address. This is to avoid wasting a CPU cycle on the dummy write of the original value back to memory, except when required for C64 compatibility.
The lack of this dummy write on the C65, that acts to clear the VIC-II interrupt on a C64, is one of the major sources of incompatibility between the C65 and C64, and stops the majority of software from running on it. Thus I had gone to special effort to make sure it wouldn't be a problem on the C65GS, but without the CPU speed penalty of doing it on every address.
However, I had messed up the dummy write state in the CPU: it was not setting the target address on the bus, and so instead was writing to the last accessed memory location, which was the first byte of the following instruction. The net result is that the old contents of $D019, usually $F0 or $F1, would get written to the next byte in the instruction stream. I confirmed this in simulation, where the dummy write and final write can be seen marked in bold. Note that the dummy write is going to $F60F, not $D019!
gs4510.vhdl:1685:11:@700ns:(report note): MEMORY reading $FFFF60C = $EE
gs4510.vhdl:1004:7:@700ns:(report note): MEMORY long_address = $FFFF60D
gs4510.vhdl:1685:11:@780ns:(report note): MEMORY reading $FFFF60D = $19
gs4510.vhdl:1004:7:@780ns:(report note): MEMORY long_address = $FFFF60E
gs4510.vhdl:1685:11:@860ns:(report note): MEMORY reading $FFFF60E = $D0
gs4510.vhdl:1004:7:@860ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@940ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1004:7:@940ns:(report note): MEMORY long_address = $FFD3019
gs4510.vhdl:1685:11:@1020ns:(report note): MEMORY reading $FFD3019 = $70
gs4510.vhdl:1004:7:@1020ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1100ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1304:9:@1100ns:(report note): writing to shadow RAM via chipram shadowing. addr=$000F60F
gs4510.vhdl:1689:11:@1140ns:(report note): MEMORY writing $000F60F <= $70
gs4510.vhdl:1689:11:@1180ns:(report note): MEMORY writing $FFD3019 <= $71
gs4510.vhdl:1004:7:@1180ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1260ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:738:9:@1260ns:(report note): $F60C EE 19 D0 inc $D019 A:11 X:22 Y:33 Z:00 SP:01FF P:24 $01=3F MAPLO:0000 MAPHI:8F00 ..E-.I..
So a quick fix and re-run simulation and suddenly we can see that it is all fixed:
gs4510.vhdl:1685:11:@700ns:(report note): MEMORY reading $FFFF60C = $EE
gs4510.vhdl:1004:7:@700ns:(report note): MEMORY long_address = $FFFF60D
gs4510.vhdl:1685:11:@780ns:(report note): MEMORY reading $FFFF60D = $19
gs4510.vhdl:1004:7:@780ns:(report note): MEMORY long_address = $FFFF60E
gs4510.vhdl:1685:11:@860ns:(report note): MEMORY reading $FFFF60E = $D0
gs4510.vhdl:1004:7:@860ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@940ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1004:7:@940ns:(report note): MEMORY long_address = $FFD3019
gs4510.vhdl:1685:11:@1020ns:(report note): MEMORY reading $FFD3019 = $70
gs4510.vhdl:1004:7:@1020ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1100ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1304:9:@1100ns:(report note): writing to shadow RAM via chipram shadowing. addr=$000D019
gs4510.vhdl:1689:11:@1140ns:(report note): MEMORY writing $000D019 <= $70
gs4510.vhdl:1689:11:@1180ns:(report note): MEMORY writing $FFD3019 <= $71
gs4510.vhdl:1004:7:@1180ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1260ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:738:9:@1260ns:(report note): $F60C EE 19 D0 inc $D019 A:11 X:22 Y:33 Z:00 SP:01FF P:24 $01=3F MAPLO:0000 MAPHI:8F00 ..E-.I..
gs4510.vhdl:1004:7:@1020ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1100ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:1689:11:@1140ns:(report note): MEMORY writing $FFD3019 <= $70
gs4510.vhdl:1689:11:@1180ns:(report note): MEMORY writing $FFD3019 <= $71
gs4510.vhdl:1004:7:@1180ns:(report note): MEMORY long_address = $FFFF60F
gs4510.vhdl:1685:11:@1260ns:(report note): MEMORY reading $FFFF60F = $AD
gs4510.vhdl:738:9:@1260ns:(report note): $F60C EE 19 D0 inc $D019 A:11 X:22 Y:33 Z:00 SP:01FF P:24 $01=3F MAPLO:0000 MAPHI:8F00 ..E-.I..
Ah, that's better!
Now to resynthesise, and see if BoulderMark, Lemmings and Rayne's MUIFLI all work properly.
More work on sprites
I don't have any nice screen shots to put in here (but I might add some in later), but I have been working on VIC-II sprites.
These sprites are now displaying properly, apart from the lack of border/foreground priority and hardware collision detection. Sprite positions are now correct with regard to the text/bitmap screen.
Unfortunately, adding the extra logic to the VIC-IV memory access paths has thrown FPGA timing closure out the window for now.
The 192MHz pixel clock requires timing within about 5.1ns, but is currently sitting around 7.3ns. It is an amazing testimony to the Artix7 FPGAs that the system still seems to run flawlessly. This is partly because the FPGA is speed rated for operation at 85 degrees Centigrade and an operating voltage of 0.95 Volts instead of the nominal 1.00 Volt supply.
In any case, I want to get the timing at least close to meeting closure (i.e., being fast enough), so that I can avoid problems later, and also to make sure that everything else that I want to add will still fit.
My approach to this at the moment is to unify the VIC-II compatibility sprite data fetches so that there is only one extra data stream that has to be plugged into the chipram/fastram. I am part way through this, and have already improved timing to about 6.6ns, and it looks like it shouldn't be too hard to further improve on this.
I have also started thinking about the design for the new sprites. This is all subject to change, but here is what I am thinking about at the moment:
The basic design of the new VIC-IV sprites, is that each sprite will have a dedicated 4KB memory buffer, and will be strictly one byte per pixel. This allows for sprites of up to 64x64 256 colour pixels.
Like with the VIC-II, one physical sprite can be used multiple times on a frame without reloading the data by altering the data offset within the 4KB block, and possibly the height and width of the sprite. I am also thinking about allowing sprites to be much wider.
Foreground/background priority will be by applying a bit mask to the character/bitmap data to decide whether it should appear in front of the sprite or behind the sprite. This will allow sprites and the background to perform many of the functions of Amiga-style bit planes, although the way it will be done will be rather different.
Bit masks are also provided to allow modification of the colours of sprites. For example applying and AND mask of $1f and an OR mask of $80 will translate all colours to $80-$9F. This can be used to allow a common image to be used for different characters in a game, with selected colours being altered. The 256 colour sprite palette can be separated from the bitmap palette, so there is improved flexibility compared to just applying bit masks to a flat 256 colour palette shared by all on-screen elements. If I get really excited it might even be possible to use the other two 256 colour palettes for different sprites.
Finally, I intend to provide hardware scaling and rotation support. I thought about having simple angle and zoom factor settings, but currently I am thinking that I will simply provide a linear 2D transformation matrix per sprite so that other effects can also be used.
The registers for the VIC-IV sprites are currently planned to live at $D710-$D7FF, allowing for up to 15 of these sprites, but there may end up being less than these depending on how many I can wrangle in.
All this is subject to change, as is the register map, but here is the structure I am currently looking at:
$D7x0-$D7x1 - Enhanced sprite X position in physical pixels (lower 12 bits)
$D7x1.4-7 - Enhanced sprite width (4 -- 64 pixels)
$D7x2-$D7x3 - Enhanced sprite Y position in physical pixels (16 bits)
$D7x3.4-7 - Enhanced sprite height (4 -- 64 pixels)
$D7x4 - Enhanced sprite data offset in its 4KB SpriteRAM (x16 bytes)
$D7x5 - Enhanced sprite foreground mask
$D7x6 - Enhanced sprite colour AND mask (sprite not visible if result = $00)
$D7x7 - Enhanced sprite colour OR mask
$D7x8-$D7x9 - Enhanced sprite 2x2 linear transform matrix 0,0 (5.11 bits)
$D7xA-$D7xB - Enhanced sprite 2x2 linear transform matrix 0,1 (5.11 bits)
$D7xC-$D7xD - Enhanced sprite 2x2 linear transform matrix 1,0 (5.11 bits)
$D7xE-$D7xF - Enhanced sprite 2x2 linear transform matrix 1,1 (5.11 bits)
These sprites are now displaying properly, apart from the lack of border/foreground priority and hardware collision detection. Sprite positions are now correct with regard to the text/bitmap screen.
Unfortunately, adding the extra logic to the VIC-IV memory access paths has thrown FPGA timing closure out the window for now.
The 192MHz pixel clock requires timing within about 5.1ns, but is currently sitting around 7.3ns. It is an amazing testimony to the Artix7 FPGAs that the system still seems to run flawlessly. This is partly because the FPGA is speed rated for operation at 85 degrees Centigrade and an operating voltage of 0.95 Volts instead of the nominal 1.00 Volt supply.
In any case, I want to get the timing at least close to meeting closure (i.e., being fast enough), so that I can avoid problems later, and also to make sure that everything else that I want to add will still fit.
My approach to this at the moment is to unify the VIC-II compatibility sprite data fetches so that there is only one extra data stream that has to be plugged into the chipram/fastram. I am part way through this, and have already improved timing to about 6.6ns, and it looks like it shouldn't be too hard to further improve on this.
I have also started thinking about the design for the new sprites. This is all subject to change, but here is what I am thinking about at the moment:
The basic design of the new VIC-IV sprites, is that each sprite will have a dedicated 4KB memory buffer, and will be strictly one byte per pixel. This allows for sprites of up to 64x64 256 colour pixels.
Like with the VIC-II, one physical sprite can be used multiple times on a frame without reloading the data by altering the data offset within the 4KB block, and possibly the height and width of the sprite. I am also thinking about allowing sprites to be much wider.
Foreground/background priority will be by applying a bit mask to the character/bitmap data to decide whether it should appear in front of the sprite or behind the sprite. This will allow sprites and the background to perform many of the functions of Amiga-style bit planes, although the way it will be done will be rather different.
Bit masks are also provided to allow modification of the colours of sprites. For example applying and AND mask of $1f and an OR mask of $80 will translate all colours to $80-$9F. This can be used to allow a common image to be used for different characters in a game, with selected colours being altered. The 256 colour sprite palette can be separated from the bitmap palette, so there is improved flexibility compared to just applying bit masks to a flat 256 colour palette shared by all on-screen elements. If I get really excited it might even be possible to use the other two 256 colour palettes for different sprites.
Finally, I intend to provide hardware scaling and rotation support. I thought about having simple angle and zoom factor settings, but currently I am thinking that I will simply provide a linear 2D transformation matrix per sprite so that other effects can also be used.
The registers for the VIC-IV sprites are currently planned to live at $D710-$D7FF, allowing for up to 15 of these sprites, but there may end up being less than these depending on how many I can wrangle in.
All this is subject to change, as is the register map, but here is the structure I am currently looking at:
$D7x0-$D7x1 - Enhanced sprite X position in physical pixels (lower 12 bits)
$D7x1.4-7 - Enhanced sprite width (4 -- 64 pixels)
$D7x2-$D7x3 - Enhanced sprite Y position in physical pixels (16 bits)
$D7x3.4-7 - Enhanced sprite height (4 -- 64 pixels)
$D7x4 - Enhanced sprite data offset in its 4KB SpriteRAM (x16 bytes)
$D7x5 - Enhanced sprite foreground mask
$D7x6 - Enhanced sprite colour AND mask (sprite not visible if result = $00)
$D7x7 - Enhanced sprite colour OR mask
$D7x8-$D7x9 - Enhanced sprite 2x2 linear transform matrix 0,0 (5.11 bits)
$D7xA-$D7xB - Enhanced sprite 2x2 linear transform matrix 0,1 (5.11 bits)
$D7xC-$D7xD - Enhanced sprite 2x2 linear transform matrix 1,0 (5.11 bits)
$D7xE-$D7xF - Enhanced sprite 2x2 linear transform matrix 1,1 (5.11 bits)
The attentive reader will note that nowhere does this address the 4KB data blocks for each sprite. This will be direct mapped in the 28-bit address space. I am tossing around the idea of over-mapping it with the 64KB colour RAM at $FF80000 (the first 1KB of which is also available at $D800 for C64 compatibility). The reason for this is that the 4KB sprite RAM will probably be write-only to simplify the data plumbing. However, to allow for freezing (and hence multi-tasking), I really do want some way to read the sprite data. The trade-off of course is that this means that you wouldn't be able to use all 64KB for colour RAM if it also being used as a proxy to the sprite RAM data.
Monday, 6 October 2014
Initial work on sprites
Last night I didn't sleep solidly, so I got up and did a bit more work on implementing VIC-II sprites in the C65GS's VIC-IV.
The focus here is on implementing "normal" C64/C128/C65 sprites for existing software. As such the focus is not on adding new functionality to these sprites, in particular allowing more colours or more than 8 sprites (although I am planning to relax the 21 pixel high limitation to allow taller sprites, and if all goes well, I may also allow wider sprites).
Along with the SID chip, it is the sprites that really made the C64 stand out from its competition in the early 1980s. Therefore it is important that I get them right, and so far as possible implement all required functionality. So let's just go over what the sprites are, and how they work on the VIC-II/VIC-III (they behave identically on the C64/128 VIC-II and C65 VIC-III).
Basically the sprites are bitmap objects that are drawn either on top or behind the background graphics in real-time as the frame is drawn raster by raster. This is done with dedicated hardware support in the VIC-II/III chips that allows the user to simply provide the X and Y coordinates at which to display each sprite, and a pointer to the start of the bitmap data. There are also some special flags to modify the priority of the sprites with regard to the rest of the display, so that they can appear "in front" or "behind" the main graphics -- and this can be controlled separately for each sprite. There is also hardware detection for sprite-to-sprite and sprite-to-foreground collision that can be used in games to detect when things touch. Altogether, this allows much more advanced games and graphics on the 1MHz CPU of a C64 compared to contemporary machines. The cost of this flexibility and power is that the sprites consume about 3/4 of the space in the VIC-II, however history has shown that this was a great investment.
Amongst the 8 sprites, they have a fixed priority with respect to one another, so that lower numbered sprites will always appear in front of higher numbered sprites. This can be easily implemented by creating a pipeline of 8 identical sprite blocks that draw over the output of the previous sprite.
There is some circumstantial evidence to suggest that this is exactly what the VIC-II/III does, as there is a 12 pixel latency in its video pipeline, and it is reasonable to suspect that 8 of those cycles are for the 8 sprite compositing stages. Also, by staging the sprites in a linear pipeline, it is easier to meet the timing requirements, because the sprite signals need only move to the next sprite in the pipe-line, instead of all having to be gathered together in some other way, for example, a tree structure, although this would be possible. This is especially relevant for the C65GS where the video dot clock is running at 192MHz, and so I have to keep the logic depth shallow, and avoid dependencies on distant signals.
This pipeline is what I have managed to get working at present, as can be seen in the following screen shot:
There are a couple of obvious things:
1. The red sprite is visible over the top border. This is because I don't have border masking active for sprites. This will be easy enough to do, but I will defer it until I have finished the rest of the work on the sprites, as it is convenient in the meantime to see the sprites wherever they are.
2. The sprites are showing a solid block of colour. This is because I haven't implemented the fetching of the bitmap data by the VIC-IV, and feeding it into the sprite pipeline (more on this in a moment).
There are also some things not working that you can't see right now, for example foreground/background priority, and the hardware collision detection stuff.
However, what is clear is that the sprites do work, and the synthesis results show that by using the pipelined approach I described above, the timing of the design in the FPGA is no worse than before. The sprites themselves are currently consuming about 5% of the entire FPGA, which is quite acceptable. The complete design is now consuming about 42% of the FPGA.
Now, back to feeding bitmap data into the sprite pipeline. As I mentioned earlier, at 192MHz it isn't actually possible to feed data into (or extract data out of) all 8 sprites in parallel, because the logic depth and physical distance on the FPGA die becomes too great.
To get around this, I have constructed a data delivery pipeline that allows the VIC-IV to feed bitmap data to any of the 8 sprites, and it is forwarded by each sprite to the following sprite. Thus in return for a latency of 8 cycles, we can deliver bitmap data to any sprite without messing up the timing closure of the design.
This allows the VIC-IV to feed data to the sprites, however, it needs to know what address to fetch the data from.
One of the rather strange tricks the VIC-II used to reduce the number of registers in the design, is that a few bytes at the end of screen RAM are used to hold the data pointers to the sprites. The Y position within each sprite is then multiplied by 3 and added to the base address from this pointer to work out which 3 bytes need to be fetched and buffered in each sprite.
On the VIC-IV, the sprites exist outside of the main design due to the timing issues described above. Thus there has to be a third data pipeline that allows the sprites to tell the VIC-IV the Y position they are currently drawing. The VIC-IV can then fetch the required bytes, and pass them through the data pipeline.
All of these extra paths are plumbed through the sprite pipeline, but a few important pieces are not finished, but hopefully I will be able to get to these things done in the not too distant future.
After that, it will be time to implement the VIC-IV enhanced sprites, for which I have a few ideas.
The focus here is on implementing "normal" C64/C128/C65 sprites for existing software. As such the focus is not on adding new functionality to these sprites, in particular allowing more colours or more than 8 sprites (although I am planning to relax the 21 pixel high limitation to allow taller sprites, and if all goes well, I may also allow wider sprites).
Along with the SID chip, it is the sprites that really made the C64 stand out from its competition in the early 1980s. Therefore it is important that I get them right, and so far as possible implement all required functionality. So let's just go over what the sprites are, and how they work on the VIC-II/VIC-III (they behave identically on the C64/128 VIC-II and C65 VIC-III).
Basically the sprites are bitmap objects that are drawn either on top or behind the background graphics in real-time as the frame is drawn raster by raster. This is done with dedicated hardware support in the VIC-II/III chips that allows the user to simply provide the X and Y coordinates at which to display each sprite, and a pointer to the start of the bitmap data. There are also some special flags to modify the priority of the sprites with regard to the rest of the display, so that they can appear "in front" or "behind" the main graphics -- and this can be controlled separately for each sprite. There is also hardware detection for sprite-to-sprite and sprite-to-foreground collision that can be used in games to detect when things touch. Altogether, this allows much more advanced games and graphics on the 1MHz CPU of a C64 compared to contemporary machines. The cost of this flexibility and power is that the sprites consume about 3/4 of the space in the VIC-II, however history has shown that this was a great investment.
Amongst the 8 sprites, they have a fixed priority with respect to one another, so that lower numbered sprites will always appear in front of higher numbered sprites. This can be easily implemented by creating a pipeline of 8 identical sprite blocks that draw over the output of the previous sprite.
There is some circumstantial evidence to suggest that this is exactly what the VIC-II/III does, as there is a 12 pixel latency in its video pipeline, and it is reasonable to suspect that 8 of those cycles are for the 8 sprite compositing stages. Also, by staging the sprites in a linear pipeline, it is easier to meet the timing requirements, because the sprite signals need only move to the next sprite in the pipe-line, instead of all having to be gathered together in some other way, for example, a tree structure, although this would be possible. This is especially relevant for the C65GS where the video dot clock is running at 192MHz, and so I have to keep the logic depth shallow, and avoid dependencies on distant signals.
This pipeline is what I have managed to get working at present, as can be seen in the following screen shot:
There are a couple of obvious things:
1. The red sprite is visible over the top border. This is because I don't have border masking active for sprites. This will be easy enough to do, but I will defer it until I have finished the rest of the work on the sprites, as it is convenient in the meantime to see the sprites wherever they are.
2. The sprites are showing a solid block of colour. This is because I haven't implemented the fetching of the bitmap data by the VIC-IV, and feeding it into the sprite pipeline (more on this in a moment).
There are also some things not working that you can't see right now, for example foreground/background priority, and the hardware collision detection stuff.
However, what is clear is that the sprites do work, and the synthesis results show that by using the pipelined approach I described above, the timing of the design in the FPGA is no worse than before. The sprites themselves are currently consuming about 5% of the entire FPGA, which is quite acceptable. The complete design is now consuming about 42% of the FPGA.
Now, back to feeding bitmap data into the sprite pipeline. As I mentioned earlier, at 192MHz it isn't actually possible to feed data into (or extract data out of) all 8 sprites in parallel, because the logic depth and physical distance on the FPGA die becomes too great.
To get around this, I have constructed a data delivery pipeline that allows the VIC-IV to feed bitmap data to any of the 8 sprites, and it is forwarded by each sprite to the following sprite. Thus in return for a latency of 8 cycles, we can deliver bitmap data to any sprite without messing up the timing closure of the design.
This allows the VIC-IV to feed data to the sprites, however, it needs to know what address to fetch the data from.
One of the rather strange tricks the VIC-II used to reduce the number of registers in the design, is that a few bytes at the end of screen RAM are used to hold the data pointers to the sprites. The Y position within each sprite is then multiplied by 3 and added to the base address from this pointer to work out which 3 bytes need to be fetched and buffered in each sprite.
On the VIC-IV, the sprites exist outside of the main design due to the timing issues described above. Thus there has to be a third data pipeline that allows the sprites to tell the VIC-IV the Y position they are currently drawing. The VIC-IV can then fetch the required bytes, and pass them through the data pipeline.
All of these extra paths are plumbed through the sprite pipeline, but a few important pieces are not finished, but hopefully I will be able to get to these things done in the not too distant future.
After that, it will be time to implement the VIC-IV enhanced sprites, for which I have a few ideas.