"Here is what I am told in a tech support email from Digilent: 1. Nexys4 cannot be programmed with current version of Vivado. We hope problems will be resolved in the next version. Read more on http://www.xilinx.com/support/answers/57385.htm 2. Adept is not able to program Nexys4. And it won’t be. 3. Use iMPACT for programming. They couldn't be bothered to put this very relevant information into their user manual." Yay. This would have been really handy to know. To add to the intrigue and frustration, the URL they point to redirects somewhere useless now, so the answer isn't even directly accessible. Digilent have updated the user guide for the Nexys4 since I got my board, but there is no collated errata section for me to work out what has changed, and reading every word to compare two 28 page documents isn't that thrilling a prospect.
The updated reference guide is here.
Now that I know what I am looking for, I can see that they have updated the relevant text on page 5 - 6 to now read:
"Quad-SPI programming can be done using the iMPACT tool included with ISE or the labtools version of Vivado."
This suggests that the latest version of Vivado should be okay, although I am at a loss to know what "labtools version" means. I hope this doesn't mean it isn't in the free version. Downloading Vivado isn't for the light hearted, however, as it is several GB just to download. I will need to enlarge my Linux VM to even try. Maybe I will just form my Linux VM off and reinstall.
Anyway, it suggests that I should be able to program the thing using iMPACT in ISE. This would be great except that I run iMPACT from in a VM on my mac, and so it can't see the USB programming cable. Some sort of dual-boot would be the other option here. I do have ISE on an older Windows machine here, but haven't succeeded in getting that to program the thing either.
Anyway, currently frustrated with the process, but will keep prodding my way around to find a solution.
This evening I had a few minutes to implement the next IPC improvement I had in mind. This one is just implementing the simple end of instruction pipeline for instructions where it is possible, the same as I have already done for single byte instructions, and the same as what the real 6502 has always done.
The result is a nice little speed up as the pictures show.
This is among the last of the speed ups that I will do before a substantial reimplementation of the CPU to make it table driven. Using a table reduces the FPGA logic consumption a lot, and also has the potential to allow the CPU speed to be increased quite a bit, hopefully to 64MHz or even 96MHz all going well. But it is really the logic reduction that matters, so that I have space to implement the missing features in the C65GS.
Reflecting on the recent benchmark results, and especially that the revision 9 Chameleon is almost exactly the same speed as the C65GS in the bouldermark benchmark, I wondered if there was any low-hanging fruit I could tackle to increase the speed of the C65GS.
The main slow-down with the C65GS is the wait-state on reading chipram. I had tried various ways to supress the wait-state at its root cause in the FPGA dual-ported block RAM without luck. Then it occurred to me this morning that I could make a single-port shadow RAM that shadows all of chipram. So writing to chipram writes to both, and reads by the CPU would be sourced from the chipram -- with no wait-state.
So as a reminder of the state of affairs before todays improvements:
Removing the wait state on chipram by implementing the shadow RAM had quite a nice impact:
Functional calls are about 30% faster, and RAM operations in general are all moderately improved, as might be expected. This also got bouldermark quite a bit faster.
In the process I realised what should have been obvious to me, that implied/accumulator mode single-byte instructions were still taking two cycles, and could be easily reduced to one cycle. This makes NOPs run at an amazing 71x, and pushed the overall rating up a little to 26.9x:
BoulderMark now indicates just over 55x. I am still at a loss why the machine is so much faster than a stock C64 for BoulderMark, but the same phenomena is visible with the latest version of the Chameleon that gets a rating of around 14,000 (see http://wiki.icomp.de/wiki/C64_Benchmarks). That's a mystery that will have to remain for now.
In the meantime, I have a couple more ideas to improve performance that I will try.
Running bouldermark I realised that multicolour text and graphics modes had some major problems.
So I revisited the VHDL code for them, found that multi-colour graphics was simply not implemented at all, and multi-colour text mode had some big fat bugs.
A bit of poking around and sorting things out has it much better, although far from perfect. For example, some games seem to be largely working now:
And bouldermark, while still not perfect, has the right colours for the most part, although one of the multi-colour bit combinations is pulling the colour from the wrong place, as can be seen by the bricks being mostly invisible in the last image. But it is progress nonetheless.
Over the last couple of days with the help of Max, I have identified and fixed some PLA/MMU bugs for C64 mode. These bugs were stopping anything that touched $01 from working. With those bugs fixed, the machine is now able to run some simple software.
Turbo assembler now works 100%, which makes it easier to write and run little programs for testing other hardware features.
Also, some games, like Lemmings and Wizball, are showing signs of running, but with various graphics problems.
Bouldermark is now able to run, too, although with somewhat garbled display. So I was finally able to see how the machine compares to Chameleon and SCPU in that benchmark. This is interesting to me, since I know that SynthMark64 is not that representative.
Here is a quick clip of it running.
Note that the raster splits are happening on the wrong rasters. This is a problem common to all software on the C65GS right now because the display is 1248 pixels high, to be exactly 4x PAL. But the default resolution is 5x pixels so that the top and bottom borders aren't huge. This means that I need to make $D012 not count monotonically, so that it can advance faster in the top border, so that it can match the VIC-II raster numbers exactly once it reaches the main display. Then in the bottom border it can speed up again so that all PAL raster lines exist.
Back to the point, here is the result: 14,380. This is only slightly ahead of the latest revision of the Chameleon64, despite the C65GS being about twice as fast when measured using SynthMark64. So the C65GS remains marginally the fastest C64 option, but of course the performance is still subject to change as I finish implementing the many missing features.
With Max the work experience student we pushed ahead today and got the C65GS living inside the case of a dead C64C, and added a few missing features, most notably $D418 now works for digital audio (still no SID voices for now).
I had previously tried using a Keyrah v2 with the C65GS, but the keyboard layout was completely bananas. It turns out I must have mis-read the instructions, and had one of the jumpers wrong. With that fixed, we were able to use the keyrah with the FPGA board.
From there it was fairly easy sailing to get the board sitting inside a C64 case, although there are cables coming out of all sorts of holes, and in some cases going back into other holes. USB goes from the keyrah in the C64 power socket location to the FPGA's USB port to connect them together. Actual power comes in via a USB cable in the cassette port. Audio comes out via the RF hole using a 3.5mm audio jack. micro-SD card is accessible via the cassette port. VGA lead snakes out of the expansion port. Clearly this will all need to be tidied up over time.
With all of that together, it was mostly working. The Keyrah uses a different keyboard layout from my rather arbitrarily chosen one, so I needed to fix the usb keyboard decoder in the C65GS. Getting a Mac or Linux box to actually tell me the right scan codes for each key proved tricky. So instead I modified the FPGA config so that $D6F6 and $D6F7 contain the full 12-bit scan code for the last key pressed or released. With that I was able to walk through each key and get the right scan codes. I am now compiling those into the FPGA config.
But even with the slightly wonky keyboard layout in the meantime, it is possible to drive the machine to do simple things. So we wrote a little program that plays some digital audio out through the newly implemented interface. In the following video you can see the C65GS in its shell, including loading and playing the digital audio:
Because we are using the keyrah interface (for now), there is no convenient way to reset the machine. So I am also rebuilding the FPGA config with logic that interprets a long press on RESTORE to cause a reset, similar to some other C64 mods and replacement main-boards that are available. We will see how that works tomorrow, all going well.
We have a work experience student called Max in the lab this week, and he is interested in low-level hardware type stuff. So I asked if he would like to explore writing some tests for the C65GS.
In a previous post I had commented on a bug where scrolling the display in C65 mode results in the colour RAM going all weird. We set about starting to write some tests for the DMA controller to try to find and fix that bug.
While Max is only 15, he has a good attitude and interest, and for someone who has never written anything in any assembly language or really used an 8-bit computer before, he is doing pretty well.
After a bit of poking about and talking about the bug with Jeremy as well, I came to the realisation that the bug was not in DMAgic, but in the colour RAM. A little more explanation is required.
On the C64 the colour RAM is a separate 1KB x 4 bit memory. The C65, however, which can require almost 2KB of colour RAM for 80 column mode. Also, the upper four bits of the colour RAM represent extended attributes. As a result the C65 needs 2KB of 8-bit RAM for colour RAM. Rather than require a separate part, the designers of the C65 would have its colour RAM as the top 2KB ($1F800-$1FFFF) of the 128KB main memory.
This means that a stock C65 actually has a little bit less RAM available than a C128. Actually, a stock C65 has quite a bit less available RAM, because the DOS eats another 8KB of RAM. The overall memory map in C65 looks something like:
RAM LO RAM HI $FFFF - $FEFF - BASIC 10 COLOUR RAM PROGRAM (2KB) $F800 - TEXT ($2000 - BASIC 10 $FEFF) VARIABLES & STRINGS $2000 - SCREENS & INTERNAL 3.5" SYSTEM DATA DRIVE DOS $0000 - So all up, only 128KB - 8KB - 8KB - 2KB - 0.25KB = 128KB - 18.25KB = 109.75KB is available for BASIC 10 in C65 mode. But when I began designing the C65GS, I had in mind that it would support much larger text modes than the C64 or C65. With a native resolution of 1920x1200, it is possible to run a 240x150 character text mode. This means we need up to 36,000 (35.2KB) bytes of colour RAM. It seemed bad enough to lose 2KB of precious chip RAM, let alone losing more than 1/4 of all RAM, just for colour data for text mode. Factoring in the actual screen RAM, this would mean that 240x150 text mode would consume more than 1/2 of the total memory. That just wasn't going to fly. My preferred solution was to have 256KB or 512KB of chipram, but the FPGA I am using can't combine that much BRAM at a high enough clock speed. However, I was pleasantly surprised to find that I could make a 64KB x 8-bit memory as well as the 128KB chipram (in its 16KB x 64-bit form factor). So I implemented colour RAM that way, mapping it to $D800-$DBFF (or $DFFF when the right bit is set in the VIC-III), as well as at the C65GS extended address $FF80000-$FF8FFFF. All was happy until I remembered that the C65 direct maps the colour RAM at $1F800-$1FFFF, as described above. My solution to this was to tweak the C65GS memory map, so that it also mapped at $1F800-$1FFFF, masking the 2KB of chipram there, a bit like how memory locations $0000 and $0001 are masked on the C64. In fact, like those locations on the C64, there is no way for the CPU (or DMAgic, since DMAgic on the C65GS is just the CPU in drag) to access the last 2KB of chipram on the C65GS. You would have to use some sort of reflective process to even read them, like the sprite trick that can be used to read $0000/$0001 on the C64. This is a bit of a hack architecturally, since it means that you can't actually use those two kilobytes of chipram as real chipram now. However, that is more or less the case on a real C65, as the colour RAM cannot be relocated elsewhere. It is possible to point bitplanes there, which would produce different results on the C65GS. Maybe one day I will have to revisit this, and make some magic that copies writes to $1F800 - $1FFFF through to the colour RAM, so that it has perfect compatibility. Coming back to the point, it turns out that the logic to write to the colour RAM image at $1F800-$1FFFF was buggy, and would return the CPU to the wrong micro-code state. I had noticed something odd in this regard previously, in that the serial monitor interface would get upset if you tried to write to colour RAM at $1F800 - $1FFFF.
Some more poking around revealed that accessing colour RAM at $FF8xxxx worked perfectly, so I adjusted the memory read/write parts of the 45GS10 CPU to translate the addresses with a cleaner abstraction layer than previously. After FPGA rebuild, the result was happiness, with all the weirdness going away. Scrolling now works correctly, and without DMA going bananas when writing to colour RAM, the DOS context switch state information doesn't get destroyed, so a directory listing can scroll off the bottom with the DIR command in C65 mode. See video below:
I haven't done anything much lately on the C65GS, but it occurred to me that the last video was now rather out of date, and didn't show BASIC working in C65 mode or various other developments since then. So here is a 2 minute video showing it in action, running SynthMark64, and fiddling about in BASIC in C64 mode as well.