Thursday, December 21, 2017

Instruction-Level Timing Accuracy is here

It is summer time here in Australia, which means that it is time for holidays for me.  For those of you reading from the Northern Hemisphere, the idea of Christmas in summer may sound crazy. However, I can assure you that this is not the case. It is in fact COMPLETELY INSANE. You have all the rush of Christmas, which should be happening when the weather outside is horrible, and the days short and dark, instead happening just when you really want to be going on holidays instead. So 1/3 of our summer (seasons start on the 1st down here rather than the 21st because of the lower thermal mass of the Southern Hemisphere due to the lack of large continents) disappears into the stress of Christmas, including several days of food-induced coma caused by trying to sustain our European traditions of having a big hot meal for Christmas, even though it might be 42C outside.  Then, come late June, when our weather is at its worst (but this is Australia, so it just means 8 hours of day light, and daytime temperatures around 10C, and nights as cold as 3C. We do however get some pretty nasty wind courtesy of the same lack of large land masses in this half of the world), we have not even a single public/bank holiday for almost four months.

Anyway, we nonetheless survive, and it means I have some time to potter away on finishing off the core functionality of the MEGA65.  Thus, there will hopefully be a number of posts over the next few weeks, before I have to dive head-long back into work.

What I have been working on the last few days is getting the new video modes sorted out once and for all (this has held things up for about a year now), and get accurate CPU timing, at least at the instruction level (cycle-accurate timing within instructions will come a bit later).

So, todays screen shots are of SynthMark64 running on the latest version of the MEGA65 with the CPU at 1MHz, 2MHz (C128-style "fast" mode), C65 native 3.5MHz mode, and full-speed at 50MHz.

I had added a framework for assigning instruction cycle counts some time ago, but hadn't had the opportunity to tune it for some time.  So, while it was close-ish, it was still out by upto 25% for some instruction types. However, after spending a few hours tracking down the problems, including realising I was mistakenly added the one cycle penalty to relative branches only when the branch did not cross a memory page, instead of when it did cross a memory page, I had it pretty much right, as you can see below:

For comparison, here is the MEGA65 running at full speed, approximately 51x faster, which makes sense, since the CPU clock speed is almost exactly that much faster than a PAL C64.  However, as you can see, there is quite a bit of irregularity. 

First, NOPs are listed as a124x faster. This is plainly impossible, as at 50MHz, this would require NOPs to take about 0.8 cycles each, which would take a very special CPU to do.  What I think is going on is that the adjustment in the calculation in SynthMark that substracts the time it takes to setup the timer for the test is based on that code running at 1MHz, not 50MHz, so it makes the end result look faster.

Second, there are some instructions which are simply faster on a clock-for-clock basis on the MEGA65. In particular, function calls are faster because JMP and RTS are a cycle faster each, and register ops are quite a bit faster, because things like INX are single cycle on the 4502.

Third, there are some instructions that are slower, in particular loads and things that load from colour RAM in particular. This is because all load instructions currently take one cycle more on the MEGA65 than on a real 6502 or 4502.  While rather annoying, it isn't a super high priority for me to fix right now.  The MEGA65 is already very very fast.

Back to testing the C64 compatibility modes, we have 2MHz mode. This doesn't exist on the C65, but since enough software for the C64 tries to use C128 2MHz mode, it has long since been implemented in the MEGA65. Basically, if the M65 is in C64 mode, then it emulates $D030.  In C65 mode, $D030 is replaced by the VIC-III memory banking register. Anyway, as for 1MHz mode, we see that the result is pretty much spot on.

So then I tested 3.5MHz mode. Here, things are a bit different, because the C65 tries to match 6502 cycle timing when running at 1MHz for compatibility. This means in practice that it adds a dummy cycle to a bunch of instructions that are otherwise a bit quicker on the 4502. This means that I have to have two separate table of instruction cycle counts, one for "pseudo 6502 mode" and one for "native 4502 mode".  Note that this is independent of the true 6502 vs 4502 mode select that the MEGA65 will soon be getting. Instead, it is just about adjusting the run time of the legal 6502 instructions that are a sub-set of the 4502 instruction set.

I hadn't touched all this for ages, so I wasn't particularly surprised to see some odd things in the result. First of all, NOPs were still only 3.5x faster, not 7x faster, as we would expect if NOPs were really now single cycle, as they are on the 4502.

What we do see is that all the read-modify-write instructions are now a bit faster, because the 4502 doesn't do the dummy write of the original value, as happens on the 6510. As I have discussed previously, this was one of the big sources of incompatibility on the C65, because you could not do INC $D019, ASL $D019 or any of the other read-modify-write instructions to reset a raster interrupt.  The MEGA65 has long since treated $D019 as a special case for these instructions, and then, and only then, does it spend the extra cycle doing the dummy write.

I then took a look at the instruction cycle count table in the MEGA65's CPU, and saw that I had basically copy-pasted the 6502 one, and changed only a few instructions.  So I spent a half-hour or so with, and updated my table.  With that done, magically the instruction timings were now much healthier looking:

This reminds me of the claims that were made about the 4502, including in the link above, that code could run "up to 25% faster" because of the instruction timing improvements.  What we see in SynthMark64 more or less confirms this: We get 4.25x instead of 3.5x, which equates to about at 21% improvement. Of course, this depends on the exact instruction mix you might be running. In any case, it seems that a claim of 25% speed-up is not unreasonable.

Anyway, I am now happy that the CPU speeds are now accurate enough for initial use, e.g., for working with the IEC serial bus, including most fast-loaders.

The other piece that the pictures show, but is not immediately apparent, is that we have the new 800x600 based video modes working, in both 50Hz and 60Hz. This is changeable run-time via a register, so you can select PAL or NTSC operation, and get the correct frame, and thus music and game speed interrupt rate.  There are still a few remaining wrinkles to work through on this, but it is mostly working.  Once we have the modes completely settled, I will post about it.