Thursday, December 21, 2017

Instruction-Level Timing Accuracy is here

It is summer time here in Australia, which means that it is time for holidays for me.  For those of you reading from the Northern Hemisphere, the idea of Christmas in summer may sound crazy. However, I can assure you that this is not the case. It is in fact COMPLETELY INSANE. You have all the rush of Christmas, which should be happening when the weather outside is horrible, and the days short and dark, instead happening just when you really want to be going on holidays instead. So 1/3 of our summer (seasons start on the 1st down here rather than the 21st because of the lower thermal mass of the Southern Hemisphere due to the lack of large continents) disappears into the stress of Christmas, including several days of food-induced coma caused by trying to sustain our European traditions of having a big hot meal for Christmas, even though it might be 42C outside.  Then, come late June, when our weather is at its worst (but this is Australia, so it just means 8 hours of day light, and daytime temperatures around 10C, and nights as cold as 3C. We do however get some pretty nasty wind courtesy of the same lack of large land masses in this half of the world), we have not even a single public/bank holiday for almost four months.

Anyway, we nonetheless survive, and it means I have some time to potter away on finishing off the core functionality of the MEGA65.  Thus, there will hopefully be a number of posts over the next few weeks, before I have to dive head-long back into work.

What I have been working on the last few days is getting the new video modes sorted out once and for all (this has held things up for about a year now), and get accurate CPU timing, at least at the instruction level (cycle-accurate timing within instructions will come a bit later).

So, todays screen shots are of SynthMark64 running on the latest version of the MEGA65 with the CPU at 1MHz, 2MHz (C128-style "fast" mode), C65 native 3.5MHz mode, and full-speed at 50MHz.

I had added a framework for assigning instruction cycle counts some time ago, but hadn't had the opportunity to tune it for some time.  So, while it was close-ish, it was still out by upto 25% for some instruction types. However, after spending a few hours tracking down the problems, including realising I was mistakenly added the one cycle penalty to relative branches only when the branch did not cross a memory page, instead of when it did cross a memory page, I had it pretty much right, as you can see below:

For comparison, here is the MEGA65 running at full speed, approximately 51x faster, which makes sense, since the CPU clock speed is almost exactly that much faster than a PAL C64.  However, as you can see, there is quite a bit of irregularity. 

First, NOPs are listed as a124x faster. This is plainly impossible, as at 50MHz, this would require NOPs to take about 0.8 cycles each, which would take a very special CPU to do.  What I think is going on is that the adjustment in the calculation in SynthMark that substracts the time it takes to setup the timer for the test is based on that code running at 1MHz, not 50MHz, so it makes the end result look faster.

Second, there are some instructions which are simply faster on a clock-for-clock basis on the MEGA65. In particular, function calls are faster because JMP and RTS are a cycle faster each, and register ops are quite a bit faster, because things like INX are single cycle on the 4502.

Third, there are some instructions that are slower, in particular loads and things that load from colour RAM in particular. This is because all load instructions currently take one cycle more on the MEGA65 than on a real 6502 or 4502.  While rather annoying, it isn't a super high priority for me to fix right now.  The MEGA65 is already very very fast.

Back to testing the C64 compatibility modes, we have 2MHz mode. This doesn't exist on the C65, but since enough software for the C64 tries to use C128 2MHz mode, it has long since been implemented in the MEGA65. Basically, if the M65 is in C64 mode, then it emulates $D030.  In C65 mode, $D030 is replaced by the VIC-III memory banking register. Anyway, as for 1MHz mode, we see that the result is pretty much spot on.

So then I tested 3.5MHz mode. Here, things are a bit different, because the C65 tries to match 6502 cycle timing when running at 1MHz for compatibility. This means in practice that it adds a dummy cycle to a bunch of instructions that are otherwise a bit quicker on the 4502. This means that I have to have two separate table of instruction cycle counts, one for "pseudo 6502 mode" and one for "native 4502 mode".  Note that this is independent of the true 6502 vs 4502 mode select that the MEGA65 will soon be getting. Instead, it is just about adjusting the run time of the legal 6502 instructions that are a sub-set of the 4502 instruction set.

I hadn't touched all this for ages, so I wasn't particularly surprised to see some odd things in the result. First of all, NOPs were still only 3.5x faster, not 7x faster, as we would expect if NOPs were really now single cycle, as they are on the 4502.

What we do see is that all the read-modify-write instructions are now a bit faster, because the 4502 doesn't do the dummy write of the original value, as happens on the 6510. As I have discussed previously, this was one of the big sources of incompatibility on the C65, because you could not do INC $D019, ASL $D019 or any of the other read-modify-write instructions to reset a raster interrupt.  The MEGA65 has long since treated $D019 as a special case for these instructions, and then, and only then, does it spend the extra cycle doing the dummy write.

I then took a look at the instruction cycle count table in the MEGA65's CPU, and saw that I had basically copy-pasted the 6502 one, and changed only a few instructions.  So I spent a half-hour or so with, and updated my table.  With that done, magically the instruction timings were now much healthier looking:

This reminds me of the claims that were made about the 4502, including in the link above, that code could run "up to 25% faster" because of the instruction timing improvements.  What we see in SynthMark64 more or less confirms this: We get 4.25x instead of 3.5x, which equates to about at 21% improvement. Of course, this depends on the exact instruction mix you might be running. In any case, it seems that a claim of 25% speed-up is not unreasonable.

Anyway, I am now happy that the CPU speeds are now accurate enough for initial use, e.g., for working with the IEC serial bus, including most fast-loaders.

The other piece that the pictures show, but is not immediately apparent, is that we have the new 800x600 based video modes working, in both 50Hz and 60Hz. This is changeable run-time via a register, so you can select PAL or NTSC operation, and get the correct frame, and thus music and game speed interrupt rate.  There are still a few remaining wrinkles to work through on this, but it is mostly working.  Once we have the modes completely settled, I will post about it.


  1. I'm curious now how in-opcode timing is accurate (on 1MHz). Since at least some "high-end tech" demos depends on the very exact cycle the actual read/write operation is done within the opcode execution itself, I mean. This work as far as I can see is "only" (sorry about the wording, nothing offensive meant here) about the total time of execution of the opcodes. But anyway, that is the case with Xemu too (and there not even 6502 timing btw). Other thing what came into my mind: with 6502 (NMOS) CPU "personality", there should be the case for need to emulate not only the illegal opcodes at some point (and how KILL opcodes will be handled, executing a reset instead or some kind of warning?) but instructions done differently or even buggy on a real NMOS 6502. For example ADC in BCD mode has some flag difference even between 65C02 (not CE ...) and 6502 especially with "invalid" BCD data if I remember correctly. Also, for example there is a bug with indirect jump if the read "vector" crosses page boundary: then a real NMOS 6502 will read eg C1FF and C100 instead of C1FF and C200 if there is a JMP (C1FF). 65C02 already corrected this (and I guess thus 65CE02 too) however for C64 compatibility the original NMOS version's behaviour should count, I think.

    1. Hello,

      As I indicated in the post, this milestone is only getting the timing of whole instructions accurate. We will get the cycle timing within opcodes accurate later on, when we target C64 compatibility improvements. Right now, the focus is on C65 compatibility and general operation.

      As for the BCD bug etc, as there is no known C65 software that depends on the 65CE02 "fixes", we are emulating the original NMOS/C64 behaviour. BCD flags are an example of this, and also the jump indirect as well, from memory. Those sorts of things are quite easy to tweak, however, so even if we don't have them all right just now, they won't be too hard to fix.

    2. Thanks, not for arguing but I can sense some danger here. In 65CE02 "CE" means (AFAIK) "65C02 enhanced" or something similar. Thus some may expect that 65C02 code runs the same way, and since some future M65 coders have 65C02 background (some other micros used 65C02, not NMOS 6502) it can cause problems if M65 uses the NMOS 6502 behaviour even in "native" mode. But for sure, just seeing Mega65, it's more like a new product without too much existing software it must be compatible with. It was only a quick thought because of my personal experience, since when I "met" 65CE02, I also actually used my previous 65C02 experience to understand its concept at the first glance [ie (ZP) mode is really (ZP),Z if Z=0, etc etc].

  2. Is CIA timing also register selectable? I've been living and breathing user port stuff the last year, and since I discovered that my modem works with the C65 in C65 mode, I've been contemplating porting some apps over...

    1. As in, switching between 1MHz and 3.5MHz timing, or something else? The CIAs on the MEGA65 run at C64 speed to maximise compatibility, however, it would be quite easy for us to make this selectable, but I don't think that is needed for your use-case.

    2. A lot of people use timers to count cycles, so maybe a 3.5MHz mode would be handy for the same reason?

    3. The Commodore 128/1571 combination can do burst mode transfers at 2MHz. When using standard KERNAL routines this only gains a little bit of performance, since the GCR decoding routines inside the 1571 are the bottleneck.

      However, with custom loaders that used optimized GCR ecoding, the 1571 2MHz mode is more than fast enough to do GCR decoding on the fly: You have then 48-64 cycles is enough to read a byte from the disk controller, GCR decode it and write to to the CIA. In that case, the 1571 can transfer a track to the C128 in just one disk revolution, which is extremely fast. The best ever done on a C64/1541 was with Epyx Vorpal custom format (easier to decode), and did need 2 revolutions.

      It is not that revelant for the Mega65, but since you are implementing a C128 feature and the CIA speed came to discussion, I tought I'd mention it.

    4. There is actually some guy who managed single revolution on a 1541:

      So do the CIAs clock at 2MHz for timers in 2MHz mode on a C128?

    5. He can decode the sector in one revolution, but not transmit it over the serial bus. He probably still needs 3 revolutions, but it is still impressive what he did and still very usefull, because it was not achieved with a standard disk format until now.

      The burst-mode of the 1571/C128 allows you also to transmit a decoded byte immedeately, because you just need to "sta" the bit to the CIA. There is no need for manual synchronization of the serial bus, and you don't to split into pairs of 2 bits, the CIA does the work for you. This makes a single revolution really practical.

      Indeed the CIAs clock at 2MHz, the timers need for example to be reprogrammed after a switch to maintain a 50Hz/60Hz interrupt rate.

  3. I was actually thinking about PAL/NTSC timings. I just figured it would be something easily overlooked.

    1. Ah, gotcha. Yes, a single Phi clock is generated for the CPU timing as well as the CIAs, so they should adjust speed according to NTSC/PAL mode selection.