Friday 28 December 2018

M.E.G.A 6502 benchmark for checking cycles per instruction

Recently I was tweaking the 1MHz instruction timing and debugging some related glitching I was seeing, and realised that I didn't have any really good way to check whether the MEGA65 was really running every instruction with the correct number of cycles.  The existing well-known C64 benchmarks, SynthMark64 and BoulderMark, don't really give this kind of direct break-down.  So I wrote one that does:


Basically it works by going through all the legal 6502 opcodes, and timing how many cycles they take.  Actually, it runs each instruction 256 times, so that if the CPU is fast, we can accurately measure fractional cycles taken.  This is of course quite important when running on a MEGA65 at 40MHz!

To achieve this, the test program works fairly similarly to SynthMark64 (and is also written primarily using the CC65 C compiler):  It constructs a test routine for each instruction that sets up a CIA counter, runs the instruction, stops the counter, subtracts the overhead of fiddling with the timer, and then divides the result by 256 to get the number of cycles per instruction.

In reality, it is a bit more complex, as there are a bunch of instructions where we have to get a bit clever-pants. For example, for PHA, we have to pop something off the stack after stopping the timer, while for PLA we have to first put something onto the stack.

Some instructions like RTS require that we push a return address that exactly matches the next instruction in the stream to run.  This turns out to not be that hard to implement. First, insert the instructions to push something to the stack.

 if (!strcmp(instruction_descriptions[opcode],"RTS")) {
    // Push two bytes to the stack for return address
    // (the actual return address will be re-written later
    test_routine[offset++]=0xA9; // LDA #$nn
    rts_hi_offset=offset++;
    test_routine[offset++]=0x48; // PHA
    test_routine[offset++]=0xA9; // LDA #$nn
    rts_lo_offset=offset++;
    test_routine[offset++]=0x48; // PHA
 }

Then once we know the address we should return to, replace that:

 // If instruction was RTS or RTI, rewrite the pushed PC to correctly point here
 if (!strcmp(instruction_descriptions[opcode],"RTS")) {
    addr=(unsigned short)(&test_routine[offset-1]);
    test_routine[rts_lo_offset]=addr&0xff;
    test_routine[rts_hi_offset]=addr>>8;
 }

RTS is also one of the few instructions where we can't actually count 256 executions in one loop, because we would need a bigger stack than exists. Basically any instruction that touches the stack falls in this category. As a result, those instructions will seem a bit faster when running at full speed than they really are.  However, the effect is somewhat less than what SynthMark64 suffers in similar circumstances, where it doesn't correctly adjust for the reduced cost of the overhead of fiddling with the timers.  It is that problem in SynthMark64 that results in impossibly fast reported speeds for NOP instructions on the MEGA65.

Anyway, as can be seen above, the MEGA65 now uses exactly the correct number of cycles per instruction when simulating 1MHz mode.

The three speed figures at the bottom report how fast the machine seems to be, compared with a stock PAL C64.  These figures are based on different weightings for each opcode.  FLAT uses an equal weighting, i.e., it assumes that LDA and RTI both are executed equally frequently, which is clearly not realistic. It does, however, give a good general idea of the machine's speed.  For the other two work-loads, C64 BASIC and BLDERDASH, I used the real-time CPU instruction capture program, ethermon, that I wrote about in the last post to gather statistics.  In fact, I modified it, so that if you give it the -f option for instruction frequencies, it outputs the table in exactly the correct format for me to include in the source for this benchmark program :)

If we then enable full speed on the MEGA65 and run it again, we get something like the following:


First, ignore the information claiming that BRK runs at 255 cycles per iteration. BRK is the single instruction that I haven't implemented testing for, although it should be possible. I got a little lazy, because BRK is not an instruction that tends to get used a lot.  While the test is running you can use the cursor keys to move the cursor around the test field, and it will show you information for the selected opcode. Thus, if you see an instruction that is green or red, you can fairly easily select it, and see exactly the detected difference in speed.  There are still a few wrinkles in this (including that it sometimes displays the line twice, as above, but it mostly works.

A bigger problem at the moment is that the reported speeds can report lower than correct sometimes when running on very fast CPUs. This is because the 1MHz ticks of the CIA do not occur every CPU cycle.  Thus it is possible for the number of consumed 1MHz ticks to sometimes be one more than it should be, if the phase difference means that a 1MHz cycle boundary is crossed.

Similarly, the overhead calculation can vary by 1 cycle as well, between zero and one cycle.  I can't think of a really good solution to this jitter, that is machine independent. It really is just a side-effect of trying to measure something that can be only a tiny fraction of a cycle with a clock that counts only whole cycles.  I could try to average the overhead over many tests, and use the average figure, but that would also require some care in case you changed the CPU speed while the test is running.  But at the end of the day, the contribution of the jitter on the overhead is relatively small, and should be mostly self-cancelling, since it will be rather random for each opcode, and thus overall should end up being near the correct average value, and this seems to be played out in practice.

A bigger problem was in implementing this, I accidentally introduced a systemic bias where the jitter was only being counted in the direction of slower measured speeds, combined with running the stack-based instructions only once.  This was solved by summing the overhead of each of 256 iterations of the stack-based instructions as well as the time taken for each instruction, and then deducting the entire overhead at the end, instead of calculating it for each iteration and removing the cases where the overhead was one cycle but the instruction run was measured at zero cycles, i.e., giving an apparent run time of -1 cycles.  By instead summing these over all 256 iterations, these were counted and cancelled with the occasions where a full 1MHz cycle was claimed to be consumed, and on average, approximating the correct time consumed.





We can now see all the cycle times are fractional, and for the faster instructions are reported as .0 cycles (I kept the output fields only 2 chars wide so that it can all fit on the screen at the same time. As a result it is a bit squashed.  I could also make an 80-column C65/MEGA65 only version, but I wanted it to also run on a stock C64).  They also show up in green because they are faster than on a stock C64. If they were slower, they would show up in yellow (because red on blue on a stock C64 is a known disaster colour combination).

By using the cursor keys you can even select an instruction to give more details on, which is indicated by the reverse-video cursor, and the info given below the instruction table:



So we see that the MEGA65 is close to 40x stock C64 speed, depending on the work load, but with considerable variation among the instructions. This makes sense, since its CPU is 40x faster, and has broadly similar native instruction cycle counts, with some faster and some slower.  The reason some take more native cycles than on a 6502 is because the MEGA65's CPU cannot read from an address immediately after computing it from instruction arguments, as 25ns isn't enough time for the data from one memory cell to make its way into the CPU, be computed on (eg, adding an index register), and then be sent out to the RAM address lines.  I keep meaning to use these latent cycles to pre-fetch the opcode of the next instruction, but given the CPU is already so fast, it has never really become a priority.

The careful observer will also see that the 4th workload to the mix is rather unexpected:  The folks on the #c-64 IRC channel pointed me to the C64 bitcoin mining program (yes, someone actually wrote one!).  It is of course so slow as to be useless, even when running at 40x original speed.  But I guess I now have the record for the fastest bitcoin mining C64, at a rate of 100 nonces in a mere 6 seconds!

Talking earlier about the real-time ethernet-based CPU instruction logger, I mentioned that the 100mbit ethernet isn't fast enough to log instructions at full speed at 40MHz.  But we can now easily check just how fast we can do it:



Not bad! We can even run slightly faster than a stock C65, while logging every single instruction and every single video badline / raster line advance event!

As this is a new benchmark (although it will likely evolve a bit, in particular, because the reported speeds are still a bit jittery, and it runs a bit slowly, testing only one instruction per frame, so as to ensure the test is running in the vertical borders to avoid the effects of badlines), it would be great if folks were able to test it on some different systems.  Groepaz has already kindly run it on his Turbo Chameleon, which reported speeds of: FLAT=16.4x , C64 BASIC=17.7x, Boulder Dash = 17.54x and Bitcoin 64 = 15.47x (although these were produced before I fixed the jitter, so actual results might vary somewhat). 

EDIT: I have now uploaded it to CSDB so you can try it out on your own hardware if you wish.  I'd love to receive speed reports for different systems.  Note for now you will need to manually activate acceleration, as the program itself does not do this for you.

No comments:

Post a Comment