Making a C64/C65 compatible computer: On cycle count predictability and related things

Friday 22 April 2016

On cycle count predictability and related things

Some folks have expressed their concern that this CPU redesign takes away from the genuine 8-bit computer feel of the MEGA65. My feeling is that it doesn't which I will explain below, but at the same time I don't want to be dismissive of anyone's concerns. Our goal remains to make something that is authentic and enjoyable for a wide range of people to use and program. So please poke me, either in the comments or elsewhere if you wish.

But for now, I will take a little time to explain how the CPU looks from the user-perspective, to hopefully provide some assurance that it is not really a great departure from what we already had. Indeed, from what I understand, what we doing here is not greatly different from how the Chameleon's CPU operates, i.e., some more modern CPU construction techniques are used behind the scenes, to provide what is very much (in their case) a 6502.

The main difference is that we are being transparent how we are making the CPU behind the scenes, so that it gives the end result of being a 6502 and 4502 compatible CPU. We're sorry if that spoils the "magic trick" for some, but we strongly believe that transparency is always best in the long run.

The out-of-order instruction retirement is just a fancy way of saying that the CPU takes and executes the instructions in order, but some can take longer to complete, for example if they need to read or write from memory.

What doesn't change, is if an instruction requires the value read from memory, that it can't be completed until the thing it depends on is complete. That is, it still behaves exactly as one expects a 6502 to behave, for any given program. This is quite similar in many ways to the way that the SuperCPU has a 1-byte write-through "cache." We are just using a different mechanism (register renaming, or reservation slots, depending on how you want to look at it), but to achieve much the same goal.

So if we look at a simple loop:

l1: lda $1000,x
sta $2000,x
inx
bne l1

The simulation of this loop for the new CPU (in its current unfinished form, so there might be some changes) below shows how a couple of loop iterations go through. Note that register contents are BEFORE the instruction is executed, just because of how the simulation outputs stuff. i.e., it shows the CPU state just before it executes the instruction, instead of just after.

-- LDA / STA / INX / BNE instructions all execute on consecutive cycles, taking

-- a total of only 20ns

@450ns: PC $8104 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--I-- : BD 00 10

@455ns: PC $8107 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ- : 9D 00 20

@460ns: PC $810A A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ- : E8 D0 F7

@465ns: PC $810B A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I-- : D0 F7 4C

-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction

@525ns: PC $8104 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I-- : BD 00 10

@530ns: PC $8107 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ- : 9D 00 20

@535ns: PC $810A A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ- : E8 D0 F7

@540ns: PC $810B A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I-- : D0 F7 4C

-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction

@600ns: PC $8104 A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I-- : BD 00 10

What can basically be seen above is that the non-branching instructions all take one cycle to run, whether or not they need a memory access, because all the out-of-order retirement and register renaming hides that. The result is that the timing is actually somewhat simpler and easier to predict for the most part than on a real 6502. Note that we will still have a ~1MHz, ~2MHz and ~3.5MHz speed settings, where we will emulate the normal 6502 and 4502 timing of all instructions, and when we get time to do it, to make the memory access cycles also match that of a 6502 exactly, and naturally also the same for 3.5MHz 4502 mode. (One of the key reasons for reimplementing the CPU this time, is actually to make sure it has two "personalities", where in 6502 mode, all illegal opcodes work properly, and when in 4502 mode, all 4502 opcodes work properly, and can match the timing exactly -- so that we can have a real C64 mode and a real C65 mode, both of which are as compatible as possible.

The other obvious thing is that the branch instruction suffers a pretty big penalty, which is because the pipeline takes a bit of time to start feeding the new instructions. However, because the clock speed is 4x, and the main pipeline is 4-stage, the end result is that the branch actually takes exactly the same amount of time as on our previous 48MHz CPU design.

It's also worth mentioning that most of the sources of timing uncertainty in modern PC processors etc don't actually come from the pipeline and other features that we are talking about here. (In fact, the 6502 already pipelines between instructions a little). They come from the cache, from virtual memory, from the operating system that is hiding behind and pre-empting your process all the time and filling the cache with rubbish as a result. We are not having any of that stuff in the MEGA65: What you get is a 6502 or 4502 processor, that behaves how you expect. We have just implemented it using some lessons learned over the past few decades of CPU implementation.

Otherwise, I think that this work has got some folks thinking about what makes a machine have character, instead of just being an 8-bit version of another wise soulless kind of PC, or FPGA-centric thing that people build. For us, there are some key things, of which the following are a few. Of course, we are thinking about many other things, such as C64 compatibility, but we take these simply for granted.

First, the video generation MUST be rasterised, without a frame-buffer, just like on a real C64 or C65. That is, the video chip needs to be deciding, cycle by cycle, what colour the next pixel will be, and allow the programmer to do horrible things to it that were never intended. It is already possible, for example, on the VIC-IV in the MEGA65 to cause a monitor to totally lose sync, because you can trick it into moving the HSYNC pulses on a raster line. I get back to this again below, but it is really a very important point. In fact, I would say that what really makes the C64 interesting is the VIC-II and the SID. The CPU, while still important, is really secondary in many ways. It is the custom chips and overall combination that really define the "character" and "personality" of the C64. The MEGA65 will of course have a its own personality, but we still feel that it will indeed have a personality, and that it will be a very strong one.

Second, it has to still be a simple bare-metal machine, where you have effectively full access to all the hardware when you are running on it. The only piece we have outside that is the Hypervisor, which is best understood as an integrated freeze cartridge, so that you can easily load, save and switch what you are doing.

Third, the machine must still have fundamental limitations, that provide opportunity for programmers to try to stretch what the machine can do. This is why we have the combination of CPU and resolution improvements together, for example, so that the relationship of CPU performance and the number of bits on screen at a time remain in reasonable relation. The C64 has 64000 pixels from not more than 64KB = 512kbit on screen at a time, and 1x10^6 cycles or 3x10^5 instructions per second, so that there is approximately one instruction per bit of displayed graphics per second. The MEGA65 has about 2x10^6 pixels, and is expected to have some multiple of 10^7 instructions per second. Thus the instructions per pixel-bit is increased by an order of magnitude over the C64, so that it offers a nice bit of extra freedom, but without removing the limitation completely. (Compare that with a modern PC, which instead has about 10^10 instructions per second, not counting the 10^12 or more GPU instructions per second). Moreover, the number of bits per pixel available from RAM is still in proportion: A C64 has about 8 bits per pixel available (64KB / 64000 pixels). The C65 actually has less, because while it has 128KB of RAM, it can do, for example 640x400 or 1280x400 resolutions. The MEGA65 goes further, having the same RAM as the C65, but with many more pixels, much more creativity will be required to find solutions to having full screen full-colour displays -- just as this presents special challenges and opportunities for ingenuity on the C64 (and C65). My point here is really that while the boundaries of what is "possible" on the MEGA65 are naturally different to those of the C64, we have retained this sense of a limited computer, so that it still has character, and will still require years of careful thought and experimentation to find its limits.

Finally, the specification has to be fixed for the long-term, like the C64's, so that people can program it with confidence, knowing that their code will "just work" on MEGA65s for years and decades to come, because otherwise the limits of the machine are not real. This is actually why we want to get this CPU matter sorted out sooner rather than later, so that we can say with authority, "This is the CPU of the MEGA65. It shall be no faster." Similarly, we want to pin down the last few points on the VIC-IV

Anyway, as I have said, we want this machine to be fun for the community, and something with a stable and fixed specification once we release it, so that it can have a long life, including so that stuff that you write with cycle-by-cycle timing will keep on working. So please don't hesitate to let us know if you have concerns about our approach, or suggestions how we can do it better. This is one of the great things about an open-source project, that people can look and provide feedback and help to make sure that the end result is as good as possible. We can't guarantee that we can take everyone's requests and include them (partly because some of you ask for opposite things ;), but we do listen and think carefully about them all.

22 comments:

Unknown22 April 2016 at 09:23
I think it is the best way to make as many components as possible be optional (or modular) - and let it be up to the actual user if he turns them on or off from the application currently. These pipelining and caching (etc.) features seem slightly similar to that of the DTV (and the little bit of SuperCPU). Thus, make the 1, 2 and 3.5 MHz modes just exactly cycle-accurate with maximal backwards compatibility for the existing old software made (or new ones being made) for C64, C128 and C65 computers. Plus add your special MEGA65 (formerly 48 MHz) mode. (These have all already been there of course.)

Now, on the top of that, add yet another switch to also activate the extra features with pipelining and caching (in any mode, not depending on the speed). If you make this switch to be activated by the same instruction sequence which the DTV does, then you also gain another backwards compatibility above the others: partially for the DTV (more than nothing, at least). Perhaps you might also add yet another switch in the manner of the SuperCPU... (But only for altering the speed - not the instruction set or other things, of course.)

And finally, to also make it simple at the same time for the user, that POKE0,65 - which turns on ALL of these at once. (And that POKE0,64 which turns off all of these at once, too.)

The 1 and 2 MHz modes should automatically use the 6510, while every other mode the 4510 instruction set.

And then so will everybody be happy.
ReplyDelete
Replies
Froy23 April 2016 at 02:07
Saying Mega65 have it's own personality and a strong one gave me the greatest smile ever!!! MY EVER NEED TO OWN one...ever so lingers in my vain!
ReplyDelete
Replies
Unknown24 April 2016 at 08:34
It's called a CSG 4510 VICTOR (NOT 4502, there is and never was and never will be a 4502).

And C64 and thus Chemeleon uses a 6510 (not: 6502: That is the PET, VIC20 and 1541's CPU)

Reading that gives me headache.
ReplyDelete
Replies
rob27 April 2016 at 07:07
You also mentioned that the CPU would be "smaller"... even while it sounds bigger (to me). Is it that you're leveraging capabilities that are provided "for free" in the FPGA's environment, or is there something else that I don't know about CPU design (and there's a lot I don't know about it!)?
ReplyDelete
Replies
Solei28 April 2016 at 02:09
Looking forward to see how the new cpu will perform.

Do you plan to further enhance the VicV - and perhaps increase the amount of graphics ram beyond 128kb, or will that break compatibility?
ReplyDelete
Replies
Unknown5 May 2016 at 06:09
Hello Paul,

I'm in favor of the new pipelined architecture. I'm wondering if there is a way to expose either values being evaluated in interim stages or status flags to indicate what stage an instruction is in.

Just as an example case, for an INC $add instruction, you have to fetch memory, add, and write it back out - if the value stuck around while in the add phase of the pipeline, maybe we could transfer the value pre-add or post-add into the accumulator?

If status flags were available, it might be helpful when profiling code. The big difference with existing 6502 is that you can do the math in your head about how many cycles an instruction is going to take, with branching and page boundaries. But with a pipeline, it depends on which stage the pipe is in, whether you can do some math on a different register while fetching from memory, etc. This level of complexity requires profiling tools to optimize code. I remember a simple command-line tool that Hitachi provided for the SH-4 that would visual print out the stages of the pipeline for a given set of instructions.

- Gary

ReplyDelete
Replies
Unknown15 May 2016 at 16:49
I'm wondering about interrupt handling/sharing amongst the CPU cores. Will there be a mechanism for determining which sources will go where? Will there be a way of generating interrupts on a core from another one?

Also, will memory mapping be available for each of the cores, independently? Will only one of the cores be able to function as the 6510/4510 with the $00, $01 IO ports? How will you handle the situation where the drive 6502s mustn't have those ports?

Daniel.
ReplyDelete
Replies
The Mind22 July 2016 at 01:33
This comment has been removed by the author.
ReplyDelete
Replies

Add comment

MEGA65 Links

Friday 22 April 2016

On cycle count predictability and related things

22 comments: