Thursday, April 21, 2016

On cycle count predictability and related things

Some folks have expressed their concern that this CPU redesign takes away from the genuine 8-bit computer feel of the MEGA65.  My feeling is that it doesn't which I will explain below, but at the same time I don't want to be dismissive of anyone's concerns. Our goal remains to make something that is authentic and enjoyable for a wide range of people to use and program.  So please poke me, either in the comments or elsewhere if you wish.

But for now, I will take a little time to explain how the CPU looks from the user-perspective, to hopefully provide some assurance that it is not really a great departure from what we already had.  Indeed, from what I understand, what we doing here is not greatly different from how the Chameleon's CPU operates, i.e., some more modern CPU construction techniques are used behind the scenes, to provide what is very much (in their case) a 6502.

The main difference is that we are being transparent how we are making the CPU behind the scenes, so that it gives the end result of being a 6502 and 4502 compatible CPU.  We're sorry if that spoils the "magic trick" for some, but we strongly believe that transparency is always best in the long run.

The out-of-order instruction retirement is just a fancy way of saying that the CPU takes and executes the instructions in order, but some can take longer to complete, for example if they need to read or write from memory.

What doesn't change, is if an instruction requires the value read from memory, that it can't be completed until the thing it depends on is complete.  That is, it still behaves exactly as one expects a 6502 to behave, for any given program.  This is quite similar in many ways to the way that the SuperCPU has a 1-byte write-through "cache."  We are just using a different mechanism (register renaming, or reservation slots, depending on how you want to look at it), but to achieve much the same goal.

So if we look at a simple loop:

l1: lda $1000,x
sta $2000,x 
bne l1         

The simulation of this loop for the new CPU (in its current unfinished form, so there might be some changes) below shows how a couple of loop iterations go through. Note that register contents are BEFORE the instruction is executed, just because of how the simulation outputs stuff.  i.e., it shows the CPU state just before it executes the instruction, instead of just after.

-- LDA / STA / INX / BNE instructions all execute on consecutive cycles, taking
-- a total of only 20ns
@450ns: PC $8104 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10
@455ns: PC $8107 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  9D 00 20
@460ns: PC $810A A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  E8 D0 F7
@465ns: PC $810B A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I--  :  D0 F7 4C
-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction
@525ns: PC $8104 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10
@530ns: PC $8107 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  9D 00 20
@535ns: PC $810A A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  E8 D0 F7
@540ns: PC $810B A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I--  :  D0 F7 4C
-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction
@600ns: PC $8104 A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10

What can basically be seen above is that the non-branching instructions all take one cycle to run, whether or not they need a memory access, because all the out-of-order retirement and register renaming hides that. The result is that the timing is actually somewhat simpler and easier to predict for the most part than on a real 6502.  Note that we will still have a ~1MHz, ~2MHz and ~3.5MHz speed settings, where we will emulate the normal 6502 and 4502 timing of all instructions, and when we get time to do it, to make the memory access cycles also match that of a 6502 exactly, and naturally also the same for 3.5MHz 4502 mode. (One of the key reasons for reimplementing the CPU this time, is actually to make sure it has two "personalities", where in 6502 mode, all illegal opcodes work properly, and when in 4502 mode, all 4502 opcodes work properly, and can match the timing exactly -- so that we can have a real C64 mode and a real C65 mode, both of which are as compatible as possible.

The other obvious thing is that the branch instruction suffers a pretty big penalty, which is because the pipeline takes a bit of time to start feeding the new instructions.  However, because the clock speed is 4x, and the main pipeline is 4-stage, the end result is that the branch actually takes exactly the same amount of time as on our previous 48MHz CPU design.

It's also worth mentioning that most of the sources of timing uncertainty in modern PC processors etc don't actually come from the pipeline and other features that we are talking about here.  (In fact, the 6502 already pipelines between instructions a little). They come from the cache, from virtual memory, from the operating system that is hiding behind and pre-empting your process all the time and filling the cache with rubbish as a result. We are not having any of that stuff in the MEGA65: What you get is a 6502 or 4502 processor, that behaves how you expect.  We have just implemented it using some lessons learned over the past few decades of CPU implementation.

Otherwise, I think that this work has got some folks thinking about what makes a machine have character, instead of just being an 8-bit version of another wise soulless kind of PC, or FPGA-centric thing that people build.  For us, there are some key things, of which the following are a few.  Of course, we are thinking about many other things, such as C64 compatibility, but we take these simply for granted. 

First, the video generation MUST be rasterised, without a frame-buffer, just like on a real C64 or C65.  That is, the video chip needs to be deciding, cycle by cycle, what colour the next pixel will be, and allow the programmer to do horrible things to it that were never intended.  It is already possible, for example, on the VIC-IV in the MEGA65 to cause a monitor to totally lose sync, because you can trick it into moving the HSYNC pulses on a raster line.  I get back to this again below, but it is really a very important point.  In fact, I would say that what really makes the C64 interesting is the VIC-II and the SID. The CPU, while still important, is really secondary in many ways. It is the custom chips and overall combination that really define the "character" and "personality" of the C64.  The MEGA65 will of course have a its own personality, but we still feel that it will indeed have a personality, and that it will be a very strong one.

Second, it has to still be a simple bare-metal machine, where you have effectively full access to all the hardware when you are running on it.  The only piece we have outside that is the Hypervisor, which is best understood as an integrated freeze cartridge, so that you can easily load, save and switch what you are doing.

Third, the machine must still have fundamental limitations, that provide opportunity for programmers to try to stretch what the machine can do.  This is why we have the combination of CPU and resolution improvements together, for example, so that the relationship of CPU performance and the number of bits on screen at a time remain in reasonable relation.  The C64 has 64000 pixels from not more than 64KB = 512kbit on screen at a time, and 1x10^6 cycles or 3x10^5 instructions per second, so that there is approximately one instruction per bit of displayed graphics per second.  The MEGA65 has about 2x10^6 pixels, and is expected to have some multiple of 10^7 instructions per second.  Thus the instructions per pixel-bit is increased by an order of magnitude over the C64, so that it offers a nice bit of extra freedom, but without removing the limitation completely.  (Compare that with a modern PC, which instead has about 10^10 instructions per second, not counting the 10^12 or more GPU instructions per second). Moreover, the number of bits per pixel available from RAM is still in proportion: A C64 has about 8 bits per pixel available (64KB / 64000 pixels). The C65 actually has less, because while it has 128KB of RAM, it can do, for example 640x400 or 1280x400 resolutions. The MEGA65 goes further, having the same RAM as the C65, but with many more pixels, much more creativity will be required to find solutions to having full screen full-colour displays -- just as this presents special challenges and opportunities for ingenuity on the C64 (and C65). My point here is really that while the boundaries of what is "possible" on the MEGA65 are naturally different to those of the C64, we have retained this sense of a limited computer, so that it still has character, and will still require years of careful thought and experimentation to find its limits.

Finally, the specification has to be fixed for the long-term, like the C64's, so that people can program it with confidence, knowing that their code will "just work" on MEGA65s for years and decades to come, because otherwise the limits of the machine are not real.  This is actually why we want to get this CPU matter sorted out sooner rather than later, so that we can say with authority, "This is the CPU of the MEGA65. It shall be no faster."  Similarly, we want to pin down the last few points on the VIC-IV

Anyway, as I have said, we want this machine to be fun for the community, and something with a stable and fixed specification once we release it, so that it can have a long life, including so that stuff that you write with cycle-by-cycle timing will keep on working.  So please don't hesitate to let us know if you have concerns about our approach, or suggestions how we can do it better. This is one of the great things about an open-source project, that people can look and provide feedback and help to make sure that the end result is as good as possible. We can't guarantee that we can take everyone's requests and include them (partly because some of you ask for opposite things ;), but we do listen and think carefully about them all.