Tuesday, April 19, 2016

Overview of the new CPU design (as of today)

This is totally subject to change without notice, but the following gives an over-view of the design of the new CPU:

The gs4502b will, at this stage, be a pipelined, triple-core, out-of-order instruction retirement, register-renaming processor with parallel instruction pre-fetch buffer and self-modifying code hazard avoidance.  

We also plan for it to run at 192MHz.

Okay, that's all fairly technical gobbledey-gook, so lets break it down:

First, for the most part, the features of the processor will increase the rate at which instructions can be processed, i.e., decrease the number of cycles that many instructions will take. The main exception is the fact that it is pipelined, means that in some cases instructions may take more cycles, in particular branches or self-modifying code, because the pipeline will have to flush.  However, the increase in clock speed means that it will be extremely hard to craft a set of instructions that run slower on this processor than on our existing 48MHz one.

Now moving to the specific features:

A pipelined processor is one that breaks instructions down into separate little bits, like reading the instruction, decoding the instruction, actually doing the instruction, and writing back any results to memory, and so on.  One instruction can be doing each of those things at any point in time, so while an individual instruction takes longer, the number of instructions per cycle doesn't have to drop.  The big advantage is that a pipeline usually allows the clock speed to be increased -- this is exactly why we are employing one on the new 4502b.

The down side to a pipeline is that if the pipeline has to be flushed, it takes a while for it to start executing instructions again.  This was the problem with the Pentium 4 processors that used crazy pipelines to push the clock-speed way high, but didn't have enough cache memory to sustain the pipeline, meaning that actual performance was often quite poor. However, on the MEGA65, the CPU is effectively operating from cache the whole time, as the BRAM we use for the main memory is internal to the FPGA, and can be accessed as fast as the cache on a typical processor.

The 4502b will also be triple core. The first core will be the "CPU", and the 2nd and 3rd cores will be primarily for floppy drive emulation.  However, when you don't want or need to emulate a floppy drive, they will be available for use by the programmer.  Also, at this stage, the cores will be able to be set in two different performance modes: In the one mode, the primary core gets priority, so that it can run as fast as possible.  In the other mode, all cores will share the memory bus more fairly, and so while the first core will likely still run fast, it won't be as fast as in the first mode, but this will be offset in most cases by the increased performance of the 2nd and 3rd cores. Of course, this will require software that is designed to take advantage of the extra cores, of which none currently exists -- although I did write some dual-processor 6502 code back in the 1990s, but that's a story for another day.

The CPU will also support out-of-order instruction retirement.  This means that while instructions will start executing in the correct order, quick instructions will be allowed to finish while slower ones will continue in the background.  This will allow more instructions to be processed per unit time, by reducing the amount of the time the CPU sits blocked waiting for memory accesses to complete. In particular, memory reads and writes will continue in the back-ground, without blocking the CPU from executing new instructions, unless the new instructions depend on the results of the old instructions, for example if we have LDA $1234  followed by ADC $3456, the ADC would normally need to wait for the LDA to finish so that we can have the result ready to use as input to the ADC instruction.  However, even then, it will sometimes be possible to continue processing, where we can easily predict where the result will come from, as in this example, by using register renaming.

Register renaming is a fancy trick, where we can have multiple versions of a register at the same time. Using the example from above, we can say that one version of the accumulator register will get its value from location $1234.  Then when we want to use ADC to calculate based on that renamed register, we can tell the appropriate part of the CPU that the input to ADC is in fact the output from the previous instruction, by giving it the name that the result will have.  If that all sounds crazily complicated, don't worry too much. Just understand that it helps the CPU to go a lot faster, especially when there are a lot of memory accesses.  For those interested, wikipedea has a good page on this.

The CPU will also have a parallel instruction pre-fetch buffer. This is really a simple little thing that holds the up-coming 16 instruction bytes, and allows an entire instruction to be dispatched every single cycle, unless the buffer is empty, or the CPU pipeline stalls.  This means that instructions that used to take upto 7 cycles on a regular 6502 can sometimes be executed in just one cycle*!

Of course the asterisk is there, because there can be a lot of reasons why this might not happen in practice.  But in theory, the new CPU will be able to execute 192 million instructions per second.  This compares with the approximately 10 - 20 million instructions per second that the existing 48MHz CPU can achieve, and of course looks quite absurd next to the ~250,000 - 300,000 instructions that a real C64 could execute per second.  And that is using just one core on the 4502b.  The theoretical peak performance will be 576 million instructions per second, although as anyone who knows CPU benchmarks will know, that the reality might be only 10% - 50% of that figure.  Nonetheless, that is still very, very fast for an 8-bit CPU.

Finally, to make sure that all existing software can run on it, the CPU will include self-modifying code hazard avoidance. This is just a fancy way of saying that the CPU will realise when a program modifies itself, and flush the pipeline whenever it needs to.  The only trade-off is that code that modifies itself might suffer a penalty of about 10 - 20 cycles each time it modifies itself.  Of course, at 192MHz, that is still less than 105 nano-seconds.  That is, a worse-case pipeline stall on this processor will stall the CPU for only about 1/10th of a cycle when compared to a 1MHz 6502.

So anyway, that's the current thinking on this processor, and when I get the chance, I will provide an update on how far along the implementation currently is, and give some tentative simulation results to give an idea of how fast the processor might end up in practice.


  1. That's an ambitious project! That means crazy raw power for an 8 bit machine, pretty cool.
    Won't the slow SDRAM be a problem with that speed up?

    1. We aren't using SDRAM for the main memory. Instead we are using the BRAM that is inside the FPGA. Code run from any other RAM will naturally run much slower, based on the delays introduced by the memory.

    2. Does that means critical routine could be run in BRAM and less critical code run from SDRAM?
      I remember something like this on the ARM7TDMI of the GBA where you could run fast 32 bits code on a small, specific part of RAM (WRAM IIRC) and the rest was slower 16 bits code (Thumb code).
      Here that would be 8 bit code everywhere but with different speed I guess.

    3. All of the RAM of the C64 and C65 on the MEGA65 is in BRAM, so it all runs fast. If we add support for a big DDR RAM, then that will either be acccessible by DMA only, or if it is direct mapped, then it will likely be slower.

    4. Either, go fast with little RAM or go slow with lots of RAM. Future devs will need to get creative I guess :)
      DMA only access seems more complicated to use but more inline with the C65 spirit of things hehe

  2. Do you anticipate that there will be self-modifying code?? (I don't program "on the bare metal" so I really don't know!)

    1. Yes. In fact, C64 BASIC will not work without self-modifying code support.

  3. The decision to implement an out-of-order architecture for an accumulator-store ISA is unique in my opinion. How does the logic effort compare to the earlier in-order design I was read about some time ago ?

    1. Well, the out-of-order aspect of the new CPU is relatively simple -- instructions still dispatch in-order, but are allowed to retire out of order depending on when the resource(s) they require become available. Typically, however, this just means that memory access instructions will be delayed in their completion, while subsequent register and branch operations can begin executing.

      Thus, the logic complexity is not much worse (at this stage). In fact, because the last CPU was quite messy because of how it was accreted together, the new CPU is likely to require less logic overall, and certainly less logic per core, compared to if I had had 3 of the old core.