Thursday, April 21, 2016

On cycle count predictability and related things

Some folks have expressed their concern that this CPU redesign takes away from the genuine 8-bit computer feel of the MEGA65.  My feeling is that it doesn't which I will explain below, but at the same time I don't want to be dismissive of anyone's concerns. Our goal remains to make something that is authentic and enjoyable for a wide range of people to use and program.  So please poke me, either in the comments or elsewhere if you wish.

But for now, I will take a little time to explain how the CPU looks from the user-perspective, to hopefully provide some assurance that it is not really a great departure from what we already had.  Indeed, from what I understand, what we doing here is not greatly different from how the Chameleon's CPU operates, i.e., some more modern CPU construction techniques are used behind the scenes, to provide what is very much (in their case) a 6502.

The main difference is that we are being transparent how we are making the CPU behind the scenes, so that it gives the end result of being a 6502 and 4502 compatible CPU.  We're sorry if that spoils the "magic trick" for some, but we strongly believe that transparency is always best in the long run.

The out-of-order instruction retirement is just a fancy way of saying that the CPU takes and executes the instructions in order, but some can take longer to complete, for example if they need to read or write from memory.

What doesn't change, is if an instruction requires the value read from memory, that it can't be completed until the thing it depends on is complete.  That is, it still behaves exactly as one expects a 6502 to behave, for any given program.  This is quite similar in many ways to the way that the SuperCPU has a 1-byte write-through "cache."  We are just using a different mechanism (register renaming, or reservation slots, depending on how you want to look at it), but to achieve much the same goal.

So if we look at a simple loop:

l1: lda $1000,x
sta $2000,x 
inx              
bne l1         

The simulation of this loop for the new CPU (in its current unfinished form, so there might be some changes) below shows how a couple of loop iterations go through. Note that register contents are BEFORE the instruction is executed, just because of how the simulation outputs stuff.  i.e., it shows the CPU state just before it executes the instruction, instead of just after.

-- LDA / STA / INX / BNE instructions all execute on consecutive cycles, taking
-- a total of only 20ns
@450ns: PC $8104 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10
@455ns: PC $8107 A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  9D 00 20
@460ns: PC $810A A:00 X:01 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  E8 D0 F7
@465ns: PC $810B A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I--  :  D0 F7 4C
-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction
@525ns: PC $8104 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10
@530ns: PC $8107 A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  9D 00 20
@535ns: PC $810A A:00 X:02 Y:00 Z:00 B:00 SP:01FF --E--IZ-  :  E8 D0 F7
@540ns: PC $810B A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I--  :  D0 F7 4C
-- 60ns ( = 12 CPU cycles) elapse between the branch and the next instruction
@600ns: PC $8104 A:00 X:03 Y:00 Z:00 B:00 SP:01FF --E--I--  :  BD 00 10

What can basically be seen above is that the non-branching instructions all take one cycle to run, whether or not they need a memory access, because all the out-of-order retirement and register renaming hides that. The result is that the timing is actually somewhat simpler and easier to predict for the most part than on a real 6502.  Note that we will still have a ~1MHz, ~2MHz and ~3.5MHz speed settings, where we will emulate the normal 6502 and 4502 timing of all instructions, and when we get time to do it, to make the memory access cycles also match that of a 6502 exactly, and naturally also the same for 3.5MHz 4502 mode. (One of the key reasons for reimplementing the CPU this time, is actually to make sure it has two "personalities", where in 6502 mode, all illegal opcodes work properly, and when in 4502 mode, all 4502 opcodes work properly, and can match the timing exactly -- so that we can have a real C64 mode and a real C65 mode, both of which are as compatible as possible.

The other obvious thing is that the branch instruction suffers a pretty big penalty, which is because the pipeline takes a bit of time to start feeding the new instructions.  However, because the clock speed is 4x, and the main pipeline is 4-stage, the end result is that the branch actually takes exactly the same amount of time as on our previous 48MHz CPU design.

It's also worth mentioning that most of the sources of timing uncertainty in modern PC processors etc don't actually come from the pipeline and other features that we are talking about here.  (In fact, the 6502 already pipelines between instructions a little). They come from the cache, from virtual memory, from the operating system that is hiding behind and pre-empting your process all the time and filling the cache with rubbish as a result. We are not having any of that stuff in the MEGA65: What you get is a 6502 or 4502 processor, that behaves how you expect.  We have just implemented it using some lessons learned over the past few decades of CPU implementation.

Otherwise, I think that this work has got some folks thinking about what makes a machine have character, instead of just being an 8-bit version of another wise soulless kind of PC, or FPGA-centric thing that people build.  For us, there are some key things, of which the following are a few.  Of course, we are thinking about many other things, such as C64 compatibility, but we take these simply for granted. 

First, the video generation MUST be rasterised, without a frame-buffer, just like on a real C64 or C65.  That is, the video chip needs to be deciding, cycle by cycle, what colour the next pixel will be, and allow the programmer to do horrible things to it that were never intended.  It is already possible, for example, on the VIC-IV in the MEGA65 to cause a monitor to totally lose sync, because you can trick it into moving the HSYNC pulses on a raster line.  I get back to this again below, but it is really a very important point.  In fact, I would say that what really makes the C64 interesting is the VIC-II and the SID. The CPU, while still important, is really secondary in many ways. It is the custom chips and overall combination that really define the "character" and "personality" of the C64.  The MEGA65 will of course have a its own personality, but we still feel that it will indeed have a personality, and that it will be a very strong one.

Second, it has to still be a simple bare-metal machine, where you have effectively full access to all the hardware when you are running on it.  The only piece we have outside that is the Hypervisor, which is best understood as an integrated freeze cartridge, so that you can easily load, save and switch what you are doing.

Third, the machine must still have fundamental limitations, that provide opportunity for programmers to try to stretch what the machine can do.  This is why we have the combination of CPU and resolution improvements together, for example, so that the relationship of CPU performance and the number of bits on screen at a time remain in reasonable relation.  The C64 has 64000 pixels from not more than 64KB = 512kbit on screen at a time, and 1x10^6 cycles or 3x10^5 instructions per second, so that there is approximately one instruction per bit of displayed graphics per second.  The MEGA65 has about 2x10^6 pixels, and is expected to have some multiple of 10^7 instructions per second.  Thus the instructions per pixel-bit is increased by an order of magnitude over the C64, so that it offers a nice bit of extra freedom, but without removing the limitation completely.  (Compare that with a modern PC, which instead has about 10^10 instructions per second, not counting the 10^12 or more GPU instructions per second). Moreover, the number of bits per pixel available from RAM is still in proportion: A C64 has about 8 bits per pixel available (64KB / 64000 pixels). The C65 actually has less, because while it has 128KB of RAM, it can do, for example 640x400 or 1280x400 resolutions. The MEGA65 goes further, having the same RAM as the C65, but with many more pixels, much more creativity will be required to find solutions to having full screen full-colour displays -- just as this presents special challenges and opportunities for ingenuity on the C64 (and C65). My point here is really that while the boundaries of what is "possible" on the MEGA65 are naturally different to those of the C64, we have retained this sense of a limited computer, so that it still has character, and will still require years of careful thought and experimentation to find its limits.

Finally, the specification has to be fixed for the long-term, like the C64's, so that people can program it with confidence, knowing that their code will "just work" on MEGA65s for years and decades to come, because otherwise the limits of the machine are not real.  This is actually why we want to get this CPU matter sorted out sooner rather than later, so that we can say with authority, "This is the CPU of the MEGA65. It shall be no faster."  Similarly, we want to pin down the last few points on the VIC-IV

Anyway, as I have said, we want this machine to be fun for the community, and something with a stable and fixed specification once we release it, so that it can have a long life, including so that stuff that you write with cycle-by-cycle timing will keep on working.  So please don't hesitate to let us know if you have concerns about our approach, or suggestions how we can do it better. This is one of the great things about an open-source project, that people can look and provide feedback and help to make sure that the end result is as good as possible. We can't guarantee that we can take everyone's requests and include them (partly because some of you ask for opposite things ;), but we do listen and think carefully about them all.

22 comments:

  1. I think it is the best way to make as many components as possible be optional (or modular) - and let it be up to the actual user if he turns them on or off from the application currently. These pipelining and caching (etc.) features seem slightly similar to that of the DTV (and the little bit of SuperCPU). Thus, make the 1, 2 and 3.5 MHz modes just exactly cycle-accurate with maximal backwards compatibility for the existing old software made (or new ones being made) for C64, C128 and C65 computers. Plus add your special MEGA65 (formerly 48 MHz) mode. (These have all already been there of course.)

    Now, on the top of that, add yet another switch to also activate the extra features with pipelining and caching (in any mode, not depending on the speed). If you make this switch to be activated by the same instruction sequence which the DTV does, then you also gain another backwards compatibility above the others: partially for the DTV (more than nothing, at least). Perhaps you might also add yet another switch in the manner of the SuperCPU... (But only for altering the speed - not the instruction set or other things, of course.)

    And finally, to also make it simple at the same time for the user, that POKE0,65 - which turns on ALL of these at once. (And that POKE0,64 which turns off all of these at once, too.)

    The 1 and 2 MHz modes should automatically use the 6510, while every other mode the 4510 instruction set.

    And then so will everybody be happy.

    ReplyDelete
    Replies
    1. Hello,

      As you are hinting at, this is more of less what we are doing anyway. We already support C128 and C65 speed switch sequences. Adding the DTV one shouldn't be too hard, although I haven't researched exactly how it works. The problem is if things check if it works, and then tries to do DTV-specific things.

      As for 6502/4502 switching, it is not unfortunately that simple, because the C64-mode kernal on the C65 actually uses some 4502 instructions for the DOS, but these get run at 1MHz, and in an otherwise completely C64 context.

      Paul.

      Delete
    2. For the C64 mode, you should rather use an own custom modified one instead of the original Kernal, thus getting rid of the 4510 code (and perhaps optionally adding JiffyDOS protocol for external periferals, too).

      Delete
    3. Hello,

      We really want it to keep it compatible with the original ROM. We will probably do something that dynamic that can work out which mode it should be in. It might even be that we make it so that it is in 4502 mode only when in the kernal, but 6502 mode otherwise, if the machine is otherwise in c64 mode.

      As for JiffyDOS, this will be upto people to do themselves. We might be able to make an even faster loader, however, that uses custom code, and takes advantage of the fact that the MEGA65 has much faster I/O, and so could potentially lock to the true 1MHz clock of the floppy drive, and allow the drive to write to the port as fast as it can, and purely on the basis of timing, read the bits correctly on the receive side. We might even be able do the GCR decoding on the receive side, and effectively just stream the GCR read register over the serial line. This should certainly work for 1571 and 1581 drives where we can use the shift register, but I suspect it might even be possible on a 1541, if we use illegal opcodes, e.g., with something along the lines of LAX gcr / AND #$03 / STA ioport / TXA / LSR / LSR / TAX / STA ioport / TXA / LSR / LSR / TAX / AND #$03 / STA ioport / TXA / LSR / LSR / STA ioport -- but this is just off the top of my head, there might be problems with doing this, or not.

      Paul.

      Delete
  2. Saying Mega65 have it's own personality and a strong one gave me the greatest smile ever!!! MY EVER NEED TO OWN one...ever so lingers in my vain!

    ReplyDelete
  3. It's called a CSG 4510 VICTOR (NOT 4502, there is and never was and never will be a 4502).

    And C64 and thus Chemeleon uses a 6510 (not: 6502: That is the PET, VIC20 and 1541's CPU)

    Reading that gives me headache.

    ReplyDelete
    Replies
    1. Hello,

      So the 4510 has a 4502 CPU core in it, plus two 6526 CIAs (or are they 4526s?). I use 4502 to keep being specific to the fact that I am not talking about the CIAs. As for the Chameleon, it's 6510 is a 6502 + 6 pin IO port, just like the 6510 in a real C64. While my terminology might not be perfect, there is nonetheless method in my madness.

      Paul.

      Delete
  4. You also mentioned that the CPU would be "smaller"... even while it sounds bigger (to me). Is it that you're leveraging capabilities that are provided "for free" in the FPGA's environment, or is there something else that I don't know about CPU design (and there's a lot I don't know about it!)?

    ReplyDelete
    Replies
    1. Well, it is partly because the implementation of the previous CPU happened somewhat organically, and so it isn't as efficient as possible. Also, with the new CPU, sharing the memory controller among the three cores that we need, also generates some significant space savings. However, there is nothing special about the new design that makes use of any special FPGA resources.

      Paul.

      Delete
    2. Thanks for your responses, Paul. I'm excited about the C65GS/MEGA65/whatever its current project name is.

      Delete
    3. Hello,

      Thanks :) It is always nice to hear encouraging words.

      The project is called the MEGA65 these days. The blog pre-dates that change, which is why it still appears as "C65GS" in the URL.

      Paul.

      Delete
  5. Looking forward to see how the new cpu will perform.

    Do you plan to further enhance the VicV - and perhaps increase the amount of graphics ram beyond 128kb, or will that break compatibility?

    ReplyDelete
    Replies
    1. Hello,

      Thanks. I am looking forwards to seeing what the result will be as well. I'm hoping for atleast 100x in SynthMark, but I won't know until I get it all done.

      For the VIC-IV, which is our enhanced version of the VIC-III, I don't yet know if it will be possible to increase the graphics RAM -- it will depend on available resources in the FPGA, which I won't know until we have everything else settled.

      Paul.

      Delete
  6. Hello Paul,

    I'm in favor of the new pipelined architecture. I'm wondering if there is a way to expose either values being evaluated in interim stages or status flags to indicate what stage an instruction is in.

    Just as an example case, for an INC $add instruction, you have to fetch memory, add, and write it back out - if the value stuck around while in the add phase of the pipeline, maybe we could transfer the value pre-add or post-add into the accumulator?

    If status flags were available, it might be helpful when profiling code. The big difference with existing 6502 is that you can do the math in your head about how many cycles an instruction is going to take, with branching and page boundaries. But with a pipeline, it depends on which stage the pipe is in, whether you can do some math on a different register while fetching from memory, etc. This level of complexity requires profiling tools to optimize code. I remember a simple command-line tool that Hitachi provided for the SH-4 that would visual print out the stages of the pipeline for a given set of instructions.

    - Gary

    ReplyDelete
    Replies
    1. Hello,

      Unfortunately exposing such information would add considerably to the complexity of the CPU. It is also complicated by the lack of any unused opcodes.

      That said, the pipeline is actually relatively simple and predictable in terms of cycle timing, so it will almost certainly be possible to still do cycle counting in your head as you write code. In fact, it will likely be simpler than on a regular 6502, because instructions will typically take either 1, 2 or 10 cycles (exact values to be finalised).

      Paul.

      Delete
  7. I'm wondering about interrupt handling/sharing amongst the CPU cores. Will there be a mechanism for determining which sources will go where? Will there be a way of generating interrupts on a core from another one?

    Also, will memory mapping be available for each of the cores, independently? Will only one of the cores be able to function as the 6510/4510 with the $00, $01 IO ports? How will you handle the situation where the drive 6502s mustn't have those ports?


    Daniel.

    ReplyDelete
    Replies
    1. Hello,

      I might add a little interrupt switch that allows each source to be set to any core, or I might just have the sources fixed to particular cores. Not entirely decided yet. Likewise for inter-core communications.

      For memory maps, they will all "see" the same 28-bit address space, but I may disable memory remapping from the auxiliary cores. For floppy emulation, they will simply be set to a memory map that contains what they need an nothing else. As on a real C65, $01 and $00 are actually $00001 and $00000, so if you map other memory, they aren't visible, so the problem of other cores accessing them is elegantly avoided, without adding any complexity.

      Delete
    2. I'm thinking of the situation where we have a multi-tasking operating system using multiple cores. I'd like to reserve the "primary core" for "ring 0" or privileged tasks and the others for user tasks. I think some means of interrupting the process on the auxiliary cores is going to be necessary for simplifying the task switching process. Otherwise, we need to be able to halt/stall and resume them and preserve and restore the PC from the "primary" core.

      Using interrupts, this is almost done for us so long as we reserve some of the address space for the system, I'm thinking the top $F000-$FFFF for interfacing as well as being able to switch in/out $0000-$0FFF for application specific state/system info. If there are no tasks to run, it could be a little ugly, though.

      What are your thoughts?

      Delete
    3. Hello,

      This is more or less in line with my thinking. The primary CPU is the only one that would be able to access the hypervisor, and would certainly not be able to be stopped by the other CPUs. By limiting access to the hypervisor, all these other things should mostly fall out.

      It might be, for example, that if one of the other cores asks for the hypervisor through a trap, that the calling core is suspended until the main core gets around to responding to it.

      Anyway, this is still all a bit fluid while I get the new CPU design actually working, and then work through what is easier and harder to implement. The end result is that I do intend it to be possible to have such heirarchical coordination between the cores, if only because we want to use this machine for teaching computer architecture.

      Delete
    4. Hmm... I'm hoping that the whole address space (less the reserved area) would be open to user tasks so that no I/O appears in the address space. Wouldn't this mean that in order for the traps to work, a call into the system (in the reserved space) would need to map in that I/O?

      At any rate, the only way I can think of to handle system calls from a user task requires that the auxiliary CPUs be able to perform memory mapping, even if its just to map in the appropriate system "ROM". I don't really know anything about the hypervisor, however.

      I really wouldn't want the I/O area to be present for user tasks. Perhaps instead of the hypervisor for the auxiliary CPUs, they can do things like signal and wait? But, as I said, I think that they would need to be able to perform some memory mapping, even if its simple ($00, $01?).

      This leads to some issues because it would have to be done via an instruction if not $00 and $01 and a 6510 "personality" would need one reclaimed for the job (and even for the 6502). This is not too much of a concern for me since I personally think the auxiliary CPUs should be 4502s or 6502s anyway. I'm not sure how the 4502 MAP instruction works or what you would want.

      Delete
    5. The simplest solution here is to make the 4502 MAP instruction on the secondary cores instead be a hypervisor trap. The hypervisor can then do the memory mapping. But as I say, I haven't really come to a final position -- it will depend on how the final CPU implementation looks, as to how I can best implement memory protection of some kind, while allowing the secondary cores as much latitude as possible. It might be, for example, that IO can be enabled/disabled for each chip for each core, e.g., you can choose which core can see CIA1 or the VIC-IV etc. This would need to be in addition to protection for the RAM, so that other cores can't (without permission) scribble over the RAM another core or process is using. It would be faster if most allowed remapping can be done by the core itself, so that hypervisor traps can be avoided. It also gets a bit interesting with the 32-bit indirect ZP addressing mode, since that doesn't see the effects of any memory mapping, and so I'll need to come up with some interesting protection scheme for that as well. As I say, there is still some thinking to do on this.

      Paul.

      Delete
  8. This comment has been removed by the author.

    ReplyDelete