Thursday, April 16, 2015

Making the SD access routines available from outside the hypervisor

I already have had working FAT file system reading code for some time now, because it has been needed to load the ROM into the running machine.

The code was, however, rather specialised, only allowing a single partition, and not really made to be callable from outside of the hypervisor.  So I am starting to refactor this so that it can be called from outside, and to setup a logical mechanism for calling it from outside of the hypervisor.

The first step is setting up a suitable calling mechanism from a running machine into the hypervisor.

For this I have taken some inspiration from modern CPUs that have nice ways to call into the operating system.  However, unlike CPUs like the MIPS and x86 CPUs, which can dedicate a separate instruction to this process, the 4502 had already allocated all opcodes.  So we need another way.

What I have implemented is a block of IO registers at $D640-$D67F that if written to, cause the CPU to switch into the hypervisor.  Depending on which register is written to, the hypervisor enters at a different address. In other words, these 64 registers correspond to a jump table in the hypervisor.

When the hypervisor is all done, it writes to $D67F, causing it to exit back to the caller.

Saving registers is a time consuming process, so I wanted this to be as fast as possible. So the GS4510 has about 30 dedicated shadow registers that save various aspects of the processor state simultaneously on trapping to the hypervisor. This means it takes only one clock cycle -- about 20 nano seconds -- to trap into the hypervisor.  The contents of A, X, Y and Z are passed into the hypervisor, as well as being saved in the shadow registers, so the hypervisor doesn't need to load them on entry. The shadow registers all get restored on exit from the hypervisor, restoring the CPU state, also in a single cycle.  This also means that when we enter the hypervisor, we can set a specific memory configuration, so that the hypervisor can get right to work.

Let's think about how how fast this can actually be in practice.

From the user process, you must write to one of the trap registers, e.g., with:

STA $D640

We don't have to set A to anything first, because the trap process ignores all register values (although the hypervisor might look at them once trapped in).

STA absolute takes 5* cycles on the GS4510. Add 1 cycle for the trap process, and we are in the hypervisor. Let's consider a minimal trap, that just returns to the caller without doing anything, and that will require a write to $D67F, so another 5 cycles, and then 1 more cycle to exit the trap.

Thus the total overhead is 12 cycles, or about 240ns.  That is, you could do an empty trap like this around 4 million times per second. In this regard, the GS4510 is much closer to the performance of much faster processors.

With that in place, we can start implementing a useful system call facility.  We'll focus on the disk access (DOS) calls for now.

First, we don't want to use up all 64 system call address for one major function, so we will use a register to indicate a sub-function.  We will use $D640, which traps to $8000 in the hypervisor, and have that jump to our call dispatch routine:

dos_and_process_trap:
; Sub-function is selected by X.
; Bits 6-1 are the only ones used.
; Mask out bit 0 so that indirect jmp's are valid.
txa
and #$FE
; to save memory we only allow this table to be 128 bytes long,
; thus we have to check that bit 7 is clear.
bmi invalid_subfunction
tax
jmp (dos_and_process_trap_table,x)
dos_and_process_trap_table:
; $00 - $0E
.word trap_dos_getversion
.word trap_dos_getdefaultdrive
.word trap_dos_selectdrive
.word trap_dos_getdisksize
.word trap_dos_getcwd
.word trap_dos_chdir
.word trap_dos_mkdir
.word trap_dos_rmdir
        ...

The first part of this routine takes the X register to use it to pick which routine to call.  The 4502 has a nice JMP indirect indexed mode that is made for jump tables, which we will use.  

However, there is a bit of an unfortunate design decision in that instruction, in that it doesn't double X before indexing, so odd values of X will cause it to jump to an address which consists of one byte each from two neighbouring vectors in the jump table.  That would be bad, as it would cause it to jump into strange places in the hypervisor, probably messing memory up and never returning, or at least posing a security risk.  So we clear the lowest bit of X before doing the lookup, which we have to do by using the accumulator. This costs us 4 cycles just to do the TXA / AND #$FE / TAX, but more importantly means that if we want to inspect the value of A in the syscall, we have to load it from the relevant shadow register, which costs another 5 cycles.  So there is an almost 200ns penalty due to this one little thing!  I am thinking I might change the behaviour of this instruction when the CPU is in hypervisor mode to make the X index be doubled when used in this instruction so that the traps can be much faster.

To save memory in the hypervisor (it is limited to 16KB) I also check that the upper bit of X is clear, so that we have only 64 vectors available in this system call.  This costs another 60 - 80 ns, plus the 6 cycles for the indirect jump (another 60ns).

Let's look at the one system call that I have implemented so far:

; Return OS and DOS version.
; A/X = OS Version major/minor
; Z/Y = DOS Version major/minor
trap_dos_getversion:
lda #<os_version
sta hypervisor_x
lda #>os_version
sta hypervisor_a
lda #<dos_version
sta hypervisor_z
lda #>dos_version
sta hypervisor_y
jmp return_from_trap_with_success

This basically just sets A, X, Y and Z to contain version information.  Each LDA / STA takes 7 cycles, so this block of code takes 28 cycles = 560 ns.

Then we jump to return_from_trap_with_success which sets the carry flag (our convention for success) and exits from the hypervisor:

; Return from trap with C flag clear to indicate success
return_from_trap_with_success:
; set C flag for caller to indicate success
lda hypervisor_flags
ora #$01   ; C flag is bit 0
sta hypervisor_flags
; return from hypervisor
sta hypervisor_enterexit_trigger

This takes about 17 cycles, so another 340ns.

So all up we have 200ns over head for the trap, then about 260ns for the dispatch, 560ns for the useful work, and another 340ns to return.  So all up, our system call to get the OS and DOS version requires about 1460ns - about 1.5 microseconds, allowing for better than 600,000 requests per second for a call of this complexity.

Now to work on making more useful disk functions available via this interface.

2 comments:

  1. 1) Given that you reserve eight entry points to the hv, is 64 functions per entry point not a bit overdone? It may very well not be, but you're not saying how you intend to use the eight entry points.

    2) Given the limited information that you can store in the registers, the more complex calls (say, access a file, read a sector or whatnot) will probably need some kind of memory structure that contains the details of the call. Given that, there may be no need to store anything in A, so you could store the sub-function number in A, then shift it left and store the result in X.

    3) I find it odd that you mask out the lower bit, but trigger an error on the upper bit. You could also mask out the upper bit and be done with it?

    ReplyDelete
    Replies
    1. Hello,

      1) There are actually 64 entry points to the hypervisor, not eight. This does allow for a lot of entry points. It may be that I am being too careful in allowing for this many, given that it will allow for 4,096 different system calls. However, I'd rather be careful at this early stage. That said, I have nominal allocations for more than 32 sub-functions for disk and process control already, so it may be that I am not being too careful. I might even be able to provide a mechanism where a running process can ask for some commonly used traps to be promoted into some of the direct entry points to speed things up a little. As it is, I will probably make the byte-by-byte file read/write calls direct entries to speed up those common use-cases.

      2) Yes, this may well make sense. What I am also thinking about is that entry to the hypervisor could save the real value of X in the shadow register, and then shift X left one bit (and optionally mask the top bit), so that the dispatch can just be the JMP (table,X). This would speed things up by 10% or so in exchange for a few transistors. It seems like a pretty good deal on that basis.

      3) Yes, you probably right. Of course, in some cases we might want 128 sub-functions. Also, if I do my hardware-assistance idea, then it becomes moot.

      Paul.

      Delete