Sunday 2 November 2014

Loading data via ethernet

As I have been working on the proportional font support for the VIC-IV, I have come to the realisation that I need some fairly comprehensive and reproducible test programmes due to the complexity of the VIC-IV text mode.

I can load programmes onto D81 images and put those on the SD card, and mount them from the simple menu I made. However, it requires physical plugging and unplugging, and is a general nuisance.

The other approach that I have at the moment is that I can use the serial monitor to load data, but VERY slowly and unreliably.  This is basically because the serial monitor doesn't have a FIFO on the serial input, and so it looses characters quite easily.

I do have a working 100mbit ethernet adapter, however.   Ideally, I would just use the RR-NET emulation to enable me to run the RR-NET version of 64NET.  However, there are still enough bugs in the RR-NET emulation to stop that from working.  This also could be more easily debugged if I could quickly and easily load test programmes onto the C65GS.

So I have finally gotten around to writing a little program that looks for UDP packets coming in via ethernet, and then providing a way to load them into memory somewhere.

Because the C65GS ethernet buffer is direct memory mapped, I can use a nice trick, of having the main loading routine actually inside the packets.  This means that the ethernet load programme on the C65GS can be <128 bytes, and yet support very flexible features, since the sending side can send whatever code it likes.  It is only about 100 lines of 4502 assembler, so I'll just include the whole thing here.

.org $CF80
First, we need to turn on C65GS enhanced IO so that we can access the ethernet controller:
lda #$47
sta $d02f
lda #$53
sta $D02f
Then we need to map the ethernet controller's read buffer.  This lives at $FFDE800-$FFDEFFF.  We will map it at $6800-$6FFF.  The 4502 MAP instruction works on 8KB pieces, so we will actually map $6000-$7FFF to $FFDE000-$FFDFFFF.  Since this is above $00FFFFF, we need to set the C65GS mega-byte number to $FF for the memory mapper before mapping the memory.  We only need to do this for the bottom-half of memory, so we will leave Y and Z zeroed out so that we don't change that one.
lda #$ff
ldx #$0f
ldy #$00
ldz #$00
Now looking at the $DE800 address within the mega-byte, we use the normal 4502/C65 MAP instruction semantic.  A contains four bits to select whether mapping happens at $0000, $2000, $4000 and/or $6000. We want to map only at $6000, so we only set bit 7.  The bottom four bits of A are bits 8 to 11 of the mapping offset, which in this case is zero.  X has bits 12 to 19, which needs to be $D8, so that the offset all together is $D8000.  We use this value, and not $DE000, because it is an offset, and $D8000 + $6000 = $DE000, our target address.  It's all a bit weird until you get used to it.
lda #$80
ldx #$8d
ldy #$00
ldz #$00
Now we are ready to make sure that the ethernet controller is running:
lda #$01
sta $d6e1
Finally we get to the interesting part, where we loop waiting for packets.  Basically we wait until the packet RX flag is set

lda $d6e1
and #$20
beq waitingforpacket
So a packet has arrived.  Bit 2 has the buffer number that the packet was read into (0 or 1), and so we shift that down to bit 1, which selects which buffer is currently visible.  Then we write this to $D6E1, which also has the effect of clearing the ethernet IRQ if it is pending.
lda $d6e1
and #$04
ora #$01
sta $d6e1
Now we check that it is an IPv4 UDP packet addressed to port 4510
; is it IPv4?
lda $6810
cmp #$45
bne waitingforpacket
; is it UDP?
lda $6819
cmp #$11
bne waitingforpacket
; UDP port #4510
lda $6826
cmp #>4510
bne waitingforpacket
lda $6827
cmp #<4510
bne waitingforpacket
If it is, we give some visual indication that stuff is happening. I'll take this out once I have the whole thing debugged, because it wastes a lot of time to copy 512 bytes this way, since I am not even using the DMAgic to do it efficiently.  In fact, this takes more time than actually loading a 1KB packet of data.
; write ethernet status to $0427
lda $d6e1
sta $0427

; Let's copy 512 bytes of packet to the screen repeatedly
ldx #$00
loop1: lda $6800,x
sta $0428,x
lda $6900,x
sta $0528,x
bne loop1
The final check we do on the packet is to see that the first data byte is $A9 for LDA immediate mode.  If so, we assume it is a packet that contains code we can run, and we then JSR to it:
lda $682c
cmp #$a9
bne loop
jsr $682C
Then we just go looking for the next packet:
jmp loop

As you can see, the whole program is really simple, especially once it hits the loop.  This is really due to the hardware design, which with the combination of DMA and memory mapping avoids insane fiddling to move data around, particularly the ability to execute an ethernet frame as code while it sits in the buffer.

The code in the ethernet frame just executes a DMAgic job to copy the payload into the correct memory location.  Thus the complete processing of a 1024 byte ethernet frame takes somewhere between 2,048 and 4,096 clock cycles -- fast enough that the routine can load at least 12MB/sec, i.e., match the wire speed of 100mbit ethernet.

On the server side, I wrote a little server program that sends out the UDP packets as it reads through a .PRG file.  Due to a bug in the ethernet controller buffer selection on the C65GS it currently has to send every packet twice, effectively halving the maximum speed to a little under 6MB/sec.  That bug should be easy to fix, allowing the load speed to be restored to ~10MB/sec.  (Note that at the moment the protocol is completely unidirectional, but that this could be changed by sending packets that download code that is able to send packets.)

When the server reaches the end of the file, the server sends a packet with a little routine that pops the return address from the JSR to the packet from the stack, and then returns, thus effectively returning to BASIC -- although it does seem to mess up sometimes, which I need to look into.

It would be nice to test the actual speed of the resulting system, but I haven't really got a setup for this yet in a robust way.

However, one can get some sort of idea of the speed by timing the server program.

When loading SynthMark64, which is about 8KB, it takes between 0.01 and 0.02 seconds, which equates to 400kb/sec - 800kb/sec.  Loading a larger programme of about 43KB takes 0.03 seconds, giving a speed of about 1.5MB/sec, which is a bit more respectable. [EDIT: I have now fixed the ethernet buffer bug, and so loading speed is easily exceeding 2MB/sec, and will likely go higher with some tuning.]

All that remains to make this useful is to fix the bugs that stop it from returning to BASIC properly, and to add the little 128 byte programme into the kickstart ROM so that it is available on boot up, just like the disk image mounting menu.


  1. I don't understand half of what's going on here, but am absolutely enthralled by what you're doing. It even got me using Feedly to keep abreast of what's going on. Keep it up. Loving your work!

  2. Dr. Paul, this is very neat stuff. I'm saving my pennies for a Nexys4, and (again) thinking about getting serious about VHDL. Now I wonder if I can do any meta-level work on this in Perl...

    1. Hello,
      There almost certainly are things that you can do without an FPGA, depending on your particular skill-set. If you can write 6502 assembler, then there are plenty of options for helping to write the hypervisor and related support software. If you know C, Java or another programming language then there are some font preparation tools and other things that would be useful. What would be even better would be if you would like to start writing some documentation, e.g., of all the IO registers, and how to do various things. iomap.txt will give you a good starting place for a memory map, and you can search for the register numbers in the VHDL source code to find out more about each to provide fuller documentation. Having something like that, which grows to include examples of how to do common tasks, such as send an ethernet packet, would make it much easier for others to get started.


  3. Something that occurred to me, is that your UDP sniffer could look for packets making up a D81 file, and *that* could be dumped into a temporary disk image on the SD card and "mounted". Thus you have a simple way to transmit D81s to the system.

    Although... once you're writing UDP code, it seems like a short step to fully-fledged TFTP bliss...

    1. Yes, now that loading via ethernet works it is possible to do all sorts of interesting things. By executing packets, as horribly insecure as it is, means that it would be possible, without changing the C65GS end of the protocol, to include D81 sectors in packets, and have the packet contain code to write the sectors to the temporarily mounted disk image.

      TFTP is also possible, although I would be inclined to just go straight for HTTP, so that you can LOAD from a URL, e.g., LOAD"" (but with some nice pleasantries to shorten the URL).

      What I intend to do sooner rather than later is get 64NET/2 working on it. Then we have remote access to all sorts of disk images (including CMD drive ones).

      Of course it all somes down to time, so if anyone is willing to help patch the 64NET/2 RR-NET modified C64 kernal, then this could happen much sooner. Packet reception is the only part that needs changing IIRC, as RR-NET emulation works sufficiently for packet transmission using the existing code.


  4. Okay, I thought of one more as I was reading up on the KERNAL at Wikipedia ( I started wondering if there is any utility in mapping some of your new toys to virtual channels? I quote Wikipedia:

    "...programmers could intercept system calls to implement virtual devices with any address in the range of [32,256). Conceivably, one can load a device driver binary into memory, patch the KERNAL I/O vectors, and from that moment forward, a new (virtual) device could be addressed. So far, this capability has never been utilized, presumably for two reasons: (1) The KERNAL provides no means for dynamically allocating device IDs, and (2) the KERNAL provides no means for loading a relocatable binary image. Thus, the burden of collisions both in I/O space and in memory space falls upon the user, while platform compatibility across a wide range of machines falls upon the software author. Nonetheless, support software for these functions could easily be implemented if desired."

    1. Hello,
      Yes -- having some sort of pluggable device driver facility is a really important thing. I have already been thinking about some parts of this, but not in great detail yet. The Hypervisor may well include device drivers that can be used by each running task, or they could live in the separate tasks. Hypervisor has the advantage that they won't consume address space in the running task.


    2. That's an interesting concept. I admit I'll have to read up on Virtual Machines and hypervisors a bit. Heck, I programmed 6502 a bit back in the 80s but never understood until today that the Accumulator works with the Stack Pointer...

    3. The Accumulator works with the stack pointer? I think one of us might be confused.