Sunday, November 2, 2014

Loading data via ethernet

As I have been working on the proportional font support for the VIC-IV, I have come to the realisation that I need some fairly comprehensive and reproducible test programmes due to the complexity of the VIC-IV text mode.

I can load programmes onto D81 images and put those on the SD card, and mount them from the simple menu I made. However, it requires physical plugging and unplugging, and is a general nuisance.

The other approach that I have at the moment is that I can use the serial monitor to load data, but VERY slowly and unreliably.  This is basically because the serial monitor doesn't have a FIFO on the serial input, and so it looses characters quite easily.

I do have a working 100mbit ethernet adapter, however.   Ideally, I would just use the RR-NET emulation to enable me to run the RR-NET version of 64NET.  However, there are still enough bugs in the RR-NET emulation to stop that from working.  This also could be more easily debugged if I could quickly and easily load test programmes onto the C65GS.

So I have finally gotten around to writing a little program that looks for UDP packets coming in via ethernet, and then providing a way to load them into memory somewhere.

Because the C65GS ethernet buffer is direct memory mapped, I can use a nice trick, of having the main loading routine actually inside the packets.  This means that the ethernet load programme on the C65GS can be <128 bytes, and yet support very flexible features, since the sending side can send whatever code it likes.  It is only about 100 lines of 4502 assembler, so I'll just include the whole thing here.

.org $CF80
First, we need to turn on C65GS enhanced IO so that we can access the ethernet controller:
lda #$47
sta $d02f
lda #$53
sta $D02f
Then we need to map the ethernet controller's read buffer.  This lives at $FFDE800-$FFDEFFF.  We will map it at $6800-$6FFF.  The 4502 MAP instruction works on 8KB pieces, so we will actually map $6000-$7FFF to $FFDE000-$FFDFFFF.  Since this is above $00FFFFF, we need to set the C65GS mega-byte number to $FF for the memory mapper before mapping the memory.  We only need to do this for the bottom-half of memory, so we will leave Y and Z zeroed out so that we don't change that one.
lda #$ff
ldx #$0f
ldy #$00
ldz #$00
map
eom
Now looking at the $DE800 address within the mega-byte, we use the normal 4502/C65 MAP instruction semantic.  A contains four bits to select whether mapping happens at $0000, $2000, $4000 and/or $6000. We want to map only at $6000, so we only set bit 7.  The bottom four bits of A are bits 8 to 11 of the mapping offset, which in this case is zero.  X has bits 12 to 19, which needs to be $D8, so that the offset all together is $D8000.  We use this value, and not $DE000, because it is an offset, and $D8000 + $6000 = $DE000, our target address.  It's all a bit weird until you get used to it.
lda #$80
ldx #$8d
ldy #$00
ldz #$00
map
eom
Now we are ready to make sure that the ethernet controller is running:
lda #$01
sta $d6e1
Finally we get to the interesting part, where we loop waiting for packets.  Basically we wait until the packet RX flag is set
loop:

waitingforpacket:
lda $d6e1
and #$20
beq waitingforpacket
So a packet has arrived.  Bit 2 has the buffer number that the packet was read into (0 or 1), and so we shift that down to bit 1, which selects which buffer is currently visible.  Then we write this to $D6E1, which also has the effect of clearing the ethernet IRQ if it is pending.
lda $d6e1
and #$04
lsr
ora #$01
sta $d6e1
Now we check that it is an IPv4 UDP packet addressed to port 4510
; is it IPv4?
lda $6810
cmp #$45
bne waitingforpacket
; is it UDP?
lda $6819
cmp #$11
bne waitingforpacket
; UDP port #4510
lda $6826
cmp #>4510
bne waitingforpacket
lda $6827
cmp #<4510
bne waitingforpacket
If it is, we give some visual indication that stuff is happening. I'll take this out once I have the whole thing debugged, because it wastes a lot of time to copy 512 bytes this way, since I am not even using the DMAgic to do it efficiently.  In fact, this takes more time than actually loading a 1KB packet of data.
; write ethernet status to $0427
lda $d6e1
sta $0427

; Let's copy 512 bytes of packet to the screen repeatedly
ldx #$00
loop1: lda $6800,x
sta $0428,x
lda $6900,x
sta $0528,x
inx
bne loop1
The final check we do on the packet is to see that the first data byte is $A9 for LDA immediate mode.  If so, we assume it is a packet that contains code we can run, and we then JSR to it:
lda $682c
cmp #$a9
bne loop
jsr $682C
Then we just go looking for the next packet:
jmp loop

As you can see, the whole program is really simple, especially once it hits the loop.  This is really due to the hardware design, which with the combination of DMA and memory mapping avoids insane fiddling to move data around, particularly the ability to execute an ethernet frame as code while it sits in the buffer.

The code in the ethernet frame just executes a DMAgic job to copy the payload into the correct memory location.  Thus the complete processing of a 1024 byte ethernet frame takes somewhere between 2,048 and 4,096 clock cycles -- fast enough that the routine can load at least 12MB/sec, i.e., match the wire speed of 100mbit ethernet.

On the server side, I wrote a little server program that sends out the UDP packets as it reads through a .PRG file.  Due to a bug in the ethernet controller buffer selection on the C65GS it currently has to send every packet twice, effectively halving the maximum speed to a little under 6MB/sec.  That bug should be easy to fix, allowing the load speed to be restored to ~10MB/sec.  (Note that at the moment the protocol is completely unidirectional, but that this could be changed by sending packets that download code that is able to send packets.)

When the server reaches the end of the file, the server sends a packet with a little routine that pops the return address from the JSR to the packet from the stack, and then returns, thus effectively returning to BASIC -- although it does seem to mess up sometimes, which I need to look into.

It would be nice to test the actual speed of the resulting system, but I haven't really got a setup for this yet in a robust way.

However, one can get some sort of idea of the speed by timing the server program.

When loading SynthMark64, which is about 8KB, it takes between 0.01 and 0.02 seconds, which equates to 400kb/sec - 800kb/sec.  Loading a larger programme of about 43KB takes 0.03 seconds, giving a speed of about 1.5MB/sec, which is a bit more respectable. [EDIT: I have now fixed the ethernet buffer bug, and so loading speed is easily exceeding 2MB/sec, and will likely go higher with some tuning.]

All that remains to make this useful is to fix the bugs that stop it from returning to BASIC properly, and to add the little 128 byte programme into the kickstart ROM so that it is available on boot up, just like the disk image mounting menu.