Saturday, 22 November 2014

Working on the anti-aliaser

I finally got around to plumbing the alpha-blender together to begin testing it for use with anti-aliasing text.

Not surprisingly, there are some things that aren't quite right. Primarily, Xilinx's reference implementation of an alpha blender doesn't quite seem up to the 192MHz pixel clock.  In particular, the blue channel is failing to meet timing closure, and seems to be delayed by a few pixels, as can be seen in the screen shots here:  


The displacement is a fixed number of physical pixels, as can be seen if I change the horizontal scaling of the text generator:



The timing failure isn't too surprising, given that the alpha blender uses a double-clock rate to multiplex the DSP operations to save a bit of silicon.  However, it means that the blender is effectively running at 192MHz x 2 = 384MHz. 

I think I will have to modify the blender so that it doesn't need to multiplex DSP blocks, and can hopefully then meet timing, and avoid the colour planes separating like this.

Wednesday, 19 November 2014

Proportional font glitches fixed

It is almost a week since I fixed some underlying faults in the hardware support for drawing proportional fonts, together with some bugs in my font file generator that was trimming the right-most pixels from some glyphs.

However, it is only tonight that I have managed to get the display working, because the character generator logic in the VIC-IV would seize up when trying to display variable-width full-colour characters. I eventually tracked the bug down to the paint_ready flag never being cleared if the VIC-IV was asked to draw a one-pixel wide 256-colour character.

This came about because the logic previously assumed that there would always be a 2nd-to-last pixel in a character, which is clearly not true when a character is trimmed to be just a single pixel wide.

Now that I have that fixed, I can finally draw variable-width anti-aliased* fonts.  The following two screen shots are of the same Unicode string being rendered using 10 point and 20 point versions of the same font. You can get an idea of the size of characters from the grid of junk at the bottom of each.



Again, the matrix-like tint is due to the colour-cube approximation of the VNC viewer I used to capture the image.  The characters look pure white on the VGA display.

Obviously the 20 point version (top) looks nicer than the 10 point version.  That said, to my wooden eye, they both look pretty decent.  The kerning looks acceptable, and both ascenders and descenders on characters are drawing fine.

I haven't optimised the unicode printing routine code at all yet, so it is still fairly slow to draw. Nonetheless, I can draw those several words in well under a frame, as can be seen in the following screenshot where I set the border to white while the drawing happens:


Earlier I said that I can now draw anti-aliased characters.  This is currently true only if I want to consume the entire 256-colour palette on fake alpha-values.  As discussed in earlier posts, I am part-way through implementing a real alpha-blender that will allow each character to be a different colour, and appear on a different background.

I also need to get around to implementing variable height characters, so that there are no large blank regions between lines when the upper rows of pixels in the top-most row of characters of a font are unused.

The other thing I hope to work on soon is to update the little programme I wrote to display the unicode text so that it can access fonts located outside of the first 64KB of RAM.  This will use the new 32-bit indirect zero-page addressing modes.  Then I will be able to display text from different fonts on the same display.



Sunday, 16 November 2014

First steps towards a C65GS laptop

From the outset I had thought about what form-factor the C65GS would have as a computer.

Basically it was a question of whether to make it something that looks and feels like the original C65, or whether to make it into a laptop.

Nostalgia said that something that looked like the original would be nice.  However, the realisation that being able to create 3D-printed or injection moulded replicas of the C65 case would be extremely expensive put a dampner on that idea.

Also, as I got working on the C65GS I found myself increasingly desiring a portable computer - both for taking places to show and so that it could be used for some light general productivity.

So I have been thinking about how to do this, and taking a look at what some other very clever people have done in the past.  The most interesting one is this laptop commodore 64, made using original parts.  This is interesting to me, because it was created using a real C64 keyboard, and he fabricated the case parts himself.  While I don't plan to use a real C64 keyboard for mine, I do intend to use the real C65 keyboard I bought on ebay recently.  Fortunately, the C65 keyboard already has the function keys moved to the top, and so is narrow enough to be used unmodified in a 15" laptop.

More of how Ben fabricates custom cases like this is covered in another post of his.

However, before I get to that point, I need to work out exactly what I intend to cram into the thing.

Obviously the Nexys4 FPGA board needs to go in.

I have already sourced 4 x 30Wh LiFePO4 rechargeable batteries, sufficient to skirt the air-line limits in several countries, but also enough to run a bright screen for a few hours at least.  I'll need to find a charger circuit for them, perhaps something like this that should allow me to put 15.6V in and charge the batteries.  The output from the batteries then can be fed into one of these to provide enough 12V for a 15" LCD panel, and a very similar one of these to provide ample 5V for the FPGA board and related circuitry.

I had originally looked for a 13" 1920x1200 panel, but those are rare as hens teeth, so have settled for a 15.4" one.  After a bit of hunting it looks like this panel and this LCD driver kit should work together, and will accept the 1920x1200 60Hz VGA signal from the FPGA board.

This leaves the joystick/keyboard/IEC serial and other ports, plus speaker driver PCB as the great unknown in terms of internal components. However, even without that PCB, I am starting to have a think about how things would be arranged in the eventual case.

I will probably use a similar approach to Ben's portable C64, and have an external lead that connects the panel in the top half of the case to the lower half.  Because I am using VGA connections, I am contemplating making the cable easily removable, at least from one end, so that it is easy to connect to an external monitor, or indeed use the 15" panel as a VGA display for something else when required.

I might buy a panel and LCD driver kit just to get my head around getting those working together and with the FPGA board, and then incrementally work on everything else as I get the opportunity.

Friday, 14 November 2014

Migrating to Vivado (Part 3)

Vivado is still refusing to synthesise the various BRAM blocks correctly.

I have, however, had a response to my post on the Xilinx Community Forums asking for more information about the problem, which I have provided.

Let's hope that they can suggest a solution.

Thursday, 13 November 2014

Migrating to Vivado (Part 2)

The title says it all.

I have managed to massage the C65GS source so that it synthesis in Vivado.

This involved some silly things that Vivado should really support, like casting ports in a module. Don't worry about what that means. Just understand that it is something really simple, and Vivado chokes on it.

However, that dwarfs into insignificance with the real problem: Vivado fails to detect when to use BRAM (the little RAM blocks in the FPGA), and instead tries to implement them using logic.  This means that Vivado thinks I need an FPGA 5x bigger than I have, even though ISE can synthesise it to fit in less than 1/2 the current one.

Oddly, it does realise that two small memories should be BRAMs, as the following output from Vivado shows (you might need to make your display really wide to see the gory detail).

Start RAM, DSP and Shift Register Reporting
---------------------------------------------------------------------------------

Block RAM:
+------------+-------------------------+------------------------+---+---+------------------------+---+---+--------------+--------+--------+------------------------+
|Module Name | RTL Object              | PORT A (depth X width) | W | R | PORT B (depth X width) | W | R | OUT_REG      | RAMB18 | RAMB36 | Hierarchical Name      | 
+------------+-------------------------+------------------------+---+---+------------------------+---+---+--------------+--------+--------+------------------------+
|framepacker | thumnailbuffer0/ram_reg | 4 K X 8(READ_FIRST)    | W |   | 4 K X 8(WRITE_FIRST)   |   | R | Port A and B | 0      | 1      | framepacker/extram__26 | 
|viciv       | rasterbuffer1/ram_reg   | 2 K X 9(READ_FIRST)    | W |   | 2 K X 9(WRITE_FIRST)   |   | R | Port A and B | 1      | 0      | viciv/extram__42       | 
+------------+-------------------------+------------------------+---+---+------------------------+---+---+--------------+--------+--------+------------------------+

Note: The table shows RAMs generated at current stage. Some RAM generation could be reversed due to later optimizations. Multiple instantiated RAMs are reported only once. "Hierarch
ical Name" reflects the hierarchical modules names of the RAM and only part of it is displayed.
Distributed RAM: 
+--------------+--------------------------+--------------------+----------------------+--------------------------------+----------------------+
|Module Name   | RTL Object               | Inference Criteria | Size (depth X width) | Primitives                     | Hierarchical Name    | 
+--------------+--------------------------+--------------------+----------------------+--------------------------------+----------------------+
|gs4510        | shadowram0/ram_reg       | Implied            | 128 K X 8            | RAM256X1S x 4096               | gs4510/ram__25       | 
|iomapper__GC0 | kickstartrom/ram_reg     | Implied            | 16 K X 8             | RAM256X1S x 512                | ram__26              | 
|framepacker   | videobuffer0/ram_reg     | Implied            | 4 K X 8              | RAM64X1D x 128  RAM64M x 128   | framepacker/ram__28  | 
|ethernet      | rxbuffer0/ram_reg        | Implied            | 4 K X 8              | RAM64X1D x 128  RAM64M x 128   | ethernet/ram__29     | 
|ethernet      | rrnet_rxbuffer/ram_reg   | Implied            | 4 K X 8              | RAM64X1D x 128  RAM64M x 128   | ethernet/ram__30     | 
|ethernet      | txbuffer0/ram_reg        | Implied            | 4 K X 8              | RAM64X1D x 128  RAM64M x 128   | ethernet/ram__31     | 
|sdcardio      | sdsectorbuffer0/ram_reg  | Implied            | 512 X 8              | RAM64X1D x 16  RAM64M x 16     | sdcardio/ram__32     | 
|sdcardio      | sdsectorbuffer1/ram_reg  | Implied            | 512 X 8              | RAM64X1D x 16  RAM64M x 16     | sdcardio/ram__33     | 
|sdcardio      | f011sectorbuffer/ram_reg | Implied            | 512 X 8              | RAM64X1D x 16  RAM64M x 16     | sdcardio/ram__34     | 
|uart_monitor  | membuf_reg               | Implied            | 16 X 8               | RAM16X1S x 8                   | uart_monitor/ram__35 | 
|uart_monitor  | historyram0/ram_reg      | Implied            | 1 K X 177            | RAM256X1S x 708                | uart_monitor/ram__37 | 
|viciv         | chipram0/ram_reg         | Implied            | 128 K X 9            | RAM128X1D x 9216               | viciv/ram__38        | 
|viciv         | colourram1/ram_reg       | Implied            | 64 K X 8             | RAM128X1D x 4096               | viciv/ram__39        | 
|viciv         | paletteram0/ram_reg      | Implied            | 1 K X 32             | RAM256X1S x 128                | viciv/ram__40        | 
|viciv         | charrom1/ram_reg         | Implied            | 4 K X 8              | RAM256X1S x 128                | viciv/ram__41        | 
|viciv         | sprite_x_reg             | Implied            | 8 X 8                | RAM16X1S x 8                   | viciv/ram__42        | 
|viciv         | sprite_data_offsets_reg  | Implied            | 8 X 6                | RAM32M x 1                     | viciv/ram__43        | 
|viciv         | sprite_colours_reg       | Implied            | 8 X 8                | RAM16X1S x 8                   | viciv/ram__44        | 
|viciv         | sprite_y_reg             | Implied            | 8 X 8                | RAM16X1S x 8                   | viciv/ram__45        | 
|viciv         | buffer1/ram_reg          | Implied            | 512 X 8              | RAM64X1D x 16  RAM64M x 16     | viciv/ram__46        | 
+--------------+--------------------------+--------------------+----------------------+--------------------------------+----------------------+

All the rest are being implemented as various types of fabric-based RAM.

I have asked for help on the Xilinx Community Forums to see if anyone knows whats going wrong, and how I can work around it.

But for now, it looks like I am using ISE for a bit longer still.

Monday, 10 November 2014

Migrating from ISE to Vivado (Part 1)

I am in the process of attempting to migrate C65GS development from Xilinx's old ISE development environment to their new Vivado.  The promise is a much better synthesis tool chain, with shorter synthesis times, better timing, and it should be easier to describe timing constraints which are a black art in ISE.

Vivado is only a couple of years old, which means that it is still quite unstable as these things go.

Just getting it to install can be interesting.

After downloading the 4.8GB (!) tarball, extracting it and running the xsetup program, it got stuck showing just the splash startup window.

Fortunately I was able to find the solution in the following thread:

http://forums.xilinx.com/t5/Installation-and-Licensing/Vivado-hangs-forever-with-white-window-during-startup-Linux/td-p/479058

Basically it boils down to setting the following shell variable, ideally in your .bashrc or similar ( don't forget to log out and in again or source the file for it to take effect after you have changed it).

export _JAVA_AWT_WM_NONREPARENTING=1

I am unable to comment on why this works, but simply confirm that it lets me at least see the installer.

Now to see if installation works ...

Even more progress on proportional fonts

Hot on the heals of the earlier fixes of the evening, descenders (the bits of characters that hang below the base-line) and bits poking out the top are now correctly draw as can be seen in the screen-shot below.

Along the way I have fixed bugs in my unicode printing programme, and also a couple of bugs in my true-type font rasteriser (the top pixel of characters could be chopped off, and character tiles were being written to the file in the wrong order).

As with the other screen-shots, the Matrix-like green tint is an artefact of the image capture process I am using, and the bad kerning is pending a patch to the FPGA so that it stops ignoring the least significant bit of the kerning field of each character.

Once the horizontal kerning has been fixed, that just leaves the vertical spacing to fix.  This will probably require both FPGA and more font-file format tweaking.


The above image is captured the same way as the one in the previous post. Then I realised that my VNC viewer was scaling the image, so below is the same image really without any scaling:




Just for further fun and to prove that it works for larger font sizes, I produced a 24 point version of the same font to try out:


This revealed a few new glitches, like why is the "t" in jupiter mostly missing.  There are also some other issues, like the width of the right vertical of "H", Alef (the right-most Hebrew character) that I will have to look into.

The disappearing "t" turned out to be a quick fix, so the final result for the night is as follows:


 Note also that with a big font the suppression of the spare blank raster lines becomes less important.  I still intend to fix it, however.

Sunday, 9 November 2014

More work on proportional fonts

I have done a bit more work on the unicode text display programme.

The following screen-shot is completely generated by the programme, in contrast to the screen-shot in the previous post where I manually populated the screen memory with the tiles and kerning adjustments.

The Unicode string used to feed the program is:

unicodestring:
       .word 'H,'e,'l,'l,'o,$20,'W,'o,'r,'l,'d
       .word $000d
       .word $05d4,$05d3,$05d2,$05d1,$05d0
       .word $000d
       .word 'g,'a,'r,'y,$20,'j,'u,'p,'i,'t,'e,'r
       .word $000d       
       .word 0

Character $20 is of course space, and the $05xx characters are some Hebrew glyphs I threw in just to show that we are not limited to latin characters.

So, how did it turn out?



The screen shot is actual size so that there is hopefully no strange things going on in terms of smoothing of the zoomed image.

The slight green sheen is an artefact of the C65GS VNC server which sends pixels using a 3x3x2 colour cube, and so suffers some colour distortion.  This doesn't need fixing since it is purely a VNC artefact.

We still have the kerning glitch, because the VIC-IV isn't honouring the least significant bit of the kerning field, so H and W are followed by one more blank pixel than they should.  This is on my list to fix in the FPGA programme.  I'll likely rearrange the bits so that the kerning bits are all adjacent, instead of spread between two different bytes (one of which is clearly not being read properly).

The programme has, however, faithfully rendered the first line of Latin and Hebrew Unicode characters, including H and W that are two tiles wide.

It also handles the carriage-returns, however, it doesn't correctly calculate how many rows of characters are needed, nor does it allow for reducing the height of character rows to provide correct line spacing.  This will require implementing the remaining hardware support for this in the VIC-IV.

Also, in the character rows that don't contain active non-blank glyphs, it isn't putting a blank tile there, nor is it adjusting the width of each to kern them to the correct width, so there is rubbish which is wider than the actual text on the 2nd and 3rd rows of each line of output.  This won't be too hard to fix, as it is just a software issue.

Finally, the third row is really messed up, because the line of characters that are drawing the under-hanging pixels are being written over the main row, instead of being written into the next row down.  This shouldn't be too hard to fix either, as it is just a software issue.

Also, it would be nice if the routine cleared the remainder of the screen, but that's really just icing on the cake.

So for now we have a nice bit of progress, and I might take a look at fixing the kerning and rows of junk bugs next.

Saturday, 8 November 2014

Easily Accessing All of Memory

As I have begun work on writing a general-purpose Unicode string renderer for the C65GS, it has got me right at the coal face of developing software to run on this machine, as compared to just making the machine. Creator and user have rather different needs and experiences.
Until now, I had simply assumed that I would use DMA and memory banking to provide access to all of memory.  Technically it provided everything that one could need.  Once I started to write software for the machine, it became immediately apparent that these methods were fine for accessing slabs of memory, but would be rather inconvenient for the normal use case of reading or writing some random piece of memory somewhere.
What I found was that if I had a pointer to memory and wanted to PEEK or POKE through that pointer, it was going to be a herculean task, and one that would waste many bytes of code and cycles of CPU to accomplish -- not good for a task that is the mainstay of software.
I was also reminded that the ($nn),Y operations of the 6502 are essentially pointer-dereference operations.  So I thought, why don't I just allow the pointers to grow from 16-bits to 32-bits.  Then one could just use ($nn),Y or ($nn),Z operations to act directly on distant pieces of memory.
Slight problem with this is that the 4502 has all 256 opcodes occupied, so I couldn't just assign a new one.  I would need some sort of flag to indicate what size pointers should be.  This had to be done in a way that would not break existing 6502 or 4502 code.
The experience of the 65816 led me to think that a global flag was not a good idea, because it makes it really hard to work out what is going on just by looking at a piece of code, especially where instruction lengths change.
So I decided to go for a bit of an ugly hack: If an instruction that uses the ($nn),Z addressing mode immediately follows and EOM instruction (which is what NOP is called on the 4502), then the pointer would be 32-bits instead of 16-bits.
While ugly, it seems to me that it should be safe, because no 6502 code uses ($nn),Z, because it doesn't exist. Similarly, there is so little C65 software that it is unlikely that any even uses ($nn),Z, and even less of it should have an EOM just before such an instruction.  
In fact, in the process of implementing 32-bit pointers, I discovered that ($nn),Z on the 4510 was actually doing ($nn),Y, among other bugs.  So clearly the C65 ROM mustn't have even been using the addressing mode at all!
Here is the summary of how this new addressing mode works in practise.  The text below is as it appears in the C65GS System Notes which is being developed with the help of the community.

32-bit Memory Addresses using 32-bit indirect zero-page indexed addressing


The ($nn),Z addressing mode is normally identical in behaviour to ($nn),Y other than that the indexing is by the Z register instead of the Y register.  That is, two bytes of zero-page memory are used to form a 16-bit pointer to the address to be accessed. However, if an instruction using the ($nn),Z addressing mode is immediately preceded by an EOM instruction, then it uses four bytes of zero-page address to form a 32-bit address.  So for example:

zppointer: .byte $11,$22,$33,$04

ldz #$05
eom
lda ($nn),Z

Would load the contents of memory location $4332216 into the accumulator.

LDA, STA, EOR, AND, ORA, ADC and SBC are all available with this addressing mode.

Memory accesses made using 32-bit indirect zero-page indexed addressing require three extra cycles compared to 16-bit indirect zero-page indexed addressing: one for the EOM, and two for the extra pointer value fetches.

This makes it fairly easy to access any byte of memory in the full 28-bit address space.  The upper four bits should be zeroes for now, so that in future we can expand the C65GS to 4GB address space.

Wednesday, 5 November 2014

Starting to write tests for 16-bit text mode, including proportional fonts

I have been bashing away at a little test program that demonstrates printing unicode strings using a proportional font.

This involves looking up the glyph in the font, then getting its tile map, and then building the screen lines to draw and several other steps.  While none of these steps are too complicated, it does make for about 1KB of code, including switching screen modes and setting the palette to a grey gradient since I don't have the alpha-blender working yet.

I could have settled for just a simple hard-coded test, but I think it is worth exercising the whole idea of how I intend to draw proportional fonts to make sure that it is feasible.

Anyway, it's got late, and I haven't got the code working yet, but I did spend a few minutes hand picking the necessary tiles and setting the kerning values to narrow the characters down so that it looks half-decent.


The main visual glitch is that single pixel resolution of the kerning is not working for some reason, so the gap between "H" and "e" is one pixel too wide.

Also, the colour cube that the VNC viewer uses butchers the gradient, and so some of the shading looks weird.  On the real screen this looks quite a bit better. Nonetheless, the result is pretty reasonable for a first go at it, and will certainly look nicer once I figure out why it is ignoring the least significant kerning bit so that there is no big gap between "H" and "e".

Tuesday, 4 November 2014

More work on proportional fonts

I am edging my way towards getting hardware proportional fonts working.

To recap progress to date, text mode can be switched to 16-bit mode, where two screen RAM bytes and two colour RAM bytes describe each character instead of one of each.  Some of those extra bits can be used to specify the width of a character, between 1 and 8 pixels wide.  Thus proportional fonts can be constructed using some full-width characters and some narrowed characters.

I have now written a little programme that takes a true-type font, and produces a font file composed of 8x8 character blocks.  It isn't perfect, but it does create a simple file with a list of unicode points, tile maps that say which tiles go to make each glyph, and then the array of 8x8 tiles, 64 bytes each.  You can see the source at:

http://github.com/gardners/c65gs-font-rasteriser

It doesn't do any compression of the 64-byte blocks, so the fonts are quite a bit bigger than they need to be.  However, it should be very easy to write some 6502 assembler that given a 16-bit unicode string, can setup a 16-bit text mode screen to display the text.

Simultaneously, I have been working on the anti-alias renderer for 8x8 tiles. It isn't done yet, but the alpha blender is in the design now.  Assuming that it works, it shouldn't be too hard to plumb it in, and start displaying alpha-blended 8x8 character tiles.

Sunday, 2 November 2014

Loading data via ethernet

As I have been working on the proportional font support for the VIC-IV, I have come to the realisation that I need some fairly comprehensive and reproducible test programmes due to the complexity of the VIC-IV text mode.

I can load programmes onto D81 images and put those on the SD card, and mount them from the simple menu I made. However, it requires physical plugging and unplugging, and is a general nuisance.

The other approach that I have at the moment is that I can use the serial monitor to load data, but VERY slowly and unreliably.  This is basically because the serial monitor doesn't have a FIFO on the serial input, and so it looses characters quite easily.

I do have a working 100mbit ethernet adapter, however.   Ideally, I would just use the RR-NET emulation to enable me to run the RR-NET version of 64NET.  However, there are still enough bugs in the RR-NET emulation to stop that from working.  This also could be more easily debugged if I could quickly and easily load test programmes onto the C65GS.

So I have finally gotten around to writing a little program that looks for UDP packets coming in via ethernet, and then providing a way to load them into memory somewhere.

Because the C65GS ethernet buffer is direct memory mapped, I can use a nice trick, of having the main loading routine actually inside the packets.  This means that the ethernet load programme on the C65GS can be <128 bytes, and yet support very flexible features, since the sending side can send whatever code it likes.  It is only about 100 lines of 4502 assembler, so I'll just include the whole thing here.

.org $CF80
First, we need to turn on C65GS enhanced IO so that we can access the ethernet controller:
lda #$47
sta $d02f
lda #$53
sta $D02f
Then we need to map the ethernet controller's read buffer.  This lives at $FFDE800-$FFDEFFF.  We will map it at $6800-$6FFF.  The 4502 MAP instruction works on 8KB pieces, so we will actually map $6000-$7FFF to $FFDE000-$FFDFFFF.  Since this is above $00FFFFF, we need to set the C65GS mega-byte number to $FF for the memory mapper before mapping the memory.  We only need to do this for the bottom-half of memory, so we will leave Y and Z zeroed out so that we don't change that one.
lda #$ff
ldx #$0f
ldy #$00
ldz #$00
map
eom
Now looking at the $DE800 address within the mega-byte, we use the normal 4502/C65 MAP instruction semantic.  A contains four bits to select whether mapping happens at $0000, $2000, $4000 and/or $6000. We want to map only at $6000, so we only set bit 7.  The bottom four bits of A are bits 8 to 11 of the mapping offset, which in this case is zero.  X has bits 12 to 19, which needs to be $D8, so that the offset all together is $D8000.  We use this value, and not $DE000, because it is an offset, and $D8000 + $6000 = $DE000, our target address.  It's all a bit weird until you get used to it.
lda #$80
ldx #$8d
ldy #$00
ldz #$00
map
eom
Now we are ready to make sure that the ethernet controller is running:
lda #$01
sta $d6e1
Finally we get to the interesting part, where we loop waiting for packets.  Basically we wait until the packet RX flag is set
loop:

waitingforpacket:
lda $d6e1
and #$20
beq waitingforpacket
So a packet has arrived.  Bit 2 has the buffer number that the packet was read into (0 or 1), and so we shift that down to bit 1, which selects which buffer is currently visible.  Then we write this to $D6E1, which also has the effect of clearing the ethernet IRQ if it is pending.
lda $d6e1
and #$04
lsr
ora #$01
sta $d6e1
Now we check that it is an IPv4 UDP packet addressed to port 4510
; is it IPv4?
lda $6810
cmp #$45
bne waitingforpacket
; is it UDP?
lda $6819
cmp #$11
bne waitingforpacket
; UDP port #4510
lda $6826
cmp #>4510
bne waitingforpacket
lda $6827
cmp #<4510
bne waitingforpacket
If it is, we give some visual indication that stuff is happening. I'll take this out once I have the whole thing debugged, because it wastes a lot of time to copy 512 bytes this way, since I am not even using the DMAgic to do it efficiently.  In fact, this takes more time than actually loading a 1KB packet of data.
; write ethernet status to $0427
lda $d6e1
sta $0427

; Let's copy 512 bytes of packet to the screen repeatedly
ldx #$00
loop1: lda $6800,x
sta $0428,x
lda $6900,x
sta $0528,x
inx
bne loop1
The final check we do on the packet is to see that the first data byte is $A9 for LDA immediate mode.  If so, we assume it is a packet that contains code we can run, and we then JSR to it:
lda $682c
cmp #$a9
bne loop
jsr $682C
Then we just go looking for the next packet:
jmp loop

As you can see, the whole program is really simple, especially once it hits the loop.  This is really due to the hardware design, which with the combination of DMA and memory mapping avoids insane fiddling to move data around, particularly the ability to execute an ethernet frame as code while it sits in the buffer.

The code in the ethernet frame just executes a DMAgic job to copy the payload into the correct memory location.  Thus the complete processing of a 1024 byte ethernet frame takes somewhere between 2,048 and 4,096 clock cycles -- fast enough that the routine can load at least 12MB/sec, i.e., match the wire speed of 100mbit ethernet.

On the server side, I wrote a little server program that sends out the UDP packets as it reads through a .PRG file.  Due to a bug in the ethernet controller buffer selection on the C65GS it currently has to send every packet twice, effectively halving the maximum speed to a little under 6MB/sec.  That bug should be easy to fix, allowing the load speed to be restored to ~10MB/sec.  (Note that at the moment the protocol is completely unidirectional, but that this could be changed by sending packets that download code that is able to send packets.)

When the server reaches the end of the file, the server sends a packet with a little routine that pops the return address from the JSR to the packet from the stack, and then returns, thus effectively returning to BASIC -- although it does seem to mess up sometimes, which I need to look into.

It would be nice to test the actual speed of the resulting system, but I haven't really got a setup for this yet in a robust way.

However, one can get some sort of idea of the speed by timing the server program.

When loading SynthMark64, which is about 8KB, it takes between 0.01 and 0.02 seconds, which equates to 400kb/sec - 800kb/sec.  Loading a larger programme of about 43KB takes 0.03 seconds, giving a speed of about 1.5MB/sec, which is a bit more respectable. [EDIT: I have now fixed the ethernet buffer bug, and so loading speed is easily exceeding 2MB/sec, and will likely go higher with some tuning.]

All that remains to make this useful is to fix the bugs that stop it from returning to BASIC properly, and to add the little 128 byte programme into the kickstart ROM so that it is available on boot up, just like the disk image mounting menu.