Tuesday 27 July 2021

Debugging TCP/IP Problems

WARNING: This is a rather long rambly post, as I rather transparently document the quagmire of bugs of my own and others doing in the TCP/IP stack we are using on the MEGA65. But if you are not deterred, and would like to join in the mystery, the chase and the hair pulling, read onwards!  

If you have read the previous blog post, you will have discovered that I have been having some problems with the TCP/IP stack with strange delays and mis-handling of received packets, resulting in corruption of the received data stream.

I started investigating it on a stream today, until the wild weather outside knocked the power out.  Fortunately, the power came back on sooner than expected, so I can take a bit more of a look.

What I started doing was to make a way to instrument the WeeIP TCP/IP stack to log all packets sent or received via the serial monitor interface, and then making a utility to convert that into a pcap capture file, so that I could investigate it using Wire Shark.

This has already been super helpful, as I was able to see that acknowledgements are taking a second or so before they get sent, instead of being sent immediately.  That will certainly be contributing to all manner of problems with TCP/IP flow control, Round-Trip Time estimation and the TCP congestion control algorithm.

Now I intend to instrument the TCP/IP stack to indicate when it thinks it has received a TCP segment, and when it has scheduled an ACK and when it has actually sent the ACK, so that we can find out where the delays are occurring.

Poking around in this for a bit, I have found the problem with the corrupt TCP streams: The problem was caused by the call-backs to digest the data being scheduled, rather than executed immediately. As a result, further data packets could be received before the call-back got called, and thus the buffer of data would be replaced with the contents of the most recently arrived data packet.  

To solve that problem, I modified the TCP/IP stack to immediately call the call-back, rather than scheduling it for later.  This also allowed increasing of the TCP window size, which when things are behaving well, for the transfer to be quite a bit faster.  It looks to be loading at several hundred KB/sec, when it is behaving properly, which is much better than the ~20 -- 40KB/sec that it was peaking at before. So that's a nice advance.

So now to focus back on the ACK problem.  I tried making the code immediately send an ACK for every received frame, but that makes it either crash, hang, or generally doesn't help. But such a change shouldn't be necessary, anyway. 

Poking around a bit more, I have discovered that the eth_task() function sometimes falls off the task list, which of course then causes the network to stop. I could just reschedule it all the time, but I'd rather work out how it gets cancelled.  I suspect that it might be related to the ACKs not happening in a timely manner, as both can be broadly considered to be scheduling bugs.

A separate contributor to this problem is that we don't resend ACKs while trying to read data. This means that if the ACK is lost, then we have to wait until the sender decides to send more data. That's a recipe for trouble with our very small TCP receive window.

While trying to find out how I might add retransmissions, I discovered that WeeIP already has a retransmission mechanism for TCP -- its just that it was disabled for some reason. While investigating that, I also discovered that WeeIP doesn't close TCP connections properly, either, as the logic in the CLOSEWAIT1 state would erroneously skip the CLOSEWAIT2 state and go directly to CLOSED.

Trying to fix that is proving a bit fun, as I am now hitting the 38KB programme size limit I was talking about earlier. I've managed to save enough space by moving the packet buffer to $C000, and shortening some messages.

Now that I can run things again, I have found the bug where it would go direct from CLOSEWAIT1 to CLOSED without going via CLOSEWAIT2.  That now works, and TCP connections seem to get properly closed, which is nice. This means I don't have stray ACK packets being sent after the MEGA65 end has closed the connection, which were polluting my packet captures, and making it harder to follow the rest of what is going on.

Doing all that, I have also seen perhaps the reason why sometimes it refuses to connect to begin with.  It looks like it never actually sends the HTTP request, due to some race condition.  I have put a catch in for that, but it happens so relatively rarely, that I haven't been able to observe it happening since I added it. So I'll return to tracking down the slow TCP transfers again.

I am noticing that the MEGA65 genuinely thinks it hasn't received packets that tcpdump on my laptop indeed shows have been sent.  Its quite reliable when it happens, with the TCP congestion window reaching two packs, and the 2nd of each pair (or the first of each pair) being lost, but the other arriving. Now to find out if the packets get properly received, and then ignored, or if they arrive corrupt, or exactly what is going on. But that blasted memory limit is really giving me grief.

Well, I feel like I am chasing my own tail here, because now it does seem to be receiving the packets, but it doesn't always ACK them straight away. So let's see if we can't get reliable ACKing happening after each packet, and see if that doesn't solve the problem.  Indeed by default if it received an in-sequence TCP segment, it was not requesting an ACK.  This is normal behaviour for normal "big" TCP stacks with large window sizes.  

We could ACK every packet, which is probably not a big problem for us, since we shouldn't be doing huge transfers on the MEGA65, where that might become an issue.  That certainly helps to reduce this problem quite a bit.  Or we could even get clever and ACK only when we see a packet in sequence following an out-of-sequence packet. But that's more code, so lets stick with the simple and reasonably effective approach instead.

Hmmm... That chasing ones own tail feeling is still there.  There seem to be ethernet frames in both directions that are not being received.

Nonetheless, I managed to find and fix a couple of TCP state machine bugs, as well as some bugs in md2h65 and the mouse position calculating code, and now when the missing packet problems don't occur, it is possible to reasonably quickly load pages, scroll around in them, and click on links to navigate between them. But those other problems do still make it a bit hit and miss.  Just getting it to start up on the index page, and then click to another page and back again is taking multiple attempts. Its frustrating, because I think it is now mostly race-conditions tickling different remaining bugs.  But I am determined to get one good sequence so that I can film it, before I go to bed.

A new day, and hopefully a fresh set of eyes to debug things.

I've added some instrumentation in to check whether the ethernet interface ever refuses to send frames. Given how slow we are sending, and that we have a true 100mbit ethernet interface, it should never block on transmission. But we are indeed seeing it block. The F's are blocked transmissions and the -'s are received packets in this screen shot:

So something is whacked out with the Ethernet controller.  Looking closer, its not the Ethernet controller that is blocking it, but rather a lack of ARP resolution of the requested IP address. But that, in turn, seems to be because when we try to transmit a packet, nothing gets received.

This again makes me wonder if we don't still have a problem with intermittent corruption of Ethernet frames when they are being sent. Specifically, we have previously had problems with the clock edge and data not being properly lined up.  We even have a register that lets us attempt to set the phase difference between them.  So its possible that we can give that a tweak to check.  

It would be nice to be able to check for corrupt Ethernet frames directly, but because I am going through the EoP adapter, that will be discarding any corrupt frames before they get passed on.  I do have Ethernet on my laptop, so I might be able to rig up a DHCP server and direct cable, and see if I get packet errors visible that way. Otherwise I might have to hook up another MEGA65 to it, so that I can really closely examine the Ethernet frames, as the MEGA65 Ethernet controller allows for the reception of frames that fail the CRC check, thus allowing me to examine the kinds of errors we are encountering.

DHCP server setup on the laptop now, and a direct ethernet connection, to see what is going on. Zero RX errors, and it seems to work pretty much fine. Certainly bucket loads faster than the poor old EoP adapter could do, and I can even browse between a couple of pages without drama, before anything odd happens. But, yes, something odd does happen: The FIN ACK process got stuck on the 3rd page load. Presumably a bit of a race condition with the opposite side.  Once we have sent our FIN ACK, we should just continue, assuming all to be well. Eventually we can implement RST to quench any TCP connections that don't get shut completely.

Okay, so that's implemented. Now I am seeing a funny situation where the DHCP offer is not being accepted.  The packet is arriving, though. It looks like the transaction ID sometimes doesn't match.  This is presumably when the DHCP reply comes just after we have sent our DHCP request, and thus we have changed the XID we are looking for.  To solve this, I have modified our DHCP code to use the same XID for all repeated requests, so that it can't mis-match.

So, back to the direct connected network. It's much faster than over the horrible EoP link, but it can still get itself messed up.  It looks like its the task scheduler now that is being the most problem, with the Ethernet task falling off the task list -- or something approximating it.

Closer examination reveals that the laptop is sending ARP requests to the MEGA65, but sometimes the MEGA65 doesn't seem to receive them. This is also tickling a problem where during the TCP connect phase, we try sending 5 packets 200ms apart, and then give up trying any more.  If the ARP resolution is outstanding during that time, and the no-ARP-packets-received bug is happening, then the TCP connection will fail.   While we need to fix the ARP problem, its clear that we also need to improve the TCP connection retries and timeouts.

Well, that helps quite a bit, in that the ARP gets attempted again after a little while, and because we are continuing to retry over a longer period of time (I am now trying every 300ms for 30 attempts = ~9 seconds), we have better success. This is fine for on the local network, but it probably makes sense to allow for even longer time, maybe with a graduated timeout interval, so that connecting over the Internet to the other side of the world can also be made a bit more robust.

Now that that is under control, I can sometimes load a page, and follow a link on that, and then a link from that back, with the pages loading quite quickly.  After that, clicking on links stops working for some reason that I need to work out.  But to make that easier, I need to make it so that pages load every time. The main problem now seems to be out-of-order TCP segment arrival being handled in a super primitive way. 

Basically the WeeIP TCP/IP stack lies and claims to have a window size which is larger than the buffer it has.  We are allocating a 2KB buffer at $C000, so it would be nice to make WeeIP use that to allow multiple segments to arrive, and to allow at least rudimentary out-of-order packet arrival.  My gut feeling is that adding a single "out of order segment" descriptor to the sockets will be a good compromise of performance and simplicity, allowing for a single missed packet to be caught up on more easily.  I'll also make it adjust the window size based on the number of remaining bytes in the buffer.  

This all assumes that there is only one active TCP/IP connection at a time, as the buffer is shared among all sockets. This will, most assuredly, come back to bite me at some point, thus my making note of it here, so that I (hopefully) don't forget.  Actually, if I get all clever pants, we can keep track of what is in the buffer, so that it will remain safe, and out-of-order received packets will be remembered, only provided that they don't cause grief.  So I'll do that. This means the data on buffer usage will now be for the buffer, rather than per-socket. That actually saves a bit of RAM and will probably make the code simpler. Ah, no, I tell a lie. WeeIP requires that a buffer be allocated per socket, after all.  So ignore all that doom and gloom. It's all perfectly safe.

Even with such a relatively simple scheme, there are several cases to handle:

1. New packet is the only thing to go in the buffer;

2. Append new packet to end of what is already in the buffer;

3. New packet comes after a gap, and there is no out-of-order segment yet;

4. New packet comes after a gap, and can be pre-pended to out-of-order segment; and

5. New packet comes after a gap, and can be appended to out-of-order segment.

Cases 1 and 2 can actually be merged, as (1) is just (2), with the special case that there are 0 bytes in the buffer.

After that its all just a pile of fiddly buffer handling and debugging until I can get it to work stably.

And its a new day, as I continue to try to find and deal with the remaining bugs.   This really consists of me running it, and waiting for some out-of-order packets to arrive, and to then see what it claims to be doing via the copious debug messages I have added.  It seems that when packets arrive in order, its all fine, so it will just be some bug in one of the out-of-order cases.

One bug down: I was incorrectly handling the merging of the received bytes in the buffer, and the out-of-order segment. Fixing that seems to have helped a bit.

Now I am hitting the problem that we don't seem to find the end of the stream any more, and thus the page never gets displayed. Essentially, the socket seems to disconnect before we get the last bit of data. Or something. Quite possible its more the "or something", since we still seem to not be correctly receiving data.

Not surprisingly, finding more bugs in my changes: I was providing the wrong message type when delivering data from a socket to the callback, so it was being ignored. That's fixed now, and it looks like the data is all being delivered now. 

Next remaining problem is that when we get to the end of the page data, it isn't correctly realising to terminate the process.  Found and fixed that problem.

Now we are back to being able to load pages and follow links, until suddenly the links stop working.  I'm going to prune out a pile of the debug output I put into the TCP out-of-order stuff while debugging that, so that I can clean up the output a bit, and also free up some memory, as I am flying right against the edge of memory usage right now.  Then I'll be able to instrument the link stuff, to figure out what is going wrong there.

Now with the debug info reduced, it looks like I might have triggered some other race condition, as the TCP three-way handshake seems to hang for ages a lot of the time now. Yes, indeed we seem to be failing to ACK the SYN packet from the server.

It looks now like the problem is that we are perhaps not properly processing the incoming SYN packets, but I'm also not totally sure of that, as there are still some funny things going on. One minor issue I can, is that we don't reply to the DHCP OFFER, and so the Linux box keeps sending DHCP offers for a while, and seems to block other traffic for the MEGA65 during that time. 

Perhaps related to that, what we then see, is that the Linux side sees the SYN packet, as confirmed by tcpdump, and then instead of responding with the SYN+ACK, it sends an ARP request... even though it just received a packet from the MEGA65, which means that it should have the MAC address in the ARP cache already, e.g.:

13:54:26.553488 IP (tos 0x8, ttl 64, id 24330, offset 0, flags [none], proto TCP (6), length 40) > Flags [S], cksum 0xc62f (correct), seq 3288990057, win 2048, length 0
13:54:26.553587 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has tell, length 28
13:54:26.840310 IP (tos 0x8, ttl 64, id 24331, offset 0, flags [none], proto TCP (6), length 40) > Flags [S], cksum 0xc630 (correct), seq 3288990056, win 2048, length 0
13:54:27.573665 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has tell, length 28

This makes me think that maybe the DHCP thing is the problem here, after all.

Well, we now have DHCP request and acknowledgements happening, and thus we are obeying the DHCP protcol completely... And the funny problems still happen with being asked to ARP repeated times, and generally it being slow to get started.

Actually the DHCP negotiation tends to end up sending multiple copies of packets in each direction, before they get responded to.  I'm not sure if this is a real problem or not.  When I run tcpdump, I see all the packets from the MEGA65 without problem, so its not TX corruption. Similarly, we see the packets all arrive on the MEGA65 side, so I don't think its an RX problem.  

We did previously have a problem where Ethernet frames would be received, but not immediately released from the Ethernet buffers, until the next frame arrives. That is, each frame would be held in a kind of purgatory until the next frame arrives.  It is possible that something like that is happening, but I'm not really convinced. The reasons being, that (1) I'm pretty sure I fixed it some time ago; and (2) it doesn't feel like things are being delayed by a single frame, which is all that that old bug could do.

To test this, I wanted to be able to just use ping to check the round-trip-time of packets, to confirm whether or not we are seeing any funny delays or not. 

As seems to be the case throughout this whole adventure with the browser, I had to fix a bug that was preventing me from doing this in some other part of it. For some very strange reason the following line in the ICMP ping reply code would crash the machine:

       *(unsigned short *)&tx_frame_buf[14+20+2] = checksum_result();

Even though

       *(unsigned short *)&tx_frame_buf[14+10] = checksum_result();

a few lines later doesn't, and the following also works just fine, which in theory does the same thing:

      tx_frame_buf[14+20+2] = (checksum_result());
      tx_frame_buf[14+20+3] = (checksum_result())>>8;

I have absolutely no idea what the problem with that was, but it was very much a real thing, and with this work around, I now generate correct ICMP checksums, and can ping the MEGA65 when it is running the network stack, and see lovely things like:

$ ping
PING ( 56(84) Bytes Daten.
64 Bytes von icmp_seq=1 ttl=64 Zeit=2339 ms
64 Bytes von icmp_seq=2 ttl=64 Zeit=1325 ms
64 Bytes von icmp_seq=3 ttl=64 Zeit=308 ms
64 Bytes von icmp_seq=4 ttl=64 Zeit=3.61 ms
64 Bytes von icmp_seq=5 ttl=64 Zeit=3.69 ms
64 Bytes von icmp_seq=6 ttl=64 Zeit=3.01 ms
64 Bytes von icmp_seq=7 ttl=64 Zeit=3.67 ms
64 Bytes von icmp_seq=8 ttl=64 Zeit=3.16 ms

The first few pings are delayed, presumably as we wait for the network stack to settle. But after that, we are seeing very nice ping times of 3 to 4 milli-seconds -- not bad, given we are using CC65 as the compiler, and the TCP stack does make copies of packets as they arrive etc.  That all looks great, but then when I ran it again a bit later:

$ ping
PING ( 56(84) Bytes Daten.
Von icmp_seq=1 Zielhost nicht erreichbar
Von icmp_seq=2 Zielhost nicht erreichbar
Von icmp_seq=3 Zielhost nicht erreichbar
64 Bytes von icmp_seq=5 ttl=64 Zeit=3075 ms
64 Bytes von icmp_seq=6 ttl=64 Zeit=3052 ms
64 Bytes von icmp_seq=7 ttl=64 Zeit=3030 ms
64 Bytes von icmp_seq=8 ttl=64 Zeit=3007 ms
64 Bytes von icmp_seq=9 ttl=64 Zeit=3006 ms

Now we have a 3 second delay.  Now, this is interesting, because most of the weird behaviours we have seen have involved seeing triplicate of packets before they seem to get acted on.  If the MEGA65 Ethernet controller is getting all confused with its buffers, such that it gets several packets behind in announcing their arrival, this would cause exactly this kind of effect.

It's also fairly reproducable, so it shouldn't be too hard to track down what is going on.  Hopefully. So let's ask the Ethernet interface what it thinks in terms of buffer status.

Well, it claims it has 3 free buffers, and no packets waiting on the receive side, but this is clearly bollocks, as we know from the fact that there are in reality 3 packets queued up, and that indeed will be exposed as new packets arrive.

We don't actually _miss_ any packets, thus I think the ring buffer setup is working just fine. The problem is that the count of free buffers gets out of whack somehow. The logic for counting the number of free receive buffers has never really made me very happy.  And it is the value that's used for determining whether or not the CPU thinks there are more packets ready to receive.  

What I am thinking of doing is just looking at which buffers are being read and written to at any point in time, and from that, statically computing the number of free buffers, and whether there are received packets waiting or not. 

This will take a resynthesis run to fix, but I wondered if it might be possible to hack it in using some debug registers I expose in the Ethernet controller, namely, which buffer the CPU is reading from, and which is the next one that the Ethernet controller would write to. These are in the bottom four bits of $D6EF, with two bits each.  By checking if the difference between them is 3 or not, I am successfully able to check if there are any waiting packets, and the ping times are now consistently down around 3 to 4ms, instead of 3,000 ms.

So lets try that now with the browser loading pages, and see if it doesn't help speed up the TCP connections and everything else, but cutting out that 3 second delay each time.

Okay, so that makes the DHCP and TCP connection all much faster, as hoped. So that's really good.  But, as you have probably now come to expect in this blog post, another weird problem is happening.  Now, we are reliably missing the "G" from the start of the "GET" in the HTTP request.  The web server logs show something like:

Serving HTTP on port 8000 ... - - [27/Jul/2021 21:23:42] code 501, message Unsupported method ('ET') - - [27/Jul/2021 21:23:42] "ET /index.h65 HTTP/1.1" 501 -

That's clearly not helpful. 

The problem here was that we were not ACKing the SYN+ACK during TCP establishment, which meant that the first character of the TCP stream was playing that role, and thus getting eaten, rather than passed to the web server.

That's easy to fix, but implementing the ACK of it explicitly, by incrementing the local-side TCP sequence number when we see the SYN+ACK.  And indeed, that allows a page to load. But clicking on a link doesn't work.

Looking at a tcpdump of such an event, it looks like the problem might be re-use of TCP ports, so the remote end can't discriminate between the sessions. Nope, its actually that the 3-way handshake is going wrong. We send SYN to the server, the server replies with SYN+ACK, and then we should reply with just ACK. But we are sending a 2nd SYN. In theory that should be fine, but for whatever reason, our 2nd SYN has its sequence number decremented by one.

Now it feels like I'm finally making progress: 

1. DHCP naturally takes a couple of seconds, because it also has to activate the Ethernet interface and wait for it to auto-negotiate with the switch.  Once that happens, the DHCP lease is obtained within a couple of seconds, and then making the HTTP connection and loading a page is really fast -- probably about 1 second for a ~100KB page. Okay, the "network" is just a cable between the MEGA65 and may laptop, but nonetheless, it is all working really nicely and smoothly, and quite reliably.

2. I figured out the "links broken after navigating two pages" bug: I wasn't resetting where on the page the browser thinks is being displayed when a new page is loaded. Thus, if you had scrolled the page, the links would be out of sync, and it would mess up. That was nice and easy to fix.

3. I wasn't clearing colour RAM between page loads, which was leaving some fluff at the bottom of pages sometimes. That's now cleaned up.

In fact, I managed to really quickly and easily navigate around a few pages, until it broken when trying to follow the link back from one of the pages I hadn't tried before.

It was so nice to get it working, I did a bit of a stream showing it off:

Next step will be to allow users to type in a URL, at which point it will be minimally functional for people to try out.

Then it will be investigating the occassional page load failures, to see if I can get to the bottom of those.

I'll then have to divert my attention back to other things, so if someone out there is interested in continuing work on the browser, I'd certainly welcome it. Just poke me on the MEGA65 Discord server (link in the nav bar at mega65.org).

Saturday 24 July 2021

Work on a Simple HTTP-based file browser

First, a sneak peak of where we have gotten to, before explaining how we got there:

We have previously improved our port of WeeIP, a tiny TCP/IP engine, so that it works fairly well on the MEGA65 with its integrated 100mbit Ethernet adapter. 

It can now be used to open an HTTP connection, and try to fetch a file.  So the next step is to try to do something useful with that.  One challenge is that we are compiling it with CC65 at the moment, which produces rather large code, and as we are for simplicity for now (until the MEGA65 port of the VBCC compiler is ready), we are stuck with compiliers that don't really support >64KB of address space.  A TCP/IP stack is already fairly large, so we end up with a minimum programme size of about 28KB, which given we are using the C64 memory map, where we have only about 38KB of contiguous RAM available to us for programme code, this is a bit annoying.

Fortunately there is a 3rd C compiler that supports the MEGA65, and that is KickC, which specialises in extreme optimisation of the produced code. Unfortunately, it isn't an ANSI-compliant C dialect. Well, really I should say that it is inconvenient, not unfortunate, because KickC does lots of clever things to make it produce much better code on 6502 targets than a traditional C compiler is able to do.  But it does have some limitations and differences that mean that porting code is not always super simple.  We had anticipated this, by having separate directories for CC65 and KickC in the mega65-libc repository, and now the time has come to start forking things off, and getting stuff working with KickC there, so that I can see how much smaller KickC is able to produce a working TCP/IP stack for me, than CC65 is able.

First step is to update KickC to the latest version, ideally from source, so that I can patch the compiler if required.

Cloning the source from https://gitlab.com/camelot/kickc is easy, as was compiling it. This makes a kickc-release.tgz file that you can then extract where you want it.  That all went smoothly, except that when I try to compile something, I get this error:

$ bin/kickc.sh -e examples/helloworld/helloworld.c
java -jar bin/../jar/kickc-0.8.3.jar bin/../jar/kickc-release.jar -F bin/../fragment -e examples/helloworld/helloworld.c -I bin/../include -L bin/../lib -P bin/../target
Unmatched argument at index 4: 'examples/helloworld/helloworld.c'

The binary release does however work, but quickly runs into problems:

$ make fetchkc.prg
git submodule init
git submodule update
../kickc/bin/kickc.sh -t mega65_c64 -a -I src/mega65-libc/kickc/include -I include -L src -L src/mega65-libc/kickc/src src/fetch.c
java -jar ../kickc/bin/../jar/kickc-0.8.5.jar -F ../kickc/bin/../fragment -t mega65_c64 -a -I src/mega65-libc/kickc/include -I include -L src -L src/mega65-libc/kickc/src src/fetch.c -I ../kickc/bin/../include -L ../kickc/bin/../lib -P ../kickc/bin/../target
//   KickC 0.8.5 BETA by Jesper Gravgaard   
Compiling src/fetch.c
line 65:8 mismatched input 'union' expecting {TYPEDEFNAME, PAR_BEGIN, 'const', 'extern', 'export', '__align', 'inline', 'volatile', 'static', '__interrupt', 'register', '__zp_reserve', '__address', '__zp', '__mem', '__ssa', '__ma', '__intrinsic', CALLINGCONVENTION, 'struct', 'enum', SIGNEDNESS, SIMPLETYPE}
/home/paul/Projects/mega65/weeIP/include/defs.h:65:9: error: Error parsing file: mismatched input 'union' expecting {TYPEDEFNAME, PAR_BEGIN, 'const', 'extern', 'export', '__align', 'inline', 'volatile', 'static', '__interrupt', 'register', '__zp_reserve', '__address', '__zp', '__mem', '__ssa', '__ma', '__intrinsic', CALLINGCONVENTION, 'struct', 'enum', SIGNEDNESS, SIMPLETYPE}
make: *** [Makefile:33: fetchkc.prg] Fehler 1

This is because KickC doesn't support unions, which is a bit of a bummer, because WeeIP makes fairly extensive use of them.

I've tagged the issue on the KickC repository to see if they might be inclined to add support for them sometime soon.

So that brings us back to working with CC65 for a bit longer, and working harder to keep everything small. enough to fit.

One help we have is that we don't have to fit all data into the first 64KB, but can strategically use the other RAM banks to stash stuff, such as the graphics display and hyperlink info.

A while back I started coming up with a really simple file format that the MEGA65 can basically just display, and that has some hyperlink info embedded, which I call H65 for Hypertext65.  I even started making a markdown to H65 converter at  * Markdown to H65 page formatter in the mega65-tools repository. Here are the comments I made about how I intended it to work:

 * H65 is a very simple (if inefficient) rich hypertext standard
 * for the MEGA65.  It allows text, pictures and hyperlinks only
 * at this stage.  It largely works by pre-formatting a MEGA65
 * screen + colour RAM, and providing the custom FCM character data
 * required to satisfy this.
 * File contains a header that is searched for by the viewer.
 * This contains the lengths of the various fields, and where they
 * should be loaded into memory, with the first 64KB of RAM being
 * reserved for the viewer. Screen data is expected at
 * $12000-$17FFF, colour RAM will load to $FF80000-$FF87FFF for
 * compatibility with MEGA65 models that have only 32KB colour RAM.
 * Hyperlinks are described in $18000-$1F7FF:
 * .word offset to list of link boundaries
 * List of null-terminated URLs.
 * List of screen-RAM offsets, hyperlink length and target link tuples.
 * (4 bytes each).
 * Tile data is allowed to be placed in banks 4 and 5.  Replacing
 * the 128KB ROM with page data is not currently allowed.
 * Header:
 * .dword "H65<FF>" (0x48, 0x36, 0x35, 0xFF) is required.
 * .byte version major, version minor
 * .byte screen line width
 * .word number of screen lines
 * .byte border colour, screen colour, initial text colour
 * <skip to offset $80>
 * [ .dword address, length
 *   .byte data .... ]
 * ...
 * .dword $0 -- end of file marker
 * Parts:
 * Copyright 2002-2010 Guillaume Cottenceau.
 * Copyright 2015-2018 Paul Gardner-Stephen.
 * This software may be freely redistributed under the terms
 * of the X11 license.


In short, we are basically making a format for loadable VIC-IV displays.  This is purposely a really unintelligent format, allowing intelligence to be built into the VIC-IV display info. For example, the page gets to provide both the VIC-IV screen RAM and colour RAM for the page. This means all VIC-IV tricks are available, and even that images can be included using FCM 256 colour or 16-colour modes.  It's also possible to provide the palette, allowing for setting the palette entries, thus allowing the images to be even prettier than if a single fixed palette were assumed. 

It also allows us to include proportional text etc on the pages, similar to how we do it in our PowerPoint clone, "MegaWAT!?", where we use the 16-bit text mode of the VIC-IV's ability to do hardware kerning and anti-aliasing. But for now, we will just work with normal text, then maybe add images, and work our way up from there.

We allow 24KB for screen RAM, which assuming that we use 80-column mode means a page can be 24*1024/80 = 307 lines long, or 153 lines long if we use 16-bit text mode. Again, as the page can provide VIC-IV register settings, this is up to the page designer.

The md2h65.c programme currently has the bare essentials to write out a valid empty .h65 file, but doesn't currently parse a .md file to get the contents of it.  So that's probably the next step, so that we have an example .h65 file that we can fetch. I'll then setup a temporary local webserver, and start testing the complete setup.

Initially I will just support plain text, and bold and maybe a simple header format, just using colours to indicate those styles, since that will be quick and easy to parse.  So something like:

# A heading

Some text with **a bit of bold**

will be allowed.

With a bit of hacking, I have, in theory at least, implemented writing out simple .h65 files generated from a .md file, allowing such formatting.  Now to modify the fetch programme to display it.  This means running a simple webserver, which is made super easy by python:

python2 -m SimpleHTTPServer 8000

in the directory where the files live.  So I can do a single test command like this:

make fetch.prg && m65 -F -4 -r fetch-unpacked.prg && python2 -m SimpleHTTPServer 8000

This will work because the WeeIP TCP/IP stack takes a few seconds to complete DHCP negotiation etc, before actually doing anything, thus giving the web server time enough to get started. 

This means that we need to be able to provide an IP rather than a hostname, since I am testing it on the local LAN. Our DNS library in WeeIP currently only supports resolving actual hostnames, so I have quickly written an IP parser that we try on the passed in hostname before falling back to DNS:

  // Check if IP address, and if so, parse directly.
  offset=0; bytes=0; value=0;
  while(hostname[offset]) {
    printf("Checking $%02x, value=%d, bytes=%d\n",hostname[offset],value,bytes);
    if (hostname[offset]=='.') {
      if (bytes>3) break;
    } else if (hostname[offset]>='0'&&hostname[offset]<='9') {
      value=value*10; value+=hostname[offset]-'0';
    } else
      // Not a digit or a period, so its not an IP address
  if (bytes==3&&(!hostname[offset])) {ip->b[3]=value; return 1; }

With the above, I am now able to have the MEGA65 attempt to fetch the page from the web server, and I see signs that it is indeed (mostly) working:

We can see that it connects to the HTTP server and fetches 536 bytes. Because the programme just prints out what it reads for the moment, we see the h65 and the cross-hatch character, which is the header of the .h65 file.  The only problem is that we can see that the content length is supposed to be 1,024 bytes, not 536.  But we can deal with that as we progress.  

What I might work on next is to instead focus on improving the UI of the programme itself. Or more correctly, actually give it a user interface, rather than just making a single fixed connection and displaying a pile of debug fluff on the screen. 

Having some kind of URL bar at the top of the screen, for example, would be a nice idea. What I will probably do, is have a display for entering a URL, and that it also shows while a page is being fetched. Then this will toggle to the page display view, after a page has been loaded.  Before doing this, our file size is currently 29,390 bytes, so we will see how much bigger it gets adding that.

Well, its now about 34KB, but it can show something like this:

The .md file is pre-converted into a .h65 before being placed on the web server. But it can then be fetched and displayed.  The video mode, colours and underline attribute are all built into the screen and colour RAM data of the .h65 file. So its just a case of loading the blocks into RAM in the right places, and the page is "rendered".  

Its now time to get ready for dinner, but after the kids are in bed, I might look at switching the font for our ASCII font, so that we don't have to munge the charset, and can use upper and lower-case chars. I might make it so that the pages can choose for themselves whether they are PETSCII or ASCII, so that we can have cool 80x50 PETSCII art rich pages.  

Once that is done, then I think it will be time to start adding support for hyper links. I have a cunning plan for implementing those in a super simple way, as well.  But whether I will start to run out of RAM before then, we will have to see.

Well, I got to spend a bit of time working on this, and even live streamed for a couple of hours as I added scrolling via keyboard and mouse, and setup my new MouSTer USB to C= mouse adapter.

This got things to the point where we can now have a page that is longer than fits on a single screen, and use cursor up and down to scroll through, and also use the mouse to scroll up and down.

So I thought this morning I might try to configure my MouSTer to enable the scroll wheel and 3rd button in C1351 mode, so that I can support them, as it is just so tempting to want to be able to use the scroll wheel.  However, I can't see how to to this. Maybe if the excellent Retrohax folks are reading along, they can drop me a comment below on how to do this.

So instead I'll attack something different for a little while.  First up, I'll call the mouse update routine in all the busy wait spots during DHCP, TCP connection etc, so that the machine doesn't feel so dead.  That makes it feel much nicer. I also modified the configuration of my MouSTer to double the mouse speed, which now also feels much more responsive, without being silly. So that's a couple of simple usability improvements for me. 

While I'm doing such things, I might also modify the mouse pointer code to change colours, so that it doesn't just sit there dully.  Ok, that looks nicer.

Next I might see if I can't make images work, as that will make it all look much fancier.  It won't support animation or anything fancy yet, just static graphics. But nonetheless, it will be much better than just plain text.  

We already have most of the PNG handling code we need from pngtoscreens.c, including palette handling and tile generation (and, I think, tile de-duplication).  I mostly just need to add the parser for the markdown for indicating an image, and then load the image. In fact, we even already have the code for writing out the tiles.  So we just need to draw the tiles on the screen RAM, and we should be good.

Well, that _almost_ works now:

Clearly the colour RAM attributes are borked. The display on the real hardware is actually a bit different, as the screen capture tool doesn't correctly handle transparency, but what is correct, is that the palette is all borked up, and VIC-III attributes are being applied to the tiles, when they shouldn't be.  The easy solution to the latter problem is to set the tile colour to a colour <16, although that does cause us some problems for colour $FF not being available. But that's the trade-off being able to use hardware underline etc for headings and things.

So let's focus on the palette for a bit, and try to get to the bottom of that. It isn't that the palette entries need nybl swapping, as I checked that, and while result is "different" its certainly not correct.  I did verify that the mode is display displaying correctly, and also check what the palette looks like by adding a routine on the P key in the fetch programme, to show the palette on 8x8 tiles at the top of the screen. The result while trying to show the well known MEGA65 logo banner is:

Which is clearly Not Correct(tm).  There should be colours matching the four colour bars.  

To try to work out what is going on, I added some debug info that shows when each colour is allocated, and saw that it was allocating colours twice-over.  I remembered that I had put the logo in the .md file twice. So just to keep life simple, I removed the 2nd instance of it, and what do you know? It suddenly displays mostly correctly:

Again, the screen-capture programme mis-renders the transparent pixels as black. But we can see its now 99% correct, except that the yellow bar isn't showing correctly.  This might be the 256th colour issue I was mentioning earlier.

So now to find out why my image loader is failing to handle the palette properly if it is asked to load a subsequent image.  It should simply remember the existing palette, and every colour should match. But clearly that's not happening, so let's find out why... And it was a buffer overflow, allowing one too many colours to be recorded, thus overwriting the number of recorded colours.  

With the 16 C64 colours already allocated, the banner image requires one too many colours, so it triggers this problem.

This actually broaches onto a broader problem, which is that all the images on a page are expected to share a single palette, and the converter programme at the moment doesn't do any down-sampling of the colours to make them fit.  

We could in theory use a 2nd palette bank for the graphics to ease this problem, or indeed use two separate palette banks for two separate images to allow 512 colours on screen, with 256 (well 256 - 16 = 240) in one and 256 potentially different colours in the other.

But maybe it will just be easier and more flexible to digest the palette requests for all the images in the page in a first pass, and then quantise the palette values to be used in a second pass.  That's probably a better approach.

So in the first pass we will log all colours, and ignore if we have too many colours to render first time around. And if there were too many colours, then we will trigger the quantisation and second pass.  There are fancy algorithms for doing this the best was possible, but for now, I just want a proof of concept. The general consensus seems to be "cluster and reduce"... But first we need a list of all the colours in the image(s)...

Code duly modified to record >256 colours and flag if we need a 2nd pass. What is curious, is that including the same image twice results in more colours being required, than if it is only included once. This should not happen.  This makes me suspect some other memory corruption problem somewhere. Nope, it turns out it was just the order of stdout and stderr messages appearing that made it look like that was the problem.

I've now implemented a very crude colour quantiser. I say crude, in that what it does is find the least frequently used colours, and replaces them with the closest remaining colour, until the palette fits.  This means the first image will have the best colours, but later images might end up with totally borked colours.  But that's okay for now -- remember that this isn't a limitation of the file format, but just me being a bit lazy in getting a proof-of-concept system up and running.

I'll soon grab a few more images, and see how it looks if I load multiple images in. But first, I want to fix the mouse scrolling: Previously it would scroll at a constant rate whenever the mouse was at the top or bottom edge. I now want to have it properly scroll proportional to the mouse's movement, so that it feels much more natural.  That feels much nicer. I'll have to make a video of it in action.

Now, back to multiple images, I have that working now, too. But we were saying just before, the palette selection stuff is really quite amazingly sub-optimal.  To get an idea of just how suboptimal it is, see the following screenshot:

Because the MEGA65 logo has large slabs of a lot of colours, it is taking precedence over the more complex (and thus fewer pixels of each colour) devkit image. 

It turns out that this quick and dirty colour quantiser really does just suck, as even if I remove the MEGA65 logos, it still looks like rubbish. But if I use a tool like convert or mogrify (or you could use gimp or something graphical like that, too), then the result is, not surprisingly, much better:

So improving the quantiser would obviously be a really good idea. But I'm going to first focus on adding hyper links, so that we can have multiple pages, and navigate between them.  We are still at only 35,600 bytes for the programme, so it should be possible to add support for hyperlinks.

My plan is to have a data structure of URLs and the regions of the page that are links to them in memory as part of the page.  Thus any click of the mouse just needs to look up the position on the page, and see if it matches anything in the list, and if so, get the address of the URL in memory from that.  

We can do his most simply by describing bounding boxes, which will also make it nice and easy to have images act as link anchors, as well.  Speaking of images with links behind them, there isn't actually a proper standard in markdown for doing that, so I'm going to use magic text in the alt text field, which we don't otherwise use.  I'll probably also allow some other attributes in there, such as horizontal alignment directives. But lets start by looking for an href= tag in the alt text.

Pottering away while I listen to the Olympic Games opening ceremony, I have implemented links for both images and text, and cooked up a few inter-linked pages for demonstrating it. And it works. I think. Except that the TCP/IP stack is giving me mountains of grief now, going super slow or corrupting the received stream most times, so I'm invetigating that a bit. 

But first, here is a nice example of a page with some links:

The image and the two hardware-underlined links are all clickable, and it even will load the target page, if the TCP stack plays nicely.  You can't see it here, but the mouse pointer changes colour based on whether you are over a link or not.

It looks like the TCP slow-start mechanism and ACKing stuff is getting all confused, as we can see in this tcpdump capture.

First, we see .60 (the MEGA65) acking a packet to the linux box, which then results in the Linux box immediately sending the next packet:

23:03:24.457335 IP > Flags [.], ack 13954, win 1536, length 0
23:03:24.457439 IP > Flags [P.], seq 13954:14490, ack 116, win 64125, length 536

... and then the TCP slow-start mechanism decides it would be a great idea to send another packet straight away, to increase throughput:

23:03:24.457502 IP > Flags [.], seq 14490:15026, ack 116, win 64125, length 536

Which should then result in the MEGA65's TCP/IP stack acknowledging both of those packets, which it doesn't do. Instead after 2 seconds it acknowledges the first of them, but not the second:

23:03:26.429488 IP > Flags [.], ack 14490, win 1536, length 0
23:03:26.429603 IP > Flags [P.], seq 15026:15562, ack 116, win 64125, length 536
23:03:28.461701 IP > Flags [.], ack 14490, win 1536, length 0

And similar disfunction continues.  Well, the good news is that I can see that the problem is clearly on the MEGA65 side, so I should be able to find what is going on, and fixing it.  

My guess is that the ack isn't sent, because the sequence number doesn't match exactly.

Well, except that I am also seeing spots where the MEGA65 ACKs a packet, and then the Linux box waits a very long time before bothering to send a packet, e.g., as here:

23:34:37.314214 IP > Flags [.], ack 6986, win 1536, length 0
23:35:49.678613 IP > Flags [P.], seq 6986:7522, ack 116, win 64125, length 536

It waits 12 seconds before sending the packet. I'm guessing the Round-Trip-Time estimation on the Linux side is getting very confused. That will probably somewhat resolve if we can fix problems with ACKing when multiple packets are sent. 

I also wonder if we are still having funny problems with the ethernet interface not immediately handing packets over.

Maybe I will make the TCP/IP stack send debug info via the serial monitor interface, so that I can time-stamp it, and determine the exact time-relationship between the various actions.  In particular, I want to know which side is delaying at which time.

Saturday 17 July 2021

EnteringThe DOS Maze

Today we have another post from Bit Shifter, as he explores the C65 DOS, and makes some improvements to it for our patched MEGA65 versions of the ROM... so over to Bit Shifter:

The source codes for the C65 firmware, which served as starting point for the 92xxxx series of the MEGA65 ROMs, consist of assembly code for the CSG4510 CPU in following  files:

b64.src            BASIC 2 for C64 mode
kernel64.src    Kernel for C64 mode

b65.src            BASIC 10
graphics.src    BASIC 10 graphics
system.src       Kernel, editor, I/O
dos.src            CBDOS

My work on the BASIC part, which was renamed to BASIC 65 for emphasising the transformation and expansion of the interpreter for the MEGA65, is mostly done.
So I moved on to the DOS (Disk Operating System) of the C65, which was named CBDOS  (Computer Based Disk Operating System) by their developers Dennis Jarvis and Fred Bowen. This name describes pretty good the most important change in Commodore’s philosophy of using intelligent storage devices, escpecially floppy drives, with builtin CPU, RAM and a firmware ROM containing the DOS. The C65 was the first Commodore 8bit computer, with a builtin (Computer Based) DOS. Some may argue, that the C128-D already was a 8bit computer with builtin disk drive, but this disk drive just shared the same case with the C128, but was otherwise a 1571 floppy drive which used the same firmware and IEC bus, as external 1571 drives, that were connected via IEC bus.

When I started to read the CBDOS source code, I was surprised by the complexity of the code, which was far more, than I had expected. Further examination revealed, that this source code is sort of a bizarre maze, leading us through the history of Commodore DOS versions. Some code parts were written in the early years of the PET/CBM dual floppy drives 2040, 3040, 4040, some were from the 1571 and 1581 disk drives with their enhancements like „burst mode“, big relative files and subdirectories and some parts were added for the benefit of the C65.

I tracked, for example, the usage of the LOAD command (see figure):

The BASIC 10/65 has six LOAD variants (plus one more in the ML monitor).
These commands use different argument parsers, some of them then call the kernel LOAD routine, while others open a sequential file and read the data bytewise.

The kernel LOAD command of the jump table is routed through a vector (ILOAD) to the routine NLOAD, which on other Commodore computers just opens and reads the file.

The C65/MEGA65 computer however tests the selected device, wether it has burst mode capabilities (1571 & 1581) or is a device connected to the internal FDC (Floppy Disk Controller). Then it branches to one of three completely different LOAD routines, either to the burst LOAD via IEC bus, the slow LOAD via IEC bus or the fast LOAD using the FDC, which has the internal floppy drive connected on the C65. 

The MEGA65 can handle two devices on the FDC, two floppy drives (connected with the same ribbon cable) or two disk images mounted from a SD-card, or a combination of one floppy drive and one image.

he situation is complicated more by the fact, that the MEGA65 CPU can be run at different speeds (1MHz, 2MHz, 3.5MHz, 40MHz) which can be too slow or too fast, dependent on the selected LOAD routine and the device (floppy drive vs. Image).

By the way, the C64 mode has its own kernel and BASIC ROM, with high compatibility to the original C64, but the tape routines in the C64 ROM are removed and the IEC routines are intercepted and call the same CBDOS routines as in native 65 mode.

So here we have a source code with a lot of legacy code from former floppy drives, burst mode routines adapted from the 1571 and 1581 DOS, new routines for the C65/MEGA65 and finally new features, for example to use the two drives, handled by the FDC either as drive 0 and 1, like a CBM dual drive like the 4040 or 8050, or emulate two single drives assigned to two different units (8 & 9 as default). This leads to bizarre situations if you read for example from unit 8 and write to unit 9, the DOS will be busy sending TALK, TALKSA signals for reading a byte and then UNTALK, LISTEN, LISTENSA for writing the byte, though every communication goes to the same device. This obsolete protocol overhead is responsible for a considerable slowdown in I/O and could have been solved better.
For example by assigning only one device/unit number to FDC and using the DOS like the DOS in the dual drive units.

The next problem is, that the development of the CBDOS was stopped abruptly, before it reached a stable status. Cite from Dennis Jarvis in the source code:
„Estimated time to complete the dos from this point would be less than 200 hours.    12/03/91 Dennis Jarvis“

At this point I want to make clear, that I don’t want to blame the developers for any issues, bugs or not so optimal solutions. They worked on this code under immense pressure from the management and used Commdore-128, Commodore PC and later DEC-VAX  as development system. I think, they did an amazing job, writing so much code, in a short time with limited ressources. They have my full respect and if I now find some better algorithms and fix bugs, it’s because I have time, no pressure and excellent development tools, that I use on a MAC computer.
Now for a selection of comments, extracted from the file dos.src:

* NOTE: I discovered this routine NEVER performed its task since the 1st   *
*  DOS was coded. This routine was to check to see if there was a drive #  *
*  prefixing the ':', and if there was NOT it was to set BIT 7 in the ACC. *
*  to denote that the DEFAULT drive was being used. We double checked this *
*  on the 2030, 2040, 4040, 8050, 8250 and found it never worked, so I     *
*  recoded it.                                                             *

lda #0   ;Blow away this file

lda temp ;Kludge to test for M-R used by

cmp #$fe ;Check for someone sniffing the irq vector

; You will only see this if the user attempts to read/write to a super relative
; file, and the super relative file capabilities of the dos have been turned off!!!

;inc $d020  ; (FLASH BORDER COLOR?????) leave in until ext. error led works

; problem with dual drives requires
; spinup delay here????

; Since we are taking a DUAL DRIVE and having it emulate two single drives
; we must have seperate ERROR MESSAGE BUFFERS.

 ldy #0   ;(fixes bug in DOS1.4 thru 2.7 excluding DOS2.6, and TED 488)

*  was installed because the users did not know HOW the thru B-R/W routines

; Note: this is testing the wrong value, but don't change it to #$41 or it will
; cause problems with existing software which relies on this "mistake".
 cmp #$40

;* NOTE: There's a bug in all CBM DOS versions that allows*
;* you to request a drive number 0 thru 255.    *
;*        1/12/1990 Dennis J. Jarvis   *

;*  NOTE:  If a one block file is appended, the directory will always *
;*  show 2 blocks in use.  This is caused by the JSR to WRTBUF, which *
;*  automatically increments NBKL (from BUTTERFIELD Transactor V3, I4). *

 ldy temp+2
;* ^^^ ^^^^^^ ;Dumb will always = 2!

 lda r0   ;save r0 (phw)
 lda r0+1  ;Sorry PHW or PHD does not work in the pc BSO version!

;  Corrects rel bug a,b,c reported many times in different mags such as Transactor...

This gives an impression.

So what can we (or I) do now?

My first idea was to strip the DOS from all not essential stuff, clean up the remaining core and use it as base for a better structured DOS.
I wanted to remove REL file support, sub directories, save by replace @ and some other (as I thought) obsolete or rarely used features. This would make the reorganisation much easier. But when I shared my thoughts at discord, there were at least two people, who were upset and demanded, that nothing should be removed, because it would make the CBDOS un-Commodorish. I thought about it and decided, that I would change the CBDOS in tiny steps, that will not affect already implemented features.

The first step is, to fix obvious bugs.
So far I fixed the COLLECTION bug, which scrambled a floppy or image, if files of type DEL and length zero were listed in the directory. These are very common among developers for creating „directory art“, which is visible in a directory listing.

The second step is to walk through the code and optimise it where possible.
This save code space, which then can be used for enhancements.

One example:
A table for allocation and freeing channels (max. 11) is needed.
The solution in CBDOS was to use the bits of two byte variables (LINUSE) for this purpose. This saves 9 bytes compared to the byte table method. 

But the code to locate and set or reset a bit in one of the LINUSE bytes was so complex, that it needed more than 100 bytes of code. I changed the allocation table from bits to bytes, using 11 bytes instead of 2. But the allocation/free routines are now only 30 bytes in size (and are 100 times faster).

There are many such optimisation possibilities, but care must be taken, not to introduce bugs or create side effects, so this work will take time.

So, I hope, I didn’t bother you and you got some insight in the work going on.
Feel free to comment on discord and make suggestions.

I found this very nice description of Commodore DOS versions, which I recommend:


Have a nice day,

Bit Shifter