Tuesday 27 July 2021

Debugging TCP/IP Problems

WARNING: This is a rather long rambly post, as I rather transparently document the quagmire of bugs of my own and others doing in the TCP/IP stack we are using on the MEGA65. But if you are not deterred, and would like to join in the mystery, the chase and the hair pulling, read onwards!  

If you have read the previous blog post, you will have discovered that I have been having some problems with the TCP/IP stack with strange delays and mis-handling of received packets, resulting in corruption of the received data stream.

I started investigating it on a stream today, until the wild weather outside knocked the power out.  Fortunately, the power came back on sooner than expected, so I can take a bit more of a look.

What I started doing was to make a way to instrument the WeeIP TCP/IP stack to log all packets sent or received via the serial monitor interface, and then making a utility to convert that into a pcap capture file, so that I could investigate it using Wire Shark.

This has already been super helpful, as I was able to see that acknowledgements are taking a second or so before they get sent, instead of being sent immediately.  That will certainly be contributing to all manner of problems with TCP/IP flow control, Round-Trip Time estimation and the TCP congestion control algorithm.

Now I intend to instrument the TCP/IP stack to indicate when it thinks it has received a TCP segment, and when it has scheduled an ACK and when it has actually sent the ACK, so that we can find out where the delays are occurring.

Poking around in this for a bit, I have found the problem with the corrupt TCP streams: The problem was caused by the call-backs to digest the data being scheduled, rather than executed immediately. As a result, further data packets could be received before the call-back got called, and thus the buffer of data would be replaced with the contents of the most recently arrived data packet.  

To solve that problem, I modified the TCP/IP stack to immediately call the call-back, rather than scheduling it for later.  This also allowed increasing of the TCP window size, which when things are behaving well, for the transfer to be quite a bit faster.  It looks to be loading at several hundred KB/sec, when it is behaving properly, which is much better than the ~20 -- 40KB/sec that it was peaking at before. So that's a nice advance.

So now to focus back on the ACK problem.  I tried making the code immediately send an ACK for every received frame, but that makes it either crash, hang, or generally doesn't help. But such a change shouldn't be necessary, anyway. 

Poking around a bit more, I have discovered that the eth_task() function sometimes falls off the task list, which of course then causes the network to stop. I could just reschedule it all the time, but I'd rather work out how it gets cancelled.  I suspect that it might be related to the ACKs not happening in a timely manner, as both can be broadly considered to be scheduling bugs.

A separate contributor to this problem is that we don't resend ACKs while trying to read data. This means that if the ACK is lost, then we have to wait until the sender decides to send more data. That's a recipe for trouble with our very small TCP receive window.

While trying to find out how I might add retransmissions, I discovered that WeeIP already has a retransmission mechanism for TCP -- its just that it was disabled for some reason. While investigating that, I also discovered that WeeIP doesn't close TCP connections properly, either, as the logic in the CLOSEWAIT1 state would erroneously skip the CLOSEWAIT2 state and go directly to CLOSED.

Trying to fix that is proving a bit fun, as I am now hitting the 38KB programme size limit I was talking about earlier. I've managed to save enough space by moving the packet buffer to $C000, and shortening some messages.

Now that I can run things again, I have found the bug where it would go direct from CLOSEWAIT1 to CLOSED without going via CLOSEWAIT2.  That now works, and TCP connections seem to get properly closed, which is nice. This means I don't have stray ACK packets being sent after the MEGA65 end has closed the connection, which were polluting my packet captures, and making it harder to follow the rest of what is going on.

Doing all that, I have also seen perhaps the reason why sometimes it refuses to connect to begin with.  It looks like it never actually sends the HTTP request, due to some race condition.  I have put a catch in for that, but it happens so relatively rarely, that I haven't been able to observe it happening since I added it. So I'll return to tracking down the slow TCP transfers again.

I am noticing that the MEGA65 genuinely thinks it hasn't received packets that tcpdump on my laptop indeed shows have been sent.  Its quite reliable when it happens, with the TCP congestion window reaching two packs, and the 2nd of each pair (or the first of each pair) being lost, but the other arriving. Now to find out if the packets get properly received, and then ignored, or if they arrive corrupt, or exactly what is going on. But that blasted memory limit is really giving me grief.

Well, I feel like I am chasing my own tail here, because now it does seem to be receiving the packets, but it doesn't always ACK them straight away. So let's see if we can't get reliable ACKing happening after each packet, and see if that doesn't solve the problem.  Indeed by default if it received an in-sequence TCP segment, it was not requesting an ACK.  This is normal behaviour for normal "big" TCP stacks with large window sizes.  

We could ACK every packet, which is probably not a big problem for us, since we shouldn't be doing huge transfers on the MEGA65, where that might become an issue.  That certainly helps to reduce this problem quite a bit.  Or we could even get clever and ACK only when we see a packet in sequence following an out-of-sequence packet. But that's more code, so lets stick with the simple and reasonably effective approach instead.

Hmmm... That chasing ones own tail feeling is still there.  There seem to be ethernet frames in both directions that are not being received.

Nonetheless, I managed to find and fix a couple of TCP state machine bugs, as well as some bugs in md2h65 and the mouse position calculating code, and now when the missing packet problems don't occur, it is possible to reasonably quickly load pages, scroll around in them, and click on links to navigate between them. But those other problems do still make it a bit hit and miss.  Just getting it to start up on the index page, and then click to another page and back again is taking multiple attempts. Its frustrating, because I think it is now mostly race-conditions tickling different remaining bugs.  But I am determined to get one good sequence so that I can film it, before I go to bed.

A new day, and hopefully a fresh set of eyes to debug things.

I've added some instrumentation in to check whether the ethernet interface ever refuses to send frames. Given how slow we are sending, and that we have a true 100mbit ethernet interface, it should never block on transmission. But we are indeed seeing it block. The F's are blocked transmissions and the -'s are received packets in this screen shot:

So something is whacked out with the Ethernet controller.  Looking closer, its not the Ethernet controller that is blocking it, but rather a lack of ARP resolution of the requested IP address. But that, in turn, seems to be because when we try to transmit a packet, nothing gets received.

This again makes me wonder if we don't still have a problem with intermittent corruption of Ethernet frames when they are being sent. Specifically, we have previously had problems with the clock edge and data not being properly lined up.  We even have a register that lets us attempt to set the phase difference between them.  So its possible that we can give that a tweak to check.  

It would be nice to be able to check for corrupt Ethernet frames directly, but because I am going through the EoP adapter, that will be discarding any corrupt frames before they get passed on.  I do have Ethernet on my laptop, so I might be able to rig up a DHCP server and direct cable, and see if I get packet errors visible that way. Otherwise I might have to hook up another MEGA65 to it, so that I can really closely examine the Ethernet frames, as the MEGA65 Ethernet controller allows for the reception of frames that fail the CRC check, thus allowing me to examine the kinds of errors we are encountering.

DHCP server setup on the laptop now, and a direct ethernet connection, to see what is going on. Zero RX errors, and it seems to work pretty much fine. Certainly bucket loads faster than the poor old EoP adapter could do, and I can even browse between a couple of pages without drama, before anything odd happens. But, yes, something odd does happen: The FIN ACK process got stuck on the 3rd page load. Presumably a bit of a race condition with the opposite side.  Once we have sent our FIN ACK, we should just continue, assuming all to be well. Eventually we can implement RST to quench any TCP connections that don't get shut completely.

Okay, so that's implemented. Now I am seeing a funny situation where the DHCP offer is not being accepted.  The packet is arriving, though. It looks like the transaction ID sometimes doesn't match.  This is presumably when the DHCP reply comes just after we have sent our DHCP request, and thus we have changed the XID we are looking for.  To solve this, I have modified our DHCP code to use the same XID for all repeated requests, so that it can't mis-match.

So, back to the direct connected network. It's much faster than over the horrible EoP link, but it can still get itself messed up.  It looks like its the task scheduler now that is being the most problem, with the Ethernet task falling off the task list -- or something approximating it.

Closer examination reveals that the laptop is sending ARP requests to the MEGA65, but sometimes the MEGA65 doesn't seem to receive them. This is also tickling a problem where during the TCP connect phase, we try sending 5 packets 200ms apart, and then give up trying any more.  If the ARP resolution is outstanding during that time, and the no-ARP-packets-received bug is happening, then the TCP connection will fail.   While we need to fix the ARP problem, its clear that we also need to improve the TCP connection retries and timeouts.

Well, that helps quite a bit, in that the ARP gets attempted again after a little while, and because we are continuing to retry over a longer period of time (I am now trying every 300ms for 30 attempts = ~9 seconds), we have better success. This is fine for on the local network, but it probably makes sense to allow for even longer time, maybe with a graduated timeout interval, so that connecting over the Internet to the other side of the world can also be made a bit more robust.

Now that that is under control, I can sometimes load a page, and follow a link on that, and then a link from that back, with the pages loading quite quickly.  After that, clicking on links stops working for some reason that I need to work out.  But to make that easier, I need to make it so that pages load every time. The main problem now seems to be out-of-order TCP segment arrival being handled in a super primitive way. 

Basically the WeeIP TCP/IP stack lies and claims to have a window size which is larger than the buffer it has.  We are allocating a 2KB buffer at $C000, so it would be nice to make WeeIP use that to allow multiple segments to arrive, and to allow at least rudimentary out-of-order packet arrival.  My gut feeling is that adding a single "out of order segment" descriptor to the sockets will be a good compromise of performance and simplicity, allowing for a single missed packet to be caught up on more easily.  I'll also make it adjust the window size based on the number of remaining bytes in the buffer.  

This all assumes that there is only one active TCP/IP connection at a time, as the buffer is shared among all sockets. This will, most assuredly, come back to bite me at some point, thus my making note of it here, so that I (hopefully) don't forget.  Actually, if I get all clever pants, we can keep track of what is in the buffer, so that it will remain safe, and out-of-order received packets will be remembered, only provided that they don't cause grief.  So I'll do that. This means the data on buffer usage will now be for the buffer, rather than per-socket. That actually saves a bit of RAM and will probably make the code simpler. Ah, no, I tell a lie. WeeIP requires that a buffer be allocated per socket, after all.  So ignore all that doom and gloom. It's all perfectly safe.

Even with such a relatively simple scheme, there are several cases to handle:

1. New packet is the only thing to go in the buffer;

2. Append new packet to end of what is already in the buffer;

3. New packet comes after a gap, and there is no out-of-order segment yet;

4. New packet comes after a gap, and can be pre-pended to out-of-order segment; and

5. New packet comes after a gap, and can be appended to out-of-order segment.

Cases 1 and 2 can actually be merged, as (1) is just (2), with the special case that there are 0 bytes in the buffer.

After that its all just a pile of fiddly buffer handling and debugging until I can get it to work stably.

And its a new day, as I continue to try to find and deal with the remaining bugs.   This really consists of me running it, and waiting for some out-of-order packets to arrive, and to then see what it claims to be doing via the copious debug messages I have added.  It seems that when packets arrive in order, its all fine, so it will just be some bug in one of the out-of-order cases.

One bug down: I was incorrectly handling the merging of the received bytes in the buffer, and the out-of-order segment. Fixing that seems to have helped a bit.

Now I am hitting the problem that we don't seem to find the end of the stream any more, and thus the page never gets displayed. Essentially, the socket seems to disconnect before we get the last bit of data. Or something. Quite possible its more the "or something", since we still seem to not be correctly receiving data.

Not surprisingly, finding more bugs in my changes: I was providing the wrong message type when delivering data from a socket to the callback, so it was being ignored. That's fixed now, and it looks like the data is all being delivered now. 

Next remaining problem is that when we get to the end of the page data, it isn't correctly realising to terminate the process.  Found and fixed that problem.

Now we are back to being able to load pages and follow links, until suddenly the links stop working.  I'm going to prune out a pile of the debug output I put into the TCP out-of-order stuff while debugging that, so that I can clean up the output a bit, and also free up some memory, as I am flying right against the edge of memory usage right now.  Then I'll be able to instrument the link stuff, to figure out what is going wrong there.

Now with the debug info reduced, it looks like I might have triggered some other race condition, as the TCP three-way handshake seems to hang for ages a lot of the time now. Yes, indeed we seem to be failing to ACK the SYN packet from the server.

It looks now like the problem is that we are perhaps not properly processing the incoming SYN packets, but I'm also not totally sure of that, as there are still some funny things going on. One minor issue I can, is that we don't reply to the DHCP OFFER, and so the Linux box keeps sending DHCP offers for a while, and seems to block other traffic for the MEGA65 during that time. 

Perhaps related to that, what we then see, is that the Linux side sees the SYN packet, as confirmed by tcpdump, and then instead of responding with the SYN+ACK, it sends an ARP request... even though it just received a packet from the MEGA65, which means that it should have the MAC address in the ARP cache already, e.g.:

13:54:26.553488 IP (tos 0x8, ttl 64, id 24330, offset 0, flags [none], proto TCP (6), length 40)
    203.28.176.10.1722 > 203.28.176.1.8000: Flags [S], cksum 0xc62f (correct), seq 3288990057, win 2048, length 0
13:54:26.553587 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 203.28.176.10 tell 203.28.176.1, length 28
13:54:26.840310 IP (tos 0x8, ttl 64, id 24331, offset 0, flags [none], proto TCP (6), length 40)
    203.28.176.10.1722 > 203.28.176.1.8000: Flags [S], cksum 0xc630 (correct), seq 3288990056, win 2048, length 0
13:54:27.573665 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 203.28.176.10 tell 203.28.176.1, length 28

This makes me think that maybe the DHCP thing is the problem here, after all.

Well, we now have DHCP request and acknowledgements happening, and thus we are obeying the DHCP protcol completely... And the funny problems still happen with being asked to ARP repeated times, and generally it being slow to get started.

Actually the DHCP negotiation tends to end up sending multiple copies of packets in each direction, before they get responded to.  I'm not sure if this is a real problem or not.  When I run tcpdump, I see all the packets from the MEGA65 without problem, so its not TX corruption. Similarly, we see the packets all arrive on the MEGA65 side, so I don't think its an RX problem.  

We did previously have a problem where Ethernet frames would be received, but not immediately released from the Ethernet buffers, until the next frame arrives. That is, each frame would be held in a kind of purgatory until the next frame arrives.  It is possible that something like that is happening, but I'm not really convinced. The reasons being, that (1) I'm pretty sure I fixed it some time ago; and (2) it doesn't feel like things are being delayed by a single frame, which is all that that old bug could do.

To test this, I wanted to be able to just use ping to check the round-trip-time of packets, to confirm whether or not we are seeing any funny delays or not. 

As seems to be the case throughout this whole adventure with the browser, I had to fix a bug that was preventing me from doing this in some other part of it. For some very strange reason the following line in the ICMP ping reply code would crash the machine:

       *(unsigned short *)&tx_frame_buf[14+20+2] = checksum_result();

Even though

       *(unsigned short *)&tx_frame_buf[14+10] = checksum_result();

a few lines later doesn't, and the following also works just fine, which in theory does the same thing:

      tx_frame_buf[14+20+2] = (checksum_result());
      tx_frame_buf[14+20+3] = (checksum_result())>>8;

I have absolutely no idea what the problem with that was, but it was very much a real thing, and with this work around, I now generate correct ICMP checksums, and can ping the MEGA65 when it is running the network stack, and see lovely things like:

$ ping 203.28.176.10
PING 203.28.176.10 (203.28.176.10) 56(84) Bytes Daten.
64 Bytes von 203.28.176.10: icmp_seq=1 ttl=64 Zeit=2339 ms
64 Bytes von 203.28.176.10: icmp_seq=2 ttl=64 Zeit=1325 ms
64 Bytes von 203.28.176.10: icmp_seq=3 ttl=64 Zeit=308 ms
64 Bytes von 203.28.176.10: icmp_seq=4 ttl=64 Zeit=3.61 ms
64 Bytes von 203.28.176.10: icmp_seq=5 ttl=64 Zeit=3.69 ms
64 Bytes von 203.28.176.10: icmp_seq=6 ttl=64 Zeit=3.01 ms
64 Bytes von 203.28.176.10: icmp_seq=7 ttl=64 Zeit=3.67 ms
64 Bytes von 203.28.176.10: icmp_seq=8 ttl=64 Zeit=3.16 ms

The first few pings are delayed, presumably as we wait for the network stack to settle. But after that, we are seeing very nice ping times of 3 to 4 milli-seconds -- not bad, given we are using CC65 as the compiler, and the TCP stack does make copies of packets as they arrive etc.  That all looks great, but then when I ran it again a bit later:

$ ping 203.28.176.11
PING 203.28.176.11 (203.28.176.11) 56(84) Bytes Daten.
Von 203.28.176.1 icmp_seq=1 Zielhost nicht erreichbar
Von 203.28.176.1 icmp_seq=2 Zielhost nicht erreichbar
Von 203.28.176.1 icmp_seq=3 Zielhost nicht erreichbar
64 Bytes von 203.28.176.11: icmp_seq=5 ttl=64 Zeit=3075 ms
64 Bytes von 203.28.176.11: icmp_seq=6 ttl=64 Zeit=3052 ms
64 Bytes von 203.28.176.11: icmp_seq=7 ttl=64 Zeit=3030 ms
64 Bytes von 203.28.176.11: icmp_seq=8 ttl=64 Zeit=3007 ms
64 Bytes von 203.28.176.11: icmp_seq=9 ttl=64 Zeit=3006 ms

Now we have a 3 second delay.  Now, this is interesting, because most of the weird behaviours we have seen have involved seeing triplicate of packets before they seem to get acted on.  If the MEGA65 Ethernet controller is getting all confused with its buffers, such that it gets several packets behind in announcing their arrival, this would cause exactly this kind of effect.

It's also fairly reproducable, so it shouldn't be too hard to track down what is going on.  Hopefully. So let's ask the Ethernet interface what it thinks in terms of buffer status.

Well, it claims it has 3 free buffers, and no packets waiting on the receive side, but this is clearly bollocks, as we know from the fact that there are in reality 3 packets queued up, and that indeed will be exposed as new packets arrive.

We don't actually _miss_ any packets, thus I think the ring buffer setup is working just fine. The problem is that the count of free buffers gets out of whack somehow. The logic for counting the number of free receive buffers has never really made me very happy.  And it is the value that's used for determining whether or not the CPU thinks there are more packets ready to receive.  

What I am thinking of doing is just looking at which buffers are being read and written to at any point in time, and from that, statically computing the number of free buffers, and whether there are received packets waiting or not. 

This will take a resynthesis run to fix, but I wondered if it might be possible to hack it in using some debug registers I expose in the Ethernet controller, namely, which buffer the CPU is reading from, and which is the next one that the Ethernet controller would write to. These are in the bottom four bits of $D6EF, with two bits each.  By checking if the difference between them is 3 or not, I am successfully able to check if there are any waiting packets, and the ping times are now consistently down around 3 to 4ms, instead of 3,000 ms.

So lets try that now with the browser loading pages, and see if it doesn't help speed up the TCP connections and everything else, but cutting out that 3 second delay each time.

Okay, so that makes the DHCP and TCP connection all much faster, as hoped. So that's really good.  But, as you have probably now come to expect in this blog post, another weird problem is happening.  Now, we are reliably missing the "G" from the start of the "GET" in the HTTP request.  The web server logs show something like:

Serving HTTP on 0.0.0.0 port 8000 ...
203.28.176.11 - - [27/Jul/2021 21:23:42] code 501, message Unsupported method ('ET')
203.28.176.11 - - [27/Jul/2021 21:23:42] "ET /index.h65 HTTP/1.1" 501 -

That's clearly not helpful. 

The problem here was that we were not ACKing the SYN+ACK during TCP establishment, which meant that the first character of the TCP stream was playing that role, and thus getting eaten, rather than passed to the web server.

That's easy to fix, but implementing the ACK of it explicitly, by incrementing the local-side TCP sequence number when we see the SYN+ACK.  And indeed, that allows a page to load. But clicking on a link doesn't work.

Looking at a tcpdump of such an event, it looks like the problem might be re-use of TCP ports, so the remote end can't discriminate between the sessions. Nope, its actually that the 3-way handshake is going wrong. We send SYN to the server, the server replies with SYN+ACK, and then we should reply with just ACK. But we are sending a 2nd SYN. In theory that should be fine, but for whatever reason, our 2nd SYN has its sequence number decremented by one.

Now it feels like I'm finally making progress: 

1. DHCP naturally takes a couple of seconds, because it also has to activate the Ethernet interface and wait for it to auto-negotiate with the switch.  Once that happens, the DHCP lease is obtained within a couple of seconds, and then making the HTTP connection and loading a page is really fast -- probably about 1 second for a ~100KB page. Okay, the "network" is just a cable between the MEGA65 and may laptop, but nonetheless, it is all working really nicely and smoothly, and quite reliably.

2. I figured out the "links broken after navigating two pages" bug: I wasn't resetting where on the page the browser thinks is being displayed when a new page is loaded. Thus, if you had scrolled the page, the links would be out of sync, and it would mess up. That was nice and easy to fix.

3. I wasn't clearing colour RAM between page loads, which was leaving some fluff at the bottom of pages sometimes. That's now cleaned up.

In fact, I managed to really quickly and easily navigate around a few pages, until it broken when trying to follow the link back from one of the pages I hadn't tried before.

It was so nice to get it working, I did a bit of a stream showing it off:


Next step will be to allow users to type in a URL, at which point it will be minimally functional for people to try out.

Then it will be investigating the occassional page load failures, to see if I can get to the bottom of those.

I'll then have to divert my attention back to other things, so if someone out there is interested in continuing work on the browser, I'd certainly welcome it. Just poke me on the MEGA65 Discord server (link in the nav bar at mega65.org).

No comments:

Post a Comment