Saturday, 5 September 2020

Fixing the remaining ethernet RX bugs

For a while now there has been an intermittent problem with receiving ethernet frames: The occasional byte of a received packet is corrupted.  The CRC of the received packets is checked fine, so I am assuming it is read/write contention on the receive buffer dual-port memories: Basically if you try to read from a dual-port memory when it is being written to, you can end up reading rubbish.

A secondary problem is that we have only two RX buffers, which means that the MEGA65 has a bit of a risk of missing the odd packet, if they come in quick succession. Also, the mechanism for indicating when an RX buffer is free is a bit of a pain and prone to occasionally confusing the ethernet RX engine.

My plan is to attack both problems at the same time, by switching from using a single 4KB RX buffer memory, divided into two logical halves, into four separate 2KB RX buffers.  That should help avoid packet loss, and also the annoying RX errors.

Then to improve the RX buffer handling, I'll make a register that asks for the next received packet to be made visible, and to simultaneously to free the previously visible RX buffer.  In this way it will be super simple to use, and will also make it easy to later add more RX buffers, without having to change the ethernet driver code.

First step is to replace the single 4KB buffer with 4x2KB buffers. Done.

Then we need to keep track of which buffer the CPU can see, and which the ethernet controller can see. Also easy and done.

Then we need to have a way to have the CPU be able to tell the ethernet controller when an RX buffer is free.  I'll do this at the point when the CPU asks for the next received packet to be made visible.  I'll keep a counter of the number of free RX buffer slots, which can only ever be between 0 and 3, because one buffer is always held by the CPU (so that the RX buffer can't change while the CPU is reading the packet out).

To implement this, the register $D6E1 will be changed: Bits 1 and 2 previously were used to pick which of the two RX buffers were visible to the CPU, and which was able to be written to by the ethernet controller.  Now, reading those two bits will return the number of free RX buffers, and writing a 1 to bit 1 will cause the current RX buffer that was available to the CPU to be released, and the next received packet, if any, to be made available to the processor.

I was pleasantly surprised how few synthesis runs I needed to get this to generally work: Writing with bit 1 set to $D6E1 flips RX buffer, and all the rest. However, there is a subtle problem where flipping to the next RX buffer sometimes does something weird, and the RX buffer doesn't get cleared, but the count of free RX buffers does get updates.  The result is that the logic to pick the next received packet to present to the CPU can keep presenting the same packet to the CPU over and over.  

The logic I have there looks right, but clearly isn't.  It is possible I am trying to be a bit too clever, by trying to find the first available buffer / filled buffer, rather than just rotating around them in a ring.  That would also make sure we process packets in the correct order.  I'm not really sure why I didn't do this to begin with, as it is clearly the most sensible algorithm for handling the packet buffers.  So time to synthesise again...

Ok, so the ring buffer works much better, but we are still ending up with the number of free RX buffers being incorrectly calculated.  I am suspecting that part of the problem might be that writing to $D6E1 to accept a packet from the RX buffer queue may be being written for two successive cycles on the MEGA65, which causes the logic to be executed more than once, generally resulting in confusion.  So I am synthesising a fix for that, as well.

Now, the whole reason for doing this was to fix the corruption of received bytes. But I am still seeing corrupt ping reply packets.  But at the same time, I am not seeing signs of the packets being received incorrectly by the MEGA65, nor of receive errors on the other side, which is what would happen if the transmit side were having signal integrity problems.  

So if the problem isn't on the receive side, or the actual sending of packets, it makes me wonder if the problem isn't in the stuffing of the TX buffer. There is only one TX buffer, and if we don't try to send a packet while another is already being sent, then it shouldn't ever have read-write conflicts.

But I can at least test this theory by not merely sending multiple responses to a single ping request, but actually building the ping reply packets from scratch for each: If some of these get corrupted, but others not, then we know it is not on the receive side -- else we would have copied the same error, and the clones would all have the same fault.

First up, I am indeed seeing that some packets are being corrupted, but not others, which means that the RX buffers are not the problem.  BUT I am also seeing something weird where the entire RX buffer is changing.  What I'm not sure of, is if that is because the RX buffer is some how being advanced to the next, or if something else weird is going on.  Similarly, I am also sometimes seeing the same packet over and over.  

A bit of debugging shows that the ethernet RX buffer management is still being a bit dodgy:  Packet after packet, it can get stuck with the CPU and ethernet controller accessing the same RX buffers, even when we advance the buffer.  Looking closer, this is when it correctly thinks that the there have been no more packets received. But it still shouldn't stay the same all the time, but rather should rotate through the set of RX buffers. 

I *think* the problem there was that if the ethernet register was written to for more than one consecutive cycle it could rotate through all the buffers, and generally confuse things.  In any case, I haven't seen any more packets with corrupt bytes in them.  So that's a plus.

However, there is still a few percent of packet loss.  In at least one case, there were three consecutive packets that were truncated -- but still with valid CRCs. Also, if I try to construct and send multiple replies to ping packets, often most of the duplicates end up containing an incorrect completely different packet.  This causes packet traces like this:

08:36:53.065726 IP 203.19.107.1 > 203.19.107.65: ICMP echo request, id 32126, seq 146, length 64
    0x0000:  4040 4040 4040 1005 019f fcfd 0800 4500  @@@@@@........E.
    0x0010:  0054 ffb9 4000 4001 ce85 cb13 6b01 cb13  .T..@.@.....k...
    0x0020:  6b41 0800 6d0c 7d7e 0092 8db0 3d5f 0000  kA..m.}~....=_..
    0x0030:  0000 8200 0100 0000 0000 1011 1213 1415  ................
    0x0040:  1617 1819 1a1b 1c1d 1e1f 2021 2223 2425  ...........!"#$%
    0x0050:  2627 2829 2a2b 2c2d 2e2f 3031 3233 3435  &'()*+,-./012345
    0x0060:  3637                                     67
08:36:53.066540 IP 203.19.107.65 > 203.19.107.1: ICMP echo reply, id 32126, seq 146, length 64
    0x0000:  1005 019f fcfd 4040 4040 4040 0800 4500  ......@@@@@@..E.
    0x0010:  0054 ffb9 4000 4001 ce85 cb13 6b41 cb13  .T..@.@.....kA..
    0x0020:  6b01 0000 6d0c 7d7e 0092 8db0 3d5f 0000  k...m.}~....=_..
    0x0030:  0000 8a00 0100 0000 0000 1011 1213 1415  ................
    0x0040:  1617 1819 1a1b 1c1d 1e1f 2021 2223 2425  ...........!"#$%
    0x0050:  2627 2829 2a2b 2c2d 2e2f 3031 3233 3435  &'()*+,-./012345
    0x0060:  3637 bc55 ce51                           67.U.Q
08:36:53.067141 IP6 fe80:be:1205:141::9f:fcfd.5353 > ff02::800:fb.5353: 0 PTR (QM)? _googlecast._tcp.local. (40)
    0x0000:  1005 019f fcfd 4040 4040 4040 86dd 600f  ......@@@@@@..`.
    0x0010:  6e5a 0030 1101 fe80 00be 1205 0141 0000  nZ.0.........A..
    0x0020:  0000 009f fcfd ff02 0000 0000 0000 0000  ................
    0x0030:  0000 0800 00fb 14e9 14e9 0030 7d83 0000  ...........0}...
    0x0040:  0000 0001 0000 0000 0000 0b5f 676f 6f67  ..........._goog
    0x0050:  6c65 6361 7374 045f 7463 7005 6c6f 6361  lecast._tcp.loca
    0x0060:  6c00 000c 0001 4df4 072b                 l.....M..+
08:36:53.067452 IP6 fe80:be:1205:141::9f:fcfd.5353 > ff02::800:fb.5353: 0 PTR (QM)? _googlecast._tcp.local. (40)
    0x0000:  1005 019f fcfd 4040 4040 4040 86dd 600f  ......@@@@@@..`.
    0x0010:  6e5a 0030 1101 fe80 00be 1205 0141 0000  nZ.0.........A..
    0x0020:  0000 009f fcfd ff02 0000 0000 0000 0000  ................
    0x0030:  0000 0800 00fb 14e9 14e9 0030 7d83 0000  ...........0}...
    0x0040:  0000 0001 0000 0000 0000 0b5f 676f 6f67  ..........._goog
    0x0050:  6c65 6361 7374 045f 7463 7005 6c6f 6361  lecast._tcp.loca
    0x0060:  6c00 000c 0001 4df4 072b                 l.....M..+
08:36:53.067744 IP6 fe80:be:1205:141::9f:fcfd.5353 > ff02::800:fb.5353: 0 PTR (QM)? _googlecast._tcp.local. (40)
    0x0000:  1005 019f fcfd 4040 4040 4040 86dd 600f  ......@@@@@@..`.
    0x0010:  6e5a 0030 1101 fe80 00be 1205 0141 0000  nZ.0.........A..
    0x0020:  0000 009f fcfd ff02 0000 0000 0000 0000  ................
    0x0030:  0000 0800 00fb 14e9 14e9 0030 7d83 0000  ...........0}...
    0x0040:  0000 0001 0000 0000 0000 0b5f 676f 6f67  ..........._goog
    0x0050:  6c65 6361 7374 045f 7463 7005 6c6f 6361  lecast._tcp.loca
    0x0060:  6c00 000c 0001 4df4 072b                 l.....M..+
08:36:53.068047 IP6 fe80:be:1205:141::9f:fcfd.5353 > ff02::800:fb.5353: 0 PTR (QM)? _googlecast._tcp.local. (40)
    0x0000:  1005 019f fcfd 4040 4040 4040 86dd 600f  ......@@@@@@..`.
    0x0010:  6e5a 0030 1101 fe80 00be 1205 0141 0000  nZ.0.........A..
    0x0020:  0000 009f fcfd ff02 0000 0000 0000 0000  ................
    0x0030:  0000 0800 00fb 14e9 14e9 0030 7d83 0000  ...........0}...
    0x0040:  0000 0001 0000 0000 0000 0b5f 676f 6f67  ..........._goog
    0x0050:  6c65 6361 7374 045f 7463 7005 6c6f 6361  lecast._tcp.loca
    0x0060:  6c00 000c 0001 4df4 072b                 l.....M..+
08:36:53.068351 IP6 fe80:be:1205:141::9f:fcfd.5353 > ff02::800:fb.5353: 0 PTR (QM)? _googlecast._tcp.local. (40)
    0x0000:  1005 019f fcfd 4040 4040 4040 86dd 600f  ......@@@@@@..`.
    0x0010:  6e5a 0030 1101 fe80 00be 1205 0141 0000  nZ.0.........A..
    0x0020:  0000 009f fcfd ff02 0000 0000 0000 0000  ................
    0x0030:  0000 0800 00fb 14e9 14e9 0030 7d83 0000  ...........0}...
    0x0040:  0000 0001 0000 0000 0000 0b5f 676f 6f67  ..........._goog
    0x0050:  6c65 6361 7374 045f 7463 7005 6c6f 6361  lecast._tcp.loca
    0x0060:  6c00 000c 0001 4df4 072b                 l.....M..+
08:36:53.068658 IP6 fe80:be:1205:141::9f:fcfd.5353 > ff02::800:fb.5353: 0 PTR (QM)? _googlecast._tcp.local. (40)
    0x0000:  1005 019f fcfd 4040 4040 4040 86dd 600f  ......@@@@@@..`.
    0x0010:  6e5a 0030 1101 fe80 00be 1205 0141 0000  nZ.0.........A..
    0x0020:  0000 009f fcfd ff02 0000 0000 0000 0000  ................
    0x0030:  0000 0800 00fb 14e9 14e9 0030 7d83 0000  ...........0}...
    0x0040:  0000 0001 0000 0000 0000 0b5f 676f 6f67  ..........._goog
    0x0050:  6c65 6361 7374 045f 7463 7005 6c6f 6361  lecast._tcp.loca
    0x0060:  6c00 000c 0001 4df4 072b                 l.....M..+
08:36:53.068962 IP6 fe80:be:1205:141::9f:fcfd.5353 > ff02::800:fb.5353: 0 PTR (QM)? _googlecast._tcp.local. (40)
    0x0000:  1005 019f fcfd 4040 4040 4040 86dd 600f  ......@@@@@@..`.
    0x0010:  6e5a 0030 1101 fe80 00be 1205 0141 0000  nZ.0.........A..
    0x0020:  0000 009f fcfd ff02 0000 0000 0000 0000  ................
    0x0030:  0000 0800 00fb 14e9 14e9 0030 7d83 0000  ...........0}...
    0x0040:  0000 0001 0000 0000 0000 0b5f 676f 6f67  ..........._goog
    0x0050:  6c65 6361 7374 045f 7463 7005 6c6f 6361  lecast._tcp.loca
    0x0060:  6c00 000c 0001 4df4 072b                 l.....M..+

That is, we see the ICMP PING request, followed by one valid reply, and then 9 of these strange other packets, each of which are identical.  They have the source MAC address of the MEGA65 (40:40:40:40:40:40), and they are being sent a fraction of a milli-second apart, which matches the code that I am using to try to send multiple replies:

isicmpping:
    ; It's a ping packet.

    jsr send_icmp_reply
    jsr send_icmp_reply
    jsr send_icmp_reply
    jsr send_icmp_reply
    jsr send_icmp_reply
    jsr send_icmp_reply
    jsr send_icmp_reply
    jsr send_icmp_reply
    jsr send_icmp_reply

Nowhere in this routine does the RX buffer get rotated. And more the point, I have no idea where this other packet data is coming from.  Certainly the MEGA65 doesn't support GoogleCast, so I can only assume that it is some other packet that has ended up in a buffer somewhere. 

Ok, that's interesting: When I instrument the RX buffer selection between sending these packets, the RX buffers are in fact being rotated after sending the first packet back, but not after that.  Before sending the first reply, I am seeing, for example, CPU side RX buffer is #1, and ethernet side is #2. Then after sending the first reply, it is CPU side RX buffer #0 and ethernet side is #1.  Ah, that I recognise: The combination is the default buffer allocation when the ethernet state machine has been reset. The cause was that I had code to unjam the TX state machine before sending a packet to deal with previous bugs.  So I'll remove that, and see how we go.

Ok, so that fixes that problem: I am now getting an ample quantity of identical ping packets -- and still without any corrupt packets, which is also nice. So we are making headway.  Now the remaining problem is that we still are getting some packet loss.  What's not yet clear is whether that is the MEGA65 is missing packets, or whether the replies are being ignored for some reason.

Hmm.... I just removed the duplicate reply sending code, and saw the first corrupted packet in a while. More weird, each PING is getting a reply immediately, and then after 1 second, i.e., when the next PING packet comes in.   Thus the packet log looks like this:

09:00:33.205017 IP 203.19.107.1 > 203.19.107.65: ICMP echo request, id 5052, seq 176, length 64

09:00:33.205830 IP 203.19.107.65 > 203.19.107.1: ICMP echo reply, id 5052, seq 175, length 64

09:00:33.206449 IP 203.19.107.65 > 203.19.107.1: ICMP echo reply, id 5052, seq 176, length 64

And I just saw another of those corrupted packets.  In duplicate because of this bug.  But both duplicates were identical, indicating that they were due to the MEGA65 receiving the request packet originally.

Time to add some diagnostics to see if the MEGA65 thinks it is receiving each packet twice... Which it does.  Now to find out why...  I can see that the ethernet side RX buffer advances as the new packet comes in, but then for some reason we are processing the old packet again first.  Found it: I was being oversensitive to packet reception, while dealing with bugs in the ethernet controller that I have since fixed. Basically I was noticing when a packet was started to be received, as well as when it was actually received.

I've also looked more into the few remaining packet errors that I am seeing:  These are now exclusively the last byte of the packet being changed to $BD.  This is a strong hint as to what is going on: Almost certainly this is the end-of-packet marker that is supposed to be written to the receive buffer immediately following the packet.  Something is causing it to sometimes be written over the last byte of the packet.  

It might be that we detect the RX Valid signal from the ethernet controller very slightly early, e.g., after we have received 6 of the 8 bits of the last byte.  It must be marginal timing if this is the cause, as the ethernet CRC is still being correctly verified.

I'll try fixing this by requiring the RX Valid signal to be disasserted for 3 cycles instead of 2.  I'll also change the byte that gets written at the end to another value, so that I can confirm conclusively that this is the cause of the incorrect values.  The choice of $BD was quite arbitrary. I'll make the equally arbitrary decision that the new value will be $EA.

Now testing this fix.  Watching a ping run for a long time, I am seeing that some of the remaining packet loss is being caused by packets truncated by some non-trivial amount, often 12 bytes or 34 bytes.  I have instrumented the receive programme and found that these truncated packets are received truncated by the MEGA65, so the problem is on the receive side.  The question is what the source of that might be?

On the receive side, these short packets are not triggering a CRC error, which makes me rather suspcious.  Also, sometimes these packets end up with the "is multicast" flag in our receiver being asserted, rather than the "is broadcast" flag.  Even so, some of those packets result in valid replies to the PING.

I wondered if I had a dodgy ethernet cable, but switching to another made no difference to the kinds of errors I am seeing.

So to summarise, in terms of missed ping replies there seem to be several causes:

1. MEGA65 receives some packets truncated.

2. MEGA65 receives a valid packet, but the connected computer receives no reply packet, dispite all evidence that the MEGA65 has sent a reply.

and also:

3. MEGA65 sometimes thinks broadcast ICMP PING packets are multicast packets instead, but then replies to them without problem.

4. The connected computer does not indicate any RX frame errors, suggesting that the MEGA65 is not sending any invalid frames at any point.

Now what I haven't tested yet, is whether it is my poor laptop that has the problem with its ethernet, since that is the other common piece here. So time to pull the Mac out and try with that.

First thing, the Mac is complaining that the first data byte of the ping replies is incorrect.  Apparently I had the location of the ICMP CRC field in the packet wrong, and so was updating the first data byte, not the ICMP CRC.

The second thing that the Mac revealed, is that at least some of the packets are being rejected because the ICMP CRC is incorrect, being out by 1.  I'm not sure if the problem is that the packets are being generated with the wrong CRC.  But it probably suggests that its worth calculating the ICMP CRC ourselves, rather than trying to just patch the CRC of the PING.

I've added the code to calculate the ICMP checksum directly, which has reduced, but not eliminated, the number of bad packets.  More the point, the types of errors seem to change as I make these fixes, and eliminate potential sources of the problem -- but somewhere between 3% and 5% of packets continue to be lost. AND the problem continues to be worse with larger packets. With 900 byte ICMP PING packets only a very small fraction actually succeed.

This brings me back to the ethernet RX and TX lines -- in particular the TX lines, as we seem to reliably receive even very large packets.  However adjusting the TX phase seems to make no difference at all.  Modifying the TX routine to send the same TX buffer twice reveals why: Exactly the same checksum error occurs on both copies -- so the problem is actually in the TX buffer.  So hopefully if I can fix that, I can finally fix the packet errors.

Ok, found one quite likely source of the problem: I was clocking the TX buffer's write side on the 50MHz clock, instead of on the CPU clock.  This would cause exactly the kind of random corruption of bytes during writing to the buffer that we have been seeing. This also explains why the ethernet frames were not being sent with corrupt CRCs, because the ethernet controller computes the ethernet CRC in real-time as it sends the frame, and then appends it at the end.

Fixed that, and that's got rid of the byte corruption in transmitted packets.  I can now reliably respond to ping packets both big and small.  The packet error rate is now down to about 2%. The remaining packet errors are short packets, i.e., with valid ethernet checksum, but truncated for some reason.  The question is whether this is due to some problem with transmission, or the transmit buffer length setup.  This can be answered by going back to sending two copies of each packet, and seeing whether both or only one gets truncated when this problem occurs.

Ok, that didn't take long: And as I hoped, the problem occurs for both packets:

14:58:55.768602 IP 203.19.107.1 > 203.19.107.65: ICMP echo request, id 1676, seq 10, length 958
14:58:55.770340 IP 203.19.107.65 > 203.19.107.1: ICMP echo reply, id 1676, seq 10, length 958
14:58:55.770404 IP 203.19.107.65 > 203.19.107.1: ICMP echo reply, id 1676, seq 10, length 958
14:58:56.769145 IP 203.19.107.1 > 203.19.107.65: ICMP echo request, id 1676, seq 11, length 958
14:58:56.770697 IP truncated-ip - 886 bytes missing! 203.19.107.65 > 203.19.107.1: ICMP echo reply, id 1676, seq 11, length 958
14:58:56.770718 IP truncated-ip - 886 bytes missing! 203.19.107.65 > 203.19.107.1: ICMP echo reply, id 1676, seq 11, length 958

What is interesting is that the ethernet frame is the correct length: It is only the contents of the packet that are wrong.  I need to look at the contents of one of these truncated packets to see exactly where the problem is... Except I just also saw one where only one of the packets was messed up... And I just saw a packet with incorrect data in it.

All this points to some subtle problem with the packet reception side of things again.

And from there it has all gone down hill a bit again. The intermittent problem I have previously seen where the TX side of the finite state machine locks up is now happening pretty much all of the time.  This means that no packets get sent at all, while the program waits in vain for the TX side to indicate readiness. However, it IS ready, in that if I go ahead and send a packet anyway, it will get sent.

The symptoms are that the eth_tx_complete line gets stuck high.  The only place that asserts that is the SentFrame state.  As soon as eth_tx_trigger goes low, it will leave that state, and go to state IdleWait. The instrumentation I have on $D6E4 lets me see both eth_tx_complete and eth_tx_trigger lines in real-time.  And eth_tx_complete is high, and eth_tx_trigger is low. This means that it can't be stuck in the SentFRame state.  The IdleWait state just counts 50 cycles, and then proceeds to InterPacketGap. As soon as the inter-packet gap timeout has been served, eth_tx_complete gets cleared, and the state moves to Idle. It is not possible for eth_tx_complete to get asserted again in that cycle, but only after sending a complete frame.

Watching the values in the debug flags of $D6E4, I see the value sits on $34, unless I send a packet.  The $3x part means that the eth_txd bits are set to "11". The $x4 part indicates that eth_tx_complete is high, while eth_tx_trigger, eth_tx_commenced and eth_txen_int are all low: That is, we are not sending a packet, nor have we been triggered to do so.

 So lets look back at the Idle state. There are three conditions where it can leave Idle:

1. eth_tx_trigger = '1' , i.e., the user has asked for a packet to be sent. This is the highest priority option.

2. activity_dump='1' and activity_dump_ready_toggle has toggled. 

3. eth_videostream = '1' , activity_dump='0', and buffer_moby_toggle has toggled.

eth_videostream is accessible via bit 3 of $D6E1, and activity_dump can be set by writing to bit 2 of $D6E1 (but can't be read). So in theory we can clear those by writing $03 into $D6E1.   

Clearing them made no difference, but setting bit 3 (eth_videostream) DID temporarily clear the problem.  Hmm...but if I set eth_videostream, then reset the ethernet controller, then it seems to come good.

Also trying to TFTP fetch a file causes the problem to happen almost immediately, which is convenient for debugging.

Ok, so some poking around, I found I had already make a good solution for this problem, but just wasn't using it: Bit 7 of $D6E0 indicates if the TX state machine is ready to send a packet. This means a simple:

ethtxwait: BIT $D6E0 

           BPL waitsomemore

can be used to wait until the ethernet controller is ready to send the next packet.

So that fixes that, but now it still goes funny after a while, refusing to answer ARP requests.  This seems to be the return of another niggly bug, where the controller goes nuts sending some sort of frame, possibly the video or CPU activity dump frames we used to support. However disabling those functions doesn't stop it. The only cure is to reset the whole ethernet controller.  I'm currently synthesising a bitstream that actually disables the dispatch of those packet types, to see if it cures the problem.  What is frustrating here, is that I can't see for the life of me how it gets into this state, and more so, how it can stay in such a loop.  The FSM is not that complex, and once a frame starts sending, its pretty inevitable that it progresses through to completion.

Disabling those special frame modes seems to have solved that part of the problem: The TX FSM no longer locks-up, and I can even use my dodgy work-in-progress TFTP server to fetch a file fairly quickly:

$ tftp 203.19.107.65
tftp> get MEGA65.D81
Received 819200 bytes in 3.2 seconds
tftp> get MEGA65.D81
Received 819200 bytes in 2.5 seconds
tftp> get MEGA65.D81
Received 819200 bytes in 2.5 seconds
tftp> get MEGA65.D81
Error code 1: ? FILE NOT FOUND ERROR

The TFTP server even has nice 8-bit style error messages, as you can see above :) The reason it gets file not found error here, is that there is a bug in HYPPO, the MEGA65 Hypervisor's FAT32 file descriptor code, that gets a touch confused when too many files have been opened.

Now, while that works, its only behaving nicely because I am sending every TFTP packet several times.  Also, every time one of those multiple sendings triggers and ACK from the receiver, that triggers yet more retransmissions. This nicely coverse the odd corrupted packet that we are still encountering.

After doing all the TFTP fetches, the RX side seems to have gotten out of sync a bit, with one more packet received than it releases, causing all ping replies to be one second late.  Reseting the ethernet controller fixes that, but it still shouldn't happen. I'll have to make another debug bitstream that shows the RX buffer allocations again, so that I can see what on earth is going on.

The errors I am seeing now are a combination of:

1. Bad ethernet frames received by the PC, indicating that our TX side is doing something wrong.

2. Truncated ethernet frames received by the PC, but with valid checksums, indicating that probably our RX side has messed up.

For both of these, and even for some of the prior problems, I am wondering if there isn't a problem with the clocks for the ethernet. The ethernet clocks were not being driven though a BUFG global buffer, and thus might have been skewing around within the ethernet controller. I've added BUFGs to all the ethernet clocks, to see if that doesn't solve the problem.

Well, that didn't solve the problem either: We still are stuck with this rather stubborn 3% -- 5% packet loss, truncation or corruption.  That said, its now flat at that level, regardless of the size of the packets, which it wasn't previously, and the contents of the received packets seem to get corrupted much less frequently, so significant progress has still been made.

But... Looking at the timing report, it looks like adding the BUFGs to the clocks has changed the names of the clock, so timing closure was quite badly out.  So I guess I'll resynthesise that, and try it out.  Resynthesised that, timing closure is now good, but the problem remains.  

Next step is to try producing a series of identical packets from my Linux machine, and send them to the MEGA65, to see if packets are getting corrupted while being sent to the MEGA65.

I'm pulling my hair out on this, so I'll go and attack some other problems for a while, then come back to this in a week or two.

Its now a couple of weeks later, and I am revisiting this.  I found and fixed one problem with the TX state machine getting stuck, which fixed the problem where it would stop sending packets after a while. However we are still seeing wrong packet lengths on received packets, and also that some packets are not being sent.

To try to understand why some packets are not being sent, I have been testing sending series of identical packets, and have indeed observed that sometimes the packets don't get sent.  The linux box reports no RX frame errors, which means that we are either sending good frames, or not sending a frame at all.

This makes me think that the ethernet chip on the MEGA65 itself is blocking the transmission of the frames OR that my TX finite state machine is getting stuck in a way that prevents a single packet from being sent OR that the frame preamble is too short to be detected by the receiver. Lets look at each of the three, to see if we can figure out which it is, and how to fix it.


1. I've looked through the datasheet, and I can't find any mechanism for suppressing the transmission of frames (some chipsets have jabber prevention on the TX side, but this one doesn't seem to). So it probably isn't that.

2. I've added a counter that counts each time the TX frame gets triggered, which should help detect this.  It would also be great to probe the TXEN pin on the ethernet chip.  Unfortunately it is on the underside of the PCB, and I think its one of those little surface-mount jobs that have no exposed pins, which would make that difficult.  But the counter should tell us, anyway.

3. If its the preamble being a bit too short, then extending it a few symbols to allow time for TXEN to take effect should be a fairly simple intervention.  It needs to be 56 bits = 28 dibits. I was using 29, presumably to allow for the first to be lost.  I've now increased it to 32.

Those changes are now synthesising, so hopefully we will get some answers. If it suddenly is fixed, then we know that the longer preamble was the problem.  If not, we can check the TX counter to make sure that it increases after each packet is sent.

Well, that's interesting: Increasing the TX preamble has now reduced the percentage of packets that get accepted to less than 10%.  I was able to confirm that the TX state machine seems to be working fine, though, using the debug insturmentation I added for this.

Now it looks like the preamble length is the problem. So I will make this run-time adjustable, to speed up the debugging of this. That still requires another bitstream to be built, but hopefully once built, I can nail this problem.

I found it interesting that making the preamble longer than required caused problems. I would have thought that ethernet cards would accept any valid preamble sequence, and thus a long preamble would be more reliably received.  But maybe its the other way around: An excessively long preamble gets ignored as being assumed to be a jabbering network card, and perhaps shorter preambles will be accepted, with the receiver assuming they might have missed the first few bits.

One curious thing with this: I was sending 29 dibits = 58 bits = 7.25 bytes, then the %11 dibit that marks the end of the preamble. Thus I was 4 bits short of the correct length.  Anyway, once this bitstream builds, I can work out how the preamble length affects reception of packets in a more thorough way.

Bitstream built, and the results are a bit annoying:  The randomly lost packets still continues, but I have also seen the TX state machine lockup. Admittedly just the once, but that's one too many.

What I have discovered, is that the preamble MUST be an odd number of dibits long. This does suggest one possible mechanism for the problem: If there is a race-condition to set the TX dibit and assert TXEN, then it is possible that sometimes the first dibit sent is NOT correctly sent.  This is not so implausible, as I have seen that if I do set it to use an even number of dibits, that it isn't that no packets are received, but rather that a very small fraction of packets are received... a suspiciously similar ratio to the number of packets that get lost when I use an odd number of di-bits for the preamble.

The trick is to see what I can do, to avoid this problem. I already have a TX phase adjustment mechanism, where I can vary the phase of the TX signals compared with the 50MHz reference clock. However, I don't have a mechanism to vary the phase between these two signals. However, according to what I can see in the datasheet for the ethernet controller, this shouldn't matter: All that is important, is that these signals be stable on each rising edge of the 50MHz reference clock.

So it should be possible to adjust the TX phase to make sure of this -- provided that the path skew on the eth_txd and eth_txen are stable and approximately equal.  So lets check the four possible TX delay phase values, by checking the packet loss rate of sending ~100 ping reply packets, 5 times each to get an average:

%00 - 93, 87, 85, 90, 93 = average 89.6% packets delivered = 10.4% packet loss

%01 - 92, 95, 89, 92, 84 = average 90.4% packets delivered = 9.6% packet loss

%10 - 88, 91, 88, 92, 90 = average 89.8% packets delivered =10.2% packets lost

%11 - 94, 88, 85, 84, 87 = average 87.6% packets delivered =12.4% packets lost.

All are broadly similar, to within the statistical variation of the tests.  For example, running subsequent tests with %11, I see results of between 8% and 13% packet loss. So I am fairly confident in the conclusion that the TX phase is not having any effect.

This is a bit surprising, as I would expect shifting the phase should allow at least one setting where the eth reference clock and data are sufficiently out of phase as to cause problems. Indeed, I have seen it doing this previously. It is sufficiently weird as to make me suspect that the TX phase control is not working for some reason. However, every indication from the source code is that it is working.

I have also checked that I have a few output drive stages, so there shouldn't be any big problems with skew or glitching.

I'm now going through the timing report for the design, to look for any problems there.  There are no timing violations for the eth_txd signals, but there is a large latency from eth_txd_out[1] to the actual eth_txd[1] pin of the FPGA of 8.913 ns, and for eth_txd_out[0] to eth_txd[0] a delay of 8.880 ns.  The eth_txen output delay is 8.586 ns, which should be close enough, given the 50MHz reference clock has a period of 20 ns, and we are talking about a skew between the txen and txd signals of less than 0.5 ns.  Thus because the difference in delay between them is small, any problem should be solvable by tweaking the TX phase setting to get them to line up nicely with the ethernet clock... which is clearly not happening.

The slew rate on the physical pins were set to slow, which I am now trying to set to fast, to see if that doesn't help tidy things up.

But its also looking like I will need to pull the mainboard out of the MEGA65 so that I can flip it over to get to the ethernet chip, and measure things at the pins of the ethernet chip, to see if there is any weird problem with the eth_txen being differentially delayed by different pcb trace lengths or some other effect.  

I'll get to that in a moment, but in the meantime I also resynthesised with a change that sets the eth_txd signal to %01 when idly waiting before transmitting a packet.  My goal here was to make sure that it couldn't latch an incorrect dibit value, thus causing the preamble to be sent with an even instead of odd number if dibits.  The results of this for the four different TX phase values, as before are:

%00 - 964 out of 1,000 packets received = 3.6% packet loss

%01 - 962 out of 1,000 packets received = 3.8% packet loss

%10 - 974 out of 1,000 packets received = 2.6% packet loss

%11 - 971 out of 1,000 packets received = 2.9% packet loss

Again, these are all within statistical noise of one another, but are all also substantially better than without this little change.  Sending 900 byte ping packets has the same packet loss, so we can at least be confident that the bit frame is fine.

It is also still holding true, that if I change the preamble length to be even, that I end up with packet loss the opposite to if the preamble is set to and odd length, i.e., ~96% packet loss instead of ~4%.  Thus I still think that the TXEN line is asserting at not quite the right time.  So time to flip the MEGA65 main board over, and investigate...

Well, that's interesting... At first I thought the problem was slow ramping of the eth_txen line. But that's not the problem: For starters, I can rotate it around the eth reference clock using the TX phase control, and it makes no difference.  So I stuck the oscilloscope probes on eth_txen and eth_txd, so that I could see the delay from eth_txen being asserted to the start of the packet body.  When I do that, and have the oscilloscope set to trigger on eth_txen rising edge, I can see that the preamble is not always the same length. It sometimes is 20ns = one dibit longer. 

Confirming this required a bit of rather fiddly work. Fortunately I have a really nice Zeiss microscope available to me that has HDMI out.  This thing really is a dream for fiddly work:


So I set that up with OBS to capture the HDMI stream (my spare HDMI monitor is broken, so I couldn't just display it directly). Then I could easily see the tiny 5mm square chip, and find traces that led to the pins I cared about, and stick the oscilloscope probes in the vias on those traces:

To give an idea of how nice the microscope, the above was taken at x0.8 magnification, without any real fiddling.  The digital camera is quite high resolution, and I can easily zoom in to see features a fraction of a milli-metre wide, like the traces below the chip:

... and that's without switching the front optics to a higher gain.  But I digress...

Then it was a bit of gymnastics to hold both probes in the tiny vias which were also so close to each other I had to hold them at the right angles, and then hold my phone in the other hand to take repeated photos of the oscilloscope display until I got one that showed both good and bad traces. But I managed it in the end:


Well, this is all a welcome discovery, as it explains all of what I have been seeing lately, including that switching to an even preamble length roughly inverts the packet loss rate.

The only problem is, I can't for the life of me see how this can be happening.  The part of the state machine responsible for sending the preamble is rather simple:

    if rising_edge(clock50mhz) then
        case eth_tx_state is

...

        when Idle =>
          if eth_tx_trigger = '0' then
            eth_tx_complete <= '0';
          end if;

          -- XXX Try having TXD be ready for preamble, to see if that
          -- fixes the weird problem with packet loss due to wrong preamble length.
          eth_txd_int <= "01";
         
          if eth_tx_trigger = '1' then

            -- reset frame padding state
            eth_tx_padding <= '0';
            if to_integer(eth_tx_size)<60 then
              eth_tx_size_padded <= to_unsigned(60,12);
            else
              eth_tx_size_padded <= eth_tx_size;
            end if;
            -- begin transmission
            tx_frame_count <= tx_frame_count + 1;
            eth_tx_commenced <= '1';
            eth_tx_complete <= '0';
            -- 56 bits of preamble = 28 dibits.
            -- We add a few extra just to make sure.
            tx_preamble_count <= tx_preamble_length;
            eth_txen_int <= '1';
            eth_txd_int <= "01";
            eth_tx_state <= WaitBeforeTX;
            eth_tx_viciv <= '0';
            eth_tx_dump <= '0';
...

          end if;

        when WaitBeforeTX =>
          if eth_tx_packet_count /= "111111" then
            eth_tx_packet_count <= eth_tx_packet_count + 1;
          else
            eth_tx_packet_count <= "000000";
          end if;          

          txbuffer_readaddress <= 0;
          eth_tx_state <= SendingPreamble;
          report "Reseting TX CRC";
          tx_fcs_crc_load_init <= '1';
          report "TX CRC init not announcing data";
          tx_fcs_crc_d_valid <= '0';
          tx_fcs_crc_calc_en <= '0';
        when SendingPreamble =>
          if tx_preamble_count = 0 then
            eth_txd_int <= "11";
            eth_tx_state <= SendingFrame;
            eth_tx_bit_count <= 0;
            txbuffer_readaddress <= txbuffer_readaddress + 1;
            tx_fcs_crc_load_init <= '0';
            report "Releasing TX CRC init";
            tx_fcs_crc_d_valid <= '1';
            tx_fcs_crc_calc_en <= '1';
            report "TX CRC announcing input";
            if eth_tx_viciv='0' and eth_tx_dump='0' then
              eth_tx_bits <= txbuffer_rdata;
              tx_fcs_crc_data_in <= std_logic_vector(txbuffer_rdata);
              report "Feeding TX CRC $" & to_hstring(txbuffer_rdata);
            else
              eth_tx_bits <= x"ff";
              tx_fcs_crc_data_in <= x"ff";
            end if;
          else
            eth_txd_int <= "01";
            tx_preamble_count <= tx_preamble_count - 1;
          end if;

...

The bold lines show that we work on the 50MHz clock, and send a preamble of tx_preamble_length di-bits (which is a constant register), before moving to sending the frame.  

There is simply no place for a single extra di-bit to be added to the preamble.  The ethernet clock is fed through a BUFG globl buffer, so shouldn't suffer from excessive jitter or skew.

It is nigglingly me that it feels like the state machine can get confused in some way that I am not considering, given that the transmit state machine generally seems able to get in a mess.  There are only two signals that can influence the TX state machine outside of the state machine itself: eth_tx_trigger and reset lines.  The reset lines are stable, so should not pose a problem. However, eth_tx_trigger does get directly set by the 40.5MHz CPU side, and might be able to glitch, rather than just asserting stably.

This is easy enough to test: Add a drive stage on the 40.5MHz side of the signal, and then add an extra flip-flop to latch it across to the 50MHz ethernet side.

I won't be surprised if this works, as when the ethernet stuff used to work perfectly, it was probably when the CPU was at 50MHz, and thus there was no need to do cross-domain clocking stuff in the ethernet controller.  The only question will be if there are other signals crossing between the two sides that I need to handle.

However, as seems to perpetually be the case with sorting this ethernet adapter out, it still doesn't work.  BUT it has fixed the random insertion of an extra dibit into the preamble. It hasn't fixed the TX state machine locking up, however. So there are two pieces of evidence that something fishy is still going on.

BUT we have progress: The dropped packet rate HAS dropped, and all of the dropped packets are actually due to truncated packets, i.e., the problem we discovered earlier where the length of received packets is incorrectly recorded. So maybe we ARE on the home run after all...

To find out more about the truncated packet problem, I tried a little experiment. Basically I suspect that the length field of the wrong receive buffer is updated either all of time or at least some of the time.  So I ran two separate ping sessions with the same packet size as each other, and then again with different sized packets.  If the problem is recording packet lengths in the wrong receive buffers, then we should see a difference in the number of truncated packets.

So, let's try that, first with 900 byte ping packets on both sessions: 2% and 3% packet loss in the two sessions, i.e., about what we see doing a single session.

Now lets do it again with 900 byte ping packets and 100 byte ping packets: 30% and 12% packet loss, and bizarrely, one duplicate.

The packet loss is clearly higher with different size packets, and is higher for the longer of the two lengths.  Both point to the length of the wrong packets being used, as for the shorter packets, the PING reply packet can be properly constructed with too much data; its only with truncated data that it causes problems.  I can see further confirmation of this when I instrument the test programme on the MEGA65 to display the reported frame length when doing the double-ping with different packet sizes, and I can see that the wrong length is reported for some frames.

What is mysterious is how the wrong length can get written.  eth_frame_len is used to write the bytes of the ethernet frame as it is received.  This is then the value that gets written into the first 2 bytes of the RX buffer, so that the receiver knows the length of the received frame.  The end of frame is not indicated to the RX buffer rotator until 15 cycles after the last byte gets written, which should be stacks of time (especially since during ethernet RX one byte gets written every 4 cycles, and that seems to work just fine).

Starting to dig around in this: One part of the problem is resetting the TX finite state machine actually also resets the RX buffer allocations.  This causes some problems.  So we need to really fix the TX state machine lockup stuff properly.

Then what I am seeing with two pings running, is that it tends to go fine until some other packet is received, which presumably causes a pattern of RX buffer usage that tickles the problem as the following output from my test programme shows, where presumably two packets of 64 bytes each arrive, causing messed up packet lengths to be reported:

 346 34 ping!                  ca @a  
 146 43 ping!                  @b ab  
 346 35 ping!                  ac bc  
 146 44 ping!                  b@ c@  
 346 36 ping!                  ca @a  
 64 45 ping!                   @b ab  
 346 37 ping!                  ac bc  
 146 46 ping!                  b@ c@  
 346 38 ping!                  ca @a  
 64 47 ping!                   @b ab  
 146 39 ping!                  ac bc  
 

Each line begins with the size of the packet, the sequence ID for the ping, and then the buffer allocation information, both before and after processing the packet as two letters.  @ means RX buffer 0, a = RX buffer 1, .. c = RX buffer 3.  

Thus the first means that a 346 byte ping packet with sequence number 34 was received. Before the packet begins to be processed, the CPU can see RX buffer 3 and the ethernet controller will write into buffer 1 next, as indicated by the ca string. After the packet has been processed, the CPU is now looking at RX buffer 0, and the ethernet controller still at buffer 1, as indicated by the @a string, since no new packets have been received.

Next a 146 byte ping packet with sequence number 43 (from the second ping session) is received, and we see the RX buffers advance one position.

This repeats until we see that apparently we have received a 64 byte ping packet with sequence 45, when it should be 146 bytes long.

So we can see that the buffer lengths really are being mixed up. 

The more I look at it, the more it looks like the length of the received packet simply doesn't get written some of the time, as compared to being written into the wrong buffer.  This is rather odd, since the contents of packets get written without error, so it seems rather unlikely that this logic is faulty.  Even when we reject a frame due to CRC error or it not being addressed to our MAC address, the logic writes these length bytes.

It looks like we only successfully write the length of a frame around half to three-quarters of the time.  So this is quite a frequently occurring problem.  It seems like it should therefore have a pretty obvious cause in the VHDL...

The only thing that looks like it could have any impact is the logic for doing the cross-domain clock transfer of the write requests.  However, the fact that it works reliably for the bytes in each frame makes it seem rather unlikely that this is to blame. 

Nonetheless I have tweaked the logic for detecting the toggle for triggering a write to the RX buffers to see if that solves the problem.  The problem still persists, however, with packet lengths not being reliably written.  But it does seem to happen less often.  I'm not sure if I am happy about that or not.  On the one hand, its nice that it is getting closer, and has helped. But on the other hand, it makes no sense that it shouldn't work for writing the frame length bytes when it works very reliably to write the bodies of packets, without a single data error.

Okay: I think I might have found it after all that: There was no delay being added between writing the 1st and 2nd bytes of the frame length, when there should have been. That would cause writing the low-byte of the frame length to fail sometimes. So this sounds quite plausible as a mechanism.  The tidying up of the latching would also have helped reduce, but not eliminate the problem, so it all is consistent.  Let's see if it really fixes it...

... and it does! I can now run two ping sessions with different sized packets, and have zero packet loss.

Now the only problem is that the RX buffer management is still being a bit odd, with the last packet received not being available until the following packet arrives.  This only happens after some condition has been satisfied, however, as resetting the ethernet controller clears this behaviour.  Of course it only happened once, and I can't reproduce it, to see what is going on...

Ah, that's got it! It seems to be happening after the RX buffers have all been filled. I've fixed that, but now it looks like the ethernet RX IRQ is a bit sticky, which means that we sometimes see the same packet 2 or 3 times.  But that, I can live with for now, since the actual function of the ethernet controller is now basically as it should be, with zero packet loss, even if I run a ping flood from linux at it with 10,000 packets.  So I think I'll leave it there for today.

No comments:

Post a Comment