• Degradation of TCP connection

    From justin.pearson@gmail.com@21:1/5 to nois...@gmail.com on Mon Dec 19 07:01:34 2016
    Revisiting this 8 years later (Dec 2016), I realize I never posted the Exciting Conclusion. Our "solution" was to switch the network connection from TCP to UDP. We concluded that there was a bug in the VxWorks TCP stack, but we couldn't reproduce the
    problem reliably. Our application didn't require the extra benefits of TCP (data stream is reproduced in order), and it could tolerate a few dropped packets. We switched to UDP and the problem went away. We were under intense schedule pressure, so we
    notified Wind River about our fix and moved on :).

    Bill & James: Thanks so much for your help 8 years ago. That was a really tough problem, especially for a new engineer. I was so grateful for your time and attention.

    Best,
    Justin


    On Thursday, August 7, 2008 at 1:56:40 PM UTC-7, nois...@gmail.com wrote:
    On Aug 7, 3:13 am, James Cunnane <james.cunnane+ag...@gmail.com>
    wrote:
    On Tue, 5 Aug 2008 16:07:34 -0700 (PDT), justin.pear...@gmail.com
    wrote:

    Oh, and I just remembered another piece of the puzzle: The VxWorks >machine is also exchanging data with another box on the network over
    UDP. We have timers in the VxWorks app that make it panic if it stops >receiving UDP packets. It appears that during each of these anomalies, >the VxWorks box continues to receive UDP packets just fine. That is,
    it appears as though it stops hearing from the TCP stream, but
    continues to receive UDP packets as normal.

    Perhaps your ARP cache has become corrupt. I had a system which after about 26 days of continuous connection would respond to ping but not
    to telnet; it turned out that the ARP cache had become corrupted by a nanosecond timer overflow. The mechanism of corruption is probably
    not timer-related in your case but the end result seems similar. Can
    you devise ARP diagnostics that can run periodically on the sending
    device, both before and after the TCP fail?

    Hmm... In your case you said the system would respond to ping, but
    not telnet. It's hard to classify that as a problem with the ARP
    cache, _if_ you tried to ping the target from the same host that you
    also tried to telnet to it from. If you can ping target A from host
    B, then ARP resolution between A and B is working (or at least, the
    ARP entries haven't timed out yet). Ping (ICMP over IP) and telnet
    (TCP over IP) both rely on ARP, so if it worked for one, it should
    have worked for the other.

    However, if you tried to ping target A from host B, and that worked,
    but trying to telnet to target A from host C did not work, that could
    be an ARP problem. (The target still had an unexpired ARP entry for
    host B, but was unable to perform ARP resolution for the previously
    unknown host C.)

    In Justin's case, he said once his app got into its error state, he
    could see the target still sending TCP segments to his Windows host
    using Wireshark (but not responding to ACKs from the Windows host).
    This implies the target's ARP entry for the Windows host was still
    valid (otherwise it would have started sending ARP "who has" requests instead).

    -Bill

    Regards

    James Cunnane

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)