• [tao-users] Stale connections with BiDirGIOP

    From Milan Cvetkovic@21:1/5 to tao-users on Mon Oct 12 11:13:45 2015
    This is a multi-part message in MIME format.
    TAO VERSION: 2.2.1
    ACE VERSION: 6.2.1

    HOST MACHINE and OPERATING SYSTEM: Debian wheezy on x86_64

    THE $ACE_ROOT/ace/config.h FILE: config-linux.h

    THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:
    c++11 = 1
    ssl = 1
    include ${ACE_ROOT}/include/makeinclude/platform_linux.GNU

    AREA/CLASS/EXAMPLE AFFECTED:
    BiDirGIOP / Transport_Cache_Manager_T / SSLIOP
    DOES THE PROBLEM AFFECT:
    EXECUTION: YES

    SYNOPSIS: After loss of network connection from a client, server is
    no longer able to invoke callback RPCs, even after client reconnected,
    and resubmitted its callback IOR.

    DESCRIPTION:

    I have BiDirGIOP setup over SSLIOP. Client is behind firewall router on 192.168.12.x network. Client incarnates callback object, listening on 192.168.12.113:7770 and port 7771 for ss. Client contacts the server
    over the internet, and it sends the IOR to callback object above. Server
    later uses callback object to send various notifications. This setup
    utilizes bidirectional GIOP, over SSLIOP.

    Everything works as desired, until client loses connectivity to server.
    When client re-registers, server adds the new Transport to Transport
    cache manager, however in some scenarios it does not remove the old
    transport, and keeps using it for callbacks, failing on CORBA::TIMEOUT

    My understanding is that Transport_Cache_Manager keeps the hash map
    table of all connections. These connections have the same key, being
    issued from the same IP:port every time (in the example above, 192.168.12.113:7771). In some cases, the server does not replace the
    existing transport entry, but adds it with an increased index, and keeps
    using index:0 for making callbacks.

    I am attaching the portions of TAO logs. Note that second registration
    binds with index :1. The stale transport is kept with index :0.

    How do I control the content of Transport_Cache_Manager_T. I removed the references to callback objects from server, however the transport is
    still cached.

    Thanks, Milan.

    connection #1:
    (25685|140022328551168) Listening port [7771] on [192.168.12.113]
    TAO (25685|140022328551168) - Transport[32]::purge_entry, entry is 0xab79f0
    TAO (25685|140022328551168) - Cache_IntId_T::Cache_IntId_T, this=0x7f597d274970 Transport[32] is connected
    TAO (25685|140022328551168) - Cache_IntId_T::recycle_state, ENTRY_UNKNOWN->ENTRY_IDLE_AND_PURGABLE Transport[32] IntId=0x7f597d274970
    TAO (25685|140022328551168) - Transport_Cache_Manager_T::bind_i, Transport[32] @ hash:index{-1062713049:0}
    TAO (25685|140022328551168) - Transport_Cache_Manager_T::bind_i: Success Transport[32] @ hash:index{-1062713049:0}. Cache size is [2]

    connection #2 (client reconnects/re-registers after losing connectivity #1):

    TAO (25685|140022328551168) SSLIOP connection from client <68.179.97.149:33251> on [35]
    TAO (25685|140022328551168) SSLIOP connection accepted from server <192.168.108.103:11001> on [35]
    TAO (25685|140022328551168) - Transport::post_open, tport id changed from 12183552 to 35
    TAO (25685|140022328551168) - Transport[35]::post_open, cache_map_entry_ is 0 TAO (25685|140022328551168) - TAO_LF_CH_Event[35]::state_changed_i, state LFS_CONNECTION_WAIT->LFS_SUCCESS
    TAO (25685|140022328551168) - Cache_IntId_T::Cache_IntId_T, this=0x7f597d275800 Transport[35] is connected
    TAO (25685|140022328551168) - Cache_IntId_T::recycle_state, ENTRY_UNKNOWN->ENTRY_IDLE_AND_PURGABLE Transport[35] IntId=0x7f597d275800
    TAO (25685|140022328551168) - Transport_Cache_Manager_T::bind_i, Transport[35] @ hash:index{1152673115:0}
    TAO (25685|140022328551168) - Transport_Cache_Manager_T::bind_i: Success Transport[35] @ hash:index{1152673115:0}. Cache size is [3]
    ...
    (25685|140022328551168) Listening port [7771] on [192.168.12.113]
    TAO (25685|140022328551168) - Transport[35]::purge_entry, entry is 0xbbe6c0
    TAO (25685|140022328551168) - Cache_IntId_T::Cache_IntId_T, this=0x7f597d274970 Transport[35] is connected
    TAO (25685|140022328551168) - Cache_IntId_T::recycle_state, ENTRY_UNKNOWN->ENTRY_IDLE_AND_PURGABLE Transport[35] IntId=0x7f597d274970
    TAO (25685|140022328551168) - Transport_Cache_Manager_T::bind_i, Transport[35] @ hash:index{-1062713049:0}
    TAO (25685|140022328551168) - Transport_Cache_Manager_T::bind_i, Unable to bind Transport[35] @ hash:index{-1062713048:1}. Trying with a new index
    TAO (25685|140022328551168) - Transport_Cache_Manager_T::bind_i: Success Transport[35] @ hash:index{-1062713048:1}. Cache size is [3]

    callback invocation, uses connection #1:

    TAO (25685|140022328551168) - Transport_Cache_Manager_T::find_i, Found available Transport[32] @hash:index {-1062713049:0}
    (this invocation fails with CORBA::TIMEOUT)
    .....
    =================

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Milan Cvetkovic@21:1/5 to tao-users on Sat Oct 17 21:11:55 2015
    Copy: tao-bugs@list.isis.vanderbilt.edu

    OK, answering to my own question to some extent...

    I narrowed the problem to Transport_Cache_Manager_T, and its use of Cache_ExtId::index_.

    First, how I see it working:
    ============================

    Transport_Cache_Manager_T uses ACE_Hash_Map_Manager to keep a mapping
    between Cache_ExtId and Cache_IntId. In case of IIOP (in my case it
    really was SSLIOP, but I doubt there is a difference there), Cache_ExtId represents IP-ADDR:PORT/index triple for a connection. IP-ADDR:PORT is
    the address that Transport connects to, and index is used to allow
    multiple connections to same ip/port address. All three values (address,
    port, index) are used to calculate hash when stored to ACE_Hash_Map_Manager

    When a new Transport is created, it is registered with cache manager,
    and it would create an entry using ip:port:index(0). When another
    transport is needed again, Transport_Cache_Manager_T::find_i looks up
    for an existing connection, and uses it if it is found and idle.

    The problem:
    ============
    Transport_Cache_Manager_T::find_i assumes that indexes of existing
    connections are all consecutive numbers starting with 0. It will try to
    lookup Transport with index=1 *only* if index=0 entry for the same
    IP:port exists, and if it is busy. If IP:port:index=0 entry is
    previously purged from the cache, Transport_Cache_Manager_T::find_i will
    never try to use index=1 (or any other index in the cache).

    This scenario is exactly what happens with BiDirGIOP when client
    disappears from the network, and later reconnects( and re-registers
    callback with same IP:PORT) value:
    - server caches first callback with IP:addr:index=0
    - client reconnects/re-registers
    - server caches the second callback with IP:addr:index=1
    - eventually, server cleans up cache entry with IP:addr:index=0
    - but it is never able to access the entry with IP:addr:index=1

    I am not too sure on the impact on regular TAO clients, since I didnt
    try it, but I would assume that:
    - if index=0 entry is busy, second transport is created
    - if index=0 entry's transport is closed, index=1
    entry is purged from cache, and index=1 entry is no
    longer reachable, until index=0 entry for the same IP:PORT is created.

    Potential solutions:
    ====================
    - I could fix Transport_Cache_Manager_T::unbind_i so it made sure
    that the assumption made in find_i is true: If cache has M elements,
    when removing an entry at index=N (where N is in [0,M), all remaining
    entries for same IP:addr should have consecutive indexes
    in range [0,M-1).
    - Alternatively, Transport_Cache_Manager_T can be rewritten
    to actually use multi-hashmap. The existing implementation with
    hash-map and indexes seems inappropriate and sub-optimal.
    Or there is a good reason not to use multi-hash-map, that I am not
    aware of...
    It seems that this would touch more files in TAO though.

    I would like to contribute this patch. I would appreciate if someone
    could advise me, which direction should I take.

    Thanks, Milan.

    Milan Cvetkovic wrote:
    TAO VERSION: 2.2.1
    ACE VERSION: 6.2.1

    HOST MACHINE and OPERATING SYSTEM: Debian wheezy on x86_64

    THE $ACE_ROOT/ace/config.h FILE: config-linux.h

    THE $ACE_ROOT/include/makeinclude/platform_macros.GNU FILE:
    c++11 = 1
    ssl = 1
    include ${ACE_ROOT}/include/makeinclude/platform_linux.GNU

    AREA/CLASS/EXAMPLE AFFECTED:
    BiDirGIOP / Transport_Cache_Manager_T / SSLIOP
    DOES THE PROBLEM AFFECT:
    EXECUTION: YES

    SYNOPSIS: After loss of network connection from a client, server is
    no longer able to invoke callback RPCs, even after client reconnected,
    and resubmitted its callback IOR.

    DESCRIPTION:

    I have BiDirGIOP setup over SSLIOP. Client is behind firewall router on 192.168.12.x network. Client incarnates callback object, listening on 192.168.12.113:7770 and port 7771 for ss. Client contacts the server
    over the internet, and it sends the IOR to callback object above. Server later uses callback object to send various notifications. This setup
    utilizes bidirectional GIOP, over SSLIOP.

    Everything works as desired, until client loses connectivity to server.
    When client re-registers, server adds the new Transport to Transport
    cache manager, however in some scenarios it does not remove the old transport, and keeps using it for callbacks, failing on CORBA::TIMEOUT

    My understanding is that Transport_Cache_Manager keeps the hash map
    table of all connections. These connections have the same key, being
    issued from the same IP:port every time (in the example above, 192.168.12.113:7771). In some cases, the server does not replace the
    existing transport entry, but adds it with an increased index, and keeps using index:0 for making callbacks.

    I am attaching the portions of TAO logs. Note that second registration
    binds with index :1. The stale transport is kept with index :0.

    How do I control the content of Transport_Cache_Manager_T. I removed the references to callback objects from server, however the transport is
    still cached.

    Thanks, Milan.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)