• Concurrency issues with FILE ccache

    From Osipov, Michael (LDA IT PLM)@21:1/5 to All on Tue Apr 6 17:48:37 2021
    Hi,

    we do experience some weird concurrency issues with FILE: based
    credential caches.
    One Python application uses tens (mostly 16 to 24) of concurrent threads
    to access resources via py-requests and py-requests-gssapi on top of
    Debian 10 with MIT Kerberos 1.17 (GitLab Runner) and FreeBSD 12-STABLE
    with MIT Kerberos 1.19.1 (my dev box). GSS context is maintained per thread/request rather than using Request's Session object.
    The initiator is a service keytab from KRB5_CLIENT_KTNAME, for testing
    purposes I do use kinit with my personal AD account, but the outcome is
    the same. On both platforms I see output from klist like this:
    Ticketzwischenspeicher: FILE:/tmp/krb5cc_722
    Standard-Principal: osipovmi@EXAMPLE.COM

    Valid starting Expires Service principal
    06.04.2021 17:18:38 07.04.2021 03:18:38 krbtgt/EXAMPLE.COM@EXAMPLE.COM
    erneuern bis 07.04.2021 17:18:35
    06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM
    06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM
    06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM
    06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM
    06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM

    On Debian even output like:

    gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)

    Which leads me to the fact that TGT and service ticket acquisition or
    the read/write to the FILE cache is not R/W locked.
    For testing purposes I have modified py-requests-gssapi in such a way
    that threading.Lock() is used around the first SecurityContext step call because here is the cache set up and I do see in klist output only *one* service ticket and the rest goes on smoothly.
    I have also considered to use
    gssapi.raw.acquire_cred_from(store={b"ccache":b"FILE:/tmp/<app>_<threadname>", b"client_keytab":b"/path/to/service.keytab"}, usage='initiate')
    but I'd like to avoid adding even more code to fix a symptom not the cause.

    I have also compared KRB5_TRACE output from the original and patched
    version of py-requests-gssapi and one can clearly see that in the former
    -- due to race conditions .. every thread tries to retrieve a TGT and a
    service ticket while the latter (patched) does it only once.

    What is the general advise here? Is any of the caches threadsafe because
    none is documented either way?

    Regards,

    Michael

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Greg Hudson@21:1/5 to Osipov, Michael (LDA IT PLM) on Tue Apr 6 13:28:39 2021
    To: kerberos@mit.edu

    On 4/6/21 11:48 AM, Osipov, Michael (LDA IT PLM) wrote:
    gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)

    This is not expected, and bears investigation. It suggests an EINVAL,
    EEXIST, EFAULT, EBADF, or EWOULDBLOCK error from one of the I/O
    operations performed by fcc_store(), none of which are expected. If
    you're building libkrb5, you could try modifying interpret_error() to
    pass those error codes through in order to find out which one is happening.

    Getting multiple cache entries for a service is normal when multiple
    threads or processes initiate contexts to the same (new) service within
    a short window.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Osipov, Michael (LDA IT PLM)@21:1/5 to Greg Hudson on Tue Apr 6 20:35:22 2021
    To: kerberos@mit.edu

    Am 2021-04-06 um 19:28 schrieb Greg Hudson:
    On 4/6/21 11:48 AM, Osipov, Michael (LDA IT PLM) wrote:
    gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)

    This is not expected, and bears investigation. It suggests an EINVAL, EEXIST, EFAULT, EBADF, or EWOULDBLOCK error from one of the I/O
    operations performed by fcc_store(), none of which are expected. If
    you're building libkrb5, you could try modifying interpret_error() to
    pass those error codes through in order to find out which one is happening.

    Getting multiple cache entries for a service is normal when multiple
    threads or processes initiate contexts to the same (new) service within
    a short window.

    Note that this is only on MIT Kerberos 1.17 on Debian. I will first try
    to compile 1.19.1 and test that. Let me get back to you in a couple of days. Would it be sufficient to printf()
    ret = interpret_errno(context, errno);
    reat and errno to std stream?

    Using acquire_cred_from() seems to work on Debian, but I also see a
    little overhead (< 5 %).

    Do you know of the top of your head whether there have been any
    concurrency fixes in this regard after 1.17?

    M

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Osipov, Michael (LDA IT PLM)@21:1/5 to Greg Hudson on Fri Apr 9 17:35:26 2021
    To: kerberos@mit.edu

    Am 2021-04-06 um 19:28 schrieb Greg Hudson:
    On 4/6/21 11:48 AM, Osipov, Michael (LDA IT PLM) wrote:
    gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)

    This is not expected, and bears investigation. It suggests an EINVAL, EEXIST, EFAULT, EBADF, or EWOULDBLOCK error from one of the I/O
    operations performed by fcc_store(), none of which are expected. If
    you're building libkrb5, you could try modifying interpret_error() to
    pass those error codes through in order to find out which one is happening.

    Getting multiple cache entries for a service is normal when multiple
    threads or processes initiate contexts to the same (new) service within
    a short window.


    Hi Greg,

    so I was able to properly compile and install 1.19.1 in the GitLab
    Runner and verified that py-gssapi picks it up from LD_LIBRARY_PATH. Unfortunately, 1.19.1 still suffers from the same problem as 1.17. I
    tried to narrow it down with strace, but that changes the runtime
    behavior of the application and the error disappears. I did patch the fcc_store() funtion:
    $ git diff
    diff --git a/src/lib/krb5/ccache/cc_file.c b/src/lib/krb5/ccache/cc_file.c index 9a9b45a6e..7f604c0f4 100644
    --- a/src/lib/krb5/ccache/cc_file.c
    +++ b/src/lib/krb5/ccache/cc_file.c
    @@ -1000,8 +1000,9 @@ fcc_store(krb5_context context, krb5_ccache id, krb5_creds *creds)
    if (ret)
    goto cleanup;
    nwritten = write(fileno(fp), buf.data, buf.len);
    - if (nwritten == -1)
    + if (nwritten == -1) {
    ret = interpret_errno(context, errno);
    + printf("errno: %d, ret: %d\n", errno, ret); }
    if ((size_t)nwritten != buf.len)
    ret = KRB5_CC_IO;

    but the output did not appear. Then I patched the interpret_errno()
    dirctly for the internal error:
    @@ -1293,6 +1294,7 @@ interpret_errno(krb5_context context, int errnum)
    case EWOULDBLOCK:
    #endif
    ret = KRB5_FCC_INTERNAL;
    + printf("errnum: %d, ret: %d\n", errnum, ret);
    break;
    /*
    * The rest all map to KRB5_CC_IO. These errnos are listed to
    I had exactly one faiure in the job and received exactly this:
    errnum: 17, ret: -1765328188
    which maps to EEXIST

    I am quite sure that this is a race condition where stat() is performed,
    file does not exist, open() with write is performed, in parallel it is
    already created and the later call returns in EEXIST.
    I assumed it to be fcc_initialize() and added a printf():
    fcc_initialize()
    errnum: 17, ret: -1765328188
    fcc_initialize()
    errnum: 17, ret: -1765328188

    What now?

    Michael

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Greg Hudson@21:1/5 to Osipov, Michael (LDA IT PLM) on Fri Apr 9 14:24:04 2021
    To: kerberos@mit.edu

    On 4/9/21 11:35 AM, Osipov, Michael (LDA IT PLM) wrote:
    I am quite sure that this is a race condition where stat() is performed,
    file does not exist, open() with write is performed, in parallel it is already created and the later call returns in EEXIST.

    I agree, except I think it's just unlink() and open(O_CREAT|O_EXCL)
    calls with no stat(). I had erroneously assumed that the unexpected
    error was happening inside fcc_store() because of "Failed to store
    credentials" in the message, but that string turns out to be from
    get_in_tkt.c in a block of code that also calls krb5_cc_initialize().

    The fcc_initialize() EEXIST self-race has existed since 1.0. I'd
    speculate that the original developers' assumption was that lots of
    processes might be competing to use a file ccache, but that creating
    ccaches would be a rare and one-at-a-time affair (happening at login or
    when a user runs "kinit"). With client keytab support, that is no
    longer the case; it's easy to have multiple threads or processes
    competing to create or refresh a cache as part of gss_acquire_cred() or gss_init_sec_context().

    Just fixing the fcc_initialize() race wouldn't really solve the problem;
    there would still be a window between krb5_cc_initialize() and krb5_cc_store_cred() where other threads (or processes) would see an initialized cache with no TGT in it, and would fail the
    gss_init_sec_context() call. This ticket describes that problem and
    some possible solutions:

    https://krbdev.mit.edu/rt/Ticket/Display.html?id=7707

    Heimdal has implemented option 5. I'm not wild about it and it won't
    work with other ccache types, but it's a working stopgap and it can
    always be backed out in favor of a different solution later.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Osipov, Michael (LDA IT PLM)@21:1/5 to Greg Hudson on Fri Apr 16 13:39:03 2021
    To: kerberos@mit.edu

    Am 2021-04-09 um 20:24 schrieb Greg Hudson:
    On 4/9/21 11:35 AM, Osipov, Michael (LDA IT PLM) wrote:
    I am quite sure that this is a race condition where stat() is performed,
    file does not exist, open() with write is performed, in parallel it is
    already created and the later call returns in EEXIST.

    I agree, except I think it's just unlink() and open(O_CREAT|O_EXCL)
    calls with no stat(). I had erroneously assumed that the unexpected
    error was happening inside fcc_store() because of "Failed to store credentials" in the message, but that string turns out to be from get_in_tkt.c in a block of code that also calls krb5_cc_initialize().

    The fcc_initialize() EEXIST self-race has existed since 1.0. I'd
    speculate that the original developers' assumption was that lots of
    processes might be competing to use a file ccache, but that creating
    ccaches would be a rare and one-at-a-time affair (happening at login or
    when a user runs "kinit"). With client keytab support, that is no
    longer the case; it's easy to have multiple threads or processes
    competing to create or refresh a cache as part of gss_acquire_cred() or gss_init_sec_context().

    Just fixing the fcc_initialize() race wouldn't really solve the problem; there would still be a window between krb5_cc_initialize() and krb5_cc_store_cred() where other threads (or processes) would see an initialized cache with no TGT in it, and would fail the gss_init_sec_context() call.

    Re-reading the code and your analysis, I agree that it won't work w/o
    external synchronization.

    This ticket describes that problem and
    some possible solutions:

    https://krbdev.mit.edu/rt/Ticket/Display.html?id=7707

    Heimdal has implemented option 5. I'm not wild about it and it won't
    work with other ccache types, but it's a working stopgap and it can
    always be backed out in favor of a different solution later.

    While I don't understand all of them, option 2 seems to be the most
    obvious (idiotproof) solution for the FILE cache, isn't it? I can't tell
    for the ccache formats.

    So for now, the only workaorounds I see are:
    1. Initiate the cache in the main thread and then spawn worker threads.
    For long running apps (10 h+) refresh cache although there is no 'kinit
    -R' in GSS-API.
    2. Use a per-thread cache to avoid race conditions:
    spnego = gssapi.OID.from_int_seq("1.3.6.1.5.5.2")
    if keytab_location:
    store = {}
    store[b"client_keytab"] = keytab_location.encode(sys.getdefaultencoding())
    store[b"ccache"] = ("/tmp/krb5cc_%d_%s" % (os.getpid(), threading.get_ident())).encode(sys.getdefaultencoding())
    creds = (gssapi.raw.acquire_cred_from(store=store, mechs=[spnego], usage="initiate")).creds

    It'd be nice if this limitation would be documented here: https://web.mit.edu/kerberos/krb5-1.19/doc/basic/ccache_def.html

    Could have spared me quite some time.

    Regards,

    Michael

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)