Forum: >>> Magnum BBS <<<

Concurrency issues with FILE ccache

From Osipov, Michael (LDA IT PLM)@21:1/5 to All on Tue Apr 6 17:48:37 2021

Hi,

we do experience some weird concurrency issues with FILE: based
credential caches.
One Python application uses tens (mostly 16 to 24) of concurrent threads
to access resources via py-requests and py-requests-gssapi on top of
Debian 10 with MIT Kerberos 1.17 (GitLab Runner) and FreeBSD 12-STABLE
with MIT Kerberos 1.19.1 (my dev box). GSS context is maintained per thread/request rather than using Request's Session object.
The initiator is a service keytab from KRB5_CLIENT_KTNAME, for testing
purposes I do use kinit with my personal AD account, but the outcome is
the same. On both platforms I see output from klist like this:

Ticketzwischenspeicher: FILE:/tmp/krb5cc_722
Standard-Principal: osipovmi@EXAMPLE.COM

Valid starting Expires Service principal
06.04.2021 17:18:38 07.04.2021 03:18:38 krbtgt/EXAMPLE.COM@EXAMPLE.COM
erneuern bis 07.04.2021 17:18:35
06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM
06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM
06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM
06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM
06.04.2021 17:18:42 07.04.2021 03:18:38 HTTP/hostname.EXAMPLE.COM@EXAMPLE.COM

On Debian even output like:

gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)

Which leads me to the fact that TGT and service ticket acquisition or
the read/write to the FILE cache is not R/W locked.
For testing purposes I have modified py-requests-gssapi in such a way
that threading.Lock() is used around the first SecurityContext step call because here is the cache set up and I do see in klist output only *one* service ticket and the rest goes on smoothly.
I have also considered to use

gssapi.raw.acquire_cred_from(store={b"ccache":b"FILE:/tmp/<app>_<threadname>", b"client_keytab":b"/path/to/service.keytab"}, usage='initiate')

but I'd like to avoid adding even more code to fix a symptom not the cause.

I have also compared KRB5_TRACE output from the original and patched
version of py-requests-gssapi and one can clearly see that in the former
-- due to race conditions .. every thread tries to retrieve a TGT and a
service ticket while the latter (patched) does it only once.

What is the general advise here? Is any of the caches threadsafe because
none is documented either way?

Regards,

Michael

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Greg Hudson@21:1/5 to Osipov, Michael (LDA IT PLM) on Tue Apr 6 13:28:39 2021

To: kerberos@mit.edu

On 4/6/21 11:48 AM, Osipov, Michael (LDA IT PLM) wrote:

gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)

This is not expected, and bears investigation. It suggests an EINVAL,
EEXIST, EFAULT, EBADF, or EWOULDBLOCK error from one of the I/O
operations performed by fcc_store(), none of which are expected. If
you're building libkrb5, you could try modifying interpret_error() to
pass those error codes through in order to find out which one is happening.

Getting multiple cache entries for a service is normal when multiple
threads or processes initiate contexts to the same (new) service within
a short window.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Osipov, Michael (LDA IT PLM)@21:1/5 to Greg Hudson on Tue Apr 6 20:35:22 2021

To: kerberos@mit.edu

Am 2021-04-06 um 19:28 schrieb Greg Hudson:

On 4/6/21 11:48 AM, Osipov, Michael (LDA IT PLM) wrote:

gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)

This is not expected, and bears investigation. It suggests an EINVAL, EEXIST, EFAULT, EBADF, or EWOULDBLOCK error from one of the I/O
operations performed by fcc_store(), none of which are expected. If
you're building libkrb5, you could try modifying interpret_error() to
pass those error codes through in order to find out which one is happening.

Getting multiple cache entries for a service is normal when multiple
threads or processes initiate contexts to the same (new) service within
a short window.

Note that this is only on MIT Kerberos 1.17 on Debian. I will first try
to compile 1.19.1 and test that. Let me get back to you in a couple of days. Would it be sufficient to printf()

ret = interpret_errno(context, errno);

reat and errno to std stream?

Using acquire_cred_from() seems to work on Debian, but I also see a
little overhead (< 5 %).

Do you know of the top of your head whether there have been any
concurrency fixes in this regard after 1.17?

M

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Osipov, Michael (LDA IT PLM)@21:1/5 to Greg Hudson on Fri Apr 9 17:35:26 2021

To: kerberos@mit.edu

Am 2021-04-06 um 19:28 schrieb Greg Hudson:

On 4/6/21 11:48 AM, Osipov, Michael (LDA IT PLM) wrote:

gssapi.raw.misc.GSSError: Major (851968): Unspecified GSS failure. Minor code may provide more information, Minor (100001): Failed to store credentials: Internal credentials cache error (filename: /tmp/krb5cc_1000)

This is not expected, and bears investigation. It suggests an EINVAL, EEXIST, EFAULT, EBADF, or EWOULDBLOCK error from one of the I/O
operations performed by fcc_store(), none of which are expected. If
you're building libkrb5, you could try modifying interpret_error() to
pass those error codes through in order to find out which one is happening.

Getting multiple cache entries for a service is normal when multiple
threads or processes initiate contexts to the same (new) service within
a short window.

Hi Greg,

so I was able to properly compile and install 1.19.1 in the GitLab
Runner and verified that py-gssapi picks it up from LD_LIBRARY_PATH. Unfortunately, 1.19.1 still suffers from the same problem as 1.17. I
tried to narrow it down with strace, but that changes the runtime
behavior of the application and the error disappears. I did patch the fcc_store() funtion:

$ git diff
diff --git a/src/lib/krb5/ccache/cc_file.c b/src/lib/krb5/ccache/cc_file.c index 9a9b45a6e..7f604c0f4 100644
--- a/src/lib/krb5/ccache/cc_file.c
+++ b/src/lib/krb5/ccache/cc_file.c
@@ -1000,8 +1000,9 @@ fcc_store(krb5_context context, krb5_ccache id, krb5_creds *creds)
if (ret)
goto cleanup;
nwritten = write(fileno(fp), buf.data, buf.len);
- if (nwritten == -1)
+ if (nwritten == -1) {
ret = interpret_errno(context, errno);
+ printf("errno: %d, ret: %d\n", errno, ret); }
if ((size_t)nwritten != buf.len)
ret = KRB5_CC_IO;

but the output did not appear. Then I patched the interpret_errno()
dirctly for the internal error:

@@ -1293,6 +1294,7 @@ interpret_errno(krb5_context context, int errnum)
case EWOULDBLOCK:
#endif
ret = KRB5_FCC_INTERNAL;
+ printf("errnum: %d, ret: %d\n", errnum, ret);
break;
/*
* The rest all map to KRB5_CC_IO. These errnos are listed to

I had exactly one faiure in the job and received exactly this:

errnum: 17, ret: -1765328188

which maps to EEXIST

I am quite sure that this is a race condition where stat() is performed,
file does not exist, open() with write is performed, in parallel it is
already created and the later call returns in EEXIST.
I assumed it to be fcc_initialize() and added a printf():

fcc_initialize()
errnum: 17, ret: -1765328188
fcc_initialize()
errnum: 17, ret: -1765328188

What now?

Michael

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Greg Hudson@21:1/5 to Osipov, Michael (LDA IT PLM) on Fri Apr 9 14:24:04 2021

To: kerberos@mit.edu

On 4/9/21 11:35 AM, Osipov, Michael (LDA IT PLM) wrote:

I am quite sure that this is a race condition where stat() is performed,
file does not exist, open() with write is performed, in parallel it is already created and the later call returns in EEXIST.

I agree, except I think it's just unlink() and open(O_CREAT|O_EXCL)
calls with no stat(). I had erroneously assumed that the unexpected
error was happening inside fcc_store() because of "Failed to store
credentials" in the message, but that string turns out to be from
get_in_tkt.c in a block of code that also calls krb5_cc_initialize().

The fcc_initialize() EEXIST self-race has existed since 1.0. I'd
speculate that the original developers' assumption was that lots of
processes might be competing to use a file ccache, but that creating
ccaches would be a rare and one-at-a-time affair (happening at login or
when a user runs "kinit"). With client keytab support, that is no
longer the case; it's easy to have multiple threads or processes
competing to create or refresh a cache as part of gss_acquire_cred() or gss_init_sec_context().

Just fixing the fcc_initialize() race wouldn't really solve the problem;
there would still be a window between krb5_cc_initialize() and krb5_cc_store_cred() where other threads (or processes) would see an initialized cache with no TGT in it, and would fail the
gss_init_sec_context() call. This ticket describes that problem and
some possible solutions:

https://krbdev.mit.edu/rt/Ticket/Display.html?id=7707

Heimdal has implemented option 5. I'm not wild about it and it won't
work with other ccache types, but it's a working stopgap and it can
always be backed out in favor of a different solution later.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Osipov, Michael (LDA IT PLM)@21:1/5 to Greg Hudson on Fri Apr 16 13:39:03 2021

To: kerberos@mit.edu

Am 2021-04-09 um 20:24 schrieb Greg Hudson:

On 4/9/21 11:35 AM, Osipov, Michael (LDA IT PLM) wrote:

I am quite sure that this is a race condition where stat() is performed,
file does not exist, open() with write is performed, in parallel it is
already created and the later call returns in EEXIST.

I agree, except I think it's just unlink() and open(O_CREAT|O_EXCL)
calls with no stat(). I had erroneously assumed that the unexpected
error was happening inside fcc_store() because of "Failed to store credentials" in the message, but that string turns out to be from get_in_tkt.c in a block of code that also calls krb5_cc_initialize().

The fcc_initialize() EEXIST self-race has existed since 1.0. I'd
speculate that the original developers' assumption was that lots of
processes might be competing to use a file ccache, but that creating
ccaches would be a rare and one-at-a-time affair (happening at login or
when a user runs "kinit"). With client keytab support, that is no
longer the case; it's easy to have multiple threads or processes
competing to create or refresh a cache as part of gss_acquire_cred() or gss_init_sec_context().

Just fixing the fcc_initialize() race wouldn't really solve the problem; there would still be a window between krb5_cc_initialize() and krb5_cc_store_cred() where other threads (or processes) would see an initialized cache with no TGT in it, and would fail the gss_init_sec_context() call.

Re-reading the code and your analysis, I agree that it won't work w/o
external synchronization.

This ticket describes that problem and
some possible solutions:

https://krbdev.mit.edu/rt/Ticket/Display.html?id=7707

Heimdal has implemented option 5. I'm not wild about it and it won't
work with other ccache types, but it's a working stopgap and it can
always be backed out in favor of a different solution later.

While I don't understand all of them, option 2 seems to be the most
obvious (idiotproof) solution for the FILE cache, isn't it? I can't tell
for the ccache formats.

So for now, the only workaorounds I see are:
1. Initiate the cache in the main thread and then spawn worker threads.
For long running apps (10 h+) refresh cache although there is no 'kinit
-R' in GSS-API.
2. Use a per-thread cache to avoid race conditions:

spnego = gssapi.OID.from_int_seq("1.3.6.1.5.5.2")
if keytab_location:
store = {}
store[b"client_keytab"] = keytab_location.encode(sys.getdefaultencoding())
store[b"ccache"] = ("/tmp/krb5cc_%d_%s" % (os.getpid(), threading.get_ident())).encode(sys.getdefaultencoding())
creds = (gssapi.raw.acquire_cred_from(store=store, mechs=[spnego], usage="initiate")).creds

It'd be nice if this limitation would be documented here: https://web.mit.edu/kerberos/krb5-1.19/doc/basic/ccache_def.html

Could have spared me quite some time.

Regards,

Michael

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	237:24:07
Calls:	6,624
Files:	12,172
Messages:	5,319,873

Concurrency issues with FILE ccache

Who's Online

System Info