server (the results (log files) of the autopkgtests are transferred to
the main server). Our ppc64el hosts are also located at Marist, so I
would expect commonality here, but also ppc64el isn't performing great,
so maybe part of the problem is common.
I have munin [1], but as said, I'm not a trained sysadmin. I don't know
what I'm looking for if you ask "statistics on the network".
Hi Phil,
On 13-02-2023 08:57, Philipp Kern wrote:
On 12.02.23 22:38, Paul Gevers wrote:
I have munin [1], but as said, I'm not a trained sysadmin. I don't
know what I'm looking for if you ask "statistics on the network".
This is more of a software development / devops question than a sysadmin question, but alas.
I acknowledge that my reach out was broad and didn't only cover s390x.
What I am interested in is *application-level* logging on reconnects. Presumably the connection to RabbitMQ is outbound?
Our configuration can be seen here: https://salsa.debian.org/ci-team/debian-ci-config/-/blob/master/cookbooks/rabbitmq/templates/rabbitmq.conf.erb
Is it tunneled? Does your application log somewhere when a reconnect happens? Does it say when it successfully connected?
I'd expect good software to log something like this:
[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:05] Connected to broker "rabbitmq.debci.debian.net:12345".
And also:
[10:00:00] Connecting to broker "rabbitmq.debci.debian.net:12345"... [10:00:01] Connection to broker "rabbitmq.debci.debian.net:12345"
failed: Connection refused
@terceiro; I haven't seen these kind of logs on the worker hosts. Do you
know if they exist or if we can generate them?
I think I'm seeing something on the main host. admin@ci-master:/var/log/rabbitmq$ sudo grep 148.100.88.163 rabbit@ci-master.log | grep -v '\[info\]' | grep -v '\[warning\]'
2023-02-14 00:00:37.522 [error] <0.30951.85> closing AMQP connection <0.30951.85> (148.100.88.163:49540 -> 10.1.14.198:5671):
2023-02-14 02:27:56.050 [error] <0.15184.87> closing AMQP connection <0.15184.87> (148.100.88.163:49988 -> 10.1.14.198:5671):
2023-02-14 02:36:05.496 [error] <0.17479.87> closing AMQP connection <0.17479.87> (148.100.88.163:57098 -> 10.1.14.198:5671):
2023-02-14 04:06:13.869 [error] <0.16105.88> closing AMQP connection <0.16105.88> (148.100.88.163:42984 -> 10.1.14.198:5671):
2023-02-14 04:15:27.696 [error] <0.19038.88> closing AMQP connection <0.19038.88> (148.100.88.163:56650 -> 10.1.14.198:5671):
2023-02-14 20:05:38.702 [error] <0.23586.97> closing AMQP connection <0.23586.97> (148.100.88.163:34278 -> 10.1.14.198:5671):
and a lot more warnings (220 times in 20 hours) as well; like:
2023-02-14 20:05:09.011 [warning] <0.20860.97> closing AMQP connection <0.20860.97> (148.100.88.163:45624 -> 10.1.14.198:5671, vhost: '/', user: 'guest'):
And a lot (around 544) (obviously I don't know if that's only or even includes the s390x host):
client unexpectedly closed TCP connection
James Addison suggested in [3] to increase a prefetch counter in amqp (although its the same on all hosts); I have done so on the s390x host and at least initially it seems to help keeping the host busier.
So there is for sure something wrong with the client-server connection
there. Reworking the client for robustness is on my TODO list for a
while.
Feb 14 08:56:25 ci-worker-s390x-01 debci[1155941]: waiting for header frame: a SSL error occurred
Feb 14 08:39:50 ci-worker-s390x-01 debci[1355790]: bacula testing/s390x tmpfail
Feb 14 08:56:25 ci-worker-s390x-01 debci[1155941]: waiting for header frame: a SSL error occurred
Feb 14 00:45:12 ci-worker-s390x-01 debci[2652291]: libgd2 testing/s390x fail Feb 14 01:01:48 ci-worker-s390x-01 debci[546227]: waiting for header frame: a SSL error occurred
Feb 14 02:45:30 ci-worker-s390x-01 debci[1209706]: mmdebstrap testing/s390x pass
Feb 14 03:02:05 ci-worker-s390x-01 debci[3642098]: waiting for header frame: a SSL error occurred
Feb 14 04:40:10 ci-worker-s390x-01 debci[12655]: cacti testing/s390x tmpfail Feb 14 04:56:51 ci-worker-s390x-01 debci[3015158]: waiting for header frame: a SSL error occurred
Feb 17 01:07:17 ci-worker-s390x-01 debci[1149352]: waiting for header frame: a SSL error occurred
Feb 17 01:13:46 ci-worker-s390x-01 debci[552417]: waiting for header frame: a SSL error occurred
Feb 17 01:16:19 ci-worker-s390x-01 debci[1261598]: waiting for header frame: a SSL error occurred
Feb 17 01:21:02 ci-worker-s390x-01 debci[1487252]: waiting for header frame: a SSL error occurred
Feb 17 01:53:30 ci-worker-s390x-01 debci[3589185]: waiting for header frame: a SSL error occurred
Feb 17 02:03:24 ci-worker-s390x-01 debci[4184831]: waiting for header frame: a SSL error occurred
Feb 17 02:18:31 ci-worker-s390x-01 debci[3986861]: waiting for header frame: a SSL error occurred
Feb 17 02:41:11 ci-worker-s390x-01 debci[4167140]: waiting for header frame: a SSL error occurred
Feb 17 05:44:55 ci-worker-s390x-01 debci[1543385]: waiting for header frame: a SSL error occurred
Feb 17 05:47:10 ci-worker-s390x-01 debci[2598734]: waiting for header frame: a SSL error occurred
Feb 17 06:24:39 ci-worker-s390x-01 debci[1275755]: waiting for header frame: a SSL error occurred
Feb 17 06:50:05 ci-worker-s390x-01 debci[3680449]: waiting for header frame: a SSL error occurred
Feb 17 07:33:09 ci-worker-s390x-01 debci[107515]: waiting for header frame: a SSL error occurred
Feb 17 07:48:04 ci-worker-s390x-01 debci[2816244]: waiting for header frame: a SSL error occurred
Feb 17 07:54:07 ci-worker-s390x-01 debci[2284573]: waiting for header frame: a SSL error occurred
Feb 17 12:40:38 ci-worker-s390x-01 debci[4069122]: waiting for header frame: a SSL error occurred
Feb 17 15:39:40 ci-worker-s390x-01 debci[3343838]: waiting for header frame: a SSL error occurred
Feb 17 20:23:33 ci-worker-s390x-01 debci[3531969]: waiting for header frame: a SSL error occurred
Feb 17 21:21:28 ci-worker-s390x-01 debci[1815008]: waiting for header frame: a SSL error occurred
Feb 17 23:28:02 ci-worker-s390x-01 debci[2830093]: waiting for header frame: a SSL error occurred
Feb 18 01:38:13 ci-worker-s390x-01 debci[3999976]: waiting for header frame: a SSL error occurred
Feb 18 04:21:49 ci-worker-s390x-01 debci[1774710]: waiting for header frame: a SSL error occurred
Feb 18 04:21:53 ci-worker-s390x-01 debci[1530267]: waiting for header frame: a SSL error occurred
Feb 18 04:43:09 ci-worker-s390x-01 debci[2484158]: waiting for header frame: a SSL error occurred
Feb 18 04:54:21 ci-worker-s390x-01 debci[3870455]: waiting for header frame: a SSL error occurred
Feb 18 06:46:27 ci-worker-s390x-01 debci[632005]: waiting for header frame: a SSL error occurred
Feb 18 06:52:56 ci-worker-s390x-01 debci[516286]: waiting for header frame: a SSL error occurred
Feb 18 09:41:23 ci-worker-s390x-01 debci[57375]: waiting for header frame: a SSL error occurred
As you can see e.g. here [1,2] it comes and goes (albeit sometimes the
queue was empty). I don't think its very different, I just never got out
of the s390x host what I was expecting. Long time I blamed it on the "stealing" that happens on a shared host, but I think there's more.
https://ci.debian.net/munin/ci-worker-s390x-01/ci-worker-s390x-01/debci_packages_processed.html
James Addison suggested in [3] to increase a prefetch counter in amqp (although its the same on all hosts); I have done so on the s390x host and at least initially it seems to help keeping the host busier.
Thanks for applying that - I was hoping that the change might also
result in reductions in the debci queue size for s390x, but that
doesn't appear to have happened, going by https://ci.debian.net/munin/debian.net/ci-master.debian.net/debci_queue_size.html
[3] https://salsa.debian.org/ci-team/debci/-/issues/92#note_381306
I can provide logging from the host, but I'll need detailed instructions
of what people find useful to look at. Recently Antonio taught me a
trick to provide temporary access to a lxc container on any of our
hosts, so if it helps to be on the host (but inside lxc) we can provide
for that.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 399 |
Nodes: | 16 (2 / 14) |
Uptime: | 99:08:40 |
Calls: | 8,363 |
Calls today: | 2 |
Files: | 13,162 |
Messages: | 5,897,780 |