Forum: >>> Magnum BBS <<<

Unstable system

From Grimble@2:250/1 to All on Tue Jun 8 11:28:15 2021

This new PC is showing serious instability. It has just thrown up
hundreds (yes, hundreds) of kernel warnings like this:
------------[ cut here ]------------
Jun 8 10:39:11 haydn kernel: [61432.829401] WARNING: CPU: 2 PID: 2664
at kernel/futex.c:1453 __unqueue_futex+0x2f/0x40
Jun 8 10:39:11 haydn kernel: [61432.829405] Modules linked in: udp_diag
< snip list of modules >
sha1_generic usb_common wmi
Jun 8 10:39:11 haydn kernel: [61432.829517] CPU: 2 PID: 2664 Comm:
kwin_x11 Tainted: P W O 5.10.41-desktop-1.mga8 #1
Jun 8 10:39:11 haydn kernel: [61432.829519] Hardware name: System manufacturer System Product Name/PRIME A320M-E, BIOS 5602 07/14/2020
Jun 8 10:39:11 haydn kernel: [61432.829521] RIP: 0010:__unqueue_futex+0x2f/0x40
Jun 8 10:39:11 haydn kernel: [61432.829523] Code: 53 48 8b 5f 30 48 85
db 74 1c 48 8b 57 18 48 8d 47 18 48 39 c2 74 13 48 8d 73 04 e8 bb 53 4b
00 f0 ff 4b fc 5b c3 0f 0b 5b c3 <0f> 0b 5b c3 66 66 2e 0f 1f 84 00 00
00 00 00 66 90 0f 1f 44 00 00
Jun 8 10:39:11 haydn kernel: [61432.829525] RSP: 0018:ffffc1e0c162fc90 EFLAGS: 00010246
Jun 8 10:39:11 haydn kernel: [61432.829527] RAX: ffffc1e0c162fd20 RBX: ffff9e0d40d44844 RCX: ffffc1e0c162fd08
Jun 8 10:39:11 haydn kernel: [61432.829528] RDX: ffffc1e0c162fd20 RSI: 0000000000000064 RDI: ffffc1e0c162fd08
Jun 8 10:39:11 haydn kernel: [61432.829530] RBP: 0000000000000000 R08: ffffc1e0c162fd08 R09: ffff9e0d40d44848
Jun 8 10:39:11 haydn kernel: [61432.829531] R10: ffffc1e0c162fd20 R11: ffffc1e0c162fd10 R12: 0000000000000000
Jun 8 10:39:11 haydn kernel: [61432.829532] R13: 0000000000000000 R14: ffff9e0d40d44844 R15: 00007ffc2b432348
Jun 8 10:39:11 haydn kernel: [61432.829533] FS: 00007fba75aeb840(0000) GS:ffff9e1c1e880000(0000) knlGS:0000000000000000
Jun 8 10:39:11 haydn kernel: [61432.829534] CS: 0010 DS: 0000 ES: 0000
CR0: 0000000080050033
Jun 8 10:39:11 haydn kernel: [61432.829536] CR2: 00007f7f8e9b8000 CR3: 000000015294e000 CR4: 0000000000350ee0
Jun 8 10:39:11 haydn kernel: [61432.829536] Call Trace:
Jun 8 10:39:11 haydn kernel: [61432.829538] futex_wait+0x118/0x230
Jun 8 10:39:11 haydn kernel: [61432.829541] do_futex+0x16f/0xba0
Jun 8 10:39:11 haydn kernel: [61432.829543] ? do_iter_write+0x17a/0x1b0
Jun 8 10:39:11 haydn kernel: [61432.829544] ? vfs_writev+0xc1/0x140
Jun 8 10:39:11 haydn kernel: [61432.829546] __x64_sys_futex+0x146/0x1c0
Jun 8 10:39:11 haydn kernel: [61432.829548] ? do_writev+0xfb/0x110
Jun 8 10:39:11 haydn kernel: [61432.829550] do_syscall_64+0x33/0x80
Jun 8 10:39:11 haydn kernel: [61432.829553] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jun 8 10:39:11 haydn kernel: [61432.829553] RIP: 0033:0x7fba78dfd86a
Jun 8 10:39:11 haydn kernel: [61432.829555] Code: 24 60 44 89 64 24 68
e8 34 30 00 00 e8 ef 33 00 00 44 89 e6 45 31 d2 31 d2 41 89 c0 40 80 f6
80 4c 89 f7 b8 ca 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 d2 00 00 00
44 89 c7 e8 42 34 00 00 31 f6
Jun 8 10:39:11 haydn kernel: [61432.829555] RSP: 002b:00007ffc2b432150 EFLAGS: 00000282 ORIG_RAX: 00000000000000ca
Jun 8 10:39:11 haydn kernel: [61432.829557] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fba78dfd86a
Jun 8 10:39:11 haydn kernel: [61432.829557] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007ffc2b432348
Jun 8 10:39:11 haydn kernel: [61432.829558] RBP: 00007ffc2b432320 R08: 0000000000000000 R09: 0000000000000000
Jun 8 10:39:11 haydn kernel: [61432.829559] R10: 0000000000000000 R11: 0000000000000282 R12: 0000000000000000
Jun 8 10:39:11 haydn kernel: [61432.829560] R13: 0000000001f3a3d8 R14: 00007ffc2b432348 R15: 00007ffc2b432180
Jun 8 10:39:11 haydn kernel: [61432.829561] ---[ end trace
1f829f9362f32cdd ]---

This is a 6-core AMD Reizen CPU, and all six cores are mentioned.
The system has been halting without obvious reason (no error messages in Syslog, and temperatures are not excessive, currently 40 - 43 degs C. I
have been in conversation with the makers, who are more oriented to
Windows problem solving. They gave me a program called Aida64Extreme to
stress test the CPUs, but it ran for 6 hours without fail, and very
steady temperatures for CPU and MB. But as soon as I run Mageia 8 with 3
cores running BOINC, temperatures go as high as 85 degs.
I've run memtest-86 twice, and it stopped without displaying errors
after about 9 minutes (heat problems?)
What could be the reason for all those futex problems, and how do I
convince the supplier that there is a serious problem.

I tried to search Syslog just now, and it bombed with SEGV error. Aaaaaggh
--
Grimble
Machine 'Haydn' running Plasma 5.20.4 on 5.10.41-desktop-1.mga8 kernel.
Mageia release 8 (Official) for x86_64

--- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
* Origin: A noiseless patient Spider (2:250/1@fidonet)

From David W. Hodgins@2:250/1 to All on Tue Jun 8 17:39:54 2021

On Tue, 08 Jun 2021 06:28:15 -0400, Grimble <grimble@nomail.afraid.org> wrote:

This new PC is showing serious instability. It has just thrown up
hundreds (yes, hundreds) of kernel warnings like this:

Please open a bug report so it can be referred to people more experienced with debugging kernel problems.

In the bug report, include the output of inxi -MSGxx and attach the output of journalctl --no-hostname -b>journal.txt, with everything before the failure deleted.

My less than expert experience suggests it may be a gpu (not cpu) heat problem or a problem with the gpu video driver/firmware. I haven't used boinc in the last year or two. IIRC it has an option for whether or not to use the gpu. Try disabling that so it only uses the cpu.

Regards, Dave Hodgins

--
Change dwhodgins@nomail.afraid.org to davidwhodgins@teksavvy.com for
email replies.

--- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
* Origin: A noiseless patient Spider (2:250/1@fidonet)

From Doug Laidlaw@2:250/1 to All on Sat Jun 12 12:50:12 2021

On 9/6/21 2:39 am, David W. Hodgins wrote:

On Tue, 08 Jun 2021 06:28:15 -0400, Grimble <grimble@nomail.afraid.org> wrote:

This new PC is showing serious instability. It has just thrown up
hundreds (yes, hundreds) of kernel warnings like this:

Please open a bug report so it can be referred to people more
experienced with
debugging kernel problems.

In the bug report, include the output of inxi -MSGxx and attach the
output of
journalctl --no-hostname -b>journal.txt, with everything before the failure deleted.

My less than expert experience suggests it may be a gpu (not cpu) heat problem
or a problem with the gpu video driver/firmware. I haven't used boinc in
the
last year or two. IIRC it has an option for whether or not to use the
gpu. Try
disabling that so it only uses the cpu.

Regards, Dave Hodgins

I am still running BOINC, but disabled it due to high CPU usage. I will
watch Grimble's progress with interest. The only other problem I had
was a "WONTFIX" bug. It keeps generating lines in the journal every
second, similar to the following:
[CODE]
Jun 12 21:44:23 dougshost.douglaidlaw.net boinc[2081]: No protocol specified [/CODE]

This seems to have been around for a while, and nobody at BOINC knows
what to do about it.

--- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
* Origin: Aioe.org NNTP Server (2:250/1@fidonet)

From Grimble@2:250/1 to All on Tue Jun 15 12:28:13 2021

On 12/06/2021 12:50, Doug Laidlaw wrote:

On 9/6/21 2:39 am, David W. Hodgins wrote:

On Tue, 08 Jun 2021 06:28:15 -0400, Grimble
<grimble@nomail.afraid.org> wrote:

This new PC is showing serious instability. It has just thrown up
hundreds (yes, hundreds) of kernel warnings like this:

Please open a bug report so it can be referred to people more
experienced with
debugging kernel problems.

In the bug report, include the output of inxi -MSGxx and attach the
output of
journalctl --no-hostname -b>journal.txt, with everything before the
failure
deleted.

My less than expert experience suggests it may be a gpu (not cpu) heat
problem
or a problem with the gpu video driver/firmware. I haven't used boinc
in the
last year or two. IIRC it has an option for whether or not to use the
gpu. Try
disabling that so it only uses the cpu.

Regards, Dave Hodgins

I am still running BOINC, but disabled it due to high CPU usage. I will watch Grimble's progress with interest. The only other problem I had
was a "WONTFIX" bug. It keeps generating lines in the journal every second, similar to the following:
[CODE]
Jun 12 21:44:23 dougshost.douglaidlaw.net boinc[2081]: No protocol
specified
[/CODE]

This seems to have been around for a while, and nobody at BOINC knows
what to do about it.

Doug, I've noticed the same problem. Googling around, it seems to be
because boinc client can't communicate with boinc manager.

--
Grimble
Registered Linux User #450547
Machine 'Bach' running Plasma 5.20.4 on 5.10.41-desktop-1.mga8 kernel.
Mageia release 8 (Official) for x86_64

--- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
* Origin: A noiseless patient Spider (2:250/1@fidonet)

From Grimble@2:250/1 to All on Fri Jun 18 13:37:16 2021

On 08/06/2021 11:28, Grimble wrote:
Just noticed a kernel warning involving "iommu":
Jun 18 11:17:13 haydn kernel: [ 1577.941885] WARNING: CPU: 2 PID: 4772
at drivers/iommu/dma-iommu.c:473 __iommu_dma_unmap+0xe8/0x100
Jun 18 11:17:13 haydn kernel: [ 1577.941886] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache md4 ip_set iptable_filter cmac
<snip list of modules>
usbcore crypto_simd cryptd ccp sha1_generic usb_common wmi
Jun 18 11:17:13 haydn kernel: [ 1577.941955] CPU: 2 PID: 4772 Comm: ip Tainted: P O 5.10.41-desktop-1.mga8 #1
Jun 18 11:17:13 haydn kernel: [ 1577.941956] Hardware name: System manufacturer System Product Name/PRIME A320M-E, BIOS 5602 07/14/2020
Jun 18 11:17:13 haydn kernel: [ 1577.941958] RIP: 0010:__iommu_dma_unmap+0xe8/0x100

Googled "Mageia 8 iommu" and this popped up first: https://forums.mageia.org/en/viewtopic.php?f=41&t=6936
which describes errors that are very similar to the ones I have been experiencing. (Reminder: this is a Ryzen 6 core processor) BIOS had
IOMMU = Automatic. Changed to "Enabled", so lets see what happens.
The post and a linked post also suggests adding "iommu=soft" or
"iommu=pt" to the boot menu. I would welcome some informed opinion as to
which to use.
--
Grimble
Machine 'Haydn' running Plasma 5.20.4 on 5.10.41-desktop-1.mga8 kernel.
Mageia release 8 (Official) for x86_64

--- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
* Origin: A noiseless patient Spider (2:250/1@fidonet)

From David W. Hodgins@2:250/1 to All on Fri Jun 18 18:21:42 2021

On Fri, 18 Jun 2021 08:37:16 -0400, Grimble <grimble@nomail.afraid.org> wrote:

Googled "Mageia 8 iommu" and this popped up first: https://forums.mageia.org/en/viewtopic.php?f=41&t=6936
which describes errors that are very similar to the ones I have been experiencing. (Reminder: this is a Ryzen 6 core processor) BIOS had
IOMMU = Automatic. Changed to "Enabled", so lets see what happens.
The post and a linked post also suggests adding "iommu=soft" or
"iommu=pt" to the boot menu. I would welcome some informed opinion as to which to use.

Thanks for the info. The /usr/share/doc/kernel-doc/admin-guide/kernel-parameters.txt
file from the kernel-doc package only lists the various possible settings for iommu,
with no details on what they do, or what they are abbreviations of.

In general, when a device parameter is set to soft, I expect that means the kernel
processes the interrupts etc. in the kernel software rather then relying on the processor built into the memory management hardware chip. I have no idea what pt
would refer to.

Keep a watch out for any bios/uefi firmware updates from the motherboard manufacturer
that fix the iommu issues.

Regards, Dave Hodgins

--
Change dwhodgins@nomail.afraid.org to davidwhodgins@teksavvy.com for
email replies.

--- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
* Origin: A noiseless patient Spider (2:250/1@fidonet)

From William Unruh@2:250/1 to All on Fri Jun 18 21:12:08 2021

On 2021-06-18, David W. Hodgins <dwhodgins@nomail.afraid.org> wrote:

On Fri, 18 Jun 2021 08:37:16 -0400, Grimble <grimble@nomail.afraid.org> wrote:

Googled "Mageia 8 iommu" and this popped up first:
https://forums.mageia.org/en/viewtopic.php?f=41&t=6936
which describes errors that are very similar to the ones I have been
experiencing. (Reminder: this is a Ryzen 6 core processor) BIOS had
IOMMU = Automatic. Changed to "Enabled", so lets see what happens.
The post and a linked post also suggests adding "iommu=soft" or
"iommu=pt" to the boot menu. I would welcome some informed opinion as to
which to use.

Thanks for the info. The /usr/share/doc/kernel-doc/admin-guide/kernel-parameters.txt
file from the kernel-doc package only lists the various possible settings for iommu,
with no details on what they do, or what they are abbreviations of.

In general, when a device parameter is set to soft, I expect that means the kernel
processes the interrupts etc. in the kernel software rather then relying on the
processor built into the memory management hardware chip. I have no idea what pt
would refer to.

Keep a watch out for any bios/uefi firmware updates from the motherboard manufacturer
that fix the iommu issues.

Regards, Dave Hodgins

https://unix.stackexchange.com/questions/592538/what-are-the-implication-of-using-iommu-force-in-the-boot-kernel-options
seems to have some very brief explanations.
I did a google linux iommu search

--- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
* Origin: A noiseless patient Spider (2:250/1@fidonet)

From David W. Hodgins@2:250/1 to All on Fri Jun 18 21:28:52 2021

On Fri, 18 Jun 2021 16:12:08 -0400, William Unruh <unruh@invalid.ca> wrote:

On 2021-06-18, David W. Hodgins <dwhodgins@nomail.afraid.org> wrote:

On Fri, 18 Jun 2021 08:37:16 -0400, Grimble <grimble@nomail.afraid.org> wrote:

Googled "Mageia 8 iommu" and this popped up first:
https://forums.mageia.org/en/viewtopic.php?f=41&t=6936
which describes errors that are very similar to the ones I have been
experiencing. (Reminder: this is a Ryzen 6 core processor) BIOS had
IOMMU = Automatic. Changed to "Enabled", so lets see what happens.
The post and a linked post also suggests adding "iommu=soft" or
"iommu=pt" to the boot menu. I would welcome some informed opinion as to >>> which to use.

Thanks for the info. The /usr/share/doc/kernel-doc/admin-guide/kernel-parameters.txt
file from the kernel-doc package only lists the various possible settings for iommu,
with no details on what they do, or what they are abbreviations of.

In general, when a device parameter is set to soft, I expect that means the kernel
processes the interrupts etc. in the kernel software rather then relying on the
processor built into the memory management hardware chip. I have no idea what pt
would refer to.

Keep a watch out for any bios/uefi firmware updates from the motherboard manufacturer
that fix the iommu issues.

Regards, Dave Hodgins

https://unix.stackexchange.com/questions/592538/what-are-the-implication-of-using-iommu-force-in-the-boot-kernel-options
seems to have some very brief explanations.
I did a google linux iommu search

Thanks. None of the pages I'd checked (I didn't check all of the ones found explained
iommu=pt. The links from the stackexchange page confirm that iommu=soft uses software
instead of hardware. Only one of the links from that page explains iommu=pt with ...

From https://community.mellanox.com/s/article/understanding-the-iommu-linux-grub-file-configuration
This post discusses the iommu and intel_iommu Linux grub parameters for SR-IOV pass-through (pt) mode. When working in an SR-IOV environment, we need to make sure that kernel enables SR-IOV and that we get good performance.
To enable SR-IOV in the kernel, configure intel_iommu=on in the grub file. To get the best performance, add iommu=pt (pass-through) to the grub file when using SR-IOV. When in pass-through mode, the adapter does not need to use DMA translation to the memory, and this improves the performance. iommu=pt is needed mainly with hypervisor performance is needed.

So it's mainly of use if you will be running under a hypervisor such as xen or qemu.

Regards, Dave Hodgins

--
Change dwhodgins@nomail.afraid.org to davidwhodgins@teksavvy.com for
email replies.

--- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
* Origin: A noiseless patient Spider (2:250/1@fidonet)

Who's Online
Recent Visitors
- Smithy
  Fri Apr 19 18:53:54 2024
  from Plymouth via Telnet
- Bob Worm
  Fri Apr 19 14:04:19 2024
  from Wales, Uk via Telnet
- Richard
  Fri Apr 19 12:43:01 2024
  from Leeds, Uk via SSH
- Bob Worm
  Fri Apr 19 09:15:26 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	230:14:13
Calls:	6,624
Calls today:	6
Files:	12,171
Messages:	5,319,300

Unstable system

Who's Online

Recent Visitors

System Info