• Unstable system

    From Grimble@2:250/1 to All on Tue Jun 8 11:28:15 2021
    This new PC is showing serious instability. It has just thrown up
    hundreds (yes, hundreds) of kernel warnings like this:
    ------------[ cut here ]------------
    Jun 8 10:39:11 haydn kernel: [61432.829401] WARNING: CPU: 2 PID: 2664
    at kernel/futex.c:1453 __unqueue_futex+0x2f/0x40
    Jun 8 10:39:11 haydn kernel: [61432.829405] Modules linked in: udp_diag
    < snip list of modules >
    sha1_generic usb_common wmi
    Jun 8 10:39:11 haydn kernel: [61432.829517] CPU: 2 PID: 2664 Comm:
    kwin_x11 Tainted: P W O 5.10.41-desktop-1.mga8 #1
    Jun 8 10:39:11 haydn kernel: [61432.829519] Hardware name: System manufacturer System Product Name/PRIME A320M-E, BIOS 5602 07/14/2020
    Jun 8 10:39:11 haydn kernel: [61432.829521] RIP: 0010:__unqueue_futex+0x2f/0x40
    Jun 8 10:39:11 haydn kernel: [61432.829523] Code: 53 48 8b 5f 30 48 85
    db 74 1c 48 8b 57 18 48 8d 47 18 48 39 c2 74 13 48 8d 73 04 e8 bb 53 4b
    00 f0 ff 4b fc 5b c3 0f 0b 5b c3 <0f> 0b 5b c3 66 66 2e 0f 1f 84 00 00
    00 00 00 66 90 0f 1f 44 00 00
    Jun 8 10:39:11 haydn kernel: [61432.829525] RSP: 0018:ffffc1e0c162fc90 EFLAGS: 00010246
    Jun 8 10:39:11 haydn kernel: [61432.829527] RAX: ffffc1e0c162fd20 RBX: ffff9e0d40d44844 RCX: ffffc1e0c162fd08
    Jun 8 10:39:11 haydn kernel: [61432.829528] RDX: ffffc1e0c162fd20 RSI: 0000000000000064 RDI: ffffc1e0c162fd08
    Jun 8 10:39:11 haydn kernel: [61432.829530] RBP: 0000000000000000 R08: ffffc1e0c162fd08 R09: ffff9e0d40d44848
    Jun 8 10:39:11 haydn kernel: [61432.829531] R10: ffffc1e0c162fd20 R11: ffffc1e0c162fd10 R12: 0000000000000000
    Jun 8 10:39:11 haydn kernel: [61432.829532] R13: 0000000000000000 R14: ffff9e0d40d44844 R15: 00007ffc2b432348
    Jun 8 10:39:11 haydn kernel: [61432.829533] FS: 00007fba75aeb840(0000) GS:ffff9e1c1e880000(0000) knlGS:0000000000000000
    Jun 8 10:39:11 haydn kernel: [61432.829534] CS: 0010 DS: 0000 ES: 0000
    CR0: 0000000080050033
    Jun 8 10:39:11 haydn kernel: [61432.829536] CR2: 00007f7f8e9b8000 CR3: 000000015294e000 CR4: 0000000000350ee0
    Jun 8 10:39:11 haydn kernel: [61432.829536] Call Trace:
    Jun 8 10:39:11 haydn kernel: [61432.829538] futex_wait+0x118/0x230
    Jun 8 10:39:11 haydn kernel: [61432.829541] do_futex+0x16f/0xba0
    Jun 8 10:39:11 haydn kernel: [61432.829543] ? do_iter_write+0x17a/0x1b0
    Jun 8 10:39:11 haydn kernel: [61432.829544] ? vfs_writev+0xc1/0x140
    Jun 8 10:39:11 haydn kernel: [61432.829546] __x64_sys_futex+0x146/0x1c0
    Jun 8 10:39:11 haydn kernel: [61432.829548] ? do_writev+0xfb/0x110
    Jun 8 10:39:11 haydn kernel: [61432.829550] do_syscall_64+0x33/0x80
    Jun 8 10:39:11 haydn kernel: [61432.829553] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Jun 8 10:39:11 haydn kernel: [61432.829553] RIP: 0033:0x7fba78dfd86a
    Jun 8 10:39:11 haydn kernel: [61432.829555] Code: 24 60 44 89 64 24 68
    e8 34 30 00 00 e8 ef 33 00 00 44 89 e6 45 31 d2 31 d2 41 89 c0 40 80 f6
    80 4c 89 f7 b8 ca 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 d2 00 00 00
    44 89 c7 e8 42 34 00 00 31 f6
    Jun 8 10:39:11 haydn kernel: [61432.829555] RSP: 002b:00007ffc2b432150 EFLAGS: 00000282 ORIG_RAX: 00000000000000ca
    Jun 8 10:39:11 haydn kernel: [61432.829557] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fba78dfd86a
    Jun 8 10:39:11 haydn kernel: [61432.829557] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007ffc2b432348
    Jun 8 10:39:11 haydn kernel: [61432.829558] RBP: 00007ffc2b432320 R08: 0000000000000000 R09: 0000000000000000
    Jun 8 10:39:11 haydn kernel: [61432.829559] R10: 0000000000000000 R11: 0000000000000282 R12: 0000000000000000
    Jun 8 10:39:11 haydn kernel: [61432.829560] R13: 0000000001f3a3d8 R14: 00007ffc2b432348 R15: 00007ffc2b432180
    Jun 8 10:39:11 haydn kernel: [61432.829561] ---[ end trace
    1f829f9362f32cdd ]---

    This is a 6-core AMD Reizen CPU, and all six cores are mentioned.
    The system has been halting without obvious reason (no error messages in Syslog, and temperatures are not excessive, currently 40 - 43 degs C. I
    have been in conversation with the makers, who are more oriented to
    Windows problem solving. They gave me a program called Aida64Extreme to
    stress test the CPUs, but it ran for 6 hours without fail, and very
    steady temperatures for CPU and MB. But as soon as I run Mageia 8 with 3
    cores running BOINC, temperatures go as high as 85 degs.
    I've run memtest-86 twice, and it stopped without displaying errors
    after about 9 minutes (heat problems?)
    What could be the reason for all those futex problems, and how do I
    convince the supplier that there is a serious problem.

    I tried to search Syslog just now, and it bombed with SEGV error. Aaaaaggh
    --
    Grimble
    Machine 'Haydn' running Plasma 5.20.4 on 5.10.41-desktop-1.mga8 kernel.
    Mageia release 8 (Official) for x86_64

    --- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
    * Origin: A noiseless patient Spider (2:250/1@fidonet)
  • From David W. Hodgins@2:250/1 to All on Tue Jun 8 17:39:54 2021
    On Tue, 08 Jun 2021 06:28:15 -0400, Grimble <grimble@nomail.afraid.org> wrote:

    This new PC is showing serious instability. It has just thrown up
    hundreds (yes, hundreds) of kernel warnings like this:

    Please open a bug report so it can be referred to people more experienced with debugging kernel problems.

    In the bug report, include the output of inxi -MSGxx and attach the output of journalctl --no-hostname -b>journal.txt, with everything before the failure deleted.

    My less than expert experience suggests it may be a gpu (not cpu) heat problem or a problem with the gpu video driver/firmware. I haven't used boinc in the last year or two. IIRC it has an option for whether or not to use the gpu. Try disabling that so it only uses the cpu.

    Regards, Dave Hodgins

    --
    Change dwhodgins@nomail.afraid.org to davidwhodgins@teksavvy.com for
    email replies.

    --- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
    * Origin: A noiseless patient Spider (2:250/1@fidonet)
  • From Doug Laidlaw@2:250/1 to All on Sat Jun 12 12:50:12 2021
    On 9/6/21 2:39 am, David W. Hodgins wrote:
    On Tue, 08 Jun 2021 06:28:15 -0400, Grimble <grimble@nomail.afraid.org> wrote:

    This new PC is showing serious instability. It has just thrown up
    hundreds (yes, hundreds) of kernel warnings like this:

    Please open a bug report so it can be referred to people more
    experienced with
    debugging kernel problems.

    In the bug report, include the output of inxi -MSGxx and attach the
    output of
    journalctl --no-hostname -b>journal.txt, with everything before the failure deleted.

    My less than expert experience suggests it may be a gpu (not cpu) heat problem
    or a problem with the gpu video driver/firmware. I haven't used boinc in
    the
    last year or two. IIRC it has an option for whether or not to use the
    gpu. Try
    disabling that so it only uses the cpu.

    Regards, Dave Hodgins

    I am still running BOINC, but disabled it due to high CPU usage. I will
    watch Grimble's progress with interest. The only other problem I had
    was a "WONTFIX" bug. It keeps generating lines in the journal every
    second, similar to the following:
    [CODE]
    Jun 12 21:44:23 dougshost.douglaidlaw.net boinc[2081]: No protocol specified [/CODE]

    This seems to have been around for a while, and nobody at BOINC knows
    what to do about it.

    --- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
    * Origin: Aioe.org NNTP Server (2:250/1@fidonet)
  • From Grimble@2:250/1 to All on Tue Jun 15 12:28:13 2021
    On 12/06/2021 12:50, Doug Laidlaw wrote:
    On 9/6/21 2:39 am, David W. Hodgins wrote:
    On Tue, 08 Jun 2021 06:28:15 -0400, Grimble
    <grimble@nomail.afraid.org> wrote:

    This new PC is showing serious instability. It has just thrown up
    hundreds (yes, hundreds) of kernel warnings like this:

    Please open a bug report so it can be referred to people more
    experienced with
    debugging kernel problems.

    In the bug report, include the output of inxi -MSGxx and attach the
    output of
    journalctl --no-hostname -b>journal.txt, with everything before the
    failure
    deleted.

    My less than expert experience suggests it may be a gpu (not cpu) heat
    problem
    or a problem with the gpu video driver/firmware. I haven't used boinc
    in the
    last year or two. IIRC it has an option for whether or not to use the
    gpu. Try
    disabling that so it only uses the cpu.

    Regards, Dave Hodgins

    I am still running BOINC, but disabled it due to high CPU usage.  I will watch Grimble's progress with interest.  The only other problem I had
    was a "WONTFIX" bug.  It keeps generating lines in the journal every second, similar to the following:
    [CODE]
    Jun 12 21:44:23 dougshost.douglaidlaw.net boinc[2081]: No protocol
    specified
    [/CODE]

    This seems to have been around for a while, and nobody at BOINC knows
    what to do about it.
    Doug, I've noticed the same problem. Googling around, it seems to be
    because boinc client can't communicate with boinc manager.

    --
    Grimble
    Registered Linux User #450547
    Machine 'Bach' running Plasma 5.20.4 on 5.10.41-desktop-1.mga8 kernel.
    Mageia release 8 (Official) for x86_64

    --- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
    * Origin: A noiseless patient Spider (2:250/1@fidonet)
  • From Grimble@2:250/1 to All on Fri Jun 18 13:37:16 2021
    On 08/06/2021 11:28, Grimble wrote:
    Just noticed a kernel warning involving "iommu":
    Jun 18 11:17:13 haydn kernel: [ 1577.941885] WARNING: CPU: 2 PID: 4772
    at drivers/iommu/dma-iommu.c:473 __iommu_dma_unmap+0xe8/0x100
    Jun 18 11:17:13 haydn kernel: [ 1577.941886] Modules linked in: rpcsec_gss_krb5 nfsv4 nfs fscache md4 ip_set iptable_filter cmac
    <snip list of modules>
    usbcore crypto_simd cryptd ccp sha1_generic usb_common wmi
    Jun 18 11:17:13 haydn kernel: [ 1577.941955] CPU: 2 PID: 4772 Comm: ip Tainted: P O 5.10.41-desktop-1.mga8 #1
    Jun 18 11:17:13 haydn kernel: [ 1577.941956] Hardware name: System manufacturer System Product Name/PRIME A320M-E, BIOS 5602 07/14/2020
    Jun 18 11:17:13 haydn kernel: [ 1577.941958] RIP: 0010:__iommu_dma_unmap+0xe8/0x100

    Googled "Mageia 8 iommu" and this popped up first: https://forums.mageia.org/en/viewtopic.php?f=41&t=6936
    which describes errors that are very similar to the ones I have been experiencing. (Reminder: this is a Ryzen 6 core processor) BIOS had
    IOMMU = Automatic. Changed to "Enabled", so lets see what happens.
    The post and a linked post also suggests adding "iommu=soft" or
    "iommu=pt" to the boot menu. I would welcome some informed opinion as to
    which to use.
    --
    Grimble
    Machine 'Haydn' running Plasma 5.20.4 on 5.10.41-desktop-1.mga8 kernel.
    Mageia release 8 (Official) for x86_64

    --- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
    * Origin: A noiseless patient Spider (2:250/1@fidonet)
  • From David W. Hodgins@2:250/1 to All on Fri Jun 18 18:21:42 2021
    On Fri, 18 Jun 2021 08:37:16 -0400, Grimble <grimble@nomail.afraid.org> wrote:
    Googled "Mageia 8 iommu" and this popped up first: https://forums.mageia.org/en/viewtopic.php?f=41&t=6936
    which describes errors that are very similar to the ones I have been experiencing. (Reminder: this is a Ryzen 6 core processor) BIOS had
    IOMMU = Automatic. Changed to "Enabled", so lets see what happens.
    The post and a linked post also suggests adding "iommu=soft" or
    "iommu=pt" to the boot menu. I would welcome some informed opinion as to which to use.

    Thanks for the info. The /usr/share/doc/kernel-doc/admin-guide/kernel-parameters.txt
    file from the kernel-doc package only lists the various possible settings for iommu,
    with no details on what they do, or what they are abbreviations of.

    In general, when a device parameter is set to soft, I expect that means the kernel
    processes the interrupts etc. in the kernel software rather then relying on the processor built into the memory management hardware chip. I have no idea what pt
    would refer to.

    Keep a watch out for any bios/uefi firmware updates from the motherboard manufacturer
    that fix the iommu issues.

    Regards, Dave Hodgins

    --
    Change dwhodgins@nomail.afraid.org to davidwhodgins@teksavvy.com for
    email replies.

    --- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
    * Origin: A noiseless patient Spider (2:250/1@fidonet)
  • From William Unruh@2:250/1 to All on Fri Jun 18 21:12:08 2021
    On 2021-06-18, David W. Hodgins <dwhodgins@nomail.afraid.org> wrote:
    On Fri, 18 Jun 2021 08:37:16 -0400, Grimble <grimble@nomail.afraid.org> wrote:
    Googled "Mageia 8 iommu" and this popped up first:
    https://forums.mageia.org/en/viewtopic.php?f=41&t=6936
    which describes errors that are very similar to the ones I have been
    experiencing. (Reminder: this is a Ryzen 6 core processor) BIOS had
    IOMMU = Automatic. Changed to "Enabled", so lets see what happens.
    The post and a linked post also suggests adding "iommu=soft" or
    "iommu=pt" to the boot menu. I would welcome some informed opinion as to
    which to use.

    Thanks for the info. The /usr/share/doc/kernel-doc/admin-guide/kernel-parameters.txt
    file from the kernel-doc package only lists the various possible settings for iommu,
    with no details on what they do, or what they are abbreviations of.

    In general, when a device parameter is set to soft, I expect that means the kernel
    processes the interrupts etc. in the kernel software rather then relying on the
    processor built into the memory management hardware chip. I have no idea what pt
    would refer to.

    Keep a watch out for any bios/uefi firmware updates from the motherboard manufacturer
    that fix the iommu issues.

    Regards, Dave Hodgins


    https://unix.stackexchange.com/questions/592538/what-are-the-implication-of-using-iommu-force-in-the-boot-kernel-options
    seems to have some very brief explanations.
    I did a google linux iommu search

    --- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
    * Origin: A noiseless patient Spider (2:250/1@fidonet)
  • From David W. Hodgins@2:250/1 to All on Fri Jun 18 21:28:52 2021
    On Fri, 18 Jun 2021 16:12:08 -0400, William Unruh <unruh@invalid.ca> wrote:

    On 2021-06-18, David W. Hodgins <dwhodgins@nomail.afraid.org> wrote:
    On Fri, 18 Jun 2021 08:37:16 -0400, Grimble <grimble@nomail.afraid.org> wrote:
    Googled "Mageia 8 iommu" and this popped up first:
    https://forums.mageia.org/en/viewtopic.php?f=41&t=6936
    which describes errors that are very similar to the ones I have been
    experiencing. (Reminder: this is a Ryzen 6 core processor) BIOS had
    IOMMU = Automatic. Changed to "Enabled", so lets see what happens.
    The post and a linked post also suggests adding "iommu=soft" or
    "iommu=pt" to the boot menu. I would welcome some informed opinion as to >>> which to use.

    Thanks for the info. The /usr/share/doc/kernel-doc/admin-guide/kernel-parameters.txt
    file from the kernel-doc package only lists the various possible settings for iommu,
    with no details on what they do, or what they are abbreviations of.

    In general, when a device parameter is set to soft, I expect that means the kernel
    processes the interrupts etc. in the kernel software rather then relying on the
    processor built into the memory management hardware chip. I have no idea what pt
    would refer to.

    Keep a watch out for any bios/uefi firmware updates from the motherboard manufacturer
    that fix the iommu issues.

    Regards, Dave Hodgins


    https://unix.stackexchange.com/questions/592538/what-are-the-implication-of-using-iommu-force-in-the-boot-kernel-options
    seems to have some very brief explanations.
    I did a google linux iommu search

    Thanks. None of the pages I'd checked (I didn't check all of the ones found explained
    iommu=pt. The links from the stackexchange page confirm that iommu=soft uses software
    instead of hardware. Only one of the links from that page explains iommu=pt with ...

    From https://community.mellanox.com/s/article/understanding-the-iommu-linux-grub-file-configuration
    This post discusses the iommu and intel_iommu Linux grub parameters for SR-IOV pass-through (pt) mode. When working in an SR-IOV environment, we need to make sure that kernel enables SR-IOV and that we get good performance.
    To enable SR-IOV in the kernel, configure intel_iommu=on in the grub file. To get the best performance, add iommu=pt (pass-through) to the grub file when using SR-IOV. When in pass-through mode, the adapter does not need to use DMA translation to the memory, and this improves the performance. iommu=pt is needed mainly with hypervisor performance is needed.

    So it's mainly of use if you will be running under a hypervisor such as xen or qemu.

    Regards, Dave Hodgins

    --
    Change dwhodgins@nomail.afraid.org to davidwhodgins@teksavvy.com for
    email replies.

    --- MBSE BBS v1.0.7.22 (GNU/Linux-x86_64)
    * Origin: A noiseless patient Spider (2:250/1@fidonet)