• Re: Linux kernel: powerpc: KVM guest can trigger host crash on Power8

    From John Paul Adrian Glaubitz@21:1/5 to All on Tue Oct 26 10:50:02 2021
    Hi Michael!

    The Linux kernel for powerpc since v5.2 has a bug which allows a
    malicious KVM guest to crash the host, when the host is running on
    Power8.

    Only machines using Linux as the hypervisor, aka. KVM, powernv or bare
    metal, are affected by the bug. Machines running PowerVM are not
    affected.

    The bug was introduced in:

    10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

    Which was first released in v5.2.

    The upstream fix is:

    cdeb5d7d890e ("KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest")
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdeb5d7d890e14f3b70e8087e745c4a6a7d9f337

    Which will be included in the v5.16 release.

    I have tested these patches against 5.14 but it seems the problem [1] still remains for me
    for big-endian guests. I built a patched kernel yesterday, rebooted the KVM server and let
    the build daemons do their work over night.

    When I got up this morning, I noticed the machine was down, so I checked the serial console
    via IPMI and saw the same messages again as reported in [1]:

    [41483.963562] watchdog: BUG: soft lockup - CPU#104 stuck for 25521s! [migration/104:175]
    [41507.963307] watchdog: BUG: soft lockup - CPU#104 stuck for 25544s! [migration/104:175]
    [41518.311200] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41518.311216] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2729959
    [41547.962882] watchdog: BUG: soft lockup - CPU#104 stuck for 25581s! [migration/104:175]
    [41571.962627] watchdog: BUG: soft lockup - CPU#104 stuck for 25603s! [migration/104:175]
    [41581.330530] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41581.330546] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2736378
    [41611.962202] watchdog: BUG: soft lockup - CPU#104 stuck for 25641s! [migration/104:175]
    [41635.961947] watchdog: BUG: soft lockup - CPU#104 stuck for 25663s! [migration/104:175]
    [41644.349859] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41644.349876] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2742753
    [41671.961564] watchdog: BUG: soft lockup - CPU#104 stuck for 25697s! [migration/104:175]
    [41695.961309] watchdog: BUG: soft lockup - CPU#104 stuck for 25719s! [migration/104:175]
    [41707.369190] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41707.369206] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2749151
    [41735.960884] watchdog: BUG: soft lockup - CPU#104 stuck for 25756s! [migration/104:175]
    [41759.960629] watchdog: BUG: soft lockup - CPU#104 stuck for 25778s! [migration/104:175]
    [41770.388520] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41770.388548] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2755540
    [41776.076307] rcu: rcu_sched kthread timer wakeup didn't happen for 1423 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
    [41776.076327] rcu: Possible timer handling issue on cpu=32 timer-softirq=1056014
    [41776.076336] rcu: rcu_sched kthread starved for 1424 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=32
    [41776.076350] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
    [41776.076360] rcu: RCU grace-period kthread stack dump:
    [41776.076434] rcu: Stack dump where RCU GP kthread last ran:
    [41783.960374] watchdog: BUG: soft lockup - CPU#104 stuck for 25801s! [migration/104:175]
    [41807.960119] watchdog: BUG: soft lockup - CPU#104 stuck for 25823s! [migration/104:175]
    [41831.959864] watchdog: BUG: soft lockup - CPU#104 stuck for 25846s! [migration/104:175]
    [41833.407851] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41833.407868] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2760381
    [41863.959524] watchdog: BUG: soft lockup - CPU#104 stuck for 25875s! [migration/104:175]

    It seems that in this case, it was the testsuite of the git package [2] that triggered the bug. As you
    can see from the overview, the git package has been in the building state for 8 hours meaning the
    build server crashed and is no longer giving feedback to the database.

    Adrian

    [1] https://bugzilla.kernel.org/show_bug.cgi?id=206669
    [2] https://buildd.debian.org/status/package.php?p=git&suite=experimental

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Ellerman@21:1/5 to John Paul Adrian Glaubitz on Wed Oct 27 08:00:02 2021
    John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> writes:
    Hi Michael!

    Hi Adrian,

    Thanks for testing ...

    The Linux kernel for powerpc since v5.2 has a bug which allows a
    malicious KVM guest to crash the host, when the host is running on
    Power8.

    Only machines using Linux as the hypervisor, aka. KVM, powernv or bare
    metal, are affected by the bug. Machines running PowerVM are not
    affected.

    The bug was introduced in:

    10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

    Which was first released in v5.2.

    The upstream fix is:

    cdeb5d7d890e ("KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest")
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdeb5d7d890e14f3b70e8087e745c4a6a7d9f337

    Which will be included in the v5.16 release.

    I have tested these patches against 5.14 but it seems the problem [1] still remains for me
    for big-endian guests. I built a patched kernel yesterday, rebooted the KVM server and let
    the build daemons do their work over night.

    When I got up this morning, I noticed the machine was down, so I checked the serial console
    via IPMI and saw the same messages again as reported in [1]:

    [41483.963562] watchdog: BUG: soft lockup - CPU#104 stuck for 25521s! [migration/104:175]
    [41507.963307] watchdog: BUG: soft lockup - CPU#104 stuck for 25544s! [migration/104:175]
    [41518.311200] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41518.311216] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2729959
    [41547.962882] watchdog: BUG: soft lockup - CPU#104 stuck for 25581s! [migration/104:175]
    [41571.962627] watchdog: BUG: soft lockup - CPU#104 stuck for 25603s! [migration/104:175]
    [41581.330530] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41581.330546] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2736378
    [41611.962202] watchdog: BUG: soft lockup - CPU#104 stuck for 25641s! [migration/104:175]
    [41635.961947] watchdog: BUG: soft lockup - CPU#104 stuck for 25663s! [migration/104:175]
    [41644.349859] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41644.349876] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2742753
    [41671.961564] watchdog: BUG: soft lockup - CPU#104 stuck for 25697s! [migration/104:175]
    [41695.961309] watchdog: BUG: soft lockup - CPU#104 stuck for 25719s! [migration/104:175]
    [41707.369190] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41707.369206] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2749151
    [41735.960884] watchdog: BUG: soft lockup - CPU#104 stuck for 25756s! [migration/104:175]
    [41759.960629] watchdog: BUG: soft lockup - CPU#104 stuck for 25778s! [migration/104:175]
    [41770.388520] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41770.388548] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2755540
    [41776.076307] rcu: rcu_sched kthread timer wakeup didn't happen for 1423 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
    [41776.076327] rcu: Possible timer handling issue on cpu=32 timer-softirq=1056014
    [41776.076336] rcu: rcu_sched kthread starved for 1424 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=32
    [41776.076350] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
    [41776.076360] rcu: RCU grace-period kthread stack dump:
    [41776.076434] rcu: Stack dump where RCU GP kthread last ran:
    [41783.960374] watchdog: BUG: soft lockup - CPU#104 stuck for 25801s! [migration/104:175]
    [41807.960119] watchdog: BUG: soft lockup - CPU#104 stuck for 25823s! [migration/104:175]
    [41831.959864] watchdog: BUG: soft lockup - CPU#104 stuck for 25846s! [migration/104:175]
    [41833.407851] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41833.407868] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2760381
    [41863.959524] watchdog: BUG: soft lockup - CPU#104 stuck for 25875s! [migration/104:175]

    It seems that in this case, it was the testsuite of the git package [2] that triggered the bug. As you
    can see from the overview, the git package has been in the building state for 8 hours meaning the
    build server crashed and is no longer giving feedback to the database.

    OK, that sucks.

    I did test the repro case you gave me before (in the bugzilla), which
    was building glibc, that passes for me with a patched host.

    I guess we have yet another bug.

    I tried the following in a debian BE VM and it completed fine:

    $ dget -u http://ftp.debian.org/debian/pool/main/g/git/git_2.33.1-1.dsc
    $ sbuild -d sid --arch=powerpc --no-arch-all git_2.33.1-1.dsc

    Same for ppc64.

    And I also tried both at once, repeatedly in a loop.

    I guess it's something more complicated.

    What exact host/guest kernel versions and configs are you running?

    cheers

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nicholas Piggin@21:1/5 to All on Wed Oct 27 07:50:01 2021
    Excerpts from John Paul Adrian Glaubitz's message of October 26, 2021 6:48 pm:
    Hi Michael!

    The Linux kernel for powerpc since v5.2 has a bug which allows a
    malicious KVM guest to crash the host, when the host is running on
    Power8.

    Only machines using Linux as the hypervisor, aka. KVM, powernv or bare
    metal, are affected by the bug. Machines running PowerVM are not
    affected.

    The bug was introduced in:

    10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

    Which was first released in v5.2.

    The upstream fix is:

    cdeb5d7d890e ("KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest")
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdeb5d7d890e14f3b70e8087e745c4a6a7d9f337

    Which will be included in the v5.16 release.

    I have tested these patches against 5.14 but it seems the problem [1] still remains for me
    for big-endian guests. I built a patched kernel yesterday, rebooted the KVM server and let
    the build daemons do their work over night.

    When I got up this morning, I noticed the machine was down, so I checked the serial console
    via IPMI and saw the same messages again as reported in [1]:

    [41483.963562] watchdog: BUG: soft lockup - CPU#104 stuck for 25521s! [migration/104:175]
    [41507.963307] watchdog: BUG: soft lockup - CPU#104 stuck for 25544s! [migration/104:175]
    [41518.311200] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41518.311216] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2729959
    [41547.962882] watchdog: BUG: soft lockup - CPU#104 stuck for 25581s! [migration/104:175]
    [41571.962627] watchdog: BUG: soft lockup - CPU#104 stuck for 25603s! [migration/104:175]
    [41581.330530] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41581.330546] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2736378
    [41611.962202] watchdog: BUG: soft lockup - CPU#104 stuck for 25641s! [migration/104:175]
    [41635.961947] watchdog: BUG: soft lockup - CPU#104 stuck for 25663s! [migration/104:175]
    [41644.349859] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41644.349876] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2742753
    [41671.961564] watchdog: BUG: soft lockup - CPU#104 stuck for 25697s! [migration/104:175]
    [41695.961309] watchdog: BUG: soft lockup - CPU#104 stuck for 25719s! [migration/104:175]
    [41707.369190] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41707.369206] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2749151
    [41735.960884] watchdog: BUG: soft lockup - CPU#104 stuck for 25756s! [migration/104:175]
    [41759.960629] watchdog: BUG: soft lockup - CPU#104 stuck for 25778s! [migration/104:175]
    [41770.388520] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41770.388548] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2755540
    [41776.076307] rcu: rcu_sched kthread timer wakeup didn't happen for 1423 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
    [41776.076327] rcu: Possible timer handling issue on cpu=32 timer-softirq=1056014
    [41776.076336] rcu: rcu_sched kthread starved for 1424 jiffies! g49897 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=32
    [41776.076350] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
    [41776.076360] rcu: RCU grace-period kthread stack dump:
    [41776.076434] rcu: Stack dump where RCU GP kthread last ran:
    [41783.960374] watchdog: BUG: soft lockup - CPU#104 stuck for 25801s! [migration/104:175]
    [41807.960119] watchdog: BUG: soft lockup - CPU#104 stuck for 25823s! [migration/104:175]
    [41831.959864] watchdog: BUG: soft lockup - CPU#104 stuck for 25846s! [migration/104:175]
    [41833.407851] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [41833.407868] rcu: 136-...0: (135 ticks this GP) idle=242/1/0x4000000000000000 softirq=32031/32033 fqs=2760381
    [41863.959524] watchdog: BUG: soft lockup - CPU#104 stuck for 25875s! [migration/104:175]

    I don't suppose you were able to get any more of the log saved? (The
    first error messages that happened might be interesting)

    Thanks,
    Nick

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Michael Ellerman on Wed Oct 27 12:10:02 2021
    Hi Michael!

    On 10/27/21 07:30, Michael Ellerman wrote:
    I did test the repro case you gave me before (in the bugzilla), which
    was building glibc, that passes for me with a patched host.

    Did you manage to crash the unpatched host? If the unpatched host crashes
    for you but the patched doesn't, I will make sure I didn't accidentally
    miss anything.

    Also, I'll try a kernel from git with Debian's config.

    I guess we have yet another bug.

    I tried the following in a debian BE VM and it completed fine:

    $ dget -u http://ftp.debian.org/debian/pool/main/g/git/git_2.33.1-1.dsc
    $ sbuild -d sid --arch=powerpc --no-arch-all git_2.33.1-1.dsc

    Same for ppc64.

    And I also tried both at once, repeatedly in a loop.

    Did you try building gcc-11 for powerpc and ppc64 both at once?

    I guess it's something more complicated.

    What exact host/guest kernel versions and configs are you running?

    Both the host and guest are running Debian's stock 5.14.12 kernel. The host has a kernel with your patches applied, the guest doesn't.

    Let me do some more testing.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Michael Ellerman on Wed Oct 27 13:20:01 2021
    Hi Michael!

    On 10/27/21 13:06, Michael Ellerman wrote:
    John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> writes:
    Hi Michael!

    On 10/27/21 07:30, Michael Ellerman wrote:
    I did test the repro case you gave me before (in the bugzilla), which
    was building glibc, that passes for me with a patched host.

    Did you manage to crash the unpatched host?

    Yes, the parallel builds of glibc you described crashed the unpatched
    host 100% reliably for me.

    OK, that is very good news!

    I also have a standalone reproducer I'll send you.

    Thanks, that would be helpful!

    Also, I'll try a kernel from git with Debian's config.

    I guess we have yet another bug.

    I tried the following in a debian BE VM and it completed fine:

    $ dget -u http://ftp.debian.org/debian/pool/main/g/git/git_2.33.1-1.dsc >>> $ sbuild -d sid --arch=powerpc --no-arch-all git_2.33.1-1.dsc

    Same for ppc64.

    And I also tried both at once, repeatedly in a loop.

    Did you try building gcc-11 for powerpc and ppc64 both at once?

    No, I will try that now.

    OK, great!

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Ellerman@21:1/5 to John Paul Adrian Glaubitz on Wed Oct 27 13:30:01 2021
    John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> writes:
    Hi Michael!

    On 10/27/21 07:30, Michael Ellerman wrote:
    I did test the repro case you gave me before (in the bugzilla), which
    was building glibc, that passes for me with a patched host.

    Did you manage to crash the unpatched host?

    Yes, the parallel builds of glibc you described crashed the unpatched
    host 100% reliably for me.

    I also have a standalone reproducer I'll send you.

    If the unpatched host crashes for you but the patched doesn't, I will
    make sure I didn't accidentally miss anything.

    OK thanks.

    Also, I'll try a kernel from git with Debian's config.

    I guess we have yet another bug.

    I tried the following in a debian BE VM and it completed fine:

    $ dget -u http://ftp.debian.org/debian/pool/main/g/git/git_2.33.1-1.dsc
    $ sbuild -d sid --arch=powerpc --no-arch-all git_2.33.1-1.dsc

    Same for ppc64.

    And I also tried both at once, repeatedly in a loop.

    Did you try building gcc-11 for powerpc and ppc64 both at once?

    No, I will try that now.

    I guess it's something more complicated.

    What exact host/guest kernel versions and configs are you running?

    Both the host and guest are running Debian's stock 5.14.12 kernel. The host has
    a kernel with your patches applied, the guest doesn't.

    OK that sounds fine.

    I tested upstream stable v5.14.13 + my patches, but there's nothing
    betwen 5.14.12 and 5.14.13 that should matter AFAICS.

    Let me do some more testing.

    Thanks.

    cheers

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Ellerman@21:1/5 to John Paul Adrian Glaubitz on Thu Oct 28 09:00:02 2021
    [ Dropping oss-security from Cc]

    John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> writes:
    On 10/27/21 13:06, Michael Ellerman wrote:
    John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> writes:
    On 10/27/21 07:30, Michael Ellerman wrote:
    I did test the repro case you gave me before (in the bugzilla), which
    was building glibc, that passes for me with a patched host.

    Did you manage to crash the unpatched host?

    Yes, the parallel builds of glibc you described crashed the unpatched
    host 100% reliably for me.

    OK, that is very good news!

    I also have a standalone reproducer I'll send you.

    Thanks, that would be helpful!

    Also, I'll try a kernel from git with Debian's config.

    I guess we have yet another bug.

    I tried the following in a debian BE VM and it completed fine:

    $ dget -u http://ftp.debian.org/debian/pool/main/g/git/git_2.33.1-1.dsc >>>> $ sbuild -d sid --arch=powerpc --no-arch-all git_2.33.1-1.dsc

    Same for ppc64.

    And I also tried both at once, repeatedly in a loop.

    Did you try building gcc-11 for powerpc and ppc64 both at once?

    No, I will try that now.

    That completed fine on my BE VM here.

    I ran these in two tmux windows:
    $ sbuild -d sid --arch=powerpc --no-arch-all gcc-11_11.2.0-10.dsc
    $ sbuild -d sid --arch=ppc64 --no-arch-all gcc-11_11.2.0-10.dsc


    The VM has 32 CPUs, with 4 threads per core:

    $ ppc64_cpu --info
    Core 0: 0* 1* 2* 3*
    Core 1: 4* 5* 6* 7*
    Core 2: 8* 9* 10* 11*
    Core 3: 12* 13* 14* 15*
    Core 4: 16* 17* 18* 19*
    Core 5: 20* 21* 22* 23*
    Core 6: 24* 25* 26* 27*
    Core 7: 28* 29* 30* 31*


    cheers

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Michael Ellerman on Thu Oct 28 13:30:02 2021
    Hi Michael!

    On 10/28/21 08:39, Michael Ellerman wrote:
    No, I will try that now.

    That completed fine on my BE VM here.

    I ran these in two tmux windows:
    $ sbuild -d sid --arch=powerpc --no-arch-all gcc-11_11.2.0-10.dsc
    $ sbuild -d sid --arch=ppc64 --no-arch-all gcc-11_11.2.0-10.dsc


    The VM has 32 CPUs, with 4 threads per core:

    $ ppc64_cpu --info
    Core 0: 0* 1* 2* 3*
    Core 1: 4* 5* 6* 7*
    Core 2: 8* 9* 10* 11*
    Core 3: 12* 13* 14* 15*
    Core 4: 16* 17* 18* 19*
    Core 5: 20* 21* 22* 23*
    Core 6: 24* 25* 26* 27*
    Core 7: 28* 29* 30* 31*

    It seems I also can no longer reproduce the issue, even when building the most problematic
    packages and I think we should consider it fixed for now. I will keep monitoring the server,
    of course, and will let you know in case the problem shows again.

    Thanks a lot again for fixing this issue!

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Thu Oct 28 16:10:01 2021
    Hello!

    On 10/28/21 15:52, John Paul Adrian Glaubitz wrote:
    I am not sure what triggered my previous crash but I don't think it's related to this
    particular bug. I will keep monitoring the server in any case and open a new bug report
    in case I'm running into similar issues.

    This is very unfortunate, but just after I sent this mail, the machine crashed again.

    Sorry for the premature success report. I will have to check now what happened and get in touch with Michael.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Thu Oct 28 16:00:02 2021
    Hello!

    An update to this post with oss-security CC'ed.

    On 10/26/21 10:48, John Paul Adrian Glaubitz wrote:
    I have tested these patches against 5.14 but it seems the problem [1] still remains for me
    for big-endian guests. I built a patched kernel yesterday, rebooted the KVM server and let
    the build daemons do their work over night.

    I have done thorough testing and I'm no longer seeing the problem with the patched kernel.

    I am not sure what triggered my previous crash but I don't think it's related to this
    particular bug. I will keep monitoring the server in any case and open a new bug report
    in case I'm running into similar issues.

    Thanks,
    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Thu Oct 28 16:20:01 2021
    Hi!

    On 10/28/21 16:05, John Paul Adrian Glaubitz wrote:
    The following packages were being built at the same time:

    - guest 1: virtuoso-opensource and openturns
    - guest 2: llvm-toolchain-13

    I really did a lot of testing today with no issues and just after I sent my report
    to oss-security that the machine seems to be stable again, the issue showed up :(.

    Do you know whether IPMI features any sort of monitoring for capturing the output
    of the serial console non-interactively? This way I would be able to capture the
    crash besides what I have seen above.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Nicholas Piggin@21:1/5 to All on Fri Oct 29 03:10:01 2021
    Excerpts from John Paul Adrian Glaubitz's message of October 29, 2021 12:05 am:
    Hi Michael!

    On 10/28/21 13:20, John Paul Adrian Glaubitz wrote:
    It seems I also can no longer reproduce the issue, even when building the most problematic
    packages and I think we should consider it fixed for now. I will keep monitoring the server,
    of course, and will let you know in case the problem shows again.

    The host machine is stuck again but I'm not 100% sure what triggered the problem:

    [194817.984249] watchdog: BUG: soft lockup - CPU#80 stuck for 246s! [CPU 2/KVM:1836]
    [194818.012248] watchdog: BUG: soft lockup - CPU#152 stuck for 246s! [CPU 3/KVM:1837]
    [194825.960164] watchdog: BUG: soft lockup - CPU#24 stuck for 246s! [khugepaged:318]
    [194841.983991] watchdog: BUG: soft lockup - CPU#80 stuck for 268s! [CPU 2/KVM:1836]
    [194842.011991] watchdog: BUG: soft lockup - CPU#152 stuck for 268s! [CPU 3/KVM:1837]
    [194849.959906] watchdog: BUG: soft lockup - CPU#24 stuck for 269s! [khugepaged:318]
    [194865.983733] watchdog: BUG: soft lockup - CPU#80 stuck for 291s! [CPU 2/KVM:1836]
    [194866.011733] watchdog: BUG: soft lockup - CPU#152 stuck for 291s! [CPU 3/KVM:1837]
    [194873.959648] watchdog: BUG: soft lockup - CPU#24 stuck for 291s! [khugepaged:318]
    [194889.983475] watchdog: BUG: soft lockup - CPU#80 stuck for 313s! [CPU 2/KVM:1836]
    [194890.011475] watchdog: BUG: soft lockup - CPU#152 stuck for 313s! [CPU 3/KVM:1837]
    [194897.959390] watchdog: BUG: soft lockup - CPU#24 stuck for 313s! [khugepaged:318]
    [194913.983218] watchdog: BUG: soft lockup - CPU#80 stuck for 335s! [CPU 2/KVM:1836]
    [194914.011217] watchdog: BUG: soft lockup - CPU#152 stuck for 335s! [CPU 3/KVM:1837]
    [194921.959133] watchdog: BUG: soft lockup - CPU#24 stuck for 336s! [khugepaged:318]

    Soft lockup should mean it's taking timer interrupts still, just not scheduling. Do you have the hard lockup detector enabled as well? Is
    there anything stuck spinning on another CPU?

    Do you have the full dmesg / kernel log for this boot?

    Could you try a sysrq+w to get a trace of blocked tasks?

    Are you able to shut down the guests and exit qemu normally?

    Thanks,
    Nick

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Nicholas Piggin on Fri Oct 29 14:40:01 2021
    Hi Nicholas!

    On 10/29/21 02:41, Nicholas Piggin wrote:
    Soft lockup should mean it's taking timer interrupts still, just not scheduling. Do you have the hard lockup detector enabled as well? Is
    there anything stuck spinning on another CPU?

    I haven't enabled it. But looking at the documentation [1] it seems we could use it to print a backtrace once the lockup occurs.

    Do you have the full dmesg / kernel log for this boot?

    I do, uploaded the messages file here: https://people.debian.org/~glaubitz/messages-kvm-lockup.gz

    Also, I noticed there is actually a backtrace:

    Oct 25 17:02:31 watson kernel: [14104.902061] (detected by 80, t=5252 jiffies, g=49897, q=37)
    Oct 25 17:02:31 watson kernel: [14104.902072] Sending NMI from CPU 80 to CPUs 136:
    Oct 25 17:02:31 watson kernel: [14108.253972] Modules linked in: dm_mod(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E) tun(E) kvm_hv(E) kvm_pr(E) kvm(E) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) nft_
    compat(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nft_counter(E) nf_tables(E) nfnetlink(E) bridge(E) stp(E) llc(E) xfs(E) ecb(E) xts(E) sg(E) ctr(E) vmx_crypto(E) gf128mul(E) ipmi_powernv(E) powernv_rng(E) ipmi_
    devintf(E) rng_core(E) ipmi_msghandler(E) powernv_op_panel(E) ib_iser(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) iscsi_tcp(E) libiscsi_tcp(E) sunrpc(E) libiscsi(E) drm(E) scsi_transport_iscsi(E) fuse(E) drm_panel_orientation_quirks(E) configfs(E) ip_
    tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sr_mod(E) sd_mod(E) ses(E) cdrom(E) enclosure(E) t10_pi(E) crc_t10dif(E) scsi_transport_sas(E) crct10dif_generic(E) crct10dif_common(E) btrfs(E) blake2b_generic(E) zstd_compress(E)
    raid10(E) raid456(E)
    Oct 25 17:02:31 watson kernel: [14108.254101] async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) xhci_pci(E) xhci_hcd(E) e1000e(E)
    usbcore(E) ptp(E) pps_core(E) ipr(E) usb_common(E)
    Oct 25 17:02:31 watson kernel: [14108.254139] CPU: 104 PID: 175 Comm: migration/104 Tainted: G E 5.14.0-0.bpo.2-powerpc64le #1 Debian 5.14.9-2~bpo11+2
    Oct 25 17:02:31 watson kernel: [14108.254146] Stopper: multi_cpu_stop+0x0/0x240 <- migrate_swap+0xf8/0x240
    Oct 25 17:02:31 watson kernel: [14108.254160] NIP: c0000000001f6a58 LR: c00000000026b734 CTR: c00000000026b5c0
    Oct 25 17:02:31 watson kernel: [14108.254163] REGS: c000001001237970 TRAP: 0900 Tainted: G E (5.14.0-0.bpo.2-powerpc64le Debian 5.14.9-2~bpo11+2)
    Oct 25 17:02:31 watson kernel: [14108.254168] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28002442 XER: 20000000
    Oct 25 17:02:31 watson kernel: [14108.254183] CFAR: c00000000026b730 IRQMASK: 0
    Oct 25 17:02:31 watson kernel: [14108.254183] GPR00: c00000000026b32c c000001001237c10 c00000000166ce00 c000000000d02c30
    Oct 25 17:02:31 watson kernel: [14108.254183] GPR04: c000001806433198 c000001806433198 0000000000000000 000000005687ca06
    Oct 25 17:02:31 watson kernel: [14108.254183] GPR08: c0000017fc8948a0 c0000017fc894780 0000000000000004 c00800000a80e378
    Oct 25 17:02:31 watson kernel: [14108.254183] GPR12: 0000000000000000 c0000017ffff5a00 c000000000173ec8 c00000000194c080
    Oct 25 17:02:31 watson kernel: [14108.254183] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    Oct 25 17:02:31 watson kernel: [14108.254183] GPR20: 0000000000000000 c000001806433170 0000000000000000 0000000000000001
    Oct 25 17:02:31 watson kernel: [14108.254183] GPR24: 0000000000000002 0000000000000003 0000000000000000 c000000000d02c30
    Oct 25 17:02:31 watson kernel: [14108.254183] GPR28: 0000000000000001 c000001806433170 c000001806433194 0000000000000001
    Oct 25 17:02:31 watson kernel: [14108.254240] NIP [c0000000001f6a58] rcu_momentary_dyntick_idle+0x48/0x60
    Oct 25 17:02:31 watson kernel: [14108.254245] LR [c00000000026b734] multi_cpu_stop+0x174/0x240
    Oct 25 17:02:31 watson kernel: [14108.254251] Call Trace:
    Oct 25 17:02:31 watson kernel: [14108.254253] [c000001001237c10] [c000001001237c80] 0xc000001001237c80 (unreliable)
    Oct 25 17:02:31 watson kernel: [14108.254260] [c000001001237c80] [c00000000026b32c] cpu_stopper_thread+0x16c/0x280
    Oct 25 17:02:31 watson kernel: [14108.254267] [c000001001237d40] [c00000000017ad4c] smpboot_thread_fn+0x1ec/0x260
    Oct 25 17:02:31 watson kernel: [14108.254273] [c000001001237da0] [c00000000017403c] kthread+0x17c/0x190
    Oct 25 17:02:31 watson kernel: [14108.254280] [c000001001237e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
    Oct 25 17:02:31 watson kernel: [14108.254287] Instruction dump:
    Oct 25 17:02:31 watson kernel: [14108.254289] 394a7aa4 39297980 7cc751ae e94d0030 7d295214 39090120 7c0004ac 39400004
    Oct 25 17:02:31 watson kernel: [14108.254301] 7ce04028 7cea3a14 7ce0412d 40c2fff4 <7c0004ac> 70e90002 4c820020 0fe00000
    Oct 25 17:02:31 watson kernel: [14110.585275] CPU 136 didn't respond to backtrace IPI, inspecting paca.
    Oct 25 17:02:31 watson kernel: [14110.585279] irq_soft_mask: 0x03 in_mce: 0 in_nmi: 0 current: 1813 (CPU 12/KVM)
    Oct 25 17:02:31 watson kernel: [14110.585284] Back trace of paca->saved_r1 (0xc00000180640f4c0) (possibly stale):
    Oct 25 17:02:31 watson kernel: [14110.585286] Call Trace:
    Oct 25 17:02:31 watson kernel: [14110.585378] task:rcu_sched state:R running task stack: 0 pid: 13 ppid: 2 flags:0x00000800
    Oct 25 17:02:31 watson kernel: [14110.585386] Call Trace:
    Oct 25 17:02:31 watson kernel: [14110.585388] [c00000000e0978d0] [c0000000001f71c0] rcu_implicit_dynticks_qs+0x0/0x370 (unreliable)
    Oct 25 17:02:31 watson kernel: [14110.585399] [c00000000e097ac0] [c00000000001b264] __switch_to+0x1d4/0x2e0
    Oct 25 17:02:31 watson kernel: [14110.585407] [c00000000e097b30] [c000000000cb9838] __schedule+0x2f8/0xbb0
    Oct 25 17:02:31 watson kernel: [14110.585416] [c00000000e097c00] [c000000000cba334] __cond_resched+0x64/0x90
    Oct 25 17:02:31 watson kernel: [14110.585424] [c00000000e097c30] [c0000000001f8670] force_qs_rnp+0xe0/0x2e0
    Oct 25 17:02:31 watson kernel: [14110.585433] [c00000000e097cd0] [c0000000001fc8a8] rcu_gp_kthread+0x9c8/0xc90
    Oct 25 17:02:31 watson kernel: [14110.585442] [c00000000e097da0] [c00000000017403c] kthread+0x17c/0x190
    Oct 25 17:02:31 watson kernel: [14110.585450] [c00000000e097e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
    Oct 25 17:02:31 watson kernel: [14110.585462] Sending NMI from CPU 80 to CPUs 32:
    Oct 25 17:02:31 watson kernel: [14110.585469] NMI backtrace for cpu 32
    Oct 25 17:02:31 watson kernel: [14110.585473] CPU: 32 PID: 1289 Comm: in:imklog Tainted: G EL 5.14.0-0.bpo.2-powerpc64le #1 Debian 5.14.9-2~bpo11+2
    Oct 25 17:02:31 watson kernel: [14110.585477] NIP: 00007fff92bc3bbc LR: 00007fff92bc5e90 CTR: 00007fff92bc5bf0
    Oct 25 17:02:31 watson kernel: [14110.585480] REGS: c00000001c9bfe80 TRAP: 0500 Tainted: G EL (5.14.0-0.bpo.2-powerpc64le Debian 5.14.9-2~bpo11+2)
    Oct 25 17:02:31 watson kernel: [14110.585483] MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 48004802 XER: 00000000
    Oct 25 17:02:31 watson kernel: [14110.585496] CFAR: 00007fff92bc3c34 IRQMASK: 0
    Oct 25 17:02:31 watson kernel: [14110.585496] GPR00: 0000000000000000 00007fff9220d940 00007fff92d37100 000000000000000c
    Oct 25 17:02:31 watson kernel: [14110.585496] GPR04: 00007fff9222f928 00007fff84000060 00007fff84097800 00007fff84000900
    Oct 25 17:02:31 watson kernel: [14110.585496] GPR08: 00007fff840008d0 00007fff84000050 00007fff8408f3a0 0000000000000007
    Oct 25 17:02:31 watson kernel: [14110.585496] GPR12: 0000000028004802 00007fff92236810 00007fff84097af0 0000000000000000
    Oct 25 17:02:31 watson kernel: [14110.585496] GPR16: 00007fff93040000 00007fff92f54478 0000000000000000 00007fff9222f160
    Oct 25 17:02:31 watson kernel: [14110.585496] GPR20: 00007fff9222f810 00007fff9220e4f0 0000000000000008 00007fff927156b0
    Oct 25 17:02:31 watson kernel: [14110.585496] GPR24: 00007fff92715638 00007fff927304f8 0000000000001fa0 0000000000000000
    Oct 25 17:02:31 watson kernel: [14110.585496] GPR28: 00007fff9220e529 000000000000006f 00007fff84000020 0000000000000030
    Oct 25 17:02:31 watson kernel: [14110.585530] NIP [00007fff92bc3bbc] 0x7fff92bc3bbc
    Oct 25 17:02:31 watson kernel: [14110.585534] LR [00007fff92bc5e90] 0x7fff92bc5e90

    Could you try a sysrq+w to get a trace of blocked tasks?

    Not sure how to send a magic sysrequest over the IPMI serial console. Any idea?

    Are you able to shut down the guests and exit qemu normally?

    Not after the crash. I have to hard-reboot the whole machine.

    Adrian

    [1] https://www.kernel.org/doc/html/latest/admin-guide/lockup-watchdogs.html

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Michael Ellerman on Sat Oct 30 09:30:02 2021
    Hi Michael!

    On 10/28/21 08:39, Michael Ellerman wrote:
    That completed fine on my BE VM here.

    I ran these in two tmux windows:
    $ sbuild -d sid --arch=powerpc --no-arch-all gcc-11_11.2.0-10.dsc
    $ sbuild -d sid --arch=ppc64 --no-arch-all gcc-11_11.2.0-10.dsc

    Could you try gcc-10 instead? It's testsuite has crashed the host for me
    with a patched kernel twice now.

    $ dget -u https://deb.debian.org/debian/pool/main/g/gcc-10/gcc-10_10.3.0-12.dsc $ sbuild -d sid --arch=powerpc --no-arch-all gcc-10_10.3.0-12.dsc
    $ sbuild -d sid --arch=ppc64 --no-arch-all gcc-10_10.3.0-12.dsc

    Thanks,
    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Ellerman@21:1/5 to John Paul Adrian Glaubitz on Mon Nov 1 08:20:02 2021
    John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> writes:
    Hi Michael!

    On 10/28/21 08:39, Michael Ellerman wrote:
    That completed fine on my BE VM here.

    I ran these in two tmux windows:
    $ sbuild -d sid --arch=powerpc --no-arch-all gcc-11_11.2.0-10.dsc
    $ sbuild -d sid --arch=ppc64 --no-arch-all gcc-11_11.2.0-10.dsc

    Could you try gcc-10 instead? It's testsuite has crashed the host for me
    with a patched kernel twice now.

    $ dget -u https://deb.debian.org/debian/pool/main/g/gcc-10/gcc-10_10.3.0-12.dsc
    $ sbuild -d sid --arch=powerpc --no-arch-all gcc-10_10.3.0-12.dsc
    $ sbuild -d sid --arch=ppc64 --no-arch-all gcc-10_10.3.0-12.dsc

    Sure, will give that a try.

    I was able to crash my machine over the weekend, building openjdk, but I haven't been able to reproduce it for ~24 hours now (I didn't change
    anything).


    Can you try running your guests with no SMT threads?

    I think one of your guests was using:

    -smp 32,sockets=1,dies=1,cores=8,threads=4

    Can you change that to:

    -smp 8,sockets=1,dies=1,cores=8,threads=1


    And something similar for the other guest(s).

    If the system is stable with those settings that would be useful
    information, and would also mean you could use the system without it
    crashing semi regularly.

    cheers

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Michael Ellerman on Mon Nov 1 08:40:01 2021
    Hi Michael!

    On 11/1/21 07:53, Michael Ellerman wrote:
    Sure, will give that a try.

    I was able to crash my machine over the weekend, building openjdk, but I haven't been able to reproduce it for ~24 hours now (I didn't change anything).

    I made another experiment and upgraded the host to 5.15-rc7 which contains your fixes and made the guests build gcc-10. Interestingly, this time, the gcc-10 build crashed the guest but didn't manage to crash the host. I will update the guest to 5.15-rc7 now as well and see how that goes.

    Can you try running your guests with no SMT threads?

    I think one of your guests was using:

    -smp 32,sockets=1,dies=1,cores=8,threads=4

    Can you change that to:

    -smp 8,sockets=1,dies=1,cores=8,threads=1


    And something similar for the other guest(s).

    Sure. I will try that later. But first I want to switch the guests to 5.15-rc7 as well.

    If the system is stable with those settings that would be useful
    information, and would also mean you could use the system without it
    crashing semi regularly.

    Gotcha.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Mon Nov 1 18:30:01 2021
    Hi Michael!

    On 11/1/21 08:37, John Paul Adrian Glaubitz wrote:
    I made another experiment and upgraded the host to 5.15-rc7 which contains your
    fixes and made the guests build gcc-10. Interestingly, this time, the gcc-10 build crashed the guest but didn't manage to crash the host. I will update the
    guest to 5.15-rc7 now as well and see how that goes.

    OK, so I'm definitely able to crash the 5.15 kernel as well:

    [57031.404944] watchdog: BUG: soft lockup - CPU#24 stuck for 14957s! [migration/24:14]
    [57035.420898] watchdog: BUG: soft lockup - CPU#48 stuck for 14961s! [CPU 17/KVM:1815]
    [57047.456761] watchdog: BUG: soft lockup - CPU#152 stuck for 14841s! [CPU 13/KVM:1811]
    [57055.404670] watchdog: BUG: soft lockup - CPU#24 stuck for 14979s! [migration/24:14]
    [57059.420624] watchdog: BUG: soft lockup - CPU#48 stuck for 14983s! [CPU 17/KVM:1815]
    [57064.064573] rcu: INFO: rcu_sched self-detected stall on CPU
    [57064.064584] rcu: 48-....: (3338577 ticks this GP) idle=9f3/1/0x4000000000000002 softirq=77540/77540 fqs=15421
    [57064.064598] rcu: rcu_sched kthread timer wakeup didn't happen for 3988041 jiffies! g125265 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
    [57064.064606] rcu: Possible timer handling issue on cpu=136 timer-softirq=313650
    [57064.064611] rcu: rcu_sched kthread starved for 3988042 jiffies! g125265 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=136
    [57064.064618] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
    [57064.064624] rcu: RCU grace-period kthread stack dump:
    [57064.064665] rcu: Stack dump where RCU GP kthread last ran:
    [57071.456487] watchdog: BUG: soft lockup - CPU#152 stuck for 14863s! [CPU 13/KVM:1811]
    [57079.404396] watchdog: BUG: soft lockup - CPU#24 stuck for 15002s! [migration/24:14]

    And the gcc-10 testsuite is able to trigger the crash very reliably.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michal =?iso-8859-1?Q?Such=E1nek?=@21:1/5 to John Paul Adrian Glaubitz on Mon Nov 1 19:10:01 2021
    On Fri, Oct 29, 2021 at 02:33:12PM +0200, John Paul Adrian Glaubitz wrote:
    Hi Nicholas!

    On 10/29/21 02:41, Nicholas Piggin wrote:
    Soft lockup should mean it's taking timer interrupts still, just not scheduling. Do you have the hard lockup detector enabled as well? Is
    there anything stuck spinning on another CPU?



    Could you try a sysrq+w to get a trace of blocked tasks?

    Not sure how to send a magic sysrequest over the IPMI serial console. Any idea?

    As on any serial console sending break should be equivalent to the magic
    sysrq key combo.

    https://tldp.org/HOWTO/Remote-Serial-Console-HOWTO/security-sysrq.html

    With ipmitool break is sent by typing ~B

    https://linux.die.net/man/1/ipmitool

    Thanks

    Michal

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Michael Ellerman on Tue Jan 4 14:10:02 2022
    Hi Michael!

    Sorry for the long time without any responses. Shall we continue debugging this?

    We're currently running 5.15.x on the host system and the guests and the testsuite
    for gcc-9 still reproducibly kills the KVM host.

    Adrian

    On 11/1/21 07:53, Michael Ellerman wrote:
    Sure, will give that a try.

    I was able to crash my machine over the weekend, building openjdk, but I haven't been able to reproduce it for ~24 hours now (I didn't change anything).


    Can you try running your guests with no SMT threads?

    I think one of your guests was using:

    -smp 32,sockets=1,dies=1,cores=8,threads=4

    Can you change that to:

    -smp 8,sockets=1,dies=1,cores=8,threads=1


    And something similar for the other guest(s).

    If the system is stable with those settings that would be useful
    information, and would also mean you could use the system without it
    crashing semi regularly.

    cheers
    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Ellerman@21:1/5 to John Paul Adrian Glaubitz on Thu Jan 6 12:30:01 2022
    John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> writes:
    Hi Michael!

    Sorry for the long time without any responses. Shall we continue debugging this?

    Yes!

    Sorry also that I haven't been able to fix it yet, I had to stop chasing
    this bug and work on other things before the end of the year.

    We're currently running 5.15.x on the host system and the guests and the testsuite
    for gcc-9 still reproducibly kills the KVM host.

    Have you been able to try the different -smp options I suggested?

    Can you separately test with (on the host):

    # echo 0 > /sys/module/kvm_hv/parameters/dynamic_mt_modes


    cheers

    On 11/1/21 07:53, Michael Ellerman wrote:
    Sure, will give that a try.

    I was able to crash my machine over the weekend, building openjdk, but I
    haven't been able to reproduce it for ~24 hours now (I didn't change
    anything).


    Can you try running your guests with no SMT threads?

    I think one of your guests was using:

    -smp 32,sockets=1,dies=1,cores=8,threads=4

    Can you change that to:

    -smp 8,sockets=1,dies=1,cores=8,threads=1


    And something similar for the other guest(s).

    If the system is stable with those settings that would be useful
    information, and would also mean you could use the system without it
    crashing semi regularly.

    cheers
    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Michael Ellerman on Fri Jan 7 12:30:01 2022
    Hi Michael!

    On 1/6/22 11:58, Michael Ellerman wrote:
    We're currently running 5.15.x on the host system and the guests and the testsuite
    for gcc-9 still reproducibly kills the KVM host.

    Have you been able to try the different -smp options I suggested?

    Can you separately test with (on the host):

    # echo 0 > /sys/module/kvm_hv/parameters/dynamic_mt_modes

    I'm trying to turn off "dynamic_mt_modes" first and see if that makes any difference.

    I will report back.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Sun Jan 9 23:20:01 2022
    Hi Michael!

    On 1/7/22 12:20, John Paul Adrian Glaubitz wrote:
    Can you separately test with (on the host):

    # echo 0 > /sys/module/kvm_hv/parameters/dynamic_mt_modes

    I'm trying to turn off "dynamic_mt_modes" first and see if that makes any difference.

    I will report back.

    So far the machine is running stable now and the VM built gcc-9 without crashing the host. I will continue to monitor the machine and report back
    if it crashes, but it looks like this could be it.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Thu Jan 13 01:20:02 2022
    Hi Michael!

    On 1/9/22 23:17, John Paul Adrian Glaubitz wrote:
    On 1/7/22 12:20, John Paul Adrian Glaubitz wrote:
    Can you separately test with (on the host):

    # echo 0 > /sys/module/kvm_hv/parameters/dynamic_mt_modes

    I'm trying to turn off "dynamic_mt_modes" first and see if that makes any difference.

    I will report back.

    So far the machine is running stable now and the VM built gcc-9 without crashing the host. I will continue to monitor the machine and report back
    if it crashes, but it looks like this could be it.

    So, it seems that setting "dynamic_mt_modes" actually did the trick, the host is no longer
    crashing. However, I have observed on two occasions now that the build VM is just suddenly
    off as if someone has shut it down using the "force-off" option in the virt-manager user
    interface.

    Not sure why that happens.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Wed Jan 26 21:30:01 2022
    Hi Michael!

    On 1/13/22 01:17, John Paul Adrian Glaubitz wrote:
    On 1/9/22 23:17, John Paul Adrian Glaubitz wrote:
    On 1/7/22 12:20, John Paul Adrian Glaubitz wrote:
    Can you separately test with (on the host):

    # echo 0 > /sys/module/kvm_hv/parameters/dynamic_mt_modes

    I'm trying to turn off "dynamic_mt_modes" first and see if that makes any difference.

    I will report back.

    So far the machine is running stable now and the VM built gcc-9 without
    crashing the host. I will continue to monitor the machine and report back
    if it crashes, but it looks like this could be it.

    So, it seems that setting "dynamic_mt_modes" actually did the trick, the host is no longer
    crashing. However, I have observed on two occasions now that the build VM is just suddenly
    off as if someone has shut it down using the "force-off" option in the virt-manager user
    interface.

    Just as a heads-up. Ever since I set

    echo 0 > /sys/module/kvm_hv/parameters/dynamic_mt_modes

    on the host machine, I never saw the crash again. So the issue seems to be related to the
    dynamic_mt_modes feature.

    Thanks,
    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike@21:1/5 to glaubitz@physik.fu-berlin.de on Thu Jan 27 17:00:02 2022
    I just made the huge mistake of hibernating and resuming, I'm going trough
    the process of rescue and all, thankfully I had a 2016 cd in the drive.
    I'll read up once the sheer panic settles.

    -Michael

    On Wed, Jan 26, 2022, 21:22 John Paul Adrian Glaubitz < glaubitz@physik.fu-berlin.de> wrote:

    Hi Michael!

    On 1/13/22 01:17, John Paul Adrian Glaubitz wrote:
    On 1/9/22 23:17, John Paul Adrian Glaubitz wrote:
    On 1/7/22 12:20, John Paul Adrian Glaubitz wrote:
    Can you separately test with (on the host):

    # echo 0 > /sys/module/kvm_hv/parameters/dynamic_mt_modes

    I'm trying to turn off "dynamic_mt_modes" first and see if that makes
    any difference.

    I will report back.

    So far the machine is running stable now and the VM built gcc-9 without
    crashing the host. I will continue to monitor the machine and report
    back
    if it crashes, but it looks like this could be it.

    So, it seems that setting "dynamic_mt_modes" actually did the trick, the
    host is no longer
    crashing. However, I have observed on two occasions now that the build
    VM is just suddenly
    off as if someone has shut it down using the "force-off" option in the
    virt-manager user
    interface.

    Just as a heads-up. Ever since I set

    echo 0 > /sys/module/kvm_hv/parameters/dynamic_mt_modes

    on the host machine, I never saw the crash again. So the issue seems to be related to the
    dynamic_mt_modes feature.

    Thanks,
    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913



    <div dir="auto">I just made the huge mistake of hibernating and resuming, I&#39;m going trough the process of rescue and all, thankfully I had a 2016 cd in the drive. I&#39;ll read up once the sheer panic settles. <div dir="auto"><br></div><div dir="
    auto">-Michael</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jan 26, 2022, 21:22 John Paul Adrian Glaubitz &lt;<a href="mailto:glaubitz@physik.fu-berlin.de">glaubitz@physik.fu-berlin.de</a>&gt; wrote:<br></div><
    blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Michael!<br>

    On 1/13/22 01:17, John Paul Adrian Glaubitz wrote:<br>
    &gt; On 1/9/22 23:17, John Paul Adrian Glaubitz wrote:<br>
    &gt;&gt; On 1/7/22 12:20, John Paul Adrian Glaubitz wrote:<br>
    &gt;&gt;&gt;&gt; Can you separately test with (on the host):<br> &gt;&gt;&gt;&gt;<br>
    &gt;&gt;&gt;&gt;  # echo 0 &gt; /sys/module/kvm_hv/parameters/dynamic_mt_modes<br>
    &gt;&gt;&gt;<br>
    &gt;&gt;&gt; I&#39;m trying to turn off &quot;dynamic_mt_modes&quot; first and see if that makes any difference.<br>
    &gt;&gt;&gt;<br>
    &gt;&gt;&gt; I will report back.<br>
    &gt;&gt;<br>
    &gt;&gt; So far the machine is running stable now and the VM built gcc-9 without<br>
    &gt;&gt; crashing the host. I will continue to monitor the machine and report back<br>
    &gt;&gt; if it crashes, but it looks like this could be it.<br>
    &gt; <br>
    &gt; So, it seems that setting &quot;dynamic_mt_modes&quot; actually did the trick, the host is no longer<br>
    &gt; crashing. However, I have observed on two occasions now that the build VM is just suddenly<br>
    &gt; off as if someone has shut it down using the &quot;force-off&quot; option in the virt-manager user<br>
    &gt; interface.<br>

    Just as a heads-up. Ever since I set<br>

            echo 0 &gt; /sys/module/kvm_hv/parameters/dynamic_mt_modes<br>

    on the host machine, I never saw the crash again. So the issue seems to be related to the<br>
    dynamic_mt_modes feature.<br>

    Thanks,<br>
    Adrian<br>

    -- <br>
     .&#39;&#39;`.  John Paul Adrian Glaubitz<br>
    : :&#39; :  Debian Developer - <a href="mailto:glaubitz@debian.org" target="_blank" rel="noreferrer">glaubitz@debian.org</a><br>
    `. `&#39;   Freie Universitaet Berlin - <a href="mailto:glaubitz@physik.fu-berlin.de" target="_blank" rel="noreferrer">glaubitz@physik.fu-berlin.de</a><br>
      `-    GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913<br>

    </blockquote></div>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)