• sparc64 kernel crashes, & using SAS/SATA drives instead of SCA/FC-AL

    From Romain Dolbeau@21:1/5 to And related to something Meelis on Thu Jul 9 13:30:02 2020
    Hello,

    I've just recently got myself a SunBlade 2500 Red, and while Debian
    Sparc64 installed perfectly from the May image (thanks!), I'm being
    bitten by repeated crashes of the kernel. Presumably, the same bug
    that was reported in april-may by Miroslav (<https://lists.debian.org/debian-sparc/2020/03/msg00026.html>)

    So - has anyone made any progress on this or are we still in need of a
    bisect? If the latest, is there any known way to quickly cause a crash
    to ensure if a tested kernel is good/bad?

    And related to something Meelis said (<https://lists.debian.org/debian-sparc/2020/04/msg00001.html>):
    I might want to install Linux on E420R but do not have FC-AL disks with me.

    I have found a reasonable cheap way to work around such issues for
    Suns with an available 3.3V PCI[-X] or PCIe slot, by flashing a x86
    LSI 1064/1068 (PCI) or 1068e (PCIe) device with the ROM file available
    from a Sun patch. See in the rescue mailing list here <http://www.sunhelp.org/pipermail/rescue/2020-July/142249.html> for
    details. As far as I can tell, any flashable card from eBay should
    enable you to boot from a SAS or even SATA (I only tried SAS so far)
    drive instead of the less-easy-to-find SCA or FC-AL drive that came
    standard.

    Cordially,

    --
    Romain Dolbeau

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Romain Dolbeau@21:1/5 to All on Sun Jul 12 14:00:02 2020
    Le jeu. 9 juil. 2020 à 13:26, Romain Dolbeau <romain@dolbeau.org> a écrit :
    So - has anyone made any progress on this or are we still in need of a bisect? If the latest, is there any known way to quickly cause a crash
    to ensure if a tested kernel is good/bad?

    I wanted to give a go at bisecting, so first I recovered a 4.17
    package in the archive on my T5120 to have something that should work
    as backup - and in so doing, removed the Debian kernels I had... maybe
    that was a mistake, as I could have been running 5.6 at the time, I'm
    not sure.

    After failing to deliberately crash my home-cross-compiled vanilla 5.2
    on the Sun Blade 2500 Red, I installed the current kernel in Sid:

    linux-image-5.7.0-1-sparc64-smp 5.7.6-1

    And managed to do a full rebuild of GCC 10.1 (starting with recent binutils/gmp/mpfr/mpc/isl before a complete 3-stage bootstrap), and in
    parallel do a git checkout of Linux, some package installation and a configure/rebuild of ZFS. Took almost a day (with a shutdown/reboot
    the middle of stage 2), no crash in sight, though I tried via SSH only
    (the XVR-600 isn't supported in X). The machine has been rock-solid so
    far (running from a SAS drive on a flashed 1068)...

    For those with crashes - could you try the current kernel and see if
    it fixes the problem? And if it doesn't, what kind of workload do you
    have when the kernel crashes? I've seen the crashes myself but can't
    reproduce them anymore and I don't have the archive of the 5.6 I might
    have been running at the time...

    Cordially,

    --
    Romain Dolbeau

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Romain Dolbeau on Sun Jul 12 14:40:02 2020
    On 7/12/20 1:57 PM, Romain Dolbeau wrote:
    After failing to deliberately crash my home-cross-compiled vanilla 5.2
    on the Sun Blade 2500 Red, I installed the current kernel in Sid:

    linux-image-5.7.0-1-sparc64-smp 5.7.6-1

    And managed to do a full rebuild of GCC 10.1 (starting with recent binutils/gmp/mpfr/mpc/isl before a complete 3-stage bootstrap), and in parallel do a git checkout of Linux, some package installation and a configure/rebuild of ZFS. Took almost a day (with a shutdown/reboot
    the middle of stage 2), no crash in sight, though I tried via SSH only
    (the XVR-600 isn't supported in X). The machine has been rock-solid so
    far (running from a SAS drive on a flashed 1068)...

    For those with crashes - could you try the current kernel and see if
    it fixes the problem? And if it doesn't, what kind of workload do you
    have when the kernel crashes? I've seen the crashes myself but can't reproduce them anymore and I don't have the archive of the 5.6 I might
    have been running at the time...

    Sounds good. I will give it a try.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Romain Dolbeau on Mon Jul 13 09:10:01 2020
    On 7/13/20 8:17 AM, Romain Dolbeau wrote:
    I really don't understand; I had two crashes before but cannot induce one now...

    I switched one of the buildds which has an UltraSPARC IIIi to kernel 5.7.6 and got this:

    [ 7483.258336] Unable to handle kernel paging request at virtual address 71e76b6501458000
    [ 7483.362427] tsk->{mm,active_mm}->context = 000000000000018a
    [ 7483.435619] tsk->{mm,active_mm}->pgd = fff0000238250000
    [ 7483.504245] \|/ ____ \|/
    "@'/ .. \`@"
    /_| \__/ |_\
    \__U_/
    [ 7483.504249] kworker/0:3(2209): Oops [#1]
    [ 7483.504257] CPU: 0 PID: 2209 Comm: kworker/0:3 Tainted: G E 5.7.0-1-sparc64-smp #1 Debian 5.7.6-1
    [ 7483.504274] Workqueue: memcg_kmem_cache kmemcg_workfn
    [ 7483.504281] TSTATE: 0000004480e01604 TPC: 000000000065eba0 TNPC: 000000000065eba4 Y: 00000000 Tainted: G E
    [ 7483.504292] TPC: <deactivate_slab.isra.0+0x60/0x660>
    [ 7483.504295] g0: 0000000000000000 g1: 0000000000000018 g2: 00000000000000bc g3: 71e76b6501458ab8
    [ 7483.504299] g4: fff000003e1a6b40 g5: fff000023d90a000 g6: fff0000238b38000 g7: 10772cbe72ed3bf5
    [ 7483.504302] o0: 0000000000000200 o1: 000c00000482dc60 o2: 0000000000000001 o3: f000020145801800
    [ 7483.504305] o4: 00fff00002014580 o5: 000000fff0000201 sp: fff0000238b3ac71 ret_pc: 000000000065ec18
    [ 7483.504309] RPC: <deactivate_slab.isra.0+0xd8/0x660>
    [ 7483.504313] l0: 00bc019900000000 l1: 00bc019900000000 l2: 00000000ff000000 l3: 000000000000ff00
    [ 7483.504316] l4: 0000000000ff0000 l5: 000000ff00000000 l6: 0000ff0000000000 l7: 00ff000000000000
    [ 7483.504320] i0: fff000003fc84680 i1: 000c00000482dc60 i2: 71e76b6501458aa0 i3: 69672e6403457a5f
    [ 7483.504323] i4: 71e76b6501458aa0 i5: fff0000201458000 i6: fff0000238b3adb1 i7: 000000000065f5a0
    [ 7483.504327] I7: <flush_cpu_slab+0x40/0x80>
    [ 7483.504330] Call Trace:
    [ 7483.504335] [000000000065f5a0] flush_cpu_slab+0x40/0x80
    [ 7483.504344] [000000000050deec] on_each_cpu_cond_mask+0x6c/0x80
    [ 7483.504349] [000000000050df20] on_each_cpu_cond+0x20/0x40
    [ 7483.504354] [00000000006637a0] __kmem_cache_shrink+0x20/0x2a0
    [ 7483.504359] [0000000000663a2c] __kmemcg_cache_deactivate_after_rcu+0xc/0x60 [ 7483.504364] [000000000061080c] kmemcg_cache_deactivate_after_rcu+0xc/0x40
    [ 7483.504369] [00000000006107e0] kmemcg_workfn+0x20/0x40
    [ 7483.504379] [000000000048bc58] process_one_work+0x1b8/0x4e0
    [ 7483.504383] [000000000048c0c0] worker_thread+0x140/0x540
    [ 7483.504390] [000000000049279c] kthread+0xdc/0x120
    [ 7483.504399] [00000000004060a4] ret_from_fork+0x1c/0x2c
    [ 7483.504403] [0000000000000000] 0x0
    [ 7483.504406] Disabling lock debugging due to kernel taint
    [ 7483.504410] Caller[000000000065f5a0]: flush_cpu_slab+0x40/0x80
    [ 7483.504415] Caller[000000000050deec]: on_each_cpu_cond_mask+0x6c/0x80
    [ 7483.504419] Caller[000000000050df20]: on_each_cpu_cond+0x20/0x40
    [ 7483.504423] Caller[00000000006637a0]: __kmem_cache_shrink+0x20/0x2a0
    [ 7483.504428] Caller[0000000000663a2c]: __kmemcg_cache_deactivate_after_rcu+0xc/0x60
    [ 7483.504433] Caller[000000000061080c]: kmemcg_cache_deactivate_after_rcu+0xc/0x40
    [ 7483.504437] Caller[00000000006107e0]: kmemcg_workfn+0x20/0x40
    [ 7483.504442] Caller[000000000048bc58]: process_one_work+0x1b8/0x4e0
    [ 7483.504445] Caller[000000000048c0c0]: worker_thread+0x140/0x540
    [ 7483.504449] Caller[000000000049279c]: kthread+0xdc/0x120

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Romain Dolbeau@21:1/5 to All on Mon Jul 13 08:20:01 2020
    Le dim. 12 juil. 2020 à 14:38, John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> a écrit :
    Sounds good. I will give it a try.

    Thanks to Connor's message about Firefox, I discovered
    snapshot.debian.org and could install the 5.6 that I couldn't find
    before.
    But I couldn't get it to crash with my GCC rebuild script, I gave up
    near the end of stage 1 (so it had compiled, checked and installed binutils/gmp/mpfr/mpc/isl).
    Then I went back to 5.7 and tried again from the console, thinking
    maybe the crash had to do with that.
    Again, no crash.

    I really don't understand; I had two crashes before but cannot induce one now...

    Le dim. 12 juil. 2020 à 15:27, Gregor Riepl <onitake@gmail.com> a écrit :
    From what I can gather on the net, the GPU on this card is a 3DLabs Wildcat

    I believe so; thanks for the links. I only mentioned the lack of X as
    it could be a factor in the crashes.
    If I really wanted to run X11 there in Linux (it should work on
    Solaris), I have a spare XVR-100 to swap in that should be OK.

    Cordially,

    --
    Romain Dolbeau

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Mon Jul 13 09:30:01 2020
    On 7/13/20 9:01 AM, John Paul Adrian Glaubitz wrote:
    On 7/13/20 8:17 AM, Romain Dolbeau wrote:
    I really don't understand; I had two crashes before but cannot induce one now...

    I switched one of the buildds which has an UltraSPARC IIIi to kernel 5.7.6 and
    got this:

    And now the machine is no longer reachable.

    Please, do thorough tests in the future before claiming a bug has been fixed. I will now have to spend several hours to get the machine working again because
    I don't have access to the console and have to give instructions to the
    owner of it.

    *sigh*

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Romain Dolbeau on Mon Jul 13 09:40:02 2020
    On 7/13/20 9:33 AM, Romain Dolbeau wrote:
    Le lun. 13 juil. 2020 à 09:01, John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> a écrit :
    I switched one of the buildds which has an UltraSPARC IIIi to kernel 5.7.6 and
    got this:

    Looks like the crash I had, as far as I can remember.

    Any idea about the workload at the time?
    Is there a lot of parallelism in the build system, could the memory
    have been saturated somehow?
    On my side I was running at -j3 to force a bit of context switching
    (SB has dual CPU), but there's 8 GIB in there and I don't think I got
    close to the limit...

    Try running the gcc or glibc testsuites, these will usually kill the machine.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Romain Dolbeau@21:1/5 to All on Mon Jul 13 09:40:02 2020
    Le lun. 13 juil. 2020 à 09:01, John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> a écrit :
    I switched one of the buildds which has an UltraSPARC IIIi to kernel 5.7.6 and
    got this:

    Looks like the crash I had, as far as I can remember.

    Any idea about the workload at the time?
    Is there a lot of parallelism in the build system, could the memory
    have been saturated somehow?
    On my side I was running at -j3 to force a bit of context switching
    (SB has dual CPU), but there's 8 GIB in there and I don't think I got
    close to the limit...

    Cordially,

    --
    Romain Dolbeau

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)