• 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

    From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Tue Mar 9 13:40:01 2021
    Hello Riccardo!

    On 3/9/21 1:23 PM, Riccardo Mottola wrote:
    while I was able to "install" correctly using a slightly older ISO, I get not a bootable
    system. The kernel appears to crash very early during boot.

    I think this is more likely a hardware issue. We haven't seen any machines crashing that
    early. Please make sure the RAM modules in this machine are working properly.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to All on Tue Mar 9 13:30:02 2021
    Hi all,

    while I was able to "install" correctly using a slightly older ISO, I
    get not a bootable system. The kernel appears to crash very early during
    boot.

    Anybody else has this issue?

    Booting `Debian GNU/Linux'

    Loading Linux 5.10.0-4-sparc64-smp ...
    Loading initial ramdisk ...

    [ 26.900156] sd 2:1:0:0: [sda] No Caching mode page found
    [ 26.900336] sd 2:1:0:0: [sda] Assuming drive cache: write through
    /dev/sda2: clean, 31420/4276224 files, 659826/17089844 blocks
    [ 30.362550] Unable to handle kernel NULL pointer dereference
    [ 30.362722] tsk->{mm,active_mm}->context = 00000000000000ab
    [ 30.362818] tsk->{mm,active_mm}->pgd = ffff80000f258000
    [ 30.363585] Kernel panic - not syncing: Aiee, killing interrupt handler!
    [ 30.363740] OOPS: Bogus kernel PC [00000000000007c0] in fault handler
    [ 30.363747] OOPS: RPC [000000000042c614]
    [ 30.363766] OOPS: RPC <arch_cpu_idle+0x74/0xc0>
    [ 30.363773] OOPS: Fault was to vaddr[7c0]
    [ 30.363787] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D E 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1
    [ 30.363792] Call Trace:
    [ 30.363808] [<0000000000c5394c>] do_sparc64_fault+0xa4c/0xa80
    [ 30.363829] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
    [ 30.363839] [<00000000000007c0>] 0x7c0
    [ 30.363852] [<0000000000c519a8>] default_idle_call+0x48/0x140
    [ 30.363865] [<00000000004a7b40>] do_idle+0xe0/0x1a0
    [ 30.363878] [<00000000004a7e5c>] cpu_startup_entry+0x1c/0x80
    [ 30.363899] [<0000000000c4b278>] rest_init+0xb8/0xc8
    [ 30.363915] [<0000000000fe26a4>] arch_call_rest_init+0xc/0x1c
    [ 30.363930] [<0000000000fe2d40>] start_kernel+0x628/0x640
    [ 30.363946] [<0000000000fe532c>] start_early_boot+0x2a0/0x2b0
    [ 30.363962] [<0000000000c4b1a0>] tlb_fixup_done+0x4c/0x6c
    [ 30.363972] [<000000000016a60c>] 0x16a60c
    [ 30.363978] Unable to handle kernel NULL pointer dereference
    [ 30.363984] tsk->{mm,active_mm}->context = 00000000000000b5
    [ 30.363990] tsk->{mm,active_mm}->pgd = ffff800014594000
    [ 30.363997] \|/ ____ \|/
    [ 30.363997] "@'/ .. \`@"
    [ 30.363997] /_| \__/ |_\
    [ 30.363997] \__U_/
    [ 30.364004] swapper/0(0): Oops [#2]
    [ 30.364017] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D E 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1
    [ 30.364027] TSTATE: 0000004480001600 TPC: 00000000000007c0 TNPC: 00000000000007c4 Y: 00000000 Tainted: G D
    [ 30.364036] TPC: <0x7c0>
    [ 30.364044] g0: 0000000040004059 g1: 0000000000000016 g2:
    00000000f0200000 g3: 00000000fff78000
    [ 30.364053] g4: 0000000000005a20 g5: ffff8003fd79c000 g6:
    0000000000e80000 g7: 00000000000043ba
    [ 30.364061] o0: 00000000000007c0 o1: 0000000000000000 o2:
    0000000000000000 o3: 0000000000000000
    [ 30.364070] o4: 0000000000000000 o5: 0000000000000000 sp:
    0000000000e831a1 ret_pc: 000000000042c614
    [ 30.364084] RPC: <arch_cpu_idle+0x74/0xc0>
    [ 30.364093] l0: 0000000000f8b7d8 l1: 000000004000407c l2:
    0000000040004059 l3: 0000000000000040
    [ 30.364102] l4: 00000000f027e7f8 l5: 0000000040004128 l6:
    00000000000ed000 l7: 00000000f025cfd8
    [ 30.364110] i0: 000000000000000e i1: 0000000000e80008 i2:
    0000000000004000 i3: 00000000000007c0
    [ 30.364118] i4: 00000000fef42ff8 i5: 00000000fef41800 i6:
    0000000000e83251 i7: 0000000000c519a8
    [ 30.364131] I7: <default_idle_call+0x48/0x140>
    [ 30.364137] Call Trace:
    [ 30.364150] [<0000000000c519a8>] default_idle_call+0x48/0x140
    [ 30.364162] [<00000000004a7b40>] do_idle+0xe0/0x1a0
    [ 30.364175] [<00000000004a7e5c>] cpu_startup_entry+0x1c/0x80
    [ 30.364191] [<0000000000c4b278>] rest_init+0xb8/0xc8
    [ 30.364207] [<0000000000fe26a4>] arch_call_rest_init+0xc/0x1c
    [ 30.364221] [<0000000000fe2d40>] start_kernel+0x628/0x640
    [ 30.364236] [<0000000000fe532c>] start_early_boot+0x2a0/0x2b0
    [ 30.364252] [<0000000000c4b1a0>] tlb_fixup_done+0x4c/0x6c
    [ 30.364262] [<000000000016a60c>] 0x16a60c
    [ 30.364276] Caller[0000000000c519a8]: default_idle_call+0x48/0x140
    [ 30.364288] Caller[00000000004a7b40]: do_idle+0xe0/0x1a0
    [ 30.364300] Caller[00000000004a7e5c]: cpu_startup_entry+0x1c/0x80
    [ 30.364315] Caller[0000000000c4b278]: rest_init+0xb8/0xc8
    [ 30.364330] Caller[0000000000fe26a4]: arch_call_rest_init+0xc/0x1c
    [ 30.364343] Caller[0000000000fe2d40]: start_kernel+0x628/0x640
    [ 30.364358] Caller[0000000000fe532c]: start_early_boot+0x2a0/0x2b0
    [ 30.364373] Caller[0000000000c4b1a0]: tlb_fixup_done+0x4c/0x6c
    [ 30.364383] Caller[000000000016a60c]: 0x16a60c
    [ 30.364387] Instruction DUMP:
    [ 30.364397] Unable to handle kernel NULL pointer dereference
    [ 30.364404] tsk->{mm,active_mm}->context = 00000000000000b5
    [ 30.364409] tsk->{mm,active_mm}->pgd = ffff800014594000
    [ 30.364416] \|/ ____ \|/
    [ 30.364416] "@'/ .. \`@"
    [ 30.364416] /_| \__/ |_\
    [ 30.364416] \__U_/
    [ 30.364422] swapper/0(0): Oops [#3]
    [ 30.364436] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D E 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1
    [ 30.364447] TSTATE: 0000008880001604 TPC: 0000000000c418a8 TNPC: 0000000000c418ac Y: 00000000 Tainted: G D
    [ 30.364469] TPC: <die_if_kernel+0x12c/0x260>
    [ 30.364479] g0: 0000000000000004 g1: fffffffffffffff4 g2:
    0000000000f29340 g3: 00000000ffffe221
    [ 30.364487] g4: 0000000000e9a680 g5: ffff8003fd79c000 g6:
    0000000000e80000 g7: 000000000000000e
    [ 30.364495] o0: 0000000000d77d78 o1: 0000000000000020 o2:
    000000000016a60c o3: 0000000000000020
    [ 30.364504] o4: 0000004480001600 o5: 000000000109fc00 sp:
    0000000000e82df1 ret_pc: 0000000000c41838
    [ 30.364519] RPC: <die_if_kernel+0xbc/0x260>
    [ 30.364528] l0: 0000000000000214 l1: 000000004000407c l2:
    000000000040770c l3: 0000000000000000

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Tue Mar 9 18:40:02 2021
    Hi!

    On 3/9/21 6:26 PM, Riccardo Mottola wrote:
    John Paul Adrian Glaubitz wrote:
    while I was able to "install" correctly using a slightly older ISO, I get not a bootable
    system. The kernel appears to crash very early during boot.
    I think this is more likely a hardware issue. We haven't seen any machines crashing that
    early. Please make sure the RAM modules in this machine are working properly.

    I don't think so... I think it is a Kernel issue, since with kernel 5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux

    the machine is performing fine with network, disk and compiler usage on all 32 CPUs.

    Then you need to bisect the kernel as I don't have any means to reproduce the issue.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Tue Mar 9 18:30:02 2021
    Hi,

    John Paul Adrian Glaubitz wrote:
    while I was able to "install" correctly using a slightly older ISO, I get not a bootable
    system. The kernel appears to crash very early during boot.
    I think this is more likely a hardware issue. We haven't seen any machines crashing that
    early. Please make sure the RAM modules in this machine are working properly.

    I don't think so... I think it is a Kernel issue, since with kernel 5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux

    the machine is performing fine with network, disk and compiler usage on
    all 32 CPUs. I tried heavy load of parallel compilations, using git on
    large repositories as well as using remote X applications at the same
    time, a combination I know tends to show issues on systems, without
    problems! Not a simgle error in syslog.
    Machine powerup-and self-tests are fine too.

    If I remember, there is a repository of various pre-compiled kernel
    versions: maybe there are some releases between the two kernels I can
    try and do some easy rough bisecting.

    so I'd say RAM, CPUs, Disk and Ethernet are working quite fine

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Tue Mar 9 21:40:01 2021
    Hi guys,

    On 09.03.21 18:31, John Paul Adrian Glaubitz wrote:
    Hi!

    On 3/9/21 6:26 PM, Riccardo Mottola wrote:
    John Paul Adrian Glaubitz wrote:
    while I was able to "install" correctly using a slightly older ISO, I get not a bootable
    system. The kernel appears to crash very early during boot.
    I think this is more likely a hardware issue. We haven't seen any machines crashing that
    early. Please make sure the RAM modules in this machine are working properly.

    I don't think so... I think it is a Kernel issue, since with kernel
    5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux

    the machine is performing fine with network, disk and compiler usage on all 32 CPUs.

    Then you need to bisect the kernel as I don't have any means to reproduce the issue.

    I have a T1000 with which I could try to reproduce Riccardo's issues.
    Hardware wise they should be pretty similar. As the T1000 doesn't have a
    CDROM, I'll try to netboot a few newer kernels and report my findings.
    Will take me until next week though, as the machine is in (cold) storage
    now.

    @Adrian:
    Aren't there some build servers using UltraSPARC T2 or T2+? Do they run
    with the latest kernels?

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Tue Mar 9 22:20:01 2021
    On 3/9/21 9:38 PM, Frank Scheiner wrote:
    I have a T1000 with which I could try to reproduce Riccardo's issues. Hardware wise they should be pretty similar. As the T1000 doesn't have a CDROM, I'll try to netboot a few newer kernels and report my findings.
    Will take me until next week though, as the machine is in (cold) storage
    now.

    @Adrian:
    Aren't there some build servers using UltraSPARC T2 or T2+? Do they run
    with the latest kernels?

    The oldest buildd we are running is a T5120 and that's a T2.

    We have an older UltraSPARC IIIi that has issues with newer kernels, but usually only after longer operation and the issue might be related to the
    bug that was just fixed recently by Rob Gardner.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Tue Mar 9 23:30:01 2021
    On 3/9/21 10:18 PM, Frank Scheiner wrote:
    The oldest buildd we are running is a T5120 and that's a T2.

    And these don't show the problems Riccardo's T1 powered T2000 has?

    No, the machine runs stable.

    We have an older UltraSPARC IIIi that has issues with newer kernels, but
    usually only after longer operation and the issue might be related to the
    bug that was just fixed recently by Rob Gardner.

    Which kernel version will have this bug (which one?) fixed, 5.11.x? I
    can also check with one of my UltraSPARC IIIi powered systems, too, next week.

    I have not uploaded that kernel yet, I have it built locally, PR here [1].

    Adrian

    [1] https://salsa.debian.org/kernel-team/linux/-/merge_requests/339

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Wed Mar 10 08:40:02 2021
    On 3/9/21 11:20 PM, John Paul Adrian Glaubitz wrote:
    Which kernel version will have this bug (which one?) fixed, 5.11.x? I
    can also check with one of my UltraSPARC IIIi powered systems, too, next
    week.

    I have not uploaded that kernel yet, I have it built locally, PR here [1].

    The patch is now in Linus' tree so it will be part of 5.12 [1].

    Adrian

    [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5e8b80d352ec999d2bba3ea584f541c83f4ca3f

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Riccardo Mottola on Wed Mar 10 10:40:02 2021
    Hi Riccardo,

    On 10.03.21 10:17, Riccardo Mottola wrote:
    Frank Scheiner wrote:
    We have an older UltraSPARC IIIi that has issues with newer kernels, but >>> usually only after longer operation and the issue might be related to
    the
    bug that was just fixed recently by Rob Gardner.

    Which kernel version will have this bug (which one?) fixed, 5.11.x? I
    can also check with one of my UltraSPARC IIIi powered systems, too, next
    week.

    as written in the title, I have issues with:
    5.10.0-4-sparc64-smp #1 Debian 5.10.19-1

    I know.

    If I remember there was a repository with many snapshots of different versions, already as package, which one can test quickly. That way we
    can restrict breakage range without git bisect.

    Do you have a link?

    I assume you mean "http://snapshot.debian.org" .

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Wed Mar 10 10:50:02 2021
    On 3/10/21 10:17 AM, Riccardo Mottola wrote:
    If I remember there was a repository with many snapshots of different versions,
    already as package, which one can test quickly. That way we can restrict breakage
    range without git bisect.

    Well, that doesn't really help you though. You want to find the commit in question,
    just the range isn't enough to solve the issue.

    If you have a fast second machine available, bisecting the problem shouldn't take
    too long.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to Frank Scheiner on Wed Mar 10 10:20:01 2021
    Hi Frank,


    Frank Scheiner wrote:
    We have an older UltraSPARC IIIi that has issues with newer kernels, but
    usually only after longer operation and the issue might be related to the
    bug that was just fixed recently by Rob Gardner.

    Which kernel version will have this bug (which one?) fixed, 5.11.x? I
    can also check with one of my UltraSPARC IIIi powered systems, too, next week.

    as written in the title, I have issues with:
    5.10.0-4-sparc64-smp #1 Debian 5.10.19-1

    If I remember there was a repository with many snapshots of different
    versions, already as package, which one can test quickly. That way we
    can restrict breakage range without git bisect.

    Do you have a link?

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to Frank Scheiner on Thu Mar 11 23:10:02 2021
    Hi Frank!

    I suppose the Niagara CPU gives the kernel issue

    Frank Scheiner wrote:
    If I remember there was a repository with many snapshots of different
    versions, already as package, which one can test quickly. That way we
    can restrict breakage range without git bisect.

    Do you have a link?

    I assume you mean "http://snapshot.debian.org" .

    Exactly. With this I did some more tests.

    Still Works:
    5.9.0-4-sparc64-smp #1 SMP Debian 5.9.11-1 (2020-11-27)
    5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)

    Broken:

    linux-image-5.10.0-trunk-sparc64-smp_5.10.2-1~exp1_sparc64.deb

    So later series 5.9 series continue to work and even very early 5.10 do not

    Do you know if I can via serial-console reset the system?
    I tried sending a break on the serial console, but the errors just keep running.
    Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine? I am
    used to old SparcStations and UltraSparc Netras, where it was sufficient.
    It is inconvenient at every hang to power-cycle, since at every turn on,
    it runs a self-test which lasts minutes :)

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Thu Mar 11 23:20:01 2021
    Hi Adrian

    John Paul Adrian Glaubitz wrote:
    Well, that doesn't really help you though. You want to find the commit in question,
    just the range isn't enough to solve the issue.

    Well, a little bit it helped, it is something early in the 5.10 series.
    Also I have now an apparently working kernel (who knows how stable under
    load?) 5.9 series

    If you have a fast second machine available, bisecting the problem shouldn't take
    too long.

    Well, this Machine has plenty of ram, disk space and good connection,
    how fast the CPU is in compiling a kernel I don't know, but we can try.
    Power consumption is not so much worse than a PC, but it is darn loud!
    Like a vacuum cleaner... I need to stay out of the room, but I found an acceptable setup. I use a workstation with a serial console connected to
    it, the connect through ssh to the workstation and through that into the management.

    Although I am used to compile kernels on Gentoo LInux since 15 years, I
    never did on Debian. Here we have init images


    How should I proceed? Which kernel sources?

    https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official

    is 4.3 correct for me? 4.6 ?

    Please guide me

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gregor Riepl@21:1/5 to All on Thu Mar 11 23:30:02 2021
    Do you know if I can via serial-console reset the system?
    I tried sending a break on the serial console, but the errors just keep running.
    Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine? I am
    used to old SparcStations and UltraSparc Netras, where it was sufficient.
    It is inconvenient at every hang to power-cycle, since at every turn on,
    it runs a self-test which lasts minutes :)

    According to this, you should be able to reach the system console
    through the SER MGT port: https://unixed.com/index.php/2013/06/16/accessing-the-sparc-system-console/
    NET MGT is probably easier, but you'll have to set it up first.

    Perhaps you can also attach a USB keyboard and press the break key to
    get into the system console, then type "reset" to boot the machine? Not
    sure if this works without a monitor though. And you might need to enter
    the system password first, if it's set.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Riccardo Mottola on Thu Mar 11 23:50:02 2021
    Hi Riccardo,

    On 11.03.21 23:03, Riccardo Mottola wrote:
    Hi Frank!

    I suppose the Niagara CPU gives the kernel issue

    From [1] I assume T2 CPUs are not affected, but yeah, the issue could
    be that selective that it only affects the very first generation.

    [1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html


    Frank Scheiner wrote:
    If I remember there was a repository with many snapshots of different
    versions, already as package, which one can test quickly. That way we
    can restrict breakage range without git bisect.

    Do you have a link?

    I assume you mean "http://snapshot.debian.org" .

    Exactly. With this I did some more tests.

    Still Works:
    5.9.0-4-sparc64-smp #1 SMP Debian 5.9.11-1 (2020-11-27)
    5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)

    Broken:

    linux-image-5.10.0-trunk-sparc64-smp_5.10.2-1~exp1_sparc64.deb

    So later series 5.9 series continue to work and even very early 5.10 do not

    Do you know if I can via serial-console reset the system?

    Reset from the serial console might work via the kernel with the [magic
    system request] functionality.

    [magic system request]: https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html

    But you can always reset the system using the SC. The T1000 (and the
    T2000, too) has both serial (on T2000 right of the DB-9 ttya port,
    should work with a blue Cisco serial cable) and network port (on T2000
    above the two USB ports). The serial port of the SC automatically
    switches to the system console after some (configurable) time and you
    need to escape to the SC login prompt with a configurable key sequence
    (`#.` by default, see [2]).

    [2]: https://docs.oracle.com/cd/E19076-01/t2k.srvr/819-2549-12/ontario-consoleConfig.html#28277

    I tried sending a break on the serial console, but the errors just keep running.
    Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine?

    So you already got access to the SC. Then you can reset the machine from
    there, too.

    I am
    used to old SparcStations and UltraSparc Netras, where it was sufficient.
    It is inconvenient at every hang to power-cycle, since at every turn on,
    it runs a self-test which lasts minutes :)

    I think depending on the SC configuration, these machines also run a
    self-test for every X resets, but this should be configurable.

    Hope that helps
    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gregor Riepl@21:1/5 to All on Fri Mar 12 00:00:02 2021
    How should I proceed? Which kernel sources?

    https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official


    is 4.3 correct for me? 4.6 ?

    You should clone the upstream Git repo, otherwise bisecting will be much
    more difficult.

    I think these instructions are still valid: https://wiki.debian.org/DebianKernel/GitBisect

    You can also skip the Debian-specific stuff and simply do
    make -j8 && make modules_install && make install

    It's better to use at least a compatible kernel config, though.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Engelhardt@21:1/5 to Frank Scheiner on Fri Mar 12 10:50:02 2021
    On Thursday 2021-03-11 23:43, Frank Scheiner wrote:

    Do you know if I can via serial-console reset the system?

    Reset from the serial console might work via the kernel with the [magic system request] functionality.

    [magic system request]: https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html

    But you can always reset the system using the SC. The T1000 (and the
    T2000, too) has both serial (on T2000 right of the DB-9 ttya port,
    should work with a blue Cisco serial cable) and network port (on T2000
    above the two USB ports). The serial port of the SC automatically
    switches to the system console after some (configurable) time

    SER MGT is a RS232-ish serial line, just with a RJ-45 connector for size.
    Once the SC has finished booting, system console is the default mode.
    Since SER has no notion of connections, it should be staying in whatever mode it was left in. Maybe there is a autoswitch, but I never observed it (but I would not want to wait a lot of minutes either just to observe it).

    For NET MGT, when you start a new SSH connection, it always starts
    out in system console mode and #. is needed.

    I tried sending a break on the serial console, but the errors just keep
    running.
    Break is received, since I see it as SC Alert, but I am not put into the
    console, maybe there is some further trick on these newer machine?

    So you already got access to the SC. Then you can reset the machine from there, too.

    Because NET does not have an equivalent of the serial pin used to traditionally signal "break", a synthetic break can be issued from SC. But it's a bit awkward, because you immediately need to go back into system console mode to type the desired sysrq character.

    break
    confirm (y/n)y
    console
    confirm (y/n)y
    type <<S>>
    Linux kernel: ah yes I received SYSRQ-s

    I am
    used to old SparcStations and UltraSparc Netras, where it was sufficient.
    It is inconvenient at every hang to power-cycle, since at every turn on,
    it runs a self-test which lasts minutes :)

    I think depending on the SC configuration, these machines also run a self-test for every X resets, but this should be configurable.

    It's the first thing you want to turn off as a private user.

    diag_trigger none

    and probably

    diag_mode off

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Frank Scheiner on Tue Mar 16 14:10:01 2021
    Hi Riccardo, Adrian,

    so I did some testing yesterday and also see your problem on my T1000.
    Because of some kernel command line misconfiguration, my machine at
    first couldn't find its root FS as it tried to use a non-existent NIC.
    This lead to a lot of kernel oopses (I assume at least one per hardware
    thread) that looked very similar to the ones you see. And this happens
    even with "working" kernels (tested 4.19.x and 5.9.x). So the actual
    result of that problem in 5.10.x seems to be that the kernel can't find
    its root FS.

    On 11.03.21 23:43, Frank Scheiner wrote:
    On 11.03.21 23:03, Riccardo Mottola wrote:
    I suppose the Niagara CPU gives the kernel issue

    From [1] I assume T2 CPUs are not affected, but yeah, the issue could
    be that selective that it only affects the very first generation.

    [1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html

    I can also indeed confirm that this problem only affects the T1 CPU, as
    my T5220 with T2 CPU works w/o problems with kernel 5.10.x.

    I didn't get any further yesterday as it took a lot of time to update
    the root FSes of my T1000 and my X4270 - my intended machine for cross compilation, not sure if it will be "fast" enough*. In addition cloning
    Linus's linux tree alone took a lot of time (about an hour).

    * it will:

    ```
    ## with config of Debian's 5.9.0-5 kernel as `.config`
    $ make ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu- olddefconfig
    [...]
    ## with lsmod output from T1000
    $ make ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu-
    LSMOD=$HOME/t1000-lsmod localmodconfig
    [...]
    $ time make -j16 ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu- all
    [...]
    kernel: arch/sparc/boot/zImage is ready

    real 3m12.264s
    user 42m5.325s
    sys 3m27.843s
    ```

    @Adrian:
    After a first cross compile run, I can confirm that 5.10-rc1 is also
    broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
    good means more than 5000 commits in between. Linus's tree doesn't
    contain v5.9.16 or at least I didn't find it there. How can I get "good"
    closer to "bad"? I don't want to check too many good versions if I know
    that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is
    good? Should I switch to the stable kernel sources from GKH?

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Frank Scheiner on Tue Mar 16 14:20:01 2021
    Hi again,

    On 16.03.21 14:07, Frank Scheiner wrote:
    @Adrian:
    After a first cross compile run, I can confirm that 5.10-rc1 is also
    broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
    good means more than 5000 commits in between. Linus's tree doesn't
    contain v5.9.16 or at least I didn't find it there. How can I get "good" closer to "bad"? I don't want to check too many good versions if I know
    that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is good? Should I switch to the stable kernel sources from GKH?

    Forget about that, [1] shows 5000+ commits between v5.9.16 and
    v5.10-rc1, too. So no difference.

    [1]: https://github.com/gregkh/linux/compare/v5.9.16...v5.10-rc1

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Tue Mar 16 14:30:01 2021
    Hello Frank!

    On 3/16/21 2:07 PM, Frank Scheiner wrote:
    After a first cross compile run, I can confirm that 5.10-rc1 is also
    broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
    good means more than 5000 commits in between. Linus's tree doesn't
    contain v5.9.16 or at least I didn't find it there. How can I get "good" closer to "bad"? I don't want to check too many good versions if I know
    that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is good? Should I switch to the stable kernel sources from GKH?

    I'm not sure I am understand your problem here. The bisecting algorithm
    has a runtime O(ln(n)), so even with 5000 commits, it will converge quite quickly.

    Just make sure you are using a fast machine when compiling the kernel
    as otherwise it won't be fun.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Tue Mar 16 15:00:02 2021
    Hi Adrian,

    On 16.03.21 14:27, John Paul Adrian Glaubitz wrote:
    Hello Frank!

    On 3/16/21 2:07 PM, Frank Scheiner wrote:
    After a first cross compile run, I can confirm that 5.10-rc1 is also
    broken on my T1000. I'll take this version (parent commit:
    33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
    good means more than 5000 commits in between. Linus's tree doesn't
    contain v5.9.16 or at least I didn't find it there. How can I get "good"
    closer to "bad"? I don't want to check too many good versions if I know
    that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is
    good? Should I switch to the stable kernel sources from GKH?

    I'm not sure I am understand your problem here. The bisecting algorithm
    has a runtime O(ln(n)), so even with 5000 commits, it will converge quite quickly.

    Yeah, you're right, I think I make this error every time I try to bisect
    the kernel - i.e. once every two years... ;-)

    Just make sure you are using a fast machine when compiling the kernel
    as otherwise it won't be fun.

    Other topic: As the compile times are actually taking less time than the preparation of the test boot (copy over modules to T1000 root FS, boot
    T1000 with working kernel, create initramfs, reboot with kernel in
    question and that initramfs), is there a way to create the initramfs
    (for sparc64) on the cross compile host (amd64)?

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Wed Mar 17 13:40:01 2021
    Hi Frank!

    On 3/17/21 1:22 PM, Frank Scheiner wrote:
    Hi Adrian, Riccardo

    so I'm finished with bisecting and it points to the following commit as
    first bad commit:

    ```
    johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
    commit 028abd9222df0cf5855dab5014a5ebaf06f90565
    Author: Christoph Hellwig <hch@lst.de>
    Date: Thu Sep 17 10:22:34 2020 +0200

    fs: remove compat_sys_mount

    compat_sys_mount is identical to the regular sys_mount now, so
    remove it
    and use the native version everywhere.

    Did you verify that reverting this commit or - if reverting is not possible - testing
    out the revision just before the commit? Just to be safe you found the correct commit.

    If that has been verified, please report the issue to the sparclinux LKML and CC Christoph.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to All on Wed Mar 17 13:30:01 2021
    Hi Adrian, Riccardo

    so I'm finished with bisecting and it points to the following commit as
    first bad commit:

    ```
    johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
    commit 028abd9222df0cf5855dab5014a5ebaf06f90565
    Author: Christoph Hellwig <hch@lst.de>
    Date: Thu Sep 17 10:22:34 2020 +0200

    fs: remove compat_sys_mount

    compat_sys_mount is identical to the regular sys_mount now, so
    remove it
    and use the native version everywhere.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

    arch/arm64/include/asm/unistd32.h | 2 +-
    arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +-
    arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +-
    arch/parisc/kernel/syscalls/syscall.tbl | 2 +-
    arch/powerpc/kernel/syscalls/syscall.tbl | 2 +-
    arch/s390/kernel/syscalls/syscall.tbl | 2 +-
    arch/sparc/kernel/syscalls/syscall.tbl | 2 +-
    arch/x86/entry/syscalls/syscall_32.tbl | 2 +-
    fs/Makefile | 1 -
    fs/compat.c | 57 ----------------------
    fs/internal.h | 3 --
    fs/namespace.c | 4 +-
    include/linux/compat.h | 6 ---
    include/uapi/asm-generic/unistd.h | 2 +-
    tools/include/uapi/asm-generic/unistd.h | 2 +-
    tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 2 +-
    tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2 +-
    17 files changed, 14 insertions(+), 81 deletions(-)
    delete mode 100644 fs/compat.c
    ```

    Seems to be indeed related to mounting (the root FS). Why it only
    affects UltraSPARC T1 CPUs is another question. I don't have any other UltraSPARC II, IIi, IIe, III and IIIi driven machines at hand now for
    checking those.

    So what now?

    Cheers,
    Frank

    P.S.

    Here's the log for reference:

    ```
    johndoe@x4270:~/git-projects/torvalds/linux$ git bisect log
    git bisect start
    # good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9
    git bisect good bbf5c979011a099af5dc76498918ed7df445635b
    # bad: [3650b228f83adda7e5ee532e2b90429c03f7b9ec] Linux 5.10-rc1
    git bisect bad 3650b228f83adda7e5ee532e2b90429c03f7b9ec
    # bad: [c48b75b7271db23c1b2d1204d6e8496d91f27711] Merge tag
    'sound-5.10-rc1' of
    git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
    git bisect bad c48b75b7271db23c1b2d1204d6e8496d91f27711
    # bad: [7fafb54c7d390e9b273a1d7d377e38d9c408046e] Merge tag
    'leds-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds
    git bisect bad 7fafb54c7d390e9b273a1d7d377e38d9c408046e
    # bad: [fd5c32d80884268a381ed0e67cccef0b3d37750b] Merge tag
    'media/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
    git bisect bad fd5c32d80884268a381ed0e67cccef0b3d37750b
    # bad: [865c50e1d279671728c2936cb7680eb89355eeea] x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT
    git bisect bad 865c50e1d279671728c2936cb7680eb89355eeea
    # good: [13cb73490f475f8e7669f9288be0bcfa85399b1f] Merge tag 'x86-entry-2020-10-12' of
    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
    git bisect good 13cb73490f475f8e7669f9288be0bcfa85399b1f
    # good: [dd502a81077a5f3b3e19fa9a1accffdcab5ad5bc] Merge tag 'core-static_call-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
    git bisect good dd502a81077a5f3b3e19fa9a1accffdcab5ad5bc
    # good: [ced3a9eb3cd0d07462cdbaa8a0f3d46e5aaeadec] Merge tag
    'ia64_for_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
    git bisect good ced3a9eb3cd0d07462cdbaa8a0f3d46e5aaeadec
    # good: [fc67d5bc876b6b224538c8848fc02e70f269ec99]
    Documentation/admin-guide: README & svga: remove use of "rdev"
    git bisect good fc67d5bc876b6b224538c8848fc02e70f269ec99
    # good: [c90578360c92c71189308ebc71087197080e94c3] Merge branch 'work.csum_and_copy' of
    git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
    git bisect good c90578360c92c71189308ebc71087197080e94c3
    # good: [85ed13e78dbedf9433115a62c85429922bc5035c] Merge branch
    'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
    git bisect good 85ed13e78dbedf9433115a62c85429922bc5035c
    # bad: [22230cd2c55bd27ee2c3a3def97c0d5577a75b82] Merge branch
    'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
    git bisect bad 22230cd2c55bd27ee2c3a3def97c0d5577a75b82
    # good: [e18afa5bfa4a2f0e07b0864370485df701dacbc1] Merge branch 'work.quota-compat' of
    git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
    git bisect good e18afa5bfa4a2f0e07b0864370485df701dacbc1
    # good: [67e306c6906137020267eb9bbdbc127034da3627] fs,nfs: lift compat
    nfs4 mount data handling into the nfs code
    git bisect good 67e306c6906137020267eb9bbdbc127034da3627
    # bad: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove
    compat_sys_mount
    git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565
    # first bad commit: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs:
    remove compat_sys_mount
    ```

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Wed Mar 17 21:00:01 2021
    Hi Adrian,

    On 17.03.21 13:39, John Paul Adrian Glaubitz wrote:
    On 3/17/21 1:22 PM, Frank Scheiner wrote:
    ```
    johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad
    028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
    [...]
    Did you verify that reverting this commit or - if reverting is not possible - testing
    out the revision just before the commit?

    I did not yet revert the bad commit in a current kernel and test it, but
    from my understanding the parent commit of the first bad one must have
    been a good one and indeed, [67e306c6906137020267eb9bbdbc127034da3627]
    is the parent of [028abd9222df0cf5855dab5014a5ebaf06f90565] and was
    working for me on my T1000:

    ```
    johndoe@x4270:~/git-projects/torvalds/linux$ git bisect log
    [...]
    # good: [67e306c6906137020267eb9bbdbc127034da3627] fs,nfs: lift compat
    nfs4 mount data handling into the nfs code
    git bisect good 67e306c6906137020267eb9bbdbc127034da3627
    # bad: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove
    compat_sys_mount
    git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565
    # first bad commit: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs:
    remove compat_sys_mount
    ```

    [67e306c6906137020267eb9bbdbc127034da3627]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=67e306c6906137020267eb9bbdbc127034da3627

    [028abd9222df0cf5855dab5014a5ebaf06f90565]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=028abd9222df0cf5855dab5014a5ebaf06f90565

    Just to be safe you found the correct commit.

    If that has been verified, please report the issue to the sparclinux LKML and CC Christoph.

    Will do that soon-ish but maybe also try to revert that commit in
    Debian's 5.10.0-4 and test it for additional assurance (then not so
    soon-ish - maybe this weekend). I'll put you and Riccardo in CC, too.

    Hopefully this will be easier to fix than the kernel breakage on the
    rx2800 i2 - assuming that problem is still there ([1], [2]).

    [1]: https://marc.info/?l=linux-ia64&m=156114769908890&w=2
    [2]: https://marc.info/?l=linux-ia64&m=156144480821712&w=2

    Cheers and thanks for the pointers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Engelhardt@21:1/5 to Frank Scheiner on Tue Mar 23 16:40:02 2021
    On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:

    while I was able to "install" correctly using a slightly older ISO, I
    get not a bootable system. The kernel appears to crash very early during
    boot.

    From my current testing it looks like "UltraSPARC IIIi"s are also
    affected by this problem with UltraSPARC T1s in some way:

    With the latest Linux 5.10.x (from Debian) the root FS can't be
    successfully mounted, with the latest Linux 5.9.x (also from Debian) it
    just works fine. Unfortunately the V245 doesn't fail/work for the exact
    same kernels that I tested during the bisecting for the T1000, e.g. the
    first bad commit version that didn't work on the T1000 seems to work on
    the V245 but some good versions don't with:

    ```
    [...]
    Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
    pass remote address
    mount: Invalid argument

    I seem to recall that NFS is one of those filesystems that (a) makes use of filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper, /usr/sbin/mount.nfs.

    Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565, I am postulating the hypothesis that that the fs/nfs/ code for parsing this binary blob is no longer aware that it is being invoked in a compat32 context.

    Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in question were all on NFS mounts and the T2 one wasn't?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Riccardo Mottola on Tue Mar 23 16:30:02 2021
    Hi all,

    On 09.03.21 13:23, Riccardo Mottola wrote:
    Hi all,

    while I was able to "install" correctly using a slightly older ISO, I
    get not a bootable system. The kernel appears to crash very early during boot.

    Anybody else has this issue?

      Booting `Debian GNU/Linux'

    Loading Linux 5.10.0-4-sparc64-smp ...
    Loading initial ramdisk ...


    From my current testing it looks like "UltraSPARC IIIi"s are also
    affected by this problem with UltraSPARC T1s in some way:

    With the latest Linux 5.10.x (from Debian) the root FS can't be
    successfully mounted, with the latest Linux 5.9.x (also from Debian) it
    just works fine. Unfortunately the V245 doesn't fail/work for the exact
    same kernels that I tested during the bisecting for the T1000, e.g. the
    first bad commit version that didn't work on the T1000 seems to work on
    the V245 but some good versions don't with:

    ```
    [...]
    Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
    pass remote address
    mount: Invalid argument
    done.
    [...]
    ```

    I'm unsure what could go wrong here, as I always pass the remote address
    via the kernel commandline:

    ```
    [...]
    [ 2.928512] Kernel command line: BOOT_IMAGE=(tftp)/AC10027A.vmlinux root=/dev/nfs ip=172.16.2.122:172.16.0.2:172.16.0.1:255.255.0.0:v245-2:enp9s4f0:off nfsroot=172.16.0.2:/srv/nfs/v245-2/root nfsrootdebug rw
    [...]
    ```

    Maybe there is some breakage in the klibc based programs in the
    initramfs, but why they don't affect both UltraSPARC IIIi and T1 in the
    same way is somewhat strange.

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Jan Engelhardt on Tue Mar 23 16:50:01 2021
    Hi Jan,

    On 23.03.21 16:36, Jan Engelhardt wrote:
    On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:
    ```
    [...]
    Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
    pass remote address
    mount: Invalid argument

    I seem to recall that NFS is one of those filesystems that (a) makes use of filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper,
    /usr/sbin/mount.nfs.

    Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565,
    I am postulating the hypothesis that that the fs/nfs/ code for parsing this binary blob is no longer aware that it is being invoked in a compat32 context.

    That sounds interesting. Can you perhaps post your hypothesis also in
    this thread:

    https://marc.info/?t=161644900600003&r=1&w=2

    Maybe this gives the kernel developers some ideas.

    Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in question were all on NFS mounts and the T2 one wasn't?

    No, the T5220 was also running diskless, actually using the same root FS
    as the T1000 (in form of a btrfs subvolume snapshot) plus identical
    kernel and initramfs:

    ```
    root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*
    lrwxrwxrwx 1 root root 35 Feb 28 2018 AC10026E -> boot/grub/sparc64-ieee1275/core.img
    lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img -> initrd.img.5.10.0-4.debian.sid.sparc64
    lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz -> linux.mp.5.10.0-4.debian.sid.sparc64
    ```

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Connor McLaughlan@21:1/5 to All on Tue Mar 23 17:40:01 2021
    Hi,

    can anyone possible give a list of known stable kernel versions for SPARC machines? (is there a difference necessary between architectures/old vs.
    newer machines? sun4u/sun4v)?

    Also this instability manifests such that the machine is crashing during
    high workload? (halting? rebooting?)

    I ask, because on three different SPARC machines i have been experiencing a weird effect when using debian:
    I would start a high compiling load for several days (7-10) where the
    machines are running fine without any apparent error visible in dmesg or somewhere else.
    Then when i power off tand on again, the filesystem would be corrupt and sometimes impossible to repair without reinstallation.

    This seems to only happen when the machines do a long run with high
    workload and seemingly not when i just power them off again for night with
    no high workload.

    Regards,
    Connor


    On Tue, Mar 23, 2021 at 4:46 PM Frank Scheiner <frank.scheiner@web.de>
    wrote:

    Hi Jan,

    On 23.03.21 16:36, Jan Engelhardt wrote:
    On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:
    ```
    [...]
    Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
    pass remote address
    mount: Invalid argument

    I seem to recall that NFS is one of those filesystems that (a) makes use
    of
    filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount
    helper,
    /usr/sbin/mount.nfs.

    Now, with the change in Linux kernel
    028abd9222df0cf5855dab5014a5ebaf06f90565,
    I am postulating the hypothesis that that the fs/nfs/ code for parsing
    this
    binary blob is no longer aware that it is being invoked in a compat32
    context.

    That sounds interesting. Can you perhaps post your hypothesis also in
    this thread:

    https://marc.info/?t=161644900600003&r=1&w=2

    Maybe this gives the kernel developers some ideas.

    Since T2 systems were said to be fine and T1 not, perhaps the T1 systems
    in
    question were all on NFS mounts and the T2 one wasn't?

    No, the T5220 was also running diskless, actually using the same root FS
    as the T1000 (in form of a btrfs subvolume snapshot) plus identical
    kernel and initramfs:

    ```
    root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*
    lrwxrwxrwx 1 root root 35 Feb 28 2018 AC10026E -> boot/grub/sparc64-ieee1275/core.img
    lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img -> initrd.img.5.10.0-4.debian.sid.sparc64
    lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz -> linux.mp.5.10.0-4.debian.sid.sparc64
    ```

    Cheers,
    Frank



    <div dir="ltr"><div>Hi,</div><div><br></div><div>can anyone possible give a list of known stable kernel versions for SPARC machines? (is there a difference necessary between architectures/old vs. newer machines? sun4u/sun4v)?</div><div><br></div><div>
    Also this instability manifests such that the machine is crashing during high workload? (halting? rebooting?)</div><div><br></div><div>I ask, because on three different SPARC machines i have been experiencing a weird effect when using debian:</div><div>I
    would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg or somewhere else.</div><div>Then when i power off tand on again, the filesystem would be corrupt and sometimes
    impossible to repair without reinstallation.<br></div><div><br></div><div>This seems to only happen when the machines do a long run with high workload and seemingly not when i just power them off again for night with no high workload.</div><div><br></div>
    <div>Regards,</div><div>Connor</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 23, 2021 at 4:46 PM Frank Scheiner &lt;<a href="mailto:frank.scheiner@web.de">frank.scheiner@web.de</a>&gt; wrote:<br></
    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Jan,<br>

    On 23.03.21 16:36, Jan Engelhardt wrote:<br>
    &gt; On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:<br>
    &gt;&gt; ```<br>
    &gt;&gt; [...]<br>
    &gt;&gt; Begin: Retrying nfs mount ... [   41.753937] NFS: mount program didn&#39;t<br>
    &gt;&gt; pass remote address<br>
    &gt;&gt; mount: Invalid argument<br>
    &gt;<br>
    &gt; I seem to recall that NFS is one of those filesystems that (a) makes use of<br>
    &gt; filesystem-specific data, i.e. mount(2)&#39;s 5th argument, and (b) a mount helper,<br>
    &gt; /usr/sbin/mount.nfs.<br>
    &gt;<br>
    &gt; Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565,<br>
    &gt; I am postulating the hypothesis that that the fs/nfs/ code for parsing this<br>
    &gt; binary blob is no longer aware that it is being invoked in a compat32 context.<br>

    That sounds interesting. Can you perhaps post your hypothesis also in<br>
    this thread:<br>

    <a href="https://marc.info/?t=161644900600003&amp;r=1&amp;w=2" rel="noreferrer" target="_blank">https://marc.info/?t=161644900600003&amp;r=1&amp;w=2</a><br>

    Maybe this gives the kernel developers some ideas.<br>

    &gt; Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in<br>
    &gt; question were all on NFS mounts and the T2 one wasn&#39;t?<br>

    No, the T5220 was also running diskless, actually using the same root FS<br>
    as the T1000 (in form of a btrfs subvolume snapshot) plus identical<br>
    kernel and initramfs:<br>

    ```<br>
    root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*<br>
    lrwxrwxrwx 1 root root 35 Feb 28  2018 AC10026E -&gt;<br> boot/grub/sparc64-ieee1275/core.img<br>
    lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img -&gt;<br> initrd.img.5.10.0-4.debian.sid.sparc64<br>
    lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz -&gt;<br> linux.mp.5.10.0-4.debian.sid.sparc64<br>
    ```<br>

    Cheers,<br>
    Frank<br>

    </blockquote></div>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Connor McLaughlan on Tue Mar 23 18:30:01 2021
    Hi,

    On 23.03.21 17:30, Connor McLaughlan wrote:
    Hi,

    can anyone possible give a list of known stable kernel versions for
    SPARC machines? (is there a difference necessary between
    architectures/old vs. newer machines? sun4u/sun4v)?

    Also this instability manifests such that the machine is crashing during
    high workload? (halting? rebooting?)

    I ask, because on three different SPARC machines i have been
    experiencing a weird effect when using debian:
    I would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg or somewhere else.
    Then when i power off tand on again, the filesystem would be corrupt and sometimes impossible to repair without reinstallation.

    Can you be sure that your used disks are in full working order? Maybe
    you have bad sectors on them and their EOL is nearing, manifesting in
    these FS errors? I assume the more accesses you have on your disks the
    more a problem is prone to show up. And the accesses happening during
    compile runs could be already too much for your disks. If you have
    enough RAM, you could try to run your compile jobs in a RAM disk and
    check if this makes a difference.

    This seems to only happen when the machines do a long run with high
    workload and seemingly not when i just power them off again for night
    with no high workload.

    I believe the error this thread is about is unrelated to what you
    experience on your machines. This because the problem happens early on
    when the root FS is to be mounted.

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to Gregor Riepl on Sat Mar 27 00:20:01 2021
    Hi,


    I was unable to "hack" for some days due to day-job. I have seen Frank
    and others have done a great deal.

    Still, I wanted to try my own compilation, as a first attempt and also
    to build and be able to check eventual patches myself.


    On 3/11/21 11:56 PM, Gregor Riepl wrote:
    You should clone the upstream Git repo, otherwise bisecting will be much
    more difficult.

    I think these instructions are still valid: https://wiki.debian.org/DebianKernel/GitBisect

    You can also skip the Debian-specific stuff and simply do
    make -j8 && make modules_install && make install

    It's better to use at least a compatible kernel config, though.


    I cloned linux stable. It took 60 minutes...

    I took the config out of /boot/config of a good kernel, updated it with
    "make oldconfig"

    During compilation I see:

      CC      init/init_task.o
    make[1]: *** No rule to make target
    'debian/certs/debian-uefi-certs.pem', needed by 'certs/x509_certificate_list'.  Stop.
    make[1]: *** Waiting for unfinished jobs....


    It took 134 minutes to build with -j32. So well, compiling is not the
    strongest point of this CPU, but not so bad either.

    real    134m55.288s
    user    4111m46.186s
    sys     145m12.479s

    I actually wonder if the kernel is not "overconfigured" ? building
    things like nouveau make sense on SPARC? I wonder.. maybe sticking a
    PCI-e card would work in a Netra or Fire?


    but I can't install:


    multix@narya:~/code/linux-stable$ sudo make modules_install
    sed: can't read modules.order: No such file or directory

    I wonder if it is related with the error above?


    Thanks,

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hermann.Lauer@uni-heidelberg.de@21:1/5 to Stan Johnson on Mon Mar 29 10:00:01 2021
    Hi Riccardo,

    On Sat, Mar 27, 2021 at 01:16:11PM -0600, Stan Johnson wrote:
    I took the config out of /boot/config of a good kernel, updated it with "make oldconfig"

    During compilation I see:

      CC      init/init_task.o
    make[1]: *** No rule to make target
    'debian/certs/debian-uefi-certs.pem', needed by 'certs/x509_certificate_list'.  Stop.
    make[1]: *** Waiting for unfinished jobs....
    ...

    I think you need to remove all references to debian certs to compile a
    custom kernel.

    Yep, in your kernel config set:
    CONFIG_SYSTEM_TRUSTED_KEYS=""

    Greetings
    Hermann

    --
    Administration/Zentrale Dienste, Interdiziplinaeres
    Zentrum fuer wissenschaftliches Rechnen der Universitaet Heidelberg
    IWR; INF 205; 69120 Heidelberg; Tel: (06221)54-14405 Fax: -14427
    Email: Hermann.Lauer@iwr.uni-heidelberg.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to Connor McLaughlan on Thu Apr 1 12:00:02 2021
    Hi Connor,

    Connor McLaughlan wrote:
    can anyone possible give a list of known stable kernel versions for
    SPARC machines? (is there a difference necessary between
    architectures/old vs. newer machines? sun4u/sun4v)?

    Also this instability manifests such that the machine is crashing
    during high workload? (halting? rebooting?)

    I ask, because on three different SPARC machines i have been
    experiencing a weird effect when using debian:
    I would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg
    or somewhere else.
    Then when i power off tand on again, the filesystem would be corrupt
    and sometimes impossible to repair without reinstallation.

    This seems to only happen when the machines do a long run with high
    workload and seemingly not when i just power them off again for night
    with no high workload.

    I have a limited experience and can only share that the kernel I
    currently am running on this Fire T2000

    Linux narya 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)
    sparc64 GNU/Linux

    Is quite stable for me: I did compile with high loads (e.g. compiling
    linux kernel on all 32 cores) and sync the git repository of linux
    kernel and ArcticFox browser. GIT sync of such repositories in my
    experience is a good stress, I had disk drivers crash, network freeze 
    on different architectures and systems. But not in this case.
    However, i did not try to run for several days compiling, so I don't
    know if it is stable for a long time.

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anatoly Pugachev@21:1/5 to riccardo.mottola@libero.it on Thu Apr 1 12:30:01 2021
    On Thu, Apr 1, 2021 at 12:59 PM Riccardo Mottola
    <riccardo.mottola@libero.it> wrote:
    This seems to only happen when the machines do a long run with high workload and seemingly not when i just power them off again for night
    with no high workload.

    I have a limited experience and can only share that the kernel I
    currently am running on this Fire T2000

    Linux narya 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)
    sparc64 GNU/Linux

    Is quite stable for me.
    However, i did not try to run for several days compiling, so I don't
    know if it is stable for a long time.

    Riccardo,

    if you would like to check sparc64 kernel stability, you might want to run stress-ng tests, like:

    $ ./stress-ng --sequential 4 -v --timeout 3m --metrics-brief

    it still successfully kills the latest (git) kernel (5.12.0-rc5) on my
    sparc64 test LDOM running on a T5-2 hardware server.
    But please take stress-ng from git repo [1] , since it has a few
    recent fixes for sparc, not yet packaged into debian.

    Thanks.

    1. https://github.com/ColinIanKing/stress-ng/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to Hermann.Lauer@uni-heidelberg.de on Thu Apr 1 13:50:02 2021
    Hhi Hermann,


    Hermann.Lauer@uni-heidelberg.de wrote:
    Yep, in your kernel config set:
    CONFIG_SYSTEM_TRUSTED_KEYS=""

    thanks, that was it! Now the kernel build

    Do I need to do somethings special?

    make install
    make modules_install

    Which shows:

    multix@narya:~/code/linux-stable$ time sudo make install
    sh ./arch/sparc/boot/install.sh 5.12.0-rc5+ arch/sparc/boot/zImage \         System.map "/boot"
    run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
    run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
    update-initramfs: Generating /boot/initrd.img-5.12.0-rc5+
    run-parts: executing /etc/kernel/postinst.d/zz-update-grub 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
    Generating grub configuration file ...
    Found linux image: /boot/vmlinuz-5.12.0-rc5+
    Found initrd image: /boot/initrd.img-5.12.0-rc5+
    Found linux image: /boot/vmlinuz-5.12.0-rc5+.old
    Found initrd image: /boot/initrd.img-5.12.0-rc5+
    Found linux image: /boot/vmlinux-5.10.0-4-sparc64-smp
    Found initrd image: /boot/initrd.img-5.10.0-4-sparc64-smp
    Found linux image: /boot/vmlinux-5.10.0-trunk-sparc64-smp
    Found initrd image: /boot/initrd.img-5.10.0-trunk-sparc64-smp
    Found linux image: /boot/vmlinux-5.9.0-5-sparc64-smp
    Found initrd image: /boot/initrd.img-5.9.0-5-sparc64-smp
    done

    real    33m3.954s
    user    28m18.936s
    sys     4m36.889s



    At boot:

    Loading Linux 5.12.0-rc5+ ...
    error: premature end of file /vmlinuz-5.12.0-rc5+.
    Loading initial ramdisk ...
    error: you need to load the kernel first.


    it is interesting how certain operations are very slow on this system,
    since a "single" core is slow.. so installing takes longer as a ...
    celeron laptop!
    It took... 33 minutes ?!


    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anatoly Pugachev@21:1/5 to riccardo.mottola@libero.it on Thu Apr 1 15:10:01 2021
    On Thu, Apr 1, 2021 at 2:40 PM Riccardo Mottola
    <riccardo.mottola@libero.it> wrote:
    multix@narya:~/code/linux-stable$ time sudo make install
    sh ./arch/sparc/boot/install.sh 5.12.0-rc5+ arch/sparc/boot/zImage \
    System.map "/boot"
    run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
    run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
    update-initramfs: Generating /boot/initrd.img-5.12.0-rc5+
    run-parts: executing /etc/kernel/postinst.d/zz-update-grub 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
    Generating grub configuration file ...
    Found linux image: /boot/vmlinuz-5.12.0-rc5+
    Found initrd image: /boot/initrd.img-5.12.0-rc5+
    Found linux image: /boot/vmlinuz-5.12.0-rc5+.old
    Found initrd image: /boot/initrd.img-5.12.0-rc5+
    Found linux image: /boot/vmlinux-5.10.0-4-sparc64-smp
    Found initrd image: /boot/initrd.img-5.10.0-4-sparc64-smp
    Found linux image: /boot/vmlinux-5.10.0-trunk-sparc64-smp
    Found initrd image: /boot/initrd.img-5.10.0-trunk-sparc64-smp
    Found linux image: /boot/vmlinux-5.9.0-5-sparc64-smp
    Found initrd image: /boot/initrd.img-5.9.0-5-sparc64-smp
    done

    At boot:

    Loading Linux 5.12.0-rc5+ ...
    error: premature end of file /vmlinuz-5.12.0-rc5+.
    Loading initial ramdisk ...
    error: you need to load the kernel first.

    current grub2 version does not support compressed image kernels, do
    the following:

    gzip -dc /boot/vmlinuz-5.12.0-rc5+ > /boot/vmlinux-5.12.0-rc5+
    rm /boot/vmlinuz-5.12.0-rc5+
    update-grub

    and reboot

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hermann Lauer@21:1/5 to Riccardo Mottola on Thu Apr 1 14:40:01 2021
    Hi Riccardo,

    On Thu, Apr 01, 2021 at 01:43:29PM +0200, Riccardo Mottola wrote:
    Yep, in your kernel config set:
    CONFIG_SYSTEM_TRUSTED_KEYS=""

    thanks, that was it! Now the kernel build

    great!

    Do I need to do somethings special?

    make install
    make modules_install

    sorry, don't know. I'm always doing:

    make -j<core#> bindeb-pkg
    dpkg -i ../linux-image*.dpkg

    But that is even slower on weak hardware (e.g. BananaUltra) and the above SHOULD work. Advantage comes when deleting kernels.

    Loading Linux 5.12.0-rc5+ ...
    error: premature end of file /vmlinuz-5.12.0-rc5+.

    Somehow your vmlinuz is to short or the loader is not able to put it
    in memory.

    Good luck and greetings
    Hermann

    --
    Administration/Zentrale Dienste, Interdiziplinaeres
    Zentrum fuer wissenschaftliches Rechnen der Universitaet Heidelberg
    IWR; INF 205; 69120 Heidelberg; Tel: (06221)54-14405 Fax: -14427
    Email: Hermann.Lauer@iwr.uni-heidelberg.de

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to Anatoly Pugachev on Thu Apr 1 16:30:01 2021
    Hi Anatoly!

    Anatoly Pugachev wrote:
    current grub2 version does not support compressed image kernels, do
    the following:

    gzip -dc /boot/vmlinuz-5.12.0-rc5+ > /boot/vmlinux-5.12.0-rc5+
    rm /boot/vmlinuz-5.12.0-rc5+
    update-grub

    and reboot

    oh yes, that was it. Finally, I could boot my own built kernel. Which,
    of course, crashes as expected.
    At least I can confirm Frank's findings.


    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Sat Dec 11 19:00:03 2021
    On 12/11/21 18:40, Riccardo Mottola wrote:
    I remember you bisected about the breaking commits. Has there been any progress?
    A better place where to report this issue other than this mailing list?

    The proper place is to send an email to the author of the breaking commit and CC the sparclinux Linux kernel mailing list. Most kernel developers don't read the debian-sparc mailing list.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to frank.scheiner@web.de on Sat Dec 11 18:50:02 2021
    Hi Frank,

    several months have passed… new kernels came into debian and they still do not work for me, so let me dig up this matter again.
    I can continue using 5.9 for now, but for how long?

    On 2021-03-11 23:43:10 +0100 Frank Scheiner <frank.scheiner@web.de> wrote:

    From [1] I assume T2 CPUs are not affected, but yeah, the issue could
    be that selective that it only affects the very first generation.

    [1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html

    Did more people report this issue perhaps on other systems?

    I remember you bisected about the breaking commits. Has there been any progress? A better place where to report this issue other than this mailing list?

    Thank you,
    Riccardo

    --
    Sent with GNUMail running on MacOS 10.7

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Sat Dec 11 20:40:02 2021
    Hi guys,

    On 11.12.21 18:59, John Paul Adrian Glaubitz wrote:
    On 12/11/21 18:40, Riccardo Mottola wrote:
    I remember you bisected about the breaking commits. Has there been any progress?
    A better place where to report this issue other than this mailing list?

    The proper place is to send an email to the author of the breaking commit and CC the sparclinux Linux kernel mailing list. Most kernel developers don't read
    the debian-sparc mailing list.

    We actually did discuss this in late March 2021 starting here:

    https://lists.debian.org/debian-sparc/2021/03/msg00045.html

    ...with Christoph Hellwig and CCed to sparclinux@vger.kernel.org and
    this list, but no solution back then.

    ****

    Back in October I did some testing on various UltraSPARC machines to
    sort out which processor( generation)s are affected but didn't found the
    time to make something out of it apart from notes and a conclusion.

    I couldn't get my Ultra 80 to netboot, so no result for UltraSPARC II.

    My Ultra 10 with US IIi worked though with kernel 5.14.0-3.

    My 280r with US III worked with kernel 5.9.0-5 and with 5.14.0-3 gives:

    ```
    Begin: Retrying nfs mount ... mount: Invalid argument
    done.
    ```

    ...when trying to mount the root FS.

    My v480 crashes with 5.14.0-3 but it crashed with every kernel version I
    tried since I own it, so perfectly normal. I don't know what the issue
    is, because hardware-wise, the - working with 5.9.0-5 - 280r seems to be
    very similar though with only 2 processors instead of 4 for the V480.

    My T5220 with T2 crashed once with 5.14.0-3 but worked with 5.14.0-4. It
    later also worked with 5.14.0-3. And the crash happened way before a
    mount of the root FS was tried, so possibly unrelated.

    My T1000 with T1 panics with 5.14.0-3 because it can't mount the root
    FS. Using `break=premount` in the kernel command line and issueing the
    mount command manually gives;

    ```
    (initramfs) nfsmount -o nolock "172.16.0.2:/srv/nfs/t1000/root" "$rootmnt"
    [ 641.272949] Unable to handle kernel paging request at virtual address 0000612000000000
    [ 641.273138] tsk->{mm,active_mm}->context = 000000000000038f
    [ 641.273248] tsk->{mm,active_mm}->pgd = ffff800016c1c000
    [ 641.273310] \|/ ____ \|/
    [ 641.273310] "@'/ .. \`@"
    [ 641.273310] /_| \__/ |_\
    [ 641.273310] \__U_/
    [ 641.273444] nfsmount(750): Oops [#182]
    [ 641.273497] CPU: 12 PID: 750 Comm: nfsmount Tainted: G D E
    5.14.0-3-sparc64-smp #1 Debian 5.14.12-1
    [ 641.273603] TSTATE: 0000000011001607 TPC: 000000000069ce48 TNPC: 000000000069ce4c Y: 00000000 Tainted: G D E
    [ 641.273705] TPC: <kfree+0x48/0x400>
    [ 641.273775] g0: 0000000000000006 g1: 0000000400000000 g2:
    0000600000000000 g3: ffff8001fda18000
    [ 641.273858] g4: ffff800013b13340 g5: ffff8001fda18000 g6:
    ffff800016bd0000 g7: ffff800016bd3c30
    [ 641.273942] o0: fffffffffffffffe o1: 00000000006f4c94 o2:
    0000000000002000 o3: ffff8000146d3aa8
    [ 641.274024] o4: 0000000000000008 o5: 0000000000000cc0 sp:
    ffff800016bd34a1 ret_pc: 00000000006f4c54
    [ 641.274107] RPC: <sys_mount+0x74/0x1a0>
    [ 641.274165] l0: 0000000000f1a000 l1: 000000000111f000 l2:
    0000000000422db4 l3: 0000000000201db0
    [ 641.274292] l4: 000000000000029c l5: ffff80010000c1a0 l6:
    ffff800016bd0000 l7: 00000000006f4be0
    [ 641.274377] i0: 0000000000000cc0 i1: 0000000000201fe0 i2:
    0000000000000001 i3: ffff800016bd3dd0
    [ 641.274460] i4: 0000000000000000 i5: 0000612000000000 i6:
    ffff800016bd3561 i7: 00000000006f4c94
    [ 641.274542] I7: <sys_mount+0xb4/0x1a0>
    [ 641.274599] Call Trace:
    [ 641.274640] [<00000000006f4c94>] sys_mount+0xb4/0x1a0
    [ 641.274712] [<00000000006f4c54>] sys_mount+0x74/0x1a0
    [ 641.274783] [<0000000000406274>] linux_sparc_syscall+0x34/0x44
    [ 641.274866] Caller[00000000006f4c94]: sys_mount+0xb4/0x1a0
    [ 641.274939] Caller[00000000006f4c54]: sys_mount+0x74/0x1a0
    [ 641.275011] Caller[0000000000406274]: linux_sparc_syscall+0x34/0x44
    [ 641.275090] Caller[0000000000100aa8]: 0x100aa8
    [ 641.275143] Instruction DUMP:
    [ 641.275150] ba074001
    [ 641.275192] bb2f7003
    [ 641.275233] ba074002
    [ 641.275274] <c25f6008>
    [ 641.275314] 84086001
    [ 641.275355] 82007fff
    [ 641.275395] 8378841d
    [ 641.275436] ba100001
    [ 641.275525] c2586008
    [ 641.275614]
    Killed
    ```

    Doing the same on a V210 with US IIIi gives:

    ```
    (initramfs) nfsmount -o nolock "172.16.0.2:/srv/nfs/v210/root" "$rootmnt" mount: Invalid argument
    (initramfs) echo $?
    1
    ```

    ...so similar to 280r with US III.

    From all that, I assume UltraSPARC IIi driven machines (and most likely
    also older ones with US II) are not affected by this, as are UltraSPARC
    T2 driven ones and possibly machines with newer processors (I didn't
    have time to try one of my T5240s with T2+).

    UltraSPARC III, IIIi and T1 driven machines are affected and to me it
    now looks more like some of the klibc programs from the initramfs are at
    fault.

    I also tested my V210 with an on-disk root FS and although the mounting
    seemed to work for that method with 5.14.0-3 I faced multiple problems
    later on that crashed the machine.

    My next try would have been to test mounting of the root FS with
    non-klibc programs. But I'm unsure how to get these into an initramfs -
    with dracut maybe?

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to All on Fri Jan 14 18:00:01 2022
    Hi all,

    as Frank asked, I compiled myself a kernel using his latest commit
    identified as good:
    67e306c6906137020267eb9bbdbc127034da3627

    and this kernel works, but then fails to load initramfs.

    I don't know if the crash was before or after, so if it is a "proof"
    that it is good or it is not conclusive?

    The good news is that latest kernel installed seems to boot and takes
    all CPUs online. How stable it is I don't know, it needs to be tested.

    Riccardo

    5.15.0-2-sparc64-smp #1 SMP Debian 5.15.5-2 (2021-12-18) sparc64 GNU/Linux

    multix@narya:~$ cat /proc/cpuinfo
    cpu : UltraSparc T1 (Niagara)
    fpu : UltraSparc T1 integrated FPU
    pmu : niagara
    prom : OBP 4.30.4.d 2011/07/06 14:29
    type : sun4v
    ncpus probed : 32
    ncpus active : 32
    D$ parity tl1 : 0
    I$ parity tl1 : 0
    cpucaps : flush,stbar,swap,muldiv,v9,blkinit,mul32,div32,v8plus,ASIBlkInit
    Cpu0ClkTck : 000000003b9aca00
    Cpu1ClkTck : 000000003b9aca00
    Cpu2ClkTck : 000000003b9aca00
    Cpu3ClkTck : 000000003b9aca00
    Cpu4ClkTck : 000000003b9aca00
    Cpu5ClkTck : 000000003b9aca00
    Cpu6ClkTck : 000000003b9aca00
    Cpu7ClkTck : 000000003b9aca00
    Cpu8ClkTck : 000000003b9aca00
    Cpu9ClkTck : 000000003b9aca00
    Cpu10ClkTck : 000000003b9aca00
    Cpu11ClkTck : 000000003b9aca00
    Cpu12ClkTck : 000000003b9aca00
    Cpu13ClkTck : 000000003b9aca00
    Cpu14ClkTck : 000000003b9aca00
    Cpu15ClkTck : 000000003b9aca00
    Cpu16ClkTck : 000000003b9aca00
    Cpu17ClkTck : 000000003b9aca00
    Cpu18ClkTck : 000000003b9aca00
    Cpu19ClkTck : 000000003b9aca00
    Cpu20ClkTck : 000000003b9aca00
    Cpu21ClkTck : 000000003b9aca00
    Cpu22ClkTck : 000000003b9aca00
    Cpu23ClkTck : 000000003b9aca00
    Cpu24ClkTck : 000000003b9aca00
    Cpu25ClkTck : 000000003b9aca00
    Cpu26ClkTck : 000000003b9aca00
    Cpu27ClkTck : 000000003b9aca00
    Cpu28ClkTck : 000000003b9aca00
    Cpu29ClkTck : 000000003b9aca00
    Cpu30ClkTck : 000000003b9aca00
    Cpu31ClkTck : 000000003b9aca00
    MMU Type : Hypervisor (sun4v)
    MMU PGSZs : 8K,64K,4MB,256MB
    State:
    CPU0: online
    CPU1: online
    CPU2: online
    CPU3: online
    CPU4: online
    CPU5: online
    CPU6: online
    CPU7: online
    CPU8: online
    CPU9: online
    CPU10: online
    CPU11: online
    CPU12: online
    CPU13: online
    CPU14: online
    CPU15: online
    CPU16: online
    CPU17: online
    CPU18: online
    CPU19: online
    CPU20: online
    CPU21: online
    CPU22: online
    CPU23: online
    CPU24: online
    CPU25: online
    CPU26: online
    CPU27: online
    CPU28: online
    CPU29: online
    CPU30: online
    CPU31: online

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Fri Jan 14 18:30:01 2022
    Hi!

    On 1/14/22 17:58, Riccardo Mottola wrote:
    as Frank asked, I compiled myself a kernel using his latest commit
    identified as good:
    67e306c6906137020267eb9bbdbc127034da3627

    and this kernel works, but then fails to load initramfs.

    Did you forget to create an initrd? After installing the kernel, run:

    $ update-initramfs -k KERNEL_VERSION -c

    The good news is that latest kernel installed seems to boot and takes
    all CPUs online. How stable it is I don't know, it needs to be tested.

    Please run some stress tests such as stress-ng and report back.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Mon Jan 17 14:50:01 2022
    Hi Adrian,

    John Paul Adrian Glaubitz wrote:
    Did you forget to create an initrd? After installing the kernel, run:

    $ update-initramfs -k KERNEL_VERSION -c

    I did not run it this way, will do.

    I had it however, of a very big size:
    316M Jan 14 17:15 initrd.img-5.9.0-rc1+

    which filled up my /boot

    I removed it, regenerated with your command, but I get dropped into
    initramfs with no modules found. Hmm..


    The good news is that latest kernel installed seems to boot and takes
    all CPUs online. How stable it is I don't know, it needs to be tested.
    Please run some stress tests such as stress-ng and report back.

    Not nice. I started compiling some stuff and the box froze, I connected
    serial console and could not resume due to Fast Data Access MMU miss"

    I will now stress things again, but keeping serial console attached with another computer and see.

    UP to last week with the old 5.9 kernel I had no issues compiling even
    large things as gecko based ArcticFox or the Linux kernel itself. So if
    the Fire didn't fail over the weekend.... it smells as kernel instability.

    What should I use in stress-ng? I just tried "--all 8 --timeout 120s"

    and the machine locked up after a little and in the serial console I see:

    [ 8563.833509] current->{active_,}mm->context = 0000000000000fcb

    [ 8563.833523] current->{active_,}mm->pgd = ffff8000d35c8000

    [ 8563.846347] Unable to handle kernel NULL pointer dereference in mna
    handler
    [ 8563.846365] at virtual address 00000000000000e7

    [ 8563.846380] current->{active_,}mm->context = 0000000000000fcc

    [ 8563.846395] current->{active_,}mm->pgd = ffff8000d2d3c000

    [ 8563.856171] Unable to handle kernel NULL pointer dereference

    [ 8563.863274] tsk->{mm,active_mm}->context = 0000000000000fd2

    [ 8563.863294] tsk->{mm,active_mm}->pgd = ffff8000d3fc0000

    [ 8563.928911] Unable to handle kernel NULL pointer dereference in mna
    handler
    [ 8563.928935] at virtual address 00000000000000e7

    [ 8563.928955] current->{active_,}mm->context = 0000000000000fde

    [ 8563.928972] current->{active_,}mm->pgd = ffff8000d32e8000

    [ 8563.952221] Unable to handle kernel NULL pointer dereference in mna
    handler
    [ 8563.952244] at virtual address 00000000000000e7

    [ 8563.952261] current->{active_,}mm->context = 0000000000000fe3

    [ 8563.952278] current->{active_,}mm->pgd = ffff8000d2f54000

    [ 8563.954004] Unable to handle kernel NULL pointer dereference in mna
    handler
    [ 8563.954022] at virtual address 00000000000000e7

    [ 8563.954037] current->{active_,}mm->context = 0000000000000fe5

    [ 8563.954053] current->{active_,}mm->pgd = ffff8000d2d5c000

    [ 8563.972643] Unable to handle kernel NULL pointer dereference

    [ 8563.972660] tsk->{mm,active_mm}->context = 0000000000000fea

    [ 8563.972677] tsk->{mm,active_mm}->pgd = ffff8000d31300

    These are kernel messages, not OF, so it looks like a kernel problem

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to All on Mon Jan 17 15:20:01 2022
    I reply to myself.

    I did run the old 5.9 kernel from debian - which has proven quite stable.
    I did run the same tests... and I found once error in the console indeed.


    [ 380.918996] Unable to handle kernel NULL pointer dereference
    [ 380.919198] tsk->{mm,active_mm}->context = 000000000000057d
    [ 380.919326] tsk->{mm,active_mm}->pgd = ffff8003f1fd4000
    [ 380.919496] \|/ ____ \|/
    "@'/ .. \`@"
    /_| \__/ |_\
    \__U_/
    [ 380.919510] stress-ng(1529): Oops [#287]
    [ 380.919536] CPU: 24 PID: 1529 Comm: stress-ng Tainted: G D E
    X 5.9.0-5-sparc64-smp #1 Debian 5.9.15-1
    [ 380.919557] TSTATE: 0000008811001602 TPC: 000000000042d8e0 TNPC: 000000000042d8e4 Y: 00000000 Tainted: G D E X
    [ 380.919587] TPC: <do_signal+0x440/0x560>
    [ 380.919604] g0: ffff800100ef7194 g1: 0000000000000328 g2:
    0000000000000000 g3: ffff80010002c000
    [ 380.919620] g4: ffff8003cf6f6b40 g5: ffff8003fdea4000 g6:
    ffff8003cf9cc000 g7: 0000000000004000
    [ 380.919634] o0: 00000000000001e8 o1: 0000000000000328 o2:
    ffff8003cf9cc000 o3: 0000000000000007
    [ 380.919650] o4: 0000000000000007 o5: fffffffffffffff2 sp:
    ffff8003cf9cf451 ret_pc: 000000000042d8c4
    [ 380.919673] RPC: <do_signal+0x424/0x560>
    [ 380.919690] l0: 0208000104000004 l1: 00000044f0000226 l2:
    ffff800100ef7194 l3: 0000000000000000
    [ 380.919705] l4: 0000000000000000 l5: 0000000000000005 l6:
    ffff8003cf9cc000 l7: 0000000000698c20
    [ 380.919719] i0: 0000000000000070 i1: 0000000000000208 i2:
    fffffffffffffff2 i3: ffff8003cf9eff70
    [ 380.919732] i4: fffffffffffffff2 i5: 0000000000000000 i6:
    ffff8003cf9cf4d1 i7: 000000000042d6fc
    [ 380.919752] I7: <do_signal+0x25c/0x560>
    [ 380.919760] Call Trace:
    [ 380.919783] [<000000000042d6fc>] do_signal+0x25c/0x560
    [ 380.919806] [<000000000042e218>] do_notify_resume+0x58/0xa0
    [ 380.919828] [<0000000000404b48>] __handle_signal+0xc/0x30
    [ 380.919852] Caller[000000000042d6fc]: do_signal+0x25c/0x560
    [ 380.919874] Caller[000000000042e218]: do_notify_resume+0x58/0xa0
    [ 380.919893] Caller[0000000000404b48]: __handle_signal+0xc/0x30
    [ 380.919910] Caller[ffff800100ef716c]: 0xffff800100ef716c
    [ 380.919916] Instruction DUMP:
    [ 380.919923] c029a00d
    [ 380.919930] b4168008
    [ 380.919938] 900761e8
    [ 380.919945] <d25e2070>
    [ 380.919952] 40014fef
    [ 380.919959] b416801c
    [ 380.919965] c2592468
    [ 380.919972] b8100008
    [ 380.919979] 920126c8

    [ 380.972358] systemd-journald[66048]: File /var/log/journal/bdb2a41ce825489ba567bea53add247e/system.journal
    corrupted or uncleanly shut down, renaming and replacing.
    [ 407.494981] systemd[1]: Started Journal Service.


    as well as error in the stressors:
    stress-ng: info: [12989] stress-ng-fanotify: 148 open, 41 close write,
    128 close nowrite, 96 access, 27 modify
    stress-ng: info: [12908] stress-ng-fanotify: 159 open, 66 close write,
    108 close nowrite, 88 access, 43 modify
    stress-ng: info: [12911] stress-ng-fanotify: 147 open, 43 close write,
    122 close nowrite, 99 access, 20 modify
    stress-ng: info: [13079] stress-ng-fanotify: 159 open, 60 close write,
    112 close nowrite, 97 access, 32 modify
    stress-ng: info: [12820] stress-ng-fanotify: 155 open, 46 close write,
    123 close nowrite, 87 access, 27 modify
    stress-ng: info: [913] unsuccessful run completed in 282.58s (4 mins,
    42.58 secs)
    stress-ng: fail: [913] chattr instance 2 corrupted bogo-ops counter, 48
    vs 0
    stress-ng: fail: [913] chattr instance 2 hash error in bogo-ops counter
    and run flag, 1918819509 vs 0
    stress-ng: fail: [913] chattr instance 6 corrupted bogo-ops counter, 50
    vs 0
    stress-ng: fail: [913] chattr instance 6 hash error in bogo-ops counter
    and run flag, 506138270 vs 0
    stress-ng: fail: [913] dnotify instance 4 corrupted bogo-ops counter,
    224 vs 0
    info: 5 failures reached, aborting stress process
    stress-ng: fail: [913] dnotify instance 4 hash error in bogo-ops
    counter and run flag, 1503783545 vs 0
    stress-ng: fail: [913] dnotify instance 6 corrupted bogo-ops counter,
    222 vs 0
    stress-ng: fail: [913] dnotify instance 6 hash error in bogo-ops
    counter and run flag, 4199465241 vs 0
    stress-ng: fail: [913] metrics-check: stressor metrics corrupted, data
    is compromised


    However the machine did not crash.
    I did run exactly the same stress command again... and the failures are reproducible, so I suppose maybe the tests are not 64bit big endian safe
    or such.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Mon Jan 17 16:10:02 2022
    Hi!

    On 1/17/22 14:41, Riccardo Mottola wrote:
    The good news is that latest kernel installed seems to boot and takes
    all CPUs online. How stable it is I don't know, it needs to be tested.

    Please run some stress tests such as stress-ng and report back.

    Not nice. I started compiling some stuff and the box froze, I connected serial console and could not resume due to Fast Data Access MMU miss"

    So, this crash occurs with the latest 5.15 kernel on your T2000?

    In my experience, the most stable kernels on the older SPARCs are still the 4.19 kernels. Thus, we should start bisecting to find out what commit actually made the kernel unreliable on these older SPARCs.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Mon Jan 17 20:40:02 2022
    Hi,

    John Paul Adrian Glaubitz wrote:
    Not nice. I started compiling some stuff and the box froze, I connected
    serial console and could not resume due to Fast Data Access MMU miss"
    So, this crash occurs with the latest 5.15 kernel on your T2000?

    exactly latest kernel.

    I will retest it with stress-ng as soon as I finish this email and copy
    the dmesg errors.

    In my experience, the most stable kernels on the older SPARCs are still the 4.19 kernels. Thus, we should start bisecting to find out what commit actually
    made the kernel unreliable on these older SPARCs.


    We must find a good way to test though. I stress-tested the 5.9 kernel
    further. The system sometimes seemed unresponsive, but eventually
    recovered (some errors to know more pasted below). Thus I would consider
    it "stable". I did run several small burst of tests and then a session
    given of 30m minutes but that due to hiccups lasted more like 2 hours,
    but afterwards, the machine was still up.

    sudo stress-ng --all 10 --timeout 30m

    10 times means more than physical CPUs, but less than logical cores
    (32). The system has 16GB of ram, I see some OOMs in dmesg, I wonder if
    this is due to certain stress tests specifically going against any limit.

    [16195.300448] Unable to handle kernel NULL pointer dereference in mna
    handler
    [16195.741725] 40014fef
    [16195.741793] at virtual address 00000000000000e7
    [16195.767936] b416801c
    [16195.767945] c2592468
    [16195.767990] current->{active_,}mm->context = 0000000000000bb8
    [16195.768848] b8100008
    [16195.768857] 920126c8
    [16195.769673] current->{active_,}mm->pgd = ffff800089cac000

    [16195.770413] \|/ ____ \|/
    "@'/ .. \`@"
    /_| \__/ |_\
    \__U_/
    [16196.303333] systemd-journald[219777]: /dev/kmsg buffer overrun, some messages lost.
    [16196.304235] stress-ng(234874): Oops [#864]
    [16196.304262] CPU: 8 PID: 234874 Comm: stress-ng Tainted: G D
    E X 5.9.0-5-sparc64-smp #1 Debian 5.9.15-1
    [16196.304281] TSTATE: 0000008811001605 TPC: 000000000042d8e0 TNPC: 000000000042d8e4 Y: 00000000 Tainted: G D E X
    [16196.304311] TPC: <do_signal+0x440/0x560>
    [16196.304327] g0: 000000000040770c g1: 000000000000032f g2:
    0000000000000000 g3: ffff80010007c000
    [16196.304341] g4: ffff8003f13f9240 g5: ffff8003fdaa4000 g6:
    ffff800087df8000 g7: 0000000000004000
    [16196.304355] o0: 00000000000001ef o1: 000000000000032f o2:
    ffff800087df8000 o3: 0000000000000007
    [16196.304368] o4: 0000000000000007 o5: fffffffffffffff2 sp:
    ffff800087dfb451 ret_pc: 000000000042d8c4
    [16196.304390] RPC: <do_signal+0x424/0x560>
    [16196.304404] l0: 0308000103000004 l1: 00000044f0001201 l2:
    000000000040770c l3: 0000000000000000
    [16196.304418] l4: 0000000000000000 l5: ffff80010007c000 l6:
    ffff800087df8000 l7: 0000000011001002
    [16196.304432] i0: 0000000000000077 i1: 000000000000020f i2:
    fffffffffffffff2 i3: ffff800187dfff70
    [16196.304445] i4: fffffffffffffff2 i5: 0000000000000007 i6:
    ffff800087dfb4d1 i7: 000000000042d6fc
    [16196.304472] I7: <do_signal+0x25c/0x560>
    [16205.284863] aes_sparc64: sparc64 aes opcodes not available.
    [16205.753417] Call Trace:
    [16205.753453] [<000000000042d6fc>] do_signal+0x25c/0x560
    [16205.753478] [<000000000042e218>] do_notify_resume+0x58/0xa0
    [16205.753500] [<0000000000404b48>] __handle_signal+0xc/0x30
    [16205.753525] Caller[000000000042d6fc]: do_signal+0x25c/0x560
    [16205.753546] Caller[000000000042e218]: do_notify_resume+0x58/0xa0 [16205.753562] Caller[0000000000404b48]: __handle_signal+0xc/0x30 [16205.753575] Caller[000001000007294c]: 0x1000007294c
    [16205.753580] Instruction DUMP:
    [16205.753587] c029a00d
    [16205.753595] b4168008
    [16205.753602] 900761e8
    [16205.753610] <d25e2070>
    [16205.753616] 40014fef
    [16205.753623] b416801c
    [16205.753629] c2592468
    [16205.753636] b8100008
    [16205.753644] 920126c8


    then also these messages. I think they explain the "slowness" and
    apparent freeze of the system - I was about to power-cycle but waited
    and it recovered:

    [16253.233924] ata1.00: qc timeout (cmd 0xa0)
    [16335.213786] PM: hibernation: Basic memory bitmaps created
    [16830.619976] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16830.620193] (detected by 18, t=5252 jiffies, g=711181, q=6)
    [16830.620215] rcu: All QSes seen, last rcu_sched kthread activity 1191 (4299098242-4299097051), jiffies_till_next_fqs=1, root ->qsmask 0x0 [16830.620491] rcu: rcu_sched kthread starved for 1191 jiffies! g711181
    f0x2 RCU_GP_CLEANUP(7) ->state=0x0 ->cpu=30
    [16830.620749] rcu: Unless rcu_sched kthread gets sufficient CPU
    time, OOM is now expected behavior.
    [16830.620844] rcu: RCU grace-period kthread stack dump:
    [16830.621069] task:rcu_sched state:R running task stack:
    0 pid: 10 ppid: 2 flags:0x05000000
    [16830.621095] Call Trace:
    [16830.621128] [<0000000000bda560>] _cond_resched+0x40/0x60
    [16830.621153] [<00000000004ee1d0>] rcu_gp_kthread+0x9b0/0xe40
    [16830.621175] [<0000000000491c48>] kthread+0x108/0x120
    [16830.621205] [<00000000004060c8>] ret_from_fork+0x1c/0x2c
    [16830.621224] [<0000000000000000>] 0x0
    [16982.524373] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16982.524591] (detected by 20, t=5252 jiffies, g=711637, q=15)
    [16982.524612] rcu: All QSes seen, last rcu_sched kthread activity 5247 (4299136209-4299130962), jiffies_till_next_fqs=1, root ->qsmask 0x0 [16982.524839] rcu: rcu_sched kthread starved for 5247 jiffies! g711637
    f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=16
    [16982.525098] rcu: Unless rcu_sched kthread gets sufficient CPU
    time, OOM is now expected behavior.
    [16982.525201] rcu: RCU grace-period kthread stack dump:
    [16982.525377] task:rcu_sched state:R running task stack:
    0 pid: 10 ppid: 2 flags:0x06000000
    [16982.525404] Call Trace:
    [16982.525435] [<0000000000bda3d4>] schedule+0x54/0x100
    [16982.525464] [<0000000000bddc50>] schedule_timeout+0x70/0x140
    [16982.525489] [<00000000004edeb4>] rcu_gp_kthread+0x694/0xe40
    [16982.525511] [<0000000000491c48>] kthread+0x108/0x120
    [16982.525540] [<00000000004060c8>] ret_from_fork+0x1c/0x2c
    [16982.525558] [<0000000000000000>] 0x0
    [17596.494910] sched: RT throttling activated
    [17664.665608] PM: hibernation: Basic memory bitmaps freed
    [17664.838884] audit: type=1400 audit(1642442424.829:817):
    apparmor="STATUS" info="failed to unpack policydb" error=-86 profile="unconfined" name="/usr/bin/pulseaudio-eg" pid=234012
    comm="stress-ng" name="/usr/bin/pulseaudio-eg" offset=2536
    [17665.077468] aes_sparc64: sparc64 aes opcodes not available.
    [17665.685823] aes_sparc64: sparc64 aes opcodes not available.
    [17686.297683] systemd[1]: systemd-journald.service: Main process
    exited, code=killed, status=6/ABRT
    [17686.300569] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
    [17686.733029] systemd[1]: systemd-journald.service: Consumed 53.065s
    CPU time.
    [17686.938707] systemd[1]: systemd-journald.service: Scheduled restart
    job, restart counter is at 3.
    [17687.012114] systemd[1]: Stopped Journal Service.
    [17687.020312] systemd[1]: systemd-journald.service: Consumed 53.065s
    CPU time.
    [17690.324815] systemd[1]: Starting Journal Service...
    [17690.831298] systemd-journald[258852]: File /var/log/journal/bdb2a41ce825489ba567bea53add247e/system.journal
    corrupted or uncleanly shut down, renaming and replacing.
    [17709.718653] systemd[1]: Started Journal Service.



    Perhaps we can at least understand these error and restrict to specific
    tests? This could gives us a better testing and also Frank could try to
    run the same tests on his systems.

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to Riccardo Mottola on Mon Jan 17 21:40:02 2022
    Hi,


    Riccardo Mottola wrote:
    John Paul Adrian Glaubitz wrote:
    Not nice. I started compiling some stuff and the box froze, I connected
    serial console and could not resume due to Fast Data Access MMU miss"
    So, this crash occurs with the latest 5.15 kernel on your T2000?
    exactly latest kernel.

    I will retest it with stress-ng as soon as I finish this email and copy
    the dmesg errors.



    wow, running the test suite once or twice, I am able to have the system power-cycle... wow

    Frank test latest kernel on yours :)

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Riccardo Mottola on Wed Feb 2 15:30:01 2022
    Hi Riccardo, all,

    On 17.01.22 21:35, Riccardo Mottola wrote:
    Hi,


    Riccardo Mottola wrote:
    John Paul Adrian Glaubitz wrote:
    Not nice. I started compiling some stuff and the box froze, I connected >>>> serial console and could not resume due to Fast Data Access MMU miss"
    So, this crash occurs with the latest 5.15 kernel on your T2000?
    exactly latest kernel.

    I will retest it with stress-ng as soon as I finish this email and copy
    the dmesg errors.



    wow, running the test suite once or twice, I am able to have the system power-cycle... wow

    Frank test latest kernel on yours :)

    I yesterday found the time to give Linux 5.15.0-3 a try on my T1000
    (UltraSPARC T1) and V210 (US IIIi), but the boot issue is still there -
    at least for my use case: The klibc based tools inside of the initramfs
    are not able to mount the root FS over NFS (details further below).

    But it's still good to see that mounting an on-disk root FS seems to
    work now for your T2000, though the instabilities during runtime are not reassuring.

    For me the last good Debian kernel - at least for booting, more on that
    shortly - is 5.9.0-5. Both T1000 and V210 boot fine with it (incl.
    mounting the root FS via NFS(v3 BTW)). But during operation (tested with
    `apt upgrade` on a root FS replicated multiple times for testing from
    the same tarball) the V210 crashes (=> kernel panic), the T1000 does
    not. For the V210 I also see that for 5.8.0-3. Doing the same with
    kernel 4.19.0-5 running on the V210, no problems are seen, not even the messages below.

    The crash when running 5.9.0-5 or 5.8.0-3 is usually "announced" (or at
    least accompanied) by one or more occurrence(s) of the following messages:
    ```
    [...]
    [ 360.489852] CPU[0]: Cheetah+ D-cache parity error at
    TPC[00000000005b28c8]
    [ 360.580300] TPC<bpf_check+0x1f68/0x34e0>
    [...]
    ```
    ...which should be familiar for UltraSPARC IIIi users with newer kernels
    (see for example [1] which shows it for 4.16.x). According to [2] this
    error should be recoverable (otherwise it would be followed by a panic
    and "Irrecoverable Cheetah+ parity error."), which seems to happen,
    until it is no longer, but I don't see that second message, so something
    else must happen.

    [1]: https://www.spinics.net/lists/sparclinux/msg21019.html

    [2]: https://github.com/torvalds/linux/blob/master/arch/sparc/kernel/traps_64.c#L1767..L1799

    Of course our CPU's caches don't go pop magically. There is something
    broken in the newer kernels (> 4.19.x) for UltraSPARC IIIi (and most
    likely all the other related processors, too), apart from the mounting
    issues for NFS (see [3] for processors affected by this, update to that:
    US II is not affected, too).

    [3]: https://lists.debian.org/debian-sparc/2021/12/msg00004.html

    If I find the time and mood I'll try to bisect this US IIIi specific
    issue in the hope that we will eventually get a fix for it, also still
    hoping for a fix for [4].

    [4]: https://lists.debian.org/debian-sparc/2021/03/msg00045.html

    Cheers,
    Frank

    ****

    ## T1000 ##

    ```
    [...]
    [ 0.000116] Linux version 5.15.0-3-sparc64-smp (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.2.0-14) 11.2.0, GNU
    ld (GNU Binutils for Debian) 2.37.90.20220123) #1 SMP Debian 5.15.15-2 (2022-01-30)
    [...]
    [ 12.484314] tg3 0001:03:04.0 enP1p3s4f0: Link is up at 1000 Mbps,
    full duplex
    [ 12.484520] tg3 0001:03:04.0 enP1p3s4f0: Flow control is on for TX
    and on for RX
    [ 12.484689] IPv6: ADDRCONF(NETDEV_CHANGE): enP1p3s4f0: link becomes ready
    [ 16.765173] Unable to handle kernel paging request at virtual address 0000612000000000
    [ 16.765384] tsk->{mm,active_mm}->context = 000000000000006e
    [ 16.765493] tsk->{mm,active_mm}->pgd = ffff800014af0000
    [ 16.765650] \|/ ____ \|/
    [ 16.765650] "@'/ .. \`@"
    [ 16.765650] /_| \__/ |_\
    [ 16.765650] \__U_/
    [ 16.765975] nfsmount(374): Oops [#1]
    [ 16.766167] CPU: 2 PID: 374 Comm: nfsmount Tainted: G E
    5.15.0-3-sparc64-smp #1 Debian 5.15.15-2
    [ 16.766345] TSTATE: 0000000011001607 TPC: 00000000006a5fe8 TNPC: 00000000006a5fec Y: 00000000 Tainted: G E
    [ 16.766642] TPC: <kfree+0x48/0x2c0>
    [ 16.766704] g0: ffff80000f2e7451 g1: 0000000400000000 g2:
    0000600000000000 g3: ffff8001fd786000
    [ 16.766802] g4: ffff800014245e80 g5: ffff8001fd786000 g6:
    ffff80000f2e4000 g7: ffff80000f2e7c30
    [ 16.766983] o0: fffffffffffffffe o1: 00000000006fd714 o2:
    0000000000002000 o3: ffff80000f2cbaf8
    [ 16.767209] o4: 0000000000000008 o5: 0000000000000cc0 sp:
    ffff80000f2e7491 ret_pc: 00000000006fd6d4
    [ 16.767292] RPC: <sys_mount+0x74/0x1a0>
    [ 16.767456] l0: ffff800014398408 l1: ffff8001fedeaa00 l2:
    0000000000422db4 l3: 0000000000201e00
    [ 16.767591] l4: 000000000000029c l5: ffff80010000c1a0 l6:
    ffff80000f2e4000 l7: 00000000006fd660
    [ 16.767771] i0: 0000000000000cc0 i1: 0000000000201ff0 i2:
    0000000000000001 i3: ffff80000f2e7dd0
    [ 16.767996] i4: 0000000000000000 i5: 0000612000000000 i6:
    ffff80000f2e7561 i7: 00000000006fd714
    [ 16.768079] I7: <sys_mount+0xb4/0x1a0>
    [ 16.768189] Call Trace:
    [ 16.768326] [<00000000006fd714>] sys_mount+0xb4/0x1a0
    [ 16.768456] [<00000000006fd6d4>] sys_mount+0x74/0x1a0
    [ 16.768628] [<0000000000406274>] linux_sparc_syscall+0x34/0x44
    [ 16.768856] Disabling lock debugging due to kernel taint
    [ 16.768917] Caller[00000000006fd714]: sys_mount+0xb4/0x1a0
    [ 16.769093] Caller[00000000006fd6d4]: sys_mount+0x74/0x1a0
    [ 16.769316] Caller[0000000000406274]: linux_sparc_syscall+0x34/0x44
    [ 16.769444] Caller[0000000000100a94]: 0x100a94
    [ 16.769596] Instruction DUMP:
    [ 16.769603] ba074001
    [ 16.769693] bb2f7003
    [ 16.769735] ba074002
    [ 16.769775] <c25f6008>
    [ 16.769865] 84086001
    [ 16.770037] 82007fff
    [ 16.770134] 8378841d
    [ 16.770226] ba100001
    [ 16.770315] c2586008
    [ 16.770456]
    Killed
    Begin: Retrying nfs mount ...
    [...]
    ```

    ## V210 ##

    ```
    [...]
    [ 0.000168] Linux version 5.15.0-3-sparc64-smp (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.2.0-14) 11.2.0, GNU
    ld (GNU Binutils for Debian) 2.37.90.20220123) #1 SMP Debian 5.15.15-2 (2022-01-30)
    [...]
    [ 40.241993] tg3 0000:00:02.0 enp0s2f0: Link is up at 1000 Mbps, full
    duplex
    [ 40.333591] tg3 0000:00:02.0 enp0s2f0: Flow control is on for TX and
    on for RX
    [ 40.428669] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s2f0: link becomes ready
    [ 44.294909] FS-Cache: Loaded
    [ 44.397657] RPC: Registered named UNIX socket transport module.
    [ 44.475650] RPC: Registered udp transport module.
    [ 44.537450] RPC: Registered tcp transport module.
    [ 44.599295] RPC: Registered tcp NFSv4.1 backchannel transport module.
    [ 44.815002] FS-Cache: Netfs 'nfs' registered for caching
    mount: Invalid argument
    Begin: Retrying nfs mount ... mount: Invalid argument
    done.
    [...]
    ```

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)