Forum: >>> Magnum BBS <<<

5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Tue Mar 9 13:40:01 2021

Hello Riccardo!

On 3/9/21 1:23 PM, Riccardo Mottola wrote:

while I was able to "install" correctly using a slightly older ISO, I get not a bootable
system. The kernel appears to crash very early during boot.

I think this is more likely a hardware issue. We haven't seen any machines crashing that
early. Please make sure the RAM modules in this machine are working properly.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to All on Tue Mar 9 13:30:02 2021

Hi all,

while I was able to "install" correctly using a slightly older ISO, I
get not a bootable system. The kernel appears to crash very early during
boot.

Anybody else has this issue?

Booting `Debian GNU/Linux'

Loading Linux 5.10.0-4-sparc64-smp ...
Loading initial ramdisk ...

[ 26.900156] sd 2:1:0:0: [sda] No Caching mode page found
[ 26.900336] sd 2:1:0:0: [sda] Assuming drive cache: write through
/dev/sda2: clean, 31420/4276224 files, 659826/17089844 blocks
[ 30.362550] Unable to handle kernel NULL pointer dereference
[ 30.362722] tsk->{mm,active_mm}->context = 00000000000000ab
[ 30.362818] tsk->{mm,active_mm}->pgd = ffff80000f258000
[ 30.363585] Kernel panic - not syncing: Aiee, killing interrupt handler!
[ 30.363740] OOPS: Bogus kernel PC [00000000000007c0] in fault handler
[ 30.363747] OOPS: RPC [000000000042c614]
[ 30.363766] OOPS: RPC <arch_cpu_idle+0x74/0xc0>
[ 30.363773] OOPS: Fault was to vaddr[7c0]
[ 30.363787] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D E 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1
[ 30.363792] Call Trace:
[ 30.363808] [<0000000000c5394c>] do_sparc64_fault+0xa4c/0xa80
[ 30.363829] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
[ 30.363839] [<00000000000007c0>] 0x7c0
[ 30.363852] [<0000000000c519a8>] default_idle_call+0x48/0x140
[ 30.363865] [<00000000004a7b40>] do_idle+0xe0/0x1a0
[ 30.363878] [<00000000004a7e5c>] cpu_startup_entry+0x1c/0x80
[ 30.363899] [<0000000000c4b278>] rest_init+0xb8/0xc8
[ 30.363915] [<0000000000fe26a4>] arch_call_rest_init+0xc/0x1c
[ 30.363930] [<0000000000fe2d40>] start_kernel+0x628/0x640
[ 30.363946] [<0000000000fe532c>] start_early_boot+0x2a0/0x2b0
[ 30.363962] [<0000000000c4b1a0>] tlb_fixup_done+0x4c/0x6c
[ 30.363972] [<000000000016a60c>] 0x16a60c
[ 30.363978] Unable to handle kernel NULL pointer dereference
[ 30.363984] tsk->{mm,active_mm}->context = 00000000000000b5
[ 30.363990] tsk->{mm,active_mm}->pgd = ffff800014594000
[ 30.363997] \|/ ____ \|/
[ 30.363997] "@'/ .. \`@"
[ 30.363997] /_| \__/ |_\
[ 30.363997] \__U_/
[ 30.364004] swapper/0(0): Oops [#2]
[ 30.364017] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D E 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1
[ 30.364027] TSTATE: 0000004480001600 TPC: 00000000000007c0 TNPC: 00000000000007c4 Y: 00000000 Tainted: G D
[ 30.364036] TPC: <0x7c0>
[ 30.364044] g0: 0000000040004059 g1: 0000000000000016 g2:
00000000f0200000 g3: 00000000fff78000
[ 30.364053] g4: 0000000000005a20 g5: ffff8003fd79c000 g6:
0000000000e80000 g7: 00000000000043ba
[ 30.364061] o0: 00000000000007c0 o1: 0000000000000000 o2:
0000000000000000 o3: 0000000000000000
[ 30.364070] o4: 0000000000000000 o5: 0000000000000000 sp:
0000000000e831a1 ret_pc: 000000000042c614
[ 30.364084] RPC: <arch_cpu_idle+0x74/0xc0>
[ 30.364093] l0: 0000000000f8b7d8 l1: 000000004000407c l2:
0000000040004059 l3: 0000000000000040
[ 30.364102] l4: 00000000f027e7f8 l5: 0000000040004128 l6:
00000000000ed000 l7: 00000000f025cfd8
[ 30.364110] i0: 000000000000000e i1: 0000000000e80008 i2:
0000000000004000 i3: 00000000000007c0
[ 30.364118] i4: 00000000fef42ff8 i5: 00000000fef41800 i6:
0000000000e83251 i7: 0000000000c519a8
[ 30.364131] I7: <default_idle_call+0x48/0x140>
[ 30.364137] Call Trace:
[ 30.364150] [<0000000000c519a8>] default_idle_call+0x48/0x140
[ 30.364162] [<00000000004a7b40>] do_idle+0xe0/0x1a0
[ 30.364175] [<00000000004a7e5c>] cpu_startup_entry+0x1c/0x80
[ 30.364191] [<0000000000c4b278>] rest_init+0xb8/0xc8
[ 30.364207] [<0000000000fe26a4>] arch_call_rest_init+0xc/0x1c
[ 30.364221] [<0000000000fe2d40>] start_kernel+0x628/0x640
[ 30.364236] [<0000000000fe532c>] start_early_boot+0x2a0/0x2b0
[ 30.364252] [<0000000000c4b1a0>] tlb_fixup_done+0x4c/0x6c
[ 30.364262] [<000000000016a60c>] 0x16a60c
[ 30.364276] Caller[0000000000c519a8]: default_idle_call+0x48/0x140
[ 30.364288] Caller[00000000004a7b40]: do_idle+0xe0/0x1a0
[ 30.364300] Caller[00000000004a7e5c]: cpu_startup_entry+0x1c/0x80
[ 30.364315] Caller[0000000000c4b278]: rest_init+0xb8/0xc8
[ 30.364330] Caller[0000000000fe26a4]: arch_call_rest_init+0xc/0x1c
[ 30.364343] Caller[0000000000fe2d40]: start_kernel+0x628/0x640
[ 30.364358] Caller[0000000000fe532c]: start_early_boot+0x2a0/0x2b0
[ 30.364373] Caller[0000000000c4b1a0]: tlb_fixup_done+0x4c/0x6c
[ 30.364383] Caller[000000000016a60c]: 0x16a60c
[ 30.364387] Instruction DUMP:
[ 30.364397] Unable to handle kernel NULL pointer dereference
[ 30.364404] tsk->{mm,active_mm}->context = 00000000000000b5
[ 30.364409] tsk->{mm,active_mm}->pgd = ffff800014594000
[ 30.364416] \|/ ____ \|/
[ 30.364416] "@'/ .. \`@"
[ 30.364416] /_| \__/ |_\
[ 30.364416] \__U_/
[ 30.364422] swapper/0(0): Oops [#3]
[ 30.364436] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G D E 5.10.0-4-sparc64-smp #1 Debian 5.10.19-1
[ 30.364447] TSTATE: 0000008880001604 TPC: 0000000000c418a8 TNPC: 0000000000c418ac Y: 00000000 Tainted: G D
[ 30.364469] TPC: <die_if_kernel+0x12c/0x260>
[ 30.364479] g0: 0000000000000004 g1: fffffffffffffff4 g2:
0000000000f29340 g3: 00000000ffffe221
[ 30.364487] g4: 0000000000e9a680 g5: ffff8003fd79c000 g6:
0000000000e80000 g7: 000000000000000e
[ 30.364495] o0: 0000000000d77d78 o1: 0000000000000020 o2:
000000000016a60c o3: 0000000000000020
[ 30.364504] o4: 0000004480001600 o5: 000000000109fc00 sp:
0000000000e82df1 ret_pc: 0000000000c41838
[ 30.364519] RPC: <die_if_kernel+0xbc/0x260>
[ 30.364528] l0: 0000000000000214 l1: 000000004000407c l2:
000000000040770c l3: 0000000000000000

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Tue Mar 9 18:40:02 2021

Hi!

On 3/9/21 6:26 PM, Riccardo Mottola wrote:

John Paul Adrian Glaubitz wrote:

while I was able to "install" correctly using a slightly older ISO, I get not a bootable
system. The kernel appears to crash very early during boot.

I think this is more likely a hardware issue. We haven't seen any machines crashing that
early. Please make sure the RAM modules in this machine are working properly.

I don't think so... I think it is a Kernel issue, since with kernel 5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux

the machine is performing fine with network, disk and compiler usage on all 32 CPUs.

Then you need to bisect the kernel as I don't have any means to reproduce the issue.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Tue Mar 9 18:30:02 2021

Hi,

John Paul Adrian Glaubitz wrote:

while I was able to "install" correctly using a slightly older ISO, I get not a bootable
system. The kernel appears to crash very early during boot.

I think this is more likely a hardware issue. We haven't seen any machines crashing that
early. Please make sure the RAM modules in this machine are working properly.

I don't think so... I think it is a Kernel issue, since with kernel 5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux

the machine is performing fine with network, disk and compiler usage on
all 32 CPUs. I tried heavy load of parallel compilations, using git on
large repositories as well as using remote X applications at the same
time, a combination I know tends to show issues on systems, without
problems! Not a simgle error in syslog.
Machine powerup-and self-tests are fine too.

If I remember, there is a repository of various pre-compiled kernel
versions: maybe there are some releases between the two kernels I can
try and do some easy rough bisecting.

so I'd say RAM, CPUs, Disk and Ethernet are working quite fine

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Tue Mar 9 21:40:01 2021

Hi guys,

On 09.03.21 18:31, John Paul Adrian Glaubitz wrote:

Hi!

On 3/9/21 6:26 PM, Riccardo Mottola wrote:

John Paul Adrian Glaubitz wrote:

while I was able to "install" correctly using a slightly older ISO, I get not a bootable
system. The kernel appears to crash very early during boot.

I think this is more likely a hardware issue. We haven't seen any machines crashing that
early. Please make sure the RAM modules in this machine are working properly.

I don't think so... I think it is a Kernel issue, since with kernel
5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux

the machine is performing fine with network, disk and compiler usage on all 32 CPUs.

Then you need to bisect the kernel as I don't have any means to reproduce the issue.

I have a T1000 with which I could try to reproduce Riccardo's issues.
Hardware wise they should be pretty similar. As the T1000 doesn't have a
CDROM, I'll try to netboot a few newer kernels and report my findings.
Will take me until next week though, as the machine is in (cold) storage
now.

@Adrian:
Aren't there some build servers using UltraSPARC T2 or T2+? Do they run
with the latest kernels?

Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Tue Mar 9 22:20:01 2021

On 3/9/21 9:38 PM, Frank Scheiner wrote:

I have a T1000 with which I could try to reproduce Riccardo's issues. Hardware wise they should be pretty similar. As the T1000 doesn't have a CDROM, I'll try to netboot a few newer kernels and report my findings.
Will take me until next week though, as the machine is in (cold) storage
now.

@Adrian:
Aren't there some build servers using UltraSPARC T2 or T2+? Do they run
with the latest kernels?

The oldest buildd we are running is a T5120 and that's a T2.

We have an older UltraSPARC IIIi that has issues with newer kernels, but usually only after longer operation and the issue might be related to the
bug that was just fixed recently by Rob Gardner.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Tue Mar 9 23:30:01 2021

On 3/9/21 10:18 PM, Frank Scheiner wrote:

The oldest buildd we are running is a T5120 and that's a T2.

And these don't show the problems Riccardo's T1 powered T2000 has?

No, the machine runs stable.

We have an older UltraSPARC IIIi that has issues with newer kernels, but
usually only after longer operation and the issue might be related to the
bug that was just fixed recently by Rob Gardner.

Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next week.

I have not uploaded that kernel yet, I have it built locally, PR here [1].

Adrian

[1] https://salsa.debian.org/kernel-team/linux/-/merge_requests/339

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Wed Mar 10 08:40:02 2021

On 3/9/21 11:20 PM, John Paul Adrian Glaubitz wrote:

Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next
week.

I have not uploaded that kernel yet, I have it built locally, PR here [1].

The patch is now in Linus' tree so it will be part of 5.12 [1].

Adrian

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5e8b80d352ec999d2bba3ea584f541c83f4ca3f

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to Riccardo Mottola on Wed Mar 10 10:40:02 2021

Hi Riccardo,

On 10.03.21 10:17, Riccardo Mottola wrote:

Frank Scheiner wrote:

We have an older UltraSPARC IIIi that has issues with newer kernels, but >>> usually only after longer operation and the issue might be related to
the
bug that was just fixed recently by Rob Gardner.

Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next
week.

as written in the title, I have issues with:
5.10.0-4-sparc64-smp #1 Debian 5.10.19-1

I know.

If I remember there was a repository with many snapshots of different versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.

Do you have a link?

I assume you mean "http://snapshot.debian.org" .

Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Wed Mar 10 10:50:02 2021

On 3/10/21 10:17 AM, Riccardo Mottola wrote:

If I remember there was a repository with many snapshots of different versions,
already as package, which one can test quickly. That way we can restrict breakage
range without git bisect.

Well, that doesn't really help you though. You want to find the commit in question,
just the range isn't enough to solve the issue.

If you have a fast second machine available, bisecting the problem shouldn't take
too long.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to Frank Scheiner on Wed Mar 10 10:20:01 2021

Hi Frank,

Frank Scheiner wrote:

We have an older UltraSPARC IIIi that has issues with newer kernels, but
usually only after longer operation and the issue might be related to the
bug that was just fixed recently by Rob Gardner.

Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next week.

as written in the title, I have issues with:
5.10.0-4-sparc64-smp #1 Debian 5.10.19-1

If I remember there was a repository with many snapshots of different
versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.

Do you have a link?

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to Frank Scheiner on Thu Mar 11 23:10:02 2021

Hi Frank!

I suppose the Niagara CPU gives the kernel issue

Frank Scheiner wrote:

If I remember there was a repository with many snapshots of different
versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.

Do you have a link?

I assume you mean "http://snapshot.debian.org" .

Exactly. With this I did some more tests.

Still Works:
5.9.0-4-sparc64-smp #1 SMP Debian 5.9.11-1 (2020-11-27)
5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)

Broken:

linux-image-5.10.0-trunk-sparc64-smp_5.10.2-1~exp1_sparc64.deb

So later series 5.9 series continue to work and even very early 5.10 do not

Do you know if I can via serial-console reset the system?
I tried sending a break on the serial console, but the errors just keep running.
Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine? I am
used to old SparcStations and UltraSparc Netras, where it was sufficient.
It is inconvenient at every hang to power-cycle, since at every turn on,
it runs a self-test which lasts minutes :)

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Thu Mar 11 23:20:01 2021

Hi Adrian

John Paul Adrian Glaubitz wrote:

Well, that doesn't really help you though. You want to find the commit in question,
just the range isn't enough to solve the issue.

Well, a little bit it helped, it is something early in the 5.10 series.
Also I have now an apparently working kernel (who knows how stable under
load?) 5.9 series

If you have a fast second machine available, bisecting the problem shouldn't take
too long.

Well, this Machine has plenty of ram, disk space and good connection,
how fast the CPU is in compiling a kernel I don't know, but we can try.
Power consumption is not so much worse than a PC, but it is darn loud!
Like a vacuum cleaner... I need to stay out of the room, but I found an acceptable setup. I use a workstation with a serial console connected to
it, the connect through ssh to the workstation and through that into the management.

Although I am used to compile kernels on Gentoo LInux since 15 years, I
never did on Debian. Here we have init images

How should I proceed? Which kernel sources?

https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official

is 4.3 correct for me? 4.6 ?

Please guide me

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gregor Riepl@21:1/5 to All on Thu Mar 11 23:30:02 2021

Do you know if I can via serial-console reset the system?
I tried sending a break on the serial console, but the errors just keep running.
Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine? I am
used to old SparcStations and UltraSparc Netras, where it was sufficient.
It is inconvenient at every hang to power-cycle, since at every turn on,
it runs a self-test which lasts minutes :)

According to this, you should be able to reach the system console
through the SER MGT port: https://unixed.com/index.php/2013/06/16/accessing-the-sparc-system-console/
NET MGT is probably easier, but you'll have to set it up first.

Perhaps you can also attach a USB keyboard and press the break key to
get into the system console, then type "reset" to boot the machine? Not
sure if this works without a monitor though. And you might need to enter
the system password first, if it's set.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to Riccardo Mottola on Thu Mar 11 23:50:02 2021

Hi Riccardo,

On 11.03.21 23:03, Riccardo Mottola wrote:

Hi Frank!

I suppose the Niagara CPU gives the kernel issue

From [1] I assume T2 CPUs are not affected, but yeah, the issue could
be that selective that it only affects the very first generation.

[1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html

Frank Scheiner wrote:

If I remember there was a repository with many snapshots of different
versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.

Do you have a link?

I assume you mean "http://snapshot.debian.org" .

Exactly. With this I did some more tests.

Still Works:
5.9.0-4-sparc64-smp #1 SMP Debian 5.9.11-1 (2020-11-27)
5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)

Broken:

linux-image-5.10.0-trunk-sparc64-smp_5.10.2-1~exp1_sparc64.deb

So later series 5.9 series continue to work and even very early 5.10 do not

Do you know if I can via serial-console reset the system?

Reset from the serial console might work via the kernel with the [magic
system request] functionality.

[magic system request]: https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html

But you can always reset the system using the SC. The T1000 (and the
T2000, too) has both serial (on T2000 right of the DB-9 ttya port,
should work with a blue Cisco serial cable) and network port (on T2000
above the two USB ports). The serial port of the SC automatically
switches to the system console after some (configurable) time and you
need to escape to the SC login prompt with a configurable key sequence
(`#.` by default, see [2]).

[2]: https://docs.oracle.com/cd/E19076-01/t2k.srvr/819-2549-12/ontario-consoleConfig.html#28277

I tried sending a break on the serial console, but the errors just keep running.
Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine?

So you already got access to the SC. Then you can reset the machine from
there, too.

I am
used to old SparcStations and UltraSparc Netras, where it was sufficient.
It is inconvenient at every hang to power-cycle, since at every turn on,
it runs a self-test which lasts minutes :)

I think depending on the SC configuration, these machines also run a
self-test for every X resets, but this should be configurable.

Hope that helps
Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gregor Riepl@21:1/5 to All on Fri Mar 12 00:00:02 2021

How should I proceed? Which kernel sources?

https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official

is 4.3 correct for me? 4.6 ?

You should clone the upstream Git repo, otherwise bisecting will be much
more difficult.

I think these instructions are still valid: https://wiki.debian.org/DebianKernel/GitBisect

You can also skip the Debian-specific stuff and simply do
make -j8 && make modules_install && make install

It's better to use at least a compatible kernel config, though.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jan Engelhardt@21:1/5 to Frank Scheiner on Fri Mar 12 10:50:02 2021

On Thursday 2021-03-11 23:43, Frank Scheiner wrote:

Do you know if I can via serial-console reset the system?

Reset from the serial console might work via the kernel with the [magic system request] functionality.

[magic system request]: https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html

But you can always reset the system using the SC. The T1000 (and the
T2000, too) has both serial (on T2000 right of the DB-9 ttya port,
should work with a blue Cisco serial cable) and network port (on T2000
above the two USB ports). The serial port of the SC automatically
switches to the system console after some (configurable) time

SER MGT is a RS232-ish serial line, just with a RJ-45 connector for size.
Once the SC has finished booting, system console is the default mode.
Since SER has no notion of connections, it should be staying in whatever mode it was left in. Maybe there is a autoswitch, but I never observed it (but I would not want to wait a lot of minutes either just to observe it).

For NET MGT, when you start a new SSH connection, it always starts
out in system console mode and #. is needed.

I tried sending a break on the serial console, but the errors just keep
running.
Break is received, since I see it as SC Alert, but I am not put into the
console, maybe there is some further trick on these newer machine?

So you already got access to the SC. Then you can reset the machine from there, too.

Because NET does not have an equivalent of the serial pin used to traditionally signal "break", a synthetic break can be issued from SC. But it's a bit awkward, because you immediately need to go back into system console mode to type the desired sysrq character.

break

confirm (y/n)y

console

confirm (y/n)y
type <<S>>
Linux kernel: ah yes I received SYSRQ-s

I am
used to old SparcStations and UltraSparc Netras, where it was sufficient.
It is inconvenient at every hang to power-cycle, since at every turn on,
it runs a self-test which lasts minutes :)

I think depending on the SC configuration, these machines also run a self-test for every X resets, but this should be configurable.

It's the first thing you want to turn off as a private user.

diag_trigger none

and probably

diag_mode off

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to Frank Scheiner on Tue Mar 16 14:10:01 2021

Hi Riccardo, Adrian,

so I did some testing yesterday and also see your problem on my T1000.
Because of some kernel command line misconfiguration, my machine at
first couldn't find its root FS as it tried to use a non-existent NIC.
This lead to a lot of kernel oopses (I assume at least one per hardware
thread) that looked very similar to the ones you see. And this happens
even with "working" kernels (tested 4.19.x and 5.9.x). So the actual
result of that problem in 5.10.x seems to be that the kernel can't find
its root FS.

On 11.03.21 23:43, Frank Scheiner wrote:

On 11.03.21 23:03, Riccardo Mottola wrote:

I suppose the Niagara CPU gives the kernel issue

From [1] I assume T2 CPUs are not affected, but yeah, the issue could
be that selective that it only affects the very first generation.

[1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html

I can also indeed confirm that this problem only affects the T1 CPU, as
my T5220 with T2 CPU works w/o problems with kernel 5.10.x.

I didn't get any further yesterday as it took a lot of time to update
the root FSes of my T1000 and my X4270 - my intended machine for cross compilation, not sure if it will be "fast" enough*. In addition cloning
Linus's linux tree alone took a lot of time (about an hour).

* it will:

```
## with config of Debian's 5.9.0-5 kernel as `.config`
$ make ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu- olddefconfig
[...]
## with lsmod output from T1000
$ make ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu-
LSMOD=$HOME/t1000-lsmod localmodconfig
[...]
$ time make -j16 ARCH=sparc64 CROSS_COMPILE=sparc64-linux-gnu- all
[...]
kernel: arch/sparc/boot/zImage is ready

real 3m12.264s
user 42m5.325s
sys 3m27.843s
```

@Adrian:
After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good"
closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is
good? Should I switch to the stable kernel sources from GKH?

Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to Frank Scheiner on Tue Mar 16 14:20:01 2021

Hi again,

On 16.03.21 14:07, Frank Scheiner wrote:

@Adrian:
After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good" closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is good? Should I switch to the stable kernel sources from GKH?

Forget about that, [1] shows 5000+ commits between v5.9.16 and
v5.10-rc1, too. So no difference.

[1]: https://github.com/gregkh/linux/compare/v5.9.16...v5.10-rc1

Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Tue Mar 16 14:30:01 2021

Hello Frank!

On 3/16/21 2:07 PM, Frank Scheiner wrote:

After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good" closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is good? Should I switch to the stable kernel sources from GKH?

I'm not sure I am understand your problem here. The bisecting algorithm
has a runtime O(ln(n)), so even with 5000 commits, it will converge quite quickly.

Just make sure you are using a fast machine when compiling the kernel
as otherwise it won't be fun.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Tue Mar 16 15:00:02 2021

Hi Adrian,

On 16.03.21 14:27, John Paul Adrian Glaubitz wrote:

Hello Frank!

On 3/16/21 2:07 PM, Frank Scheiner wrote:

After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit:
33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good"
closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is
good? Should I switch to the stable kernel sources from GKH?

I'm not sure I am understand your problem here. The bisecting algorithm
has a runtime O(ln(n)), so even with 5000 commits, it will converge quite quickly.

Yeah, you're right, I think I make this error every time I try to bisect
the kernel - i.e. once every two years... ;-)

Just make sure you are using a fast machine when compiling the kernel
as otherwise it won't be fun.

Other topic: As the compile times are actually taking less time than the preparation of the test boot (copy over modules to T1000 root FS, boot
T1000 with working kernel, create initramfs, reboot with kernel in
question and that initramfs), is there a way to create the initramfs
(for sparc64) on the cross compile host (amd64)?

Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Wed Mar 17 13:40:01 2021

Hi Frank!

On 3/17/21 1:22 PM, Frank Scheiner wrote:

Hi Adrian, Riccardo

so I'm finished with bisecting and it points to the following commit as
first bad commit:

```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
commit 028abd9222df0cf5855dab5014a5ebaf06f90565
Author: Christoph Hellwig <hch@lst.de>
Date: Thu Sep 17 10:22:34 2020 +0200

fs: remove compat_sys_mount

compat_sys_mount is identical to the regular sys_mount now, so
remove it
and use the native version everywhere.

Did you verify that reverting this commit or - if reverting is not possible - testing
out the revision just before the commit? Just to be safe you found the correct commit.

If that has been verified, please report the issue to the sparclinux LKML and CC Christoph.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to All on Wed Mar 17 13:30:01 2021

Hi Adrian, Riccardo

so I'm finished with bisecting and it points to the following commit as
first bad commit:

```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
commit 028abd9222df0cf5855dab5014a5ebaf06f90565
Author: Christoph Hellwig <hch@lst.de>
Date: Thu Sep 17 10:22:34 2020 +0200

fs: remove compat_sys_mount

compat_sys_mount is identical to the regular sys_mount now, so
remove it
and use the native version everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

arch/arm64/include/asm/unistd32.h | 2 +-
arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +-
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +-
arch/parisc/kernel/syscalls/syscall.tbl | 2 +-
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +-
arch/s390/kernel/syscalls/syscall.tbl | 2 +-
arch/sparc/kernel/syscalls/syscall.tbl | 2 +-
arch/x86/entry/syscalls/syscall_32.tbl | 2 +-
fs/Makefile | 1 -
fs/compat.c | 57 ----------------------
fs/internal.h | 3 --
fs/namespace.c | 4 +-
include/linux/compat.h | 6 ---
include/uapi/asm-generic/unistd.h | 2 +-
tools/include/uapi/asm-generic/unistd.h | 2 +-
tools/perf/arch/powerpc/entry/syscalls/syscall.tbl | 2 +-
tools/perf/arch/s390/entry/syscalls/syscall.tbl | 2 +-
17 files changed, 14 insertions(+), 81 deletions(-)
delete mode 100644 fs/compat.c
```

Seems to be indeed related to mounting (the root FS). Why it only
affects UltraSPARC T1 CPUs is another question. I don't have any other UltraSPARC II, IIi, IIe, III and IIIi driven machines at hand now for
checking those.

So what now?

Cheers,
Frank

P.S.

Here's the log for reference:

```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect log
git bisect start
# good: [bbf5c979011a099af5dc76498918ed7df445635b] Linux 5.9
git bisect good bbf5c979011a099af5dc76498918ed7df445635b
# bad: [3650b228f83adda7e5ee532e2b90429c03f7b9ec] Linux 5.10-rc1
git bisect bad 3650b228f83adda7e5ee532e2b90429c03f7b9ec
# bad: [c48b75b7271db23c1b2d1204d6e8496d91f27711] Merge tag
'sound-5.10-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect bad c48b75b7271db23c1b2d1204d6e8496d91f27711
# bad: [7fafb54c7d390e9b273a1d7d377e38d9c408046e] Merge tag
'leds-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds
git bisect bad 7fafb54c7d390e9b273a1d7d377e38d9c408046e
# bad: [fd5c32d80884268a381ed0e67cccef0b3d37750b] Merge tag
'media/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect bad fd5c32d80884268a381ed0e67cccef0b3d37750b
# bad: [865c50e1d279671728c2936cb7680eb89355eeea] x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT
git bisect bad 865c50e1d279671728c2936cb7680eb89355eeea
# good: [13cb73490f475f8e7669f9288be0bcfa85399b1f] Merge tag 'x86-entry-2020-10-12' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 13cb73490f475f8e7669f9288be0bcfa85399b1f
# good: [dd502a81077a5f3b3e19fa9a1accffdcab5ad5bc] Merge tag 'core-static_call-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good dd502a81077a5f3b3e19fa9a1accffdcab5ad5bc
# good: [ced3a9eb3cd0d07462cdbaa8a0f3d46e5aaeadec] Merge tag
'ia64_for_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
git bisect good ced3a9eb3cd0d07462cdbaa8a0f3d46e5aaeadec
# good: [fc67d5bc876b6b224538c8848fc02e70f269ec99]
Documentation/admin-guide: README & svga: remove use of "rdev"
git bisect good fc67d5bc876b6b224538c8848fc02e70f269ec99
# good: [c90578360c92c71189308ebc71087197080e94c3] Merge branch 'work.csum_and_copy' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect good c90578360c92c71189308ebc71087197080e94c3
# good: [85ed13e78dbedf9433115a62c85429922bc5035c] Merge branch
'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect good 85ed13e78dbedf9433115a62c85429922bc5035c
# bad: [22230cd2c55bd27ee2c3a3def97c0d5577a75b82] Merge branch
'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect bad 22230cd2c55bd27ee2c3a3def97c0d5577a75b82
# good: [e18afa5bfa4a2f0e07b0864370485df701dacbc1] Merge branch 'work.quota-compat' of
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
git bisect good e18afa5bfa4a2f0e07b0864370485df701dacbc1
# good: [67e306c6906137020267eb9bbdbc127034da3627] fs,nfs: lift compat
nfs4 mount data handling into the nfs code
git bisect good 67e306c6906137020267eb9bbdbc127034da3627
# bad: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove
compat_sys_mount
git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565
# first bad commit: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs:
remove compat_sys_mount
```

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Wed Mar 17 21:00:01 2021

Hi Adrian,

On 17.03.21 13:39, John Paul Adrian Glaubitz wrote:

On 3/17/21 1:22 PM, Frank Scheiner wrote:

```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad
028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
[...]

Did you verify that reverting this commit or - if reverting is not possible - testing
out the revision just before the commit?

I did not yet revert the bad commit in a current kernel and test it, but
from my understanding the parent commit of the first bad one must have
been a good one and indeed, [67e306c6906137020267eb9bbdbc127034da3627]
is the parent of [028abd9222df0cf5855dab5014a5ebaf06f90565] and was
working for me on my T1000:

```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect log
[...]
# good: [67e306c6906137020267eb9bbdbc127034da3627] fs,nfs: lift compat
nfs4 mount data handling into the nfs code
git bisect good 67e306c6906137020267eb9bbdbc127034da3627
# bad: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove
compat_sys_mount
git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565
# first bad commit: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs:
remove compat_sys_mount
```

[67e306c6906137020267eb9bbdbc127034da3627]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=67e306c6906137020267eb9bbdbc127034da3627

[028abd9222df0cf5855dab5014a5ebaf06f90565]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=028abd9222df0cf5855dab5014a5ebaf06f90565

Just to be safe you found the correct commit.

If that has been verified, please report the issue to the sparclinux LKML and CC Christoph.

Will do that soon-ish but maybe also try to revert that commit in
Debian's 5.10.0-4 and test it for additional assurance (then not so
soon-ish - maybe this weekend). I'll put you and Riccardo in CC, too.

Hopefully this will be easier to fix than the kernel breakage on the
rx2800 i2 - assuming that problem is still there ([1], [2]).

[1]: https://marc.info/?l=linux-ia64&m=156114769908890&w=2
[2]: https://marc.info/?l=linux-ia64&m=156144480821712&w=2

Cheers and thanks for the pointers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jan Engelhardt@21:1/5 to Frank Scheiner on Tue Mar 23 16:40:02 2021

On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:

while I was able to "install" correctly using a slightly older ISO, I
get not a bootable system. The kernel appears to crash very early during
boot.

From my current testing it looks like "UltraSPARC IIIi"s are also
affected by this problem with UltraSPARC T1s in some way:

With the latest Linux 5.10.x (from Debian) the root FS can't be
successfully mounted, with the latest Linux 5.9.x (also from Debian) it
just works fine. Unfortunately the V245 doesn't fail/work for the exact
same kernels that I tested during the bisecting for the T1000, e.g. the
first bad commit version that didn't work on the T1000 seems to work on
the V245 but some good versions don't with:

```
[...]
Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument

I seem to recall that NFS is one of those filesystems that (a) makes use of filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper, /usr/sbin/mount.nfs.

Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565, I am postulating the hypothesis that that the fs/nfs/ code for parsing this binary blob is no longer aware that it is being invoked in a compat32 context.

Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in question were all on NFS mounts and the T2 one wasn't?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to Riccardo Mottola on Tue Mar 23 16:30:02 2021

Hi all,

On 09.03.21 13:23, Riccardo Mottola wrote:

Hi all,

while I was able to "install" correctly using a slightly older ISO, I
get not a bootable system. The kernel appears to crash very early during boot.

Anybody else has this issue?

Booting `Debian GNU/Linux'

Loading Linux 5.10.0-4-sparc64-smp ...
Loading initial ramdisk ...

From my current testing it looks like "UltraSPARC IIIi"s are also
affected by this problem with UltraSPARC T1s in some way:

With the latest Linux 5.10.x (from Debian) the root FS can't be
successfully mounted, with the latest Linux 5.9.x (also from Debian) it
just works fine. Unfortunately the V245 doesn't fail/work for the exact
same kernels that I tested during the bisecting for the T1000, e.g. the
first bad commit version that didn't work on the T1000 seems to work on
the V245 but some good versions don't with:

```
[...]
Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument
done.
[...]
```

I'm unsure what could go wrong here, as I always pass the remote address
via the kernel commandline:

```
[...]
[ 2.928512] Kernel command line: BOOT_IMAGE=(tftp)/AC10027A.vmlinux root=/dev/nfs ip=172.16.2.122:172.16.0.2:172.16.0.1:255.255.0.0:v245-2:enp9s4f0:off nfsroot=172.16.0.2:/srv/nfs/v245-2/root nfsrootdebug rw
[...]
```

Maybe there is some breakage in the klibc based programs in the
initramfs, but why they don't affect both UltraSPARC IIIi and T1 in the
same way is somewhat strange.

Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to Jan Engelhardt on Tue Mar 23 16:50:01 2021

Hi Jan,

On 23.03.21 16:36, Jan Engelhardt wrote:

On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:

```
[...]
Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument

I seem to recall that NFS is one of those filesystems that (a) makes use of filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper,
/usr/sbin/mount.nfs.

Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565,
I am postulating the hypothesis that that the fs/nfs/ code for parsing this binary blob is no longer aware that it is being invoked in a compat32 context.

That sounds interesting. Can you perhaps post your hypothesis also in
this thread:

https://marc.info/?t=161644900600003&r=1&w=2

Maybe this gives the kernel developers some ideas.

Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in question were all on NFS mounts and the T2 one wasn't?

No, the T5220 was also running diskless, actually using the same root FS
as the T1000 (in form of a btrfs subvolume snapshot) plus identical
kernel and initramfs:

```
root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*
lrwxrwxrwx 1 root root 35 Feb 28 2018 AC10026E -> boot/grub/sparc64-ieee1275/core.img
lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img -> initrd.img.5.10.0-4.debian.sid.sparc64
lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz -> linux.mp.5.10.0-4.debian.sid.sparc64
```

Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Connor McLaughlan@21:1/5 to All on Tue Mar 23 17:40:01 2021

Hi,

can anyone possible give a list of known stable kernel versions for SPARC machines? (is there a difference necessary between architectures/old vs.
newer machines? sun4u/sun4v)?

Also this instability manifests such that the machine is crashing during
high workload? (halting? rebooting?)

I ask, because on three different SPARC machines i have been experiencing a weird effect when using debian:
I would start a high compiling load for several days (7-10) where the
machines are running fine without any apparent error visible in dmesg or somewhere else.
Then when i power off tand on again, the filesystem would be corrupt and sometimes impossible to repair without reinstallation.

This seems to only happen when the machines do a long run with high
workload and seemingly not when i just power them off again for night with
no high workload.

Regards,
Connor

On Tue, Mar 23, 2021 at 4:46 PM Frank Scheiner <frank.scheiner@web.de>
wrote:

Hi Jan,

On 23.03.21 16:36, Jan Engelhardt wrote:

On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:

```
[...]
Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument

I seem to recall that NFS is one of those filesystems that (a) makes use

of

filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount

helper,

/usr/sbin/mount.nfs.

Now, with the change in Linux kernel

028abd9222df0cf5855dab5014a5ebaf06f90565,

I am postulating the hypothesis that that the fs/nfs/ code for parsing

this

binary blob is no longer aware that it is being invoked in a compat32

context.

That sounds interesting. Can you perhaps post your hypothesis also in
this thread:

https://marc.info/?t=161644900600003&r=1&w=2

Maybe this gives the kernel developers some ideas.

Since T2 systems were said to be fine and T1 not, perhaps the T1 systems

in

question were all on NFS mounts and the T2 one wasn't?

No, the T5220 was also running diskless, actually using the same root FS
as the T1000 (in form of a btrfs subvolume snapshot) plus identical
kernel and initramfs:

```
root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*
lrwxrwxrwx 1 root root 35 Feb 28 2018 AC10026E -> boot/grub/sparc64-ieee1275/core.img
lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img -> initrd.img.5.10.0-4.debian.sid.sparc64
lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz -> linux.mp.5.10.0-4.debian.sid.sparc64
```

Cheers,
Frank

<div dir="ltr"><div>Hi,</div><div><br></div><div>can anyone possible give a list of known stable kernel versions for SPARC machines? (is there a difference necessary between architectures/old vs. newer machines? sun4u/sun4v)?</div><div><br></div><div>
Also this instability manifests such that the machine is crashing during high workload? (halting? rebooting?)</div><div><br></div><div>I ask, because on three different SPARC machines i have been experiencing a weird effect when using debian:</div><div>I
would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg or somewhere else.</div><div>Then when i power off tand on again, the filesystem would be corrupt and sometimes
impossible to repair without reinstallation.<br></div><div><br></div><div>This seems to only happen when the machines do a long run with high workload and seemingly not when i just power them off again for night with no high workload.</div><div><br></div>
<div>Regards,</div><div>Connor</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Mar 23, 2021 at 4:46 PM Frank Scheiner <<a href="mailto:frank.scheiner@web.de">frank.scheiner@web.de</a>> wrote:<br></

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Jan,<br>

On 23.03.21 16:36, Jan Engelhardt wrote:<br>
> On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:<br>
>> ```<br>
>> [...]<br>
>> Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't<br>
>> pass remote address<br>
>> mount: Invalid argument<br>
><br>
> I seem to recall that NFS is one of those filesystems that (a) makes use of<br>
> filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper,<br>
> /usr/sbin/mount.nfs.<br>
><br>
> Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565,<br>
> I am postulating the hypothesis that that the fs/nfs/ code for parsing this<br>
> binary blob is no longer aware that it is being invoked in a compat32 context.<br>

That sounds interesting. Can you perhaps post your hypothesis also in<br>
this thread:<br>

<a href="https://marc.info/?t=161644900600003&r=1&w=2" rel="noreferrer" target="_blank">https://marc.info/?t=161644900600003&r=1&w=2</a><br>

Maybe this gives the kernel developers some ideas.<br>

> Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in<br>
> question were all on NFS mounts and the T2 one wasn't?<br>

No, the T5220 was also running diskless, actually using the same root FS<br>
as the T1000 (in form of a btrfs subvolume snapshot) plus identical<br>
kernel and initramfs:<br>

```<br>
root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*<br>
lrwxrwxrwx 1 root root 35 Feb 28 2018 AC10026E -><br> boot/grub/sparc64-ieee1275/core.img<br>
lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img -><br> initrd.img.5.10.0-4.debian.sid.sparc64<br>
lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz -><br> linux.mp.5.10.0-4.debian.sid.sparc64<br>
```<br>

Cheers,<br>
Frank<br>

</blockquote></div>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to Connor McLaughlan on Tue Mar 23 18:30:01 2021

Hi,

On 23.03.21 17:30, Connor McLaughlan wrote:

Hi,

can anyone possible give a list of known stable kernel versions for
SPARC machines? (is there a difference necessary between
architectures/old vs. newer machines? sun4u/sun4v)?

Also this instability manifests such that the machine is crashing during
high workload? (halting? rebooting?)

I ask, because on three different SPARC machines i have been
experiencing a weird effect when using debian:
I would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg or somewhere else.
Then when i power off tand on again, the filesystem would be corrupt and sometimes impossible to repair without reinstallation.

Can you be sure that your used disks are in full working order? Maybe
you have bad sectors on them and their EOL is nearing, manifesting in
these FS errors? I assume the more accesses you have on your disks the
more a problem is prone to show up. And the accesses happening during
compile runs could be already too much for your disks. If you have
enough RAM, you could try to run your compile jobs in a RAM disk and
check if this makes a difference.

This seems to only happen when the machines do a long run with high
workload and seemingly not when i just power them off again for night
with no high workload.

I believe the error this thread is about is unrelated to what you
experience on your machines. This because the problem happens early on
when the root FS is to be mounted.

Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to Gregor Riepl on Sat Mar 27 00:20:01 2021

Hi,

I was unable to "hack" for some days due to day-job. I have seen Frank
and others have done a great deal.

Still, I wanted to try my own compilation, as a first attempt and also
to build and be able to check eventual patches myself.

On 3/11/21 11:56 PM, Gregor Riepl wrote:

You should clone the upstream Git repo, otherwise bisecting will be much
more difficult.

I think these instructions are still valid: https://wiki.debian.org/DebianKernel/GitBisect

You can also skip the Debian-specific stuff and simply do
make -j8 && make modules_install && make install

It's better to use at least a compatible kernel config, though.

I cloned linux stable. It took 60 minutes...

I took the config out of /boot/config of a good kernel, updated it with
"make oldconfig"

During compilation I see:

CC      init/init_task.o
make[1]: *** No rule to make target
'debian/certs/debian-uefi-certs.pem', needed by 'certs/x509_certificate_list'. Stop.
make[1]: *** Waiting for unfinished jobs....

It took 134 minutes to build with -j32. So well, compiling is not the
strongest point of this CPU, but not so bad either.

real    134m55.288s
user    4111m46.186s
sys     145m12.479s

I actually wonder if the kernel is not "overconfigured" ? building
things like nouveau make sense on SPARC? I wonder.. maybe sticking a
PCI-e card would work in a Netra or Fire?

but I can't install:

multix@narya:~/code/linux-stable$ sudo make modules_install
sed: can't read modules.order: No such file or directory

I wonder if it is related with the error above?

Thanks,

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hermann.Lauer@uni-heidelberg.de@21:1/5 to Stan Johnson on Mon Mar 29 10:00:01 2021

Hi Riccardo,

On Sat, Mar 27, 2021 at 01:16:11PM -0600, Stan Johnson wrote:

I took the config out of /boot/config of a good kernel, updated it with "make oldconfig"

During compilation I see:

CC init/init_task.o
make[1]: *** No rule to make target
'debian/certs/debian-uefi-certs.pem', needed by 'certs/x509_certificate_list'. Stop.
make[1]: *** Waiting for unfinished jobs....
...

I think you need to remove all references to debian certs to compile a
custom kernel.

Yep, in your kernel config set:
CONFIG_SYSTEM_TRUSTED_KEYS=""

Greetings
Hermann

--
Administration/Zentrale Dienste, Interdiziplinaeres
Zentrum fuer wissenschaftliches Rechnen der Universitaet Heidelberg
IWR; INF 205; 69120 Heidelberg; Tel: (06221)54-14405 Fax: -14427
Email: Hermann.Lauer@iwr.uni-heidelberg.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to Connor McLaughlan on Thu Apr 1 12:00:02 2021

Hi Connor,

Connor McLaughlan wrote:

can anyone possible give a list of known stable kernel versions for
SPARC machines? (is there a difference necessary between
architectures/old vs. newer machines? sun4u/sun4v)?

Also this instability manifests such that the machine is crashing
during high workload? (halting? rebooting?)

I ask, because on three different SPARC machines i have been
experiencing a weird effect when using debian:
I would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg
or somewhere else.
Then when i power off tand on again, the filesystem would be corrupt
and sometimes impossible to repair without reinstallation.

This seems to only happen when the machines do a long run with high
workload and seemingly not when i just power them off again for night
with no high workload.

I have a limited experience and can only share that the kernel I
currently am running on this Fire T2000

Linux narya 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)
sparc64 GNU/Linux

Is quite stable for me: I did compile with high loads (e.g. compiling
linux kernel on all 32 cores) and sync the git repository of linux
kernel and ArcticFox browser. GIT sync of such repositories in my
experience is a good stress, I had disk drivers crash, network freeze
on different architectures and systems. But not in this case.
However, i did not try to run for several days compiling, so I don't
know if it is stable for a long time.

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anatoly Pugachev@21:1/5 to riccardo.mottola@libero.it on Thu Apr 1 12:30:01 2021

On Thu, Apr 1, 2021 at 12:59 PM Riccardo Mottola
<riccardo.mottola@libero.it> wrote:

This seems to only happen when the machines do a long run with high workload and seemingly not when i just power them off again for night
with no high workload.

I have a limited experience and can only share that the kernel I
currently am running on this Fire T2000

Linux narya 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)
sparc64 GNU/Linux

Is quite stable for me.
However, i did not try to run for several days compiling, so I don't
know if it is stable for a long time.

Riccardo,

if you would like to check sparc64 kernel stability, you might want to run stress-ng tests, like:

$ ./stress-ng --sequential 4 -v --timeout 3m --metrics-brief

it still successfully kills the latest (git) kernel (5.12.0-rc5) on my
sparc64 test LDOM running on a T5-2 hardware server.
But please take stress-ng from git repo [1] , since it has a few
recent fixes for sparc, not yet packaged into debian.

Thanks.

1. https://github.com/ColinIanKing/stress-ng/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to Hermann.Lauer@uni-heidelberg.de on Thu Apr 1 13:50:02 2021

Hhi Hermann,

Hermann.Lauer@uni-heidelberg.de wrote:

Yep, in your kernel config set:
CONFIG_SYSTEM_TRUSTED_KEYS=""

thanks, that was it! Now the kernel build

Do I need to do somethings special?

make install
make modules_install

Which shows:

multix@narya:~/code/linux-stable$ time sudo make install
sh ./arch/sparc/boot/install.sh 5.12.0-rc5+ arch/sparc/boot/zImage \         System.map "/boot"
run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
update-initramfs: Generating /boot/initrd.img-5.12.0-rc5+
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.12.0-rc5+
Found initrd image: /boot/initrd.img-5.12.0-rc5+
Found linux image: /boot/vmlinuz-5.12.0-rc5+.old
Found initrd image: /boot/initrd.img-5.12.0-rc5+
Found linux image: /boot/vmlinux-5.10.0-4-sparc64-smp
Found initrd image: /boot/initrd.img-5.10.0-4-sparc64-smp
Found linux image: /boot/vmlinux-5.10.0-trunk-sparc64-smp
Found initrd image: /boot/initrd.img-5.10.0-trunk-sparc64-smp
Found linux image: /boot/vmlinux-5.9.0-5-sparc64-smp
Found initrd image: /boot/initrd.img-5.9.0-5-sparc64-smp
done

real    33m3.954s
user    28m18.936s
sys     4m36.889s

At boot:

Loading Linux 5.12.0-rc5+ ...
error: premature end of file /vmlinuz-5.12.0-rc5+.
Loading initial ramdisk ...
error: you need to load the kernel first.

it is interesting how certain operations are very slow on this system,
since a "single" core is slow.. so installing takes longer as a ...
celeron laptop!
It took... 33 minutes ?!

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anatoly Pugachev@21:1/5 to riccardo.mottola@libero.it on Thu Apr 1 15:10:01 2021

On Thu, Apr 1, 2021 at 2:40 PM Riccardo Mottola
<riccardo.mottola@libero.it> wrote:

multix@narya:~/code/linux-stable$ time sudo make install
sh ./arch/sparc/boot/install.sh 5.12.0-rc5+ arch/sparc/boot/zImage \
System.map "/boot"
run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
update-initramfs: Generating /boot/initrd.img-5.12.0-rc5+
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.12.0-rc5+
Found initrd image: /boot/initrd.img-5.12.0-rc5+
Found linux image: /boot/vmlinuz-5.12.0-rc5+.old
Found initrd image: /boot/initrd.img-5.12.0-rc5+
Found linux image: /boot/vmlinux-5.10.0-4-sparc64-smp
Found initrd image: /boot/initrd.img-5.10.0-4-sparc64-smp
Found linux image: /boot/vmlinux-5.10.0-trunk-sparc64-smp
Found initrd image: /boot/initrd.img-5.10.0-trunk-sparc64-smp
Found linux image: /boot/vmlinux-5.9.0-5-sparc64-smp
Found initrd image: /boot/initrd.img-5.9.0-5-sparc64-smp
done

At boot:

Loading Linux 5.12.0-rc5+ ...
error: premature end of file /vmlinuz-5.12.0-rc5+.
Loading initial ramdisk ...
error: you need to load the kernel first.

current grub2 version does not support compressed image kernels, do
the following:

gzip -dc /boot/vmlinuz-5.12.0-rc5+ > /boot/vmlinux-5.12.0-rc5+
rm /boot/vmlinuz-5.12.0-rc5+
update-grub

and reboot

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hermann Lauer@21:1/5 to Riccardo Mottola on Thu Apr 1 14:40:01 2021

Hi Riccardo,

On Thu, Apr 01, 2021 at 01:43:29PM +0200, Riccardo Mottola wrote:

Yep, in your kernel config set:
CONFIG_SYSTEM_TRUSTED_KEYS=""

thanks, that was it! Now the kernel build

great!

Do I need to do somethings special?

make install
make modules_install

sorry, don't know. I'm always doing:

make -j<core#> bindeb-pkg
dpkg -i ../linux-image*.dpkg

But that is even slower on weak hardware (e.g. BananaUltra) and the above SHOULD work. Advantage comes when deleting kernels.

Loading Linux 5.12.0-rc5+ ...
error: premature end of file /vmlinuz-5.12.0-rc5+.

Somehow your vmlinuz is to short or the loader is not able to put it
in memory.

Good luck and greetings
Hermann

--
Administration/Zentrale Dienste, Interdiziplinaeres
Zentrum fuer wissenschaftliches Rechnen der Universitaet Heidelberg
IWR; INF 205; 69120 Heidelberg; Tel: (06221)54-14405 Fax: -14427
Email: Hermann.Lauer@iwr.uni-heidelberg.de

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to Anatoly Pugachev on Thu Apr 1 16:30:01 2021

Hi Anatoly!

Anatoly Pugachev wrote:

current grub2 version does not support compressed image kernels, do
the following:

gzip -dc /boot/vmlinuz-5.12.0-rc5+ > /boot/vmlinux-5.12.0-rc5+
rm /boot/vmlinuz-5.12.0-rc5+
update-grub

and reboot

oh yes, that was it. Finally, I could boot my own built kernel. Which,
of course, crashes as expected.
At least I can confirm Frank's findings.

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Sat Dec 11 19:00:03 2021

On 12/11/21 18:40, Riccardo Mottola wrote:

I remember you bisected about the breaking commits. Has there been any progress?
A better place where to report this issue other than this mailing list?

The proper place is to send an email to the author of the breaking commit and CC the sparclinux Linux kernel mailing list. Most kernel developers don't read the debian-sparc mailing list.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to frank.scheiner@web.de on Sat Dec 11 18:50:02 2021

Hi Frank,

several months have passed� new kernels came into debian and they still do not work for me, so let me dig up this matter again.
I can continue using 5.9 for now, but for how long?

On 2021-03-11 23:43:10 +0100 Frank Scheiner <frank.scheiner@web.de> wrote:

From [1] I assume T2 CPUs are not affected, but yeah, the issue could
be that selective that it only affects the very first generation.

[1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html

Did more people report this issue perhaps on other systems?

I remember you bisected about the breaking commits. Has there been any progress? A better place where to report this issue other than this mailing list?

Thank you,
Riccardo

--
Sent with GNUMail running on MacOS 10.7

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Sat Dec 11 20:40:02 2021

Hi guys,

On 11.12.21 18:59, John Paul Adrian Glaubitz wrote:

On 12/11/21 18:40, Riccardo Mottola wrote:

I remember you bisected about the breaking commits. Has there been any progress?
A better place where to report this issue other than this mailing list?

The proper place is to send an email to the author of the breaking commit and CC the sparclinux Linux kernel mailing list. Most kernel developers don't read
the debian-sparc mailing list.

We actually did discuss this in late March 2021 starting here:

https://lists.debian.org/debian-sparc/2021/03/msg00045.html

...with Christoph Hellwig and CCed to sparclinux@vger.kernel.org and
this list, but no solution back then.

****

Back in October I did some testing on various UltraSPARC machines to
sort out which processor( generation)s are affected but didn't found the
time to make something out of it apart from notes and a conclusion.

I couldn't get my Ultra 80 to netboot, so no result for UltraSPARC II.

My Ultra 10 with US IIi worked though with kernel 5.14.0-3.

My 280r with US III worked with kernel 5.9.0-5 and with 5.14.0-3 gives:

```
Begin: Retrying nfs mount ... mount: Invalid argument
done.
```

...when trying to mount the root FS.

My v480 crashes with 5.14.0-3 but it crashed with every kernel version I
tried since I own it, so perfectly normal. I don't know what the issue
is, because hardware-wise, the - working with 5.9.0-5 - 280r seems to be
very similar though with only 2 processors instead of 4 for the V480.

My T5220 with T2 crashed once with 5.14.0-3 but worked with 5.14.0-4. It
later also worked with 5.14.0-3. And the crash happened way before a
mount of the root FS was tried, so possibly unrelated.

My T1000 with T1 panics with 5.14.0-3 because it can't mount the root
FS. Using `break=premount` in the kernel command line and issueing the
mount command manually gives;

```
(initramfs) nfsmount -o nolock "172.16.0.2:/srv/nfs/t1000/root" "$rootmnt"
[ 641.272949] Unable to handle kernel paging request at virtual address 0000612000000000
[ 641.273138] tsk->{mm,active_mm}->context = 000000000000038f
[ 641.273248] tsk->{mm,active_mm}->pgd = ffff800016c1c000
[ 641.273310] \|/ ____ \|/
[ 641.273310] "@'/ .. \`@"
[ 641.273310] /_| \__/ |_\
[ 641.273310] \__U_/
[ 641.273444] nfsmount(750): Oops [#182]
[ 641.273497] CPU: 12 PID: 750 Comm: nfsmount Tainted: G D E
5.14.0-3-sparc64-smp #1 Debian 5.14.12-1
[ 641.273603] TSTATE: 0000000011001607 TPC: 000000000069ce48 TNPC: 000000000069ce4c Y: 00000000 Tainted: G D E
[ 641.273705] TPC: <kfree+0x48/0x400>
[ 641.273775] g0: 0000000000000006 g1: 0000000400000000 g2:
0000600000000000 g3: ffff8001fda18000
[ 641.273858] g4: ffff800013b13340 g5: ffff8001fda18000 g6:
ffff800016bd0000 g7: ffff800016bd3c30
[ 641.273942] o0: fffffffffffffffe o1: 00000000006f4c94 o2:
0000000000002000 o3: ffff8000146d3aa8
[ 641.274024] o4: 0000000000000008 o5: 0000000000000cc0 sp:
ffff800016bd34a1 ret_pc: 00000000006f4c54
[ 641.274107] RPC: <sys_mount+0x74/0x1a0>
[ 641.274165] l0: 0000000000f1a000 l1: 000000000111f000 l2:
0000000000422db4 l3: 0000000000201db0
[ 641.274292] l4: 000000000000029c l5: ffff80010000c1a0 l6:
ffff800016bd0000 l7: 00000000006f4be0
[ 641.274377] i0: 0000000000000cc0 i1: 0000000000201fe0 i2:
0000000000000001 i3: ffff800016bd3dd0
[ 641.274460] i4: 0000000000000000 i5: 0000612000000000 i6:
ffff800016bd3561 i7: 00000000006f4c94
[ 641.274542] I7: <sys_mount+0xb4/0x1a0>
[ 641.274599] Call Trace:
[ 641.274640] [<00000000006f4c94>] sys_mount+0xb4/0x1a0
[ 641.274712] [<00000000006f4c54>] sys_mount+0x74/0x1a0
[ 641.274783] [<0000000000406274>] linux_sparc_syscall+0x34/0x44
[ 641.274866] Caller[00000000006f4c94]: sys_mount+0xb4/0x1a0
[ 641.274939] Caller[00000000006f4c54]: sys_mount+0x74/0x1a0
[ 641.275011] Caller[0000000000406274]: linux_sparc_syscall+0x34/0x44
[ 641.275090] Caller[0000000000100aa8]: 0x100aa8
[ 641.275143] Instruction DUMP:
[ 641.275150] ba074001
[ 641.275192] bb2f7003
[ 641.275233] ba074002
[ 641.275274] <c25f6008>
[ 641.275314] 84086001
[ 641.275355] 82007fff
[ 641.275395] 8378841d
[ 641.275436] ba100001
[ 641.275525] c2586008
[ 641.275614]
Killed
```

Doing the same on a V210 with US IIIi gives:

```
(initramfs) nfsmount -o nolock "172.16.0.2:/srv/nfs/v210/root" "$rootmnt" mount: Invalid argument
(initramfs) echo $?
1
```

...so similar to 280r with US III.

From all that, I assume UltraSPARC IIi driven machines (and most likely
also older ones with US II) are not affected by this, as are UltraSPARC
T2 driven ones and possibly machines with newer processors (I didn't
have time to try one of my T5240s with T2+).

UltraSPARC III, IIIi and T1 driven machines are affected and to me it
now looks more like some of the klibc programs from the initramfs are at
fault.

I also tested my V210 with an on-disk root FS and although the mounting
seemed to work for that method with 5.14.0-3 I faced multiple problems
later on that crashed the machine.

My next try would have been to test mounting of the root FS with
non-klibc programs. But I'm unsure how to get these into an initramfs -
with dracut maybe?

Cheers,
Frank

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to All on Fri Jan 14 18:00:01 2022

Hi all,

as Frank asked, I compiled myself a kernel using his latest commit
identified as good:
67e306c6906137020267eb9bbdbc127034da3627

and this kernel works, but then fails to load initramfs.

I don't know if the crash was before or after, so if it is a "proof"
that it is good or it is not conclusive?

The good news is that latest kernel installed seems to boot and takes
all CPUs online. How stable it is I don't know, it needs to be tested.

Riccardo

5.15.0-2-sparc64-smp #1 SMP Debian 5.15.5-2 (2021-12-18) sparc64 GNU/Linux

multix@narya:~$ cat /proc/cpuinfo
cpu : UltraSparc T1 (Niagara)
fpu : UltraSparc T1 integrated FPU
pmu : niagara
prom : OBP 4.30.4.d 2011/07/06 14:29
type : sun4v
ncpus probed : 32
ncpus active : 32
D$ parity tl1 : 0
I$ parity tl1 : 0
cpucaps : flush,stbar,swap,muldiv,v9,blkinit,mul32,div32,v8plus,ASIBlkInit
Cpu0ClkTck : 000000003b9aca00
Cpu1ClkTck : 000000003b9aca00
Cpu2ClkTck : 000000003b9aca00
Cpu3ClkTck : 000000003b9aca00
Cpu4ClkTck : 000000003b9aca00
Cpu5ClkTck : 000000003b9aca00
Cpu6ClkTck : 000000003b9aca00
Cpu7ClkTck : 000000003b9aca00
Cpu8ClkTck : 000000003b9aca00
Cpu9ClkTck : 000000003b9aca00
Cpu10ClkTck : 000000003b9aca00
Cpu11ClkTck : 000000003b9aca00
Cpu12ClkTck : 000000003b9aca00
Cpu13ClkTck : 000000003b9aca00
Cpu14ClkTck : 000000003b9aca00
Cpu15ClkTck : 000000003b9aca00
Cpu16ClkTck : 000000003b9aca00
Cpu17ClkTck : 000000003b9aca00
Cpu18ClkTck : 000000003b9aca00
Cpu19ClkTck : 000000003b9aca00
Cpu20ClkTck : 000000003b9aca00
Cpu21ClkTck : 000000003b9aca00
Cpu22ClkTck : 000000003b9aca00
Cpu23ClkTck : 000000003b9aca00
Cpu24ClkTck : 000000003b9aca00
Cpu25ClkTck : 000000003b9aca00
Cpu26ClkTck : 000000003b9aca00
Cpu27ClkTck : 000000003b9aca00
Cpu28ClkTck : 000000003b9aca00
Cpu29ClkTck : 000000003b9aca00
Cpu30ClkTck : 000000003b9aca00
Cpu31ClkTck : 000000003b9aca00
MMU Type : Hypervisor (sun4v)
MMU PGSZs : 8K,64K,4MB,256MB
State:
CPU0: online
CPU1: online
CPU2: online
CPU3: online
CPU4: online
CPU5: online
CPU6: online
CPU7: online
CPU8: online
CPU9: online
CPU10: online
CPU11: online
CPU12: online
CPU13: online
CPU14: online
CPU15: online
CPU16: online
CPU17: online
CPU18: online
CPU19: online
CPU20: online
CPU21: online
CPU22: online
CPU23: online
CPU24: online
CPU25: online
CPU26: online
CPU27: online
CPU28: online
CPU29: online
CPU30: online
CPU31: online

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Fri Jan 14 18:30:01 2022

Hi!

On 1/14/22 17:58, Riccardo Mottola wrote:

as Frank asked, I compiled myself a kernel using his latest commit
identified as good:
67e306c6906137020267eb9bbdbc127034da3627

and this kernel works, but then fails to load initramfs.

Did you forget to create an initrd? After installing the kernel, run:

$ update-initramfs -k KERNEL_VERSION -c

The good news is that latest kernel installed seems to boot and takes
all CPUs online. How stable it is I don't know, it needs to be tested.

Please run some stress tests such as stress-ng and report back.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Mon Jan 17 14:50:01 2022

Hi Adrian,

John Paul Adrian Glaubitz wrote:

Did you forget to create an initrd? After installing the kernel, run:

$ update-initramfs -k KERNEL_VERSION -c

I did not run it this way, will do.

I had it however, of a very big size:
316M Jan 14 17:15 initrd.img-5.9.0-rc1+

which filled up my /boot

I removed it, regenerated with your command, but I get dropped into
initramfs with no modules found. Hmm..

The good news is that latest kernel installed seems to boot and takes
all CPUs online. How stable it is I don't know, it needs to be tested.

Please run some stress tests such as stress-ng and report back.

Not nice. I started compiling some stuff and the box froze, I connected
serial console and could not resume due to Fast Data Access MMU miss"

I will now stress things again, but keeping serial console attached with another computer and see.

UP to last week with the old 5.9 kernel I had no issues compiling even
large things as gecko based ArcticFox or the Linux kernel itself. So if
the Fire didn't fail over the weekend.... it smells as kernel instability.

What should I use in stress-ng? I just tried "--all 8 --timeout 120s"

and the machine locked up after a little and in the serial console I see:

[ 8563.833509] current->{active_,}mm->context = 0000000000000fcb

[ 8563.833523] current->{active_,}mm->pgd = ffff8000d35c8000

[ 8563.846347] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.846365] at virtual address 00000000000000e7

[ 8563.846380] current->{active_,}mm->context = 0000000000000fcc

[ 8563.846395] current->{active_,}mm->pgd = ffff8000d2d3c000

[ 8563.856171] Unable to handle kernel NULL pointer dereference

[ 8563.863274] tsk->{mm,active_mm}->context = 0000000000000fd2

[ 8563.863294] tsk->{mm,active_mm}->pgd = ffff8000d3fc0000

[ 8563.928911] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.928935] at virtual address 00000000000000e7

[ 8563.928955] current->{active_,}mm->context = 0000000000000fde

[ 8563.928972] current->{active_,}mm->pgd = ffff8000d32e8000

[ 8563.952221] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.952244] at virtual address 00000000000000e7

[ 8563.952261] current->{active_,}mm->context = 0000000000000fe3

[ 8563.952278] current->{active_,}mm->pgd = ffff8000d2f54000

[ 8563.954004] Unable to handle kernel NULL pointer dereference in mna
handler
[ 8563.954022] at virtual address 00000000000000e7

[ 8563.954037] current->{active_,}mm->context = 0000000000000fe5

[ 8563.954053] current->{active_,}mm->pgd = ffff8000d2d5c000

[ 8563.972643] Unable to handle kernel NULL pointer dereference

[ 8563.972660] tsk->{mm,active_mm}->context = 0000000000000fea

[ 8563.972677] tsk->{mm,active_mm}->pgd = ffff8000d31300

These are kernel messages, not OF, so it looks like a kernel problem

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to All on Mon Jan 17 15:20:01 2022

I reply to myself.

I did run the old 5.9 kernel from debian - which has proven quite stable.
I did run the same tests... and I found once error in the console indeed.

[ 380.918996] Unable to handle kernel NULL pointer dereference
[ 380.919198] tsk->{mm,active_mm}->context = 000000000000057d
[ 380.919326] tsk->{mm,active_mm}->pgd = ffff8003f1fd4000
[ 380.919496] \|/ ____ \|/
"@'/ .. \`@"
/_| \__/ |_\
\__U_/
[ 380.919510] stress-ng(1529): Oops [#287]
[ 380.919536] CPU: 24 PID: 1529 Comm: stress-ng Tainted: G D E
X 5.9.0-5-sparc64-smp #1 Debian 5.9.15-1
[ 380.919557] TSTATE: 0000008811001602 TPC: 000000000042d8e0 TNPC: 000000000042d8e4 Y: 00000000 Tainted: G D E X
[ 380.919587] TPC: <do_signal+0x440/0x560>
[ 380.919604] g0: ffff800100ef7194 g1: 0000000000000328 g2:
0000000000000000 g3: ffff80010002c000
[ 380.919620] g4: ffff8003cf6f6b40 g5: ffff8003fdea4000 g6:
ffff8003cf9cc000 g7: 0000000000004000
[ 380.919634] o0: 00000000000001e8 o1: 0000000000000328 o2:
ffff8003cf9cc000 o3: 0000000000000007
[ 380.919650] o4: 0000000000000007 o5: fffffffffffffff2 sp:
ffff8003cf9cf451 ret_pc: 000000000042d8c4
[ 380.919673] RPC: <do_signal+0x424/0x560>
[ 380.919690] l0: 0208000104000004 l1: 00000044f0000226 l2:
ffff800100ef7194 l3: 0000000000000000
[ 380.919705] l4: 0000000000000000 l5: 0000000000000005 l6:
ffff8003cf9cc000 l7: 0000000000698c20
[ 380.919719] i0: 0000000000000070 i1: 0000000000000208 i2:
fffffffffffffff2 i3: ffff8003cf9eff70
[ 380.919732] i4: fffffffffffffff2 i5: 0000000000000000 i6:
ffff8003cf9cf4d1 i7: 000000000042d6fc
[ 380.919752] I7: <do_signal+0x25c/0x560>
[ 380.919760] Call Trace:
[ 380.919783] [<000000000042d6fc>] do_signal+0x25c/0x560
[ 380.919806] [<000000000042e218>] do_notify_resume+0x58/0xa0
[ 380.919828] [<0000000000404b48>] __handle_signal+0xc/0x30
[ 380.919852] Caller[000000000042d6fc]: do_signal+0x25c/0x560
[ 380.919874] Caller[000000000042e218]: do_notify_resume+0x58/0xa0
[ 380.919893] Caller[0000000000404b48]: __handle_signal+0xc/0x30
[ 380.919910] Caller[ffff800100ef716c]: 0xffff800100ef716c
[ 380.919916] Instruction DUMP:
[ 380.919923] c029a00d
[ 380.919930] b4168008
[ 380.919938] 900761e8
[ 380.919945] <d25e2070>
[ 380.919952] 40014fef
[ 380.919959] b416801c
[ 380.919965] c2592468
[ 380.919972] b8100008
[ 380.919979] 920126c8

[ 380.972358] systemd-journald[66048]: File /var/log/journal/bdb2a41ce825489ba567bea53add247e/system.journal
corrupted or uncleanly shut down, renaming and replacing.
[ 407.494981] systemd[1]: Started Journal Service.

as well as error in the stressors:
stress-ng: info: [12989] stress-ng-fanotify: 148 open, 41 close write,
128 close nowrite, 96 access, 27 modify
stress-ng: info: [12908] stress-ng-fanotify: 159 open, 66 close write,
108 close nowrite, 88 access, 43 modify
stress-ng: info: [12911] stress-ng-fanotify: 147 open, 43 close write,
122 close nowrite, 99 access, 20 modify
stress-ng: info: [13079] stress-ng-fanotify: 159 open, 60 close write,
112 close nowrite, 97 access, 32 modify
stress-ng: info: [12820] stress-ng-fanotify: 155 open, 46 close write,
123 close nowrite, 87 access, 27 modify
stress-ng: info: [913] unsuccessful run completed in 282.58s (4 mins,
42.58 secs)
stress-ng: fail: [913] chattr instance 2 corrupted bogo-ops counter, 48
vs 0
stress-ng: fail: [913] chattr instance 2 hash error in bogo-ops counter
and run flag, 1918819509 vs 0
stress-ng: fail: [913] chattr instance 6 corrupted bogo-ops counter, 50
vs 0
stress-ng: fail: [913] chattr instance 6 hash error in bogo-ops counter
and run flag, 506138270 vs 0
stress-ng: fail: [913] dnotify instance 4 corrupted bogo-ops counter,
224 vs 0
info: 5 failures reached, aborting stress process
stress-ng: fail: [913] dnotify instance 4 hash error in bogo-ops
counter and run flag, 1503783545 vs 0
stress-ng: fail: [913] dnotify instance 6 corrupted bogo-ops counter,
222 vs 0
stress-ng: fail: [913] dnotify instance 6 hash error in bogo-ops
counter and run flag, 4199465241 vs 0
stress-ng: fail: [913] metrics-check: stressor metrics corrupted, data
is compromised

However the machine did not crash.
I did run exactly the same stress command again... and the failures are reproducible, so I suppose maybe the tests are not 64bit big endian safe
or such.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Mon Jan 17 16:10:02 2022

Hi!

On 1/17/22 14:41, Riccardo Mottola wrote:

The good news is that latest kernel installed seems to boot and takes
all CPUs online. How stable it is I don't know, it needs to be tested.

Please run some stress tests such as stress-ng and report back.

Not nice. I started compiling some stuff and the box froze, I connected serial console and could not resume due to Fast Data Access MMU miss"

So, this crash occurs with the latest 5.15 kernel on your T2000?

In my experience, the most stable kernels on the older SPARCs are still the 4.19 kernels. Thus, we should start bisecting to find out what commit actually made the kernel unreliable on these older SPARCs.

Adrian

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer - glaubitz@debian.org
`. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Mon Jan 17 20:40:02 2022

Hi,

John Paul Adrian Glaubitz wrote:

Not nice. I started compiling some stuff and the box froze, I connected
serial console and could not resume due to Fast Data Access MMU miss"

So, this crash occurs with the latest 5.15 kernel on your T2000?

exactly latest kernel.

I will retest it with stress-ng as soon as I finish this email and copy
the dmesg errors.

In my experience, the most stable kernels on the older SPARCs are still the 4.19 kernels. Thus, we should start bisecting to find out what commit actually
made the kernel unreliable on these older SPARCs.

We must find a good way to test though. I stress-tested the 5.9 kernel
further. The system sometimes seemed unresponsive, but eventually
recovered (some errors to know more pasted below). Thus I would consider
it "stable". I did run several small burst of tests and then a session
given of 30m minutes but that due to hiccups lasted more like 2 hours,
but afterwards, the machine was still up.

sudo stress-ng --all 10 --timeout 30m

10 times means more than physical CPUs, but less than logical cores
(32). The system has 16GB of ram, I see some OOMs in dmesg, I wonder if
this is due to certain stress tests specifically going against any limit.

[16195.300448] Unable to handle kernel NULL pointer dereference in mna
handler
[16195.741725] 40014fef
[16195.741793] at virtual address 00000000000000e7
[16195.767936] b416801c
[16195.767945] c2592468
[16195.767990] current->{active_,}mm->context = 0000000000000bb8
[16195.768848] b8100008
[16195.768857] 920126c8
[16195.769673] current->{active_,}mm->pgd = ffff800089cac000

[16195.770413] \|/ ____ \|/
"@'/ .. \`@"
/_| \__/ |_\
\__U_/
[16196.303333] systemd-journald[219777]: /dev/kmsg buffer overrun, some messages lost.
[16196.304235] stress-ng(234874): Oops [#864]
[16196.304262] CPU: 8 PID: 234874 Comm: stress-ng Tainted: G D
E X 5.9.0-5-sparc64-smp #1 Debian 5.9.15-1
[16196.304281] TSTATE: 0000008811001605 TPC: 000000000042d8e0 TNPC: 000000000042d8e4 Y: 00000000 Tainted: G D E X
[16196.304311] TPC: <do_signal+0x440/0x560>
[16196.304327] g0: 000000000040770c g1: 000000000000032f g2:
0000000000000000 g3: ffff80010007c000
[16196.304341] g4: ffff8003f13f9240 g5: ffff8003fdaa4000 g6:
ffff800087df8000 g7: 0000000000004000
[16196.304355] o0: 00000000000001ef o1: 000000000000032f o2:
ffff800087df8000 o3: 0000000000000007
[16196.304368] o4: 0000000000000007 o5: fffffffffffffff2 sp:
ffff800087dfb451 ret_pc: 000000000042d8c4
[16196.304390] RPC: <do_signal+0x424/0x560>
[16196.304404] l0: 0308000103000004 l1: 00000044f0001201 l2:
000000000040770c l3: 0000000000000000
[16196.304418] l4: 0000000000000000 l5: ffff80010007c000 l6:
ffff800087df8000 l7: 0000000011001002
[16196.304432] i0: 0000000000000077 i1: 000000000000020f i2:
fffffffffffffff2 i3: ffff800187dfff70
[16196.304445] i4: fffffffffffffff2 i5: 0000000000000007 i6:
ffff800087dfb4d1 i7: 000000000042d6fc
[16196.304472] I7: <do_signal+0x25c/0x560>
[16205.284863] aes_sparc64: sparc64 aes opcodes not available.
[16205.753417] Call Trace:
[16205.753453] [<000000000042d6fc>] do_signal+0x25c/0x560
[16205.753478] [<000000000042e218>] do_notify_resume+0x58/0xa0
[16205.753500] [<0000000000404b48>] __handle_signal+0xc/0x30
[16205.753525] Caller[000000000042d6fc]: do_signal+0x25c/0x560
[16205.753546] Caller[000000000042e218]: do_notify_resume+0x58/0xa0 [16205.753562] Caller[0000000000404b48]: __handle_signal+0xc/0x30 [16205.753575] Caller[000001000007294c]: 0x1000007294c
[16205.753580] Instruction DUMP:
[16205.753587] c029a00d
[16205.753595] b4168008
[16205.753602] 900761e8
[16205.753610] <d25e2070>
[16205.753616] 40014fef
[16205.753623] b416801c
[16205.753629] c2592468
[16205.753636] b8100008
[16205.753644] 920126c8

then also these messages. I think they explain the "slowness" and
apparent freeze of the system - I was about to power-cycle but waited
and it recovered:

[16253.233924] ata1.00: qc timeout (cmd 0xa0)
[16335.213786] PM: hibernation: Basic memory bitmaps created
[16830.619976] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16830.620193] (detected by 18, t=5252 jiffies, g=711181, q=6)
[16830.620215] rcu: All QSes seen, last rcu_sched kthread activity 1191 (4299098242-4299097051), jiffies_till_next_fqs=1, root ->qsmask 0x0 [16830.620491] rcu: rcu_sched kthread starved for 1191 jiffies! g711181
f0x2 RCU_GP_CLEANUP(7) ->state=0x0 ->cpu=30
[16830.620749] rcu: Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[16830.620844] rcu: RCU grace-period kthread stack dump:
[16830.621069] task:rcu_sched state:R running task stack:
0 pid: 10 ppid: 2 flags:0x05000000
[16830.621095] Call Trace:
[16830.621128] [<0000000000bda560>] _cond_resched+0x40/0x60
[16830.621153] [<00000000004ee1d0>] rcu_gp_kthread+0x9b0/0xe40
[16830.621175] [<0000000000491c48>] kthread+0x108/0x120
[16830.621205] [<00000000004060c8>] ret_from_fork+0x1c/0x2c
[16830.621224] [<0000000000000000>] 0x0
[16982.524373] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [16982.524591] (detected by 20, t=5252 jiffies, g=711637, q=15)
[16982.524612] rcu: All QSes seen, last rcu_sched kthread activity 5247 (4299136209-4299130962), jiffies_till_next_fqs=1, root ->qsmask 0x0 [16982.524839] rcu: rcu_sched kthread starved for 5247 jiffies! g711637
f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=16
[16982.525098] rcu: Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[16982.525201] rcu: RCU grace-period kthread stack dump:
[16982.525377] task:rcu_sched state:R running task stack:
0 pid: 10 ppid: 2 flags:0x06000000
[16982.525404] Call Trace:
[16982.525435] [<0000000000bda3d4>] schedule+0x54/0x100
[16982.525464] [<0000000000bddc50>] schedule_timeout+0x70/0x140
[16982.525489] [<00000000004edeb4>] rcu_gp_kthread+0x694/0xe40
[16982.525511] [<0000000000491c48>] kthread+0x108/0x120
[16982.525540] [<00000000004060c8>] ret_from_fork+0x1c/0x2c
[16982.525558] [<0000000000000000>] 0x0
[17596.494910] sched: RT throttling activated
[17664.665608] PM: hibernation: Basic memory bitmaps freed
[17664.838884] audit: type=1400 audit(1642442424.829:817):
apparmor="STATUS" info="failed to unpack policydb" error=-86 profile="unconfined" name="/usr/bin/pulseaudio-eg" pid=234012
comm="stress-ng" name="/usr/bin/pulseaudio-eg" offset=2536
[17665.077468] aes_sparc64: sparc64 aes opcodes not available.
[17665.685823] aes_sparc64: sparc64 aes opcodes not available.
[17686.297683] systemd[1]: systemd-journald.service: Main process
exited, code=killed, status=6/ABRT
[17686.300569] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
[17686.733029] systemd[1]: systemd-journald.service: Consumed 53.065s
CPU time.
[17686.938707] systemd[1]: systemd-journald.service: Scheduled restart
job, restart counter is at 3.
[17687.012114] systemd[1]: Stopped Journal Service.
[17687.020312] systemd[1]: systemd-journald.service: Consumed 53.065s
CPU time.
[17690.324815] systemd[1]: Starting Journal Service...
[17690.831298] systemd-journald[258852]: File /var/log/journal/bdb2a41ce825489ba567bea53add247e/system.journal
corrupted or uncleanly shut down, renaming and replacing.
[17709.718653] systemd[1]: Started Journal Service.

Perhaps we can at least understand these error and restrict to specific
tests? This could gives us a better testing and also Frank could try to
run the same tests on his systems.

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to Riccardo Mottola on Mon Jan 17 21:40:02 2022

Hi,

Riccardo Mottola wrote:

John Paul Adrian Glaubitz wrote:

Not nice. I started compiling some stuff and the box froze, I connected
serial console and could not resume due to Fast Data Access MMU miss"

So, this crash occurs with the latest 5.15 kernel on your T2000?

exactly latest kernel.

I will retest it with stress-ng as soon as I finish this email and copy
the dmesg errors.

wow, running the test suite once or twice, I am able to have the system power-cycle... wow

Frank test latest kernel on yours :)

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Frank Scheiner@21:1/5 to Riccardo Mottola on Wed Feb 2 15:30:01 2022

Hi Riccardo, all,

On 17.01.22 21:35, Riccardo Mottola wrote:

Hi,

Riccardo Mottola wrote:

John Paul Adrian Glaubitz wrote:

Not nice. I started compiling some stuff and the box froze, I connected >>>> serial console and could not resume due to Fast Data Access MMU miss"

So, this crash occurs with the latest 5.15 kernel on your T2000?

exactly latest kernel.

I will retest it with stress-ng as soon as I finish this email and copy
the dmesg errors.

wow, running the test suite once or twice, I am able to have the system power-cycle... wow

Frank test latest kernel on yours :)

I yesterday found the time to give Linux 5.15.0-3 a try on my T1000
(UltraSPARC T1) and V210 (US IIIi), but the boot issue is still there -
at least for my use case: The klibc based tools inside of the initramfs
are not able to mount the root FS over NFS (details further below).

But it's still good to see that mounting an on-disk root FS seems to
work now for your T2000, though the instabilities during runtime are not reassuring.

For me the last good Debian kernel - at least for booting, more on that
shortly - is 5.9.0-5. Both T1000 and V210 boot fine with it (incl.
mounting the root FS via NFS(v3 BTW)). But during operation (tested with
`apt upgrade` on a root FS replicated multiple times for testing from
the same tarball) the V210 crashes (=> kernel panic), the T1000 does
not. For the V210 I also see that for 5.8.0-3. Doing the same with
kernel 4.19.0-5 running on the V210, no problems are seen, not even the messages below.

The crash when running 5.9.0-5 or 5.8.0-3 is usually "announced" (or at
least accompanied) by one or more occurrence(s) of the following messages:
```
[...]
[ 360.489852] CPU[0]: Cheetah+ D-cache parity error at
TPC[00000000005b28c8]
[ 360.580300] TPC<bpf_check+0x1f68/0x34e0>
[...]
```
...which should be familiar for UltraSPARC IIIi users with newer kernels
(see for example [1] which shows it for 4.16.x). According to [2] this
error should be recoverable (otherwise it would be followed by a panic
and "Irrecoverable Cheetah+ parity error."), which seems to happen,
until it is no longer, but I don't see that second message, so something
else must happen.

[1]: https://www.spinics.net/lists/sparclinux/msg21019.html

[2]: https://github.com/torvalds/linux/blob/master/arch/sparc/kernel/traps_64.c#L1767..L1799

Of course our CPU's caches don't go pop magically. There is something
broken in the newer kernels (> 4.19.x) for UltraSPARC IIIi (and most
likely all the other related processors, too), apart from the mounting
issues for NFS (see [3] for processors affected by this, update to that:
US II is not affected, too).

[3]: https://lists.debian.org/debian-sparc/2021/12/msg00004.html

If I find the time and mood I'll try to bisect this US IIIi specific
issue in the hope that we will eventually get a fix for it, also still
hoping for a fix for [4].

[4]: https://lists.debian.org/debian-sparc/2021/03/msg00045.html

Cheers,
Frank

****

## T1000 ##

```
[...]
[ 0.000116] Linux version 5.15.0-3-sparc64-smp (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.2.0-14) 11.2.0, GNU
ld (GNU Binutils for Debian) 2.37.90.20220123) #1 SMP Debian 5.15.15-2 (2022-01-30)
[...]
[ 12.484314] tg3 0001:03:04.0 enP1p3s4f0: Link is up at 1000 Mbps,
full duplex
[ 12.484520] tg3 0001:03:04.0 enP1p3s4f0: Flow control is on for TX
and on for RX
[ 12.484689] IPv6: ADDRCONF(NETDEV_CHANGE): enP1p3s4f0: link becomes ready
[ 16.765173] Unable to handle kernel paging request at virtual address 0000612000000000
[ 16.765384] tsk->{mm,active_mm}->context = 000000000000006e
[ 16.765493] tsk->{mm,active_mm}->pgd = ffff800014af0000
[ 16.765650] \|/ ____ \|/
[ 16.765650] "@'/ .. \`@"
[ 16.765650] /_| \__/ |_\
[ 16.765650] \__U_/
[ 16.765975] nfsmount(374): Oops [#1]
[ 16.766167] CPU: 2 PID: 374 Comm: nfsmount Tainted: G E
5.15.0-3-sparc64-smp #1 Debian 5.15.15-2
[ 16.766345] TSTATE: 0000000011001607 TPC: 00000000006a5fe8 TNPC: 00000000006a5fec Y: 00000000 Tainted: G E
[ 16.766642] TPC: <kfree+0x48/0x2c0>
[ 16.766704] g0: ffff80000f2e7451 g1: 0000000400000000 g2:
0000600000000000 g3: ffff8001fd786000
[ 16.766802] g4: ffff800014245e80 g5: ffff8001fd786000 g6:
ffff80000f2e4000 g7: ffff80000f2e7c30
[ 16.766983] o0: fffffffffffffffe o1: 00000000006fd714 o2:
0000000000002000 o3: ffff80000f2cbaf8
[ 16.767209] o4: 0000000000000008 o5: 0000000000000cc0 sp:
ffff80000f2e7491 ret_pc: 00000000006fd6d4
[ 16.767292] RPC: <sys_mount+0x74/0x1a0>
[ 16.767456] l0: ffff800014398408 l1: ffff8001fedeaa00 l2:
0000000000422db4 l3: 0000000000201e00
[ 16.767591] l4: 000000000000029c l5: ffff80010000c1a0 l6:
ffff80000f2e4000 l7: 00000000006fd660
[ 16.767771] i0: 0000000000000cc0 i1: 0000000000201ff0 i2:
0000000000000001 i3: ffff80000f2e7dd0
[ 16.767996] i4: 0000000000000000 i5: 0000612000000000 i6:
ffff80000f2e7561 i7: 00000000006fd714
[ 16.768079] I7: <sys_mount+0xb4/0x1a0>
[ 16.768189] Call Trace:
[ 16.768326] [<00000000006fd714>] sys_mount+0xb4/0x1a0
[ 16.768456] [<00000000006fd6d4>] sys_mount+0x74/0x1a0
[ 16.768628] [<0000000000406274>] linux_sparc_syscall+0x34/0x44
[ 16.768856] Disabling lock debugging due to kernel taint
[ 16.768917] Caller[00000000006fd714]: sys_mount+0xb4/0x1a0
[ 16.769093] Caller[00000000006fd6d4]: sys_mount+0x74/0x1a0
[ 16.769316] Caller[0000000000406274]: linux_sparc_syscall+0x34/0x44
[ 16.769444] Caller[0000000000100a94]: 0x100a94
[ 16.769596] Instruction DUMP:
[ 16.769603] ba074001
[ 16.769693] bb2f7003
[ 16.769735] ba074002
[ 16.769775] <c25f6008>
[ 16.769865] 84086001
[ 16.770037] 82007fff
[ 16.770134] 8378841d
[ 16.770226] ba100001
[ 16.770315] c2586008
[ 16.770456]
Killed
Begin: Retrying nfs mount ...
[...]
```

## V210 ##

```
[...]
[ 0.000168] Linux version 5.15.0-3-sparc64-smp (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.2.0-14) 11.2.0, GNU
ld (GNU Binutils for Debian) 2.37.90.20220123) #1 SMP Debian 5.15.15-2 (2022-01-30)
[...]
[ 40.241993] tg3 0000:00:02.0 enp0s2f0: Link is up at 1000 Mbps, full
duplex
[ 40.333591] tg3 0000:00:02.0 enp0s2f0: Flow control is on for TX and
on for RX
[ 40.428669] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s2f0: link becomes ready
[ 44.294909] FS-Cache: Loaded
[ 44.397657] RPC: Registered named UNIX socket transport module.
[ 44.475650] RPC: Registered udp transport module.
[ 44.537450] RPC: Registered tcp transport module.
[ 44.599295] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 44.815002] FS-Cache: Netfs 'nfs' registered for caching
mount: Invalid argument
Begin: Retrying nfs mount ... mount: Invalid argument
done.
[...]
```

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	293
Nodes:	16 (2 / 14)
Uptime:	242:21:54
Calls:	6,624
Files:	12,175
Messages:	5,320,202

5.10.0-4-sparc64-smp #1 Debian 5.10.19-1 crashes on T2000

Who's Online

System Info