Forum: >>> Magnum BBS <<<

with kernel 6.4 update, boot fails drops into maintenance

From Riccardo Mottola@21:1/5 to All on Mon Jul 31 12:30:01 2023

Hi,

I have a Fire T2000 which has been stable for my usage for months.
Now I upgraded kernel (and userland) in debian to:

Yesterday, when hooking up a serial console, i was getting asked for
root password and dropped into maintenance mode.
Today I hooked up better through a console with minicom to be able to
copy the error, but I get these errors again and again:

[   17.134833] OOPS: RPC [000000000042ce10]
[   17.134843] OOPS: RPC <arch_cpu_idle+0x70/0xc0>
[   17.134875] OOPS: Fault was to vaddr[bc0]
[   17.134885] CPU: 25 PID: 0 Comm: swapper/25 Tainted: G D
6.4.0-1-sparc64-smp #1 Debian 6.4.4-2
[   17.134908] Call Trace:
[   17.134915] [<0000000000de9614>] dump_stack+0x8/0x18
[   17.134948] [<0000000000df547c>] do_sparc64_fault+0x4bc/0xb00
[   17.134979] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
[   17.135017] [<0000000000000bc0>] 0xbc0
[   17.135034] [<0000000000dea050>] default_idle_call+0x30/0x120
[   17.135067] [<00000000004c32f8>] do_idle+0xd8/0x1a0
[   17.135095] [<00000000004c3684>] cpu_startup_entry+0x24/0x40
[   17.135123] [<0000000000440f28>] smp_callin+0xe8/0x120
[   17.135145] [<00000000010d8c14>] 0x10d8c14
[   17.135162] [<0000000000000000>] 0x0
[   17.135179] Unable to handle kernel NULL pointer dereference
[   17.135187] tsk->{mm,active_mm}->context = 0000000000000041
[   17.135197] tsk->{mm,active_mm}->pgd = ffff800017270000
[   17.135207]               \|/ ____ \|/
[   17.135207]               "@'/ .. \`@"
[   17.135207]               /_| \__/ |_\
[   17.135207]                  \__U_/
[   17.135216] swapper/25(0): Oops [#49]
[   17.135230] CPU: 25 PID: 0 Comm: swapper/25 Tainted: G D
6.4.0-1-sparc64-smp #1 Debian 6.4.4-2
[   17.135253] TSTATE: 0000004411001605

so something worsened! darn! this looks as back when all kernels failed
to boot on the Niagara.
I set up minicom to log to a file and see see if I can catch something
before the crashes.

A reboot... shows this again:
Booting `Debian GNU/Linux'

Loading Linux 6.4.0-1-sparc64-smp ...
Loading initial ramdisk ...

/dev/sda2: clean, 429184/4276224 files, 8886587/17089844 blocks
[   42.149954] /dev/sda: Can't open blockdev
[FAILED] Failed to mount boot.mount - /boot.
[DEPEND] Dependency failed for local-fs.target - Local File Systems.
You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, or "exit"
to continue bootup.
Give root password for maintenance
(or press Control-D to continue):

dmesg shows:
[   42.149954] /dev/sda: Can't open blockdev
[   42.745098] Kernel unaligned access at TPC[850ebc] aa_dfa_unpack+0x3c/0x5a0
[   42.745167] Kernel unaligned access at TPC[850ed0] aa_dfa_unpack+0x50/0x5a0
[   42.745207] Kernel unaligned access at TPC[850f68] aa_dfa_unpack+0xe8/0x5a0
[   42.745254] Kernel unaligned access at TPC[850f68] aa_dfa_unpack+0xe8/0x5a0
[   42.745298] Kernel unaligned access at TPC[850f68] aa_dfa_unpack+0xe8/0x5a0
[   42.760418] audit: type=1400 audit(1690795782.060:2):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=631 comm="apparmor_parser"
[   42.760469] audit: type=1400 audit(1690795782.060:3):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=631 comm="apparmor_"
[   42.764485] audit: type=1400 audit(1690795782.064:4):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=630 comm="apparmor_parser"
[   42.773434] audit: type=1400 audit(1690795782.072:5):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=633 comm="apparmor_parser"
[   42.773480] audit: type=1400 audit(1690795782.072:6):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=633 comm="apparmor_parser"
[   42.773509] audit: type=1400 audit(1690795782.072:7):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=633 comm="apparmor_parser"
[   42.904068] audit: type=1400 audit(1690795782.204:8):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action""
[   42.904121] audit: type=1400 audit(1690795782.204:9):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=63"
[   42.904153] audit: type=1400 audit(1690795782.204:10):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid"
[   42.904183] audit: type=1400 audit(1690795782.204:11):
apparmor="STATUS" operation="profile_load" profile="unconfined" name="/{,usr/}sbin/dhclient" pid=632 comm="apparmor"
[   42.960067] RPC: Registered named UNIX socket transport module.
[   42.960093] RPC: Registered udp transport module.
[   42.960108] RPC: Registered tcp transport module.
[   42.960121] RPC: Registered tcp NFSv4.1 backchannel transport module. [   43.603086] ------------[ cut here ]------------
[   43.603111] WARNING: CPU: 4 PID: 668 at kernel/irq/msi.c:196 msi_domain_free_descs+0xf4/0x120
[   43.603170] Modules linked in: sunrpc binfmt_misc libdes sg flash drm drm_panel_orientation_quirks i2c_core dm_mod fuse loop configfs
ip_tables x_tables autofs4 ext4 crc1n
[   43.603683] CPU: 4 PID: 668 Comm: ip Not tainted 6.4.0-1-sparc64-smp
#1 Debian 6.4.4-2
[   43.603715] Call Trace:
[   43.603727] [<0000000000de9614>] dump_stack+0x8/0x18
[   43.603771] [<000000000046e208>] __warn+0xc8/0x120
[   43.603800] [<000000000046e2c0>] warn_slowpath_fmt+0x60/0x100
[   43.603827] [<00000000004f96b4>] msi_domain_free_descs+0xf4/0x120
[   43.603866] [<00000000004f9b60>]
msi_domain_free_msi_descs_range+0x20/0x40
[   43.603903] [<0000000000a1e760>] pci_msi_teardown_msi_irqs+0x40/0x60
[   43.603952] [<0000000000a1e14c>] pci_free_msi_irqs+0xc/0x40
[   43.603988] [<0000000000a1c390>] pci_disable_msi+0x50/0x80
[   43.604021] [<000000001004b800>] e1000e_open+0x420/0x600 [e1000e]
[   43.604191] [<0000000000b9262c>] __dev_open+0xec/0x1c0
[   43.604232] [<0000000000b92b58>] __dev_change_flags+0x1b8/0x260
[   43.604266] [<0000000000b92c1c>] dev_change_flags+0x1c/0x80
[   43.604298] [<0000000000ba4720>] do_setlink+0x300/0x11e0
[   43.604336] [<0000000000baadcc>] __rtnl_newlink+0x54c/0x880
[   43.604374] [<0000000000bab134>] rtnl_newlink+0x34/0x60
[   43.604407] [<0000000000ba3324>] rtnetlink_rcv_msg+0x244/0x340
[   43.604441] ---[ end trace 0000000000000000 ]---
[   45.928446] e1000e 0000:04:00.0 enp4s0f0: NIC Link is Up 1000 Mbps
Full Duplex, Flow Control: Rx/Tx
[   45.928724] IPv6: ADDRCONF(NETDEV_CHANGE): enp4s0f0: link becomes ready

a single crash, once. Doesn't seem like the upper one.

If I select my next kernel I have:
Linux narya 6.1.0-9-sparc64-smp #1 SMP Debian 6.1.27-1 (2023-05-08) sparc64

all still works perfectly fine!

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Mon Jul 31 12:50:01 2023

Hi Riccardo!

On Mon, 2023-07-31 at 12:21 +0200, Riccardo Mottola wrote:

I have a Fire T2000 which has been stable for my usage for months.
Now I upgraded kernel (and userland) in debian to:

Yesterday, when hooking up a serial console, i was getting asked for
root password and dropped into maintenance mode.
Today I hooked up better through a console with minicom to be able to
copy the error, but I get these errors again and again:

[   17.134833] OOPS: RPC [000000000042ce10]
[   17.134843] OOPS: RPC <arch_cpu_idle+0x70/0xc0>
[   17.134875] OOPS: Fault was to vaddr[bc0]
[   17.134885] CPU: 25 PID: 0 Comm: swapper/25 Tainted: G D
6.4.0-1-sparc64-smp #1 Debian 6.4.4-2
[   17.134908] Call Trace:
[   17.134915] [<0000000000de9614>] dump_stack+0x8/0x18
[   17.134948] [<0000000000df547c>] do_sparc64_fault+0x4bc/0xb0

0

[   17.134979] [<0000000000407714>] sparc64_realfault_common+0x10/0x20 [   17.135017] [<0000000000000bc0>] 0xbc0
[   17.135034] [<0000000000dea050>] default_idle_call+0x30/0x12

0

[   17.135067] [<00000000004c32f8>] do_idle+0xd8/0x1a0
[   17.135095] [<00000000004c3684>] cpu_startup_entry+0x24/0x40
[   17.135123] [<0000000000440f28>] smp_callin+0xe8/0x120
[   17.135145] [<00000000010d8c14>] 0x10d8c14
[   17.135162] [<0000000000000000>] 0x0
[   17.135179] Unable to handle kernel NULL pointer dereference
[   17.135187] tsk->{mm,active_mm}->context = 0000000000000041
[   17.135197] tsk->{mm,active_mm}->pgd = ffff800017270000
[   17.135207]               \|/ ____ \|/
[   17.135207]               "@'/ .. \`@"
[   17.135207]               /_| \__/ |_\
[   17.135207]                  \__U_/
[   17.135216] swapper/25(0): Oops [#49]
[   17.135230] CPU: 25 PID: 0 Comm: swapper/25 Tainted: G D
6.4.0-1-sparc64-smp #1 Debian 6.4.4-2
[   17.135253] TSTATE: 0000004411001605

so something worsened! darn! this looks as back when all kernels failed
to boot on the Niagara.
I set up minicom to log to a file and see see if I can catch something before the crashes.

Kernel regressions should be reported to the sparclinux kernel mailing list [1].

Also, a bisect would be helpful to determine which commit broke the kernel.

Thanks,
Adrian

[1] http://vger.kernel.org/vger-lists.html#sparclinux

--
.''`. John Paul Adrian Glaubitz
: :' : Debian Developer
`. `' Physicist
`- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Mon Jul 31 14:50:01 2023

Hi Adrian

John Paul Adrian Glaubitz wrote:

Kernel regressions should be reported to the sparclinux kernel mailing list [1].

Also, a bisect would be helpful to determine which commit broke the kernel.

after several reboots, I could not reproduce "that" looping crash
anymore, but I always got into maintenance mode.
Is that a kernel issue too? I don't know..- I'd like to understand what
the issue is. I don't even know if it is a kernel issue, but I suppose so.
It happened once also when booting a known good 6.1 kernel... during
these tests. The sytem has been stable for long.

I tried fetching from snapshots
Linux narya 6.3.0-2-sparc64-smp #1 SMP Debian 6.3.11-1 (2023-07-01)
sparc64 GNU/Linux

And it too fails:
/dev/sda2: clean, 432715/4276224 files, 8940034/17089844 blocks
[   41.713577] /dev/sda: Can't open blockdev
[FAILED] Failed to mount boot.mount - /boot.
[DEPEND] Dependency failed for local-fs.target - Local File Systems.
You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, or "exit"
to continue bootup.
Give root password for maintenance

this looks like failing to mount root, which is internal SCSI.

The exception I see in dmesg is:
[   43.220522] ------------[ cut here ]------------
[   43.220545] WARNING: CPU: 25 PID: 666 at kernel/irq/msi.c:196 msi_domain_free_descs+0xf4/0x120
[   43.220600] Modules linked in: sunrpc binfmt_misc libdes sg flash
loop drm drm_panel_orientation_quirks i2c_core fuse dm_mod configfs
ip_tables x_tables autofs4 ext4 crc1n
[   43.221112] CPU: 25 PID: 666 Comm: ip Not tainted 6.3.0-2-sparc64-smp
#1 Debian 6.3.11-1
[   43.221140] Call Trace:
[   43.221153] [<0000000000dda8cc>] dump_stack+0x8/0x18
[   43.221194] [<000000000046e228>] __warn+0xc8/0x120
[   43.221223] [<000000000046e2e0>] warn_slowpath_fmt+0x60/0x100
[   43.221251] [<00000000004f8194>] msi_domain_free_descs+0xf4/0x120
[   43.221287] [<00000000004f8640>]
msi_domain_free_msi_descs_range+0x20/0x40
[   43.221323] [<0000000000a13d20>] pci_msi_teardown_msi_irqs+0x40/0x60
[   43.221357] [<0000000000a1372c>] pci_free_msi_irqs+0xc/0x40
[   43.221385] [<0000000000a11950>] pci_disable_msi+0x50/0x80
[   43.221413] [<000000001012fa00>] e1000e_open+0x420/0x600 [e1000e]
[   43.221572] [<0000000000b871ec>] __dev_open+0xec/0x1c0
[   43.221601] [<0000000000b87718>] __dev_change_flags+0x1b8/0x260
[   43.221628] [<0000000000b877dc>] dev_change_flags+0x1c/0x80
[   43.221654] [<0000000000b98d84>] do_setlink+0x304/0x1220
[   43.221680] [<0000000000b9f4a8>] __rtnl_newlink+0x548/0x880
[   43.221707] [<0000000000b9f814>] rtnl_newlink+0x34/0x60
[   43.221732] [<0000000000b97884>] rtnetlink_rcv_msg+0x244/0x340
[   43.221758] ---[ end trace 0000000000000000 ]---
[   45.317910] e1000e 0000:04:00.0 enp4s0f0: NIC Link is Up 1000 Mbps
Full Duplex, Flow Control: Rx/Tx

which looks more related to ethernet, or do I misread it?

Riccardo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Paul Adrian Glaubitz@21:1/5 to All on Mon Jul 31 17:20:01 2023

On Jul 31, 2023, at 2:46 PM, Riccardo Mottola <riccardo.mottola@libero.it> wrote:

after several reboots, I could not reproduce "that" looping crash anymore, but I always got into maintenance mode.
Is that a kernel issue too?

Yes, see the other thread which also reports a regression on older SPARCs. There are no issues on the T5220 or SPARC-T4.

Adrian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Sun Apr 28 20:37:53 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Sun Apr 28 20:37:37 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Sun Apr 28 20:30:04 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Mon Apr 29 09:04:47 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	297
Nodes:	16 (2 / 14)
Uptime:	19:50:54
Calls:	6,667
Calls today:	1
Files:	12,216
Messages:	5,337,043

with kernel 6.4 update, boot fails drops into maintenance

Who's Online

Recent Visitors

System Info