• with kernel 6.4 update, boot fails drops into maintenance

    From Riccardo Mottola@21:1/5 to All on Mon Jul 31 12:30:01 2023
    Hi,

    I have a Fire T2000 which has been stable for my usage for months.
    Now I upgraded kernel (and userland) in debian to:

    Yesterday, when hooking up a serial console, i was getting asked for
    root password and dropped into maintenance mode.
    Today I hooked up better through a console with minicom to be able to
    copy the error, but I get these errors again and again:


    [   17.134833] OOPS: RPC [000000000042ce10]
    [   17.134843] OOPS: RPC <arch_cpu_idle+0x70/0xc0>
    [   17.134875] OOPS: Fault was to vaddr[bc0]
    [   17.134885] CPU: 25 PID: 0 Comm: swapper/25 Tainted: G D           
    6.4.0-1-sparc64-smp #1  Debian 6.4.4-2
    [   17.134908] Call Trace:
    [   17.134915] [<0000000000de9614>] dump_stack+0x8/0x18
    [   17.134948] [<0000000000df547c>] do_sparc64_fault+0x4bc/0xb00
    [   17.134979] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
    [   17.135017] [<0000000000000bc0>] 0xbc0
    [   17.135034] [<0000000000dea050>] default_idle_call+0x30/0x120
    [   17.135067] [<00000000004c32f8>] do_idle+0xd8/0x1a0
    [   17.135095] [<00000000004c3684>] cpu_startup_entry+0x24/0x40
    [   17.135123] [<0000000000440f28>] smp_callin+0xe8/0x120
    [   17.135145] [<00000000010d8c14>] 0x10d8c14
    [   17.135162] [<0000000000000000>] 0x0
    [   17.135179] Unable to handle kernel NULL pointer dereference
    [   17.135187] tsk->{mm,active_mm}->context = 0000000000000041
    [   17.135197] tsk->{mm,active_mm}->pgd = ffff800017270000
    [   17.135207]               \|/ ____ \|/
    [   17.135207]               "@'/ .. \`@"
    [   17.135207]               /_| \__/ |_\
    [   17.135207]                  \__U_/
    [   17.135216] swapper/25(0): Oops [#49]
    [   17.135230] CPU: 25 PID: 0 Comm: swapper/25 Tainted: G D           
    6.4.0-1-sparc64-smp #1  Debian 6.4.4-2
    [   17.135253] TSTATE: 0000004411001605


    so something worsened! darn! this looks as back when all kernels failed
    to boot on the Niagara.
    I set up minicom to log to a file and see see if I can catch something
    before the crashes.

    A reboot... shows this again:
      Booting `Debian GNU/Linux'

    Loading Linux 6.4.0-1-sparc64-smp ...
    Loading initial ramdisk ...

    /dev/sda2: clean, 429184/4276224 files, 8886587/17089844 blocks
    [   42.149954] /dev/sda: Can't open blockdev
    [FAILED] Failed to mount boot.mount - /boot.
    [DEPEND] Dependency failed for local-fs.target - Local File Systems.
    You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, or "exit"
    to continue bootup.
    Give root password for maintenance
    (or press Control-D to continue):

    dmesg shows:
    [   42.149954] /dev/sda: Can't open blockdev
    [   42.745098] Kernel unaligned access at TPC[850ebc] aa_dfa_unpack+0x3c/0x5a0
    [   42.745167] Kernel unaligned access at TPC[850ed0] aa_dfa_unpack+0x50/0x5a0
    [   42.745207] Kernel unaligned access at TPC[850f68] aa_dfa_unpack+0xe8/0x5a0
    [   42.745254] Kernel unaligned access at TPC[850f68] aa_dfa_unpack+0xe8/0x5a0
    [   42.745298] Kernel unaligned access at TPC[850f68] aa_dfa_unpack+0xe8/0x5a0
    [   42.760418] audit: type=1400 audit(1690795782.060:2):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=631 comm="apparmor_parser"
    [   42.760469] audit: type=1400 audit(1690795782.060:3):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=631 comm="apparmor_"
    [   42.764485] audit: type=1400 audit(1690795782.064:4):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=630 comm="apparmor_parser"
    [   42.773434] audit: type=1400 audit(1690795782.072:5):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/man" pid=633 comm="apparmor_parser"
    [   42.773480] audit: type=1400 audit(1690795782.072:6):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_filter" pid=633 comm="apparmor_parser"
    [   42.773509] audit: type=1400 audit(1690795782.072:7):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="man_groff" pid=633 comm="apparmor_parser"
    [   42.904068] audit: type=1400 audit(1690795782.204:8):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action""
    [   42.904121] audit: type=1400 audit(1690795782.204:9):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=63"
    [   42.904153] audit: type=1400 audit(1690795782.204:10):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid"
    [   42.904183] audit: type=1400 audit(1690795782.204:11):
    apparmor="STATUS" operation="profile_load" profile="unconfined" name="/{,usr/}sbin/dhclient" pid=632 comm="apparmor"
    [   42.960067] RPC: Registered named UNIX socket transport module.
    [   42.960093] RPC: Registered udp transport module.
    [   42.960108] RPC: Registered tcp transport module.
    [   42.960121] RPC: Registered tcp NFSv4.1 backchannel transport module. [   43.603086] ------------[ cut here ]------------
    [   43.603111] WARNING: CPU: 4 PID: 668 at kernel/irq/msi.c:196 msi_domain_free_descs+0xf4/0x120
    [   43.603170] Modules linked in: sunrpc binfmt_misc libdes sg flash drm drm_panel_orientation_quirks i2c_core dm_mod fuse loop configfs
    ip_tables x_tables autofs4 ext4 crc1n
    [   43.603683] CPU: 4 PID: 668 Comm: ip Not tainted 6.4.0-1-sparc64-smp
    #1  Debian 6.4.4-2
    [   43.603715] Call Trace:
    [   43.603727] [<0000000000de9614>] dump_stack+0x8/0x18
    [   43.603771] [<000000000046e208>] __warn+0xc8/0x120
    [   43.603800] [<000000000046e2c0>] warn_slowpath_fmt+0x60/0x100
    [   43.603827] [<00000000004f96b4>] msi_domain_free_descs+0xf4/0x120
    [   43.603866] [<00000000004f9b60>]
    msi_domain_free_msi_descs_range+0x20/0x40
    [   43.603903] [<0000000000a1e760>] pci_msi_teardown_msi_irqs+0x40/0x60
    [   43.603952] [<0000000000a1e14c>] pci_free_msi_irqs+0xc/0x40
    [   43.603988] [<0000000000a1c390>] pci_disable_msi+0x50/0x80
    [   43.604021] [<000000001004b800>] e1000e_open+0x420/0x600 [e1000e]
    [   43.604191] [<0000000000b9262c>] __dev_open+0xec/0x1c0
    [   43.604232] [<0000000000b92b58>] __dev_change_flags+0x1b8/0x260
    [   43.604266] [<0000000000b92c1c>] dev_change_flags+0x1c/0x80
    [   43.604298] [<0000000000ba4720>] do_setlink+0x300/0x11e0
    [   43.604336] [<0000000000baadcc>] __rtnl_newlink+0x54c/0x880
    [   43.604374] [<0000000000bab134>] rtnl_newlink+0x34/0x60
    [   43.604407] [<0000000000ba3324>] rtnetlink_rcv_msg+0x244/0x340
    [   43.604441] ---[ end trace 0000000000000000 ]---
    [   45.928446] e1000e 0000:04:00.0 enp4s0f0: NIC Link is Up 1000 Mbps
    Full Duplex, Flow Control: Rx/Tx
    [   45.928724] IPv6: ADDRCONF(NETDEV_CHANGE): enp4s0f0: link becomes ready


    a single crash, once. Doesn't seem like the upper one.

    If I select my next kernel I have:
    Linux narya 6.1.0-9-sparc64-smp #1 SMP Debian 6.1.27-1 (2023-05-08) sparc64

    all still works perfectly fine!


    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Riccardo Mottola on Mon Jul 31 12:50:01 2023
    Hi Riccardo!

    On Mon, 2023-07-31 at 12:21 +0200, Riccardo Mottola wrote:
    I have a Fire T2000 which has been stable for my usage for months.
    Now I upgraded kernel (and userland) in debian to:

    Yesterday, when hooking up a serial console, i was getting asked for
    root password and dropped into maintenance mode.
    Today I hooked up better through a console with minicom to be able to
    copy the error, but I get these errors again and again:


    [   17.134833] OOPS: RPC [000000000042ce10]
    [   17.134843] OOPS: RPC <arch_cpu_idle+0x70/0xc0>
    [   17.134875] OOPS: Fault was to vaddr[bc0]
    [   17.134885] CPU: 25 PID: 0 Comm: swapper/25 Tainted: G D           
    6.4.0-1-sparc64-smp #1  Debian 6.4.4-2
    [   17.134908] Call Trace:
    [   17.134915] [<0000000000de9614>] dump_stack+0x8/0x18
    [   17.134948] [<0000000000df547c>] do_sparc64_fault+0x4bc/0xb0
    0
    [   17.134979] [<0000000000407714>] sparc64_realfault_common+0x10/0x20 [   17.135017] [<0000000000000bc0>] 0xbc0
    [   17.135034] [<0000000000dea050>] default_idle_call+0x30/0x12
    0
    [   17.135067] [<00000000004c32f8>] do_idle+0xd8/0x1a0
    [   17.135095] [<00000000004c3684>] cpu_startup_entry+0x24/0x40
    [   17.135123] [<0000000000440f28>] smp_callin+0xe8/0x120
    [   17.135145] [<00000000010d8c14>] 0x10d8c14
    [   17.135162] [<0000000000000000>] 0x0
    [   17.135179] Unable to handle kernel NULL pointer dereference
    [   17.135187] tsk->{mm,active_mm}->context = 0000000000000041
    [   17.135197] tsk->{mm,active_mm}->pgd = ffff800017270000
    [   17.135207]               \|/ ____ \|/
    [   17.135207]               "@'/ .. \`@"
    [   17.135207]               /_| \__/ |_\
    [   17.135207]                  \__U_/
    [   17.135216] swapper/25(0): Oops [#49]
    [   17.135230] CPU: 25 PID: 0 Comm: swapper/25 Tainted: G D           
    6.4.0-1-sparc64-smp #1  Debian 6.4.4-2
    [   17.135253] TSTATE: 0000004411001605


    so something worsened! darn! this looks as back when all kernels failed
    to boot on the Niagara.
    I set up minicom to log to a file and see see if I can catch something before the crashes.

    Kernel regressions should be reported to the sparclinux kernel mailing list [1].

    Also, a bisect would be helpful to determine which commit broke the kernel.

    Thanks,
    Adrian

    [1] http://vger.kernel.org/vger-lists.html#sparclinux

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer
    `. `' Physicist
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Riccardo Mottola@21:1/5 to John Paul Adrian Glaubitz on Mon Jul 31 14:50:01 2023
    Hi Adrian


    John Paul Adrian Glaubitz wrote:
    Kernel regressions should be reported to the sparclinux kernel mailing list [1].

    Also, a bisect would be helpful to determine which commit broke the kernel.

    after several reboots, I could not reproduce "that" looping crash
    anymore, but I always got into maintenance mode.
    Is that a kernel issue too? I don't know..- I'd like to understand what
    the issue is. I don't even know if it is a kernel issue, but I suppose so.
    It happened once also when booting a known good 6.1 kernel... during
    these tests. The sytem has been stable for long.

    I tried fetching from snapshots
    Linux narya 6.3.0-2-sparc64-smp #1 SMP Debian 6.3.11-1 (2023-07-01)
    sparc64 GNU/Linux

    And it too fails:
    /dev/sda2: clean, 432715/4276224 files, 8940034/17089844 blocks
    [   41.713577] /dev/sda: Can't open blockdev
    [FAILED] Failed to mount boot.mount - /boot.
    [DEPEND] Dependency failed for local-fs.target - Local File Systems.
    You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, or "exit"
    to continue bootup.
    Give root password for maintenance


    this looks like failing to mount root, which is internal SCSI.

    The exception I see in dmesg is:
    [   43.220522] ------------[ cut here ]------------
    [   43.220545] WARNING: CPU: 25 PID: 666 at kernel/irq/msi.c:196 msi_domain_free_descs+0xf4/0x120
    [   43.220600] Modules linked in: sunrpc binfmt_misc libdes sg flash
    loop drm drm_panel_orientation_quirks i2c_core fuse dm_mod configfs
    ip_tables x_tables autofs4 ext4 crc1n
    [   43.221112] CPU: 25 PID: 666 Comm: ip Not tainted 6.3.0-2-sparc64-smp
    #1  Debian 6.3.11-1
    [   43.221140] Call Trace:
    [   43.221153] [<0000000000dda8cc>] dump_stack+0x8/0x18
    [   43.221194] [<000000000046e228>] __warn+0xc8/0x120
    [   43.221223] [<000000000046e2e0>] warn_slowpath_fmt+0x60/0x100
    [   43.221251] [<00000000004f8194>] msi_domain_free_descs+0xf4/0x120
    [   43.221287] [<00000000004f8640>]
    msi_domain_free_msi_descs_range+0x20/0x40
    [   43.221323] [<0000000000a13d20>] pci_msi_teardown_msi_irqs+0x40/0x60
    [   43.221357] [<0000000000a1372c>] pci_free_msi_irqs+0xc/0x40
    [   43.221385] [<0000000000a11950>] pci_disable_msi+0x50/0x80
    [   43.221413] [<000000001012fa00>] e1000e_open+0x420/0x600 [e1000e]
    [   43.221572] [<0000000000b871ec>] __dev_open+0xec/0x1c0
    [   43.221601] [<0000000000b87718>] __dev_change_flags+0x1b8/0x260
    [   43.221628] [<0000000000b877dc>] dev_change_flags+0x1c/0x80
    [   43.221654] [<0000000000b98d84>] do_setlink+0x304/0x1220
    [   43.221680] [<0000000000b9f4a8>] __rtnl_newlink+0x548/0x880
    [   43.221707] [<0000000000b9f814>] rtnl_newlink+0x34/0x60
    [   43.221732] [<0000000000b97884>] rtnetlink_rcv_msg+0x244/0x340
    [   43.221758] ---[ end trace 0000000000000000 ]---
    [   45.317910] e1000e 0000:04:00.0 enp4s0f0: NIC Link is Up 1000 Mbps
    Full Duplex, Flow Control: Rx/Tx

    which looks more related to ethernet, or do I misread it?

    Riccardo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to All on Mon Jul 31 17:20:01 2023
    On Jul 31, 2023, at 2:46 PM, Riccardo Mottola <riccardo.mottola@libero.it> wrote:

    after several reboots, I could not reproduce "that" looping crash anymore, but I always got into maintenance mode.
    Is that a kernel issue too?

    Yes, see the other thread which also reports a regression on older SPARCs. There are no issues on the T5220 or SPARC-T4.

    Adrian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)