• Kernel Panics / Re: GRUB testers on SPARC needed

    From John Paul Adrian Glaubitz@21:1/5 to Robin Cremer on Mon May 17 13:00:01 2021
    On 5/17/21 12:36 PM, Robin Cremer wrote:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/sparc?id=e5e8b80d352ec999d2bba3ea584f541c83f4ca3f
    I'm using the latest version from the repositories:
    5.10.0-6-sparc64-smp #1 SMP Debian 5.10.28-1 (2021-04-09) sparc64 GNU/Linux
    The commit you mention seems to be in 5.12 and 5.13-rc2?
    Is there a pre-built SMP-image for this or do I have to set up building myself?

    The commit has been backported to the 5.10.x series which is an LTS kernel:

    https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/sparc?h=linux-5.10.y&id=f2b38f03a3f71c30c77a4516b26c8bea13cc08ce

    Running on the v215 was a real nightmare yesterday. They will be stable for hours, but
    certain actions (in one occurence executing dmesg (!), apt-get installing nfs-common,
    some mounts systemd tries) crashed both of the boxes with various errors, most of them the
    dreaded line with "p

    Try to use a 4.19 kernel from snapshot.debian.org, these are known to run more stable on the
    older machines. Any machine older than T2 seem to have some issues with the more recent kernels.

    If you have the possibility to bisect the issue, that would be great. I do have such older machines
    myself but currently no possibility to set them up for bisecting.

    On the newer CPUs (>= SPARC T2), the kernel runs stable. Dave Miller (the Linux SPARC maintainer)
    uses a T5 himself for testing.

    Retrying the dpkg-reconfigure as well. After the systemd-unit for - I think - rpcbind was activated
    during package configuration, both boxes crashed about 4-6 times, I had to reset from OBP.
    After a few tries, installation went through, then. Now I can mount nfs... The hours I spent in the rescue mode of the current installation CD without any trouble made me suspect
    that non-SMP-kernels are "more stable". I'm currently running the SMP variant with "maxcpus=1", that
    seems stable so far... But as with any other sporadic issue, that's hard to tell with a way to reliably
    trigger the errors...

    I think these issues have started to show up sometime after 4.19 but only on the older machines. So, chances
    are you can bisect the issue.

    On a not entirely unrelated note:
    Are there any news on functioning netboot images? The last post I could find points to images from April '17
    on your webspace, which were, according to the ML, not bootable because of the size.

    I don't have the time and resources at the moment to work on netboot. I'm not just maintaining the sparc64
    port in Debian but also many of the other unofficial ports such as 32-bit PowerPC. So far, netboot has not
    been a top priority so I haven't worked on it yet.

    Additional support is always welcome.

    At least I can't boot them either.
    If there is no more recent version, I'll try to build something myself - are there any pointers on how to go
    about this? Minimal OS or the netinstaller in an .img would be preferred.

    You just have to build the debian-installer package on sparc64 using sbuild and as a result, you will get
    the tarball containing the netboot and cdrom images.

    I think that would help in quick testing, as I have multiple other systems with Cheetah (UltraSPARC III, III CU
    and IIIi) I'd like to try provoking the panics on. Also, some older (UltraSPARC IIi and IIe+) systems are waiting
    for recent Debian :-)
    For these machines, I would recommend installing a regular release, then downgrading the kernel to 4.19, then
    bi-sect using a cross-compiled kernel if you have a working reproducer.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robin Cremer@21:1/5 to All on Mon May 17 12:40:01 2021
    Hi Adrian,
    for the sake of visibility, here the response to the kernel-trouble:

    Am 17.05.2021 um 10:23 schrieb John Paul Adrian Glaubitz:
    Installing on two SunFire v215 went reasonably well

    /- (apart from recurring Kernel Panics with "Unable to handle kernel paging request in mna handler",
    most often triggered on boot immediately after the systemd binfmt service tries to start. This seems
    to have been mentioned in /2020/04/msg00020.html but never pinpointed and fixed?) -/
    What kernel version are you running. There have actually been some fixes in this regard, in particular
    this fix:

    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/sparc?id=e5e8b80d352ec999d2bba3ea584f541c83f4ca3f
    I'm using the latest version from the repositories:
    5.10.0-6-sparc64-smp #1 SMP Debian 5.10.28-1 (2021-04-09) sparc64
    GNU/Linux
    The commit you mention seems to be in 5.12 and 5.13-rc2?
    Is there a pre-built SMP-image for this or do I have to set up building
    myself?


    Running on the v215 was a real nightmare yesterday. They will be stable
    for hours, but certain actions (in one occurence executing dmesg (!),
    apt-get installing nfs-common, some mounts systemd tries) crashed both
    of the boxes with various errors, most of them the dreaded line with "p Retrying the dpkg-reconfigure as well. After the systemd-unit for - I
    think - rpcbind was activated during package configuration, both boxes
    crashed about 4-6 times, I had to reset from OBP.
    After a few tries, installation went through, then. Now I can mount nfs...
    The hours I spent in the rescue mode of the current installation CD
    without any trouble made me suspect that non-SMP-kernels are "more
    stable". I'm currently running the SMP variant with "maxcpus=1", that
    seems stable so far... But as with any other sporadic issue, that's hard
    to tell with a way to reliably trigger the errors...

    The worst offender so far seems to be xfs, though...
    I initially installed both v215 with ext3 /boot and xfs for /.
    I'm not sure if the problem is related, but xfs seems to frequently
    encounter
    [   35.325122] XFS (md1): Metadata corruption detected at xfs_dinode_verify.part.0+0x358/0x6c0 [xfs], inode 0x402c4d0 dinode
    [   35.469639] XFS (md1): Unmount and run xfs_repair
    on both machines. xfs_repair doesn't do anything, though. Either, these
    inodes were the last ones written during kernel panics, or the
    underlying issue of the panics
    leads to checksum-mismatches in-memory? The latter seems more likely,
    because during dpkg-installs the following popped up a few times as well:
    [  195.360257] XFS (md1): Corruption of in-memory data detected. 
    Shutting down filesystem
    (after that, obviously, the system is unusable despite not panicking, as
    root is missing entirely...)

    Some faults:
    [  281.304119] WARNING: CPU: 1 PID: 11 at kernel/smp.c:633 smp_call_function_many_cond+0x3bc/0x3e0
    [  281.418696] Modules linked in: ext4(E) crc16(E) mbcache(E) jbd2(E) sr_mod(E) cdrom(E) ata_generic(E) tg3(E) libphy(E) ptp(E) ohci_pci(E)
    sg(E) pata_ali(E) ehci_pci(E) ohci_hcd(E) ehci_hcd(E) libata(E)
    pps_core(E) usbcore(E) usb_common(E) flash(E) drm(E) drm_panel_orientation_quirks(E) i2c_core(E) fuse(E) configfs(E)
    ip_tables(E) x_tables(E) autofs4(E) xfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) raid6_pq(E)
    async_xor(E) xor(E) async_tx(E) libcrc32c(E) crc32c_generic(E)
    raid0(E) multipath(E) linear(E) raid1(E) md_mod(E) sd_mod(E) t10_pi(E) crc_t10dif(E) crct10dif_generic(E) crct10dif_common(E) mptsas(E) scsi_transport_sas(E) mptscsih(E) mptbase(E) scsi_mod(E)
    [  282.224447] CPU: 1 PID: 11 Comm: ksoftirqd/1 Tainted: G D     E    
    5.10.0-6-sparc64-smp #1 Debian 5.10.28-1
    [  282.359710] Call Trace:
    [  282.391788] [<000000000046c67c>] __warn+0xbc/0x120
    [  282.454810] [<0000000000c450f8>] warn_slowpath_fmt+0x34/0x74
    [  282.529285] [<0000000000517b5c>]
    smp_call_function_many_cond+0x3bc/0x3e0
    [  282.617512] [<0000000000517be4>] smp_call_function+0x24/0x40
    [  282.691989] [<0000000000441828>] smp_send_stop+0x28/0x120
    [  282.763028] [<0000000000c44e84>] panic+0x110/0x350
    [  282.826047] [<0000000000472ad0>] do_exit+0xad0/0xb20
    [  282.891357] [<0000000000c43ab0>] die_if_kernel+0x1f4/0x260
    [  282.963543] [<0000000000c5501c>] unhandled_fault+0x88/0xac
    [  283.035728] [<0000000000c553c8>] do_sparc64_fault+0x388/0xa80
    [  283.111354] [<0000000000407714>] sparc64_realfault_common+0x10/0x20
    [  283.193850] [<00000000005a1e64>] __bpf_prog_put_rcu+0x24/0x60
    [  283.269470] [<00000000004f5c20>] rcu_core+0x240/0x620
    [  283.335926] [<00000000004f600c>] rcu_core_si+0xc/0x20
    [  283.402383] [<0000000000c5602c>] __do_softirq+0x10c/0x3a0
    [  283.473423] [<0000000000473b14>] run_ksoftirqd+0x34/0x60
    [  283.543315] ---[ end trace 9f0a29fcdf85be47 ]---

    [  124.914048] CPU[1]: Cheetah+ D-cache parity error at TPC[00000000005bc2b0]
    [  125.004638] TPC<bpf_check+0x1cd0/0x32e0>
    nfs-utils.service is a disabled or a static unit, not starting it.
    [  125.528183] Kernel unaligned access at TPC[8ffba4] atomic64_sub_return+0x4/0x54
    [  125.624591] Unable to handle kernel paging request in mna handler
    [  125.624595]  at virtual address 6f430c861b2ffaab
    [  125.765686] current->{active_,}mm->context = 00000000000000c1
    [  125.841410] current->{active_,}mm->pgd = fff0000001b94000
    [  125.912544]               \|/ ____ \|/
    [  125.912544]               "@'/ .. \`@"
    [  125.912544]               /_| \__/ |_\
    [  125.912544]                  \__U_/
    [  126.106299] systemd(1): Oops [#1]

    Especially the "Unable to handle kernel paging request in mna handler"
    is interesting. It's nearly identical to the issue posted ~ a year ago, seemingly introduced somewhere around the Kernel 5.0-time.
    It's not always accompanied by the "Cheetah+ D-cache parity error at TPC[00000000005bc2b0]"-error. And while CPU L1-Cache parity errors
    "seem" like a hardware issue, I highly suspect it is not. Both machines
    show these errors sporadically (often without panicking!) and I found
    mention of these errors in other contexts...
    Also, both systems were highly stable in the past, exceeding 2-3 years
    of uptime, although on Solaris 10 :-(


    On a not entirely unrelated note:
    Are there any news on functioning netboot images? The last post I could
    find points to images from April '17 on your webspace, which were,
    according to the ML, not bootable because of the size.
    At least I can't boot them either.
    If there is no more recent version, I'll try to build something myself -
    are there any pointers on how to go about this? Minimal OS or the
    netinstaller in an .img would be preferred.

    I think that would help in quick testing, as I have multiple other
    systems with Cheetah (UltraSPARC III, III CU and IIIi) I'd like to try provoking the panics on.
    Also, some older (UltraSPARC IIi and IIe+) systems are waiting for
    recent Debian :-)


    Thanks,

    - Robin

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Robin Cremer on Mon May 17 13:30:01 2021
    Hi Robin,

    On 17.05.21 12:36, Robin Cremer wrote:
    [...]
    On a not entirely unrelated note:
    Are there any news on functioning netboot images? The last post I could
    find points to images from April '17 on your webspace, which were,
    according to the ML, not bootable because of the size.
    At least I can't boot them either.

    Can't help you with any netboot images, though I assume the existing
    ones in the archives ([1], images are in `./installer-sparc64/20210415/images/netboot/netboot.tar.gz`) could
    actually work if they are loaded from GRUB and not from the OBP.

    [1]: http://ftp.ports.debian.org/debian-ports/pool-sparc64/main/d/debian-installer/

    If there is no more recent version, I'll try to build something myself -
    are there any pointers on how to go about this? Minimal OS or the netinstaller in an .img would be preferred.

    I described a possible setup to netboot with GRUB on [2]. My used
    version of GRUB is old (`sparc64-ieee1275-2.02+dfsg1-18`) but works.
    This setup works similarly to identically for most of my machinery
    (ia64, amd64, powerpc, ppc64, sparc64) and for me GRUB is able to load
    "large" images (> 100 MiB, tested during my investigation on [3]) over
    network w/o an issue. With a working sparc64 Debian installation it's
    even easier to setup. You can host everything needed (I recommend:
    dnsmasq (only for DNS), tftpd-hpa, isc-dhcp-server and possibly rarpd)
    on a Raspberry Pi for example. I assume you're already familiar with
    these services but just ask if you need some help in configuration.

    [2]: https://wiki.debian.org/Sparc64#Netbooting_with_GRUB2

    [3]: https://lists.debian.org/debian-sparc/2021/03/msg00045.html

    I think that would help in quick testing, as I have multiple other
    systems with Cheetah (UltraSPARC III, III CU and IIIi) I'd like to try provoking the panics on.
    Also, some older (UltraSPARC IIi and IIe+) systems are waiting for
    recent Debian :-)

    For testing multiple systems for anomalies I'd recommend to netboot with
    a Debian root FS provided by NFS which saves you the time to install
    Debian on every machine. There were two posts on the debian-sparc list
    recently ([4]) by Anatoly that mentioned the `stress-ng` tool, which
    might be helpful in provoking panics and possibly identifying the source
    of a panic.

    [4]: https://lists.debian.org/cgi-bin/search?P=stress-ng&DEFAULTOP=or&B=Gdebian-sparc&SORT=&HITSPERPAGE=10

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Robin Cremer@21:1/5 to All on Wed May 26 01:20:02 2021
    thanks for the information. I'm still battling a few issues, but the
    most prominent problem on my way to netboot-install is that I can't seem
    to get the apt/sources.list right for fetching sources (?):
    You just have to build the debian-installer package on sparc64 using sbuild and as a result, you will get
    the tarball containing the netboot and cdrom images.
    ...trying to get the build-deps for debian-installer complains, that I
    should at least specify one deb-src line in my sources.list...
    I have the settings the installer configured ("deb-src http://deb.debian.org/debian-ports/ sid main"), but sid doesn't have a /main/source on debian-ports... "unreleased" does have a source folder,
    but apt seems to ignore the "unreleased" distribution for sources (maybe
    that's what the comment "unreleased does not support sources yet" means?)

    Well... long story short: Can you point me in the right direction to get
    the sources the debian-ports use? Or is the normal way to pull sources
    from "normal" debian repositories?

    Thanks,
    Robin

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)