• ia64 port needs help in order to be saved

    From John Paul Adrian Glaubitz@21:1/5 to All on Sun Aug 2 08:10:01 2020
    Hello!

    A few days ago, I had to switch off titanium because the blade was taking up too much space in a blade center at the university which was needed by the admins now.

    Currently, we have only one build server namely RX2600 which is building packages. In order to resolve this, Anatoly (CC'ed) bought a used HP
    BL870 i4 blade with enough CPU power.

    Unfortunately, this blade has the same architecture as the RX2800 and won't properly boot due to this bug in the DMA code [1] for which there is no fix yet, see also [2].

    I spend a lot of time getting Anatoly's new blade installed yesterday but eventually gave up. While I was able to install Debian Squeeze (Wheezy
    wouldn't work either, kernel just reboots akin to the ia64 gcc bug), the installed system won't boot as the hpsa driver wouldn't load.

    Also, in order to successfully boot the Squeeze kernel, I had to pass "intel_iommu=off" on the command line.

    The issue with the hpsa driver might be this bug [3] which I was able to
    fix on an RX2600 we have at SUSE by upgrading the firmware of the RAID controller.

    We might be able to boot the system with the latest LTS kernel of the 4.14 series which should not have Christoph's patch [1] yet but maybe already the fix for the hpsa driver [3].

    Upstream kernel 4.14.0 had the problem from [1] as well (so maybe the bisect was wrong?), however kernel 4.14.83 from Gentoo works (which is upstream version
    4.14.83 with some patches) although the binary kernel that Frank gave me is missing
    the SCSI modules for block devices - although the hpsa module was present.

    In any case, I spent too much time on this issue the past two days, so I need to step back to keep my sanity. If someone else wants to have a go at it,
    there should now be enough information in order to.

    Thanks,
    Adrian

    [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=543cea9accd9804307541cb93d3ed7ec94b07237
    [2] https://marc.info/?l=linux-ia64&m=156144480821712&w=2
    [3] https://bugzilla.redhat.com/show_bug.cgi?id=1557655

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Tue Aug 4 11:00:01 2020
    Hi!

    On 8/2/20 8:05 AM, John Paul Adrian Glaubitz wrote:
    Unfortunately, this blade has the same architecture as the RX2800 and won't properly boot due to this bug in the DMA code [1] for which there is no fix yet, see also [2].

    I spend a lot of time getting Anatoly's new blade installed yesterday but eventually gave up. While I was able to install Debian Squeeze (Wheezy wouldn't work either, kernel just reboots akin to the ia64 gcc bug), the installed system won't boot as the hpsa driver wouldn't load.

    Finally got it working with the 4.14.83 kernel plus the Gentoo ptrace patch:

    Linux lenz 4.14.83-00001-g6ef2496425e7 #1 SMP Sun Aug 2 19:31:30 CEST 2020 ia64

    The programs included with the Debian GNU/Linux system are free software;
    the exact distribution terms for each program are described in the
    individual files in /usr/share/doc/*/copyright.

    Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
    permitted by applicable law.
    root@lenz:~# uname -a
    Linux lenz 4.14.83-00001-g6ef2496425e7 #1 SMP Sun Aug 2 19:31:30 CEST 2020 ia64 GNU/Linux
    root@lenz:~#

    The machine is running Squeeze now. I will install unstable using debootstrap on the second disk later today and then the buildd is finally up and running again.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Tue Aug 4 12:50:02 2020
    On 04.08.20 10:53, John Paul Adrian Glaubitz wrote:
    Hi!

    On 8/2/20 8:05 AM, John Paul Adrian Glaubitz wrote:
    Unfortunately, this blade has the same architecture as the RX2800 and won't >> properly boot due to this bug in the DMA code [1] for which there is no fix >> yet, see also [2].

    I spend a lot of time getting Anatoly's new blade installed yesterday but
    eventually gave up. While I was able to install Debian Squeeze (Wheezy
    wouldn't work either, kernel just reboots akin to the ia64 gcc bug), the
    installed system won't boot as the hpsa driver wouldn't load.

    Finally got it working with the 4.14.83 kernel plus the Gentoo ptrace patch:

    Linux lenz 4.14.83-00001-g6ef2496425e7 #1 SMP Sun Aug 2 19:31:30 CEST 2020 ia64

    The programs included with the Debian GNU/Linux system are free software;
    the exact distribution terms for each program are described in the
    individual files in /usr/share/doc/*/copyright.

    Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
    permitted by applicable law.
    root@lenz:~# uname -a
    Linux lenz 4.14.83-00001-g6ef2496425e7 #1 SMP Sun Aug 2 19:31:30 CEST 2020 ia64 GNU/Linux
    root@lenz:~#

    The machine is running Squeeze now. I will install unstable using debootstrap on the second disk later today and then the buildd is finally up and running again.

    Great work and great news! So do you also have enabled CONFIG_ZONE_DMA32
    for the used kernel in addition or just the Gentoo ptrace patch?

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Tue Aug 4 13:30:02 2020
    Hello!

    On 8/4/20 12:40 PM, Frank Scheiner wrote:
    The machine is running Squeeze now. I will install unstable using debootstrap
    on the second disk later today and then the buildd is finally up and running >> again.

    Great work and great news! So do you also have enabled CONFIG_ZONE_DMA32
    for the used kernel in addition or just the Gentoo ptrace patch?

    There is actually no CONFIG_ZONE_DMA32 for ia64, just CONFIG_ZONE_DMA and
    that is set.

    The kernel has this configuration from Sergei [1] and the ptrace patch from
    the Gentoo kernel, otherwise it's a vanilla upstream kernel 4.14.83.

    I cannot say yet which particular change fixes the problem, but I'm confident
    I will be able to figure that out. If you want to test yourself, you may
    try 4.14.83 from the stable branch [2] with Sergei's configuration but without the ptrace patch.

    I will get the machine up and running first, so that we can resume building packages. Anatoly said the blade is a dual blade, so I might be able to
    perform the kernel tests on the second blade.

    My wild guess is that's the ptrace patch that fixes the problem.

    Adrian

    [1] https://dev.gentoo.org/~slyfox/config-4.19.86-gentoo
    [2] git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Tue Aug 4 15:20:01 2020
    Hi,

    On 04.08.20 13:20, John Paul Adrian Glaubitz wrote:
    On 8/4/20 12:40 PM, Frank Scheiner wrote:
    The machine is running Squeeze now. I will install unstable using debootstrap
    on the second disk later today and then the buildd is finally up and running
    again.

    Great work and great news! So do you also have enabled CONFIG_ZONE_DMA32
    for the used kernel in addition or just the Gentoo ptrace patch?

    There is actually no CONFIG_ZONE_DMA32 for ia64, just CONFIG_ZONE_DMA and that is set.

    [3] says it's CONFIG_ZONE_DMA for ia64 until Linux 4.15 and
    CONFIG_ZONE_DMA32 for later kernel versions. I couldn't find a control
    from `menuconfig` where I could disable that, not sure why, maybe it's implicit.

    [3]: https://cateee.net/lkddb/web-lkddb/ZONE_DMA.html

    I also saw that CONFIG_ZONE_DMA32 actually is enabled in Linux 4.19.37
    from Debian and also the current 5.7.x from Debian for ia64. So in this configuration option Gentoo and Debian kernels seem not to differ. I
    couldn't find a 4.14.x ia64 kernel image.

    The kernel has this configuration from Sergei [1] and the ptrace patch from the Gentoo kernel, otherwise it's a vanilla upstream kernel 4.14.83.

    I cannot say yet which particular change fixes the problem, but I'm confident I will be able to figure that out. If you want to test yourself, you may
    try 4.14.83 from the stable branch [2] with Sergei's configuration but without
    the ptrace patch.

    Yeah, I'll give that a try.

    I will get the machine up and running first, so that we can resume building packages. Anatoly said the blade is a dual blade, so I might be able to perform the kernel tests on the second blade.

    Oh, that would be cool if a dual blade could also be used as two
    separate machines.

    My wild guess is that's the ptrace patch that fixes the problem.

    Yeah, I think so, too, but will check that now, to be sure.

    Cheers,
    Frank


    Adrian

    [1] https://dev.gentoo.org/~slyfox/config-4.19.86-gentoo
    [2] git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to Frank Scheiner on Tue Aug 4 21:10:02 2020
    On 04.08.20 15:12, Frank Scheiner wrote:
    On 04.08.20 13:20, John Paul Adrian Glaubitz wrote:
    The kernel has this configuration from Sergei [1] and the ptrace patch
    from
    the Gentoo kernel, otherwise it's a vanilla upstream kernel 4.14.83.

    I cannot say yet which particular change fixes the problem, but I'm
    confident
    I will be able to figure that out. If you want to test yourself, you may
    try 4.14.83 from the stable branch [2] with Sergei's configuration but
    without
    the ptrace patch.

    Yeah, I'll give that a try.

    Ok, I have a few results:

    All kernels used Sergei's configuration adapted to enable network boot.

    * 4.14.83 vanilla w/o ptrace patch and compiled with "gcc (Gentoo
    7.3.0-r3 p1.4) 7.3.0" leads to a reboot after kernel and initramfs are
    loaded:

    ```
    ELILO v3.16 for EFI/IA-64
    ..
    Uncompressing Linux... done
    Loading file AC10027B.initrd.img...done
    1,0,0,0 5400006301E10000 0000000000000000 EVN_BOOT_START

    ***********************************************************
    * ROM Version : 01.93
    * ROM Date : Wed Sep 12 22:10:03 PDT 2012 ***********************************************************
    ```

    Which makes sense, as [4] explicitly speaks of issues with gcc 6.4.0,
    7.2.0 and the latest gcc 8 at that time which are worked around by the
    patch.

    [4]: https://lore.kernel.org/patchwork/comment/1079097/

    * 4.14.83 vanilla w/o ptrace patch and compiled with "gcc (Debian
    10.2.0-3) 10.2.0" works and boots Debian GNU/Linux Sid successfully to
    the login prompt on my rx2800 i2 and also runs stable enough to
    recompile itself:

    ```
    ELILO v3.16 for EFI/IA-64
    ..
    Uncompressing Linux... done
    Loading file AC10027B.initrd.img...done
    [ 0.000000] Linux version 4.14.83vanilla (root@rx2800-i2) (gcc
    version 10.2.0 (Debian 10.2.0-3)) #1 SMP Tue Aug 4 15:56:40 UTC 2020
    [...]
    Debian GNU/Linux bullseye/sid rx2800-i2 ttyS1

    rx2800-i2 login:
    ```

    * The same holds true for 4.14.191 vanilla w/o ptrace patch and compiled
    with the same gcc.

    So latest 4.14.x kernels should work w/o the patch on a rx2800 i2 with
    Itanium 9300 (Tukwila) series processors - if they are compiled with gcc 10.

    My wild guess is that's the ptrace patch that fixes the problem.

    Yeah, I think so, too, but will check that now, to be sure.

    I haven't tested anything newer than the latest 4.14.x and also no
    kernels with Debian patches, but maybe the newer gcc will also make a difference for later kernel versions.

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Frank Scheiner on Wed Aug 5 15:00:02 2020
    Hello Frank!

    On 8/4/20 9:04 PM, Frank Scheiner wrote:
    All kernels used Sergei's configuration adapted to enable network boot.

    * 4.14.83 vanilla w/o ptrace patch and compiled with "gcc (Gentoo
    7.3.0-r3 p1.4) 7.3.0" leads to a reboot after kernel and initramfs are loaded:

    ```
    ELILO v3.16 for EFI/IA-64
    ..
    Uncompressing Linux... done
    Loading file AC10027B.initrd.img...done
    1,0,0,0 5400006301E10000 0000000000000000 EVN_BOOT_START

    I have noticed that such an immediate reboot is also observed with certain versions of elilo. For example, the elilo version in Wheezy causes an
    immediate reboot while the version from Squeeze works.

    But we're using GRUB anyway. And since unpatched versions of gcc are
    known to produce a buggy kernel, we don't need to test that either.

    * 4.14.83 vanilla w/o ptrace patch and compiled with "gcc (Debian
    10.2.0-3) 10.2.0" works and boots Debian GNU/Linux Sid successfully to
    the login prompt on my rx2800 i2 and also runs stable enough to
    recompile itself:

    ```
    ELILO v3.16 for EFI/IA-64
    ..
    Uncompressing Linux... done
    Loading file AC10027B.initrd.img...done
    [    0.000000] Linux version 4.14.83vanilla (root@rx2800-i2) (gcc
    version 10.2.0 (Debian 10.2.0-3)) #1 SMP Tue Aug 4 15:56:40 UTC 2020
    [...]
    Debian GNU/Linux bullseye/sid rx2800-i2 ttyS1

    rx2800-i2 login:
    ```

    * The same holds true for 4.14.191 vanilla w/o ptrace patch and compiled
    with the same gcc.

    So latest 4.14.x kernels should work w/o the patch on a rx2800 i2 with Itanium 9300 (Tukwila) series processors - if they are compiled with gcc 10.

    I'm currently running a 4.14.192 kernel with the ptrace patch. While it doesn't crash, I'm seeing a kworker thread with rather high load.

    Do you observe that as well?

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to John Paul Adrian Glaubitz on Wed Aug 5 15:40:01 2020
    Hello!

    On 8/5/20 2:51 PM, John Paul Adrian Glaubitz wrote:
    So latest 4.14.x kernels should work w/o the patch on a rx2800 i2 with
    Itanium 9300 (Tukwila) series processors - if they are compiled with gcc 10.

    I'm currently running a 4.14.192 kernel with the ptrace patch. While it doesn't
    crash, I'm seeing a kworker thread with rather high load.

    Do you observe that as well?

    I have tried 4.19.137 now with the same configuration and Sergey's ptrace patch and sure enough it crashes when loading the first modules.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Wed Aug 5 16:00:01 2020
    On 05.08.20 14:51, John Paul Adrian Glaubitz wrote:
    Hello Frank!

    On 8/4/20 9:04 PM, Frank Scheiner wrote:
    All kernels used Sergei's configuration adapted to enable network boot.

    * 4.14.83 vanilla w/o ptrace patch and compiled with "gcc (Gentoo
    7.3.0-r3 p1.4) 7.3.0" leads to a reboot after kernel and initramfs are
    loaded:

    ```
    ELILO v3.16 for EFI/IA-64
    ..
    Uncompressing Linux... done
    Loading file AC10027B.initrd.img...done
    1,0,0,0 5400006301E10000 0000000000000000 EVN_BOOT_START

    I have noticed that such an immediate reboot is also observed with certain versions of elilo. For example, the elilo version in Wheezy causes an immediate reboot while the version from Squeeze works.

    Sure, but this version of elilo works for all my Itanium machines (even
    with newer kernels on other machines than the rx2800 i2) - I haven't yet evaluated GRUB for these machines and hence am conservative or reluctant
    to change the boot loader for testing a specific kernel. I don't want to
    change too many things.

    And the behavior is similar to what Sergei wrote in [1] I think, so I'm confident that this is not coming from the boot loader, but from using
    gcc 7.3.0 w/o ptrace patch.

    [1]: https://lore.kernel.org/patchwork/comment/1081244/

    But we're using GRUB anyway. And since unpatched versions of gcc are
    known to produce a buggy kernel, we don't need to test that either.

    Ok, I just used what I had available on Gentoo. Emerging a new gcc would
    take a lot of time and can also produce repercussions with the rest of
    the installed software.

    For Debian it was relatively easy to test a newer gcc and that really
    makes a difference for 4.14.x kernels.


    * 4.14.83 vanilla w/o ptrace patch and compiled with "gcc (Debian
    10.2.0-3) 10.2.0" works and boots Debian GNU/Linux Sid successfully to
    the login prompt on my rx2800 i2 and also runs stable enough to
    recompile itself:

    ```
    ELILO v3.16 for EFI/IA-64
    ..
    Uncompressing Linux... done
    Loading file AC10027B.initrd.img...done
    [    0.000000] Linux version 4.14.83vanilla (root@rx2800-i2) (gcc
    version 10.2.0 (Debian 10.2.0-3)) #1 SMP Tue Aug 4 15:56:40 UTC 2020
    [...]
    Debian GNU/Linux bullseye/sid rx2800-i2 ttyS1

    rx2800-i2 login:
    ```

    * The same holds true for 4.14.191 vanilla w/o ptrace patch and compiled
    with the same gcc.

    So latest 4.14.x kernels should work w/o the patch on a rx2800 i2 with
    Itanium 9300 (Tukwila) series processors - if they are compiled with gcc 10.

    I'm currently running a 4.14.192 kernel with the ptrace patch. While it doesn't
    crash, I'm seeing a kworker thread with rather high load.

    Do you observe that as well?

    Not that I know of, but I didn't check activity the whole time. I'll
    observe activity next time I fire the machine up.

    What I saw on Debian were unaligned memory accesses. But I think that is
    common on non-x86 arches.

    Cheers.
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Wed Aug 5 16:10:02 2020
    On 05.08.20 15:30, John Paul Adrian Glaubitz wrote:
    Hello!

    On 8/5/20 2:51 PM, John Paul Adrian Glaubitz wrote:
    So latest 4.14.x kernels should work w/o the patch on a rx2800 i2 with
    Itanium 9300 (Tukwila) series processors - if they are compiled with gcc 10.

    I'm currently running a 4.14.192 kernel with the ptrace patch. While it doesn't
    crash, I'm seeing a kworker thread with rather high load.

    Do you observe that as well?

    I have tried 4.19.137 now with the same configuration and Sergey's ptrace patch
    and sure enough it crashes when loading the first modules.

    I think this is due to:

    ```
    rx2800-i2 /usr/src/linux-on-ramdisk # git bisect good 543cea9accd9804307541cb93d3ed7ec94b07237 is the first bad commit
    commit 543cea9accd9804307541cb93d3ed7ec94b07237
    Author: Christoph Hellwig <hch@lst.de>
    Date: Sun Dec 24 15:10:07 2017 +0100

    ia64: use generic swiotlb_ops

    These are identical to the ia64 ops, and would also support CMA
    if enabled on ia64.

    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Acked-by: Christian König <christian.koenig@amd.com>

    :040000 040000 5fe9ea16dd24746410a88e8e57d5722eabf99650 3cdc996c27f5b5f1cb626dcfe246146e09cb804c M arch
    ```

    The bisecting happened between 4.15.18 ([1]) - which still ran on my
    rx2800 i2 - and 4.16-rc1 ([2]) - which crashes it during OS bootup.

    [1]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.15.18&id=a8ec862fda39d9adb88469eb8b9125daccc1c8335

    [2]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.16-rc1&id=7928b2cbe55b2a410a0f5c1f154610059c57b1b2

    Maybe you try to apply Chrisoph Hellwig's patch from [3] (whole thread
    on [4])? If it works then, it could to be related.

    [3]: https://marc.info/?l=linux-ia64&m=156145878523498&w=2

    [4]: https://marc.info/?l=linux-ia64&m=156144480821712&w=2

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Frank Scheiner@21:1/5 to John Paul Adrian Glaubitz on Wed Aug 5 16:20:02 2020
    On 05.08.20 16:02, John Paul Adrian Glaubitz wrote:
    On 8/5/20 3:56 PM, Frank Scheiner wrote:
    And the behavior is similar to what Sergei wrote in [1] I think, so I'm
    confident that this is not coming from the boot loader, but from using
    gcc 7.3.0 w/o ptrace patch.

    [1]: https://lore.kernel.org/patchwork/comment/1081244/

    I don't see any comments regarding the RX2800 in this discussion.

    Sergei and Gentoo don't have one available (I think they have or had an
    rx3600 at that time IIRC), so they couldn't knew.


    And, FWIW, the broken version of elilo crashes the exact same way as
    the kernel without the ptrace fix which is why it took me a while
    to figure that out.

    It's just meant as a heads-up since GRUB is known to work very well
    and I don't expect any such surprises when using GRUB instead of
    elilo.

    Yeah, it's already on my todo list to switch to GRUB (actually since
    2019). I just didn't want to introduce another variable into the
    testing. :-)

    Do you observe that as well?

    Not that I know of, but I didn't check activity the whole time. I'll
    observe activity next time I fire the machine up.

    What I saw on Debian were unaligned memory accesses. But I think that is
    common on non-x86 arches.
    I think the unaligned access might be what keeps the kworker thread busy.

    Than this seems to be not due to the kernel.

    Cheers,
    Frank

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)