• 5.17.0 boot issue on Miata

    From Bob Tracy@21:1/5 to All on Fri Mar 25 04:30:01 2022
    When I attempt to boot a 5.17.0 kernel built from the kernel.org
    sources, I see disk sector errors on my "sda" device, and the boot
    process hangs at the point where "systemd-udevd.service" starts.

    Rebooting on 5.16.0 works with no disk I/O errors of any kind.

    Assuming the 5.17.0 kernel or its associated initrd had bad sectors, I
    rebuilt both and saw no I/O errors during the build nor afterward when
    copying the new kernel into place under "/boot".

    Even tried a cross-compile build of a 5.17.0 alpha kernel on my x86_64
    platform to save build time (34 hours for a native build on a PWS 433au
    vs. 2 hours on the x86_64 platform). That build produced identical
    results when I tried booting on it.

    If anyone else is seeing this and can get a head-start on bisecting,
    that would be very much appreciated. I won't be able to get to it for
    about a week and a half :-(. 5.16.0 works. 5.17.0 doesn't. Might get
    lucky and find that the offending changes happened in the first 5.17.0
    release candidate.

    As always, sincere thanks in advance.

    --Bob

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Cree@21:1/5 to Bob Tracy on Sat Mar 26 23:30:02 2022
    On Thu, Mar 24, 2022 at 09:54:15PM -0500, Bob Tracy wrote:
    When I attempt to boot a 5.17.0 kernel built from the kernel.org
    sources, I see disk sector errors on my "sda" device, and the boot
    process hangs at the point where "systemd-udevd.service" starts.

    Rebooting on 5.16.0 works with no disk I/O errors of any kind.

    Oh, you can run a 5.16.y kernel on Alpha? I have had problems
    with everything since 5.9.y with rare, random, corruptions in
    memory in user space (exhibiting as glibc detected memory
    corruptions or segfaults).

    This is why I am still running a 5.8.y kernel on the Debian
    Ports buildd.

    I just compiled up a 5.16.y kernel and the problem is still there.
    It did take a bit to trigger the bug (about 10 hours of testing
    before it happened).

    I had done a bisection between 5.8.0 and 5.9.0 last year but I
    think it went astray (as testing is difficult and not fool proof).
    You email has prompted me to go back to it and see if I can nail
    down the offending commit. We really want to get it fixed.

    Cheers,
    Michael.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Tracy@21:1/5 to Michael Cree on Tue Apr 5 03:50:01 2022
    On Sun, Mar 27, 2022 at 11:21:57AM +1300, Michael Cree wrote:
    On Thu, Mar 24, 2022 at 09:54:15PM -0500, Bob Tracy wrote:
    When I attempt to boot a 5.17.0 kernel built from the kernel.org
    sources, I see disk sector errors on my "sda" device, and the boot
    process hangs at the point where "systemd-udevd.service" starts.

    Rebooting on 5.16.0 works with no disk I/O errors of any kind.

    Oh, you can run a 5.16.y kernel on Alpha? I have had problems
    with everything since 5.9.y with rare, random, corruptions in
    memory in user space (exhibiting as glibc detected memory
    corruptions or segfaults).

    Did we have this painted into the "SMP vs. not-SMP" corner at one point?
    Miata is an automatic not-SMP case for hand-built kernels for that architecture, which might explain why I'm not seeing the problems with
    user space memory corruption.

    Just successfully booted on v5.17-rc1 a little while ago. Moving on to
    "-rc2".

    --Bob

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Cree@21:1/5 to Bob Tracy on Tue Apr 5 07:10:01 2022
    On Mon, Apr 04, 2022 at 08:42:38PM -0500, Bob Tracy wrote:
    On Sun, Mar 27, 2022 at 11:21:57AM +1300, Michael Cree wrote:
    On Thu, Mar 24, 2022 at 09:54:15PM -0500, Bob Tracy wrote:
    When I attempt to boot a 5.17.0 kernel built from the kernel.org
    sources, I see disk sector errors on my "sda" device, and the boot process hangs at the point where "systemd-udevd.service" starts.

    Rebooting on 5.16.0 works with no disk I/O errors of any kind.

    Oh, you can run a 5.16.y kernel on Alpha? I have had problems
    with everything since 5.9.y with rare, random, corruptions in
    memory in user space (exhibiting as glibc detected memory
    corruptions or segfaults).

    Did we have this painted into the "SMP vs. not-SMP" corner at one point?

    No, this affects both ES45 (with 3 cpus) and XP1000 (one cpu).

    The problem is rare. I often have to run tests for 12 hours on
    the XP1000 before I see a problem. On miata it might occur even
    less often.

    I hope I am getting close to the bad commit, but it is taking
    time when I run testing for a whole day before I feel confident
    enough to mark the kernel as good. And I have been wrong on
    that one a couple of times now, having to repeat part of the
    bisection.

    Cheers,
    Michael.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Tracy@21:1/5 to Michael Cree on Tue Apr 5 16:00:01 2022
    On Tue, Apr 05, 2022 at 05:01:25PM +1200, Michael Cree wrote:
    On Mon, Apr 04, 2022 at 08:42:38PM -0500, Bob Tracy wrote:
    On Sun, Mar 27, 2022 at 11:21:57AM +1300, Michael Cree wrote:
    On Thu, Mar 24, 2022 at 09:54:15PM -0500, Bob Tracy wrote:
    When I attempt to boot a 5.17.0 kernel built from the kernel.org sources, I see disk sector errors on my "sda" device, and the boot process hangs at the point where "systemd-udevd.service" starts.

    Rebooting on 5.16.0 works with no disk I/O errors of any kind.

    Oh, you can run a 5.16.y kernel on Alpha? I have had problems
    with everything since 5.9.y with rare, random, corruptions in
    memory in user space (exhibiting as glibc detected memory
    corruptions or segfaults).

    Did we have this painted into the "SMP vs. not-SMP" corner at one point?

    No, this affects both ES45 (with 3 cpus) and XP1000 (one cpu).

    The problem is rare. I often have to run tests for 12 hours on
    the XP1000 before I see a problem. On miata it might occur even
    less often.

    I hope I am getting close to the bad commit, but it is taking
    time when I run testing for a whole day before I feel confident
    enough to mark the kernel as good. And I have been wrong on
    that one a couple of times now, having to repeat part of the
    bisection.

    Stupid question, but possibly related to what I'm seeing in v5.17-final. Beginning with "-rc3" there's a new FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION configuration option. Do I need this enabled on Miata if I normally
    boot in a video mode that displays a logo? I'll try "no" for the "-rc3"
    build if/when "-rc2" boots properly.

    --Bob

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Helge Deller@21:1/5 to Bob Tracy on Tue Apr 5 20:30:01 2022
    On 4/5/22 15:55, Bob Tracy wrote:
    On Tue, Apr 05, 2022 at 05:01:25PM +1200, Michael Cree wrote:
    On Mon, Apr 04, 2022 at 08:42:38PM -0500, Bob Tracy wrote:
    On Sun, Mar 27, 2022 at 11:21:57AM +1300, Michael Cree wrote:
    On Thu, Mar 24, 2022 at 09:54:15PM -0500, Bob Tracy wrote:
    When I attempt to boot a 5.17.0 kernel built from the kernel.org
    sources, I see disk sector errors on my "sda" device, and the boot
    process hangs at the point where "systemd-udevd.service" starts.

    Rebooting on 5.16.0 works with no disk I/O errors of any kind.

    Oh, you can run a 5.16.y kernel on Alpha? I have had problems
    with everything since 5.9.y with rare, random, corruptions in
    memory in user space (exhibiting as glibc detected memory
    corruptions or segfaults).

    Did we have this painted into the "SMP vs. not-SMP" corner at one point?

    No, this affects both ES45 (with 3 cpus) and XP1000 (one cpu).

    The problem is rare. I often have to run tests for 12 hours on
    the XP1000 before I see a problem. On miata it might occur even
    less often.

    I hope I am getting close to the bad commit, but it is taking
    time when I run testing for a whole day before I feel confident
    enough to mark the kernel as good. And I have been wrong on
    that one a couple of times now, having to repeat part of the
    bisection.

    Stupid question, but possibly related to what I'm seeing in v5.17-final. Beginning with "-rc3" there's a new FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION configuration option. Do I need this enabled on Miata if I normally
    boot in a video mode that displays a logo? I'll try "no" for the "-rc3" build if/when "-rc2" boots properly.

    You don't need to enable it, but for alpha it's probably beneficial to enable it.
    When enabled, you will see a big speed improvement when logging in to a graphics text
    console and printing info. E.g. try "time dmesg" with and without that option...
    The "dmesg" will scroll the screen, and that's what it accelerates (only if the driver
    has such hardware bitblt-support).

    Helge

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Tracy@21:1/5 to Helge Deller on Thu Apr 7 00:50:01 2022
    On Tue, Apr 05, 2022 at 08:22:48PM +0200, Helge Deller wrote:
    You don't need to enable it, but for alpha it's probably beneficial to enable it.
    When enabled, you will see a big speed improvement when logging in to a graphics text
    console and printing info. E.g. try "time dmesg" with and without that option...
    The "dmesg" will scroll the screen, and that's what it accelerates (only if the driver
    has such hardware bitblt-support).

    v5.17-rc2 ok. v5.17-rc3 I get the disk sector errors and hang that I
    reported in the first message in this thread.

    (Unrelated, I *did* enable the framebuffer option, and that part of the
    boot worked just fine.)

    I'm going to try a native build of '-rc3' just to rule out any
    cross-compiler strangeness. Should have something to report in another
    34 hours or so :-(.

    --Bob

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Tracy@21:1/5 to Bob Tracy on Fri Apr 8 15:10:01 2022
    On Wed, Apr 06, 2022 at 05:44:01PM -0500, Bob Tracy wrote:
    v5.17-rc2 ok. v5.17-rc3 I get the disk sector errors and hang that I reported in the first message in this thread.

    I'm going to try a native build of '-rc3' just to rule out any
    cross-compiler strangeness. Should have something to report in another
    34 hours or so :-(.

    Confirmed: the native build was just as broken as the cross build. The bug
    was introduced somewhere between v5.17-rc2 and v5.17-rc3. But at least I
    have a bit more confidence in the integrity of what the cross tools build.

    Interesting aside: the cross build's vmlinux.gz was approx. 200k larger.
    That might be due to gcc version differences (native toolchain is 11.2,
    and the cross toolchain is 11.1).

    I'll start the actual bisection process today. If I don't finish today,
    it will be at least another week before I can get back to this, so
    apologies in advance.

    --Bob

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Tracy@21:1/5 to Bob Tracy on Tue Apr 19 07:00:01 2022
    XPost: linux.kernel

    (Adding linux-scsi and linux-kernel, now that bisection is complete.)

    On Wed, Apr 06, 2022 at 05:44:01PM -0500, Bob Tracy wrote:
    v5.17-rc2 ok. v5.17-rc3 I get the disk sector errors and hang that I reported in the first message in this thread.

    This is on an Alpha Miata platform (PWS 433au) with QLogic ISP1020 controller.

    Here's the implicated commit:

    edb854a3680bacc9ef9b91ec0c5ff6105886f6f3 is the first bad commit
    commit edb854a3680bacc9ef9b91ec0c5ff6105886f6f3
    Author: Ming Lei <ming.lei@redhat.com>
    Date: Thu Jan 27 23:37:33 2022 +0800

    scsi: core: Reallocate device's budget map on queue depth change

    We currently use ->cmd_per_lun as initial queue depth for setting up the
    budget_map. Martin Wilck reported that it is common for the queue_depth to
    be subsequently updated in slave_configure() based on detected hardware
    characteristics.

    As a result, for some drivers, the static host template settings for
    cmd_per_lun and can_queue won't actually get used in practice. And if the
    default values are used to allocate the budget_map, memory may be consumed
    unnecessarily.

    Fix the issue by reallocating the budget_map after ->slave_configure()
    returns. At that time the device queue_depth should accurately reflect what
    the hardware needs.

    Link: https://lore.kernel.org/r/20220127153733.409132-1-ming.lei@redhat.com
    Cc: Bart Van Assche <bvanassche@acm.org>
    Reported-by: Martin Wilck <martin.wilck@suse.com>
    Suggested-by: Martin Wilck <martin.wilck@suse.com>
    Tested-by: Martin Wilck <mwilck@suse.com>
    Reviewed-by: Martin Wilck <mwilck@suse.com>
    Reviewed-by: Bart Van Assche <bvanassche@acm.org>
    Signed-off-by: Ming Lei <ming.lei@redhat.com>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

    drivers/scsi/scsi_scan.c | 55 +++++++++++++++++++++++++++++++++++++++++++-----
    1 file changed, 50 insertions(+), 5 deletions(-)

    Respectfully,
    --Bob

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Garry@21:1/5 to Bob Tracy on Mon Apr 25 12:10:01 2022
    XPost: linux.kernel

    On 19/04/2022 05:53, Bob Tracy wrote:
    (Adding linux-scsi and linux-kernel, now that bisection is complete.)

    On Wed, Apr 06, 2022 at 05:44:01PM -0500, Bob Tracy wrote:
    v5.17-rc2 ok. v5.17-rc3 I get the disk sector errors and hang that I
    reported in the first message in this thread.
    This is on an Alpha Miata platform (PWS 433au) with QLogic ISP1020 controller.

    Here's the implicated commit:

    edb854a3680bacc9ef9b91ec0c5ff6105886f6f3 is the first bad commit
    commit edb854a3680bacc9ef9b91ec0c5ff6105886f6f3
    Author: Ming Lei<ming.lei@redhat.com>
    Date: Thu Jan 27 23:37:33 2022 +0800

    scsi: core: Reallocate device's budget map on queue depth change


    Please try v5.18-rc2 as it should have a fix in commit eaba83b5b850

    Thanks,
    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Tracy@21:1/5 to John Garry on Tue May 3 18:00:01 2022
    XPost: linux.kernel

    On Mon, Apr 25, 2022 at 10:26:46AM +0100, John Garry wrote:
    Please try v5.18-rc2 as it should have a fix in commit eaba83b5b850

    Up and running on v5.18-rc5 as I type this. Fix confirmed. Thanks!

    --Bob

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)