• kernel behaviour, was Re: dash behaviour

    From Finn Thain@21:1/5 to Michael Schmitz on Mon Apr 10 12:00:01 2023
    On Mon, 10 Apr 2023, Michael Schmitz wrote:


    So I guess this bug has more to do with timing and little to do with
    state, contrary to my guesswork above. And no doubt I will have to

    What may still vary is physical mapping - I remember you had used some
    tool before to parse proc/<pid>/pagemap to determine the physical
    addresses for task stack areas? Or am I misremembering that from some
    other bug?


    You're right, back in September 2021 when I was chasing a different bug we
    did discuss tools to look at physical mappings. I don't think that would
    help here though. We know the failure is not bad RAM because multiple Macs
    fail in the same way. Also, there's no DMA taking place on these
    particular machines.

    contradict myself again if/when it turns out that uninitialized memory
    is a factor :-/

    I haven't found a config option to initialize memory returned by the
    kernel page allocators, so not sure how to test that ...


    I was able to find some command line options (init_on_alloc, init_on_free)
    and the related Kconfig symbols (CONFIG_INIT_ON_ALLOC_DEFAULT_ON, CONFIG_INIT_ON_FREE_DEFAULT_ON).

    Given the compiler supports -fzero-call-used-regs=used-gpr there's also CONFIG_ZERO_CALL_USED_REGS. Also CONFIG_INIT_STACK_ALL_ZERO (-ftrivial-auto-var-init=zero).

    The problem with these options is that they may produce a large effect on
    the timing of events but they should still have no effect on the behaviour
    of a correct userspace program.

    Since we are dealing with a suspect userspace program, what could we learn
    from such a test? E.g. if the crashing stopped one could simply attribute
    that to the timing change. I suppose, if the crashing became more
    frequent, perhaps that would help debug the userspace program. So maybe
    it's worth a try...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Finn Thain@21:1/5 to Michael Schmitz on Tue Apr 11 07:10:01 2023
    On Tue, 11 Apr 2023, Michael Schmitz wrote:


    I was able to find some command line options (init_on_alloc,
    init_on_free) and the related Kconfig symbols (CONFIG_INIT_ON_ALLOC_DEFAULT_ON, CONFIG_INIT_ON_FREE_DEFAULT_ON).

    Right - not sure how I managed to miss those.

    init_on_free might delay the boot process a while! But I would guesss init_on_alloc should be OK in the first instance.


    Given the compiler supports -fzero-call-used-regs=used-gpr there's
    also CONFIG_ZERO_CALL_USED_REGS. Also CONFIG_INIT_STACK_ALL_ZERO (-ftrivial-auto-var-init=zero).


    With all of those options enabled I still see dash crash sometimes. I
    don't think I've learned anything new about the bug from that test.

    The problem with these options is that they may produce a large effect
    on the timing of events but they should still have no effect on the behaviour of a correct userspace program.

    Since we are dealing with a suspect userspace program, what could we
    learn from such a test? E.g. if the crashing stopped one could simply attribute

    We don't know for definite that we deal with a suspect user space
    program - it might just be a change in a previously fine program that
    now exposes a subtle kernel bug (undetected for quite a long time, but
    we've seen a few of those now...)?


    That's right -- the kernel is also suspect. As is glibc. I will keep
    looking for some way to narrow down the search.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael Schmitz@21:1/5 to All on Tue Apr 11 06:30:02 2023
    Hi Finn,

    Am 10.04.2023 um 21:39 schrieb Finn Thain:
    On Mon, 10 Apr 2023, Michael Schmitz wrote:


    So I guess this bug has more to do with timing and little to do with
    state, contrary to my guesswork above. And no doubt I will have to

    What may still vary is physical mapping - I remember you had used some
    tool before to parse proc/<pid>/pagemap to determine the physical
    addresses for task stack areas? Or am I misremembering that from some
    other bug?


    You're right, back in September 2021 when I was chasing a different bug we did discuss tools to look at physical mappings. I don't think that would
    help here though. We know the failure is not bad RAM because multiple Macs fail in the same way. Also, there's no DMA taking place on these
    particular machines.

    contradict myself again if/when it turns out that uninitialized memory
    is a factor :-/

    I haven't found a config option to initialize memory returned by the
    kernel page allocators, so not sure how to test that ...


    I was able to find some command line options (init_on_alloc, init_on_free) and the related Kconfig symbols (CONFIG_INIT_ON_ALLOC_DEFAULT_ON, CONFIG_INIT_ON_FREE_DEFAULT_ON).

    Right - not sure how I managed to miss those.

    init_on_free might delay the boot process a while! But I would guesss init_on_alloc should be OK in the first instance.


    Given the compiler supports -fzero-call-used-regs=used-gpr there's also CONFIG_ZERO_CALL_USED_REGS. Also CONFIG_INIT_STACK_ALL_ZERO (-ftrivial-auto-var-init=zero).

    The problem with these options is that they may produce a large effect on
    the timing of events but they should still have no effect on the behaviour
    of a correct userspace program.

    Since we are dealing with a suspect userspace program, what could we learn from such a test? E.g. if the crashing stopped one could simply attribute

    We don't know for definite that we deal with a suspect user space
    program - it might just be a change in a previously fine program that
    now exposes a subtle kernel bug (undetected for quite a long time, but
    we've seen a few of those now...)?

    that to the timing change. I suppose, if the crashing became more
    frequent, perhaps that would help debug the userspace program. So maybe
    it's worth a try...

    We'd then have to try and minimize the impact on timing, by instead initializing a 'shadow' page reserved for that purpose. Though I suspect
    the loop over the pages might be optimized away in that case. See include/linux/highmem.h:clear_highpage_kasan_tagged() and mm/page_alloc.c:kernel_init_pages() ...

    Cheers,

    Michael

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Geert Uytterhoeven@21:1/5 to fthain@linux-m68k.org on Tue Apr 11 09:30:01 2023
    Hi Finn,

    On Tue, Apr 11, 2023 at 6:59 AM Finn Thain <fthain@linux-m68k.org> wrote:
    On Tue, 11 Apr 2023, Michael Schmitz wrote:
    We don't know for definite that we deal with a suspect user space
    program - it might just be a change in a previously fine program that
    now exposes a subtle kernel bug (undetected for quite a long time, but we've seen a few of those now...)?


    That's right -- the kernel is also suspect. As is glibc. I will keep
    looking for some way to narrow down the search.

    Or the compiler...

    https://lore.kernel.org/all/CABVgOSmgpkktiLkU-ic0xGitDOhep+3sb5X91hb8RNEzFauhAA@mail.gmail.com

    Gr{oetje,eeting}s,

    Geert


    --
    Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

    In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that.
    -- Linus Torvalds

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)