• armhf SIGILL, Illegal Instruction

    From Ash Hughes@21:1/5 to All on Wed Sep 29 22:10:02 2021
    Hi,

    I've been getting some programs terminated with SIGILL today, and I'm
    trying to find out if this is a package issue or if Debian (Bullseye) is
    no longer compatible with my ARM machine. I first got an error with
    onedrive, with gdb output:

    #0  0xb6948ca8 in gc.impl.conservative.gc.Gcx.fullcollect(bool) ()
       from /usr/lib/arm-linux-gnueabihf/libdruntime-ldc-shared.so.94

    which is "vldr    d18, [pc, #216] ;".

    I then tried to run ldc2, and I got something similar:

    Core was generated by `ldc2 -c --output-o -conf= -w -mattr=-neon -O3
    -release -relocation-model=pic -d'.
    Program terminated with signal SIGILL, Illegal instruction.
    #0  0x0089e15c in dmd.parse.Parser!(dmd.astcodegen.ASTCodegen).Parser.parsePrimaryExp() ()

    which is also a vldr instruction ("vldr    d16, [r6, #80]  ; 0x50")

    Finally, I tried to compile ldc2 myself and running it I got:

    #0  0xb4a6eabc in ?? () from /usr/lib/arm-linux-gnueabihf/libLLVM-11.so.1

    also vldr ("vldr        d16, [sp, #8]")

    It looks like the vldr instruction is being used in several LLVM
    packages, in a way my CPU doesn't like. Here's my cpuinfo:

    processor       : 0
    model name      : ARMv7 Processor rev 1 (v7l)
    BogoMIPS        : 37.39
    Features        : half thumb fastmult vfp edsp thumbee vfpv3 vfpv3d16 tls idivt
    CPU implementer : 0x56
    CPU architecture: 7
    CPU variant     : 0x1
    CPU part        : 0x581
    CPU revision    : 1

    Hardware        : Marvell Armada 370/XP (Device Tree) Revision        : 0000
    Serial          : 0000000000000000

    I don't have neon, although I think armhf doesn't require it, unless
    this has changed for Bullseye? If neon isn't required for Debian armhf,
    does this mean some LLVM related packages could be built differently to
    improve compatibility?

    Thanks,

    Ash

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Paul Adrian Glaubitz@21:1/5 to Jeffrey Walton on Wed Sep 29 23:00:01 2021
    Hi Jeffrey!

    On 9/29/21 22:28, Jeffrey Walton wrote:
    I think John Paul Adrian Glaubitz (with the help of others) on the
    PowerPC mailing list determined that Autools is the problem. Autotools
    is using an M4 macro that is selecting the wrong platform or features.
    It is new behavior.

    Also see Bug #995223: libffi: SIGILL on powerpc and ppc64 systems
    since libffi8, https://lists.debian.org/debian-powerpc/2021/09/msg00051.html. In particular, from a followup at https://lists.debian.org/debian-powerpc/2021/09/msg00077.html:

    It looks like a different bug as the SIGILL faults that Ash is seeing are not occurring inside libffi.so.8. I think it's more likely an issue with LLVM in this case as could be seen from the backtrace.

    But I would have to look into the details to figure out who the culprit is.

    Adrian

    --
    .''`. John Paul Adrian Glaubitz
    : :' : Debian Developer - glaubitz@debian.org
    `. `' Freie Universitaet Berlin - glaubitz@physik.fu-berlin.de
    `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jeffrey Walton@21:1/5 to sehguh.hsa@gmail.com on Wed Sep 29 22:30:01 2021
    On Wed, Sep 29, 2021 at 4:06 PM Ash Hughes <sehguh.hsa@gmail.com> wrote:

    Hi,

    I've been getting some programs terminated with SIGILL today, and I'm
    trying to find out if this is a package issue or if Debian (Bullseye) is
    no longer compatible with my ARM machine. I first got an error with
    onedrive, with gdb output:

    #0 0xb6948ca8 in gc.impl.conservative.gc.Gcx.fullcollect(bool) ()
    from /usr/lib/arm-linux-gnueabihf/libdruntime-ldc-shared.so.94

    which is "vldr d18, [pc, #216] ;".

    I then tried to run ldc2, and I got something similar:

    Core was generated by `ldc2 -c --output-o -conf= -w -mattr=-neon -O3
    -release -relocation-model=pic -d'.
    Program terminated with signal SIGILL, Illegal instruction.
    #0 0x0089e15c in dmd.parse.Parser!(dmd.astcodegen.ASTCodegen).Parser.parsePrimaryExp() ()

    which is also a vldr instruction ("vldr d16, [r6, #80] ; 0x50")

    Finally, I tried to compile ldc2 myself and running it I got:

    #0 0xb4a6eabc in ?? () from /usr/lib/arm-linux-gnueabihf/libLLVM-11.so.1

    also vldr ("vldr d16, [sp, #8]")

    It looks like the vldr instruction is being used in several LLVM
    packages, in a way my CPU doesn't like. Here's my cpuinfo:

    processor : 0
    model name : ARMv7 Processor rev 1 (v7l)
    BogoMIPS : 37.39
    Features : half thumb fastmult vfp edsp thumbee vfpv3 vfpv3d16
    tls idivt
    CPU implementer : 0x56
    CPU architecture: 7
    CPU variant : 0x1
    CPU part : 0x581
    CPU revision : 1

    Hardware : Marvell Armada 370/XP (Device Tree)
    Revision : 0000
    Serial : 0000000000000000

    I don't have neon, although I think armhf doesn't require it, unless
    this has changed for Bullseye? If neon isn't required for Debian armhf,
    does this mean some LLVM related packages could be built differently to improve compatibility?

    I think John Paul Adrian Glaubitz (with the help of others) on the
    PowerPC mailing list determined that Autools is the problem. Autotools
    is using an M4 macro that is selecting the wrong platform or features.
    It is new behavior.

    Also see Bug #995223: libffi: SIGILL on powerpc and ppc64 systems
    since libffi8, https://lists.debian.org/debian-powerpc/2021/09/msg00051.html. In particular, from a followup at https://lists.debian.org/debian-powerpc/2021/09/msg00077.html:

    <QUOTE>
    It turns out that m4/ax_gcc_archflag.m4 contains code to detect the
    baseline of the host system and sets the GCC architecture accordingly.

    Thus, a libffi compiled on a POWER8 machine will not work on a POWER5
    machine as the compiler is emitting POWER8 instructions in this case.

    Since the m4 script contains such a host enviroment detection for aarch64
    as well [1], this bug can potentially affect arm64 which is a release architecture.

    We should therefore pass "--enable-portable-binary" in debian/rules.

    [1] https://github.com/libffi/libffi/blob/master/m4/ax_gcc_archflag.m4#L209 </QUOTE>

    This is also of interest https://lists.debian.org/debian-powerpc/2021/09/msg00048.html. There's
    a lot of back-and-forth, but it is where the problem is revealed.

    I could be mistaken, so take it with a grain of salt.

    Jeff

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From peter green@21:1/5 to Ash Hughes on Wed Sep 29 23:10:03 2021
    As I understand it, there are two variants of "VFPv3", a version with 32 double registers (d0 to d31) and a version with only 16 double registers (d0 to d16).
    The former is reffered to by gcc as "vfpv3" while the latter is reffered to by gcc as "vfpv3_d16".

    Debian is supposed to support vfpv3_d16 but because there is relatively little hardware out there that doesn't support the extra registers bugs may take a while
    to get noticed.

    So IMO this is a bug in the compiler that is generating that code. What i'm not so sure about is whether selecting the correct compilation settings is the
    responsibility of the frontend (ldc) or the backend (llvm).

    On 29/09/2021 21:06, Ash Hughes wrote:
    Hi,

    I've been getting some programs terminated with SIGILL today, and I'm trying to find out if this is a package issue or if Debian (Bullseye) is no longer compatible with my ARM machine. I first got an error with onedrive, with gdb output:

    #0  0xb6948ca8 in gc.impl.conservative.gc.Gcx.fullcollect(bool) ()
       from /usr/lib/arm-linux-gnueabihf/libdruntime-ldc-shared.so.94

    which is "vldr    d18, [pc, #216] ;".

    I then tried to run ldc2, and I got something similar:

    Core was generated by `ldc2 -c --output-o -conf= -w -mattr=-neon -O3 -release -relocation-model=pic -d'.
    Program terminated with signal SIGILL, Illegal instruction.
    #0  0x0089e15c in dmd.parse.Parser!(dmd.astcodegen.ASTCodegen).Parser.parsePrimaryExp() ()

    which is also a vldr instruction ("vldr    d16, [r6, #80]  ; 0x50")

    Finally, I tried to compile ldc2 myself and running it I got:

    #0  0xb4a6eabc in ?? () from /usr/lib/arm-linux-gnueabihf/libLLVM-11.so.1

    also vldr ("vldr        d16, [sp, #8]")

    It looks like the vldr instruction is being used in several LLVM packages, in a way my CPU doesn't like. Here's my cpuinfo:

    processor       : 0
    model name      : ARMv7 Processor rev 1 (v7l)
    BogoMIPS        : 37.39
    Features        : half thumb fastmult vfp edsp thumbee vfpv3 vfpv3d16 tls idivt
    CPU implementer : 0x56
    CPU architecture: 7
    CPU variant     : 0x1
    CPU part        : 0x581
    CPU revision    : 1

    Hardware        : Marvell Armada 370/XP (Device Tree) Revision        : 0000
    Serial          : 0000000000000000

    I don't have neon, although I think armhf doesn't require it, unless this has changed for Bullseye? If neon isn't required for Debian armhf, does this mean some LLVM related packages could be built differently to improve compatibility?

    Thanks,

    Ash

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jeffrey Walton@21:1/5 to plugwash@p10link.net on Thu Sep 30 00:50:02 2021
    On Wed, Sep 29, 2021 at 5:05 PM peter green <plugwash@p10link.net> wrote:

    As I understand it, there are two variants of "VFPv3", a version with 32 double registers (d0 to d31) and a version with only 16 double registers (d0 to d16).
    The former is reffered to by gcc as "vfpv3" while the latter is reffered to by gcc as "vfpv3_d16".

    Debian is supposed to support vfpv3_d16 but because there is relatively little hardware out there that doesn't support the extra registers bugs may take a while
    to get noticed.

    So IMO this is a bug in the compiler that is generating that code. What i'm not so sure about is whether selecting the correct compilation settings is the
    responsibility of the frontend (ldc) or the backend (llvm).

    Shouldn't that show up in the build logs? You should see 'gcc
    -march=armv7 -fpu=vfpv3-d16 ...'? Also see https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html .

    I'm used to building with -fpu=neon, so I'm not too familiar with a
    fpu that does not do NEON. But I seem to recall we needed something
    similar for early Android devices.

    ( I also have never used ldc, so my [limited] knowledge must really be old...).

    Jeff

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From peter green@21:1/5 to Jeffrey Walton on Thu Sep 30 02:20:01 2021
    On 29/09/2021 23:39, Jeffrey Walton wrote:
    On Wed, Sep 29, 2021 at 5:05 PM peter green <plugwash@p10link.net> wrote:

    As I understand it, there are two variants of "VFPv3", a version with 32 double registers (d0 to d31) and a version with only 16 double registers (d0 to d16).
    The former is reffered to by gcc as "vfpv3" while the latter is reffered to by gcc as "vfpv3_d16".

    Debian is supposed to support vfpv3_d16 but because there is relatively little hardware out there that doesn't support the extra registers bugs may take a while
    to get noticed.

    So IMO this is a bug in the compiler that is generating that code. What i'm not so sure about is whether selecting the correct compilation settings is the
    responsibility of the frontend (ldc) or the backend (llvm).

    Shouldn't that show up in the build logs?

    It will only show up in build logs if the build process is overriding the built-in defaults of the compiler.

    Normal practice in Debian is that when invoked without specific architecture flags compilers should generate
    code that will run on the baseline CPU of the port. If they don't then that is a bug in the compiler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)