• kernel 4.15.7/64bit, C3600 is unstable during heavy I/O on PCI

    From Carlo Pisani@21:1/5 to All on Thu Mar 15 17:00:02 2018
    hi
    I was copying 500Mbyte of files in a shell_A, while I was also copying
    other files in shell_B. I was connected to the machine through ssh.
    And I was working on a SATA harddrive connected to VT6421 PCI_SATA
    controller

    The kernel is configured without -mlong-calls, and it's

    C3600 ~ # uname -r
    4.15.7-Fearless-Coyote-c3600-64bit

    no SMP, but it's 64bit

    it happened the machine crashed without showing a line on the kernel's
    console (serial line0)

    I saw strange messages on the LCD in the front of the machine, see
    pics in attachment

    errors on the LCD:

    FLT CBFC: SYS BD
    OS HPMC br err

    FLT 500B: SYS BD
    bus timeout

    FLT CBF4: SYS BD
    bad OS HPMC chsm

    FLT CBF4: SYS BD
    bad OS HPMC cksm


    C3600 ~ # lspci
    00:0c.0 Ethernet controller: Digital Equipment Corporation DECchip
    21142/43 (rev 41)
    00:0d.0 Multimedia audio controller: Analog Devices AD1889 sound chip
    00:0e.0 IDE interface: National Semiconductor Corporation 87415/87560
    IDE (rev 03)
    00:0e.1 Bridge: National Semiconductor Corporation 87560 Legacy I/O (rev 01) 00:0e.2 USB Controller: National Semiconductor Corporation USB
    Controller (rev 02)
    00:0f.0 SCSI storage controller: LSI Logic / Symbios Logic 53C896/897 (rev 07) 00:0f.1 SCSI storage controller: LSI Logic / Symbios Logic 53C896/897 (rev 07) 01:05.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE RAID
    Controller (rev 50)



    pics of the errors on the LCD

    https://s13.postimg.org/qmrfj0lev/hpc3600-error-msg-1.jpg https://s13.postimg.org/gpgepyliv/hpc3600-error-msg-2.jpg https://s13.postimg.org/exnfv29vr/hpc3600-error-msg-3.jpg https://s13.postimg.org/faeu18zvb/hpc3600-error-msg-4.jpg

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Carlo Pisani@21:1/5 to All on Fri Mar 16 13:50:02 2018
    hi
    I have created and applied the following patch
    testing the kernel with heavy I/O seems now stable

    my C3600 is still under testing, moving chunks of 500Mbyte and compiling gcc-v6


    http://93.55.217.0//wonderland/chunk_of/user/ivelegacy/happa-dev/hppa2_0001_HPMC_fix_my_v1.patch
    http://93.55.217.0//wonderland/chunk_of/user/ivelegacy/happa-dev/kernel-4.15.7-c3600-64bit--mini-all-modules__config

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Helge Deller@21:1/5 to All on Sat Mar 17 12:40:02 2018
    * Carlo Pisani <carlojpisani@gmail.com>:
    I have created and applied the following patch
    testing the kernel with heavy I/O seems now stable

    my C3600 is still under testing, moving chunks of 500Mbyte and compiling gcc-v6


    http://93.55.217.0//wonderland/chunk_of/user/ivelegacy/happa-dev/hppa2_0001_HPMC_fix_my_v1.patch

    --- drivers/parisc/lba_pci.c 2018-01-28 22:20:33.000000000 +0100
    +++ drivers/parisc/lba_pci.c 2018-03-15 12:26:44.839894952 +0100
    @@ -1405,7 +1405,7 @@

    /* Set HF mode as the default (vs. -1 mode). */
    stat = READ_REG32(d->hba.base_addr + LBA_STAT_CTL);
    - WRITE_REG32(stat | HF_ENABLE, d->hba.base_addr + LBA_STAT_CTL);
    + WRITE_REG32(stat & ~HF_ENABLE, d->hba.base_addr + LBA_STAT_CTL);

    /*
    ** Writing a zero to STAT_CTL.rf (bit 0) will clear reset signal

    That's the patch from Kyle: https://www.spinics.net/lists/linux-parisc/msg01027.html

    which comes out of this mail thread: https://www.spinics.net/lists/linux-parisc/msg01024.html
    specifically with those notes: https://www.spinics.net/lists/linux-parisc/msg01026.html

    Citing here:
    "bus timeout" usually means we tried to read an address that doesn't
    respond. that is, nothing on the bus accepted the transaction for it,
    so it timed out and HPMC'd the box.

    what you really need is the IIR, and the address it tried to access
    (
  • From John David Anglin@21:1/5 to Helge Deller on Sat Mar 17 15:10:02 2018
    On 2018-03-17 7:36 AM, Helge Deller wrote:
    Can you maybe try to localize where in the drivers/ata/sata_via.c driver triggers the HPMC ?


    --- arch/parisc/kernel/hpmc.S 2018-01-28 22:20:33.000000000 +0100
    +++ arch/parisc/kernel/hpmc.S 2018-03-15 14:13:46.611969815 +0100
    @@ -308,4 +290,5 @@
    .align 4
    .export os_hpmc_size
    os_hpmc_size:
    - .word .os_hpmc_end-.os_hpmc
    + /* .word .os_hpmc_end-.os_hpmc */
    + .word (.os_hpmc_end - .os_hpmc) * 4 /* sizeof(u32) */

    This one seems wrong.
    I think you just didn't hit a HPMC with your first patch, and as such
    this patch has no influence...
    Helge, did you check that os_hpmc is correctly entered after you added
    the ".align 4" statement
    (e.g., trigger hpmc by accessing page 0 or such)?  I looked at one of my builds and the size is
    correct as is.  Is it possible that the length needs to be a multiple of
    16?  Current length is 0x194.

    There are a couple of minor issues with assembly code.  There are no
    .type and .size directives for
    os_hpmc_size.

    I think we should investigate further as I have never seen hpmc handler
    entered on rp3440 or c8000.
    On rp3440, there are messages about the branch failing in the SL log.  However, this might have been
    fixed by your alignment fix.  I did see a similar issue to what Carlos indicated on c8000.  It would lock
    up and require a power cycle when the DRM driver was included in
    kernel.  However, it seems to be
    running fine with v4.15.10 with cache patch and full "Debian" config.  I
    tried a couple of boots last night
    to try to generate the lock up but it didn't happen.

    Dave

    --
    John David Anglin dave.anglin@bell.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Grundler@21:1/5 to Helge Deller on Sat Mar 17 17:40:01 2018
    Hi Helge,
    just a nit on PCI terminology...

    On Sat, Mar 17, 2018 at 12:36 PM, Helge Deller <deller@gmx.de> wrote:
    * Carlo Pisani <carlojpisani@gmail.com>:
    I have created and applied the following patch
    testing the kernel with heavy I/O seems now stable

    my C3600 is still under testing, moving chunks of 500Mbyte and compiling gcc-v6


    http://93.55.217.0//wonderland/chunk_of/user/ivelegacy/happa-dev/hppa2_0001_HPMC_fix_my_v1.patch

    --- drivers/parisc/lba_pci.c 2018-01-28 22:20:33.000000000 +0100
    +++ drivers/parisc/lba_pci.c 2018-03-15 12:26:44.839894952 +0100
    @@ -1405,7 +1405,7 @@

    /* Set HF mode as the default (vs. -1 mode). */
    stat = READ_REG32(d->hba.base_addr + LBA_STAT_CTL);
    - WRITE_REG32(stat | HF_ENABLE, d->hba.base_addr + LBA_STAT_CTL);
    + WRITE_REG32(stat & ~HF_ENABLE, d->hba.base_addr + LBA_STAT_CTL);

    /*
    ** Writing a zero to STAT_CTL.rf (bit 0) will clear reset signal

    That's the patch from Kyle: https://www.spinics.net/lists/linux-parisc/msg01027.html

    which comes out of this mail thread: https://www.spinics.net/lists/linux-parisc/msg01024.html
    specifically with those notes: https://www.spinics.net/lists/linux-parisc/msg01026.html

    Citing here:
    "bus timeout" usually means we tried to read an address that doesn't
    respond. that is, nothing on the bus accepted the transaction for it,
    so it timed out and HPMC'd the box.

    HF= Hard Fail on PCI "Master Abort". "Master Abort" means the MMIO
    transaction timed out - usually due to the device not responding to an
    MMIO read. We normally want HF to be enabled as you noted below in
    order to get state of the CPU and which IO device it was trying to
    access.

    It's possible the "~0L" returned in SoftFail mode is being handled by
    the driver OR the particular read that fails just doesn't matter.
    Would have to see dmesg output if the driver ever complains about
    invalid MMIO read data (~0L).

    cheers,
    grant

    what you really need is the IIR, and the address it tried to access
    (both the kernel vaddr which will be in the register, and the "system requester address" from the hpmc dump which will be the physical address mapped.

    not sure why the hpmc handler is getting skipped, that's a little weird.

    you can try hacking elroy to set softfail mode on that bus, which will
    result in a timeout on the pci bus to return -1 (like what x86 and most
    other architectures do) rather than hang the box, but it really likely
    means a driver bug.



    So, you change LBA to return -1 instead of faulting via HPMC which is
    of course one work-around to avoid the HPMC.

    But could you try to check the driver instead?

    You run this SATA controller:
    01:05.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE RAID

    Can you maybe try to localize where in the drivers/ata/sata_via.c driver triggers the HPMC ?


    --- arch/parisc/kernel/hpmc.S 2018-01-28 22:20:33.000000000 +0100
    +++ arch/parisc/kernel/hpmc.S 2018-03-15 14:13:46.611969815 +0100
    @@ -308,4 +290,5 @@
    .align 4
    .export os_hpmc_size
    os_hpmc_size:
    - .word .os_hpmc_end-.os_hpmc
    + /* .word .os_hpmc_end-.os_hpmc */
    + .word (.os_hpmc_end - .os_hpmc) * 4 /* sizeof(u32) */

    This one seems wrong.
    I think you just didn't hit a HPMC with your first patch, and as such
    this patch has no influence...

    Helge


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Grundler@21:1/5 to Carlo Pisani on Sat Mar 17 18:30:02 2018
    [adding lists back to CC since this is a public discussion]

    On Sat, Mar 17, 2018 at 5:36 PM, Carlo Pisani <carlojpisani@gmail.com> wrote: >> Would have to see dmesg output if the driver ever complains about
    invalid MMIO read data (~0L).

    I repeat AGAIN

    I have tested different sATA controllers
    - via6421
    - SIL24
    - Adaptex 2410

    Ah sorry - I missed that.

    none of them has never complained in dmesg ( i have the console
    redirected the serial port )

    Thanks for confirming Carlo!

    Having worked on mv7042 and SIL3124 (IIRC) driver support for almost 4
    years, I can tell you most SATA drivers suck. Helge is most likely
    correct that this is due to a SATA driver bug. Since Intel x86 (32 or
    64bit) systems ONLY support "SoftFail" mode, many driver issues are
    never exposed until something gets corrupted or HW totally wedges.

    and all of them showed the same behavior under heavy I/O

    While the symptom (HPMC) looks the same, the details are likely
    different for each card. It's been over 10 years but ISTR that "SER
    PIM" command when entered at boot prompt will dump those details. Ah,
    yes, that's correct:
    https://parisc.wiki.kernel.org/index.php/How_to_report_a_parisc-linux_kernel_problem#HPMC


    BTW, HPMCs can also be due to devices DMAing to an invalid address
    since the IOMMU is "strict" and requires all DMA be "mapped". IIRC,
    these DMA failures are "imprecise events" meaning SER PIM is the only
    way to determine what the target DMA address was. This is non-trivial
    stuff to debug since it requires pretty deep understanding of the SATA controller operations and how the driver works.

    the C3600 machine stops to work, and I see errors in the LCD
    without a line from the kernel on the serial console

    Yes, HPMC due to IO errors (including DMA failures and PCI Master
    Abort) will lock up all other IO including the serial console. SER PIM
    is the only output that will lead to diagnosing the underlying driver
    bug.

    But in SoftFail mode, I was hoping one of the drivers might complain
    about "invalid value" - but that is likely way too optimistic on my
    part. The only drivers that look for ~0L on MMIO read are the ones
    that support PCI hotplug (e.g. PCMCIA devices and some higher end PCI
    IO devices which target data center/server market).

    cheers,
    grant

    ps. I haven't looked at mvsata or sil3124 SATA controller code since
    2010. But if you could find a MV7042 PCI (or PCI-X?) SATA controller,
    I have more faith in that driver/HW combination partly because I
    tested it pretty thoroughly at the time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John David Anglin@21:1/5 to Grant Grundler on Sat Mar 17 18:50:01 2018
    Hi Grant,

    On 2018-03-17 12:12 PM, Grant Grundler wrote:
    "Master Abort" means the MMIO
    transaction timed out - usually due to the device not responding to an
    MMIO read.
    In lba_pci.c and sba_iommu.c, it says "BE WARNED: register writes are
    posted" and need
    to be followed by a read.  It seems there are a some routines in these
    modules that have
    writes that aren't followed by a read.  One is lba_wr_cfg(). Another
    might be the macro
    LBA_CFG_RESTORE().  Are these okay?

    It seems probable that the problem that Carlo is having is a conflict
    between devices.

    Dave

    --
    John David Anglin dave.anglin@bell.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Grundler@21:1/5 to Carlo Pisani on Mon Mar 19 21:00:02 2018
    [adding public lists back to CC since I can't sort this by myself any more]

    On Sat, Mar 17, 2018 at 6:45 PM, Carlo Pisani <carlojpisani@gmail.com> wrote: >> While the symptom (HPMC) looks the same, the details are likely
    different for each card. It's been over 10 years but ISTR that "SER
    PIM" command when entered at boot prompt will dump those details.

    this is an old log http://93.55.217.0/wonderland/chunk_of/user/ivelegacy/hppa-dev/ser-prim-2018-03-12--sata-crash

    Awesome - thank you!

    Can you tell me which SATA controller was installed when this crash was taken?

    my C3600 is now busy, compiling 4.16.*, a couple of branches with two different configurations, so I can't force a crash to have a fresh log
    at the moment, but I will with several SATA card!

    No worries - start with one. :)

    what about SAS kernel modules? are them better than SATA?

    In general, I expect they are better supported but I never worked with
    SAS controllers.

    and what about SCSI (adaptec 29320, PCI-X)?

    Sorry: I don't have any experience with adaptec - only the LSI SCSI controllers, Marvel 7042 SATA controller, and Silicon Image 3124
    (IIRC). The SI3124 was a PoS. MV7042 worked well and was reliable "at
    scale".

    I am waiting for an array of disks, I can test the SCSI card under
    heavy I/O to see
    - performance
    - stability

    it will be delivered here for the end of the next week

    Ok! Sounds like fun! :D

    cheers,
    grant

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Grundler@21:1/5 to Grant Grundler on Mon Mar 19 21:00:02 2018
    On Mon, Mar 19, 2018 at 8:37 PM, Grant Grundler <grantgrundler@gmail.com> wrote:
    ...
    this is an old log
    http://93.55.217.0/wonderland/chunk_of/user/ivelegacy/hppa-dev/ser-prim-2018-03-12--sata-crash

    BTW, I should point out one needs to give "CLEARPIM" command at PDC
    command line in order to record a new HPMC.

    So once you reproduce the crash, collect HPMC dump with "SER PIM",
    then "CLEAR PIM" again. :)

    cheers,
    grant

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Grundler@21:1/5 to John David Anglin on Mon Mar 19 20:50:01 2018
    Hi John!

    On Sat, Mar 17, 2018 at 6:47 PM, John David Anglin <dave.anglin@bell.net> wrote:
    Hi Grant,

    On 2018-03-17 12:12 PM, Grant Grundler wrote:

    "Master Abort" means the MMIO
    transaction timed out - usually due to the device not responding to an
    MMIO read.

    In lba_pci.c and sba_iommu.c, it says "BE WARNED: register writes are
    posted" and need to be followed by a read. It seems there are a some routines in these modules that have writes that aren't followed by a read. One is lba_wr_cfg(). Another might be the macro
    LBA_CFG_RESTORE(). Are these okay?

    I looked through the two examples you point out and I *think* both are
    ok. lba_wr_cfg() issues an mmio write and immediately after calls LBA_CFG_MASTER_ABORT_CHECK() which performs an MMIO read from the same
    base address.

    The LBA_CFG_RESTORE() is "lazy" - the next MMIO read will flush those
    three writes and (I believe) any following MMIO writes will still be
    issued in order.

    Typically, the problem with posted MMIO writes is DMA or other events
    don't start until the MMIO write is "seen" by the device. This is
    important when specific timing between MMIO transactions is required
    OR some magic (e.g. device reset, updates Frame Buffer, etc) happens.

    It seems probable that the problem that Carlo is having is a conflict
    between devices.

    Hrm. I don't know. I haven't yet looked at the latest dump that Carlo
    helpfully provided as I'm still traveling. Why do you suspect this?

    I'm skeptical about "conflict between devices" (due to lba_wr_cfg())
    for two reasons:
    1) configuration space accesses are usually not part of normal IO
    device transaction processing.
    2) I've nearly always found that PCI Master Aborts (on MMIO reads) are
    usually just a symptom of something else going wrong and not the root
    cause.

    Typically, the issues I recall running into are around the drivers
    hitting a corner case where the device is still performing DMA to an
    address that gets unmapped by the driver. This will wedge the IOMMU
    (sba) and then following MMIO reads will generate an HPMC.

    The hard part is to determine what the corner case is based on a DMA
    address (as reported in SER PIM output). It requires deeper
    understanding of the DMA programming for the given SATA controller
    (driver directing HW what to do), how transaction completions are
    reported (SATA controller HW) and handled (driver operation).

    In the past, I've sorted several of these issues out for tg3 and tulip
    NIC drivers and I can with confidence say that some issues still
    remain in the tulip driver shutdown path. But I gave up on trying to
    fix those and lost interest later.

    cheers,
    grant

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John David Anglin@21:1/5 to Carlo Pisani on Mon Mar 19 21:50:01 2018
    This is info about PCI devices that Carlo sent me.  It would help if we
    lnew where 0x4046b840
    was in the kernel. From the IIR, 0xf3c001c, it seems the timeout
    occurred on a byte operation.

    Dave

    On 2018-03-17 2:58 PM, Carlo Pisani wrote:
    Probably, it would be useful to have verbose output from lspci posted to
    list as well.

    c3600 ~ $ lspci -v
    00:0c.0 Ethernet controller: Digital Equipment Corporation DECchip
    21142/43 (rev 41)
    Subsystem: Hewlett-Packard Company DECchip 21142/43
    Flags: bus master, medium devsel, latency 255, IRQ 17
    I/O ports at 1000 [size=128]
    Memory at fffffffff4008000 (32-bit, non-prefetchable) [size=1K]
    Expansion ROM at fffffffff4040000 [disabled] [size=256K]
    Kernel driver in use: tulip
    lspci: Unable to load libkmod resources: error -12

    00:0d.0 Multimedia audio controller: Analog Devices Device 1889
    Subsystem: Analog Devices Device 1889
    Flags: bus master, medium devsel, latency 255, IRQ 18
    Memory at fffffffff400c000 (32-bit, prefetchable) [size=512]
    Memory at fffffffff400b000 (32-bit, prefetchable) [size=16]
    Memory at fffffffff400a000 (32-bit, prefetchable) [size=16]
    Memory at fffffffff4009000 (32-bit, prefetchable) [size=16]
    Capabilities: [dc] Power Management version 1
    Kernel driver in use: snd_ad1889

    00:0e.0 IDE interface: National Semiconductor Corporation 87415/87560
    IDE (rev 03) (prog-if 8f [Master SecP SecO PriP PriO])
    Flags: bus master, medium devsel, latency 255, IRQ 7
    I/O ports at 0f00 [size=8]
    I/O ports at 0e00 [size=4]
    I/O ports at 0d00 [size=8]
    I/O ports at 0b00 [size=4]
    I/O ports at 0a00 [size=16]
    Kernel driver in use: NS87415_IDE

    00:0e.1 Bridge: National Semiconductor Corporation 87560 Legacy I/O (rev 01)
    Flags: bus master, medium devsel, latency 255, IRQ 19
    Kernel driver in use: SuperIO

    00:0e.2 USB controller: National Semiconductor Corporation USB
    Controller (rev 02) (prog-if 10 [OHCI])
    Flags: bus master, medium devsel, latency 240, IRQ 1
    Memory at fffffffff4007000 (32-bit, non-prefetchable) [size=4K]
    Memory at fffffffff4006000 (32-bit, non-prefetchable) [size=4K]
    Kernel driver in use: ohci-pci

    00:0f.0 SCSI storage controller: LSI Logic / Symbios Logic 53C896/897 (rev 07)
    Subsystem: LSI Logic / Symbios Logic LSI53C896/7 PCI to Dual
    Channel Ultra2 SCSI Multifunction Controller
    Flags: bus master, medium devsel, latency 255, IRQ 20
    I/O ports at 0900 [size=256]
    Memory at fffffffff4005000 (64-bit, non-prefetchable) [size=1K]
    Memory at fffffffff4002000 (64-bit, non-prefetchable) [size=8K]
    Capabilities: [40] Power Management version 2
    Kernel driver in use: sym53c8xx

    00:0f.1 SCSI storage controller: LSI Logic / Symbios Logic 53C896/897 (rev 07)
    Subsystem: LSI Logic / Symbios Logic LSI53C896/7 PCI to Dual
    Channel Ultra2 SCSI Multifunction Controller
    Flags: bus master, medium devsel, latency 255, IRQ 20
    I/O ports at 0800 [size=256]
    Memory at fffffffff4004000 (64-bit, non-prefetchable) [size=1K]
    Memory at fffffffff4000000 (64-bit, non-prefetchable) [size=8K]
    Capabilities: [40] Power Management version 2
    Kernel driver in use: sym53c8xx

    01:05.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE/SATA Controller (rev 50)
    Subsystem: VIA Technologies, Inc. VT6421 IDE/SATA Controller
    Flags: bus master, medium devsel, latency 240, IRQ 21
    I/O ports at 12500 [size=16]
    I/O ports at 12400 [size=16]
    I/O ports at 12300 [size=16]
    I/O ports at 12200 [size=16]
    I/O ports at 12100 [size=32]
    I/O ports at 12000 [size=256]
    Expansion ROM at fffffffff4800000 [disabled] [size=64K]
    Capabilities: [e0] Power Management version 2
    Kernel driver in use: sata_via


    The requesting device was 10/1/5/0.

    --
    John David Anglin dave.anglin@bell.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Helge Deller@21:1/5 to All on Mon Mar 19 23:30:01 2018
    * John David Anglin <dave.anglin@bell.net>:
    On 2018-03-17 7:36 AM, Helge Deller wrote:
    Can you maybe try to localize where in the drivers/ata/sata_via.c driver triggers the HPMC ?


    --- arch/parisc/kernel/hpmc.S 2018-01-28 22:20:33.000000000 +0100
    +++ arch/parisc/kernel/hpmc.S 2018-03-15 14:13:46.611969815 +0100
    @@ -308,4 +290,5 @@
    .align 4
    .export os_hpmc_size
    os_hpmc_size:
    - .word .os_hpmc_end-.os_hpmc
    + /* .word .os_hpmc_end-.os_hpmc */
    + .word (.os_hpmc_end - .os_hpmc) * 4 /* sizeof(u32) */

    This one seems wrong.
    I think you just didn't hit a HPMC with your first patch, and as such
    this patch has no influence...

    Helge, did you check that os_hpmc is correctly entered after you added
    the ".align 4" statement (e.g., trigger hpmc by accessing page 0 or
    such)?  I looked at one of my builds and the size is correct as is. 
    Is it possible that the length needs to be a multiple of 16?  Current
    length is 0x194.

    Good point. I changed that in the patch below.

    There are a couple of minor issues with assembly code.  There are no .type and .size directives for os_hpmc_size.

    Fixed as well.

    I think we should investigate further as I have never seen hpmc
    handler entered on rp3440 or c8000. On rp3440, there are messages
    about the branch failing in the SL log.  However, this might have been
    fixed by your alignment fix. 

    No, the problem still exists in existing code.

    Below patch is build-tested only.


    [PATCH] Fix HPMC handler by increasing size to multiple of 16 bytes

    Make sure that the HPMC handler is 16-byte aligned and that it's length
    in the IVT is a multiple of 16 bytes. Otherwise PDC may decide not
    to jump to HPMC handler.

    Signed-off-by: Helge Deller <deller@gmx.de>

    diff --git a/arch/parisc/kernel/hpmc.S b/arch/parisc/kernel/hpmc.S
    index 8d072c44f300..f09f54f8042e 100644
    --- a/arch/parisc/kernel/hpmc.S
    +++ b/arch/parisc/kernel/hpmc.S
    @@ -84,6 +84,7 @@ END(hpmc_pim_data)
    .text

    .import intr_save, code
    + .align 16
    ENTRY_CFI(os_hpmc)
    .os_hpmc:

    @@ -305,7 +306,9 @@ ENDPROC_CFI(os_hpmc)


    __INITRODATA
    +.globl os_hpmc_size
    .align 4
    - .export os_hpmc_size
    + .type os_hpmc_size, @object
    + .size os_hpmc_size, 4
    os_hpmc_size:
    .word .os_hpmc_end-.os_hpmc
    diff --git a/arch/parisc/kernel/traps.c b/arch/parisc/kernel/traps.c
    index c919e6c0a687..ca6e8722fffb 100644
    --- a/arch/parisc/kernel/traps.c
    +++ b/arch/parisc/kernel/traps.c
    @@ -837,8 +837,8 @@ void __init initialize_ivt(const void *iva)
    ivap[0] = instr;

    /* Compute Checksum for HPMC handler */
    - length = os_hpmc_size;
    - ivap[7] = length;
    + length = ALIGN(os_hpmc_size, 16);
    + ivap[7] = length; /* length needs to be multiple of 16. */

    hpmcp = (u32 *)os_hpmc;

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John David Anglin@21:1/5 to Helge Deller on Tue Mar 20 00:00:01 2018
    On 2018-03-19 6:23 PM, Helge Deller wrote:
    - length = os_hpmc_size;
    - ivap[7] = length;
    + length = ALIGN(os_hpmc_size, 16);
    + ivap[7] = length; /* length needs to be multiple of 16. */
    If this is true, should we not add three nops to the end of os_hpmc so
    it has a length which is a multiple of 16?

    Dave

    --
    John David Anglin dave.anglin@bell.net

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Helge Deller@21:1/5 to All on Sat Mar 24 21:20:02 2018
    Make sure that the HPMC (High Priority Machine Check) handler is 16-byte aligned and that it's length in the IVT is a multiple of 16 bytes.
    Otherwise PDC may decide not to call the HPMC crash handler.

    Signed-off-by: Helge Deller <deller@gmx.de>
    Cc: stable@vger.kernel.org

    diff --git a/arch/parisc/kernel/hpmc.S b/arch/parisc/kernel/hpmc.S
    index 8d072c44f300..781c3b9a3e46 100644
    --- a/arch/parisc/kernel/hpmc.S
    +++ b/arch/parisc/kernel/hpmc.S
    @@ -84,6 +84,7 @@ END(hpmc_pim_data)
    .text

    .import intr_save, code
    + .align 16
    ENTRY_CFI(os_hpmc)
    .os_hpmc:

    @@ -300,12 +301,15 @@ os_hpmc_6:

    b .
    nop
    + .align 16 /* make function length multiple of 16 bytes */
    ENDPROC_CFI(os_hpmc)
    .os_hpmc_end:


    __INITRODATA
    +.globl os_hpmc_size
    .align 4
    - .export os_hpmc_size
    + .type os_hpmc_size, @object
    + .size os_hpmc_size, 4
    os_hpmc_size:
    .word .os_hpmc_end-.os_hpmc

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)