• DRAM accommodations

    From Don Y@21:1/5 to All on Thu Sep 5 15:54:43 2024
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss? <frown>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From john larkin@21:1/5 to All on Thu Sep 5 16:54:39 2024
    On Thu, 5 Sep 2024 15:54:43 -0700, Don Y <blockedofcourse@foo.invalid>
    wrote:

    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss? <frown>

    Lots of bad block swapping, and lots of error-correction codes.

    The multi-level storage, many bits per element, is really scary.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Sloman@21:1/5 to Don Y on Fri Sep 6 16:15:23 2024
    On 6/09/2024 8:54 am, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>

    Back in 1985 I was specifying 72-bit words to accomodate 64-bit data
    words and eight bits of error-detection and correction data/checksum.

    That was built into every memory access - it didn't slow down the memory
    much, but it made it a lot more reliable.

    --
    Bill Sloman, Sydney

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Fri Sep 6 12:31:48 2024
    On 9/5/2024 3:54 PM, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>

    From discussions with colleagues, apparently, adding (external) ECC to
    most MCUs is simply not possible; too much of the memory and DRAM
    controllers are in-built (unlike older multi-chip microprocessors).
    There's no easy way to generate a bus fault to rerun the bus cycle
    or delay for the write-after-read correction.

    And, among those devices that *do* support ECC, it's just a conventional
    SECDEC implelmentation. So, a fair number of UCEs will plague any
    design with an appreciable amount of DRAM (can you even BUY *small*
    amounts of DRAM??)

    For devices with PMMUs, it's possible to address the UCEs -- sort of.
    But, this places an additional burden on the software and raises
    the problem of "If you are getting UCEs, how sure are you that
    undetected CEs aren't slipping through??" (again, you can only
    detect the UCEs via an explicit effort so you pay the fee and take
    your chances!)

    For devices without PMMUs, you have to rely on POST or BIST. And,
    *hope* that everything works in the periods between (restart often! :> )

    Back of the napkin figures suggest many errors are (silently!) encountered
    in an 8-hour shift. For XIP implementations, it's mainly data that is at
    risk (though that can also include control flow information from, e.g.,
    the pushdown stack). For implementations that load their application
    into DRAM, then the code is suspect as well as the data!

    [Which is likely to cause more detectable/undetectable problems?]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Fri Sep 6 17:43:53 2024
    On 9/6/2024 12:31 PM, Don Y wrote:
    And, among those devices that *do* support ECC, it's just a conventional SECDEC implelmentation.  So, a fair number of UCEs will plague any
    design with an appreciable amount of DRAM (can you even BUY *small*
    amounts of DRAM??)

    Grrrr.... s/SECDEC/SECDED/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Sloman@21:1/5 to Don Y on Sat Sep 7 15:07:29 2024
    On 7/09/2024 5:31 am, Don Y wrote:
    On 9/5/2024 3:54 PM, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>

    From discussions with colleagues, apparently, adding (external) ECC to
    most MCUs is simply not possible; too much of the memory and DRAM
    controllers are in-built (unlike older multi-chip microprocessors).
    There's no easy way to generate a bus fault to rerun the bus cycle
    or delay for the write-after-read correction.

    And, among those devices that *do* support ECC, it's just a conventional SECDEC implelmentation.  So, a fair number of UCEs will plague any
    design with an appreciable amount of DRAM (can you even BUY *small*
    amounts of DRAM??)

    For devices with PMMUs, it's possible to address the UCEs -- sort of.
    But, this places an additional burden on the software and raises
    the problem of "If you are getting UCEs, how sure are you that
    undetected CEs aren't slipping through??"  (again, you can only
    detect the UCEs via an explicit effort so you pay the fee and take
    your chances!)

    For devices without PMMUs, you have to rely on POST or BIST.  And,
    *hope* that everything works in the periods between (restart often!  :> )

    Back of the napkin figures suggest many errors are (silently!) encountered
    in an 8-hour shift.  For XIP implementations, it's mainly data that is at risk (though that can also include control flow information from, e.g.,
    the pushdown stack).  For implementations that load their application
    into DRAM, then the code is suspect as well as the data!

    [Which is likely to cause more detectable/undetectable problems?]

    Typical software reaction. You design error detection and error
    correction into the hardware, and the extra hardware can both correct
    most errors (when they can be corrected) and report all of the them -
    both those corrected and those that couldn't be corrected.

    Data transmission systems can re-transmit damaged packets of data. and
    tend to go for checksums that merely detected errors in much longer words/packets, and reject the affected packets.

    --
    Bill Sloman, Sydney

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From albert@spenarnc.xs4all.nl@21:1/5 to blockedofcourse@foo.invalid on Sat Sep 7 11:56:47 2024
    In article <vbflbe$tlhp$7@dont-email.me>,
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 9/5/2024 3:54 PM, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>

    From discussions with colleagues, apparently, adding (external) ECC to
    most MCUs is simply not possible; too much of the memory and DRAM
    controllers are in-built (unlike older multi-chip microprocessors).
    There's no easy way to generate a bus fault to rerun the bus cycle
    or delay for the write-after-read correction.

    And, among those devices that *do* support ECC, it's just a conventional >SECDEC implelmentation. So, a fair number of UCEs will plague any
    design with an appreciable amount of DRAM (can you even BUY *small*
    amounts of DRAM??)

    For devices with PMMUs, it's possible to address the UCEs -- sort of.
    But, this places an additional burden on the software and raises
    the problem of "If you are getting UCEs, how sure are you that
    undetected CEs aren't slipping through??" (again, you can only
    detect the UCEs via an explicit effort so you pay the fee and take
    your chances!)

    For devices without PMMUs, you have to rely on POST or BIST. And,
    *hope* that everything works in the periods between (restart often! :> )

    Back of the napkin figures suggest many errors are (silently!) encountered
    in an 8-hour shift. For XIP implementations, it's mainly data that is at >risk (though that can also include control flow information from, e.g.,
    the pushdown stack). For implementations that load their application
    into DRAM, then the code is suspect as well as the data!

    [Which is likely to cause more detectable/undetectable problems?]



    Running many day long computations for e.g. the euler project,
    involving giga byte memories, and require precise (not one off
    results),
    I have not encountered any wrong results caused by RAM failures.

    Groetjes Albert
    --
    Temu exploits Christians: (Disclaimer, only 10 apostles)
    Last Supper Acrylic Suncatcher - 15Cm Round Stained Glass- Style Wall
    Art For Home, Office And Garden Decor - Perfect For Windows, Bars,
    And Gifts For Friends Family And Colleagues.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to albert@spenarnc.xs4all.nl on Sat Sep 7 04:04:37 2024
    On 9/7/2024 2:56 AM, albert@spenarnc.xs4all.nl wrote:
    Back of the napkin figures suggest many errors are (silently!) encountered >> in an 8-hour shift. For XIP implementations, it's mainly data that is at
    risk (though that can also include control flow information from, e.g.,
    the pushdown stack). For implementations that load their application
    into DRAM, then the code is suspect as well as the data!

    [Which is likely to cause more detectable/undetectable problems?]

    Running many day long computations for e.g. the euler project,
    involving giga byte memories, and require precise (not one off
    results),
    I have not encountered any wrong results caused by RAM failures.

    Do you KNOW that there haven't been any that your ECC *hasn't* silently corrected for you? Or, that any uncorrected errors have tickled vulnerabilities in your data/code?

    [I've seen at least one study that deliberately injected memory faults
    into running OSs in an attempt to determine how "resilient" they were
    to such faults. If a conditional jump is replaced by an unconditional
    jump (because of a corrupted opcode), you wouldn't notice a difference
    /if the condition was true/! Likewise, data can be altered in ways that
    are masked by the operations/tests performed on them. I.e., without
    actively monitoring the ECC hardware, you are largely clueless as to
    what is really happening in the memory]

    Real-world studies show FITs of 1,000 - 70,000 / Mb. So, 64,000,000 / GB.
    For a small 4GB machine, that's a failure every 4 hours.

    Chances are, you are reloading an application and, thus, refreshing the values in the DRAM -- TEXT and DATA. Let your application run for a week or two
    (with const data) and see if it *provably* exhibits no errors -- correctable
    or uncorrectable.

    My machines are up 24/7 for months at a time. *But*, the contents of DRAM are continuously being altered/updated. This effectively amounts to a scrubbing operation. So, the chance of a datum being noticeably corrupt are reduced.

    OTOH, loading code into DRAM (e.g., from NAND FLASH) and letting it sit there /as if/ it was ROM leaves it vulnerable to bit-rot unless ALL of the code is continuously reread (re-executed). And, single bit errors degrade to multiple bit errors -- which SECDED won't address.

    Do you *know* what happens in your OS when an ECC error is detected/corrected?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Sat Sep 7 04:08:40 2024
    On 9/7/2024 4:06 AM, Don Y wrote:
    On 9/7/2024 4:04 AM, Don Y wrote:
    Real-world studies show FITs of 1,000 - 70,000 / Mb.  So, 64,000,000 / GB. >> For a small 4GB machine, that's a failure every 4 hours.

    AT LEAST every 4 hours (for a FIT of 1000)

    <https://arstechnica.com/information-technology/2009/10/dram-study-turns-assumptions-about-errors-upside-down/>

    There have been numerous similar studies that are more current.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Sat Sep 7 04:06:46 2024
    On 9/7/2024 4:04 AM, Don Y wrote:
    Real-world studies show FITs of 1,000 - 70,000 / Mb.  So, 64,000,000 / GB. For a small 4GB machine, that's a failure every 4 hours.

    AT LEAST every 4 hours (for a FIT of 1000)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Sat Sep 7 05:01:40 2024
    On 9/7/2024 4:04 AM, Don Y wrote:
    Real-world studies show FITs of 1,000 - 70,000 / Mb.  So, 64,000,000 / GB. For a small 4GB machine, that's a failure every 4 hours.

    Ugh! I've got to stop doing math in my head! :<

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From john larkin @21:1/5 to All on Sat Sep 7 07:35:11 2024
    On Sat, 7 Sep 2024 04:04:37 -0700, Don Y <blockedofcourse@foo.invalid>
    wrote:

    On 9/7/2024 2:56 AM, albert@spenarnc.xs4all.nl wrote:
    Back of the napkin figures suggest many errors are (silently!) encountered >>> in an 8-hour shift. For XIP implementations, it's mainly data that is at >>> risk (though that can also include control flow information from, e.g.,
    the pushdown stack). For implementations that load their application
    into DRAM, then the code is suspect as well as the data!

    [Which is likely to cause more detectable/undetectable problems?]

    Running many day long computations for e.g. the euler project,
    involving giga byte memories, and require precise (not one off
    results),
    I have not encountered any wrong results caused by RAM failures.

    Do you KNOW that there haven't been any that your ECC *hasn't* silently >corrected for you?

    ECC is there so they can increase dram density. Increasing density and
    using multi levels per bit cell makes dram and flash cheaper and
    increases the raw error rate, and adding ecc bits makes it tolerably
    reliable, and that tradeoff is cost optimized.

    It seems to work pretty well.

    My PC here has 32Gbytes of dram and a terabyte of SSD, and seem fine.
    The combination runs Spice fast and the results work.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to Don Y on Tue Sep 17 00:33:31 2024
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 9/5/2024 3:54 PM, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>

    From discussions with colleagues, apparently, adding (external) ECC to
    most MCUs is simply not possible; too much of the memory and DRAM
    controllers are in-built (unlike older multi-chip microprocessors).
    There's no easy way to generate a bus fault to rerun the bus cycle
    or delay for the write-after-read correction.

    Are you really writing about MCU-s? My impression was that MCU-s
    which allow external memory (most do not) typically have WAIT signal
    so that external memory can insert extra wait states. OTOH access
    to external memory is much slower than to internal memory.

    People frequently use chips designed for gadgets like smartphones
    of TV-s, those tend to have integrated DRAM controller and no
    support for ECC.

    And, among those devices that *do* support ECC, it's just a conventional SECDEC implelmentation. So, a fair number of UCEs will plague any
    design with an appreciable amount of DRAM (can you even BUY *small*
    amounts of DRAM??)

    IIUC, if you need small amount of memory you should use SRAM...

    For devices with PMMUs, it's possible to address the UCEs -- sort of.
    But, this places an additional burden on the software and raises
    the problem of "If you are getting UCEs, how sure are you that
    undetected CEs aren't slipping through??" (again, you can only
    detect the UCEs via an explicit effort so you pay the fee and take
    your chances!)

    For devices without PMMUs, you have to rely on POST or BIST. And,
    *hope* that everything works in the periods between (restart often! :> )

    Back of the napkin figures suggest many errors are (silently!) encountered
    in an 8-hour shift. For XIP implementations, it's mainly data that is at risk (though that can also include control flow information from, e.g.,
    the pushdown stack). For implementations that load their application
    into DRAM, then the code is suspect as well as the data!

    [Which is likely to cause more detectable/undetectable problems?]

    I think that this estimate is somewhat pessimistic. The last case
    I remember that could be explanied by memory error was about 25 years
    ago: I unpacked source of newer gcc version and tried to compile it. Compilation failed, I tracked trouble to a flipped bit in one source
    file. Unpacking sources again gave correct value of affected bit
    (and on other bit chaged). And compiling the second copy worked OK.

    Earlier I saw segfaults that could be cured by exchanging DRAM
    modules or downclocking the machine.

    But it seems that machines got more reliable and I did not remember
    any recent problem like this. And I regularly do large compiles,
    here error in sources is very unlikely to go unnoticed. I did
    large computations were any error had nontrivial chance to propagate
    to final result. Some computations that I do are naturally error
    tolerant, but error tolerant part used tiny amount of data, while
    most data was "fragile" (error there was likely to be detected).

    Concerning doing something about memory errors: on hardwares side
    devices with DRAM that I use are COTS devices. So the only thing
    I have is to look at reputation of the vendor, and general reputation
    says nothing about DRAM errors. In other words, there is nothing
    I can realistically do. On software side I could try to add
    some extra error tolerance. But I have various consistency checks
    and they indicate various troubles. I deal with detected troubles,
    DRAM errors are not in this category.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Waldek Hebisch on Mon Sep 16 19:05:24 2024
    On 9/16/2024 5:33 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 9/5/2024 3:54 PM, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>

    From discussions with colleagues, apparently, adding (external) ECC to
    most MCUs is simply not possible; too much of the memory and DRAM
    controllers are in-built (unlike older multi-chip microprocessors).
    There's no easy way to generate a bus fault to rerun the bus cycle
    or delay for the write-after-read correction.

    Are you really writing about MCU-s?

    Yes. In particular, the Cortex A series.

    My impression was that MCU-s
    which allow external memory (most do not) typically have WAIT signal
    so that external memory can insert extra wait states. OTOH access
    to external memory is much slower than to internal memory.

    You typically use external memory when there isn't enough
    internal memory for your needs. I'm looking at 1-2GB / device
    (~500GB - 2TB per system)

    People frequently use chips designed for gadgets like smartphones
    of TV-s, those tend to have integrated DRAM controller and no
    support for ECC.

    Exactly. As the DRAM controller is in-built, adding ECC isn't
    an option. Unless the syndrome logic is ALSO in-built (it is
    on some offerings, but not all).

    And, among those devices that *do* support ECC, it's just a conventional
    SECDEC implelmentation. So, a fair number of UCEs will plague any
    design with an appreciable amount of DRAM (can you even BUY *small*
    amounts of DRAM??)

    IIUC, if you need small amount of memory you should use SRAM...

    As above, 1-2GB isn't small, even by today's standards.
    And, SRAM isn't without its own data integrity issues.

    For devices with PMMUs, it's possible to address the UCEs -- sort of.
    But, this places an additional burden on the software and raises
    the problem of "If you are getting UCEs, how sure are you that
    undetected CEs aren't slipping through??" (again, you can only
    detect the UCEs via an explicit effort so you pay the fee and take
    your chances!)

    For devices without PMMUs, you have to rely on POST or BIST. And,
    *hope* that everything works in the periods between (restart often! :> )

    Back of the napkin figures suggest many errors are (silently!) encountered >> in an 8-hour shift. For XIP implementations, it's mainly data that is at
    risk (though that can also include control flow information from, e.g.,
    the pushdown stack). For implementations that load their application
    into DRAM, then the code is suspect as well as the data!

    [Which is likely to cause more detectable/undetectable problems?]

    I think that this estimate is somewhat pessimistic. The last case
    I remember that could be explanied by memory error was about 25 years
    ago: I unpacked source of newer gcc version and tried to compile it. Compilation failed, I tracked trouble to a flipped bit in one source
    file. Unpacking sources again gave correct value of affected bit
    (and on other bit chaged). And compiling the second copy worked OK.

    You were likely doing this on a PC/workstation, right? Nowadays,
    having ECC *in* the PC is commonplace.

    "We made two key observations: First, although most personal-computer
    users blame system failures on software problems (quirks of the
    operating system, browser, and so forth) or maybe on malware infections,
    hardware was the main culprit. At Los Alamos, for instance, more than
    60 percent of machine outages came from hardware issues. Digging further,
    we found that the most common hardware problem was faulty DRAM. This
    meshes with the experience of people operating big data centers, DRAM
    modules being among the most frequently replaced components."

    Earlier I saw segfaults that could be cured by exchanging DRAM
    modules or downclocking the machine.

    .. suggesting a memory integrity issue.

    But it seems that machines got more reliable and I did not remember
    any recent problem like this. And I regularly do large compiles,
    here error in sources is very unlikely to go unnoticed. I did

    But, again, the use case, in a workstation, is entirely different
    than in a "device" where the code is unpacked (from NAND flash
    or from a remote file server) into RAM and then *left* there to
    execute for the duration of the device's operation (hours, days,
    weeks, etc.). I.e., the DRAM is used to emulate EPROM.

    In a PC, when your application is "done", the DRAM effectively is
    scrubbed by the loading of NEW data/text. This tends not to be
    the case for appliances/devices; such a reload only tends to happen
    when the power is cycled and the contents of (volatile) DRAM are
    obviously lost.

    Even without ECC, not all errors are consequential. If a bit
    flips and is never accessed, <shrug>. If a bit flips and
    it alters one opcode into another that is equivalent /in the
    current program state/, <shrug> Ditto for data.

    large computations were any error had nontrivial chance to propagate
    to final result. Some computations that I do are naturally error
    tolerant, but error tolerant part used tiny amount of data, while
    most data was "fragile" (error there was likely to be detected).

    If you are reaccessing the memory (data or code) frequently, you
    give the ECC a new chance to "refresh" the intended contents of
    that memory (assuming your ECC hardware is configured for write-back operation). So, the possibility of a second fault coming along
    while the first fault is still in place (and uncorrected) is small.

    OTOH, if the memory just sits there with the expectation that it
    will retain its intended contents without a chance to be corrected
    (by the ECC hardware), then bit-rot can continue increasing the
    possibility of a second bit failing while the first is still failed.

    Remember, you don't even consider this possibility when you are
    executing out of EPROM or NOR FLASH... you just assume bits
    retain their state even when you aren't "looking at them".

    Concerning doing something about memory errors: on hardwares side
    devices with DRAM that I use are COTS devices. So the only thing
    I have is to look at reputation of the vendor, and general reputation
    says nothing about DRAM errors. In other words, there is nothing
    I can realistically do.

    You can select MCUs that *do* have support for ECC instead of just
    "hoping" the (inevitable) memory errors won't bite you. Even so,
    your best case scenario is just SECDED protection.

    On software side I could try to add
    some extra error tolerance. But I have various consistency checks
    and they indicate various troubles. I deal with detected troubles,
    DRAM errors are not in this category.

    I assume you *test* memory on POST? But, if your device runs
    24/7/365, it can be a very long time between POSTs! OTOH, you
    could force a test cycle (either by deliberately restarting
    your device -- "nightly maintenance") or, you could test memory
    WHILE the application is running.

    And, what specific criteria do you use to get alarmed at the results??

    <https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf> <https://www.cse.cuhk.edu.hk/~pclee/www/pubs/srds22.pdf> <https://core.ac.uk/download/pdf/81577401.pdf>

    [Lots more where those come from. Seeing *observed* data instead
    of theoretical is eye-opening! Is a server environment more
    pandering of DRAM? Or, *less* than, for example, a device that
    is controlling a large electro-mechanical load and the
    switching/mechanical transients that it might induce?]

    Of course, only systems with lots of memory available to test
    are good candidates for such testing -- hence the nature of these
    studies.

    Do systems with soldered down memory behave better than those
    with socketed devices? (How many insertion cycles are your
    sockets rated for? How many have been *used*???) Is an
    80F degree cold aisle with lots of forced air cooling better
    or worse than a 100F ambient with convection cooling? Is
    a server secured to an equipment rack a more hospitable
    environment than one in which the device is feeling the
    effects of 10T shocks at 200Hz?

    But, error *rates* can be extrapolated to an arbitrary
    memory size. E.g., the google study (old, now) noted
    FITs of 25,000-75,000 per Mb. This is on the order of
    "many errors per GB per day"! (their 20FIT/Mb *uncorrectable*
    error observation gives some hope for ECC systems /with
    properly implemented AND PATROLLED ECC/)

    "I'm only using a GB!"
    "Yeah, but how many UNITS do you sell? I.e., what percentage
    of your customers are experiencing problems /that you can't
    track down/?"

    Ask your "hardware guy" what sort of error rate he expects from
    "his" design (Sea level? Or, at altitude?) And, what steps HE
    recommends to combat those issues... And, what conditions he
    would consider to employ some other remedy /after the sale/!
    Hasn't considered it? Hmmm...

    [How often do you have a maintenance/support team to keep
    YOUR device running properly over the course of its lifetime?
    And, just how long *is* that expected lifetime??]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Mon Sep 16 19:31:35 2024
    On 9/16/2024 7:05 PM, Don Y wrote:
    OTOH, if the memory just sits there with the expectation that it
    will retain its intended contents without a chance to be corrected
    (by the ECC hardware), then bit-rot can continue increasing the
    possibility of a second bit failing while the first is still failed.

    This is the fallacy of (disk) RAID; unless you do patrol reads
    and actively scrub the media, you won't know about bit-rot
    until it has had a chance to progress to a point where you
    are vulnerable to it.

    [That's why you scrub DRAM!]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Sloman@21:1/5 to Chris Jones on Wed Sep 18 00:04:09 2024
    On 17/09/2024 11:47 pm, Chris Jones wrote:
    On 6/09/2024 8:54 am, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>


    Do we know whether DRAM chips implement ECC internally? It seems an
    obvious thing for them to do. Of course it wouldn't help with bad solder joints on the DIMM, but it would help with many kinds of faults on the
    chip.

    It seems unlikely. The only ECC coding that I got close enough to plan
    to implement used a 72-bit word to protect 64-bits of the word content.

    I suppose there might be serial access DRAM chips that would spit out
    64-bit words, and they could offer ECC protection inside the chip by
    adding the 8-bit checksum when the data went in, and using it to correct
    the output whenever the data was read out.

    --
    Bill Sloman, Sydney

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Jones@21:1/5 to Don Y on Tue Sep 17 23:47:32 2024
    On 6/09/2024 8:54 am, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>


    Do we know whether DRAM chips implement ECC internally? It seems an
    obvious thing for them to do. Of course it wouldn't help with bad solder
    joints on the DIMM, but it would help with many kinds of faults on the chip.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Chris Jones on Tue Sep 17 07:57:39 2024
    On 9/17/2024 6:47 AM, Chris Jones wrote:
    On 6/09/2024 8:54 am, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>

    Do we know whether DRAM chips implement ECC internally?

    Some do (some processors implement ECC on internal data pathways!).
    But, I've never seen any details of the mechanism(s) employed and
    it's not likely that manufacturers would be eager to release those
    details (competitive advantage, leaks information about how good their
    process is, how close to their technological capacity they are
    operating, etc.).

    It seems an obvious
    thing for them to do. Of course it wouldn't help with bad solder joints on the
    DIMM, but it would help with many kinds of faults on the chip.

    It also won't help with transfers between CPU and memory device,
    subtle timing errors in the implementation, etc.

    But, you have to remember: ECC isn't a panacea.
    - It doesn't correct *all* errors (e.g., original SECDED just
    corrected a single bit error)
    - It can MIScorrect errors
    - It doesn't DETECT all errors (e.g., it only reliably detects TWO
    errors; for k-bit data, there will be 2^k code words that appear
    "correct" -- a number identical to the actual number of code words
    that *are* correct! -- yet have UNDETECTABLE errors), etc.
    There is also often a cost to the ECC operation in terms of time,
    power consumption, etc.

    And, if you hide the functioning of the ECC inside the memory device,
    then the application has no way of gauging how well the memory is
    performing with/without the ECC functionality! You never know if the
    ECC is only occasionally fixing stored data OR if it is fixing EVERY
    access! (in the latter case, one should be wary of the number of
    mistakes it is possibly making as well as the number of undetectable
    errors that are slipping past it!)

    Needless to say, there is a lot of research into alternative ECC
    schemes that try to address different aspects of DRAM faults and
    failures. But, naively expecting DRAM to store what you write to
    it is a fairy tale. So, you should have, in place, a strategy to
    address those likely failures in your product design (or, just
    blame it on "the software" :> )

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From antispam@fricas.org@21:1/5 to Don Y on Tue Sep 17 16:12:56 2024
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 9/16/2024 5:33 PM, Waldek Hebisch wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:
    On 9/5/2024 3:54 PM, Don Y wrote:
    Given the high rate of memory errors in DRAM, what steps
    are folks taking to mitigate the effects of these?

    Or, is ignorance truly bliss?  <frown>

    From discussions with colleagues, apparently, adding (external) ECC to
    most MCUs is simply not possible; too much of the memory and DRAM
    controllers are in-built (unlike older multi-chip microprocessors).
    There's no easy way to generate a bus fault to rerun the bus cycle
    or delay for the write-after-read correction.

    Are you really writing about MCU-s?

    Yes. In particular, the Cortex A series.

    I do not think that Cortex A is a MCU. More precisely, Cortex A
    core is not intended to be used in MCU-s and I think that most
    chips using it are not MCU-s. Distingushing feature here is if
    chip is intended to be a complete system (MCU) or are you expected
    to _always_ use it with external chips (typeical Cortex A based SOC).

    My impression was that MCU-s
    which allow external memory (most do not) typically have WAIT signal
    so that external memory can insert extra wait states. OTOH access
    to external memory is much slower than to internal memory.

    You typically use external memory when there isn't enough
    internal memory for your needs. I'm looking at 1-2GB / device
    (~500GB - 2TB per system)

    People frequently use chips designed for gadgets like smartphones
    of TV-s, those tend to have integrated DRAM controller and no
    support for ECC.

    Exactly. As the DRAM controller is in-built, adding ECC isn't
    an option. Unless the syndrome logic is ALSO in-built (it is
    on some offerings, but not all).

    And, among those devices that *do* support ECC, it's just a conventional >>> SECDEC implelmentation. So, a fair number of UCEs will plague any
    design with an appreciable amount of DRAM (can you even BUY *small*
    amounts of DRAM??)

    IIUC, if you need small amount of memory you should use SRAM...

    As above, 1-2GB isn't small, even by today's standards.
    And, SRAM isn't without its own data integrity issues.

    For devices with PMMUs, it's possible to address the UCEs -- sort of.
    But, this places an additional burden on the software and raises
    the problem of "If you are getting UCEs, how sure are you that
    undetected CEs aren't slipping through??" (again, you can only
    detect the UCEs via an explicit effort so you pay the fee and take
    your chances!)

    For devices without PMMUs, you have to rely on POST or BIST. And,
    *hope* that everything works in the periods between (restart often! :> ) >>>
    Back of the napkin figures suggest many errors are (silently!) encountered >>> in an 8-hour shift. For XIP implementations, it's mainly data that is at >>> risk (though that can also include control flow information from, e.g.,
    the pushdown stack). For implementations that load their application
    into DRAM, then the code is suspect as well as the data!

    [Which is likely to cause more detectable/undetectable problems?]

    I think that this estimate is somewhat pessimistic. The last case
    I remember that could be explanied by memory error was about 25 years
    ago: I unpacked source of newer gcc version and tried to compile it.
    Compilation failed, I tracked trouble to a flipped bit in one source
    file. Unpacking sources again gave correct value of affected bit
    (and on other bit chaged). And compiling the second copy worked OK.

    You were likely doing this on a PC/workstation, right? Nowadays,
    having ECC *in* the PC is commonplace.

    PC-s, yes (small SBC-s too, but those were not used for really heavy computations). Concerning ECC, most PC-s that I used came without ECC.
    IME ECC used to roughly double price of the PC compared to not ECC
    one. So, important servers got ECC, other PC-s were non ECC.

    "We made two key observations: First, although most personal-computer
    users blame system failures on software problems (quirks of the
    operating system, browser, and so forth) or maybe on malware infections,
    hardware was the main culprit. At Los Alamos, for instance, more than
    60 percent of machine outages came from hardware issues. Digging further,
    we found that the most common hardware problem was faulty DRAM. This
    meshes with the experience of people operating big data centers, DRAM
    modules being among the most frequently replaced components."

    I remeber memory corruption study several years ago that said that
    software was significant issue. In particular bugs in BIOS and Linux
    kernel led to random looking memory corruption. Hopefully, issues
    that they noticed are fixed now. Exact precentages probably do
    not matter much, because both hardware and software is changing.
    The point is that there are both hardware errors and software errors
    which without deep investigation are hard to distinguish from
    hardware ones.

    Earlier I saw segfaults that could be cured by exchanging DRAM
    modules or downclocking the machine.

    .. suggesting a memory integrity issue.

    Exactly.

    But it seems that machines got more reliable and I did not remember
    any recent problem like this. And I regularly do large compiles,
    here error in sources is very unlikely to go unnoticed. I did

    But, again, the use case, in a workstation, is entirely different
    than in a "device" where the code is unpacked (from NAND flash
    or from a remote file server) into RAM and then *left* there to
    execute for the duration of the device's operation (hours, days,
    weeks, etc.). I.e., the DRAM is used to emulate EPROM.

    In a PC, when your application is "done", the DRAM effectively is
    scrubbed by the loading of NEW data/text. This tends not to be
    the case for appliances/devices; such a reload only tends to happen
    when the power is cycled and the contents of (volatile) DRAM are
    obviously lost.

    Well, most of my computations were on machines without ECC. And
    I also had some machines sitting idle for long time. They had
    cached data in RAM and would use it when given some work to do.

    Even without ECC, not all errors are consequential. If a bit
    flips and is never accessed, <shrug>. If a bit flips and
    it alters one opcode into another that is equivalent /in the
    current program state/, <shrug> Ditto for data.

    Yes. In numerics using Newton style iteration small number of
    bit flips normally means that it needs to iterate longer, but will
    still converge to the correct result.

    large computations were any error had nontrivial chance to propagate
    to final result. Some computations that I do are naturally error
    tolerant, but error tolerant part used tiny amount of data, while
    most data was "fragile" (error there was likely to be detected).

    If you are reaccessing the memory (data or code) frequently, you
    give the ECC a new chance to "refresh" the intended contents of
    that memory (assuming your ECC hardware is configured for write-back operation).

    As I wrote, most computations were on non-ECC machines.

    So, the possibility of a second fault coming along
    while the first fault is still in place (and uncorrected) is small.

    OTOH, if the memory just sits there with the expectation that it
    will retain its intended contents without a chance to be corrected
    (by the ECC hardware), then bit-rot can continue increasing the
    possibility of a second bit failing while the first is still failed.

    Well, I you are concerned you can implement low priority process
    that will read RAM possibly doing some extra work (like detecting
    unexpected changes).

    Remember, you don't even consider this possibility when you are
    executing out of EPROM or NOR FLASH... you just assume bits
    retain their state even when you aren't "looking at them".

    Concerning doing something about memory errors: on hardwares side
    devices with DRAM that I use are COTS devices. So the only thing
    I have is to look at reputation of the vendor, and general reputation
    says nothing about DRAM errors. In other words, there is nothing
    I can realistically do.

    You can select MCUs that *do* have support for ECC instead of just
    "hoping" the (inevitable) memory errors won't bite you. Even so,
    your best case scenario is just SECDED protection.

    When working with MCU-s I have internal SRAM instead of DRAM.
    And in several cases manufacturers claim parity or ECC protection
    for SRAM. But in case of PC-s hardware and OS are commodity.
    Due to price reasons I mostly deal with non-ECC hardware.

    On software side I could try to add
    some extra error tolerance. But I have various consistency checks
    and they indicate various troubles. I deal with detected troubles,
    DRAM errors are not in this category.

    I assume you *test* memory on POST?

    On PC-s that is BIOS and OS that I did not wrote. And AFAIK BIOS
    POST is detecting memory size and doing a little sanity checks
    to detect gross troubles. But I am not sure if I would call them
    "memory tests", better memory tests tend to run for days.

    But, if your device runs
    24/7/365, it can be a very long time between POSTs! OTOH, you
    could force a test cycle (either by deliberately restarting
    your device -- "nightly maintenance") or, you could test memory
    WHILE the application is running.

    And, what specific criteria do you use to get alarmed at the results??

    As I wrote I am doing computations and criteria are problem specific.
    For example I have two multidigit numbers and one is supposed to
    exactly divide the other. If not, software signals an error.

    I read several papers about DRAM errors and I take seriously
    possiblity that they can happen. But simply at my scale they
    do not seem to matter.

    BTW: It seems that currently large fraction of errors (both software
    and hardware ones) appear semi-randomly. So, to estimate
    reliabilty one should use statistic methods. But if you aim
    at high reliablity, then needed sample size may be impractically
    large. You may be able to add mitigations for rare problems that
    you can predict/guess, but you will be left with unpredictable
    ones. In other words, it make sense to concentrate on problems
    that you see (you including your customers). AFAIK some big
    companies have wordwide automatic error reporting systems.
    If you set up such a thing that you may get useful info.

    You can try "error injection", that is run tests with extra
    component that simulates memory errors. Then you will have
    some info about effects:
    - do memory errors cause incorrect operation?
    - is incorrect operation detected?

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to antispam@fricas.org on Tue Sep 17 11:57:56 2024
    On 9/17/2024 9:12 AM, antispam@fricas.org wrote:
    Are you really writing about MCU-s?

    Yes. In particular, the Cortex A series.

    I do not think that Cortex A is a MCU. More precisely, Cortex A
    core is not intended to be used in MCU-s and I think that most
    chips using it are not MCU-s. Distingushing feature here is if
    chip is intended to be a complete system (MCU) or are you expected
    to _always_ use it with external chips (typeical Cortex A based SOC).

    An MCU integrates peripherals into the same package as the CPU.
    By contrast, earlier devices (MPUs) had a family of peripherals
    often assembled around the CPU/MPU.

    An 8051 is an MCU. Yet, it can support external (static) memory.

    The A-series devices support LCD displays, ethernet, serial ports, counter/timers/PWM, internal static memory and FLASH as well as
    controllers for external static and dynamic memory. One can build a
    product with or without the addition of external memory (e.g., the
    SAMA5D34 has 160KB of ROM, 128KB of SRAM, 32K of i/d cache as well
    as an assortment of peripherals that would typically be EXTERNAL
    to an MPU-based design).

    SoC's differ in that they are *intended* to be complete systems
    (yet typically *still* off-board the physical SDRAM).

    At the other end of the spectrum are discrete CPUs -- requiring
    external support chips to implement *anything* of use.

    I think that this estimate is somewhat pessimistic. The last case
    I remember that could be explanied by memory error was about 25 years
    ago: I unpacked source of newer gcc version and tried to compile it.
    Compilation failed, I tracked trouble to a flipped bit in one source
    file. Unpacking sources again gave correct value of affected bit
    (and on other bit chaged). And compiling the second copy worked OK.

    You were likely doing this on a PC/workstation, right? Nowadays,
    having ECC *in* the PC is commonplace.

    PC-s, yes (small SBC-s too, but those were not used for really heavy computations). Concerning ECC, most PC-s that I used came without ECC.
    IME ECC used to roughly double price of the PC compared to not ECC
    one. So, important servers got ECC, other PC-s were non ECC.

    If you don't have ECC memory, hardware AND OS, you can never tell if
    you are having memory errors.

    "We made two key observations: First, although most personal-computer >> users blame system failures on software problems (quirks of the
    operating system, browser, and so forth) or maybe on malware infections,
    hardware was the main culprit. At Los Alamos, for instance, more than
    60 percent of machine outages came from hardware issues. Digging further,
    we found that the most common hardware problem was faulty DRAM. This
    meshes with the experience of people operating big data centers, DRAM >> modules being among the most frequently replaced components."

    I remeber memory corruption study several years ago that said that
    software was significant issue. In particular bugs in BIOS and Linux
    kernel led to random looking memory corruption. Hopefully, issues
    that they noticed are fixed now. Exact precentages probably do
    not matter much, because both hardware and software is changing.
    The point is that there are both hardware errors and software errors
    which without deep investigation are hard to distinguish from
    hardware ones.

    Systems with ECC make it (relatively) easy to determine if you are having memory errors. Those without it leave you simply guessing as to the
    nature of any "problems" you may experience.

    So, the possibility of a second fault coming along
    while the first fault is still in place (and uncorrected) is small.

    OTOH, if the memory just sits there with the expectation that it
    will retain its intended contents without a chance to be corrected
    (by the ECC hardware), then bit-rot can continue increasing the
    possibility of a second bit failing while the first is still failed.

    Well, I you are concerned you can implement low priority process
    that will read RAM possibly doing some extra work (like detecting
    unexpected changes).

    You have to design your system (particularly the OS) with this capability
    in mind. You ("it") needs to know which regions of memory are invariant
    over which intervals.

    OTOH, simple periodic *testing* of memory can reveal hard errors
    (google claimed hard errors to be more prevalent than soft ones).

    But, this just gives you information; you still need mechanisms in
    place that let you "retire" bad sections of memory (up to and
    including ALL memory).

    You can select MCUs that *do* have support for ECC instead of just
    "hoping" the (inevitable) memory errors won't bite you. Even so,
    your best case scenario is just SECDED protection.

    When working with MCU-s I have internal SRAM instead of DRAM.

    SRAM is not without its own problems. As geometries shrink and
    capacities increase, the reliability of memory suffers.

    And in several cases manufacturers claim parity or ECC protection
    for SRAM. But in case of PC-s hardware and OS are commodity.
    Due to price reasons I mostly deal with non-ECC hardware.

    On software side I could try to add
    some extra error tolerance. But I have various consistency checks
    and they indicate various troubles. I deal with detected troubles,
    DRAM errors are not in this category.

    I assume you *test* memory on POST?

    On PC-s that is BIOS and OS that I did not wrote. And AFAIK BIOS
    POST is detecting memory size and doing a little sanity checks
    to detect gross troubles. But I am not sure if I would call them
    "memory tests", better memory tests tend to run for days.

    Many machines have "extended diagnostics" (often on a hidden disk
    partition) that will let you run such. Or, install an app to do so.

    You have to know the types of errors you are attempting to detect
    and design your tests to address those. E.g., in the days of 1K and
    4K DRAM, it was not uncommon to verify *refresh* was working properly
    (as one often had to build the DRAM controller out of discrete logic
    so you needed assurance that the refresh aspect was operating as
    intended)

    But, if your device runs
    24/7/365, it can be a very long time between POSTs! OTOH, you
    could force a test cycle (either by deliberately restarting
    your device -- "nightly maintenance") or, you could test memory
    WHILE the application is running.

    And, what specific criteria do you use to get alarmed at the results??

    As I wrote I am doing computations and criteria are problem specific.
    For example I have two multidigit numbers and one is supposed to
    exactly divide the other. If not, software signals an error.

    But you still don't know *why* the error came about.

    I read several papers about DRAM errors and I take seriously
    possiblity that they can happen. But simply at my scale they
    do not seem to matter.

    BTW: It seems that currently large fraction of errors (both software
    and hardware ones) appear semi-randomly. So, to estimate
    reliabilty one should use statistic methods. But if you aim
    at high reliablity, then needed sample size may be impractically
    large. You may be able to add mitigations for rare problems that
    you can predict/guess, but you will be left with unpredictable

    You can only ever address the known unknowns. Discovery of unknown
    unknowns is serendipitous.

    PC applications (the whole PC environment) is likely more tolerant
    of memory errors; new applications are always loaded over the previous]
    one, the user can always rerun an application/calculation that is
    suspect, you can reboot, etc.

    The more demanding scenario is using DRAM as "EPROM" for load once
    applications (appliances).

    ones. In other words, it make sense to concentrate on problems
    that you see (you including your customers). AFAIK some big
    companies have wordwide automatic error reporting systems.
    If you set up such a thing that you may get useful info.

    You can try "error injection", that is run tests with extra
    component that simulates memory errors. Then you will have
    some info about effects:
    - do memory errors cause incorrect operation?
    - is incorrect operation detected?

    There have been studies trying to qualify the "resiliency"
    of different operating systems to memory errors. These
    deliberately corrupt portions of memory ("at random")
    and see if the corruption affects the resulting operation.

    But, there are myriad possible ways memory can fail to yield
    correct data; trying to identify specific cases borders on
    foolhardy.

    Instead, take steps to verify the memory is performing as
    you would expect it to perform -- that it REMEMBERS the
    data entrusted to it. Or, ensure you can rely on it by other
    design choices.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)