• Predictive failures

    From Don Y@21:1/5 to All on Mon Apr 15 10:13:02 2024
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Rid@21:1/5 to Don Y on Mon Apr 15 13:32:41 2024
    Don Y <blockedofcourse@foo.invalid> Wrote in message:r
    Is there a general rule of thumb for signalling the likelihood ofan "imminent" (for some value of "imminent") hardware failure?I suspect most would involve *relative* changes that would besuggestive of changing conditions in the components (and
    notdirectly related to environmental influences).So, perhaps, a good strategy is to just "watch" everything andnotice the sorts of changes you "typically" encounter in the hopethat something of greater magnitude would be a harbinger...

    Current and voltages outside of normal operation?

    Cheers
    --


    ----Android NewsGroup Reader---- https://piaohong.s3-us-west-2.amazonaws.com/usenet/index.html

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From john larkin@21:1/5 to blockedofcourse@foo.invalid on Mon Apr 15 11:28:13 2024
    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    Checking temperatures is good. An overload or a fan failure can be bad
    news.

    We put temp sensors on most products. Some parts, like ADCs and FPGAs,
    have free built-in temp sensors.

    I have tried various ideas to put an air flow sensor on boards, but so
    far none have worked very well. We do check fan tachs to be sure they
    are still spinning.

    Blocking air flow generally makes fan speed *increase*.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Gwinn@21:1/5 to blockedofcourse@foo.invalid on Mon Apr 15 15:41:57 2024
    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and
    trend of very low frequency (around a tenth of a Hertz) flicker noise.
    When connections (perhaps within a package) start to fail, the flicker
    level rises. The actual frequency monitored isn't all that critical.

    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From john larkin@21:1/5 to All on Mon Apr 15 13:05:40 2024
    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and
    trend of very low frequency (around a tenth of a Hertz) flicker noise.
    When connections (perhaps within a package) start to fail, the flicker
    level rises. The actual frequency monitored isn't all that critical.

    Joe Gwinn

    Do connections "start to fail" ?

    I don't think I've ever owned a piece of electronic equipment that
    warned me of an impending failure.

    Cars do, for some failure modes, like low oil level.

    Don, what does the thing do?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Mon Apr 15 16:32:17 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvjn74$d54b$1@dont-email.me...
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    My conclusion would be no.
    Some of my reasons are given below.

    It always puzzled me how HAL could know that the AE-35 would fail in the
    near future, but maybe HAL had a motive for lying.

    Back in that era I was doing a lot of repair work when I should have been
    doing my homework.
    So I knew that there were many unrelated kinds of hardware failure.

    A component could fail suddenly, such as a short circuit diode, and
    everything would work fine after replacing it.
    The cause could perhaps have been a manufacturing defect, such as
    insufficient cooling due to poor quality assembly, but the exact real cause would never be known.

    A component could fail suddenly as a side effect of another failure.
    One short circuit output transistor and several other components could also burn up.

    A component could fail slowly and only become apparent when it got to the
    stage of causing an audible or visible effect.
    It would often be easy to locate the dried up electrolytic due to it having already let go of some of its contents.

    So I concluded that if I wanted to be sure that I could always watch my favourite TV show, we would have to have at least two TVs in the house.

    If it's not possible to have the equivalent of two TVs then you will want to
    be in a position to get the existing TV repaired or replaced as quicky as possible.

    My home wireless Internet system doesn't care if one access point fails, and
    I would not expect to be able to do anything to predict a time of failure. Experience says a dead unit has power supply issues. Usually external but
    could be internal.

    I don't think it would be possible to "watch" everything because it's rare
    that you can properly test a component while it's part of a working system.

    These days I would expect to have fun with management asking for software to
    be able to diagnose and report any hardware failure.
    Not very easy if the power supply has died.


    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Gwinn@21:1/5 to john larkin on Mon Apr 15 18:03:23 2024
    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and
    trend of very low frequency (around a tenth of a Hertz) flicker noise.
    When connections (perhaps within a package) start to fail, the flicker >>level rises. The actual frequency monitored isn't all that critical.

    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a
    critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was
    suffering from low-cycle fatigue during temperature cycling and slowly breaking, one little crack at a time, until it went open. If you
    measured the resistance to parts per million (6.5 digit DMM), sampling
    at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
    also measure a copper line, and divide the via-chain resistance by the
    no-via resistance, to correct for temperature changes.

    The solution was to redesign the vias, mainly to increase the critical
    volume of copper. And modern SMD designs have less and less copper
    volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that
    warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near
    the bearings, and listen for excessive rotation-correlated (not
    necessarily harmonic) noise.


    Don, what does the thing do?

    Good question.

    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From john larkin@21:1/5 to All on Mon Apr 15 16:26:35 2024
    On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be >>>>suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and
    trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>When connections (perhaps within a package) start to fail, the flicker >>>level rises. The actual frequency monitored isn't all that critical.

    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a >critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was
    suffering from low-cycle fatigue during temperature cycling and slowly >breaking, one little crack at a time, until it went open. If you
    measured the resistance to parts per million (6.5 digit DMM), sampling
    at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
    also measure a copper line, and divide the via-chain resistance by the
    no-via resistance, to correct for temperature changes.

    But nobody is going to monitor every via on a PCB, even if it were
    possible.

    One could instrument a PCB fab test board, I guess. But DC tests would
    be fine.

    We have one board with over 4000 vias, but they are mostly in
    parallel.



    The solution was to redesign the vias, mainly to increase the critical
    volume of copper. And modern SMD designs have less and less copper
    volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that
    warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near
    the bearings, and listen for excessive rotation-correlated (not
    necessarily harmonic) noise.

    Big ships that I've worked on have a long propeller shaft in the shaft
    alley, a long tunnel where nobody often goes. They have magnetic shaft
    runout sensors and shaft bearing temperature monitors.

    They measure shaft torque and SHP too, from the shaft twist.

    I liked hiding out in the shaft alley. It was private and cool, that
    giant shaft slowly rotating.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Phil Hobbs@21:1/5 to Joe Gwinn on Tue Apr 16 00:51:03 2024
    Joe Gwinn <joegwinn@comcast.net> wrote:
    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and
    trend of very low frequency (around a tenth of a Hertz) flicker noise.
    When connections (perhaps within a package) start to fail, the flicker
    level rises. The actual frequency monitored isn't all that critical.

    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was
    suffering from low-cycle fatigue during temperature cycling and slowly breaking, one little crack at a time, until it went open. If you
    measured the resistance to parts per million (6.5 digit DMM), sampling
    at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
    also measure a copper line, and divide the via-chain resistance by the
    no-via resistance, to correct for temperature changes.

    The solution was to redesign the vias, mainly to increase the critical
    volume of copper. And modern SMD designs have less and less copper
    volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that
    warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near
    the bearings, and listen for excessive rotation-correlated (not
    necessarily harmonic) noise.

    There are a number of instruments available that look for metal particles
    in the lubricating oil.

    Cheers

    Phil Hobbs




    --
    Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC / Hobbs ElectroOptics Optics, Electro-optics, Photonics, Analog Electronics

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Larkin@21:1/5 to pcdhSpamMeSenseless@electrooptical. on Mon Apr 15 19:17:39 2024
    On Tue, 16 Apr 2024 00:51:03 -0000 (UTC), Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote:

    Joe Gwinn <joegwinn@comcast.net> wrote:
    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and
    trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>> When connections (perhaps within a package) start to fail, the flicker >>>> level rises. The actual frequency monitored isn't all that critical.

    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a
    critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was
    suffering from low-cycle fatigue during temperature cycling and slowly
    breaking, one little crack at a time, until it went open. If you
    measured the resistance to parts per million (6.5 digit DMM), sampling
    at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
    also measure a copper line, and divide the via-chain resistance by the
    no-via resistance, to correct for temperature changes.

    The solution was to redesign the vias, mainly to increase the critical
    volume of copper. And modern SMD designs have less and less copper
    volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that
    warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near
    the bearings, and listen for excessive rotation-correlated (not
    necessarily harmonic) noise.

    There are a number of instruments available that look for metal particles
    in the lubricating oil.

    Cheers

    Phil Hobbs



    And water. Some of our capacitor simulators include a parallel
    resistance component.

    One customer used to glue bits of metal onto a string and pull it
    through the magnetic sensor. We did a simulator for that too.

    Jet engines have magnetic eddy-current blade-tip sensors. For
    efficiency, they want a tiny clearance between fan blades and the
    casing, but not too tiny.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Martin Rid on Mon Apr 15 19:19:06 2024
    On 4/15/2024 10:32 AM, Martin Rid wrote:
    Don Y <blockedofcourse@foo.invalid> Wrote in message:r
    Is there a general rule of thumb for signalling the likelihood ofan "imminent" (for some value of "imminent") hardware failure?I suspect most would involve *relative* changes that would besuggestive of changing conditions in the components (and
    notdirectly related to environmental influences).So, perhaps, a good strategy is to just "watch" everything andnotice the sorts of changes you "typically" encounter in the hopethat something of greater magnitude would be a harbinger...

    Current and voltages outside of normal operation?

    I think "outside" is (often) likely indicative of
    "something is (already) broken".

    But, perhaps TRENDS in either/both can be predictive.

    E.g., if a (sub)circuit has always been consuming X (which
    is nominal for the design) and, over time, starts to consume
    1.1X, is that suggestive that something is in the process of
    failing?

    Note that the goal is not to troubleshoot the particular design
    or its components but, rather, act as an early warning that
    maintenance may be required (or, that performance may not be
    what you are expecting/have become accustomed to).

    You can include mechanisms to verify outputs are what you
    *intended* them to be (in case the output drivers have shit
    the bed).

    You can, also, do sanity checks that ensure values are never
    what they SHOULDN'T be (this is commonly done within software
    products -- if something "can't happen" then noticing that
    it IS happening is a sure-fire indication that something
    is broken!)

    [Limit switches on mechanisms are there to ensure the impossible
    is not possible -- like driving a mechanism beyond its extents]

    And, where possible, notice second-hand effects of your actions
    (e.g., if you switched on a load, you should see an increase
    in supplied current).

    But, again, these are more helpful in detecting FAILED items.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Mon Apr 15 23:33:34 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvkn71$ngqi$2@dont-email.me...
    On 4/15/2024 10:32 AM, Martin Rid wrote:
    Don Y <blockedofcourse@foo.invalid> Wrote in message:r
    Is there a general rule of thumb for signalling the likelihood ofan
    "imminent" (for some value of "imminent") hardware failure?I suspect
    most would involve *relative* changes that would besuggestive of
    changing conditions in the components (and notdirectly related to
    environmental influences).So, perhaps, a good strategy is to just
    "watch" everything andnotice the sorts of changes you "typically"
    encounter in the hopethat something of greater magnitude would be a
    harbinger...

    Current and voltages outside of normal operation?

    I think "outside" is (often) likely indicative of
    "something is (already) broken".

    But, perhaps TRENDS in either/both can be predictive.

    E.g., if a (sub)circuit has always been consuming X (which
    is nominal for the design) and, over time, starts to consume
    1.1X, is that suggestive that something is in the process of
    failing?

    That depends on many other unknown factors.
    Temperature sensors are common in electronics.
    So is current sensing. Voltage sensing too.


    Note that the goal is not to troubleshoot the particular design
    or its components but, rather, act as an early warning that
    maintenance may be required (or, that performance may not be
    what you are expecting/have become accustomed to).

    If the system is electronic then you can detect whether currents and/or
    votages are within expected ranges.
    If they are a just a little out of expected range then you might turn on a warning LED.
    If they are way out of range then you might tell the power supply to turn
    off quick.
    By all means tell the software what has happened, but don't put software between the current sensor and the emergency turn off.
    Be aware that components in monitoring circuits can fail too.


    You can include mechanisms to verify outputs are what you
    *intended* them to be (in case the output drivers have shit
    the bed).

    You can, also, do sanity checks that ensure values are never
    what they SHOULDN'T be (this is commonly done within software
    products -- if something "can't happen" then noticing that
    it IS happening is a sure-fire indication that something
    is broken!)

    [Limit switches on mechanisms are there to ensure the impossible
    is not possible -- like driving a mechanism beyond its extents]

    And, where possible, notice second-hand effects of your actions
    (e.g., if you switched on a load, you should see an increase
    in supplied current).

    But, again, these are more helpful in detecting FAILED items.

    What system would you like to have early warnings for?
    Are the warnings needed to indicate operation out of expected limits or to indicate that maintenance is required, or both?
    Without detailed knowledge of the specific sytem, only speculative answers
    can be given.




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Mon Apr 15 20:20:55 2024
    On 4/15/2024 1:32 PM, Edward Rawde wrote:
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvjn74$d54b$1@dont-email.me...
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    My conclusion would be no.
    Some of my reasons are given below.

    It always puzzled me how HAL could know that the AE-35 would fail in the
    near future, but maybe HAL had a motive for lying.

    Why does your PC retry failed disk operations? If I ask the drive to give
    me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data corruption (CRC error) AND within the normal access time limits defined by the location
    of those magnetic domains on the rotating medium?

    Why should it attempt to retry this MORE than once?

    Now, if you knew your disk drive was repeatedly retrying operations,
    would your confidence in it be unchanged from times when it did not
    exhibit such behavior?

    Assuming you have properly configured a EIA232 interface, why would you
    ever get a parity error? (OVERRUN errors can be the result of an i/f
    that is running too fast for the system on the receiving end) How would
    you even KNOW this was happening?

    I suspect everyone who has owned a DVD/CD drive has encountered a
    "slow tray" as the mechanism aged. Or, a tray that wouldn't
    open (of its own accord) as soon/quickly as it used to.

    The controller COULD be watching this (cuz it knows when it
    initiated the operation and there is an "end-of-stroke"
    sensor available) and KNOW that the drive belt was stretching
    to the point where it was impacting operation.

    [And, that a stretched belt wasn't going to suddenly decide to
    unstretch to fix the problem!]

    Back in that era I was doing a lot of repair work when I should have been doing my homework.
    So I knew that there were many unrelated kinds of hardware failure.

    The goal isn't to predict ALL failures but, rather, to anticipate
    LIKELY failures and treat them before they become an inconvenience
    (or worse).

    One morning, the (gas) furnace repeatedly tried to light as the
    thermostat called for heat. Then, a few moments later, the
    safeties would kick in and shut down the gas flow. This attracted my
    attention as the LIT furnace should STAY LIT!

    The furnace was too stupid to notice its behavior so would repeat
    this cycle, endlessly.

    I stepped in and overrode the thermostat to eliminate the call
    for heat as this behavior couldn't be productive (if something
    truly IS wrong, then why let it continue? and, if there is nothing
    wrong with the controls/mechanism, then clearly it is unable to meet
    my needs so why let it persist in trying?)

    [Turns out, there was a city-wide gas shortage so there was enough
    gas available to light the furnace but not enough to bring it up to
    temperature as quickly as the designers had expected]

    A component could fail suddenly, such as a short circuit diode, and everything would work fine after replacing it.
    The cause could perhaps have been a manufacturing defect, such as insufficient cooling due to poor quality assembly, but the exact real cause would never be known.

    You don't care about the real cause. Or, even the failure mode.
    You (as user) just don't want to be inconvenienced by the sudden
    loss of the functionality/convenience that the the device provided.

    A component could fail suddenly as a side effect of another failure.
    One short circuit output transistor and several other components could also burn up.

    So, if you could predict the OTHER failure...
    Or, that such a failure might occur and lead to the followup failure...

    A component could fail slowly and only become apparent when it got to the stage of causing an audible or visible effect.

    But, likely, there was something observable *in* the circuit that
    just hadn't made it to the level of human perception.

    It would often be easy to locate the dried up electrolytic due to it having already let go of some of its contents.

    So I concluded that if I wanted to be sure that I could always watch my favourite TV show, we would have to have at least two TVs in the house.

    If it's not possible to have the equivalent of two TVs then you will want to be in a position to get the existing TV repaired or replaced as quicky as possible.

    Two TVs are affordable. Consider two controllers for a wire-EDM machine.

    Or, the cost of having that wire-EDM machine *idle* (because you didn't
    have a spare controller!)

    My home wireless Internet system doesn't care if one access point fails, and I would not expect to be able to do anything to predict a time of failure. Experience says a dead unit has power supply issues. Usually external but could be internal.

    Again, the goal isn't to predict "time of failure". But, rather, to be
    able to know that "this isn't going to end well" -- with some advance notice that allows for preemptive action to be taken (and not TOO much advance
    notice that the user ends up replacing items prematurely).

    I don't think it would be possible to "watch" everything because it's rare that you can properly test a component while it's part of a working system.

    You don't have to -- as long as you can observe its effects on other
    parts of the system. E.g., there's no easy/inexpensive way to
    check to see how much the belt on that CD/DVD player has stretched.
    But, you can notice that it HAS stretched (or, some less likely
    change has occurred that similarly interferes with the tray's actions)
    by noting how the activity that it is used for has changed.

    These days I would expect to have fun with management asking for software to be able to diagnose and report any hardware failure.
    Not very easy if the power supply has died.

    What if the power supply HASN'T died? What if you are diagnosing the
    likely upcoming failure *of* the power supply?

    You have ECC memory in most (larger) machines. Do you silently
    expect it to just fix all the errors? Does it have a way of telling you
    how many such errors it HAS corrected? Can you infer the number of
    errors that it *hasn't*?

    [Why have ECC at all?]

    There are (and have been) many efforts to *predict* lifetimes of
    components (and, systems). And, some work to examine the state
    of systems /in situ/ with an eye towards anticipating their
    likelihood of future failure.

    [The former has met with poor results -- predicting the future
    without a position in its past is difficult. And, knowing how
    a device is "stored" when not powered on also plays a role
    in its future survival! (is there some reason YOUR devices
    can't power themselves on, periodically; notice the environmental
    conditions; log them and then power back off)]

    The question is one of a practical nature; how much does it cost
    you to add this capability to a device and how accurately can it
    make those predictions (thus avoiding some future cost/inconvenience).

    For small manufacturers, the research required is likely not cost-effective; just take your best stab at it and let the customer "buy a replacement"
    when the time comes (hopefully, outside of your warranty window).

    But, anything you can do to minimize this TCO issue gives your product
    an edge over competitors. Given that most devices are smart, nowadays,
    it seems obvious that they should undertake as much of this task as
    they can (conveniently) afford.

    <https://www.sciencedirect.com/science/article/abs/pii/S0026271409003667>

    <https://www.researchgate.net/publication/3430090_In_Situ_Temperature_Measurement_of_a_Notebook_Computer-A_Case_Study_in_Health_and_Usage_Monitoring_of_Electronics>

    <https://www.tandfonline.com/doi/abs/10.1080/16843703.2007.11673148>

    <https://www.prognostics.umd.edu/calcepapers/02_V.Shetty_remaingLifeAssesShuttleRemotemanipulatorSystem_22ndSpaceSimulationConf.pdf>

    <https://ieeexplore.ieee.org/document/1656125>

    <https://journals.sagepub.com/doi/10.1177/0142331208092031>

    [Sorry, I can't publish links to the full articles]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Tue Apr 16 00:14:13 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvkqqu$o5co$1@dont-email.me...
    On 4/15/2024 1:32 PM, Edward Rawde wrote:
    "Don Y" <blockedofcourse@foo.invalid> wrote in message
    news:uvjn74$d54b$1@dont-email.me...
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    My conclusion would be no.
    Some of my reasons are given below.

    It always puzzled me how HAL could know that the AE-35 would fail in the
    near future, but maybe HAL had a motive for lying.

    Why does your PC retry failed disk operations?

    Because the software designer didn't understand hardware.
    The correct approach is to mark that part of the disk as unusable and, if possible, move any data from it elsewhere quick.

    If I ask the drive to give
    me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data corruption
    (CRC error) AND within the normal access time limits defined by the
    location
    of those magnetic domains on the rotating medium?

    Why should it attempt to retry this MORE than once?

    Now, if you knew your disk drive was repeatedly retrying operations,
    would your confidence in it be unchanged from times when it did not
    exhibit such behavior?

    I'd have put an SSD in by now, along with an off site backup of the same
    data :)


    Assuming you have properly configured a EIA232 interface, why would you
    ever get a parity error? (OVERRUN errors can be the result of an i/f
    that is running too fast for the system on the receiving end) How would
    you even KNOW this was happening?

    I suspect everyone who has owned a DVD/CD drive has encountered a
    "slow tray" as the mechanism aged. Or, a tray that wouldn't
    open (of its own accord) as soon/quickly as it used to.

    If it hasn't been used for some time then I'm ready with a tiny screwdriver blade to help it open.
    But I forget when I last used an optical drive.


    The controller COULD be watching this (cuz it knows when it
    initiated the operation and there is an "end-of-stroke"
    sensor available) and KNOW that the drive belt was stretching
    to the point where it was impacting operation.

    [And, that a stretched belt wasn't going to suddenly decide to
    unstretch to fix the problem!]

    Back in that era I was doing a lot of repair work when I should have been
    doing my homework.
    So I knew that there were many unrelated kinds of hardware failure.

    The goal isn't to predict ALL failures but, rather, to anticipate
    LIKELY failures and treat them before they become an inconvenience
    (or worse).

    One morning, the (gas) furnace repeatedly tried to light as the
    thermostat called for heat. Then, a few moments later, the
    safeties would kick in and shut down the gas flow. This attracted my attention as the LIT furnace should STAY LIT!

    The furnace was too stupid to notice its behavior so would repeat
    this cycle, endlessly.

    I stepped in and overrode the thermostat to eliminate the call
    for heat as this behavior couldn't be productive (if something
    truly IS wrong, then why let it continue? and, if there is nothing
    wrong with the controls/mechanism, then clearly it is unable to meet
    my needs so why let it persist in trying?)

    [Turns out, there was a city-wide gas shortage so there was enough
    gas available to light the furnace but not enough to bring it up to temperature as quickly as the designers had expected]

    That's why the furnace designers couldn't have anticipated it.
    They did not know that such a contition might occur so never tested for it.


    A component could fail suddenly, such as a short circuit diode, and
    everything would work fine after replacing it.
    The cause could perhaps have been a manufacturing defect, such as
    insufficient cooling due to poor quality assembly, but the exact real
    cause
    would never be known.

    You don't care about the real cause. Or, even the failure mode.
    You (as user) just don't want to be inconvenienced by the sudden
    loss of the functionality/convenience that the the device provided.

    There will always be sudden unexpected loss of functionality for reasons
    which could not easily be predicted.
    People who service lawn mowers in the area where I live are very busy right now.


    A component could fail suddenly as a side effect of another failure.
    One short circuit output transistor and several other components could
    also
    burn up.

    So, if you could predict the OTHER failure...
    Or, that such a failure might occur and lead to the followup failure...

    A component could fail slowly and only become apparent when it got to the
    stage of causing an audible or visible effect.

    But, likely, there was something observable *in* the circuit that
    just hadn't made it to the level of human perception.

    Yes a power supply ripple detection circuit could have turned on a warning
    LED but that never happened for at least two reasons.
    1. The detection circuit would have increased the cost of the equipment and thus diminished the profit of the manufacturer.
    2. The user would not have understood and would have ignored the warning anyway.


    It would often be easy to locate the dried up electrolytic due to it
    having
    already let go of some of its contents.

    So I concluded that if I wanted to be sure that I could always watch my
    favourite TV show, we would have to have at least two TVs in the house.

    If it's not possible to have the equivalent of two TVs then you will want
    to
    be in a position to get the existing TV repaired or replaced as quicky as
    possible.

    Two TVs are affordable. Consider two controllers for a wire-EDM machine.

    Or, the cost of having that wire-EDM machine *idle* (because you didn't
    have a spare controller!)

    My home wireless Internet system doesn't care if one access point fails,
    and
    I would not expect to be able to do anything to predict a time of
    failure.
    Experience says a dead unit has power supply issues. Usually external but
    could be internal.

    Again, the goal isn't to predict "time of failure". But, rather, to be
    able to know that "this isn't going to end well" -- with some advance
    notice
    that allows for preemptive action to be taken (and not TOO much advance notice that the user ends up replacing items prematurely).

    Get feedback from the people who use your equpment.


    I don't think it would be possible to "watch" everything because it's
    rare
    that you can properly test a component while it's part of a working
    system.

    You don't have to -- as long as you can observe its effects on other
    parts of the system. E.g., there's no easy/inexpensive way to
    check to see how much the belt on that CD/DVD player has stretched.
    But, you can notice that it HAS stretched (or, some less likely
    change has occurred that similarly interferes with the tray's actions)
    by noting how the activity that it is used for has changed.

    Sure but you have to be the operator for that.
    So you can be ready to help the tray open when needed.


    These days I would expect to have fun with management asking for software
    to
    be able to diagnose and report any hardware failure.
    Not very easy if the power supply has died.

    What if the power supply HASN'T died? What if you are diagnosing the
    likely upcoming failure *of* the power supply?

    Then I probably can't, because the power supply may be just a bought in
    power supply which was never designed with upcoming failure detection in
    mind.


    You have ECC memory in most (larger) machines. Do you silently
    expect it to just fix all the errors? Does it have a way of telling you
    how many such errors it HAS corrected? Can you infer the number of
    errors that it *hasn't*?

    [Why have ECC at all?]

    Things are sometimes done the way they've always been done.
    I used to notice a missing chip in the 9th position but now you mention it
    the RAM I just looked at has 9 chips each side.


    There are (and have been) many efforts to *predict* lifetimes of
    components (and, systems). And, some work to examine the state
    of systems /in situ/ with an eye towards anticipating their
    likelihood of future failure.

    I'm sure that's true.


    [The former has met with poor results -- predicting the future
    without a position in its past is difficult. And, knowing how
    a device is "stored" when not powered on also plays a role
    in its future survival! (is there some reason YOUR devices
    can't power themselves on, periodically; notice the environmental
    conditions; log them and then power back off)]

    The question is one of a practical nature; how much does it cost
    you to add this capability to a device and how accurately can it
    make those predictions (thus avoiding some future cost/inconvenience).

    For small manufacturers, the research required is likely not
    cost-effective;
    just take your best stab at it and let the customer "buy a replacement"
    when the time comes (hopefully, outside of your warranty window).

    But, anything you can do to minimize this TCO issue gives your product
    an edge over competitors. Given that most devices are smart, nowadays,
    it seems obvious that they should undertake as much of this task as
    they can (conveniently) afford.

    <https://www.sciencedirect.com/science/article/abs/pii/S0026271409003667>

    <https://www.researchgate.net/publication/3430090_In_Situ_Temperature_Measurement_of_a_Notebook_Computer-A_Case_Study_in_Health_and_Usage_Monitoring_of_Electronics>

    <https://www.tandfonline.com/doi/abs/10.1080/16843703.2007.11673148>

    <https://www.prognostics.umd.edu/calcepapers/02_V.Shetty_remaingLifeAssesShuttleRemotemanipulatorSystem_22ndSpaceSimulationConf.pdf>

    <https://ieeexplore.ieee.org/document/1656125>

    <https://journals.sagepub.com/doi/10.1177/0142331208092031>

    [Sorry, I can't publish links to the full articles]


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Mon Apr 15 22:32:04 2024
    On 4/15/2024 8:33 PM, Edward Rawde wrote:

    [Shouldn't that be Edwar D rawdE?]

    Current and voltages outside of normal operation?

    I think "outside" is (often) likely indicative of
    "something is (already) broken".

    But, perhaps TRENDS in either/both can be predictive.

    E.g., if a (sub)circuit has always been consuming X (which
    is nominal for the design) and, over time, starts to consume
    1.1X, is that suggestive that something is in the process of
    failing?

    That depends on many other unknown factors.
    Temperature sensors are common in electronics.
    So is current sensing. Voltage sensing too.

    Sensors cost money. And, HAVING data but not knowing how to
    USE it is a wasted activity (and cost).

    Why not monitor every node in the schematic and compare
    them (with dedicated hardware -- that is ALSO monitored??)
    with expected operational limits?

    Then, design some network to weight the individual
    observations to make the prediction?

    Note that the goal is not to troubleshoot the particular design
    or its components but, rather, act as an early warning that
    maintenance may be required (or, that performance may not be
    what you are expecting/have become accustomed to).

    If the system is electronic then you can detect whether currents and/or votages are within expected ranges.
    If they are a just a little out of expected range then you might turn on a warning LED.
    If they are way out of range then you might tell the power supply to turn
    off quick.
    By all means tell the software what has happened, but don't put software between the current sensor and the emergency turn off.

    Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the power!!"

    As such, software is invaluable as designing PREDICTIVE hardware is
    harder than designing predictive software (algorithms).

    You don't want to tell the user "The battery in your smoke detector
    is NOW dead (leaving you vulnerable)" but, rather, "The battery in
    your smoke detector WILL cease to be able to provide the power necessary
    for the smoke detector to provide the level of protection that you
    desire."

    And, the WAY that you inform the user has to be "productive/useful".
    A smoke detector beeping every minute is likely to find itself unplugged, leading to exactly the situation that the alert was trying to avoid!

    A smoke detector that beeps once a day risks not being heard
    (what if the occupant "works nights"?). A smoke detector
    that beeps a month in advance of the anticipated failure (and
    requires acknowledgement) risks being forgotten -- until
    it is forced to beep more persistently (see above).

    Be aware that components in monitoring circuits can fail too.

    Which is why hardware interlocks are physical switches -- yet
    can only be used to protect against certain types of faults
    (those that are most costly -- injury or loss of life)

    But, again, these are more helpful in detecting FAILED items.

    What system would you like to have early warnings for?
    Are the warnings needed to indicate operation out of expected limits or to indicate that maintenance is required, or both?
    Without detailed knowledge of the specific sytem, only speculative answers can be given.

    I'm not looking for speculation. I'm looking for folks who have DONE
    such things (designing to speculation is more expensive than just letting
    the devices fail when they need to fail!)

    E.g., when making tablets, it is possible that a bit of air will
    get trapped in the granulation during compression. This is dependant
    on a lot of factors -- tablet dimensions, location in the die
    where the compression event is happening, characteristics of the
    granulation, geometry/condition of the tooling, etc.

    But, if this happens, some tens of milliseconds later, the top will "pop"
    off the tablet. It now is cosmetically damaged as well as likely out
    of specification (amount of "active" present in the dose). You want
    to either be able to detect this (100% of the time on 100% of the tablets)
    and dynamically discard those units (and only those units!). *OR*,
    identify the characteristics of the process that most affect this condition
    and *monitor* for them to AVOID the problem.

    If that means replacing your tooling more frequently (expensive!),
    it can save money in the long run (imagine having to "sort" through
    a million tablets each hour to determine if any have popped like this?)
    Or, throttling down the press so the compression events are "slower"
    (more gradual). Or, moving the event up in the die to provide
    a better egress for the trapped air. Or...

    TELLING the user that this is happening (or likely to happen, soon)
    has real $$$ value. Even better if your device can LEARN which
    tablets and conditions will likely lead to this -- and when!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Mon Apr 15 22:40:29 2024
    On 4/15/2024 9:14 PM, Edward Rawde wrote:
    It always puzzled me how HAL could know that the AE-35 would fail in the >>> near future, but maybe HAL had a motive for lying.

    Why does your PC retry failed disk operations?

    Because the software designer didn't understand hardware.

    Actually, he DID understand the hardware which is why he retried
    it instead of ASSUMING every operation would proceed correctly.

    [Why bother testing the result code if you never expect a failure?]

    The correct approach is to mark that part of the disk as unusable and, if possible, move any data from it elsewhere quick.

    That only makes sense if the error is *persistent*. "Shit
    happens" and you can get an occasional failed operation when
    nothing is truly "broken".

    (how do you know the HBA isn't the culprit?)

    If I ask the drive to give
    me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data
    corruption
    (CRC error) AND within the normal access time limits defined by the
    location
    of those magnetic domains on the rotating medium?

    Why should it attempt to retry this MORE than once?

    Now, if you knew your disk drive was repeatedly retrying operations,
    would your confidence in it be unchanged from times when it did not
    exhibit such behavior?

    I'd have put an SSD in by now, along with an off site backup of the same
    data :)

    So, any problems you have with your SSD, today, should be solved by using the technology that will be invented 10 years hence! Ah, that's a sound strategy!

    Assuming you have properly configured a EIA232 interface, why would you
    ever get a parity error? (OVERRUN errors can be the result of an i/f
    that is running too fast for the system on the receiving end) How would
    you even KNOW this was happening?

    I suspect everyone who has owned a DVD/CD drive has encountered a
    "slow tray" as the mechanism aged. Or, a tray that wouldn't
    open (of its own accord) as soon/quickly as it used to.

    If it hasn't been used for some time then I'm ready with a tiny screwdriver blade to help it open.

    Why don't they ship such drives with tiny screwdrivers to make it
    easier for EVERY customer to address this problem?

    But I forget when I last used an optical drive.

    When the firmware in your SSD corrupts your data, what remedy will
    you use?

    You're missing the forest for the trees.

    [Turns out, there was a city-wide gas shortage so there was enough
    gas available to light the furnace but not enough to bring it up to
    temperature as quickly as the designers had expected]

    That's why the furnace designers couldn't have anticipated it.

    Really? You can't anticipate the "gas shutoff" not being in the ON
    position? (which would yield the same endless retry cycle)

    They did not know that such a contition might occur so never tested for it.

    If they planned on ENDLESSLY retrying, then they must have imagined
    some condition COULD occur that would lead to such an outcome.
    Else, why not just retry *once* and then give up? Or, not
    retry at all?

    A component could fail suddenly, such as a short circuit diode, and
    everything would work fine after replacing it.
    The cause could perhaps have been a manufacturing defect, such as
    insufficient cooling due to poor quality assembly, but the exact real
    cause
    would never be known.

    You don't care about the real cause. Or, even the failure mode.
    You (as user) just don't want to be inconvenienced by the sudden
    loss of the functionality/convenience that the the device provided.

    There will always be sudden unexpected loss of functionality for reasons which could not easily be predicted.

    And if they CAN'T be predicted, then they aren't germane to this
    discussion, eh?

    My concern is for the set of failure modes that can realistically
    be anticipated.

    I *know* the inverters in my monitors are going to fail. It
    would be nice if I knew before I was actively using one when
    it went dark!

    [But, most users would only use this indication to tell them
    to purchase another monitor; "You have been warned!"]

    People who service lawn mowers in the area where I live are very busy right now.

    A component could fail suddenly as a side effect of another failure.
    One short circuit output transistor and several other components could
    also
    burn up.

    So, if you could predict the OTHER failure...
    Or, that such a failure might occur and lead to the followup failure...

    A component could fail slowly and only become apparent when it got to the >>> stage of causing an audible or visible effect.

    But, likely, there was something observable *in* the circuit that
    just hadn't made it to the level of human perception.

    Yes a power supply ripple detection circuit could have turned on a warning LED but that never happened for at least two reasons.
    1. The detection circuit would have increased the cost of the equipment and thus diminished the profit of the manufacturer.

    That would depend on the market, right? Most of my computers have redundant "smart" (i.e., internal monitoring and reporting) power supplies. Because
    they were marketed to folks who wanted that sort of reliability. Because
    a manufacturer who didn't provide that level of AVAILABILITY would quickly
    lose market share. The cost of the added components and "handling" is
    small compared to the cost of lost opportunity (sales).

    2. The user would not have understood and would have ignored the warning anyway.

    That makes assumptions about the market AND the user.

    If one of my machines signals a fault, I look to see what it is complaining about: is it a power supply failure (in which case, I'm now reliant on
    a single power supply)? is it a memory failure (in which case, a bank
    of memory may have been disabled which means the machine will thrash
    more and throughput will drop)? is it a link aggregation error (and
    network traffic will suffer)?

    If I can't understand these errors, then I either don't buy a product
    with that level of reliability *or* have someone on hand who CAN
    understand the errors and provide remedies/advice.

    Consumers will replace a PC because of malware, trashed registry,
    creeping cruft, etc. That's a problem with the consumer buying the
    "wrong" sort of computing equipment for his likely method of use.
    (buy a Mac?)

    My home wireless Internet system doesn't care if one access point fails, >>> and
    I would not expect to be able to do anything to predict a time of
    failure.
    Experience says a dead unit has power supply issues. Usually external but >>> could be internal.

    Again, the goal isn't to predict "time of failure". But, rather, to be
    able to know that "this isn't going to end well" -- with some advance
    notice
    that allows for preemptive action to be taken (and not TOO much advance
    notice that the user ends up replacing items prematurely).

    Get feedback from the people who use your equpment.

    Users often don't understand when a device is malfunctioning.
    Or, how to report the conditions and symptoms in a meaningful way.

    I recall a woman I worked with ~45 years ago sitting, patiently,
    waiting for her computer to boot. As I walked past, she asked me how
    long it takes for that to happen (floppy based systems). Alarmed
    (I had designed the workstations), I asked "How long have you been
    waiting?"

    Turns out, she had inserted the (8") floppy rotated 90 degrees from
    it's proper orientation.

    How much longer would she have waited had I not walked past?

    I don't think it would be possible to "watch" everything because it's
    rare
    that you can properly test a component while it's part of a working
    system.

    You don't have to -- as long as you can observe its effects on other
    parts of the system. E.g., there's no easy/inexpensive way to
    check to see how much the belt on that CD/DVD player has stretched.
    But, you can notice that it HAS stretched (or, some less likely
    change has occurred that similarly interferes with the tray's actions)
    by noting how the activity that it is used for has changed.

    Sure but you have to be the operator for that.
    So you can be ready to help the tray open when needed.

    One wouldn't bother with a CD/DVD player -- they are too disposable
    and reporting errors won't help the user (even though you have a
    big ATTACHED display at your disposal!)

    "For your continued video enjoyment, replace me, now!"

    OTOH, if a CNC machine tries to "home" a mechanism and doesn't
    get (electronic) confirmation of that event having been completed,
    would you expect *it* to just sit there endlessly waiting?
    Possibly causing damage to itself in the process?

    Would you expect it to "notice" if the drive motor APPEARED to
    be connected and was drawing the EXPECTED amount of current?

    Or, would you expect an electrician to come along and start
    troubleshooting (taking the machine out of production in the process)?

    These days I would expect to have fun with management asking for software >>> to
    be able to diagnose and report any hardware failure.
    Not very easy if the power supply has died.

    What if the power supply HASN'T died? What if you are diagnosing the
    likely upcoming failure *of* the power supply?

    Then I probably can't, because the power supply may be just a bought in
    power supply which was never designed with upcoming failure detection in mind.

    You wouldn't pick such a power supply if that was an important
    failure mode to guard against! (that's why smart power supplies
    are so common -- and redundant!)

    You have ECC memory in most (larger) machines. Do you silently
    expect it to just fix all the errors? Does it have a way of telling you
    how many such errors it HAS corrected? Can you infer the number of
    errors that it *hasn't*?

    [Why have ECC at all?]

    Things are sometimes done the way they've always been done.

    Then, we should all be using machines with MEGAbytes of memory...

    I used to notice a missing chip in the 9th position but now you mention it the RAM I just looked at has 9 chips each side.

    Much consumer kit has non-ECC RAM. I'd wager many of the
    devices designed by folks *here* use non-ECC RAM (because
    support for ECC in embedded products is less common).

    Is this ignorance? Or, willful naivite?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Brown@21:1/5 to Don Y on Tue Apr 16 09:45:34 2024
    On 15/04/2024 18:13, Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    You have to be very careful that the additional complexity doesn't
    itself introduce new annoying failure modes. My previous car had
    filament bulb failure sensors (new one is LED) of which the one for the
    parking light had itself failed - the parking light still worked.
    However, the car would great me with "parking light failure" every time
    I started the engine and the main dealer refused to cancel it.

    Repair of parking light sensor failure required swapping out the
    *entire* front light assembly since it was built in one time hot glue.
    That would be a very expensive "repair" for a trivial fault.

    The parking light is not even a required feature.

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    Monitoring temperature, voltage supply and current consumption isn't a
    bad idea. If they get unexpectedly out of line something is wrong.
    Likewise with power on self tests you can catch some latent failures
    before they actually affect normal operation.

    --
    Martin Brown

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Martin Brown@21:1/5 to Edward Rawde on Tue Apr 16 11:46:20 2024
    On 16/04/2024 05:14, Edward Rawde wrote:
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvkqqu$o5co$1@dont-email.me...
    On 4/15/2024 1:32 PM, Edward Rawde wrote:
    "Don Y" <blockedofcourse@foo.invalid> wrote in message
    news:uvjn74$d54b$1@dont-email.me...
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    My conclusion would be no.
    Some of my reasons are given below.

    It always puzzled me how HAL could know that the AE-35 would fail in the >>> near future, but maybe HAL had a motive for lying.

    Why does your PC retry failed disk operations?

    Because the software designer didn't understand hardware.
    The correct approach is to mark that part of the disk as unusable and, if possible, move any data from it elsewhere quick.

    If I ask the drive to give
    me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data
    corruption
    (CRC error) AND within the normal access time limits defined by the
    location
    of those magnetic domains on the rotating medium?

    Why should it attempt to retry this MORE than once?

    Now, if you knew your disk drive was repeatedly retrying operations,
    would your confidence in it be unchanged from times when it did not
    exhibit such behavior?

    I'd have put an SSD in by now, along with an off site backup of the same
    data :)


    Assuming you have properly configured a EIA232 interface, why would you
    ever get a parity error? (OVERRUN errors can be the result of an i/f
    that is running too fast for the system on the receiving end) How would
    you even KNOW this was happening?

    I suspect everyone who has owned a DVD/CD drive has encountered a
    "slow tray" as the mechanism aged. Or, a tray that wouldn't
    open (of its own accord) as soon/quickly as it used to.

    If it hasn't been used for some time then I'm ready with a tiny screwdriver blade to help it open.
    But I forget when I last used an optical drive.


    The controller COULD be watching this (cuz it knows when it
    initiated the operation and there is an "end-of-stroke"
    sensor available) and KNOW that the drive belt was stretching
    to the point where it was impacting operation.

    [And, that a stretched belt wasn't going to suddenly decide to
    unstretch to fix the problem!]

    Back in that era I was doing a lot of repair work when I should have been >>> doing my homework.
    So I knew that there were many unrelated kinds of hardware failure.

    The goal isn't to predict ALL failures but, rather, to anticipate
    LIKELY failures and treat them before they become an inconvenience
    (or worse).

    One morning, the (gas) furnace repeatedly tried to light as the
    thermostat called for heat. Then, a few moments later, the
    safeties would kick in and shut down the gas flow. This attracted my
    attention as the LIT furnace should STAY LIT!

    The furnace was too stupid to notice its behavior so would repeat
    this cycle, endlessly.

    I stepped in and overrode the thermostat to eliminate the call
    for heat as this behavior couldn't be productive (if something
    truly IS wrong, then why let it continue? and, if there is nothing
    wrong with the controls/mechanism, then clearly it is unable to meet
    my needs so why let it persist in trying?)

    [Turns out, there was a city-wide gas shortage so there was enough
    gas available to light the furnace but not enough to bring it up to
    temperature as quickly as the designers had expected]

    That's why the furnace designers couldn't have anticipated it.
    They did not know that such a contition might occur so never tested for it.


    A component could fail suddenly, such as a short circuit diode, and
    everything would work fine after replacing it.
    The cause could perhaps have been a manufacturing defect, such as
    insufficient cooling due to poor quality assembly, but the exact real
    cause
    would never be known.

    You don't care about the real cause. Or, even the failure mode.
    You (as user) just don't want to be inconvenienced by the sudden
    loss of the functionality/convenience that the the device provided.

    There will always be sudden unexpected loss of functionality for reasons which could not easily be predicted.
    People who service lawn mowers in the area where I live are very busy right now.


    A component could fail suddenly as a side effect of another failure.
    One short circuit output transistor and several other components could
    also
    burn up.

    So, if you could predict the OTHER failure...
    Or, that such a failure might occur and lead to the followup failure...

    A component could fail slowly and only become apparent when it got to the >>> stage of causing an audible or visible effect.

    But, likely, there was something observable *in* the circuit that
    just hadn't made it to the level of human perception.

    Yes a power supply ripple detection circuit could have turned on a warning LED but that never happened for at least two reasons.
    1. The detection circuit would have increased the cost of the equipment and thus diminished the profit of the manufacturer.
    2. The user would not have understood and would have ignored the warning anyway.


    It would often be easy to locate the dried up electrolytic due to it
    having
    already let go of some of its contents.

    So I concluded that if I wanted to be sure that I could always watch my
    favourite TV show, we would have to have at least two TVs in the house.

    If it's not possible to have the equivalent of two TVs then you will want >>> to
    be in a position to get the existing TV repaired or replaced as quicky as >>> possible.

    Two TVs are affordable. Consider two controllers for a wire-EDM machine.

    Or, the cost of having that wire-EDM machine *idle* (because you didn't
    have a spare controller!)

    My home wireless Internet system doesn't care if one access point fails, >>> and
    I would not expect to be able to do anything to predict a time of
    failure.
    Experience says a dead unit has power supply issues. Usually external but >>> could be internal.

    Again, the goal isn't to predict "time of failure". But, rather, to be
    able to know that "this isn't going to end well" -- with some advance
    notice
    that allows for preemptive action to be taken (and not TOO much advance
    notice that the user ends up replacing items prematurely).

    Get feedback from the people who use your equpment.


    I don't think it would be possible to "watch" everything because it's
    rare
    that you can properly test a component while it's part of a working
    system.

    You don't have to -- as long as you can observe its effects on other
    parts of the system. E.g., there's no easy/inexpensive way to
    check to see how much the belt on that CD/DVD player has stretched.
    But, you can notice that it HAS stretched (or, some less likely
    change has occurred that similarly interferes with the tray's actions)
    by noting how the activity that it is used for has changed.

    Sure but you have to be the operator for that.
    So you can be ready to help the tray open when needed.


    These days I would expect to have fun with management asking for software >>> to
    be able to diagnose and report any hardware failure.
    Not very easy if the power supply has died.

    What if the power supply HASN'T died? What if you are diagnosing the
    likely upcoming failure *of* the power supply?

    Then I probably can't, because the power supply may be just a bought in
    power supply which was never designed with upcoming failure detection in mind.


    You have ECC memory in most (larger) machines. Do you silently
    expect it to just fix all the errors? Does it have a way of telling you
    how many such errors it HAS corrected? Can you infer the number of
    errors that it *hasn't*?

    [Why have ECC at all?]

    Things are sometimes done the way they've always been done.
    I used to notice a missing chip in the 9th position but now you mention it the RAM I just looked at has 9 chips each side.


    There are (and have been) many efforts to *predict* lifetimes of
    components (and, systems). And, some work to examine the state
    of systems /in situ/ with an eye towards anticipating their
    likelihood of future failure.

    I'm sure that's true.


    [The former has met with poor results -- predicting the future
    without a position in its past is difficult. And, knowing how
    a device is "stored" when not powered on also plays a role
    in its future survival! (is there some reason YOUR devices
    can't power themselves on, periodically; notice the environmental
    conditions; log them and then power back off)]

    The question is one of a practical nature; how much does it cost
    you to add this capability to a device and how accurately can it
    make those predictions (thus avoiding some future cost/inconvenience).

    For small manufacturers, the research required is likely not
    cost-effective;
    just take your best stab at it and let the customer "buy a replacement"
    when the time comes (hopefully, outside of your warranty window).

    But, anything you can do to minimize this TCO issue gives your product
    an edge over competitors. Given that most devices are smart, nowadays,
    it seems obvious that they should undertake as much of this task as
    they can (conveniently) afford.

    <https://www.sciencedirect.com/science/article/abs/pii/S0026271409003667>

    <https://www.researchgate.net/publication/3430090_In_Situ_Temperature_Measurement_of_a_Notebook_Computer-A_Case_Study_in_Health_and_Usage_Monitoring_of_Electronics>

    <https://www.tandfonline.com/doi/abs/10.1080/16843703.2007.11673148>

    <https://www.prognostics.umd.edu/calcepapers/02_V.Shetty_remaingLifeAssesShuttleRemotemanipulatorSystem_22ndSpaceSimulationConf.pdf>

    <https://ieeexplore.ieee.org/document/1656125>

    <https://journals.sagepub.com/doi/10.1177/0142331208092031>

    [Sorry, I can't publish links to the full articles]




    --
    Martin Brown

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Martin Brown on Tue Apr 16 04:26:28 2024
    On 4/16/2024 1:45 AM, Martin Brown wrote:
    On 15/04/2024 18:13, Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    You have to be very careful that the additional complexity doesn't itself introduce new annoying failure modes.

    *Or*, decrease the reliability of the device, in general.

    My previous car had filament bulb failure
    sensors (new one is LED) of which the one for the parking light had itself failed - the parking light still worked. However, the car would great me with "parking light failure" every time I started the engine and the main dealer refused to cancel it.

    My goal is to provide *advisories*. You don't want to constrain the
    user.

    Smoke detectors that nag you with "replace battery" alerts are nags.
    A car that refuses to start unless the seat belts are fastened is a nag.

    You shouldn't require a third party to enable you to ignore an
    advisory. But, it's OK to require the user to acknowledge that
    advisory!

    Repair of parking light sensor failure required swapping out the *entire* front
    light assembly since it was built in one time hot glue. That would be a very expensive "repair" for a trivial fault.

    The parking light is not even a required feature.

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    Monitoring temperature, voltage supply and current consumption isn't a bad idea. If they get unexpectedly out of line something is wrong.

    Extremes are easy to detect -- but often indicate failures.
    E.g., a short, an open.

    The problem is sorting out what magnitude changes are significant
    and which are normal variation.

    I think being able to track history gives you a leg up in that
    it gives you a better idea of what MIGHT be normal instead of
    just looking at an instant in time.

    Likewise with
    power on self tests you can catch some latent failures before they actually affect normal operation.

    POST is seldom executed as devices tend to run 24/7/365.
    So, I have to design runtime BIST support that can, hopefully,
    coax this information from a *running* system without interfering
    with that operation.

    This puts constraints on how you operate the hardware
    (unless you want to add lots of EXTRA hardware to
    extract these observations.

    E.g., if you can control N loads, then individually (sequentially)
    activating them and noticing the delta power consumption reveals
    more than just enabling ALL that need to be enabled and only seeing
    the aggregate of those loads.

    This can also simplify gross failure detection if part of the
    normal control strategy.

    E.g., I designed a medical instrument many years ago that had an
    external "sensor array". As that could be unplugged at any time,
    I had to continually monitor for it's disconnection. At the same
    time, individual sensors in the array could be "spoiled" by
    spilled reagents. Yet, the other sensors shouldn't be compromised
    or voided just because of the failure of certain ones.

    Recognizing that this sort of thing COULD happen in normal use
    was the biggest part of the design; the hardware and software
    to actually handle these exceptions was then straightforward.

    Note that some failures may not be possible to recover from
    without adding significant cost (and other failure modes).
    So, it's a value decision as to what you support and what
    you "tolerate".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Gwinn@21:1/5 to pcdhSpamMeSenseless@electrooptical. on Tue Apr 16 09:54:40 2024
    On Tue, 16 Apr 2024 00:51:03 -0000 (UTC), Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote:

    Joe Gwinn <joegwinn@comcast.net> wrote:
    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and
    trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>> When connections (perhaps within a package) start to fail, the flicker >>>> level rises. The actual frequency monitored isn't all that critical.

    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a
    critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was
    suffering from low-cycle fatigue during temperature cycling and slowly
    breaking, one little crack at a time, until it went open. If you
    measured the resistance to parts per million (6.5 digit DMM), sampling
    at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
    also measure a copper line, and divide the via-chain resistance by the
    no-via resistance, to correct for temperature changes.

    The solution was to redesign the vias, mainly to increase the critical
    volume of copper. And modern SMD designs have less and less copper
    volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that
    warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near
    the bearings, and listen for excessive rotation-correlated (not
    necessarily harmonic) noise.

    There are a number of instruments available that look for metal particles
    in the lubricating oil.

    Yes.

    The old-school version was a magnetic drain plug, which one inspected
    for clinging iron chips or dust, also serving to trap those chips. The newer-school version was to send a sample of the dirty oil to the lab
    for microscope and chemical analysis. There are companies that will
    take your old lubrication oil and reprocess it, yielding new oil.

    If there was an oil filter, inspect the filter surface.

    And when one was replacing the oil in the gear case, wipe the bottom
    with a white rag, and look at the rag.

    Nobody did electronic testing until very recently, because even
    expensive electronics were far too unreliable and fragile.

    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don@21:1/5 to Don Y on Tue Apr 16 13:25:07 2024
    Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    A singular speculative spitball - the capacitive marker:

    In-situ Prognostic Method of Power MOSFET Based on Miller Effect

    ... This paper presents a new in-situ prognosis method for
    MOSFET based on miller effect. According to the theory
    analysis, simulation and experiment results, the miller
    platform voltage is identified as a new degradation
    precursor ...

    (10.1109/PHM.2017.8079139)

    Danke,

    --
    Don, KB7RPU, https://www.qsl.net/kb7rpu
    There was a young lady named Bright Whose speed was far faster than light;
    She set out one day In a relative way And returned on the previous night.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Tue Apr 16 11:07:27 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvl30j$phap$3@dont-email.me...
    On 4/15/2024 9:14 PM, Edward Rawde wrote:
    It always puzzled me how HAL could know that the AE-35 would fail in
    the
    near future, but maybe HAL had a motive for lying.

    Why does your PC retry failed disk operations?

    Because the software designer didn't understand hardware.

    Actually, he DID understand the hardware which is why he retried
    it instead of ASSUMING every operation would proceed correctly.

    ....

    When the firmware in your SSD corrupts your data, what remedy will
    you use?

    Replace drive and restore backup.
    It's happened a few times, and a friend had one of those 16 GB but looks
    like 1 TB to the OS SSDs from Amazon.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Gwinn@21:1/5 to john larkin on Tue Apr 16 10:19:00 2024
    On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be >>>>>suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope >>>>>that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and >>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>When connections (perhaps within a package) start to fail, the flicker >>>>level rises. The actual frequency monitored isn't all that critical.

    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a >>critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was >>suffering from low-cycle fatigue during temperature cycling and slowly >>breaking, one little crack at a time, until it went open. If you
    measured the resistance to parts per million (6.5 digit DMM), sampling
    at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
    also measure a copper line, and divide the via-chain resistance by the >>no-via resistance, to correct for temperature changes.

    But nobody is going to monitor every via on a PCB, even if it were
    possible.

    It was not possible to test the vias on the failing logic board, but
    we knew from metallurgical cut, polish, and inspect studies of failed
    boards that it was the vias that were failing.


    One could instrument a PCB fab test board, I guess. But DC tests would
    be fine.

    What was being tested was a fab test board that had both the series
    via chain path and the no-via path of roughly the same DC resistance,
    set up so we could do 4-wire Kelvin resistance measurements of each
    path independent of the other path.


    We have one board with over 4000 vias, but they are mostly in
    parallel.

    This can also be tested , but using a 6.5-digit DMM intended for
    measuring very low resistance values. A change of one part in 4,000
    is huge to a 6.5-digit instrument. The conductivity will decline
    linearly as vias fail one by one.


    The solution was to redesign the vias, mainly to increase the critical >>volume of copper. And modern SMD designs have less and less copper
    volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that
    warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near
    the bearings, and listen for excessive rotation-correlated (not
    necessarily harmonic) noise.

    Big ships that I've worked on have a long propeller shaft in the shaft
    alley, a long tunnel where nobody often goes. They have magnetic shaft
    runout sensors and shaft bearing temperature monitors.

    They measure shaft torque and SHP too, from the shaft twist.

    Yep. And these kinds of things fail slowly. At first.


    I liked hiding out in the shaft alley. It was private and cool, that
    giant shaft slowly rotating.

    Probably had a calming flowing water sound as well.

    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Tue Apr 16 11:10:40 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvl2gr$phap$2@dont-email.me...
    On 4/15/2024 8:33 PM, Edward Rawde wrote:

    [Shouldn't that be Edwar D rawdE?]


    I don't mind how you pronounce it.


    ...

    A smoke detector that beeps once a day risks not being heard

    Reminds me of a tenant who just removed the battery to stop the annoying beeping.


    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Gwinn@21:1/5 to invalid@invalid.invalid on Tue Apr 16 11:21:00 2024
    On Tue, 16 Apr 2024 11:10:40 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:

    "Don Y" <blockedofcourse@foo.invalid> wrote in message >news:uvl2gr$phap$2@dont-email.me...
    On 4/15/2024 8:33 PM, Edward Rawde wrote:

    [Shouldn't that be Edwar D rawdE?]


    I don't mind how you pronounce it.


    ...

    A smoke detector that beeps once a day risks not being heard

    Reminds me of a tenant who just removed the battery to stop the annoying >beeping.

    My experience has been the smoke detectors too close (as the smoke
    travels) to the kitchen tend to suffer mysterious disablement.
    Relocation of the smoke detector usually solves the problem.

    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don on Tue Apr 16 11:37:16 2024
    "Don" <g@crcomp.net> wrote in message news:20240416a@crcomp.net...
    Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    A singular speculative spitball - the capacitive marker:

    In-situ Prognostic Method of Power MOSFET Based on Miller Effect

    ... This paper presents a new in-situ prognosis method for
    MOSFET based on miller effect. According to the theory
    analysis, simulation and experiment results, the miller
    platform voltage is identified as a new degradation
    precursor ...

    (10.1109/PHM.2017.8079139)

    Very interesting but are there any products out there which make use of this
    or other prognostic methods to provide information on remaining useful life?


    Danke,

    --
    Don, KB7RPU, https://www.qsl.net/kb7rpu
    There was a young lady named Bright Whose speed was far faster than light; She set out one day In a relative way And returned on the previous night.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Larkin@21:1/5 to '''newspam'''@nonad.co.uk on Tue Apr 16 08:22:17 2024
    On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:

    On 15/04/2024 18:13, Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    You have to be very careful that the additional complexity doesn't
    itself introduce new annoying failure modes. My previous car had
    filament bulb failure sensors (new one is LED) of which the one for the >parking light had itself failed - the parking light still worked.
    However, the car would great me with "parking light failure" every time
    I started the engine and the main dealer refused to cancel it.

    Repair of parking light sensor failure required swapping out the
    *entire* front light assembly since it was built in one time hot glue.
    That would be a very expensive "repair" for a trivial fault.

    The parking light is not even a required feature.

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    Monitoring temperature, voltage supply and current consumption isn't a
    bad idea. If they get unexpectedly out of line something is wrong.
    Likewise with power on self tests you can catch some latent failures
    before they actually affect normal operation.

    The real way to reduce failure rates is by designing carefully.

    Sometimes BIST can help ensure that small failures won't become
    board-burning failures, but an RMA will happen anyhow.

    I just added a soft-start feature to a couple of boards. Apply a current-limited 48 volts to the power stages before the real thing is
    switched on hard.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Larkin@21:1/5 to All on Tue Apr 16 08:16:04 2024
    On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>><blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of >>>>>>an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be >>>>>>suggestive of changing conditions in the components (and not >>>>>>directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and >>>>>>notice the sorts of changes you "typically" encounter in the hope >>>>>>that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and >>>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>>When connections (perhaps within a package) start to fail, the flicker >>>>>level rises. The actual frequency monitored isn't all that critical. >>>>>
    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a >>>critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was >>>suffering from low-cycle fatigue during temperature cycling and slowly >>>breaking, one little crack at a time, until it went open. If you >>>measured the resistance to parts per million (6.5 digit DMM), sampling
    at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to >>>also measure a copper line, and divide the via-chain resistance by the >>>no-via resistance, to correct for temperature changes.

    But nobody is going to monitor every via on a PCB, even if it were >>possible.

    It was not possible to test the vias on the failing logic board, but
    we knew from metallurgical cut, polish, and inspect studies of failed
    boards that it was the vias that were failing.


    One could instrument a PCB fab test board, I guess. But DC tests would
    be fine.

    What was being tested was a fab test board that had both the series
    via chain path and the no-via path of roughly the same DC resistance,
    set up so we could do 4-wire Kelvin resistance measurements of each
    path independent of the other path.


    Yes, but the question was whether one could predict the failure of an
    operating electronic gadget. The answer is mostly NO.

    We had a visit from the quality team from a giant company that you
    have heard of. They wanted us to trend analyze all the power supplies
    on our boards and apply a complex algotithm to predict failures. It
    was total nonsense, basically predicting the future by zooming in on
    random noise with a big 1/f component, just like climate prediction.




    We have one board with over 4000 vias, but they are mostly in
    parallel.

    This can also be tested , but using a 6.5-digit DMM intended for
    measuring very low resistance values. A change of one part in 4,000
    is huge to a 6.5-digit instrument. The conductivity will decline
    linearly as vias fail one by one.



    Millikelvin temperature changes would make more signal than a failing
    via.

    The solution was to redesign the vias, mainly to increase the critical >>>volume of copper. And modern SMD designs have less and less copper >>>volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that >>>>warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near
    the bearings, and listen for excessive rotation-correlated (not >>>necessarily harmonic) noise.

    Big ships that I've worked on have a long propeller shaft in the shaft >>alley, a long tunnel where nobody often goes. They have magnetic shaft >>runout sensors and shaft bearing temperature monitors.

    They measure shaft torque and SHP too, from the shaft twist.

    Yep. And these kinds of things fail slowly. At first.

    They could repair a bearing at sea, given a heads-up about violent
    failure. A serious bearing failure on a single-screw machine means
    getting a seagoing tug.

    The main engine gearbox had padlocks on the covers.

    There was also a chem lab to analyze oil and water and such, looking
    for contaminamts that might suggest something going on.




    I liked hiding out in the shaft alley. It was private and cool, that
    giant shaft slowly rotating.

    Probably had a calming flowing water sound as well.

    Yes, cool and beautiful and serene after the heat and noise and
    vibration of the engine room. A quiet 32,000 horsepower.

    It was fun being an electronic guru on sea trials of a ship full of
    big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
    other tech reps slept in cots in the hold.

    Have you noticed how many lumberjack types are afraid of electricity?
    That can be funny.



    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Tue Apr 16 12:02:43 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvl2gr$phap$2@dont-email.me...
    On 4/15/2024 8:33 PM, Edward Rawde wrote:

    [Shouldn't that be Edwar D rawdE?]

    I don't mind how you pronounce it.



    Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the power!!"

    As such, software is invaluable as designing PREDICTIVE hardware is
    harder than designing predictive software (algorithms).

    Two comparators can make a window detector which will tell you whether some parameter is in a specified range.
    And it doesn't need monthly updates because software is never finished.


    You don't want to tell the user "The battery in your smoke detector
    is NOW dead (leaving you vulnerable)" but, rather, "The battery in
    your smoke detector WILL cease to be able to provide the power necessary
    for the smoke detector to provide the level of protection that you
    desire."

    And, the WAY that you inform the user has to be "productive/useful".
    A smoke detector beeping every minute is likely to find itself unplugged, leading to exactly the situation that the alert was trying to avoid!

    Reminds me of a tenant who just removed the battery to stop the annoying beeping.
    Better to inform the individual who can get the replacement done when the tenant isn't even home.



    I'm not looking for speculation. I'm looking for folks who have DONE
    such things (designing to speculation is more expensive than just letting
    the devices fail when they need to fail!)

    Well I don't recall putting anything much into a design which could predict remaining life.
    The only exceptions, also drawing from other replies in this thread, might
    be be temperature sensing,
    voltage sensing, current sensing, air flow sensing, noise sensing, iron in
    oil sensing,
    and any other kind of sensing which might provide information on parameters outside or getting close to outside expected range.
    Give that to some software which also knows how long the equipment has been
    in use, how often
    it has been used, what the temperature and humidity was, how long it's been since the oil was changed,
    and you might be able to give the operator useful information about when to schedule specific maintenance.
    But don't give the software too much control. I don't want to be told that I can't use the equipment because an oil change was required 5 minutes ago and
    it hasn't been done yet.



    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Joe Gwinn on Tue Apr 16 12:06:47 2024
    "Joe Gwinn" <joegwinn@comcast.net> wrote in message news:4l5t1jtefudrcq5dmpcv993l5jqsg1k8tc@4ax.com...
    On Tue, 16 Apr 2024 11:10:40 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:

    "Don Y" <blockedofcourse@foo.invalid> wrote in message >>news:uvl2gr$phap$2@dont-email.me...
    On 4/15/2024 8:33 PM, Edward Rawde wrote:

    [Shouldn't that be Edwar D rawdE?]


    I don't mind how you pronounce it.


    ...

    A smoke detector that beeps once a day risks not being heard

    Reminds me of a tenant who just removed the battery to stop the annoying >>beeping.

    My experience has been the smoke detectors too close (as the smoke
    travels) to the kitchen tend to suffer mysterious disablement.

    Oh yes I've had that too. I call them burnt toast detectors.

    Relocation of the smoke detector usually solves the problem.

    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bill Sloman@21:1/5 to John Larkin on Wed Apr 17 02:58:41 2024
    On 17/04/2024 1:22 am, John Larkin wrote:
    On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown <'''newspam'''@nonad.co.uk> wrote:
    On 15/04/2024 18:13, Don Y wrote:

    <snip>

    Sometimes BIST can help ensure that small failures won't become
    board-burning failures, but an RMA will happen anyhow.

    Built-in self test is mostly auto-calibration. You can use temperature sensitive components for precise measurements if you calibrate out the temperature shift and re-calibrate if the measured temperature shifts appreciably (or every few minutes).

    It might also take out the effects of dopant drift in a hot device, but
    it wouldn't take it out forever.

    I just added a soft-start feature to a couple of boards. Apply a current-limited 48 volts to the power stages before the real thing is switched on hard.

    Soft-start has been around forever. If you don't pay attention to what
    happens to your circuit at start-up and turn-off you can have some real disasters.

    At Cambridge Instruments I once replaced all the tail resistors in a
    bunch of class-B long-tailed-pair-based scan amplifiers with constant
    current diodes. With the resistors tails, the scan amps drew a lot of
    current when the 24V rail was being ramped up and that threw the 24V
    supply into current limit, so it didn't ramp up. The constant current
    diodes stopped this (not that I can remember how).

    This was a follow-up after I'd brought in to stop the 24V power supply
    from blowing up (because it hadn't had a properly designed current limit).

    The problem had shown up in production - where it was known as the three
    back problem because when things did go wrong the excursions on the 24V
    rail destroyed three bags of components.

    --
    Bill Sloman, Sydney

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Tue Apr 16 09:53:42 2024
    On 4/16/2024 9:02 AM, Edward Rawde wrote:
    Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the
    power!!"

    As such, software is invaluable as designing PREDICTIVE hardware is
    harder than designing predictive software (algorithms).

    Two comparators can make a window detector which will tell you whether some parameter is in a specified range.

    Yes, but you are limited in the relationships that you can encode
    in hardware -- because they are "hard-wired".

    And it doesn't need monthly updates because software is never finished.

    Software is finished when the design is finalized. When management
    fails to discipline itself to stop the list of "Sales would like..."
    requests, then it's hard for software to even CLAIM to be finished.

    [How many hardware products see the addition of features and new
    functionality at the rate EXPECTED of software? Why can't I drive
    my car from the back seat? It's still a "car", right? It's not like
    I'm asking for it to suddenly FLY! Why can't this wall wart deliver
    400A at 32VDC? It's still a power supply, right? It's not like I'm
    asking for it to become an ARB!]

    You don't want to tell the user "The battery in your smoke detector
    is NOW dead (leaving you vulnerable)" but, rather, "The battery in
    your smoke detector WILL cease to be able to provide the power necessary
    for the smoke detector to provide the level of protection that you
    desire."

    And, the WAY that you inform the user has to be "productive/useful".
    A smoke detector beeping every minute is likely to find itself unplugged,
    leading to exactly the situation that the alert was trying to avoid!

    Reminds me of a tenant who just removed the battery to stop the annoying beeping.

    "Dinner will be served at the sound of the beep".

    [I had a friend who would routinely trip her smoke detector while cooking. Then, wave a dishtowel in front of it (she was short) to try to "clear" it.]

    Most places have specific rules regarding the placement of smoke detectors
    to 1) ensure safety and 2) avoid nuisance alarms. (I was amused to disover that our local fire department couldn't cite the local requirements when
    I went asking!)

    Add CO and heat detectors to the mix and they get *really* confused!

    Better to inform the individual who can get the replacement done when the tenant isn't even home.

    So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts suffering from feeping creaturism. "Download the app..."

    Remind the occupants once a week (requiring acknowledgement) starting
    a month prior to ANTICIPATED battery depletion. When the battery is on
    it's last leg, you can be a nuisance.

    Folks will learn to remember at the first (or second or third) reminder
    in order to avoid the annoying nuisance behavior that is typical of
    most detectors. (there is not a lot of $aving$ to replacing the
    battery at the second warning instead of at the "last minute")

    [We unconditionally replace all of ours each New Years. Modern units
    now come with sealed "10 year" batteries -- 10 years being the expected lifespan of the detector itself!]

    I'm not looking for speculation. I'm looking for folks who have DONE
    such things (designing to speculation is more expensive than just letting
    the devices fail when they need to fail!)

    Well I don't recall putting anything much into a design which could predict remaining life.

    Most people don't. Most people don't design for high availability
    *or* "costly" systems. When I was designing for pharma, my philosophy was
    to make it easy/quick to replace the entire control system. Let someone troubleshoot it on a bench instead of on the factory floor (which is semi-sterile).

    When you have hundreds/thousands of devices in a single installation,
    then you REALLY don't want to have to be playing wack-a-mole with whichever devices have crapped out, TODAY. If these *10* are likely to fail in
    the next month, then replace all ten of them NOW, when you can fit
    that maintenance activity into the production schedule instead of
    HAVING to replace them when they DISRUPT the production schedule.

    The only exceptions, also drawing from other replies in this thread, might
    be be temperature sensing,
    voltage sensing, current sensing, air flow sensing, noise sensing, iron in oil sensing,
    and any other kind of sensing which might provide information on parameters outside or getting close to outside expected range.

    Give that to some software which also knows how long the equipment has been in use, how often
    it has been used, what the temperature and humidity was, how long it's been since the oil was changed,
    and you might be able to give the operator useful information about when to schedule specific maintenance.

    I have all of those things -- with the exception of knowing which sensory data is most pertinent to failure prediction. I can watch to see when one device fails and use it to anticipate the next failure. After many years (of 24/7 operation) and many sites, I can learn from actual experience.

    Just like I can learn that you want your coffee pot started 15 minutes after you arise -- regardless of time of day. Or, that bar stock will be routed directly to the Gridley's. Because that's what I've *observed* you doing!

    But don't give the software too much control. I don't want to be told that I can't use the equipment because an oil change was required 5 minutes ago and it hasn't been done yet.

    Advisories are just that; advisories. They are there to help the user
    avoid the "rudeness" of a piece of equipment "suddenly" (as far as the
    user is concerned) failing. They add value by increasing availability.

    If you choose to ignore the advisory (e.g., not purchasing a spare to have
    on hand for that "imminent" failure), then that's your perogative. If
    you can afford to have "down time" and only react to ACTUAL failures,
    then that's a policy decision that YOU make.

    OTOH, if there is no oil in the gearbox, the equipment isn't going to
    start; if the oil sensor is defective, then *it* needs to be replaced.
    But, if the gearbox is truly empty, then it needs to be refilled.
    In either case, the equipment *needs* service -- now. Operating in
    this FAILED state presumably poses some risk, hence the prohibition.

    [Cars have gas gauges to save the driver from "discovering" that he's
    run out of fuel!]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don@21:1/5 to Edward Rawde on Tue Apr 16 17:15:31 2024
    Edward Rawde wrote:
    Don wrote:
    Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    A singular speculative spitball - the capacitive marker:

    In-situ Prognostic Method of Power MOSFET Based on Miller Effect

    ... This paper presents a new in-situ prognosis method for
    MOSFET based on miller effect. According to the theory
    analysis, simulation and experiment results, the miller
    platform voltage is identified as a new degradation
    precursor ...

    (10.1109/PHM.2017.8079139)

    Very interesting but are there any products out there which make use of this or other prognostic methods to provide information on remaining useful life?

    Perhaps this popular application "rings a bell"?

    Battery and System Health Monitoring of Battery-Powered Smart Flow
    Meters Reference Design

    <https://www.ti.com/lit/ug/tidudo5a/tidudo5a.pdf>

    Danke,

    --
    Don, KB7RPU, https://www.qsl.net/kb7rpu
    There was a young lady named Bright Whose speed was far faster than light;
    She set out one day In a relative way And returned on the previous night.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Tue Apr 16 13:25:21 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvmaet$1231i$2@dont-email.me...
    On 4/16/2024 9:02 AM, Edward Rawde wrote:
    Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the
    power!!"

    ...
    Add CO and heat detectors to the mix and they get *really* confused!

    Better to inform the individual who can get the replacement done when the
    tenant isn't even home.

    So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts suffering from feeping creaturism. "Download the app..."

    No thanks. I have the same view of cameras.
    They won't be connecting outbound to a server anywhere in the world.
    But the average user does not know that and just wants the pictures on their phone.


    The only exceptions, also drawing from other replies in this thread,
    might
    be be temperature sensing,
    voltage sensing, current sensing, air flow sensing, noise sensing, iron
    in
    oil sensing,
    and any other kind of sensing which might provide information on
    parameters
    outside or getting close to outside expected range.

    Give that to some software which also knows how long the equipment has
    been
    in use, how often
    it has been used, what the temperature and humidity was, how long it's
    been
    since the oil was changed,
    and you might be able to give the operator useful information about when
    to
    schedule specific maintenance.

    I have all of those things -- with the exception of knowing which sensory data
    is most pertinent to failure prediction.

    That's one reason why you want feedback from people who use your equipment.



    OTOH, if there is no oil in the gearbox, the equipment isn't going to
    start; if the oil sensor is defective, then *it* needs to be replaced.

    Preferably by me purchasing a new sensor and being able to replace it
    myself.

    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Bill Sloman on Tue Apr 16 13:39:07 2024
    "Bill Sloman" <bill.sloman@ieee.org> wrote in message news:uvmao8$124q1$1@dont-email.me...
    On 17/04/2024 1:22 am, John Larkin wrote:
    On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:
    On 15/04/2024 18:13, Don Y wrote:

    <snip>

    Sometimes BIST can help ensure that small failures won't become
    board-burning failures, but an RMA will happen anyhow.

    Built-in self test is mostly auto-calibration. You can use temperature sensitive components for precise measurements if you calibrate out the temperature shift and re-calibrate if the measured temperature shifts appreciably (or every few minutes).

    It might also take out the effects of dopant drift in a hot device, but it wouldn't take it out forever.

    I just added a soft-start feature to a couple of boards. Apply a
    current-limited 48 volts to the power stages before the real thing is
    switched on hard.

    Soft-start has been around forever. If you don't pay attention to what happens to your circuit at start-up and turn-off you can have some real disasters.

    Yes I've seen that a lot.
    The power rails in the production product came up in a different order to
    those in the development lab.
    This caused all kinds of previously unseen behaviour including an expensive flash a/d chip burning up.

    I'd have it in the test spec that any missing power rail does not cause
    issues.
    And any power rail can be turned on and off any time.
    The equipment may not work properly with a missing power rail but it should
    not be damaged.


    At Cambridge Instruments I once replaced all the tail resistors in a bunch
    of class-B long-tailed-pair-based scan amplifiers with constant current diodes. With the resistors tails, the scan amps drew a lot of current when the 24V rail was being ramped up and that threw the 24V supply into
    current limit, so it didn't ramp up. The constant current diodes stopped
    this (not that I can remember how).

    This was a follow-up after I'd brought in to stop the 24V power supply
    from blowing up (because it hadn't had a properly designed current limit).

    The problem had shown up in production - where it was known as the three
    back problem because when things did go wrong the excursions on the 24V
    rail destroyed three bags of components.

    --
    Bill Sloman, Sydney




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Gwinn@21:1/5 to jjSNIPlarkin@highNONOlandtechnology on Tue Apr 16 13:20:34 2024
    On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin <jjSNIPlarkin@highNONOlandtechnology.com> wrote:

    On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>wrote:

    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>><blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of >>>>>>>an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be >>>>>>>suggestive of changing conditions in the components (and not >>>>>>>directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and >>>>>>>notice the sorts of changes you "typically" encounter in the hope >>>>>>>that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and >>>>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>>>When connections (perhaps within a package) start to fail, the flicker >>>>>>level rises. The actual frequency monitored isn't all that critical. >>>>>>
    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a >>>>critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was >>>>suffering from low-cycle fatigue during temperature cycling and slowly >>>>breaking, one little crack at a time, until it went open. If you >>>>measured the resistance to parts per million (6.5 digit DMM), sampling >>>>at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to >>>>also measure a copper line, and divide the via-chain resistance by the >>>>no-via resistance, to correct for temperature changes.

    But nobody is going to monitor every via on a PCB, even if it were >>>possible.

    It was not possible to test the vias on the failing logic board, but
    we knew from metallurgical cut, polish, and inspect studies of failed >>boards that it was the vias that were failing.


    One could instrument a PCB fab test board, I guess. But DC tests would
    be fine.

    What was being tested was a fab test board that had both the series
    via chain path and the no-via path of roughly the same DC resistance,
    set up so we could do 4-wire Kelvin resistance measurements of each
    path independent of the other path.


    Yes, but the question was whether one could predict the failure of an >operating electronic gadget. The answer is mostly NO.

    Agree.


    We had a visit from the quality team from a giant company that you
    have heard of. They wanted us to trend analyze all the power supplies
    on our boards and apply a complex algotithm to predict failures. It
    was total nonsense, basically predicting the future by zooming in on
    random noise with a big 1/f component, just like climate prediction.

    Hmm. My first instinct was that they were using MIL-HNBK-317 (?) or
    the like, but that does not measure noise. Do you recall any more of
    what they were doing? I might know what they were up to. The
    military were big on prognostics for a while, and still talk of this,
    but it never worked all that well in the field compared to what it was
    supposed to improve on.


    We have one board with over 4000 vias, but they are mostly in
    parallel.

    This can also be tested , but using a 6.5-digit DMM intended for
    measuring very low resistance values. A change of one part in 4,000
    is huge to a 6.5-digit instrument. The conductivity will decline
    linearly as vias fail one by one.



    Millikelvin temperature changes would make more signal than a failing
    via.

    Not at the currents in that logic card. Too much ambient thermal
    noise.


    The solution was to redesign the vias, mainly to increase the critical >>>>volume of copper. And modern SMD designs have less and less copper >>>>volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that >>>>>warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near
    the bearings, and listen for excessive rotation-correlated (not >>>>necessarily harmonic) noise.

    Big ships that I've worked on have a long propeller shaft in the shaft >>>alley, a long tunnel where nobody often goes. They have magnetic shaft >>>runout sensors and shaft bearing temperature monitors.

    They measure shaft torque and SHP too, from the shaft twist.

    Yep. And these kinds of things fail slowly. At first.

    They could repair a bearing at sea, given a heads-up about violent
    failure. A serious bearing failure on a single-screw machine means
    getting a seagoing tug.

    The main engine gearbox had padlocks on the covers.

    There was also a chem lab to analyze oil and water and such, looking
    for contaminamts that might suggest something going on.




    I liked hiding out in the shaft alley. It was private and cool, that >>>giant shaft slowly rotating.

    Probably had a calming flowing water sound as well.

    Yes, cool and beautiful and serene after the heat and noise and
    vibration of the engine room. A quiet 32,000 horsepower.

    It was fun being an electronic guru on sea trials of a ship full of
    big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
    other tech reps slept in cots in the hold.

    Have you noticed how many lumberjack types are afraid of electricity?
    That can be funny.

    Oh yes. And EEs frightened by a 9-v battery.

    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Tue Apr 16 15:43:17 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvmjmt$140d2$1@dont-email.me...
    On 4/16/2024 10:25 AM, Edward Rawde wrote:
    Better to inform the individual who can get the replacement done when
    the
    tenant isn't even home.

    So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
    suffering from feeping creaturism. "Download the app..."

    No thanks. I have the same view of cameras.
    They won't be connecting outbound to a server anywhere in the world.
    But the average user does not know that and just wants the pictures on
    their
    phone.

    There is no need for a manufacturer to interpose themselves in such
    "remote access". Having the device register with a DDNS service
    cuts out the need for the manufacturer to essentially provide THAT
    service.

    Not for most users here.
    They tried to put me on lsn/cgnat not long ago.
    After complaining I was given a free static IPv4.
    Most users wouldn't know DDNS from a banana, and will expect it to work out
    of the box after installing the app on their phone.


    ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Tue Apr 16 12:31:33 2024
    On 4/16/2024 10:25 AM, Edward Rawde wrote:
    Better to inform the individual who can get the replacement done when the >>> tenant isn't even home.

    So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
    suffering from feeping creaturism. "Download the app..."

    No thanks. I have the same view of cameras.
    They won't be connecting outbound to a server anywhere in the world.
    But the average user does not know that and just wants the pictures on their phone.

    There is no need for a manufacturer to interpose themselves in such
    "remote access". Having the device register with a DDNS service
    cuts out the need for the manufacturer to essentially provide THAT
    service.

    OTOH, the manufacturer wants to "keep selling toilet paper" and
    has used that business model to underwrite the cost of the "device".

    Everything, here, is wired. And, my designs have the same approach:
    partly because I can distribute power over the fabric; partly because
    it removes an attqck surface (RF jamming); partly for reliability
    (no reliance on services that the user doesn't "own"); partly for
    privacy (no information leaking -- even by side channel inference).
    My wireless connections are all short range and of necessity
    (e.g., the car connects via wifi so the views from the external
    cameras can be viewed on it's LCD screen as it pulls out/in).

    Give that to some software which also knows how long the equipment has
    been
    in use, how often
    it has been used, what the temperature and humidity was, how long it's
    been
    since the oil was changed,
    and you might be able to give the operator useful information about when >>> to
    schedule specific maintenance.

    I have all of those things -- with the exception of knowing which sensory
    data
    is most pertinent to failure prediction.

    That's one reason why you want feedback from people who use your equipment.

    All they know is when something BREAKS. But, my device also knows that
    (unless EVERY "thing" breaks at the same time). The devices can capture pertinent data to adjust it's model of when those other devices are
    likely to suffer similar failures because the failed device shared
    its observations with them (via a common knowledge base).

    OTOH, if there is no oil in the gearbox, the equipment isn't going to
    start; if the oil sensor is defective, then *it* needs to be replaced.

    Preferably by me purchasing a new sensor and being able to replace it
    myself.

    If it makes sense to do so. Replacing a temperature sensor inside an
    MCU is likely not cost effective, something folks are capable of doing nor supported as an FRU.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Tue Apr 16 14:35:34 2024
    On 4/16/2024 12:43 PM, Edward Rawde wrote:
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvmjmt$140d2$1@dont-email.me...
    On 4/16/2024 10:25 AM, Edward Rawde wrote:
    Better to inform the individual who can get the replacement done when >>>>> the
    tenant isn't even home.

    So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts >>>> suffering from feeping creaturism. "Download the app..."

    No thanks. I have the same view of cameras.
    They won't be connecting outbound to a server anywhere in the world.
    But the average user does not know that and just wants the pictures on
    their
    phone.

    There is no need for a manufacturer to interpose themselves in such
    "remote access". Having the device register with a DDNS service
    cuts out the need for the manufacturer to essentially provide THAT
    service.

    Not for most users here.
    They tried to put me on lsn/cgnat not long ago.
    After complaining I was given a free static IPv4.

    Most folks, here, effectively have static IPs -- even if not guaranteed
    as such. But, most also have AUPs that prohibit running their own servers (speaking about consumers, not businesses).

    Most users wouldn't know DDNS from a banana, and will expect it to work out of the box after installing the app on their phone.

    There's no reason the app can't rely on DDNS. Infineon used to make
    a series of "power control modules" (think BSR/X10) for consumers.
    You could talk to the "controller" -- placed on YOUR network -- directly.
    No need to go THROUGH a third party (e.g., Infineon).

    If you wanted to access those devices (through the controller) from
    a remote location, the controller -- if you provided internet access
    to it -- would register with a DDNS and you could access it through
    that URL.

    It is only recently that vendors have been trying to bake themselves into
    their products. Effectively turning their products into "rentals".
    You can buy a smart IP camera that will recognize *people*! For $30.
    Plus $6/month -- forever! (and, if you stop paying, you have a nice high-tech-looking PAPERWEIGHT).

    I rescued another (APC) UPS, recently. I was excited that they FINALLY
    had included the NIC in the basic model (instead of as an add-in card
    as it has historically been supplied - at additional cost).

    [I use the network access to log my power consumption and control the
    power to attached devices without having to press power buttons]

    Ah, but you can't *talk* to that NIC! It exists so the UPS can talk to the vendor! Who will let you talk to them to get information about YOUR UPS.

    So, you pay for a NIC that you can't use -- unless you agree to their
    terms (I have no idea if there is a fee involved or if they just want
    to spy on your usage and sell you batteries!)

    In addition to the sleeze factor, it's also a risk. Do you know what
    the software in the device does? Are you sure it is benevolent? And,
    not snooping your INTERNAL network (it's INSIDE your firewall)? Maybe
    just trying to sort out what sorts of hardware you have (for which they
    could pitch additional products/services)? Are you sure the product
    (if benign) can't be hacked and act as a beachhead for some other
    infestation?

    And, what, exactly, am I *getting* for this risk that I couldn't get
    with the "old style" NIC?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From john larkin@21:1/5 to Don on Tue Apr 16 15:06:49 2024
    On Tue, 16 Apr 2024 13:25:07 -0000 (UTC), "Don" <g@crcomp.net> wrote:

    Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    A singular speculative spitball - the capacitive marker:

    In-situ Prognostic Method of Power MOSFET Based on Miller Effect

    ... This paper presents a new in-situ prognosis method for
    MOSFET based on miller effect. According to the theory
    analysis, simulation and experiment results, the miller
    platform voltage is identified as a new degradation
    precursor ...

    (10.1109/PHM.2017.8079139)

    Danke,

    Sounds like they are really measuring gate threshold, or gate transfer
    curve, drift with time. That happens and is usually no big deal, in
    moderation. Ions and charges drift around. We don't build opamp
    front-ends from power mosfets.

    This doesn't sound very useful for "in-situ" diagnostics.

    GaN fets can have a lot of gate threshold and leakage change over time
    too. Drive them hard and it doesn't matter.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Tue Apr 16 14:22:28 2024
    On 4/16/2024 8:37 AM, Edward Rawde wrote:
    "Don" <g@crcomp.net> wrote in message news:20240416a@crcomp.net...
    Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    A singular speculative spitball - the capacitive marker:

    In-situ Prognostic Method of Power MOSFET Based on Miller Effect

    ... This paper presents a new in-situ prognosis method for
    MOSFET based on miller effect. According to the theory
    analysis, simulation and experiment results, the miller
    platform voltage is identified as a new degradation
    precursor ...

    (10.1109/PHM.2017.8079139)

    Very interesting but are there any products out there which make use of this or other prognostic methods to provide information on remaining useful life?

    Wanna bet there's a shitload of effort going into sorting out how to
    prolong the service life of batteries for EVs?

    It's only a matter of time before large organizations and nations start
    looking hard at "eWaste" both from the standpoint of efficient use of
    capitol, resources and environmental consequences. If recycling was
    mandated (by law), how many vendors would rethink their approach to
    product design? (Do we really need to assume the cost of retrieving
    that 75 inch TV from the customer just so we can sell him ANOTHER?
    Is there a better way to pitch improvements in *features* instead of
    pels or screen size?)

    Here, you have to PAY (typ $25) for someone to take ownership of
    your CRT-based devices. I see Gaylords full of LCD monitors discarded
    each week. And, a 20 ft roll-off of "flat screen TVs" monthly.

    Most businesses discard EVERY workstation in their fleet on a
    2-3 yr basis. The software update cycle coerces hardware developers
    to design for a similarly (artificially) limited lifecycle.

    [Most people are clueless at the volume of eWaste that their communities generate, regularly.]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Tue Apr 16 18:19:32 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvmqve$15hl7$2@dont-email.me...
    On 4/16/2024 12:43 PM, Edward Rawde wrote:
    "Don Y" <blockedofcourse@foo.invalid> wrote in message
    news:uvmjmt$140d2$1@dont-email.me...
    On 4/16/2024 10:25 AM, Edward Rawde wrote:
    Better to inform the individual who can get the replacement done when >>>>>> the
    tenant isn't even home.

    So, a WiFi/BT link to <whatever>? Now the simple smoke detector
    starts
    suffering from feeping creaturism. "Download the app..."

    But vendors know that most people want it easy so the push towards
    subscription services and products which phone home isn't going to change.

    Most people don't know or care what their products are sending to the
    vendor.

    I like to see what is connecting to what with https://www.pfsense.org/
    But I might be the only person in 100 mile radius doing so.

    I can also remote desktop from anywhere of my choice, with the rest of the world unable to connect.

    Pretty much all of my online services are either restricted to specific IPs (cameras, remote desktop and similar).
    Or they have one or more countries and other problem IPs blocked. (web sites and email services).

    None of that is possible when the vendor is in control because users will
    want their camera pictures available anywhere.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don on Tue Apr 16 15:21:25 2024
    On 4/16/2024 6:25 AM, Don wrote:
    Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    A singular speculative spitball - the capacitive marker:

    In-situ Prognostic Method of Power MOSFET Based on Miller Effect

    ... This paper presents a new in-situ prognosis method for
    MOSFET based on miller effect. According to the theory
    analysis, simulation and experiment results, the miller
    platform voltage is identified as a new degradation
    precursor ...

    With the levels of integration we now routinely encounter, this
    is likely more of interest to a component vendor than an end designer.
    I.e., sell a device that provides this sort of information in a
    friendlier form.

    Most consumers/users don't care about which component failed.
    They just see the DEVICE as having failed.

    The Reliability Engineer likely has more of an interest -- but,
    only if he gets a chance to examine the failed device (how many
    "broken" devices actually get returned to their manufacturer
    for such analysis? Even when covered IN warranty??)

    When I see an LCD monitor indicating signs of imminent failure,
    I know I have to have a replacement on-hand. (I keep a shitload).
    I happen to know that this particular type of monitor (make/model)
    *tends* to fail in one of N (for small values of N) ways. So,
    when I get around to dismantling it and troubleshooting, I know
    where to start instead of having to wander through an undocumented
    design -- AGAIN.

    [I've standardized on three different (sized) models to make this
    process pretty simple; I don't want to spend more than a few minutes *repairing* a monitor!]

    If the swamp (evaporative) cooler cycles on, I can monitor the rate
    of water consumption compared to "nominal". Using this, I can infer
    the level of calcification of the water valve *in* the cooler.
    To some extent, I can compensate for obstruction by running the
    blower at a reduced speed (assuming the cooler can meet the needs
    of the house in this condition). With a VFD, I could find the sweet
    spot! :>

    So, I can alert the occupants of an impending problem that they might
    want to address before the cooler can't meet their needs (when the
    pads are insufficiently wetted, you're just pushing hot, dry air into
    the house/office/business).

    A "dumb" controller just looks at indoor temperature and cycles
    the system on/off based on whether or not it is above or below
    the desired setpoint (which means it can actually make the house
    warmer, the harder it tries to close the loop!)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Glen Walpert@21:1/5 to John Larkin on Tue Apr 16 23:41:48 2024
    On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin wrote:

    <clip>

    The main engine gearbox had padlocks on the covers.

    Padlocks went on every reduction gearbox in the USN in the summer of 1972, after CV60's departure to Vietnam was delayed by 3 days due to a bucket of bolts being dumped into #3 Main Machinery Room reduction gear. The locks
    were custom made for the application and not otherwise available, to serve
    as a tamper-evident seal. You could easily cut one off but you couldn't
    get a replacement. #3 main gear was cleaned up and large burrs filed off,
    but still made a thump with every revolution for the entire 'Nam cruise.
    (This followed a 3 fatality fire in 3-Main which delayed departure by 3
    weeks, done skillfully enough to be deemed an accident.)

    I was assigned to 3-Main on CV60 for the shipyard overhaul following the
    'Nam cruise and heard the stories from those who were there. The
    reduction gear thump went away entirely after a full power run following overhaul, something rarely done except for testing on account of the
    ~million gallon a day fuel consumption.

    I liked hiding out in the shaft alley. It was private and cool, that >>>giant shaft slowly rotating.

    Probably had a calming flowing water sound as well.

    Yes, cool and beautiful and serene after the heat and noise and
    vibration of the engine room. A quiet 32,000 horsepower.

    So not claustrophobic and like cool and quiet - you would have made a good submariner.

    Glen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don@21:1/5 to john larkin on Tue Apr 16 23:25:30 2024
    john larkin wrote:
    Don wrote:
    Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    A singular speculative spitball - the capacitive marker:

    In-situ Prognostic Method of Power MOSFET Based on Miller Effect

    ... This paper presents a new in-situ prognosis method for
    MOSFET based on miller effect. According to the theory
    analysis, simulation and experiment results, the miller
    platform voltage is identified as a new degradation
    precursor ...

    (10.1109/PHM.2017.8079139)

    Sounds like they are really measuring gate threshold, or gate transfer
    curve, drift with time. That happens and is usually no big deal, in moderation. Ions and charges drift around. We don't build opamp
    front-ends from power mosfets.

    This doesn't sound very useful for "in-situ" diagnostics.

    GaN fets can have a lot of gate threshold and leakage change over time
    too. Drive them hard and it doesn't matter.

    Threshold voltage measurement is indeed one of two parameters. The
    second parameter is Miller platform voltage measurement.
    The Miller plateau is directly related to the gate-drain
    capacitance, Cgd. It's why "capacitive marker" appears in my
    original followup.
    Long story short, the Miller Plateau length provides a metric
    principle to measure Tj without a sensor. Some may find this useful.

    Danke,

    --
    Don, KB7RPU, https://www.qsl.net/kb7rpu
    There was a young lady named Bright Whose speed was far faster than light;
    She set out one day In a relative way And returned on the previous night.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Larkin@21:1/5 to invalid@invalid.invalid on Tue Apr 16 17:58:47 2024
    On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:


    On 17/04/2024 1:22 am, John Larkin wrote:
    On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:
    On 15/04/2024 18:13, Don Y wrote:


    Yes I've seen that a lot.
    The power rails in the production product came up in a different order to >those in the development lab.
    This caused all kinds of previously unseen behaviour including an expensive >flash a/d chip burning up.

    I'd have it in the test spec that any missing power rail does not cause >issues.
    And any power rail can be turned on and off any time.
    The equipment may not work properly with a missing power rail but it should >not be damaged.


    Some FPGAs require supply sequencing, as may as four.

    LM3880 is a dedicated powerup sequencer, most cool.

    https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Larkin@21:1/5 to Don on Tue Apr 16 18:02:24 2024
    On Tue, 16 Apr 2024 23:25:30 -0000 (UTC), "Don" <g@crcomp.net> wrote:

    john larkin wrote:
    Don wrote:
    Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    A singular speculative spitball - the capacitive marker:

    In-situ Prognostic Method of Power MOSFET Based on Miller Effect

    ... This paper presents a new in-situ prognosis method for
    MOSFET based on miller effect. According to the theory
    analysis, simulation and experiment results, the miller
    platform voltage is identified as a new degradation
    precursor ...

    (10.1109/PHM.2017.8079139)

    Sounds like they are really measuring gate threshold, or gate transfer
    curve, drift with time. That happens and is usually no big deal, in
    moderation. Ions and charges drift around. We don't build opamp
    front-ends from power mosfets.

    This doesn't sound very useful for "in-situ" diagnostics.

    GaN fets can have a lot of gate threshold and leakage change over time
    too. Drive them hard and it doesn't matter.

    Threshold voltage measurement is indeed one of two parameters. The
    second parameter is Miller platform voltage measurement.
    The Miller plateau is directly related to the gate-drain
    capacitance, Cgd. It's why "capacitive marker" appears in my
    original followup.
    Long story short, the Miller Plateau length provides a metric
    principle to measure Tj without a sensor. Some may find this useful.

    Danke,

    When we want to measure actual junction temperature of a mosfet, we
    use the substrate diode. Or get lazy and thermal image the top of the
    package.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to John Larkin on Tue Apr 16 21:04:40 2024
    "John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message news:jr6u1j9vmo3a6tpl1evgrvmu1993slepno@4ax.com...
    On Tue, 16 Apr 2024 13:20:34 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin >><jjSNIPlarkin@highNONOlandtechnology.com> wrote:

    On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net> >>>wrote:

    On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>wrote:

    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote: >>>>>>
    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>>wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>>><blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of >>>>>>>>>an "imminent" (for some value of "imminent") hardware failure? >>>>>>>>>
    ....

    I liked hiding out in the shaft alley. It was private and cool, that >>>>>giant shaft slowly rotating.

    Probably had a calming flowing water sound as well.

    Yes, cool and beautiful and serene after the heat and noise and
    vibration of the engine room. A quiet 32,000 horsepower.

    It was fun being an electronic guru on sea trials of a ship full of
    big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>>other tech reps slept in cots in the hold.

    Have you noticed how many lumberjack types are afraid of electricity? >>>That can be funny.

    Oh yes. And EEs frightened by a 9-v battery.

    Joe Gwinn

    I had an intern, an EE senior, who was afraid of 3.3 volts.

    I told him to touch an FPGA to see how warm it was getting, and he
    refused.


    That's what happens when they grow up having never accidentally touched the
    top cap of a 40KG6A/PL519

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Larkin@21:1/5 to All on Tue Apr 16 17:48:19 2024
    On Tue, 16 Apr 2024 13:20:34 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin ><jjSNIPlarkin@highNONOlandtechnology.com> wrote:

    On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>wrote:

    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote: >>>>>
    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>><blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of >>>>>>>>an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be >>>>>>>>suggestive of changing conditions in the components (and not >>>>>>>>directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and >>>>>>>>notice the sorts of changes you "typically" encounter in the hope >>>>>>>>that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and >>>>>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>>>>When connections (perhaps within a package) start to fail, the flicker >>>>>>>level rises. The actual frequency monitored isn't all that critical. >>>>>>>
    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a >>>>>critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was >>>>>suffering from low-cycle fatigue during temperature cycling and slowly >>>>>breaking, one little crack at a time, until it went open. If you >>>>>measured the resistance to parts per million (6.5 digit DMM), sampling >>>>>at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to >>>>>also measure a copper line, and divide the via-chain resistance by the >>>>>no-via resistance, to correct for temperature changes.

    But nobody is going to monitor every via on a PCB, even if it were >>>>possible.

    It was not possible to test the vias on the failing logic board, but
    we knew from metallurgical cut, polish, and inspect studies of failed >>>boards that it was the vias that were failing.


    One could instrument a PCB fab test board, I guess. But DC tests would >>>>be fine.

    What was being tested was a fab test board that had both the series
    via chain path and the no-via path of roughly the same DC resistance,
    set up so we could do 4-wire Kelvin resistance measurements of each
    path independent of the other path.


    Yes, but the question was whether one could predict the failure of an >>operating electronic gadget. The answer is mostly NO.

    Agree.


    We had a visit from the quality team from a giant company that you
    have heard of. They wanted us to trend analyze all the power supplies
    on our boards and apply a complex algotithm to predict failures. It
    was total nonsense, basically predicting the future by zooming in on
    random noise with a big 1/f component, just like climate prediction.

    Hmm. My first instinct was that they were using MIL-HNBK-317 (?) or
    the like, but that does not measure noise. Do you recall any more of
    what they were doing? I might know what they were up to. The
    military were big on prognostics for a while, and still talk of this,
    but it never worked all that well in the field compared to what it was >supposed to improve on.


    We have one board with over 4000 vias, but they are mostly in
    parallel.

    This can also be tested , but using a 6.5-digit DMM intended for >>>measuring very low resistance values. A change of one part in 4,000
    is huge to a 6.5-digit instrument. The conductivity will decline >>>linearly as vias fail one by one.



    Millikelvin temperature changes would make more signal than a failing
    via.

    Not at the currents in that logic card. Too much ambient thermal
    noise.


    The solution was to redesign the vias, mainly to increase the critical >>>>>volume of copper. And modern SMD designs have less and less copper >>>>>volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that >>>>>>warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near >>>>>the bearings, and listen for excessive rotation-correlated (not >>>>>necessarily harmonic) noise.

    Big ships that I've worked on have a long propeller shaft in the shaft >>>>alley, a long tunnel where nobody often goes. They have magnetic shaft >>>>runout sensors and shaft bearing temperature monitors.

    They measure shaft torque and SHP too, from the shaft twist.

    Yep. And these kinds of things fail slowly. At first.

    They could repair a bearing at sea, given a heads-up about violent
    failure. A serious bearing failure on a single-screw machine means
    getting a seagoing tug.

    The main engine gearbox had padlocks on the covers.

    There was also a chem lab to analyze oil and water and such, looking
    for contaminamts that might suggest something going on.




    I liked hiding out in the shaft alley. It was private and cool, that >>>>giant shaft slowly rotating.

    Probably had a calming flowing water sound as well.

    Yes, cool and beautiful and serene after the heat and noise and
    vibration of the engine room. A quiet 32,000 horsepower.

    It was fun being an electronic guru on sea trials of a ship full of
    big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>other tech reps slept in cots in the hold.

    Have you noticed how many lumberjack types are afraid of electricity?
    That can be funny.

    Oh yes. And EEs frightened by a 9-v battery.

    Joe Gwinn

    I had an intern, an EE senior, who was afraid of 3.3 volts.

    I told him to touch an FPGA to see how warm it was getting, and he
    refused.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Tue Apr 16 18:12:14 2024
    On 4/16/2024 3:19 PM, Edward Rawde wrote:
    But vendors know that most people want it easy so the push towards subscription services and products which phone home isn't going to change.

    Until it does. There is nothing inherent in the design of any
    of these products that requires another "service" to provide the
    ADVERTISED functionality.

    Our stove and refrigerator have WiFi "apps" -- that rely on a tie-in
    to the manufacturer's site (no charge but still, why do they need
    access to my stovetop?).

    Simple solution: router has no radio! Even if the appliances wanted
    to connect (ignoring their "disable WiFi access" setting), there's
    nothing they can connect *to*.

    Most people don't know or care what their products are sending to the
    vendor.

    I think that is a generational issue. My neighbor just bought
    camera and, when she realized it had to connect to their OUTBOUND
    wifi, she just opted to return it. So, a lost sale AND the cost
    of a return.

    Young people seem to find nothing odd about RENTING -- anything!
    Wanna listen to some music? You can RENT it, one song at a time!
    Wanna access the internet using free WiFi *inside* a business?
    The idea that they are leaking information never crosses their
    mind. They *deserve* to discover that some actuary has noted
    a correlation between people who shop at XYZCo and alcoholism.
    Or, inability to pay their debts. Or, cannabis use. Or...
    whatever the Big Data tells *them*.

    Like the driver who complained that his CAR was revealing his
    driving behavior through OnStar to credit agencies and their
    subscribers (insurance companies) were using that to determine
    the risk he represented.

    I like to see what is connecting to what with https://www.pfsense.org/
    But I might be the only person in 100 mile radius doing so.

    I can also remote desktop from anywhere of my choice, with the rest of the world unable to connect.

    Pretty much all of my online services are either restricted to specific IPs (cameras, remote desktop and similar).
    Or they have one or more countries and other problem IPs blocked. (web sites and email services).

    But IP and MAC masquerading are trivial exercises. And, don't require
    a human participant to interact with the target (i.e., they can be automated).

    I have voice access to the services in my home. I don't rely on the
    CID information provided as it can be forged. But, I *do* require
    the *voice* match one of a few known voiceprints -- along with other
    conditions for access (e.g., if I am known to be HOME, then anyone
    calling with my voice is obviously an imposter; likewise, if
    someone "authorized" calls and passes the authentication procedure,
    they are limited in what they can do -- like, maybe close my garage
    door if I happened to leave it open and it is now after midnight).
    And, recording a phrase (uttered by that person) only works if you
    know what I am going to ASK you; anything that relies on your own
    personal knowledge can't be emulated, even by an AI!

    No need for apps or appliances -- you could technically use a "payphone"
    (if such things still existed) or an office phone in some business.

    I have a "cordless phone" in the car that lets me talk to the house from
    a range of 1/2 mile, without relying on cell phone service. I can't
    send video over the link -- but, I can ask "Did I remember to close
    the garage door?" Or, "Did I forget to turn off the tea kettle?"
    as I drive away.

    None of that is possible when the vendor is in control because users will want their camera pictures available anywhere.

    No, you just have to rely on other mechanisms for authentication.

    I have a friend who manages a datafarm at a large multinational bank.
    When he is here, he uses my internet connection -- which is "foreign"
    as far as the financial institution is concerned -- with no problems.
    But, he carries a time-varying "token" with him that ensures he
    has the correct credentials for any ~2 minute slice of time!

    I rely on biometrics, backed with "shared secrets" ("Hi Jane!
    How's Tom doing?" "Hmmm, I don't know anyone by the name of Tom")
    because I don't want to have to carry a physical key (and
    don't want the other folks with access to have to do so, either)

    And, most folks don't really need remote access to the things
    that are offering that access. Why do I need to check the state
    of my oven/stove WHEN I AM NOT AT HOME? (Why the hell would
    I leave it ON when the house is empty???) There are refrigerators
    that take a photo of the contents of the frig each time you close
    the door. Do I care if the photo on my phone is of the state of the refrigerator when I was last IN PROXIMITY OF IT vs. it's most recent
    state? Do I need to access my thermostat "online" vs. via SMS?
    Or voice?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to John Larkin on Tue Apr 16 21:16:45 2024
    "John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message news:p47u1j1tg35ctb3tcta5qevsfnhgnpcrsg@4ax.com...
    On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:


    On 17/04/2024 1:22 am, John Larkin wrote:
    On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:
    On 15/04/2024 18:13, Don Y wrote:


    Yes I've seen that a lot.
    The power rails in the production product came up in a different order to >>those in the development lab.
    This caused all kinds of previously unseen behaviour including an
    expensive
    flash a/d chip burning up.

    I'd have it in the test spec that any missing power rail does not cause >>issues.
    And any power rail can be turned on and off any time.
    The equipment may not work properly with a missing power rail but it
    should
    not be damaged.


    Some FPGAs require supply sequencing, as may as four.

    LM3880 is a dedicated powerup sequencer, most cool.

    https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0


    Ok that doesn't surprise me.
    I'd want to be sure that the requirement is always met even when the 12V connector is in a position where it isn't sure whether it's connected or
    not.
    Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Tue Apr 16 21:38:29 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvn7lm$17so6$2@dont-email.me...
    On 4/16/2024 3:19 PM, Edward Rawde wrote:
    But vendors know that most people want it easy so the push towards
    subscription services and products which phone home isn't going to
    change.

    Until it does. There is nothing inherent in the design of any
    of these products that requires another "service" to provide the
    ADVERTISED functionality.

    Our stove and refrigerator have WiFi "apps" -- that rely on a tie-in
    to the manufacturer's site (no charge but still, why do they need
    access to my stovetop?).

    Simple solution: router has no radio! Even if the appliances wanted
    to connect (ignoring their "disable WiFi access" setting), there's
    nothing they can connect *to*.

    I'd have trouble here with no wifi access.
    I can restrict outbound with a firewall as necessary.


    Most people don't know or care what their products are sending to the
    vendor.

    I think that is a generational issue. My neighbor just bought
    camera and, when she realized it had to connect to their OUTBOUND
    wifi, she just opted to return it. So, a lost sale AND the cost
    of a return.

    Young people seem to find nothing odd about RENTING -- anything!
    Wanna listen to some music? You can RENT it, one song at a time!
    Wanna access the internet using free WiFi *inside* a business?
    The idea that they are leaking information never crosses their
    mind. They *deserve* to discover that some actuary has noted
    a correlation between people who shop at XYZCo and alcoholism.
    Or, inability to pay their debts. Or, cannabis use. Or...
    whatever the Big Data tells *them*.

    Like the driver who complained that his CAR was revealing his
    driving behavior through OnStar to credit agencies and their
    subscribers (insurance companies) were using that to determine
    the risk he represented.

    I like to see what is connecting to what with https://www.pfsense.org/
    But I might be the only person in 100 mile radius doing so.

    I can also remote desktop from anywhere of my choice, with the rest of
    the
    world unable to connect.

    Pretty much all of my online services are either restricted to specific
    IPs
    (cameras, remote desktop and similar).
    Or they have one or more countries and other problem IPs blocked. (web
    sites
    and email services).

    But IP and MAC masquerading are trivial exercises. And, don't require
    a human participant to interact with the target (i.e., they can be automated).

    That's why most tor exit nodes and home user vpn services are blocked.
    I don't allow unauthenticated access to anything (except web sites).
    I prefer to keep authentication simple and drop packets from countries and places who have no business connecting.
    Granted a multinational bank may need a different approach since their customers could be anywhere.
    If I were a multinational bank I'd be employing people to watch where the packets come from and decide which ones the firewall should drop.


    I have voice access to the services in my home. I don't rely on the
    CID information provided as it can be forged. But, I *do* require
    the *voice* match one of a few known voiceprints -- along with other conditions for access (e.g., if I am known to be HOME, then anyone
    calling with my voice is obviously an imposter; likewise, if
    someone "authorized" calls and passes the authentication procedure,
    they are limited in what they can do -- like, maybe close my garage
    door if I happened to leave it open and it is now after midnight).
    And, recording a phrase (uttered by that person) only works if you
    know what I am going to ASK you; anything that relies on your own
    personal knowledge can't be emulated, even by an AI!

    No need for apps or appliances -- you could technically use a "payphone"
    (if such things still existed) or an office phone in some business.

    I have a "cordless phone" in the car that lets me talk to the house from
    a range of 1/2 mile, without relying on cell phone service. I can't
    send video over the link -- but, I can ask "Did I remember to close
    the garage door?" Or, "Did I forget to turn off the tea kettle?"
    as I drive away.

    None of that is possible when the vendor is in control because users will
    want their camera pictures available anywhere.

    No, you just have to rely on other mechanisms for authentication.

    I have a friend who manages a datafarm at a large multinational bank.
    When he is here, he uses my internet connection -- which is "foreign"
    as far as the financial institution is concerned -- with no problems.
    But, he carries a time-varying "token" with him that ensures he
    has the correct credentials for any ~2 minute slice of time!

    I rely on biometrics, backed with "shared secrets" ("Hi Jane!
    How's Tom doing?" "Hmmm, I don't know anyone by the name of Tom")
    because I don't want to have to carry a physical key (and
    don't want the other folks with access to have to do so, either)

    And, most folks don't really need remote access to the things
    that are offering that access. Why do I need to check the state
    of my oven/stove WHEN I AM NOT AT HOME? (Why the hell would
    I leave it ON when the house is empty???) There are refrigerators
    that take a photo of the contents of the frig each time you close
    the door. Do I care if the photo on my phone is of the state of the refrigerator when I was last IN PROXIMITY OF IT vs. it's most recent
    state? Do I need to access my thermostat "online" vs. via SMS?
    Or voice?


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Tue Apr 16 20:17:12 2024
    On 4/16/2024 6:38 PM, Edward Rawde wrote:
    Simple solution: router has no radio! Even if the appliances wanted
    to connect (ignoring their "disable WiFi access" setting), there's
    nothing they can connect *to*.

    I'd have trouble here with no wifi access.
    I can restrict outbound with a firewall as necessary.

    I have 25 general purpose drops, here. So, you can be in any room,
    front/back porch -- even the ROOF -- and get connected.

    When I *need* wifi, I have to turn on one of the radios in
    the ceiling, temporarily. (they are there as convenience
    features for visiting guests; they are blocked from all of
    the wired connections in the house)

    But IP and MAC masquerading are trivial exercises. And, don't require
    a human participant to interact with the target (i.e., they can be
    automated).

    That's why most tor exit nodes and home user vpn services are blocked.
    I don't allow unauthenticated access to anything (except web sites).
    I prefer to keep authentication simple and drop packets from countries and places who have no business connecting.
    Granted a multinational bank may need a different approach since their customers could be anywhere.
    If I were a multinational bank I'd be employing people to watch where the packets come from and decide which ones the firewall should drop.

    The internal network isn't routed. So, the only machines to worry about are this one (used only for email/news/web) and a laptop that is only used
    for ecommerce.

    I have an out-facing server that operates in stealth mode and won't appear
    on probes (only used to source my work to colleagues). The goal is not to
    look "interesting".

    The structure of the house's fabric allows me to treat any individual
    node as being directly connected to the ISP while isolating the
    rest of the nodes. I.e., if you bring a laptop loaded with malware into
    the house, you can't infect anything (or even know that there are other
    hosts, here); it's as if you had a dedicated connection to the Internet
    with no other devices "nearby".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don@21:1/5 to John Larkin on Wed Apr 17 03:46:02 2024
    John Larkin wrote:
    Don wrote:
    john larkin wrote:
    Don wrote:
    Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    I suspect most would involve *relative* changes that would be
    suggestive of changing conditions in the components (and not
    directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and
    notice the sorts of changes you "typically" encounter in the hope
    that something of greater magnitude would be a harbinger...

    A singular speculative spitball - the capacitive marker:

    In-situ Prognostic Method of Power MOSFET Based on Miller Effect

    ... This paper presents a new in-situ prognosis method for
    MOSFET based on miller effect. According to the theory
    analysis, simulation and experiment results, the miller
    platform voltage is identified as a new degradation
    precursor ...

    (10.1109/PHM.2017.8079139)

    Sounds like they are really measuring gate threshold, or gate transfer
    curve, drift with time. That happens and is usually no big deal, in
    moderation. Ions and charges drift around. We don't build opamp
    front-ends from power mosfets.

    This doesn't sound very useful for "in-situ" diagnostics.

    GaN fets can have a lot of gate threshold and leakage change over time
    too. Drive them hard and it doesn't matter.

    Threshold voltage measurement is indeed one of two parameters. The
    second parameter is Miller platform voltage measurement.
    The Miller plateau is directly related to the gate-drain
    capacitance, Cgd. It's why "capacitive marker" appears in my
    original followup.
    Long story short, the Miller Plateau length provides a metric
    principle to measure Tj without a sensor. Some may find this useful.

    When we want to measure actual junction temperature of a mosfet, we
    use the substrate diode. Or get lazy and thermal image the top of the package.

    My son asked me to explain how Government works. So I told him. They
    hire a guy, give him a FLIR, and bundle both with their product as an
    in-situ prognostic solution.

    Danke,

    --
    Don, KB7RPU, https://www.qsl.net/kb7rpu
    There was a young lady named Bright Whose speed was far faster than light;
    She set out one day In a relative way And returned on the previous night.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Larkin@21:1/5 to invalid@invalid.invalid on Tue Apr 16 20:23:46 2024
    On Tue, 16 Apr 2024 21:16:45 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:

    "John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >news:p47u1j1tg35ctb3tcta5qevsfnhgnpcrsg@4ax.com...
    On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:


    On 17/04/2024 1:22 am, John Larkin wrote:
    On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:
    On 15/04/2024 18:13, Don Y wrote:


    Yes I've seen that a lot.
    The power rails in the production product came up in a different order to >>>those in the development lab.
    This caused all kinds of previously unseen behaviour including an >>>expensive
    flash a/d chip burning up.

    I'd have it in the test spec that any missing power rail does not cause >>>issues.
    And any power rail can be turned on and off any time.
    The equipment may not work properly with a missing power rail but it >>>should
    not be damaged.


    Some FPGAs require supply sequencing, as may as four.

    LM3880 is a dedicated powerup sequencer, most cool.

    https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0


    Ok that doesn't surprise me.
    I'd want to be sure that the requirement is always met even when the 12V >connector is in a position where it isn't sure whether it's connected or
    not.
    Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.


    We considered the brownout case. The MAX809 handles that.

    This supply will also tolerate +24v input, in case someone grabs the
    wrong wart. Or connects the power backwards.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Larkin@21:1/5 to invalid@invalid.invalid on Tue Apr 16 20:19:19 2024
    On Tue, 16 Apr 2024 21:04:40 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:

    "John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >news:jr6u1j9vmo3a6tpl1evgrvmu1993slepno@4ax.com...
    On Tue, 16 Apr 2024 13:20:34 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin >>><jjSNIPlarkin@highNONOlandtechnology.com> wrote:

    On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>wrote:

    On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote: >>>>>
    On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>wrote:

    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote: >>>>>>>
    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>>>wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>>>><blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of >>>>>>>>>>an "imminent" (for some value of "imminent") hardware failure? >>>>>>>>>>
    ....

    I liked hiding out in the shaft alley. It was private and cool, that >>>>>>giant shaft slowly rotating.

    Probably had a calming flowing water sound as well.

    Yes, cool and beautiful and serene after the heat and noise and >>>>vibration of the engine room. A quiet 32,000 horsepower.

    It was fun being an electronic guru on sea trials of a ship full of
    big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>>>other tech reps slept in cots in the hold.

    Have you noticed how many lumberjack types are afraid of electricity? >>>>That can be funny.

    Oh yes. And EEs frightened by a 9-v battery.

    Joe Gwinn

    I had an intern, an EE senior, who was afraid of 3.3 volts.

    I told him to touch an FPGA to see how warm it was getting, and he
    refused.


    That's what happens when they grow up having never accidentally touched the >top cap of a 40KG6A/PL519


    They can type code. Rust is supposed to be safe.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to John Larkin on Tue Apr 16 23:50:19 2024
    "John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message news:prfu1j589dupr032431n5laq6cqist2fka@4ax.com...
    On Tue, 16 Apr 2024 21:04:40 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:

    "John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >>news:jr6u1j9vmo3a6tpl1evgrvmu1993slepno@4ax.com...
    On Tue, 16 Apr 2024 13:20:34 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin >>>><jjSNIPlarkin@highNONOlandtechnology.com> wrote:

    On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>wrote:

    On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote: >>>>>>
    On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>>wrote:

    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> >>>>>>>>wrote:

    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn >>>>>>>>><joegwinn@comcast.net>
    wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>>>>><blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of >>>>>>>>>>>an "imminent" (for some value of "imminent") hardware failure? >>>>>>>>>>>
    ....

    I liked hiding out in the shaft alley. It was private and cool, that >>>>>>>giant shaft slowly rotating.

    Probably had a calming flowing water sound as well.

    Yes, cool and beautiful and serene after the heat and noise and >>>>>vibration of the engine room. A quiet 32,000 horsepower.

    It was fun being an electronic guru on sea trials of a ship full of >>>>>big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>>>>other tech reps slept in cots in the hold.

    Have you noticed how many lumberjack types are afraid of electricity? >>>>>That can be funny.

    Oh yes. And EEs frightened by a 9-v battery.

    Joe Gwinn

    I had an intern, an EE senior, who was afraid of 3.3 volts.

    I told him to touch an FPGA to see how warm it was getting, and he
    refused.


    That's what happens when they grow up having never accidentally touched
    the
    top cap of a 40KG6A/PL519


    They can type code. Rust is supposed to be safe.

    I doubt it's safe from the programmer who implemented my humidifier like
    this:

    if humidity < setting {
    fan_on();
    } else {
    fan_off();
    }






    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Wed Apr 17 00:21:24 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvnf00$1cu2a$1@dont-email.me...
    On 4/16/2024 6:38 PM, Edward Rawde wrote:
    Simple solution: router has no radio! Even if the appliances wanted
    to connect (ignoring their "disable WiFi access" setting), there's
    nothing they can connect *to*.

    I'd have trouble here with no wifi access.
    I can restrict outbound with a firewall as necessary.

    I have 25 general purpose drops, here. So, you can be in any room, front/back porch -- even the ROOF -- and get connected.

    I have wired LAN to every room too but it's not only me who uses wifi so
    wifi can't be turned off.


    The internal network isn't routed. So, the only machines to worry about
    are
    this one (used only for email/news/web) and a laptop that is only used
    for ecommerce.

    My LAN is more like a small/medium size business with all workstations,
    servers and devices behind a firewall and able to communicate both with each other and online as necessary.
    I wouldn't want to give online security advice to others without doing it myself.


    I have an out-facing server that operates in stealth mode and won't appear
    on probes (only used to source my work to colleagues). The goal is not to look "interesting".

    Not sure what you mean by that.
    Given what gets thrown at my firewall I think you could maybe look more interesting than you think.


    The structure of the house's fabric allows me to treat any individual
    node as being directly connected to the ISP while isolating the
    rest of the nodes. I.e., if you bring a laptop loaded with malware into
    the house, you can't infect anything (or even know that there are other hosts, here); it's as if you had a dedicated connection to the Internet
    with no other devices "nearby".

    I wouldn't bother. I'd just not connect it to wifi or wired if I thought
    there was a risk.
    It's been a while since I had to clean a malware infested PC.




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Tue Apr 16 22:14:06 2024
    On 4/16/2024 9:21 PM, Edward Rawde wrote:
    The internal network isn't routed. So, the only machines to worry about
    are
    this one (used only for email/news/web) and a laptop that is only used
    for ecommerce.

    My LAN is more like a small/medium size business with all workstations, servers and devices behind a firewall and able to communicate both with each other and online as necessary.

    I have 72 drops in the office and 240 throughout the rest of the house
    (though the vast majority of those are for dedicated "appliances")...
    about 2.5 miles of CAT5.

    I have no desire to waste any time installing the latest OS & AV updates, keeping an IDS operationally effective, etc. My business is designing
    devices so my uses reflect that -- and nothing else.

    "Patch Tuesday?" What's that?? Why would I *want* to play that game?

    I wouldn't want to give online security advice to others without doing it myself.

    The advice I give to others is to only leave "exposed" what you absolutely
    MUST leave exposed. Most of my colleagues have adopted similar strategies
    to keep their intellectual property secure; it's a small inconvenience
    to (physically) move to a routed workstation when one needs to check email
    or chase down a resource online.

    I have an out-facing server that operates in stealth mode and won't appear >> on probes (only used to source my work to colleagues). The goal is not to >> look "interesting".

    Not sure what you mean by that.
    Given what gets thrown at my firewall I think you could maybe look more interesting than you think.

    Nothing on my side "answers" connection attempts. To the rest of the world,
    it looks like a cable dangling in air...

    The structure of the house's fabric allows me to treat any individual
    node as being directly connected to the ISP while isolating the
    rest of the nodes. I.e., if you bring a laptop loaded with malware into
    the house, you can't infect anything (or even know that there are other
    hosts, here); it's as if you had a dedicated connection to the Internet
    with no other devices "nearby".

    I wouldn't bother. I'd just not connect it to wifi or wired if I thought there was a risk.

    So, you'd have to *police* all such connections. What do you do with hundreds of drops on a factory floor? Or, scattered throughout a business? Can
    you prevent any "foreign" devices from being connected -- even if IN PLACE OF
    a legitimate device? (after all, it is a trivial matter to unplug a network cable from one "approved" PC and plug it into a "foreign import")

    It's been a while since I had to clean a malware infested PC.

    My current project relies heavily on internetworking for interprocessor communication. So, has to be designed to tolerate (and survive) a
    hostile actor being directly connected TO that fabric -- because that
    is a likely occurrence, "in the wild".

    Imagine someone being able to open your PC and alter the internals...
    and be expected to continue to operate as if this had not occurred!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jasen Betts@21:1/5 to Don Y on Wed Apr 17 05:12:55 2024
    On 2024-04-16, Don Y <blockedofcourse@foo.invalid> wrote:
    On 4/16/2024 10:25 AM, Edward Rawde wrote:
    Better to inform the individual who can get the replacement done when the >>>> tenant isn't even home.

    So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
    suffering from feeping creaturism. "Download the app..."

    No thanks. I have the same view of cameras.
    They won't be connecting outbound to a server anywhere in the world.
    But the average user does not know that and just wants the pictures on their >> phone.

    There is no need for a manufacturer to interpose themselves in such
    "remote access". Having the device register with a DDNS service
    cuts out the need for the manufacturer to essentially provide THAT
    service.

    Someone still needs to provide DDNS.

    Yes, UPNP has been a thing for several generations of routers now.
    but browswers have become fussier about port numbers too. also some
    customers are on Carrier Grade NAT, I don't think that UPNP can traverse
    that. IPV6 however can avoid the CGNAT problem.

    It's an ease of use vs quality of service problem.

    --
    Jasen.
    🇺🇦 Слава Україні

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Jasen Betts on Tue Apr 16 22:51:37 2024
    On 4/16/2024 10:12 PM, Jasen Betts wrote:
    There is no need for a manufacturer to interpose themselves in such
    "remote access". Having the device register with a DDNS service
    cuts out the need for the manufacturer to essentially provide THAT
    service.

    Someone still needs to provide DDNS.

    Yes, but ALL they are providing is name resolution. They aren't
    processing your data stream or "adding any value", there.
    So, point your DNS at an IP that maps to the DDNS service
    of your choice when the device "registers" with it!

    Manufacturer can abandon a product line and your hardware STILL WORKS!

    Yes, UPNP has been a thing for several generations of routers now.
    but browswers have become fussier about port numbers too. also some
    customers are on Carrier Grade NAT, I don't think that UPNP can traverse that. IPV6 however can avoid the CGNAT problem.

    It's an ease of use vs quality of service problem.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Wed Apr 17 01:39:51 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvnlr6$1e3fi$1@dont-email.me...
    On 4/16/2024 9:21 PM, Edward Rawde wrote:
    The internal network isn't routed. So, the only machines to worry about >>> are
    this one (used only for email/news/web) and a laptop that is only used
    for ecommerce.

    My LAN is more like a small/medium size business with all workstations,
    servers and devices behind a firewall and able to communicate both with
    each
    other and online as necessary.

    I have 72 drops in the office and 240 throughout the rest of the house (though the vast majority of those are for dedicated "appliances")...
    about 2.5 miles of CAT5.

    Must be a big house.


    ...
    I have an out-facing server that operates in stealth mode and won't
    appear
    on probes (only used to source my work to colleagues). The goal is not
    to
    look "interesting".

    Not sure what you mean by that.
    Given what gets thrown at my firewall I think you could maybe look more
    interesting than you think.

    Nothing on my side "answers" connection attempts. To the rest of the
    world,
    it looks like a cable dangling in air...

    You could ping me if you knew my IP address.


    The structure of the house's fabric allows me to treat any individual
    node as being directly connected to the ISP while isolating the
    rest of the nodes. I.e., if you bring a laptop loaded with malware into >>> the house, you can't infect anything (or even know that there are other
    hosts, here); it's as if you had a dedicated connection to the Internet
    with no other devices "nearby".

    I wouldn't bother. I'd just not connect it to wifi or wired if I thought
    there was a risk.

    What I mean by that is I'd clean it without it being connected.
    The Avira boot CD used to be useful but I forget how many years ago.


    So, you'd have to *police* all such connections. What do you do with hundreds
    of drops on a factory floor? Or, scattered throughout a business? Can
    you prevent any "foreign" devices from being connected -- even if IN PLACE
    OF
    a legitimate device? (after all, it is a trivial matter to unplug a
    network
    cable from one "approved" PC and plug it into a "foreign import")

    Devices on a LAN should be secure just like Internet facing devices.


    It's been a while since I had to clean a malware infested PC.

    My current project relies heavily on internetworking for interprocessor communication. So, has to be designed to tolerate (and survive) a
    hostile actor being directly connected TO that fabric -- because that
    is a likely occurrence, "in the wild".

    Imagine someone being able to open your PC and alter the internals...
    and be expected to continue to operate as if this had not occurred!


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Tue Apr 16 23:33:11 2024
    On 4/16/2024 10:39 PM, Edward Rawde wrote:
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvnlr6$1e3fi$1@dont-email.me...
    On 4/16/2024 9:21 PM, Edward Rawde wrote:
    The internal network isn't routed. So, the only machines to worry about >>>> are
    this one (used only for email/news/web) and a laptop that is only used >>>> for ecommerce.

    My LAN is more like a small/medium size business with all workstations,
    servers and devices behind a firewall and able to communicate both with
    each
    other and online as necessary.

    I have 72 drops in the office and 240 throughout the rest of the house
    (though the vast majority of those are for dedicated "appliances")...
    about 2.5 miles of CAT5.

    Must be a big house.

    The office is ~150 sq ft. Three sets of dual workstations each sharing a
    set of monitors and a tablet (for music) -- 7 drops for each such set.
    Eight drops for my "prototyping platform". Twelve UPSs. Four scanners
    (two B size, one A-size w/ADF and a film scanner). An SB2000 and Voyager
    (for cross development testing; I'm discarding a T5220 tomorrow).
    Four "toy" NASs (for sharing files between myself and SWMBO, documents
    dropped by the scanners, etc.). Four 12-bay NASs, two 16 bay. Four
    8-bay ESXi servers. Two 1U servers. Two 2U servers. My DBMS server.
    A "general services" appliance (DNS, NTP, PXE, FTP, TFTP, font, etc.
    services). Three media front ends. One media tank. Two 12 bay
    (and one 24 bay) iSCSI SAN devices.

    [It's amazing how much stuff you can cram into a small space when you
    try hard! :> To be completely honest, the scanners are located in
    my adjoining bedroom]

    The house a bit under 2000. But, the drops go to places that "people"
    don't normally access -- with the notable exception of the 25 "uncommitted drops": 2 in each bedroom, 2 on kitchen counters, 4 in living room,
    3 in family room, 2 in dining room, front hall, back porch, front porch,
    etc.

    E.g., there are 4 in the kitchen ceiling -- for four "network speakers" (controller, amplifier, network interface). Four more in the family room
    (same use). And two on the back porch.

    There's one on the roof *in* the evaporative cooler (to control the
    evaporative cooler, of course). Another for a weather station
    (to sort out how best to use the HVAC options available). Another
    in the furnace/ACbrrr.

    One for a genset out by the load center. Another for a solar installation.
    One to monitor utility power consumption. Another for municipal water.
    And natural gas. One for the irrigation system. One for water
    "treatment".

    One for the garage (door opener and "parking assistant"). Another for the water heater. Washer. Dryer. Stove/oven. Refrigerator. Dishwasher.

    One for each skylight (to allow for automatic venting, shading and environmental sensing). One for each window (automate window coverings).

    Three "control panels". One "privileged port" (used to "introduce" new
    devices to the system, securely).

    Two cameras on each corner of the house. A camera looking at the front
    door. Another looking away from it. One more looking at the potential
    guest standing AT the door. One on the roof (for the wildlife that
    invariably end up there)

    One for the alarm system. Phone system. CATV. CATV modem. 2 OTA TV receivers. 2 SDRs.

    10 BT "beacons" in the ceiling to track the location of occupants.
    2 WiFi APs (also in the ceiling).

    Etc. Processors are cheap. As is CAT5 to talk to them and power them.

    You'll *see* the cameras, speaker grills, etc. But, the kit controlling
    each of them is hidden -- in the devices, walls, ceilings, etc. (each "controller" is about the size/shape/volume of a US electrical receptacle)

    I have an out-facing server that operates in stealth mode and won't
    appear
    on probes (only used to source my work to colleagues). The goal is not >>>> to
    look "interesting".

    Not sure what you mean by that.
    Given what gets thrown at my firewall I think you could maybe look more
    interesting than you think.

    Nothing on my side "answers" connection attempts. To the rest of the
    world,
    it looks like a cable dangling in air...

    You could ping me if you knew my IP address.

    You can't see me, at all. You have to know the right sequence of packets (connection attempts) to throw at me before I will "wake up" and respond
    to the *final*/correct one. And, while doing so, will continue to
    ignore *other* attempts to contact me. So, even if you could see that
    I had started to respond, you couldn't "get my attention".

    The structure of the house's fabric allows me to treat any individual
    node as being directly connected to the ISP while isolating the
    rest of the nodes. I.e., if you bring a laptop loaded with malware into >>>> the house, you can't infect anything (or even know that there are other >>>> hosts, here); it's as if you had a dedicated connection to the Internet >>>> with no other devices "nearby".

    I wouldn't bother. I'd just not connect it to wifi or wired if I thought >>> there was a risk.

    What I mean by that is I'd clean it without it being connected.
    The Avira boot CD used to be useful but I forget how many years ago.

    If you were to unplug any of the above mentioned ("house") drops,
    you'd find nothing at the other end. Each physical link is an
    encrypted tunnel that similarly "hides" until (and unless) properly
    tickled. As a result, eavesdropping on the connection doesn't
    "give" you anything (because it's immune from replay attacks and
    it's content is opaque to you)

    So, you'd have to *police* all such connections. What do you do with
    hundreds
    of drops on a factory floor? Or, scattered throughout a business? Can
    you prevent any "foreign" devices from being connected -- even if IN PLACE >> OF
    a legitimate device? (after all, it is a trivial matter to unplug a
    network
    cable from one "approved" PC and plug it into a "foreign import")

    Devices on a LAN should be secure just like Internet facing devices.

    They should be secure from the threats they are LIKELY TO FACE.
    If the only access to my devices is by gaining physical entry
    to the premises, then why waste CPU cycles and man-hours protecting
    against a threat that can't manifest? Each box has a password...
    pasted on the outer skin of the box (for any intruder to read).

    Do I *care* about the latest MS release? (ANS: No)
    Do I care about the security patches for it? (No)
    Can I still do MY work with MY tools? (Yes)

    I have to activate an iPhone, tonight. So, drag out a laptop
    (I have 7 of them), install the latest iTunes. Do the required
    song and dance to get the phone running. Wipe the laptop's
    disk and reinstall the image that was present, there, minutes
    earlier (so, I don't care WHICH laptop I use!)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Liz Tuddenham@21:1/5 to Don Y on Wed Apr 17 08:56:55 2024
    Don Y <blockedofcourse@foo.invalid> wrote:

    ...When I was designing for pharma, my philosophy was
    to make it easy/quick to replace the entire control system. Let someone troubleshoot it on a bench instead of on the factory floor (which is semi-sterile).

    That's fine if the failure is clearly in the equipment itself, but what
    if it is in the way it interacts with something outside it, some
    unpredictable or unrecognised input codition? It works perfectly on the
    bench, only to fail when put into service ...again and again.


    --
    ~ Liz Tuddenham ~
    (Remove the ".invalid"s and add ".co.uk" to reply)
    www.poppyrecords.co.uk

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Liz Tuddenham on Wed Apr 17 04:05:37 2024
    On 4/17/2024 12:56 AM, Liz Tuddenham wrote:
    Don Y <blockedofcourse@foo.invalid> wrote:

    ...When I was designing for pharma, my philosophy was
    to make it easy/quick to replace the entire control system. Let someone
    troubleshoot it on a bench instead of on the factory floor (which is
    semi-sterile).

    That's fine if the failure is clearly in the equipment itself, but what
    if it is in the way it interacts with something outside it, some unpredictable or unrecognised input codition? It works perfectly on the bench, only to fail when put into service ...again and again.

    Then the *replacement* -- now installed in the system -- would have
    the same faulty behavior as the "pulled" unit. Lending credibility
    to the pulled unit NOT being at fault.

    When the control system is a 7 ft tall, 24 inch rack, bolted to
    the floor, your only option is to troubleshoot the system there,
    taking the system out of production while doing so.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From legg@21:1/5 to invalid@invalid.invalid on Wed Apr 17 08:11:33 2024
    On Tue, 16 Apr 2024 11:10:40 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:

    "Don Y" <blockedofcourse@foo.invalid> wrote in message >news:uvl2gr$phap$2@dont-email.me...
    On 4/15/2024 8:33 PM, Edward Rawde wrote:

    [Shouldn't that be Edwar D rawdE?]


    I don't mind how you pronounce it.


    ...

    A smoke detector that beeps once a day risks not being heard

    Reminds me of a tenant who just removed the battery to stop the annoying >beeping.

    The ocassional beeping is a low battery alert.

    RL

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Chris Jones@21:1/5 to All on Wed Apr 17 22:18:59 2024
    On a vaguely related rant, shamelessly hijacking your thread:

    Why do recent mechanical hard drives have a "Annualised Workload Rate"
    limit saying that you are only supposed to write say 55TB/year?

    What is the wearout mechanism, or is it just bullshit to discourage
    enterprise customers from buying the cheapest drives?

    It seems odd to me that they would all do it, if it really is just made
    up bullshit. It also seems odd to express it in terms of TB
    read+written. I can't see why that would be more likely to wear it out
    than some number of hours of spindle rotation, or seek operations, or
    spindle starts, or head load/unload cycles. I could imagine they might
    want to use a very high current density in the windings of the write
    head that might place an electromigration limit on the time spent
    writing, but they apply the limit to reads as well. Is there something
    that wears out when the servo loop is keeping the head on a track?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Don Y on Wed Apr 17 09:05:56 2024
    On 4/17/2024 8:42 AM, Don Y wrote:
    a typical run-of-the-mill disk performs at about 60MB/s.  So, ~350MB/min
    or 21GB/hr.  That's ~500GB/day or 180TB/yr.

    Assuming 24/7/365 use.

    In a 9-to-5 environment, that would be (5/7)*60TB (to account for idle time on weekends) or ~40TB/yr.

    Said another way, I'd expect a 55TB/yr drive to run at about (55/40)*60MB/s or ~80MB/s.  A drive that runs at 100MB/s (not uncommon) would be ~100TB/yr.

    That number doesn't look right. 100/80 = 1.25 so that 55 should probably be about 70TB/yr (not 100!).

    I guess a calculator would be handy... :>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Larkin@21:1/5 to jjSNIPlarkin@highNONOlandtechnology on Wed Apr 17 08:17:10 2024
    On Tue, 16 Apr 2024 20:23:46 -0700, John Larkin <jjSNIPlarkin@highNONOlandtechnology.com> wrote:

    On Tue, 16 Apr 2024 21:16:45 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:

    "John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >>news:p47u1j1tg35ctb3tcta5qevsfnhgnpcrsg@4ax.com...
    On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:


    On 17/04/2024 1:22 am, John Larkin wrote:
    On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:
    On 15/04/2024 18:13, Don Y wrote:


    Yes I've seen that a lot.
    The power rails in the production product came up in a different order to >>>>those in the development lab.
    This caused all kinds of previously unseen behaviour including an >>>>expensive
    flash a/d chip burning up.

    I'd have it in the test spec that any missing power rail does not cause >>>>issues.
    And any power rail can be turned on and off any time.
    The equipment may not work properly with a missing power rail but it >>>>should
    not be damaged.


    Some FPGAs require supply sequencing, as may as four.

    LM3880 is a dedicated powerup sequencer, most cool.

    https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0


    Ok that doesn't surprise me.
    I'd want to be sure that the requirement is always met even when the 12V >>connector is in a position where it isn't sure whether it's connected or >>not.
    Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.


    We considered the brownout case. The MAX809 handles that.

    This supply will also tolerate +24v input, in case someone grabs the
    wrong wart. Or connects the power backwards.



    Another hazard/failure mode happens when things like opamps use pos
    and neg supply rails. A positive regulator, for example, can latch up
    if its output is pulled negative, though ground, at startup. Brownout
    dippies can trigger that too.

    Add schottky diodes to ground.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Chris Jones on Wed Apr 17 08:42:03 2024
    On 4/17/2024 5:18 AM, Chris Jones wrote:
    On a vaguely related rant, shamelessly hijacking your thread:

    Why do recent mechanical hard drives have a "Annualised Workload Rate" limit saying that you are only supposed to write say 55TB/year?

    Are you sure they aren't giving you a *recommendation*? I.e., "this
    device will give acceptable performance (not durability) in applications
    with a workload of X TB/yr"?

    I built a box to sanitize and characterize disks for recycling. It seems
    a typical run-of-the-mill disk performs at about 60MB/s. So, ~350MB/min
    or 21GB/hr. That's ~500GB/day or 180TB/yr.

    Assuming 24/7/365 use.

    In a 9-to-5 environment, that would be (5/7)*60TB (to account for idle time
    on weekends) or ~40TB/yr.

    Said another way, I'd expect a 55TB/yr drive to run at about (55/40)*60MB/s
    or ~80MB/s. A drive that runs at 100MB/s (not uncommon) would be ~100TB/yr.

    What is the wearout mechanism, or is it just bullshit to discourage enterprise
    customers from buying the cheapest drives?

    It seems odd to me that they would all do it, if it really is just made up bullshit. It also seems odd to express it in terms of TB read+written. I can't

    As this seems to be a relatively new "expression", it may be a side-effect of SSD ratings (in which *wear* is a real issue). It would allow for a rough comparison of the durability of the media in a synthetic workload.

    see why that would be more likely to wear it out than some number of hours of spindle rotation, or seek operations, or spindle starts, or head load/unload cycles. I could imagine they might want to use a very high current density in the windings of the write head that might place an electromigration limit on the time spent writing, but they apply the limit to reads as well. Is there something that wears out when the servo loop is keeping the head on a track?

    I've encountered drives with 50K PoH that still report no SMART issues.
    I assume they truly are running 24/7/365 (based on the number of power cycles reported) so that's *6* years spinning on its axis! (I wonder how many
    miles it would have traveled if it was a "wheel"?)

    Most nearline drives pulled from DASs seem to be discarded (upgraded?)
    at about 20K PoH, FWIW. Plenty of useful life remaining!

    [FWIW, I've only lost two drives in my life -- one a laptop drive installed
    in an application that spun it up and down almost continuously and another
    that magically lost access to it's boot sector. OTOH, I've heard horror stories of folks having issues with SSDs (firmware). So, just put all the rescued SSDs I come across in a box thinking "someday" I will play with them]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Joe Gwinn@21:1/5 to jjSNIPlarkin@highNONOlandtechnology on Wed Apr 17 11:47:53 2024
    On Tue, 16 Apr 2024 17:48:19 -0700, John Larkin <jjSNIPlarkin@highNONOlandtechnology.com> wrote:

    On Tue, 16 Apr 2024 13:20:34 -0400, Joe Gwinn <joegwinn@comcast.net>
    wrote:

    On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin >><jjSNIPlarkin@highNONOlandtechnology.com> wrote:

    On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net> >>>wrote:

    On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:

    On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>wrote:

    On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote: >>>>>>
    On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>>wrote:

    On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>>><blockedofcourse@foo.invalid> wrote:

    Is there a general rule of thumb for signalling the likelihood of >>>>>>>>>an "imminent" (for some value of "imminent") hardware failure? >>>>>>>>>
    I suspect most would involve *relative* changes that would be >>>>>>>>>suggestive of changing conditions in the components (and not >>>>>>>>>directly related to environmental influences).

    So, perhaps, a good strategy is to just "watch" everything and >>>>>>>>>notice the sorts of changes you "typically" encounter in the hope >>>>>>>>>that something of greater magnitude would be a harbinger...

    There is a standard approach that may work: Measure the level and >>>>>>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>>>>>When connections (perhaps within a package) start to fail, the flicker >>>>>>>>level rises. The actual frequency monitored isn't all that critical. >>>>>>>>
    Joe Gwinn

    Do connections "start to fail" ?

    Yes, they do, in things like vias. I went through a big drama where a >>>>>>critical bit of radar logic circuitry would slowly go nuts.

    It turned out that the copper plating on the walls of the vias was >>>>>>suffering from low-cycle fatigue during temperature cycling and slowly >>>>>>breaking, one little crack at a time, until it went open. If you >>>>>>measured the resistance to parts per million (6.5 digit DMM), sampling >>>>>>at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to >>>>>>also measure a copper line, and divide the via-chain resistance by the >>>>>>no-via resistance, to correct for temperature changes.

    But nobody is going to monitor every via on a PCB, even if it were >>>>>possible.

    It was not possible to test the vias on the failing logic board, but
    we knew from metallurgical cut, polish, and inspect studies of failed >>>>boards that it was the vias that were failing.


    One could instrument a PCB fab test board, I guess. But DC tests would >>>>>be fine.

    What was being tested was a fab test board that had both the series
    via chain path and the no-via path of roughly the same DC resistance, >>>>set up so we could do 4-wire Kelvin resistance measurements of each >>>>path independent of the other path.


    Yes, but the question was whether one could predict the failure of an >>>operating electronic gadget. The answer is mostly NO.

    Agree.


    We had a visit from the quality team from a giant company that you
    have heard of. They wanted us to trend analyze all the power supplies
    on our boards and apply a complex algotithm to predict failures. It
    was total nonsense, basically predicting the future by zooming in on >>>random noise with a big 1/f component, just like climate prediction.

    Hmm. My first instinct was that they were using MIL-HNBK-317 (?) or
    the like, but that does not measure noise. Do you recall any more of
    what they were doing? I might know what they were up to. The
    military were big on prognostics for a while, and still talk of this,
    but it never worked all that well in the field compared to what it was >>supposed to improve on.


    We have one board with over 4000 vias, but they are mostly in >>>>>parallel.

    This can also be tested , but using a 6.5-digit DMM intended for >>>>measuring very low resistance values. A change of one part in 4,000
    is huge to a 6.5-digit instrument. The conductivity will decline >>>>linearly as vias fail one by one.



    Millikelvin temperature changes would make more signal than a failing >>>via.

    Not at the currents in that logic card. Too much ambient thermal
    noise.


    The solution was to redesign the vias, mainly to increase the critical >>>>>>volume of copper. And modern SMD designs have less and less copper >>>>>>volume.

    I bet precision resistors can also be measured this way.


    I don't think I've ever owned a piece of electronic equipment that >>>>>>>warned me of an impending failure.

    Onset of smoke emission is a common sign.


    Cars do, for some failure modes, like low oil level.

    The industrial method for big stuff is accelerometers attached near >>>>>>the bearings, and listen for excessive rotation-correlated (not >>>>>>necessarily harmonic) noise.

    Big ships that I've worked on have a long propeller shaft in the shaft >>>>>alley, a long tunnel where nobody often goes. They have magnetic shaft >>>>>runout sensors and shaft bearing temperature monitors.

    They measure shaft torque and SHP too, from the shaft twist.

    Yep. And these kinds of things fail slowly. At first.

    They could repair a bearing at sea, given a heads-up about violent >>>failure. A serious bearing failure on a single-screw machine means >>>getting a seagoing tug.

    The main engine gearbox had padlocks on the covers.

    There was also a chem lab to analyze oil and water and such, looking
    for contaminamts that might suggest something going on.




    I liked hiding out in the shaft alley. It was private and cool, that >>>>>giant shaft slowly rotating.

    Probably had a calming flowing water sound as well.

    Yes, cool and beautiful and serene after the heat and noise and
    vibration of the engine room. A quiet 32,000 horsepower.

    It was fun being an electronic guru on sea trials of a ship full of
    big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>>other tech reps slept in cots in the hold.

    Have you noticed how many lumberjack types are afraid of electricity? >>>That can be funny.

    Oh yes. And EEs frightened by a 9-v battery.

    Joe Gwinn

    I had an intern, an EE senior, who was afraid of 3.3 volts.

    I told him to touch an FPGA to see how warm it was getting, and he
    refused.

    Yeah.

    Not quite as dramatic, but in the last year I have been involved in
    some full-scale vibration tests, where a relay rack packed full of
    equipment is shaken and resulting phase noise is measured. People are
    afraid to touch the vibrating equipment., but I tell people to put a
    hand on a convenient place.

    It's amazing how much one can tell by feel. There is some
    low-frequency spectral analysis capability there, and one can detect
    for instance a resonance. It's a very good cross-check on the fancy instrumentation.

    Joe Gwinn

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Wed Apr 17 13:49:38 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvnqg7$1f0pl$1@dont-email.me...
    On 4/16/2024 10:39 PM, Edward Rawde wrote:
    "Don Y" <blockedofcourse@foo.invalid> wrote in message
    news:uvnlr6$1e3fi$1@dont-email.me...
    On 4/16/2024 9:21 PM, Edward Rawde wrote:
    The internal network isn't routed. So, the only machines to worry
    about
    are
    this one (used only for email/news/web) and a laptop that is only used >>>>> for ecommerce.

    My LAN is more like a small/medium size business with all workstations, >>>> servers and devices behind a firewall and able to communicate both with >>>> each
    other and online as necessary.

    I have 72 drops in the office and 240 throughout the rest of the house
    (though the vast majority of those are for dedicated "appliances")...
    about 2.5 miles of CAT5.

    Must be a big house.

    The office is ~150 sq ft. Three sets of dual workstations each sharing a
    set of monitors and a tablet (for music) -- 7 drops for each such set.
    Eight drops for my "prototyping platform". Twelve UPSs. Four scanners
    (two B size, one A-size w/ADF and a film scanner). An SB2000 and Voyager (for cross development testing; I'm discarding a T5220 tomorrow).
    Four "toy" NASs (for sharing files between myself and SWMBO, documents dropped by the scanners, etc.). Four 12-bay NASs, two 16 bay. Four
    8-bay ESXi servers. Two 1U servers. Two 2U servers. My DBMS server.
    A "general services" appliance (DNS, NTP, PXE, FTP, TFTP, font, etc. services). Three media front ends. One media tank. Two 12 bay
    (and one 24 bay) iSCSI SAN devices.
    ....

    I have an out-facing server that operates in stealth mode and won't
    appear
    on probes (only used to source my work to colleagues). The goal is
    not
    to
    look "interesting".

    Not sure what you mean by that.
    Given what gets thrown at my firewall I think you could maybe look more >>>> interesting than you think.

    Nothing on my side "answers" connection attempts. To the rest of the
    world,
    it looks like a cable dangling in air...

    You could ping me if you knew my IP address.

    You can't see me, at all. You have to know the right sequence of packets (connection attempts) to throw at me before I will "wake up" and respond
    to the *final*/correct one. And, while doing so, will continue to
    ignore *other* attempts to contact me. So, even if you could see that
    I had started to respond, you couldn't "get my attention".

    I've never bothered with port knocking.
    Those of us with inbound connectable web servers, database servers, email servers etc have to be connectable by more conventional means.

    ....

    I wouldn't bother. I'd just not connect it to wifi or wired if I
    thought
    there was a risk.

    What I mean by that is I'd clean it without it being connected.
    The Avira boot CD used to be useful but I forget how many years ago.

    If you were to unplug any of the above mentioned ("house") drops,
    you'd find nothing at the other end. Each physical link is an
    encrypted tunnel that similarly "hides" until (and unless) properly
    tickled. As a result, eavesdropping on the connection doesn't
    "give" you anything (because it's immune from replay attacks and
    it's content is opaque to you)

    I'm surprised you get anything done with all the tickle processes you must
    need before anything works.


    So, you'd have to *police* all such connections. What do you do with
    hundreds
    of drops on a factory floor? Or, scattered throughout a business? Can
    you prevent any "foreign" devices from being connected -- even if IN
    PLACE
    OF
    a legitimate device? (after all, it is a trivial matter to unplug a
    network
    cable from one "approved" PC and plug it into a "foreign import")

    Devices on a LAN should be secure just like Internet facing devices.

    They should be secure from the threats they are LIKELY TO FACE.
    If the only access to my devices is by gaining physical entry
    to the premises, then why waste CPU cycles and man-hours protecting
    against a threat that can't manifest? Each box has a password...
    pasted on the outer skin of the box (for any intruder to read).

    Sounds like you are the the only user of your devices.
    Consider a small business.
    Here you want a minimum of either two LANs or VLANs so that guest access to wireless can't connect to your own LAN devices.
    Your own LAN should have devices which are patched and have proper identification so that even if you do get a compromised device on your own
    LAN it's not likely to spread to other devices.
    You might also want a firewall which is monitored remotely by somone who
    knows how to spot anything unusual.
    I have much written in python which tells me whether I want a closer look at the firewall log or not.


    Do I *care* about the latest MS release? (ANS: No)
    Do I care about the security patches for it? (No)
    Can I still do MY work with MY tools? (Yes)

    But only for your situation.
    If I advised a small business to run like that they'd get someone else to do it.


    I have to activate an iPhone, tonight. So, drag out a laptop
    (I have 7 of them), install the latest iTunes. Do the required
    song and dance to get the phone running. Wipe the laptop's
    disk and reinstall the image that was present, there, minutes
    earlier (so, I don't care WHICH laptop I use!)

    You'll have to excuse me for laughing at that.
    Cybersecurity is certainly a very interesting subject, and thanks for the discussion.
    If I open one of the wordy cybersecurity books I have (pdf) at a random page
    I get this.
    "Once the attacker has gained access to a system, they will want to gain administrator-level access to the current resource, as well as additional resources on the network."
    Well duh. You mean like once the bank robber has gained access to the bank
    they will want to find out where the money is?




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to John Larkin on Wed Apr 17 14:31:13 2024
    "John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message news:uhpv1jd4b0cd20haudbsb3dneut5dap4ka@4ax.com...
    On Tue, 16 Apr 2024 20:23:46 -0700, John Larkin <jjSNIPlarkin@highNONOlandtechnology.com> wrote:

    On Tue, 16 Apr 2024 21:16:45 -0400, "Edward Rawde" >><invalid@invalid.invalid> wrote:

    "John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >>>news:p47u1j1tg35ctb3tcta5qevsfnhgnpcrsg@4ax.com...
    On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
    <invalid@invalid.invalid> wrote:


    On 17/04/2024 1:22 am, John Larkin wrote:
    On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
    <'''newspam'''@nonad.co.uk> wrote:
    On 15/04/2024 18:13, Don Y wrote:


    Yes I've seen that a lot.
    The power rails in the production product came up in a different order >>>>>to
    those in the development lab.
    This caused all kinds of previously unseen behaviour including an >>>>>expensive
    flash a/d chip burning up.

    I'd have it in the test spec that any missing power rail does not cause >>>>>issues.
    And any power rail can be turned on and off any time.
    The equipment may not work properly with a missing power rail but it >>>>>should
    not be damaged.


    Some FPGAs require supply sequencing, as may as four.

    LM3880 is a dedicated powerup sequencer, most cool.

    https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0


    Ok that doesn't surprise me.
    I'd want to be sure that the requirement is always met even when the 12V >>>connector is in a position where it isn't sure whether it's connected or >>>not.
    Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.


    We considered the brownout case. The MAX809 handles that.

    This supply will also tolerate +24v input, in case someone grabs the
    wrong wart. Or connects the power backwards.



    Another hazard/failure mode happens when things like opamps use pos
    and neg supply rails. A positive regulator, for example, can latch up
    if its output is pulled negative, though ground, at startup. Brownout
    dippies can trigger that too.

    Add schottky diodes to ground.

    I've seen many a circuit with pos and neg supply rails for op amps when a single rail would have been fine.
    In one case the output went negative during startup and the following device
    (a VCO) didn't like that and refused to start.
    A series diode was the easiest solution in that case.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Edward Rawde on Wed Apr 17 11:50:07 2024
    On 4/17/2024 10:49 AM, Edward Rawde wrote:
    You could ping me if you knew my IP address.

    You can't see me, at all. You have to know the right sequence of packets
    (connection attempts) to throw at me before I will "wake up" and respond
    to the *final*/correct one. And, while doing so, will continue to
    ignore *other* attempts to contact me. So, even if you could see that
    I had started to respond, you couldn't "get my attention".

    I've never bothered with port knocking.
    Those of us with inbound connectable web servers, database servers, email servers etc have to be connectable by more conventional means.

    As with installing updates and other "maintenance issues", I have
    no desire to add to my workload. I want to spend my time *designing*
    things.

    I run the server to save me time handling requests from colleagues for
    source code releases. This lets them access the repository and
    pull whatever versions they want without me having to get them and
    send them. Otherwise, they gripe about my weird working hours, etc.
    (and I gripe about their poorly timed requests for STATIC resources)

    There is some overhead to their initial connection to the server
    as the script has to take into account that packets aren't delivered
    instantly and retransmissions can cause a connection attempt to be
    delayed -- so, *I* might not see it when they think I am.

    But, once the connection is allowed, there is no additional
    overhead or special protocols required.

    I wouldn't bother. I'd just not connect it to wifi or wired if I
    thought
    there was a risk.

    What I mean by that is I'd clean it without it being connected.
    The Avira boot CD used to be useful but I forget how many years ago.

    If you were to unplug any of the above mentioned ("house") drops,
    you'd find nothing at the other end. Each physical link is an
    encrypted tunnel that similarly "hides" until (and unless) properly
    tickled. As a result, eavesdropping on the connection doesn't
    "give" you anything (because it's immune from replay attacks and
    it's content is opaque to you)

    I'm surprised you get anything done with all the tickle processes you must need before anything works.

    I wouldn't "unplug any of the above mentioned drops". I'd let them connect using their native protocols. This is already baked into the code so "costs" nothing.

    The hiding prevents an adversary from cutting an exposed (e.g., outside the house) cable and trying to interfere with the system. Just like an adversary on a factory floor could find a convenient, out-of-the-way place to access the fabric with malevolent intent. Or, a guest in a hotel. Or, a passenger
    on an aircraft/ship. Or, a CAN node in an automobile (!).

    They should be secure from the threats they are LIKELY TO FACE.
    If the only access to my devices is by gaining physical entry
    to the premises, then why waste CPU cycles and man-hours protecting
    against a threat that can't manifest? Each box has a password...
    pasted on the outer skin of the box (for any intruder to read).

    Sounds like you are the the only user of your devices.

    I'm a "development lab". I want to spend my time using my tools to
    create new products. I don't want to bear the overhead of trying to
    keep up with patches for 0-day exploits just to be able to USE those
    tools. I am more than willing to trade the hassle of walking
    down the hall to another computer (this one) to access my email.
    And, if I DL a research paper, copying it onto a thumb drive to
    SneakerNet it back to my office. To me, that's a HUGE productivity
    increase!

    Consider a small business.
    Here you want a minimum of either two LANs or VLANs so that guest access to wireless can't connect to your own LAN devices.
    Your own LAN should have devices which are patched and have proper identification so that even if you do get a compromised device on your own LAN it's not likely to spread to other devices.

    The house network effectively implements a VLAN per drop. My OS only lets "things" talk to other things that they've been preconfigured to talk to.
    So, I can configure the drop in the guest bedroom to access the ISP.
    Or, one of the radios in the ceiling to do similarly. If I later decide that
    I want to plug a TV into that guest bedroom drop, then the ISP access is "unwired" from that drop and access to the media server wired in its place.

    And, KNOW that there is no way that any of the traffic on either of those tunnels can *see* (or access) any of the other traffic flowing through the switch. The switch is the source of all physical security as you have
    to be able to convince it to allow your traffic to go *anywhere* (and WHERE).

    [So, the switch is in a protected location AND has the hardware
    mechanisms that let me add new devices to the fabric -- by installing site-specific "secrets" over a secure connection]

    Because a factory floor would need the ability to "dial out" from a
    drop ON the floor (or WiFi) without risking compromise to any of
    the machines that are concurrently using that same fabric.

    Imagine having a firewall ENCASING that connection so it can't see *anything* besides the ISP. (and, imagine that firewall not needing any particular rules governing the traffic that it allows as it's an encrypted tunnel letting NOTHING through)

    You might also want a firewall which is monitored remotely by somone who knows how to spot anything unusual.
    I have much written in python which tells me whether I want a closer look at the firewall log or not.

    Yet another activity I don't have to worry about. Sit in the guest bedroom
    and you're effectively directly connected to The Internet. If your machine
    is vulnerable (because of measures YOU failed to take), then YOUR machine
    is at risk. Not any of the other devices sharing that fabric. You can get infected while sitting there and I'm still safe.

    My "labor costs" are fixed and don't increase, regardless of the number
    of devices and threats that I may encounter. No need for IT staff to handle the "exposed" guests -- that's THEIR problem.

    Do I *care* about the latest MS release? (ANS: No)
    Do I care about the security patches for it? (No)
    Can I still do MY work with MY tools? (Yes)

    But only for your situation.
    If I advised a small business to run like that they'd get someone else to do it.

    And they would forever be "TAXED" for their choice. Folks are starting
    to notice that updates often don't give them anything that is worth the risk/cost of the update. Especially if that requires/entices them to have
    that host routed!

    My colleagues have begrudgingly adopted a similar "unrouted development network" for their shops. The savings in IT-related activities are
    enormous. And, they sleep more soundly knowing the only threats
    they have to worry about are physical break-in and equipment
    failure.

    You want to check your email? Take your phone out of your pocket...
    Need to do some on-line work (e.g., chasing down research papers
    or browsing a remote repository)? Then move to an "exposed"
    workstation FOR THAT TASK.

    [Imagine if businesses required their employees to move to such
    a workstation to browse YouTube videos or check their facebook
    page! "Gee, you're spending an awful lot of time 'on-line',
    today, Bob... Have you finished that DESIGN, yet?"]

    I have to activate an iPhone, tonight. So, drag out a laptop
    (I have 7 of them), install the latest iTunes. Do the required
    song and dance to get the phone running. Wipe the laptop's
    disk and reinstall the image that was present, there, minutes
    earlier (so, I don't care WHICH laptop I use!)

    You'll have to excuse me for laughing at that.
    Cybersecurity is certainly a very interesting subject, and thanks for the discussion.
    If I open one of the wordy cybersecurity books I have (pdf) at a random page I get this.
    "Once the attacker has gained access to a system, they will want to gain administrator-level access to the current resource, as well as additional resources on the network."

    Hence the reason for NOT letting anything "talk" to anything that it shouldn't. E.g., the oven has no need to talk to the front door lock. Or, the garage
    door opener. Or, the HVAC controller. So, even if compromised, an adversary can only do what those items could normally do. There is no *path* to
    the items that it has no (designed) need to access!

    With a conventional fabric, anything that establishes a beachhead on ANY
    device can start poking around EVERYWHERE. You have to monitor traffic
    INSIDE your firewall to verify nothing untoward is happening (IDS -- yet another cost to install and maintain and police!)

    Well duh. You mean like once the bank robber has gained access to the bank they will want to find out where the money is?

    Banks keep the money is well-known places. Most commercial (and free) OS's
    are similarly unimaginative. So, *looking* for it is relatively easy. Especially OSs which use a unified file system as a naming mechanism
    for everything in the system ("Gee, let's go have a peek at passwd(5)...")

    In my approach, an actor only knows about the items that he SHOULD know about. So, you may *SUSPECT* that there is a "front door" but the only things you
    have access to are named "rose bush" and "garden hose" (if you are
    an irrigation controller).

    In a conventional (50 year old design!) system, you would *see* the names
    of all of the devices in the system and HOPE that someone had implemented
    one of them incorrectly. Your task (pen-test) would be to figure out
    which one and how best to exploit it.

    Had the designers, instead, adhered to the notions of information hiding, encapsulation, principle of least privilege, etc. there's be less attack surface exposed to the "outside" AND to devices on the *inside*! (But,
    you need to approach the design of the OS entirely differently instead of hoping to layer protections onto some legacy codebase)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Edward Rawde@21:1/5 to Don Y on Wed Apr 17 15:28:49 2024
    "Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvp5l9$1ojua$1@dont-email.me...
    On 4/17/2024 10:49 AM, Edward Rawde wrote:
    You could ping me if you knew my IP address.


    So let's take a step back before the posts get so big that my newsreader crashes.

    All networks are different.
    All businesses have different online/offline needs.
    All businesses have different processes and device needs.
    All businesses have different people with different ideas about how their network and devices should be secured.
    Businesses which design or manufacture technology may have different requirements when compared with businesses who just use it.
    People find security inconvenient. "Don't give your password to anyone else"
    is likely to fall on deaf ears.

    There is no one-size-fits-all cybersecurity solution.
    Any solution requires a detailed analysis of the network, the devices, and
    how the people and/or their guests use it.

    Few people know what is going in/out of the connection to their Internet provider.
    Few people care until it's too late.

    Human behaviour is a major factor.

    I had one manager do the equivalent of bursting into the operating theatre while the heart surgeon was busy with a delicate and complicated operation.
    He wanted to know all the details of the operation and why this part was connected to that part etc.
    It turned out that his reasoning was that after getting this information he could do it himself instead of paying "cybersecurity" people.

    Unskilled and unaware of it comes to mind. Search engine it if you need to.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Buzz McCool@21:1/5 to Don Y on Thu Apr 18 10:18:08 2024
    On 4/15/2024 10:13 AM, Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    This reminded me of some past efforts in this area. It was never
    demonstrated to me (given ample opportunity) that this technology
    actually worked on intermittently failing hardware I had, so be cautious
    in applying it in any future endeavors.

    https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Buzz McCool on Thu Apr 18 15:05:07 2024
    On 4/18/2024 10:18 AM, Buzz McCool wrote:
    On 4/15/2024 10:13 AM, Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    This reminded me of some past efforts in this area. It was never demonstrated to me (given ample opportunity) that this technology actually worked on intermittently failing hardware I had, so be cautious in applying it in any future endeavors.

    Intermittent failures are the bane of all designers. Until something
    is reliably observable, trying to address the problem is largely
    wack-a-mole.

    https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf

    Thanks for that. I didn't find it in my collection so it's addition will
    be welcome.

    Sun has historically been aggressive in trying to increase availability, especially on big iron. In fact, such a "prediction" led me to discard
    a small server, yesterday (no time to dick with failing hardware!).

    I am now seeing similar features in Dell servers. But, the *actual* implementation details are always shrouded in mystery.

    But, it is obvious (for "always on" systems) that there are many things
    that can silently fail that will only manifest some time later -- if at
    all and possibly complicated by other failures that may have been
    precipitated by it.

    Sorting out WHAT to monitor is the tricky part. Then, having the
    ability to watch for trends can give you an inkling that something is
    headed in the wrong direction -- before it actually exceeds some
    baked in "hard limit".

    E.g., only the memory that you actively REFERENCE in a product is ever
    checked for errors! Bit rot may not be detected until some time after it
    has occurred -- when you eventually access that memory (and the memory controller throws an error).

    This is paradoxically amusing; code to HANDLE errors is likely the least accessed code in a product. So, bit rot IN that code is more likely
    to go unnoticed -- until it is referenced (by some error condition)
    and the error event complicated by the attendant error in the handler!
    The more reliable your code (fewer faults), the more uncertain you
    will be of the handlers' abilities to address faults that DO manifest!

    The same applies to secondary storage media. How will you know if some-rarely-accessed-file is intact and ready to be referenced
    WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
    verify that it is intact, NOW?

    [One common flaw with RAID implementations and naive reliance on that technology]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Glen Walpert@21:1/5 to Don Y on Fri Apr 19 01:27:11 2024
    On Thu, 18 Apr 2024 15:05:07 -0700, Don Y wrote:

    On 4/18/2024 10:18 AM, Buzz McCool wrote:
    On 4/15/2024 10:13 AM, Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of an
    "imminent" (for some value of "imminent") hardware failure?

    This reminded me of some past efforts in this area. It was never
    demonstrated to me (given ample opportunity) that this technology
    actually worked on intermittently failing hardware I had, so be
    cautious in applying it in any future endeavors.

    Intermittent failures are the bane of all designers. Until something is reliably observable, trying to address the problem is largely
    wack-a-mole.

    https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf

    Thanks for that. I didn't find it in my collection so it's addition
    will be welcome.

    Sun has historically been aggressive in trying to increase availability, especially on big iron. In fact, such a "prediction" led me to discard
    a small server, yesterday (no time to dick with failing hardware!).

    I am now seeing similar features in Dell servers. But, the *actual* implementation details are always shrouded in mystery.

    But, it is obvious (for "always on" systems) that there are many things
    that can silently fail that will only manifest some time later -- if at
    all and possibly complicated by other failures that may have been precipitated by it.

    Sorting out WHAT to monitor is the tricky part. Then, having the
    ability to watch for trends can give you an inkling that something is
    headed in the wrong direction -- before it actually exceeds some baked
    in "hard limit".

    E.g., only the memory that you actively REFERENCE in a product is ever checked for errors! Bit rot may not be detected until some time after
    it has occurred -- when you eventually access that memory (and the
    memory controller throws an error).

    This is paradoxically amusing; code to HANDLE errors is likely the least accessed code in a product. So, bit rot IN that code is more likely to
    go unnoticed -- until it is referenced (by some error condition)
    and the error event complicated by the attendant error in the handler!
    The more reliable your code (fewer faults), the more uncertain you will
    be of the handlers' abilities to address faults that DO manifest!

    The same applies to secondary storage media. How will you know if some-rarely-accessed-file is intact and ready to be referenced WHEN
    NEEDED -- if you aren't doing patrol reads/scrubbing to verify that it
    is intact, NOW?

    [One common flaw with RAID implementations and naive reliance on that technology]

    RAID, even with backups, is unsuited to high reliability storage of large databases. Distributed storage can be of much higher reliability:

    https://telnyx.com/resources/what-is-distributed-storage

    <https://towardsdatascience.com/introduction-to-distributed-data- storage-2ee03e02a11d>

    This requires successful retrieval of any n of m data files, normally from different locations, where n can be arbitrarily smaller than m depending
    on your needs. Overkill for small databases but required for high
    reliability storage of very large databases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to Glen Walpert on Thu Apr 18 20:08:17 2024
    On 4/18/2024 6:27 PM, Glen Walpert wrote:
    On Thu, 18 Apr 2024 15:05:07 -0700, Don Y wrote:

    The same applies to secondary storage media. How will you know if
    some-rarely-accessed-file is intact and ready to be referenced WHEN
    NEEDED -- if you aren't doing patrol reads/scrubbing to verify that it
    is intact, NOW?

    [One common flaw with RAID implementations and naive reliance on that
    technology]

    RAID, even with backups, is unsuited to high reliability storage of large databases. Distributed storage can be of much higher reliability:

    https://telnyx.com/resources/what-is-distributed-storage

    <https://towardsdatascience.com/introduction-to-distributed-data- storage-2ee03e02a11d>

    This requires successful retrieval of any n of m data files, normally from different locations, where n can be arbitrarily smaller than m depending
    on your needs. Overkill for small databases but required for high reliability storage of very large databases.

    This is effectively how I maintain my archive. Except that the
    media are all "offline", requiring a human operator (me) to
    fetch the required volumes in order to locate the desired files.

    Unlike mirroring (or other RAID technologies), my scheme places
    no constraints as to the "containers" holding the data. E.g.,

    DISK43 /somewhere/in/filesystem/ fileofinterest
    DISK21 >some>other>place anothernameforfile
    CDROM77 \yet\another\place archive.type /where/in/archive foo

    Can all yield the same "content" (as verified by their prestored signatures). Knowing the hash of each object means you can verify its contents from a
    single instance instead of looking for confirmation via other instance(s)

    [Hashes take up considerably less space than a duplicate copy would]

    This makes it easy to create multiple instances of particular "content"
    without imposing constraints on how it is named, stored, located, etc.

    I.e., pull a disk out of a system, catalog its contents, slap an adhesive
    label on it (to be human-readable) and add it to your store.

    (If I could mount all of the volumes -- because I wouldn't know which volume might be needed -- then access wouldn't require a human operator, regardless
    of where the volumes were actually mounted or the peculiarities of the
    systems on which they are mounted! But, you can have a daemon that watches to see WHICH volumes are presently accessible and have it initiate a patrol
    read of their contents while the media are being accessed "for whatever OTHER reason" -- and track the time/date of last "verification" so you know which volumes haven't been checked, recently)

    The inconvenience of requiring human intervention is offset by the lack of
    wear on the media (as well as BTUs to keep it accessible) and the ease of creating NEW content/copies. NOT useful for data that needs to be accessed frequently but excellent for "archives"/repositories -- that can be mounted, accessed and DUPLICATED to online/nearline storage for normal use.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From boB@21:1/5 to blockedofcourse@foo.invalid on Fri Apr 19 11:16:02 2024
    On Thu, 18 Apr 2024 15:05:07 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    On 4/18/2024 10:18 AM, Buzz McCool wrote:
    On 4/15/2024 10:13 AM, Don Y wrote:
    Is there a general rule of thumb for signalling the likelihood of
    an "imminent" (for some value of "imminent") hardware failure?

    This reminded me of some past efforts in this area. It was never demonstrated
    to me (given ample opportunity) that this technology actually worked on
    intermittently failing hardware I had, so be cautious in applying it in any >> future endeavors.

    Intermittent failures are the bane of all designers. Until something
    is reliably observable, trying to address the problem is largely
    wack-a-mole.


    The problem I have with troubleshooting intermittent failures is that
    they are only intermittend sometimes.


    https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf

    Thanks for that. I didn't find it in my collection so it's addition will
    be welcome.

    Yes, neat paper.

    boB



    Sun has historically been aggressive in trying to increase availability, >especially on big iron. In fact, such a "prediction" led me to discard
    a small server, yesterday (no time to dick with failing hardware!).

    I am now seeing similar features in Dell servers. But, the *actual* >implementation details are always shrouded in mystery.

    But, it is obvious (for "always on" systems) that there are many things
    that can silently fail that will only manifest some time later -- if at
    all and possibly complicated by other failures that may have been >precipitated by it.

    Sorting out WHAT to monitor is the tricky part. Then, having the
    ability to watch for trends can give you an inkling that something is
    headed in the wrong direction -- before it actually exceeds some
    baked in "hard limit".

    E.g., only the memory that you actively REFERENCE in a product is ever >checked for errors! Bit rot may not be detected until some time after it
    has occurred -- when you eventually access that memory (and the memory >controller throws an error).

    This is paradoxically amusing; code to HANDLE errors is likely the least >accessed code in a product. So, bit rot IN that code is more likely
    to go unnoticed -- until it is referenced (by some error condition)
    and the error event complicated by the attendant error in the handler!
    The more reliable your code (fewer faults), the more uncertain you
    will be of the handlers' abilities to address faults that DO manifest!

    The same applies to secondary storage media. How will you know if >some-rarely-accessed-file is intact and ready to be referenced
    WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
    verify that it is intact, NOW?

    [One common flaw with RAID implementations and naive reliance on that >technology]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to boB on Fri Apr 19 12:10:22 2024
    On 4/19/2024 11:16 AM, boB wrote:
    Intermittent failures are the bane of all designers. Until something
    is reliably observable, trying to address the problem is largely
    wack-a-mole.

    The problem I have with troubleshooting intermittent failures is that
    they are only intermittend sometimes.

    My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular failure/fault but, because reproducing it is "hard", just pretend it
    never happened! Really? Do you think the circuit/code is self-healing???

    You're going to "bless" a product that you, personally, know has a fault...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From boB@21:1/5 to blockedofcourse@foo.invalid on Sun Apr 21 12:37:58 2024
    On Fri, 19 Apr 2024 12:10:22 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    On 4/19/2024 11:16 AM, boB wrote:
    Intermittent failures are the bane of all designers. Until something
    is reliably observable, trying to address the problem is largely
    wack-a-mole.

    The problem I have with troubleshooting intermittent failures is that
    they are only intermittend sometimes.

    My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular >failure/fault but, because reproducing it is "hard", just pretend it
    never happened! Really? Do you think the circuit/code is self-healing???

    You're going to "bless" a product that you, personally, know has a fault...


    Yes, it may be hard to replicate but you just have to try and try
    again sometimes. Or create something that exercises the unit or
    software to make it happen and automatically catch it in the act.

    I don't care to have to do that very often. When I do, I just try to
    make it a challenge.

    boB

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Don Y@21:1/5 to boB on Sun Apr 21 14:23:32 2024
    On 4/21/2024 12:37 PM, boB wrote:
    On Fri, 19 Apr 2024 12:10:22 -0700, Don Y
    <blockedofcourse@foo.invalid> wrote:

    On 4/19/2024 11:16 AM, boB wrote:
    Intermittent failures are the bane of all designers. Until something
    is reliably observable, trying to address the problem is largely
    wack-a-mole.

    The problem I have with troubleshooting intermittent failures is that
    they are only intermittend sometimes.

    My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular
    failure/fault but, because reproducing it is "hard", just pretend it
    never happened! Really? Do you think the circuit/code is self-healing??? >>
    You're going to "bless" a product that you, personally, know has a fault...

    Yes, it may be hard to replicate but you just have to try and try
    again sometimes. Or create something that exercises the unit or
    software to make it happen and automatically catch it in the act.

    I think this was the perfect application for Google Glass! It
    seems a given that whenever you stumble on one of these "events",
    you aren't concentrating on how you GOT there; you didn't expect
    the failure to manifest so weren't keeping track of your actions.

    If, instead, you could "rewind" a recording of everything that you
    had done up to that point, it would likely go a long way towards
    helping you recreate the problem!

    When you get a "report" of someone encountering some anomalous
    behavior, its easy to shrug it off because they are often very
    imprecise in describing their actions; details (crucial) are
    often missing or a subject of "fantasy". Is the person sure
    that the machine wasn't doing exactly what it SHOULD in that
    SPECIFIC situation??

    OTOH, when it happens to YOU, you know that the report isn't
    a fluke. But, are just as weak on the details as those third-party
    reporters!

    I don't care to have to do that very often. When I do, I just try to
    make it a challenge.

    Being able to break a design into small pieces goes a long way to
    improving its quality. Taking "contractual design" to its extreme
    lets you build small, validatable modules that stand a greater
    chance of working in concert.

    Unfortunately, few have the discipline for such detail, hoping,
    instead, to test bigger units (if they do ANY formal testing at all!)

    Think of how little formal testing goes into a hardware design.
    Aside from imposing inputs and outputs at their extremes, what
    *really* happens before a design is released to manufacturing?
    (I haven't seen a firm that does a rigorous shake-n-bake in
    more than 40 years!)

    And, how much less goes into software -- where it is relatively easy to
    build test scaffolding and implement regression tests to ensure new
    releases don't reintroduce old bugs...

    When the emphasis (Management) is getting product out the door,
    it's easy to see engineering (and manufacturing) disciplines suffer.

    :<

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)