Is there a general rule of thumb for signalling the likelihood ofan "imminent" (for some value of "imminent") hardware failure?I suspect most would involve *relative* changes that would besuggestive of changing conditions in the components (andnotdirectly related to environmental influences).So, perhaps, a good strategy is to just "watch" everything andnotice the sorts of changes you "typically" encounter in the hopethat something of greater magnitude would be a harbinger...
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker >>level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Cars do, for some failure modes, like low oil level.
Don, what does the thing do?
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be >>>>suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>When connections (perhaps within a package) start to fail, the flicker >>>level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a >critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly >breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
Joe Gwinn <joegwinn@comcast.net> wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>> When connections (perhaps within a package) start to fail, the flicker >>>> level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
There are a number of instruments available that look for metal particles
in the lubricating oil.
Cheers
Phil Hobbs
Don Y <blockedofcourse@foo.invalid> Wrote in message:rnotdirectly related to environmental influences).So, perhaps, a good strategy is to just "watch" everything andnotice the sorts of changes you "typically" encounter in the hopethat something of greater magnitude would be a harbinger...
Is there a general rule of thumb for signalling the likelihood ofan "imminent" (for some value of "imminent") hardware failure?I suspect most would involve *relative* changes that would besuggestive of changing conditions in the components (and
Current and voltages outside of normal operation?
On 4/15/2024 10:32 AM, Martin Rid wrote:
Don Y <blockedofcourse@foo.invalid> Wrote in message:r
Is there a general rule of thumb for signalling the likelihood ofan
"imminent" (for some value of "imminent") hardware failure?I suspect
most would involve *relative* changes that would besuggestive of
changing conditions in the components (and notdirectly related to
environmental influences).So, perhaps, a good strategy is to just
"watch" everything andnotice the sorts of changes you "typically"
encounter in the hopethat something of greater magnitude would be a
harbinger...
Current and voltages outside of normal operation?
I think "outside" is (often) likely indicative of
"something is (already) broken".
But, perhaps TRENDS in either/both can be predictive.
E.g., if a (sub)circuit has always been consuming X (which
is nominal for the design) and, over time, starts to consume
1.1X, is that suggestive that something is in the process of
failing?
Note that the goal is not to troubleshoot the particular design
or its components but, rather, act as an early warning that
maintenance may be required (or, that performance may not be
what you are expecting/have become accustomed to).
You can include mechanisms to verify outputs are what you
*intended* them to be (in case the output drivers have shit
the bed).
You can, also, do sanity checks that ensure values are never
what they SHOULDN'T be (this is commonly done within software
products -- if something "can't happen" then noticing that
it IS happening is a sure-fire indication that something
is broken!)
[Limit switches on mechanisms are there to ensure the impossible
is not possible -- like driving a mechanism beyond its extents]
And, where possible, notice second-hand effects of your actions
(e.g., if you switched on a load, you should see an increase
in supplied current).
But, again, these are more helpful in detecting FAILED items.
"Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvjn74$d54b$1@dont-email.me...
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
My conclusion would be no.
Some of my reasons are given below.
It always puzzled me how HAL could know that the AE-35 would fail in the
near future, but maybe HAL had a motive for lying.
Back in that era I was doing a lot of repair work when I should have been doing my homework.
So I knew that there were many unrelated kinds of hardware failure.
A component could fail suddenly, such as a short circuit diode, and everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as insufficient cooling due to poor quality assembly, but the exact real cause would never be known.
A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could also burn up.
A component could fail slowly and only become apparent when it got to the stage of causing an audible or visible effect.
It would often be easy to locate the dried up electrolytic due to it having already let go of some of its contents.
So I concluded that if I wanted to be sure that I could always watch my favourite TV show, we would have to have at least two TVs in the house.
If it's not possible to have the equivalent of two TVs then you will want to be in a position to get the existing TV repaired or replaced as quicky as possible.
My home wireless Internet system doesn't care if one access point fails, and I would not expect to be able to do anything to predict a time of failure. Experience says a dead unit has power supply issues. Usually external but could be internal.
I don't think it would be possible to "watch" everything because it's rare that you can properly test a component while it's part of a working system.
These days I would expect to have fun with management asking for software to be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
On 4/15/2024 1:32 PM, Edward Rawde wrote:
"Don Y" <blockedofcourse@foo.invalid> wrote in message
news:uvjn74$d54b$1@dont-email.me...
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
My conclusion would be no.
Some of my reasons are given below.
It always puzzled me how HAL could know that the AE-35 would fail in the
near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations?
If I ask the drive to give
me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data corruption
(CRC error) AND within the normal access time limits defined by the
location
of those magnetic domains on the rotating medium?
Why should it attempt to retry this MORE than once?
Now, if you knew your disk drive was repeatedly retrying operations,
would your confidence in it be unchanged from times when it did not
exhibit such behavior?
Assuming you have properly configured a EIA232 interface, why would you
ever get a parity error? (OVERRUN errors can be the result of an i/f
that is running too fast for the system on the receiving end) How would
you even KNOW this was happening?
I suspect everyone who has owned a DVD/CD drive has encountered a
"slow tray" as the mechanism aged. Or, a tray that wouldn't
open (of its own accord) as soon/quickly as it used to.
The controller COULD be watching this (cuz it knows when it
initiated the operation and there is an "end-of-stroke"
sensor available) and KNOW that the drive belt was stretching
to the point where it was impacting operation.
[And, that a stretched belt wasn't going to suddenly decide to
unstretch to fix the problem!]
Back in that era I was doing a lot of repair work when I should have been
doing my homework.
So I knew that there were many unrelated kinds of hardware failure.
The goal isn't to predict ALL failures but, rather, to anticipate
LIKELY failures and treat them before they become an inconvenience
(or worse).
One morning, the (gas) furnace repeatedly tried to light as the
thermostat called for heat. Then, a few moments later, the
safeties would kick in and shut down the gas flow. This attracted my attention as the LIT furnace should STAY LIT!
The furnace was too stupid to notice its behavior so would repeat
this cycle, endlessly.
I stepped in and overrode the thermostat to eliminate the call
for heat as this behavior couldn't be productive (if something
truly IS wrong, then why let it continue? and, if there is nothing
wrong with the controls/mechanism, then clearly it is unable to meet
my needs so why let it persist in trying?)
[Turns out, there was a city-wide gas shortage so there was enough
gas available to light the furnace but not enough to bring it up to temperature as quickly as the designers had expected]
A component could fail suddenly, such as a short circuit diode, and
everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as
insufficient cooling due to poor quality assembly, but the exact real
cause
would never be known.
You don't care about the real cause. Or, even the failure mode.
You (as user) just don't want to be inconvenienced by the sudden
loss of the functionality/convenience that the the device provided.
A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could
also
burn up.
So, if you could predict the OTHER failure...
Or, that such a failure might occur and lead to the followup failure...
A component could fail slowly and only become apparent when it got to the
stage of causing an audible or visible effect.
But, likely, there was something observable *in* the circuit that
just hadn't made it to the level of human perception.
It would often be easy to locate the dried up electrolytic due to it
having
already let go of some of its contents.
So I concluded that if I wanted to be sure that I could always watch my
favourite TV show, we would have to have at least two TVs in the house.
If it's not possible to have the equivalent of two TVs then you will want
to
be in a position to get the existing TV repaired or replaced as quicky as
possible.
Two TVs are affordable. Consider two controllers for a wire-EDM machine.
Or, the cost of having that wire-EDM machine *idle* (because you didn't
have a spare controller!)
My home wireless Internet system doesn't care if one access point fails,
and
I would not expect to be able to do anything to predict a time of
failure.
Experience says a dead unit has power supply issues. Usually external but
could be internal.
Again, the goal isn't to predict "time of failure". But, rather, to be
able to know that "this isn't going to end well" -- with some advance
notice
that allows for preemptive action to be taken (and not TOO much advance notice that the user ends up replacing items prematurely).
I don't think it would be possible to "watch" everything because it's
rare
that you can properly test a component while it's part of a working
system.
You don't have to -- as long as you can observe its effects on other
parts of the system. E.g., there's no easy/inexpensive way to
check to see how much the belt on that CD/DVD player has stretched.
But, you can notice that it HAS stretched (or, some less likely
change has occurred that similarly interferes with the tray's actions)
by noting how the activity that it is used for has changed.
These days I would expect to have fun with management asking for software
to
be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
What if the power supply HASN'T died? What if you are diagnosing the
likely upcoming failure *of* the power supply?
You have ECC memory in most (larger) machines. Do you silently
expect it to just fix all the errors? Does it have a way of telling you
how many such errors it HAS corrected? Can you infer the number of
errors that it *hasn't*?
[Why have ECC at all?]
There are (and have been) many efforts to *predict* lifetimes of
components (and, systems). And, some work to examine the state
of systems /in situ/ with an eye towards anticipating their
likelihood of future failure.
[The former has met with poor results -- predicting the future
without a position in its past is difficult. And, knowing how
a device is "stored" when not powered on also plays a role
in its future survival! (is there some reason YOUR devices
can't power themselves on, periodically; notice the environmental
conditions; log them and then power back off)]
The question is one of a practical nature; how much does it cost
you to add this capability to a device and how accurately can it
make those predictions (thus avoiding some future cost/inconvenience).
For small manufacturers, the research required is likely not
cost-effective;
just take your best stab at it and let the customer "buy a replacement"
when the time comes (hopefully, outside of your warranty window).
But, anything you can do to minimize this TCO issue gives your product
an edge over competitors. Given that most devices are smart, nowadays,
it seems obvious that they should undertake as much of this task as
they can (conveniently) afford.
<https://www.sciencedirect.com/science/article/abs/pii/S0026271409003667>
<https://www.researchgate.net/publication/3430090_In_Situ_Temperature_Measurement_of_a_Notebook_Computer-A_Case_Study_in_Health_and_Usage_Monitoring_of_Electronics>
<https://www.tandfonline.com/doi/abs/10.1080/16843703.2007.11673148>
<https://www.prognostics.umd.edu/calcepapers/02_V.Shetty_remaingLifeAssesShuttleRemotemanipulatorSystem_22ndSpaceSimulationConf.pdf>
<https://ieeexplore.ieee.org/document/1656125>
<https://journals.sagepub.com/doi/10.1177/0142331208092031>
[Sorry, I can't publish links to the full articles]
Current and voltages outside of normal operation?
I think "outside" is (often) likely indicative of
"something is (already) broken".
But, perhaps TRENDS in either/both can be predictive.
E.g., if a (sub)circuit has always been consuming X (which
is nominal for the design) and, over time, starts to consume
1.1X, is that suggestive that something is in the process of
failing?
That depends on many other unknown factors.
Temperature sensors are common in electronics.
So is current sensing. Voltage sensing too.
Note that the goal is not to troubleshoot the particular design
or its components but, rather, act as an early warning that
maintenance may be required (or, that performance may not be
what you are expecting/have become accustomed to).
If the system is electronic then you can detect whether currents and/or votages are within expected ranges.
If they are a just a little out of expected range then you might turn on a warning LED.
If they are way out of range then you might tell the power supply to turn
off quick.
By all means tell the software what has happened, but don't put software between the current sensor and the emergency turn off.
Be aware that components in monitoring circuits can fail too.
But, again, these are more helpful in detecting FAILED items.
What system would you like to have early warnings for?
Are the warnings needed to indicate operation out of expected limits or to indicate that maintenance is required, or both?
Without detailed knowledge of the specific sytem, only speculative answers can be given.
It always puzzled me how HAL could know that the AE-35 would fail in the >>> near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations?
Because the software designer didn't understand hardware.
The correct approach is to mark that part of the disk as unusable and, if possible, move any data from it elsewhere quick.
If I ask the drive to give
me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data
corruption
(CRC error) AND within the normal access time limits defined by the
location
of those magnetic domains on the rotating medium?
Why should it attempt to retry this MORE than once?
Now, if you knew your disk drive was repeatedly retrying operations,
would your confidence in it be unchanged from times when it did not
exhibit such behavior?
I'd have put an SSD in by now, along with an off site backup of the same
data :)
Assuming you have properly configured a EIA232 interface, why would you
ever get a parity error? (OVERRUN errors can be the result of an i/f
that is running too fast for the system on the receiving end) How would
you even KNOW this was happening?
I suspect everyone who has owned a DVD/CD drive has encountered a
"slow tray" as the mechanism aged. Or, a tray that wouldn't
open (of its own accord) as soon/quickly as it used to.
If it hasn't been used for some time then I'm ready with a tiny screwdriver blade to help it open.
But I forget when I last used an optical drive.
[Turns out, there was a city-wide gas shortage so there was enough
gas available to light the furnace but not enough to bring it up to
temperature as quickly as the designers had expected]
That's why the furnace designers couldn't have anticipated it.
They did not know that such a contition might occur so never tested for it.
A component could fail suddenly, such as a short circuit diode, and
everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as
insufficient cooling due to poor quality assembly, but the exact real
cause
would never be known.
You don't care about the real cause. Or, even the failure mode.
You (as user) just don't want to be inconvenienced by the sudden
loss of the functionality/convenience that the the device provided.
There will always be sudden unexpected loss of functionality for reasons which could not easily be predicted.
People who service lawn mowers in the area where I live are very busy right now.
A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could
also
burn up.
So, if you could predict the OTHER failure...
Or, that such a failure might occur and lead to the followup failure...
A component could fail slowly and only become apparent when it got to the >>> stage of causing an audible or visible effect.
But, likely, there was something observable *in* the circuit that
just hadn't made it to the level of human perception.
Yes a power supply ripple detection circuit could have turned on a warning LED but that never happened for at least two reasons.
1. The detection circuit would have increased the cost of the equipment and thus diminished the profit of the manufacturer.
2. The user would not have understood and would have ignored the warning anyway.
My home wireless Internet system doesn't care if one access point fails, >>> and
I would not expect to be able to do anything to predict a time of
failure.
Experience says a dead unit has power supply issues. Usually external but >>> could be internal.
Again, the goal isn't to predict "time of failure". But, rather, to be
able to know that "this isn't going to end well" -- with some advance
notice
that allows for preemptive action to be taken (and not TOO much advance
notice that the user ends up replacing items prematurely).
Get feedback from the people who use your equpment.
I don't think it would be possible to "watch" everything because it's
rare
that you can properly test a component while it's part of a working
system.
You don't have to -- as long as you can observe its effects on other
parts of the system. E.g., there's no easy/inexpensive way to
check to see how much the belt on that CD/DVD player has stretched.
But, you can notice that it HAS stretched (or, some less likely
change has occurred that similarly interferes with the tray's actions)
by noting how the activity that it is used for has changed.
Sure but you have to be the operator for that.
So you can be ready to help the tray open when needed.
These days I would expect to have fun with management asking for software >>> to
be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
What if the power supply HASN'T died? What if you are diagnosing the
likely upcoming failure *of* the power supply?
Then I probably can't, because the power supply may be just a bought in
power supply which was never designed with upcoming failure detection in mind.
You have ECC memory in most (larger) machines. Do you silently
expect it to just fix all the errors? Does it have a way of telling you
how many such errors it HAS corrected? Can you infer the number of
errors that it *hasn't*?
[Why have ECC at all?]
Things are sometimes done the way they've always been done.
I used to notice a missing chip in the 9th position but now you mention it the RAM I just looked at has 9 chips each side.
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
"Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvkqqu$o5co$1@dont-email.me...
On 4/15/2024 1:32 PM, Edward Rawde wrote:
"Don Y" <blockedofcourse@foo.invalid> wrote in message
news:uvjn74$d54b$1@dont-email.me...
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
My conclusion would be no.
Some of my reasons are given below.
It always puzzled me how HAL could know that the AE-35 would fail in the >>> near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations?
Because the software designer didn't understand hardware.
The correct approach is to mark that part of the disk as unusable and, if possible, move any data from it elsewhere quick.
If I ask the drive to give
me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data
corruption
(CRC error) AND within the normal access time limits defined by the
location
of those magnetic domains on the rotating medium?
Why should it attempt to retry this MORE than once?
Now, if you knew your disk drive was repeatedly retrying operations,
would your confidence in it be unchanged from times when it did not
exhibit such behavior?
I'd have put an SSD in by now, along with an off site backup of the same
data :)
Assuming you have properly configured a EIA232 interface, why would you
ever get a parity error? (OVERRUN errors can be the result of an i/f
that is running too fast for the system on the receiving end) How would
you even KNOW this was happening?
I suspect everyone who has owned a DVD/CD drive has encountered a
"slow tray" as the mechanism aged. Or, a tray that wouldn't
open (of its own accord) as soon/quickly as it used to.
If it hasn't been used for some time then I'm ready with a tiny screwdriver blade to help it open.
But I forget when I last used an optical drive.
The controller COULD be watching this (cuz it knows when it
initiated the operation and there is an "end-of-stroke"
sensor available) and KNOW that the drive belt was stretching
to the point where it was impacting operation.
[And, that a stretched belt wasn't going to suddenly decide to
unstretch to fix the problem!]
Back in that era I was doing a lot of repair work when I should have been >>> doing my homework.
So I knew that there were many unrelated kinds of hardware failure.
The goal isn't to predict ALL failures but, rather, to anticipate
LIKELY failures and treat them before they become an inconvenience
(or worse).
One morning, the (gas) furnace repeatedly tried to light as the
thermostat called for heat. Then, a few moments later, the
safeties would kick in and shut down the gas flow. This attracted my
attention as the LIT furnace should STAY LIT!
The furnace was too stupid to notice its behavior so would repeat
this cycle, endlessly.
I stepped in and overrode the thermostat to eliminate the call
for heat as this behavior couldn't be productive (if something
truly IS wrong, then why let it continue? and, if there is nothing
wrong with the controls/mechanism, then clearly it is unable to meet
my needs so why let it persist in trying?)
[Turns out, there was a city-wide gas shortage so there was enough
gas available to light the furnace but not enough to bring it up to
temperature as quickly as the designers had expected]
That's why the furnace designers couldn't have anticipated it.
They did not know that such a contition might occur so never tested for it.
A component could fail suddenly, such as a short circuit diode, and
everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as
insufficient cooling due to poor quality assembly, but the exact real
cause
would never be known.
You don't care about the real cause. Or, even the failure mode.
You (as user) just don't want to be inconvenienced by the sudden
loss of the functionality/convenience that the the device provided.
There will always be sudden unexpected loss of functionality for reasons which could not easily be predicted.
People who service lawn mowers in the area where I live are very busy right now.
A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could
also
burn up.
So, if you could predict the OTHER failure...
Or, that such a failure might occur and lead to the followup failure...
A component could fail slowly and only become apparent when it got to the >>> stage of causing an audible or visible effect.
But, likely, there was something observable *in* the circuit that
just hadn't made it to the level of human perception.
Yes a power supply ripple detection circuit could have turned on a warning LED but that never happened for at least two reasons.
1. The detection circuit would have increased the cost of the equipment and thus diminished the profit of the manufacturer.
2. The user would not have understood and would have ignored the warning anyway.
It would often be easy to locate the dried up electrolytic due to it
having
already let go of some of its contents.
So I concluded that if I wanted to be sure that I could always watch my
favourite TV show, we would have to have at least two TVs in the house.
If it's not possible to have the equivalent of two TVs then you will want >>> to
be in a position to get the existing TV repaired or replaced as quicky as >>> possible.
Two TVs are affordable. Consider two controllers for a wire-EDM machine.
Or, the cost of having that wire-EDM machine *idle* (because you didn't
have a spare controller!)
My home wireless Internet system doesn't care if one access point fails, >>> and
I would not expect to be able to do anything to predict a time of
failure.
Experience says a dead unit has power supply issues. Usually external but >>> could be internal.
Again, the goal isn't to predict "time of failure". But, rather, to be
able to know that "this isn't going to end well" -- with some advance
notice
that allows for preemptive action to be taken (and not TOO much advance
notice that the user ends up replacing items prematurely).
Get feedback from the people who use your equpment.
I don't think it would be possible to "watch" everything because it's
rare
that you can properly test a component while it's part of a working
system.
You don't have to -- as long as you can observe its effects on other
parts of the system. E.g., there's no easy/inexpensive way to
check to see how much the belt on that CD/DVD player has stretched.
But, you can notice that it HAS stretched (or, some less likely
change has occurred that similarly interferes with the tray's actions)
by noting how the activity that it is used for has changed.
Sure but you have to be the operator for that.
So you can be ready to help the tray open when needed.
These days I would expect to have fun with management asking for software >>> to
be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
What if the power supply HASN'T died? What if you are diagnosing the
likely upcoming failure *of* the power supply?
Then I probably can't, because the power supply may be just a bought in
power supply which was never designed with upcoming failure detection in mind.
You have ECC memory in most (larger) machines. Do you silently
expect it to just fix all the errors? Does it have a way of telling you
how many such errors it HAS corrected? Can you infer the number of
errors that it *hasn't*?
[Why have ECC at all?]
Things are sometimes done the way they've always been done.
I used to notice a missing chip in the 9th position but now you mention it the RAM I just looked at has 9 chips each side.
There are (and have been) many efforts to *predict* lifetimes of
components (and, systems). And, some work to examine the state
of systems /in situ/ with an eye towards anticipating their
likelihood of future failure.
I'm sure that's true.
[The former has met with poor results -- predicting the future
without a position in its past is difficult. And, knowing how
a device is "stored" when not powered on also plays a role
in its future survival! (is there some reason YOUR devices
can't power themselves on, periodically; notice the environmental
conditions; log them and then power back off)]
The question is one of a practical nature; how much does it cost
you to add this capability to a device and how accurately can it
make those predictions (thus avoiding some future cost/inconvenience).
For small manufacturers, the research required is likely not
cost-effective;
just take your best stab at it and let the customer "buy a replacement"
when the time comes (hopefully, outside of your warranty window).
But, anything you can do to minimize this TCO issue gives your product
an edge over competitors. Given that most devices are smart, nowadays,
it seems obvious that they should undertake as much of this task as
they can (conveniently) afford.
<https://www.sciencedirect.com/science/article/abs/pii/S0026271409003667>
<https://www.researchgate.net/publication/3430090_In_Situ_Temperature_Measurement_of_a_Notebook_Computer-A_Case_Study_in_Health_and_Usage_Monitoring_of_Electronics>
<https://www.tandfonline.com/doi/abs/10.1080/16843703.2007.11673148>
<https://www.prognostics.umd.edu/calcepapers/02_V.Shetty_remaingLifeAssesShuttleRemotemanipulatorSystem_22ndSpaceSimulationConf.pdf>
<https://ieeexplore.ieee.org/document/1656125>
<https://journals.sagepub.com/doi/10.1177/0142331208092031>
[Sorry, I can't publish links to the full articles]
On 15/04/2024 18:13, Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
You have to be very careful that the additional complexity doesn't itself introduce new annoying failure modes.
My previous car had filament bulb failure
sensors (new one is LED) of which the one for the parking light had itself failed - the parking light still worked. However, the car would great me with "parking light failure" every time I started the engine and the main dealer refused to cancel it.
Repair of parking light sensor failure required swapping out the *entire* front
light assembly since it was built in one time hot glue. That would be a very expensive "repair" for a trivial fault.
The parking light is not even a required feature.
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
Monitoring temperature, voltage supply and current consumption isn't a bad idea. If they get unexpectedly out of line something is wrong.
Likewise with
power on self tests you can catch some latent failures before they actually affect normal operation.
Joe Gwinn <joegwinn@comcast.net> wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>> When connections (perhaps within a package) start to fail, the flicker >>>> level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
There are a number of instruments available that look for metal particles
in the lubricating oil.
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
On 4/15/2024 9:14 PM, Edward Rawde wrote:
It always puzzled me how HAL could know that the AE-35 would fail in
the
near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations?
Because the software designer didn't understand hardware.
Actually, he DID understand the hardware which is why he retried
it instead of ASSUMING every operation would proceed correctly.
....
When the firmware in your SSD corrupts your data, what remedy will
you use?
On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be >>>>>suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope >>>>>that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and >>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>When connections (perhaps within a package) start to fail, the flicker >>>>level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a >>critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was >>suffering from low-cycle fatigue during temperature cycling and slowly >>breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the >>no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were
possible.
One could instrument a PCB fab test board, I guess. But DC tests would
be fine.
We have one board with over 4000 vias, but they are mostly in
parallel.
The solution was to redesign the vias, mainly to increase the critical >>volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft
alley, a long tunnel where nobody often goes. They have magnetic shaft
runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
On 4/15/2024 8:33 PM, Edward Rawde wrote:
[Shouldn't that be Edwar D rawdE?]
A smoke detector that beeps once a day risks not being heard
"Don Y" <blockedofcourse@foo.invalid> wrote in message >news:uvl2gr$phap$2@dont-email.me...
On 4/15/2024 8:33 PM, Edward Rawde wrote:
[Shouldn't that be Edwar D rawdE?]
I don't mind how you pronounce it.
...
A smoke detector that beeps once a day risks not being heard
Reminds me of a tenant who just removed the battery to stop the annoying >beeping.
Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
A singular speculative spitball - the capacitive marker:
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Danke,
--
Don, KB7RPU, https://www.qsl.net/kb7rpu
There was a young lady named Bright Whose speed was far faster than light; She set out one day In a relative way And returned on the previous night.
On 15/04/2024 18:13, Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
You have to be very careful that the additional complexity doesn't
itself introduce new annoying failure modes. My previous car had
filament bulb failure sensors (new one is LED) of which the one for the >parking light had itself failed - the parking light still worked.
However, the car would great me with "parking light failure" every time
I started the engine and the main dealer refused to cancel it.
Repair of parking light sensor failure required swapping out the
*entire* front light assembly since it was built in one time hot glue.
That would be a very expensive "repair" for a trivial fault.
The parking light is not even a required feature.
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
Monitoring temperature, voltage supply and current consumption isn't a
bad idea. If they get unexpectedly out of line something is wrong.
Likewise with power on self tests you can catch some latent failures
before they actually affect normal operation.
On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>><blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of >>>>>>an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be >>>>>>suggestive of changing conditions in the components (and not >>>>>>directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and >>>>>>notice the sorts of changes you "typically" encounter in the hope >>>>>>that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and >>>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>>When connections (perhaps within a package) start to fail, the flicker >>>>>level rises. The actual frequency monitored isn't all that critical. >>>>>
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a >>>critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was >>>suffering from low-cycle fatigue during temperature cycling and slowly >>>breaking, one little crack at a time, until it went open. If you >>>measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to >>>also measure a copper line, and divide the via-chain resistance by the >>>no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were >>possible.
It was not possible to test the vias on the failing logic board, but
we knew from metallurgical cut, polish, and inspect studies of failed
boards that it was the vias that were failing.
One could instrument a PCB fab test board, I guess. But DC tests would
be fine.
What was being tested was a fab test board that had both the series
via chain path and the no-via path of roughly the same DC resistance,
set up so we could do 4-wire Kelvin resistance measurements of each
path independent of the other path.
We have one board with over 4000 vias, but they are mostly in
parallel.
This can also be tested , but using a 6.5-digit DMM intended for
measuring very low resistance values. A change of one part in 4,000
is huge to a 6.5-digit instrument. The conductivity will decline
linearly as vias fail one by one.
The solution was to redesign the vias, mainly to increase the critical >>>volume of copper. And modern SMD designs have less and less copper >>>volume.
I bet precision resistors can also be measured this way.
I don't think I've ever owned a piece of electronic equipment that >>>>warned me of an impending failure.
Onset of smoke emission is a common sign.
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not >>>necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft >>alley, a long tunnel where nobody often goes. They have magnetic shaft >>runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
Yep. And these kinds of things fail slowly. At first.
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Joe Gwinn
On 4/15/2024 8:33 PM, Edward Rawde wrote:
[Shouldn't that be Edwar D rawdE?]
Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the power!!"
As such, software is invaluable as designing PREDICTIVE hardware is
harder than designing predictive software (algorithms).
You don't want to tell the user "The battery in your smoke detector
is NOW dead (leaving you vulnerable)" but, rather, "The battery in
your smoke detector WILL cease to be able to provide the power necessary
for the smoke detector to provide the level of protection that you
desire."
And, the WAY that you inform the user has to be "productive/useful".
A smoke detector beeping every minute is likely to find itself unplugged, leading to exactly the situation that the alert was trying to avoid!
I'm not looking for speculation. I'm looking for folks who have DONE
such things (designing to speculation is more expensive than just letting
the devices fail when they need to fail!)
On Tue, 16 Apr 2024 11:10:40 -0400, "Edward Rawde"
<invalid@invalid.invalid> wrote:
"Don Y" <blockedofcourse@foo.invalid> wrote in message >>news:uvl2gr$phap$2@dont-email.me...
On 4/15/2024 8:33 PM, Edward Rawde wrote:
[Shouldn't that be Edwar D rawdE?]
I don't mind how you pronounce it.
...
A smoke detector that beeps once a day risks not being heard
Reminds me of a tenant who just removed the battery to stop the annoying >>beeping.
My experience has been the smoke detectors too close (as the smoke
travels) to the kitchen tend to suffer mysterious disablement.
Relocation of the smoke detector usually solves the problem.
Joe Gwinn
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown <'''newspam'''@nonad.co.uk> wrote:
On 15/04/2024 18:13, Don Y wrote:
Sometimes BIST can help ensure that small failures won't become
board-burning failures, but an RMA will happen anyhow.
I just added a soft-start feature to a couple of boards. Apply a current-limited 48 volts to the power stages before the real thing is switched on hard.
Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the
power!!"
As such, software is invaluable as designing PREDICTIVE hardware is
harder than designing predictive software (algorithms).
Two comparators can make a window detector which will tell you whether some parameter is in a specified range.
And it doesn't need monthly updates because software is never finished.
You don't want to tell the user "The battery in your smoke detector
is NOW dead (leaving you vulnerable)" but, rather, "The battery in
your smoke detector WILL cease to be able to provide the power necessary
for the smoke detector to provide the level of protection that you
desire."
And, the WAY that you inform the user has to be "productive/useful".
A smoke detector beeping every minute is likely to find itself unplugged,
leading to exactly the situation that the alert was trying to avoid!
Reminds me of a tenant who just removed the battery to stop the annoying beeping.
Better to inform the individual who can get the replacement done when the tenant isn't even home.
I'm not looking for speculation. I'm looking for folks who have DONE
such things (designing to speculation is more expensive than just letting
the devices fail when they need to fail!)
Well I don't recall putting anything much into a design which could predict remaining life.
The only exceptions, also drawing from other replies in this thread, might
be be temperature sensing,
voltage sensing, current sensing, air flow sensing, noise sensing, iron in oil sensing,
and any other kind of sensing which might provide information on parameters outside or getting close to outside expected range.
Give that to some software which also knows how long the equipment has been in use, how often
it has been used, what the temperature and humidity was, how long it's been since the oil was changed,
and you might be able to give the operator useful information about when to schedule specific maintenance.
But don't give the software too much control. I don't want to be told that I can't use the equipment because an oil change was required 5 minutes ago and it hasn't been done yet.
Don wrote:
Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
A singular speculative spitball - the capacitive marker:
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Very interesting but are there any products out there which make use of this or other prognostic methods to provide information on remaining useful life?
On 4/16/2024 9:02 AM, Edward Rawde wrote:...
Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the
power!!"
Add CO and heat detectors to the mix and they get *really* confused!
Better to inform the individual who can get the replacement done when the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts suffering from feeping creaturism. "Download the app..."
The only exceptions, also drawing from other replies in this thread,
might
be be temperature sensing,
voltage sensing, current sensing, air flow sensing, noise sensing, iron
in
oil sensing,
and any other kind of sensing which might provide information on
parameters
outside or getting close to outside expected range.
Give that to some software which also knows how long the equipment has
been
in use, how often
it has been used, what the temperature and humidity was, how long it's
been
since the oil was changed,
and you might be able to give the operator useful information about when
to
schedule specific maintenance.
I have all of those things -- with the exception of knowing which sensory data
is most pertinent to failure prediction.
OTOH, if there is no oil in the gearbox, the equipment isn't going to
start; if the oil sensor is defective, then *it* needs to be replaced.
...
On 17/04/2024 1:22 am, John Larkin wrote:
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
<'''newspam'''@nonad.co.uk> wrote:
On 15/04/2024 18:13, Don Y wrote:
<snip>
Sometimes BIST can help ensure that small failures won't become
board-burning failures, but an RMA will happen anyhow.
Built-in self test is mostly auto-calibration. You can use temperature sensitive components for precise measurements if you calibrate out the temperature shift and re-calibrate if the measured temperature shifts appreciably (or every few minutes).
It might also take out the effects of dopant drift in a hot device, but it wouldn't take it out forever.
I just added a soft-start feature to a couple of boards. Apply a
current-limited 48 volts to the power stages before the real thing is
switched on hard.
Soft-start has been around forever. If you don't pay attention to what happens to your circuit at start-up and turn-off you can have some real disasters.
At Cambridge Instruments I once replaced all the tail resistors in a bunch
of class-B long-tailed-pair-based scan amplifiers with constant current diodes. With the resistors tails, the scan amps drew a lot of current when the 24V rail was being ramped up and that threw the 24V supply into
current limit, so it didn't ramp up. The constant current diodes stopped
this (not that I can remember how).
This was a follow-up after I'd brought in to stop the 24V power supply
from blowing up (because it hadn't had a properly designed current limit).
The problem had shown up in production - where it was known as the three
back problem because when things did go wrong the excursions on the 24V
rail destroyed three bags of components.
--
Bill Sloman, Sydney
On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>><blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of >>>>>>>an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be >>>>>>>suggestive of changing conditions in the components (and not >>>>>>>directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and >>>>>>>notice the sorts of changes you "typically" encounter in the hope >>>>>>>that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and >>>>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>>>When connections (perhaps within a package) start to fail, the flicker >>>>>>level rises. The actual frequency monitored isn't all that critical. >>>>>>
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a >>>>critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was >>>>suffering from low-cycle fatigue during temperature cycling and slowly >>>>breaking, one little crack at a time, until it went open. If you >>>>measured the resistance to parts per million (6.5 digit DMM), sampling >>>>at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to >>>>also measure a copper line, and divide the via-chain resistance by the >>>>no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were >>>possible.
It was not possible to test the vias on the failing logic board, but
we knew from metallurgical cut, polish, and inspect studies of failed >>boards that it was the vias that were failing.
One could instrument a PCB fab test board, I guess. But DC tests would
be fine.
What was being tested was a fab test board that had both the series
via chain path and the no-via path of roughly the same DC resistance,
set up so we could do 4-wire Kelvin resistance measurements of each
path independent of the other path.
Yes, but the question was whether one could predict the failure of an >operating electronic gadget. The answer is mostly NO.
We had a visit from the quality team from a giant company that you
have heard of. They wanted us to trend analyze all the power supplies
on our boards and apply a complex algotithm to predict failures. It
was total nonsense, basically predicting the future by zooming in on
random noise with a big 1/f component, just like climate prediction.
We have one board with over 4000 vias, but they are mostly in
parallel.
This can also be tested , but using a 6.5-digit DMM intended for
measuring very low resistance values. A change of one part in 4,000
is huge to a 6.5-digit instrument. The conductivity will decline
linearly as vias fail one by one.
Millikelvin temperature changes would make more signal than a failing
via.
The solution was to redesign the vias, mainly to increase the critical >>>>volume of copper. And modern SMD designs have less and less copper >>>>volume.
I bet precision resistors can also be measured this way.
I don't think I've ever owned a piece of electronic equipment that >>>>>warned me of an impending failure.
Onset of smoke emission is a common sign.
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not >>>>necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft >>>alley, a long tunnel where nobody often goes. They have magnetic shaft >>>runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
Yep. And these kinds of things fail slowly. At first.
They could repair a bearing at sea, given a heads-up about violent
failure. A serious bearing failure on a single-screw machine means
getting a seagoing tug.
The main engine gearbox had padlocks on the covers.
There was also a chem lab to analyze oil and water and such, looking
for contaminamts that might suggest something going on.
I liked hiding out in the shaft alley. It was private and cool, that >>>giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity?
That can be funny.
On 4/16/2024 10:25 AM, Edward Rawde wrote:
Better to inform the individual who can get the replacement done when
the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."
No thanks. I have the same view of cameras.
They won't be connecting outbound to a server anywhere in the world.
But the average user does not know that and just wants the pictures on
their
phone.
There is no need for a manufacturer to interpose themselves in such
"remote access". Having the device register with a DDNS service
cuts out the need for the manufacturer to essentially provide THAT
service.
...
Better to inform the individual who can get the replacement done when the >>> tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."
No thanks. I have the same view of cameras.
They won't be connecting outbound to a server anywhere in the world.
But the average user does not know that and just wants the pictures on their phone.
Give that to some software which also knows how long the equipment has
been
in use, how often
it has been used, what the temperature and humidity was, how long it's
been
since the oil was changed,
and you might be able to give the operator useful information about when >>> to
schedule specific maintenance.
I have all of those things -- with the exception of knowing which sensory
data
is most pertinent to failure prediction.
That's one reason why you want feedback from people who use your equipment.
OTOH, if there is no oil in the gearbox, the equipment isn't going to
start; if the oil sensor is defective, then *it* needs to be replaced.
Preferably by me purchasing a new sensor and being able to replace it
myself.
"Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvmjmt$140d2$1@dont-email.me...
On 4/16/2024 10:25 AM, Edward Rawde wrote:
Better to inform the individual who can get the replacement done when >>>>> the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts >>>> suffering from feeping creaturism. "Download the app..."
No thanks. I have the same view of cameras.
They won't be connecting outbound to a server anywhere in the world.
But the average user does not know that and just wants the pictures on
their
phone.
There is no need for a manufacturer to interpose themselves in such
"remote access". Having the device register with a DDNS service
cuts out the need for the manufacturer to essentially provide THAT
service.
Not for most users here.
They tried to put me on lsn/cgnat not long ago.
After complaining I was given a free static IPv4.
Most users wouldn't know DDNS from a banana, and will expect it to work out of the box after installing the app on their phone.
Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
A singular speculative spitball - the capacitive marker:
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Danke,
"Don" <g@crcomp.net> wrote in message news:20240416a@crcomp.net...
Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
A singular speculative spitball - the capacitive marker:
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Very interesting but are there any products out there which make use of this or other prognostic methods to provide information on remaining useful life?
On 4/16/2024 12:43 PM, Edward Rawde wrote:But vendors know that most people want it easy so the push towards
"Don Y" <blockedofcourse@foo.invalid> wrote in message
news:uvmjmt$140d2$1@dont-email.me...
On 4/16/2024 10:25 AM, Edward Rawde wrote:
Better to inform the individual who can get the replacement done when >>>>>> the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector
starts
suffering from feeping creaturism. "Download the app..."
Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
A singular speculative spitball - the capacitive marker:
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
The main engine gearbox had padlocks on the covers.
I liked hiding out in the shaft alley. It was private and cool, that >>>giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
Don wrote:
Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
A singular speculative spitball - the capacitive marker:
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Sounds like they are really measuring gate threshold, or gate transfer
curve, drift with time. That happens and is usually no big deal, in moderation. Ions and charges drift around. We don't build opamp
front-ends from power mosfets.
This doesn't sound very useful for "in-situ" diagnostics.
GaN fets can have a lot of gate threshold and leakage change over time
too. Drive them hard and it doesn't matter.
On 17/04/2024 1:22 am, John Larkin wrote:
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
<'''newspam'''@nonad.co.uk> wrote:
On 15/04/2024 18:13, Don Y wrote:
Yes I've seen that a lot.
The power rails in the production product came up in a different order to >those in the development lab.
This caused all kinds of previously unseen behaviour including an expensive >flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause >issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it should >not be damaged.
john larkin wrote:
Don wrote:
Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
A singular speculative spitball - the capacitive marker:
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Sounds like they are really measuring gate threshold, or gate transfer
curve, drift with time. That happens and is usually no big deal, in
moderation. Ions and charges drift around. We don't build opamp
front-ends from power mosfets.
This doesn't sound very useful for "in-situ" diagnostics.
GaN fets can have a lot of gate threshold and leakage change over time
too. Drive them hard and it doesn't matter.
Threshold voltage measurement is indeed one of two parameters. The
second parameter is Miller platform voltage measurement.
The Miller plateau is directly related to the gate-drain
capacitance, Cgd. It's why "capacitive marker" appears in my
original followup.
Long story short, the Miller Plateau length provides a metric
principle to measure Tj without a sensor. Some may find this useful.
Danke,
On Tue, 16 Apr 2024 13:20:34 -0400, Joe Gwinn <joegwinn@comcast.net>....
wrote:
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin >><jjSNIPlarkin@highNONOlandtechnology.com> wrote:
On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net> >>>wrote:
On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote: >>>>>>
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>>wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>>><blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of >>>>>>>>>an "imminent" (for some value of "imminent") hardware failure? >>>>>>>>>
I liked hiding out in the shaft alley. It was private and cool, that >>>>>giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>>other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity? >>>That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
I had an intern, an EE senior, who was afraid of 3.3 volts.
I told him to touch an FPGA to see how warm it was getting, and he
refused.
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin ><jjSNIPlarkin@highNONOlandtechnology.com> wrote:
On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote: >>>>>
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>><blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of >>>>>>>>an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be >>>>>>>>suggestive of changing conditions in the components (and not >>>>>>>>directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and >>>>>>>>notice the sorts of changes you "typically" encounter in the hope >>>>>>>>that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and >>>>>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>>>>When connections (perhaps within a package) start to fail, the flicker >>>>>>>level rises. The actual frequency monitored isn't all that critical. >>>>>>>
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a >>>>>critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was >>>>>suffering from low-cycle fatigue during temperature cycling and slowly >>>>>breaking, one little crack at a time, until it went open. If you >>>>>measured the resistance to parts per million (6.5 digit DMM), sampling >>>>>at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to >>>>>also measure a copper line, and divide the via-chain resistance by the >>>>>no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were >>>>possible.
It was not possible to test the vias on the failing logic board, but
we knew from metallurgical cut, polish, and inspect studies of failed >>>boards that it was the vias that were failing.
One could instrument a PCB fab test board, I guess. But DC tests would >>>>be fine.
What was being tested was a fab test board that had both the series
via chain path and the no-via path of roughly the same DC resistance,
set up so we could do 4-wire Kelvin resistance measurements of each
path independent of the other path.
Yes, but the question was whether one could predict the failure of an >>operating electronic gadget. The answer is mostly NO.
Agree.
We had a visit from the quality team from a giant company that you
have heard of. They wanted us to trend analyze all the power supplies
on our boards and apply a complex algotithm to predict failures. It
was total nonsense, basically predicting the future by zooming in on
random noise with a big 1/f component, just like climate prediction.
Hmm. My first instinct was that they were using MIL-HNBK-317 (?) or
the like, but that does not measure noise. Do you recall any more of
what they were doing? I might know what they were up to. The
military were big on prognostics for a while, and still talk of this,
but it never worked all that well in the field compared to what it was >supposed to improve on.
We have one board with over 4000 vias, but they are mostly in
parallel.
This can also be tested , but using a 6.5-digit DMM intended for >>>measuring very low resistance values. A change of one part in 4,000
is huge to a 6.5-digit instrument. The conductivity will decline >>>linearly as vias fail one by one.
Millikelvin temperature changes would make more signal than a failing
via.
Not at the currents in that logic card. Too much ambient thermal
noise.
The solution was to redesign the vias, mainly to increase the critical >>>>>volume of copper. And modern SMD designs have less and less copper >>>>>volume.
I bet precision resistors can also be measured this way.
I don't think I've ever owned a piece of electronic equipment that >>>>>>warned me of an impending failure.
Onset of smoke emission is a common sign.
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near >>>>>the bearings, and listen for excessive rotation-correlated (not >>>>>necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft >>>>alley, a long tunnel where nobody often goes. They have magnetic shaft >>>>runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
Yep. And these kinds of things fail slowly. At first.
They could repair a bearing at sea, given a heads-up about violent
failure. A serious bearing failure on a single-screw machine means
getting a seagoing tug.
The main engine gearbox had padlocks on the covers.
There was also a chem lab to analyze oil and water and such, looking
for contaminamts that might suggest something going on.
I liked hiding out in the shaft alley. It was private and cool, that >>>>giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity?
That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
But vendors know that most people want it easy so the push towards subscription services and products which phone home isn't going to change.
Most people don't know or care what their products are sending to the
vendor.
I like to see what is connecting to what with https://www.pfsense.org/
But I might be the only person in 100 mile radius doing so.
I can also remote desktop from anywhere of my choice, with the rest of the world unable to connect.
Pretty much all of my online services are either restricted to specific IPs (cameras, remote desktop and similar).
Or they have one or more countries and other problem IPs blocked. (web sites and email services).
None of that is possible when the vendor is in control because users will want their camera pictures available anywhere.
On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
<invalid@invalid.invalid> wrote:
On 17/04/2024 1:22 am, John Larkin wrote:
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
<'''newspam'''@nonad.co.uk> wrote:
On 15/04/2024 18:13, Don Y wrote:
Yes I've seen that a lot.
The power rails in the production product came up in a different order to >>those in the development lab.
This caused all kinds of previously unseen behaviour including an
expensive
flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause >>issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it
should
not be damaged.
Some FPGAs require supply sequencing, as may as four.
LM3880 is a dedicated powerup sequencer, most cool.
https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0
On 4/16/2024 3:19 PM, Edward Rawde wrote:
But vendors know that most people want it easy so the push towards
subscription services and products which phone home isn't going to
change.
Until it does. There is nothing inherent in the design of any
of these products that requires another "service" to provide the
ADVERTISED functionality.
Our stove and refrigerator have WiFi "apps" -- that rely on a tie-in
to the manufacturer's site (no charge but still, why do they need
access to my stovetop?).
Simple solution: router has no radio! Even if the appliances wanted
to connect (ignoring their "disable WiFi access" setting), there's
nothing they can connect *to*.
Most people don't know or care what their products are sending to the
vendor.
I think that is a generational issue. My neighbor just bought
camera and, when she realized it had to connect to their OUTBOUND
wifi, she just opted to return it. So, a lost sale AND the cost
of a return.
Young people seem to find nothing odd about RENTING -- anything!
Wanna listen to some music? You can RENT it, one song at a time!
Wanna access the internet using free WiFi *inside* a business?
The idea that they are leaking information never crosses their
mind. They *deserve* to discover that some actuary has noted
a correlation between people who shop at XYZCo and alcoholism.
Or, inability to pay their debts. Or, cannabis use. Or...
whatever the Big Data tells *them*.
Like the driver who complained that his CAR was revealing his
driving behavior through OnStar to credit agencies and their
subscribers (insurance companies) were using that to determine
the risk he represented.
I like to see what is connecting to what with https://www.pfsense.org/
But I might be the only person in 100 mile radius doing so.
I can also remote desktop from anywhere of my choice, with the rest of
the
world unable to connect.
Pretty much all of my online services are either restricted to specific
IPs
(cameras, remote desktop and similar).
Or they have one or more countries and other problem IPs blocked. (web
sites
and email services).
But IP and MAC masquerading are trivial exercises. And, don't require
a human participant to interact with the target (i.e., they can be automated).
I have voice access to the services in my home. I don't rely on the
CID information provided as it can be forged. But, I *do* require
the *voice* match one of a few known voiceprints -- along with other conditions for access (e.g., if I am known to be HOME, then anyone
calling with my voice is obviously an imposter; likewise, if
someone "authorized" calls and passes the authentication procedure,
they are limited in what they can do -- like, maybe close my garage
door if I happened to leave it open and it is now after midnight).
And, recording a phrase (uttered by that person) only works if you
know what I am going to ASK you; anything that relies on your own
personal knowledge can't be emulated, even by an AI!
No need for apps or appliances -- you could technically use a "payphone"
(if such things still existed) or an office phone in some business.
I have a "cordless phone" in the car that lets me talk to the house from
a range of 1/2 mile, without relying on cell phone service. I can't
send video over the link -- but, I can ask "Did I remember to close
the garage door?" Or, "Did I forget to turn off the tea kettle?"
as I drive away.
None of that is possible when the vendor is in control because users will
want their camera pictures available anywhere.
No, you just have to rely on other mechanisms for authentication.
I have a friend who manages a datafarm at a large multinational bank.
When he is here, he uses my internet connection -- which is "foreign"
as far as the financial institution is concerned -- with no problems.
But, he carries a time-varying "token" with him that ensures he
has the correct credentials for any ~2 minute slice of time!
I rely on biometrics, backed with "shared secrets" ("Hi Jane!
How's Tom doing?" "Hmmm, I don't know anyone by the name of Tom")
because I don't want to have to carry a physical key (and
don't want the other folks with access to have to do so, either)
And, most folks don't really need remote access to the things
that are offering that access. Why do I need to check the state
of my oven/stove WHEN I AM NOT AT HOME? (Why the hell would
I leave it ON when the house is empty???) There are refrigerators
that take a photo of the contents of the frig each time you close
the door. Do I care if the photo on my phone is of the state of the refrigerator when I was last IN PROXIMITY OF IT vs. it's most recent
state? Do I need to access my thermostat "online" vs. via SMS?
Or voice?
Simple solution: router has no radio! Even if the appliances wanted
to connect (ignoring their "disable WiFi access" setting), there's
nothing they can connect *to*.
I'd have trouble here with no wifi access.
I can restrict outbound with a firewall as necessary.
But IP and MAC masquerading are trivial exercises. And, don't require
a human participant to interact with the target (i.e., they can be
automated).
That's why most tor exit nodes and home user vpn services are blocked.
I don't allow unauthenticated access to anything (except web sites).
I prefer to keep authentication simple and drop packets from countries and places who have no business connecting.
Granted a multinational bank may need a different approach since their customers could be anywhere.
If I were a multinational bank I'd be employing people to watch where the packets come from and decide which ones the firewall should drop.
Don wrote:
john larkin wrote:
Don wrote:
Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
A singular speculative spitball - the capacitive marker:
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Sounds like they are really measuring gate threshold, or gate transfer
curve, drift with time. That happens and is usually no big deal, in
moderation. Ions and charges drift around. We don't build opamp
front-ends from power mosfets.
This doesn't sound very useful for "in-situ" diagnostics.
GaN fets can have a lot of gate threshold and leakage change over time
too. Drive them hard and it doesn't matter.
Threshold voltage measurement is indeed one of two parameters. The
second parameter is Miller platform voltage measurement.
The Miller plateau is directly related to the gate-drain
capacitance, Cgd. It's why "capacitive marker" appears in my
original followup.
Long story short, the Miller Plateau length provides a metric
principle to measure Tj without a sensor. Some may find this useful.
When we want to measure actual junction temperature of a mosfet, we
use the substrate diode. Or get lazy and thermal image the top of the package.
"John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >news:p47u1j1tg35ctb3tcta5qevsfnhgnpcrsg@4ax.com...
On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
<invalid@invalid.invalid> wrote:
On 17/04/2024 1:22 am, John Larkin wrote:
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
<'''newspam'''@nonad.co.uk> wrote:
On 15/04/2024 18:13, Don Y wrote:
Yes I've seen that a lot.
The power rails in the production product came up in a different order to >>>those in the development lab.
This caused all kinds of previously unseen behaviour including an >>>expensive
flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause >>>issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it >>>should
not be damaged.
Some FPGAs require supply sequencing, as may as four.
LM3880 is a dedicated powerup sequencer, most cool.
https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0
Ok that doesn't surprise me.
I'd want to be sure that the requirement is always met even when the 12V >connector is in a position where it isn't sure whether it's connected or
not.
Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.
"John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >news:jr6u1j9vmo3a6tpl1evgrvmu1993slepno@4ax.com...
On Tue, 16 Apr 2024 13:20:34 -0400, Joe Gwinn <joegwinn@comcast.net>....
wrote:
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin >>><jjSNIPlarkin@highNONOlandtechnology.com> wrote:
On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>wrote:
On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote: >>>>>
On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote: >>>>>>>
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>>>wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>>>><blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of >>>>>>>>>>an "imminent" (for some value of "imminent") hardware failure? >>>>>>>>>>
I liked hiding out in the shaft alley. It was private and cool, that >>>>>>giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and >>>>vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>>>other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity? >>>>That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
I had an intern, an EE senior, who was afraid of 3.3 volts.
I told him to touch an FPGA to see how warm it was getting, and he
refused.
That's what happens when they grow up having never accidentally touched the >top cap of a 40KG6A/PL519
On Tue, 16 Apr 2024 21:04:40 -0400, "Edward Rawde"
<invalid@invalid.invalid> wrote:
"John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >>news:jr6u1j9vmo3a6tpl1evgrvmu1993slepno@4ax.com...
On Tue, 16 Apr 2024 13:20:34 -0400, Joe Gwinn <joegwinn@comcast.net>....
wrote:
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin >>>><jjSNIPlarkin@highNONOlandtechnology.com> wrote:
On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>wrote:
On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote: >>>>>>
On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>>wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> >>>>>>>>wrote:
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn >>>>>>>>><joegwinn@comcast.net>
wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>>>>><blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of >>>>>>>>>>>an "imminent" (for some value of "imminent") hardware failure? >>>>>>>>>>>
I liked hiding out in the shaft alley. It was private and cool, that >>>>>>>giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and >>>>>vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of >>>>>big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>>>>other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity? >>>>>That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
I had an intern, an EE senior, who was afraid of 3.3 volts.
I told him to touch an FPGA to see how warm it was getting, and he
refused.
That's what happens when they grow up having never accidentally touched
the
top cap of a 40KG6A/PL519
They can type code. Rust is supposed to be safe.
On 4/16/2024 6:38 PM, Edward Rawde wrote:
Simple solution: router has no radio! Even if the appliances wanted
to connect (ignoring their "disable WiFi access" setting), there's
nothing they can connect *to*.
I'd have trouble here with no wifi access.
I can restrict outbound with a firewall as necessary.
I have 25 general purpose drops, here. So, you can be in any room, front/back porch -- even the ROOF -- and get connected.
The internal network isn't routed. So, the only machines to worry about
are
this one (used only for email/news/web) and a laptop that is only used
for ecommerce.
I have an out-facing server that operates in stealth mode and won't appear
on probes (only used to source my work to colleagues). The goal is not to look "interesting".
The structure of the house's fabric allows me to treat any individual
node as being directly connected to the ISP while isolating the
rest of the nodes. I.e., if you bring a laptop loaded with malware into
the house, you can't infect anything (or even know that there are other hosts, here); it's as if you had a dedicated connection to the Internet
with no other devices "nearby".
The internal network isn't routed. So, the only machines to worry about
are
this one (used only for email/news/web) and a laptop that is only used
for ecommerce.
My LAN is more like a small/medium size business with all workstations, servers and devices behind a firewall and able to communicate both with each other and online as necessary.
I wouldn't want to give online security advice to others without doing it myself.
I have an out-facing server that operates in stealth mode and won't appear >> on probes (only used to source my work to colleagues). The goal is not to >> look "interesting".
Not sure what you mean by that.
Given what gets thrown at my firewall I think you could maybe look more interesting than you think.
The structure of the house's fabric allows me to treat any individual
node as being directly connected to the ISP while isolating the
rest of the nodes. I.e., if you bring a laptop loaded with malware into
the house, you can't infect anything (or even know that there are other
hosts, here); it's as if you had a dedicated connection to the Internet
with no other devices "nearby".
I wouldn't bother. I'd just not connect it to wifi or wired if I thought there was a risk.
It's been a while since I had to clean a malware infested PC.
On 4/16/2024 10:25 AM, Edward Rawde wrote:
Better to inform the individual who can get the replacement done when the >>>> tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."
No thanks. I have the same view of cameras.
They won't be connecting outbound to a server anywhere in the world.
But the average user does not know that and just wants the pictures on their >> phone.
There is no need for a manufacturer to interpose themselves in such
"remote access". Having the device register with a DDNS service
cuts out the need for the manufacturer to essentially provide THAT
service.
There is no need for a manufacturer to interpose themselves in such
"remote access". Having the device register with a DDNS service
cuts out the need for the manufacturer to essentially provide THAT
service.
Someone still needs to provide DDNS.
Yes, UPNP has been a thing for several generations of routers now.
but browswers have become fussier about port numbers too. also some
customers are on Carrier Grade NAT, I don't think that UPNP can traverse that. IPV6 however can avoid the CGNAT problem.
It's an ease of use vs quality of service problem.
On 4/16/2024 9:21 PM, Edward Rawde wrote:
The internal network isn't routed. So, the only machines to worry about >>> are
this one (used only for email/news/web) and a laptop that is only used
for ecommerce.
My LAN is more like a small/medium size business with all workstations,
servers and devices behind a firewall and able to communicate both with
each
other and online as necessary.
I have 72 drops in the office and 240 throughout the rest of the house (though the vast majority of those are for dedicated "appliances")...
about 2.5 miles of CAT5.
...
I have an out-facing server that operates in stealth mode and won't
appear
on probes (only used to source my work to colleagues). The goal is not
to
look "interesting".
Not sure what you mean by that.
Given what gets thrown at my firewall I think you could maybe look more
interesting than you think.
Nothing on my side "answers" connection attempts. To the rest of the
world,
it looks like a cable dangling in air...
The structure of the house's fabric allows me to treat any individual
node as being directly connected to the ISP while isolating the
rest of the nodes. I.e., if you bring a laptop loaded with malware into >>> the house, you can't infect anything (or even know that there are other
hosts, here); it's as if you had a dedicated connection to the Internet
with no other devices "nearby".
I wouldn't bother. I'd just not connect it to wifi or wired if I thought
there was a risk.
So, you'd have to *police* all such connections. What do you do with hundreds
of drops on a factory floor? Or, scattered throughout a business? Can
you prevent any "foreign" devices from being connected -- even if IN PLACE
OF
a legitimate device? (after all, it is a trivial matter to unplug a
network
cable from one "approved" PC and plug it into a "foreign import")
It's been a while since I had to clean a malware infested PC.
My current project relies heavily on internetworking for interprocessor communication. So, has to be designed to tolerate (and survive) a
hostile actor being directly connected TO that fabric -- because that
is a likely occurrence, "in the wild".
Imagine someone being able to open your PC and alter the internals...
and be expected to continue to operate as if this had not occurred!
"Don Y" <blockedofcourse@foo.invalid> wrote in message news:uvnlr6$1e3fi$1@dont-email.me...
On 4/16/2024 9:21 PM, Edward Rawde wrote:
The internal network isn't routed. So, the only machines to worry about >>>> are
this one (used only for email/news/web) and a laptop that is only used >>>> for ecommerce.
My LAN is more like a small/medium size business with all workstations,
servers and devices behind a firewall and able to communicate both with
each
other and online as necessary.
I have 72 drops in the office and 240 throughout the rest of the house
(though the vast majority of those are for dedicated "appliances")...
about 2.5 miles of CAT5.
Must be a big house.
I have an out-facing server that operates in stealth mode and won't
appear
on probes (only used to source my work to colleagues). The goal is not >>>> to
look "interesting".
Not sure what you mean by that.
Given what gets thrown at my firewall I think you could maybe look more
interesting than you think.
Nothing on my side "answers" connection attempts. To the rest of the
world,
it looks like a cable dangling in air...
You could ping me if you knew my IP address.
The structure of the house's fabric allows me to treat any individual
node as being directly connected to the ISP while isolating the
rest of the nodes. I.e., if you bring a laptop loaded with malware into >>>> the house, you can't infect anything (or even know that there are other >>>> hosts, here); it's as if you had a dedicated connection to the Internet >>>> with no other devices "nearby".
I wouldn't bother. I'd just not connect it to wifi or wired if I thought >>> there was a risk.
What I mean by that is I'd clean it without it being connected.
The Avira boot CD used to be useful but I forget how many years ago.
So, you'd have to *police* all such connections. What do you do with
hundreds
of drops on a factory floor? Or, scattered throughout a business? Can
you prevent any "foreign" devices from being connected -- even if IN PLACE >> OF
a legitimate device? (after all, it is a trivial matter to unplug a
network
cable from one "approved" PC and plug it into a "foreign import")
Devices on a LAN should be secure just like Internet facing devices.
...When I was designing for pharma, my philosophy was
to make it easy/quick to replace the entire control system. Let someone troubleshoot it on a bench instead of on the factory floor (which is semi-sterile).
Don Y <blockedofcourse@foo.invalid> wrote:
...When I was designing for pharma, my philosophy was
to make it easy/quick to replace the entire control system. Let someone
troubleshoot it on a bench instead of on the factory floor (which is
semi-sterile).
That's fine if the failure is clearly in the equipment itself, but what
if it is in the way it interacts with something outside it, some unpredictable or unrecognised input codition? It works perfectly on the bench, only to fail when put into service ...again and again.
"Don Y" <blockedofcourse@foo.invalid> wrote in message >news:uvl2gr$phap$2@dont-email.me...
On 4/15/2024 8:33 PM, Edward Rawde wrote:
[Shouldn't that be Edwar D rawdE?]
I don't mind how you pronounce it.
...
A smoke detector that beeps once a day risks not being heard
Reminds me of a tenant who just removed the battery to stop the annoying >beeping.
a typical run-of-the-mill disk performs at about 60MB/s. So, ~350MB/min
or 21GB/hr. That's ~500GB/day or 180TB/yr.
Assuming 24/7/365 use.
In a 9-to-5 environment, that would be (5/7)*60TB (to account for idle time on weekends) or ~40TB/yr.
Said another way, I'd expect a 55TB/yr drive to run at about (55/40)*60MB/s or ~80MB/s. A drive that runs at 100MB/s (not uncommon) would be ~100TB/yr.
On Tue, 16 Apr 2024 21:16:45 -0400, "Edward Rawde"
<invalid@invalid.invalid> wrote:
"John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >>news:p47u1j1tg35ctb3tcta5qevsfnhgnpcrsg@4ax.com...
On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
<invalid@invalid.invalid> wrote:
On 17/04/2024 1:22 am, John Larkin wrote:
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
<'''newspam'''@nonad.co.uk> wrote:
On 15/04/2024 18:13, Don Y wrote:
Yes I've seen that a lot.
The power rails in the production product came up in a different order to >>>>those in the development lab.
This caused all kinds of previously unseen behaviour including an >>>>expensive
flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause >>>>issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it >>>>should
not be damaged.
Some FPGAs require supply sequencing, as may as four.
LM3880 is a dedicated powerup sequencer, most cool.
https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0
Ok that doesn't surprise me.
I'd want to be sure that the requirement is always met even when the 12V >>connector is in a position where it isn't sure whether it's connected or >>not.
Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.
We considered the brownout case. The MAX809 handles that.
This supply will also tolerate +24v input, in case someone grabs the
wrong wart. Or connects the power backwards.
On a vaguely related rant, shamelessly hijacking your thread:
Why do recent mechanical hard drives have a "Annualised Workload Rate" limit saying that you are only supposed to write say 55TB/year?
What is the wearout mechanism, or is it just bullshit to discourage enterprise
customers from buying the cheapest drives?
It seems odd to me that they would all do it, if it really is just made up bullshit. It also seems odd to express it in terms of TB read+written. I can't
see why that would be more likely to wear it out than some number of hours of spindle rotation, or seek operations, or spindle starts, or head load/unload cycles. I could imagine they might want to use a very high current density in the windings of the write head that might place an electromigration limit on the time spent writing, but they apply the limit to reads as well. Is there something that wears out when the servo loop is keeping the head on a track?
On Tue, 16 Apr 2024 13:20:34 -0400, Joe Gwinn <joegwinn@comcast.net>
wrote:
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin >><jjSNIPlarkin@highNONOlandtechnology.com> wrote:
On Tue, 16 Apr 2024 10:19:00 -0400, Joe Gwinn <joegwinn@comcast.net> >>>wrote:
On Mon, 15 Apr 2024 16:26:35 -0700, john larkin <jl@650pot.com> wrote:
On Mon, 15 Apr 2024 18:03:23 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>wrote:
On Mon, 15 Apr 2024 13:05:40 -0700, john larkin <jl@650pot.com> wrote: >>>>>>
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn <joegwinn@comcast.net> >>>>>>>wrote:
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y >>>>>>>><blockedofcourse@foo.invalid> wrote:
Is there a general rule of thumb for signalling the likelihood of >>>>>>>>>an "imminent" (for some value of "imminent") hardware failure? >>>>>>>>>
I suspect most would involve *relative* changes that would be >>>>>>>>>suggestive of changing conditions in the components (and not >>>>>>>>>directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and >>>>>>>>>notice the sorts of changes you "typically" encounter in the hope >>>>>>>>>that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and >>>>>>>>trend of very low frequency (around a tenth of a Hertz) flicker noise. >>>>>>>>When connections (perhaps within a package) start to fail, the flicker >>>>>>>>level rises. The actual frequency monitored isn't all that critical. >>>>>>>>
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a >>>>>>critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was >>>>>>suffering from low-cycle fatigue during temperature cycling and slowly >>>>>>breaking, one little crack at a time, until it went open. If you >>>>>>measured the resistance to parts per million (6.5 digit DMM), sampling >>>>>>at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to >>>>>>also measure a copper line, and divide the via-chain resistance by the >>>>>>no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were >>>>>possible.
It was not possible to test the vias on the failing logic board, but
we knew from metallurgical cut, polish, and inspect studies of failed >>>>boards that it was the vias that were failing.
One could instrument a PCB fab test board, I guess. But DC tests would >>>>>be fine.
What was being tested was a fab test board that had both the series
via chain path and the no-via path of roughly the same DC resistance, >>>>set up so we could do 4-wire Kelvin resistance measurements of each >>>>path independent of the other path.
Yes, but the question was whether one could predict the failure of an >>>operating electronic gadget. The answer is mostly NO.
Agree.
We had a visit from the quality team from a giant company that you
have heard of. They wanted us to trend analyze all the power supplies
on our boards and apply a complex algotithm to predict failures. It
was total nonsense, basically predicting the future by zooming in on >>>random noise with a big 1/f component, just like climate prediction.
Hmm. My first instinct was that they were using MIL-HNBK-317 (?) or
the like, but that does not measure noise. Do you recall any more of
what they were doing? I might know what they were up to. The
military were big on prognostics for a while, and still talk of this,
but it never worked all that well in the field compared to what it was >>supposed to improve on.
We have one board with over 4000 vias, but they are mostly in >>>>>parallel.
This can also be tested , but using a 6.5-digit DMM intended for >>>>measuring very low resistance values. A change of one part in 4,000
is huge to a 6.5-digit instrument. The conductivity will decline >>>>linearly as vias fail one by one.
Millikelvin temperature changes would make more signal than a failing >>>via.
Not at the currents in that logic card. Too much ambient thermal
noise.
The solution was to redesign the vias, mainly to increase the critical >>>>>>volume of copper. And modern SMD designs have less and less copper >>>>>>volume.
I bet precision resistors can also be measured this way.
I don't think I've ever owned a piece of electronic equipment that >>>>>>>warned me of an impending failure.
Onset of smoke emission is a common sign.
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near >>>>>>the bearings, and listen for excessive rotation-correlated (not >>>>>>necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft >>>>>alley, a long tunnel where nobody often goes. They have magnetic shaft >>>>>runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
Yep. And these kinds of things fail slowly. At first.
They could repair a bearing at sea, given a heads-up about violent >>>failure. A serious bearing failure on a single-screw machine means >>>getting a seagoing tug.
The main engine gearbox had padlocks on the covers.
There was also a chem lab to analyze oil and water and such, looking
for contaminamts that might suggest something going on.
I liked hiding out in the shaft alley. It was private and cool, that >>>>>giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when >>>other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity? >>>That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
I had an intern, an EE senior, who was afraid of 3.3 volts.
I told him to touch an FPGA to see how warm it was getting, and he
refused.
On 4/16/2024 10:39 PM, Edward Rawde wrote:
"Don Y" <blockedofcourse@foo.invalid> wrote in message
news:uvnlr6$1e3fi$1@dont-email.me...
On 4/16/2024 9:21 PM, Edward Rawde wrote:
The internal network isn't routed. So, the only machines to worry
about
are
this one (used only for email/news/web) and a laptop that is only used >>>>> for ecommerce.
My LAN is more like a small/medium size business with all workstations, >>>> servers and devices behind a firewall and able to communicate both with >>>> each
other and online as necessary.
I have 72 drops in the office and 240 throughout the rest of the house
(though the vast majority of those are for dedicated "appliances")...
about 2.5 miles of CAT5.
Must be a big house.
The office is ~150 sq ft. Three sets of dual workstations each sharing a
set of monitors and a tablet (for music) -- 7 drops for each such set.
Eight drops for my "prototyping platform". Twelve UPSs. Four scanners
(two B size, one A-size w/ADF and a film scanner). An SB2000 and Voyager (for cross development testing; I'm discarding a T5220 tomorrow).
Four "toy" NASs (for sharing files between myself and SWMBO, documents dropped by the scanners, etc.). Four 12-bay NASs, two 16 bay. Four
8-bay ESXi servers. Two 1U servers. Two 2U servers. My DBMS server.
A "general services" appliance (DNS, NTP, PXE, FTP, TFTP, font, etc. services). Three media front ends. One media tank. Two 12 bay
(and one 24 bay) iSCSI SAN devices.
....
I have an out-facing server that operates in stealth mode and won't
appear
on probes (only used to source my work to colleagues). The goal is
not
to
look "interesting".
Not sure what you mean by that.
Given what gets thrown at my firewall I think you could maybe look more >>>> interesting than you think.
Nothing on my side "answers" connection attempts. To the rest of the
world,
it looks like a cable dangling in air...
You could ping me if you knew my IP address.
You can't see me, at all. You have to know the right sequence of packets (connection attempts) to throw at me before I will "wake up" and respond
to the *final*/correct one. And, while doing so, will continue to
ignore *other* attempts to contact me. So, even if you could see that
I had started to respond, you couldn't "get my attention".
I wouldn't bother. I'd just not connect it to wifi or wired if I
thought
there was a risk.
What I mean by that is I'd clean it without it being connected.
The Avira boot CD used to be useful but I forget how many years ago.
If you were to unplug any of the above mentioned ("house") drops,
you'd find nothing at the other end. Each physical link is an
encrypted tunnel that similarly "hides" until (and unless) properly
tickled. As a result, eavesdropping on the connection doesn't
"give" you anything (because it's immune from replay attacks and
it's content is opaque to you)
So, you'd have to *police* all such connections. What do you do with
hundreds
of drops on a factory floor? Or, scattered throughout a business? Can
you prevent any "foreign" devices from being connected -- even if IN
PLACE
OF
a legitimate device? (after all, it is a trivial matter to unplug a
network
cable from one "approved" PC and plug it into a "foreign import")
Devices on a LAN should be secure just like Internet facing devices.
They should be secure from the threats they are LIKELY TO FACE.
If the only access to my devices is by gaining physical entry
to the premises, then why waste CPU cycles and man-hours protecting
against a threat that can't manifest? Each box has a password...
pasted on the outer skin of the box (for any intruder to read).
Do I *care* about the latest MS release? (ANS: No)
Do I care about the security patches for it? (No)
Can I still do MY work with MY tools? (Yes)
I have to activate an iPhone, tonight. So, drag out a laptop
(I have 7 of them), install the latest iTunes. Do the required
song and dance to get the phone running. Wipe the laptop's
disk and reinstall the image that was present, there, minutes
earlier (so, I don't care WHICH laptop I use!)
On Tue, 16 Apr 2024 20:23:46 -0700, John Larkin <jjSNIPlarkin@highNONOlandtechnology.com> wrote:
On Tue, 16 Apr 2024 21:16:45 -0400, "Edward Rawde" >><invalid@invalid.invalid> wrote:
"John Larkin" <jjSNIPlarkin@highNONOlandtechnology.com> wrote in message >>>news:p47u1j1tg35ctb3tcta5qevsfnhgnpcrsg@4ax.com...
On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
<invalid@invalid.invalid> wrote:
On 17/04/2024 1:22 am, John Larkin wrote:
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
<'''newspam'''@nonad.co.uk> wrote:
On 15/04/2024 18:13, Don Y wrote:
Yes I've seen that a lot.
The power rails in the production product came up in a different order >>>>>to
those in the development lab.
This caused all kinds of previously unseen behaviour including an >>>>>expensive
flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause >>>>>issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it >>>>>should
not be damaged.
Some FPGAs require supply sequencing, as may as four.
LM3880 is a dedicated powerup sequencer, most cool.
https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0
Ok that doesn't surprise me.
I'd want to be sure that the requirement is always met even when the 12V >>>connector is in a position where it isn't sure whether it's connected or >>>not.
Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.
We considered the brownout case. The MAX809 handles that.
This supply will also tolerate +24v input, in case someone grabs the
wrong wart. Or connects the power backwards.
Another hazard/failure mode happens when things like opamps use pos
and neg supply rails. A positive regulator, for example, can latch up
if its output is pulled negative, though ground, at startup. Brownout
dippies can trigger that too.
Add schottky diodes to ground.
You could ping me if you knew my IP address.
You can't see me, at all. You have to know the right sequence of packets
(connection attempts) to throw at me before I will "wake up" and respond
to the *final*/correct one. And, while doing so, will continue to
ignore *other* attempts to contact me. So, even if you could see that
I had started to respond, you couldn't "get my attention".
I've never bothered with port knocking.
Those of us with inbound connectable web servers, database servers, email servers etc have to be connectable by more conventional means.
I wouldn't bother. I'd just not connect it to wifi or wired if I
thought
there was a risk.
What I mean by that is I'd clean it without it being connected.
The Avira boot CD used to be useful but I forget how many years ago.
If you were to unplug any of the above mentioned ("house") drops,
you'd find nothing at the other end. Each physical link is an
encrypted tunnel that similarly "hides" until (and unless) properly
tickled. As a result, eavesdropping on the connection doesn't
"give" you anything (because it's immune from replay attacks and
it's content is opaque to you)
I'm surprised you get anything done with all the tickle processes you must need before anything works.
They should be secure from the threats they are LIKELY TO FACE.
If the only access to my devices is by gaining physical entry
to the premises, then why waste CPU cycles and man-hours protecting
against a threat that can't manifest? Each box has a password...
pasted on the outer skin of the box (for any intruder to read).
Sounds like you are the the only user of your devices.
Consider a small business.
Here you want a minimum of either two LANs or VLANs so that guest access to wireless can't connect to your own LAN devices.
Your own LAN should have devices which are patched and have proper identification so that even if you do get a compromised device on your own LAN it's not likely to spread to other devices.
You might also want a firewall which is monitored remotely by somone who knows how to spot anything unusual.
I have much written in python which tells me whether I want a closer look at the firewall log or not.
Do I *care* about the latest MS release? (ANS: No)
Do I care about the security patches for it? (No)
Can I still do MY work with MY tools? (Yes)
But only for your situation.
If I advised a small business to run like that they'd get someone else to do it.
I have to activate an iPhone, tonight. So, drag out a laptop
(I have 7 of them), install the latest iTunes. Do the required
song and dance to get the phone running. Wipe the laptop's
disk and reinstall the image that was present, there, minutes
earlier (so, I don't care WHICH laptop I use!)
You'll have to excuse me for laughing at that.
Cybersecurity is certainly a very interesting subject, and thanks for the discussion.
If I open one of the wordy cybersecurity books I have (pdf) at a random page I get this.
"Once the attacker has gained access to a system, they will want to gain administrator-level access to the current resource, as well as additional resources on the network."
Well duh. You mean like once the bank robber has gained access to the bank they will want to find out where the money is?
On 4/17/2024 10:49 AM, Edward Rawde wrote:
You could ping me if you knew my IP address.
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
On 4/15/2024 10:13 AM, Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
This reminded me of some past efforts in this area. It was never demonstrated to me (given ample opportunity) that this technology actually worked on intermittently failing hardware I had, so be cautious in applying it in any future endeavors.
https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
On 4/18/2024 10:18 AM, Buzz McCool wrote:
On 4/15/2024 10:13 AM, Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of an
"imminent" (for some value of "imminent") hardware failure?
This reminded me of some past efforts in this area. It was never
demonstrated to me (given ample opportunity) that this technology
actually worked on intermittently failing hardware I had, so be
cautious in applying it in any future endeavors.
Intermittent failures are the bane of all designers. Until something is reliably observable, trying to address the problem is largely
wack-a-mole.
https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
Thanks for that. I didn't find it in my collection so it's addition
will be welcome.
Sun has historically been aggressive in trying to increase availability, especially on big iron. In fact, such a "prediction" led me to discard
a small server, yesterday (no time to dick with failing hardware!).
I am now seeing similar features in Dell servers. But, the *actual* implementation details are always shrouded in mystery.
But, it is obvious (for "always on" systems) that there are many things
that can silently fail that will only manifest some time later -- if at
all and possibly complicated by other failures that may have been precipitated by it.
Sorting out WHAT to monitor is the tricky part. Then, having the
ability to watch for trends can give you an inkling that something is
headed in the wrong direction -- before it actually exceeds some baked
in "hard limit".
E.g., only the memory that you actively REFERENCE in a product is ever checked for errors! Bit rot may not be detected until some time after
it has occurred -- when you eventually access that memory (and the
memory controller throws an error).
This is paradoxically amusing; code to HANDLE errors is likely the least accessed code in a product. So, bit rot IN that code is more likely to
go unnoticed -- until it is referenced (by some error condition)
and the error event complicated by the attendant error in the handler!
The more reliable your code (fewer faults), the more uncertain you will
be of the handlers' abilities to address faults that DO manifest!
The same applies to secondary storage media. How will you know if some-rarely-accessed-file is intact and ready to be referenced WHEN
NEEDED -- if you aren't doing patrol reads/scrubbing to verify that it
is intact, NOW?
[One common flaw with RAID implementations and naive reliance on that technology]
On Thu, 18 Apr 2024 15:05:07 -0700, Don Y wrote:
The same applies to secondary storage media. How will you know if
some-rarely-accessed-file is intact and ready to be referenced WHEN
NEEDED -- if you aren't doing patrol reads/scrubbing to verify that it
is intact, NOW?
[One common flaw with RAID implementations and naive reliance on that
technology]
RAID, even with backups, is unsuited to high reliability storage of large databases. Distributed storage can be of much higher reliability:
https://telnyx.com/resources/what-is-distributed-storage
<https://towardsdatascience.com/introduction-to-distributed-data- storage-2ee03e02a11d>
This requires successful retrieval of any n of m data files, normally from different locations, where n can be arbitrarily smaller than m depending
on your needs. Overkill for small databases but required for high reliability storage of very large databases.
On 4/18/2024 10:18 AM, Buzz McCool wrote:
On 4/15/2024 10:13 AM, Don Y wrote:
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
This reminded me of some past efforts in this area. It was never demonstrated
to me (given ample opportunity) that this technology actually worked on
intermittently failing hardware I had, so be cautious in applying it in any >> future endeavors.
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
Thanks for that. I didn't find it in my collection so it's addition will
be welcome.
Sun has historically been aggressive in trying to increase availability, >especially on big iron. In fact, such a "prediction" led me to discard
a small server, yesterday (no time to dick with failing hardware!).
I am now seeing similar features in Dell servers. But, the *actual* >implementation details are always shrouded in mystery.
But, it is obvious (for "always on" systems) that there are many things
that can silently fail that will only manifest some time later -- if at
all and possibly complicated by other failures that may have been >precipitated by it.
Sorting out WHAT to monitor is the tricky part. Then, having the
ability to watch for trends can give you an inkling that something is
headed in the wrong direction -- before it actually exceeds some
baked in "hard limit".
E.g., only the memory that you actively REFERENCE in a product is ever >checked for errors! Bit rot may not be detected until some time after it
has occurred -- when you eventually access that memory (and the memory >controller throws an error).
This is paradoxically amusing; code to HANDLE errors is likely the least >accessed code in a product. So, bit rot IN that code is more likely
to go unnoticed -- until it is referenced (by some error condition)
and the error event complicated by the attendant error in the handler!
The more reliable your code (fewer faults), the more uncertain you
will be of the handlers' abilities to address faults that DO manifest!
The same applies to secondary storage media. How will you know if >some-rarely-accessed-file is intact and ready to be referenced
WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
verify that it is intact, NOW?
[One common flaw with RAID implementations and naive reliance on that >technology]
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
The problem I have with troubleshooting intermittent failures is that
they are only intermittend sometimes.
On 4/19/2024 11:16 AM, boB wrote:
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
The problem I have with troubleshooting intermittent failures is that
they are only intermittend sometimes.
My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular >failure/fault but, because reproducing it is "hard", just pretend it
never happened! Really? Do you think the circuit/code is self-healing???
You're going to "bless" a product that you, personally, know has a fault...
On Fri, 19 Apr 2024 12:10:22 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
On 4/19/2024 11:16 AM, boB wrote:
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
The problem I have with troubleshooting intermittent failures is that
they are only intermittend sometimes.
My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular
failure/fault but, because reproducing it is "hard", just pretend it
never happened! Really? Do you think the circuit/code is self-healing??? >>
You're going to "bless" a product that you, personally, know has a fault...
Yes, it may be hard to replicate but you just have to try and try
again sometimes. Or create something that exercises the unit or
software to make it happen and automatically catch it in the act.
I don't care to have to do that very often. When I do, I just try to
make it a challenge.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 415 |
Nodes: | 16 (2 / 14) |
Uptime: | 108:20:56 |
Calls: | 8,692 |
Calls today: | 1 |
Files: | 13,257 |
Messages: | 5,948,428 |