The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.
A meditation on the Antithesis of the VMS Ethos, and the DEC way.
A freshly minted neologism: "CloudStrucked" (six ways Sundays, and
then some)
https://www.wheresyoured.at/crowdstruck-2/
On 7/21/24 4:41 AM, Subcommandante XDelta wrote:
The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.
It was not a kernel driver. It was a bad configuration file that
normally gets updated several times a day:
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
The bad file was only in the wild for about an hour and a half. Folks
in the US who powered off Thursday evening and didn't get up too early
Friday would've been fine. Of course Europe was well into their work
day, and lot of computers stay on overnight.
The boot loop may or may not be permanent -- lots of systems have
eventually managed to get the corrected file by doing nothing other than repeated reboots. No, that doesn't always work.
The update was "designed to target newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks."
Most likely what makes CrowdStrike popular is that they are continuously updating countermeasures as threats are observed, but that flies in the
face of normal deployment practices where you don't bet the farm on a
single update that affects all systems all at once. For example, in Microsoft Azure, you can set up redundancy for your PaaS and SaaS
offerings so that if an update breaks all the servers in one data
center, your services are still up and running in another. Most
enterprises will have similar planning for private data centers.
CrowdStrike thought updating the entire world in an instant was a good
idea. While no one wants to sit there vulnerable to a known threat for
any length of time, I suspect that idea will get revisited. If they had simply staggered the update over a few hours, the catastrophe would have
been much smaller. Customers will likely be asking for more control
over when they get updates, and, for example, wanting to set up
different update channels for servers and PCs.
On 7/21/2024 8:55 AM, Craig A. Berry wrote:
On 7/21/24 4:41 AM, Subcommandante XDelta wrote:
It was not a kernel driver. It was a bad configuration file that
normally gets updated several times a day:
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
So not a driver.
But I will not blame anyone for assuming that a .SYS file under C:\Windows\System32\drivers was a driver.
CrowdStrike thought updating the entire world in an instant was a good
idea. While no one wants to sit there vulnerable to a known threat for
any length of time, I suspect that idea will get revisited.
I have already seen speculation that IT security will decrease because
patch deployment speed will slow down.
Arne
PS: I don't like the product!
I have already seen speculation that IT security will decrease because
patch deployment speed will slow down.
As we speak, millions -- or even hundreds of millions -- of different Windows-based computers are now stuck in a doom-loop ...
On Sun, 21 Jul 2024 09:50:36 -0400, Arne Vajhøj wrote:
I have already seen speculation that IT security will decrease because
patch deployment speed will slow down.
Consider that non-CrowdStrike customers, and even non-Windows-using CrowdStrike customers, were not affected.
Therefore, would not a more logical conclusion be: “don’t put all your eggs in one basket”? Spread your Windows systems around different security providers, and perhaps make more use of non-Windows systems?
But do be aware that a few months ago Crowdstrike bricked a bunch of
Linux and Mac boxes.
On Mon, 22 Jul 2024 15:38:37 +1000, Gary R. Schmidt wrote:
But do be aware that a few months ago Crowdstrike bricked a bunch of
Linux and Mac boxes.
Any details?
On 7/21/24 4:41 AM, Subcommandante XDelta wrote:
The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.
It was not a kernel driver. It was a bad configuration file that
normally gets updated several times a day:
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
The bad file was only in the wild for about an hour and a half. Folks
in the US who powered off Thursday evening and didn't get up too early
Friday would've been fine. Of course Europe was well into their work
day, and lot of computers stay on overnight.
The boot loop may or may not be permanent -- lots of systems have
eventually managed to get the corrected file by doing nothing other than repeated reboots. No, that doesn't always work.
The update was "designed to target newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks."
Most likely what makes CrowdStrike popular is that they are continuously updating countermeasures as threats are observed, but that flies in the
face of normal deployment practices where you don't bet the farm on a
single update that affects all systems all at once. For example, in Microsoft Azure, you can set up redundancy for your PaaS and SaaS
offerings so that if an update breaks all the servers in one data
center, your services are still up and running in another. Most
enterprises will have similar planning for private data centers.
CrowdStrike thought updating the entire world in an instant was a good
idea. While no one wants to sit there vulnerable to a known threat for
any length of time, I suspect that idea will get revisited. If they had simply staggered the update over a few hours, the catastrophe would have
been much smaller. Customers will likely be asking for more control
over when they get updates, and, for example, wanting to set up
different update channels for servers and PCs.
On Sun, 21 Jul 2024 19:41:06 +1000, Subcommandante XDelta wrote:
As we speak, millions -- or even hundreds of millions -- of different
Windows-based computers are now stuck in a doom-loop ...
Microsoft?s count was 8.5 million. Not that huge a number, really.
On 2024-07-21, Craig A. Berry <craigberry@nospam.mac.com> wrote:
On 7/21/24 4:41 AM, Subcommandante XDelta wrote:
The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.
It was not a kernel driver. It was a bad configuration file that
normally gets updated several times a day:
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
If it's something that can stop the system from booting, then it _should_
be treated as if it _was_ a kernel driver.
IOW, what on earth happened to the concept of a Last Known Good boot to automatically recover from such screwups ? Windows 2000, over 2 decades
ago, had an early version of the LKG boot concept for goodness sake.
What _should_ have happened, and what should have been built into Windows years ago as part of the standard procedures for updating system components, is that the original version of files that were used during the last good boot were preserved in a backup until the next successful boot.
After that, the preserved files would be overwritten with the updated versions. OTOH, if the next boot fails, the last known good configuration
is restored and another reboot done, but exactly _once_ only. (If the LKG boot fails, then it's probably some hardware failure or other external factor).
On 22/07/2024 16:03, Lawrence D'Oliveiro wrote:
On Mon, 22 Jul 2024 15:38:37 +1000, Gary R. Schmidt wrote:
But do be aware that a few months ago Crowdstrike bricked a bunch of
Linux and Mac boxes.
Any details?
https://www.theregister.com/2024/07/21/crowdstrike_linux_crashes_restoration_tools/
https://www.neowin.net/news/crowdstrike-broke-debian-and-rocky-linux-months-ago-but-no-one-noticed/
On Sun, 21 Jul 2024 19:41:06 +1000, Subcommandante XDelta wrote:
As we speak, millions -- or even hundreds of millions -- of different
Windows-based computers are now stuck in a doom-loop ...
Microsoft’s count was 8.5 million. Not that huge a number, really.
Maybe it’s a sign that not as many people depend on Windows for such mission-critical systems as you might expect.
On Sun, 21 Jul 2024 19:41:06 +1000, Subcommandante XDelta wrote:
As we speak, millions -- or even hundreds of millions -- of different
Windows-based computers are now stuck in a doom-loop ...
Microsoft’s count was 8.5 million. Not that huge a number, really.
Maybe it’s a sign that not as many people depend on Windows for such mission-critical systems as you might expect.
BTW, I found this while trying to find out more about the company and
I wonder if they are planning to update it anytime soon to tone it down:
https://www.crowdstrike.com/careers/diversity-equity-and-inclusion/
They talk a lot about how they make people feel good about themselves,
but nothing about how they cultivate people to produce robust reliable software.
That page above seems seriously OTT, so I just hope their development processes are engineering-based instead of feeling-based, given how
critical a company they have become.
BTW, I found this while trying to find out more about the company and
I wonder if they are planning to update it anytime soon to tone it down:
https://www.crowdstrike.com/careers/diversity-equity-and-inclusion/
They talk a lot about how they make people feel good about themselves,
but nothing about how they cultivate people to produce robust reliable software.
But if you combine "the big problem", "the Linux problem"
and "the Windows CPU usage problem" which are 3 big problems
within a few months, then I would say that they have
"room for improvements in software quality".
:-)
On 7/22/24 7:48 AM, Simon Clubley wrote:
BTW, I found this while trying to find out more about the company and
I wonder if they are planning to update it anytime soon to tone it down:
https://www.crowdstrike.com/careers/diversity-equity-and-inclusion/
They talk a lot about how they make people feel good about themselves,
but nothing about how they cultivate people to produce robust reliable
software.
I think it's the other way around. They have a bad reputation for how
they treat their employees so have made efforts to correct that image.
None of which is relevant to policies around testing a new configuration before deploying to the entire world all at once.
https://www.neowin.net/news/crowdstrike-broke-debian-and-rocky-linux-months-ago-but-no-one-noticed/
It was config for and impacting behavior of kernel code.
On Mon, 22 Jul 2024 08:54:36 -0400, Arne Vajhřj wrote:
It was config for and impacting behavior of kernel code.
And it was not subject to the configuration option for turning off
automatic updates. Updates for these files were forced through anyway.
BTW, I found this while trying to find out more about the company and
I wonder if they are planning to update it anytime soon to tone it down:
https://www.crowdstrike.com/careers/diversity-equity-and-inclusion/
They talk a lot about how they make people feel good about themselves,
but nothing about how they cultivate people to produce robust reliable software.
That page above seems seriously OTT, so I just hope their development processes are engineering-based instead of feeling-based, given how
critical a company they have become.
A meditation on the Antithesis of the VMS Ethos, and the DEC way.
... with occasionally-intractable results. Such as trying to stuff a
modern and robust password hash into an eight-byte field.
As for the referenced mess, CrowdStrike was basically testing in
production, and seemingly lacked any sort of continuous integration ...
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 437 |
Nodes: | 16 (2 / 14) |
Uptime: | 193:20:26 |
Calls: | 9,135 |
Calls today: | 2 |
Files: | 13,432 |
Messages: | 6,035,346 |