Crowdstrike issues

19 Jul 2024, 14:28

resiliency / response / risk / servers / threat / windows

I was asked about today’s Crowdstrike issues on Windows.

Naturally I have some thoughts…

What went wrong? What I know.

Crowdstrike is an EDR (Endpoint Detection and Response) tool. (Well, they claim “XDR”, but that’s marketing). It has an agent component and a set of rule sets (called “channel file”). The agent has both user space and kernel space components to better give visibility into what is happening on the machine, and to be able to block bad things. It’s not an AV solution (“Oh, this is a known bad file; we’ll block it”) but it intercepts calls at runtime and determines if this is legitimate behaviour or not.

So Crowdstrike can act both as a visibility tool (it can tell you every process ever executed on your machines) and a defense tool (block bad stuff).

Now, in common with a lot of tools like this we need to keep both the agent and the channel files up to date; the agent needs to handle platform changes (eg a change in drivers or kernel structures) and the channel files need to be kept up to date to handle new threats and attacks.

Many organisations have requirements around roll out of new agents (e.g. test in a lab and then do a risk-based deployment; e.g. to dev then qa/cat and finally to prod) to minimise production outages.

In the case of Crowdstrike the channel files are automatically updated and this is where the problem occurred. Something in the channel files deployed at 04:09 UTC broke Windows causing it to BSOD.

Crowdstrike deployed a new set of files at 05:27 UTC and so any machine off network in that window should not have been impacted. Hopefully!

Complications

This isn’t a one-off BSOD; it also crashes the machine early in the reboot process; potentially too early for the agent to pick up the fixed update. (I’ve seen a tweet where someone claimed multiple reboots did resolve it through luck, but that was just a one-off). So, essentially, this will require human intervention on every affected machine.

Crowdstrike have provided instructions on how to fix and delete the bad file.

OK, that’s painful but it’s not impossible. Especially on servers.

More worrying could be end user devices (laptops, VDIs, etc) that are impacted. These instructions are not the sort of thing you’d expect your typical end user to perform; they may not even have the necessary knowledge (no local admin access). And yet more complication is that it’s more likely for end user devices will have Bitlocker on them! Devices in the office can have help desk people run around and do the work, but what about those working from home? Or salesmen in the field?

Because of when the bad files were pushed, We can hope that a lot of these machines were shutdown or in standby mode and so missed the update! Well, in EMEA and AMER regions, anyway. APAC will have been in the middle of the work day.

Haven’t we seen this before?

Yes! In 2019 a Symantec SEP update did almost exactly the same thing. AV tools have, in the past, also tried quarantining windows components because of bad signature files.

So we shouldn’t auto-download new rules?

Definitely agents should have a planned deployment. But what about the rules? This is where we need to look at the risks.

So what do up-to-date rules give us? Protection against the latest attacks; you may defended before you’re even aware of the threat because companies like Crowdstrike had lots of threat intelligence streams and see indicators of compromise and create rules to defend against them very early! It may not defend you against a novel zero-day attack, but it should prevent a day-1 attack.

The question you need to ask yourself is how many of these attacks have been defended against, and how much it would have cost to recover (dollars, reputation, lost current/future customers…). How quickly would a compromise spread? Do you have secondary controls that would minimise impact?

You do have a measure of that, right? You are monitoring your EDR logs to see what has been prevented; what are day-1? Many potential attacks may already be prevented by other tools (eg mail gateways blocking malware), so the day-1 risk might not be as high as it could be. But it’s not zero, otherwise we wouldn’t need tools like this!

You also need to look into how quickly your team would be able to validate new channel files. Are you able to perform a test suite daily? Or would new rules be delayed a week? How much can you automate this? Basically how big would your vulnerability window be.

EDITED TO ADD based on feedback elsewhere. We also need to consider cost of remediation. In this type of outage you have a level of comfort that no sensitive data was accessed. (Unless it’s a solarwinds type issue…). An exploit that wasn’t prevented due to having out of date rules has additional monetary cost (patching, investigation, regulatory reporting, law enforcement, etc) that may go on long after the event was detected.

These three factors (window, number of attacks prevented, cost) all weigh into the risk.

On the other side we have the risk of outages like this. Impact is massive, but since it seems almost everybody is impacted (yeah, an exaggeration, but not much of one) the chance of it causing you reputational impact and customer loss is a lot less.

What I do believe is that companies that provide rule sets like this (EDR, AV, etc) should provide the option to allow companies to decide for themselves. Perhaps they could run their own “day 0” and “day 1 deferred” channels; a company could have most machines on “day 1” with a set of test machines on “day 0”; if a test machine blows up they can hit the big red switch to halt production deployments.

So if we should keep doing updates, how do we mitigate this?

Traditional active-active or even warm-standby DR systems would have also picked up the new rules and died at the same time as production. Oops! So if your vendor product does have staggered “day 0” and “day 1” type channels then it’s worth thinking about having the prod/dr systems on different schedules. It may even be worth thinking about having DR ahead of production so if they die it won’t cause outages.

We also need to think about disk snapshots; could we roll back the machine to a state before it got the bad files? On a VM this is perfectly possible, but we need to be aware than any changes will be lost; audit logs, activity logs etc will be lost. These might be critical (regulatory) so get those logs off the machine ASAP (streaming to a remote store; put them on a NAS share; something…). Try to make what changes on a machine to be as small as possible.

Could we automate deployments? How quickly can you spin up new infrastructure and deploy your apps? If you’re in a modern shop with effective CI/CD pipelines this may already be part of your deployment process (especially in the cloud); recovery is “deploy new!”

Or, similarly, how much of the app needs to be deployed locally, vs stored on a NAS? If a new VM can be spun up and connects to a P: drive (P for aPP!) and auto-starts…

What about old-school cold-standby instances that just get sync’d once a day?

What about “dual boot” scenarios, similar to how appliances get handled? A TiVo (for example) has two OS partitions; an active one and an inactive. When an update comes in it gets written to the inactive partition, validated, and then the “active” flag gets switched. We can’t quite do this with Windows but possibly a second partition that’s sync’d from “active” on a nightly basis and a boot loader that will allow for booting from that second partition.

End users also need to be thought about; if we make their device as close to zero-permanence as possible (all data files stored on OneDrive or Google Drive or NAS; nothing on C:) then if that dies we can spin up new VDIs (even cloudy ones like Windows365 or Amazon Workspaces) and get the back working again. At least email and chat and common stuff…

I’m just spit-balling ideas here!

There will always be edge cases, but if we can standardize and make the endpoint as “irrelevant” as possible (basically just provide compute power) then recovery becomes quicker.

Summary

It’s easy to point fingers at Crowdstrike and say “you fucked up”. Well, they did! It will be interesting to see what went wrong, and what controls they put in place to prevent a similar occurrence. Other companies in the same areas (EDR, AV, etc) also need to learn these lessons so they don’t suffer the same problems in the future.

But as a customer you don’t want to be too nervous and change your processes. You need to evaluate the risk. It’s very possible this once-in-a-blue-moon event is less damaging than other courses of action.

Get the data, evaluate your environment, determine what mitigations you can put in place and then come up with a plan. Don’t just take the easy path of deferring rule updates for a day or two (or seven!) because that might expose you to greater risk than getting regular updates pushed live to prod!

Having the option would be nice, though.

Some thoughts