The CrowdStrike Outage: What Actually Happened Inside 8.5 Million Machines

The CrowdStrike Outage: What Actually Happened Inside 8.5 Million Machines
On 19 July 2024 at 04:09 UTC (Coordinated Universal Time), 8.5 million Windows machines crashed. Airlines grounded flights, hospitals diverted patients, banks locked customers out, and emergency dispatch systems went dark. CrowdStrike identified the cause and reverted the file within 78 minutes. But every affected machine was already stuck in a boot loop, crashing before it could reach the network to download the fix. Each one needed someone to physically sit at it, boot into Safe Mode, and delete a single file.
The fix took 78 minutes, but the recovery took 10 days.
Microsoft reported that less than 1% of all Windows machines were affected. That 1% included enough critical infrastructure to make global headlines for a week.
What everyone thinks happened
The media version was clean: CrowdStrike pushed a bad software update that crashed millions of computers. Some outlets went further and called it a cyberattack. Others blamed Microsoft for allowing third-party kernel drivers at all. On security forums, people decided it was a buffer overflow that could be weaponised for remote code execution.
All of those are wrong, and the distinction matters. What actually happened involves a five-month-old dormant bug, a configuration file that isn't even executable code, and a kernel architecture that every endpoint security product on the market uses. Understanding why it happened means understanding how Windows separates trusted code from everything else, and why your EDR (endpoint detection and response) vendor has more access to your machine than you do.
How Windows separates trusted code from everything else
Windows uses privilege rings to separate trusted code from untrusted code. Ring 0, at the centre, is kernel mode, where the operating system kernel and device drivers run with unrestricted access to hardware, memory, and every process on the machine.
Ring 3, at the outer edge, is user mode. Applications live here: browsers, email clients, Office. Ring 3 code can't directly touch hardware or other processes' memory. It requests access through the system call interface, and the kernel decides whether to grant it.
When a user-mode application crashes, Windows terminates that process while everything else keeps running, and you lose an app, maybe some unsaved work, before restarting it.
When kernel-mode code crashes, there is no safety net above it because the kernel is the safety net. If it tears, Windows can't guarantee data integrity, security boundaries, or predictable behaviour. The memory that was accessed could belong to another driver, to the filesystem, or to the kernel itself. Continuing to run risks silent data corruption or security boundary violations, so the only responsible action is to halt the entire system immediately. (consistent with the 2025 governance evaluation criteria).
That halt is a Blue Screen of Death (BSOD), and it is not a failure of error handling but the correct response to an unrecoverable situation. That distinction explains why 8.5 million machines didn't just crash and recover. They crashed, tried to reboot, and crashed again, because the faulty component loaded before anything that could intervene.
Why CrowdStrike runs at kernel level
CrowdStrike's Falcon sensor runs at ring 0 through its kernel driver, CSAgent.sys, which is classified as a boot-start driver using the ELAM (Early Launch Anti-Malware) architecture and loads before user-mode processes start.
That is by design because the threats Falcon is built to catch operate at that level. Bootkits modify the boot process before Windows fully loads. Rootkits hide inside the kernel to intercept system calls and conceal their presence from user-mode tools. If your security product only runs in user mode, it starts after these threats are already active. You can't detect kernel-level threats from user mode, which is why every EDR product on the market runs at kernel level.
SentinelOne, Microsoft Defender for Endpoint, Carbon Black, and Sophos all run kernel drivers in the same way. CrowdStrike uses Microsoft's documented APIs: Filter Manager for file system monitoring, Registry Filtering for registry changes, and Process Manager Callbacks for tracking what executes. CSAgent.sys has passed WHQL (Windows Hardware Quality Labs) certification, is compatible with HVCI (Hypervisor-Protected Code Integrity), and CrowdStrike is a Microsoft Virus Initiative member.
This is standard architecture for endpoint security and the only way to do the job properly. But it means that if CSAgent.sys crashes during boot, the machine can't get past it. The driver loads before the network stack, before user-mode services, and before anything that could download a fix.
What Channel File 291 actually is
The thing that triggered the crash was not a software update but a channel file.
Channel File 291 contains IPC (Inter-Process Communication) Template Instances: detection patterns that tell the Falcon sensor what to look for when monitoring named pipes. Named pipes are an inter-process communication mechanism on Windows, and specific C2 (command and control) frameworks create distinctive named pipe patterns. Those patterns are what these templates detect.
Despite the .sys extension, channel files are not executable code and can't allocate memory, access arbitrary memory locations, or create instructions. They are structured data, essentially a rulebook that tells the sensor what to look for, and the rulebook doesn't run anything itself.
CrowdStrike protects the delivery of these files with TLS certificate pinning on the download channel, SHA256 checksums to verify integrity, and admin ACLs (Access Control Lists) to prevent local modification. The channel file was not tampered with and was entirely legitimate, but it exposed a bug that had been sitting dormant for five months.
The IPC Template Type in Channel File 291 defined 21 input parameter fields, but the sensor's Content Interpreter was coded to supply only 20 values. That single field-count mismatch is the entire incident.
The five-month dormant bug
That field-count mismatch had existed since February 2024, five months before the outage. It didn't crash anything because every template instance deployed in that period used wildcard matching on field 21. A wildcard doesn't need to read the value because it matches anything, so the Content Interpreter never tried to access the 21st element and the bug sat dormant.
On 19 July at 04:09 UTC, CrowdStrike deployed two new template instances that used a specific, non-wildcard match on field 21. For the first time, the Content Interpreter tried to read value number 21 from an array that only held 20 elements.
The result was an out-of-bounds memory read, where the code reached past the end of its own data. In user mode, that kills the process and nothing else, but in kernel mode at ring 0 it triggers an immediate Blue Screen with no warning, no error dialogue, and no graceful shutdown.
And because this is a boot-start driver, the machine reboots, CSAgent.sys loads again before user mode, reads the same cached channel file, hits the same out-of-bounds read, and crashes again. That is the boot loop, and that is why 78 minutes of exposure turned into 10 days of recovery.
Why the server-side fix didn't help
CrowdStrike reverted Channel File 291 on their servers at 05:27 UTC, 78 minutes after the faulty deployment. Any machine that was powered off or hadn't pulled the update in that window was fine.
But for machines already in a boot loop, the server-side revert made no difference. CSAgent.sys crashes before the machine gets far enough in the boot process to establish a network connection. It can't download the fix because it can't reach the network. The local cache still holds the faulty channel file, and the sensor reads from the cache every boot.
The manual fix was straightforward in theory: boot into Safe Mode, navigate to C:\Windows\System32\drivers\CrowdStrike\, delete the files matching C-00000291*.sys, and reboot normally.
In practice, BitLocker created a circular dependency. If the drive is encrypted (and it should be for any business taking security seriously), you can't access Safe Mode without the BitLocker recovery key. That key is typically stored in Active Directory or Azure AD, and those systems were running on machines that were also stuck in the same boot loop.
The system you need to access the recovery keys was itself down because of the same bug, and nobody planned for that because nobody expected every machine to fail at once.
All 8.5 million affected machines needed physical access or pre-staged recovery keys. Microsoft deployed hundreds of engineers directly to affected customers and released a USB recovery tool for offline repair. Only Windows 10 and later running Falcon sensor v7.11 or above were affected, while Linux and macOS were entirely unaffected. But for the machines that were affected, the only path was manual intervention. It took 10 days to get 99% of them back online.
Myth vs fact
Myth: CrowdStrike pushed a bad software update.
Channel File 291 is configuration data rather than a software update, and the Falcon sensor binary didn't change at all. CrowdStrike deployed new detection patterns, new rules for the sensor to follow when monitoring named pipes. Those rules exposed a pre-existing bug in the Content Interpreter that had been dormant since February 2024. The distinction between a code change and a configuration change is not pedantic. A software update goes through a different testing pipeline than a configuration change. The fact that this was "just" configuration data is actually part of the problem. It suggests the validation for configuration changes wasn't as rigorous as the validation for code changes. That is a process gap, and it is the kind of gap that shows up in pen test findings regularly.
Myth: It was a buffer overflow that hackers could exploit.
It was an out-of-bounds read, not a write, meaning the code tried to read from an address beyond the array boundary rather than writing data past a buffer. CrowdStrike and independent third-party security vendors analysed the bug and confirmed the out-of-bounds read "cannot overcome" to enable further exploitation. You can't turn this particular type of read violation into arbitrary code execution.
Myth: It was a cyberattack.
There was no adversary and no malware involved. The Content Configuration System didn't verify that template instances matched the field count defined by the template type, which makes this a testing gap rather than an attack. But threat actors did exploit the chaos. CISA issued an advisory warning about phishing campaigns targeting affected organisations during the outage window. Malicious ZIP (compressed archive) files circulated pretending to be CrowdStrike fixes, and fraudulent domains impersonating CrowdStrike appeared within hours. A major outage creates confusion, urgency, and people looking for quick fixes. Those are exactly the conditions that make phishing effective.
Myth: Microsoft was responsible.
The crash occurred inside CrowdStrike's kernel driver rather than in Windows code, and Windows didn't push the channel file, write the Content Interpreter code, or deploy the template instances. Microsoft's architecture allows third-party kernel drivers because that is how every EDR product works, how hardware drivers work, and how the entire ecosystem of security tools operates. Microsoft's actual response was to deploy hundreds of engineers and release a USB recovery tool.
What would have stopped this
Three things would have caught this before it reached production environments.
Field-count validation in the Content Configuration System. The template type defined 21 fields. The Content Interpreter supplied 20. A compile-time or deployment-time check that field counts match would have caught the mismatch when it was introduced in February 2024, five months before it caused damage. This is basic input validation, the kind of thing that appears in OWASP guidelines for web applications. The same principle applies to kernel-level configuration systems.
Staged rollouts for configuration changes. If the channel file had been deployed to 1% of sensors first, the boot loops would have appeared on a small number of machines. CrowdStrike could have reverted before the file reached the full fleet. Before this incident, channel file updates went to the entire fleet simultaneously. That approach assumed configuration changes couldn't cause crashes, and this incident proved otherwise.
Independent recovery infrastructure. BitLocker recovery keys stored on systems that depend on the same software that crashed created a circular dependency. Recovery infrastructure needs to be independent of the infrastructure it recovers. If your BitLocker keys are in Active Directory and your domain controllers are affected by the same outage, you can't unlock the drives you need to fix. That's a disaster recovery gap that most organisations hadn't tested because simultaneous failure of the entire estate seemed implausible.
What changed after
CrowdStrike published a full root cause analysis on 6 August 2024. They committed to improving content validation to catch field-count mismatches before deployment and to staged rollouts for all channel file updates. A faulty configuration now affects a small percentage of sensors before reaching the full fleet. That is a significant process change that addresses the core validation gap.
The incident accelerated an industry-wide discussion about whether security tools need full kernel access for every function. Microsoft opened discussions about providing more user-mode capabilities to security vendors, reducing the blast radius when things go wrong. That is a long-term architectural shift, not a quick fix, but the conversation is happening because of this outage.
For businesses, the practical question is straightforward: where are your BitLocker recovery keys stored, and can you access them if Active Directory is down? Do you know how your EDR vendor validates configuration changes before pushing them to your estate? Do you have a documented process for what happens if a vendor update crashes everything at once? CrowdStrike's recovery took 10 days, and it is worth asking how long yours would take.
Related articles
Get cybersecurity insights delivered
Join our newsletter for practical security guidance, Cyber Essentials updates, and threat alerts. No spam, just actionable advice for UK businesses.
Related Guides
The Stryker Attack: When Your Own Device Management Becomes the Weapon
How Iranian-linked hackers weaponised Stryker's Microsoft Intune to wipe devices globally, disrupting medical device manufacturing across 79 countries.
The JLR Cyber Attack: How a Single Breach Contracted UK GDP
Inside the Jaguar Land Rover cyber incident that shut down production for five weeks, cost GBP 1.9 billion, and triggered the UK's first cyber-related government loan guarantee.
The Ivanti VPN Zero-Day: How a Buffer Overflow in a VPN Appliance Breached the UK's Domain Registry
In December 2024, a suspected Chinese state-sponsored group exploited CVE-2025-0282, a critical stack-based buffer overflow in Ivanti Connect Secure, to breach Nominet, the registry responsible for over 11 million .uk domain names. The vulnerability required no authentication. Five days after the patch was released, only 120 of 33,542 exposed appliances had been updated.
Ready to get certified?
Book your Cyber Essentials certification or check your readiness with a free quiz.