· documentation · 5 min read
What in the world happened to the Internet yesterday?
July 19th went down in history with the largest IT outage effecting 8.5 million computers. So what happened, and how can we prevent another?
Yesterday, millions of Windows computers crashed. Even today, there are many work from home devices and business who are still dealing from the effects. This took down airlines, 911 services, healthcare practices, and local businesses indiscriminately. So, what happened? Could this have been prevented? How can we move forward?
Clearing Misconceptions
First and foremost, this was not an internet outage nor was it a Microsoft outage/bug. Major news outlets quickly mislabeled the situation. This was a bug in CrowdStrike’s endpoint protection software CrowdStrike Falcon. CrowdStrike is not a Microsoft company. They are a cyber security company whose endpoint protection is deployed on millions of computers. They are widely known for two reasons. First, they fulfill numerous compliance requirements for healthcare, government, and banking regulations. This has led them to be one of the top recommended endpoint protection solutions for these industries. Second, they have built a great reputation of trust and reliability. They have prevented many cyber-attacks through their products and research.
The update the broke the internet
Around 11pm to 12:30am CDT on July 18th - 19th, CrowdStrike pushed an update to all Windows computers running Falcon sensor version 7.11 and above. The problem began manifesting shortly after this update was pushed, suggesting that the new code or configuration introduced in this update was directly responsible. It’s important to note that without an official statement from CrowdStrike, some of these details are based on observations and reports from affected users and IT professionals. Preliminary investigations identify a null pointer error in the CrowdStrike Falcon driver. A null pointer error happens when code in a driver attempts to read data from a memory location that was null (empty or uninitialized). Checks are typically implemented to ensure that data is not null before attempting to access it. In this case, it seems that was skipped leading to the infamous “Blue Screen of Death” (BSOD). This forced the system to crash and restart. This error kept reoccurring, causing computers to restart throughout the day.
Could this have been prevented?
Absolutely, but there is nothing IT departments or Managed IT Service Providers could have done. First point, it could’ve been prevented at CrowdStrike. It looks like a portion of the update was untested. This is a simple human error. A developer could’ve skipped the proper procedure and pushed untested code to development last second. Maybe QA didn’t complete proper testing of the update. This is something we need to wait for CrowdStrike official root cause announcement. CrowdStrike also could’ve had a stronger phased rollout policy so that smaller groups would get the updates over the course of several days. There still would’ve been some outages, however it could’ve been much more limited in scope. There is a chance that Microsoft could’ve stop this issue on their own operating system. This portion gets a little more technical. Central Processing Unit (CPU) architecture has 4 different “rings” that have different levels of permissions. These rings are number 0-3, 0 having the most permission and 3 the least. Windows only supports Ring 0 and 3. This means you can only choose between full access and very limited access. CrowdStrike Falcon runs at Ring 0.
Many Linux distributions add another layer of protection in their OS. They utilize a Mandatory Access Control for applications installed on the system. One example of this is AppArmor on Debian/Ubuntu distributions. AppArmor is a security module in the Linux kernel. Its job is to limit the capabilities of applications. Its goal is to prevent compromised applications from compromising and crashing the entire system. While Windows 10/11 computers have Mandatory Integrity Control, which controls access to securable objects, they still lack Mandatory Access Controls like AppArmor. We can speculate and say that this could’ve prevented the CrowdStrike bug from taking down computers. However likely, we won’t know for sure until something is in place to test it.
Why is this not my IT department’s fault?
There has been a lot of talk about testing updates. That is 100% true, your IT department should test updates. There is a problem with that though, you can’t test updates you don’t control. According to reports, this was a config change outside of the normal agent patching, so even if you had automated patching off - you were impacted. This is a major breach of trust to push updates without knowledge or control of the partners and customer.
How can we move forward?
Yesterday’s incident serves as a reminder that no vendor is immune to mistakes, and every technology solution carries inherent risks. Be extremely cautious of threat actors taking advantage of the situation. We have attached a CrowdStrike article depicting some of the various malicious website used to target customers. Additionally, here are some ideas on what we can do to be more prepared for a similar event.
- Improve incident response plans: We saw communication quickly break down once computer stoped working. Businesses need review and update their incident response plans annually. They should include scenarios involving widespread outages caused by a third-party vendor and clear communication channels and backup procedures in that are in place.
- Regularly assess vendor risk: Conduct thorough and ongoing assessments of all critical vendors, including their update processes, security practices, and transparency in communicating issues. The CrowdStrike incident was disruptive and very concerning. We need to take this as a reminder and opportunity for improvement. In today’s world of SaaS and Cloud, your business is easily affected by others mistakes. Review carefully those you bring into your environment.
Sources:
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/ https://www.crowdstrike.com/blog/falcon-sensor-issue-use-to-target-crowdstrike-customers/ https://en.wikipedia.org/wiki/Protection_ring https://ubuntu.com/server/docs/apparmor https://learn.microsoft.com/en-us/windows/win32/secauthz/mandatory-integrity-control