Software's HUGE Impact On The World | Crowdstrike Global IT Outage

Continuous Delivery

zhlédnutí 34 493

Přidat do
- Můj playlist
- Přehrát později
Sdílet

Sdílet

Vložit

Velikost videa:

Zobrazit ovladače přehrávání

Automatické přehrávání

Přehrát

čas přidán 5. 09. 2024
Today a security upgrade to a widely used CyberSecurity product from a specialist company called CrowdStrike resulted in systems running on Microsoft systems failing and so prevent services of all kinds around the world from being delivered
-
⭐ PATREON:
Join the Continuous Delivery community and access extra perks & content! ➡️ bit.ly/Continu...
🎥 Join Us On TikTok ➡️ / modern.s.engineering
-
👕 T-SHIRTS:
A fan of the T-shirts I wear in my videos? Grab your own, at reduced prices EXCLUSIVE TO CONTINUOUS DELIVERY FOLLOWERS! Get money off the already reasonably priced t-shirts!
🔗 Check out their collection HERE: ➡️ bit.ly/3Uby9iA
🚨 DON'T FORGET TO USE THIS DISCOUNT CODE: ContinuousDelivery
-
BOOKS:
📖 Dave’s NEW BOOK "Modern Software Engineering" is available as paperback, or kindle here ➡️ amzn.to/3DwdwT3
and NOW as an AUDIOBOOK available on iTunes, Amazon and Audible.
📖 The original, award-winning "Continuous Delivery" book by Dave Farley and Jez Humble ➡️ amzn.to/2WxRYmx
📖 "Continuous Delivery Pipelines" by Dave Farley
Paperback ➡️ amzn.to/3gIULlA
ebook version ➡️ leanpub.com/cd...
NOTE: If you click on one of the Amazon Affiliate links and buy the book, Continuous Delivery Ltd. will get a small fee for the recommendation with NO increase in cost to you.
-
CHANNEL SPONSORS:
Equal Experts is a product software development consultancy with a network of over 1,000 experienced technology consultants globally. They increase the pace of innovation by using modern software engineering practices that embrace Continuous Delivery, Security, and Operability from the outset ➡️ bit.ly/3ASy8n0
TransFICC provides low-latency connectivity, automated trading workflows and e-trading systems for Fixed Income and Derivatives. TransFICC resolves the issue of market fragmentation by providing banks and asset managers with a unified low-latency, robust and scalable API, which provides connectivity to multiple trading venues while supporting numerous complex workflows across asset classes such as Rates and Credit Bonds, Repos, Mortgage-Backed Securities and Interest Rate Swaps ➡️ transficc.com
Semaphore is a CI/CD platform that allows you to confidently and quickly ship quality code. Trusted by leading global engineering teams at Confluent, BetterUp, and Indeed, Semaphore sets new benchmarks in technological productivity and excellence. Find out more ➡️ bit.ly/CDSemap...
#softwareengineer #developer #crowdstrike #microsoft

Komentáře • 400

@DeagleNZ Před měsícem ⁺⁵⁴
I think that the Post Incident Review for this will be the most anticipated PIR in technology ever!
@therealmccoy7221 Před měsícem ⁺²
I think there won't be one
@kermitzforg Před měsícem ⁺⁵⁴
Poorly tested security software is just as dangerous as the problems it purports to solve!
@CallousCoder Před měsícem ⁺²
I would immediately dump this crap.. Hell I wouldn't even have bought it because I don't even trust microsoft let alone a thrid parrty supplier to push some crap in a kernel driver. If you do that then you need to be really really sure that the code using it is highly secure because it could literally be the key to the castle.
@kaasmeester5903 Před měsícem ⁺⁵
@@CallousCoder It's very odd that this wasn't caught by testers at Crowdstrike, but... back in the day, the company I worked for did not trust these automatic patches, and anything coming out of Microsoft or 3rd party suppliers would be installed on active test environments first, and only after a few days observation, be pushed to production. The rollout was automatic but controlled by the company, not the supplier. And that was fairly common practice back then.
@doryan08 Před měsícem ⁺¹
This is the comment more accurate here
@CallousCoder Před měsícem ⁺¹
@@kaasmeester5903 inderdaad vertrouw de shit niet. That was the common stance back when mankind was still normal and critical thinking. People now allow critical systems to be patched automatically. We would never allow that in the past! And you shouldn’t especially software that nests itself in the kernel, you want to know it doesn’t throw off things with your specific workloads.
@grokitall Před měsícem ⁺²
the fact is that mandatory automatic updates is in breach of any sensible security, stability or resilience policy.
they should be able to say there is an update available, and you should be able to say that it does not get installed until it has been triggered by you.
@philipfisher8853 Před měsícem ⁺¹⁰²
Lads asked gpt how to release code
@Meritumas Před měsícem ⁺³
wouldn't be surprised, and a good chunk of their software is "crafted" by some sort of "gpt"
@flamfloz Před měsícem ⁺¹
This gives me an idea... "Chat GPT, setup a Crowdstrike competitor but WITHOUT the outages!". I will soon be very rich, you suckers!
@everlynevins Před měsícem ⁺¹
@flamfloz Wait until I gpt a competitor of yours. And it gives out free pizza, too!
@Meritumas Před měsícem
@@everlynevins would prefer free ribeye or sirloin tbh
@farquoi Před měsícem ⁺⁵⁴
5:00 They did revert the patch very quickly. But the affected machine are stuck in a blue screen and need manual intervention.
@aureltrouts Před měsícem ⁺⁷
Then it's not a rollback, it was just a "we prevented new devices to face same issue" strategy. Rollback means going back to a state known as working, stable, without further operations, like nothing happened. This incident shows the importance of rollback (tested in lower environment) procedure in every single production deployment, even for the most anodyne ones.
@zomgoose Před měsícem ⁺⁷
It's a level level driver. Ir caused BSOD, requiring Recovery Mode to rollback. This requires remote access to the BIOS or physical presence. It's further complicated if Bitlocker is implemented.
@RenaudRwemalika Před měsícem ⁺²⁸
The craziest thing for me is on the client side. It means that all these big company just update all their software without doing any check as to what is coming into their system. We are talking about highly critical systems like banks, airports, hospitals ..
@GackFinder Před měsícem ⁺³
Very good point. Plenty of blame to go around.
@subapeter Před měsícem ⁺⁷
This is my biggest question as someone who advises businesses on their IT processes. A vendor is just a vendor. At a certain risk level, you could allow them to make direct changes, but above that level, if they could impact your core business operation, you have to be willing to pay for your mitigation in some way (e.g. pay your internal IT to add some oversight and control). That's assuming they are consciously assessing that risk in the first place.
@LukasLiesis Před měsícem ⁺¹
i wonder if that's not part of SOC2 and/or similar requirements. many huge corps auto update things while otherwise previously machines sit w/o updates for years and it's usally more risky than auto update. though slower rolling update will probably start to happen post all this
@anthonychurch1567 Před měsícem ⁺²
It wasn't that they updated it. It was an automatic update. The big thing I think here is that for example we have duplication of servers, network devices, power backups etc. but all these companies have not even thought about something going wrong with security software. Microsoft could have provided an alternative security software to split the servers so they can failover to the ones using a different provider. However, for all these massive companies I wonder why they didn't even bother to have a multi-cloud when money is no object except for the NHS. Airports have no excuse to put all their eggs in the Azure basket.
@animanaut Před měsícem
i've seen IT departments previewing updates before letting them pass to its users before. but i also have seen devs in the same company willy nilly approving dependabot PRs, so 🤷‍♂
@PaulHewsonPhD Před měsícem ⁺⁸⁵
CEO of Crowdstrike was CTO when the big McAffee outage happened in 2010. You'd have thought he had more incentive than anyone to insist on good engineering practice.
@staubsauger2305 Před měsícem
Crowdstrike also said that the 2016 DNC hack was due to external agents rather than an internal whistleblower (Seth Rich) despite the data transfer rate clearly being at USB speeds rather than ethernet speeds. This contributed to the (now proven/debunked) 'Russian Collusion' hoax as a political dirty trick. The leadership of Crowdstrike are shady - if one does due diligence and cares to dig past the headlines.
@nielsenaaa Před měsícem ⁺³
really ? haha
@nd5Ip3p0Mu Před měsícem ⁺⁶
Most CTOs have a background in marketing/banking/management, and don't have a deep understanding of software engineering, it's processes; or the pros and cons thereof.
It's a shame really, because a technical leader of that nature should be setting the bar 😂
@zomgoose Před měsícem ⁺³
Shareholder Profits are more important. Look at Boeing as an example.
@leonardomangano6861 Před měsícem
I read that he sold stocks some days before this new outage
@TimothyWhiteheadzm Před měsícem ⁺³⁷
My first thought when I heard about it was 'haven't they heard of staged rollouts?' (canary releasing). Its always been funny to me that I tend to expect large companies to know all the best practices and follow them, while simultaneously, I know full well that every large company I have personally worked for or with has NOT followed best practices.
@jasonracey9600 Před měsícem ⁺⁸
It was likely due to being forced by senior leadership to meet a deadline, not a lack of awareness of the pattern.
@Ash_18037 Před měsícem
Verifying a global impact change on a small group beforehand is not "a best practice". It's just basic competence. Best practice is a corporate BS term that lets people turn off their brains instead of thinking.
@TimothyWhiteheadzm Před měsícem
@@jasonracey9600 Crowstrike, as far as I know, has one main product. They have been in existence for 12 years. They provide that product to businesses and that software is updated using normal update mechanisms. This particular issue was not some special once off update that was frantically pushed out through abnormal channels by stressed developers due to some senior leadership deadline. They clearly do not have a system of canary releasing in place. They are either unaware of the idea or chose not to implement it and that decision was made many years ago. They have been lucky to date and this is the first major f#*% up they have made and now they will hopefully implement a canary deployment approach. Yes, it is almost certainly a failing of senior leadership as they were probably told about the idea many times and they chose not to prioritize it. I frequently provide advice to my own senior leadership and it is almost never taken seriously as they tend to take the attitude that if it is working now, then lets work on features, not preventing possible catastrophic issues.
@Muddy283 Před měsícem
@@jasonracey9600 Well if so, senior leadership must now be feeling quite chuffed with themselves!
My AV company (nowhere near the size of Crowdstrike) had a similar experience some years back. Their fingers got badly burnt, and now they strictly conduct their update releases through staged ("Canary") rollouts.
@thomasf.9869 Před měsícem
In large companies it is politics, adroit talk and blind luck that gets you promoted, not technical competence. The result is technically incompetent management and a lack of adherence to best practices.
@aratilishehe Před měsícem ⁺⁸⁷
The most baffling thing about all this is why they released the update globally at the same time.
Is this how they've always rolled out their updates in the past? If so, how has it not raised any red flags for their customers?
Smart people have already figured out how to avoid disasters like this. None of this is new! It's infuriating that CrowdStrike even managed to mess this up despite the wealth of knowledge available to us now.
@SteveBurnap Před měsícem
@@malcolmstonebridge7933 This incident shows why this idea is a mistake. They almost certainly did far more damage than a delay of a few hours in a security patch could have done.
@LightningMcCream Před měsícem
@@malcolmstonebridge7933 Proper testing requires paying engineers to write tests.
Paying your employees is bad for the Sharehold Value. Therefore, don't write tests, it's always in your financial best interest up until the exact moment it isn't
@snorman1911 Před měsícem ⁺¹⁰
They were practicing Contunuous Delivery.
@hackmedia7755 Před měsícem ⁺⁴
chatgpt assured them it would be fine lol
@PiotrMarkiewicz Před měsícem
it was microsoft patch that broke something, isn't it? why blaming CrowdStrike for MS mistakes? (like firing most of testers because automated tests on VM are always perfect)
@Ian_Carolan Před měsícem ⁺⁸
The more these incidents come to light the more I see idiocracy becoming the norm in all aspects of society.
@AlexandrosFotiadis Před měsícem
Yes
@AxWarhawk Před měsícem ⁺³²
2:46 It is a security incident. Availability is one of the 3 pillars of security. The others are confidentiality and integrity.
@icequark1568 Před měsícem ⁺¹
CISSP FTW
@haxi52 Před měsícem ⁺⁷
I think what Dave means here is that the incident wasn't malicious, an attack or security breach. Someone f'd up.
@mikeflowerdew7877 Před měsícem ⁺⁷
@@haxi52Yes, though those aren't Dave's words - he was reading a statement from crowdstrike, which I recognised from other news reports. It's obviously intended for a very general audience who would understandably be worried about ransomware, data breaches and the like
@CallousCoder Před měsícem ⁺¹
@@haxi52 Availability issues are in 99.9% of the cases non-malicous human error.
@dieglhix Před měsícem
operational incident*
@PlanetJeroen Před měsícem ⁺⁴⁸
you missed the most important one .. what the hell were they thinking doing this on a friday?!
@sajjanm01 Před měsícem ⁺¹
I thought the same but I guess in their mind it was Thursday evening that they did
@qj0n Před měsícem
Security software needs to be ready and safe to deploy on friday as well as you never know when some serious zero-day will be published
@PlanetJeroen Před měsícem
@@qj0n that sement won't help get people on the office if it doesn't roll out as expected.
@Fitzrovialitter Před měsícem
I despair.
CrowdStrike updated on a Friday.
Microsoft accepted random code (in the form of configuration files) into their (Ring 0) kernel.
CrowdStrike clients just rolled-out this garbage without a second thought.
@qj0n Před měsícem
@@Fitzrovialitter update was on Thursday tbh
MS approved the driver which itself worked ok, though not every wrong config was handled gracefully. But they don't review the code, all they have are whql logs
@FlyFisher-xd6je Před měsícem ⁺²⁸
I love the shirt Dave!
@hackmedia7755 Před měsícem ⁺¹
I want that shirt it's so funny
@hhappy Před měsícem ⁺¹
I only come here for the apparel
@akeenlewis3052 Před měsícem ⁺⁷
Been watching for about 4 years and I have to say, your t shirt collection never disappoints 😂
@stevecarter8810 Před měsícem
It always disappoints?
@akeenlewis3052 Před měsícem
@@stevecarter8810 doh! 🤦
@stevecarter8810 Před měsícem
@@akeenlewis3052 the importance of four eyes review 🤣
@llucos100 Před měsícem ⁺¹³
But what is also baffling, from what I have heard, is that CS sits at the kernel level… so how CS and system admins are permitting automatic push updates is beyond me.
@stocothedude Před měsícem ⁺²
exactly this is so dumb
@ZeonX69 Před měsícem ⁺¹
So you want to deal with approving a 0day that might impact your system on a friday/weekend?
Sure you’d probably deploy it anyway as it’s a risk.
CS should have tested it more before pushing but I think (massive assumption here) it’s more like a definition file that identified a bug in their kernel driver vs it being an “update” as such that goes through full QA.
@peterdz9573 Před měsícem ⁺³
@ZeonX69 but what is stopping them from deploying this new update file to test environment and running some tests? This could be fully automated, and sewing how widespread the problem is (all Windows version) they would detect it easily. Just plain CI pipeline
@ZeonX69 Před měsícem
@@peterdz9573 agreed on the testing, this is a massive CS failure.
Approval for every threat doesn’t make sense that’s their whole sales proposition. Not saying it’s perfect but how the world works and is incentivised
@peterdz9573 Před měsícem ⁺²
@ZeonX69 why approval, you can test and deploy automatically. Slight delay and huge benefits.
@ToadalChaos Před měsícem ⁺⁸
4:50 "Why didn't they roll back the change?"
Apparently it was a kernel-level piece of software that caused machines to get stuck in a boot loop. As a result, it is impossible to fix an affected machine remotely. IT guys worldwide are having to physically access each machine to remove the offending patch.
@yapdog Před měsícem ⁺¹⁶
It would be inconceivable that CrowdStrike doesn't at least have some form of canary testing capability. This seems to be a procedure problem; no new changes--EVEN THE FIXES--should be allowed to go wide without having first successfully past early stages of testing. This should be built into any and every platform.
@CallousCoder Před měsícem ⁺²
Who's to say that India wasn't the canary in the coal mine?
But what I think is what happened, is that they had it build in Debug mode and tested it as such and it ran. Because it would be east to be able to connect a debugger or profiler to the kernel and debug/test. It probably ran fine and they then build a release version with optimization and that is not the same binary you tested. Now it's still unforgivable because the only valid test, is to run a release version from a cold boot and a warm boot and let to run for weeks on representable system before you qualify it even as a canary release. That simply wasn't done, otherwise it would've been found.
@Tom-sl7ni Před měsícem ⁺³
We were waiting for this video from you the whole day. Thanks for being balanced when reviewing this failure. I hope to see another video as the cause of the problem becomes clear.
@animanaut Před měsícem ⁺⁷
their idea of canary releases was prob based on timezones. jokes aside i do hope they make it very transparent in an open postmortem so we all can learn from it.
@raymitchell9736 Před měsícem ⁺¹⁰
There is another question businesses will ask themselves... Can we trust a cloud-based solution what do we have to do to maintain reliability, or resiliency should this happen again. And one more final question, I do believer this could have been an honest mistake, but what if next time it's intentional, i.e. a cyber strike or a bunch of hackers? I have a friend that works in an auto parts store on the East Coast and he said that they couldn't open the stores or make sales. This seems to me to be a house of cards we've built our technology on.
@JanosFeher Před měsícem ⁺³
The better question is: "can we trust closed-sourced solution"?
@raymitchell9736 Před měsícem ⁺¹
@@JanosFeher Yes, I think it does open up a can of worms, and yes, those questions should be asked. Perhaps sometimes the answer will be yes, other times no... but updates are so commonplace, I worry more and more what is being updated on my phone or computers... And why so frequently?
@brownlearner2164 Před měsícem ⁺⁵
They were milking interns around to save money now they are finding out.
@kevinmcnamee6006 Před měsícem ⁺⁶
Crowdstrike Falcon is an end-point security system. It monitors the system operation for malicious activities and has hooks into the OS kernel to enable it to do this. It was likely one of these kernel hooks that had the problem, causing the Windows OS to crash with a blue screen. With the OS dead, it's very difficult to back out the update. This illustrates the dangers of automated-updates that are a key part to any continuous delivery system. Obviously this problem should have been picked up in testing or some sort of "canary" update process, but it wasn't. It unfortunately killed the OS making any sort of automated recovery impossible. Given the damage it has caused, it will be interesting to see if Crowdstrike has to pay for it. That would certainly incentivise software vendors to strengthen their testing and update processes.
@gppsoftware Před měsícem ⁺²
And to employ properly qualified and experienced people.
@grokitall Před měsícem ⁺²
third party mandatory automated updates that you cannot turn off have no place in any security policy for corporate use.
these types of updates are basically saying that your system is so unimportant that you can let someone you don't know decide that the system can be shutdown now, and it does not matter if it needs a complete reinstall to fix it.
that is why every half way competent engineer here is saying wtf. it has been known bad practice for systems which matter for decades.
@davetoms1 Před měsícem ⁺¹
I appreciate that you're asking questions sincerely wanting to know the answers, not to make accusations. Far too many people jump to conclusions but you're demonstrating the best way to embrace continuous improvement: Sincere and enthusiastic curiosity. Cheers.
@AxWarhawk Před měsícem ⁺²⁰
5:00 Because the update resulted in a BSOD at boot, at which point no further update mechanism could fix the issue and manual administration is required. Some machines may have a low-level management interface (like iLo) which could be used, but that's not available on most of the affected machines.
@Max-wk7cg Před měsícem ⁺²
This was such a pointless video for him to make. OF COURSE they would have rolled it back if they could... "why did you decide to push a bug in the first place? :) CONTINUOUS DELIVERY HEE HEE"
@leerothman2715 Před měsícem ⁺¹
@@Max-wk7cgIf they had used canary deployments then they would have limited the impact to the first batch of deployments rather than this mess.
@subapeter Před měsícem
@@leerothman2715 Maybe they did and this was their canary set of environments? I certainly don't know the size of their customer base, it might have even been larger. Granted, that would be a stupid size of a canary environment. Either way, if what I read is true that this was only an impact because of a coinciding MS patch then simply having a canary release strategy would not have been sufficient. There would also had to have been a sizeable delay between the canary deployments and the subsequent ones to allow for the MS patch coincidence to emerge (i.e. wait time for MS deployments to occur + time to detect issue occuring). Which, if Crowdstrike deemed this a critical update because of an actively occurring malware (for example), they likely would not have wanted to delay deployments for. I am more interested in an analysis as to why businesses seemingly allowed these updates to get to their systems without adequate filtering/control on a corporate level.
@Max-wk7cg Před měsícem
@@leerothman2715 Stunning display of wisdom and insight there! They definitely must not have heard about such a clever stratagem. It probably has nothing to do with the fact that they have reasons to not do it. No wayyyyyyyyyyyyyy!
@peterdz9573 Před měsícem ⁺³
@@Max-wk7cgreasons not do it? Do you really think that there is some master plan? How long have you been working in corporate environment? (I am assuming not much)
The truth is always below expectations. Every windows version is affected, so simple CI pipeline would have detected it. What reason is there not to use it? Also canary release, with simple call home that update have worked could be automated too. But this is an extra effort. And you need to have people in organisation that wants to make it. And why would they make that extra effort? The money is cashed in, responsibility is none. CEO will bare no repercussion, along with higher lvl management. As always...
@CallousCoder Před měsícem ⁺²
As a former systems developer, I find it absolutely unthinkable that you did not test a cold boot -- my idea is that they may have run in debug mode with all sorts of profiler crap and test crap, attached to the kernel process. And that just worked. But testing with a debug build in runtime environment with test crap, is not the same as running a release build on a cold clean system.
@logiciananimal Před měsícem ⁺⁶
I would like to know also why the product uses a kernel level device driver (still). Microsoft created the antimalware APIs a while ago so that antimalware products don't use the same techniques as "actual" malware to stay in memory, survive integrity attacks, etc.
@kevinmcnamee6006 Před měsícem ⁺³
It monitors the kernel for suspicious activity using hooks in the kernel.
@PostMasterNick Před měsícem ⁺³
Thanks Dave. There are too many making assumptions, conjecture and hyperbole and coming to their own conclusions and remediation/retribution judgments. Yor video is the right way to go about looking at this (or any) problem, via solid principles and level-headed questions.
@Tom-sl7ni Před měsícem ⁺²
I'm eagerly waiting for them to say "It's not a bug, it's a feature!" as if it should have shut down the PCs in case where security might be at risk. That would be fun
@eeaotly Před měsícem ⁺¹
Your comment makes me recall something... :-)
@AxWarhawk Před měsícem ⁺⁸
IMHO a more important question is, why was CrowdStrike able to bring down so much critical infrastructure? Why are there no fail-safes in supply chains that guard against faulty (or rogue) updates?
@CallousCoder Před měsícem ⁺⁴
Even wore, why don't have companies contingency plans and processes for when IT is down, so that things keep working (albeit slower but not grind to a hold). We used to always think about that in the 80s and 90s. Manual processes were as much part of the design and testing as systems development. I always say, it's not if you have an outage but when. And the SPOF factors only grew in the last 30 years.
@AK-vx4dy Před měsícem
They mostly affected clients not infrastructure....
@CallousCoder Před měsícem ⁺¹
@@AK-vx4dy Clients are a part of your IT infrastructure. And it affected Windows that could be client and server. Without the clients critical business processes (hence critical infrastructure) couldn't be executed.
@AK-vx4dy Před měsícem
@@CallousCoder Maybe I'm to much connected to telephone or network analogy and for me infrastructure are telefonem switches and cables or routers, servers and fibers 😉
@grokitall Před měsícem ⁺¹
@@AK-vx4dyinfrastructure is any person, device or service without which your business stops trading and starts losing money.
resilience planning requires you to ask what parts of your business does this cover, and what plans do you need to put in place to cope with it.
chaos engineering then goes one step further, and deliberately blocks that infrastructure to see if your plans are good enough.
@mosesdaniel7045 Před měsícem ⁺¹
The real reason is the lack of real professional software test engineers...
@juanmacias5922 Před měsícem ⁺¹
I agree with all your points. It really seems there was no testing whatsoever.
@retagainez Před měsícem ⁺⁵
Canary releasing is immediately the first overall strategy that I thought could've heavily mitigated many of the global effects of the outage.
I'm speculating entirely from this point on, but I believe it couldn't be rolled back since the interaction between Crowdstrike's software interacts with Windows at a low-level and causes the O.S to be inaccessible once the software deployment went live. I believe they have to delete a file through some relatively straight-forward procedure, but it requires manual intervention to some degree and so it is painstakingly slow to recover everything when the downstream IT organizations don't have great remote controls over each individual instance of Windows.
Indeed, it is most curious how and why early detection of such a blatant issue was not present here. Perhaps, some testing was neglected?
@meelooxavier6502 Před měsícem
Canary? Testing? REAL men test straight in production :))
@pilotboba Před měsícem
Seems to me they don't dogfood, at least not on windows.
@grokitall Před měsícem
at the very least, there should have been some sort of smoke test to see if it even got as far as completing booting, and disabled itself if that failed.
@soberhippie Před měsícem ⁺⁸
Has anyone observed the "Observability" caption in this video? Could it have been caught by testing or a canary release?
I broke production at work this week, but at least I did not write _that_ antivirus patch
On a more serious note, they do say that it was a combination of them releasing a patch and MS releasing an update. I worked for a company where we had to check our software against a set of operating systems, but we didn't get pre-release patches from MS, so I can't imagine how that could have been prevented
@leerothman2715 Před měsícem
It could have yes. Initial deployment should have been a small amount of customers. It’s easy to set up monitoring so an external systems checks something like a health check endpoint. If it doesn’t get a successful response then something has gone wrong and stop all further deployments.
@subapeter Před měsícem ⁺²
You do that by having a release strategy where, if you are running the risk of certain updates causing issues that can't be centrally and automatically be rolled back (e.g. in this case, system level changes that can cause boot issues), you a.) deploy to a test environment first and/or b.) you delay updates that aren't critical to your environment (you assess to your circumstance, not just assume that a vendor's designation of "critical" applies in your use case) by a certain amount to see if those updates cause issues elsewhere. In any case, if you as a business allow a 3rd party to directly impact your business critical environment without you having any control over it with techniques such as the ones I mention above, that to means you have (also) failed at your risk assessment practices. So in your case, you don't just deploy that patch to production straight away, you take the production patch and apply it in your test environment before you allow it to go through to your live estate. You do not need a pre-release from the vendor (MS in this case), but treat their release patch as a pre-release for your specific environment.
@peterdz9573 Před měsícem
@@subapetervery well said. But on the other hand, there is no incentive to make risk assessment. Why would they? CEO will cash the money, along with higher management. If the money stops flowing, the first to be fired will be bottom floor workers. There is a crisis of responsibility and this will be just more frequent
@soberhippie Před měsícem
@@subapeter I can imagine a situation when you roll out a release bit by bit (pun not intended), and when you reach 100% audience, MS rolls out their release, which kills machines so completely that they don't send back the crash data, thus they don't notice the problem. (Well, MS could see that their machines start going offline, probably)
@subapeter Před měsícem
@@soberhippie No, MS is another vendor just like CS. In your release strategy as a business, you would apply the same testing to the MS patch as you do for CS updates. So as a business you'd catch your scenario because your build would crash in the test environment when you apply the MS patch (after your build already integrated the CS update) so you would not roll that out to your corporate environment.
@delamar6199 Před měsícem ⁺²
Unit testing, integration testing and release testing pipelines with different OS Versions. A company such as this must have a coverage of all those stages. However, in an enterprise world each and every machine is automatically kept up to date by their respective IT departments. We are not talking about old machines hiding in the back of the library somewhere.... That this obvious issue wasn't caught in the first place is beyond me tbh.
@ChristianWagner888 Před měsícem ⁺²
How to prevent this? Automated testing on typical real PC systems by CrowdStrike (Did they only test in VMs?). Limited rollout with telemetry (did CS really update all 29000 customers with millions of computers simultaneously?).
Corporations deploying these updates should have their own limited rollouts, with automated testing before letting everyone have the BSOD update.
Each computer system (including client PCs) in essential industries should have the capability of an immediate full roll back via System Restore and Disk images, possibly with remote management. Systems could have been restored within minutes to a bootable state.
@grokitall Před měsícem ⁺¹
sorry, but with broken kernel drivers for any operating system for locked down corporate use you either need automated roll back after a failed reboot, or you need automated bad driver detection and isolation.
nobody implements the latter, and windows safe mode does not qualify as the former.
@ChristianWagner888 Před měsícem
@@grokitall Something like "Automated bad driver detection and isolation" is implemented by macOS and Linux, even though these can still kernel panic as well. The faulty update was apparently pushed by CS to macOS and Linux as well without causing crashes. I think moving most non-essential drivers into user space would be better.
Best and fastest automated OS level roll back is done by ChromeOS. Takes less than a minute with a reboot. AFAIK Windows does not have anything nearly as effective built-in and even the quite limited System Restore seems to have been disabled by default on many Windows 10/11 systems. There are a few 3rd party solutions (Acronis, RollbackRX,...), but those don't seem to be used as a general standard.
@grokitall Před měsícem
@@ChristianWagner888 i was not aware that the efforts in mac os and linux were further progressed than capabilities and potentialities. thanks for the information.
you cannot stop bugs in kernel level code from taking down the kernel. the issue here is how the os responds to it after a reboot. boot looping until someone comes along and manually fixes it shows that in this case the answer is not well.
having as little code in kernel space as possible is obviously the best answer, which is why the idea of microkernels was so popular, but in practice there are some things you need to do which must be done in kernel space. the key there is to do as little as possible, and expect things to break, planing to mitigate the side effects.
@dexterBlanket Před měsícem ⁺¹
An excellent analysis of what caused the crash (and prevented affected systems to be fixed remotely) can be found on the "Dave's Garage" channel.
In short - don't put code that skips input validation into kernel driver.
@leerothman2715 Před měsícem
I just caught up with his update, definitely worth a watch to understand what’s going on with software running on the low level.
@sheko4515 Před měsícem ⁺¹
It is easy to speak when you on the other side!!!! I am pretty sure if your were on charge we would have had the same issue!!
@ThomasScottD Před měsícem
The fix was rolled out within an hour of the issue occurring. The problem is it is hard to fix machines that are stuck in a BSoD loop, becuase they don't have the opportunity to download the updated file.
@varunsharma1889 Před měsícem
I think the reason they can't remote fix it is because in order for the PC to download the fix and apply it, it needs to be able to boot and connect to internet etc. but it can't even boot. So the fix is to start in Safe Mode and then delete the file manually and then restart.
@nmilutinovic Před měsícem
The problem was in a faulty SYS file, for Windows, which is used as a part of boot process. This update basically made Windows devices un-bootable.
Booting into Safe Mode does skip the faulty driver, but disables WiFi and net. That means, the fix needs to be distributed via USB drives - on foot.
For those machines with BitLocker, and most enterprize installations do use it, it becomes even more fun. They need to extract BitLocker key from the Active Directory and then apply it in the boot process. Of course, the AD servers are also running Windows and same CrowdStrike software, so you need to fix them first. With USB. In the Data Center. On foot.
@yesnickcarter Před měsícem ⁺¹
the problem makes our hosts blue-screen-of-death. it’s hard to automate the push out of an update when the problem puts the host in that state.
@ajward137 Před měsícem
That's what staged updates are for. Release to a small, chosen subset; insist that they report back before extending the rollout. It feels like a triumph of C-suite "get it done" versus caution born of experience.
@svijaikumar Před měsícem ⁺¹
Testing and canary release could have avoided this worldwide outage
@tattooineste Před měsícem
Serious stuff yes Dave, but, that has to be the funniest T-shirt I've ever seen and surely the most apt for this particular subject matter.
@DrKaoliN Před měsícem
4:51 They did that, but with a crashed OS, their local app couldn't receive it.
I think they should have tested before delivery and use canary deployments.
@AirborneInsightsUK Před měsícem
As a former software engineer in real-time, safety significant aircraft systems, I think the organisation at question failed to completely understand that their product was working at the kernel level in the most used operating system running the most important services across the globe. IMO the problem was multiple failures to imagine the bad things their product could do to the services relying on the systems it was installed on.
@tomerza Před měsícem
The 1st idea that was crazy in my eyes is exactly what u said... I see crazy stuff -> RollBack!
@Fitzrovialitter Před měsícem
I despair.
CrowdStrike updated on a Friday.
Microsoft accepted random code (in the form of configuration files) into their (Ring 0) kernel.
CrowdStrike clients just rolled-out this garbage without a second thought
@AnselRobateau Před měsícem ⁺¹
Thank you for your reasoned and educated analysis.
@ClaymorePT Před měsícem
This was not just poorly tested. It was also rushed to deploy into production without going through the proper WHQL channels.
Running dynamically loaded code in kernel level? S+it of epic proportions...
@staticrealm61 Před měsícem
What I dont understand is how collectively companies have been so complicit to allow their resources to consume updates in mass. There should have been responsibility on both sides.
@oliversteffe8526 Před měsícem
great thoughts. following you folsom’s time and read your books, I had the same questions. 😊
@RickTheClipper Před měsícem ⁺⁵
Lesson #1 becoming a software engineer:
DO NEVER NEVER EVER release untested code
@RickTheClipper Před měsícem
@@malcolmstonebridge7933 I am in the Industry for +30 years, still mentally stable and I learned the hard way the only thing that is better than testing is more testing
@peterdz9573 Před měsícem ⁺¹
#2 never release on friday
@qj0n Před měsícem ⁺¹
Why do everyone insist to connect this failure to Microsoft? It's purely Crowdstrike and they have already crashed Debian and Rocky Linux machines several months ago, so it's not much related to MS
@peterbradley6580 Před měsícem
Spot on about canary releases - that would be a very simple strategy to mitigate risk.
@janbrittenson210 Před měsícem ⁺²
Any update or maybe even installer for drivers and other kernel components, and perhaps services that may interfere with system functionality, should always create a restore point; then, if the system doesn't reboot cleanly it should itself automatically back out the patch by unwinding to the restore point. All the technologies for this are already in place, it's just amazing that it isn't done. Also, while automatically updating software is convenient for a home PC, production systems absolutely can't do this; in production environments any updates need to vetted and carefully rolled out, and limited to critical fixes - which admittedly fixes to bugs in security software might be. But only bug fixes, not feature (minor version) changes. No corporate IT department worth a squat should be setting up production systems to auto update! Ever!!!
@GackFinder Před měsícem ⁺¹
Yes but we're talking about Windows System Restore here, a component that was introduced in Windows XP and that has never actually worked properly.
@bart2019 Před měsícem
So you're putting the onus onto Microsoft!? How dare you!
@conceptrat Před měsícem
They seem to need this to be considered a 'defect'. Can't imagine why? Accountability and insurance?
@tonymerrett Před měsícem
this is a quality control incident at heart I think use of best practises not checked ....in sense that continual testing done and monitored so that practise followed
@RudhinMenon Před měsícem
This is the dark side of software development, where people thrown untested code over the wall, hoping it would work
@MonochromeWench Před měsícem
It was a kernel mode driver crash that caused affected systems to blue screen. Once this happens there is no coming back without manual intervention of an on site technician. The next time Windows starts the problem driver gets loaded again and causes another blue screen. If Windows even manages to start, the time before the Blue screen happens again is too short for most affected systems to receive an update fixing the problem (some system could run for just long enough to get an update but most couldn't). Deploying a fix was mostly useless as the systems couldn't function for long enough to get it. This was a huge failure of testing.. A problem this widespread should have been caught in testing if there was any testing
@nickbarton3191 Před měsícem
Didn't notice, but I'm working a 4 day week and today was my day off.
@vk3fbab Před měsícem
I think there is a broader lesson. If you have software with kernel level drivers deployed you are at the mercy of the vendor of that driver to get it right. Doesn't matter which OS. A few months ago i was watching an old talk from Mark Russinovich of sysinternals fame talking through how to analyse kernel driver issues and how third party vendors are usually to blame. This one was very bad because it was a system critical driver that caused the boot loop. That said it seems almost impossible to build an EDR tool without kernel level access.
@mattshen1207 Před měsícem ⁺¹
If the code bug kills the server it is not easy to just undo
@grokitall Před měsícem
yes it is. have a modular kernel, flag up which modules have changed on shutdown, when rebooted clear the list, and if the reboot did not complete, block those modules on next reboot.
@heypaisan9384 Před měsícem ⁺¹
Does anyone know why antivirus software are implemented as kernel device drivers? Any kernel code can take down the whole system as we've seen. You would think Microsoft would add specific system calls to their kernel specifically for antivirus software to avoid this situation. A bigger question would be, why are these systems using Windows at all?!?
@joachimfrank4134 Před měsícem ⁺¹
Windows has special system calls for anti-virus software.
@kevinmcnamee6006 Před měsícem ⁺¹
Crowdstrike is more than an anti-virus product. It monitors the kernel for malicious activity by attaching code to the kernel.
@rao180677 Před měsícem
Well in order to rollback we need a properly running OS where the change can be rolled back. But if the update blocked the OS from running, even if CS rolls back, the blocked OS is not able to recover without manual intervention. This raises two types of questions: one, what could CS have done to prevent this? Two, what can customers do to mitigate such possible events? From the producer POV I don’t think we have enough info, and probably never will, but from a customer POV OS updates need to be done in a much more controlled way and not just update everywhere and hope for the best. SW consumers need to adopt redundant OS environments so that when an OS update fails it can be switched by similar OS on the same version that will not update. And this is completely neglected in all companies.
@edhodapp6465 Před měsícem
When you kill the kernel, there is no rollback, at least not in the systems killed. They did rollback, but that only saved the systems that had not updated.
@vivek.80807 Před měsícem
Thanks, learning a lot from your videos as usual.
@jsonkody Před měsícem ⁺¹
someone wanted just already go to weekend .. "nah, it's all right, I don't have time to test it .. Jimmy! .. push it to the production! BYEEE"
@CallousCoder Před měsícem ⁺¹
Unforgivable inept blunder, that should've and could've been prevented!
But also unforgivable is the fact that all their users, obviously don't have any manual back up processes in place for when tech goes down! Something that in the 90s we always were aware of and planned for and tested for! "What if something goes down!" It can be as trivial as a 48 hour power outage (they do happen, not often but the do happen, especially when a chopper flies in to the high power cables like happened here in early 2000s).
@stocothedude Před měsícem ⁺¹
exactly! so many companies without fallback strategies is outrageous
@NicodemusT Před měsícem
CrowdStrike had 100% test coverage, just an FYI from the CEO of CrowdStrike.
@petersuvara Před měsícem
Was waiting for your response to this.
@scottsnelson1 Před měsícem
Could not have picked a better shirt for this episode
@BigWhoopZH Před měsícem ⁺²⁰
Once the faulty update installed the systems went into a boot loop. Thus an automated roolback isn't possible.
@jimpo1234 Před měsícem ⁺²
Yeah....smart questions "why didn't they roll back the change, why didn't they observe that update was broken?"
@vitalyl1327 Před měsícem ⁺⁴
Which is a Windows design fault. Even a puny teeny little uBoot on IoT devices can roll back. And freequent reboots is a fault condition indeed.
@Ildorion09 Před měsícem ⁺¹
@@jimpo1234 I mean they are responsible for delivering a faulty update but once it was out of the box they coudn't roll it back.
@BigWhoopZH Před měsícem
@@vitalyl1327 cannot do that with a general use operating system. How could you tell what to roll back and how far? The update could have been installed days before the reboot.
@vitalyl1327 Před měsícem ⁺³
@@BigWhoopZH you roll back the system image to the exact state before the last reboot. This is how it's done in all the properly designed systems.
@lokedemus7184 Před měsícem ⁺²
Observability? How would they observe systems failing to boot? That probably happened before any network calls are made. The only thing they could have seen is an decrease in requests to their services. I assume that their software phoned home for various reasons.
@SteveBurnap Před měsícem ⁺¹
Have telemetry that sends an event when the files are installed and a second event when the machine comes back from a reboot. If there is a wide disparity between the numbers of each event, it is a strong indication that something very bad has happened.
@leerothman2715 Před měsícem
@@SteveBurnapYup agreed. Have monitoring calling every x seconds to check that it’s getting a successful response.
@grokitall Před měsícem
@@SteveBurnaphave fun with flags.
set a flag when you download the update.
if the flag is not update or booted, leave the driver and have it disable itself
if it is updated change the flag to say it started
if it has started do the minimum work to show it boots, then change the flag to say it booted
at this point, either it worked, or it got out of the way next reboot.
suddenly the problem did not occur in production, and everything else is solved by testing and canary releasing.
@ddunn99 Před měsícem
It's a blue screen cycle that needs you to boot in safe mode - not easy in a corporate environment with bitlocker for example. Another question for me is, why use a kernel mode driver?
@Sim-rh4tj Před měsícem
Low Level Learning has a good explanation of what went wrong.
@YossiZinger Před měsícem
Since the endpoints got BSOD due to this faulty driver, no service on the machine is available to accept a rollback or a fix. it has to be patched manually, on each endpoint separately.
@grokitall Před měsícem
but as the systems could be rebooted, all you need is for the os vender to block drivers which were updated prior to the reboot, and the os boots and can start recovery procedures.
@paulyaron2410 Před měsícem
Automation is great until it’s not. A family member informed me of issues with Heart monitors at work from this event. Perhaps a future discussion on software for critical systems managing human life.
@grokitall Před měsícem
we have known how to do man rated systems since the moon landings, and this is not how.
the last time the nhs in the uk was brought down, it was embedded medical devices using windows xp years after end of life mixing with current machines.
@jasonc3589 Před měsícem
You can't hack our systems if there are no systems!
Well played Crowdstrike, well played 👏👏👏
@llucos100 Před měsícem ⁺¹
Cue the… “You had ONE JOB” meme!
@kashgarinn Před měsícem
It isn’t only a problem with crowdstrike, it’s an inherent problem with Windows that a single program can BSOD the whole system.
@grokitall Před měsícem
any os kernel module can bsod a system.
there are only two ways to stop it.
1 require every module to be submitted to os vendor for intensive testing prior to release. microsoft tried it and nobody wanted to pay them them thousands to sign every minor patch.
2, have the os catch bad drivers, and automatically block them next reboot. nobody does this. a poor approximation is windows safe mode, which does not work in a corporate environment.
so every kernel driver update can hose your system with a need for a full reinstall.
@user-oc5of4tm2p Před měsícem
Definition of a cyber attack is to cause major disruption and loss of functionality and even loss of assets. This incident by Crowdstrike was deliberate and meets that definition.
@youtux2 Před měsícem
Shouldn't big orgs such as banks and airports be doing canary releases as well for vendor updates as critical as these ones?
How feasible do you think it is?
@RickTimmis Před měsícem
Love your content Dave, thanks for sharing
@SubTroppo Před měsícem
As stock buy-backs [formerly illegal] are part of "engineering" companies' activities now, It would be interesting to know of the financial engineering activities of the company involved as that is seemingly the prime indicator of competence nowadays [ref. Boeing].
@rakeshbhat5102 Před měsícem ⁺¹
its absurd, software is released without testing the change, impacting services globally. why very basic SDLC principles are avoided by enterprise level services/companies?
@peterdz9573 Před měsícem
Simple answer: because higher level management bears no responsibility. Money will be cashed in anyways. If company gets into trouble, first to go will be bottom level workers.
Look what is happening. Health care systems gets hacked and release confidential data. Planes fall to ground because of badly designed feature. And what are the repercussions? CEO get hefty paychecks, court cases are dismissed.
Why would anyone make an effort where there are no consequences?
The only people feeling the pain are bottom floor workers, who (oh irony) often do not have enough decisive power to influence the company.
@calkelpdiver Před měsícem
I love that T-shirt. Where can I get one. Perfect for a Software Test person like me. After all, Sh!t's Broke!
@marcom. Před měsícem
"Why not rollback" should be obvious if the result of the defective deployment is a blue screen. You can't communicate with a host that can't boot anymore.
@Fred-yq3fs Před měsícem ⁺¹
The CEO -when asked on Today- said smth like "we deploy in real time so our product stays ahead of threats" = no canary. The CEO literally choked before giving his obviously standard/marketing answer. Could be a lack of water, could be the very marketing of the product torn to shreds by non tech journos. He felt totally taken aback. He probably realized and got an adrenalin rush.
@landmanland Před měsícem
A reversal rollout was not possible because the very computer you want to rollout to could not boot and thus could never be fixed remotely. This means that a physical local fix has to be applied. This is going to take weeks, perhaps months.
@notthere83 Před měsícem ⁺¹
I mean... most tech companies these days are managed in an incredibly incompetent manner?
They'll either hire engineers who don't bother with the processes you describe or put so much pressure on them that they just can't do all of these reliability processes.
That's what I find the scariest about the near future - having seen so many corners being cut, I wonder when it'll all come crumbling down. This incident just being a fun little preview.
Just think military platforms, virus engineering labs, etc.
@georgebeierberkeley Před měsícem ⁺¹
Why didn't they install the update on a staging system, test it, and then promote to production. That's CS 101.
@GackFinder Před měsícem
I doubt they have a staging system.
@joachimfrank4134 Před měsícem
Some other commenters thought that the error was caused by the combination of the software update and a windows update. If this is true, it would have worked without problems when testing. Even canary releases would have been without issues. After the windows update the systems would react erroneous on the software update.
@joachimfrank4134 Před měsícem
@@georgebeierberkeley I've just seen an analysis of the failure by DavesGarage, a retired developer from Microsoft. The gist of the analysis Was that the code itself was good. But it changes its behaviour based on a configfile. The cause of the error was within the config file. So it wasn't an error located at one place in the code, but a conceptual error of not good enaugh handling of broken config files.
@profbrashears8332 Před měsícem
Thanks!
@Fitzrovialitter Před měsícem
3:23 Change "us" to "our" in the caption to make sense.
@Immudzen Před měsícem
I think we need some trust breaking. The companies in this market have merged or destroyed each other and this has made the system more fragile. We would have a more resilient system if we had a wider range of vendors.
@SM-cs3nt Před měsícem
To be honest this incident is kind of an argument against continues delivery - as a separate QA stage gate before deployment would have prevented that.
@grokitall Před měsícem ⁺¹
you mean an argument for continuous delivery. ci and cd are designed to block deployments of broken systems.
@whelangc Před měsícem
Yeah - Management probably announced the 'release of our patch was a resounding success while the Engineering folk are saying ' but but but it are still building it ' - 'Ah just chuck it in Production, it will be fine ' - After all success is about meeting a delivery date, right?
@markd.9538 Před měsícem
Have to ask the hard questions… Was the continuous delivery model, or approach to software delivery a contributing cause?
I ask as it posits an opportunity to reflect and improve, not to point fingers.
@emmaatkinson4334 Před měsícem
It looks as though they cannot roll back because the target systems are looping through blue screens.- i e. cannot complete a reboot.
@rodrigo2112- Před měsícem
I see the t-shirt, well chosen for the occasion 🤣
@edwardallen3428 Před měsícem
And how do you go about releasing a kernel patch that is a file full of zeroes?
@florinpandele5205 Před měsícem
Let's assume this happened:
Crowdstrike releases canary version, everything is fine for a week.
Microsoft releases a patch for Os that silently crashes the CS update.
The Crowdstrike update gets realeased for everyone...and crashes because of conflict with previos Microsoft update..
Moral:
Avoid using the damn cloud.Manage your own systems don't "externalize" everything.

Další v pořadí

Automatické přehrávání

How Minecraft & Sea Of Thieves Were Built (Told By Senior Game Developer)