Microsoft Cloud is down- is the future in the cloud?

Started by Josquius, July 19, 2024, 03:19:38 AM

Previous topic - Next topic

Josquius

https://www.bbc.com/news/articles/cv2g5lvwkl2o

QuoteMass IT outage affects airlines, media and banks

A raft of global institutions - including major banks, media outlets and airlines - have reported suffering a mass IT outage.

The US state of Alaska has warned its emergency services are affected, while several of the country's airlines have grounded their flights around the globe.
Australia - which has been particularly hard hit - has seen broadcast networks scrambling on air as systems failed and supermarkets crippled. Sky News UK went completely off air as a result of the issues.

The cause of the outage is unclear, but many of those impacted have linked it to Microsoft PC operating systems.

An official Microsoft 365 service update posted to X earlier in the day said " we're investigating an issue impacting users ability to access various Microsoft 365 apps and services".
However, a Microsoft spokesperson told the BBC on Friday that "the majority of services were recovered" hours earlier.

A spokesperson for Australia's Home Affairs Minister said the outage appears to be related to an issue at global cybersecurity firm Crowdstrike, and the country's cybersecurity watchdog said there is no information to suggest it an attack.

"Our current information is this outage relates to a technical issue with a third-party software platform employed by affected companies," they said in a statement.

Alaskan officials said many 911 and non-emergency call centres are not working properly.
United, Delta and American Airlines - which are all based in the United States - have issued a "global ground stop" on all of their flights. And in Australia, carriers Virgin Australia and Jetstar have also had to delay or cancel flights.

Australian telecom firm Telstra has said triple-0 call centres - the main emergency contact in the country - are not affected, but that it is working with other state emergency services providers to implement backup processes.

Social media users have reported queues at Australian stores like Woolworths, with payment systems downed, and trouble accessing financial institutions like the National Australia Bank.


Apparently this is causing quite a fuss in a lot of places. Airports screwed up and that sort of thing.

I find it interesting as the whole discussion about data centres and the move to the cloud pops up all the time.

In theory it all makes sense. Instead of investing tonnes of money into building your own data services, your own security, and all this other really specialised stuff that is completely outside of your company's domain, you instead hire a company that is actually a specialist to do it.
Its do you build your own vault or just put your money in the bank.

But...a few years ago I was working at a large (non-tech) multinational who were investing big money to construct a big on-premise server centre at their global HQ. This was after the mass move to cloud was well underway and they chose a different path.

What we're seeming to see here is the danger of a single point of failure. What seems to be a pretty minor technical fault at Microsoft causing big issues globally.

An interesting argument I heard the other day about why AI is just a fad came onto Google Stadia- that cloud gaming is completely backwards thinking. Computers are cheap and plentiful. Its reliable ultra fast internet that is a rare commodity in the world. For most people it will never make more sense to play a game on a computer hundreds of miles away rather than just hooking one up to your TV.

Thoughts?
██████
██████
██████

Tamas

I am not saying this invalidates your point or it's not worthy of discussion, but the current mess is because of CrowdStrike which is -apparently- a very popular security software. Seems like everyone who was using it got nuked by a fucked up update.

viper37

Quote from: Tamas on July 19, 2024, 03:59:41 AMI am not saying this invalidates your point or it's not worthy of discussion, but the current mess is because of CrowdStrike which is -apparently- a very popular security software. Seems like everyone who was using it got nuked by a fucked up update.
I don't know why my router was fucked up for the last two day.  I think it was because of the DNS server I use.  Probably linked to CrowdStrike.
I don't do meditation.  I drink alcohol to relax, like normal people.

If Microsoft Excel decided to stop working overnight, the world would practically end.

Sheilbh

#3
Yeah sounds like it was the CrowdStrike thing - there was an issue with Microsoft but it was fixable (patch a few weeks ago apparently). But obviously some organisations with poor patching are experiencing both.

Separately can't help but think somewhere someone is sitting who pressed go on that software update. Obviously it's not their fault - it's a system issue - but they're watching this knowing they pressed the button and just have that sinking feeling of the most massive work fuck up :lol: :(

Interesting to see some journalists (not tech journalists) being absolutely baffled by this - stuff like "how can a software fault from a third party basically shut down airports all around the world?" I feel like this might be a moment (a bit like the supply chain issues post-covid) where reporters in unrelated areas, including politics, start realising that the world we live in is a little different than they'd understood and these areas (all of them in a way supply chains - digital and physical) are really important.

Edit: Eg this timelapse of flights over the US:
https://x.com/US_Stormwatch/status/1814268813879206397
Let's bomb Russia!

DGuller

If you're using a Microsoft product and don't expect it to crash, then it's on you.

Zanza

The Crowdstrike issue today was not related to cloud computing. Instead it caused a blue screen of death on your local Windows client. Non-Windows clients (iOS/MacOS , Android, Linux...) and non-Windows servers (i.e. most of "the Cloud") weren't affected.

Jacob

Wondering what will happen when I turn on my work laptop in a little while....

Zanza

Quote from: Jacob on July 19, 2024, 10:46:57 AMWondering what will happen when I turn on my work laptop in a little while....
In my workplace it was gone by 8am CET, only machines running before that were affected.

Baron von Schtinkenbutt

Quote from: Josquius on July 19, 2024, 03:19:38 AMI find it interesting as the whole discussion about data centres and the move to the cloud pops up all the time.

In theory it all makes sense. Instead of investing tonnes of money into building your own data services, your own security, and all this other really specialised stuff that is completely outside of your company's domain, you instead hire a company that is actually a specialist to do it.
Its do you build your own vault or just put your money in the bank.

But...a few years ago I was working at a large (non-tech) multinational who were investing big money to construct a big on-premise server centre at their global HQ. This was after the mass move to cloud was well underway and they chose a different path.

As others have noted, this problem is related to Windows and some particular Microsoft software-as-a-service (SaaS) offerings.  Azure, Microsoft's cloud infrastructure offering, is unaffected (as evidenced by Languish still being up).  I could go on a tangential rant about cloud migrations and the false dichotomy between "run it all In The Cloud™" and "build your own datacenter from scratch", but I won't.

QuoteWhat we're seeming to see here is the danger of a single point of failure. What seems to be a pretty minor technical fault at Microsoft causing big issues globally.

Technology single points of failure have been an endemic problem in business for decades.  The root cause isn't a particular technology, it's the attitude of the business.  Businesses, like people, can't deal with very unlikely but existential risk.  If a particular cloud vendor, SaaS provider, or sole supplier looks to be very low risk it's very difficult to get the business to care about building in redundancy or developing contingency plans.  Those things cost money, both directly and indirectly though slowing down the rate that money-making value can be delivered to customers.

Of course, this is a multi-layered problem.  The customers of these businesses also generally don't appreciate the very low but existential risk that could ripple down the chain to them, and so don't value businesses that account for this enough to pay extra for the product or service.  The result is a fragile system where one key supplier having a problem cascades through a web of interconnected businesses, causing something like what the article describes.  This time it happened to include a SaaS product, but there is nothing unique in such products that would cause this.

QuoteAn interesting argument I heard the other day about why AI is just a fad came onto Google Stadia- that cloud gaming is completely backwards thinking. Computers are cheap and plentiful. Its reliable ultra fast internet that is a rare commodity in the world. For most people it will never make more sense to play a game on a computer hundreds of miles away rather than just hooking one up to your TV.

Stadia had two promises.  One was giving the customer much better and more frequently-updated hardware than they may have at home on which to play games; the other was giving the customer an extensive rental library of games so that they don't have to buy as many.  I don't think it was a bad idea per se, but I think Google significantly overestimated the addressable market.  While I generally agree with you about preference for local hardware, there is a market for those who can't afford the latest console or a tricked-out gaming PC, or those who travel a lot.  It's just not a large enough market for such an expensive to run service, especially since there are services like Microsoft Game Pass that provide the second value proposition (and work on existing hardware, for those who have it).

Caliga

Been working on this since 5 am... it caused mass chaos across all of our systems
0 Ed Anger Disapproval Points

Iormlund

It's the second (and by far the biggest) fuckup of Crowdstrike in two weeks.

Two Fridays ago everything slowed to a crawl at my work because of another failed update (simply opening a file took like a minute).

Grey Fox

Palo Alto Networks about to make bank! (another EDR provider)
Colonel Caliga is Awesome.

viper37

I don't do meditation.  I drink alcohol to relax, like normal people.

If Microsoft Excel decided to stop working overnight, the world would practically end.

Baron von Schtinkenbutt


Iormlund

I, for one, welcome our new silicon-based overlord.