Microsoft Blames Software ‘Code Problem’ for Office 365 Disruption



[ad_1]

Microsoft blames a software “code problem” for an outage that affected Microsoft 365 services for five hours on Monday night.

“A code issue caused a portion of our infrastructure to experience delays in processing authentication requests, preventing users from accessing various M365 services,” Microsoft said in an email update to administrators at Microsoft affected by the outage.

Microsoft said it is currently “reviewing our code” to understand what caused the code to “stop processing authentication requests in a timely manner.” Microsoft promised a post-incident report within five business days.

Microsoft said the software code issue affected users on September 28 from 5:25 pm EST to 10:25 pm EST.

Microsoft customers began reporting their inability to access Office 365 on Downdetector.com at 5:21 pm Monday; Within an hour, more than 18,000 posts documenting those issues flooded the website that tracks cloud outages.

Microsoft told administrators that users may have been unable to access various Microsoft 365 services that leveraged Azure Active Directory, including Outlook, Microsoft Teams, and Teams Live Events, as well as Office.com.

Additionally, Microsoft said that the Power Platform and Dynamics365 properties were also affected by the outage.

Separately, Microsoft said in a public Azure status update last night that a “subset of customers in the Azure Public and Azure Government clouds may have encountered errors while performing authentication operations for various Microsoft or Azure services, including access to Azure portals “. Microsoft said the Azure issue lasted from 5:25 PM EST Monday to 8:23 PM EST Monday.

Microsoft attributed the Azure service outage to a “recent configuration change that affected a backend storage layer, causing latency in authentication requests.”

Microsoft said the settings were reversed to “mitigate the problem.”

Regarding the Azure issue, Microsoft said that services “still experiencing residual impact will receive separate portal communications.” He promised a full post-incident report on that issue within the next 72 hours.

A senior executive at one of Microsoft’s top partners, who declined to be named, said it appears that a Microsoft software developer made a software code change that brought down Office 365 and Azure.

“It is surprising to me that a change in the code could cause a platform as large as Azure to go down,” said the executive. “Looks like someone wrote code that was merged in a production environment and broke authentication. That’s ridiculous. If you can’t access email or documents for five hours, that’s bad enough. “

The senior executive said Microsoft will need to do a deep analysis to determine how someone could implement a software code change that causes a five-hour outage.

“Everyone expects blackouts, a setback here or there is understandable,” said the executive. But this appears to be a faulty source control software policy issue. Presumably they would be in a source control / DevOps environment that should have avoided this. With billions and billions of dollars invested in Azure, how could a developer write code, release it to production, and ditch it all? It seems that somehow someone overcame the continuous cycle of software integration. “

An outage like the one Microsoft just experienced definitely has a ripple effect on sales trenches, the executive said, noting that large companies with mission-critical applications often use such outages as a reason not to go to the public cloud. .

“It is a difficult scenario for sales representatives,” said the executive. “There are a lot of frozen intermediate accounts that cling to an issue like this and trigger another three-year review cycle. In industries like oil and gas and financial services, they hold onto something like this. It has a snowball effect. “

Tony Safoian, president and CEO of SADA, one of Google’s leading cloud partners, said he sees Google Cloud as “the most robust and reliable.” At the same time, he said, outages are expected “from time to time” with hyperscaler cloud providers.

Larry Cannell, senior research director at market researcher Gartner who focuses on Microsoft Teams and the digital workplace, said in an email that an outage “is not a good appearance for any cloud service.” That said, he applauded Microsoft for doing “a good job of keeping everyone informed about the actions they were taking.”

Additional information from O’Ryan Johnson.

[ad_2]