Microsoft hopes to improve the resiliency of its cloud services by extending a “failure mode” for Azure Active Directory to cover the web as well as desktop applications.
Azure Active Directory (AAD) is Microsoft’s cloud directory that manages authentication for Office 365 and can be linked to on-premises Active Directory. Additionally, developers can write applications that use the service. However, if something goes wrong, customers experience several failures, including being unable to access the Azure portal to manage other cloud services.
In December of last year, Microsoft updated its SLA (Service Level Agreement) for AAD to 99.99% uptime, down from 99.9%, but with a certain sleight of hand as it also removed the “administrative functions” of its definition of availability.
Now the company has given more details of its efforts, focusing on a backup authentication service that replicates authentication data during normal operations, and then if the primary service fails, goes into “crash mode” where he is able to verify requests and provide tokens to clients.
Diagram from Microsoft showing how AAD backup works
According to Microsoft, this has been working for Outlook Web Access and SharePoint Online since 2019, although we did note that during the September 2020 outage, Outlook and SharePoint were affected. The reason given at the time was that “a recent change in configuration impacted a primary storage layer,” an issue that was compounded by another issue caused by “a change put in place to mitigate the impact”. So it seems that the backup service was not sufficient in this case.
There is also a limitation that authentications are only processed by the backup service if the user has already accessed an “application or resource” in the past three days, described as the “storage window”. The company found this to be acceptable for most users who “access their most important apps from a consistent device on a daily basis,” but it’s easy to think of cases where users will be locked out, for example s ‘they buy a new device.
It’s better than nothing though, and Microsoft has been working to expand its applicability. Earlier this year, support for desktop and mobile apps was added, and next year more web apps, including Teams Online and the rest of Office 365, will be added as well. Client applications using Open ID Connect will follow shortly.
More questions than answers
In some ways, Microsoft’s latest post raises more questions than answers. A quick glance at the Azure status page shows “Azure Active Directory – Problems trying to authenticate”, although possibly limited to customers using Azure Active Directory external identities, with the root cause attributed to ” outgoing port depletion ”, although this is on the company’s architecture diagram is not clear.
In March of this year, there was an extended AAD outage caused by the erroneous deletion of a key used for cryptographic signing. Microsoft referred to the backup service at the time and said, “Unfortunately, it didn’t help in this case as it provided cover for the token issuance but did not provide cover for the. token validation as it depended on the affected metadata endpoint. “
It is therefore obvious that extending the backup service will not solve all the problems that may impact AAD even if it is beneficial.
In August of this year, analysts at Gartner reported that customers “remain concerned about the real impacts” of Azure reliability even though its performance is not bad in an absolute sense. Gartner considers some Azure regions to be less resilient than they should be, possibly due to capacity issues, but note that the pandemic has caused increased demand for all cloud providers.
Microsoft also has questions to answer regarding the Cosmos DB vulnerability described by Wiz security researchers earlier this month. The vulnerability has been fixed, but researchers have identified what look like extraordinary architectural errors, such as firewall rules designed to prevent an escalation of a breach, but “these firewall rules were configured locally on the container where we were currently running as root. So we just deleted the rules (by issuing iptables -F), paving the way for those banned IPs and even more interesting discoveries. “
It’s a good thing when Azure CTO Mark Russinovich appears to talk to us, along with colleagues, about Azure reliability improvements, and the extended AAD backup service is welcome even if it isn’t. always effective, but we would like to know more about these other pressing situations. ®