Outages and Service Health Notifications for Azure

My experiences with Microsoft Azure support are pretty poor. Some of that is due to the nature of the intermediate party that is contracted by Microsoft to handle “professional support”. They are called Mindtree (or CSS) and are one level of indirection away from Microsoft.

However even the so-called “paid” support (called “unified support”) isn’t necessarily the solution. Depending on the Azure product group (PG) involved, the “unified support” might be equally bad. The limiting factor can often be the PG itself. If the PG is unmotivated to help their customers solve a problem, then the “unified support” engineers won’t be of much help either since they are downstream of the PG.

Over the past few years I raised a number of cases in response to some obvious outages in Azure.

A recent case was pretty poor (with ADF). Microsoft has a variety of stalling tactics and, after the most recent outage, they used all of them to delay my support case. The main tactic is to start a “collab” with another team, then another, then another. After a period of time I had ADF, Synapse, Network, and Cosmos all participating on the support case to help me explain my outage. This “collab” was probably done as a pretext; since I’m convinced the ADF team is well aware of the reason for their outages (typically networking issues). In fact, the ADF team has so many networking issues, that I have seen them start to shift blame to other Microsoft PGs. In this particular case they wanted to blame the Cosmos PaaS for a thirty minute outage! (This is totally a fabrication, IMHO).

NOTE: One of the primary deliverables I needed to receive out of this support case was an outage notification in the service health portals. This is especially important for products like ADF and Synapse which are constantly wetting the bed. The outage notification allows me to communicate with my own downstream users. It also serves a documentation purpose (record-keeping). Finally, it allows me to show my Microsoft account rep how much trouble we are experiencing by hosting workloads on certain Azure platforms. … Microsoft account reps are rarely well informed about the troubles that their customers are facing in Azure. (Nor do they want to be).

After a month of chasing down their rabbit holes, I was finally told by my unified support engineer (for ADF) that they won’t send me my outage notification to the service health portal. They say it is because the outage was declared “resolved”. Here is the relevant part:

Microsoft Unified Support Engineering Manager (from Mr. P.E.) :

Thanks for reaching out. I am sorry that you are unhappy with what has been provided. However, we have mentioned multiple times we would be unable to provide the information via the portal after the event. We have provided the outage information in the previous emails for the October 11 outage which is the same communication that was sent to the portal of identified impacted subscriptions. We have provided the outage information that is available. We are unable to send outage information after the fact to your impacted subscription via the portal after the outage is declared resolved.

This response was disappointing, but I have had many other cases where I was able to get my outage notification. I have an approximate 50/50 rate of success getting Microsoft to send outage notifications to the portal. It is a small thing to ask for notifications, since the service health portal is used as basically a point-to-point communication for a specific tenant. However as small as it is, Microsoft will still try to wiggle out of sending the notifications. The moral of the story is that if you have an outage, you need to make sure that you don’t allow the Microsoft teams to delay for multiple weeks or you may miss your chance to receive the necessary communications.

Don’t count on Microsoft’s ability to send notifications proactively. They aren’t happy to admit to their outages. My understanding is that these notifications are generally triggered by a customer who calls for support, and then Microsoft tries to do an analysis to discover what other customers may have been impacted. At the end of that they send notifications to the smallest subset of customers that they can. (eg. only the ones that happened to have workloads running at a certain moment of time on a certain platform in a certain region).