Service Health
Microsoft
Redesigned the Service Health experience to help users find critical system insights faster, doubling monthly active users and increasing engagement by 173%.
Partnered with Brain PMs, developers, and researchers to validate new workflows that reduced Time to Mitigation (TTM) during incidents and outages.
Client
Microsoft Inc., Azure
Role
Senior User Experience Designer
Location
On-sight, remote
Scope
Design updates to a Microsoft internal tool to improve the ability to quickly identify Service outages while expanding it’s appeal to more user types.
Challenge
Service Health is a critical component of Microsoft’s internal monitoring platform, built for Service Owners, Azure Incident Managers (AIMs), DRIs, and executive leadership to manage the health of in-house applications, the product provided a quick view of overall service status with a regional health pivot.
Originally part of the AI-driven “Brain” toolset, “Brain Cloud Health” aimed to quickly surface failing services and identify likely causes. But it was hard to find—buried under a sub-menu—and its top down UX concept overlooked DRIs and other key target users. A bloated feature set further limited adoption, with most users relying only on basic functions.
Users could also click the chevron next to service names to see which regions contributed to the service’s health (“Level 2”), but this interaction wasn’t always obvious—posing a problem, since clicking on region names was the only path to deploy a context pane featuring deeper data insights.
Content on the “Level 3” context pane caused confusion, as users didn’t understand how metrics like “Overall Health”, “SLI Health”, and “Brain Aware Impact” were calculated—leaving many unsure of the data’s usefulness. Additionally, the SLI Details interaction was clunky, since users didn’t really want a comparison of SLIs, and the UX forced users to select a single SLI from the main chart before the supporting charts would populate with data.
Action
I began the redesign process by working closely with a Senior and Junior PMs and Developers from Brain to focus the main view (“Level 1”) on core Service Health, which meant we could eliminate the regional pivot, and all of the tabs on the secondary left navigation, to simplify the layout. With these changes in place, I recommended placing the newly named “Service Health” on the top of the table of contents menu, making it more visible and easy to access.
Customer calls with DRIs revealed that the Recovery time column wasn’t helpful, so we switched to the higher priority “Outages” data. I also added a simple star-based “favorites” selection interaction, plus a new Favorites tab to view only those services.
I added a collapsable legend area to show details about how Brain defined healthy/unhealthy conditions, Outages, and Health Accuracy. I also included an icon to indicate which Services were best tuned for Health Accuracy, with easy learning links to explain why.
I added a shaded red area on the timeline as a visual indication of outages, which users could interact with to deploy a pop-up with detailed data. The pop-up also featured links for quick follow-up with Azure Incident Managers and the owning team, plus a direct link to see incident details in the external Incident Management tool.
I designed a new “Level 2” dashboard for services selected on the “Level 1” page. It displays detailed service data, including a Regions card with a link to all regions. At the top, a prominent card highlights AI-generated insights to pinpoint likely causes of failures or outages, saving users time while searching for root causes of outages.
Based on user feedback, I redesigned the UX to group SLIs into “Healthy” and “Unhealthy,” with the Unhealthy section auto-expanded to highlight failures. I again included the AI summary card, but this time tuned to just the selected region.
Clicking on the chevron next to an SLI’s name reveals the Metrics, Unhealthy Resource Count, and Burn Rate, with the option to expand each chart to column width.
Result
The redesigned UX helped users find critical system insights faster, doubling monthly active users and increasing engagement by 173%. Telemetry also validated a reduced Time To Mitigation during incidents and outages.