How available is your app, really?
Uptime lies and browntime hurts
Part of my job is keeping an e-commerce site healthy.
I want my tech to just work: every day, 100% of the time; zero defects; zero complaints. I want to delight every visitor.
Web admins and site reliability engineers around the planet wants those things too. They’ll never get them. Websites and apps always crap out eventually. Unavoidable failure modes include:
- Hardware failure. Cloud infrastructure is not perfect. Networks go down. Humans are clumsy in datacentres. Rodents chew through fibre optic bundles.
- Change failure. 90% of system outages are caused by software updates. This is true even with the latest and coolest testing and deployment strategies. It’s mathematically impossible to completely test any non-trivial software, or to predict every possible user behaviour.
- Environmental failure. Modern websites connect with numerous third-party APIs, scripts, platforms, brokers, affiliates, frameworks, and packages. A bug or unplanned change in any of them could give your SRE a bad day.
The Trouble with Uptime
So, how can we know when a site is healthy enough? How can we know we’re investing enough in reliability, and how can we demonstrate an adequate return? There’s an optimal spend. Beyond a certain point, the effort and cost involved in getting from 99.9% reliability to 99.99% or 99.999 will outweigh the business advantage. We’ll want to stay close to the point of maximum yield.
To achieve that, we need to know the impact of interrupted service on business outcomes. And, we need to know how often the service degrades.
It’s not as easy as it sounds.
The most basic (and popular) measure of system health is uptime. It’s just the percentage of time that the system was responsive. For example, you might cURL your homepage once per minute. If you get a 200 OK response, the site is ‘up’. Any other response (or no response at all) means it’s ‘down’.
// 1440 minutes per day
var dailyUptimePercentage = 100 *(1440 - countOfMinutesDown)/1440;
var dailyDowntimePercentage = 100 * countOfMinutesDown / 1440;
Uptime measures a system’s aliveness, not its health.
An SRE being satisfied with uptime is like a doctor thinking the patient is totally healthy because the ECG isn’t flatlined.
The same considerations apply to other reliability metrics like MTBF (mean time between failures) and CFR (change failure rate). If we define ‘failure’ to only include catastrophes, we’re deluding ourselves. If an e-Commerce site goes down completely for 3 hours, once per year, it might lose, say, 100 sales. If it never goes down fully, but 5% of users constantly get an error screen that stops them buying, it might lose 100,000.
Track your Browntime
Uptime says nothing about user experience. The site could be ‘up’, but the pages might fail to render in a browser, or the payment gateway might be broken, or the site might be too slow to use, etc. The modes of partial failure are innumerable.
I use an analogy with power grids to explain this. A power blackout is very obvious. A power brownout or sag (a voltage drop across the network) is harder to identify, but equally disruptive — since it can take longer to find and fix. Computer application downtime is equivalent to a blackout. Incomplete failures are like brownouts.
If we care about availability, we must consider “browntime” as much as “downtime”.
True Availability Measurement
Here’s what we did to more truly measure our site’s reliability.
First, we set the bar really high. We defined our own standard of Availability, such that the site is only ‘available’ during a time interval if nearly every customer had a good experience, with no errors or slowdown.
Then, we took a multi-factor approach to measuring and improving this Availability:
- Continued to invest in the fundamentals: blue/green and canary deployments; auto-scaling; test automation; continuous improvement of telemetry, observability and APM.
- Strengthened our synthetic monitoring tools. We already had bots simulating real user journeys with headless browsers, and logging errors. We refined and extended them.
- Developed a formula to calculate the Availability metric per 4-minute period of each day, based on a weighted sum of 14 different measures across our application telemetry and synthetic monitoring. We ran it ‘silently’ for 3 months, tuning the weights to maximise the correlation between our Availability score and known impacts on site visitors.
We’ve been tracking the custom metric for almost a year now. We’ve almost got enough data to analyse its association with our business KPIs — and, hopefully, reach data-driven and accurate conclusions about how to target investment in DevOps, SRE and our dev lifecycle. It’s exciting. I’ll let you know how it goes.