In my previous post, I wrote about capturing user happiness in metric form with SLIs and SLOs, with the end goal of providing the optimal level of software reliability that maximises user happiness.

In this post, I’ll look at how Error Budgets can be used to negotiate in a data-driven way, the trade-offs between innovation and reliability and between risk and stability.

What systems are unreliable

Systems can become unreliable (and unavailable) because of Business-As-Usual (BAU). Just as humans generate dust simply by living, engineers generate bugs simply by coding. Other BAU events are hardware/power/network failures caused by cute, furry or feathered friends or failures in third-party dependencies.

That’s what Error Budgets are for: to measure the acceptable level of system unreliability. There is such a thing as ‘too much reliability’ and it can be bad business. Each .9 adds 10x cost.

The need for an Error Budget

Everything is a trade-off. Product performance is evaluated using velocity while platform performance is evaluated using reliability.

The structural conflict is between pace of innovation and product stability. The error budget stems from the observation that 100% is the wrong reliability target for basically everything (pacemakers and anti-lock brakes being notable exceptions). If 100% is the wrong reliability target for a system, what, then, is the right reliability target for the system? This actually isn’t a technical question at all — it’s a product question, which should take the following considerations into account: What level of availability will the users be happy with, given how they use the product? What alternatives are available to users who are dissatisfied with the product’s availability? What happens to users’ usage of the product at different availability levels?” 

Google SRE book

Product/Engineering and Business have to constantly negotiate the balance between the value added by new features and the value lost through bugs, outages, tech debt etc.

An Error Budget is a data-driven way to convince the leadership to invest in development velocity for the long run, meaning:

  • When to prioritise bugs and post-mortem actions in the next planning cycle.
  • When to implement automation, monitoring, observability.

Just like a household budget, if we have the money, we can spend it on new features, if not, we have to reduce our innovation expenses.

Looking at the Error Budget exhaustion rate is as useful as managing overspending money.

If the rate > 1, we’re consuming the budget faster than we should and we’re getting into debt.

We can also have special types of Error Budgets, but those are usually a bad sign and should warrant a post-mortem as to why we had to use them.

  • A Rainy Day Fund for unexpected events.
  • Silver Bullets for “critical” new features.

The Error Budget equation

Time-To-Detect (TTD)

The time it takes from the moment a user is impacted by an issue until someone is informed of it.

Time-To-Resolution (TTR)

The time it takes from someone being informed of an issue until the issue is resolved.

Time-To-Failure/Time-Between-Failures (TTF/TBF)

The frequency of a particular failure.

When they come with an ‘M’ in front, aka ‘Mean Time to X’, they are averaged.

Error Budget = 1 — Availability Target

The equation above tells us how we can decrease unavailability, and consequently increase availability.

1. Decrease Time-To-Detect
  • Monitoring and alerting catch outages faster.
2. Decrease Time-To-Resolution
  • Make it quicker to troubleshoot with good developer Runbooks.
  • Improve logs for fire fighting.
  • Add traces.
  • Automate fail-overs like redirecting traffic or backups.
3. Decrease the impact
  • Limit the number of users affected with a gradual roll-out.
  • Increase reversibility with feature flags.
  • Implement graceful degradation, e.g. Circuit Breaker Pattern, throttle requests, limit retry calls with exponential backoff, set client timeouts and limit queues.
3. Increase Time-To-Failure
  • Analyse and understand the cause of failure.
  • Do proactive maintenance work.

The properties of a good Error Budget policy

Since missed SLOs indicate that users are unhappy, it’s in the interest of the business to have a mechanism that enforces investment in engineering work to improve reliability.

Such a mechanism is provided by an Error Budget Policy, which outlines the trade-offs between reliability and feature work. Implementing and following an Error Budget Policy not only results in increased reliability and customer happiness but also in decreased firefighting and finger-pointing within teams. It’s a win-win situation.

An Error Budget is likely to apply to multiple services and teams across the organisation. It’s best kept and maintained in a highly visible place and stored as metadata next to the SLO definition (for example as a link).

A good Error Budget Policy has seven properties:

If the Error Budget is exhausted or threatened, the Policy should be able to enforce engineering efforts to re-prioritise features that improve reliability.

The policy should clarify when this reprioritisation takes effect, for example when the budget is closed to exhaustion.

It describes how teams will prioritise reliability work. For example, if the budget is threatened but not exhausted, one or two developers are allocated to fix all priority issues from the relevant post-mortems. On the other hand, if the budget has been exhausted for months in a row, maybe the entire dev team should focus solely on reliability work until the budget is replenished to a comfortable level.

To enforce a policy, it has to come with important consequences and risks if the reliability work doesn’t happen. At the end of the day, this work is needed to meet customer happiness. If that doesn’t happen, the business is ultimately failing at its core value.

The policy should be consistently applied across teams and throughout the year. There might be one or two exceptions, like Silver Bullets because of a potential breach of contract or an unexpected marketing opportunity. However, Silver Bullets should be treated as extraordinary circumstances and followed by a post-mortem explaining how they can be avoided in the future.

The policy needs a final owner and decision-maker because disagreements between parties (e.g. Product and Engineering or different dev teams) will always happen.

It’s difficult for people to adhere to a policy they dislike. However, once all the parties involved (product managers, developers, SREs, executives etc.) provide feedback that is analysed and incorporated into the policy, everyone should commit to following the policy for actual results to show.

Example Error Budget Policy

Example of Error Budget Policy scenarios and escalation

Google’s CRE Life Lessons — Applying the Escalation Policy has four scenarios illustrating how to apply the policy thresholds for a service that desires “three nines” availability but burns half of its error budgets on background errors.

Example of CRE Risk Analysis template

A risk matrix is useful to calculate the level of risk by considering the probability (likelihood, frequency) and severity (impact) of an event. Its purpose is increasing risk visibility and assisting in management decision-making. 

Know thy enemy: How to prioritise and communicate risks—CRE life lessons

The shortcomings of a general risk matrix applied to SRE become transparent: 

  • Are a few minutes of downtime ‘catastrophic’? It depends, most likely yes, at four nines of expected availability but not so much at two nines. 
  • What is more manageable: having a catastrophic but extremely rare outage or frequent but minimal ones?

When we apply the concepts above, we get a clearer picture of risk, in the context of an actual SLO:

  • The likelihood is measured by Mean-Between-Failure (MTBF)
  • The impact is measured by Mean-Time-To-Recover (MTTR)
  • The acceptable risk is set by the Error Budget
  • The target availability is set by the SLO

With these in hand, we can create a catalogue of risks, by estimating the loss in minutes for a period of time. We use past data and our intuition to assign acceptable values for MTBF (days) and MTTR (minutes out of an SLO). We can use a traffic light system to highlight the risks ranking visually:

  • Red risks are unacceptable. We need to invest engineering effort immediately, as these are above the Error Budget for a single risk and can have a major impact on reliability in a single event.
  • Amber risks need to be addressed urgently because, although not critical, they are a big consumer of our Error Budget. However, they can be tolerated if there is enough budget.
  • Green risks are acceptable because they are not major consumers of our Error Budget and even in aggregate, do not go over the Error Budget. We might want to address them when we want to ‘buy back’ some budget to accept amber risks that are harder to eliminate.

Finally, a spreadsheet template for this sort of CRE Risk Analysis is provided below for further assistance. You can make a copy for your own purposes. You can play with the numbers by using the strategies above to decrease the ETTD, ETTR, ETTF and impact in the ‘Risk Catalog’ sheet and/or changing the ‘Target Availability’ under ‘Risk Stack Rank’.

***

If you or your CTO / technology lead would benefit from any of the services offered by the CTO Craft community, use the Contact Us button at the top or email us and we’ll be in touch!


Subscribe to Tech Manager Weekly for a free weekly dose of tech culture, hiring, development, process and more.