Companies want to capture user happiness in metric form to provide the optimal level of reliability for their software that maximises user happiness. In this series of posts, I’m writing about using Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in data-driven negotiations between Engineering, Product and Business to achieve this goal.

The Customer Reliability Engineering Principle says that it’s the user experience that determines the reliability of our services. Since we can’t measure user happiness directly, SLIs are proxies to help answer the question: “Is our service working as our users expect it?”. The closer to the user we measure our system’s performance, the more accurate measure of user happiness our SLIs will be.

Why social media is not great at measuring user happiness

Social media channels are not good indicators of users’ (un)happiness. We want indicators to be quantifiable and predictable, ideally in a linear relationship with user happiness. Predictability is key and good indicators show long-term trends clearly. 

Social media metrics have several drawbacks. The data isn’t timely and is often dubious and sometimes outright malicious. Competitors can spam downdetector during your public launches and newspapers pick up on it. It’s powered by crowdsourced user reports of “problems” but isn’t targeted at specific areas of the site, just that there are problems. Monitoring is faster at detecting and measuring incident resolution. Synthetics are more targeted and reliable alternatives. ” 

Ben Cordero, Staff Software Engineer, SRE, Deliveroo

The anatomy of a good SLI

The main challenge with choosing “good” SLIs is system complexity. Having lots of SLIs is unhelpful because they become a sea of noise.

Commonly chosen types of SLIs that aren’t actually good

❌ System metrics: Tempting because a sharp change is often associated with outages. But most users don’t care about “CPU being 100%”; they care about the system being slow:

  • Load average
  • CPU utilisation
  • Memory usage
  • Bandwidth

❌ Internal state: Data is noisy; there are too many reasons why large changes can occur. They also can be temporary while the system is scaling up or down. None of them has a predictable relationship with the happiness of our users.

  • Thread pool fullness
  • Request queue length
  • Outages

The SLI equation

The proportion of all valid events that were good.

Why we care about valid events

Some events recorded by monitoring tools need to be excluded not to consume the Error Budget (more about that later), for example, bot requests or health checks. “Valid” events make this possible.

  • For HTTP requests, validity is often determined by request params, e.g. hostname or requested path.
  • For data processing systems, validity is determined by the selection of inputs.

SLIs are better aggregated over a reasonably long time to smooth out the noise curve from the underlying data. This is because SLIs need to provide a clear definition of good and bad events. It’s much harder to set a meaningful threshold for metrics with variance and poor correlation with user experience.

In the example above, the good metric has a noticeable dip that matches the time span of an outage. 

  • It has less noise because the data has been smoothed over a time window.
  • It has a narrower range of values during normal operation that is noticeably different from the outage range during the outage. This makes it easier to set thresholds for.
  • It tracks the performance of the service against the user expectations accurately and predictably.

The bad metric forces us to either set a tight threshold and run the risk of false positives or set a loose threshold and risk false negatives. Worst, choosing the middle ground means accepting both risks. 

Five ways to measure SLIs and their trade-offs

The closer to the user we measure, the better approximation of their happiness we’ll have. The options below are listed in increasing proximity to users.

1. Server-side logs

Server logs are one of the few ways to track the reliability of complex user journeys with many request-response interactions during a long-running session. 

Pros

  • Even if we haven’t measured it previously, we can still process request logs retroactively to backfill the SLI data and get an idea of the historical performance. 
  • If the SLI needs convoluted logic to determine what events were good, this could be written into the code of the logs and processing jobs and exported as a much simpler ‘good events’ counter. 

Cons

  • The engineering effort to process logs is significant. 
  • Reconstructing user sessions requires an even bigger effort.
  • Ingestion and processing add significant latency between an event occurring and being observed in the SLI, making log-based SLI unsuitable for triggering an emergency response.
  • Requests that don’t make it to the application servers can’t be observed by log-based SLIs at all.

2. Application-level logs

Application-level metrics capture the performance of individual requests.

Pros

  • Easy to add.
  • They don’t have the same measurement latency as log processing.

Cons

  • Can’t easily measure complex multi-request user journeys by exporting metrics from stateless servers.
  • There’s a conflict of interest between generating the response and exporting metrics related to the response content.

3. Cloud provider’s front-end load balancer

Pros

  • The cloud’s load balancer has detailed metrics and historical data.
  • The engineering effort to get started is smaller.

Cons

  • Most load balancers are stateless and can’t track sessions, so they don’t have insight into the response data. 
  • They rely on setting correct metadata in the response envelope to determine if responses were good.

4. Synthetic clients

Synthetic clients can emulate a user’s interaction with the service to confirm if a full journey has been successful and verify if the responses were good, outside of our infrastructure.

Pros

  • Can monitor a new area of the website or application before getting real traffic, so there’s time to remedy availability and performance issues. 
  • Easy to simulate a user in a certain geography.
  • Helpful to assess the reliability of third parties like payment processors, recommendation engines, business intelligence tools etc.

Cons

  • A synthetic client is only an approximation of user behaviour. Users are human, so they do unexpected things. Synthetics might not be enough as the sole measurement strategy.
  • Covering all the edge-cases of the user journey with a synthetic client is a huge engineering effort that usually devolves into integration testing.

5. Client-side telemetry

Another option is to instrument the client using Real User Monitoring or RUM tags to provide telemetry for the SLI. 

Pros

  • A far more accurate measure of the user experience. 
  • Helpful to assess the reliability of third-party systems involved in the user journey.

Cons

  • Telemetry from clients can incur significant measurement latency, especially for mobile clients. Waking up the device every few seconds is detrimental to battery life and user trust. 
  • It’s unsuitable for emergency responses.
  • It captures many factors outside of our direct control, lowering the signal to noise ratio of SLIs. For example, mobile clients could suffer from poor latency and high error rates, but we can’t do much about it, so we have to relax our SLOs to accommodate these situations.

The SLI buffet

To create a good SLI, we need a specification and an implementation

  • The specification is the desired outcome from a user perspective. 
  • The implementation is the specification plus a way to measure it.
1. Request/Response

Example: an HTTP service where the user interacts with the browser or a mobile app to send API requests and receive responses.

1.1 Availability

There are two ways to measure availability: time-based and aggregate events.

Time-based availability

How long the service was unavailable for a period of time.

Aggregate availability

The proportion of valid requests served successfully.

Implementation
  1. Which requests the system serves are valid for the SLI?
  2. What makes a response successful?

Aggregate availability is a more reasonable approximation of unplanned downtime from a user perspective because most systems are at least partially up all the time. It also provides a consistent metric for systems that don’t have to run all the time, like batch processing.

When considering the availability of an entire user journey, we need to also consider the voluntary exit scenarios before completing the journey.

1.2. Latency

The proportion of valid requests served faster than a threshold.

Implementation
  1. Which requests the system serves are valid for the SLI?
  2. When does the timer for measuring latency start and stop?

When setting a single latency threshold, we need to consider the long tail of requests, where 95% or 99% of requests must respond faster than a threshold for users to be happy. The relationship between user happiness and latency is on an S-curve, so it’s good to set thresholds for 75% to 90% to describe it as more nuanced. 

Things that affect latency:

  • Pre-fetching
  • Caching
  • Load spikes

RobinHood: Tail Latency-Aware Caching lists several strategies to maintain low request tail latency, such as load balancing, auto-scaling, caching and prefetching. The difficulty lies in user journeys with multiple requests across multiple backends, where the latency of the slowest request defines the latency of a journey. Even when requests can be parallelised among backends and all backends have low tail latency, the resulting tail latency can still be high. The Tale at Scale offers interesting techniques to tolerate latency variability in large-scale web services.

Latency and batch processing

Latency is equally important to track data processing or asynchronous queue tasks. For example, if we have a batch processing pipeline that runs daily, that pipeline shouldn’t run more than a day.

We must be careful when reporting the latency of long-running operations only on their eventual success or failure. For example, if the threshold for operational latency is 30 minutes, but the latency is only reported after the process fails two hours later, there’s a 90-minute window where the operation has missed expectations without being measured.

1.3. Quality

The proportion of valid requests served without degraded quality.

Implementation
  1. Which requests the system serves are valid for the SLI?
  2. How to determine whether the response was served without degraded quality?

Sometimes we trade off the quality of the user response with CPU or memory utilisation. We need to track this graceful degradation of service with a quality SLI.

Users might not be aware of the degradation until it becomes severe. However, degradation can still impact the bottom line. For example, degraded quality could mean serving fewer ads to users, resulting in lower click-through rates.

It’s easier to express this SLI in terms of bad events rather than good ones. The mechanism used by the system to degrade response quality should also mark the responses as degraded and increment metrics to count them.

Same as latency, response degradation falls along a spectrum with multiple thresholds. For example, consider a service that fans out incoming requests to 10 backends, each with a 99.9% availability target and the ability to reject requests when overloaded. We might choose to serve 99% of surface responses without missing backend responses and 99.9% with no more than one missed response.

2. Data Processing

Examples
  • A video service that converts from one format to another.
  • A system that processes logs and generates reports.
  • A storage system that accepts data and makes it available for retrieval later on.

2.1. Freshness

The proportion of valid data updated more recently than a threshold.

Implementation
  1. What data is valid for the SLI?
  2. When does the timer measuring data freshness start and stop?

For a batch processing system, freshness can be approximated since the completion of the last successful run. More accurate measurements require processing systems to track generation and source age timestamps.

For streaming processing systems, we can measure freshness with a watermark that tracks the age of the most recent record that has been fully processed.

Serving stale data is a common way for response quality to be degraded without the system making an active choice. If we don’t track it and no user accesses the stale data, we can miss freshness expectations.

The system that generates the data must also produce a generation timestamp so that the infrastructure can check against the freshness threshold when it reads the data.

2.2. Correctness

The proportion of valid data producing correct output.

Implementation
  1. What data is valid for the SLI?
  2. How to determine the correctness of output records?

The methods for determining correctness need to be independent of the methods used to generate the output of the data; otherwise, bugs during generation will also affect validation.

To estimate overall correctness, the input data must be sufficiently representative of real user data and exercise most of the processing systems code paths.

2.3. Coverage

The proportion of valid data processed successfully.

Implementation
  1. What data is valid for the SLI?
  2. How to determine whether the processing of data was successful?

The data processing system should determine whether a record that began processing has finished and the outcome is a success or failure.

The challenge is with records that should have been processed but were missed for some reason. To solve this, we need to determine the number of valid records outside the data processing system itself, directly in the data source. 

For batch processing, we can measure the proportion of jobs that processed data above a threshold amount. 

For streaming processing, we can measure the proportion of incoming records that were successfully processed within a time window.

2.4. Throughput

The proportion of time where the data processing rate is faster than a threshold.

Implementation
  1. The units of measurement of the data processing rate, e.g. bytes per second.

How does this differ from latency? Throughput is the rate of events over time. As with latency and quality, throughput rates are a spectrum.

Managing SLI Complexity

A small number of SLIs for the most critical user journeys

We should have one to three SLIs for each user journey, even if they are relatively complex. Why?

  1. Not all metrics make good SLIs.
  2. Not all user journeys are equally important. We should prioritise those where reliability has a significant impact on business outcomes.
  3. The more SLIs we have, the more cognitive load for the team to learn and understand the signals needed to respond to outages.
  4. Too many SLIs increase the probability of conflicting signals, which will drive up the Time-To-Resolution because the team will be chasing down “red herrings”.

Monitoring and observability

Having SLIs alone is not enough. We need monitoring and observability. Why?

  1. The deterioration of an SLI is an indication that something is wrong.
  2. When the deterioration becomes bad enough to provoke an incident response, we need other systems like monitoring and observability to identify what is wrong.

Manage complexity with aggregation

Let’s take an example of a typical e-commerce website, where a user lands on the home page, searches or browses a specific category of products and then goes into product details. To simplify, we can group these events into a single “browsing journey”. We can then sum up the valid and good events for an overall browse availability and latency SLIs.

The problem with summing events is that it treats all of them equally, even though some might be more important than others. Request rates can differ significantly. For example, summing hides low-traffic events in the noise of high-traffic ones. One solution is to multiply the SLI by weight, be that traffic rate or user journey importance.

Manage complexity with bucketing

Another source of complexity is choosing different thresholds for different SLOs. To reduce the complexity, we can reduce the set of good thresholds and label them with consistent, easily recognisable and comparable labels, for example, choosing one to three discrete response buckets.

Bucket 1: Interactive requests

The first step is to identify when a human user is actively waiting for a response. This is important because requests could also come from bots and mobile devices pre-fetching data overnight on WiFi and AC power. However, we care about the human user experience.

Bucket 2: Write requests

The second step is to categorise which requests mutate state in the system. This is important because writes and reads, especially in distributed systems, have different characteristics. For example, users are already accustomed to waiting a bit more after clicking ‘Submit’ than when seeing static information on a page.

Bucket 3: Read requests

The third step is choosing which requests should have the strictest latencies. Choosing a spectrum of thresholds is a good idea:

  1. Annoying requests: 50–75% of requests are faster than this threshold.
  2. Painful requests (long-tail): 90% of requests are faster than this threshold.
Bucket 4 (optional): Third-party dependent requests

When we have third-party dependencies like payment providers, we can’t make requests faster because they’re not within our control. A solution is to make it explicit to the user what the responsibility boundaries are. For example, make it visible when the third-party dependency kicks in in the user journey.

We could also bucket by customer tier: enterprise customers have tighter SLOs than self-serve ones.

Achievable vs aspirational SLOs

Once we’ve chosen the right SLIs for a close predictable relationship with the user’s happiness, the next step is to choose good-enough reliability targets.

Users’ expectations are strongly tied to past performances. The best way to arrive at a reasonable target is to have historical monitoring data to tell us what’s achievable. If that’s missing or the business needs change, the solution is to gather data and iterate towards achievable and aspirational targets.

Achievable SLOs are based on historical data when there’s enough information to set the targets that meet the users’ expectations in most cases. The downside of achievable SLOs is that the assumption that users are happy with past and current performance is impossible to validate from monitoring data alone.

  • What if our feature is completely new? 
  • What if our users only stick with us because the competition is far worse?
  • What if our users are too happy with our performance and could relax some SLOs to increase margins?

As stated in Disciplined Entrepreneurship: 24 Steps to a Successful Startup, aspirational SLOs are based on business needs. Like OKRs, they are set higher than the achievable ones. Since they start from assumptions about the users’ happiness, it’s totally reasonable to not hit them at first. That’s why it’s more important to set a reasonable target than to set the right target.

The first thing to do when achievable and aspirational SLOs diverge is to understand why.

Why are the users sad even if we’re within an SLO?

To answer that, we need two things:

  1. Tracking signals which proxies for user happiness outside monitoring systems. For example, NPS or Customer Support requests.
  2. Time. We don’t have to wait an entire year before setting some reasonable targets because a lot can happen in one year: the business can pivot or scale 10x. At the same time, we also don’t want to panic every week and change the targets based on fear or hope.

We should iterate to check if the assumptions have changed through continuous learning and improvement.

Four steps to arrive at good SLOs

1. Choose an SLI specification from the SLI buffet.

Questions:

  • What does the user expect this service to do?
  • What would make the user unhappy with this service?
  • Do different types of failures have different effects on the service?

Output: SLIs for request/response and data processing.

2. Refine the specification into a detailed SLI implementation.

Questions:

  • What does the SLI measure?
  • Where is the SLI measured?
  • What metrics should be included and excluded?

Output: A detailed-enough SLI specification that can be added to a monitoring system

3. Walk through the user journey and look for coverage gaps.

Questions:

  • What edge doesn’t the SLI specification cover cases?
  • How risky are those edge cases?

Output: A documented list of edge cases and/or augmenting measurement strategies.

4. Set aspirational SLO targets based on business needs.

Questions:

  • What historical performance data can we use to set the initial targets?
  • What other user happiness signals can we use to estimate targets?
  • If there are competitors on the market, what levels of service do they offer?
  • What is the profile of the user: self-serve or enterprise?
  • What is the cost for the next order of magnitude in the SLO target?
  • What is worse for users: a constant rate of low failures or an occasional full-site outage?

Output: SLO targets on a spectrum

(The SLO decision matrix from Google SRE book / Example SLO document)

In part two, I’ll dive into the need for Error Budgets, the seven properties of a good Error Budget Policy, and an example of a CRE Risk Analysis template.

***

If you or your CTO / technology lead would benefit from any of the services offered by the CTO Craft community, use the Contact Us button at the top or email us here and we’ll be in touch!

Subscribe to Tech Manager Weekly for a free weekly dose of tech culture, hiring, development, process and more