Skip to main content

Error Budget

What is Error Budget?

Error Budget is one of the most important concepts in SRE, serving as an objective metric to resolve conflicts between development teams (feature development speed) and SRE teams (system reliability).

In the "Embracing Risk" chapter of the Google SRE book, it is stated that making reliability 100% is impossible and not desirable. Pursuing excessive reliability that users do not notice leads to opportunity costs for feature development.

Definition

Error Budget is defined as 100% minus the Service Level Objective (SLO).

Error Budget = 100% - SLO

For example, if the quarterly SLO is 99.9%, the Error Budget is 0.1%. This means that failure (downtime or errors) is acceptable up to 0.1% in that quarter.

Purpose of Error Budget

The main purpose of Error Budget is to decide the balance between release speed and reliability based on data.

  • If budget remains:

    • The development team can actively release new features.
    • Experimental attempts and risky changes can be made.
    • Simplifying tests to prioritize speed is also acceptable.
  • If budget is exhausted:

    • Pause feature releases.
    • Focus engineering resources on "improving reliability" (enhancing tests, fixing bugs, improving infrastructure, etc.).
    • Refrain from risky operations until the budget recovers (or the next period begins).

This eliminates political negotiations and emotional conflicts between "development wanting to release" and "operations wanting stability," allowing decisions to be made based on the objective fact of "Is there budget remaining?".

Calculation Example

Here is a calculation example when SLO is defined by "request success rate".

  • Period: 1 quarter (about 90 days)
  • Total Requests: 1 billion
  • SLO: 99.99%

The Error Budget (allowable number of failures) in this case is as follows.

1,000,000,000 × (1 - 0.9999) = 1,000,000,000 × 0.0001 = 100,000 times

In other words, up to 100,000 errors in this quarter are treated as "within acceptable range".

Case Study: YouTube's Availability Target

The Google SRE book preaches the importance of setting appropriate availability targets (SLOs) according to the nature of the service. The case of YouTube is introduced as a good example.

Background

When Google acquired YouTube in 2006, Google considered what availability target should be set for YouTube.

Decision: Lower the Target

Google set YouTube's availability target lower compared to enterprise products like Gmail and Google Calendar (Google Apps for Work).

Reason

  • Difference in Business Phase: YouTube at the time was a rapidly growing consumer service, and the speed of feature development was paramount.
  • Difference in User Base: Downtime for enterprise users (companies) leads directly to business stoppage, but users of consumer video services did not demand such strict availability.
  • Priority on Innovation: Aiming for excessively high reliability would slow down the speed of change and potentially hinder YouTube's growth.

Lesson

This case shows that "reliability is not necessarily better the higher it is". Setting an appropriate SLO according to the service phase and user expectations, and investing the surplus reliability (Error Budget) into feature development speed (innovation) leads to the success of the entire business.

References