Skip to main content

SRE Principles

Introduction

There is a commonality between software engineering and "having children". The labor before birth is hard, but in reality, the labor after birth accounts for the majority of the effort. It is estimated that 40-90% of the total cost of a system arises after its birth.

One way to create an impetus for improving reliability is to officially recognize the work and promote, or reward, what those people are doing.

Development Team vs Operations Team

Conflicts often arise between development teams and operations teams.

TeamPurpose/Motivation
Development TeamWants to launch new features whenever they want, without interruption.
Operations TeamOnce a system is working, they don't want to change anything for stability.

This conflict arises because the two teams have completely different backgrounds, skill sets, and motivations. The indirect costs due to the division between development and operations are often greater than the direct costs to the organization.

Google chose a different approach. Google's Site Reliability Engineering (SRE) team focused on hiring software engineers and having them operate services. And they had them build systems to perform tasks that would traditionally have been done manually by system administrators.

Google's Approach to Service Management

Google's SRE team quickly gets bored with doing tasks manually. As a result, a team of engineers with the skill set necessary to write software to replace what was previously done manually (even if it is complex) was formed. SREs also came to share an academic and intellectual background with the development organization.

Therefore, SRE has the following characteristics:

  • Engineers with software expertise perform what the operations team has done so far.
  • Engineers have the ability to design and implement software that automates manual management, and are willing to do so.

Focus on Engineering

For the SRE team, focusing on engineering is extremely important.

If they are not constantly doing engineering, the load associated with operations will continue to increase, and the team will need more people just to keep up with that load. Eventually, the traditional group focusing on operations will grow in proportion to the size of the service.

Google's rule to avoid this is clear.

The team responsible for service management tasks must write code.

To create time, Google set an upper limit of 50% or less for the total of "operations" work such as ticket handling, on-call, and manual work. The system Google seeks is not just automated, but one that works automatically.

How SRE Team Uses Time

The SRE team must use the remaining 50% to actually do development. To do this, it is necessary to measure how the SRE team is spending its time. In some cases, part of the burden may be pushed back to the development team.

Maintaining the work balance between operations and development creates the following benefits:

  1. Ensures SREs have room to engage in creative and autonomous engineering.
  2. Maintains insights gathered from the operational side of running the service.

The SRE team will be characterized by widely accepting rapid innovation and change. Supporting the same service with an operations-oriented team would require a huge number of personnel, but with the SRE approach, the number of people required for system operation, maintenance, and improvement is not proportional to the size of the system.

SRE Tenets

1. Ensuring Continuous Focus on Engineering

  • Time spent on operations is capped at 50%.
  • The remaining time is devoted to project work using coding skills.
  • Operational work exceeding the cap is pushed back to the product development team.

2. Pursuing Maximization of Change Velocity Without Dropping Below Service SLO

To resolve the structural conflict between the product development team and the SRE team (speed of innovation vs product stability), Error Budget was introduced.

Error Budget is based on the observation that "aiming for 100% reliability target is basically wrong in any case".

  • The slight difference between 100% and 99.99% availability gets lost in the instability of the user environment (Wi-Fi or device).
  • The enormous effort to increase the last 0.01% does not bring benefits to the user.

Therefore, we ask the following questions for correct goal setting:

  • What level of availability are users satisfied with?
  • What alternatives are there for users who are not satisfied with the product's availability?
  • What happens to user usage if the availability level is changed?

Leveraging Error Budget: Error Budget is 1 - Availability Target. This budget can be used for anything as long as it is not exceeded. Ideally, the development team uses this budget for launching new features or risky changes. This way, the SRE's goal is no longer "zero failures", and the interests of development and operations align.

3. Monitoring

A system where humans have to read emails and decide on the necessity of response is fundamentally flawed. In monitoring, human interpretation should not be required in the area of alerting. Software should interpret, and humans should only be notified when action is needed.

TypeDescription
AlertSomething that requires immediate human action.
TicketRequires human response, but not immediately.
LoggingRecords that do not need to be seen unless something happens.

4. Emergency Response

If humans are involved, latency (delay) occurs. A system that can avoid emergencies requiring human intervention has high availability.

The best practice when human involvement is necessary is to create a Playbook in advance. This resulted in an approximately 3x improvement in MTTR (Mean Time To Repair).

  • Importance of Playbooks: Even excellent engineers can solve problems faster with a playbook.
  • Training: Google SRE conducts failure response training like "Wheel of Misfortune" to prepare engineers to deal with events during on-call.

5. Change Management

Approximately 70% of service failures are caused by changes to the running system. Best practices are to achieve the following through automation:

  • Progressive Rollout: Gradually expand the scope of application.
  • Fast and Accurate Problem Detection: Detect problems quickly.
  • Safe Rollback: Immediately revert when a problem occurs.

These avoid human-specific problems such as fatigue, habituation, and carelessness.

6. Demand Forecasting and Capacity Planning

Guarantee sufficient capacity and redundancy for expected future demand.

  • Natural Growth: Natural increase due to customer increase.
  • Spike Growth: Rapid increase due to feature launches or marketing campaigns.

Necessary Steps:

  1. Accurate forecast of natural demand (considering lead time for resource securing).
  2. Incorporate sources of spike demand into forecasts.
  3. Conduct regular load tests to understand the relationship between resources and service capacity.
Cost Optimization

Capacity directly affects cost. The SRE team contributes to cost optimization by quantitatively showing the "minimum necessary cost for maintaining availability".

7. Provisioning

Provisioning combines change management and capacity planning.

info

In modern cloud environments, this corresponds to appropriate allocation and scaling settings of cloud resources rather than physical data center expansion.

8. Efficiency and Performance

Since SRE controls provisioning, they are also involved in work related to system utilization. By paying attention to service provisioning policies and utilization rates, they can have a strong impact on the total cost of the service.

SREs and product developers need to continue efforts to improve service performance and increase efficiency through monitoring and modification.

Reference Pages