Toil Definition and Management
If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.
— Carla Geisser, Google SRE
SREs aim to spend more time on long-term engineering projects rather than operational tasks. However, the term "operational tasks" can be misleading, so we use the term Toil instead.
Definition of Toil
Toil is not simply "work you don't want to do." It cannot be simply rephrased as administrative chores or dirty work. Some people even enjoy repetitive manual tasks. There are administrative chores that must be done, but that is overhead, not toil. Boring work can sometimes have long-term value, and if so, it is not toil.
So what is toil?
Toil is work related to running a production service that has the following characteristics:
- Manual and repetitive
- Can be automated
- Tactical in nature
- Has no long-term value
- Scales linearly with service growth
If a job matches one or more of the following descriptions, it has a high degree of toil.
Characteristics of Toil
| Characteristic | Description |
|---|---|
| Manual | This includes tasks like manually running scripts to automate some task. Running the script is quicker than manually executing each step in the script, but the time a human spends on the actual work of running the script is still toil time. |
| Repetitive | If you're doing a task for the first or second time, it's not toil. Toil is something that is done over and over again. |
| Automatable | If a task can be performed by a machine in the same way a human would, or if a system can be created that eliminates the need for the task, then that work is toil. If the task essentially requires human judgment, it's not really toil, but you need to carefully consider whether it truly requires human judgment and whether a better design could address it. |
| Tactical | Toil is interrupt-driven and reactive work performed in response to something that has happened, such as a problem. Responding to alerts is toil. While this kind of work cannot be completely eliminated, it should be continuously minimized. |
| No Long-term Value | If the service remains in the same state after completing a task, that task is probably toil. For example, even if the task includes boring work like diving into and cleaning up old code or configurations, if it makes a permanent improvement to the service, it is not toil. |
| O(n) with Service Growth | If the work includes tasks that scale proportionally with service size, traffic volume, number of users, etc., those tasks are probably toil. An ideally managed and designed service should be able to grow by at least an order of magnitude without additional work beyond the one-time task of adding resources. |
Why Less Toil is Better
Google's SRE organization has a goal of keeping toil to 50% or less of each person's work time. At least 50% of each SRE's work time should be spent on engineering projects that reduce future toil or add features to the service.
Importance of the 50% Rule
The reason for sharing this 50% goal is that toil tends to expand if left unchecked and can rapidly consume 100% of everyone's time.
The work of reducing toil and scaling services is what "engineering" means in SRE. Through engineering work, the size of the SRE organization becomes independent of service size, enabling more efficient service management compared to pure development or operations teams.
Requirements for Being Engineering Work
Engineering work is about doing new things and essentially requires human judgment. Engineering work makes permanent improvements to services and is strategy-driven. This work is often creative and innovative, taking a design-driven approach to problem-solving. In other words, the more general-purpose it is, the better. Engineering work helps teams and SRE organizations handle larger services or more services with the same staffing level.
Classification of Typical SRE Activities
| Category | Description |
|---|---|
| Software Engineering | Work involving writing or modifying code, as well as related design and documentation work. Examples: Creating automation scripts, building tools and frameworks, adding scalability and reliability features to services, improving infrastructure robustness through code modifications, etc. |
| Systems Engineering | Includes configuring production systems, changing configurations, and documenting systems in ways that provide lasting improvements from a single effort. Examples: Setting up and updating monitoring, configuring load balancing, server configuration, parameter tuning, etc. Systems engineering also includes consulting with development teams on architecture, design, and production environment fit |
| Toil | Work directly related to keeping services running that is repetitive or manual |
| Overhead | Administrative work not directly related to keeping services running. Examples: Recruiting, HR administrative work, meetings, organizing bug queues, creating and sharing code snippets, peer reviews, self-evaluations, training courses, etc. |
Since toil tends to grow rapidly, maintaining 50% of time for engineering may not be realistic for some SRE teams. However, if the average percentage of time spent on projects is significantly below 50% over the long term, the team must step back and examine what the problem is.
Calculating Toil
If we seek to cap the time an SRE spends on toil to 50%, how is that time spent?
There's a floor on the amount of toil any SRE has to handle if they are on-call. A typical SRE has one week of primary on-call and one week of secondary on-call in each cycle. It follows that in a 6-person rotation, at least 2 of every 6 weeks are dedicated to on-call shifts and interrupt handling, which means the lower bound on potential toil is 2/6 = 33% of an SRE's time. In an 8-person rotation, the lower bound is 2/8 = 25%.
Real Data from Google SRE
Consistent with this data, SREs report that their top source of toil is interrupts (non-urgent service-related messages and emails). The next leading source is on-call (urgent) response, followed by releases and pushes. Even though release and push processes are usually handled with a fair amount of automation, there's still plenty of room for improvement in this area.
Quarterly surveys of Google's SREs show that the average time spent toiling is about 33%, so we do much better than our overall target of 50%. However, the average doesn't capture outliers: some SREs claim 0% toil (pure development projects with no on-call work) and others claim 80% toil. When individual SREs report excessive toil, it often indicates a need for managers to spread the toil load more evenly across the team and to encourage those SREs to find satisfying engineering projects.
Practical Examples of Measuring and Reporting Toil
There are several ways to measure and report the time each SRE engineer spends on toil, depending on the team's culture and tools.
1. Labeling and Time Tracking via Ticket Systems
If you use a ticket management system like Jira or GitHub Issues, assign labels or categories such as "Toil" or "Engineering" to tasks.
- Method: Record the time spent upon task completion and apply the appropriate label (e.g.,
type:toil). - Pros: Quantitative data can be automatically aggregated and easily visualized.
- Cons: Overhead arises from creating tickets for every task (especially small interruptions).
2. Self-Reporting via Snippets (Weekly Reports)
Utilize weekly work summaries, like the "snippets" used at Google.
- Method: At the end of each week, list the main activities performed and estimate the percentage of toil vs. engineering (e.g., Toil 40%, Engineering 60%) to share with the team.
- Pros: Low burden of recording, and qualitative information (e.g., what was painful) can also be shared.
- Cons: Tends to be subjective estimates and may lack precision.
3. Quarterly Surveys
Conduct regular surveys of team members to understand the state of toil.
- Method: Ask questions like "What percentage of your time was spent on toil in the last 3 months?" and "What were the main sources of toil?".
- Pros: Easy to grasp long-term trends and member sentiment (e.g., feelings of burnout).
- Cons: Not real-time data, so immediate countermeasures may be delayed.
4. Analysis of On-Call Shifts
Analyze activity logs during on-call shifts.
- Method: Aggregate alert counts and response times from tools like PagerDuty and summarize them in on-call reports.
- Pros: Accurately captures "interrupt" toil, which often has the highest load.
- Cons: Toil outside of on-call (e.g., manual script execution, application processing) may be missed.
It is important to use these methods individually or in combination to understand toil trends across the team.
Is Toil Always Bad?
Toil doesn't necessarily make everyone unhappy. This is especially true when the amount is small. Predictable repetitive tasks can be quite calming. Some people even enjoy this kind of work.
It must be clearly recognized that toil is not always equally bad, and that some toil is unavoidable for SREs. Small amounts don't need to be a concern. Toil becomes harmful when it must be handled in large quantities.
Why Too Much Toil is Bad
Impact on Individuals
-
Career Stagnation: When too little time is spent on projects, career advancement slows down or comes to a halt. You cannot build a career from menial work.
-
Low Morale: Too much toil leads to burnout, boredom, and dissatisfaction.
Impact on Organizations
When too much time that should be spent on engineering is spent on toil, the SRE organization suffers the following damages:
| Impact | Description |
|---|---|
| Confusion | Within the SRE organization and among those working with SREs, we try to ensure everyone understands that SRE is an engineering organization. When individuals or teams within SRE focus too much on toil, others may develop misconceptions about the SRE role. |
| Slower Progress | Excessive toil reduces team productivity. When SRE teams are too busy with manual work and firefighting to rapidly roll out new features, product feature development slows down. |
| Conditioning | Being too eager to take on toil makes development counterparts want to send more toil their way, sometimes even operational tasks that should be handled by the development side. |
| Friction | Even if you personally don't mind toil, current or future teammates may not like it. Building too much toil into the team's workflow is like encouraging the best engineers on the team to look for more rewarding work elsewhere. |
| Breach of Faith | People who were recruited or transferred to SRE with the promise of project work will feel deceived. This negatively affects morale. |
Importance of Continuous Improvement
If we all commit to eliminate a bit of toil each week with some good engineering, we'll steadily clean up our services, and we can shift our collective efforts to engineering for scale, architecting the next generation of services, and building cross-SRE toolchains.
Let's invent more, and toil less.
Summary
- Toil is manual, repetitive, automatable work with no long-term value
- SREs should spend more than 50% of their work time on engineering projects
- Real data from Google SRE shows average toil time is about 33% (achieving the <50% target)
- Considering on-call rotations, the theoretical lower bound of toil is around 25-33%
- While small amounts of toil are unavoidable, excessive toil negatively affects both individuals and organizations
- By reducing toil and focusing on engineering work, SRE teams can achieve more scalable service operations
- Continuous commitment to eliminating a bit of toil each week is crucial