Skip to main content

Postmortem (Incident Analysis)

What is a Postmortem?

A postmortem is a blameless post-incident analysis conducted after an incident or outage occurs. Derived from the medical term meaning "autopsy," in SRE it represents an important process for objectively analyzing "what happened," "why it happened," and "how to prevent recurrence," connecting to organizational learning and improvement.

Purpose

The main purposes of a postmortem are:

  1. Detailed incident documentation: Accurately documenting what happened
  2. Root cause identification: Revealing fundamental causes rather than superficial ones
  3. Recurrence prevention measures: Defining specific and actionable improvement actions
  4. Promoting organizational learning: Sharing insights and enhancing team-wide resilience
  5. System improvement: Learning from incidents to build more robust systems
Importance of Postmortems

The Google SRE book emphasizes the idea that "failure is a learning opportunity." Through postmortems, organizations can avoid repeating the same mistakes and continuously improve.

Blameless Culture

The most important principle in postmortems is being blameless - not blaming individuals.

Why Not Blame?

Everyone makes mistakes. Blaming individuals has the following negative effects:

  • Encourages cover-ups: Engineers start trying to hide failures
  • Reduces psychological safety: Team members become hesitant and unable to take risks
  • Overlooks root causes: Focuses on human error and misses systemic problems
  • Loses learning opportunities: The organization loses the chance to learn from failures

Practicing Blameless Culture

In a blameless culture, we ask:

  • Not "Who is at fault?" but "What went wrong?"
  • Not "Why did they make a mistake?" but "Why didn't the system prevent the mistake?"
  • Not "Who will take responsibility?" but "How can we prevent recurrence?"
Humans Are Part of the System

Human error is a signal of system design flaws. With proper guardrails, automation, and checking mechanisms, many human errors can be prevented.

When to Conduct Postmortems

Not all incidents require a postmortem. Conduct one when meeting these criteria:

Required Cases

  • Outages affecting users (downtime, performance degradation, etc.)
  • Data loss or corruption
  • Security incidents
  • SLO violations
  • Incidents requiring manual emergency response
  • Issues affecting other teams or stakeholders
  • Incidents providing valuable learning opportunities
  • Near misses (no actual impact but potential risk existed)
  • Cases revealing inadequate procedures or tools
  • Cases where multiple small problems occurred simultaneously
Don't Overlook Small Incidents

Even small-scale incidents can be symptoms of larger problems if patterns repeat. Postmortems for minor incidents also have value.

Postmortem Process

1. Timing

Conduct as soon as possible after incident resolution (typically within 24-48 hours). It's important to record details while memories are fresh.

2. Participants

The following members are recommended to participate:

  • Incident responders: Engineers who actually responded
  • Service owners: Those responsible for the affected service
  • Related team members: Other team members affected or related
  • Facilitator: Role to neutrally guide discussion (optional)

3. Implementation Steps

Postmortem Template

The following are elements of the postmortem template recommended in the Google SRE book.

Basic Information

ItemContent
TitleBrief description of the incident
Date/TimeIncident occurrence and resolution times
AuthorPostmortem author
StatusDraft / In Review / Finalized
SeverityCritical / High / Medium / Low

Section Structure

1. Executive Summary

Briefly explain the incident overview in 2-3 sentences.

## Executive Summary

From 10:30 JST to 12:00 JST on December 11, 2024,
the payment API response time degraded from an average of 500ms to 5 seconds,
and approximately 15% of transactions timed out.
The root cause was database connection pool exhaustion,
and we recovered by urgently expanding the pool size.

2. Impact

  • Affected services: Which services or features were affected
  • Impact scope: Number of users, regions, customer segments, etc.
  • Business impact: Monetary losses, reputational impact, etc.
  • Downtime: Total downtime or degradation time
## Impact

- **Affected services**: Payment API (payment-service)
- **Impact scope**: Users in all regions (approximately 100,000 active users)
- **Failed transactions**: Approximately 1,500 (15% of total)
- **Downtime**: No complete outage, performance degradation for 90 minutes
- **Business impact**: Estimated revenue loss approximately $5,000
- **SLO violation**: Consumed 0.1% of error budget against monthly 99.9% availability SLO

3. Timeline

Record the incident progression chronologically. Specify timezone for all times.

## Timeline

| Time (JST) | Event |
| :--- | :--- |
| 10:30 | Datadog alert: Payment API response time exceeded threshold |
| 10:32 | On-call engineer (Tanaka) began investigation |
| 10:35 | Confirmed DB connection wait in payment service via APM |
| 10:40 | Identified database connection pool reached limit |
| 10:45 | Urgently expanded pool size from 50 to 100 (configuration change) |
| 10:50 | Began rolling restart of payment service |
| 11:00 | Response time began returning to normal range |
| 11:30 | Confirmed all instances operating normally |
| 12:00 | Incident closed |

4. Root Cause

Identify root cause using 5 Whys method, etc.

## Root Cause

### 5 Whys Analysis

1. **Why did response time degrade?**
→ Because database connection wait time was long

2. **Why was connection wait time long?**
→ Because connection pool was exhausted and new connections couldn't be acquired

3. **Why was connection pool exhausted?**
→ Because traffic doubled from normal and pool size (50) was insufficient

4. **Why did traffic increase?**
→ Unexpected traffic surge from marketing campaign

5. **Why couldn't we predict campaign traffic increase?**
→ Lack of communication between marketing and engineering teams

### Root Cause Summary

**Direct cause**: Insufficient database connection pool size

**Systemic causes**:
- Lack of auto-scaling mechanism for traffic surges
- No load testing before campaign implementation
- Deficient cross-department communication processes

5. Resolution and Response

Record emergency measures implemented during incident response.

## Resolution and Response

### Emergency Response

1. **Expanded connection pool size** (10:45)
- Configuration change: `max_pool_size: 50 → 100`
- Rolling restart of payment service

2. **Temporarily stopped unnecessary batch processing** (10:50)
- Paused report generation to reduce DB load

3. **Enhanced monitoring** (11:00)
- Created monitoring dashboard for connection pool utilization

6. Prevention Measures

Define specific and actionable action items. Set assignees and deadlines for each item.

## Prevention Measures

### Short-term measures (within 1 week)

- [ ] **Implement connection pool auto-scaling**
- Assignee: Tanaka
- Deadline: December 18, 2024
- Details: Automatically adjust pool size during traffic increases

- [ ] **Set connection pool utilization alerts**
- Assignee: Suzuki
- Deadline: December 15, 2024
- Details: Warning at 80% utilization, Critical at 90%

### Medium-term measures (within 1 month)

- [ ] **Establish marketing campaign advance notification process**
- Assignee: Sato (Marketing), Tanaka (SRE)
- Deadline: January 10, 2025
- Details: Notify 2 weeks before campaign and conduct load testing

- [ ] **Introduce automated load testing**
- Assignee: Yamada
- Deadline: January 15, 2025
- Details: Integrate production-level load testing into CI/CD pipeline

### Long-term measures (within 3 months)

- [ ] **Implement circuit breaker for database connections**
- Assignee: Takahashi
- Deadline: March 1, 2025
- Details: Enable partial service provision during failures

- [ ] **Establish capacity planning process**
- Assignee: Entire SRE team
- Deadline: March 15, 2025
- Details: Quarterly capacity reviews and forecasting

7. Lessons Learned

  • What went well: Quick cause identification, effective escalation procedures, etc.
  • What could be improved: Monitoring, alerting, runbooks, communication, etc.
  • What was lucky: Reasons it didn't become worse
## Lessons Learned

### What Went Well ✅

- On-call engineer responded quickly (within 2 minutes)
- APM tool (Datadog) enabled rapid cause identification
- Safely recovered following runbook restart procedures

### What Could Be Improved 🔧

- Connection pool utilization monitoring was insufficient
- Couldn't anticipate traffic increase from campaign
- Auto-scaling mechanism was not implemented

### What Was Lucky 🍀

- Occurred outside peak hours, limiting impact
- Degradation rather than complete downtime
- No data loss or corruption occurred

8. References

Include links to related logs, metrics, graphs, and documents.

## References

- [Datadog Dashboard (during incident)](https://app.datadoghq.com/dashboard/xxx)
- [Slack Incident Channel](https://slack.com/archives/C12345678)
- [PagerDuty Incident](https://example.pagerduty.com/incidents/P123456)
- [Related Code Change (PR #1234)](https://github.com/org/repo/pull/1234)

Root Cause Analysis Methods

5 Whys

The simplest and most effective method. Dig from superficial causes to root causes by repeatedly asking "Why?"

Key points:

  • Ask "Why?" at least 5 times (can be 3 or 7 depending on situation)
  • Confirm each answer is logically connected
  • Focus on systems and processes, not people

Fishbone Diagram (Ishikawa Diagram)

Effective for complex incidents involving multiple factors.

Timeline Analysis

Organize events chronologically and visualize what happened at each point.

Postmortem Best Practices

1. Conduct Timely

Implement as soon as possible (within 24-48 hours) after incident resolution.

2. Fact-Based

  • Avoid speculation and guesswork
  • Write based on logs, metrics, and evidence
  • Clearly mark uncertainties as "unknown"

3. Define Specific and Actionable Actions

Bad examples (vague):

  • "Improve monitoring"
  • "Strengthen communication"

Good examples (specific):

  • "Set warning alert at 80% connection pool utilization (Assignee: Tanaka, Deadline: 12/15)"
  • "Establish operational flow for 2-week advance notification of marketing campaigns (Assignee: Sato, Deadline: 1/10)"

4. Create Inclusive Environment

  • Ensure psychological safety
  • Respect all opinions
  • Have facilitator guide neutrally

5. Share Widely

  • Postmortems are organizational learning assets
  • Publish in internal wiki or document repository
  • Regularly review past postmortems

6. Track Action Items

  • Set assignee and deadline for each item
  • Check progress regularly
  • Always close when completed
Tracking Action Items

The biggest postmortem failure is when action items aren't executed. Use tracking tools like JIRA or GitHub Issues and track to completion.

7. Celebrate Successes

  • Document what went well
  • Recognize team's quick response
  • Visualize improvement results

Postmortem Storage and Utilization

Document Management

  • Centralized management: Store in accessible locations like Confluence, Notion, GitHub Wiki
  • Searchable: Classify with tags and categories for easy searching
  • Version control: Maintain update history

Regular Reviews

  • Review past postmortems quarterly
  • Analyze patterns and trends
  • Connect to organization-level improvements

Metrics Tracking

  • Number of postmortems created
  • Action item completion rate
  • MTTR (Mean Time To Repair) trends
  • Recurrence rate of similar incidents

Practice Example: Google's Postmortem Culture

At Google, postmortems are an important process conducted routinely.

Characteristics

  • Public by default: Published organization-wide, accessible to everyone
  • Award system: Excellent postmortems are recognized internally
  • Learning resource: Also used as training material for new employees
  • Data-driven: Drive improvements with data from postmortems

Effects

  • Reduced incident response times
  • Prevented recurrence of similar problems
  • Improved team-wide technical capabilities
  • Strengthened organizational resilience
Google's "Non-Abstract Large System Design (NALSD)" Interview

In Google interviews, lessons learned from past postmortems can be applied to system design questions. Postmortems contribute to individual career growth.

Common Failure Patterns and Countermeasures

Failure Pattern 1: Blaming Individuals

Problem: "Because Mr./Ms. XX executed the wrong command"

Countermeasure: View as "system flaw that allowed inappropriate command execution" and add guardrails or review processes

Failure Pattern 2: Vague Action Items

Problem: "Be more careful next time" "Pay more attention"

Countermeasure: Define specific technical improvements (automation, add alerts, revise runbooks, etc.)

Failure Pattern 3: Action Items Not Implemented

Problem: Satisfied with just writing postmortem, improvements not implemented

Countermeasure: Manage with tracking tools, review progress regularly

Failure Pattern 4: Taking Too Much Time

Problem: Taking weeks trying to write perfect postmortem

Countermeasure: Draft is OK initially. Important is timely implementation and action

Failure Pattern 5: Not Sharing

Problem: Completed only within specific team, not shared organization-wide

Countermeasure: Publish in internal wiki and notify related teams

Checklist

Use this checklist to verify postmortem quality.

  • Basic information (date/time, author, status) documented
  • Executive summary briefly explains overview
  • Impact (scope, downtime, business impact) is clear
  • Detailed timeline recorded
  • Root cause analysis (5 Whys, etc.) conducted
  • Specific and actionable action items exist
  • Each action has assignee and deadline
  • No language blaming individuals (Blameless)
  • Lessons learned (what went well, improvements) documented
  • References (links to logs, metrics, dashboards) included
  • Reviewed and approved
  • Shared organization-wide
  • Action items registered in tracking tool

Summary

Postmortems are one of the most important practices in SRE.

Key Points

  1. Blameless culture: Focus on system problems, not individuals
  2. Timely implementation: Conduct as soon as possible after incident resolution
  3. Specific actions: Define actionable and trackable improvements
  4. Organizational learning: Share insights and enhance team-wide resilience
  5. Continuous improvement: Track action items to completion

Through postmortems, organizations can learn from failures and build more robust and reliable systems. Cultivating a culture that "failure is not the end, but the beginning of learning" is key to SRE success.

References