Postmortem (Incident Analysis)

What is a Postmortem?

A postmortem is a blameless post-incident analysis conducted after an incident or outage occurs. Derived from the medical term meaning "autopsy," in SRE it represents an important process for objectively analyzing "what happened," "why it happened," and "how to prevent recurrence," connecting to organizational learning and improvement.

Purpose

The main purposes of a postmortem are:

Detailed incident documentation: Accurately documenting what happened
Root cause identification: Revealing fundamental causes rather than superficial ones
Recurrence prevention measures: Defining specific and actionable improvement actions
Promoting organizational learning: Sharing insights and enhancing team-wide resilience
System improvement: Learning from incidents to build more robust systems

Importance of Postmortems

The Google SRE book emphasizes the idea that "failure is a learning opportunity." Through postmortems, organizations can avoid repeating the same mistakes and continuously improve.

Blameless Culture

The most important principle in postmortems is being blameless - not blaming individuals.

Why Not Blame?

Everyone makes mistakes. Blaming individuals has the following negative effects:

Encourages cover-ups: Engineers start trying to hide failures
Reduces psychological safety: Team members become hesitant and unable to take risks
Overlooks root causes: Focuses on human error and misses systemic problems
Loses learning opportunities: The organization loses the chance to learn from failures

Practicing Blameless Culture

In a blameless culture, we ask:

Not "Who is at fault?" but "What went wrong?"
Not "Why did they make a mistake?" but "Why didn't the system prevent the mistake?"
Not "Who will take responsibility?" but "How can we prevent recurrence?"

Humans Are Part of the System

Human error is a signal of system design flaws. With proper guardrails, automation, and checking mechanisms, many human errors can be prevented.

When to Conduct Postmortems

Not all incidents require a postmortem. Conduct one when meeting these criteria:

Required Cases

Outages affecting users (downtime, performance degradation, etc.)
Data loss or corruption
Security incidents
SLO violations
Incidents requiring manual emergency response
Issues affecting other teams or stakeholders
Incidents providing valuable learning opportunities

Recommended Cases

Near misses (no actual impact but potential risk existed)
Cases revealing inadequate procedures or tools
Cases where multiple small problems occurred simultaneously

Don't Overlook Small Incidents

Even small-scale incidents can be symptoms of larger problems if patterns repeat. Postmortems for minor incidents also have value.

Postmortem Process

1. Timing

Conduct as soon as possible after incident resolution (typically within 24-48 hours). It's important to record details while memories are fresh.

2. Participants

The following members are recommended to participate:

Incident responders: Engineers who actually responded
Service owners: Those responsible for the affected service
Related team members: Other team members affected or related
Facilitator: Role to neutrally guide discussion (optional)

3. Implementation Steps

Postmortem Template

The following are elements of the postmortem template recommended in the Google SRE book.

Basic Information

Item	Content
Title	Brief description of the incident
Date/Time	Incident occurrence and resolution times
Author	Postmortem author
Status	Draft / In Review / Finalized
Severity	Critical / High / Medium / Low

Section Structure

1. Executive Summary

Briefly explain the incident overview in 2-3 sentences.

## Executive Summary

From 10:30 JST to 12:00 JST on December 11, 2024,
the payment API response time degraded from an average of 500ms to 5 seconds,
and approximately 15% of transactions timed out.
The root cause was database connection pool exhaustion,
and we recovered by urgently expanding the pool size.

2. Impact

Affected services: Which services or features were affected
Impact scope: Number of users, regions, customer segments, etc.
Business impact: Monetary losses, reputational impact, etc.
Downtime: Total downtime or degradation time

## Impact

- **Affected services**: Payment API (payment-service)
- **Impact scope**: Users in all regions (approximately 100,000 active users)
- **Failed transactions**: Approximately 1,500 (15% of total)
- **Downtime**: No complete outage, performance degradation for 90 minutes
- **Business impact**: Estimated revenue loss approximately $5,000
- **SLO violation**: Consumed 0.1% of error budget against monthly 99.9% availability SLO

3. Timeline

Record the incident progression chronologically. Specify timezone for all times.

## Timeline

| Time (JST) | Event |
| :--- | :--- |
| 10:30 | Datadog alert: Payment API response time exceeded threshold |
| 10:32 | On-call engineer (Tanaka) began investigation |
| 10:35 | Confirmed DB connection wait in payment service via APM |
| 10:40 | Identified database connection pool reached limit |
| 10:45 | Urgently expanded pool size from 50 to 100 (configuration change) |
| 10:50 | Began rolling restart of payment service |
| 11:00 | Response time began returning to normal range |
| 11:30 | Confirmed all instances operating normally |
| 12:00 | Incident closed |

4. Root Cause

Identify root cause using 5 Whys method, etc.

## Root Cause

### 5 Whys Analysis

1. **Why did response time degrade?**
   → Because database connection wait time was long

2. **Why was connection wait time long?**
   → Because connection pool was exhausted and new connections couldn't be acquired

3. **Why was connection pool exhausted?**
   → Because traffic doubled from normal and pool size (50) was insufficient

4. **Why did traffic increase?**
   → Unexpected traffic surge from marketing campaign

5. **Why couldn't we predict campaign traffic increase?**
   → Lack of communication between marketing and engineering teams

### Root Cause Summary

**Direct cause**: Insufficient database connection pool size

**Systemic causes**: 
- Lack of auto-scaling mechanism for traffic surges
- No load testing before campaign implementation
- Deficient cross-department communication processes

5. Resolution and Response

Record emergency measures implemented during incident response.

## Resolution and Response

### Emergency Response

1. **Expanded connection pool size** (10:45)
   - Configuration change: `max_pool_size: 50 → 100`
   - Rolling restart of payment service

2. **Temporarily stopped unnecessary batch processing** (10:50)
   - Paused report generation to reduce DB load

3. **Enhanced monitoring** (11:00)
   - Created monitoring dashboard for connection pool utilization

6. Prevention Measures

Define specific and actionable action items. Set assignees and deadlines for each item.

## Prevention Measures

### Short-term measures (within 1 week)

- [ ] **Implement connection pool auto-scaling** 
  - Assignee: Tanaka
  - Deadline: December 18, 2024
  - Details: Automatically adjust pool size during traffic increases

- [ ] **Set connection pool utilization alerts**
  - Assignee: Suzuki
  - Deadline: December 15, 2024
  - Details: Warning at 80% utilization, Critical at 90%

### Medium-term measures (within 1 month)

- [ ] **Establish marketing campaign advance notification process**
  - Assignee: Sato (Marketing), Tanaka (SRE)
  - Deadline: January 10, 2025
  - Details: Notify 2 weeks before campaign and conduct load testing

- [ ] **Introduce automated load testing**
  - Assignee: Yamada
  - Deadline: January 15, 2025
  - Details: Integrate production-level load testing into CI/CD pipeline

### Long-term measures (within 3 months)

- [ ] **Implement circuit breaker for database connections**
  - Assignee: Takahashi
  - Deadline: March 1, 2025
  - Details: Enable partial service provision during failures

- [ ] **Establish capacity planning process**
  - Assignee: Entire SRE team
  - Deadline: March 15, 2025
  - Details: Quarterly capacity reviews and forecasting

7. Lessons Learned

What went well: Quick cause identification, effective escalation procedures, etc.
What could be improved: Monitoring, alerting, runbooks, communication, etc.
What was lucky: Reasons it didn't become worse

## Lessons Learned

### What Went Well ✅

- On-call engineer responded quickly (within 2 minutes)
- APM tool (Datadog) enabled rapid cause identification
- Safely recovered following runbook restart procedures

### What Could Be Improved 🔧

- Connection pool utilization monitoring was insufficient
- Couldn't anticipate traffic increase from campaign
- Auto-scaling mechanism was not implemented

### What Was Lucky 🍀

- Occurred outside peak hours, limiting impact
- Degradation rather than complete downtime
- No data loss or corruption occurred

8. References

Include links to related logs, metrics, graphs, and documents.

## References

- [Datadog Dashboard (during incident)](https://app.datadoghq.com/dashboard/xxx)
- [Slack Incident Channel](https://slack.com/archives/C12345678)
- [PagerDuty Incident](https://example.pagerduty.com/incidents/P123456)
- [Related Code Change (PR #1234)](https://github.com/org/repo/pull/1234)

Root Cause Analysis Methods

5 Whys

The simplest and most effective method. Dig from superficial causes to root causes by repeatedly asking "Why?"

Key points:

Ask "Why?" at least 5 times (can be 3 or 7 depending on situation)
Confirm each answer is logically connected
Focus on systems and processes, not people

Fishbone Diagram (Ishikawa Diagram)

Effective for complex incidents involving multiple factors.

Timeline Analysis

Organize events chronologically and visualize what happened at each point.

Postmortem Best Practices

1. Conduct Timely

Implement as soon as possible (within 24-48 hours) after incident resolution.

2. Fact-Based

Avoid speculation and guesswork
Write based on logs, metrics, and evidence
Clearly mark uncertainties as "unknown"

3. Define Specific and Actionable Actions

Bad examples (vague):

"Improve monitoring"
"Strengthen communication"

Good examples (specific):

"Set warning alert at 80% connection pool utilization (Assignee: Tanaka, Deadline: 12/15)"
"Establish operational flow for 2-week advance notification of marketing campaigns (Assignee: Sato, Deadline: 1/10)"

4. Create Inclusive Environment

Ensure psychological safety
Respect all opinions
Have facilitator guide neutrally

Postmortems are organizational learning assets
Publish in internal wiki or document repository
Regularly review past postmortems

6. Track Action Items

Set assignee and deadline for each item
Check progress regularly
Always close when completed

Tracking Action Items

The biggest postmortem failure is when action items aren't executed. Use tracking tools like JIRA or GitHub Issues and track to completion.

7. Celebrate Successes

Document what went well
Recognize team's quick response
Visualize improvement results

Postmortem Storage and Utilization

Document Management

Centralized management: Store in accessible locations like Confluence, Notion, GitHub Wiki
Searchable: Classify with tags and categories for easy searching
Version control: Maintain update history

Regular Reviews

Review past postmortems quarterly
Analyze patterns and trends
Connect to organization-level improvements

Metrics Tracking

Number of postmortems created
Action item completion rate
MTTR (Mean Time To Repair) trends
Recurrence rate of similar incidents

Practice Example: Google's Postmortem Culture

At Google, postmortems are an important process conducted routinely.

Characteristics

Public by default: Published organization-wide, accessible to everyone
Award system: Excellent postmortems are recognized internally
Learning resource: Also used as training material for new employees
Data-driven: Drive improvements with data from postmortems

Effects

Reduced incident response times
Prevented recurrence of similar problems
Improved team-wide technical capabilities
Strengthened organizational resilience

Google's "Non-Abstract Large System Design (NALSD)" Interview

In Google interviews, lessons learned from past postmortems can be applied to system design questions. Postmortems contribute to individual career growth.

Common Failure Patterns and Countermeasures

Failure Pattern 1: Blaming Individuals

Problem: "Because Mr./Ms. XX executed the wrong command"

Countermeasure: View as "system flaw that allowed inappropriate command execution" and add guardrails or review processes

Failure Pattern 2: Vague Action Items

Problem: "Be more careful next time" "Pay more attention"

Countermeasure: Define specific technical improvements (automation, add alerts, revise runbooks, etc.)

Failure Pattern 3: Action Items Not Implemented

Problem: Satisfied with just writing postmortem, improvements not implemented

Countermeasure: Manage with tracking tools, review progress regularly

Failure Pattern 4: Taking Too Much Time

Problem: Taking weeks trying to write perfect postmortem

Countermeasure: Draft is OK initially. Important is timely implementation and action

Problem: Completed only within specific team, not shared organization-wide

Countermeasure: Publish in internal wiki and notify related teams

Checklist

Use this checklist to verify postmortem quality.

Summary

Postmortems are one of the most important practices in SRE.

Key Points

Blameless culture: Focus on system problems, not individuals
Timely implementation: Conduct as soon as possible after incident resolution
Specific actions: Define actionable and trackable improvements
Organizational learning: Share insights and enhance team-wide resilience
Continuous improvement: Track action items to completion

Through postmortems, organizations can learn from failures and build more robust and reliable systems. Cultivating a culture that "failure is not the end, but the beginning of learning" is key to SRE success.

What is a Postmortem?​

Purpose​

Blameless Culture​

Why Not Blame?​

Practicing Blameless Culture​

When to Conduct Postmortems​

Required Cases​

Recommended Cases​

Postmortem Process​

1. Timing​

2. Participants​

3. Implementation Steps​

Postmortem Template​

Basic Information​

Section Structure​

1. Executive Summary​

2. Impact​

3. Timeline​

4. Root Cause​

5. Resolution and Response​

6. Prevention Measures​

7. Lessons Learned​

8. References​

Root Cause Analysis Methods​

5 Whys​

Fishbone Diagram (Ishikawa Diagram)​

Timeline Analysis​

Postmortem Best Practices​

1. Conduct Timely​

2. Fact-Based​

3. Define Specific and Actionable Actions​

4. Create Inclusive Environment​

5. Share Widely​

6. Track Action Items​

7. Celebrate Successes​

Postmortem Storage and Utilization​

Document Management​

Regular Reviews​

Metrics Tracking​

Practice Example: Google's Postmortem Culture​

Characteristics​

Effects​

Common Failure Patterns and Countermeasures​

Failure Pattern 1: Blaming Individuals​

Failure Pattern 2: Vague Action Items​

Failure Pattern 3: Action Items Not Implemented​

Failure Pattern 4: Taking Too Much Time​

Failure Pattern 5: Not Sharing​

Checklist​

Summary​

Key Points​

References​