SRE and Observability
This document summarizes the concept of "Observability" in modern complex distributed systems, its differences from monitoring, its components, and the value it brings to SRE teams, based on IBM's technical documentation "What is observability?".
Overview of Observability in SRE
Observability is the ability to understand the internal state and condition of a system based solely on knowledge of its external outputs (telemetry data).
While this term originates from control theory, in IT Operations (ITOps) and cloud computing, it refers to collecting, aggregating, and analyzing performance data such as logs, metrics, and traces to provide visibility for identifying and resolving problems in real-time. The more "observable" a system is, the deeper the team can go into the root cause of performance issues often without additional testing or coding.
Difference Between Monitoring and Observability
Observability does not replace monitoring; it extends and evolves it.
1. Monitoring
- Monitors "Known unknowns". It detects abnormal states that you know you need to watch out for in advance (e.g., CPU usage, memory shortages).
- Excellent for visualizing "what is happening" using dashboards.
2. Observability
- Discovers "Unknown unknowns". It reveals new patterns or abnormal behaviors that the team did not anticipate.
- Focuses on the "Why" — investigating the root cause of why something is happening.
- Essential for understanding complex dependencies and transient function behaviors in microservices and distributed systems.
The Three Pillars of Observability
Observability is primarily based on three types of telemetry data:
1. Logs
Granular, immutable, timestamped records of application events. They are used for troubleshooting and debugging, providing context for events.
2. Metrics
Numerical data (time-series data) indicating the health of an application or system over a specific period.
- Examples: CPU/memory usage, latency, traffic volume, etc.
- Used for identifying trends and detecting anomalies.
3. Traces
Records of the "end-to-end path" of a user request as it traverses the entire system (UI, backend, database, etc.). In distributed systems, they help identify where requests are delayed and which components are bottlenecks.
Benefits for SRE Teams
Full-stack observability brings the following benefits to SRE teams:
- Discovering "Unknown Unknowns": Identify unexpected issues that traditional monitoring missed and trace causal relationships with performance problems.
- Problem Resolution in Early Development: A shift-left approach allows identifying and fixing issues during the development phase, preventing impact on the production environment.
- Improved User Experience: Enables application optimization based on a deeper understanding of user behavior.
- Automation and Self-Healing: Utilizing AIOps and machine learning allows for autonomous operations that predict and repair issues without human intervention.
Contribution to Incident Management and Reliability
Observability plays a central role in incident management and improving system reliability.
- Shortening MTTR (Mean Time To Repair): Accelerates the process from problem detection to root cause identification. Since all relevant telemetry data (logs, metrics, traces) is correlated, SREs can immediately understand "where" and "why" a problem occurred.
- Accurate Root Cause Analysis: AI-driven analysis helps separate the true signal (problems) from the noise (alerts), allowing SREs to focus on critical incidents.
- Strengthening DevOps/DevSecOps: Acts as a common language between development and operations, providing continuous feedback loops to enhance release reliability and security.