Kubernetes Probes (Liveness, Readiness, Startup)
Kubernetes provides Probes to monitor the health of containers and automatically perform recovery or traffic control. Properly configuring probes can significantly improve application availability and reliability.
Three Types of Probes and Their Roles
Kubernetes has three probes with different purposes. It is important to use them appropriately.
1. Liveness Probe (Survival Check)
- Purpose: Checks if the application is "alive" (e.g., not in a deadlock or infinite loop).
- Action: If it fails, kubelet restarts the container.
- Use Case: Used to detect states where the application has not crashed but is unresponsive due to internal errors, and can only be recovered by a restart.
2. Readiness Probe (Readiness Check)
- Purpose: Checks if the application is "ready to accept traffic".
- Action: If it fails, the Pod's IP address is removed from the Service's load balancing targets (Endpoints) (it is not restarted).
- Use Case: Used to prevent traffic from flowing to the Pod during startup, while loading large data, or when temporarily overloaded and unable to process requests.
3. Startup Probe (Startup Check)
- Purpose: Checks if the application has "completed startup".
- Action: Liveness and Readiness Probes are disabled until this probe succeeds. If it continues to fail (exceeding the configured threshold), the container is restarted.
- Use Case: Used for legacy applications or large Java applications that take a long time to start. Use this instead of setting a long
initialDelaySecondsfor the Liveness Probe.
Probe Check Mechanisms
Each probe uses one of the following three methods to perform checks:
- HTTP GET: Sends an HTTP GET request to the specified path and port. Status codes 200-399 are considered successful.
- TCP Socket: Attempts to establish a TCP connection to the specified port. If the connection is established, it is considered successful.
- Exec: Executes a specified command inside the container. If the exit code is 0, it is considered successful.
Configuration Example
Here is an example configuration combining all three probes.
apiVersion: v1
kind: Pod
metadata:
name: probe-demo
spec:
containers:
- name: my-app
image: my-app:v1
# 1. Startup Probe: Wait up to 300 seconds (10s * 30 times) for startup
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
# 2. Readiness Probe: Check every 5 seconds after startup. Stop traffic after 3 consecutive failures
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5 # Wait 5 seconds after Startup Probe succeeds
periodSeconds: 5
failureThreshold: 3
successThreshold: 1
# 3. Liveness Probe: Check every 10 seconds after startup. Restart after 3 consecutive failures
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 1
Best Practices
1. Distinguish Between Liveness and Readiness
- Liveness: Should only detect fatal errors that "can be fixed by a restart".
- Readiness: Should detect temporary issues where "it's busy now but will recover if we wait" or startup states.
- Anti-Pattern: Checking database connectivity with a Liveness Probe. If the DB goes down, all Web Server Pods will restart repeatedly (CrashLoopBackOff), potentially preventing the entire system from recovering even after the DB is back up (Cascading Failure). External dependency checks should be done in Readiness Probes or handled within the application's error handling, not in Liveness Probes.
2. Leverage Startup Probe
Avoid setting an extremely long initialDelaySeconds for Liveness Probes for slow-starting apps. This causes unnecessary waiting if startup is fast, or accidental restarts if startup is slow. Startup Probe allows accurate detection of startup completion and immediate transition to Liveness monitoring.
3. Adjust Timeouts and Intervals
timeoutSeconds(default 1s) can be too short. Consider increasing it slightly (e.g., 2-5s) to avoid false positives during high load.- If
periodSeconds(default 10s) is too short, the load of the Probe itself may become non-negligible.
4. Provide Dedicated Health Check Endpoints
Instead of using / (root) as a probe, it is recommended to implement lightweight dedicated endpoints like /healthz or /ready. This prevents logs from being flooded with access logs and minimizes the overhead of check processing.
5. Interaction with Graceful Shutdown
When a Pod terminates (receives SIGTERM), Kubernetes starts removing the Pod from Endpoints regardless of the Readiness Probe state. However, since there is a time lag in propagation, implementing the /ready endpoint to return 503 during the application's shutdown process can help stop traffic more safely.