Kubernetes Probes (Liveness, Readiness, Startup)

Kubernetes provides Probes to monitor the health of containers and automatically perform recovery or traffic control. Properly configuring probes can significantly improve application availability and reliability.

Three Types of Probes and Their Roles

Kubernetes has three probes with different purposes. It is important to use them appropriately.

1. Liveness Probe (Survival Check)

Purpose: Checks if the application is "alive" (e.g., not in a deadlock or infinite loop).
Action: If it fails, kubelet restarts the container.
Use Case: Used to detect states where the application has not crashed but is unresponsive due to internal errors, and can only be recovered by a restart.

2. Readiness Probe (Readiness Check)

Purpose: Checks if the application is "ready to accept traffic".
Action: If it fails, the Pod's IP address is removed from the Service's load balancing targets (Endpoints) (it is not restarted).
Use Case: Used to prevent traffic from flowing to the Pod during startup, while loading large data, or when temporarily overloaded and unable to process requests.

3. Startup Probe (Startup Check)

Purpose: Checks if the application has "completed startup".
Action: Liveness and Readiness Probes are disabled until this probe succeeds. If it continues to fail (exceeding the configured threshold), the container is restarted.
Use Case: Used for legacy applications or large Java applications that take a long time to start. Use this instead of setting a long initialDelaySeconds for the Liveness Probe.

Probe Check Mechanisms

Each probe uses one of the following three methods to perform checks:

HTTP GET: Sends an HTTP GET request to the specified path and port. Status codes 200-399 are considered successful.
TCP Socket: Attempts to establish a TCP connection to the specified port. If the connection is established, it is considered successful.
Exec: Executes a specified command inside the container. If the exit code is 0, it is considered successful.

Configuration Example

Here is an example configuration combining all three probes.

apiVersion: v1
kind: Pod
metadata:
  name: probe-demo
spec:
  containers:
  - name: my-app
    image: my-app:v1
    
    # 1. Startup Probe: Wait up to 300 seconds (10s * 30 times) for startup
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      failureThreshold: 30
      periodSeconds: 10
    
    # 2. Readiness Probe: Check every 5 seconds after startup. Stop traffic after 3 consecutive failures
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5 # Wait 5 seconds after Startup Probe succeeds
      periodSeconds: 5
      failureThreshold: 3
      successThreshold: 1
    
    # 3. Liveness Probe: Check every 10 seconds after startup. Restart after 3 consecutive failures
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 10
      failureThreshold: 3
      timeoutSeconds: 1

Best Practices

1. Distinguish Between Liveness and Readiness

Liveness: Should only detect fatal errors that "can be fixed by a restart".
Readiness: Should detect temporary issues where "it's busy now but will recover if we wait" or startup states.
Anti-Pattern: Checking database connectivity with a Liveness Probe. If the DB goes down, all Web Server Pods will restart repeatedly (CrashLoopBackOff), potentially preventing the entire system from recovering even after the DB is back up (Cascading Failure). External dependency checks should be done in Readiness Probes or handled within the application's error handling, not in Liveness Probes.

2. Leverage Startup Probe

Avoid setting an extremely long initialDelaySeconds for Liveness Probes for slow-starting apps. This causes unnecessary waiting if startup is fast, or accidental restarts if startup is slow. Startup Probe allows accurate detection of startup completion and immediate transition to Liveness monitoring.

3. Adjust Timeouts and Intervals

timeoutSeconds (default 1s) can be too short. Consider increasing it slightly (e.g., 2-5s) to avoid false positives during high load.
If periodSeconds (default 10s) is too short, the load of the Probe itself may become non-negligible.

4. Provide Dedicated Health Check Endpoints

Instead of using / (root) as a probe, it is recommended to implement lightweight dedicated endpoints like /healthz or /ready. This prevents logs from being flooded with access logs and minimizes the overhead of check processing.

5. Interaction with Graceful Shutdown

When a Pod terminates (receives SIGTERM), Kubernetes starts removing the Pod from Endpoints regardless of the Readiness Probe state. However, since there is a time lag in propagation, implementing the /ready endpoint to return 503 during the application's shutdown process can help stop traffic more safely.

Three Types of Probes and Their Roles​

1. Liveness Probe (Survival Check)​

2. Readiness Probe (Readiness Check)​

3. Startup Probe (Startup Check)​

Probe Check Mechanisms​

Configuration Example​

Best Practices​

1. Distinguish Between Liveness and Readiness​

2. Leverage Startup Probe​

3. Adjust Timeouts and Intervals​

4. Provide Dedicated Health Check Endpoints​

5. Interaction with Graceful Shutdown​