Job and CronJob

In Kubernetes, resources are provided not only for resident applications like web servers but also for managing tasks that run once and exit, or tasks that run periodically. These are Job and CronJob.

Job

A Job creates one or more Pods and ensures that a specified number of them successfully terminate (Succeeded).

Main Use Cases (Job)

Database migrations
Batch processing (data aggregation, report generation, etc.)
One-time initialization scripts

Manifest Example (Job)

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

Key Parameters (Job)

restartPolicy: Must be OnFailure or Never for Jobs (Always is not allowed).
completions: The number of successful completions required.
parallelism: The number of Pods to run in parallel.
backoffLimit: The number of retries before considering the Job as failed (default is 6).

CronJob

A CronJob creates Jobs on a repeating schedule based on the Linux cron format.

Main Use Cases (CronJob)

Periodic backups
Periodic email delivery
Nightly data cleansing

Manifest Example (CronJob)

apiVersion: batch/v1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox:1.28
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster
          restartPolicy: OnFailure

Key Parameters (CronJob)

schedule: Schedule in Cron format (e.g., "0 0 * * *").
concurrencyPolicy:
- Allow: Allows concurrent runs (default).
- Forbid: Skips if the previous job hasn't finished.
- Replace: Cancels the previous job and starts a new one.
successfulJobsHistoryLimit: The number of successful finished jobs to retain (default 3).
failedJobsHistoryLimit: The number of failed finished jobs to retain (default 1).

Best Practices

1. Timeout and Cleanup Settings

Configure settings to prevent Jobs from running indefinitely and to ensure finished Pods do not consume resources.

activeDeadlineSeconds: Limit on the duration of the Job (in seconds). If exceeded, the Job is terminated.
ttlSecondsAfterFinished: Automatically deletes the Job resource (and associated Pods) after a specified number of seconds has passed since completion (TTL Controller).

spec:
  ttlSecondsAfterFinished: 100 # Delete 100 seconds after completion
  activeDeadlineSeconds: 1800  # Timeout after 30 minutes

2. Ensure Idempotency

Kubernetes Jobs may rarely execute the same process multiple times due to network or node failures. It is important to ensure idempotency on the application side (so that the result does not change even if the same process is executed multiple times).

3. Resource Limits

Batch processes can be resource-intensive. Always set resources.requests and resources.limits to prevent impact on other applications.

4. Consider Concurrency Policy (CronJob)

If the job execution time might exceed the schedule interval, carefully select the concurrencyPolicy. Forbid is recommended if data consistency is critical.

5. Handling Failures

Set backoffLimit appropriately to prevent infinite retries. Also, ensure application logs are properly output so that the cause of failure can be investigated.

6. Setting startingDeadlineSeconds (Important)

.spec.startingDeadlineSeconds is the number of seconds to allow a Job to start after its scheduled time, but it is also important for preventing CronJob suspension. By Kubernetes design, if a CronJob fails to schedule 100 times in a row, it will stop generating Jobs. If startingDeadlineSeconds is not set, a long period of suspend: true or continuous failures due to an outage can reach this 100-count limit, breaking the CronJob and requiring recreation. Setting this value limits the failure count check to that specific time window, preventing unintended CronJob suspension.

7. Time Zone Specification

By default, CronJobs run in UTC. To run in a specific time zone (e.g., JST), you can use the .spec.timeZone field in Kubernetes 1.27+ (AKS 1.27+).

spec:
  timeZone: "Asia/Tokyo"
  schedule: "0 0 * * *" # Runs at 0:00 JST

8. Avoiding Spot VMs (Preemptible Nodes)

Long-running critical batch processes should avoid running on Spot VMs (Spot Node Pools) even in clusters that use them for cost savings. Use nodeAffinity to ensure they run on regular (On-Demand) nodes.

    spec:
      template:
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: kubernetes.azure.com/scalesetpriority
                    operator: NotIn
                    values:
                    - spot

References

Kubernetes CronJobと仲良くなりたい | Mercari Engineering Blog (Japanese)

Job​

Main Use Cases (Job)​

Manifest Example (Job)​

Key Parameters (Job)​

CronJob​

Main Use Cases (CronJob)​

Manifest Example (CronJob)​

Key Parameters (CronJob)​

Best Practices​

1. Timeout and Cleanup Settings​

2. Ensure Idempotency​

3. Resource Limits​

4. Consider Concurrency Policy (CronJob)​

5. Handling Failures​

6. Setting startingDeadlineSeconds (Important)​

7. Time Zone Specification​

8. Avoiding Spot VMs (Preemptible Nodes)​

References​

Job