Job and CronJob
In Kubernetes, resources are provided not only for resident applications like web servers but also for managing tasks that run once and exit, or tasks that run periodically. These are Job and CronJob.
Job
A Job creates one or more Pods and ensures that a specified number of them successfully terminate (Succeeded).
Main Use Cases (Job)
- Database migrations
- Batch processing (data aggregation, report generation, etc.)
- One-time initialization scripts
Manifest Example (Job)
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
Key Parameters (Job)
- restartPolicy: Must be
OnFailureorNeverfor Jobs (Alwaysis not allowed). - completions: The number of successful completions required.
- parallelism: The number of Pods to run in parallel.
- backoffLimit: The number of retries before considering the Job as failed (default is 6).
CronJob
A CronJob creates Jobs on a repeating schedule based on the Linux cron format.
Main Use Cases (CronJob)
- Periodic backups
- Periodic email delivery
- Nightly data cleansing
Manifest Example (CronJob)
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
Key Parameters (CronJob)
- schedule: Schedule in Cron format (e.g.,
"0 0 * * *"). - concurrencyPolicy:
Allow: Allows concurrent runs (default).Forbid: Skips if the previous job hasn't finished.Replace: Cancels the previous job and starts a new one.
- successfulJobsHistoryLimit: The number of successful finished jobs to retain (default 3).
- failedJobsHistoryLimit: The number of failed finished jobs to retain (default 1).
Best Practices
1. Timeout and Cleanup Settings
Configure settings to prevent Jobs from running indefinitely and to ensure finished Pods do not consume resources.
- activeDeadlineSeconds: Limit on the duration of the Job (in seconds). If exceeded, the Job is terminated.
- ttlSecondsAfterFinished: Automatically deletes the Job resource (and associated Pods) after a specified number of seconds has passed since completion (TTL Controller).
spec:
ttlSecondsAfterFinished: 100 # Delete 100 seconds after completion
activeDeadlineSeconds: 1800 # Timeout after 30 minutes
2. Ensure Idempotency
Kubernetes Jobs may rarely execute the same process multiple times due to network or node failures. It is important to ensure idempotency on the application side (so that the result does not change even if the same process is executed multiple times).
3. Resource Limits
Batch processes can be resource-intensive. Always set resources.requests and resources.limits to prevent impact on other applications.
4. Consider Concurrency Policy (CronJob)
If the job execution time might exceed the schedule interval, carefully select the concurrencyPolicy. Forbid is recommended if data consistency is critical.
5. Handling Failures
Set backoffLimit appropriately to prevent infinite retries. Also, ensure application logs are properly output so that the cause of failure can be investigated.
6. Setting startingDeadlineSeconds (Important)
.spec.startingDeadlineSeconds is the number of seconds to allow a Job to start after its scheduled time, but it is also important for preventing CronJob suspension.
By Kubernetes design, if a CronJob fails to schedule 100 times in a row, it will stop generating Jobs. If startingDeadlineSeconds is not set, a long period of suspend: true or continuous failures due to an outage can reach this 100-count limit, breaking the CronJob and requiring recreation.
Setting this value limits the failure count check to that specific time window, preventing unintended CronJob suspension.
7. Time Zone Specification
By default, CronJobs run in UTC. To run in a specific time zone (e.g., JST), you can use the .spec.timeZone field in Kubernetes 1.27+ (AKS 1.27+).
spec:
timeZone: "Asia/Tokyo"
schedule: "0 0 * * *" # Runs at 0:00 JST
8. Avoiding Spot VMs (Preemptible Nodes)
Long-running critical batch processes should avoid running on Spot VMs (Spot Node Pools) even in clusters that use them for cost savings. Use nodeAffinity to ensure they run on regular (On-Demand) nodes.
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: NotIn
values:
- spot