AWS Batch combines job scheduling and job execution into one managed service. Example Batch workloads include video rendering, log file ingestion, model training, simulation, and cosmology. Under the hood, Batch provisions EC2 or Fargate instances and executes containerized jobs on them.
Users set a range of vCPUs and memory that are needed to execute the job. You can also choose specific instance types, which can be helpful for cost optimizations. For both EC2 and Fargate jobs it is possible to select on-demand or spot instances. Job execution itself can be managed with scheduling, allocation, and parallelization parameters.
Batch is free!
AWS Batch only consumes the underlying EC2 or Fargate resources, however it does not break these out in the bill. If Batch is pointed at existing EC2 instances, in other words the Batch Compute Environment is
UNMANAGED, the cost of the Batch jobs will be the elapsed time they run on the instance.
To view the cost of a job in a
MANAGED Batch environment, you must inspect the ECS tasks that are associated with it. You can view the cost of AWS batch jobs through ECS in Cost Reports.
Another technique is to add a tag to the Compute Environment that is created to run the Batch job in. This does not allow you to track costs down to the job level.
Where Jobs Should Run#
One consideration is whether Batch is even the right tool for the job, and if so what compute type is preferred. The table below shows various options for running batch workloads on AWS, and the constraints that come with each.
|Lambda||Fargate (Spot)||Fargate||EC2 (Spot)||EC2|
|Job Length||<15 mins||5 - 10 mins||5 - 10 mins||5 - 45 mins||Hours|
|Compute Limits||Lambda Runtime Only||<4 vCPUs, <30 GiB memory, no GPU||<4 vCPUS, <30 GiB mem, no GPU||None||None|
|Startup Time||<1 sec||30 - 90 secs||30 - 90 secs||5 - 15 mins||5 - 15 mins|
|Job is Fault Tolerant||No||Yes||No||Yes||No|
Batch Cost Optimization Tips#
AWS recommends a few techniques to lower costs for Batch jobs:
- The most cost effective allocation strategy for non interrupatable workloads is
BEST_FIT. This strategy is sensitive to capacity constraints however and so an entire workload may have to wait for available machines. To avoid this,
BEST_FIT_PROGRESSIVEtries to find the best instances but falls back to less cost efficient instances that will still complete the job (e.g. have the minimum required number of vCPUs). For Fargate and EC2 Spot workloads,
SPOT_CAPACITY_OPTIMIZEDuses the same auto scaling algorithm as Spot Fleets to get the best price.
- Use smaller containers and image layers. Loading each container consumes compute time for the job. Furthermore, pulling containers across NAT Gateways will rack up data transfer charges. Prefer PrivateLink for pulling containers.
- Use multiple availability zones. All things considered,
BEST_FIT_PROGRESSIVEwill find the cheapest AZ to run the workload in, so do not artificially limit yourself here.
Contribute to this page on GitHub or join the
#cloud-costs-handbook channel in the Vantage Community Slack.