EMR

Summary#

Amazon Elastic Map Reduce (EMR) is software infrastructure for running map reduce and other big data workloads. It supports open-source frameworks like Apache Spark, projects like Hadoop, and SQL tools like Presto.

EMR runs on top of EC2 or EKS instances and also has a serverless option. EMR is available for a wide variety of instances which allows for tight optimization of workloads, for example choosing a compute-optimized vs a memory-optimized instance for Spark vs Hive.

To see which EC2 instances are available for EMR, you can add the On EMR and EMR Cost columns on ec2instances.info.

Pricing Dimensions#

EMR is billed differently based on the underlying compute service.

Dimension	Description
Running on EC2	EMR is billed as an additional cost per hour for the instance. For example, a m6g.16xlarge has an EMR cost of ~$0.60 per hour.
Running on EKS	Running on EKS involves 2 dimensions: vCPUs and GiB of memory, with a minimum charge of one minute.
Serverless	Serverless has Compute, Memory, and Storage dimensions.

EMR Optimization#

Every EMR instance above can also be run as a Spot Instance, which is likely to be appropriate for fault-tolerant workloads on EMR. As of 2023, it is also possible to use Spot Fleets with the price-capacity-optimized allocation strategy for running EMR workloads. Lastly, data transfer charges are likely accumulating from the movement of your big data through the EMR system. You can dramatically reduce these charges, or even eliminate them, by connecting to EMR using interface VPC endpoints.

Contribute

Contribute to this page on GitHub or join the #cloud-costs-handbook channel in the Vantage Community Slack.

Last updated Aug 22, 2023