Amazon Elastic Map Reduce (EMR) is software infrastructure for running map reduce and other big data workloads. It supports open-source frameworks like Apache Spark, projects like Hadoop, and SQL tools like Presto.
EMR runs on top of EC2 or EKS instances and also has a serverless option. EMR is available for a wide variety of instances which allows for tight optimization of workloads, for example choosing a compute optimized vs. a memory optimized instance for Spark vs. Hive.
To see which EC2 instances are available for EMR, you can add the
On EMR and
EMR Cost columns on ec2instances.info.
EMR is billed differently based on the underlying compute service.
|Running on EC2||EMR is billed as an additional cost per hour for the instance. For example, a m6g.16xlarge has an EMR cost of ~$0.60 per hour.|
|Running on EKS||Running on EKS involves 2 dimensions: vCPUs and GiB of memory, with a minimum charge of 1 minute.|
|Serverless||Serverless has Compute, Memory, and Storage dimensions.|
Every EMR instance above can also be run as a spot instance, which is likely to be appropriate for "fault tolerant" workloads on EMR. As of 2020, it is also possible to use Spot Fleets with the capacity-optimized allocation strategy for running EMR workloads. Lastly, data transfer charges are likely accumulating from the movement of your big data through the EMR system. You can dramatically reduce these charges, or even eliminate them, by connecting to EMR using interface VPC endpoints.
Contribute to this page on GitHub or join the
#cloud-costs-handbook channel in the Vantage Community Slack.