Under Cluster Composition » Instance group configuration, select Instance fleets.
Under Network, select the VPC that you deployed using the CloudFormation template earlier in the workshop (or the default VPC if you’re running the workshop in an AWS event), and select all subnets in the VPC.
We recommend that you provide a list of subnets (Availability Zones) and instance types in instance fleets, Amazon EMR will automatically select one optimal subnet (AZ) based on cost and availability of instance types.
The master node does not typically have large computational requirements. For clusters with a large number of nodes, or for clusters with applications that are specifically deployed on the master node (JupyterHub, Hue, etc.), a larger master node may be required and can help improve cluster performance. For example, consider using a General Purpose m5.xlarge instance for small clusters (50 or fewer nodes), and increasing to a larger instance type for larger clusters.
You may experience insufficient capacity when using On-Demand Instances with allocation strategy for instance fleets. We recommend specifying a larger number of instance types for On-Demand Instances also, to diversify and reduce the chance of experiencing insufficient capacity.
Under Node type » Master, click Add / remove instance types to fleet and select General Purpose instance types - i.e m4.xlarge, m5.xlarge, m5a.xlarge and m5d.xlarge. EMR will only provision one instance, but will select the cheapest On-Demand instance type for the Master node from the given instance types.
Unless your cluster is very short-lived and the runs are cost-driven, avoid running your Master node on a Spot Instance. We suggest this because a Spot interruption on the Master node terminates the entire cluster. If you want to dive deeper into when to use On-Demand and Spot in your EMR clusters, click here
When using EMR instance fleets, one core node is mandatory. Since we don’t use HDFS in this workshop, you will auto-scale task fleet and keep only one mandatory core node using On-Demand Instances. Specify 4 On-demand units, to allow single core node to run one executor and YARN application master.
Under the Node type » Core , click Add / remove instance types to fleet and select five instance types that you noted before as suitable to run an executor (given the 18G executor size), for example:
Core nodes process data and store information using HDFS, terminating a core instance risks data loss. YARN application master runs on one of the core nodes, in case of Spark applications the Spark driver runs on the YARN application master container hosted on the core node. Spark driver is a single point of failure in Spark applications. If driver dies, all other linked components will be discarded as well.
Task nodes run only Spark executors and no HDFS DataNodes, therefore Task nodes are a great fit for scaling out and increasing parallel executions to achieve faster execution times.
Under the Node type » Task , click Add / remove instance types to fleet and select up to 15 instance types you noted before as suitable for your executor size. Since the executor size is 4 vCore, let’s specify 32 Spot units in order to run 8 executors to start with.
While you can always manually adjust the number of core or task nodes (EC2 instances) in your Amazon EMR cluster, you can also use the power of EMR auto-scaling to automatically adjust the cluster size in response to changing workloads without any manual intervention.
Let’s enable scaling for this cluster using Amazon EMR Managed Scaling. With EMR Managed scaling you specify the minimum and maximum compute limits for your cluster and Amazon EMR automatically resizes EMR clusters for best performance and resource utilization. EMR Managed Scaling constantly monitors key metrics based on workload and optimizes the cluster size for best resource utilization
EMR Managed Scaling is supported for Apache Spark, Apache Hive and YARN-based workloads on Amazon EMR versions 5.30.1 and above.
Managed Scaling now also has the capability to prevent scaling down instances that store intermediate shuffle data for Apache Spark. Intelligently scaling down clusters without removing the instances that store intermediate shuffle data prevents job re-attempts and re-computations, which leads to better performance, and lower cost. Click here for more details.
click Next to continue to the next steps of launching your EMR cluster.