Now we’re in the process of getting started with adopting Spot Instances for our EMR clusters. We’re still not sure that our jobs are fully resilient and what would actually happen if some of the EC2 Spot Instances in our EMR clusters get interrupted, when EC2 needs the capacity back for On-Demand.
In most cases, when running fault-tolerant workloads, we don’t really need to track the Spot interruptions as our applications should be built to handle them gracefully without any impact to performance or availability. However, when we get started with running our EMR jobs on Spot Instances this could be useful, as our organization can use these to correlate to possible EMR job failures or prolonged execution times, in case Spot Instances were interrupted during Spark run time.
Let’s set up CloudWatch Logs to log Spot interruptions, so if there are any failures in our EMR applications, we’ll be able to check if the failures correlate to a Spot interruption.
We’ve created a CloudFormation template that includes all the resources you need to track EC2 Spot Interruptions. The stack creates the following:
You can view the CloudFormation template (cloudwatchlogs.yaml) at GitHub here. To download it, you can run the following command:
After downloading the CloudFormation template, run the following command in a terminal:
aws cloudformation create-stack --stack-name track-spot-interruption --template-body file://cloudwatchlogs.yaml --capabilities CAPABILITY_NAMED_IAM aws cloudformation wait stack-create-complete --stack-name track-spot-interruption
You should see an event rule in the Amazon EventBridge console, like this: