Resilience with Spot Instances

In this section, you are going to learn about the integrations and approaches to handle EC2 Spot instance interruptions.

Handling Spot Interruptions

When EC2 needs the capacity back in a specific capacity pool (a combination of an instance type in an Availability Zone) it can start interrupting Spot Instances that are running in that AZ, by sending a 2 minute interruption notification, and then terminating the instance. The 2 minute interruption notification is delivered via the EC2 instance meta-data service as well as Amazon EventBridge.

Since November 5th, 2020, EC2 Spot instances also get a notification when they are at elevated risk of interruption via a new signal called EC2 Rebalance Recommendation. This signal can arrive sooner than the two-minute Spot Instance interruption notice, giving you more headroom to react to that upcoming interruption. You can decide to rebalance your workload to new or existing Spot Instances that are not at an elevated risk of interruption.

To help you automating the handling of interruptions, EC2 Auto Scaling has a feature called Capacity Rebalaning that, if enabled, whenever a Spot instance in your Auto Scaling group is at an elevated risk of interruption, it will proactively attempt to launch a replacement Spot Instance. For the feature to work as expected it’s highly recommended that you configure multiple instance types in your Auto Scaling group, and that you use the capacity-optimized allocation strategy. In this case, Auto Scaling will launch the optimal instance type out of your selection of instances based on spare capacity availability at launch time. Once the replacement instance is running and passing health checks, Auto Scaling will then terminate the Spot Instance at an elevated risk of interruption, triggering the configured deregistration delay from the load balancer (if you have a Target Group configured on your Auto Scaling group), and executing termination lifecycle hooks if they are set up (Auto Scaling lifecycle hooks allows you to execute actions before an instance is put in service, and/or before it’s terminated). You can learn more about Capacity Rebalancing here. The diagram below reflects the timeline of a Capacity Rebalancing activity:

Capacity Rebalancing Diagram

During the Auto Scaling group configuration, we have already enabled Capacity Rebalancing, configured multiple instance types and Availability Zones and the capacity-optimized allocation strategy, so we are all set. If you want to review these configurations again, inspect the asg.json file or go to the Auto Scaling console and check the Purchase options and instance types section of your Auto Scaling group.

Auto Scaling console

Note that when we have configured the Application Load Balancer, we have configured the deregistration delay of the target group to 90 seconds, so the deregistration of the instance from the ALB completes within the 2 minutes notice. This is because it is not always possible for Amazon EC2 to send the rebalance recommendation signal before the two-minute Spot Instance interruption notice. Therefore, the rebalance recommendation signal can arrive along with the two-minute interruption notice.

Knowledge check

How can you increase the resilience of the web application that you deployed in this workshop, when using EC2 Spot Instances?

Click here for the answer

Challenges

  • By default, a Target Group linked to an Application Load Balancer distributes the requests across its registered instances using a Round Robin load balancing algorithm. Is there anything you could do to spread the load more efficiently if you have backend instances from different instance families that may have slight differences on processing power? Take a look at this article.
Click here for the answer

Optional exercise: Custom Spot interruption notification handling

In this workshop, you deployed a simple stateless web application and leveraged Capacity Rebalancing for handling the Spot Instance lifecycle. Capacity Rebalancing works great for this use case, as it will automatically take care of bringing up replacement capacity when Spot instances are at an elevated risk of interruption and gracefully attempt to finish in-flight requests coming from the Application Load Balancer.

For other EC2 Auto Scaling workloads, like queue workers or other similar batch processes, you may prefer to execute actions when the EC2 Rebalance Recommendation signal is issued (like stop consuming new jobs) but to not terminate the instance until in-flight job processing has finished and/or act only when the two-minute instance termination warning arrives. For these cases, you can build your own handling logic by leveraging Amazon EventBridge notifications, AWS Lambda and / or a local script that consumes the EC2 metadata service notifications. As an example, you can find here a sample solution.

Click here for instructions to deploy a custom EC2 Spot Interruption Handler and simulate an interruption.