Resilience with Spot Instances

Handling Spot Interruptions

When EC2 needs the capacity back in a specific capacity pool (a combination of an instance type in an Availability Zone) it could start interrupting the Spot Instances that are running in that AZ, by sending a 2 minute interruption notification, and then terminating the instance. The 2 minute interruption notification is delivered via EC2 instance meta-data as well as CloudWatch Events.

Let’s deploy a Lambda function that would catch the CloudWatch event for EC2 Spot Instance Interruption Warning and automatically detach the soon-to-be-terminated instance from the EC2 Auto Scaling group. By calling the DetachInstances API you achieve two things:

  1. You can specify on the API call whether to keep the current desired capacity on the Auto Scaling group, or decrement it by the number of instances being detached. By keeping the same desired capacity, Auto Scaling will immediately launch a replacement instance.

  2. If the Auto Scaling group has a Load Balancer or a Target Group attached (as we have in this workshop), the instance is deregistered from it. Also, if connection draining is enabled for your Load Balancer (or Target Group), the Auto Scaling group waits for in-flight requests to complete (up to the configured timeout, which we have set up to 120 sec).

    You can learn more about the Detaching EC2 Instances from an Auto Scaling group here.

To save time, we will use a CloudFormation template to deploy the Lambda Function that will handle EC2 Spot interruptions, and the CloudWatch event rule to catch the Spot Interruption notifications, and subscribe the Lambda Function to it.

  1. Take some time to review the CloudFormation template and understand what will be launched. Then, execute the following command to deploy the template:

    aws cloudformation deploy --template-file spot-interruption-handler.yaml --stack-name spotinterruptionhandler --capabilities CAPABILITY_IAM
    
  2. When the CloudFormation deployment completes (under 2 minutes), open the AWS Lambda console and click on the newly deployed Function name.

  3. Feel free to examine the code in the Inline code editor.

Now our infrastructure is ready to respond to Spot Interruptions by detaching Spot Instances from the Auto Scaling group when they receive a Spot interruption notification. We can’t simulate an EC2 Spot Interruption, but we can invoke the Lambda Function with a simulation of a CloudWatch event for an EC2 Spot Instance Interruption Warning, and see the result.

  1. In the top right corner of the AWS Lambda console, click the dropdown menu Select a test event -> Configure test events
  2. With Create a new test event selected, provide an Event name (i.e TestSpotInterruption). In the event text box, paste the following:

    {
      "version": "0",
      "id": "92453ca5-5b23-219e-8003-ab7283ca016b",
      "detail-type": "EC2 Spot Instance Interruption Warning",
      "source": "aws.ec2",
      "account": "123456789012",
      "time": "2019-11-05T11:03:11Z",
      "region": "eu-west-1",
      "resources": [
        "arn:aws:ec2:eu-west-1b:instance/<instance-id>"
      ],
    "detail": {
      "instance-id": "<instance-id>",
      "instance-action": "terminate"
      }
    }
  3. Replace both occurrences of “<instance-id>” with the instance-id of one of the Spot Instances that are currently running in your EC2 Auto Scaling group (you can get an instance-id from the Instances tab in the bottom pane of the EC2 Auto Scaling groups console ). You don’t need to change any of the other parameters in the event json.

  4. Click Create

  5. With your new test name (i.e TestSpotInterruption) selected in the dropdown menu, click the Test button.

  6. The execution result should be succeeded and you can expand the details to see the successful log message: “Instance i-01234567890123456 belongs to AutoScaling Group runningAmazonEC2WorkloadsAtScale. Detaching instance…”

  7. Go back to the EC2 Auto Scaling groups console, and under the Activity History tab in the bottom pane, you should see a Detaching EC2 instance activity, followed shortly after by a Launching a new EC2 instance activity.

  8. Go to the EC2 ELB Target Groups console and click on the runningAmazonEC2WorkloadsAtScale Target Group, go to the Targets tab in the bottom pane, you should see the instance in draining mode.

Great result! by leveraging the EC2 Spot Instance Interruption Warning, the Lambda Function detached the instance from the Auto Scaling group and the ELB Target Group, thus draining existing connections, and launching a replacement instance before the current instance is terminated.

In a real scenario, EC2 would terminate the instance after two minutes, however in this case we simply mocked up the interruption so the EC2 instance will keep running outside the Auto Scaling group. Go to the EC2 console and terminate the instance that you used on the mock up event.

Increasing the application’s resilience when using Spot Instances

In a previous step in this workshop, you learned that the EC2 Auto Scaling group is configured to fulfill the 4 lowest-priced instance types (out of a list of 9 types) in each Availability Zone. Since Spot is spare EC2 capacity, its supply and demand vary. By diversifying your usage of capacity pools (a combination of an instance type in an Availability Zone), you increase your chances of getting the desired capacity, and decrease the potential number of interrupted instances in case Spot Instances are interrupted (when EC2 needs the capacity back for On-Demand).

Knowledge check

How can you increase the resilience of the Koel music streaming application that you deployed in this workshop, when using EC2 Spot Instances?

Click here for the answer

Challenge

What other Spot allocation strategy can you choose, would it be suitable for this workload? if not, when will you use it?
Hint: read or skim through the following article

Click here for the answer