Scale a Cluster with CA

Visualizing Cluster Autoscaler Logs and Actions

During this section we recommend arranging your window so that you can see Cloud9 Console and Kube-ops-view and starting a new terminal in Cloud9 to tail Cluster Autoscaler logs. This will help you visualize the effect of your scaling commands.

Show me how to get kube-ops-view url
Show me how to tail Cluster Autoscaler logs

Scaling down to 0

Before we scale up our cluster let’s explore what would happen when set up 0 replicas. Execute the following command:

kubectl scale deployment/monte-carlo-pi-service --replicas=0

Question: Can you predict what would be the result of scaling down to 0 replicas ?

Show me the answer

Scale our ReplicaSet

OK, let’s now scale out the replicaset to 3

kubectl scale deployment/monte-carlo-pi-service --replicas=3

You can confirm the state of the pods using

kubectl get pods --watch
NAME                                     READY   STATUS    RESTARTS   AGE
monte-carlo-pi-service-584f6ddff-fk2nj   1/1     Running   0          20m21s
monte-carlo-pi-service-584f6ddff-fs9x6   1/1     Running   0          103s
monte-carlo-pi-service-584f6ddff-jst55   1/1     Running   0          103s

You should also be able to visualize the scaling action using kube-ops-view. Kube-ops-view provides an option to highlight pods meeting a regular expression. All pods in green are monte-carlo-pi-service pods. Scaling up to 10 replicas

Given we started from 0 capacity in both Ec2Spot nodegroups, this should trigger a scaling event for Cluster Autoscaler. Can you predict which size (and type!) of node will be provided ?

Challenge

Try to answer the following questions:

  • Could you predict what should happen if we increase the number of replicas to 13 ?
  • How would you scale up the replicas to 13 ?
  • If you are expecting a new node, which size will it be: (a) 4vCPU’s 16GB RAM or (b) 8vCPU’s 32GB RAM ?
  • Which AWS instance type you would expect to be selected ?
  • How would you confirm your predictions ?
  • Would you consider the selection of nodes by Cluster Autoscaler as optimal ?
Show me the answers

After you’ve completed the exercise, scale down your replicas back down in preparation for the configuration of Horizontal Pod Autoscheduler.

kubectl scale deployment/monte-carlo-pi-service --replicas=3

It is a recommended to use capacity-optimized as an allocation strategy for your mixed instances EC2 Spot nodegroups. Other Strategies like Lowest Price might be still considered for nodes that just process Kubernetes retriable Jobs

Optional Exercises

Some of this exercises will take time for Cluster Autoscaler to scale up and down. If you are running this workshop at a AWS event or with limited time, we recommend to come back to this section once you have completed the workshop, and before getting into the cleanup section.

  • At the moment AWS auto-scaling groups backing up the nodegroups are setup to use the capacity-optimized allocation strategy. What do you think is the trade-off when you switch to lowest price allocation strategy ?

  • What will happen when modifying Cluster Autoscaler expander configuration from random to least-waste. What happens when we increase the replicas back to 13 ? What happens if we increase the number of replicas to 20? Can you predict which group of node will be expanded in each case: (a) 4vCPUs 16GB RAM (b) 8vCPUs 32GB RAM? What’s Cluster Autoscaler log looking like in this case?

  • How would you expect Cluster Autoscaler to Scale in the cluster ? How about scaling out ? How much time you’ll expect for it to take ?

  • How will pods be removed when scaling down? From which nodes they will be removed? What is the effect of adding Pod Disruption Budget to this mix ?

  • Scheduling in Kubernetes is the process of binding pending pods to nodes, and is performed by a component of Kubernetes called kube-scheduler. When running on Spot the cluster is expected to be dynamic; the state is expected to change over time; The original scheduling decision may not be adequate after the state changes. Could you think or research for a project that could help address this? (hint_1) hint_2. If so apply the solution and see what is the impact on scale-in operations.

  • During the workshop, we did use nodegroups that expand across multiple AZ’s; There are scenarios where might create issues. Could you think which scenarios ? (hint). Could you think of ways of mitigating the risk in those scenarios ? (hint 1, hint 2)