Scale a Cluster with CA

Visualizing Cluster Autoscaler Logs and Actions

During this section we recommend arranging your window so that you can see Cloud9 Console and Kube-ops-view and starting a new terminal in Cloud9 to tail Cluster Autoscaler logs. This will help you visualize the effect of your scaling commands.

Show me how to get kube-ops-view url
Show me how to tail Cluster Autoscaler logs

Scaling down to 0

Before we scale up our cluster let’s explore what would happen when set up 0 replicas. Execute the following command:

kubectl scale deployment/monte-carlo-pi-service --replicas=0

Question: Can you predict what would be the result of scaling down to 0 replicas ?

Show me the answer

Scale our ReplicaSet

OK, let’s now scale out the replicaset to 3

kubectl scale deployment/monte-carlo-pi-service --replicas=3

You can confirm the state of the pods using

kubectl get pods --watch
NAME                                     READY   STATUS    RESTARTS   AGE
monte-carlo-pi-service-584f6ddff-fk2nj   1/1     Running   0          20m21s
monte-carlo-pi-service-584f6ddff-fs9x6   1/1     Running   0          103s
monte-carlo-pi-service-584f6ddff-jst55   1/1     Running   0          103s

You should also be able to visualize the scaling action using kube-ops-view. Kube-ops-view provides an option to highlight pods meeting a regular expression. All pods in green are monte-carlo-pi-service pods. Scaling up to 10 replicas

Given we started from 0 capacity in both Ec2Spot nodegroups, this should trigger a scaling event for Cluster Autoscaler. Can you predict which size (and type!) of node will be provided ?

Challenge

Try to answer the following questions:

  • Could you predict what should happen if we increase the number of replicas to 13 ?
  • How would you scale up the replicas to 13 ?
  • If you are expecting a new node, which size will it be: (a) 4vCPU’s 16GB RAM or (b) 8vCPU’s 32GB RAM ?
  • Which AWS instance type you would expect to be selected ?
  • How would you confirm your predictions ?
  • Would you consider the selection of nodes by Cluster Autoscaler as optimal ?
Show me the answers

After you’ve completed the exercise, scale down your replicas back down in preparation for the configuration of Horizontal Pod Autoscheduler.

kubectl scale deployment/monte-carlo-pi-service --replicas=3

Optional Exercises

Some of this exercises will take time for Cluster Autoscaler to scale up and down. If you are running this workshop at a AWS event or with limited time, we recommend to come back to this section once you have completed the workshop, and before getting into the cleanup section.

  • What will happen when modifying Cluster Autoscaler expander configuration from random to least-waste. What happens when we increase the replicas back to 13 ? What happens if we increase the number of replicas to 20? Can you predict which group of node will be expandeded in each case: (a) 4vCPUs 16GB RAM (b) 8vCPUs 32GB RAM? What’s Cluster Autoscaler log looking like in this case?

  • How would you expect Cluster Autoscaler to Scale in the cluster ? How about scaling out ? How much time you’ll expect for it to take ?

  • How will pods be removed when scaling down? From which nodes they will be removed? What is the effect of adding Pod Disruption Budget to this mix ?

  • At the moment AWS auto-scaling groups backing up the nodegroups are setup to use the lowest price allocation strategy, using the 4 cheapest pools in each AZ. Can you think of a different alternative allocation strategy to help reduce the frequency of interruptions on EC2 Spot nodes? What would be the pros and cons of using a different allocation strategy on a front-end production system ?

  • Scheduling in Kubernetes is the process of binding pending pods to nodes, and is performed by a component of Kubernetes called kube-scheduler. When running on Spot the cluster is expected to be dynamic; the state is expected to change over time; The original scheduling decision may not be adequate after the state changes. Could you think or research for a project that could help address this? (hint). If so apply the solution and see what is the impact on scale-in operations.

  • During the workshop, we did use nodegroups that expand across multiple AZ’s; There are scenarios where might create issues. Could you think which scenarios ? (hint). Could you think of ways of palliating the risk in those scenarios ? (hint 1, hint 2)