Auto Scaling
Autoscaling: Adapting to Demand
Autoscaling is a dynamic process that automatically adjusts system resources to align with fluctuating workloads. It ensures optimal performance, cost-efficiency, and resource utilization.
Types of Autoscaling
Horizontal Pod Autoscaling (HPA):
Increases or decreases the number of pod replicas in response to changes in CPU utilization, memory usage, or custom metrics.
Ideal for stateless applications where adding more instances can handle increased load.
Example: A web application that experiences traffic spikes during peak hours.
Vertical Pod Autoscaling (VPA):
Adjusts the resource requests and limits of individual pods.
Suitable for applications with varying resource requirements within a single pod.
Example: A machine learning model that needs more resources during training than inference.
Cluster Autoscaler:
Manages the number of nodes in a Kubernetes cluster based on pod resource requests.
Ensures efficient use of underlying infrastructure by adding or removing nodes as needed.
Example: A cluster handling batch jobs with unpredictable resource demands.
How Autoscaling Benefits Your Applications
Cost Optimization: Avoids over-provisioning resources during low-demand periods.
Performance Improvement: Ensures optimal performance by scaling resources in response to load changes.
Reliability: Prevents system failures due to resource exhaustion by proactively scaling.
Agility: Enables rapid scaling to meet unexpected demand spikes.
Key Differences Between HPA and VPA
Feature | Horizontal Pod Autoscaling (HPA) | Vertical Pod Autoscaling (VPA) |
Scaling Dimension | Number of pods | Pod resources (CPU, memory) |
Best Use Cases | Stateless applications, traffic spikes | Applications with varying resource needs within a pod |
Impact | Adds or removes entire pods | Adjusts resources within existing pods |
In Conclusion
By effectively utilizing autoscaling, you can build highly responsive and cost-effective applications on Kubernetes. Understanding the different types of autoscaling and their use cases empowers you to make informed decisions for your specific workloads.
Hands On
We create a Deployment specifying the desired number of pod replicas and an HPA that monitors CPU utilization. Set minimum and maximum replica limits. Simulate increased load using a load generator. Observe the HPA scaling up pod count as CPU utilization rises and scaling down as load decreases. This demonstrates the HPA's ability to dynamically adjust resources based on demand.
Code for Deployment and Service
apiVersion: apps/v1
kind: Deployment
metadata:
name: php-apache
spec:
selector:
matchLabels:
run: php-apache
template:
metadata:
labels:
run: php-apache
spec:
containers:
- name: php-apache
image: registry.k8s.io/hpa-example
ports:
- containerPort: 80
resources:
limits:
cpu: 500m
requests:
cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
name: php-apache
labels:
run: php-apache
spec:
ports:
- port: 80
selector:
run: php-apache
Imperative way of creating an HPA
kubectl auto scale deploy php-apache --cpu-percent=50 --min=1
kubectl get hpa
command to get the HPA
kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"
Simulating load and stress testing: Generate continuous requests to pods to induce load and observe HPA's response in scaling pod count.
our load got increase here
kubectl get hpa --watch
The pods got increased here
Now we terminate the process and check whether our pods count is decreasing or not
Our CPU utilization becomes 0 now