Autoscaling: Adapting to Demand

Autoscaling is a dynamic process that automatically adjusts system resources to align with fluctuating workloads. It ensures optimal performance, cost-efficiency, and resource utilization.

Types of Autoscaling

Horizontal Pod Autoscaling (HPA):
- Increases or decreases the number of pod replicas in response to changes in CPU utilization, memory usage, or custom metrics.
- Ideal for stateless applications where adding more instances can handle increased load.
- Example: A web application that experiences traffic spikes during peak hours.
Vertical Pod Autoscaling (VPA):
- Adjusts the resource requests and limits of individual pods.
- Suitable for applications with varying resource requirements within a single pod.
- Example: A machine learning model that needs more resources during training than inference.
Cluster Autoscaler:
- Manages the number of nodes in a Kubernetes cluster based on pod resource requests.
- Ensures efficient use of underlying infrastructure by adding or removing nodes as needed.
- Example: A cluster handling batch jobs with unpredictable resource demands.

How Autoscaling Benefits Your Applications

Cost Optimization: Avoids over-provisioning resources during low-demand periods.
Performance Improvement: Ensures optimal performance by scaling resources in response to load changes.
Reliability: Prevents system failures due to resource exhaustion by proactively scaling.
Agility: Enables rapid scaling to meet unexpected demand spikes.

Key Differences Between HPA and VPA

Feature	Horizontal Pod Autoscaling (HPA)	Vertical Pod Autoscaling (VPA)
Scaling Dimension	Number of pods	Pod resources (CPU, memory)
Best Use Cases	Stateless applications, traffic spikes	Applications with varying resource needs within a pod
Impact	Adds or removes entire pods	Adjusts resources within existing pods

In Conclusion

By effectively utilizing autoscaling, you can build highly responsive and cost-effective applications on Kubernetes. Understanding the different types of autoscaling and their use cases empowers you to make informed decisions for your specific workloads.

Hands On

We create a Deployment specifying the desired number of pod replicas and an HPA that monitors CPU utilization. Set minimum and maximum replica limits. Simulate increased load using a load generator. Observe the HPA scaling up pod count as CPU utilization rises and scaling down as load decreases. This demonstrates the HPA's ability to dynamically adjust resources based on demand.

Code for Deployment and Service

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-apache
spec:
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - name: php-apache
        image: registry.k8s.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
  name: php-apache
  labels:
    run: php-apache
spec:
  ports:
  - port: 80
  selector:
    run: php-apache

Imperative way of creating an HPA

kubectl auto scale deploy php-apache --cpu-percent=50 --min=1

kubectl get hpa

command to get the HPA

kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"

Simulating load and stress testing: Generate continuous requests to pods to induce load and observe HPA's response in scaling pod count.