A real-world guide for DevOps engineers who want zero-downtime upgrades and zero post-mortems
The Story That Started This Guide
The alert came in at 9:47 PM. “API server unreachable.”
The on-call engineer opened his laptop to find that the cluster upgrade — which was “just a version bump from 1.27 to 1.28” — had gone completely sideways. The control plane was upgraded. The nodes weren’t. Half the workloads were in an unknown state. Nobody had tested on staging. Nobody checked the deprecated API removals. Nobody had a rollback plan written down.
Three hours later, the service was back. The post-mortem was long and uncomfortable.
Kubernetes upgrades look simple from the outside — just a version number going up. But inside, you’re coordinating a distributed system with multiple components, API deprecation cycles, and live workloads running on top of it the entire time. Skip one step and you’re that team writing that post-mortem.
This guide is the one I wish existed before my first production upgrade.

What You’re Actually Upgrading
Let’s clear this up first. A “Kubernetes upgrade” is not one action. It’s upgrading a stack of interconnected components — each with its own version, compatibility requirement, and failure mode.
| Component | Description | Managed By |
|---|---|---|
| Control Plane | API server, scheduler, controller manager, etcd | You (kubeadm) or cloud provider |
| Worker Nodes | kubelet, kube-proxy on every node | Always you |
| Container Runtime | containerd, CRI-O | Always you |
| CNI Plugin | Calico, Cilium, Flannel | Always you |
| Core Add-ons | CoreDNS, metrics-server | Always you |
| Managed Add-ons | VPC CNI (EKS), cloud-controller-manager | Always you — never auto-upgraded |
| Helm Charts / Manifests | Your own workload definitions | Always you |
Miss even one of these and you’ll spend an hour debugging something that had nothing to do with the upgrade itself.
The Release Cycle — What You Need to Know
Kubernetes ships a new minor version approximately every 4 months. Each minor version is supported for roughly 14 months — after that it’s end-of-life and receives no more security patches or bug fixes.
v1.26 → v1.27 → v1.28 → v1.29 → v1.30 → v1.31...
(each ~4 months apart)
The rule that catches engineers off guard:
You can only upgrade one minor version at a time.
You cannot jump from 1.26 to 1.29. It must be 1.26 → 1.27 → 1.28 → 1.29 — three separate upgrade operations, each with its own pre-checks and validation. If you’ve been skipping upgrades for a year, plan accordingly.
# First thing to do: know where you are right now
kubectl version --short
# Check individual node versions
kubectl get nodes
# NAME STATUS ROLES VERSION
# control-plane-1 Ready control-plane v1.27.8
# worker-node-1 Ready <none> v1.27.8
# worker-node-2 Ready <none> v1.27.8
Phase 1: Pre-Upgrade Preparation
This is where 90% of upgrade failures originate. Take your time here.
1. Read the Official Release Notes
Before touching anything, read the changelog for your target version:
https://kubernetes.io/releases/notes/
Every release documents: removed APIs, breaking behavioral changes, new features that affect how workloads run, and known issues. Spend 20 minutes here. It can save you 3 hours of debugging later.
2. Scan for Deprecated and Removed APIs with kubent
Every Kubernetes release removes APIs that were deprecated earlier. Workloads using those old APIs — in Helm charts, raw YAML, Operators, or CRDs — fail silently after the upgrade. You won’t always get an obvious error. Things just stop working.
kubent (Kubernetes No Trouble) scans your live cluster and flags exactly what needs to be fixed:
# Install kubent
sh -c "$(curl -sSfL https://git.io/install-kubent)"
# Scan your cluster
kubent
Sample output:
>>> Deprecated APIs removed in 1.25 <<<
KIND NAMESPACE NAME API_VERSION REPLACE WITH
PodSecurityPolicy - restricted policy/v1beta1 (removed)
CronJob production db-backup batch/v1beta1 batch/v1
HPA staging frontend-hpa autoscaling/v2beta1 autoscaling/v2
Fix every flagged resource. Update the API version in your manifests, redeploy on the current version, verify it works — then proceed.
Complete API removal reference by version:
| Removed In | Old API (Deprecated) | Replacement |
|---|---|---|
| v1.16 | extensions/v1beta1 Deployments | apps/v1 |
| v1.22 | networking.k8s.io/v1beta1 Ingress | networking.k8s.io/v1 |
| v1.25 | policy/v1beta1 PodSecurityPolicy | Pod Security Admission |
| v1.25 | batch/v1beta1 CronJob | batch/v1 |
| v1.25 | autoscaling/v2beta1 HPA | autoscaling/v2 |
| v1.26 | flowcontrol.apiserver.k8s.io/v1beta1 | v1beta3 |
| v1.27 | storage.k8s.io/v1beta1 CSIStorageCapacity | storage.k8s.io/v1 |
| v1.29 | flowcontrol.apiserver.k8s.io/v1beta2 | v1 |
3. Back Up etcd
etcd holds the entire state of your cluster — every Deployment, Secret, ConfigMap, RBAC policy, CRD, and custom resource. If the control plane upgrade corrupts etcd and you have no backup, your cluster is essentially gone.
# Create a timestamped backup
ETCDCTL_API=3 etcdctl snapshot save \
/backup/etcd-snapshot-$(date +%Y%m%d-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# ALWAYS verify the backup — never skip this step
ETCDCTL_API=3 etcdctl snapshot status \
/backup/etcd-snapshot-$(date +%Y%m%d-%H%M).db \
--write-out=table
Expected output:
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| a1b2c3d4 | 198432 | 1589 | 5.4 MB |
+----------+----------+------------+------------+
If total keys are 0 or file size is suspiciously small — something went wrong. Don’t proceed without a verified backup.
For managed clusters (GKE, EKS, AKS), the cloud provider manages etcd internally. Still verify that your cluster backup/snapshot policy is active before starting.
4. Confirm All Nodes Are Healthy
kubectl get nodes
# Look for any non-Ready states
kubectl get nodes | grep -v Ready
# Check for resource pressure conditions
kubectl describe nodes | grep -E "DiskPressure|MemoryPressure|PIDPressure|NetworkUnavailable"
Every node must show Ready with no pressure conditions. DiskPressure: True on a node means it’s already running out of disk space — draining pods onto it during the upgrade makes it worse. Resolve unhealthy nodes first, always.
5. Review PodDisruptionBudgets
During node upgrades, nodes are drained. The drain process calls the Eviction API, which respects PDBs. If a PDB blocks all evictions, the node drain hangs indefinitely and your upgrade stalls completely.
kubectl get pdb -A
# NAME NAMESPACE MIN AVAILABLE ALLOWED DISRUPTIONS
# payment-service-pdb production 2 0 ← This will block the drain
# frontend-pdb production N/A 1 ← This is fine
For any PDB with ALLOWED DISRUPTIONS: 0, check why — are you already at minimum replicas? Are all pods on the same node about to be drained? Fix the root cause, or temporarily relax:
# Temporarily relax (document this change!)
kubectl patch pdb payment-service-pdb -n production \
-p '{"spec":{"minAvailable":1}}' --type=merge
# Restore immediately after upgrade completes
kubectl patch pdb payment-service-pdb -n production \
-p '{"spec":{"minAvailable":2}}' --type=merge
6. Verify Spare Cluster Capacity
When nodes are drained, pods need somewhere to land. If your cluster is at 90%+ utilization, evicted pods will sit Pending and your application gets partial or full downtime.
# Current node resource usage
kubectl top nodes
# Detailed breakdown of allocated vs available
kubectl describe nodes | grep -A8 "Allocated resources"
Rule of thumb: Have at least one full node’s worth of free CPU and memory before starting. For managed clusters, configure surge upgrades — the cloud provider provisions an extra node before draining old ones, meaning capacity never drops during the upgrade.
7. Test the Full Upgrade on Staging First
Every single time. I know staging never perfectly mirrors production. Do it anyway.
Run the complete upgrade on staging. Validate all application endpoints. Run your integration tests. Check your dashboards. Catch the broken Helm chart, the removed API, the misconfigured add-on — on staging, where it doesn’t matter. The 45 minutes you spend there is the 3 AM page you won’t get.
8. Notify Stakeholders and Open a Change Ticket
Inform all relevant teams. Set a maintenance window. Have people on standby. Even with zero-downtime strategies, unexpected things happen — and having the right engineers aware and available cuts recovery time dramatically.
Phase 2: The Upgrade — Order, Patience, and Discipline
The Ironclad Rule: Control Plane Before Nodes — Always
Kubernetes only supports skew in one direction. The control plane must be equal to or one version ahead of worker nodes. Never behind.
t✅ Control plane: v1.28 | Nodes: v1.27 → Valid (transitional state during upgrade)
✅ Control plane: v1.28 | Nodes: v1.28 → Valid (fully upgraded)
❌ Control plane: v1.27 | Nodes: v1.28 → Never — this breaks everything
Always upgrade control plane first. Wait for it to stabilize completely. Then upgrade nodes one at a time.
Self-Managed Clusters — kubeadm
Step 1: Check the Upgrade Plan
# SSH into the control plane node first
kubeadm upgrade plan
This shows your current version, all available upgrade targets, and any configuration changes required. Read it carefully before proceeding.
Step 2: Upgrade the Control Plane
# SSH into the control plane node
# --- Upgrade kubeadm first ---
apt-mark unhold kubeadm
apt-get update && apt-get install -y kubeadm=1.28.x-00
apt-mark hold kubeadm
# Confirm kubeadm version
kubeadm version
# --- Apply the control plane upgrade ---
# This upgrades: API server, scheduler, controller manager, CoreDNS, kube-proxy
kubeadm upgrade apply v1.28.x
# --- Upgrade kubelet and kubectl on the control plane node ---
apt-mark unhold kubelet kubectl
apt-get update && apt-get install -y kubelet=1.28.x-00 kubectl=1.28.x-00
apt-mark hold kubelet kubectl
# --- Reload and restart ---
systemctl daemon-reload
systemctl restart kubelet
# --- Verify ---
kubectl get nodes
At this point, the control plane shows v1.28.x. Worker nodes still show v1.27.x. That is correct and expected. Don’t panic — you haven’t touched the workers yet.
Step 3: Upgrade Worker Nodes One at a Time
Do not rush. Complete the full sequence for one node. Verify it. Then move to the next.
# ============================================
# RUN ON YOUR WORKSTATION (not the node)
# ============================================
# 1. Cordon: stop new pods from scheduling on this node
kubectl cordon <node-name>
# 2. Drain: gracefully evict all pods off this node
kubectl drain <node-name> \
--ignore-daemonsets \ # DaemonSets are managed separately — skip them
--delete-emptydir-data \ # Remove pods using emptyDir volumes
--grace-period=60 # Give apps 60s to shut down gracefully
# ============================================
# SSH INTO THE WORKER NODE
# ============================================
ssh user@<node-ip>
# 3. Upgrade kubeadm on the node
apt-mark unhold kubeadm
apt-get update && apt-get install -y kubeadm=1.28.x-00
apt-mark hold kubeadm
# 4. Apply the node configuration from the new control plane
kubeadm upgrade node
# 5. Upgrade kubelet and kubectl
apt-mark unhold kubelet kubectl
apt-get update && apt-get install -y kubelet=1.28.x-00 kubectl=1.28.x-00
apt-mark hold kubelet kubectl
# 6. Reload and restart kubelet
systemctl daemon-reload
systemctl restart kubelet
# ============================================
# BACK ON YOUR WORKSTATION
# ============================================
# 7. Uncordon: allow pods to schedule here again
kubectl uncordon <node-name>
# 8. WAIT and verify before touching the next node
kubectl get nodes
kubectl get pods -A -o wide | grep <node-name>
Wait until:
- Node shows
Readyandv1.28.x - Pods have rescheduled and are
Runningon that node
Only then move to the next worker. Rushing through nodes simultaneously is how you accidentally take down your entire application.
Managed Kubernetes
GKE (Google Kubernetes Engine)
# Step 1: Upgrade the control plane
gcloud container clusters upgrade my-cluster \
--master \
--cluster-version 1.28 \
--region us-central1
# Step 2: Configure surge upgrades BEFORE upgrading node pool
# This adds 1 extra node before draining old ones — capacity never drops
gcloud container node-pools update default-pool \
--cluster my-cluster \
--max-surge-upgrade 1 \
--max-unavailable-upgrade 0 \
--region us-central1
# Step 3: Upgrade the node pool
gcloud container clusters upgrade my-cluster \
--node-pool default-pool \
--cluster-version 1.28 \
AKS Upgrade — Full Flow
Key concepts:
- Control plane and node pools upgrade separately
- Node pools can be max 1 minor version behind control plane
- Always set
--max-surge 1before upgrading for zero capacity loss
Step-by-step:
- Check available versions
bashaz aks get-upgrades --resource-group my-rg --name my-cluster --output table
- Set surge upgrades on node pools first
bashaz aks nodepool update --resource-group my-rg --cluster-name my-cluster --name nodepool1 --max-surge 1
- Upgrade control plane only
bashaz aks upgrade --resource-group my-rg --name my-cluster --kubernetes-version 1.28 --control-plane-only --yes
- Upgrade system node pool first
bashaz aks nodepool upgrade --resource-group my-rg --cluster-name my-cluster --name systempool --kubernetes-version 1.28
- Upgrade user node pools one at a time
bashaz aks nodepool upgrade --resource-group my-rg --cluster-name my-cluster --name apppool --kubernetes-version 1.28
- Verify everything
bashaz aks show --resource-group my-rg --name my-cluster --query "kubernetesVersion"
kubectl get nodes
kubectl get pods -n kube-system
kubectl get pods -A | grep -Ev "Running|Completed"
Upgrading EKS (Amazon Elastic Kubernetes Service) — Complete Full Flow
How EKS Upgrades Work — Understand This First
AWS manages the EKS control plane — API server, etcd, scheduler, controller manager. You don’t touch those directly. But everything else is your responsibility:
- Worker Nodes (managed or self-managed)
- CoreDNS
- kube-proxy
- VPC CNI (aws-node)
- EBS CSI Driver (if used)
- EFS CSI Driver (if used)
- Any other add-ons
The #1 EKS gotcha that burns teams:
Add-ons — CoreDNS, kube-proxy, VPC CNI — do NOT auto-upgrade when you upgrade the cluster. You must manually update every single one. Skip this and you’ll get intermittent DNS failures, networking issues, and pod communication problems that are extremely confusing to debug because the cluster “looks fine.”
The correct EKS upgrade order:
textPre-checks → Control Plane → Add-ons → Node Groups → Verify
Step 0: Pre-Upgrade Checks (EKS-Specific)
# Check current cluster version
aws eks describe-cluster \
--name my-cluster \
--region us-east-1 \
--query "cluster.version" \
--output text
# Check current node group version
aws eks list-nodegroups \
--cluster-name my-cluster \
--region us-east-1
aws eks describe-nodegroup \
--cluster-name my-cluster \
--nodegroup-name default-ng \
--region us-east-1 \
--query "{Version:nodegroup.version, Status:nodegroup.status, InstanceType:nodegroup.instanceTypes}"
# Check all currently installed add-ons and their versions
aws eks list-addons \
--cluster-name my-cluster \
--region us-east-1
# Get detail on each add-on
for addon in coredns kube-proxy vpc-cni aws-ebs-csi-driver; do
echo "=== $addon ==="
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name $addon \
--region us-east-1 \
--query "{Version:addon.addonVersion, Status:addon.status}" \
--output table 2>/dev/null || echo "Not installed"
done
# Check available upgrade versions for the cluster
aws eks describe-cluster \
--name my-cluster \
--region us-east-1 \
--query "cluster.{CurrentVersion:version, PlatformVersion:platformVersion}"
Step 1: Check Compatible Add-on Versions for Target K8s Version
Before upgrading anything, find the correct add-on versions for your target Kubernetes version. Use these version numbers in the upgrade steps below.
# Check available CoreDNS versions for Kubernetes 1.28
aws eks describe-addon-versions \
--addon-name coredns \
--kubernetes-version 1.28 \
--region us-east-1 \
--query "addons[].addonVersions[].{Version:addonVersion, Default:compatibilities[0].defaultVersion}" \
--output table
# Check available kube-proxy versions
aws eks describe-addon-versions \
--addon-name kube-proxy \
--kubernetes-version 1.28 \
--region us-east-1 \
--query "addons[].addonVersions[].{Version:addonVersion, Default:compatibilities[0].defaultVersion}" \
--output table
# Check available VPC CNI versions
aws eks describe-addon-versions \
--addon-name vpc-cni \
--kubernetes-version 1.28 \
--region us-east-1 \
--query "addons[].addonVersions[].{Version:addonVersion, Default:compatibilities[0].defaultVersion}" \
--output table
# Check EBS CSI Driver versions
aws eks describe-addon-versions \
--addon-name aws-ebs-csi-driver \
--kubernetes-version 1.28 \
--region us-east-1 \
--query "addons[].addonVersions[].{Version:addonVersion, Default:compatibilities[0].defaultVersion}" \
--output table
Pick the version marked Default: true from each output — that’s the recommended version for your target Kubernetes version. Note them all down before proceeding.
Step 2: Upgrade the EKS Control Plane
# Trigger the control plane upgrade
aws eks update-cluster-version \
--name my-cluster \
--kubernetes-version 1.28 \
--region us-east-1
# The command returns an update ID — save it
# {
# "update": {
# "id": "abc-123-def-456",
# "status": "InProgress",
# ...
# }
# }
# Track the upgrade using the update ID
aws eks describe-update \
--name my-cluster \
--update-id abc-123-def-456 \
--region us-east-1 \
--query "update.{Status:status, Errors:errors}"
# Or just wait until the cluster is ACTIVE again
aws eks wait cluster-active \
--name my-cluster \
--region us-east-1
# Verify control plane version after upgrade
aws eks describe-cluster \
--name my-cluster \
--region us-east-1 \
--query "cluster.{Version:version, Status:status}" \
--output table
Control plane upgrade takes approximately 10–20 minutes. During this time:
- Your existing workloads continue running normally
- The API server may be briefly unavailable during the upgrade (seconds, not minutes)
kubectlcommands may occasionally return errors — this is expected
Do NOT start upgrading nodes or add-ons until the cluster status shows ACTIVE and version shows 1.28.
Step 3: Upgrade Add-ons (Before Nodes — Always)
This is the step most teams skip. Don’t skip it.
3a: Upgrade CoreDNS
CoreDNS handles all DNS resolution inside your cluster. Running an old CoreDNS against a new API server can cause intermittent DNS failures.
# Upgrade CoreDNS
aws eks update-addon \
--cluster-name my-cluster \
--addon-name coredns \
--addon-version v1.10.1-eksbuild.6 \
--resolve-conflicts OVERWRITE \
--region us-east-1
# Wait for CoreDNS upgrade to complete
aws eks wait addon-active \
--cluster-name my-cluster \
--addon-name coredns \
--region us-east-1
# Verify
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name coredns \
--region us-east-1 \
--query "{Version:addon.addonVersion, Status:addon.status}" \
--output table
# Confirm CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns
3b: Upgrade kube-proxy
kube-proxy manages network rules on each node. Version mismatch with the API server can cause networking issues.
aws eks update-addon \
--cluster-name my-cluster \
--addon-name kube-proxy \
--addon-version v1.28.6-eksbuild.2 \
--resolve-conflicts OVERWRITE \
--region us-east-1
aws eks wait addon-active \
--cluster-name my-cluster \
--addon-name kube-proxy \
--region us-east-1
# Verify
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name kube-proxy \
--region us-east-1 \
--query "{Version:addon.addonVersion, Status:addon.status}" \
--output table
# Confirm kube-proxy pods are running on all nodes
kubectl get pods -n kube-system -l k8s-app=kube-proxy
3c: Upgrade VPC CNI (aws-node)
VPC CNI manages pod networking and IP allocation on AWS. This is critical — a misconfigured VPC CNI breaks pod-to-pod communication.
aws eks update-addon \
--cluster-name my-cluster \
--addon-name vpc-cni \
--addon-version v1.16.0-eksbuild.1 \
--resolve-conflicts OVERWRITE \
--region us-east-1
aws eks wait addon-active \
--cluster-name my-cluster \
--addon-name vpc-cni \
--region us-east-1
# Verify
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name vpc-cni \
--region us-east-1 \
--query "{Version:addon.addonVersion, Status:addon.status}" \
--output table
# Confirm aws-node DaemonSet is running on all nodes
kubectl get pods -n kube-system -l k8s-app=aws-node
3d: Upgrade EBS CSI Driver (if installed)
aws eks update-addon \
--cluster-name my-cluster \
--addon-name aws-ebs-csi-driver \
--addon-version v1.26.0-eksbuild.1 \
--resolve-conflicts OVERWRITE \
--region us-east-1
aws eks wait addon-active \
--cluster-name my-cluster \
--addon-name aws-ebs-csi-driver \
--region us-east-1
# Verify
kubectl get pods -n kube-system -l app=ebs-csi-controller
3e: Upgrade EFS CSI Driver (if installed)
aws eks update-addon \
--cluster-name my-cluster \
--addon-name aws-efs-csi-driver \
--addon-version v1.7.0-eksbuild.1 \
--resolve-conflicts OVERWRITE \
--region us-east-1
aws eks wait addon-active \
--cluster-name my-cluster \
--addon-name aws-efs-csi-driver \
--region us-east-1
Step 4: Upgrade Managed Node Groups
Now that the control plane and all add-ons are on the new version, upgrade the node groups.
4a: Upgrade Default Managed Node Group
# Trigger node group upgrade
aws eks update-nodegroup-version \
--cluster-name my-cluster \
--nodegroup-name default-ng \
--kubernetes-version 1.28 \
--region us-east-1
# Save the update ID from the response for tracking
# Track upgrade progress
aws eks describe-update \
--name my-cluster \
--update-id <update-id-from-above> \
--region us-east-1 \
--query "update.{Status:status, Errors:errors}"
# Wait until node group is fully active
aws eks wait nodegroup-active \
--cluster-name my-cluster \
--nodegroup-name default-ng \
--region us-east-1
# Verify
aws eks describe-nodegroup \
--cluster-name my-cluster \
--nodegroup-name default-ng \
--region us-east-1 \
--query "{Version:nodegroup.version, Status:nodegroup.status}" \
--output table
4b: Multiple Node Groups
If you have more than one node group, upgrade them one at a time. List them all first:
bash# List all node groups
aws eks list-nodegroups \
--cluster-name my-cluster \
--region us-east-1 \
--query "nodegroups" \
--output table
Then upgrade each one, wait for it to complete, verify it, then move to the next:
# Upgrade each node group — replace names with your actual node group names
for ng in ng-general ng-compute ng-memory; do
echo "=== Upgrading node group: $ng ==="
aws eks update-nodegroup-version \
--cluster-name my-cluster \
--nodegroup-name $ng \
--kubernetes-version 1.28 \
--region us-east-1
echo "Waiting for $ng to become active..."
aws eks wait nodegroup-active \
--cluster-name my-cluster \
--nodegroup-name $ng \
--region us-east-1
echo "$ng upgraded successfully"
echo ""
done
Step 5: For Self-Managed Node Groups on EKS
If you’re running self-managed EC2 instances (not managed node groups), you handle the upgrade differently — you replace nodes with new EC2 instances running the updated AMI.
# Step 1: Find the latest EKS-optimized AMI for your target version
aws ssm get-parameter \
--name /aws/service/eks/optimized-ami/1.28/amazon-linux-2/recommended/image_id \
--region us-east-1 \
--query "Parameter.Value" \
--output text
# Step 2: Update your Launch Template with the new AMI ID
# Do this in the AWS Console or via CloudFormation/Terraform
# Step 3: Cordon and drain each old node manually
kubectl cordon <node-name>
kubectl drain <node-name> \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60
# Step 4: Terminate the old EC2 instance
# Your ASG will launch a new instance with the updated Launch Template
aws ec2 terminate-instances \
--instance-ids <old-instance-id> \
--region us-east-1
# Step 5: Verify new node joins the cluster
kubectl get nodes --watch
# New node should appear with v1.28.x and Ready status
# Repeat for each node — one at a time
Step 6: Post-Upgrade Verification for EKS
# 1. Verify cluster version
aws eks describe-cluster \
--name my-cluster \
--region us-east-1 \
--query "cluster.version"
# Expected: "1.28"
# 2. Verify all node groups are on new version
aws eks list-nodegroups \
--cluster-name my-cluster \
--region us-east-1 \
--query "nodegroups" | \
xargs -I {} aws eks describe-nodegroup \
--cluster-name my-cluster \
--nodegroup-name {} \
--region us-east-1 \
--query "{Name:nodegroup.nodegroupName, Version:nodegroup.version}"
# 3. Verify all nodes in kubectl
kubectl get nodes
# All nodes should show v1.28.x and Ready
# 4. Verify all add-on versions
for addon in coredns kube-proxy vpc-cni aws-ebs-csi-driver; do
echo "=== $addon ==="
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name $addon \
--region us-east-1 \
--query "{Version:addon.addonVersion, Status:addon.status}" \
--output table 2>/dev/null || echo "Not installed"
done
# 5. All kube-system pods healthy?
kubectl get pods -n kube-system
# 6. Any pods in bad state?
kubectl get pods -A | grep -Ev "Running|Completed"
# 7. Any warning events?
kubectl get events -A \
References & Further Reading
Official Kubernetes Documentation
| Topic | Link |
|---|---|
| Kubernetes Release Notes & Changelog | https://kubernetes.io/releases/notes/ |
| Kubernetes Version Skew Policy | https://kubernetes.io/releases/version-skew-policy/ |
| Upgrading kubeadm Clusters (Official Guide) | https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ |
| Deprecated API Migration Guide | https://kubernetes.io/docs/reference/using-api/deprecation-guide/ |
| etcd Backup & Restore | https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/ |
| Safely Drain a Node | https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/ |
| Kubernetes Release Cadence | https://kubernetes.io/releases/ |
| Pod Disruption Budgets | https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ |
| Cluster Upgrade Strategies | https://kubernetes.io/docs/concepts/cluster-administration/cluster-administration-overview/ |
| Kubernetes Component Versioning | https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning |
Managed Kubernetes — Cloud Provider Docs
| Platform | Topic | Link |
|---|---|---|
| GKE | Upgrading Clusters | https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster |
| GKE | Surge Upgrades | https://cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies |
| EKS | Updating a Cluster | https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html |
| EKS | Managing Add-ons | https://docs.aws.amazon.com/eks/latest/userguide/managing-add-ons.html |
| EKS | Updating Node Groups | https://docs.aws.amazon.com/eks/latest/userguide/update-managed-node-group.html |
| AKS | Upgrading AKS Cluster | https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster |
| AKS | Node Pool Upgrades | https://learn.microsoft.com/en-us/azure/aks/manage-node-pools |
AKS Reference Links
| Topic | Link |
|---|---|
| Upgrade AKS Cluster | https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster |
| Upgrade Node Pools | https://learn.microsoft.com/en-us/azure/aks/manage-node-pools |
| Max Surge Upgrades | https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster#customize-node-surge-upgrade |
| AKS Supported Versions | https://learn.microsoft.com/en-us/azure/aks/supported-kubernetes-versions |
| AKS Auto-upgrade Channels | https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-cluster |
| AKS Release Notes | https://github.com/Azure/AKS/releases |
Tools Referenced in the Blog
| Tool | Purpose | Link |
|---|
| Tool | Purpose | Link |
|---|---|---|
| kubent | Scan cluster for deprecated API usage | https://github.com/doitintl/kube-no-trouble |
| etcdctl | etcd backup and restore tool | https://github.com/etcd-io/etcd/tree/main/etcdctl |
| kubeadm | Official cluster lifecycle management tool | https://kubernetes.io/docs/reference/setup-tools/kubeadm/ |
| kubectl | Official CLI for Kubernetes | https://kubernetes.io/docs/reference/kubectl/ |