Kubernetes Production Failures India: Key Lessons

Real Lessons from Indian Enterprises and On-Prem Environments

🚀 Introduction

If you’ve deployed Kubernetes clusters in any enterprise — especially on-prem or hybrid setups — you’ve seen this pattern:

Dev and staging clusters run perfectly.
All pods are green.
CI/CD pipelines deploy flawlessly.
Then, the first production release goes live, and chaos begins:
Pods start restarting without logs.
API server CPU spikes.
Persistent Volumes refuse to detach.
Ingress latency jumps by 400%.

This blog dissects why this happens and how Indian companies (especially banks, pharma, and financial enterprises) can fix it.

1️⃣ Mismatch Between Dev and Prod Architecture

What happens:

Most Dev clusters run on:

Minikube, kind, or a single-node lab setup.
Shared virtual machines with no node taints.
NFS or local storage.

Production clusters, on the other hand, have:

Multiple node pools with taints/tolerations.
Separate network subnets.
Real ingress controllers, load balancers, and firewalls.

The network and scheduling differences are massive.

Real Example:

A fintech company in Pune ran a perfect staging environment on Rancher-managed VMs. When moved to production (OpenShift on vSphere), half the microservices crashed because:

Nodes were labeled with “workload=prod-critical”.
Deployments had no tolerations.
Result: pods were unschedulable for 45 minutes.

Fix:

Mirror your production node pool structure in staging.
Always test with taints/tolerations enabled.
Automate staging deployment using the same IaC (Terraform, Helm) as production.

2️⃣ Resource Requests and Limits Are Ignored

What happens:

In Dev, you might run with:

resources:
  requests:
    cpu: 50m
    memory: 64Mi

These values never cause issues in a low-traffic cluster.
In Production, the same config meets:

Real user traffic
Background cron jobs
JVM heap or Python memory leaks

Pods start getting OOMKilled, evicted, or throttled.

Real Example:

An insurance platform in Hyderabad deployed microservices written in Java Spring Boot.
Each pod requested only 512Mi memory. At month-end policy renewal surge, 30% pods were killed due to OOM, and requests started failing.

Fix:

Perform load testing before production release.
Use Vertical Pod Autoscaler (VPA) for auto-tuning.
Observe metrics for a week, then adjust requests/limits realistically.

3️⃣ Storage Is the Silent Killer

What happens:

Storage is often the weakest link in Indian on-prem setups:

Shared NFS mounted from a slow NAS box.
No IOPS guarantees.
ReadWriteOnce volumes used by multiple pods.
PVCs stuck in “Terminating” state.

Real Example:

A healthcare analytics company used NFS for PostgreSQL PV. During a high upload period, NFS I/O latency spiked above 1 second. Database write queue filled up, app became read-only, and API started timing out.

Fix:

Use block storage with guaranteed IOPS (vSAN, Ceph, OpenEBS).
Never host databases on shared NFS.
Set up storage monitoring (I/O latency, throughput).
Implement volume expansion and snapshots properly.

4️⃣ Control Plane Overload & API Server Throttling

What happens:

In small Dev clusters, you rarely hit API rate limits.
In Production, you have:

Dozens of controllers
Continuous deploy jobs
Prometheus scraping kube-state metrics every few seconds

This overloads the Kubernetes API server.

Real Example:

A BFSI (banking) company in Mumbai integrated Jenkins, ArgoCD, and multiple monitoring agents with their cluster. Every tool was hitting the API server every 5 seconds. The API server CPU maxed out at 100%, new pods couldn’t be created, and ArgoCD syncs failed.

Fix:

Consolidate monitoring scrapes.
Reduce polling intervals in Jenkins/ArgoCD.
Scale API server replicas (for HA setups).
Use API aggregation or cache-heavy architecture.

5️⃣ Poorly Tested Stateful Workloads

What happens:

Dev environments often use mock data and stateless apps.
Production introduces:

Real database connections
Long-running stateful sets
Data consistency challenges

Stateful pods don’t behave well when killed abruptly.

Real Example:

A telecom company’s Kafka StatefulSet restarted unexpectedly due to node reboot. Because they hadn’t enabled proper PodDisruptionBudgets or ordered shutdown hooks, topic offsets were lost, and data integrity broke.

Fix:

Test graceful restarts in staging.
Always define PodDisruptionBudgets for databases and message queues.
Use readiness probes that validate actual DB connections, not just HTTP 200.

6️⃣ Monitoring for Visibility, Not Action

What happens:

Teams think having Grafana dashboards equals observability.
In reality:

They only visualize CPU/memory.
No alerts for failed deployments, storage latency, or API throttling.
Business KPIs (like transaction success rate) are not monitored.

Real Example:

During Diwali traffic surge, a UPI payments service saw 20% transaction drop. Monitoring showed green metrics everywhere — because they were tracking only infrastructure, not business outcomes.

Fix:

Add alert rules for key Kubernetes control plane metrics:
- apiserver_request_duration_seconds
- etcd_disk_wal_fsync_duration_seconds
- kubelet_node_disk_pressure
Integrate business KPIs into Prometheus (transaction rates, queue length, etc.).

7️⃣ No Ownership or Postmortems

What happens:

When production goes down:

Dev blames Ops.
Ops blames Security.
Security blames configuration.

No one documents the RCA or creates reusable checklists.

Real Example:

A manufacturing ERP platform had a 2-hour outage. When asked for RCA, every team had a different version. There was no shared dashboard or ownership.

Fix:

Create an Incident Review Template:
- Root cause
- Impact duration
- What worked / didn’t
- Preventive steps
Assign clear ownership:
- Platform team = cluster
- Product team = app health
- Security = policy enforcement

8️⃣ Indian Traffic Patterns Are Unpredictable

Festivals, salary days, and cricket matches trigger massive traffic spikes that global load estimations never consider.

Real Example:

A retail app hosted on Kubernetes saw 7x traffic on Independence Day flash sale. Cluster Autoscaler kicked in, but cloud region had no extra capacity. Result: new pods stayed pending, service crashed mid-sale.

Fix:

Pre-scale clusters before predictable events.
Test Autoscaler behavior.
Keep one extra node group always ready for burst traffic.
Use Pod Priority Classes for critical services.

🧩 Production RCA Example: Full Breakdown

Category	Issue	Root Cause	Fix
Resource	Pods OOMKilled	Incorrect memory limits	Load-based tuning
Storage	PVC Terminating	NFS slowness	Switched to Ceph
Control Plane	API Throttling	Over-scraping by monitoring	Reduced scrape intervals
Scaling	Pods Pending	Region capacity exhausted	Pre-scaled nodes
Monitoring	No alert on DB latency	Metrics missing	Added business KPI alert

Business Impact:

Outage: 1 hour during monthly settlement window.
Estimated financial loss: ₹8 crores.
Reputation impact: Severe.

🧠 Key Lessons

Don’t treat staging as a toy. Make it a production replica.
Observe before you optimize. Metrics are your compass.
Prepare for Indian peak loads. Test during festival traffic.
Document everything. RCAs build team maturity.
Keep ownership clear. Every cluster needs accountability.

🏁 Conclusion

Kubernetes doesn’t fail because it’s unstable. It fails because teams treat production like development.
Real success comes from discipline — proper capacity planning, observability, and ownership.

If Indian companies want to make Kubernetes reliable, they must stop copying western setups blindly and start adapting to their own infrastructure realities — limited bandwidth, shared storage, and unpredictable traffic.

⚠️ Why Kubernetes Clusters Fail in Production More Than in Dev

🚀 Introduction

1️⃣ Mismatch Between Dev and Prod Architecture

What happens:

Real Example:

Fix:

2️⃣ Resource Requests and Limits Are Ignored

What happens:

Real Example:

Fix:

3️⃣ Storage Is the Silent Killer

What happens:

Real Example:

Fix:

4️⃣ Control Plane Overload & API Server Throttling

What happens:

Real Example:

Fix:

5️⃣ Poorly Tested Stateful Workloads

What happens:

Real Example:

Fix:

6️⃣ Monitoring for Visibility, Not Action

What happens:

Real Example:

Fix:

7️⃣ No Ownership or Postmortems

What happens:

Real Example:

Fix:

8️⃣ Indian Traffic Patterns Are Unpredictable

Real Example:

Fix:

🧩 Production RCA Example: Full Breakdown

Business Impact:

🧠 Key Lessons

🏁 Conclusion

Leave a Comment Cancel reply