Real Lessons from Indian Enterprises and On-Prem Environments
🚀 Introduction
If you’ve deployed Kubernetes clusters in any enterprise — especially on-prem or hybrid setups — you’ve seen this pattern:
- Dev and staging clusters run perfectly.
- All pods are green.
- CI/CD pipelines deploy flawlessly.
Then, the first production release goes live, and chaos begins: - Pods start restarting without logs.
- API server CPU spikes.
- Persistent Volumes refuse to detach.
- Ingress latency jumps by 400%.
This blog dissects why this happens and how Indian companies (especially banks, pharma, and financial enterprises) can fix it.
1️⃣ Mismatch Between Dev and Prod Architecture
What happens:
Most Dev clusters run on:
- Minikube, kind, or a single-node lab setup.
- Shared virtual machines with no node taints.
- NFS or local storage.
Production clusters, on the other hand, have:
- Multiple node pools with taints/tolerations.
- Separate network subnets.
- Real ingress controllers, load balancers, and firewalls.
The network and scheduling differences are massive.
Real Example:
A fintech company in Pune ran a perfect staging environment on Rancher-managed VMs. When moved to production (OpenShift on vSphere), half the microservices crashed because:
- Nodes were labeled with “workload=prod-critical”.
- Deployments had no tolerations.
- Result: pods were unschedulable for 45 minutes.
Fix:
- Mirror your production node pool structure in staging.
- Always test with taints/tolerations enabled.
- Automate staging deployment using the same IaC (Terraform, Helm) as production.
2️⃣ Resource Requests and Limits Are Ignored
What happens:
In Dev, you might run with:
resources:
requests:
cpu: 50m
memory: 64Mi
These values never cause issues in a low-traffic cluster.
In Production, the same config meets:
- Real user traffic
- Background cron jobs
- JVM heap or Python memory leaks
Pods start getting OOMKilled, evicted, or throttled.
Real Example:
An insurance platform in Hyderabad deployed microservices written in Java Spring Boot.
Each pod requested only 512Mi memory. At month-end policy renewal surge, 30% pods were killed due to OOM, and requests started failing.
Fix:
- Perform load testing before production release.
- Use Vertical Pod Autoscaler (VPA) for auto-tuning.
- Observe metrics for a week, then adjust requests/limits realistically.
3️⃣ Storage Is the Silent Killer
What happens:
Storage is often the weakest link in Indian on-prem setups:
- Shared NFS mounted from a slow NAS box.
- No IOPS guarantees.
- ReadWriteOnce volumes used by multiple pods.
- PVCs stuck in “Terminating” state.
Real Example:
A healthcare analytics company used NFS for PostgreSQL PV. During a high upload period, NFS I/O latency spiked above 1 second. Database write queue filled up, app became read-only, and API started timing out.
Fix:
- Use block storage with guaranteed IOPS (vSAN, Ceph, OpenEBS).
- Never host databases on shared NFS.
- Set up storage monitoring (I/O latency, throughput).
- Implement volume expansion and snapshots properly.
4️⃣ Control Plane Overload & API Server Throttling
What happens:
In small Dev clusters, you rarely hit API rate limits.
In Production, you have:
- Dozens of controllers
- Continuous deploy jobs
- Prometheus scraping kube-state metrics every few seconds
This overloads the Kubernetes API server.
Real Example:
A BFSI (banking) company in Mumbai integrated Jenkins, ArgoCD, and multiple monitoring agents with their cluster. Every tool was hitting the API server every 5 seconds. The API server CPU maxed out at 100%, new pods couldn’t be created, and ArgoCD syncs failed.
Fix:
- Consolidate monitoring scrapes.
- Reduce polling intervals in Jenkins/ArgoCD.
- Scale API server replicas (for HA setups).
- Use API aggregation or cache-heavy architecture.
5️⃣ Poorly Tested Stateful Workloads
What happens:
Dev environments often use mock data and stateless apps.
Production introduces:
- Real database connections
- Long-running stateful sets
- Data consistency challenges
Stateful pods don’t behave well when killed abruptly.
Real Example:
A telecom company’s Kafka StatefulSet restarted unexpectedly due to node reboot. Because they hadn’t enabled proper PodDisruptionBudgets or ordered shutdown hooks, topic offsets were lost, and data integrity broke.
Fix:
- Test graceful restarts in staging.
- Always define PodDisruptionBudgets for databases and message queues.
- Use readiness probes that validate actual DB connections, not just HTTP 200.
6️⃣ Monitoring for Visibility, Not Action
What happens:
Teams think having Grafana dashboards equals observability.
In reality:
- They only visualize CPU/memory.
- No alerts for failed deployments, storage latency, or API throttling.
- Business KPIs (like transaction success rate) are not monitored.
Real Example:
During Diwali traffic surge, a UPI payments service saw 20% transaction drop. Monitoring showed green metrics everywhere — because they were tracking only infrastructure, not business outcomes.
Fix:
- Add alert rules for key Kubernetes control plane metrics:
apiserver_request_duration_secondsetcd_disk_wal_fsync_duration_secondskubelet_node_disk_pressure
- Integrate business KPIs into Prometheus (transaction rates, queue length, etc.).
7️⃣ No Ownership or Postmortems
What happens:
When production goes down:
- Dev blames Ops.
- Ops blames Security.
- Security blames configuration.
No one documents the RCA or creates reusable checklists.
Real Example:
A manufacturing ERP platform had a 2-hour outage. When asked for RCA, every team had a different version. There was no shared dashboard or ownership.
Fix:
- Create an Incident Review Template:
- Root cause
- Impact duration
- What worked / didn’t
- Preventive steps
- Assign clear ownership:
- Platform team = cluster
- Product team = app health
- Security = policy enforcement
8️⃣ Indian Traffic Patterns Are Unpredictable
Festivals, salary days, and cricket matches trigger massive traffic spikes that global load estimations never consider.
Real Example:
A retail app hosted on Kubernetes saw 7x traffic on Independence Day flash sale. Cluster Autoscaler kicked in, but cloud region had no extra capacity. Result: new pods stayed pending, service crashed mid-sale.
Fix:
- Pre-scale clusters before predictable events.
- Test Autoscaler behavior.
- Keep one extra node group always ready for burst traffic.
- Use Pod Priority Classes for critical services.
🧩 Production RCA Example: Full Breakdown
| Category | Issue | Root Cause | Fix |
|---|---|---|---|
| Resource | Pods OOMKilled | Incorrect memory limits | Load-based tuning |
| Storage | PVC Terminating | NFS slowness | Switched to Ceph |
| Control Plane | API Throttling | Over-scraping by monitoring | Reduced scrape intervals |
| Scaling | Pods Pending | Region capacity exhausted | Pre-scaled nodes |
| Monitoring | No alert on DB latency | Metrics missing | Added business KPI alert |
Business Impact:
- Outage: 1 hour during monthly settlement window.
- Estimated financial loss: ₹8 crores.
- Reputation impact: Severe.
🧠 Key Lessons
- Don’t treat staging as a toy. Make it a production replica.
- Observe before you optimize. Metrics are your compass.
- Prepare for Indian peak loads. Test during festival traffic.
- Document everything. RCAs build team maturity.
- Keep ownership clear. Every cluster needs accountability.
🏁 Conclusion
Kubernetes doesn’t fail because it’s unstable. It fails because teams treat production like development.
Real success comes from discipline — proper capacity planning, observability, and ownership.
If Indian companies want to make Kubernetes reliable, they must stop copying western setups blindly and start adapting to their own infrastructure realities — limited bandwidth, shared storage, and unpredictable traffic.