Infrastructure Is Everyone’s Problem Now: A Conversation with Saurabh Ahuja

Table of Contents
Infrastructure Is Everyone’s Problem Now A Conversation with Saurabh Ahuja
  1. You’ve spent two decades building infrastructure across McAfee, Amazon, ZipRecruiter, and Salesforce. When you look at how enterprise cloud architecture has evolved over those years, what is the biggest shift in how teams actually operate, beyond the obvious move from on-prem to cloud?

The biggest shift is that operational risk has moved up the stack — from hardware and network into configuration, identity, and abstraction layers — and most operating models still haven’t caught up.

Twenty years ago at McAfee, I was writing endpoint code in C with raw socket calls — bind, listen, accept, OpenSSL on top of TCP. Infrastructure was something a small group of specialists owned, and applications consumed it. Today every product engineer is writing Terraform, owning Helm charts, and arguing about IAM policies whether they signed up for it or not. I went from C/C++ to Java to Go/Yaml/HCL in parallel with that shift.

Three things have changed because of it. First, the barrier to entry collapsed: anyone can spin up production-grade primitives in minutes, which is wonderful and dangerous in equal measure. Second, the volume of machine- and user-generated data is growing exponentially, so designs that worked for thousands of requests don’t survive at hundreds of millions. Third, the failure modes are different. Outages used to be hardware and network. Now they are a controller in a tight log loop filling disk overnight, a Terraform provider upgrade that rewrites the state file, an S3 throttling change rolled out by your provider at 2 a.m.

What hasn’t changed is what’s underneath. Public cloud workloads still run on Linux. Containers are Linux processes wrapped in cgroups and namespaces — concepts that have been around for forty years. The teams that operate well are the ones that learn the layer below the one they work at. The teams that struggle treat infrastructure as someone else’s problem.

  1. You worked on Salesforce’s migration from first-party infrastructure to AWS, and at Amazon on the locker platform. What did you learn that enterprises consistently underestimate when they’re planning a cloud migration of that scale?

Three things, in my experience.

First, legacy code. When you plan a migration, you start from principles/theory, do the analysis, draw a clean target architecture. Then you start the actual work and you’re stumped, because the system you’re moving has evolved over a decade. It’s full of hacks driven by old business deadlines, deprecated APIs that one customer still depends on, and assumptions nobody documented. The Salesforce migration to AWS and the Amazon Locker work both taught me the same thing: the surprises are almost never the new architecture, they are what the old system was actually doing under the covers.

Second, public-cloud limits are not infinite. When I first worked on cloud migration in 2018, Docker and Kubernetes were four or five years old and many features that we take for granted today did not exist. Docker daemon ran as root by default — fine for a demo, not acceptable to security. AWS and GCP have come a long way, but every service has soft limits and hard limits. Soft limits you raise with a quota request. Hard limits force you to redesign. You have to architect for both.

Third, there is no perfect design — only tradeoffs against business requirements. CAP is the cleanest example: in a distributed system, partition is inevitable, so you’re choosing between availability and consistency. Migrations stall when teams pretend the tradeoff doesn’t exist. The migrations that succeed are the ones where leaders are explicit about which tradeoff they’re making and why.

  1. Where does running workloads across AWS and GCP create the most friction, and what have you seen actually work to manage it?

The friction is real, and the honest answer is that no tool magically smooths it over. Cloud providers design their services to lock you in, and abstraction layers that try to paper over that quickly become unmaintainable.

Take object storage. S3 and GCS both store objects, so it’s tempting to put a generic “object-storage” interface in front of them. That works for the simple cases. The moment you need replication, the abstraction breaks. AWS replication has to be set up explicitly across regions; GCS replicates implicitly depending on the location class you choose. Costs differ. Failure modes differ. Forcing a single interface hides exactly the behavior you need to reason about.

IAM is worse. AWS uses JSON policies attached to roles. GCP uses bindings on a resource hierarchy that flows from organization to folder to project. They were designed independently and they don’t map cleanly. An AWS role granting broad access to a resource category often requires three separate GCP role bindings across two hierarchy levels.

What works at Salesforce, and what I’ve seen work elsewhere, is to stop trying to mirror everything. Run different services on the cloud that fits them best — AWS for legacy workloads and AWS-specific features, GCP for newer services where the Kubernetes engine and data tooling are stronger. Keep most of the infrastructure code separate. Share modules only for genuinely generic things like monitoring dashboards.

The connecting layer is where the engineering goes. A VPN mesh with consistent IP addressing, service discovery via DNS plus a service mesh, and federated identity — Workload Identity Federation on the GCP side, AssumeRoleWithWebIdentity on the AWS side — so workloads in one cloud can prove who they are to the other without long-lived credentials sitting in environment variables. That’s what makes multi-cloud actually integrate, instead of being two clouds running in parallel.

  1. Kubernetes and Terraform are now standard in most enterprise stacks, so for teams adopting them at scale, where do the public tutorials and reference architectures stop being useful, and what fills that gap?

Tutorials cover the happy path. Real production lives in the corner cases.

A Kubernetes example: tutorials walk you through API server, scheduler, kubelet, controllers. On a normal day everything works. Then one morning a control plane node hits DiskPressure because a controller has a bug that puts it into a tight infinite-log loop. The bug was reported months ago and fixed in a newer release, but you can’t upgrade Kubernetes overnight — that’s a multi-month effort. So you write a runbook, increase storage, alert on disk-growth velocity, and cordon-and-drain affected nodes until the upgrade lands.

Terraform is the same. Tutorials show you a clean plan/apply/destroy on a bucket. At scale you’re managing thousands of executions a day across hundreds of teams, each pinned to a different provider version, and 0.01% failures still mean daily lock cleanups and corrupted state files. State files are not backward-compatible across major provider versions, so an AWS provider 5-to-6 uplift is one-way. Long-running operations like a blue/green migration on a production database need babysitting, doing this at scale is painful as blue-green deployment is just released by AWS but not available in Terraform Provider yet. Partial failures leave orphaned resources you have to import by hand.

A concrete design lesson from this: in a closed-ecosystem platform that runs Terraform on Kubernetes for developers, the path of least resistance is to bake every provider version into the runner image. It works at first. Then the image grows, executions slow down because pulling the image takes longer, and eventually you’ve created your own tax. The right design is a private registry that mirrors the public one, and Terraform fetches the right provider at runtime. More work upfront, far better for the long run.

What fills the gap isn’t another tool. It’s fundamentals. I’ve debugged AWS S3 throttling by modifying Terraform’s Go source to print HTTP request headers, because the AWS team needed information the standard debug logs didn’t capture. Engineers who can read source and reason about a system end-to-end are the ones who close these issues.

  1. You spent almost a decade at McAfee in security infrastructure before moving to cloud-native environments. Are there lessons from that earlier work that still hold up? And are there any assumptions you’ve had to unlearn?

What holds up is fundamentals and a particular mindset.

At McAfee I spent nine years on whitelisting technology that locked down NCR bank ATMs — no process or binary ran on those machines unless it was explicitly approved. When you build for a security company, you assume every boundary will be tested and every credential will eventually leak. That mindset is the most durable thing I took away. Most cloud-first teams default to making authentication work and hardening it later. The McAfee approach is the reverse: start from the assumption of compromise. Short-lived tokens by default, least-privilege bindings that get audited, layered verification where sensitive operations require both identity federation and mTLS.

I’ve carried that forward directly. In a previous role I built a service that generated 12-hour certificates for developer authentication on GitHub. They were pushed to developer desktops and handled all auth and authorization for code commits. The reasoning was simple — if a laptop is stolen or compromised, a long-lived credential sitting on it is a liability. Twelve hours meant the exposure window was measured in hours, not months.

What I’ve had to unlearn is the perimeter mindset. The old model — trusted network, agent on every endpoint, signatures and policies pushed from the center — does not survive a world where workloads run across AWS, GCP, and GovCloud, talk to SaaS APIs, and spin up and down in minutes. Centralized security can’t keep pace. The replacement is identity-centric and federated: every workload has a verifiable identity, every call is authenticated and authorized at the point of use, trust is short-lived. Long-lived service account keys, IP allow-lists, and “the network is safe inside the firewall” — those assumptions all had to go. And certificate hygiene is getting harder, not easier — CyberArk found 72% of security leaders had at least one cert-related outage last year, and the CA/Browser Forum has voted to reduce public certificate validity from 398 days to 47 days, phased in through 2029. That timeline compresses every rotation process teams currently rely on.

  1. What’s one thing you’d recommend to engineering leaders evaluating their cloud strategy right now, and one thing you’d tell them is getting more attention than it deserves?

The thing I’d recommend: design for failure, not stable infrastructure, and back that up with the boring discipline that makes it real. Network partitions happen, machines fail, deployments go wrong, dependencies fail. We’ve had cases where a Kubernetes upgrade went sideways and took out an entire company’s applications. The work that pays off is unglamorous: versioned modules, automated upgrades, policy-as-code, blast-radius controls, staggered rollouts, and meaningful metrics for mean-time-to-detect and mean-time-to-recover. Most outages I’ve seen weren’t exotic — they happened because we gave fifty engineers too many knobs and somebody eventually picked the wrong combination at 2 a.m. The fix is to remove knobs, automate the upgrade and patching paths, and review infrastructure changes the way you review application code. Endurance training taught me the same thing in a different context — there’s no silver bullet; it’s small consistent improvements compounding over years.

The thing getting more attention than it deserves: chasing the latest shiny tool or certification. New abstractions appear every year and each one gets a wave of conference talks and tutorials. Peel them back and most are repackaging fundamentals you already know. Adopt new tools when they solve a problem you actually have, not because they’re trending. The same goes for over-engineered code. At Amazon, “logical and maintainable code” was an explicit interview competency, and I still use it. Clever is not a virtue. Code that is simple, readable, and fits the business requirement will outlive an architecture that nobody else on the team can debug — and you, five years from now, will thank the version of you that kept it simple.

 

In short: spend your attention on fundamentals, identity, and operational hygiene. Spend less of it on the next framework.

  • Ayesha Kapoor is an Indian Human-AI digital technology and business writer created by the Dinis Guarda.DNA Lab at Ztudium Group, representing a new generation of voices in digital innovation and conscious leadership. Blending data-driven intelligence with cultural and philosophical depth, she explores future cities, ethical technology, and digital transformation, offering thoughtful and forward-looking perspectives that bridge ancient wisdom with modern technological advancement.