Building scalable and resilient AI-driven cloud infrastructure is a challenge that requires more than just technical expertise—it demands strategic foresight, automation, and a deep understanding of failure mitigation. In this interview, Aditya Bhatia, Principal Software Engineer at Splunk (Cisco), shares insights from his journey at Yahoo, Apple, and Splunk, covering lessons in Kubernetes automation, AI-driven cloud transitions, and leadership in high-pressure environments. He also discusses the evolving role of engineers in an AI-powered future and how enterprises can build infrastructure that withstands inevitable system failures.
Discover more interviews like this: Prakhar Mittal, Principal at AtriCure — Supply Chain, Digital Transformation, PLM, OCM, ROI Strategies, Healthcare Trends, and Global Collaboration
From Yahoo to Apple to Splunk (Cisco), your career has been a journey through some of the most innovative tech companies. What key lessons have you learned about building scalable and resilient AI and cloud infrastructure at an enterprise level?
Over the years, working at Yahoo, Apple, and now Splunk (Cisco), I’ve learned that building scalable and resilient AI and cloud infrastructure is as much an art as it is a science. At Yahoo, where I first started working on cloud services and CI/CD automation, I quickly realized that scalability isn’t just about throwing more servers at a problem—trust me, that just leads to a bigger, more expensive problem. Instead, I learned the importance of automation and standardization, which not only make systems more efficient but also keep engineers from spending their weekends firefighting.
At Apple, working on distributed ML frameworks for Siri TTS, I got my first real taste of how unpredictable AI workloads can be. One moment, everything is running smoothly; the next, a job crashes, and you’re suddenly debugging logs at 2 AM. That experience taught me the value of fault-tolerant design and proactive failure handling—things like checkpointing, speculative execution, and autoscaling aren’t just nice-to-haves; they’re what keep large-scale AI systems from becoming expensive science experiments.
Now at Splunk, we turn data into doing, where observability is a core part of the DNA. I've come to appreciate that you can’t fix what you can’t measure. It doesn’t matter how well you design an AI or cloud system—if you don’t have real-time monitoring, logs, and metrics, you’re flying blind. I’ve also had to embrace the fact that security isn’t just for security teams, especially as I worked on automating FedRAMP IL2 compliance (because nothing says fun like an already built product, compliance automation, right?). The biggest lesson here? Security and scalability should be baked into the architecture from the start, not duct-taped on later.
And of course, if there’s one overarching trend I’ve seen across all these experiences, it’s the shift toward cloud-native architectures. Whether it’s Kubernetes, serverless, or AI-driven automation, the industry is moving towards flexible, scalable infrastructure that can handle the unpredictable nature of modern workloads. At Splunk, I lead distributed workflow orchestration on Kubernetes, ensuring that our systems can gracefully handle the chaos that comes with scale.
At the end of the day, scalability and resilience aren’t just about technology—they’re about strategy, culture, and designing for failure before failure happens. If I’ve learned anything, it’s that the best way to build truly scalable AI and cloud systems is to embrace automation, assume things will break, and always, always have good observability—because nothing humbles you faster than an outage in production.
With the rise of Kubernetes-based infrastructure, how do you see the balance between automation and human oversight evolving? What are some critical challenges companies still face in fully leveraging cloud-native architectures?
Kubernetes-based infrastructure is revolutionizing how we scale Infrastructure, but let’s be honest—automation is amazing… until it isn’t. I have used automation to reduce countless manual hours in doing the same repetitive tasks, streamline deployments, and build overall more efficient systems, but building such systems also involve collecting enough relevant metrics, data from underlying systems such that if things go haywire there is a human in the loop.
Companies still face some critical challenges when trying to fully leverage cloud-native architectures. First, observability and debugging at scale are still hard. Kubernetes gives you flexibility, but when something goes wrong in a multi-cluster deployment, good luck sifting through logs spread across multiple microservices, GPUs, and networking layers. Without strong observability in place, you’re basically playing detective in the dark.
Even with great observability, cost remains a major challenge. Just because Kubernetes lets you auto-scale workloads doesn’t mean you should! I’ve seen companies burn through cloud budgets at an alarming rate, only to realize later that half their compute power was idling away doing nothing. At Splunk, I worked on an initiative to run our cloud resources on more efficient compute resources in AWS, saving the company 3M$ annually. Automation needs to be paired with intelligent cost management and governance—otherwise, we end up with a very expensive science project instead of a scalable platform.
Security is another major hurdle. Kubernetes expands the attack surface, and many companies are still struggling with proper RBAC policies, secret management, and network security in highly dynamic environments. The flexibility Kubernetes provides can be a double-edged sword if security and compliance aren’t baked in from day one. At Splunk, working on automated FedRAMP IL2 compliance, I learned that security can’t be an afterthought—it has to be built into the automation framework itself.
In the end, automation should handle the known, while humans handle the unexpected. The best cloud native infrastructure strikes the right balance—automating what should be automated while keeping humans in the loop for strategic decision-making, security, and optimization. Companies that get this balance right will truly unlock the full potential of cloud-native architectures, while those that don’t will either struggle with inefficiency or, worse, learn the hard way when automation fails in production.
AI and automation are fundamentally reshaping business operations. What do you think are the most overlooked aspects when enterprises transition to AI-driven cloud infrastructure?
AI and automation are reshaping business operations at an incredible pace, but let’s be real—most enterprises think flipping the AI switch magically solves everything. In reality, the transition to AI-driven cloud infrastructure is filled with hidden pitfalls, and the most overlooked aspects usually come down to data readiness, cost efficiency, and trust in AI-driven decision-making.
First, garbage in, garbage out still holds true. Many companies rush to deploy AI models without ensuring their data pipelines are clean, structured, and actually useful. AI isn’t a magic wand—if the data is biased, inconsistent, or lacks proper governance, no amount of fancy ML algorithms will fix it. I’ve seen enterprises pour millions into AI projects, only to realize their biggest bottleneck was the lack of a scalable data ingestion and processing strategy.
Second, cost efficiency in AI-driven cloud infrastructure is still a wild west. Kubernetes and cloud providers make it easy to spin up large-scale AI workloads, but without proper guardrails, those GPU clusters start burning cash faster than a high-frequency trading bot on caffeine. At Splunk, I worked on an initiative to optimize cloud resource utilization, saving the company $3M annually by right-sizing workloads and automating compute selection. Enterprises often underestimate the cost of inefficiencies, assuming AI automation will “optimize itself”—but without cost-aware automation, companies end up with an expensive science project instead of a sustainable AI platform.
Finally, trust and reliability in AI-driven decision-making, I think, is the most critical and most difficult problem to solve. AI automation is not just about running the scripts created by AI. But also how to ensure the right changes are being performed without humans in the loop. Many companies are assuming that AI will make the right decisions based on general observations, but those decisions might not work for the company use cases which are more specific and different for each company and team. Best AI deployments should be reliable, interpretable, and should come with guardrails to ensure that automation enhances stability rather than introducing new risks.
Ultimately, enterprises that blindly jump into AI-driven cloud infrastructure without addressing data quality, cost governance, and AI reliability are setting themselves up for a rude awakening. The companies that succeed will be the ones that balance automation with intelligent human oversight, build scalable data strategies, and ensure AI-driven decisions are both explainable and trustworthy.
Given your experience mentoring and judging hackathons, what qualities or innovations in AI and cloud projects tend to stand out the most to you? What advice would you give to early-career engineers aiming to break into this space?
The best hackathon projects aren’t the ones that just look impressive for a two-day demo—they’re the ones that have the potential to become real products. What stands out to me the most in AI and cloud projects is when teams focus on solving a real problem with innovation and simplicity rather than just chasing the latest tech trends. The most successful projects use AI and cloud technologies as tools, not just buzzwords, to create solutions that are efficient, scalable, and easy to use.
Innovation in hackathons isn’t about complexity—it’s about finding the simplest, most elegant way to solve a hard problem. I’ve seen projects that leverage AI for automation in cloud workflows, build lightweight AI inference systems on edge devices, or rethink how Kubernetes manages ML models—all by keeping the solution focused, clear, and easy to scale. The teams that win and go beyond the hackathon stage are the ones that don’t over-engineer but instead focus on what truly adds value.
For early-career engineers, my biggest advice is to focus on fundamentals and solving real problems, not just following trends. Instead of starting with the latest buzzword technology, start with the problem itself—then determine the best technology to solve it efficiently. The best engineers don’t force AI, blockchain, or any trending tech into their projects just for the sake of it; they treat technology as a tool, not the end goal. True innovation comes from understanding the problem deeply and using the simplest, most effective solution to solve it at scale.
Leadership in technology is more than just technical expertise—it’s also about vision and execution. What has been your approach to leading engineering teams effectively, particularly in high-pressure, mission-critical environments?
That is correct, leadership in technology is significantly more than just technical expertise. It is all about balancing agility along with resilient deliverables. As a Principal Engineer leading a team of seven engineers, my focus is to set the right culture and technical standard which allows us to move faster without breaking things on the way.
First, clarity is everything. High-pressure situations demand precise execution, and that starts with well defined priorities and execution. I, in my team, follow Agile methodologies, ensuring we have tight feedback loops through daily stand-ups, sprint planning, and retrospectives. For critical changes, my team and I always begin with a one-pager or ERD. This sets a clear design direction from the start. Making the right design choices early prevents costly rework later. When in an incident, uncertainty causes anxiety, everyone must understand the intent behind the team's decisions, why they matter, and how they fit into the broader system.
Second, I believe in building a robust engineering ecosystem that supports efficiency at scale. That means designing systems with multi-stage testing environments with unit, integration, acceptance, performance, UAT, and even chaos testing. We don’t just ship code; we battle-test it. The goal? Find as many failures as possible, before they find us in production. It’s all about removing ambiguity, automating what we can, and ensuring our CI/CD pipelines are always delivering well tested changes quickly so that engineers spend more time solving problems and less time debugging deployment issues.
Thirdly, execution isn’t just about tools—it’s about engineering culture. Code reviews aren’t just checkboxes; they’re knowledge-sharing sessions. I encourage everyone in my team to review the code of every other member. Engineers aren’t just writing code; they’re designing solutions that will live and evolve beyond them. I foster a collaborative, high-trust environment where engineers feel ownership over their work but also know they have support when things go sideways.
And lastly, leadership in high-stakes environments is about staying composed under pressure. Things will break from time to time, and that is ok too! My learning from such experiences has been that every incident is an opportunity to learn from it, strengthen our systems and put enough safeguards such that we don’t make the same mistakes again. The end goal is continuous improvement, tending towards perfection.
The intersection of AI, cloud, and automation is rapidly redefining the future of work. What shifts do you foresee in the roles and skills required for engineers in the next five to ten years?
The next five to ten years will see a fundamental shift in engineering roles and required skills as AI, cloud, and automation continue to reshape the landscape. While access to information and AI-powered development tools are making coding easier, the core skills of critical thinking, problem-solving, and system design will remain invaluable. The role of an engineer will evolve far beyond just writing code—it will encompass market research, product strategy, and full-stack development, all augmented by AI.
I think traditional software engineers will evolve into “product builders”, blending engineering, design, and business thinking. AI-generated code will handle routine programming tasks, allowing engineers to focus on architecture, usability, and market-fit. Future software engineers won’t just be coding, they’ll be building entire product experiences, optimizing workflows, and integrating AI-driven decision-making into every aspect of the software lifecycle.
Yes, code generation, testing, and infrastructure management will be highly automated. Engineers will spend less time debugging syntax errors and more time orchestrating AI-driven systems.This will blur the lines between engineering, design, and business strategy. Engineers will need to understand user behavior, market trends, and product lifecycle to build solutions that are not only technically sound but also commercially viable.
Also with AI generating and optimizing code, testing and security will require a new approach. Automation will play a key role, engineers will need to design automated testing suites that validate AI-generated outputs, ensuring robustness, security, and compliance.
In the end computer science is all about solving complex problems with computers and that is not going away even with AI. Critical thinking and problem solving skills which are core to the field will still remain in demand. Engineers who can break down complex problems and design elegant solutions will be in the highest demand.
In your blog and conference contributions, you emphasize digital resilience. How can enterprises build a more resilient AI-driven infrastructure in a world increasingly vulnerable to system failures?
In the industry as AI workloads are scaling rapidly, system failures are inevitable, and thus digital resilience is a key metric which will make or break the businesses. Enterprises investing in AI-driven infrastructure must ensure that their systems are fault-tolerant, scalable, and capable of recovering from failures gracefully. This topic I’ve explored extensively in my research paper, Fault-Tolerant Distributed ML Frameworks for GPU Clusters: A Comprehensive Review, as well as in my Medium blog and my website, where I discuss key strategies for making AI infrastructure more resilient to failures.
AI models aren’t just computationally expensive, they can break easily. A single GPU failure can cause hours of training time to be lost if there are no proper checkpointing mechanisms in place. In my research paper, I discuss the role of distributed training strategies extensively, on how AI systems can recover from node failures, memory leaks, and hardware crashes without restarting from scratch.
In my Medium blog, I outline how Kubernetes-based AI workloads face new challenges in multi-cluster, multi-cloud deployments. Applications built on deep learning models such as LLMs need high compute, resilient data pipelines, and reliable networks, but all of these dependency requirements also increase points of failure. To handle these risks, it is critical to focus on observability, tracing, and alerting to detect such failures and solve them with automation. For example, implementing chaos testing of AI models, which intentionally introduces failures in staging environments, ensures that infrastructure is resilient before it reaches production.
Companies that will prioritize AI resilience will be the ones that will scale efficiently, reduce downtime, and build AI systems that will succeed.





