Private Cloud AI Platform Engineering

Given at SREday Amsterdam on

A run through what platform engineering looks like when AI is in the picture, with a private-cloud bias. Covers the new things ops and platform teams will be doing for AI, why most large organizations are running this on premises, what a real platform looks like vs a blinking cursor, and the two things to do right now: a gateway and frameworks people will actually use.

Slides

Recording

Further Resources

Transcript

What's actually a priority right now

What I'm going to go over is what, at least as far as I can determine, seems like we know about doing platform engineering, or operations, for AI in private cloud. This is important now because people want it, which is always a good reason to do something. A lot of you are being asked to sort it out, or you think it's super cool and you want to do it on your own - which is also a great reason for a technologist to do things, especially if you can give the mess to someone else when you lose interest.

I'm an application developer, so I just lump operations, SRE, DevOps, platform engineering all together into "stuff I use." Or I used to be one. But it's worth knowing what's actually valuable. Here's a recent Forrester survey asking management types what's important. Multi-bar charts are complicated to read, but if you collate them, security is always at the top - which is like asking what your number-one priority with food is. It should taste good. You're not going to answer "security? Nope, not a priority." Take that for granted. The interesting stuff is below it: lots of interest in AI, but what dominates is the old crap people have to deal with all the time - modernization, maintenance, just keeping things up and running. If you're picking what's most valuable to your organization for the next 12 months, probably upgrading your stuff. But in the meantime, why not some AI? That'd be fun.

Forrester chart of top IT priorities, with modernization, maintenance, and security ranking above AI initiatives.
Source: "Modernize Or Fall Behind: Rethinking IT Infrastructure For A Competitive Edge," Forrester, commissioned by Broadcom, n=216, June 2025, published October 2025.

The new things platform engineers will do for AI

Here's a summary of what operations and platform-engineer types will probably focus on, drawn from talking with Tanzu customers and looking at what other people have written. The question marks are things I'm not sure about. Some of these are obvious: hosting models - or, if you want to be all technical and cool, "inference" - so people can just use them. Application frameworks. Registries for plugins. If you've ever hosted a platform before you'll look at this list and go "ah yes, every single thing I've ever done in a platform." There are a few things particular to AI: notably the extension/plugin layer - whether that's Model Context Protocol or whatever ends up generalizing it. And of course you all are going to be in charge of the cost controls, observability, and the AI-FinOps-BizSRE-DevOps-Eng stuff that's going to emerge. Get ready for that.

Two-column list of new things platform engineers will likely do for AI: hosting models, gateways, application frameworks, curating models, self-service model access, AI FinOps, eval and testing, audit and compliance, data access, registries for MCP, prompts, and integrations.
New things platform engineers will likely do with and for AI.

As for roles and responsibilities, this is a good early cut. The top is what platform teams already do. The bottom rows are where AI changes the picture. Developers develop - they use access to the models they're given, write apps around them, tune to their users. Platform/AI teams run the models, keep them updated, handle controls and costs, do the autoscaling and the integrations - just like any other middleware. Where things get up in the air is on the optimize side: who decides which models to use, who fits them to a purpose, who decides when to update them. Is that an operations person or a data scientist - whatever that means? It's a very cool term. Maybe you should rename yourself a data-scientist reliability engineer. They're the subject-matter experts, so maybe they're the ones who start to do it.

Three-column matrix - Develop, Operate, Optimize - across rows for app teams, platform teams, and AI-specific responsibilities like self-service model access, model running, and continuous model curation.
A platform treats AI like any other service: security, reliability, middleware, new models, and dev frameworks. Develop / Operate / Optimize across app teams, platform teams, and AI-specific tasks.

Why private cloud

I work at Tanzu, formerly Pivotal, formerly VMware, now Tanzu by Broadcom - almost 11 years come January. I've been lucky to work with large organizations figuring out how to do their software better, which is a handwavy way of putting it. I've written a few books and used to work at analyst firms - RedMonk and 451 - and I'm from Austin originally, so you have to serve a compulsory two-year term at Dell. I did M&A and corporate strategy there in 2010-2011, when they wanted to do software and public cloud, which was thrilling at a hardware vendor.

For homework: Nick Kuhn's "Tales from Production" at VMware Explore 2025; Manjunath Bhat's Gartner work on platform teams scaling generative-AI delivery; Patrick Debois on why AI needs a platform team and on AI platform engineering; Angie Jones from Block on operationalizing MCP at scale, which is the surprising one - they have normals using AI and they go over how they're running it; Abi Noda and Laura Tacho from DX on the evolving role of platform teams in the AI era; and Keith Townsend's "The Emperor's New GPT - Why Your 'Custom AI' Is a Demo, Not a Product." I haven't watched all of them, but the Gradle conference also has several great talks on how people are delivering AI inside their organizations, mostly for programmers.

Why the private-cloud angle? Partly because that's what we're interested in where I work, but that's not satisfying on its own. If you look at IDC's data and at what people in larger organizations are actually saying, a lot of them are thinking about running AI on their own - not just hitting publicly hosted ones. Whatever "premises" means or wherever it's located, the point is they want lots of control over how AI is used, not just developers going directly to it. Back home in America no one cares about this, but the trustworthiness of America as a premises to run your stuff in has been steadily plummeting, which is fantastic for people like me. Whether you call it sovereign cloud or "not-America cloud," there's a lot of interest. Also just purely having it on your own for other reasons - I was talking with some banks in London who are very interested in: assume all of our IT just shuts down because of hacks, now we need to stand it up from scratch. There are all these scenarios that make private cloud interesting for AI.

IDC chart showing on-premises AI infrastructure adoption balancing innovation and security.
Source: "On-Premises AI Infrastructure Balances Innovation and Security," IDC White Paper sponsored by Broadcom, doc #US52747024, December 2024 (n=411, conducted July 2024).

What is a platform

Whenever you talk about platform engineering you have to have the famous Evan Bottcher quote from 2018: a digital platform is a foundation of self-service APIs, tools, services, knowledge and support, arranged as a compelling internal product, so that autonomous delivery teams can deliver product features at a higher pace with reduced coordination. Done. For a better visual, here's the CNCF Platforms reference architecture from a few years ago. People don't call this platform-as-a-service anymore - that's a terrible word you're not supposed to use, except when you want to. But it's basically what you'd expect: if I'm supporting people writing applications, what's all the stuff they need - services, middleware, configuration, dependency mapping, infrastructure interfaces.

CNCF Platforms reference architecture diagram of capabilities a platform provides.
Sources: CNCF Platforms White Paper, March 2023; VMware Tanzu.

What makes platform engineering different from every previous go at this - probably since Babbage - is that we should go up the stack and integrate tightly with portals, the build tools developers use, the way they look around for APIs and other projects. You can't just search Google and go to GitHub to figure out how to interact with the payments team. There's a whole discovery layer. There's a great talk from a couple of years ago at DevOpsDays Amsterdam from the bol.com people that goes over this - the Backstage-y internal-developer-portal layer, if you want to dive in.

Here's how people are thinking about a platform with AI capability running through it. This is from the people who said DevOps was dead - Gartner - giving you a cross-cut of all the activities going through a platform that involve AI: data access, an orchestrator controlling the various AI things, the close-to-developer portal/IDE layer. If you want the all-in view of the technical boxes that get added to a platform, read the articles cited at the bottom of the slide.

Gartner diagram of platform layers and AI-specific capabilities for generative-AI application delivery.
Source: "How platform teams can help scale generative AI application delivery," Manjunath Bhat, Gartner, PlatformCon 2025, June 2025.

And if you're a management type who doesn't immediately tune out when "Gartner" shows up: that talk goes over the process, strategy, and layers in the platform area. What's nice about Gartner if you work at a large organization is that's who they talk to multiple times a day. They get a sense of what happens in regular large organizations that do more than let you show your friends what kind of sandwich you've eaten.

Two things you should do right now: a gateway, and frameworks people will actually use

So here are two pretty obvious things you should be doing right now if you're going to run AI in a platform way. First: developers, in my experience, are a little too fun-loving. They don't mind looking stupid. They don't really care about much beyond writing their applications. If you're worried about security or runaway things or great unknowns being brought into your system - they don't really care to be told they shouldn't be doing it. So there are plenty of developers using these things already. Find the ones who complain about your virus scanners and I bet they're hooked into public AI services right now. The lesson from the past: if you don't find something developers like and start managing it - don't use the word "controlling" - you'll lose very quickly.

Step one is a gateway or broker. Before you lose track of these crazy developers, put a gateway in place and make sure they go to it. You're controlling where the gateway goes. The contract is: it won't be terrible, they'll be able to access the kinds of things they want, it'll be sanctioned, it'll be easier than DIY. There are several gateways available now. Putting one in place early is way easier than reigning developers back to it later.

The same logic applies to MCP. Whatever the plugin/extension layer ends up being, run those things in severely locked-down VMs or containers - maybe with no network access except one little thing. If you've played with Claude Code's skills, that's exactly what it does, and it's kind of brilliant. It's not MCP, but it gives you an idea of how to architect securely contained execution. And then on the private-cloud point: you can host your own model serving, do your vGPUs, shard things out, all the thrilling stuff. It's available now. Worth standing up so developers don't run wild.

Architecture diagram showing Spring AI App, Python, and TypeScript app runtimes on the left connecting through MCP, an AI Gateway, and Tanzu AI Server on Tanzu GenAI; on the right, vLLM and Ollama running on VPAIF on top of VMs and containers with CPU and GPU pools, alongside Postgres, GemFire, and Greenplum data services and external model providers Anthropic, Gemini, and OpenAI.
Conceptual architecture: Spring AI App, Python, and TypeScript runtimes connect through MCP and an AI Gateway to Tanzu AI Server, with VPAIF hosting vLLM and Ollama on GPUs and routing to external providers (Anthropic, Gemini, OpenAI). More: "Tales from Production - Debugging LLMs and GenAI Apps on VMware Tanzu Platform," Nick Kuhn, VMware Explore 2025, No. CODEQT1641LV, August 2025.

Meet developers where they are

When you deliver infrastructure to developers - even the coolest models, inference, and APIs - you're basically giving them a blinking cursor. They have to bring their own framework, methodology, and scaffolding. So think about what you can do beyond the cursor to guide them. That's the philosophy of meeting developers where they are.

In larger organizations and private-cloud setups, be aware of which languages are pervasive. JavaScript and Python are extremely popular, but the other two languages that usually put people to sleep when you talk about them - Java and C# - are pervasively used in large organizations. So when you're considering frameworks to provide, ask which ones have easy integration and compatibility with how you want to run AI. You don't want to leave just a blinking cursor, because then developers layer all sorts of stuff on top.

That ties to platform-as-a-product, an older notion exemplified by Thomas Müller at Mercedes-Benz: "We are building this platform not for us, we are building it for Mercedes-Benz developers." He used to run their Cloud Foundry platform and now runs Kubernetes too - a thrilling comparison to be running in his head. The point: you treat developers as customers. You're making things secure, reliable, performant - and you're product-managing the capability. Are you talking to your customers frequently? Are you observing whether what you provided is useful? When you get feedback, are you tuning and improving?

Speaking of "know your customer" - which usually means something else in banking - there's a new type of customer for platforms. You still have programmers who want public code-generation tools or want to instrument their workflow with little MCP things. There are data scientists running Jupyter notebooks who need a platform too, or will very soon. There's the obvious case: writing applications that include AI features. And then there's a new category: the normals - or, if you're a Harry Potter fan, the muggles - non-technical people using something like Goose on the desktop, or that web-based one with the tragic name that doesn't tell you what it is. In a private-cloud context, people are being given "here's a chat app, hook it up to your own models." That's a new constituency for ops/platform-engineer types to support. Worth checking out the Angie Jones Block talk to dive into what that looks like.

Three-column breakdown: Chat for normals (your own ChatGPT, customer service, better search, chat-as-UI); Programming (new and old code, SDLC juicing, traditional data science, making pptx); AI in apps (sales assistants, sloppy integration, science-ing).
Three buckets of AI users a platform serves. Sources: Tanzu customers; "AI at Goldman," FT, September 14, 2025; "Leverage Generative AI to Streamline the Software Development Lifecycle," Banu Parasuraman and Andrew Berenato, VMware Explore 2025.

Avoiding centers-of-excellent-bottlenecks

Probably in your organization there's a center of excellence for anything new and excellent. We used to have cloud centers of excellence. Now there are AI centers of excellence, or roundtables of architects, or policy people. Their job - well, what ends up happening - is they become centers of excellent bottlenecks, slowing things down rather than letting them go at a pace.

If you want to speed that up, two groups should get involved at minimum. First, the security people. Even though they can be a little ornery, they're actually pretty good at this kind of process - understanding something, doing the risk modeling, making a ruling. It's what they do. Second, everyone's favorite: legal. Right after procurement on the popularity charts. There's a lot of weird ambiguity in laws around AI, so legal is going to want in. Get them involved early in the center of excellence and at several organizations the mix of those two roles has actually sped things up more than people expected. Then you can start approving the models you use and how you want to use them.

When you don't know what you're doing, do a lot of it quickly

Remember the digital-transformation era? Airbnb was going to destroy all hotel chains, Google was going to destroy banking, Tesla was going to destroy basically every company, and tech companies in general were going to destroy the existing ones. I don't know about you, but I've been enjoying none of the companies that existed in 2015 existing anymore. I just work with the new ones - which is to say, they all adapted and did just fine.

What we learned back then: if you don't really know what you're doing, you should do a lot of it quickly. In a structured way. Most organizations aren't sure what they're going to do with generative AI, so set up a sandbox - a practice, a way to iterate quickly on ideas. Thankfully, that's what platforms are built for: a structured environment that lets developers focus on the applications and business ideas instead of the infrastructure, while giving you tremendous control. Platforms have a great role to play here.

Lucky enough to have it already?

Some people are lucky enough to have this stuff already in a data center somewhere. I was here at an internal ING event recently kicking off the VMware, Tanzu, and VCF stack as a way of doing this. If you're at ING, find that person and ask about testing it out. If you don't know them, I can tell you who they are. If you're not at ING, you can get our platform - I'm not supposed to say "free," but you can get a 90-day trial without paying for it, so I'll let you figure out how much that costs. Try it at TryTanzu.ai. It'll be fantastic.