The Local AI Option

I keep running across the same kind of post on social media where someone has posted about being on a flight at 35,000 feet, screen open to a coding session, running a local LLM on their laptop with no internet. And another person posts a screenshot demoing Gemma 4 running on a a phone.

A year ago, this conversation was theoretical for most but isn’t anymore.

The piece that pushed me from “interesting” to “worth writing about” is the recent Mistral Medium 3.5 release . “It performs strongly in real-world use, with self-hosting possible on as few as four GPUs.” That’s a single rack in a server closet. Add Gemma, LLaMA, Qwen, and the rest of the open ecosystem, and the conversation is no longer about toy models running slowly but about production-grade systems that a small team could actually use to do real work.

I’m a heavy Claude Code user, and I’m not predicting a mass exodus from APIs. Most companies should stay exactly where they are. But the math is moving for some of them and its time for some companies to start looking at options.

Back in the day (yeah…i’m old), companies were racing to the cloud from on-premise data centers. This was the dominant story for the better part of a decade. Then a quiet thing started happening….some companies looked at their actual usage, did the math, and started bringing workloads back on-prem. Once they knew what they were actually consuming, the cost curve flipped for their specific situation, and the answer changed. The cloud wasn’t ‘wrong’ but it no longer made sense for them

That same pattern is happening with AI right now. Cursor, Copilot, and Claude pricing has been changing and usage limits are tightening in some cases. If you’re a company spending six or seven figures a year on AI APIs across a developer team or a product, the question of whether a local setup makes sense is suddenly worth a real spreadsheet.

Notice I said “some companies.” Most companies shouldn’t bother. The shortlist of who this is actually for:

You have an IT or platform team that runs its own infrastructure today. Not a vendor managing it for you, your people.
You have real capital budget for a GPU rack (or racks) and the cooling, networking, and rack space that goes with it.
You have engineers who can deploy, monitor, and patch an inference stack without it becoming a side project nobody owns.
You have actual AI workloads in production, not pilots. You know what you’re using, how often, and what it’s worth.
You’re already spending enough on AI APIs that the capital expenditure math gets interesting.

Miss any of those, and the cloud is generally still the right call.

For the companies that check those boxes, what do you actually get on the other side?

You get control. Data stays inside your network and your compliance posture is something you can audit, instead of trusting a vendor’s SOC 2 report . Monthly costs become predictable in a way API spend isn’t. Availability doesn’t depend on someone else’s status page. And if your workloads are real and steady, the unit economics start looking very different from per-token billing.

None of this is free. The capital outlay is real as is the operational load; and you’re trading one cost shape (a usage-based bill that scales with consumption) for another (capex plus a team that knows how to run inference at scale). That’s not always a better trade but it’s one some companies should at least be running the numbers on.

The infrastructure and skills bar is high enough that for the average business, the API path is correct. What’s changed is that for the companies who do clear that bar, local AI stopped being a research project and became a real architectural option.

I don’t know how many companies will actually end up running local stacks. Probably not many, at least not for a while. But for the ones with the team and the AI spend to consider it, this is a real conversation now.

The Local AI Option

Get weekly insights on technology leadership