Self-hosting AI models: Why Mac Studio hardware is

A note — when you shop through links in this post, I earn a commission — at no extra cost to you. It doesn't affect what I recommend. Full policy

Listen to article 0:00 / 0:00

Speed

I've been running our AI stack on a rented L4 GPU on Google Cloud.

$700 a month. Serves a Qwen 14B model. Fine for production.

Not fine for what I want to do next.

I want to self-host a 70B open-source model on my own hardware. Run it locally. Write about the whole process in public. Eventually help other founders do the same.

The plan was simple. Mac Studio. M4 Max. 128GB unified memory. Quiet enough to sit on a desk. Fast enough to run Llama 3 70B or Qwen 72B. Cheap enough to expense, not capitalize.

I opened the Apple Store.

"Available."

"Available" Doesn't Mean What You Think

That was the trap.

"Available" meant the base config. 24GB. 12-core CPU. Ships in two to three weeks. Useless for inference on anything meaningful.

Every config above that — the 48GB, the 64GB, the 128GB that anyone doing real work would actually buy — gone. Not backordered. "Currently Unavailable." Apple won't even take your money.

The 256GB M3 Ultra config? Pulled from the store entirely in March.

I checked Canada. Same. Checked Germany. Three to four months on the 128GB. Checked China. Four to five months. Australia. Extended delays across the board.

This isn't a pre-launch inventory gap. It's rather a global memory supply crisis hitting the exact configurations that matter for AI.

The Datacenter Buildout Is Eating Your Hardware

The reason is structural.

The same three companies that make Apple's unified memory — Samsung, SK Hynix, Micron — are the same three companies building the memory stacks inside every NVIDIA H100 and H200 shipping to datacenters right now.

And datacenter memory is three times harder to produce per bit than the memory in your Mac.

Microsoft, Google, and Meta are on track to spend $650 billion combined on AI infrastructure this year. All of it memory-hungry. All of it pulling from the same suppliers Apple depends on.

The memory makers picked their customer. It wasn't you.

Apple can't outbid the datacenter buildout. So the Mac Studio sits on a "Currently Unavailable" page while the stripped-down base config pretends everything is fine.

What X Told Me

I posted about it on X. The replies were consistent.

Going to Apple next week to buy a Mac Mini.

Planning to run our own AI models on it.

No API. No subscriptions. No one else's servers.

Start with one. Stack a few together over time.

Still deciding between Mac Mini and Mac Studio.

If you're running local models - what specs…
— George Pu (@TheGeorgePu) April 14, 2026

"Don't buy a Mini for this. You want a Mac Studio with 128GB minimum."

"Wait for M5 Ultra at WWDC in June."

"Everyone is buying these. Global RAM shortage. Good luck."

"I'm running 70B on a rented H100 for $2.99/hr. The Mac doesn't pay back."

That last one stuck with me. So I spent the next few days going deeper than I planned.

The M4 Ultra Doesn't Exist

The first thing I learned is that there is no M4 Ultra.

Apple confirmed at the March 2025 Mac Studio launch that the M4 Max chip doesn't have an UltraFusion interconnect.

Without that, you can't bond two chips into an Ultra.

Gurman reported in his November 2025 Power On newsletter that Apple killed it entirely — too costly, too few buyers.

The next high-end chip will be the M5 Ultra, rumored for a June 2026 Mac Studio launch.

So right now, if you want more than 128GB of unified memory on Apple silicon, your only option is the M3 Ultra — a last-gen chip that Apple has stopped accepting orders for in most configurations.

There's a 4x memory ceiling gap between what you can buy and what you actually need. And nothing in between.

The Older Chip Is Faster

Then I looked at the benchmarks and found something that genuinely surprised me.

The M2 Ultra with 192GB runs Llama 3 70B at roughly 12 tokens per second.

The newer M4 Max with 128GB? About 8.3 tokens per second on the same model.

The older chip is faster.

Want the full playbook? I wrote a free 350+ page book on building without VC.
Read the free book·Online, free

At 70B scale, inference speed is bottlenecked by memory bandwidth, not compute.

The M2 Ultra pushes 800 GB/s. The M4 Max can't move data fast enough to keep up at that model size.

If you're shopping for a Mac to run large models and you're comparing chip generations, the spec sheet is lying to you.

The number that matters isn't on the marketing page.

The Math on Renting vs. Owning

I also ran the math on renting versus owning.

A Mac Studio M3 Ultra with 192GB costs roughly $7,000.

Running the same workload on a rented H100 at $2.99 an hour, eight hours a day, costs about $8,700 a year.

The Mac pays for itself in under twelve months at daily use.

But that flips entirely below 40 hours a month. At low usage, renting is 20x cheaper.

I'm not running inference eight hours a day right now. I'm experimenting. Building content around the process. I don't know what my sustained workload looks like yet.

Buying hardware I can't get, for a workload I haven't proven, doesn't make sense. Renting does.

What I'm Actually Doing

I'm renting an H100 on DigitalOcean's Toronto data center. $2.99 an hour.

Canadian jurisdiction. I bring my own model. Nothing proprietary leaves the stack.

I'll write about the experience from the rental side. Honestly. Not pretending supply chains don't exist.

If M5 Ultra ships at WWDC in June with 256GB+ configs that you can actually order, that's my buy window.

If it doesn't — or if lead times immediately stretch to four months — I keep renting.

The Sovereignty Spectrum Has Three Tiers

This experience reshaped how I think about what "own your AI" actually means in 2026.

For the last two years, the story has been simple. Buy a Mac. Download Ollama. Run your models locally. You're sovereign.

That story is broken. Not because the hardware doesn't work — it does. Because you can't buy it.

There's a tier one. People who bought their Mac Studio two years ago.

They own the hardware. They're running models, shipping content, making it look easy on YouTube. Good for them. They timed the window.

There's a tier two. People like me, trying to buy in this week. The Apple Store leads with the base config to make it look available.

The spec you actually need is gone. NVIDIA's DGX Spark is $2,999 and competitive — but it doesn't officially sell in Canada. This tier is broken.

And there's a tier three. Bare-metal rental. H100s on DigitalOcean Toronto, Lambda, Hetzner. Three dollars an hour.

Canadian data center if you want it. Bring your own model. This is the pragmatic sovereign path for anyone who can't hold inventory.

Most serious self-hosting in 2026 is going to happen on tier three. Not because people prefer renting. Because the hardware to buy doesn't exist on a shelf anywhere.

If You're Thinking About Buying a Mac for AI

Two things.

One. "Available" on the Apple Store doesn't mean what you think it means. Click through. Check the spec. The base config is almost never the one you want. The one you want is almost never the one that's available.

Two. If WWDC comes and M5 Ultra drops with shippable 256GB configs, that's the window. Move fast. If it doesn't, rent.

The path to owning your AI just got narrower than the marketing made it look.