Fine-tuning small AI models to run locally

Listen to article 0:00 / 0:00

Speed

I'm training an AI model. It's going to run on a laptop.

Three weeks ago I would have told you I was training a 70-billion-parameter model, the kind of thing that needs a data center to breathe.

I'm not. I'm training a 4-billion-parameter model that runs on a Mac Mini. If the smaller one works, a larger companion model may follow. But the 4B is the bet.

This is the first post in a series where I'll share what I'm building, why, how it's going, and what breaks along the way.

The small-model bet

There are two stories about AI in 2026.

The first story is on every magazine cover.

Models are getting bigger, more expensive, more centralized.

Training a frontier model costs hundreds of millions.

Running one costs a data center.

The API is the product. You don't own it. You rent it.

The second story is quieter and moving faster.

Open-weight models in the 4-billion to 9-billion parameter range are now genuinely good.

Qwen 3.5, which Alibaba released in early March, scores on the Berkeley Function-Calling Leaderboard roughly where GPT-4 did in early 2024.

And it fits on a laptop.

Ollama, the tool most people use to run these models locally, now ships to millions of active developers.

Mistral open-sourced a voice model last month that runs in 3 gigabytes of RAM and beats ElevenLabs at its own job.

Every month the gap between what the frontier labs can do and what anyone can run locally shrinks.

The second story is where I'm building.

The reason is simple.

If you care about AI being something you use, not something you rent, the model has to be small enough to escape the data center.

A 70-billion-parameter model does not run on your Mac Mini. A 4-billion-parameter model does.

Big models are useful. But sovereignty is the version that runs without asking permission.

So that's the bet: small, fast, genuinely good at a few things, runs on hardware you already have.

What the model is good at

The goal isn't to beat GPT-5 on a trivia quiz.

The goal is to be excellent at three specific things that matter when you're actually building something.

Answering questions from documents.

You give the model a passage, ask a question, and it answers from the passage.

It cites which sentence supported each claim. If the answer isn't there, it says I don't know instead of making something up.

This is the hardest thing for a small model to do well, and it's the backbone of any useful assistant.

Calling tools correctly on the first try.

Modern AI doesn't just talk. It acts. It calls functions. It queries databases. It books meetings.

The problem with small models is they fumble the function call. Wrong name. Missing parameter. Malformed output.

A 15% error rate kills an agent. I'm training the model to get the call right 99% of the time, on consumer hardware, on the first attempt.

Following instructions without drift.

Small models ramble. You ask for three bullet points, you get six paragraphs with a sonnet at the end.

You ask for JSON, you get JSON wrapped in prose that breaks your parser.

I'm training tight instruction-following so the model does what you asked, at the length you asked, in the format you asked.

These three capabilities together are what make a model feel reliable. Not bigger. Reliable.

What it's not

It's not a facts database. I'm not teaching the model who's the Prime Minister or what the HST rate is in Ontario.

Facts change. Models should not be trusted to remember them.

Facts belong in a retrieval layer. Documents the model reads at the moment you ask. Not baked into the weights.

Want the full playbook? I wrote a free 350+ page book on building without VC.
Read the free book·Online, free

This matters for a practical reason.

A model that claims to 'know' tax law is a liability.

A model that reads current tax law and cites it is a tool. I'm shipping the tool.

It's also not a dumbed-down model trying to pretend it's GPT-5.

At 4 billion parameters, it will never hold GPT-5's general knowledge.

If you ask it to write an essay about Ottoman naval history, it will be worse than what you get from the frontier labs.

That's fine. It's not for that.

It's for running your app's assistant, your agent loop, your customer service automation. The places where you need the same answer every time, not a creative one.

The deliverables, named

To be concrete:

simpledirect-flash-1 - a 4-billion-parameter model, Apache 2.0 license. Designed to run on the current Mac Mini series, any recent MacBook with 8GB or more of memory, consumer NVIDIA GPUs, and cheaply in cloud.
The training dataset - released openly on Hugging Face the day the model goes out. Anyone can retrain their own model on the same data. Target: end of June.
Four deployment recipes - one model, four tested ways to run it: Apple Silicon, consumer NVIDIA, colocated hardware, and cloud. Published as an MIT-licensed companion repo. The weights are the model. The recipes are how you actually use it without spending a week on YAML.

Target model release: September.

A larger companion model may follow later if the 4B delivers what I'm aiming for. That decision comes after the first training run, not before.

This is the current plan. It might change.

The space moves faster than monthly. Qwen 3.5, the base model I'm starting from, was released on March 2. Seven weeks ago.

It replaced Qwen 3 in my own plan almost immediately. Between now and September, something better might show up.

A new base model. A new training technique. A reason to ship a different size, or a different architecture, or something I haven't thought of.

If that happens, I'm going to pivot and write a post explaining why.

The worst thing I can do right now is pretend the plan is locked when the ground under it is still moving. So treat everything above as my best call as of today, not a signed contract.

I'd rather publish the plan openly and own the change than hide it and pretend I always knew.

Things that are unlikely to change: small over large, laptops over data centers, open weights over API keys, published evaluations at every compression level.

Those are the bet. The specific implementation is subject to new evidence between now and release.

Where we are today

Research is done. Base model chosen. Training method locked. Infrastructure picked.

Training hasn't started. First smoke tests are next week.

First full training run early May. First evaluation results mid-May. Dataset released end of June. Model released in September.

I'll publish an update at each milestone.

You'll see the loss curves. You'll see the evaluation numbers on every quantization level I ship.

A model that scores well at full precision but collapses when you compress it to run on a laptop is a model that lied to you. You'll see the failures. You'll see what didn't work.

Why you might care

If you've been following my posts on building in public, this is the big one for 2026.

Every tweet about local AI, every post about what actually runs on consumer hardware, every "you don't need OpenAI for this" thread.

This is the thing those posts were pointing at. I'm putting a real model with my name on it into the world and letting you watch the whole thing.

Most fine-tuned models on Hugging Face publish their full-precision benchmark scores and stay quiet about what happens when you compress them for real use. I'm not going to do that.

If the compressed version is three points worse on the function-calling benchmark, you'll see the three points.

More posts in this series: how I picked the training framework, what this actually costs, and what breaks along the way.