Unsloth vs fine-tuning frameworks: which to pick

Listen to article 0:00 / 0:00

Speed

Honest note up front: I have not yet fine-tuned anything with Unsloth. I have not run a single training job.

What I did is spend three weeks researching fine-tuning frameworks before writing a line of training code — and at the end of that research, I picked Unsloth and committed to it.

This post is about why. I'm writing it now, before I start, for two reasons.

First, so that if this decision ages badly I have to own it publicly.

Second, so that whoever is doing this research next has a starting point I didn't have.

The question I was actually answering

When you want to fine-tune a language model, you don't just open a text editor and start typing. You pick a framework.

The framework handles the math: loading the model, applying the training update, managing memory, saving checkpoints. Every serious fine-tuning project runs on one.

Here's the catch. There are five or six viable frameworks.

They all work. They all have followers online who will tell you their favorite is obviously the right choice.

None of them is obviously the right choice.

The differences show up in real ways: training speed, memory usage, which models they support on day one when a new release drops, which ones break subtly under compression, how much time you'll spend debugging framework bugs instead of improving your model.

I'm training a 4-billion-parameter model and a 9-billion one. The wrong framework could double my cost or cost me two weeks of debugging.

I wanted to pick once, deliberately, before I had any sunk cost.

The candidates

Five serious options.

Unsloth (58,000 GitHub stars).

Built specifically to make fine-tuning on a single GPU fast and memory-efficient.

The headline claim: trains roughly twice as fast as standard alternatives and uses about 70% less GPU memory.

That claim has been verified by independent benchmarks. It's real.

LLaMA-Factory (55,000 stars).

A more general-purpose framework. Supports more models than Unsloth.

Has an academic paper (ACL 2024) and a large community. Comes with a web UI. Slower than Unsloth on the specific models I care about, but more flexible.

Axolotl (8,000 stars).

A config-driven wrapper around the standard Hugging Face training stack.

Popular with teams who run many training variations by editing YAML. Less efficient than Unsloth.

torchtune (4,000 stars).

Meta's official library. Clean code, PyTorch-native, Meta engineers use it internally.

Fewer features, smaller community, less battle-tested on the models I plan to use.

Hugging Face TRL (18,000 stars).

The library Unsloth, LLaMA-Factory, and Axolotl all build on top of. Raw, powerful, verbose. Using it directly is reasonable but asks more of the engineer.

I read the source code of all five. I read the recent issues.

I ran about fifty deep-research queries through Perplexity on 2026 fine-tuning best practices. I talked to people who'd shipped fine-tuned models on each framework.

Why Unsloth won

Three things decided it.

The speed and memory claims hold up.

On the hardware I'm running — NVIDIA L40S with 48 gigabytes of memory — my 9B model, trained at full 16-bit precision, fits comfortably under Unsloth.

On some other frameworks I'd have to drop to a lower-precision mode to stay within memory.

That matters because my use case involves structured output (the model emits JSON function calls), and every piece of research I read is consistent: training at reduced precision introduces subtle noise that degrades exactly that kind of structured output.

Being able to stay at 16-bit without running out of memory is specifically valuable for what I'm building.

Day-one Qwen 3.5 support.

Qwen 3.5 came out on March 2. Unsloth had official support, optimized model weights, and working examples before the month was out.

That's not an accident — the Unsloth maintainers make fast support for new open-weight model releases a habit.

When I chose Qwen 3.5 as my base model, it was already supported with a pre-tuned recipe. I wasn't betting on future work getting done.

The license and the ecosystem.

Unsloth's core library is Apache 2.0, which matches how I want to release the trained model.

Their Studio product — a visual interface on top — is AGPL, which is restrictive, but I'm not using the Studio; I'm using the library.

Active maintainers. Issues get answered. Pull requests get merged. This is not a zombie project.

Add it up: fastest for my hardware, supports my base model day-one, right license, active community. There wasn't a single category where Unsloth wasn't the top or near-top pick.

What I'm worried about

If I told you I picked this tool and everything's going to be smooth, you wouldn't believe me, and you'd be right. Here's what I'm actually worried about.

The `lm_head` merge bug.

There's an open issue in the Unsloth repository — number 4098, filed in February — that describes a specific failure: if you include the model's final output layer in your training target, the framework silently misclassifies it, and when you try to merge your training results back into the base model, it crashes.

It's a known bug with a known workaround (don't include that layer). I've written it into my engineering runbook. But it's the kind of thing that, if you don't know about it, can cost you a week.

Chat template drift.

After you finish training, you merge the training results with the base model and save the combined result.

During that process, it's easy to lose or corrupt the chat template — the piece of config that tells the model how conversations are formatted.

If you lose it, the model still loads, still runs, but responds in subtly broken ways that are hard to diagnose.

This is not Unsloth-specific; it affects every framework.

It's also the single most common production failure in fine-tuning projects. I'm building a sanity-check script that runs after every merge to catch it immediately.

Compression collapse.

My model has to run on laptops, which means compressing it from the training format (16-bit) down to a smaller format (4-bit or 5-bit).

Want the full playbook? I wrote a free 350+ page book on building without VC.
Read the free book·Online, free

Compression introduces error.

Most of the time the error is small and the model still works. Sometimes — especially for models that emit structured output — the compressed model performs substantially worse than the uncompressed one.

I'm going to publish evaluation scores at every compression level.

If the 4-bit version scores three percentage points worse than the 8-bit version, you'll see that. Nobody should ship a model to laptops and only publish the data-center numbers.

The AGPL Studio asterisk.

Unsloth's free-to-use library is Apache 2.0. Their Studio product is AGPL, which is restrictive and viral.

Some founders have seen the AGPL mention on the Unsloth homepage and ruled out the whole project.

That's a mistake. The Studio is a separate product from the library. The library is Apache. I'm using the library and have read the license carefully.

Catastrophic forgetting.

This is a risk inherent to fine-tuning, not to Unsloth specifically.

When you train a model to be great at a new task, it can get worse at things it was originally good at.

Recent research puts the potential drop as high as 43% on earlier capabilities if you train wrong.

I'm using a single-stage multi-task training recipe specifically to mitigate this, and I'll evaluate on general-purpose benchmarks as well as my target capabilities.

If the fine-tuned model gets meaningfully dumber at general knowledge, I have a mitigation ready.

What I considered and rejected

A few tempting options that I decided against.

Rolling my own training loop with raw TRL.

More control, more flexibility, more engineering time. Not worth it for a first real training run. The abstractions Unsloth provides are the ones I need.

Convenient. Expensive. Requires training data to leave my infrastructure. Doesn't fit the story I'm telling. No thanks.

LLaMA-Factory for the flexibility.

Seriously considered. The web UI is useful for rapid iteration. Broader model support matters if I end up needing to try base models Unsloth doesn't cover.

For this run, Unsloth's speed and specific Qwen 3.5 tuning won. If I ever need to run comparison experiments across many base models, I'll revisit.

What happens next

Side note: cost is the other question people ask, and it deserves its own post. Short version: fine-tuning is credit-card money, not venture money. Full breakdown in What fine-tuning actually costs (it's not what you think).

This is where the post stops being about research and starts being about work.

Next week: first end-to-end smoke test. The 4-billion-parameter model, trained on a thousand-example subset of my data, on a single L40S GPU, for about an hour. The goal isn't a good model.

The goal is to verify the pipeline works — that I can load the base model, train it, merge the result, convert it to the laptop-friendly format, and have it still emit coherent text at the other end. If any step breaks, I find out now, not in May.

Early May: the real first training run. Full dataset, the 4B model, measured on all my target evaluations. The first actual signal on whether the approach works.

Mid-May through June: iteration. What the evaluations show, I respond to. If the model is weak on function calling, I add data. If it's drifting on French, I adjust the mix. This is the part that takes taste and time, not compute.

Late June: the full 9B training run, plus the dataset release on Hugging Face.

September: the model release. Both sizes, four deployment recipes, model card, evaluation results at every compression level, training logs you'd need to reproduce it.

If Unsloth was the wrong choice, I'll know by early May. I'll write that post too.

Either way, I'd rather decide deliberately now and publish my reasoning than look back in September and say "I picked it because a Twitter thread recommended it."

This is the current plan. It might change.

Everything above is my best call as of April 20, 2026. The fine-tuning landscape moves fast.

Qwen 3.5, the base model I'm building on, was released seven weeks ago.

Unsloth itself ships significant updates on a weekly cadence.

A framework I ranked second today could be first in two months.

A bug I'm planning around could get fixed next week.

A new compression format could change which quantization level we ship.

If any of that happens and it changes my answer, I'll write the follow-up post.

Hiding a pivot is worse than making one.

This is how I'm going to run the whole project.

Decide in public. Publish the reasoning. Show the work. Own the call, even when the call changes.

More soon.