When we launched Ghost Narrator, we used Fish Speech for voice cloning. It worked. The voice quality was past the threshold of "nobody notices." We were happy with it.
Then we had a licensing problem.
Fish Speech's license is restrictive for commercial use.
We were publishing 200 narrated blog posts a month on a model we couldn't fully commercialize. That needed to change.
So we tested alternatives. The obvious move was Mistral's new TTS model — newer, from a well-funded lab, getting attention on Twitter. We tried it.
The license was restrictive too.
Then we tested Qwen TTS.
Apache 2.0. Fully free. No restrictions on commercial use.
And here's the part that shocked us: the voice quality is significantly better.
Not marginally. Not "if you listen carefully." Significantly.
Qwen published this model three months before Mistral published theirs.
Older model. More permissive license. Better output.
Every assumption we had was wrong.
- We assumed newer meant better.
- We assumed more restrictive meant more advanced.
- We assumed the well-funded Western lab would outperform.
None of that held up.
What Changed
The swap was straightforward.
Ghost Narrator's architecture is modular — the TTS model is one component in the pipeline.
We pulled Fish Speech out, dropped Qwen TTS in.
Same pipeline, same workflow, same hardware.
The narration scripts still get generated by a local LLM. The voice cloning still runs locally. Nothing leaves our machines.
The only difference is the model producing the audio.
If you cloned Ghost Narrator before the swap, updating is a model change. The architecture doesn't care which TTS model you run.
Listen to It
Every article on Founder Reality is now narrated using Qwen TTS. Hit play on any recent post. That's what it sounds like.
What This Means
A year ago, the best text-to-speech required a paid API. Per-word pricing. Someone else's servers. Rate limits. ElevenLabs was the default answer and nobody questioned it.
Today, the best TTS model we've tested is free, open-source, runs locally, and was released before the paid alternatives that followed it.
The gap between paid and free didn't just close. Free pulled ahead.
I think this is the beginning of a pattern that kills most AI SaaS companies within 18 months.
Here's what's happening: open-source models are catching up to commercial ones on a delay.
That delay used to be two years. Then it was one year.
Now it's negative — an older open-source model is outperforming a newer commercial one.
If you're building a business that wraps an API around a model you don't own, your moat is the delay between open-source and commercial quality.
That moat is evaporating. In TTS, it's already gone.
ElevenLabs is a great product. I genuinely think they have the best voice quality in the industry right now. But "right now" is doing a lot of heavy lifting in that sentence.
Qwen TTS wasn't even trying to compete with them. It just quietly exists, Apache 2.0, and it's already better than the models that launched after it with restrictive licenses.
This is why we built Ghost Narrator the way we did.
Modular. Local. Model-agnostic.
We didn't bet on Fish Speech being the best forever. We bet on the architecture being right and the models being swappable.
That bet paid off in three months instead of three years.
The companies that survive in AI won't be the ones with the best model today.
They'll be the ones who own the pipeline and swap models as the open-source floor rises. Everyone else is paying rent on something that's about to be free.
We're going to keep documenting every swap. Every time a paid tool gets replaced by something we run locally, we'll write about it here.
Not because we're anti-commercial software. Because the math is changing faster than people realize, and I'd rather show it than argue about it.
Ghost Narrator is open-source: github.com/getsimpledirect/workos-mvp
Follow along: @TheGeorgePu

