Why CBLRE Matters More Than the Model

Yesterday we released CBLRE — the Canadian Bilingual Legal and Regulatory Evaluation.

The day before, we released flash-1-mini, a 4-billion-parameter bilingual Canadian legal AI model.

Most of the launch coverage has focused on the model.

That's the wrong artifact to focus on.

The model is the proof. CBLRE is the moat. Here's why.

The gap nobody had filled

Before yesterday, no standard public benchmark existed for Canadian bilingual legal AI evaluation.

That sentence is bigger than it sounds.

US legal benchmarks like LegalBench are common-law and English-only — they can't evaluate Quebec civil law reasoning and they don't test French.

General multilingual benchmarks like Belebele and Global-MMLU test French comprehension and knowledge, but not legal French: not the technical vocabulary of Quebec's Civil Code, federal francophone case law, or Canadian regulatory frameworks.

They carry no Canadian legal content at all.

So a Canadian institution evaluating AI for legal or regulatory work has had three options: apply US benchmarks that don't match Canadian law, apply multilingual benchmarks that don't test legal correctness, or build evaluation infrastructure from scratch.

Most haven't built.

Which means AI procurement for Canadian legal workflows has been decided on vibes — which vendor demos well, which has the logo the board recognizes.

There's no shared scorecard.

CBLRE is an attempt to be that scorecard.

What CBLRE actually does

CBLRE is a public test set: 129 expert-reviewed items across six tracks:

common law doctrine,
Quebec civil law under the Civil Code of Québec,
constitutional and Charter reasoning,
privacy compliance,
citation integrity,
and safety calibration
(how the model behaves on adversarial prompts trying to elicit hallucinated statutes).

It's bilingual, and one design choice matters more than the rest.

The privacy track is built as matched English-French pairs, so you can compute a parity ratio directly — French accuracy divided by English accuracy.

A ratio of 1.00 means the model performs identically in both languages on the same legal questions; 0.70 means it's meaningfully worse in French.

Most AI procurement happens in English first, and a model that demos well in English can quietly collapse in French.

CBLRE catches that.

No other public Canadian benchmark measures it.

Three methodology choices set CBLRE apart from standard practice.

The dataset card publishes no model scores — the instrument is separated from the rankings on purpose, so the eval can't be defined to make a particular model look good.

Scoring is deterministic: the same items and the same code produce the same scores every time, so anyone can regenerate any number we report, and the judge is a string-matching routine rather than an LLM grading another LLM.

And three items caught during expert review with incorrect gold answers were removed — documented in the card for transparency.

This is the academic posture, not the marketing posture.

It costs short-term comparison wins. It earns long-term credibility.

Why the discipline matters

There are three common ways to publish a benchmark that looks rigorous and isn't.

Hide the test set — for "national security," say — so the scores can't be independently reproduced. You have to trust the lab.

Publish the test set but keep the scoring opaque, with the lab judging its own outputs and reporting the percentage.

The judge is unfalsifiable.

Design the benchmark alongside the model, so the eval is gerrymandered to the model's training distribution.

The score is high and the information is meaningless.

CBLRE does none of these.

Public items, public scoring code, a methodology published separately and dated before any model scoring, and deterministic grading.

That's the cost of being trustworthy. It's also the moat.

Why the benchmark outlasts the model

A specialist Canadian legal AI today is a category with a handful of competitors next year and many more the year after.

Base models keep improving, fine-tuning keeps getting cheaper, and the specific weights of flash-1-mini are not what compound.

Evaluation standards are different.

If CBLRE becomes the standard Canadian institutions cite in procurement RFPs, it outlasts every individual model.

Vendors compete on it.

Buyers score on it.

The instrument becomes the substrate everything else stands on.

That's what GLUE and SuperGLUE did for language understanding, what MLPerf did for hardware, what MMLU did for general knowledge.

If you're finding this useful, I send essays like this 2-3x per week.
·No spam

It's also why each eventually saturated and needed a successor — which is a feature of a versioned standard, not a bug.

The instrument matters more than any individual run on it.

Canada hasn't had one for legal AI. It does now.

What it enables

A Canadian law firm can cite CBLRE in a procurement RFP and require vendors to report scores against the published items, computed with the published code — a vendor-neutral basis for comparing proposals.

A provincial government can require CBLRE evaluation as part of diligence on an AI system bound for regulated workflows; because the instrument is open and the scoring reproducible, that diligence is defensible against legal challenge.

These aren't hypotheticals — they're the use cases the eval was built for.

Researchers studying the bilingual capability gap can use the matched English-French privacy pairs as their measurement instrument; the dataset is OGL-Canada licensed for permissive academic use.

What CBLRE is not

It's version 1.0, and it's small.

129 items can discriminate between models with real capability differences, but not with high statistical power on subtle distinctions between strong models.

The items were AI-assisted drafts corrected by expert review, which introduces author bias; broader subject-matter validation continues as the bank grows.

And the Quebec French register isn't yet validated by native Quebec French legal raters — CBLRE measures whether the model gets the law right in French, not yet whether the French reads like a Quebec lawyer wrote it.

These limitations are documented in the card.

Versioning is what makes them auditable over time: v1.0 numbers cannot be compared to v2.0 numbers, by design.

Each release is a fixed reference point for the period in which it was the canonical version.

The roadmap is the strategy

v1.0 covers legal, privacy, constitutional, citation, and safety.

It doesn't yet cover tax and benefits, employment and labour law, immigration, securities and financial compliance, or healthcare privacy.

Those are the next tracks — built and validated as qualified subject-matter reviewers are secured for each domain.

The reproducibility discipline holds at every version.

Structured-analysis items come later: Charter analyses, Oakes test reasoning, regulatory interpretation problems — scored by rubric, not multiple choice.

MCQ discriminates well between weak and strong models but ceilings out on settled doctrine; rubric-scored analysis carries more signal about real legal reasoning and breaks that ceiling.

Versioned releases with DOIs are the academic credentialing move — by v2.0 or v3.0, CBLRE should be citable in peer-reviewed legal-AI research.

That's when it goes from "useful benchmark" to "canonical reference."

Who built it

Ayush Naik on our team built CBLRE — the methodology, the item structure, the expert-review process, the scoring code, the dataset card. It's his work.

This is the substrate moat in its cleanest form: a small team in Toronto, Apache 2.0 on the evaluation and OGL-Canada on the source materials, no venture capital, no vendor lock-in.

An instrument owned by no one in particular and usable by everyone.

flash-1-mini shows you can build a capable Canadian-context specialist on consumer hardware.

CBLRE shows how to measure whether anyone else's claim holds up.

We'll keep shipping models on this stack — flash-1 (9B) in July, flash-1-pro (27B) in September, and a free Government-of-Canada demo variant via Hugging Face Spaces after that.

But the models are downstream.

CBLRE is the upstream artifact, and the next version ships when the expert-review process concludes; we'll publish that timing once the validation work is done.

The standard is the thing. The model is the proof.

If you're training a Canadian-context AI model, score it against CBLRE and publish your numbers.

If you're buying an AI system for Canadian legal or regulatory work, ask your vendor for CBLRE scores against the published item set.

If you're evaluating AI policy for Canadian institutions, CBLRE gives you a measurement framework that didn't exist before yesterday.

That's the substrate. We built it. It's yours.