Evidence

Early evidence from source-heavy coding work.

Controlled runs show Kairn reducing token use, keeping long sessions focused, and improving source routing. The detailed split is available for deeper review.

These results measure context governance during coding work, not raw search quality.

Token use

Tokens saved after quality passes.

These rows compare successful runs against successful runs. Bars show percent reduction; labels show tokens saved.

Evidence sets

headline sets plus latest check

Median reduction

42.7%

clean rows shown

Official tasks passed

20/25

Kairn across official campaigns

Endurance suites

long-session evidence

Latest fresh check

Fresh source-routing check

3 fresh pass/pass rows saved 48k tokens in aggregate: 123k baseline to 75k with Kairn. This is repeat-1 evidence, so it is tracked separately from the headline examples.

38.7%

aggregate reduction

Fresh official slice

Fresh SWE-Pro governance slice

Kairn passed 9/9 valid official passes in the latest branch check, with 0 harmful assists and 0 token regressions. Clean pass/pass rows still saved tokens, but the stronger signal was source and scope discipline.

9.3%

clean median

Click endurance

29k

saved

42.7%

less token use

91k baseline to 62k with Kairn.

10-turn debugging session.

Node SemVer endurance

11k

saved

16.4%

less token use

70k baseline to 58k with Kairn.

Fresh long-session JavaScript package test.

OpenLibrary SWE-Pro

12k

saved

54.2%

less token use

22k baseline to 10k with Kairn.

Official evaluator row.

Coverage

More than a token chart.

Kairn has been tested across clean token-savings rows, longer sessions, MCP delivery, official evaluator rows, and governor ablations.

Token reduction

On clean rows, Kairn reduced token use by 16.4-54.2% after both sides completed the task.

measured

Long sessions

Endurance runs show Kairn keeping source focus across repeated debugging and verification turns.

repeat-tested

Editor path

MCP-first runs recorded real tool calls and useful source guidance, so Kairn is not limited to terminal hooks.

MCP

Official evaluator

Across 25 official SWE-Pro rows from evidence campaigns, Kairn passed 20/25 versus 15/25 baseline.

SWE-Pro

Fresh official slice

In the latest 10-row SWE-Pro governance slice, Kairn passed 9/9 valid official evaluations with zero harmful assists.

fresh

Quality lift

Several controlled rows passed with Kairn when the baseline missed quality, showing the value of better source focus.

rescue

Decision layer

Ablation showed the governor improves safety by shrinking, suppressing, or asking for evidence instead of over-assisting.

governor

Latest fresh check

A post-launch source-routing check saved 47,641 tokens across 3 fresh pass/pass rows; repeat-3 is still needed before treating it as headline evidence.

fresh

Quality

Better source focus can improve outcomes.

Some controlled rows passed with Kairn after the baseline missed quality. Those rows are shown separately from token-savings rows.

official evaluator pass rate

Official SWE-Pro campaigns

25 official rows across evidence campaigns; clean savings and quality/source-focus rows are separated.

Codex alone

15/25

tasks passed

60%

pass rate

Codex + Kairn

20/25

tasks passed

80%

pass rate

Interpretation

What this does and does not prove.

Kairn is strongest when the task would otherwise cause repeated source searching.

Kairn's silence is a feature: it avoids adding context when context is not useful.

The strongest savings rows count only after quality gates pass.

Codex session is the most-tested path; MCP is the portable editor path.

Details

Want the benchmark split?

The detailed page keeps modes, MCP runs, official evaluator rows, and caveats visible for technical review.

View details