kairn

Evidence

Early evidence from source-heavy coding work.

Controlled runs show Kairn reducing token use, keeping long sessions focused, and improving source routing. The detailed split is available for deeper review.

Token use

Tokens saved after quality passes.

These rows compare successful runs against successful runs. Bars show percent reduction; labels show tokens saved.

Validated slices

8

fresh, endurance, MCP, official

Median reduction

42.7%

clean rows shown

Official tasks passed

5/5

small SWE-Pro pilot

Endurance suites

4

long-session evidence

01

Click endurance

29k

saved

42.7%

less token use

91k baseline to 62k with Kairn.

10-turn debugging session.

02

Node SemVer endurance

11k

saved

16.4%

less token use

70k baseline to 58k with Kairn.

Fresh long-session JavaScript package test.

03

OpenLibrary SWE-Pro

12k

saved

54.2%

less token use

22k baseline to 10k with Kairn.

Official evaluator row.

Coverage

More than a token chart.

Kairn has been tested across clean token-savings rows, longer sessions, MCP delivery, official evaluator rows, and governor ablations.

Token reduction

On clean rows, Kairn reduced token use by 16.4-54.2% after both sides completed the task.

measured

Long sessions

Endurance runs show Kairn keeping source focus across repeated debugging and verification turns.

repeat-tested

Editor path

MCP-first runs recorded real tool calls and useful source guidance, so Kairn is not limited to terminal hooks.

MCP

Official evaluator

A small SWE-Pro pilot passed 5/5 Kairn rows, compared with 2/5 baseline rows.

SWE-Pro

Quality lift

Several controlled rows passed with Kairn when the baseline missed quality, showing the value of better source focus.

rescue

Decision layer

Ablation showed the governor improves safety by shrinking, suppressing, or asking for evidence instead of over-assisting.

governor

Quality

Better source focus can improve outcomes.

Some controlled rows passed with Kairn after the baseline missed quality. Those rows are shown separately from token-savings rows.

official evaluator pass rate

Official SWE-Pro pilot

Small generated-patch official evaluator slice; includes quality-rescue rows.

Codex alone

2/5

tasks passed

40%

pass rate

Codex + Kairn

5/5

tasks passed

100%

pass rate

Interpretation

What this does and does not prove.

Kairn is strongest when the task would otherwise cause repeated source searching.

Kairn's silence is a feature: it avoids adding context when context is not useful.

The strongest savings rows count only after quality gates pass.

Codex session is the most-tested path; MCP is the portable editor path.

Details

Want the benchmark split?

The detailed page keeps modes, MCP runs, official evaluator rows, and caveats visible for technical review.

View details