Evidence details

The receipts behind the public numbers.

This page keeps the benchmark split visible: clean pass/pass savings, fresh checks, endurance and MCP runs, official evaluator rows, and results that are useful but not counted as headline savings.

How to read this evidence

Clean savings

Counted only when both baseline and Kairn pass the quality gate.

Quality/source focus

Reported separately when Kairn passes and baseline misses quality.

MCP evidence

Requires actual MCP calls plus useful source guidance, not just a configured server.

Fresh repeat-1 checks

Useful new signal, but repeat-3 is needed before it becomes a headline benchmark.

Kairn is not measured as a basic code RAG layer. RAG finds relevant candidates. Kairn is measured on what should actually reach the agent during the task, and how much context is worth spending.

Clean pass/pass savings

Rows where both baseline and Kairn completed the task successfully, so token reduction can be counted directly.

Workflow	Type	Baseline	Kairn	Reduction	Quality	Status	Notes
Click endurance	Long session repeat-3	90,733	61,528	42.7%	pass/pass	clean savings	10-turn debugging session with zero scope violations.
Node SemVer endurance	Long session repeat-3	69,513	58,142	16.4%	pass/pass	clean savings	Fresh JavaScript package endurance row outside the earlier Python-heavy set.
OpenLibrary SWE-Pro	Official evaluator official row	21,932	10,048	54.2%	pass/pass	clean official savings	Official evaluator pass/pass row; counted separately from quality-rescue rows.

Latest fresh checks

Newer post-launch checks that are useful signal, but not promoted to headline claims until repeated.

Workflow	Type	Baseline	Kairn	Reduction	Quality	Status	Notes
Fresh source-routing check	Source routing repeat-1 fresh check	122,980	75,339	38.7%	3/3 pass/pass	fresh repeat-1	Active-valid assistance with scope discipline passing. Needs repeat-3 before it becomes a headline benchmark.
Fresh SWE-Pro governance slice	Official evaluator fresh official slice	4/8 known local passes	9/9 valid official passes	9.3% clean median	10/10 local pass	fresh official	Post-governor branch check. Strongest signal is source/scope discipline; one row excluded for evaluator image availability.

Endurance and MCP evidence

Long-session and MCP rows where Kairn had to stay useful beyond the first turn or through actual MCP calls.

Workflow	Type	Baseline	Kairn	Reduction	Quality	Status	Notes
httpcore MCP endurance	MCP endurance repeat-3	recorded	recorded	18.5% median	100%	clean MCP	Actual MCP calls, useful returned files, and zero scope violations.
Requests MCP post-fix	MCP endurance repeat-3	recorded	recorded	28.1% median	3/3 pass	clean MCP	Useful MCP file returns in all three runs; zero scope violations.
Click MCP-first	MCP endurance repeat-3	recorded	recorded	52.6% median	pass/pass	clean MCP	MCP-first variant of the Click endurance suite.

Official evaluator evidence

Rows run through the official evaluator path. Clean savings are counted only when both sides pass.

Workflow	Type	Baseline	Kairn	Reduction	Quality	Status	Notes
SWE-Pro official campaigns	Official evaluator 25 official rows	15/25 passed	20/25 passed	6 clean rows	official evaluator	quality/source-focus + clean savings	Clean rows saved 63,393 observed tokens with a 22.1% median reduction; rescue rows are reported separately.
SWE-Pro new expansion	Official evaluator 20 new rows	13/20 passed	15/20 passed	5 clean rows	official evaluator	fresh official expansion	New expansion rows saved 51,509 observed tokens across clean pass/pass savings rows.
SWE-Pro governance slice clean rows	Official evaluator 3 clean pass/pass rows	174,620	154,627	11.4%	official pass/pass	clean official savings	Latest slice clean rows had positive savings; source/scope discipline was the stronger result.

What does not count as clean savings

These rows are useful for product learning, but they are not used as headline savings claims.

Workflow	Type	Baseline	Kairn	Reduction	Quality	Status	Notes
Fresh live canary	Live canary repeat-1	missed quality	passed	supporting only	quality rescue	not clean savings	Useful because Kairn improved the outcome, but not a pass/pass savings claim.
SWE-Pro token regressions	Official evaluator tracked regressions	official pass	official pass	4 regressions	pass/pass	not clean savings	Some official pass/pass rows used more tokens with Kairn; these are tracked as optimization targets.
Suppressed or passive rows	Governor behavior varies	varies	silent or tiny	supporting only	varies	not active savings	Correct silence is product evidence, but it is reported separately from active token wins.

Claims to keep straight

Safe	Kairn can reduce token use on source-rescue, debug, MCP, and endurance workflows.
Safe	Kairn may stay silent when confidence is low; that is intentional suppression.
Safe	Codex CLI/session is the most-tested path; MCP is the portable editor path.
Careful	Savings vary by task and model; official SWE-Pro rows separate clean savings from quality/source-focus evidence.
Avoid	Do not claim universal 20-70% savings or public works-anywhere reliability yet.

Back to overview Install Kairn