Skip to main content
Vibgrate AI Context

Token-cost benchmarks: AI sessions with and without the code map

Every question you ask an AI assistant about your codebase starts with the assistant re-reading your codebase — and that re-reading is most of the token bill. Vibgrate AI Context replaces it with answers from a pre-built code map. This page publishes what that is worth, measured: the same tasks, the same model, with and without the map, at equal verified success — including the repo sizes where the map is not worth it.

How the numbers are measured

Same model, twice

Each coding task runs twice with the same model on two fresh copies of the repo: once with generic file tools (list, search, read, write), once with Vibgrate AI Context (vg serve) for discovery.

Only discovery differs

Both runs can read and write files. The only difference is how the model finds the right code: walking and grepping the tree, or asking the pre-built code map.

Success is verified, not assumed

After each run, a deterministic verifier executes the edited code in the language’s real runtime and checks behaviour. A run that fails the verifier counts for nothing.

Savings only count at equal success

Reduction is computed only across tasks both runs solved. The headline is “fewer tokens at the same outcome” — never “cheaper because it gave up”. Tiers where the map saves nothing are published too.

The task set spans 13 repositories in 9 languages (JavaScript, TypeScript, Python, Go, Ruby, Java, PHP, Bash, C) across three size tiers, from 3-file fixtures to a ~1,000-file platform. Tokens are the model’s own reported usage summed across every step. Reports are produced per CLI release against the exact shipped build and reviewed before publication — regressions and negative tiers included.

First measured report on the way

Token-savings reports publish automatically with each CLI release, after human review. The first published report will appear here, per repository and per language, pinned to the release it measured — whatever the numbers say.

These reports sit alongside the CLI release benchmarks (scan correctness, code-graph extraction, retrieval accuracy, performance) and follow the same rule: measured on a pinned corpus, published per release, reviewed by a human before anything appears here.