Skip to main content
Vibgrate CLI

How the benchmark numbers are measured

A benchmark you can't reproduce is an advertisement. This page defines exactly what each published number means and how it is produced, so you can check our work.

Two released builds, one machine

Every comparison runs two arms: the previous release, installed from the npm registry, and the new release, measured from the exact package archive that was published — never a rebuild. Both arms run interleaved on the same machine in the same session, so hardware and load differences affect both equally and cancel out of the comparison.

A pinned, versioned corpus

All accuracy numbers are scored against a corpus whose ground truth is known by construction: manifest fixtures across 15 package ecosystems where the manifest content is the expected answer, and deterministic adversarial codebases across every supported language (name collisions, deep nesting, unusual identifiers, very large files). The corpus carries a version, printed with every table. Numbers are only compared at equal corpus versions; when the corpus changes, the report says so.

What each family measures

Code-graph extraction — per language: definitions and call edges extracted, the share of source files parsed, and false self-reference edges (lower is better).

Retrieval — against ground-truth lookups: how often the first answer is the right one (locate top-1), symbol resolution, path connectivity, latency percentiles, and transport errors.

Supply chain — scan correctness against authored manifests across 15 ecosystems: dependency detection, version fidelity, invented or duplicated rows (defects, counted), and software bill of materials determinism and coverage. Two exports of the same input must match byte for byte.

Performance — startup time (median of interleaved repetitions) and the published package size.

Honesty rules

These are enforced by the report format itself, not by editorial choice:

— A regression is never omitted. Every table ships with a regressions section; when it is empty, the page says how many metrics were compared, so "no regressions" is a counted claim.

— Timing changes inside the measured noise floor are labelled "no significant change" and are never presented as improvements.

— A metric that didn't run is shown as absent with the reason. Absent is never displayed as zero, and zero is never displayed as absent.

— Numbers are rendered from the report by deterministic code. No language model writes, rounds, or summarizes a benchmark number anywhere in the pipeline.

Profiles and cadence

Each release runs the fast per-release profile (small and medium corpus tiers, sampled request sets) so results publish with the release. A weekly run covers the full corpus, including the largest repositories. Every table is labelled with the profile it came from.

Reproducing and auditing

The harness, the corpus generators, and the fixtures are part of the open Vibgrate CLI codebase, and every published page links back to a report that records the environment, the corpus version, the request-set seed, and the checksums of both package archives. Reports are reviewed by a person before publication, through the same process as release notes.

Questions about the method, or a result you can't reproduce? Tell us — a benchmark defect is a product defect.