Methodology
How scores are computed today, what's being measured, and where the current approach stops short.
Status: static parameters
Today every score is derived from static signals — file existence and content-length checks on the cloned tree. No agent is actually run. Per-model weights are illustrative, not yet derived from measured agent success. This is enough to produce meaningfully different rankings and to show how the UX of per-model scoring feels, but it should not be read as a benchmark.
The plan to replace illustrative weights with measured ones is the v0.3.0 milestone on the roadmap (tasks/0.3.0/01-benchmark-harness.md). Until then, treat the numbers as a directional signal, not a verdict.
Score formula
per-model score = Σ(signal.pass × model.weight[signal]) / Σ(model.weight) × 100 overall = mean(per-model scores) improvement = closing a gap unlocks (1 - pass) × weight / Σweight × 100 points
signal.pass is a float in [0, 1] — partial credit is allowed (e.g. a thin README gets 0.3, a long one gets 1.0).
Signals (12)
- AGENTS.md / CLAUDE.mdagents_mdPresence of an agent-oriented instructions file, with substantive content.Improve: Add an AGENTS.md covering project goals, layout, setup commands, and conventions. Aim for 800+ chars of real guidance (not boilerplate).
- READMEreadmeNon-trivial README so the agent can learn the project quickly.Improve: Expand your README to cover what the project does, how to install, the common commands, and the high-level layout.
- Test suitetestsDetectable tests — agents rely on feedback loops.Improve: Add a tests/ (or test/, __tests__/, spec/) directory with runnable tests. Document how to run them in AGENTS.md.
- CI configurationciDefined pipeline the agent can reason about / emulate locally.Improve: Add a CI workflow (e.g. .github/workflows/ci.yml or .gitlab-ci.yml) that runs tests + linter on every PR.
- Linter / formatter configlinterAgents get immediate feedback on style rather than ambiguous drift.Improve: Configure a linter/formatter (ESLint+Prettier, Biome, Ruff, rustfmt+clippy, golangci-lint) and commit the config.
- Dependency manifestdeps_manifestMachine-readable dependency list so the agent can reproduce the env.Improve: Commit a proper manifest (package.json, pyproject.toml, Cargo.toml, go.mod, etc.) plus a lockfile.
- Reproducible dev envdev_envOne-command setup the agent can run (Makefile / devcontainer / Nix / Docker).Improve: Add a Makefile or devcontainer or Dockerfile so the agent can set up the project in one command.
- Type configurationtype_configStatic types help agents reason about call sites without running code.Improve: Add a type config (tsconfig.json for JS/TS, mypy.ini or pyrightconfig.json for Python). Rust/Go are typed by default.
- License filelicenseClarity on what an agent is allowed to do with the code.Improve: Add a LICENSE (or COPYING) file — MIT, Apache-2.0, BSD, GPL, etc. — at the repo root.
- CONTRIBUTING guidecontributingExplicit contribution workflow an agent can follow.Improve: Add CONTRIBUTING.md describing branch naming, commit style, test commands, and the PR process.
- Pre-commit / git hookspre_commitCatches problems locally before the agent wastes a CI cycle.Improve: Set up pre-commit (.pre-commit-config.yaml), husky, or lefthook to run format+lint on every commit.
- Manageable sizesizeVery large repos strain an agent's context window.Improve: If possible, split into smaller modules or carve out a focused entry path. Document where to start in AGENTS.md.
Models & weight profiles (4)
- Claude CodeWeights AGENTS.md and tests heavily — Claude Code leans on an instructions file and a fast feedback loop.
Weights
ci 0.50 size 0.50 tests 1.00 readme 0.70 linter 0.60 dev_env 0.90 license 0.30 agents_md 1.00 pre_commit 0.40 type_config 0.60 contributing 0.40 deps_manifest 0.70
- CursorWeights type config and a detailed README highly — Cursor's inline edits benefit from static types and skim-readable docs.
Weights
agents_md 0.60 readme 1.00 tests 0.70 ci 0.40 linter 0.80 deps_manifest 0.80 dev_env 0.50 type_config 1.00 license 0.30 contributing 0.30 pre_commit 0.30 size 0.40
- DevinWeights CI and reproducible envs highly — Devin runs in a sandboxed VM and needs end-to-end automation.
Weights
agents_md 0.60 readme 0.70 tests 0.90 ci 1.00 linter 0.50 deps_manifest 0.90 dev_env 1.00 type_config 0.50 license 0.30 contributing 0.50 pre_commit 0.50 size 0.60
- GPT-5 CodexBalanced profile as a reference point.
Weights
ci 0.70 size 0.50 tests 0.80 linter 0.60 readme 0.80 dev_env 0.70 license 0.30 agents_md 0.70 pre_commit 0.40 type_config 0.70 contributing 0.40 deps_manifest 0.70
What isn't measured yet
- Whether tests actually pass (we only detect their presence).
- Whether the linter actually runs cleanly.
- Whether the dev-env artifact (Makefile, Dockerfile) works end-to-end.
- Commit-history signals — churn, commit frequency, contributor count. We use
--depth 1 --single-branchwhich fetches the whole working tree at HEAD of the default branch, but no history. Closing this gap is planned as v0.7.0 on the roadmap. - How agents actually perform on the repo — that's the v0.3.0 benchmark harness.