Methodology

How scores are computed today, what's being measured, and where the current approach stops short.

Status: static parameters

Today every score is derived from static signals — file existence and content-length checks on the cloned tree. No agent is actually run. Per-model weights are illustrative, not yet derived from measured agent success. This is enough to produce meaningfully different rankings and to show how the UX of per-model scoring feels, but it should not be read as a benchmark.

The plan to replace illustrative weights with measured ones is the v0.3.0 milestone on the roadmap (tasks/0.3.0/01-benchmark-harness.md). Until then, treat the numbers as a directional signal, not a verdict.

Score formula

per-model score = Σ(signal.pass × model.weight[signal]) / Σ(model.weight) × 100
overall         = mean(per-model scores)
improvement     = closing a gap unlocks  (1 - pass) × weight / Σweight × 100  points

signal.pass is a float in [0, 1] — partial credit is allowed (e.g. a thin README gets 0.3, a long one gets 1.0).

Signals (12)

  • AGENTS.md / CLAUDE.md
    agents_md
    Presence of an agent-oriented instructions file, with substantive content.
    Improve: Add an AGENTS.md covering project goals, layout, setup commands, and conventions. Aim for 800+ chars of real guidance (not boilerplate).
  • README
    readme
    Non-trivial README so the agent can learn the project quickly.
    Improve: Expand your README to cover what the project does, how to install, the common commands, and the high-level layout.
  • Test suite
    tests
    Detectable tests — agents rely on feedback loops.
    Improve: Add a tests/ (or test/, __tests__/, spec/) directory with runnable tests. Document how to run them in AGENTS.md.
  • CI configuration
    ci
    Defined pipeline the agent can reason about / emulate locally.
    Improve: Add a CI workflow (e.g. .github/workflows/ci.yml or .gitlab-ci.yml) that runs tests + linter on every PR.
  • Linter / formatter config
    linter
    Agents get immediate feedback on style rather than ambiguous drift.
    Improve: Configure a linter/formatter (ESLint+Prettier, Biome, Ruff, rustfmt+clippy, golangci-lint) and commit the config.
  • Dependency manifest
    deps_manifest
    Machine-readable dependency list so the agent can reproduce the env.
    Improve: Commit a proper manifest (package.json, pyproject.toml, Cargo.toml, go.mod, etc.) plus a lockfile.
  • Reproducible dev env
    dev_env
    One-command setup the agent can run (Makefile / devcontainer / Nix / Docker).
    Improve: Add a Makefile or devcontainer or Dockerfile so the agent can set up the project in one command.
  • Type configuration
    type_config
    Static types help agents reason about call sites without running code.
    Improve: Add a type config (tsconfig.json for JS/TS, mypy.ini or pyrightconfig.json for Python). Rust/Go are typed by default.
  • License file
    license
    Clarity on what an agent is allowed to do with the code.
    Improve: Add a LICENSE (or COPYING) file — MIT, Apache-2.0, BSD, GPL, etc. — at the repo root.
  • CONTRIBUTING guide
    contributing
    Explicit contribution workflow an agent can follow.
    Improve: Add CONTRIBUTING.md describing branch naming, commit style, test commands, and the PR process.
  • Pre-commit / git hooks
    pre_commit
    Catches problems locally before the agent wastes a CI cycle.
    Improve: Set up pre-commit (.pre-commit-config.yaml), husky, or lefthook to run format+lint on every commit.
  • Manageable size
    size
    Very large repos strain an agent's context window.
    Improve: If possible, split into smaller modules or carve out a focused entry path. Document where to start in AGENTS.md.

Models & weight profiles (4)

  • Claude Code
    Weights AGENTS.md and tests heavily — Claude Code leans on an instructions file and a fast feedback loop.
    Weights
    ci               0.50
    size             0.50
    tests            1.00
    readme           0.70
    linter           0.60
    dev_env          0.90
    license          0.30
    agents_md        1.00
    pre_commit       0.40
    type_config      0.60
    contributing     0.40
    deps_manifest    0.70
  • Cursor
    Weights type config and a detailed README highly — Cursor's inline edits benefit from static types and skim-readable docs.
    Weights
    agents_md        0.60
    readme           1.00
    tests            0.70
    ci               0.40
    linter           0.80
    deps_manifest    0.80
    dev_env          0.50
    type_config      1.00
    license          0.30
    contributing     0.30
    pre_commit       0.30
    size             0.40
  • Devin
    Weights CI and reproducible envs highly — Devin runs in a sandboxed VM and needs end-to-end automation.
    Weights
    agents_md        0.60
    readme           0.70
    tests            0.90
    ci               1.00
    linter           0.50
    deps_manifest    0.90
    dev_env          1.00
    type_config      0.50
    license          0.30
    contributing     0.50
    pre_commit       0.50
    size             0.60
  • GPT-5 Codex
    Balanced profile as a reference point.
    Weights
    ci               0.70
    size             0.50
    tests            0.80
    linter           0.60
    readme           0.80
    dev_env          0.70
    license          0.30
    agents_md        0.70
    pre_commit       0.40
    type_config      0.70
    contributing     0.40
    deps_manifest    0.70

What isn't measured yet

  • Whether tests actually pass (we only detect their presence).
  • Whether the linter actually runs cleanly.
  • Whether the dev-env artifact (Makefile, Dockerfile) works end-to-end.
  • Commit-history signals — churn, commit frequency, contributor count. We use--depth 1 --single-branchwhich fetches the whole working tree at HEAD of the default branch, but no history. Closing this gap is planned as v0.7.0 on the roadmap.
  • How agents actually perform on the repo — that's the v0.3.0 benchmark harness.