jasonswett / llm-test-design-review★ Featured · worked examples

Test Design Review

Review the supplied tests for design quality against the guidelines catalogued below, and produce a precise, actionable review. Tests are executable specifications; the review judges how well each test reads as a specification — does it describe a scenario…

agent claude-codemodel claude-sonnet-4-6snapshot python312-uveval programmatic8 stepsv1.0.0

Deploy Test Design Review to your jetty.io

One-click installs this runbook into a collection on your Jetty account. You can run it from the Spot dashboard, schedule it, or pipe inputs in via the API.

Deploy on jetty.io →View source

Run time2–4 mins

Headline outputreview.md

Runs on Jetty's managed sandbox. No setup. Free for your first 10 runs.

Worked examples · 3

Real runs, real outputs.

reviewrequest-speceval ✓

RSpec request spec

A request spec with a vague name, .last, a redundant assertion, a mock-based assertion, and described_class — 6 findings.

review.mdfindings.jsonsummary.mdvalidation_report.json

claude-sonnet-4-6 · 3 minView run →Source

reviewsystem-speceval ✓

RSpec system spec

A feature spec with have_current_path, a forward reference, instance_variable_set, a cargo-culted wait: 3, and a code-token describe — 8 findings.

review.mdfindings.jsonsummary.mdvalidation_report.json

claude-sonnet-4-6 · 3 minView run →Source

reviewmodel-speceval ✓

RSpec model spec

A model spec that asserts on the caching mechanism (Rails.cache.read) and couples to a JSON internal — the means-vs-ends smells. 3 findings.

review.mdfindings.jsonsummary.mdvalidation_report.json

claude-sonnet-4-6 · 2 minView run →Source

The shape of the run

8 steps · start to finish.

Step 1

Environment Setup

▶

mkdir -p "{{results_dir}}"
shopt -s nullglob 2>/dev/null || true
FILES=$(ls /app/assets/*.rb /app/assets/*_spec.rb /app/assets/*_test.rb /app/assets/*.rs /app/assets/*.py /app/assets/*.diff /app/assets/*.patch 2>/dev/null)
echo "Files to review:"; echo "$FILES"
[ -n "$FILES" ] || { echo "ERROR: no code files found in /app/assets"; exit 1; }

Read each file in full before reviewing.

Guidelines Catalog (the review criteria)

Review every test against each guideline. Each has an id (use it in findings.json), the rule, and a bad → good contrast.

id	Guideline	The rule
`specification-format`	Specification format	A test name should answer "in scenario X, what should happen?" — not vague ("it works correctly", "it handles errors").
`behavior-not-implementation`	Test behavior, not implementation	Assert on observable behavior, not internal/implementation details or hard-coded incidental values.
`describe-the-essence`	Describe the essence	`describe`/`context` strings should capture the scenario's meaning (`"rerunning only failed tests"`), not a code token (`"scope=failed"`).
`avoid-arbitrariness`	Avoid arbitrariness	Don't retrieve records with `.first`/`.last` (order-dependent, fragile). Use explicit `change`/`where` queries instead.
`essential-not-incidental`	Assert essentials, not incidentals	Only assert what matters. Drop redundant assertions implied by others (e.g. `be_successful` next to a body assertion).
`one-level-of-abstraction`	Don't mix levels of abstraction	A block should operate at one level; push incidental setup/details out of the essential flow.
`avoid-forward-reference`	Avoid forward reference	Don't reference a `let`/variable before it's defined; order definitions before use, or inline the value.
`no-have-current-path`	Don't use `have_current_path`	Too coupled to the URL/implementation. Assert on what the user sees on the page.
`observable-not-method-calls`	Assert observable outcomes, not method calls	Avoid mock assertions like `expect(x).to have_received(:foo)` (tests means). Assert the real end result. Stub only true externals.
`test-ends-not-means`	Test ends, not means	For caching/perf, assert the observable difference (e.g. zero extra DB queries), not the mechanism (`Rails.cache.read`).
`high-level-of-abstraction`	Maintain a high level of abstraction	Hide dense incidental details behind a well-named helper (defined after the test); show the essence.
`no-private-method-hacks`	Don't hack to test private methods	Never `send`/`public_send` to reach private methods — make the method public instead.
`no-tight-coupling`	Don't tightly couple to implementation	Don't set up state via internal shapes (`json_output: {...}`) when a behavioral input (`exit_code: 0`) expresses the scenario.
`arrange-act-assert`	Use Arrange / Act / Assert	Structure the test into clear arrange, act, and assert phases.
`no-speculative-coding`	No speculative coding	Scrutinize cargo-culted choices (e.g. an unexplained `wait: 3`); remove what isn't justified.
`no-instance-variable-set`	Never `instance_variable_set`	If it seems necessary, that signals poor design — find it and suggest a specific refactor.
`no-described-class`	Don't use `described_class`	It adds obscurity; use the actual class name.

The full bad/good examples for each guideline live in the source skill (linked in the frontmatter origin). Apply the rule above; cite the id in findings.

2
Step 2
Review
▶
For each file, read it fully, then walk the Guidelines Catalog top to bottom. For every violation, capture: the guideline id, the file, the line_start/line_end, the exact offending_code (quoted), a…
3
Step 3
(reserved)
▶
---
4
Step 4
Write `findings.json`
▶
One object per violation:
5
Step 5
Write `review.md`
▶
Group findings by guideline (matching the source skill's "group by guideline"). For each guideline with ≥1 finding, write a section: the guideline name, then each finding as the offending code…
6
Step 6
Evaluate & Validate
▶
Assign the review one status, then write validation_report.json.
7
Step 7
Iterate (max 3 rounds)
▶
If validation fails (invalid JSON, a finding missing fields, an unknown guideline id, or review.md not grouped), fix it and re-validate. Max 3 rounds; then surface the remaining issue in summary.md.
8
Step 8
Write Executive Summary
▶
Write {{results_dir}}/summary.md:

Inputs

Code under reviewfilerequired

The test file(s) or a diff to review. If a diff is given, only changed lines are reviewed.

Focus guidelineselect

specification-formatbehavior-not-implementationdescribe-the-essenceavoid-arbitrarinessessential-not-incidentalone-level-of-abstractionavoid-forward-reference

Optionally restrict the review to one guideline. Empty = all guidelines.

Dependencies

*(none)* · optional · —

Required outputs

{{results_dir}}/review.md
The human-readable review: findings grouped by guideline, each with the offending code, why it violates the guideline, and a suggested fix. The headline deliverable.
{{results_dir}}/findings.json
Structured findings — one object per violation (see schema in Step 4). May be `[]` when the tests are clean.
{{results_dir}}/summary.md
Executive summary: per-guideline violation counts, files reviewed, overall assessment.
{{results_dir}}/validation_report.json
Stage-by-stage validation with `overall_passed`. See Step 6.

Origin

source: github.com
title: test-design-review
attr: high

Original →

Test Design Review

Deploy Test Design Review to your jetty.io

Real runs, real outputs.

RSpec request spec

RSpec system spec

RSpec model spec

8 steps · start to finish.

Environment Setup

Guidelines Catalog (the review criteria)

Review

(reserved)

Write `findings.json`

Write `review.md`

Evaluate & Validate

Iterate (max 3 rounds)

Write Executive Summary