garrytan / gstack-benchmark-models★ Featured · worked examples

Cross-Model Benchmark

Run the same prompt through two or more models side by side and answer "which model is actually best for this task?" with data instead of vibes. For each model, measure latency, tokens (prompt + completion), and cost, capture the output, and — when the judge…

agent claude-codemodel claude-sonnet-4-6snapshot python312-uveval programmatic8 stepsv1.0.0

Deploy Cross-Model Benchmark to your jetty.io

One-click installs this runbook into a collection on your Jetty account. You can run it from the Spot dashboard, schedule it, or pipe inputs in via the API.

Deploy on jetty.io →View source

Run time2–4 mins

Headline outputbenchmark.md

Runs on Jetty's managed sandbox. No setup. Free for your first 10 runs.

Worked examples · 3

Real runs, real outputs.

benchmarkcodingeval ✓

Coding prompt — merge intervals

Three models implement merge_intervals(). gpt-4o fastest + cheapest; claude and gpt tie at quality 10; gemini-2.5-flash verbose and slower.

benchmark.mdresults.jsonsummary.mdvalidation_report.json

claude-sonnet-4-6 · 3 minView run →Source

benchmarkstructured-extractioneval ✓

Extraction prompt — notes → JSON

Three models extract action items from meeting notes into JSON. gpt-4o fastest (1.52s) + cheapest; claude and gpt tie at quality 10.

benchmark.mdresults.jsonsummary.mdvalidation_report.json

claude-sonnet-4-6 · 2 minView run →Source

benchmarkwritingeval ✓

Writing prompt — launch blurb

The differentiating run: on a 100-word launch brief, gpt-4o was fastest + cheapest but scored only 3/10 (missed the no-hype constraint) while claude scored 9. Speed is…

benchmark.mdresults.jsonsummary.mdvalidation_report.json

claude-sonnet-4-6 · 2 minView run →Source

The shape of the run

8 steps · start to finish.

1
Step 1
Environment Setup & Provider Preflight
▶
```
python -m pip install --quiet "litellm>=1.40"
mkdir -p "{{results_dir}}"
echo "Provider keys present:"
for k in ANTHROPIC_API_KEY OPENAI_API_KEY GEMINI_API_KEY OPENROUTER_API_KEY; do
  [ -n "${!k}" ] && echo "  $k: SET" || echo "  $k: (absent)"
done
```
Preflight (the dry-run, from the source skill): map each requested model to its provider and check the key is present. Models whose provider key is absent are reported as skipped: no_key and excluded — they do not abort the run. If zero models have a key, STOP and write a clear error (a benchmark needs at least one authed provider).

Provider inference: claude*/anthropic/* → ANTHROPIC_API_KEY; gpt*/o1*/openai/* → OPENAI_API_KEY; gemini* → GEMINI_API_KEY; openrouter/* → OPENROUTER_API_KEY.
2
Step 2
Resolve the Prompt
▶
Use the first .txt/.md in /app/assets/; if none, use {{prompt}}. If both are empty, STOP with an error. Record the resolved prompt source in summary.md.
3
Step 3
Run the Benchmark
▶
Run the same prompt through every authed model, measuring latency, tokens, and cost. Errors are caught per-model and recorded, never raised.
4
Step 4
Judge Quality (when `{{judge}}` is `true`)
▶
For each successful output, ask the judge model to score it 0–10 on correctness, completeness, and clarity, returning strict JSON. Skip judging for models that errored.
5
Step 5
Recommend & Write Outputs
▶
ok = [r for r in results if r.get("ok")] def pick(seq, key, best=min): seq = [r for r in seq if r.get(key) is not None] return best(seq, key=lambda r: r[key]) if seq else None
6
Step 6
Evaluate & Validate
▶
Status · Criteria PASS · At least 2 models ran successfully; each successful model has latency + tokens recorded; benchmark.md has the table and a recommendation; (judge on) each successful model has…
7
Step 7
Iterate (max 3 rounds)
▶
If a model errored with a fixable cause, fix and retry only that model:
8
Step 8
Write Executive Summary
▶
Write {{results_dir}}/summary.md:

Inputs

Promptfilerequired

The prompt to benchmark (uploaded file). Or pass it inline via the Prompt parameter.

Modelstext

default: claude-sonnet-4-6,gpt-4o,gemini/gemini-2.0-flash

Comma-separated litellm model ids. Models with no provider key are skipped cleanly.

Judgeselect

truefalse

Score each output 0–10 with an LLM judge (adds a little cost).

System prompttext

Optional system prompt sent identically to every model.

Secrets

ANTHROPIC_API_KEY | OPENAI_API_KEY | GEMINI_API_KEY | OPENROUTER_API_KEY · required
At least one provider key. Jetty trial keys cover Anthropic/OpenAI/Gemini on the platform.

Dependencies

litellm · required · Python package
At least one provider key · required · Credential

Required outputs

{{results_dir}}/benchmark.md
The comparison report: a per-model table (latency / tokens / cost / quality) and the fastest / cheapest / highest-quality / best-overall recommendation. The headline deliverable.
{{results_dir}}/results.json
Full structured results — one object per model with metrics, the raw output, judge score/reason, and any error.
{{results_dir}}/summary.md
Executive summary: models run vs skipped, the recommendation, and the cost of the benchmark itself.
{{results_dir}}/validation_report.json
Stage-by-stage validation with `overall_passed`. See Step 6.

Origin

source: github.com
title: benchmark-models (gstack)
attr: high

Original →

Cross-Model Benchmark

Deploy Cross-Model Benchmark to your jetty.io

Real runs, real outputs.

Coding prompt — merge intervals

Extraction prompt — notes → JSON

Writing prompt — launch blurb

8 steps · start to finish.

Environment Setup & Provider Preflight

Resolve the Prompt

Run the Benchmark

Judge Quality (when `{{judge}}` is `true`)

Recommend & Write Outputs

Evaluate & Validate

Iterate (max 3 rounds)

Write Executive Summary