jettyio / ambient-scribe-quality-gate

"Ambient Scribe Quality Gate"

Run the eval an ambient AI medical-scribe team actually cares about, and make the hill-climb visible. The agent generates a panel of synthetic clinical encounters (a "clinic day"), and for each one it: synthesizes a realistic, messy doctor–patient transcript…

agent claude-codemodel claude-sonnet-4-6snapshot python312-uveval rubric10 stepsv1.1.0

Deploy "Ambient Scribe Quality Gate" to your jetty.io

One-click installs this runbook into a collection on your Jetty account. You can run it from the Spot dashboard, schedule it, or pipe inputs in via the API.

Deploy on jetty.io →View source

The shape of the run

10 steps · start to finish.

Step 1

Environment Setup

▶

pip install -q litellm
mkdir -p {{results_dir}}/transcripts {{results_dir}}/notes {{results_dir}}/scorecards
# Shared-panel mode (for an apples-to-apples model bake-off): if a panel was uploaded
# (case_sheets.json + encounter_*.md transcripts), reuse it verbatim and SKIP generation so
# every scribe grades IDENTICAL transcripts. Otherwise a fresh panel is generated below.
PANEL="$(find / -name 'case_sheets.json' -not -path '*/results/*' 2>/dev/null | head -1)"
if [ -n "$PANEL" ]; then
  cp "$PANEL" {{results_dir}}/case_sheets.json
  find / -name 'encounter_*.md' -not -path '*/results/*' 2>/dev/null | while read -r f; do cp "$f" {{results_dir}}/transcripts/ 2>/dev/null || true; done
  echo "SHARED PANEL DETECTED ($PANEL): $(ls {{results_dir}}/transcripts 2>/dev/null | wc -l) transcripts copied — generation will be SKIPPED."
else
  echo "No uploaded panel found — a fresh synthetic panel will be generated."
fi
if [ -z "$ANTHROPIC_API_KEY" ] && [ -z "$OPENAI_API_KEY" ]; then
  echo "ERROR: need ANTHROPIC_API_KEY (preferred) or OPENAI_API_KEY"; exit 1
fi
python - <<'PY'
import os
has_anthropic = bool(os.environ.get("ANTHROPIC_API_KEY"))
has_openai = bool(os.environ.get("OPENAI_API_KEY"))
print(f"Anthropic key: {'SET' if has_anthropic else 'missing'} | OpenAI key: {'SET' if has_openai else 'missing'}")
print("litellm import:", end=" ")
import litellm  # noqa
print("OK")
PY

Notes on model selection (adapt in the scripts below if a key is missing):

If only OPENAI_API_KEY is set, use openai/gpt-... strings for scribe + judge.
Keep judge ≠ scribe. A self-judging model rubber-stamps its own note (a known weak-model-self-validation failure mode) and the hill-climb will look like it passes when it hasn't moved. Cross-vendor (Anthropic scribe + OpenAI judge, or vice-versa) is strongest.

2
Step 2
The Shared LLM Helper
▶
Write {{results_dir}}/_llm.py once; every later step imports it. It wraps litellm.completion, asks for JSON when needed, and retries on transient errors.
3
Step 3
Generate the Synthetic Case Panel (hidden ground truth)
▶
Write {{results_dir}}/case_sheets.json. Vary specialty + complexity, and embed exactly one trap per case. Fully fictional — no real people, no real identifiers.
4
Step 4
Synthesize the Ambient Transcripts
▶
For each case, write {{results_dir}}/transcripts/encounter_NN.md — a realistic, messy doctor–patient dialogue. The transcript must contain every ground-truth fact (so a perfect note is achievable)…
5
Step 5
The Baseline Scribe Prompt + the Hill-Climb Fix Library
▶
The scribe starts thin on purpose. The fix library is the menu the climb draws from — each entry targets one rubric criterion. Write {{results_dir}}/scribe_prompt_final.md at the end with the prompt…
6
Step 6
The Judge (rubric + hard hallucination floor)
▶
The judge sees the transcript + the hidden case sheet + the note and returns structured scores plus extracted hallucinated claims and missed facts. The hard floor is then applied deterministically in…
7
Step 7
Run the Panel — Scribe → Judge → Hill-Climb (the centerpiece)
▶
For each encounter: write a note from the baseline prompt, judge it, then while it fails and rounds remain, append the fix for the weakest criterion, regenerate, re-judge — logging every iteration…
8
Step 8
Aggregate Leaderboard
▶
Write {{results_dir}}/leaderboard.md: per-criterion means at baseline (iter 1) vs final, the panel pass rate, total hallucination incidents caught, and the average iterations-to-pass.
9
Step 9
Write Executive Summary
▶
Write {{results_dir}}/summary.md: the climb story in plain language — where the panel started, which criterion was the biggest lever (usually groundedness — the hallucination floor does the work)…
10
Step 10
Write Validation Report
▶
Write {{results_dir}}/validation_report.json:

Parameters

Results directory

default: `{{results_dir}}`

`/app/results`

Panel size

default: `{{panel_size}}`

`8`

Scribe model

default: `{{scribe_model}}`

`anthropic/claude-sonnet-4-6`

Judge model

default: `{{judge_model}}`

`anthropic/claude-opus-4-8`

Max iterations

default: `{{max_iterations}}`

`3`

Pass threshold

default: `{{pass_threshold}}`

`4.0`

Dependencies

ANTHROPIC_API_KEY · optional · Yes
litellm (pip) · optional · Yes
OPENAI_API_KEY · optional · No

Required outputs

{{results_dir}}/case_sheets.json
The N hidden ground-truth case sheets (synthetic facts + the trap)
{{results_dir}}/transcripts/encounter_NN.md
The messy ambient transcript for each encounter
{{results_dir}}/notes/encounter_NN.md
The final (best-iteration) SOAP note for each encounter
{{results_dir}}/scorecards/encounter_NN.json
Per-encounter rubric scores + **iteration_log** (the climb)
{{results_dir}}/leaderboard.md
Aggregate per-criterion means, pass rate, hallucination incidents across the panel
{{results_dir}}/summary.md
Executive summary: the climb story, weakest→strongest, the winning scribe prompt
{{results_dir}}/scribe_prompt_final.md
The hill-climbed scribe system prompt (the artifact a real team would ship)
{{results_dir}}/validation_report.json
Structured results + `overall_passed`

"Ambient Scribe Quality Gate"

Deploy "Ambient Scribe Quality Gate" to your jetty.io

10 steps · start to finish.

Environment Setup

The Shared LLM Helper

Generate the Synthetic Case Panel (hidden ground truth)

Synthesize the Ambient Transcripts

The Baseline Scribe Prompt + the Hill-Climb Fix Library

The Judge (rubric + hard hallucination floor)

Run the Panel — Scribe → Judge → Hill-Climb (the centerpiece)

Aggregate Leaderboard

Write Executive Summary

Write Validation Report