"Ambient Scribe Quality Gate"
Run the eval an ambient AI medical-scribe team actually cares about, and make the hill-climb visible. The agent generates a panel of synthetic clinical encounters (a "clinic day"), and for each one it: synthesizes a realistic, messy doctor–patient transcript…
10 steps · start to finish.
- 1Step 1
Environment Setup
▶pip install -q litellm mkdir -p {{results_dir}}/transcripts {{results_dir}}/notes {{results_dir}}/scorecards # Shared-panel mode (for an apples-to-apples model bake-off): if a panel was uploaded # (case_sheets.json + encounter_*.md transcripts), reuse it verbatim and SKIP generation so # every scribe grades IDENTICAL transcripts. Otherwise a fresh panel is generated below. PANEL="$(find / -name 'case_sheets.json' -not -path '*/results/*' 2>/dev/null | head -1)" if [ -n "$PANEL" ]; then cp "$PANEL" {{results_dir}}/case_sheets.json find / -name 'encounter_*.md' -not -path '*/results/*' 2>/dev/null | while read -r f; do cp "$f" {{results_dir}}/transcripts/ 2>/dev/null || true; done echo "SHARED PANEL DETECTED ($PANEL): $(ls {{results_dir}}/transcripts 2>/dev/null | wc -l) transcripts copied — generation will be SKIPPED." else echo "No uploaded panel found — a fresh synthetic panel will be generated." fi if [ -z "$ANTHROPIC_API_KEY" ] && [ -z "$OPENAI_API_KEY" ]; then echo "ERROR: need ANTHROPIC_API_KEY (preferred) or OPENAI_API_KEY"; exit 1 fi python - <<'PY' import os has_anthropic = bool(os.environ.get("ANTHROPIC_API_KEY")) has_openai = bool(os.environ.get("OPENAI_API_KEY")) print(f"Anthropic key: {'SET' if has_anthropic else 'missing'} | OpenAI key: {'SET' if has_openai else 'missing'}") print("litellm import:", end=" ") import litellm # noqa print("OK") PYNotes on model selection (adapt in the scripts below if a key is missing):
- If only
OPENAI_API_KEYis set, useopenai/gpt-...strings for scribe + judge. - Keep judge ≠ scribe. A self-judging model rubber-stamps its own note (a known weak-model-self-validation failure mode) and the hill-climb will look like it passes when it hasn't moved. Cross-vendor (Anthropic scribe + OpenAI judge, or vice-versa) is strongest.
- If only
- 2Step 2
The Shared LLM Helper
▶Write {{results_dir}}/_llm.py once; every later step imports it. It wraps litellm.completion, asks for JSON when needed, and retries on transient errors.
- 3Step 3
Generate the Synthetic Case Panel (hidden ground truth)
▶Write {{results_dir}}/case_sheets.json. Vary specialty + complexity, and embed exactly one trap per case. Fully fictional — no real people, no real identifiers.
- 4Step 4
Synthesize the Ambient Transcripts
▶For each case, write {{results_dir}}/transcripts/encounter_NN.md — a realistic, messy doctor–patient dialogue. The transcript must contain every ground-truth fact (so a perfect note is achievable)…
- 5Step 5
The Baseline Scribe Prompt + the Hill-Climb Fix Library
▶The scribe starts thin on purpose. The fix library is the menu the climb draws from — each entry targets one rubric criterion. Write {{results_dir}}/scribe_prompt_final.md at the end with the prompt…
- 6Step 6
The Judge (rubric + hard hallucination floor)
▶The judge sees the transcript + the hidden case sheet + the note and returns structured scores plus extracted hallucinated claims and missed facts. The hard floor is then applied deterministically in…
- 7Step 7
Run the Panel — Scribe → Judge → Hill-Climb (the centerpiece)
▶For each encounter: write a note from the baseline prompt, judge it, then while it fails and rounds remain, append the fix for the weakest criterion, regenerate, re-judge — logging every iteration…
- 8Step 8
Aggregate Leaderboard
▶Write {{results_dir}}/leaderboard.md: per-criterion means at baseline (iter 1) vs final, the panel pass rate, total hallucination incidents caught, and the average iterations-to-pass.
- 9Step 9
Write Executive Summary
▶Write {{results_dir}}/summary.md: the climb story in plain language — where the panel started, which criterion was the biggest lever (usually groundedness — the hallucination floor does the work)…
- 10Step 10
Write Validation Report
▶Write {{results_dir}}/validation_report.json: