← All runbooks
jettyio / ambient-scribe-quality-gate

"Ambient Scribe Quality Gate"

Run the eval an ambient AI medical-scribe team actually cares about, and make the hill-climb visible. The agent generates a panel of synthetic clinical encounters (a "clinic day"), and for each one it: synthesizes a realistic, messy doctor–patient transcript…

agent claude-codemodel claude-sonnet-4-6snapshot python312-uveval rubric10 stepsv1.1.0

Deploy "Ambient Scribe Quality Gate" to your jetty.io

One-click installs this runbook into a collection on your Jetty account. You can run it from the Spot dashboard, schedule it, or pipe inputs in via the API.

The shape of the run

10 steps · start to finish.

  1. 1
    Step 1

    Environment Setup

    pip install -q litellm
    mkdir -p {{results_dir}}/transcripts {{results_dir}}/notes {{results_dir}}/scorecards
    # Shared-panel mode (for an apples-to-apples model bake-off): if a panel was uploaded
    # (case_sheets.json + encounter_*.md transcripts), reuse it verbatim and SKIP generation so
    # every scribe grades IDENTICAL transcripts. Otherwise a fresh panel is generated below.
    PANEL="$(find / -name 'case_sheets.json' -not -path '*/results/*' 2>/dev/null | head -1)"
    if [ -n "$PANEL" ]; then
      cp "$PANEL" {{results_dir}}/case_sheets.json
      find / -name 'encounter_*.md' -not -path '*/results/*' 2>/dev/null | while read -r f; do cp "$f" {{results_dir}}/transcripts/ 2>/dev/null || true; done
      echo "SHARED PANEL DETECTED ($PANEL): $(ls {{results_dir}}/transcripts 2>/dev/null | wc -l) transcripts copied — generation will be SKIPPED."
    else
      echo "No uploaded panel found — a fresh synthetic panel will be generated."
    fi
    if [ -z "$ANTHROPIC_API_KEY" ] && [ -z "$OPENAI_API_KEY" ]; then
      echo "ERROR: need ANTHROPIC_API_KEY (preferred) or OPENAI_API_KEY"; exit 1
    fi
    python - <<'PY'
    import os
    has_anthropic = bool(os.environ.get("ANTHROPIC_API_KEY"))
    has_openai = bool(os.environ.get("OPENAI_API_KEY"))
    print(f"Anthropic key: {'SET' if has_anthropic else 'missing'} | OpenAI key: {'SET' if has_openai else 'missing'}")
    print("litellm import:", end=" ")
    import litellm  # noqa
    print("OK")
    PY
    

    Notes on model selection (adapt in the scripts below if a key is missing):

    • If only OPENAI_API_KEY is set, use openai/gpt-... strings for scribe + judge.
    • Keep judge ≠ scribe. A self-judging model rubber-stamps its own note (a known weak-model-self-validation failure mode) and the hill-climb will look like it passes when it hasn't moved. Cross-vendor (Anthropic scribe + OpenAI judge, or vice-versa) is strongest.

  2. 2
    Step 2

    The Shared LLM Helper

    Write {{results_dir}}/_llm.py once; every later step imports it. It wraps litellm.completion, asks for JSON when needed, and retries on transient errors.

  3. 3
    Step 3

    Generate the Synthetic Case Panel (hidden ground truth)

    Write {{results_dir}}/case_sheets.json. Vary specialty + complexity, and embed exactly one trap per case. Fully fictional — no real people, no real identifiers.

  4. 4
    Step 4

    Synthesize the Ambient Transcripts

    For each case, write {{results_dir}}/transcripts/encounter_NN.md — a realistic, messy doctor–patient dialogue. The transcript must contain every ground-truth fact (so a perfect note is achievable)…

  5. 5
    Step 5

    The Baseline Scribe Prompt + the Hill-Climb Fix Library

    The scribe starts thin on purpose. The fix library is the menu the climb draws from — each entry targets one rubric criterion. Write {{results_dir}}/scribe_prompt_final.md at the end with the prompt…

  6. 6
    Step 6

    The Judge (rubric + hard hallucination floor)

    The judge sees the transcript + the hidden case sheet + the note and returns structured scores plus extracted hallucinated claims and missed facts. The hard floor is then applied deterministically in…

  7. 7
    Step 7

    Run the Panel — Scribe → Judge → Hill-Climb (the centerpiece)

    For each encounter: write a note from the baseline prompt, judge it, then while it fails and rounds remain, append the fix for the weakest criterion, regenerate, re-judge — logging every iteration…

  8. 8
    Step 8

    Aggregate Leaderboard

    Write {{results_dir}}/leaderboard.md: per-criterion means at baseline (iter 1) vs final, the panel pass rate, total hallucination incidents caught, and the average iterations-to-pass.

  9. 9
    Step 9

    Write Executive Summary

    Write {{results_dir}}/summary.md: the climb story in plain language — where the panel started, which criterion was the biggest lever (usually groundedness — the hallucination floor does the work)…

  10. 10
    Step 10

    Write Validation Report

    Write {{results_dir}}/validation_report.json: