jettyio / langfuse-trace-optimizer

Langfuse Trace Optimizer

Analyze a project's Langfuse trace data the way a cost-and-quality engineer would, and hand back a recommendations.md report that an engineering team can act on this week. The agent connects to Langfuse, pulls aggregate metrics and individual traces, finds…

agent claude-codemodel anthropic/claude-sonnet-4.6snapshot python312-uveval rubric11 stepsv1.0.0

Deploy Langfuse Trace Optimizer to your jetty.io

One-click installs this runbook into a collection on your Jetty account. You can run it from the Spot dashboard, schedule it, or pipe inputs in via the API.

Deploy on jetty.io →View source

The shape of the run

11 steps · start to finish.

Step 1

Environment Setup & Connectivity Check

▶

Install deps, write the creds to {{results_dir}}/.env for reproducibility, and prove the connection before doing anything else. If you cannot connect, STOP — this is a live-data task and every downstream step depends on it.

mkdir -p "{{results_dir}}/data" "{{results_dir}}/figures"
pip install -q langfuse litellm pandas matplotlib tabulate

# Persist creds for reproducibility (never echo the secret values into logs).
cat > "{{results_dir}}/.env" <<EOF
LANGFUSE_PUBLIC_KEY=${LANGFUSE_PUBLIC_KEY}
LANGFUSE_SECRET_KEY=${LANGFUSE_SECRET_KEY}
LANGFUSE_HOST=${LANGFUSE_HOST}
EOF

[ -n "$LANGFUSE_PUBLIC_KEY" ] && [ -n "$LANGFUSE_SECRET_KEY" ] && [ -n "$LANGFUSE_HOST" ] \
  || { echo "ERROR: missing Langfuse credentials"; exit 1; }
echo "Langfuse host: $LANGFUSE_HOST"

# Connectivity gate — STOP IMMEDIATELY IF THIS FAILS.
from langfuse import get_client
langfuse = get_client()
assert langfuse.auth_check(), "Langfuse auth_check() failed — bad keys or host"
print("Langfuse connection OK")

If auth_check() raises or returns false, do not continue: report the connection failure in summary.md and exit. Do not fabricate analysis against data you could not read.

2
Step 2
Fetch Live Model Pricing (provider-filtered)
▶
Accurate cost analysis needs current pricing. Fetch it from LiteLLM and filter by the litellm_provider field — not by matching model-name substrings, so new families (gpt-4.1, claude-4, gemini-2) are…
3
Step 3
Aggregate Metrics — Cost & Volume by Trace and Model
▶
Use the Metrics API (cheap aggregates, not row-by-row listing) for the window and two sub-windows (3d/7d/Nd). Write metrics_by_trace.csv and metrics_by_model.csv.
4
Step 4
Cost-Variance Deep Dive on the Top-N Expensive Traces
▶
For the {{top_n}} most expensive trace names, list individual traces and compute the cost distribution. High variance is where the money leaks. Write data/cost_variance.json.
5
Step 5
Root-Cause the Cost Drivers
▶
For each expensive/outlier trace, pull its observations and test concrete hypotheses. Don't guess — read the actual generations.
6
Step 6
Failure-Mode Detection
▶
Find the errors and inconsistency the cost view hides. Use the Metrics API with a level filter, then catalog patterns.
7
Step 7
Qualitative Trace Assessment (manual inspection)
▶
Aggregates miss the things that actually embarrass a team. Pull {{sample_size}} traces across percentiles (cheapest / median / most expensive / slowest / errored) and read them. Write…
8
Step 8
Rank Recommendations (evidence + WHY + code + measurement)
▶
Turn findings into at least {{min_recommendations}} ranked recommendations. Each one MUST have all five parts — a recommendation without a measurement plan is an opinion, not an engineering action:
9
Step 9
Factor in Prior Analysis History (if provided)
▶
If {{analysis_history}} is non-empty, it lists past recommendations and the PRs that acted on them. Use it to make this report a follow-up, not a repeat:
10
Step 10
Write `recommendations.md` (the report) + `summary.md`
▶
Assemble {{results_dir}}/recommendations.md with this structure:
11
Step 11
Self-Validation Report
▶
Write {{results_dir}}/validation_report.json:

Parameters

Results directory

default: `{{results_dir}}`

`/app/results`

Analysis window (days)

default: `{{window_days}}`

`30`

Top-N expensive traces

default: `{{top_n}}`

`5`

Qualitative sample size

default: `{{sample_size}}`

`15`

Min recommendations

default: `{{min_recommendations}}`

`3`

Analysis history

default: `{{analysis_history}}`

*(optional)*

Dependencies

LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST · required · Secret
langfuse (pip) · required · Runtime
litellm (pip) · required · Runtime
pandas (pip) · required · Runtime
matplotlib (pip) · optional · Runtime

Required outputs

{{results_dir}}/recommendations.md
The headline report: executive summary, cost analysis tables, qualitative findings, ≥3 ranked recommendations (each with evidence trace IDs, WHY, code example, measurement plan), and an implementation roadmap.
{{results_dir}}/data/metrics_by_trace.csv
Per-trace-name aggregates (count, total cost, total tokens, p95 latency) for the analysis window.
{{results_dir}}/data/metrics_by_model.csv
Per-model aggregates at the observation level (count, cost, tokens).
{{results_dir}}/data/cost_variance.json
Per-expensive-trace cost distribution (min/mean/median/max/std/p95) and flagged outliers.
{{results_dir}}/data/qualitative_samples.json
The sampled traces inspected manually, with trace IDs, input/output excerpts, and the abnormality tagged on each.
{{results_dir}}/data/model_pricing.json
Current model pricing fetched from LiteLLM (provider-filtered).
{{results_dir}}/summary.md
One-screen executive summary: window, traces analyzed, total cost, top 3 findings, total projected monthly savings.
{{results_dir}}/validation_report.json
Stage-by-stage self-validation with `overall_passed`. See Step 11.

Origin

source: github.com
title: Langfuse Optimizer — Terminal Bench 2.0 task
attr: high

Original →

Langfuse Trace Optimizer

Deploy Langfuse Trace Optimizer to your jetty.io

11 steps · start to finish.

Environment Setup & Connectivity Check

Fetch Live Model Pricing (provider-filtered)

Aggregate Metrics — Cost & Volume by Trace and Model

Cost-Variance Deep Dive on the Top-N Expensive Traces

Root-Cause the Cost Drivers

Failure-Mode Detection

Qualitative Trace Assessment (manual inspection)

Rank Recommendations (evidence + WHY + code + measurement)

Factor in Prior Analysis History (if provided)

Write `recommendations.md` (the report) + `summary.md`

Self-Validation Report