[Virtual] Langfuse Town Hall · Jun 11 →
June 10, 2026

Using Fable 5 and Langfuse to run auto improvement

We used Claude Fable 5 with Langfuse Datasets, Prompt Management, Experiments, and trace comments to run an auto-improvement loop on a classification benchmark.

Picture Annabell SchäferAnnabell Schäfer

Claude Fable 5 had just come out and was being promoted for loop-shaped work. We wanted to see what it looked like to give an agent a real auto-improvement workbench: a benchmark in Langfuse Datasets, a versioned prompt in Prompt Management, an experiment runner via the Langfuse SDK, and a place to attach qualitative error analysis with trace comments.

This is not really a model review. Fable turned out to be perfectly capable of running the loop. The more interesting question was what the loop actually learned, and what the dataset needed to look like for the result to transfer.

Lance Martin post about designing loops with Fable 5

We picked a contained classification task with one of the cleanest target functions in AI engineering: exact-match accuracy against gold labels. The agent's job was simple: keep improving the prompt on the train split until it reached 95% accuracy or 15 runs, then run the held-out test set once.

What we gave the agent

The setup was deliberately minimal:

We used gpt-4o-mini on purpose. A stronger model likely would have done better out of the box, but for a narrow task like this, the interesting question was whether prompt iteration on a cheaper model could get us far enough to matter.

The starting prompt was as bare as it gets: "Classify this paper with a label" plus the flat list of allowed labels.

The auto-improving loop, powered by Fable

The loop looked like this:

  1. Run the train dataset with the current prompt.
  2. Score every row as correct or incorrect.
  3. Write a short qualitative annotation on every error and attach it as a comment to the trace.
  4. Form a hypothesis and publish a new prompt version.
  5. Repeat until the train target is hit.
  6. Run the held-out test split once at the end.

Diagram of the autonomous training loop: start from a base prompt, run the train split, score rows, comment on errors, revise the prompt, and then run the held-out test split once

This is where Langfuse mattered. The agent had a stable place to read the benchmark, fetch and update the prompt, compare experiment runs, and keep the qualitative audit trail next to the quantitative results.

What happened

The mechanics worked immediately.

RunPrompt strategyTrain accuracy
1v1 - flat label list78.0%
2v2 - definitions + decision rules90.5%
3v3 - sharper boundary rules90.0%
4v4 - concrete precedent list from prior failures97.0%

Fable hit the stopping condition in four runs. On the train split, this looked like a clear success.

Then came the held-out test set:

PromptTrainTestGap
v2 - general definitions90.5%84.0%6.5
v4 - train-derived precedents97.0%82.0%15.0

The first useful lesson was that the "best" prompt on train was not the best prompt on test. The more the loop optimized against the train split, the less meaningful that train number became.

So we restarted from v2 and constrained the loop: every change had to be a general principle backed by a class of errors, not a single-paper precedent.

PromptTrainTestGap
v2 - general definitions, round 190.5%84.0%6.5
v4 - precedent list, round 197.0%82.0%15.0
v9 - general principles, round 294.0%81.0%13.0

That second round is the real result. Even after forcing the loop to behave more "correctly," held-out performance still did not improve in a meaningful way.

And by the end, 11 test errors were shared across every prompt variant.

That is what changed the interpretation of the whole experiment. The question stopped being "how do we make the loop smarter?" and became "what would the dataset have needed to contain for this loop to actually learn something transferable?"

Why this Langfuse pattern is useful

Even though the conclusion was mostly about evaluation design, this still felt like a strong example of using Fable 5 and Langfuse together for auto improvement.

  • Datasets gave the agent a clear benchmark to optimize against.
  • Prompt Management gave it a safe way to read, version, and publish prompt revisions.
  • Experiments turned each iteration into a comparable run instead of an anecdote.
  • Trace comments made the error analysis visible and inspectable instead of burying it in logs.

The loop itself was not the problem. The infrastructure was enough for the agent to run the work autonomously and keep the whole process auditable.

What we learned

1. Auto improvement needs a validation split. Train and test alone are not enough. If you select prompt versions on train accuracy, you will overfit, even if the metric itself is perfectly clean.

2. The strongest result was not the 97% train score. It was that a more disciplined second round bought zero measurable test improvement. That tells you the loop had already extracted most of the prompt-level gain available in this setup.

3. Shared errors are the signal to pay attention to. Once the same test errors survive every prompt variant, the next bottleneck is probably the dataset, label definitions, or task model, not the prompt.

4. That does not make auto improvement useless. It means auto improvement is only as good as the benchmark and selection setup you give it. In this experiment, the loop surfaced the next constraint quickly, which is useful in its own right.

What we would change next

If we ran this again, we would change the dataset design before changing the loop:

  • add a true validation split for selection
  • add more repeated examples around known class boundaries
  • sharpen label definitions for ambiguous categories
  • decide whether some items really need an "unsure" or multi-label escape hatch

That is the part of the setup that determines whether another round of auto improvement teaches you something new.


All experiments: gpt-4o-mini, temperature 0, strict JSON schema output. Optimizer agent: Claude Fable 5 in Claude Code with goal mode. 9 train runs, 200 items, plus 3 test runs, 100 items, across both rounds. Full prompt version history and per-run error annotations live in Langfuse.


Was this page helpful?