Evaluation workflow
Langfuse evaluation follows a continuous improvement loop with four jobs:
- Instrument -> Capture real behavior with Observability.
- Annotate -> Turn traces into reusable evaluation assets with annotations and datasets.
- Deploy -> Validate changes with experiments and CI/CD checks before they ship.
- Monitor -> Track production quality with online evaluators and Score Analytics.
Work through the loop continuously: collect traces, turn them into failure modes, validate changes against those failure modes, and use production monitoring to find the next examples to annotate.
![]()
Instrument
Collect the evidence you need to evaluate your application. Start with traces that show what users asked, what your system did, and where the output came from.
- Capture traces and observations for user interactions, model calls, tool calls, retrieval steps, and final outputs.
- Use the observability data model to pick the right evaluation unit: trace, observation, generation, session, or dataset run.
- Add tags, metadata, users, sessions, environments, and releases so you can slice behavior later.
Annotate
Turn raw traces into reusable evaluation assets. Start with human review, name the failure modes, then convert the useful examples into datasets and score definitions.
- Use Annotation Queues, manual scores via UI, and
TEXTscores for collaborative open coding: review representative traces and note the first thing that went wrong. - Group notes into failure modes (axial coding), then turn stable error categories into structured labels and evaluation criteria for LLM-as-a-Judge and custom evaluators.
- Add the most important examples to a dataset. Aim for roughly 100 diverse, high-quality test cases when possible.
For a detailed walkthrough, use the error analysis guide to open-code traces, cluster failure categories, label examples, and decide which failures should become evaluators.
Deploy
Ship tested changes with measurable impact. Use experiments to confirm that a prompt, model, retrieval configuration, agent implementation, or evaluator variant improves quality without introducing regressions.
- Active iteration: Use experiments via UI for prompt and model changes, or experiments via SDK for application and agent logic.
- Benchmarking: Compare candidates on the same dataset and score outputs with LLM-as-a-Judge, Scores via API/SDK, or Scores via UI.
- Releasing / deploying: Run regression checks manually during development or automatically in CI/CD before merge or deploy.
Experiments are useful in three common deployment workflows:
| Workflow | Use experiments to |
|---|---|
| Active iteration | Hill climb on a prompt, model, or agent implementation while engineering a change. |
| Benchmarking | Compare multiple implementations on the same dataset and scoring criteria. |
| Releasing / deploying | Run regression checks before merge or deploy so quality drops do not ship. |
Monitor
Monitor production with the failure modes you already annotated. Use online evaluators to score live behavior, catch quality issues after deployment, and discover the next examples to review.
- Turn annotated failure modes into online evaluators with LLM-as-a-Judge or custom scores via API/SDK.
- Use Score Analytics and custom dashboards to track distributions, trends, evaluator agreement, and regressions; when monitors reveal new patterns, send them back to annotation.
Start with a small number of high-signal LLM-as-a-Judge or custom evaluators derived from known failure modes, then expand coverage as new failures appear.
Which Langfuse feature should I use?
| If you want to... | Use this Langfuse feature |
|---|---|
| Capture application behavior | Observability, traces and observations |
| Segment traces for later review | Tags, metadata, users, sessions, environments, releases |
| Review examples manually | Annotation Queues, Scores via UI |
| Open Coding: capture open-ended notes | TEXT scores, Annotation Queues |
| Axial Coding: derive failure modes | Stable error categories, evaluation criteria |
| Create reusable test cases | Datasets |
| Compare changes before shipping | Experiments via UI, Experiments via SDK |
| Gate pull requests or deploys | CI/CD experiments |
| Monitor production quality | LLM-as-a-Judge, Scores via API/SDK |
| Analyze evaluator results | Score Analytics, custom dashboards |