🇯🇵 Langfuse Cloud Japan is live →
DocsEvaluation workflow
DocsEvaluationEvaluation workflow

Evaluation workflow

Langfuse evaluation follows a continuous improvement loop with four jobs:

  1. Instrument -> Capture real behavior with Observability.
  2. Annotate -> Turn traces into reusable evaluation assets with annotations and datasets.
  3. Deploy -> Validate changes with experiments and CI/CD checks before they ship.
  4. Monitor -> Track production quality with online evaluators and Score Analytics.

Work through the loop continuously: collect traces, turn them into failure modes, validate changes against those failure modes, and use production monitoring to find the next examples to annotate.

The Continuous Evaluation/Iteration Loop

Instrument

Collect the evidence you need to evaluate your application. Start with traces that show what users asked, what your system did, and where the output came from.

Annotate

Turn raw traces into reusable evaluation assets. Start with human review, name the failure modes, then convert the useful examples into datasets and score definitions.

For a detailed walkthrough, use the error analysis guide to open-code traces, cluster failure categories, label examples, and decide which failures should become evaluators.

Deploy

Ship tested changes with measurable impact. Use experiments to confirm that a prompt, model, retrieval configuration, agent implementation, or evaluator variant improves quality without introducing regressions.

Experiments are useful in three common deployment workflows:

WorkflowUse experiments to
Active iterationHill climb on a prompt, model, or agent implementation while engineering a change.
BenchmarkingCompare multiple implementations on the same dataset and scoring criteria.
Releasing / deployingRun regression checks before merge or deploy so quality drops do not ship.

Monitor

Monitor production with the failure modes you already annotated. Use online evaluators to score live behavior, catch quality issues after deployment, and discover the next examples to review.

Start with a small number of high-signal LLM-as-a-Judge or custom evaluators derived from known failure modes, then expand coverage as new failures appear.

Which Langfuse feature should I use?

If you want to...Use this Langfuse feature
Capture application behaviorObservability, traces and observations
Segment traces for later reviewTags, metadata, users, sessions, environments, releases
Review examples manuallyAnnotation Queues, Scores via UI
Open Coding: capture open-ended notesTEXT scores, Annotation Queues
Axial Coding: derive failure modesStable error categories, evaluation criteria
Create reusable test casesDatasets
Compare changes before shippingExperiments via UI, Experiments via SDK
Gate pull requests or deploysCI/CD experiments
Monitor production qualityLLM-as-a-Judge, Scores via API/SDK
Analyze evaluator resultsScore Analytics, custom dashboards

Was this page helpful?