Blog post

LLM Evaluation in Production Starts With Explicit Failure Modes

Evaluation is most useful when it reflects the failures a system can actually produce in production: missing context, wrong retrieval, incorrect tool use, unstable outputs, and unhelpful responses.

  • LLM
  • Evaluation
  • Production AI
  • Quality

Problem

Teams often begin evaluation by asking for a benchmark. The harder and more useful question is: what exactly can this system do wrong in production, and how will we detect it before users do?

Start with failures, not metrics

Useful evaluation plans map directly to system behavior:

  • retrieval misses relevant context
  • the answer cites the wrong part of the source
  • the system over-answers beyond available evidence
  • tool outputs are technically correct but operationally unhelpful
  • small configuration changes create large quality swings

Design implications

Once those failure modes are explicit, the evaluation strategy becomes more practical.

  • Build datasets that represent the real queries and constraints the team cares about.
  • Separate retrieval tests from answer tests.
  • Use human review selectively for the decisions that matter most.
  • Keep a small number of operational metrics visible after launch.

Tradeoffs

Broad benchmark coverage can be useful, but it usually cannot replace domain-specific checks. In production environments, a smaller evaluation set tied to real failure modes is often more valuable than a larger generic suite.

Production lesson

Evaluation is not a reporting layer added after the system exists. It shapes architecture, rollout confidence, and the speed at which a team can safely improve an LLM workflow.

Related projects

Case studies where these tradeoffs showed up in practice.

Project Legal TechPublic Sector AI

PGDF

OSIRIS Legal-Fiscal AI Workflows

Data Scientist · May 2023 - May 2024

AI delivery for PGDF legal-fiscal operations, spanning production APIs, supervised and semi-supervised models, active learning, and early LLM exploration for document-heavy institutional workflows.

Primary impact

Brought governed ML workflows and production APIs into legal-fiscal operations, while designing active-learning paths for longer-term model adaptation.

  • FastAPI
  • Active Learning
  • MLflow
  • DVC
  • LLM

Key outcomes

  • Production APIs connected model outputs to PGDF internal systems
  • Active-learning loop designed to reduce model drift over time
Read project

Next step

Want the delivery context behind this line of thinking?

The project pages show where these technical decisions had to work inside real institutions, teams, and operational constraints.