LLM Evaluation in Production Starts With Explicit Failure Modes

Problem

Teams often begin evaluation by asking for a benchmark. The harder and more useful question is: what exactly can this system do wrong in production, and how will we detect it before users do?

Start with failures, not metrics

Useful evaluation plans map directly to system behavior:

retrieval misses relevant context
the answer cites the wrong part of the source
the system over-answers beyond available evidence
tool outputs are technically correct but operationally unhelpful
small configuration changes create large quality swings

Design implications

Once those failure modes are explicit, the evaluation strategy becomes more practical.

Build datasets that represent the real queries and constraints the team cares about.
Separate retrieval tests from answer tests.
Use human review selectively for the decisions that matter most.
Keep a small number of operational metrics visible after launch.

Tradeoffs

Broad benchmark coverage can be useful, but it usually cannot replace domain-specific checks. In production environments, a smaller evaluation set tied to real failure modes is often more valuable than a larger generic suite.

Production lesson

Evaluation is not a reporting layer added after the system exists. It shapes architecture, rollout confidence, and the speed at which a team can safely improve an LLM workflow.