Problem
Teams often begin evaluation by asking for a benchmark. The harder and more useful question is: what exactly can this system do wrong in production, and how will we detect it before users do?
Start with failures, not metrics
Useful evaluation plans map directly to system behavior:
- retrieval misses relevant context
- the answer cites the wrong part of the source
- the system over-answers beyond available evidence
- tool outputs are technically correct but operationally unhelpful
- small configuration changes create large quality swings
Design implications
Once those failure modes are explicit, the evaluation strategy becomes more practical.
- Build datasets that represent the real queries and constraints the team cares about.
- Separate retrieval tests from answer tests.
- Use human review selectively for the decisions that matter most.
- Keep a small number of operational metrics visible after launch.
Tradeoffs
Broad benchmark coverage can be useful, but it usually cannot replace domain-specific checks. In production environments, a smaller evaluation set tied to real failure modes is often more valuable than a larger generic suite.
Production lesson
Evaluation is not a reporting layer added after the system exists. It shapes architecture, rollout confidence, and the speed at which a team can safely improve an LLM workflow.