Reliability is the New Credibility in AI Systems

Demos impress stakeholders. Reliability keeps them.

-

60-Second Summary

The AI credibility gap is not about capability. Current models can handle most enterprise tasks. The gap is between a polished demo and what happens when 500 users hit the system with messy, unpredictable inputs every day. Hallucination, retrieval misses, tool failures, and context drift are not edge cases. They are Tuesday. Reliability in AI systems is an engineering practice: structured outputs, evaluation pipelines, fallback chains, tracing everything, and treating cost controls as safety mechanisms. Teams that invest here ship systems that last. Teams that skip it ship demos.

The Demo That Went Too Well

We once ran a contract analysis demo that went perfectly. The model identified key clauses, summarized obligations, flagged risks. The client was ready to sign. We knew something they did not: we had hand-picked those three documents. The model had seen similar structures in its training data. When we later tested on the client’s actual corpus — scanned PDFs with OCR artifacts, inconsistent formatting, clauses split across pages — accuracy dropped from 92% to 61%.

We could have shipped the demo version. We did not. Because we had already learned what happens when you do: three months of escalations, eroded trust, and a project that gets quietly shelved.

The hard truth: a system that works 92% of the time on curated inputs and 61% on real inputs is a 61% system. Your users will not encounter curated inputs.

Five Failure Modes That Kill Production Systems

These are not theoretical. We have logs for each one.

1. Hallucination

The model invents facts, cites non-existent clauses, or fabricates numbers. It does this confidently. In enterprise contexts — legal, financial, compliance — a single hallucinated data point can trigger a bad business decision. What makes this insidious: the output is well-formed and reads correctly. Only someone who checks the source material will catch it.

2. Retrieval Failure

Your RAG system retrieves the wrong chunks. The model then reasons faithfully over wrong evidence. The output cites real documents — just the wrong sections. We had a case where a query about liability caps consistently retrieved indemnification clauses because the embedding similarity was high. Technically related. Practically wrong. The fix was not better embeddings. It was adding metadata-based filtering and relevance score thresholds below which results are discarded rather than passed to the model.

3. Context Drift

In multi-turn conversations, the model gradually forgets instructions from earlier turns. By turn 20, the persona has shifted, the output format has changed, and constraints set in the system prompt are being ignored. This happens because new content pushes old content further from the attention mechanism’s focus. It is a fundamental property of transformer architecture, not a bug you can prompt away.

4. Tool Execution Failure

The agent calls the right tool with wrong parameters. Or calls the wrong tool entirely. Or the tool succeeds but returns data in an unexpected format. Without per-tool error handling, these failures cascade: the model receives garbage, incorporates it into its reasoning, and produces an output that looks fine but is built on a broken foundation.

5. Silent Degradation

This is the worst one. The system does not crash. It does not throw errors. It just gradually gets worse. Retrieval scores drift down as new documents are added. Model behavior shifts after a provider update. Prompt performance degrades on a new category of inputs nobody tested. Without continuous monitoring, you only discover this when users start complaining — or worse, stop using the system.

The Reliability Stack

Two Failures That Shaped Our Reliability Practice

Failure 1: The well-cited wrong answer. Our contract analysis system retrieved an indemnification clause when asked about liability caps. The model cited the clause correctly — section, page number, exact quote. The answer was wrong, but it looked authoritative. We caught it only because a lawyer in the pilot group knew the contract. The fix: we added retrieval relevance score thresholds. If no retrieved chunk scores above 0.75, the system responds with “I don’t have enough confidence to answer this” instead of guessing. This reduced our answer rate by 8% but eliminated an entire class of plausible-sounding wrong answers.

Failure 2: The Friday afternoon model update. A model provider pushed a minor version update. Our prompt outputs shifted subtly — the JSON field ordering changed, confidence scores skewed higher, and one edge-case behavior we relied on disappeared. We had no regression tests. It took us four days to identify the root cause. Now every prompt has a test suite of 20+ cases that runs daily, even when we have not changed anything, because providers change things without telling you.

Decisions and Trade-offs

DecisionWhat We ChoseThe Cost
Output formatStructured JSON with schema validation on every responseMore upfront schema design work. But near-zero integration failures.
EvaluationLLM-as-judge for semantic quality + deterministic checks for format/citationLLM-as-judge has variance (~5-10%). We average over 3 runs. Slower but more reliable.
Confidence thresholdHuman escalation below 0.7 confidence~12% of queries go to humans initially. Decreases as we tune. But zero high-visibility failures.
TracingFull trace on every call (prompt, response, latency, tokens, cost)Storage cost of ~$200/month at our scale. Debugging time dropped from days to minutes.

Your Reliability Checklist

  • Every LLM output is validated against a defined schema before downstream use
  • Retrieval results have minimum relevance score thresholds — below threshold means “I don’t know”
  • Every prompt template has a test suite of 20+ cases that runs on change AND on schedule
  • Full traces exist for every production call: prompt, response, latency, tokens, cost, retrieval scores
  • A fallback chain exists: retry → alternate model → cached response → human
  • Cost limits per request and per session with circuit breakers on anomalies
  • Latency is tracked at p95/p99 with alerts, not just averages

Key Takeaways

  • Demo performance is not production performance. Test on real data, at real scale, with real users. Nothing else counts.
  • The most dangerous failure mode is the well-cited wrong answer. It looks authoritative. It is wrong. Retrieval score thresholds are your defense.
  • Treat prompts like code: version them, test them, run regression suites, block deployment on failure.
  • Observability is not optional infrastructure. It is how you detect the 90% of issues that do not throw errors.
  • Cost controls are safety mechanisms. A token budget prevents runaway behavior, not just runaway bills.

References

  1. Hamel Husain — Your AI Product Needs Evals
  2. LangSmith — LLM Evaluation and Tracing
  3. Arize AI — ML Observability Patterns
  4. Simon Willison — Prompt Injection
  5. Lilian Weng — Extrinsic Hallucination in LLMs
  6. OpenAI — Structured Outputs
  7. Anthropic — Tool Use and Structured Output
  8. OpenTelemetry — Distributed Tracing

CONTENTS

Latest reads

From RAG to GraphRAG: Why Context Structure Matters

Where Vector Search Fails We started with a standard RAG pipeline for contract analysis: chunk…

Voice AI is Not About Speech. It is About Latency and Turn-Taking.

The Real Problem: One Second Human conversation has a natural turn-taking rhythm. Research shows the…

Reliability is the New Credibility in AI Systems

The Demo That Went Too Well We once ran a contract analysis demo that went…

Sign up for more like this