From prototype to production: shipping an AI feature that doesn't hallucinate

Every AI feature looks impressive in a demo. The problem starts the day after launch, when a model confidently returns a policy number that doesn't exist, cites a paragraph that isn't in the source, or hallucinates a refund amount that isn't in the ledger. Hallucinations are not a prompt-engineering bug; they're a systems problem. You don't solve them with a better prompt — you solve them with a stack of defences, each of which catches a different failure mode. This post walks through the stack we ship on production AI features: grounding, structured output with schema validation, confidence scoring, explicit abstention, human review, and the evals that keep the whole thing honest.

Why prompt engineering alone doesn't fix this

A reasonable first instinct is to tell the model, in the system prompt, not to hallucinate. It sort of works. It doesn't work enough to ship. Models generate plausible tokens; that's the job. If you ask a question the model can't answer from its context and don't give it an explicit out, it will fill the gap with something plausible-sounding. The fix is architectural: constrain what the model can say, give it a way to say 'I don't know', verify its answer before it reaches the user, and keep humans in the loop where the cost of being wrong is high.

Layer 1: grounding via RAG with citations

Retrieval-augmented generation changes the task from 'recall an answer' to 'produce an answer from these specific facts'. That shift alone eliminates a large class of hallucinations. The catch is that retrieval has to be good, and the model has to actually use what was retrieved. Two patterns from 2025 that are now table-stakes: structured claim cards and inline citations.

Chunk with source anchors. Every chunk carries a stable ID (document_id, page, offset) that flows through retrieval into the model's context and out into the response.
Instruct the model to emit citations inline. 'For every factual claim, include the source ID in the format [doc:page]. If no source supports the claim, say so explicitly.'
Validate citations post-hoc. Parse the response, check that every cited ID appears in the retrieved set, and that the cited chunk actually contains the claim. A separate lightweight verifier model works well here.
Treat unsupported claims as failures. A response with a claim that has no valid citation should either be regenerated or surfaced to the user with a visible uncertainty marker, not passed through silently.

Layer 2: structured output with schema validation

Schema-enforced output is the single biggest reliability win on any AI feature that produces something a downstream system consumes. You get valid JSON, you get required fields, you get enum values from a closed set. Important caveat: schema enforcement guarantees shape, not truth. A model can emit a perfectly valid JSON object that contains fabricated data. The schema stops bad parsing; it doesn't stop bad facts. You need both.

import { z } from "zod";
import Anthropic from "@anthropic-ai/sdk";

// Schema first — typed fields, closed enums, explicit "I don't know"
const AnswerSchema = z.object({
  answer: z.string().min(1),
  confidence: z.enum(["high", "medium", "low", "unknown"]),
  citations: z
    .array(
      z.object({
        claim: z.string(),
        source_id: z.string(),
        quote: z.string(),
      }),
    )
    .min(0),
  // Explicit escape hatch — the model is allowed to say it can't answer
  abstain_reason: z.string().nullable(),
});

type Answer = z.infer<typeof AnswerSchema>;

export async function answer(question: string, chunks: Chunk[]): Promise<Answer> {
  const anthropic = new Anthropic();
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system: GROUNDED_SYSTEM_PROMPT,
    messages: [
      {
        role: "user",
        content: formatContext(question, chunks),
      },
      // Prefill the first tokens of JSON so the model commits to the shape
      { role: "assistant", content: "{" },
    ],
  });

  const raw = "{" + extractText(response);
  const parsed = AnswerSchema.safeParse(JSON.parse(raw));
  if (!parsed.success) throw new GenerationError("Schema validation failed", parsed.error);

  // Post-hoc citation check — every citation must reference a real chunk
  const chunkIds = new Set(chunks.map((c) => c.id));
  for (const cite of parsed.data.citations) {
    if (!chunkIds.has(cite.source_id)) {
      throw new GenerationError("Hallucinated citation", { cite });
    }
  }

  return parsed.data;
}

A few things are doing work in that snippet. The schema includes a confidence enum and a nullable abstain_reason — the model has explicit permission to say it can't answer. The Zod parse is a hard boundary; invalid output throws, it doesn't get patched. The post-hoc citation check catches the most common hallucination mode on RAG systems: a citation to a source ID that was never retrieved.

Layer 3: confidence scoring and abstention

Asking a model for a confidence score looks easy and is subtly dangerous. Neural networks are systematically overconfident: they'll rate a wrong answer 'high' confidence and a correct answer 'low' confidence often enough that naive thresholding fails. The fix is to use confidence as one signal among several, not as the sole gate.

Log model-reported confidence but don't route on it alone. Combine with retrieval scores, citation-coverage checks, and domain-specific validators.
Calibrate thresholds against real review data. A 'high' confidence rating on your workload might correspond to 72% accuracy or 94% accuracy depending on the task. Measure, don't guess.
Build explicit abstention into the schema. A model that can return 'I don't know' is a model that will return 'I don't know' — train the prompt to prefer abstention over fabrication.
Sample auto-approved outputs for audit. Silent failure is the failure mode that kills trust. Even a 1% manual-review sample of auto-approved cases surfaces regressions before users report them.

Don't show the model's confidence score to human reviewers. It primes them — reviewers agree with high-confidence answers more often than the underlying accuracy justifies. Review blind; log the score for analysis later.

Layer 4: human in the loop, where it matters

Human review is expensive. The design problem is getting the right cases to humans and the rest through automatically. Route by impact, not by confidence alone. A high-confidence answer that updates a billing record deserves review; a low-confidence answer that suggests a blog title doesn't. Tie the review tier to the blast radius of being wrong.

Tier cases by impact — read-only suggestions, reversible actions, irreversible actions. Escalate tiers to humans; automate within them.
Escalate on any of: abstain_reason non-null, confidence below threshold, citation check failure, domain validator failure. These are independent signals and any one firing is a reason to escalate.
Give reviewers the full context — retrieved chunks, model reasoning, citations. Reviewing a raw answer without the context the model had is slower and less accurate.
Add a 'manual only' switch at the feature level for incident response. When a regression lands, route 100% to humans while you fix it, not 0% to users.

Layer 5: evals built from production traces

Evals written on day one age badly. Real user queries don't look like the examples an engineer dreams up. The eval suite that matters is the one you harvest from production: the questions users actually ask, paired with the answers your best reviewers would give. Build that dataset from the start — log everything, review samples, curate a golden set — and you'll have a pre-deployment gate that catches regressions before they ship.

A practical eval ratio we ship on client projects: 60% real production queries with human-curated answers, 30% adversarial inputs designed to trigger known hallucination modes, 10% regression cases from past incidents. Run this on every prompt or model change before promotion.

The production checklist

A concrete list of what's in place before we sign off on an AI feature going live:

Retrieval returns source-anchored chunks with stable IDs, and the system prompt requires inline citations for every factual claim.
Output schema is Zod-enforced with a confidence enum and a nullable abstain_reason; parse failures throw, they don't fall back to freeform.
Post-hoc verifier validates every citation against the retrieved set and rejects responses with unsupported claims.
Escalation rules route low-confidence, abstention, or validator-failure cases to human review, with full context attached.
Review UI hides confidence scores from reviewers to prevent anchoring bias.
Budget and rate caps are set at both the request and session level; a single runaway doesn't produce a six-figure bill.
Traces are persisted for every request — inputs, retrieved chunks, model outputs, reviewer decisions — and sampled into the eval suite weekly.
A 'manual only' switch exists per feature and has been tested on staging. Incident response doesn't require a code deploy.

What we'd still not do

A few product surfaces we still recommend against shipping unattended, regardless of how tight the stack is. Anything that writes to a financial ledger, sends legally binding communications, makes medical or diagnostic claims, or takes irreversible actions on user data. Not because the model will be wrong every time — it won't — but because the failure cost when it is wrong outweighs the automation saving. Keep the AI as a drafting step and a reviewer as the authority. That's not a compromise; that's good product design.

Key takeaways

Hallucinations are a systems problem, not a prompt problem. Fix them with layered defences, not a better prompt.
Ground via RAG with source-anchored chunks and inline citations; verify every citation against the retrieved set before shipping a response.
Structured output with Zod validation stops bad parsing but doesn't stop bad facts — you need the grounding and the schema and the verifier.
Let the model say 'I don't know'. Build abstention into the schema; route abstentions to humans. A model that never abstains is a model that always invents.
Eval on real production traces, not synthetic examples. Harvest from the start or regressions will catch you in production.

#hallucinations#rag#zod#structured-output#evals#human-in-the-loop#production

From prototype to production: shipping an AI feature that doesn't hallucinate

Why prompt engineering alone doesn't fix this

Layer 1: grounding via RAG with citations

Layer 2: structured output with schema validation

Layer 3: confidence scoring and abstention

Layer 4: human in the loop, where it matters

Layer 5: evals built from production traces

The production checklist

What we'd still not do

Key takeaways

Related posts

Building production RAG systems in 2026: vector DBs, hybrid search, and eval frameworks

AI observability: logging, tracing, and evals for LLM apps

Form validation patterns: React Hook Form + Zod in production

Let's build it together.