AI agents in production: the architectural patterns that survived 2025

The 2024–2025 agent hype cycle produced more YouTube demos than production deployments. Gartner's widely cited number — over 40% of agentic AI projects will be cancelled by the end of 2027 — lines up with what we see on client engagements. Most agent projects fail for the same three reasons: the architecture was wrong for the workload, observability was bolted on after the first incident, or the team picked a framework before they understood the problem. This post is about the patterns that actually made it through 2025 into production, the failure modes we stopped tolerating, and the decision framework we use before touching a line of code.

The three patterns that matter

Ignore the zoo of named frameworks. Underneath every agent in production today is one of three architectural shapes — or a composition of them. Pick the simplest one that fits the workload and you'll save yourself months of debugging.

Pattern	Best for	Latency profile	Failure mode to watch
ReAct loop	Dynamic, open-ended tasks with unknown depth	Linear in turns — often slow	Infinite loops, tool thrashing
Planner/executor	Predictable workflows with a known task shape	Parallelisable — up to 3–4× faster than ReAct	Bad plans cascade; replanner is critical
Orchestrator/worker	Multi-domain tasks with distinct sub-specialties	Latency depends on slowest worker	Orchestrator misrouting breaks everything
Swarm / peer-to-peer	Research papers, mostly	Unpredictable	Debuggability collapses past 3 agents

Peer-to-peer swarms remain mostly academic. Every production multi-agent system we've shipped or audited uses hub-and-spoke — one orchestrator, N specialised workers, a shared state store. The pattern is boring. That's the point.

ReAct is still the default — and that's fine

ReAct (reason, act, observe, repeat) is the thinnest possible agent loop: the model reasons about what to do next, calls a tool, reads the result, and decides again. It works well when the task is genuinely open-ended and the depth is hard to predict. It fails quietly when the model gets into a tool-thrashing loop — same tool, slightly different arguments, ten turns in a row — burning tokens and wallclock time.

A minimal production-grade tool loop

// Small but battle-tested agent loop
// Guards: max turns, tool timeouts, repeat detection, structured stop condition
async function runAgent(task: string, tools: ToolRegistry) {
  const messages: Message[] = [{ role: "user", content: task }];
  const MAX_TURNS = 12;
  const seen = new Set<string>();

  for (let turn = 0; turn < MAX_TURNS; turn++) {
    const response = await model.create({
      model: "claude-sonnet-4-6",
      messages,
      tools: tools.definitions,
      stop_sequences: ["<done/>"],
    });

    messages.push({ role: "assistant", content: response.content });

    if (response.stop_reason === "end_turn") return response;

    const toolCall = response.content.find((b) => b.type === "tool_use");
    if (!toolCall) return response;

    // Repeat detection — same tool + same args three turns in a row → bail
    const sig = `${toolCall.name}:${JSON.stringify(toolCall.input)}`;
    if (seen.has(sig)) throw new AgentError("Tool thrashing detected", { sig });
    seen.add(sig);

    const result = await withTimeout(
      tools.run(toolCall.name, toolCall.input),
      15_000,
    );

    messages.push({
      role: "user",
      content: [{ type: "tool_result", tool_use_id: toolCall.id, content: result }],
    });
  }

  throw new AgentError("Max turns exceeded", { messages });
}

Always cap max turns, tool timeouts, and repeat detection before shipping. Half the ReAct incidents we've debugged come from missing one of these three guardrails — a single runaway loop can burn a thousand dollars of tokens before ops notices.

Planner/executor: the pattern that wins when the workload is predictable

When the task shape is known — invoice processing, customer onboarding, multi-step data enrichment — planner/executor beats ReAct on almost every axis. A stronger reasoning model produces a DAG of subtasks up front; smaller, cheaper models execute each subtask in parallel; a replanner corrects course when a step fails. Public benchmarks show up to 92% task completion with a 3.6× speedup over sequential ReAct on structured workflows. Our own client numbers track close to that for well-scoped ETL-style tasks.

Use your best reasoning model for planning (Opus-tier) and your cheapest for execution (Haiku-tier). The economics shift dramatically in your favour — the planner runs once per task, the executors run many times.
Make the plan structured. JSON with typed steps and explicit dependencies beats freeform prose every time. You'll want to serialise, inspect, and partially replay plans.
Build a replanner into the loop, not as an afterthought. When step 4 of 7 fails, you don't retry from step 1 — you hand the failure back to the planner and let it amend.

Multi-agent: the orchestrator is the product

Multi-agent systems look like the future in demos and eat your lunch in production. The orchestrator — the agent that decomposes the task and routes to workers — is the single most load-bearing component. If it hallucinates a subtask or misroutes to the wrong specialist, every downstream worker is doing well-engineered work on the wrong problem. Treat the orchestrator as its own product: stricter schema, narrower tool surface, more eval coverage than anything else in the system.

Before building a multi-agent system, try hard to solve the problem with one agent and better tools. The tax on debugging, evaluation, and state management goes up non-linearly with agent count. We've refactored four client systems from 'five agents' to 'one agent with five tools' and every one got faster, cheaper, and easier to operate.

State and memory — the part nobody wants to design

Most agent failures in production are state failures, not model failures. Teams default to 'stuff it all in context' or 'dump it in a vector store' and discover six months later that the agent behaves inconsistently because retrieval conflicts are silently corrupting its reasoning. The 2026 consensus is to stratify memory by type and retrieval pattern.

Working state — the current task, intermediate results, pending tool calls. Keep it in a fast key-value store (Redis, Postgres with a JSONB column). This is per-session and shouldn't touch a vector index.
Episodic memory — what the agent did and what happened. Store as structured JSON with a stable schema, not as embeddings of freeform prose. You'll want to query it later.
Semantic memory — facts about the user, the domain, long-lived preferences. Hybrid retrieval here: structured lookup first (user_id → preferences row), vector fallback for the genuinely unstructured bits.
Procedural memory — templates of successful traces. When a task resembles one the agent has completed before, retrieve the structure (not the data) and use it as a planning scaffold.

For multi-agent systems, one source of truth with per-agent read caches handles roughly 80% of the state-management need. Event-source every mutation — it costs almost nothing and gives you audit trails, replay capability, and a recovery mechanism the first time an agent dies mid-workflow.

Observability — non-negotiable

Agents are non-deterministic by construction. The same input can produce different tool sequences, different retrieved documents, and different final answers on different runs. Without tracing, you will not be able to debug them. Industry surveys in early 2026 put detailed tracing adoption at around 62% of teams running agents in production — which is still lower than it should be.

Trace every tool call, every model turn, every retrieval. Keep the full span tree; don't aggregate early.
Tail-based sampling, not head-based. Keep every failed, expensive, or anomalous trace in full; sample the happy path aggressively.
Log the inputs, the outputs, and the model's intermediate reasoning separately. You will need all three when something goes sideways.
Build eval suites from real traces. The highest-signal test set is one you harvest from production, not one you wrote on day one.

Failure modes we now design against

A short list of things that went wrong in 2024–2025 and are now table-stakes to prevent:

Tool thrashing — the model calls the same tool with minor argument variations until it times out. Detect repeats, escalate, abort.
Context rot — agents in long-running sessions accumulate stale context and start making decisions on outdated state. Compress or reset at boundaries.
Silent orchestrator drift — the router quietly starts mis-routing a small percentage of tasks to the wrong worker. Eval the orchestrator separately from the workers.
Cost blow-ups — a single edge case triggers a thousand-turn loop at full-tier pricing. Hard budget caps per session, alerted on.
Replay unavailability — incident hits, team tries to reproduce, realises they didn't persist enough trace data to replay. Persist everything; compress later.

What we ship on a greenfield project today

For a new agent project starting today, our default is: a single ReAct loop against Claude Sonnet 4.6 with a narrow, well-typed tool surface; a planner/executor layer added only once the task shape is predictable and the scale demands parallelism; Postgres for working state, a structured episodic log, and a separate vector index for semantic memory if the workload calls for it; LangSmith or equivalent for traces from day one, with tail-based sampling. No multi-agent orchestration until a single agent can reliably close at least 80% of tasks end-to-end. The simpler system ships faster, costs less to operate, and tells you what the harder system should actually look like when you eventually need it.

The best signal that you're ready for a multi-agent system isn't that your single agent is hitting a ceiling — it's that you have a clear, well-separated taxonomy of sub-tasks, independent evals for each, and a routing layer you trust. If you don't have those three things yet, one agent with better tools will outperform five agents with a router on most metrics that matter.

Key takeaways

Most production agents are ReAct loops, planner/executors, or hub-and-spoke orchestrator/worker systems. Everything else is either academic or a composition of these three.
Guardrails are the difference between a demo and a system. Max turns, tool timeouts, repeat detection, budget caps — non-negotiable before shipping.
Planner/executor beats ReAct when the task shape is predictable; multi-agent beats both when there's a clear taxonomy of sub-specialties. Don't jump a tier early.
Memory is a design problem, not a vendor choice. Stratify by type; query structured state structurally; use vectors only where unstructured retrieval genuinely helps.
Observability and evals on production traces are what separate teams that improve their agents from teams that keep shipping the same bugs.

#ai-agents#agent-architecture#langgraph#tool-use#observability#production