The Problem We Kept Seeing
Here is a pattern we watched play out three times before we stopped repeating it. A team picks a business problem. They build a prototype: one API call to an LLM, maybe a retrieval layer, wrapped in a chat UI. The demo lands. Leadership funds a pilot. Then reality shows up.
The LLM returns different formats on different runs. The context window fills up mid-conversation and reasoning degrades. A tool call silently fails and the model confabulates a response. Nobody notices for two days because there is no tracing. The team spends 80% of their time firefighting issues that are not about the model at all. They are about everything around the model.
This is where teams fool themselves: they think the model is the system. The model is one component. The architecture is the system.
What AI-Native Actually Requires
AI-native does not mean “rewrite everything.” It means the architecture accounts for the specific properties of LLM-based components: they are non-deterministic, expensive per call, variable in latency, and capable of plausible-sounding failure. Five layers need to be explicit.
1. Agent Orchestration
We model every agent workflow as a directed graph with typed state. Not because we love abstractions, but because we need to test individual nodes, inspect state transitions, and add human checkpoints without rewriting the chain. We chose LangGraph for this. Each node is a function, each edge is a conditional, and the state schema is defined upfront. When an agent goes wrong, we can tell you exactly which node, with what inputs, produced what output.
What surprised us: the orchestration layer catches more bugs than the model layer. Bad state transitions cause more production incidents than bad model outputs.
2. Memory Architecture
We burned a week early on debugging degraded responses. Turned out we were stuffing the entire conversation history into the context window. At turn 30, the model was reasoning over a wall of text where important instructions from turn 2 were buried under noise.
Our fix was three-tier memory:
| Tier | What | Latency | Cost |
| Working | Current context window | 0ms | High (every token billed) |
| Session | Summarized history, key facts extracted per turn | 5-50ms | Medium (summarization call) |
| Long-term | Vector store + knowledge graph | 50-300ms | Low per query |
The working memory should contain only what the model needs right now. Everything else gets summarized or retrieved on demand. This is not optimization. It is the difference between a system that works at turn 5 and one that works at turn 50.
3. Tool Governance
Agents call tools. Tools have side effects. Without governance, an agent can send an email, execute a query, or update a record with no approval, no audit log, and no cost boundary. We classify every tool by risk: low-risk tools (read-only queries, summarization) auto-approve. Medium-risk tools (creating drafts, running analysis) log and notify. High-risk tools (sending communications, modifying data) require explicit human approval before execution.
4. Observability
Every LLM call gets a trace: full prompt, full response, latency, token count, cost, model version. Every retrieval gets a trace: query, results, relevance scores. Every tool call gets a trace: name, parameters, result, execution time. This is non-negotiable. We learned this the hard way when we spent three days debugging a production issue that would have taken 15 minutes with proper traces.
5. Cost Controls
An agent loop with no cost ceiling can burn through $50 in minutes. We set hard limits: max tokens per request, max cost per agent execution, max cost per user per day. These are not about being cheap. They are circuit breakers that prevent the system from entering uncontrolled states.
System Flow: How the Layers Interact

Where It Breaks

Two Failures That Changed Our Approach
Failure 1: The invisible context overflow. A contract analysis agent worked perfectly for simple documents. On complex 40-page master agreements, it silently degraded. The model was still producing well-formatted answers, but the reasoning was wrong. It was pulling from the wrong sections because the early context (containing critical definitions) got pushed out by later content. We had no monitoring on context utilization. By the time a user reported it, it had affected three weeks of outputs. The fix was not just better chunking. It was adding context utilization tracking as a standard metric and establishing hard limits with fallback to retrieval when context pressure exceeded 70%.
Failure 2: The tool call nobody audited. An agent designed to draft contract summaries started invoking a downstream API to pre-populate a form. This was technically within its tool permissions, but nobody had considered that the API would create draft records in a production system. We discovered 200+ orphaned draft records before we caught it. The fix was tool classification by side effect: read-only, draft/staging, and production-write. Each class gets a different approval threshold.
Decisions We Made and Why
| Decision | What We Chose | What We Gave Up |
| Orchestration | LangGraph explicit state graphs | Simpler chain-based approach. More upfront work, but every node is independently testable. |
| Memory | Three-tier with turn-by-turn summarization | Simplicity. We now maintain a summarization prompt that itself needs testing and tuning. |
| Governance | Built into infrastructure from sprint 1 | Velocity in week 1-2. But retrofitting governance at month 3 would have cost us a month. |
| Model routing | Small models for classification/routing, large for reasoning | Consistency (two models can disagree). But cost dropped 70% and latency dropped 40%. |
Greenfield vs. Brownfield
Starting fresh: Define the agent workflow graph first. Build the eval harness second. Set up tracing third. Then write features. We always set up cost monitoring before the first LLM call goes to production. Always.
Existing system: Pick one high-value workflow. Build the AI path as a shadow system running in parallel. Compare outputs for two weeks before routing real traffic. Instrument everything. Migrate the workflow only after the shadow system demonstrably matches or exceeds the baseline.
Your Monday Morning Checklist
- Agent workflows are modeled as explicit state machines or DAGs, not ad-hoc prompt chains
- Memory is tiered: context window holds only what is needed now; history is summarized; knowledge is retrieved
- Every LLM call is traced with full prompt, response, latency, tokens, and cost
- Tools are classified by side-effect risk with matching approval thresholds
- There is a hard cost ceiling per agent execution that triggers a circuit breaker
- An evaluation suite runs before every deployment and blocks on regression
- You can reconstruct the full reasoning path of any production request from logs



