According to LangChain's State of Agent Engineering report, 57% of respondents now have AI agents running in production. That number would have seemed ambitious two years ago. Today it just sounds like Tuesday.
But buried in the same data is a number that should get more attention: 32% of engineering teams cite output quality as their top barrier to production AI. Not cost. Not latency. Not the models themselves. The quality of what the agents actually produce when real users start throwing real inputs at them.
The Demo-to-Production Gap Is Real
There is a specific kind of pain that comes from watching an agent nail every eval you throw at it, shipping it, and then having it start hallucinating tool parameters on day three. It is not a model problem. The model is the same. What changed is everything around it.
Production inputs are messier than eval inputs. Context windows fill up in ways you did not anticipate. Tool calls chain together and surface edge cases that never appeared in your test suite. A user phrases something slightly differently than your prompt was tuned for, and the whole thing quietly goes sideways.
Quality in Production Means Something Specific
- Consistency across diverse inputs — the agent handles the clean case well but degrades on anything outside the distribution it was tuned on
- Graceful degradation when tools fail — instead of recovering, the agent loops, hallucinates a result, or returns something confidently wrong
- Hallucinated tool parameters — the agent invents arguments for a function call rather than acknowledging it doesn't have the information it needs
- Context loss on long-running tasks — multi-step tasks that work fine at step three start losing coherence by step eight as the context window fills
Rate Limits Are Also a Production Reality
Datadog's State of AI Engineering found that in February 2026, roughly 5% of all LLM call spans reported errors — and 60% of those errors were rate limits. That's a significant chunk of production failures that have nothing to do with prompt quality. They are infrastructure problems dressed up as AI problems.
Production readiness for agents is not just about getting the outputs right. It is about building systems that degrade gracefully when the infrastructure underneath them hiccups. Retry logic, fallback routing, and rate limit awareness are table stakes.
The Framework Moment
Framework adoption for agent development nearly doubled year-over-year — from around 9% of organisations in early 2025 to roughly 18% by early 2026. LangChain, LangGraph, Pydantic AI, and Vercel AI SDK are all gaining ground.
The core insight
The quality problem in production AI is not a model problem. It is a systems problem. Production quality is determined by how well your surrounding system handles the inputs the model wasn't trained on, the tools that don't behave as expected, and the edge cases your evals never surfaced.
What Teams Doing This Well Are Actually Doing
- Building evaluation sets from real production traffic, not synthetic examples — so evals reflect the actual distribution of inputs the agent will see
- Using structured outputs with validation at every tool call boundary — if the model can't produce a valid structured response, treat it as a failure and handle it explicitly
- Adding human-in-the-loop checkpoints for high-stakes decisions — as a circuit breaker while you build confidence in the agent's behaviour
- Shipping in shadow mode before full rollout — running the agent on real traffic, logging outputs, but not acting on them until quality is verified
- Treating context management as a first-class engineering concern — designing tasks so the information the agent needs is available when it needs it
- Building explicit failure modes into tool definitions — so when a tool call can't be completed, the agent returns a structured 'I cannot do this' rather than inventing an answer
The technology is ready enough. The engineering practices are catching up. The gap is closeable — but only if you take it seriously as an engineering problem rather than a model problem.