Everyone Is Building AI Agents. Most Will Never Work in Production.

Kaushal Malhotra|May 5, 2026

AI AgentsAgentic AIProductionLLMStartupsEngineeringGenAI

The Agent Gold Rush

Open your LinkedIn feed on any given day in 2026 and you will see the same thing. Someone has built an AI agent. It books meetings, writes code, researches competitors, sends emails, files expense reports, and makes coffee — or so the demo suggests.

Agents are everywhere. Every AI lab is releasing one. Every startup is building one. Every investor is funding one.

And yet, if you talk to the engineers actually trying to deploy these systems into production — into real businesses with real data and real consequences — you hear a very different story.

Most agents do not work. Not in any meaningful, reliable, production-grade sense.

This is not a controversial opinion. Anthropic's own research and studies from Carnegie Mellon have found that AI agents make too many mistakes for businesses to rely on them in any process involving real stakes. The hype and the reality are separated by a gap so wide you could park a data center in it.

I have been building AI systems for years. I have seen this pattern before — with machine learning, with blockchain, with every technology that arrived with enough momentum to outrun honest evaluation. Agents are the latest chapter.

So let me give you the honest engineer's take. What agents are genuinely good for. Where they fall apart. And how to think about them if you are actually trying to ship something that works.

First — What an AI Agent Actually Is

The word agent has been stretched so far it barely means anything anymore. Let me be precise.

An AI agent is a system where a language model does not just respond to a single prompt — it takes a sequence of actions, makes decisions across multiple steps, calls external tools, and works toward a goal with some degree of autonomy. It plans. It executes. It observes the results of its actions and adjusts.

That is genuinely different from a chatbot or a single LLM API call. And that difference is what makes agents both powerful and dangerous to deploy carelessly.

The power: an agent can complete tasks that would require many manual steps, operating faster and more consistently than a human doing the same work.

The danger: every step in an agent's chain is a point of failure. Errors do not just happen — they compound. A wrong assumption at step two becomes a catastrophic action at step seven. And unlike a human, an agent has no common sense to catch itself before things go off the rails.

Where Agents Actually Work

Agents are not useless. In the right context, with the right constraints, they are genuinely transformative. Here is where I have seen them work reliably in production.

Narrow, well-defined workflows

The more constrained the task, the better agents perform. An agent that monitors a specific data source, detects anomalies against a defined threshold, and fires a specific alert — that works. An agent that browses the entire internet and decides what is important — that does not.

The lesson: shrink the action space. An agent with five possible actions is far more reliable than an agent with fifty. Narrow the scope until the agent cannot surprise you.

Low-stakes, reversible actions

Agents work best when their mistakes are cheap to catch and easy to reverse. Drafting an email for human review — great use of an agent. Actually sending the email autonomously — high risk. Generating a summary of a document — reliable. Making a financial transaction based on that summary — a different matter entirely.

The lesson: put humans in the loop at every high-stakes decision point. Autonomy should be earned incrementally, not assumed from day one.

Internal tools with clean data

Agents perform significantly better when they operate on structured, clean, predictable data inside controlled environments. An internal agent that queries your company database, formats a report, and posts it to Slack — that is a solved problem. An agent that scrapes the open web, reconciles contradictory sources, and produces a reliable answer — that is still largely unsolved at the reliability level businesses need.

The lesson: the messier the environment, the less you should trust autonomous action. Match the agent's autonomy to the cleanliness of its operating context.

Where Agents Consistently Fail in Production

Now for the part that the demos never show you.

Error compounding

In a demo, the agent takes five steps and completes the task perfectly. In production, real inputs are messier. The agent misinterprets something at step two. It carries that misinterpretation forward. By step six it has taken an action based on a fundamentally wrong assumption — and nothing in the system caught it because each individual step looked plausible.

This is the most insidious failure mode in agentic systems. It is not a crash. It is a quiet drift toward the wrong outcome that only becomes visible when the damage is already done.

Prompt injection and adversarial inputs

Agents that interact with the outside world — reading emails, browsing websites, processing documents from untrusted sources — are vulnerable to prompt injection. A malicious actor embeds instructions in a webpage the agent visits. The agent follows them. Your system has been hijacked without a single line of your code being touched.

This is not a theoretical risk. It is an active, documented attack vector. And most agent implementations in the wild have no meaningful defence against it.

Non-determinism at scale

A language model is not a deterministic system. Give it the same input twice and you may get meaningfully different outputs. In a single-turn chatbot, that is usually fine. In a multi-step agent, that variability compounds across every step. What worked reliably in testing starts producing surprising outputs in production — not because anything changed in your code, but because the model itself is probabilistic.

Testing agents is harder than testing traditional software for exactly this reason. You cannot simply write a test suite and call the system verified. You need evaluation frameworks, output monitoring, and human review pipelines — infrastructure most teams have not built.

No graceful degradation

Traditional software fails loudly. An exception is thrown. An error code is returned. A log entry is written. You know something went wrong.

Agents fail quietly. They produce a confident, well-structured, completely wrong output. They complete the task — just not the right task. They take an action — just not the intended one. Building systems that detect this kind of failure before it causes real harm requires a level of observability most agent implementations simply do not have.

The Right Way to Think About Agents in 2026

Here is the framework I use when a client asks about building an agentic system.

Start with the task, not the technology. What specific workflow are you trying to automate? What are the inputs, the steps, and the desired outputs? What happens when it goes wrong and who is affected? Answer those questions first. The architecture follows from the answers — not the other way around.

Use a chain before you use an agent. In most cases, a structured sequence of LLM calls with deterministic logic between them — what engineers call a chain — delivers 80 percent of the value of a full agent with 20 percent of the complexity and risk. Build the chain first. Introduce real autonomy only where the chain cannot go.

Design for failure from day one. Assume your agent will make mistakes. Build catch mechanisms. Add human review at high-stakes steps. Log every action with enough context to reconstruct what happened and why. Treat observability as a core feature, not a nice-to-have.

Earn autonomy incrementally. Start with a human approving every action. Once you have confidence in a specific action type, automate it. Expand autonomy step by step as the system earns trust through real-world performance — not through demo performance.

The Agents That Will Survive

The agent gold rush will produce a lot of wreckage. Systems that looked great in demos and collapsed in production. Startups that raised money on the strength of a prototype that never became a real product. Businesses that automated critical workflows and discovered the hard way that their agent was confidently wrong three percent of the time — which at scale is a very big number.

The agents that will survive and create real value are the ones built by engineers who understand these failure modes and design for them from the start. Narrow scope. Clean data. Human oversight at critical junctures. Robust observability. Incremental autonomy earned through demonstrated reliability.

That is not as exciting as the demos. But it is what actually ships. And in this industry, shipping is everything.

If you are thinking about building an agentic system and want to do it in a way that actually works in production — that is exactly the kind of problem we solve at Will of Dawn Labs.

— Kaushal Malhotra
Founder, Will of Dawn Labs
willodawn.com/contact

Work With Us

Want to Build an AI System?

We help startups and businesses go from idea to production-ready AI in 2–4 weeks.

Book a 30-Min Strategy Call Send a Message

Back to Blog