aiyan.io
// 2026-02-15 · #design

Don't Prompt Your Agent for Reliability — Engineer It

When your users are non-technical business analysts, your agent doesn’t get second chances. It can’t hallucinate a table schema and ask the user to verify; it can’t silently drop half a request — it has to work.

I’ve spent the past year at work building a data engineering agent for exactly this audience, and I’ve rearchitected the entire thing three times. Each rewrite was due to real failures and limitations, and each one made the system dramatically simpler.

This post walks through the full evolution.

Act 1: The Rigid State Machine

My biggest concern early on was LLM non-determinism. Enterprise software demands reliability, security, and predictability. An agent that sometimes inspected your data connections, other times hallucinated an entire table schema you never had, and only occasionally built a grounded data flow wasn’t going to cut it.

So I built a rigid state machine to put the multi-agent system on rails. An intent classifier extracted what the user wanted: NEW_REQUEST, APPROVAL, QUESTION, UNKNOWN. Then a state machine router used that intent plus workspace state to dispatch to the right sub-agent. And each sub-agent (schema generation, data source discovery, data flow creation) had its own prompt and handled its slice of the workflow.

This was like turning a helicopter that could fly anywhere into a train with fixed tracks, rail signals, and precise boarding times. For clean, happy-path scenarios, this worked well.

The rigid state machine architecture The rigid state machine architecture

However, the cracks soon showed. How does this system handle a message that does multiple things at once?

Please get rid of the order ID column and instead use the payment ID from 2026_mkt_transactions table. Also could you explain to me the formula you’re using to calculate revenue?

Two edit requests and a question in one message. You could classify into a list of intents, but processing one might invalidate the state for the next.

And the problems compounded. UNKNOWN intent always triggered the same canned “I couldn’t understand…” response. Handling questions meant bolting a Q&A mode onto every sub-agent’s prompt. The decision tree became a maintenance nightmare — nested conditionals everywhere, and each agent had its own voice that I was desperately trying to keep consistent.

Trying to impose deterministic routing on an extraordinarily flexible language model was the wrong approach.

Act 2: The Orchestrator

Soon, something clicked. Real world agent usage is messy: users are ambiguous, unsure of direction, or implying multiple side effects in a single message. A large language model was precisely trained to dissect this kind of noise.

Instead of fighting non-determinism, why not embrace it?

The core problem with intent classification is that it’s a lossy compression of highly nuanced messages into structured labels. But the end-goal was never identifying user intent — it’s invoking the correct system processes. That’s exactly what LLM tool-calling solves.

I replaced the entire intent classifier and state machine router with a single orchestrator LLM and wrapped the sub-agents as tools. The control flow collapsed into a simple loop: at each turn, the orchestrator either responds in prose or calls tools:

Python
UTF-8|5 Lines|
for i in range(MAX_TOOL_CALLS):
    result = llm.call()
    if not result.tool_calls:
        print(result.content)
        break

Tools could also reject calls if prerequisites weren’t met, which made sequencing easy to enforce.

The orchestrator architecture The orchestrator architecture

The improvements were substantial, eliminating hard-coded responses, giving the data engineering agent a single voice, and designating it as the representative.

Best of all, multi-intent messages simply worked, as the orchestrator called tools in sequence, so that same example from before now resolved into:

  1. tool_generate_schema("drop order ID, add payment ID")
  2. tool_determine_sources("2026_mkt_transactions")
  3. Final reply: “Done! I dropped order ID, added payment ID, and included 2026_mkt_transactions. The revenue formula is…”

One early challenge was that the orchestrator sometimes generated objects (like schemas) on its own instead of using the tools. I fixed this by hiding the output objects behind the tool boundary — the orchestrator never saw schemas or sources directly, only success/failure strings. By throwing objects to the other side of the wall, the orchestrator was compelled to use the tools rather than spinning its own versions.

This trick was surprisingly effective because it didn’t ask the LLM to behave differently. It changed what the LLM could see, serving not as a behavioral fix but an environmental one.

But this was a blunt instrument. The orchestrator couldn’t fabricate data — good — but it also couldn’t explain it, inspect it, or reason about it.

The data source tool suffered the most. If a user wanted to change which sources were used, the orchestrator had to compress that request into a single string parameter. No matter how small the change, the tool re-ran its entire internal process — data fetching, LLM calls, scoring heuristics. After all that, the orchestrator just got back the final answer.

When a user asks, “Why were these sources chosen? What others were considered?”, the orchestrator literally doesn’t know.

This was turning into a game of telephone.

Act 3: The General-Purpose Agent

The core issue: a lack of direct control and transparency. We had to flatten the chain of command.

I replaced the orchestrator, schema LLM, and data source LLM with a single agent that acts instead of orchestrates. No more delegation to sub-agents for routine tasks. Instead, the agent gets small, fast, atomic tools — tool_list_data_sources and tool_inspect_data_source — and does the thinking itself.

The general-purpose agent architecture The general-purpose agent architecture

Where did the schema tool go? We don’t need it. Modern models handle this and JSON generation reliably enough that isolating it wasn’t worth the delegation cost.

The one exception is data flow generation — complex enough and self-contained enough to justify a sub-LLM. But the key difference is how the main agent interfaces with it:

Python
UTF-8|5 Lines|
def tool_generate_data_flow(
    task_description: str,
    schema: Schema,
    data_source_ids: list[str]  # <- magic
): ...

The agent must supply the schema and a list of data source IDs. We cache metadata in internal state and expose short reference IDs to the agent.

The IDs act as “keys” to this step of the workflow, and the agent, being an optimizer, instinctively seeks to obtain those keys. The failure mode where the LLM hallucinates metadata is eliminated by design, not by prompting.

Here, the agent sees all the context it needs. It can inspect sources, browse metadata, reason freely. But it can’t invoke data flow generation without supplying concrete artifact IDs that only exist if it actually did the prerequisite work.

The constraint isn’t “you can’t see this” anymore. It’s “you can’t use this until you’ve earned the inputs”. The agent’s creativity is fully unconstrained — it can explore, retry, combine sources however it wants. But the dependency chain is non-negotiable.

You can’t unlock the door without the key, and you can’t get the key without finding it first.

This architecture unlocked capabilities that were previously impossible. Because the agent hoards all context, it can answer “Explain to me what my data looks like” by inspecting sources directly and summarizing. It can respond to “What are the highest-value data flows I could build?” by synthesizing ideas from everything it’s discovered.

I didn’t build these features; they simply emerged from an agent that can see everything at once.

The independence of schema and sources also opened up new interaction patterns. A user saying “I need to replicate data to this destination table” triggers the agent to find sources matching the schema. But “Combine these two source tables into one aggregate view” works in the opposite direction — the agent produces a schema based on the sources. Previously, the schema agent might produce a schema requiring columns that the source agent later discovered didn’t exist. Now the agent sees everything at once, so it never generates a schema it can’t fulfill.

The code size shrunk considerably, I stopped dreading edge cases, and the agent finally felt like an intelligent assistant. Building data flows is its main capability but far from its only one.

What Survived

Each rewrite followed the same pattern: simplify. State machine with three specialized agents became an orchestrator with delegated tools, which became one agent with lightweight tools. Three LLMs collapsed into one, and complex routing gave way to simple tool calls.

If I could start over and take away just two lessons, they would be these.

1. There is a delegation tax

Every time you hand off to a sub-agent, you’re paying a cost in context loss, whether that be the interface you’re communicating through or the scoped understanding of the overall workflow.

The default should be one agent doing more with simple tools. Delegate when complexity genuinely demands it and you deliberately want to limit context.

2. Make bad behavior structurally impossible

In Act 2, hiding objects behind tool boundaries stopped hallucination but killed transparency. In Act 3, artifact-based dependencies preserved full context while still making invalid sequences impossible.

The pattern: Don’t prompt the agent to follow the rules; build an environment where breaking them isn’t an option. Build tools that produce physical artifacts (a token, file, or capability handle) that subsequent tools require as input.

The agent can freely explore, retry, and improvise, but it cannot disobey the laws of physics that you write.