aiyan.io
// 2026-02-15 · #design

Don't Prompt Your Agent for Reliability — Engineer It

We’ve spent the past year building a data engineering agent. Our users are business analysts who don’t care about the internals of an LLM agent and just want to build correct and efficient data pipelines.

When users rely on the agent’s expertise in areas unfamiliar to them, correctness is critical — no hallucinations of tables that don’t exist and no silent forgetting of half a request.

Chasing that level of reliability, I’ve rearchitected the entire agent three times, with each overhaul stemming from real failures and limitations.

This post walks through the full evolution, and how subtle environment engineering can outperform pages of prompt engineering.

Act 1: The Rigid State Machine

Our biggest concern early on concerned the LLM non-determinism and correctness issue I raised.

A software engineer faced with a chaotic system tries to bring order. So I built a rigid state machine to put the multi-agent system on rails.

An intent classifier extracted what the user wanted: NEW_REQUEST, APPROVAL, QUESTION, UNKNOWN. Then a state machine router used that intent plus the current workspace state to dispatch to the right sub-agent. And each sub-agent (schema generation, data source discovery, data flow creation) had its own prompt and handled its slice of the workflow.

This was like turning a helicopter that could fly anywhere into a train with fixed tracks, rail signals, and precise boarding times. (This was not necessarily a good thing.) But for clean, happy-path scenarios, this worked well.

The rigid state machine architecture The rigid state machine architecture

Of course, very soon the pains showed up. How does this system handle a message that does multiple things at once?

[User]: Please get rid of the order ID column and instead use the payment ID from 2026_mkt_transactions table. Also could you explain to me the formula you’re using to calculate revenue?

Two edit requests and a question in one message — ouch. You could classify into a list of intents, but processing one might invalidate the state for the next.

And it got worse from there. UNKNOWN intent always triggered the same canned “I couldn’t understand…” response. Handling questions meant bolting a Q&A mode onto every sub-agent’s prompt. The decision tree became a maintenance nightmare — nested conditionals everywhere, and each agent had its own voice that I was desperately trying to keep consistent.

Trying to impose deterministic routing on a language model whose adaptability is its strength was the wrong approach.

Act 2: The Orchestrator

Real world agent usage is messy: users are ambiguous, unsure of direction, or implying multiple side effects in a single message.

The core problem with intent classification was that it was a lossy compression of highly nuanced messages into structured labels.

But then I took a step back: did we actually need to distill abstract intent into buckets? The end-goal was never identifying user intent — it was invoking the correct system processes. That’s exactly what LLM tool-calling solves.

Embrace non-determinism instead of fighting it.

Thus, I replaced the entire intent classifier and state machine router with a single orchestrator LLM and wrapped the sub-agents as tools. The control flow collapsed into a simple loop: at each turn, the orchestrator either responded in prose or called tools:

Python
UTF-8|5 Lines|
for i in range(MAX_TOOL_CALLS):
    result = llm.call()
    if not result.tool_calls:
        print(result.content)
        break

Tools could also reject calls if prerequisites weren’t met, which made sequencing easy to enforce.

The orchestrator architecture The orchestrator architecture

The improvements were substantial, eliminating hard-coded responses, giving the data engineering agent a single voice, and designating it as the coordinator.

Best of all, multi-intent messages simply worked, as the orchestrator called tools in sequence, so that same example from before resolved into:

  1. tool_generate_schema("drop order ID, add payment ID")
  2. tool_determine_sources("2026_mkt_transactions")
  3. Final reply: “Done! I dropped order ID, added payment ID, and included 2026_mkt_transactions. The revenue formula is…”

One early challenge, though, was that the orchestrator sometimes generated objects (like schemas) on its own instead of using the tools. I fixed this by hiding the output objects behind the tool boundary — the orchestrator never saw schemas or sources directly, only success/failure strings. By throwing objects to the other side of the wall, the orchestrator was compelled to use the tools rather than spinning its own versions.

This trick was surprisingly effective because it didn’t ask the LLM to behave differently. It changed what the LLM could see, serving not as a behavioral fix but an environmental one.

But this was a blunt instrument. The orchestrator couldn’t fabricate data — good — but it also couldn’t explain it, inspect it, or reason about it.

The data source tool suffered from this the most. If a user wanted to change which sources were used, the orchestrator had to compress that request into a single string parameter. No matter how small the change, the tool re-ran its entire internal process — data fetching, LLM calls, scoring heuristics. After all that, the orchestrator got back just the final answer.

When a user asks, “Why were these sources chosen? What others were considered?”, the orchestrator literally doesn’t know.

This was turning into a game of telephone.

Act 3: The General-Purpose Agent

The core issue: a lack of direct control and transparency. We had to flatten the chain of command.

I replaced the orchestrator, schema LLM, and data source LLM with a single agent that acts instead of orchestrates. No more delegation to sub-agents for routine tasks. Instead, the agent gets small, fast, atomic tools — tool_list_data_sources and tool_inspect_data_source — and does the thinking itself.

The general-purpose agent architecture The general-purpose agent architecture

Where did the schema tool go? We don’t need it. Modern models handle this and JSON generation reliably enough that isolating it wasn’t worth the delegation cost.

By now, I’m noticing a pattern. When in doubt, cut hard, inflexible logic and take advantage of the multi-billion parameter model.

The one exception is data flow generation — context-heavy and self-contained enough to justify a sub-LLM. But the key difference is how the main agent interfaces with it:

Python
UTF-8|5 Lines|
def tool_generate_data_flow(
    task_description: str,
    schema: Schema,
    data_source_ids: list[str]  # <- the key
): ...

The agent must supply the schema and a list of data source IDs. We cache metadata in internal state and expose short reference IDs to the agent.

The IDs act as “keys” to this step of the workflow, and the agent, being an optimizer, instinctively seeks to obtain those keys. The failure mode where the LLM hallucinates metadata is eliminated by design, not by prompting.

Here, the agent sees all the context it needs. It can inspect sources, browse metadata, reason freely. But it can’t invoke data flow generation without supplying concrete artifact IDs that only exist if it actually did the prerequisite work.

You can’t unlock the door without the key, and you can’t get the key without finding it first. The agent’s creativity is fully unconstrained: it can explore, retry, combine sources however it wants. But the dependency chain remains.

This architecture unlocked capabilities that were previously impossible. Because the agent hoards all context, it can answer “Explain to me what my data looks like” by inspecting sources directly and summarizing. It can respond to “What are the highest-value data flows I could build?” by synthesizing ideas from everything it’s discovered.

We didn’t build these features — they simply emerged from an agent that can see everything at once.

The independence of schema and sources also opened up new interaction patterns. A user saying “I need to replicate data to this destination table” triggers the agent to find sources matching the schema. But “Combine these two source tables into one aggregate view” works in the opposite direction — the agent produces a schema based on the sources. Previously, the schema agent might produce a schema requiring columns that the source agent later discovered didn’t exist. Now the agent sees everything at once, so it never generates a schema it can’t fulfill.

The code shrunk, we stopped dreading edge cases, and the agent finally felt like an intelligent assistant. Building data flows is its main capability but far from its only one.

What Survived

Each rewrite followed the same pattern: simplify. State machine with three specialized agents became an orchestrator with delegated tools, which became one agent with lightweight tools. Three LLMs collapsed into one, and complex routing became simple tool calls.

If I could start over, I’d take away two lessons.

1. There is a delegation tax

Every time you hand off to a sub-agent, you’re paying a cost in context loss, whether that be the interface you’re communicating through or the scoped understanding of the overall workflow.

See how far you can get with one agent with contained tools. Delegate when complexity genuinely demands it and you deliberately want to limit context.

2. Make bad behavior structurally impossible

In Act 2, hiding objects behind tool boundaries stopped hallucination but killed transparency. In Act 3, artifact-based dependencies preserved full context while still making invalid sequences impossible.

The pattern: Don’t prompt the agent to follow the rules; build an environment where breaking them isn’t an option. Build tools that produce physical artifacts (a token, file, or capability handle) that subsequent tools require as input.

The agent can freely explore, retry, and improvise, but it cannot disobey the laws of physics that you write.