aiyan.io
// 2026-02-15 · #design

Don't Prompt Your Agent for Reliability — Engineer It

We’ve spent the past year building a data engineering agent. Our users are business analysts who don’t care about the internals of an LLM agent and just want to build correct and efficient data pipelines.

When users rely on the agent’s expertise in areas unfamiliar to them, correctness is critical. Who is there to course-correct if the agent hallucinates tables or confidently proposes a data flow analyzing the wrong metric?

Chasing higher levels of reliability, I’ve rearchitected the entire agent three times, with each overhaul stemming from real failures and limitations.

This post walks through the full evolution, and how subtle environment engineering can outperform pages of prompt engineering.

Act 1: The Rigid State Machine

Our biggest concern early on concerned LLM non-determinism and correctness issues, which were especially prevalent on the older open-source models.

A software engineer faced with a chaotic system tries to bring order. So I built a rigid state machine to put the multi-agent system on rails.

An intent classifier extracted what the user wanted: NEW_REQUEST, APPROVAL, QUESTION, UNKNOWN. Then a state machine router used that intent plus the current workspace state to dispatch to the right sub-agent. And each sub-agent (schema generation, data source discovery, data flow creation) had its own prompt and handled its slice of the workflow.

This was like turning a helicopter that could fly anywhere into a train with fixed tracks, rail signals, and precise boarding times. (This was not necessarily a good thing.) But for clean, happy-path scenarios, this worked well.

The rigid state machine architecture The rigid state machine architecture

Of course, very soon the pains showed up. How does this system handle a message that does multiple things at once?

[User]: Please get rid of the order ID column and instead use the payment ID from 2026_mkt_transactions table. Also could you explain to me the formula you’re using to calculate revenue?

Two edit requests and a question in one message — ouch. You could classify into a list of intents, but processing one might invalidate the state for the next.

And it got worse from there. UNKNOWN intent always triggered the same canned “I couldn’t understand…” response. Handling questions meant bolting a Q&A mode onto every sub-agent’s prompt. The decision tree was a maintenance nightmare.

Trying to impose deterministic routing on a language model whose adaptability is its strength was the wrong approach.

Act 2: The Orchestrator

Real world agent usage is messy: users are ambiguous, unsure of direction, or implying multiple side effects in a single message.

The core problem with intent classification was that it was a lossy compression of highly nuanced messages into structured labels.

But then I took a step back: did we actually need to distill abstract intent into buckets? The end-goal was never identifying user intent — it was invoking the correct system processes. That’s exactly what LLM tool-calling solves.

Embrace non-determinism instead of fighting it.

Thus, I replaced the entire intent classifier and state machine router with a single orchestrator LLM and wrapped the sub-agents as tools. The control flow collapsed into a simple loop: at each turn, the orchestrator either responded in prose or called tools:

Python
UTF-8|5 Lines|
for i in range(MAX_TOOL_CALLS):
    result = llm.call()
    if not result.tool_calls:
        print(result.content)
        break

Tools could also reject calls if prerequisites weren’t met, which made sequencing easy to enforce.

The orchestrator architecture The orchestrator architecture

The improvements were substantial, eliminating hard-coded responses, giving the data engineering agent a single voice, and designating it as the coordinator.

Best of all, multi-intent messages simply worked, as the orchestrator called tools in sequence, so that same example from before resolved into:

  1. tool_generate_schema("drop order ID, add payment ID")
  2. tool_determine_sources("2026_mkt_transactions")
  3. Final reply: “Done! I dropped order ID, added payment ID, and included 2026_mkt_transactions. The revenue formula is…”

One early challenge, though, was that the orchestrator LLM sometimes generated objects (like schemas) on its own instead of using the tools. I fixed this by hiding the output objects behind the tool boundary — the orchestrator never saw schemas or sources directly, only success/failure strings. By throwing objects to the other side of the wall, the orchestrator was compelled to use the tools rather than spinning its own versions.

This trick was surprisingly effective because it didn’t ask the LLM to behave differently. It changed what the LLM could see, serving not as a behavioral fix but an environmental one.

But black-boxing was a blunt instrument. The orchestrator couldn’t fabricate data — good — but it also couldn’t explain, inspect, or reason about it.

The data source tool suffered from this the most. If a user wanted to change which sources were used, the orchestrator had to compress that request into a single string parameter. No matter how small the change, the tool re-ran its entire internal process — data fetching, LLM calls, scoring heuristics. After all that, the orchestrator got back just the final answer.

When a user asks, “Why were these sources chosen? What others were considered?”, the orchestrator literally doesn’t know.

This was turning into a game of telephone.

Act 3: The General-Purpose Agent

The core issue: a lack of direct control and transparency. We had to flatten the chain of command.

I replaced the orchestrator, schema LLM, and data source LLM with a single agent that acts instead of orchestrates. No more delegation to sub-agents for routine tasks. Instead, the agent gets small, fast, atomic tools — tool_list_data_sources and tool_inspect_data_source — and does the thinking itself.

The general-purpose agent architecture The general-purpose agent architecture

Where did the schema tool go? We don’t need it. Modern models handle this and JSON generation reliably enough that isolating it wasn’t worth the delegation cost.

Same instinct as before: when the model can handle it natively, cut the scaffolding.

The one exception is data flow generation — context-heavy and self-contained enough to justify a sub-LLM. But the key difference is how the main agent interfaces with it:

Python
UTF-8|5 Lines|
def tool_generate_data_flow(
    task_description: str,
    schema: Schema,
    data_source_ids: list[str]  # <- the key
): ...

The agent must supply the schema and a list of data source IDs. We cache metadata in internal state and expose short reference IDs to the agent.

The IDs act as “keys” to this step of the workflow, and the agent, being an optimizer, instinctively seeks to obtain those keys. The failure mode where the LLM hallucinates metadata is eliminated by design, not by prompting.

In this new design, the agent sees all the context it needs. It can inspect sources, browse metadata, reason freely. But it can’t invoke data flow generation without supplying concrete artifact IDs that only exist if it actually did the prerequisite work.

You can’t unlock the door without the key, and you can’t get the key without finding it first. The agent’s creativity is fully unconstrained: it can explore, retry, combine sources however it wants, but the dependency chain remains.

From this architecture emerged capabilities we didn’t explicitly engineer. Because the agent hoards all context, it can answer “Explain to me what my data looks like” by inspecting sources directly and summarizing. It can respond to “What are the highest-value data flows I could build?” by synthesizing ideas from everything it’s discovered.

The independence of schema and sources also opened up new interaction patterns. A user saying “I need to replicate data to this destination table” triggers the agent to find sources matching the schema. But “Combine these two source tables into one aggregate view” works in the opposite direction — the agent produces a schema based on the sources. Previously, the schema agent might produce a schema requiring columns that the source agent later discovered didn’t exist. Now the agent sees everything at once, so it never generates a schema it can’t fulfill.

What Survived

Each rewrite followed the same impulse to simplify, give the agent more direct access, and stop trying to be smarter than the model. If I had to distill what survived all three rewrites:

1. There is a delegation tax

Every time you hand off to a sub-agent, you’re paying a cost in context loss. Black-boxing is very dangerous as it constrains understanding both ways:

  1. The caller, not knowing inner workings, struggles to debug internal errors, is oblivious to latency costs, and has to compress its context into the tool’s interface.
  2. The callee (tool or sub-agent), loses understanding of the overarching workflow, and has only its given arguments to work off of.

So, see how far you can get with one agent with atomic tools. Delegate when complexity genuinely demands it and you deliberately want to limit context.

2. Bad behavior arises from bad platforms

Giving an LLM rigid rules to follow often means it can’t extrapolate beyond the rigid situations you’re testing.

We got better results the more effort we spent building a more robust, precise, and consistent environment, rather than writing pages of skills and system prompts.

In practice, this came down to a few things:

  • Unambiguous, faithful tools, and non-overlapping tools: make sure the tool does exactly what it says, and don’t give the LLM multiple ways to accomplish the same thing.
  • Tools and skills that paint a story: just the tool names themselves do a ton of heavy lifting. The LLM sees a wrench, cutter, and drain snake, and realizes: “I can do plumbing”.
  • Consistent ID types: if you’re going to have an asset_id and a job_id, ask yourself “Is an LLM going to think a job is an asset? Will it ever use the wrong IDs for the wrong tools?”

You’ll notice all the points above are things a human developer would benefit from as well.

Poor environments, platforms, or APIs to operate on means poorer agent performance. Ask yourself: how much additional context would I need to explain to a junior in order for them to start developing on my platform?

If the answer is “a lot”, then consider ways to simplify and streamline your APIs. Start with the platform, and your agent will follow.