We’ve spent the past year building a data engineering agent. Our users are business analysts who don’t care about the internals of an LLM agent and just want to build correct and efficient data pipelines.

When users rely on the agent’s expertise in areas unfamiliar to them, correctness is critical. Who is there to course-correct if the agent hallucinates tables or confidently proposes a data flow analyzing the wrong metric?

Chasing higher levels of reliability, I’ve rearchitected the entire agent three times, with each overhaul stemming from real failures and limitations.

This post walks through the full evolution, and how subtle environment engineering can outperform pages of prompt engineering.

Act 1: The Rigid State Machine

Our biggest concern early on concerned LLM non-determinism and correctness issues, which were especially prevalent on the older open-source models.

A software engineer faced with a chaotic system tries to bring order, so I built a rigid state machine to put the multi-agent system on rails.

First, an intent classifier extracted what the user wanted: NEW_REQUEST, APPROVAL, QUESTION, UNKNOWN. Then, a state machine router used that intent plus the current workspace state to dispatch to the right sub-agent. And each sub-agent (schema generation, data source discovery, data flow creation) had its own prompt and handled its slice of the workflow.

This was like turning a helicopter that could fly anywhere into a train with fixed tracks, rail signals, and precise boarding times. (This was not necessarily a good thing.) But for clean, happy-path scenarios, this worked well.

Of course, very soon the pains showed up. How does this system handle a message that does multiple things at once?

[User]: Please get rid of the order ID column and instead use the payment ID from 2026_mkt_transactions table. Also could you explain to me the formula you’re using to calculate revenue?

Two edit requests and a question in one message… ouch. I could classify into a list of intents, but processing one might invalidate the state for the next.

And it got worse from there. UNKNOWN intent was useless to act on, handling questions meant adding a Q&A mode onto every sub-agent’s prompt, and the decision tree was a maintenance nightmare.

Trying to impose deterministic routing on a language model whose adaptability is its strength was the wrong approach.

Act 2: The Orchestrator

Real world agent usage is messy: users are ambiguous, unsure of direction, or implying multiple side effects in a single message.

The core problem with intent classification was that it was a lossy compression of highly nuanced messages into structured labels.

But then I took a step back: did we actually need to distill abstract intent into buckets? The end-goal was never identifying user intent but rather invoking the correct system processes! LLM tool-calling aims to solve this exact problem.

Embrace non-determinism instead of fighting it.

Thus, I replaced the entire intent classifier and state machine router with a single orchestrator LLM and wrapped the sub-agents as tools. The control flow collapsed into a simple loop: at each turn, the orchestrator either responded in prose or called tools:

for i in range(MAX_TOOL_CALLS):
    result = llm.call()
    if not result.tool_calls:
        print(result.content)
        break

Tools could also reject calls if prerequisites weren’t met, which made sequencing easy to enforce.

The improvements were substantial, eliminating hard-coded responses, giving the data engineering agent a single voice, and designating it as the coordinator.

Best of all, multi-intent messages simply worked, as the orchestrator called tools in sequence, so that same example from before resolved into:

tool_generate_schema("drop order ID, add payment ID")
tool_determine_sources("2026_mkt_transactions")
Final reply: “Done! I dropped order ID, added payment ID, and included 2026_mkt_transactions. The revenue formula is…”

One early challenge, though, was that the orchestrator LLM sometimes generated objects (like schemas) on its own instead of using the tools. I addressed this by hiding the output objects behind the tool boundary, ensuring the orchestrator never saw schemas or sources directly, only success/failure strings.

This trick was surprisingly effective because it didn’t ask the LLM to behave differently. It changed what the LLM could see, serving not as a behavioral fix but an environmental one.

However, black-boxing was a blunt instrument. The orchestrator couldn’t fabricate data — good — but it also couldn’t explain, inspect, or reason about it.

The data source tool suffered from this the most. If a user wanted to change which sources were used, the orchestrator had to compress that request into a single string parameter. No matter how small the change, the tool re-ran its entire internal process: data fetching, LLM calls, scoring heuristics. After all that, the orchestrator got back just the final answer.

When a user asks, “Why were these sources chosen? What others were considered?”, the orchestrator literally doesn’t know.

This was turning into a game of telephone.

Act 3: The General-Purpose Agent

The core issue: a lack of direct control and transparency. We had to flatten the chain of command.

I replaced the orchestrator, schema LLM, and data source LLM with a single agent that acts instead of orchestrates. No more delegation to sub-agents for routine tasks. Instead, the agent gets small, fast, atomic tools — tool_list_data_sources and tool_inspect_data_source — and does the thinking itself.

Where did the schema tool go? We don’t need it. Modern models handle this and JSON generation reliably enough that isolating it wasn’t worth the delegation cost.

Same instinct as before: when the model can handle it natively, cut the scaffolding.

The one exception is data flow generation — context-heavy and self-contained enough to justify a sub-LLM. But the key difference is how the main agent interfaces with it:

def tool_generate_data_flow(
    task_description: str,
    schema: Schema,
    data_source_ids: list[str]  # <- the key
): ...

The agent must supply the schema and a list of data source IDs. We cache metadata in internal state and expose short reference IDs to the agent.

The IDs act as “keys” to this step of the workflow, and the agent, being an optimizer, instinctively seeks to obtain those keys. The failure mode where the LLM hallucinates metadata is eliminated by design, not by prompting.

In this new design, the agent sees all the context it needs. It can inspect sources, browse metadata, reason freely. But it can’t invoke data flow generation without supplying concrete artifact IDs that only exist if it actually did the prerequisite work.

You can’t unlock the door without the key, and you can’t get the key without finding it first. The agent’s creativity is fully unconstrained: it can explore, retry, combine sources however it wants, but the dependency chain remains.

From this architecture emerged capabilities we didn’t explicitly engineer. Because the agent hoards all context, it can answer “Explain to me what my data looks like” by inspecting sources directly and summarizing. It can respond to “What are the highest-value data flows I could build?” by synthesizing ideas from everything it’s discovered.

What Survived

Each rewrite followed the same impulse to simplify, give the agent more direct access, and stop trying to be smarter than the model. If I had to distill what survived all three rewrites:

1. There is a delegation tax

Every time we hand off to a sub-agent, we pay a cost in context loss. Black-boxing is very dangerous as it constrains understanding both ways:

The caller, not knowing inner workings, struggles to debug internal errors, is oblivious to latency costs, and has to compress its context into the tool’s interface.
The callee (tool or sub-agent), loses understanding of the overarching workflow, and has only its given arguments to work off of.

Delegation is only helpful when complexity genuinely demands it or we deliberately want to limit context.

2. Bad behavior arises from bad platforms

Giving an LLM rigid rules to follow often means it can’t extrapolate beyond the rigid situations we’re testing.

We got better results the more effort we spent building a more robust, precise, and consistent environment, rather than writing pages of prompts.

In practice, this came down to a few things:

Clear, faithful, and single-purpose tools: making sure the tool does exactly what it says. And avoiding giving the LLM redundant paths to choose from.
Tools and skills that paint a story: just tool names themselves do a ton of heavy lifting. The LLM sees a wrench, cutter, and drain snake, and realizes: “I can do plumbing”.
Consistent ID types: having an asset_id and a job_id can be confusing, and we need to ask, “Is an LLM going to think a job is an asset? Will it ever use the wrong IDs for the wrong tools?”

Notice all the points above are things a human developer would benefit from as well. Models are improving faster than platforms can keep up, so investing in the platform not only ensures a better moat but also grants more power to the agents built upon it.