For teams deciding between LangGraph, CrewAI, or rolling your own orchestration for your next LLM agent, there’s a third option that gets less attention than it deserves: not building the orchestrator at all.

Instead, think about shipping kits—collections of MCP tools and skills—that equip existing agents like Claude Code and Codex with the context and actions they need to operate inside your platform.

As I’ve documented over the last 8 months, orchestrating an agent is expensive in ways that aren’t obvious upfront. A LangGraph prototype takes a day, but making that prototype production-ready is the ninety-ninety problem: the final 10% somehow eats as much time as the first 90%.

We’ve had to figure out chat persistence, context management, system prompt tuning (that never quite ended), and UI synchronization. And on top of all this were testing harnesses that only added to the maintenance burden.

Come time to GA, we were maintaining whole frameworks of our own. And that was before counting the tools, platform APIs, and SDKs that actually delivered the business value to our users.

Orchestration has a half-life

Early in the life of our agent, we implemented task management utilities—WriteTodo and GetTodo—so the model could track long-term goals across subtasks. We paired them with system prompts documenting the pattern and system reminders appended after tool calls to refocus the agent’s attention.

But after upgrading to a different LLM, the task management utilities were suddenly unnecessary for our workflows, as the model was sharp enough to remember its own steps. The system reminders and todo-tracking just forced unnecessary ceremony.

I begrudgingly tore down the scaffolding around system reminders and task management, and the days spent tuning and testing went with it.

Anthropic themselves ran into the same pattern:

As just one example, in prior work we found that Claude Sonnet 4.5 would wrap up tasks prematurely as it sensed its context limit approaching—a behavior sometimes called “context anxiety.” We addressed this by adding context resets to the harness. But when we used the same harness on Claude Opus 4.5, we found that the behavior was gone. The resets had become dead weight.

— Anthropic, “Building Managed Agents”

It was an uncomfortable position to sit in. Do we keep investing in abstractions that move our evaluations forward, knowing the next model release might render them obsolete? It’s a bet no agent team should be forced to take.

The frontier labs are already solving this

While we were busy patching over our model’s weaknesses, Anthropic was solving those same weaknesses properly, inside Claude Code.

When tool outputs get too large to fit in context, Claude writes them to a file, samples what it needs with a few reads, and executes a script to extract the exact data. When there are too many tools to load without drowning the agent, the ToolSearch meta-tool defers tool loading until a task actually requires them. And when context would otherwise need to be stuffed upfront, skills let it pull in relevant information on demand.

Anthropic and other providers do this work continuously, and when models update, the orchestration updates with them. Now compare that to the position we were in: a small team spending scarce engineering cycles optimizing a loop for a specific model, quietly hoping none of our assumptions would shift underneath us. And what happens when we need to switch — voluntarily, because a different model starts outperforming ours on the tasks we care about, or involuntarily, because the political climate suddenly disqualifies a provider?

Framework vendors advertise “model-agnostic” abstractions, and in theory that sounds like a clean solution. In practice, different models behave differently enough that the abstraction leaks. Prompts that excel on one model are unsatisfactory on another, tool-use behaviors shift, and reasoning styles diverge in ways that the config can’t fully account for. Octomind pulled LangChain from production after a year and put it plainly: “designing abstractions that will stand the test of time is incredibly difficult.”

That left us with two uncomfortable options: maintain the custom harness ourselves at escalating cost, or wait on a framework provider whose roadmap we didn’t control and whose priorities weren’t ours. Neither of those teams looks like Anthropic’s, and matching it was never the job.

Meeting the LLM in the middle

So we stop competing with frontier labs on the orchestration loop, and we ask a different question: what does a frontier agent actually need from us in order to operate inside our platform?

Really, it comes down to two things. The first is niche, domain-specific context the model wasn’t trained on — the shape of our data, the conventions of our product, the assumptions a seasoned user would carry into any task without thinking about them. The second is a way to act — to retrieve information, mutate state, and move things through our system on the user’s behalf.

The first ships as Agent Skills. The second ships as an MCP server. Together, that’s a kit.

I won’t go into depth on the two here, as there are plenty of resources online on them (DataBricks AI Dev Kit is a good example). The natural objection: skills and tools are the 2026 fad, and what happens when the fad passes?

Improving the resources we hand the agent also improves our entire tech stack. Good skills come from clear, concise documentation, which is exactly what our human developers and our platform’s consumers already need. Good tools come from APIs with sharp contracts and expressive parameters, which every downstream client — human or agent — benefits from. If the agent interface fad dies tomorrow, what’s left is better docs and better APIs—improvements worth making regardless.

Your platform is the moat

Months into building our own harness, we realized the orchestration layer was never our value center. IBM’s data platform was. The agent was just a new entrypoint for smoother and smarter access to existing platform capabilities.

Software patterns are what frontier labs are training their models to absorb. What actually differentiates our platform — the way we shape our customers’ data, the compliance boundaries of our domain, the conventions our users have built up over years of working with our product — isn’t on the internet for a model to learn from. We want our offerings to be empowered by the next model release, not be scared of being made redundant.

By piggybacking agent clients, we could stop defending the orchestration layer where we had no competitive advantage, and instead pour that same effort back into the domain where we genuinely do.

When owning the harness is the right call

There are teams for which this argument doesn’t apply, and they’re worth naming.

If you’re training your own model, or competing as a model provider, you need tight coupling between the model and the loop it runs inside — there, the orchestrator is the product.

Less obvious is the case where you have hard latency or cost requirements that demand loop-level optimization, or highly specialized approval gates that can’t tolerate LLM judgment in the hot path.

But even in those cases, “strict requirements” doesn’t automatically translate to “custom harness.” MCP elicitation lets a server request approval or clarification from the user in the middle of a tool call. MCP tasks let long-running operations report status back asynchronously, so the agent can react when work completes. Not every client supports these yet, but they will, and the shape of what you can accomplish without owning the loop keeps expanding.

With features like these landing steadily and conferring orchestration benefits for free, the more durable default is extending existing agents rather than building in parallel to them.

Where the hours compound

Now we direct our engineering efforts into what’s only ours to build: APIs, SDKs, platform capabilities that no outside team can sharpen for us.

And what strikes me, looking back on the months we spent otherwise, is how differently the same hour of work behaves depending on which side of this line it falls on. An hour spent on the orchestrator bought us something that decayed as the model improved, whereas an hour spent on the platform underneath buys something whose value the next model only deepens.