The Orchestrator Isn't Your Moat
For teams deciding between LangGraph, CrewAI, or rolling your own orchestration for your next LLM agent, let me offer a third option worth sitting with: drop the orchestrator altogether.
Instead, think about shipping kits—collections of MCP tools and skills—that arm existing agents like Claude Code and Codex with the context and actions they need to operate inside your platform.
As I’ve documented over the last 8 months, orchestrating an agent is expensive in ways that aren’t obvious upfront. A LangGraph prototype will take you a day. Making that prototype production-ready is the ninety-ninety problem: the final 10% somehow eats as much time as the first 90%.
There’s chat persistence to figure out, context management to get right, system prompt tuning that never quite ends, UI synchronization to keep the agent and the interface in agreement about what’s happening — and then custom harnesses on top of all of it, because without evaluation you’re flying blind.
Come time to GA, you’re maintaining whole frameworks of your own. And that’s before counting the tools, platform APIs, and SDKs that actually deliver the business value to your users.
Orchestration has a half-life
Early in the life of our agent, we implemented task management
utilities—WriteTodo and GetTodo—so the model could track long-term goals
across subtasks. We paired them with system prompts documenting the pattern and
system reminders appended after tool calls to refocus the agent’s attention.
Then we upgraded to a different model.
Suddenly, the task management utilities were totally unnecessary for our workflows, as the LLM was sharp enough to remember its own steps. The system reminders just injected noise and the todo-tracking forced unnecessary ceremony.
I begrudgingly tore down the scaffolding around system reminders and task management, and the days spent tuning and testing went with it.
Anthropic themselves ran into the same pattern:
As just one example, in prior work we found that Claude Sonnet 4.5 would wrap up tasks prematurely as it sensed its context limit approaching—a behavior sometimes called “context anxiety.” We addressed this by adding context resets to the harness. But when we used the same harness on Claude Opus 4.5, we found that the behavior was gone. The resets had become dead weight.
— Anthropic, “Building Managed Agents”
It was an uncomfortable position to sit in. Do we keep investing in abstractions that move our evaluations forward, knowing the next model release might render them obsolete? It’s a bet no agent team should be forced to take.
The big players are solving your problems
While we were busy patching over our model’s weaknesses, Anthropic was solving those same weaknesses properly, inside Claude Code.
When tool outputs get too large to fit in context, Claude writes them to a file, samples what it needs with a few reads, and executes a script to extract the exact data. When there are too many tools to load without drowning the agent, the ToolSearch meta-tool defers tool loading until a task actually requires them. And when context would otherwise need to be stuffed upfront, skills let it pull in relevant information on demand.
Anthropic and other providers do this work continuously, and when models update, the orchestration updates with them. Now compare that to the position we were in: a small team spending scarce engineering cycles optimizing a loop for a specific model, quietly hoping none of our assumptions would shift underneath us. And what happens when we need to switch — voluntarily, because a different model starts outperforming ours on the tasks we care about, or involuntarily, because the political climate suddenly disqualifies a provider?
Framework vendors advertise “model-agnostic” abstractions, and in theory that sounds like a clean solution. In practice, different models behave differently enough that the abstraction leaks. Prompts that excel on one model are unsatisfactory on another, tool-use behaviors shift, and reasoning styles diverge in ways that the config can’t fully account for. Octomind pulled LangChain from production after a year and put it plainly: “designing abstractions that will stand the test of time is incredibly difficult.”
You’re left with two bad options: maintain the custom harness yourself at escalating cost, or wait on a framework provider whose roadmap you don’t control and whose priorities aren’t yours. Neither of those teams looks like Anthropic’s. They shouldn’t be trying to.
Meeting the LLM in the middle
So we stop competing with frontier labs on the orchestration loop, and we ask a different question: what does a frontier agent actually need from us in order to operate inside our platform?
Really, it comes down to two things. The first is niche, domain-specific context the model wasn’t trained on — the shape of your data, the conventions of your product, the assumptions a seasoned user would carry into any task without thinking about them. The second is a way to act — to retrieve information, mutate state, and move things through your system on the user’s behalf.
The first ships as Agent Skills. The second ships as an MCP server. Together, that’s a kit.
I won’t go into depth on the two here, as there are a plethora of resources online on them (DataBricks AI Dev Kit is a good example). Instead, I’ll anticipate the natural objection here: skills and tools are the 2026 fad, and what happens when the fad passes?
The beauty of improving these resources offered to the agent is that doing so improves your entire tech stack. Good skills come from clear, concise documentation, which is exactly what your human developers and your platform’s consumers already need. Good tools come from APIs with sharp contracts and expressive parameters, which every downstream client — human or agent — benefits from. If the agent interface fad dies tomorrow, what you’re left with is better docs and better APIs!
Your platform is the moat
What we actually realized, months into building our own harness, was that the orchestration layer was never our value center. IBM’s data platform was. The agent was just a new entrypoint for smoother and smarter access to existing platform capabilities.
Software patterns are what frontier labs are training their models to absorb. What actually differentiates a platform — the shape of your customers’ data, the compliance boundaries of your domain, the conventions your users have built up over years of working with your product — isn’t on the internet for a model to learn from. We want our offerings to be empowered by the next model release, not be scared of being made redundant.
By piggybacking agent clients, we could stop defending the orchestration layer where we had no competitive advantage, and instead pour that same effort back into the domain where we genuinely do.
When owning the harness is the right call
There are honestly teams for which none of this advice applies and I want to name them.
If you’re training your own model, or competing as a model provider, you need tight coupling between the model and the loop it runs inside of — that’s obvious.
Less obvious is the case where you have hard latency floors or cost ceilings that demand loop-level optimization, or deterministic compliance gates that can’t tolerate LLM judgment in the hot path.
But even in those cases, “strict requirements” doesn’t automatically translate to “custom harness.” MCP elicitation lets a server request approval or clarification from the user in the middle of a tool call. MCP tasks let long-running operations report status back asynchronously, so the agent can react when work completes. Not every client supports these yet, but they will, and the shape of what you can accomplish without owning the loop keeps expanding.
With features like these landing steadily and conferring orchestration benefits for free, the default should be extending existing agents rather than building in parallel to them.
Rising with the tide
Now we direct our engineering efforts into what’s only ours to build: APIs, SDKs, platform capabilities that no outside team can sharpen for us. And what strikes me, looking back on the months we spent otherwise, is how differently the same hour of work behaves depending on which side of this line it falls on. An hour spent on the orchestrator bought us something that decayed as the model improved, whereas an hour spent on the platform underneath buys something whose value the next model only deepens.
A litmus test I use now, before writing any piece of orchestration: Would a ten-point jump in the next model’s reasoning benchmark make this code redundant? If the answer is probably yes, I look for a way to express the need as a tool contract or a skill instead, where the same intelligence jump makes the work better rather than obsolete.
When you own the orchestrator, a model update is a kind of invoice: prompts to rewrite and behaviors to optimize. A slow accumulation of small debts that must be paid each time the frontier moves. When you don’t, the same upgrade comes as a gift. You spend so much effort streamlining the programmatic interfaces to your platform and voilà, the latest LLM agent just works.