Trade notes
AI lab
06The Stack· 8 min read

The 2026 stack: providers, models, and choosing what's load-bearing

A working architect's read on the Anthropic / OpenAI / Microsoft landscape in 2026. Where each one is strong, where each one is overhyped, and how to think about the choice when it's actually yours to make.

The 2026 model landscape looks different than the 2024 one. Three things changed: model quality converged in the bands most enterprises care about, the system around the model started to matter more than the model itself, and the buyer realized they don't have to pick one.

This is the brief about how to think about the stack when you're the one choosing.

The shape of the field

Three vendors do the heavy lifting in enterprise agent work right now:

  • Anthropic: the Claude family (Opus, Sonnet, Haiku, most current generation: 4.x), the Claude Agent SDK, MCP, and the new Managed Agents service for long-horizon work.
  • OpenAI: the GPT family, Assistants and Realtime APIs, deep integration with Microsoft and a sprawling third-party tooling ecosystem.
  • Microsoft: the Copilot family, Copilot Studio, Foundry, the Frontier Suite, the IQ stack covered in the Work IQ brief, and a posture that increasingly says we'll route to whichever model fits rather than we are an OpenAI distributor.

There are good models from Google, Meta, Mistral, and others. They show up in the enterprise selectively. The center of gravity in 2026 is the three above, and Microsoft is increasingly the routing layer between the other two.

What converged and what didn't

The thing that makes 2026 model choice less fraught than 2024 is that the floor of acceptable quality is now high. For most general-purpose enterprise tasks (drafting, summarizing, classifying, retrieval-augmented Q&A) the top tier from each vendor is good enough that quality is rarely the differentiator. You can run a real workload against any of them and it will work.

The ceiling still matters in three areas:

  • Long-horizon agentic work. Sustained reasoning over hours and dozens of tool calls is still differentiated. Anthropic's Cowork-style sessions and OpenAI's deep-research-class agents are not commodity.
  • Code. Specialized coding models have pulled ahead of general ones for software work specifically. This is the most actively-moving frontier.
  • Cost-quality at scale. When you're running millions of calls, the difference between a $3-per-million-tokens model and a $30 one is everything. The cheap-and-good band is where new entrants compete hardest.

For the rest, choose on operational fit, not benchmark scores.

How to choose in practice

A working frame I use:

1. Where does the work need to live?

If the answer is inside Microsoft 365, you start with Microsoft. Cowork, Copilot Studio, Foundry: these are the path of least resistance and they inherit the governance story for free. The model under the hood may be OpenAI or Anthropic; you don't always pick.

If the answer is in our own product or internal tooling, you have a real choice. Pick on:

  • Provider stability (track record, region, data residency).
  • SDK and tool-use ergonomics for what you're building.
  • Cost at the volume you actually expect to hit.
  • Whether MCP is supported natively or you're going to bolt it on.

2. What's the budget shape?

Three bands matter:

  • High volume, simple tasks. Classification, extraction, simple summarization. Cheap-and-good models: Claude Haiku, OpenAI Mini-class, equivalents. The quality difference is marginal; the cost difference is 10 times.
  • Medium volume, knowledge work. Drafting, structured analysis, multi-step reasoning. Mid-tier models: Sonnet-class. The sweet spot for most enterprise work.
  • Low volume, hard problems. Complex agentic work, long-horizon planning, sensitive analysis. Top-tier models: Opus-class. Spend where it matters.

Most architectures should mix. A tier-1 router on a cheap model handing off to a Sonnet for the actual work, with Opus reserved for the hard cases. Single-model deployments are usually a sign that nobody asked the cost question.

3. What's the failure cost?

If a wrong answer means a slightly worse search result, choose for cost and latency. If a wrong answer means a regulatory breach or a customer-impacting error, choose for the strongest model you can afford and add a maker-checker pair on top of it. Match the model investment to the consequence.

4. What's the multi-vendor story?

In 2026, the assumption that you'll be on one provider for the next five years is wrong. Build assuming you'll route to multiple. That means:

  • Abstraction at the right layer: tool definitions through MCP, prompts as data, model selection as configuration.
  • Don't lean on provider-specific features (custom training formats, proprietary tool schemas) unless the value is overwhelming.
  • Plan for the day Microsoft routes Cowork to a model you didn't pick. Because they will, and you'll want your governance to be of the action, not of the model.

What's overhyped and what's underhyped right now

Overhyped. Benchmark deltas of a few points between top-tier models. They almost never show up in real workloads. The teams that obsess over benchmarks are usually the teams who haven't shipped anything yet.

Underhyped. The operational difference between providers: observability, debugging tools, audit log quality, region availability, support response. This is what determines whether you can run a workload at scale, and it doesn't show up on leaderboards.

Overhyped. "AI replacement" framing. Not because it never happens but because the framing distorts what's actually shipping (compression of specific tasks within roles, not deletion of roles).

Underhyped. MCP and the protocol-first stack. Quietly the most important infrastructure decision of the year.

Overhyped. Custom-trained foundation models for most enterprises. Outside of regulated verticals with mountains of unique data, fine-tuning a frontier model rarely beats good prompting plus retrieval.

Underhyped. Evaluations. Almost everyone ships agents without an eval suite that resembles the production workload. Three months in, when something starts drifting, the teams without evals are flying blind.

In your M365 environment

If you're doing anything serious with Cowork, Copilot Studio, or any of the IQ stack:

  • Treat Microsoft as your routing layer, not as your provider commitment. Frontier-suite Cowork can run on Anthropic models today. That's a feature; design for it.
  • Stand up your own eval harness early. Even a small one: twenty representative tasks scored weekly. The day Microsoft swaps the model under your Cowork sessions, you want to be the team that can quantify what changed instead of guessing.
  • Maintain a parallel "outside M365" agent path for the workloads that don't need to live inside the tenant. Custom code, MCP-first, your choice of provider. This is your hedge, and your skills development for the team.

The 2026 stack rewards architects who hold opinions loosely about specific models and tightly about the shape of the system around them. The model is the part that will change. The shape is the part you have to get right.


Sources: Anthropic Release Notes · Microsoft Frontier Suite announcement · Anthropic: Building agents with the Claude Agent SDK · Insight to Execution: Foundry IQ, Work IQ · Best agentic AI platforms 2026