009. The Missing Middle
Why the AI retail layer hasn't arrived — and what's actually blocking it
Recently, this publication examined what OpenAI’s advertising pivot revealed about AI market structure. The picture that emerged was a split: wholesale infrastructure thriving, retail applications struggling. API providers and cloud platforms capture most enterprise AI spend; consumer-facing subscriptions face brutal conversion economics and now, advertising as a fallback for making money.
But that analysis left a question open. If the retail layer is struggling, what would a viable retail model actually look like?
Small Language Models — SLMs — offer a possible answer. These are models small enough to run on company hardware without cloud dependency, trained or fine-tuned for specific tasks rather than general capability. The economics look different from both the API metering that makes wholesale AI expensive at scale and the subscription model that consumers resist. A task-specific SLM costs more upfront to develop but little to run once deployed. Data stays inside the organisation. Traditional software economics — the kind that sustained the software industry for decades — could apply.
The production pathway exists too. Frontier LLMs turn out to be extraordinarily useful for creating smaller models: generating training data, providing targets for distillation, evaluating fine-tuned outputs. Microsoft’s Phi-4 at 14 billion parameters matches GPT-4 on reasoning benchmarks — a model roughly one-hundredth the size achieving task-specific parity. DeepSeek’s distilled models demonstrate the same pattern at industrial scale. The factory works.
And the demand is real. Enterprise security teams block nearly 60% of AI-related transactions over data exposure concerns. SLMs running on-premises sidestep that objection entirely. Regulated industries — healthcare, finance, government — are adopting early, precisely because keeping data in-house matters more than frontier capability for most bounded tasks.
So: viable economics, working production methods, demonstrated demand, and a capability threshold that’s clearly been crossed. Why hasn’t the SLM retail layer emerged?
The answer isn’t capability. The models work. It isn’t cost — inference hardware has become cheap enough that a $200 Raspberry Pi setup can run useful local models. It isn’t even the security and governance concerns that slow LLM adoption, since SLMs sidestep the worst of those.
The answer is coordination.
A contract review SLM, a compliance checking SLM, and a document summarisation SLM don’t naturally work together. Nothing coordinates them. The middleware that would let enterprises deploy portfolios of specialist models — routing queries to the right specialist, checking outputs between steps, managing context across models with limited windows, handing off gracefully when a specialist fails — doesn’t exist yet.
This article examines that gap: what SLM orchestration would require, why current tools don’t provide it, and what the absence means for when — and whether — the AI retail layer arrives.
Good enough for the job
The case against small models used to be simple: they weren’t good enough. Frontier capability required frontier scale. If you wanted reliability on complex tasks, you paid for the large model or accepted worse results.
That case has eroded faster than most observers expected.
Microsoft’s Phi-4, released in late 2024, showed that a 14-billion-parameter model could match GPT-4 on mathematical reasoning benchmarks. Not approach it. Match it. The model is roughly one-hundredth the size of frontier offerings, runs on hardware that companies already own, and achieves parity on the specific capability it was tuned for.
This isn’t an isolated result. DeepSeek’s distillation work showed that smaller models trained on outputs from larger ones could inherit much of their capability at a fraction of the cost to run. The pattern has repeated across labs: Meta’s Llama variants, Alibaba’s Qwen family, Microsoft’s continuing Phi series. Each generation closes the gap between “small enough to deploy anywhere” and “capable enough to be useful.”
The key insight is task-specificity. A general-purpose model needs to handle everything from poetry to protein folding. A model fine-tuned for contract clause extraction needs to handle contract clause extraction. The capability bar for the second task is lower, and small models can clear it.
Apple understood this early. Their on-device Foundation Models — roughly 3 billion parameters — ship on hundreds of millions of devices. These aren’t toy demonstrations. They power writing assistance, notification summaries, and contextual suggestions across the Apple ecosystem. The models use an “Adapters” architecture that allows task-specific tuning without retraining the base model, letting Apple deploy what amounts to a portfolio of specialists sharing common infrastructure.
This is proof at consumer scale. If Apple can ship useful AI capability on a phone without cloud dependency, the technical barrier to company deployment has clearly fallen.
The hardware story reinforces this. Raspberry Pi’s AI HAT+ 2, released in January 2026, runs compressed versions of Llama and other open models for under $200 in total hardware cost. Enterprise-grade inference hardware from vendors like Nvidia costs more but delivers proportionally more — and crucially, it’s hardware that many companies already have for other purposes. The added cost of running SLMs on existing infrastructure approaches trivial.
None of this means small models match frontier capability across the board. They don’t. For open-ended reasoning, novel problem-solving, and tasks requiring broad world knowledge, larger models retain clear advantages. But most company workflows aren’t open-ended. They’re bounded, repetitive, and specific — exactly the territory where task-tuned SLMs compete effectively.
The capability threshold has been crossed. The question is why crossing it hasn’t been enough.
What coordination requires
Picture a company deploying SLMs the way the economics suggest it should: a portfolio of specialists, each tuned for a specific task. One handles contract review. Another checks regulatory compliance. A third summarises documents for executive briefing. Each model is small, fast, and good at its job.
Now picture what happens when a real workflow touches all three.
A supplier contract arrives. The contract review model extracts key terms — payment schedules, liability clauses, termination conditions. But it has no way to flag whether those terms create compliance exposure. The compliance model could check, but it doesn’t know what the contract model found unless something passes the information along. The summarisation model could produce an executive brief, but only if it receives both the original document and the analysis from the first two models — and its context window may not fit all of that.
This is the coordination problem. Individual SLMs work. Connecting them into coherent workflows doesn’t — not without infrastructure that doesn’t yet exist.
Governance is about controlling access. Coordination is about enabling collaboration.
What current tools provide
The market hasn’t ignored multi-model deployment entirely. A category called “AI Gateways” has emerged, with well-funded startups (Nexos.ai, Portkey, Vellum) and established vendors (Kong, F5) offering solutions. These tools solve real problems:
Unified access: One API endpoint connecting to hundreds of models across providers
Cost control: Tracking spend, setting budgets, routing to cheaper models where appropriate
Governance: Audit logs, access controls, compliance features
Basic failover: If one provider goes down, route to another
This is valuable infrastructure. It’s also the wrong level of abstraction for SLM coordination.
AI Gateways treat models as interchangeable endpoints to be managed. The problem they solve is “how do we govern access to AI across the organisation?” The problem SLM portfolios create is “how do we make specialist models work together on complex tasks?”
These are different problems. Governance is about controlling access. Coordination is about enabling collaboration.
What coordination would actually require
The gap becomes clearer when we list what SLM orchestration needs:
Confidence-aware routing. Not every query suits every model. A coordination layer needs to assess which specialist should handle a given input — and recognise when no specialist fits, passing the task to a larger model or a human. Current gateways route by cost, latency, or provider availability. They don’t route by task fit or model confidence.
Checking outputs between steps. When the contract model passes output to the compliance model, something should check whether that output is usable. Did the extraction succeed? Is the confidence high enough to proceed? Traditional middleware assumes predictable components: you send a message, you get a predictable response. SLMs aren’t predictable in that way. Their outputs vary. Orchestration needs to handle that variance.
Managing context across models. Most SLMs under 10 billion parameters struggle with context windows beyond 4,000–8,000 tokens. A workflow spanning multiple documents quickly exceeds that limit. Orchestration needs strategies for chunking, summarising, or retrieval that let limited-context models handle document-heavy work.
Graceful fallback. What happens when a specialist fails? Not crashes — models rarely crash — but produces low-confidence output, or output that doesn’t meet validation criteria. The workflow needs fallback paths: try a different specialist, pass to a larger model, flag for human review. Current agent frameworks assume a capable generalist at the centre, not a team of narrow specialists where any member might need backup.
Watching systems that vary. Traditional monitoring asks “did the service respond?” AI monitoring needs to ask “did the model produce something useful?” This requires evaluation frameworks, confidence scoring, output quality measures — infrastructure that’s nascent for individual models and nearly nonexistent for multi-model workflows.
What exists instead
The honest assessment: nothing currently addresses this full stack.
Workflow platforms like Vellum and LangGraph offer graph-based orchestration — you can wire models together and build control flow. But they’re designed around the “agentic LLM” pattern: a powerful generalist model that calls tools and manages subtasks. That’s different from coordinating a swarm of cheap specialists.
Companies deploying multiple SLMs today build custom coordination logic. This works for organisations with ML engineering capability, but it doesn’t scale. Each company reinvents the same patterns, makes the same mistakes, and maintains one-off infrastructure that can’t benefit from shared tooling or community knowledge.
The missing piece isn’t any single capability. It’s the integrated layer that treats “portfolio of specialists” as the primary pattern rather than an edge case.
When the retail layer arrives
The AI retail layer will emerge when someone solves the coordination problem. Not before.
This isn’t a prediction about timing. It’s a structural observation. The economics work. The capability is there. The demand exists. What’s missing is the infrastructure that lets companies deploy portfolios of specialist models without building custom orchestration from scratch.
Whoever builds that infrastructure captures significant value. Not because orchestration middleware is glamorous — it isn’t — but because it’s the bottleneck. Every company that wants to move from “we have some AI pilots” to “AI is embedded in our workflows” will need to solve the coordination problem. Most won’t want to solve it themselves.
This creates a different investment thesis than the current market reflects. Capital has flowed to model training, to inference infrastructure, to AI Gateways that solve governance. Less attention has gone to the orchestration layer — the middleware that would make SLM portfolios actually usable. That gap represents opportunity for builders and risk for companies betting on tools that solve yesterday’s problem.
What to watch
Several signals would indicate the coordination gap is closing:
New tooling. Watch for frameworks explicitly designed around multi-SLM workflows rather than single-model agents. The shift will be visible in product positioning: tools that talk about “specialist portfolios” or “model teams” rather than “AI agents with tools.”
How companies describe their AI. When organisations start describing their AI architecture as “multiple small models coordinated by X” rather than “we use GPT-4 for everything,” the pattern is landing. Case studies from regulated industries — healthcare, finance, government — will likely surface first, since they have the strongest reasons to keep data in-house.
Cloud provider positioning. Microsoft, Google, and Amazon have invested heavily in frontier model APIs. Watch whether they begin offering SLM orchestration as a service, or whether they defend API revenue by downplaying the small-model pattern. Their positioning will signal where they see the market heading.
Startup activity. The AI Gateway category emerged rapidly once the governance problem was recognised. If orchestration follows the same pattern, expect funded startups within 12–18 months explicitly targeting “SLM workflow coordination” or similar positioning.
What’s happening in China. China’s SLM deployment — particularly in robotics and manufacturing — is further along than Western coverage suggests. Watch for orchestration patterns emerging from industrial use cases where multiple embedded models must coordinate in real time.
The honest uncertainty
None of this is guaranteed to play out as described.
The coordination gap could close faster than expected if existing workflow tools adapt. It could persist longer if companies decide single large models are simpler than specialist portfolios. Frontier model prices could drop enough that the SLM economic advantage narrows. Regulatory requirements could shift the calculus in either direction.
What seems stable is the structural logic: the retail layer needs coordination infrastructure, that infrastructure doesn’t exist yet, and its absence explains more about current adoption patterns than capability gaps or cost barriers do.
That’s a testable claim. Watch with us.
Process Note
This article was developed through a multi-session partnership with Claude (Anthropic). Primary research was conducted by one Claude session, with findings checked against both ChatGPT (OpenAI) and Gemini (Google) to reduce single-source analytical bias. A discrepancy between sources — regarding whether “AI Gateway” tools address the orchestration gap — prompted targeted follow-up research, which was itself verified through an independent Claude session.
The evidence cited is drawn from public sources including company announcements, analyst reports, product documentation, and technical benchmarks. Market projections carry the usual caveats about analyst variance.
Ruv works extensively with Claude and has an interest in AI adoption patterns succeeding. This creates potential bias when discussing market structure. Readers should weigh claims accordingly.
The authors wish to acknowledge the work of Mike D, MrComputerScience, who writes Pithy Cyborg | AI News Made Simple. His recent reportage on Raspberry Pi running AI rekindled some thoughts about the role of SLMs in enterprise retail. Any errors are the authors’ own.
Version: V1.0 (Final- 22 January 2026)
Attribution: Ruv Draba and Claude (Anthropic), Reciprocal Inquiry
License: CC BY-SA 4.0 — Free to share, adapt, and cross-post with attribution; adaptations must use same license.
Disclaimer: Ruv receives no compensation from Anthropic. Anthropic takes no position on this work.
Reciprocal Inquiry explores human-AI partnership for analysis and publication. For more, visit Reciprocal Inquiry on Substack.


