Episode 4: GPT-5, The $100B Gap, and Why Inference-Time Compute Just Went Mainstream

Companion to Episode 4 of our podcast, Inference Time Tactics

Sep 02, 2025

Listen and Subscribe here: Inference Time Tactics

TL;DR

A viral thread called the AI stack a house of cards. Grok summarized it clearly: the numbers are illustrative, but the dynamic is real. Venture capital has been subsidizing AI user economics. That can’t last. The gap between cost-to-serve and user revenue is massive.
GPT-5 made inference-time routing visible: more prompts were handled by smaller, cheaper models. Users pushed back.
Inference-Time Compute (ITC) is emerging as the structured, programmable response to cost, latency, and reliability pressure.
Our alpha preview launches soon (leaderboard + strategy explorer). A full ITC framework to follow.

The News Peg: Routing Went Public

In April, Sam Altman noted that small prompt additions like “please” and “thank you” added tens of millions in annual cost. With GPT-5, the most notable change wasn’t performance — it was architecture.

More requests were clearly being routed to smaller, faster, cheaper models. It was a shift in stack design to manage cost and latency. Users noticed and pushed back. OpenAI responded with a return of 4o and the model picker. Routing had gone public.

The story isn’t outrage. It’s architecture. Prompts are now triaged: some go to lightweight models, others to deeper reasoning paths. The trade-offs are visible.

A year ago, inference-time compute (or test-time compute) was a niche topic in research circles. Now it’s showing up in mainstream reporting on GPT-5, cloud infra strategy, and slowing model progress.

Why Now? The Economic Reality

AI products feel magical because VC is eating the cost. A now-viral thread broke down the gap: users pay ~$200/year. The app pays OpenAI ~$500. OpenAI pays AWS ~$1,000. AWS buys $10K GPUs. VC dollars fund the gaps.

Grok’s summary: the numbers are illustrative, but the dynamic is real. OpenAI and others are subsidizing massive amounts of user compute — backed by multi-billion-dollar revenue projections and massive capex.

That doesn't collapse tomorrow. But as capital tightens and demand grows, the risk grows into a $100B-scale problem.

Inference Time Compute Gains Traction

Inference-Time Compute is a runtime layer that enables structured decisions: routing requests across models, allocating more or less “thinking,” caching repeat answers, enforcing latency and cost budgets. It turns opaque model behavior into governed, testable, repeatable systems.

In Episode 4, we explore GPT-5’s router as the tell: OpenAI is shifting toward learned routing to balance cost, latency, and user experience. Others are following. The next big optimization isn't just bigger models — it’s smarter inference.

Beyond Chat: Agents, Infra, and Placement

As agents become the primary interface for users, they generate more model calls than any human could. Inference becomes the dominant form of compute.

That changes infra design: compute gets placed closer to users for latency, or centralized for heavy reasoning. And it forces a new constraint: inference must be observable, composable, and policy-controlled.

Programmable inference isn't optional at scale. It’s how AI remains viable as usage explodes.

What We’re Releasing

Inference is infrastructure. We started NeuroMetric to help teams more easily test, build, and deploy inference-time compute strategies. Our public alpha goes live soon.

It includes:

CRMArenaPro leaderboard with strategy comparisons.
A visual Strategy Explorer to test LLM-as-Judge.

Much more to follow.

neurometric’s Substack

Discussion about this post