Beyond Vibe Testing: Smarter Eval for Agentic AI

In this episode, we break down why enterprise AI agents fail in the wild—and how smarter eval frameworks with ITC tactics can close the performance gap.

, and

Sep 08, 2025

In Episode 5 of Inference Time Tactics, Rob May, Calvin Cooper, and Byron Galbraith unpack Salesforce’s CRMArena-Pro benchmark and what it reveals about the reliability gap in agentic AI systems.

Benchmarks look impressive on paper, but they’re not great predictors of production performance. Performance drops significantly on multi-turn tasks, outputs vary from run to run, and “vibe testing” can’t scale in enterprise settings.

This episode dives into:

Why CRMArena-Pro shows the limits of today’s benchmarks in real enterprise settings
The stochastic nature of LLMs and why reliability—not raw capability—is the gating factor for adoption
How inference-time tactics reduce variance and unlock more stable workflows
Latency, rate limits, and cost as structural barriers to scaling agentic systems
What Neurometric is building: ITC Test Engine and drag-and-drop interface for rapid visualization and experimentation

Listen & subscribe:

https://inferencetimetactics.podbean.com/

neurometric’s Substack

Discussion about this post