Discussion about this post

User's avatar
JP's avatar

Qwen has moved fast since this list went out. The 3.5 release dropped with a 4B and 9B that genuinely trade blows with models two to three times their size. The 4B especially is punching well above its weight class for local inference. Surprised it isn't getting more attention. Wrote up the full breakdown with benchmarks and how to get it running: https://reading.sh/your-laptop-is-an-ai-server-now-370bad238461?sk=1cf7a4391e614720ecbd6e9bc3f076a2

Pawel Jozefiak's avatar

This list is useful but the real question is: which SLM for which task? I ran experiments across Claude's model range—Haiku crushes structured tasks, email, code execution. Opus only wins on multi-step reasoning.

The benchmark that matters isn't aggregate scores. It's task-specific performance vs cost. Some of these smaller models beat larger ones on narrow domains while costing 15x less.

No posts

Ready for more?