I Read 200 Test Time Compute Paper Abstracts. Here Is What I Learned.
The problems and opportunities of the third scaling law
It started with an internal debate here at Neurometric about the terminology we should use. The terms “test time compute,” “inference time compute,” “test time scaling,” “inference time scaling,” and a few other variations are used mostly interchangeably. We decided it was important for us to standardize in our marketing language so I started to review test time compute research papers to gauge how much each term was being used. But it turned into something bigger than that.
If you aren’t familiar with the concept, the model builders divide the world into two pieces - training time and test time. Everything that happens when you run a model is “test time” even though, those of us using models for inference don’t think of it as a test when we run it in production systems.
It’s important because people are calling this the “third scaling law.” Below is a screenshot from NVIDIA’s last developer conference where Jensen discussed why this is so important. For those of you who don’t understand test time compute, that white flow chart he has there - it could be one of dozens to hundreds of options to make these models perform better, depending on the specifics of the situation.
Skimming a few papers was interesting enough that I went down a rabbit hole and I ended up reading the abstracts of over 200 papers. I also read 15 papers in full that I thought were the most interesting of the bunch. I came away with a lot of takeaways on the test time compute space.
In 2023, and prior, there wasn’t a ton of action and what was there focused mainly on idea that test time compute was a thing and comparing it to training. In 2024 it was largely about surveys of the landscape and a focus on the most popular approaches like Best of N and Chain of Thought. In 2025, papers have been focused on new algorithms, expanding the definition of TTC, and managing computation budgets in TTC deployments. Here are some highlights of what I learned:
The most interesting paper in the whole batch, in my opinion, was Inference Time Computations for LLM Reasoning and Planning: A Benchmark and Insights. The summary of the paper’s findings is: Our findings indicate that simply scaling inference-time computation has limitations, as no single inference-time technique consistently performs well across all reasoning and planning tasks. This is important because it means there won’t be a one-size-fits-all inference algorithm, but rather, users will need to manage many of them depending on what types of tasks they need AI to do. I will note this also led us to a good discussion about what this means in the context of The Bitter Lesson.
The most surprising take away from it all is that inference time compute doesn’t always move the needle. The gains from adding inference time compute can be large or non-existent, and depend on many factors like which model you are using, the problem difficulty, the compute or token budget, the test time compute tactics you choose, and other minor variables. This paper was particularly interesting in showing use cases where a 1B parameter LLM can beat a 405B parameter LLM with some test time compute. We have verified this with our work benchmarking these tactics (stay tuned for a report).
2025 has seen an increase in papers that address the issue of how much compute budget to allow. While some early papers showed that thinking longer was generally better, newer research shows that sometimes these tactics waste tokens and think too long. A good example paper is Learning To Stop Overthinking at Test Time.
Going back to point #1, there are so many new algorithms all the time that perform well in specific situations against specific benchmarks. Choosing which algorithm to run has been a very recent area of investigation in the research community.
The vast majority of the papers were focused on LLMs and language based reasoning tasks, including a lot of coding and mathematics. But there are other areas where some interesting papers cropped up like:
The reasons for using test time compute are mixed. Some are interested in it to save money and use smaller models that perform as well as larger more expensive models. Some are interested in it to expand the capabilities of the best models. It will be interesting to see if this segmentation leads to a market segmentation in popular algorithms used or if the same algorithms primarily drive both use cases.
There is some interesting debate about whether inference time compute algorithms should target token space or latent space. That’s one of the most interesting spaces to watch.
There are two things that no one pointed out in any papers that I’ve been thinking about after skimming all of these.
Can the tech community surpass the foundation model labs on test time compute research? If you are a developer and have a new training idea, it costs tens of millions to hundreds of millions to test it out. If you are a developer and have a new test time compute algorithm, it costs dollars to hundreds of dollars to test it. Thousands of people working on these algorithms and tactics could significantly outperform the model labs.
This is still a very large and unexplored space and I believe we have barely scratched the surface of what is possible at inference time. This is ripe for exploration and innovation.
I made a chronological list (by first publication date, not updates) of the best 150 or so. I’ve shared it here in case you want to review. I will try to keep it updated through the summer with interesting papers.
Neurometric is making it easy to apply these inference time compute tactics to your models. If you want to follow along as we get closer to launching, you can follow us on Bluesky or subscribe to our email list at the bottom of our homepage.