The illusion of peer review [Part 1/2]
Top tier ML conferences, but what tier are their standards?
A few months back I started a Substack to, frankly, rant a bit about the lack of scientific rigour in machine learning “research papers”. I wrote about the GSM-Symbolic paper by Iman Mirzadeh and his colleagues at Apple, supervised by Mehrdad Farajtabar. Unsurprisingly, the paper got a lot of attention on social media with many prominent “AI critics” treating it as proof that LLMs cannot reason.

This “proof”, however, is deeply flawed. Setting aside the fact that the authors, conveniently, never define what reasoning (or pattern matching) is, here’s a quick recap of the other key technical issues with this paper:
Over-interprets and over-sensationalises expected statistical variations.
Completely ignores alternative explanations: lack of reasoning is a plausible explanation, but so is distribution mismatch between the original GSM8K and the Symbolic version they introduce.
The paper’s own data suggests this mismatch: the example template given in the paper (Figure 1) does not even reproduce the actual question in the original dataset.
Have a look at the ICLR Blog Post for full details and analysis.
What did the review process do?
Despite these significant issues, the paper passed the peer review system and was accepted at ICLR — a top-tier machine learning conference. During the review process, I attempted to engage the authors in discussion via a public comment, but they did not respond. The Area Chair (AC) acknowledged concerns around statistical rigour, which were raised by both myself and another anonymous reviewer (Yt9o, who despite their concern gave the paper a score of 8/10). In their official meta review, the AC noted:
Statistical Rigor (Desi R. Ivanova, Yt9o):
Concern: The paper lacks statistical evaluations, particularly in assessing performance variations.
Response: The authors added statistical significance results based on one-sample t-tests in Appendix A.3 (Figure 10), acknowledging the complexity of statistical analysis and committing to further investigation.
Looking at the revised version of the paper, the authors indeed added a one-sample t-test in one of the charts in the Appendix. Unfortunately, both the test they used and the way it was conducted seem to be wrong.
First, a one-sample t-test is not appropriate in this context as the evaluation involves two datasets (GSM8K and GSM-Symbolic). Indeed, the authors have access to their raw data, so they could have performed a proper two-sample comparison. More critically, for a t-test to be valid, the sample mean and variance must be independent (footnote 4 on page 18 of the revised paper is incomplete). This assumption does not hold here because the outputs are Bernoulli variables (Binomial when aggregated), whose variance is directly determined by the mean.
They could have used a two-sample z-test, which is what I did in my initial quick analysis, complete with a publicly available spreadsheet showing the calculations! All of this was freely accessible to the authors during the entire review period; they could have used it directly, or could have reached out for assistance when revising their paper (my identity was known, given I posted the comment publicly).
Worse still, the authors provide no description of how their statistical test was conducted. They state the Null hypothesis as "50 different performance results on GSM-Symbolic differ from the original GSM8K score" (Appendix A.3, page 18). The Null hypothesis, as the name suggests, posits there is no difference, so they must be describing the alternative. Given they are performing a one sample test, it is unclear whether they treat the GSM8K or the GSM-Symbolic score as fixed. This has huge implications as it determines what standard error is used. It is also unclear whether the test was one- or two-sided.
Having performed a test that is methodologically invalid on multiple levels, the authors claim "for an overwhelming majority of models (except Llama3-8B and GPT-4o), the results are statistically significant" (Appendix A.3, page 18). This is in direct contradiction with my own analysis (see Section 4.2.2 in the published ICLR blog post, or the publicly available spreadsheet accompanying my initial quick analysis), where I found that only 3 models showed significant decrease in performance (Gemma-7b, Mistral-7b-instruct-v0.1, Phi-2) and 1 showed significant increase (Llama3-8b).
So let’s summarise… Overall we have that after the peer review:
The over-sensationalised, exaggerated claims remained largely unchanged from the pre-print.
The revision made to include statistical analysis did not constitute a meaningful correction but a superficial gesture towards rigour at best, and a scientific misconduct or severe negligence at worst.
The “commitment to further investigation” (as per the meta review) does not appear to have happened.
In short, the peer-review process did worse than nothing — it highlighted key issues but nevertheless provided academic credibility to work that should probably not have been published in this form.
Why am I writing about this… again?
Honestly, I am asking myself the same. I think the main reason is that I still believe scientists should have principles, and that the role of academia is to uphold them (through peer review or commentary like this). Maybe, as a junior academic, I’m just still a bit too naive.
The concrete reason that prompted me to look at the revision of the GSM-Symbolic paper is the new pre-print from the same lab at Apple, again led by Mehrdad Farajtabar — The illusion thinking. Once again, it has serious methodological flaws that have been discussed by e.g., Rohit on X, Lisan al Gaib on X, and myself at a recent reading group (slides here). This new work follows exactly the same pattern: flashy, exaggerated claims, weak analysis, social media hype, AI critics getting excited1 … So the question is: will this paper also pass peer review despite its flaws?
But perhaps the more fundamental question is whether peer review in machine learning still matters at all. If it doesn’t, we might as well abolish the system entirely and save ourselves some time and effort. If it does, how do we fix it? Because right now, it is clearly failing to serve its most basic purpose.
This post got a bit too long already, so I’ll opine on those questions in a follow-up.
I’m deliberately not linking to their commentary as I don’t believe in amplifying sensationalist takes that lack scientific substance.