The goal of peer review is (or rather should be) to separate good science from bad science.1 I’d like to distinguish between two types of bad science (there’s surely more):
Clearly flawed papers: ones with theoretical mistakes, clear methodological flaws or poorly executed empirical analysis
Misleading or overstated papers: ones with a discrepancy between what they claim to do, show or prove, and what they actually do, show or prove
The paper discussed in the previous blog post falls in both of these categories: its empirical analysis is poorly executed, and that analysis doesn't really allow us to distinguish between LLM doing “pattern matching” vs “reasoning”.
Many published papers fall into the second category, though of course to varying degrees. I’m guilty of it too: many of my own papers make claims along the lines of “this Bayesian method we propose is useful for such and such important real world applications”, but the empirical side is all limited to toy scenarios and models. Another common example is the overstatement of “novelty”. Many works present incremental changes as novel breakthroughs. Incremental work is not necessarily a problem — it is the backbone of scientific progress. As long as it’s good science, such work shouldn't be rejected for being incremental (but that’s a separate discussion).
The reason why many papers fall into the second category is that the current system actively pushes them there. Getting a “weak reject due to limited novelty” is a classic. So what naturally follows are flashy claims and overstatements to signal novelty, practical relevance, or alignment with whatever happens to be trending in ML currently. As a community, we don’t write conference papers to communicate ideas in the clearest or most honest way. We write them in the way that will most likely get them accepted.
Catching papers that fall into the first category is becoming harder and harder, and we all know why. The biggest one is that way too many papers are being submitted. Add to that the very short reviewing window, and coupled with the fact that reviewing brings little to no benefit for the reviewers. So we end up with the majority of reviews being low-effort, low-quality and generally superficial.
Possibly in response to this, the community has introduced various check-boxing exercises to give the appearance that “good science” is being promoted. These include things like:
Impact statements, where an overworked grad student is expected to reflect on the broader implications of their work, possibly after an all-nighter, likely minutes before the deadline, nervously checking whether the section counts toward the page limit or not.
Checklists, where authors self-report on questions like “Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope”. I wonder how many people have answered “No” to that.
I genuinely wonder what impact statements and checklists have actually achieved. If anyone reading this has any hard stats, please share them!
So what can we do about it?
Here’s a radical idea that could kill three birds with one stone — the submission volume, the reproducibility crisis and the peer review bottleneck: for any paper an author or a team of authors want to submit, they first must reproduce an existing paper in the same field that is currently under review.
For example, if a team is planning to submit to NeurIPS (typically due in early May), they would need to replicate work currently under review at ICML (whose review period usually runs from mid-February to mid-March). In this way, replication acts as peer review. The first iteration of this new peer review process would involve replicating work from a previous conference.
Of course various details need to be determined:
Authorship and publication format: If the original paper is to be accepted — should it and the reproduction be published together under joint authorship? Or should the reproduction be published in a separate track? In the latter case, we may need to impose new “citation rules” so that the original paper and its reproduction get cited together. If the original paper is rejected, the reproduction should still be published in Findings, Refutations and Critiques or similar tracks within the conference proceedings.
Matching process: How exactly should papers be assigned for reproduction? There should probably be a certain level of choice, while avoiding conflicts of interest as done currently. I think the process should be completely open (not double or single blind).
Feasibility: How do we deal with various barriers e.g. around compute and closed models? A possible solution could be for the authors of the original paper to provide the compute and access to models for reproduction.
Reviewing the reproductions: How are we going to ensure the quality of the reproductions? Hopefully, the fact that these will be published should provide sufficient motivation.
This setup will come with many other benefits: deeper understanding of existing work, a reliable foundation for follow-up research, and crucially — a shift in incentives toward clearer, more honest writing and a slower, more thoughtful pace of publishing that actually promote good science.
I don’t believe the role of peer should be to judge what’s impactful or novel (those things tend to be subjective and often only clear in hindsight).
Doesn't this approach disadvantage early-career scientists?