Skip to content

togethercomputer/reviewing-agents

Repository files navigation

Reviewing Agents

Reviewing Agents

LLM-powered scientific paper review and error detection

arXiv arXiv Conference Python 3.12+ Code style: ruff

This repository supports:

  • To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis

    We developed an LLM-based Paper Correctness Checker to identify objective mistakes (formulas, derivations, figures, tables) in papers published at top AI venues. Our analysis reveals that mistakes per paper have increased over time—from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (+55%). Human experts confirmed 83.2% precision on 316 reviewed mistakes. The checker can also propose correct fixes for 75.8% of identified issues.

  • Agents4Science: LLM reviewers for the Agents4Science Conference, the first conference where AI agents served as both primary authors and reviewers

    Agents4Science 2025 was the inaugural conference where AI systems served as both authors and reviewers of research papers. Organized by TogetherAI and Stanford University, it received 315 submissions with 48 papers accepted after AI + human peer review.


Installation

uv sync

Configuration

Set up API keys in .env (copy from env.dev):

cp env.dev .env

Configure modules in config.yaml.


Modules

Module Description
SimpleReviewer General paper reviewing
LLMCorrectnessDetector Methodological correctness evaluation
LLMCriticalityVerifier Verifies criticality of correctness findings
LLMFormatDetector Format compliance checking
JailbreakingChecker Detects adversarial instructions in papers
ReferenceCheckLight Reference hallucination detection
ReferenceCheckHeavy Full reference + author verification
ArxivTaxonomyClassifier arXiv category classification

Quick Start

Agents4Science reviewers can be used to review papers:

from reviewing_agents.modules import SimpleReviewer

pdf_bytes = open("paper.pdf", "rb").read()

reviewer = SimpleReviewer()
result = reviewer.review_paper(pdf_bytes)

The LLMCorrectnessDetector and LLMCriticalityVerifier can be used to evaluate the correctness of a paper, used in our To Err Is Human paper.

from reviewing_agents.modules import LLMCorrectnessDetector, LLMCriticalityVerifier

pdf_bytes = open("paper.pdf", "rb").read()

detector = LLMCorrectnessDetector()
correctness = detector.check_correctness(pdf_bytes)

verifier = LLMCriticalityVerifier()
findings = {"score": correctness.score, "reasoning": correctness.reasoning, "key_issues": correctness.key_issues}
verified = verifier.verify_criticality(pdf_bytes, findings)

Citation

@article{bianchi2025toerr,
  title={To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis},
  author={Bianchi, Federico and Kwon, Yongchan and Izzo, Zachary and Zhang, Linjun and Zou, James},
  journal={arXiv preprint arXiv:2512.05925},
  year={2025}
}

@article{bianchi2025agents4science,
  title={Exploring the use of AI authors and reviewers at Agents4Science},
  author={Bianchi, Federico and Queen, Owen and Thakkar, Nitya and Sun, Eric and Zou, James},
  journal={arXiv preprint arXiv:2511.15534},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages