Skip to content

The official implemention of "Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers"

License

Notifications You must be signed in to change notification settings

yangzhch6/Mirror-Critique

Repository files navigation

Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers

RLVR training verifier with self-synthetic critique data to improve accuracy and honesty in test-time scaling.

overview

✨Installation

pip install -e ./verl
pip install packaging
pip install ninja
pip install flash-attn --no-build-isolation
pip install -e .

📃Prepare Data

dowload from 'yangzhch6/Mirror-Critique'

huggingface-cli download yangzhch6/Mirror-Critique --local-dir ./data

We have already provided the trajectories (with redundancy filtering) of the training procudure of Zero-RL Solver in ./data/to_critique/Qwen2.5-Math-{1.5/7}B-L-openr1-f3/to_critique.parquet. Also, the test-time output of Zero-RL Solver are shown in ./data/rlvr-critique/Qwen2.5-Math-{1.5/7}B-L-openr1-f3/test_n16_full.parquet

🔧Train Mirror-Verifier

Gen Critique

bash ./experiments/gen_critique/Qwen2.5-7B-Instruct-to-Qwen2.5-Math-{1.5/7}B.sh

SFT Cold Start

bash ./experiments/sft_critique/Qwern2.5-Math-{1.5/7}B-L.sh

RVLR Train Verifier

bash ./experiments/rlvr-verify/Qwen2.5-Math-{1.5/7}B-L-sft-ckpt-balance-bsz1k.sh

Evaluation

The performance of test-time scaling can be evaluated with:

python ./test-time-eval.py

Huggingface Models

Model Huggingface Base Model
Zero-Solver-Qwen2.5-Math-1.5B-L https://huggingface.co/yangzhch6/Zero-Solver-Qwen2.5-Math-1.5B-L Qwen2.5-Math-1.5B
Zero-Solver-Qwen2.5-Math-7B-L https://huggingface.co/yangzhch6/Zero-Solver-Qwen2.5-Math-7B-L Qwen2.5-Math-7B
Mirror-Verifier-1.5B https://huggingface.co/yangzhch6/Mirror-Verifier-1.5B Qwen2.5-Math-1.5B
Mirror-Verifier-7B https://huggingface.co/yangzhch6/Mirror-Verifier-7B Qwen2.5-Math-7B

🌻Acknowledgement

This repo builds upon veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for datasets and backbones, OpenR1-Math-220k, Qwen2.5-Math, and DeepSeek-R1 model.

📬 Contact

For questions, feedback, or collaboration opportunities, feel free to reach out:

Citation

If you find our model or code useful, please kindly cite our paper:

@misc{yang2025critiqueverifyaccuratehonest,
      title={Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers}, 
      author={Zhicheng Yang and Zhijiang Guo and Yinya Huang and Yongxin Wang and Yiwei Wang and Xiaodan Liang and Jing Tang},
      year={2025},
      eprint={2509.23152},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.23152}, 
}

About

The official implemention of "Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •