A domain-agnostic framework enabling VLM self-improvement through competitive visual games
Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision–language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs.
To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs.
🎮 Strategic Self-Play Framework
Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation.
🖼️ Gameplay from Arbitrary Images
Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images.
📈 Sustainable Performance Gain
We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements.
🏆 Achievement: Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods.
| Component | Status | Description |
|---|---|---|
| 🤖 Models | ✅ Available | Trained models on Qwen2.5-VL-7B, InternVL3-8B, InternVL3-14B |
| 📊 CLEVR Dataset | ✅ Available | Complete CLEVR-based training dataset |
| 🛠️ Training Code | ✅ Available | Full open-source training pipeline |
| 📈 Chart Dataset | 🚧 Coming Soon | Chart-based dataset for enhanced reasoning |
| 🌍 Real-World Dataset | 🚧 Coming Soon | Real-world image dataset for diverse scenarios |
# 1. Clone the repository
git clone https://github.com/your-repo/vision-zero.git
cd vision-zero
# 2. Set up environment
conda create -n vision-zero python=3.10
conda activate vision-zero
bash setup.sh
# 3. Download a Trained model
# Choose from available models in the table below
# 4. Start training or inference
bash run_scripts/run_grpo_vision_zero.sh| Model Family | Size | Dataset | HuggingFace Link |
|---|---|---|---|
| Qwen2.5-VL | 7B | CLEVR | |
| Qwen2.5-VL | 7B | Chart | |
| Qwen2.5-VL | 7B | Real-World | |
| InternVL3 | 8B | CLEVR | |
| InternVL3 | 14B | CLEVR |
| Dataset Type | Description | Link |
|---|---|---|
| CLEVR-based | Synthetic scenes for logical reasoning |
📢 Acknowledgment: This repo is based on
vlm-r1- thanks for their contribution!
- Python 3.10+
- CUDA-compatible GPU (recommended)
- Conda or similar environment manager
# Create and activate environment
conda create -n vision-zero python=3.10
conda activate vision-zero
# Install dependencies
bash setup.shDownload one of the available datasets or prepare your own:
- CLEVR-based: Available now ✅
- Chart-based: Coming soon 🚧
- Real-World: Coming soon 🚧
Configure your training setup in run_scripts/run_grpo_vision_zero.sh:
# Configuration variables
IMAGES_DIR=$IMAGES_DIR # Path to your images
SCENES_DIR=$SCENES_DIR # Path to scene descriptions
MODEL=$MODEL # Base model to fine-tune
OUTPUT_BASE_DIR=$OUTPUT_DIR # Output directory for checkpoints
RUN_NAME="your_run_name" # Experiment nameLaunch the training process with customizable hyperparameters:
bash run_scripts/run_grpo_vision_zero.sh💡 Tip: All hyperparameters can be modified directly in the script file.
Evaluate your trained model on out-of-distribution tasks using VLMEvalKit:
# After training completes and checkpoint is saved
# Use VLMEvalKit for comprehensive evaluationWe use VLMEvalKit for comprehensive model evaluation on out-of-distribution tasks, ensuring robust performance assessment across various benchmarks.
If you find Vision-Zero useful in your research, please consider citing our paper:
@misc{wang2025visionzeroscalablevlmselfimprovement,
title={Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play},
author={Qinsi Wang and Bo Liu and Tianyi Zhou and Jing Shi and Yueqian Lin and Yiran Chen and Hai Helen Li and Kun Wan and Wentian Zhao},
year={2025},
eprint={2509.25541},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.25541}
}🌟 Star this repo if you find it helpful!
Made with ❤️ by the Vision-Zero team
