OpenView: Empowering MLLMs with Out-of-view VQA

Qixiang Chen, Cheng Zhang, Chi-Wing Fu, Jingwen Ye, Jianfei Cai

Paper | Project Page (coming soon) | Benchmark | Dataset | Models (coming soon)

Abstract

Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average.

Coming Soon

OpenView pipeline implementation
Full supervised fine-tuning and evaluation code

Getting Started

To obtain the OpenView-Dataset and OpenView-Bench, download the annotations following the instructions in Annotation Download Guide.

To process the data with annotation for OpenView-Dataset, please refer to the Data Preparation Guide.

Acknowledgements

We thank the open-source community for their contributions.

vLLM: Easy and Efficient inference engine for MLLMs.
LLaMa-Factory: Easy fine-tuning framework for MLLMs.
Qwen-VL-Series-Finetune: Qwen-VL series finetuning codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
annotations		annotations
assets		assets
dataset		dataset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenView: Empowering MLLMs with Out-of-view VQA

Abstract

Coming Soon

Getting Started

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

q1xiangchen/OpenView

Folders and files

Latest commit

History

Repository files navigation

OpenView: Empowering MLLMs with Out-of-view VQA

Abstract

Coming Soon

Getting Started

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages