Ming Zhong1*,
Yuanlei Wang3*,
Liuzhou Zhang2,
Arctanx An2,
Renrui Zhang4,
Hao Liang2,
Ming Lu2,
Ying Shen3βοΈ,
Wentao Zhang2βοΈ
1Zhejiang University,
2Peking University,
3Sun Yat-sen University,
4CUHK
- Overview
- Project Structure
- Key Features
- HVCU-Bench Dataset
- Installation
- Quick Start
- Performance
- Data Generation Pipeline
- License
- Acknowledgments
- Contact
- Citation
VCU-Bridge is a comprehensive framework for evaluating and improving Hierarchical Visual Connotation Understanding in Multimodal Large Language Models (MLLMs). Unlike traditional benchmarks that test perception and reasoning in isolation, VCU-Bridge explicitly models the critical semantic bridge that connects low-level visual details to high-level abstract interpretations.
Overview of HVCU-Bench. We evaluate MLLMs across 3 task families spanning 15 diverse aspects. Our benchmark employs hierarchical decomposition: each question is systematically broken down into sub-questions across three levels (Lperc, Lbridge, Lconn), with validation ensuring logical coherence. During evaluation, models progress from low to high levels, constructing inter-level reasoning chains that emulate human visual comprehension.
- π Novel Framework: First to formalize hierarchical visual connotation understanding as a three-level progressive process with explicit inter-level associations
- π HVCU-Bench: A benchmark with 1,050 samples (3,150 QA pairs) covering Affective Reasoning, Aesthetic Appreciation, and Implication Understanding
- π² MCTS Pipeline: Monte Carlo Tree Search-driven data generation for creating high-quality hierarchical training data
- β Proven Results: Instruction-tuned models show +6.17% improvement on HVCU-Bench
VCU-Bridge/
βββ Data/ # Benchmark data and images (download required)
β βββ *.json # Task annotation files
β βββ Image/ # Image directories by task
βββ Evaluation/ # Evaluation framework
β βββ run_evaluation.py # Main evaluation script
β βββ calculate_metrics.py # Metrics calculation
β βββ base_evaluator.py # Base evaluator classes
β βββ answer_parser.py # Answer parsing utilities
βββ Instruction_Data_Generation_Pipeline/ # MCTS data generation
β βββ main.py # Generation entry point
β βββ config.py # Configuration management
β βββ image_processor.py # Image processing utilities
β βββ configs/ # Configuration files
β βββ prompts/ # Prompt templates
β βββ utils/ # Core utilities
β βββ orchestrator/ # MCTS orchestration
β βββ services/ # API clients and services
β βββ tree/ # MCTS tree implementation
β βββ batch/ # Batch processing
β βββ formatter/ # Data formatting
β βββ nodes/ # MCTS node implementation
βββ .env.example # Environment variables template
βββ environment.yml # Conda environment file
βββ pyproject.toml # Project configuration
βββ setup.sh # Setup script
βββ README.md # This file
- Evaluation Module: Provides comprehensive evaluation framework for testing MLLMs on HVCU-Bench with support for multiple evaluators (OpenAI API and local models via vLLM)
- Instruction_Data_Generation_Pipeline Module: Implements MCTS-driven hierarchical data generation with quality filtering and diversity checking
VCU-Bridge models visual understanding as a progressive three-level process:
Level 1 - Foundational Perception (Lperc)
- Objective, low-level visual facts
- Direct observation of objects, attributes, and visual primitives
- Example: "The image shows dark clouds and a person with a hunched posture"
Level 2 - Semantic Bridge (Lbridge)
- Explanatory statements linking perception to meaning
- Causal reasoning about visual evidence
- Example: "The dark weather and body language create an atmosphere of isolation"
Level 3 - Abstract Connotation (Lconn)
- Subjective, high-level interpretations
- Aesthetic, affective, or symbolic meanings
- Example: "The scene conveys a sense of melancholy and loneliness"
- Monte Carlo Tree Search for exploring hierarchical reasoning paths
- UCB (Upper Confidence Bound) selection balancing exploration and exploitation
- Progressive validation ensuring logical coherence across levels
- Quality filtering with multi-dimensional evaluation criteria
- Affective Reasoning (300 samples): 6 emotional categories (joy, affection, wonder, anger, fear, sadness)
- Aesthetic Appreciation (350 samples): 4 design aspects (color, composition, font, graphics)
- Implication Understanding (400 samples): 5 rhetorical devices (metaphor, symbolism, contrast, exaggeration, dislocation)
- Independent Mode: Test each level in isolation
- Context Mode: Evaluate with hierarchical dependencies
- Diagnostic Metrics: Per-level accuracy, full-chain accuracy, and error attribution
HVCU-Bench contains 1,050 samples (3,150 QA pairs) across three task families:
- Affective Reasoning (300 samples): 6 emotional categories β joy, affection, wonder, anger, fear, sadness
- Aesthetic Appreciation (350 samples): 4 design aspects β color, composition, font, graphics
- Implication Understanding (400 samples): 5 rhetorical devices β metaphor, symbolism, contrast, exaggeration, dislocation
- Total Samples: 1,050 images
- Total QA Pairs: 3,150 (3 levels Γ 1,050 samples)
- Task Distribution:
- Affective Reasoning: 28.6% (300 samples)
- Aesthetic Appreciation: 33.3% (350 samples)
- Implication Understanding: 38.1% (400 samples)
The benchmark data is organized in the Data/ directory:
- JSON annotation files: Task-specific annotation files containing QA pairs
Aesthetic-Appreciation.jsonAffective-Reasoning.jsonImplication-Understanding.json
- Image directory: Images organized by task family in
Data/Image/- Each task has its own subdirectory containing all relevant images
- Image filenames correspond to sample IDs in the JSON files
- Supported formats: JPG, PNG, WEBP
The complete benchmark data (including both JSON annotation files and images) can be downloaded from:
- Hugging Face: Chime316/HVCU-Bench
After downloading, place the data in the Data/ directory with the following structure:
Data/
βββ Aesthetic-Appreciation.json
βββ Affective-Reasoning.json
βββ Implication-Understanding.json
βββ Image/
βββ Aesthetic-Appreciation/
βββ Affective-Reasoning/
βββ Implication-Understanding/
Note: Both JSON annotation files and Image directory are required for evaluation.
- Python 3.10+
- Conda (recommended)
- Clone the repository
git clone https://github.com/ZI-MA/VCU-Bridge.git
cd VCU-Bridge- Create environment
# Option A: Automated setup
bash setup.sh
# Option B: Manual setup
conda env create -f environment.yml
conda activate vcu-bridge
pip install -e .- Download benchmark data
Download the benchmark data from Hugging Face and place it in the Data/ directory. See the HVCU-Bench Dataset section for detailed download instructions.
- Configure environment variables
# Copy the example environment file
cp .env.example .env
# Edit .env with your API credentials
# For OpenAI/GPT models:
# OPENAI_API_KEY=your_api_key_here
# OPENAI_API_BASE=your_custom_base_url # Optional
# For Google Gemini models:
# GEMINI_PROJECT_ID=your_project_id
# GEMINI_LOCATION=us-central1
# GEMINI_SERVICE_ACCOUNT_FILE=path/to/service_account.json
# Optional: Customize data directories
# DATA_DIR=Data
# RESULTS_DIR=Result
# IMAGE_DIR=Data/Image# Evaluate GPT-4o on Implication Understanding (Independent Mode)
python -m Evaluation.run_evaluation openai \
--input Implication-Understanding.json \
--model gpt-4o
# Evaluate with context (hierarchical dependencies)
python -m Evaluation.run_evaluation openai \
--input Implication-Understanding.json \
--model gpt-4o \
--context_mode
# Evaluate local model with vLLM
python -m Evaluation.run_evaluation local \
--input Aesthetic-Appreciation.json \
--model Qwen/Qwen3-VL-8B-Instruct
# Parallel evaluation for faster processing
python -m Evaluation.run_evaluation openai \
--input Affective-Reasoning.json \
--parallel 4# Compute accuracy metrics from evaluation results
python -m Evaluation.calculate_metrics Result/Implication-Understanding/Indep/gpt_4o.json
# Output includes:
# - Per-Level Accuracy (Level 1, 2, 3)
# - Full-Chain Accuracy
# - Overall Score
# - Error Attribution Analysis (for reference only)# Run complete evaluation suite
for task in Aesthetic-Appreciation Affective-Reasoning Implication-Understanding; do
python -m Evaluation.run_evaluation openai --input ${task}.json --model gpt-4o
python -m Evaluation.calculate_metrics Result/${task}/Indep/gpt_4o.json
doneThe evaluation framework supports multiple evaluator types:
- OpenAI Evaluator: Uses OpenAI API (GPT-4o, GPT-4o-mini, etc.)
- Local Evaluator: Uses local models via vLLM or compatible APIs
| Argument | Type | Description |
|---|---|---|
evaluator |
str | Required. Evaluator type: openai or local |
--input |
str | Required. Input JSON filename in Data/ directory |
--model |
str | Model name (default: gpt-4o for openai, Qwen/Qwen3-VL-8B-Instruct for local) |
--context_mode |
flag | Enable context mode (with hierarchical dependencies) |
--parallel |
int | Number of parallel workers (default: 1) |
--temperature |
float | Sampling temperature (default: 0.0) |
--base_url |
str | Custom API base URL |
-
Per-Level Accuracy (Acci): Correctness at each individual level
- Accperc: Level 1 (Foundational Perception) accuracy
- Accbridge: Level 2 (Semantic Bridge) accuracy
- Accconn: Level 3 (Abstract Connotation) accuracy
-
Full-Chain Accuracy (Accfull): Simultaneous correctness across all three levels
- Most stringent metric requiring correct reasoning at all levels
-
Overall Score: Mean of Accfull across all tasks
- Single aggregate performance indicator
-
Error Attribution: Identifies the first level where reasoning fails
- Diagnostic metric for understanding failure patterns
Results are saved in JSON format. Use calculate_metrics.py to compute aggregated statistics from result files.
Overall Performance Table - Best results in bold, second-best underlined:
| Model | Size | Implication Understanding | Aesthetic Appreciation | Affective Reasoning | Score | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accperc | Accbridge | Accconn | Accfull | Accperc | Accbridge | Accconn | Accfull | Accperc | Accbridge | Accconn | Accfull | |||
| Basic Reference | ||||||||||||||
| Human | - | 99.25 | 96.00 | 86.50 | 86.00 | 99.14 | 92.29 | 90.29 | 88.86 | 99.33 | 93.33 | 88.67 | 86.67 | 87.18 |
| GPT-4o | - | 95.50 | 85.25 | 62.75 | 53.25 | 95.43 | 78.29 | 68.00 | 53.14 | 91.33 | 83.67 | 64.33 | 50.33 | 52.24 |
| Open-Source MLLMs | ||||||||||||||
| Qwen3-VL-Instruct | 4B | 86.75 | 82.75 | 58.00 | 43.25 | 90.57 | 70.86 | 60.00 | 41.14 | 90.33 | 82.67 | 56.67 | 39.33 | 41.24 |
| Qwen3-VL-Instruct | 8B | 93.50 | 89.50 | 59.50 | 50.75 | 91.71 | 73.43 | 63.43 | 44.00 | 94.33 | 84.67 | 60.00 | 48.00 | 47.58 |
| LLaVA-1.6 | 7B | 81.75 | 58.00 | 40.25 | 18.75 | 79.14 | 36.86 | 33.14 | 9.43 | 92.00 | 58.00 | 19.33 | 12.00 | 13.39 |
| LLaVA-1.6 | 13B | 84.75 | 79.00 | 55.00 | 39.50 | 84.86 | 55.14 | 50.57 | 26.29 | 94.33 | 77.33 | 29.00 | 21.33 | 29.04 |
| Deepseek-VL2-tiny | MoE 1B/3B | 88.25 | 62.25 | 49.75 | 29.25 | 89.71 | 45.14 | 41.14 | 19.71 | 93.33 | 65.33 | 29.00 | 19.00 | 22.65 |
| Deepseek-VL2 | MoE 4.5B/27B | 93.75 | 83.00 | 60.75 | 49.50 | 95.14 | 58.00 | 38.00 | 23.71 | 96.33 | 81.33 | 46.00 | 36.67 | 36.63 |
| Gemma3 | 4B | 76.50 | 72.00 | 49.75 | 30.75 | 68.86 | 62.86 | 68.00 | 29.14 | 87.00 | 76.00 | 51.00 | 36.00 | 31.96 |
| Gemma3 | 12B | 87.50 | 85.25 | 60.50 | 47.50 | 82.86 | 70.29 | 68.00 | 38.86 | 90.67 | 86.33 | 58.00 | 46.33 | 44.23 |
| InternVL3.5 | 4B | 82.50 | 83.75 | 58.50 | 42.00 | 82.86 | 64.57 | 40.00 | 23.43 | 91.00 | 81.67 | 60.67 | 47.67 | 37.70 |
| InternVL3.5 | 8B | 82.00 | 85.25 | 55.75 | 41.75 | 84.00 | 68.00 | 60.57 | 36.29 | 86.00 | 83.67 | 55.67 | 42.00 | 40.01 |
| Phi-4-Multimodal-Instruct | 6B | 90.25 | 56.50 | 42.75 | 32.25 | 90.29 | 42.57 | 23.14 | 15.14 | 90.00 | 85.00 | 45.33 | 33.67 | 27.02 |
| Phi-3.5-Vision-Instruct | 4B | 84.25 | 83.25 | 61.25 | 44.75 | 88.29 | 61.14 | 53.43 | 33.14 | 91.33 | 82.00 | 54.33 | 41.33 | 39.74 |
Overall Performance Table (Context Setting) - Results when models are provided with hierarchical context from previous levels:
| Model | Size | Implication Understanding | Aesthetic Appreciation | Affective Reasoning | Score | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Accperc | Accbridge | Accconn | Accfull | Accperc | Accbridge | Accconn | Accfull | Accperc | Accbridge | Accconn | Accfull | |||
| Basic Reference | ||||||||||||||
| GPT-4o | - | 95.50 | 89.75 | 76.50 | 65.00 | 95.43 | 82.29 | 87.71 | 72.86 | 91.33 | 86.00 | 80.67 | 66.67 | 68.18 |
| Open-Source MLLMs | ||||||||||||||
| Qwen3-VL-Instruct | 4B | 86.75 | 85.50 | 70.75 | 54.50 | 90.57 | 72.86 | 74.00 | 53.14 | 90.33 | 86.00 | 74.33 | 57.67 | 55.10 |
| Qwen3-VL-Instruct | 8B | 93.50 | 90.00 | 74.75 | 62.75 | 91.71 | 74.00 | 82.57 | 59.43 | 94.33 | 89.00 | 76.00 | 64.67 | 62.28 |
| LLaVA-1.6 | 7B | 81.75 | 68.00 | 54.50 | 32.75 | 79.14 | 43.71 | 50.00 | 18.29 | 92.00 | 65.00 | 30.00 | 18.67 | 23.24 |
| LLaVA-1.6 | 13B | 84.75 | 80.25 | 63.00 | 44.75 | 84.86 | 57.71 | 52.00 | 29.14 | 94.33 | 78.00 | 37.67 | 27.33 | 33.74 |
| Deepseek-VL2-tiny | MoE 1B/3B | 88.25 | 65.25 | 55.75 | 34.25 | 89.71 | 47.71 | 60.57 | 27.43 | 93.33 | 68.33 | 33.00 | 23.00 | 28.23 |
| Deepseek-VL2 | MoE 4.5B/27B | 93.75 | 84.25 | 67.50 | 53.75 | 95.14 | 59.43 | 54.00 | 33.43 | 96.33 | 83.67 | 62.00 | 52.00 | 46.39 |
| Gemma3 | 4B | 76.50 | 78.25 | 63.50 | 40.75 | 68.86 | 65.14 | 82.57 | 35.43 | 87.00 | 75.00 | 50.00 | 33.00 | 36.39 |
| Gemma3 | 12B | 87.50 | 88.00 | 74.50 | 57.00 | 82.86 | 72.57 | 84.86 | 50.00 | 90.67 | 86.00 | 74.33 | 59.67 | 55.56 |
| InternVL3.5 | 4B | 82.50 | 80.75 | 66.75 | 46.00 | 82.86 | 65.43 | 64.00 | 36.86 | 91.00 | 82.33 | 79.00 | 60.33 | 47.73 |
| InternVL3.5 | 8B | 82.00 | 84.75 | 66.00 | 47.25 | 84.00 | 70.57 | 71.14 | 45.43 | 86.00 | 81.67 | 68.00 | 50.33 | 47.67 |
| Phi-4-Multimodal-Instruct | 6B | 90.25 | 84.75 | 68.00 | 54.50 | 90.29 | 61.71 | 50.86 | 32.86 | 90.00 | 86.33 | 59.67 | 45.00 | 44.12 |
| Phi-3.5-Vision-Instruct | 4B | 84.25 | 84.00 | 71.50 | 51.25 | 88.29 | 63.43 | 64.00 | 40.00 | 91.33 | 81.33 | 64.67 | 49.33 | 46.86 |
-
π₯ Gap between Humans and MLLMs: Humans achieve 87.18% overall score, while GPT-4o and Qwen3-VL-8B-Instruct fall behind by -34.94% and -39.60% respectively, demonstrating that current MLLMs lack a stable semantic bridge from concrete evidence to abstract meaning.
-
π Universal Performance Degradation: Most models, regardless of scale or architecture, exhibit a sharp, cascading decline from perception β bridge β connotation. GPT-4o experiences -32.75% degradation, Qwen3-VL-8B-Instruct degrades by -34.00% on Implication Understanding.
-
π Model Scale and Architecture Analysis: While increasing model scale generally improves performance, it does not resolve the fundamental challenges. LLaVA-1.6-13B significantly outperforms its 7B counterpart (29.04% vs 13.39%), yet performance at Lconn remains weak. Different architectures show distinct profiles: Qwen3-VL exhibits more balanced performance, while LLaVA-1.6 shows a particularly steep decline after Lperc.
-
π Hierarchical Dependencies: Providing hierarchical context yields substantial performance gains. GPT-4o demonstrates +15.94% overall improvement, while Qwen3-VL-8B-Instruct achieves +14.70%, confirming that lower levels provide critical grounding for higher-level connotative reasoning.
Our MCTS-driven pipeline generates high-quality hierarchical training data through iterative tree search:
- Selection: UCB (Upper Confidence Bound) algorithm balances exploration and exploitation
- Expansion: Generate candidate QA pairs at each level
- Evaluation: Multi-dimensional quality assessment (logical coherence, difficulty progression, image-text alignment)
- Backpropagation: Update node statistics to guide future exploration
# Configure data generation settings
cd Instruction_Data_Generation_Pipeline
cp configs/config.example.json configs/config.json
# Edit config.json with your settings
# Run MCTS-based generation
python -m Instruction_Data_Generation_Pipeline.main \
--config configs/config.json
# Generated data will be saved as JSONL files with complete reasoning chainsKey parameters in config.json:
-
MCTS Parameters:
mcts.max_iterations: Maximum MCTS iterations (default: 5)mcts.max_depth: Number of hierarchy levels (default: 3)mcts.exploration_constant: UCB exploration parameter (default: 2.0)
-
Tree Structure:
tree.levels: Per-level node capacity limitstree.max_children: Maximum children per nodetree.max_total_nodes: Total node limit
-
Quality Control:
tree.quality.thresholds: Quality score thresholds (high/medium)tree.quality.acceptance_threshold: Minimum score for admission (default: 0.65)
-
Parallel Processing:
parallel.images: Number of parallel image workers (default: 10)parallel.nodes: Number of parallel node expansions (default: 5)client.max_concurrency: API concurrency limit (default: 50)
- Resume Mechanism: Automatically detects and skips completed images based on tree output
- Tree State Persistence: Saves and loads MCTS tree states for checkpointing
- Batch Processing: Parallel candidate generation and evaluation
- Quality Filtering: Multi-dimensional evaluation with logical coherence checks
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
We thank the following projects and datasets for providing valuable resources:
- II-Bench, EEmo-Bench, and LayerD for providing valuable resources for our benchmark
- OpenAI, Google, and the open-source community for providing excellent MLLMs and APIs
For questions, issues, or collaboration opportunities:
- Email: chime@zju.edu.cn
- GitHub Issues: Open an issue
- Project Page: https://vcu-bridge.github.io/
If you find this work useful in your research, please cite:
@misc{zhong2025vcubridgehierarchicalvisualconnotation,
title={VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging},
author={Ming Zhong and Yuanlei Wang and Liuzhou Zhang and Arctanx An and Renrui Zhang and Hao Liang and Ming Lu and Ying Shen and Wentao Zhang},
year={2025},
eprint={2511.18121},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.18121},
}