VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

Ming Zhong^1, Yuanlei Wang^3, Liuzhou Zhang², Arctanx An², Renrui Zhang⁴, Hao Liang², Ming Lu², Ying Shen^3✉️, Wentao Zhang^2✉️
¹Zhejiang University, ²Peking University, ³Sun Yat-sen University, ⁴CUHK

📋 Table of Contents

Overview
Project Structure
Key Features
HVCU-Bench Dataset
Installation
Quick Start
Performance
Data Generation Pipeline
License
Acknowledgments
Contact
Citation

🎯 Overview

VCU-Bridge is a comprehensive framework for evaluating and improving Hierarchical Visual Connotation Understanding in Multimodal Large Language Models (MLLMs). Unlike traditional benchmarks that test perception and reasoning in isolation, VCU-Bridge explicitly models the critical semantic bridge that connects low-level visual details to high-level abstract interpretations.

HVCU-Bench Overview

Overview of HVCU-Bench. We evaluate MLLMs across 3 task families spanning 15 diverse aspects. Our benchmark employs hierarchical decomposition: each question is systematically broken down into sub-questions across three levels (L_perc, L_bridge, L_conn), with validation ensuring logical coherence. During evaluation, models progress from low to high levels, constructing inter-level reasoning chains that emulate human visual comprehension.

Core Contributions

📐 Novel Framework: First to formalize hierarchical visual connotation understanding as a three-level progressive process with explicit inter-level associations
📊 HVCU-Bench: A benchmark with 1,050 samples (3,150 QA pairs) covering Affective Reasoning, Aesthetic Appreciation, and Implication Understanding
🌲 MCTS Pipeline: Monte Carlo Tree Search-driven data generation for creating high-quality hierarchical training data
✅ Proven Results: Instruction-tuned models show +6.17% improvement on HVCU-Bench

📁 Project Structure

VCU-Bridge/
├── Data/                                # Benchmark data and images (download required)
│   ├── *.json                           # Task annotation files
│   └── Image/                           # Image directories by task
├── Evaluation/                          # Evaluation framework
│   ├── run_evaluation.py                # Main evaluation script
│   ├── calculate_metrics.py             # Metrics calculation
│   ├── base_evaluator.py                # Base evaluator classes
│   └── answer_parser.py                 # Answer parsing utilities
├── Instruction_Data_Generation_Pipeline/  # MCTS data generation
│   ├── main.py                          # Generation entry point
│   ├── config.py                        # Configuration management
│   ├── image_processor.py               # Image processing utilities
│   ├── configs/                         # Configuration files
│   ├── prompts/                         # Prompt templates
│   └── utils/                           # Core utilities
│       ├── orchestrator/                # MCTS orchestration
│       ├── services/                    # API clients and services
│       ├── tree/                        # MCTS tree implementation
│       ├── batch/                       # Batch processing
│       ├── formatter/                   # Data formatting
│       └── nodes/                       # MCTS node implementation
├── .env.example                          # Environment variables template
├── environment.yml                      # Conda environment file
├── pyproject.toml                       # Project configuration
├── setup.sh                             # Setup script
└── README.md                            # This file

Core Modules

Evaluation Module: Provides comprehensive evaluation framework for testing MLLMs on HVCU-Bench with support for multiple evaluators (OpenAI API and local models via vLLM)
Instruction_Data_Generation_Pipeline Module: Implements MCTS-driven hierarchical data generation with quality filtering and diversity checking

✨ Key Features

🎯 Hierarchical Evaluation Framework

VCU-Bridge models visual understanding as a progressive three-level process:

Level 1 - Foundational Perception (L_perc)

Objective, low-level visual facts
Direct observation of objects, attributes, and visual primitives
Example: "The image shows dark clouds and a person with a hunched posture"

Level 2 - Semantic Bridge (L_bridge)

Explanatory statements linking perception to meaning
Causal reasoning about visual evidence
Example: "The dark weather and body language create an atmosphere of isolation"

Level 3 - Abstract Connotation (L_conn)

Subjective, high-level interpretations
Aesthetic, affective, or symbolic meanings
Example: "The scene conveys a sense of melancholy and loneliness"

🔬 MCTS-Driven Generation

Monte Carlo Tree Search for exploring hierarchical reasoning paths
UCB (Upper Confidence Bound) selection balancing exploration and exploitation
Progressive validation ensuring logical coherence across levels
Quality filtering with multi-dimensional evaluation criteria

📊 Comprehensive Tasks

Affective Reasoning (300 samples): 6 emotional categories (joy, affection, wonder, anger, fear, sadness)
Aesthetic Appreciation (350 samples): 4 design aspects (color, composition, font, graphics)
Implication Understanding (400 samples): 5 rhetorical devices (metaphor, symbolism, contrast, exaggeration, dislocation)

🎛️ Flexible Evaluation

Independent Mode: Test each level in isolation
Context Mode: Evaluate with hierarchical dependencies
Diagnostic Metrics: Per-level accuracy, full-chain accuracy, and error attribution

📊 HVCU-Bench Dataset

HVCU-Bench contains 1,050 samples (3,150 QA pairs) across three task families:

Affective Reasoning (300 samples): 6 emotional categories — joy, affection, wonder, anger, fear, sadness
Aesthetic Appreciation (350 samples): 4 design aspects — color, composition, font, graphics
Implication Understanding (400 samples): 5 rhetorical devices — metaphor, symbolism, contrast, exaggeration, dislocation

Dataset Statistics

Total Samples: 1,050 images
Total QA Pairs: 3,150 (3 levels × 1,050 samples)
Task Distribution:
- Affective Reasoning: 28.6% (300 samples)
- Aesthetic Appreciation: 33.3% (350 samples)
- Implication Understanding: 38.1% (400 samples)

Dataset Organization

The benchmark data is organized in the Data/ directory:

JSON annotation files: Task-specific annotation files containing QA pairs
- Aesthetic-Appreciation.json
- Affective-Reasoning.json
- Implication-Understanding.json
Image directory: Images organized by task family in Data/Image/
- Each task has its own subdirectory containing all relevant images
- Image filenames correspond to sample IDs in the JSON files
- Supported formats: JPG, PNG, WEBP

Download Dataset

The complete benchmark data (including both JSON annotation files and images) can be downloaded from:

Hugging Face: Chime316/HVCU-Bench

After downloading, place the data in the Data/ directory with the following structure:

Data/
├── Aesthetic-Appreciation.json
├── Affective-Reasoning.json
├── Implication-Understanding.json
└── Image/
    ├── Aesthetic-Appreciation/
    ├── Affective-Reasoning/
    └── Implication-Understanding/

Note: Both JSON annotation files and Image directory are required for evaluation.

🛠️ Installation

Prerequisites

Python 3.10+
Conda (recommended)

Setup

Clone the repository

git clone https://github.com/ZI-MA/VCU-Bridge.git
cd VCU-Bridge

Create environment

# Option A: Automated setup
bash setup.sh

# Option B: Manual setup
conda env create -f environment.yml
conda activate vcu-bridge
pip install -e .

Download benchmark data

Download the benchmark data from Hugging Face and place it in the Data/ directory. See the HVCU-Bench Dataset section for detailed download instructions.

Configure environment variables

# Copy the example environment file
cp .env.example .env

# Edit .env with your API credentials
# For OpenAI/GPT models:
# OPENAI_API_KEY=your_api_key_here
# OPENAI_API_BASE=your_custom_base_url  # Optional

# For Google Gemini models:
# GEMINI_PROJECT_ID=your_project_id
# GEMINI_LOCATION=us-central1
# GEMINI_SERVICE_ACCOUNT_FILE=path/to/service_account.json

# Optional: Customize data directories
# DATA_DIR=Data
# RESULTS_DIR=Result
# IMAGE_DIR=Data/Image

🚀 Quick Start

Evaluate a Model

# Evaluate GPT-4o on Implication Understanding (Independent Mode)
python -m Evaluation.run_evaluation openai \
  --input Implication-Understanding.json \
  --model gpt-4o

# Evaluate with context (hierarchical dependencies)
python -m Evaluation.run_evaluation openai \
  --input Implication-Understanding.json \
  --model gpt-4o \
  --context_mode

# Evaluate local model with vLLM
python -m Evaluation.run_evaluation local \
  --input Aesthetic-Appreciation.json \
  --model Qwen/Qwen3-VL-8B-Instruct

# Parallel evaluation for faster processing
python -m Evaluation.run_evaluation openai \
  --input Affective-Reasoning.json \
  --parallel 4

Calculate Metrics

# Compute accuracy metrics from evaluation results
python -m Evaluation.calculate_metrics Result/Implication-Understanding/Indep/gpt_4o.json

# Output includes:
# - Per-Level Accuracy (Level 1, 2, 3)
# - Full-Chain Accuracy
# - Overall Score
# - Error Attribution Analysis (for reference only)

Evaluate All Benchmarks

# Run complete evaluation suite
for task in Aesthetic-Appreciation Affective-Reasoning Implication-Understanding; do
  python -m Evaluation.run_evaluation openai --input ${task}.json --model gpt-4o
  python -m Evaluation.calculate_metrics Result/${task}/Indep/gpt_4o.json
done

Evaluation Framework

The evaluation framework supports multiple evaluator types:

OpenAI Evaluator: Uses OpenAI API (GPT-4o, GPT-4o-mini, etc.)
Local Evaluator: Uses local models via vLLM or compatible APIs

Command-Line Options

Argument	Type	Description
`evaluator`	str	Required. Evaluator type: `openai` or `local`
`--input`	str	Required. Input JSON filename in `Data/` directory
`--model`	str	Model name (default: `gpt-4o` for openai, `Qwen/Qwen3-VL-8B-Instruct` for local)
`--context_mode`	flag	Enable context mode (with hierarchical dependencies)
`--parallel`	int	Number of parallel workers (default: 1)
`--temperature`	float	Sampling temperature (default: 0.0)
`--base_url`	str	Custom API base URL

Evaluation Metrics

Per-Level Accuracy (Acc_i): Correctness at each individual level
- Acc_perc: Level 1 (Foundational Perception) accuracy
- Acc_bridge: Level 2 (Semantic Bridge) accuracy
- Acc_conn: Level 3 (Abstract Connotation) accuracy
Full-Chain Accuracy (Acc_full): Simultaneous correctness across all three levels
- Most stringent metric requiring correct reasoning at all levels
Overall Score: Mean of Acc_full across all tasks
- Single aggregate performance indicator
Error Attribution: Identifies the first level where reasoning fails
- Diagnostic metric for understanding failure patterns

Result Analysis

Results are saved in JSON format. Use calculate_metrics.py to compute aggregated statistics from result files.

🏆 Performance

Main Results on HVCU-Bench

Overall Performance Table - Best results in bold, second-best underlined:

Model	Size	Implication Understanding				Aesthetic Appreciation				Affective Reasoning				Score
Model	Size	Acc_perc	Acc_bridge	Acc_conn	Acc_full	Acc_perc	Acc_bridge	Acc_conn	Acc_full	Acc_perc	Acc_bridge	Acc_conn	Acc_full	Score
Basic Reference
Human	-	99.25	96.00	86.50	86.00	99.14	92.29	90.29	88.86	99.33	93.33	88.67	86.67	87.18
GPT-4o	-	95.50	85.25	62.75	53.25	95.43	78.29	68.00	53.14	91.33	83.67	64.33	50.33	52.24
Open-Source MLLMs
Qwen3-VL-Instruct	4B	86.75	82.75	58.00	43.25	90.57	70.86	60.00	41.14	90.33	82.67	56.67	39.33	41.24
Qwen3-VL-Instruct	8B	93.50	89.50	59.50	50.75	91.71	73.43	63.43	44.00	94.33	84.67	60.00	48.00	47.58
LLaVA-1.6	7B	81.75	58.00	40.25	18.75	79.14	36.86	33.14	9.43	92.00	58.00	19.33	12.00	13.39
LLaVA-1.6	13B	84.75	79.00	55.00	39.50	84.86	55.14	50.57	26.29	94.33	77.33	29.00	21.33	29.04
Deepseek-VL2-tiny	MoE 1B/3B	88.25	62.25	49.75	29.25	89.71	45.14	41.14	19.71	93.33	65.33	29.00	19.00	22.65
Deepseek-VL2	MoE 4.5B/27B	93.75	83.00	60.75	49.50	95.14	58.00	38.00	23.71	96.33	81.33	46.00	36.67	36.63
Gemma3	4B	76.50	72.00	49.75	30.75	68.86	62.86	68.00	29.14	87.00	76.00	51.00	36.00	31.96
Gemma3	12B	87.50	85.25	60.50	47.50	82.86	70.29	68.00	38.86	90.67	86.33	58.00	46.33	44.23
InternVL3.5	4B	82.50	83.75	58.50	42.00	82.86	64.57	40.00	23.43	91.00	81.67	60.67	47.67	37.70
InternVL3.5	8B	82.00	85.25	55.75	41.75	84.00	68.00	60.57	36.29	86.00	83.67	55.67	42.00	40.01
Phi-4-Multimodal-Instruct	6B	90.25	56.50	42.75	32.25	90.29	42.57	23.14	15.14	90.00	85.00	45.33	33.67	27.02
Phi-3.5-Vision-Instruct	4B	84.25	83.25	61.25	44.75	88.29	61.14	53.43	33.14	91.33	82.00	54.33	41.33	39.74

Results with Context Mode

Overall Performance Table (Context Setting) - Results when models are provided with hierarchical context from previous levels:

Model	Size	Implication Understanding				Aesthetic Appreciation				Affective Reasoning				Score
Model	Size	Acc_perc	Acc_bridge	Acc_conn	Acc_full	Acc_perc	Acc_bridge	Acc_conn	Acc_full	Acc_perc	Acc_bridge	Acc_conn	Acc_full	Score
Basic Reference
GPT-4o	-	95.50	89.75	76.50	65.00	95.43	82.29	87.71	72.86	91.33	86.00	80.67	66.67	68.18
Open-Source MLLMs
Qwen3-VL-Instruct	4B	86.75	85.50	70.75	54.50	90.57	72.86	74.00	53.14	90.33	86.00	74.33	57.67	55.10
Qwen3-VL-Instruct	8B	93.50	90.00	74.75	62.75	91.71	74.00	82.57	59.43	94.33	89.00	76.00	64.67	62.28
LLaVA-1.6	7B	81.75	68.00	54.50	32.75	79.14	43.71	50.00	18.29	92.00	65.00	30.00	18.67	23.24
LLaVA-1.6	13B	84.75	80.25	63.00	44.75	84.86	57.71	52.00	29.14	94.33	78.00	37.67	27.33	33.74
Deepseek-VL2-tiny	MoE 1B/3B	88.25	65.25	55.75	34.25	89.71	47.71	60.57	27.43	93.33	68.33	33.00	23.00	28.23
Deepseek-VL2	MoE 4.5B/27B	93.75	84.25	67.50	53.75	95.14	59.43	54.00	33.43	96.33	83.67	62.00	52.00	46.39
Gemma3	4B	76.50	78.25	63.50	40.75	68.86	65.14	82.57	35.43	87.00	75.00	50.00	33.00	36.39
Gemma3	12B	87.50	88.00	74.50	57.00	82.86	72.57	84.86	50.00	90.67	86.00	74.33	59.67	55.56
InternVL3.5	4B	82.50	80.75	66.75	46.00	82.86	65.43	64.00	36.86	91.00	82.33	79.00	60.33	47.73
InternVL3.5	8B	82.00	84.75	66.00	47.25	84.00	70.57	71.14	45.43	86.00	81.67	68.00	50.33	47.67
Phi-4-Multimodal-Instruct	6B	90.25	84.75	68.00	54.50	90.29	61.71	50.86	32.86	90.00	86.33	59.67	45.00	44.12
Phi-3.5-Vision-Instruct	4B	84.25	84.00	71.50	51.25	88.29	63.43	64.00	40.00	91.33	81.33	64.67	49.33	46.86

Key Findings

👥 Gap between Humans and MLLMs: Humans achieve 87.18% overall score, while GPT-4o and Qwen3-VL-8B-Instruct fall behind by -34.94% and -39.60% respectively, demonstrating that current MLLMs lack a stable semantic bridge from concrete evidence to abstract meaning.
📉 Universal Performance Degradation: Most models, regardless of scale or architecture, exhibit a sharp, cascading decline from perception → bridge → connotation. GPT-4o experiences -32.75% degradation, Qwen3-VL-8B-Instruct degrades by -34.00% on Implication Understanding.
📊 Model Scale and Architecture Analysis: While increasing model scale generally improves performance, it does not resolve the fundamental challenges. LLaVA-1.6-13B significantly outperforms its 7B counterpart (29.04% vs 13.39%), yet performance at L_conn remains weak. Different architectures show distinct profiles: Qwen3-VL exhibits more balanced performance, while LLaVA-1.6 shows a particularly steep decline after L_perc.
🔗 Hierarchical Dependencies: Providing hierarchical context yields substantial performance gains. GPT-4o demonstrates +15.94% overall improvement, while Qwen3-VL-8B-Instruct achieves +14.70%, confirming that lower levels provide critical grounding for higher-level connotative reasoning.

🌲 Data Generation Pipeline

Our MCTS-driven pipeline generates high-quality hierarchical training data through iterative tree search:

Key Components

Selection: UCB (Upper Confidence Bound) algorithm balances exploration and exploitation
Expansion: Generate candidate QA pairs at each level
Evaluation: Multi-dimensional quality assessment (logical coherence, difficulty progression, image-text alignment)
Backpropagation: Update node statistics to guide future exploration

Generate Training Data

# Configure data generation settings
cd Instruction_Data_Generation_Pipeline
cp configs/config.example.json configs/config.json
# Edit config.json with your settings

# Run MCTS-based generation
python -m Instruction_Data_Generation_Pipeline.main \
  --config configs/config.json

# Generated data will be saved as JSONL files with complete reasoning chains

Configuration Options

Key parameters in config.json:

MCTS Parameters:
- mcts.max_iterations: Maximum MCTS iterations (default: 5)
- mcts.max_depth: Number of hierarchy levels (default: 3)
- mcts.exploration_constant: UCB exploration parameter (default: 2.0)
Tree Structure:
- tree.levels: Per-level node capacity limits
- tree.max_children: Maximum children per node
- tree.max_total_nodes: Total node limit
Quality Control:
- tree.quality.thresholds: Quality score thresholds (high/medium)
- tree.quality.acceptance_threshold: Minimum score for admission (default: 0.65)
Parallel Processing:
- parallel.images: Number of parallel image workers (default: 10)
- parallel.nodes: Number of parallel node expansions (default: 5)
- client.max_concurrency: API concurrency limit (default: 50)

Advanced Features

Resume Mechanism: Automatically detects and skips completed images based on tree output
Tree State Persistence: Saves and loads MCTS tree states for checkpointing
Batch Processing: Parallel candidate generation and evaluation
Quality Filtering: Multi-dimensional evaluation with logical coherence checks

📜 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

🙏 Acknowledgments

We thank the following projects and datasets for providing valuable resources:

II-Bench, EEmo-Bench, and LayerD for providing valuable resources for our benchmark
OpenAI, Google, and the open-source community for providing excellent MLLMs and APIs

📧 Contact

For questions, issues, or collaboration opportunities:

Email: chime@zju.edu.cn
GitHub Issues: Open an issue
Project Page: https://vcu-bridge.github.io/

📄 Citation

If you find this work useful in your research, please cite:

@misc{zhong2025vcubridgehierarchicalvisualconnotation,
      title={VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging}, 
      author={Ming Zhong and Yuanlei Wang and Liuzhou Zhang and Arctanx An and Renrui Zhang and Hao Liang and Ming Lu and Ying Shen and Wentao Zhang},
      year={2025},
      eprint={2511.18121},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.18121}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Evaluation		Evaluation
Instruction_Data_Generation_Pipeline		Instruction_Data_Generation_Pipeline
assets		assets
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.sh		setup.sh

License

ZI-MA/VCU-Bridge

Folders and files

Latest commit

History

Repository files navigation