Skip to content

ZI-MA/VCU-Bridge

Repository files navigation

VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

Ming Zhong1*, Yuanlei Wang3*, Liuzhou Zhang2, Arctanx An2, Renrui Zhang4, Hao Liang2, Ming Lu2, Ying Shen3βœ‰οΈ, Wentao Zhang2βœ‰οΈ
1Zhejiang University, 2Peking University, 3Sun Yat-sen University, 4CUHK

Paper Project Page Dataset License


πŸ“‹ Table of Contents


🎯 Overview

VCU-Bridge is a comprehensive framework for evaluating and improving Hierarchical Visual Connotation Understanding in Multimodal Large Language Models (MLLMs). Unlike traditional benchmarks that test perception and reasoning in isolation, VCU-Bridge explicitly models the critical semantic bridge that connects low-level visual details to high-level abstract interpretations.

HVCU-Bench Overview

HVCU-Bench Overview

Overview of HVCU-Bench. We evaluate MLLMs across 3 task families spanning 15 diverse aspects. Our benchmark employs hierarchical decomposition: each question is systematically broken down into sub-questions across three levels (Lperc, Lbridge, Lconn), with validation ensuring logical coherence. During evaluation, models progress from low to high levels, constructing inter-level reasoning chains that emulate human visual comprehension.

Core Contributions

  • πŸ“ Novel Framework: First to formalize hierarchical visual connotation understanding as a three-level progressive process with explicit inter-level associations
  • πŸ“Š HVCU-Bench: A benchmark with 1,050 samples (3,150 QA pairs) covering Affective Reasoning, Aesthetic Appreciation, and Implication Understanding
  • 🌲 MCTS Pipeline: Monte Carlo Tree Search-driven data generation for creating high-quality hierarchical training data
  • βœ… Proven Results: Instruction-tuned models show +6.17% improvement on HVCU-Bench

πŸ“ Project Structure

VCU-Bridge/
β”œβ”€β”€ Data/                                # Benchmark data and images (download required)
β”‚   β”œβ”€β”€ *.json                           # Task annotation files
β”‚   └── Image/                           # Image directories by task
β”œβ”€β”€ Evaluation/                          # Evaluation framework
β”‚   β”œβ”€β”€ run_evaluation.py                # Main evaluation script
β”‚   β”œβ”€β”€ calculate_metrics.py             # Metrics calculation
β”‚   β”œβ”€β”€ base_evaluator.py                # Base evaluator classes
β”‚   └── answer_parser.py                 # Answer parsing utilities
β”œβ”€β”€ Instruction_Data_Generation_Pipeline/  # MCTS data generation
β”‚   β”œβ”€β”€ main.py                          # Generation entry point
β”‚   β”œβ”€β”€ config.py                        # Configuration management
β”‚   β”œβ”€β”€ image_processor.py               # Image processing utilities
β”‚   β”œβ”€β”€ configs/                         # Configuration files
β”‚   β”œβ”€β”€ prompts/                         # Prompt templates
β”‚   └── utils/                           # Core utilities
β”‚       β”œβ”€β”€ orchestrator/                # MCTS orchestration
β”‚       β”œβ”€β”€ services/                    # API clients and services
β”‚       β”œβ”€β”€ tree/                        # MCTS tree implementation
β”‚       β”œβ”€β”€ batch/                       # Batch processing
β”‚       β”œβ”€β”€ formatter/                   # Data formatting
β”‚       └── nodes/                       # MCTS node implementation
β”œβ”€β”€ .env.example                          # Environment variables template
β”œβ”€β”€ environment.yml                      # Conda environment file
β”œβ”€β”€ pyproject.toml                       # Project configuration
β”œβ”€β”€ setup.sh                             # Setup script
└── README.md                            # This file

Core Modules

  • Evaluation Module: Provides comprehensive evaluation framework for testing MLLMs on HVCU-Bench with support for multiple evaluators (OpenAI API and local models via vLLM)
  • Instruction_Data_Generation_Pipeline Module: Implements MCTS-driven hierarchical data generation with quality filtering and diversity checking

✨ Key Features

🎯 Hierarchical Evaluation Framework

VCU-Bridge models visual understanding as a progressive three-level process:

Level 1 - Foundational Perception (Lperc)

  • Objective, low-level visual facts
  • Direct observation of objects, attributes, and visual primitives
  • Example: "The image shows dark clouds and a person with a hunched posture"

Level 2 - Semantic Bridge (Lbridge)

  • Explanatory statements linking perception to meaning
  • Causal reasoning about visual evidence
  • Example: "The dark weather and body language create an atmosphere of isolation"

Level 3 - Abstract Connotation (Lconn)

  • Subjective, high-level interpretations
  • Aesthetic, affective, or symbolic meanings
  • Example: "The scene conveys a sense of melancholy and loneliness"

πŸ”¬ MCTS-Driven Generation

  • Monte Carlo Tree Search for exploring hierarchical reasoning paths
  • UCB (Upper Confidence Bound) selection balancing exploration and exploitation
  • Progressive validation ensuring logical coherence across levels
  • Quality filtering with multi-dimensional evaluation criteria

πŸ“Š Comprehensive Tasks

  • Affective Reasoning (300 samples): 6 emotional categories (joy, affection, wonder, anger, fear, sadness)
  • Aesthetic Appreciation (350 samples): 4 design aspects (color, composition, font, graphics)
  • Implication Understanding (400 samples): 5 rhetorical devices (metaphor, symbolism, contrast, exaggeration, dislocation)

πŸŽ›οΈ Flexible Evaluation

  • Independent Mode: Test each level in isolation
  • Context Mode: Evaluate with hierarchical dependencies
  • Diagnostic Metrics: Per-level accuracy, full-chain accuracy, and error attribution

πŸ“Š HVCU-Bench Dataset

HVCU-Bench contains 1,050 samples (3,150 QA pairs) across three task families:

  • Affective Reasoning (300 samples): 6 emotional categories β€” joy, affection, wonder, anger, fear, sadness
  • Aesthetic Appreciation (350 samples): 4 design aspects β€” color, composition, font, graphics
  • Implication Understanding (400 samples): 5 rhetorical devices β€” metaphor, symbolism, contrast, exaggeration, dislocation

Dataset Statistics

  • Total Samples: 1,050 images
  • Total QA Pairs: 3,150 (3 levels Γ— 1,050 samples)
  • Task Distribution:
    • Affective Reasoning: 28.6% (300 samples)
    • Aesthetic Appreciation: 33.3% (350 samples)
    • Implication Understanding: 38.1% (400 samples)

Dataset Organization

The benchmark data is organized in the Data/ directory:

  • JSON annotation files: Task-specific annotation files containing QA pairs
    • Aesthetic-Appreciation.json
    • Affective-Reasoning.json
    • Implication-Understanding.json
  • Image directory: Images organized by task family in Data/Image/
    • Each task has its own subdirectory containing all relevant images
    • Image filenames correspond to sample IDs in the JSON files
    • Supported formats: JPG, PNG, WEBP

Download Dataset

The complete benchmark data (including both JSON annotation files and images) can be downloaded from:

After downloading, place the data in the Data/ directory with the following structure:

Data/
β”œβ”€β”€ Aesthetic-Appreciation.json
β”œβ”€β”€ Affective-Reasoning.json
β”œβ”€β”€ Implication-Understanding.json
└── Image/
    β”œβ”€β”€ Aesthetic-Appreciation/
    β”œβ”€β”€ Affective-Reasoning/
    └── Implication-Understanding/

Note: Both JSON annotation files and Image directory are required for evaluation.


πŸ› οΈ Installation

Prerequisites

  • Python 3.10+
  • Conda (recommended)

Setup

  1. Clone the repository
git clone https://github.com/ZI-MA/VCU-Bridge.git
cd VCU-Bridge
  1. Create environment
# Option A: Automated setup
bash setup.sh

# Option B: Manual setup
conda env create -f environment.yml
conda activate vcu-bridge
pip install -e .
  1. Download benchmark data

Download the benchmark data from Hugging Face and place it in the Data/ directory. See the HVCU-Bench Dataset section for detailed download instructions.

  1. Configure environment variables
# Copy the example environment file
cp .env.example .env

# Edit .env with your API credentials
# For OpenAI/GPT models:
# OPENAI_API_KEY=your_api_key_here
# OPENAI_API_BASE=your_custom_base_url  # Optional

# For Google Gemini models:
# GEMINI_PROJECT_ID=your_project_id
# GEMINI_LOCATION=us-central1
# GEMINI_SERVICE_ACCOUNT_FILE=path/to/service_account.json

# Optional: Customize data directories
# DATA_DIR=Data
# RESULTS_DIR=Result
# IMAGE_DIR=Data/Image

πŸš€ Quick Start

Evaluate a Model

# Evaluate GPT-4o on Implication Understanding (Independent Mode)
python -m Evaluation.run_evaluation openai \
  --input Implication-Understanding.json \
  --model gpt-4o

# Evaluate with context (hierarchical dependencies)
python -m Evaluation.run_evaluation openai \
  --input Implication-Understanding.json \
  --model gpt-4o \
  --context_mode

# Evaluate local model with vLLM
python -m Evaluation.run_evaluation local \
  --input Aesthetic-Appreciation.json \
  --model Qwen/Qwen3-VL-8B-Instruct

# Parallel evaluation for faster processing
python -m Evaluation.run_evaluation openai \
  --input Affective-Reasoning.json \
  --parallel 4

Calculate Metrics

# Compute accuracy metrics from evaluation results
python -m Evaluation.calculate_metrics Result/Implication-Understanding/Indep/gpt_4o.json

# Output includes:
# - Per-Level Accuracy (Level 1, 2, 3)
# - Full-Chain Accuracy
# - Overall Score
# - Error Attribution Analysis (for reference only)

Evaluate All Benchmarks

# Run complete evaluation suite
for task in Aesthetic-Appreciation Affective-Reasoning Implication-Understanding; do
  python -m Evaluation.run_evaluation openai --input ${task}.json --model gpt-4o
  python -m Evaluation.calculate_metrics Result/${task}/Indep/gpt_4o.json
done

Evaluation Framework

The evaluation framework supports multiple evaluator types:

  • OpenAI Evaluator: Uses OpenAI API (GPT-4o, GPT-4o-mini, etc.)
  • Local Evaluator: Uses local models via vLLM or compatible APIs

Command-Line Options

Argument Type Description
evaluator str Required. Evaluator type: openai or local
--input str Required. Input JSON filename in Data/ directory
--model str Model name (default: gpt-4o for openai, Qwen/Qwen3-VL-8B-Instruct for local)
--context_mode flag Enable context mode (with hierarchical dependencies)
--parallel int Number of parallel workers (default: 1)
--temperature float Sampling temperature (default: 0.0)
--base_url str Custom API base URL

Evaluation Metrics

  • Per-Level Accuracy (Acci): Correctness at each individual level

    • Accperc: Level 1 (Foundational Perception) accuracy
    • Accbridge: Level 2 (Semantic Bridge) accuracy
    • Accconn: Level 3 (Abstract Connotation) accuracy
  • Full-Chain Accuracy (Accfull): Simultaneous correctness across all three levels

    • Most stringent metric requiring correct reasoning at all levels
  • Overall Score: Mean of Accfull across all tasks

    • Single aggregate performance indicator
  • Error Attribution: Identifies the first level where reasoning fails

    • Diagnostic metric for understanding failure patterns

Result Analysis

Results are saved in JSON format. Use calculate_metrics.py to compute aggregated statistics from result files.


πŸ† Performance

Main Results on HVCU-Bench

Overall Performance Table - Best results in bold, second-best underlined:

Model Size Implication Understanding Aesthetic Appreciation Affective Reasoning Score
Accperc Accbridge Accconn Accfull Accperc Accbridge Accconn Accfull Accperc Accbridge Accconn Accfull
Basic Reference
Human - 99.25 96.00 86.50 86.00 99.14 92.29 90.29 88.86 99.33 93.33 88.67 86.67 87.18
GPT-4o - 95.50 85.25 62.75 53.25 95.43 78.29 68.00 53.14 91.33 83.67 64.33 50.33 52.24
Open-Source MLLMs
Qwen3-VL-Instruct 4B 86.75 82.75 58.00 43.25 90.57 70.86 60.00 41.14 90.33 82.67 56.67 39.33 41.24
Qwen3-VL-Instruct 8B 93.50 89.50 59.50 50.75 91.71 73.43 63.43 44.00 94.33 84.67 60.00 48.00 47.58
LLaVA-1.6 7B 81.75 58.00 40.25 18.75 79.14 36.86 33.14 9.43 92.00 58.00 19.33 12.00 13.39
LLaVA-1.6 13B 84.75 79.00 55.00 39.50 84.86 55.14 50.57 26.29 94.33 77.33 29.00 21.33 29.04
Deepseek-VL2-tiny MoE 1B/3B 88.25 62.25 49.75 29.25 89.71 45.14 41.14 19.71 93.33 65.33 29.00 19.00 22.65
Deepseek-VL2 MoE 4.5B/27B 93.75 83.00 60.75 49.50 95.14 58.00 38.00 23.71 96.33 81.33 46.00 36.67 36.63
Gemma3 4B 76.50 72.00 49.75 30.75 68.86 62.86 68.00 29.14 87.00 76.00 51.00 36.00 31.96
Gemma3 12B 87.50 85.25 60.50 47.50 82.86 70.29 68.00 38.86 90.67 86.33 58.00 46.33 44.23
InternVL3.5 4B 82.50 83.75 58.50 42.00 82.86 64.57 40.00 23.43 91.00 81.67 60.67 47.67 37.70
InternVL3.5 8B 82.00 85.25 55.75 41.75 84.00 68.00 60.57 36.29 86.00 83.67 55.67 42.00 40.01
Phi-4-Multimodal-Instruct 6B 90.25 56.50 42.75 32.25 90.29 42.57 23.14 15.14 90.00 85.00 45.33 33.67 27.02
Phi-3.5-Vision-Instruct 4B 84.25 83.25 61.25 44.75 88.29 61.14 53.43 33.14 91.33 82.00 54.33 41.33 39.74

Results with Context Mode

Overall Performance Table (Context Setting) - Results when models are provided with hierarchical context from previous levels:

Model Size Implication Understanding Aesthetic Appreciation Affective Reasoning Score
Accperc Accbridge Accconn Accfull Accperc Accbridge Accconn Accfull Accperc Accbridge Accconn Accfull
Basic Reference
GPT-4o - 95.50 89.75 76.50 65.00 95.43 82.29 87.71 72.86 91.33 86.00 80.67 66.67 68.18
Open-Source MLLMs
Qwen3-VL-Instruct 4B 86.75 85.50 70.75 54.50 90.57 72.86 74.00 53.14 90.33 86.00 74.33 57.67 55.10
Qwen3-VL-Instruct 8B 93.50 90.00 74.75 62.75 91.71 74.00 82.57 59.43 94.33 89.00 76.00 64.67 62.28
LLaVA-1.6 7B 81.75 68.00 54.50 32.75 79.14 43.71 50.00 18.29 92.00 65.00 30.00 18.67 23.24
LLaVA-1.6 13B 84.75 80.25 63.00 44.75 84.86 57.71 52.00 29.14 94.33 78.00 37.67 27.33 33.74
Deepseek-VL2-tiny MoE 1B/3B 88.25 65.25 55.75 34.25 89.71 47.71 60.57 27.43 93.33 68.33 33.00 23.00 28.23
Deepseek-VL2 MoE 4.5B/27B 93.75 84.25 67.50 53.75 95.14 59.43 54.00 33.43 96.33 83.67 62.00 52.00 46.39
Gemma3 4B 76.50 78.25 63.50 40.75 68.86 65.14 82.57 35.43 87.00 75.00 50.00 33.00 36.39
Gemma3 12B 87.50 88.00 74.50 57.00 82.86 72.57 84.86 50.00 90.67 86.00 74.33 59.67 55.56
InternVL3.5 4B 82.50 80.75 66.75 46.00 82.86 65.43 64.00 36.86 91.00 82.33 79.00 60.33 47.73
InternVL3.5 8B 82.00 84.75 66.00 47.25 84.00 70.57 71.14 45.43 86.00 81.67 68.00 50.33 47.67
Phi-4-Multimodal-Instruct 6B 90.25 84.75 68.00 54.50 90.29 61.71 50.86 32.86 90.00 86.33 59.67 45.00 44.12
Phi-3.5-Vision-Instruct 4B 84.25 84.00 71.50 51.25 88.29 63.43 64.00 40.00 91.33 81.33 64.67 49.33 46.86

Key Findings

  1. πŸ‘₯ Gap between Humans and MLLMs: Humans achieve 87.18% overall score, while GPT-4o and Qwen3-VL-8B-Instruct fall behind by -34.94% and -39.60% respectively, demonstrating that current MLLMs lack a stable semantic bridge from concrete evidence to abstract meaning.

  2. πŸ“‰ Universal Performance Degradation: Most models, regardless of scale or architecture, exhibit a sharp, cascading decline from perception β†’ bridge β†’ connotation. GPT-4o experiences -32.75% degradation, Qwen3-VL-8B-Instruct degrades by -34.00% on Implication Understanding.

  3. πŸ“Š Model Scale and Architecture Analysis: While increasing model scale generally improves performance, it does not resolve the fundamental challenges. LLaVA-1.6-13B significantly outperforms its 7B counterpart (29.04% vs 13.39%), yet performance at Lconn remains weak. Different architectures show distinct profiles: Qwen3-VL exhibits more balanced performance, while LLaVA-1.6 shows a particularly steep decline after Lperc.

  4. πŸ”— Hierarchical Dependencies: Providing hierarchical context yields substantial performance gains. GPT-4o demonstrates +15.94% overall improvement, while Qwen3-VL-8B-Instruct achieves +14.70%, confirming that lower levels provide critical grounding for higher-level connotative reasoning.


🌲 Data Generation Pipeline

Our MCTS-driven pipeline generates high-quality hierarchical training data through iterative tree search:

Key Components

  1. Selection: UCB (Upper Confidence Bound) algorithm balances exploration and exploitation
  2. Expansion: Generate candidate QA pairs at each level
  3. Evaluation: Multi-dimensional quality assessment (logical coherence, difficulty progression, image-text alignment)
  4. Backpropagation: Update node statistics to guide future exploration

Generate Training Data

# Configure data generation settings
cd Instruction_Data_Generation_Pipeline
cp configs/config.example.json configs/config.json
# Edit config.json with your settings

# Run MCTS-based generation
python -m Instruction_Data_Generation_Pipeline.main \
  --config configs/config.json

# Generated data will be saved as JSONL files with complete reasoning chains

Configuration Options

Key parameters in config.json:

  • MCTS Parameters:

    • mcts.max_iterations: Maximum MCTS iterations (default: 5)
    • mcts.max_depth: Number of hierarchy levels (default: 3)
    • mcts.exploration_constant: UCB exploration parameter (default: 2.0)
  • Tree Structure:

    • tree.levels: Per-level node capacity limits
    • tree.max_children: Maximum children per node
    • tree.max_total_nodes: Total node limit
  • Quality Control:

    • tree.quality.thresholds: Quality score thresholds (high/medium)
    • tree.quality.acceptance_threshold: Minimum score for admission (default: 0.65)
  • Parallel Processing:

    • parallel.images: Number of parallel image workers (default: 10)
    • parallel.nodes: Number of parallel node expansions (default: 5)
    • client.max_concurrency: API concurrency limit (default: 50)

Advanced Features

  • Resume Mechanism: Automatically detects and skips completed images based on tree output
  • Tree State Persistence: Saves and loads MCTS tree states for checkpointing
  • Batch Processing: Parallel candidate generation and evaluation
  • Quality Filtering: Multi-dimensional evaluation with logical coherence checks

πŸ“œ License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.


πŸ™ Acknowledgments

We thank the following projects and datasets for providing valuable resources:

  • II-Bench, EEmo-Bench, and LayerD for providing valuable resources for our benchmark
  • OpenAI, Google, and the open-source community for providing excellent MLLMs and APIs

πŸ“§ Contact

For questions, issues, or collaboration opportunities:


πŸ“„ Citation

If you find this work useful in your research, please cite:

@misc{zhong2025vcubridgehierarchicalvisualconnotation,
      title={VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging}, 
      author={Ming Zhong and Yuanlei Wang and Liuzhou Zhang and Arctanx An and Renrui Zhang and Hao Liang and Ming Lu and Ying Shen and Wentao Zhang},
      year={2025},
      eprint={2511.18121},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.18121}, 
}

About

If you find VCU-Bridge useful, please give us a star🌟!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published