[KDD'25]👋HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

Zhuohang Jiang†, Pangjing Wu†, Ziran Liang†, Qi Chen†, Xu Yuan†, Ye Jia†, Jiancheng Tu†，Chen Li，Peter H.F. Ng，Qing Li*

†: joint first author & equal contribution *: corresponding author

News

[2025-05-16] We are thrilled to announce that our paper HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning has been accepted to KDD 2025! 🎉
[2025-04-06] We are excited to announce that the HiBench dataset is now available on 🤗 Hugging Face! 🎉 Explore and use it to benchmark your models' hierarchical reasoning capabilities. 🚀
[2025-03-02] We are thrilled to announce that our paper HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning is now available on arXiv! 📄 Dive into the details of our benchmark and findings. 🌟

Overview

Welcome to 👋HiBench, the First Comprehensive Hierarchical Structure Understanding Benchmark of LLMs.

To our best knowledge, it is the first benchmark specifically designed to evaluate the hierarchical reasoning abilities of LLMs, encompassing tasks of varying scales and complexities for comprehensive evaluation.
We evaluate 20 LLMs and reveal that even the most advanced LLMs struggle with performance, offering new insights into hierarchical reasoning.
We propose a synthetic hierarchical dataset for task-specific fine-tuning, which enhanced LLM's ability on hierarchical reasoning, surpassing GPT4 by 6.53% throughout all tasks.
Cite and star if you feel helpful. This will encourage us a lot 🥰.

Figure 1: Overview of the paradigm for HiBench.

Task Definition

HiBench includes a range of tasks from basic to advanced levels, specifically comprising 7 fundamental hierarchical understanding tasks at the Fundamental level, 5 JSON structure understanding tasks, and 3 formula structure understanding tasks at the Intermedia level; 2 code structure understanding tasks and 3 scientific paper understanding tasks at the Advanced level, totaling 20 tasks covering 15,852 problems.

Figure 2: Task definition in Hibench. Hibench contains 3 levels of evaluation, 5 types of tasks, and 20 subtasks.

Figure 3: Performance comparison of the best models from different families on multiple hierarchical tasks.

Features under developing

This repository has completed evaluating Qwen Family, Llama Family, GPT Family and ChatGLM model.

However, more LLMs are currently being evaluated for improved our experiment.Moreover, we will increase more Benchmark Dataset to evaluating LLMs ability of Hierarchical understanding. Updates will be rolled out frequently.

Tested on HiBench for Phi , InternLM , Yi ,baichuan , and Mistral.
Check all datasets and add binary datasets

Quick Start

Install HiBench

conda create -n HiBench python=3.11
conda activate HiBench
pip install -r requirements.txt

Evaluation HiBench

python ./launch.py

Result

Figure 4: Performance of different LLMs in HiBench.

About

☑️About the Developers:

All developers of Hibench are PhD/MPhil students of The Hong Kong Polytechnic University 🇭🇰.
Hibench's codebase designer is Zhuohang Jiang.
The Fundamental Tasks of Hibench are mainly undertaken by Pangjing Wu and Ziran Liang.
The Code Programming Task and JSON Task are mainly undertaken by Qi Chen and Ye Jia.
The Paper Task is mainly undertaken by Xu Yuan.
The Formula Task is mainly undertaken by Jiancheng Tu.

Citation

If you find our work valuable and it has contributed to your research or projects, we kindly request that you cite our paper. Your recognition is a driving force for our continuous improvement and innovation🤗.

@inproceedings{jiang2025hibench,
  title={Hibench: Benchmarking llms capability on hierarchical structure reasoning},
  author={Jiang, Zhuohang and Wu, Pangjing and Liang, Ziran and Chen, Peter Q and Yuan, Xu and Jia, Ye and Tu, Jiancheng and Li, Chen and Ng, Peter HF and Li, Qing},
  booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2},
  pages={5505--5515},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 231 Commits
Images		Images
analysis		analysis
config		config
dataset		dataset
evaluator		evaluator
finetune		finetune
generator		generator
scripts		scripts
test		test
.gitignore		.gitignore
Hibench.sh		Hibench.sh
LICENSE		LICENSE
README.md		README.md
call_llms.py		call_llms.py
dataloader.py		dataloader.py
eval-finetune.py		eval-finetune.py
eval.py		eval.py
launch.py		launch.py
overview.md		overview.md
requirements.txt		requirements.txt
test.yaml		test.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[KDD'25]👋HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

News

Overview

Task Definition

Features under developing

Quick Start

Install HiBench

Evaluation HiBench

Result

About

Citation

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

jzzzzh/HiBench

Folders and files

Latest commit

History

Repository files navigation

[KDD'25]👋HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

News

Overview

Task Definition

Features under developing

Quick Start

Install HiBench

Evaluation HiBench

Result

About

Citation

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages