†: joint first author & equal contribution *: corresponding author
- [2025-05-16] We are thrilled to announce that our paper HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning has been accepted to KDD 2025! 🎉
- [2025-04-06] We are excited to announce that the HiBench dataset is now available on 🤗 Hugging Face! 🎉 Explore and use it to benchmark your models' hierarchical reasoning capabilities. 🚀
- [2025-03-02] We are thrilled to announce that our paper HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning is now available on arXiv! 📄 Dive into the details of our benchmark and findings. 🌟
Welcome to 👋HiBench, the First Comprehensive Hierarchical Structure Understanding Benchmark of LLMs.
- To our best knowledge, it is the first benchmark specifically designed to evaluate the hierarchical reasoning abilities of LLMs, encompassing tasks of varying scales and complexities for comprehensive evaluation.
- We evaluate 20 LLMs and reveal that even the most advanced LLMs struggle with performance, offering new insights into hierarchical reasoning.
- We propose a synthetic hierarchical dataset for task-specific fine-tuning, which enhanced LLM's ability on hierarchical reasoning, surpassing GPT4 by 6.53% throughout all tasks.
- Cite and star if you feel helpful. This will encourage us a lot 🥰.
HiBench includes a range of tasks from basic to advanced levels, specifically comprising 7 fundamental hierarchical understanding tasks at the Fundamental level, 5 JSON structure understanding tasks, and 3 formula structure understanding tasks at the Intermedia level; 2 code structure understanding tasks and 3 scientific paper understanding tasks at the Advanced level, totaling 20 tasks covering 15,852 problems.

Figure 2: Task definition in Hibench. Hibench contains 3 levels of evaluation, 5 types of tasks, and 20 subtasks.
Figure 3: Performance comparison of the best models from different families on multiple hierarchical tasks.
This repository has completed evaluating Qwen Family, Llama Family, GPT Family and ChatGLM model.
However, more LLMs are currently being evaluated for improved our experiment.Moreover, we will increase more Benchmark Dataset to evaluating LLMs ability of Hierarchical understanding. Updates will be rolled out frequently.
-
Tested on HiBench for Phi
, InternLM
, Yi
,baichuan
, and Mistral
.
-
Check all datasets and add binary datasets
conda create -n HiBench python=3.11
conda activate HiBench
pip install -r requirements.txt
python ./launch.py
☑️About the Developers:
- All developers of Hibench are PhD/MPhil students of The Hong Kong Polytechnic University 🇭🇰.
- Hibench's codebase designer is Zhuohang Jiang.
- The Fundamental Tasks of Hibench are mainly undertaken by Pangjing Wu and Ziran Liang.
- The Code Programming Task and JSON Task are mainly undertaken by Qi Chen and Ye Jia.
- The Paper Task is mainly undertaken by Xu Yuan.
- The Formula Task is mainly undertaken by Jiancheng Tu.
If you find our work valuable and it has contributed to your research or projects, we kindly request that you cite our paper. Your recognition is a driving force for our continuous improvement and innovation🤗.
@inproceedings{jiang2025hibench,
title={Hibench: Benchmarking llms capability on hierarchical structure reasoning},
author={Jiang, Zhuohang and Wu, Pangjing and Liang, Ziran and Chen, Peter Q and Yuan, Xu and Jia, Ye and Tu, Jiancheng and Li, Chen and Ng, Peter HF and Li, Qing},
booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2},
pages={5505--5515},
year={2025}
}


