Skip to content

jzzzzh/HiBench

Repository files navigation

[KDD'25]👋HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

Zhuohang Jiang†, Pangjing Wu†, Ziran Liang†, Qi Chen†, Xu Yuan†, Ye Jia†, Jiancheng Tu†,Chen Li,Peter H.F. Ng,Qing Li*

†: joint first author & equal contribution *: corresponding author

Powered by Powered by Powered by Powered by GitHub

LLM LLM LLM LLM

LLM LLM LLM LLM LLM LLM

News

  • [2025-05-16] We are thrilled to announce that our paper HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning has been accepted to KDD 2025! 🎉
  • [2025-04-06] We are excited to announce that the HiBench dataset is now available on 🤗 Hugging Face! 🎉 Explore and use it to benchmark your models' hierarchical reasoning capabilities. 🚀
  • [2025-03-02] We are thrilled to announce that our paper HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning is now available on arXiv! 📄 Dive into the details of our benchmark and findings. 🌟

Overview

Welcome to 👋HiBench, the First Comprehensive Hierarchical Structure Understanding Benchmark of LLMs.

  • To our best knowledge, it is the first benchmark specifically designed to evaluate the hierarchical reasoning abilities of LLMs, encompassing tasks of varying scales and complexities for comprehensive evaluation.
  • We evaluate 20 LLMs and reveal that even the most advanced LLMs struggle with performance, offering new insights into hierarchical reasoning.
  • We propose a synthetic hierarchical dataset for task-specific fine-tuning, which enhanced LLM's ability on hierarchical reasoning, surpassing GPT4 by 6.53% throughout all tasks.
  • Cite and star if you feel helpful. This will encourage us a lot 🥰.
Hibench Outline

Figure 1: Overview of the paradigm for HiBench.

Task Definition

HiBench includes a range of tasks from basic to advanced levels, specifically comprising 7 fundamental hierarchical understanding tasks at the Fundamental level, 5 JSON structure understanding tasks, and 3 formula structure understanding tasks at the Intermedia level; 2 code structure understanding tasks and 3 scientific paper understanding tasks at the Advanced level, totaling 20 tasks covering 15,852 problems. Task

Figure 2: Task definition in Hibench. Hibench contains 3 levels of evaluation, 5 types of tasks, and 20 subtasks.

Radar chat

Figure 3: Performance comparison of the best models from different families on multiple hierarchical tasks.

Features under developing

This repository has completed evaluating Qwen Family, Llama Family, GPT Family and ChatGLM model.

However, more LLMs are currently being evaluated for improved our experiment.Moreover, we will increase more Benchmark Dataset to evaluating LLMs ability of Hierarchical understanding. Updates will be rolled out frequently.

  • Tested on HiBench for PhiLLM , InternLM LLM , Yi LLM ,baichuan LLM, and MistralLLM.

  • Check all datasets and add binary datasets

Quick Start

Install HiBench

conda create -n HiBench python=3.11
conda activate HiBench
pip install -r requirements.txt

Evaluation HiBench

python ./launch.py

Result

Performance of different LLMs in HiBench

Figure 4: Performance of different LLMs in HiBench.


About

☑️About the Developers:

  • All developers of Hibench are PhD/MPhil students of The Hong Kong Polytechnic University 🇭🇰.
  • Hibench's codebase designer is Zhuohang Jiang.
  • The Fundamental Tasks of Hibench are mainly undertaken by Pangjing Wu and Ziran Liang.
  • The Code Programming Task and JSON Task are mainly undertaken by Qi Chen and Ye Jia.
  • The Paper Task is mainly undertaken by Xu Yuan.
  • The Formula Task is mainly undertaken by Jiancheng Tu.

Citation

If you find our work valuable and it has contributed to your research or projects, we kindly request that you cite our paper. Your recognition is a driving force for our continuous improvement and innovation🤗.

@inproceedings{jiang2025hibench,
  title={Hibench: Benchmarking llms capability on hierarchical structure reasoning},
  author={Jiang, Zhuohang and Wu, Pangjing and Liang, Ziran and Chen, Peter Q and Yuan, Xu and Jia, Ye and Tu, Jiancheng and Li, Chen and Ng, Peter HF and Li, Qing},
  booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2},
  pages={5505--5515},
  year={2025}
}

Star History

Star History Chart

Flag Counter

About

[KDD'25] HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5