We present the Benchmark of IR tasks with Complex Objectives (BIRCO) to evaluate the ability of Information Retrieval (IR) models to follow multi-faceted task objectives. We study the performance of various embedding, distilled and fine-tuned IR models on BIRCO, and find them lacking. We provide a unified framework for investigating the performance of large language models (LLMs) on these tasks. The proposed framework consists of 3 modular components: task-objective awareness; chain-of-thought reasoning; and task decomposition. We investigate the effects of these factors on LLM performance, and identify a simple baseline model which matches or outperforms existing approaches and more complex alternatives. No approach achieves satisfactory performance on all benchmark tasks, suggesting that stronger models and new retrieval protocols are necessary to address complex user needs.
We highly recommend creating a new conda environment for the following steps by:
conda create -n birco python=3.10
conda activate birco
To start, please run
bash setup.sh
Running the above line will create a new conda environment with name birco, download the needed packages and datasets, and unzip the needed files.
The datasets with test and dev sets are in datasets/.
We provide information about test sets for five datasets in BIRCO.
- DORIS-MAE: 60 queries that are complex research questions from computer scientists. The query communicates specific requirements from research papers. Each query has a candidate pool sized approximately 110.
- ArguAna: 100 queries, each with a candidate pool of around 50 passages. Queries and passages are both complex one-paragraph arguments about current affairs. The objective is to find matching counter-arguments.
- Clinical-Trial: 100 queries that are paragraph-length patient case-reports. Each query has a candidate pool comprising 30-110 passages that are paragraph-length descriptions of clinical trials. The objective is to find the most suitable clinical trial for a patient.
- WhatsThatBook (WTB): 100 queries, with each query describing a book in an ambiguous manner. Each query has a pool of 50 passages, which are book descriptions.
- RELIC: 100 queries which are excerpts from scholars analyzing classic English-language literature. Passages are sentences from a novel that have been extracted from the queries. The objective is to match a literary analysis with its missing quotations.
Using the prompts as shown in generation/generation_data/prompts/ for each specific dataset and method, the sample generated outputs are stored in generation/generation_data/output/.
To generate full results from gpt:
-
Set up your API Key and OpenAI model:
- Open the
gpt_generation.pyfile. - Locate the line
openai.api_key = Noneand replaceNonewith your OpenAI API key. - Our default model is
"gpt-4-0613", change this according to your requirement.
- Open the
-
Configure the Shell Script:
- Open the
run_generation.shscript in a text editor. - Set the following variables according to your requirements:
thread_num: Number of threads, usually20prompt_path: Path to the prompt file, this should be a pickle that stores a dictionary of format id string to prompt string.name: prefix of name for the output file, e.g.,"arguana_test_set_reason_O".start: Starting index for processing, usually0.end: Ending index for processing. This depends on the length of the prompt file.
- Open the
-
Run the Generation Script:
- Execute the
run_generation.shscript to start the generation process. - We recommend using nohup to run the generation process
cd generation nohup bash run_generation.sh > generation_log/<LOG_FILE> 2>&1 &
remember to replace <LOG_FILE> with name of the log file.
- Execute the
After generation, to get the ranking result, run the following for corresponding scoring-based method:
bash run_rank_score.sh # For score+Ogpt
bash run_rank_reason.sh # For reason+Ogpt
bash run_rank_subtask.sh # For subtask+OgptThe results will be saved in rankings/.
To evaluate various models and methods on datasets:
- Configure the Shell Script:
- Open the
get_scores.shscript in a text editor. - Set the following variables according to your requirements:
model_name: Model namesdataset_name: The names of the dataset to evaluate.set_option: The evaluation option for the datasets.
- The first loop is for models already have ranking result, the second loop is for models without ranking result, put the model names in corresponding loop areas for evaluation.
- Open the
- Run the following:
nohup bash get_scores.sh > evaluation_log/<LOG_FILE> 2>&1 & The result will be printed in log file, and remember to replace <LOG_FILE> with name of the log file.