Source code for the EMNLP 2024 (Findings) paper Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach.
Create a virtual environment and install the requirements:
conda create -n mtxlsum python=3.10
conda activate mtxlsum
pip install torch==2.1.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txtIf you wish to run the evaluation, you will also need to install the multilingual ROUGE package available at https://github.com/csebuetnlp/xl-sum/tree/master/multilingual_rouge_scoring.
-
Download the data from the official repository and extract it to the
original_datadirectory. -
Run the following command to cluster the test data (similarly for validation):
mkdir -p data
mkdir -p data/test
python aggregate.py --data_dir original_data/test --output_dir data/test --langs arabic chinese_simplified english french portuguese russian spanishThis will create a data directory with multiple JSONL files. Each line corresponds to a cluster of documents and has the following format:
{
"num_docs": int,
"url0": str,
"lang0": str,
"text0": str,
"summary0": str,
"url1": str,
"lang1": str,
"text1": str,
"summary1": str,
...
}
E.g., to generate summaries for all English documents using NeutralRR using all the languages in each cluster as targets, run the following command:
python generate.py --source_lang=en --split=test --method=rerank --search_mode=dijkstra --num_candidates=8 --temperature=1.0 --top_k=50 --num_sampling_beams=5 --output=predictions_en.jsonlThis will create a predictions_en.jsonl file where each line has the following format:
{
"source_url": str,
"summary_english": str,
"summary_spanish": str,
...
}
For other methods and options, run python generate.py --help.
To evaluate the generated summaries, run the following command:
python evaluate.py --predictions=./predictions_en.jsonl --source_lang=en --split=test --output=predictions_en_eval.jsonThis will create a JSON file with the results of the evaluation for each target language according to several metrics.
@inproceedings{pernes-etal-2024-multi,
title = "Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach",
author = "Pernes, Diogo and
Correia, Gon{\c{c}}alo M. and
Mendes, Afonso",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.755",
pages = "12908--12924",
abstract = "Cross-lingual summarization aims to bridge language barriers by summarizing documents in different languages. However, ensuring semantic coherence across languages is an overlooked challenge and can be critical in several contexts. To fill this gap, we introduce multi-target cross-lingual summarization as the task of summarizing a document into multiple target languages while ensuring that the produced summaries are semantically similar. We propose a principled re-ranking approach to this problem and a multi-criteria evaluation protocol to assess semantic coherence across target languages, marking a first step that will hopefully stimulate further research on this problem.",
}