Multi-Target Cross-Lingual Summarization

Source code for the EMNLP 2024 (Findings) paper Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach.

Setup

Create a virtual environment and install the requirements:

conda create -n mtxlsum python=3.10
conda activate mtxlsum
pip install torch==2.1.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

If you wish to run the evaluation, you will also need to install the multilingual ROUGE package available at https://github.com/csebuetnlp/xl-sum/tree/master/multilingual_rouge_scoring.

Data preparation

Download the data from the official repository and extract it to the original_data directory.
Run the following command to cluster the test data (similarly for validation):

mkdir -p data
mkdir -p data/test
python aggregate.py --data_dir original_data/test --output_dir data/test --langs arabic chinese_simplified english french portuguese russian spanish

This will create a data directory with multiple JSONL files. Each line corresponds to a cluster of documents and has the following format:

{
    "num_docs": int,
    "url0": str,
    "lang0": str,
    "text0": str,
    "summary0": str,
    "url1": str,
    "lang1": str,
    "text1": str,
    "summary1": str,
    ...
}

Generation

E.g., to generate summaries for all English documents using NeutralRR using all the languages in each cluster as targets, run the following command:

python generate.py --source_lang=en --split=test --method=rerank --search_mode=dijkstra --num_candidates=8 --temperature=1.0 --top_k=50 --num_sampling_beams=5 --output=predictions_en.jsonl

This will create a predictions_en.jsonl file where each line has the following format:

{
    "source_url": str,
    "summary_english": str,
    "summary_spanish": str,
    ...
}

For other methods and options, run python generate.py --help.

Evaluation

To evaluate the generated summaries, run the following command:

python evaluate.py --predictions=./predictions_en.jsonl --source_lang=en --split=test --output=predictions_en_eval.json

This will create a JSON file with the results of the evaluation for each target language according to several metrics.

Citation

@inproceedings{pernes-etal-2024-multi,
    title = "Multi-Target Cross-Lingual Summarization: a novel task and a language-neutral approach",
    author = "Pernes, Diogo  and
      Correia, Gon{\c{c}}alo M.  and
      Mendes, Afonso",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.755",
    pages = "12908--12924",
    abstract = "Cross-lingual summarization aims to bridge language barriers by summarizing documents in different languages. However, ensuring semantic coherence across languages is an overlooked challenge and can be critical in several contexts. To fill this gap, we introduce multi-target cross-lingual summarization as the task of summarizing a document into multiple target languages while ensuring that the produced summaries are semantically similar. We propose a principled re-ranking approach to this problem and a multi-criteria evaluation protocol to assess semantic coherence across target languages, marking a first step that will hopefully stimulate further research on this problem.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
models		models
.gitignore		.gitignore
README.md		README.md
aggregate.py		aggregate.py
dataloading.py		dataloading.py
evaluate.py		evaluate.py
generate.py		generate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-Target Cross-Lingual Summarization

Setup

Data preparation

Generation

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

Priberam/MTXLSum

Folders and files

Latest commit

History

Repository files navigation

Multi-Target Cross-Lingual Summarization

Setup

Data preparation

Generation

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages