LLM-Based Field Mapping for Materials Datasets
We study field-level schema mapping for materials-science databases using large language models (LLMs). Given a raw field name from sources such as Materials Project, OQMD, Alexandria, JARVIS--DFT and AFLOW, the task is to predict a single target column from the OPTIMADE-compatible LeMaterial schema. We formulate this as multiclass text classification and compare two adaptation modes---(i) in-context learning (ICL) with Llama-3.1-8B and Mistral-7B, and (ii) parameter-efficient finetuning (PEFT, QLoRA) of Llama-3.1-8B and Phi-3-Mini---under both in-domain and out-of-domain (OMat24) evaluation. We further study a back-translation augmentation strategy that generates synthetic raw field names. Our main findings are: Mistral-7B is a strong ICL baseline, but Llama-3.1-8B with QLoRA clearly dominates after finetuning; small models like Phi-3-Mini underfit even with PEFT; and back-translation is beneficial only when combined with finetuning, substantially improving OOD macro-F1.
conda create -n llmap python=3.12 -y
conda activate llmap
pip install -r requirements.txt
huggingface-cli loginModels used (you must accept licenses on Hugging Face):
meta-llama/Meta-Llama-3.1-8B-Instructmistralai/Mistral-7B-Instruct-v0.3microsoft/Phi-3-mini-4k-instruct
All required preprocessed splits are already included:
data/stp2_train.jsonl
data/stp2_val.jsonl
data/stp2_test_id.jsonl
data/stp2_train_bt_aug.jsonl # includes back-translated synthetic samples
Each example follows this JSONL format:
{"raw_field": "...", "label": "..."}These files are sufficient to reproduce all reported results out-of-the-box.
Run ICL with frozen model weights:
python icl_eval.py \
--model_name <hf_model_id> \
--train_path data/stp2_train.jsonl \
--val_path data/stp2_val.jsonl \
--test_path data/stp2_test_id.jsonl \
--k_shot <k>Reproduce the full ICL curve for Llama-3.1-8B:
for k in {0..10}; do
python icl_eval.py \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--train_path data/stp2_train.jsonl \
--val_path data/stp2_val.jsonl \
--test_path data/stp2_test_id.jsonl \
--k_shot $k
doneUse the same loop for Mistral-7B or Phi-3-Mini.
ICL logs are written under:
logs_icl/
Run full QLoRA fine-tuning + automatic hyperparameter search:
python peft_eval.py \
--model_name <hf_model_id> \
--train_path data/stp2_train.jsonl \
--val_path data/stp2_val.jsonl \
--test_path data/stp2_test_id.jsonl \
--output_root runs/<tag> \
--log_root logs_peft/<tag>This script sweeps over:
- Learning rates
- LoRA ranks
- Epoch counts
- Batch sizes
A summary file will be written to:
runs/<tag>/summary_<model_name>_sequential_search.json
Select the run with the highest validation macro-F1 to match the reported results.
To reproduce only the best configuration, edit the candidate lists inside peft_eval.py.
The back-translation augmented training file is already included:
data/stp2_train_bt_aug.jsonl
Reproduce Llama-3.1-8B PEFT with augmentation:
python peft_eval.py \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--train_path data/stp2_train_bt_aug.jsonl \
--val_path data/stp2_val.jsonl \
--test_path data/stp2_test_id.jsonl \
--output_root runs/llama_bt \
--log_root logs_peft/llama_btRepeat with the Phi-3-Mini model to reproduce its augmented results.
- A GPU (≥ 24 GB recommended) is required for QLoRA on 8B models.
- Optional: OOD evaluation can be added using
--ood_path <jsonl>, though no OOD file is included in this repo.