Here we explain how to use the repo.
First, clone the repo
git clone https://github.com/codebyzeb/bytespantokenization.gitInstall the uv environment manager
# install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# or simply update it
uv self updateThen, install the dependencies. Because of flash attention we do a two step installation process —
see uv docs
uv sync
uv sync --extra flash # if you also want flash-attentionTo lint and format your code you can use
make lint
make formatWe define commands in the /commands folder. You can run any available command as follows
uv run cli.py <command> <subcommand> <args> <options>For example, to create a simple byte-level tokenizer
# Note: no args or options needed here
uv run cli.py tokenizers create-bytelevelHere we describe how to run the experiments.
Download the preprocessed data from the Hugging Face Hub
uv run cli.py data download-bytelevelThis saves the data to ./data/finewebedu/bytelevel. Similarly, to download common-corpus:
uv run cli.py data download-bytelevel --repo-id common-corpusOnce you have the data, you can train the bytelevel models as follows:
uv run scripts/train.py \
dataset=finewebedu \
tok_name=bytelevel \
model=fw57M-tied \
hydra.run.dir=outputs/finewebedu/bytelevel
uv run scripts/train.py \
dataset=common-corpus \
tok_name=bytelevel \
model=fw57M-tied \
hydra.run.dir=outputs/common-corpus/bytelevelThe models will be saved locally to outputs/. The training script can also take other arguments, see launch_slurm_llm_trainer.wilkes3 for how our models were trained on a CSD3 cluster. The model can then be uploaded, e.g:
uv run cli.py upload model outputs/finewebedu/bytelevelNote that you may need to adjust the configurations in commands/configs.py to match your Huggingface credentials.
Predictions can be extracted from the model as follow:
uv run cli.py extract get-llm-predictions fw57MThis automatically downloads the bytelevel model, the first subset of finewebedu, extracts preditions, and uploads these as a new subset of finewebedu. To do the same for common-corpus:
uv run cli.py extract get-llm-predictions fw57M-multi common-corpusTo create a ByteSpan tokenizer with the global constraint:
uv run cli.py tokenizers create-thresholdtokenizer EntropyEntropy can be replaced with Surprisal or other cues (same with below). To create a ByteSpan tokenizer with the monotonic constraint:
uv run cli.py tokenizers create-bytespantokenizer EntropyTo create a ByteSpan tokenizer with the combined constraint:
uv run cli.py tokenizers create-bytespantokenizer Entropy --threshold-percentile=30This sets the threshold to be the 30th percentile value in the data. For either the monotonic constraint or the combined constraint, the --proportion-bytespan flag can be set to only use ByteSpan to learn a fixed portion of the vocabulary, seeding BPE for the rest:
uv run cli.py tokenizers create-bytespantokenizer Entropy --threshold-percentile=30 --proportion-bytespan=50All commands can be adjusted to train a multilingual tokenizer on common corpus, e.g:
uv run cli.py tokenizers create-bytespantokenizer Entropy common-corpus --threshold-percentile=30 --proportion-bytespan=50We implement our own trainer for producing a BPE-style tokenizer. It can be trained without needing the LM predictions:
uv run cli.py tokenizers create-frequencytokenizer frequencyTo create a BPE-style tokenizer that supports WordPiece inference, simply train a ByteSpan tokenizer with proportion-bytespan=0"
uv run cli.py tokenizers create-bytespantokenizer Entropy --proportion-bytespan=0To run our analysis pipeline, you must first install morphscore. Use our forked version which fixes a couple of bugs with the pipeline:
git clone https://github.com/codebyzeb/morphscore.gitThen run the analysis command:
uv run cli.py analysis get-tokenizer-statistics-fineweb
uv run cli.py analysis get-tokenizer-statistics-common-corpusThis automatically downloads all tokenizers from our repository and analyze all tokenizers, saving results to tokenizer_stats_fineweb.csv for the English tokenizers and tokenizer_stats_common.csv for the multilingual tokenizers.
Our model training script requires the data to be pre-tokenized. For a particular tokenizers, run the following:
uv run cli.py data finewebedu-tokenize --subfolder=fw57M_Surprisal_bytespanP1-0T30_64000This will tokenize all of finewebedu with the chosen tokenizer and upload the IDs as a new subset of our copy of finewebedu. This data must be re-downloaded to prepare it for training:
uv run cli.py data finewebedu-download fw57M_Surprisal_bytespanP1-0T30_64000 --num-train-rows=16000000For 50k steps, 16000000 rows is more than enough (less than 1 epoch). The training script can then be run, similarly to above:
uv run scripts/train.py \
dataset=finewebedu \
tok_name=fw57M_Surprisal_bytespanP1-0T30_64000 \
model=fw57M-tied \
hydra.run.dir=outputs/finewebedu/fw57M_Surprisal_bytespanP1-0T30_64000