GitHub

General Intro and Setup

Here we explain how to use the repo.

Setup

First, clone the repo

git clone https://github.com/codebyzeb/bytespantokenization.git

Install the uv environment manager

# install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# or simply update it
uv self update

Then, install the dependencies. Because of flash attention we do a two step installation process — see uv docs

uv sync
uv sync --extra flash  # if you also want flash-attention

To lint and format your code you can use

make lint
make format

Run commands via the CLI

We define commands in the /commands folder. You can run any available command as follows

uv run cli.py <command> <subcommand> <args> <options>

For example, to create a simple byte-level tokenizer

# Note: no args or options needed here
uv run cli.py tokenizers create-bytelevel

Reproducing Experiments

Here we describe how to run the experiments.

Train the ByteLevel model

Download the preprocessed data from the Hugging Face Hub

uv run cli.py data download-bytelevel

This saves the data to ./data/finewebedu/bytelevel. Similarly, to download common-corpus:

uv run cli.py data download-bytelevel --repo-id common-corpus

Once you have the data, you can train the bytelevel models as follows:

uv run scripts/train.py \
    dataset=finewebedu \
    tok_name=bytelevel \
    model=fw57M-tied \
    hydra.run.dir=outputs/finewebedu/bytelevel

uv run scripts/train.py \
    dataset=common-corpus \
    tok_name=bytelevel \
    model=fw57M-tied \
    hydra.run.dir=outputs/common-corpus/bytelevel

The models will be saved locally to outputs/. The training script can also take other arguments, see launch_slurm_llm_trainer.wilkes3 for how our models were trained on a CSD3 cluster. The model can then be uploaded, e.g:

uv run cli.py upload model outputs/finewebedu/bytelevel

Note that you may need to adjust the configurations in commands/configs.py to match your Huggingface credentials.

Extract predictions from the trained model

Predictions can be extracted from the model as follow:

uv run cli.py extract get-llm-predictions fw57M

This automatically downloads the bytelevel model, the first subset of finewebedu, extracts preditions, and uploads these as a new subset of finewebedu. To do the same for common-corpus:

uv run cli.py extract get-llm-predictions fw57M-multi common-corpus

Train ByteSpan tokenizer

To create a ByteSpan tokenizer with the global constraint:

uv run cli.py tokenizers create-thresholdtokenizer Entropy

Entropy can be replaced with Surprisal or other cues (same with below). To create a ByteSpan tokenizer with the monotonic constraint:

uv run cli.py tokenizers create-bytespantokenizer Entropy

To create a ByteSpan tokenizer with the combined constraint:

uv run cli.py tokenizers create-bytespantokenizer Entropy --threshold-percentile=30

This sets the threshold to be the 30th percentile value in the data. For either the monotonic constraint or the combined constraint, the --proportion-bytespan flag can be set to only use ByteSpan to learn a fixed portion of the vocabulary, seeding BPE for the rest:

uv run cli.py tokenizers create-bytespantokenizer Entropy --threshold-percentile=30 --proportion-bytespan=50

All commands can be adjusted to train a multilingual tokenizer on common corpus, e.g:

uv run cli.py tokenizers create-bytespantokenizer Entropy common-corpus --threshold-percentile=30 --proportion-bytespan=50

Train BPE tokenizer

We implement our own trainer for producing a BPE-style tokenizer. It can be trained without needing the LM predictions:

uv run cli.py tokenizers create-frequencytokenizer frequency

To create a BPE-style tokenizer that supports WordPiece inference, simply train a ByteSpan tokenizer with proportion-bytespan=0"

uv run cli.py tokenizers create-bytespantokenizer Entropy --proportion-bytespan=0

Tokenizer analysis

To run our analysis pipeline, you must first install morphscore. Use our forked version which fixes a couple of bugs with the pipeline:

git clone https://github.com/codebyzeb/morphscore.git

Then run the analysis command:

uv run cli.py analysis get-tokenizer-statistics-fineweb
uv run cli.py analysis get-tokenizer-statistics-common-corpus

This automatically downloads all tokenizers from our repository and analyze all tokenizers, saving results to tokenizer_stats_fineweb.csv for the English tokenizers and tokenizer_stats_common.csv for the multilingual tokenizers.

Train a model using tokenizers

Our model training script requires the data to be pre-tokenized. For a particular tokenizers, run the following:

uv run cli.py data finewebedu-tokenize --subfolder=fw57M_Surprisal_bytespanP1-0T30_64000

This will tokenize all of finewebedu with the chosen tokenizer and upload the IDs as a new subset of our copy of finewebedu. This data must be re-downloaded to prepare it for training:

uv run cli.py data finewebedu-download fw57M_Surprisal_bytespanP1-0T30_64000 --num-train-rows=16000000

For 50k steps, 16000000 rows is more than enough (less than 1 epoch). The training script can then be run, similarly to above:

uv run scripts/train.py \
    dataset=finewebedu \
    tok_name=fw57M_Surprisal_bytespanP1-0T30_64000 \
    model=fw57M-tied \
    hydra.run.dir=outputs/finewebedu/fw57M_Surprisal_bytespanP1-0T30_64000

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
bin		bin
commands		commands
conf		conf
eval		eval
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
cli.py		cli.py
launch_slurm_llm_trainer.wilkes3		launch_slurm_llm_trainer.wilkes3
launch_slurm_old.wilkes3		launch_slurm_old.wilkes3
launch_slurm_subword_trainer.wilkes3		launch_slurm_subword_trainer.wilkes3
makefile		makefile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

General Intro and Setup

Setup

Run commands via the CLI

Reproducing Experiments

Train the ByteLevel model

Extract predictions from the trained model

Train ByteSpan tokenizer

Train BPE tokenizer

Tokenizer analysis

Train a model using tokenizers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

codebyzeb/bytespantokenization

Folders and files

Latest commit

History

Repository files navigation

General Intro and Setup

Setup

Run commands via the CLI

Reproducing Experiments

Train the ByteLevel model

Extract predictions from the trained model

Train ByteSpan tokenizer

Train BPE tokenizer

Tokenizer analysis

Train a model using tokenizers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages