A minimal, educational text-to-speech (TTS) system developed for the
Speech Synthesis and Voice Cloning course during the Independent Study Period 2025 (ISP'25) at Skoltech.
The model components and training example are provided in the following demonstration notebooks:
inference.ipynb: demo with the TTS inference using the pre-trained models
training.ipynb: code for fine-tuning the pre-trained model on custom data
The model architecture takes inspiration from FastPitch and Matcha-TTS and introduces a few modifications and simplifications. Its modules are:
- Transformer-based
TextEncoderwith ALiBi embeddings Alignerbetween text and mel spectrograms with CUDA-supported Monotonic Alignment Search- Flow Matching and Transformer-based
TemporalAdaptorfor modeling the distribution of token duration, pitch, and energy - Transformer-based
MelDecoderwith ALiBi embeddings
The dataset for training the models should have the following structure:
DATASET_ROOT
wavs
audio_1.wav
audio_2.wav
...
audio_N.wav
meta.csv
The metadata file should have the following structure:
wavs/audio_1.wav|This is the sample text.
wavs/audio_2.wav|The second audio св+язяно с +этим т+екстом.
...
wavs/audio_N.wav|нижний текст.
In other words, the metadata files should contain the "|"-separated paths to audios (relative to the dataset root) and matched texts.
Prepared for academic and non-commercial use.
Inspired by open-source projects and educational resources in speech synthesis research.