Training: Quickstart

Introduction

This document is a quickstart guide to training an 🐸STT model using your own speech data. For more in-depth training documentation, you should refer to Advanced Training Topics.

Training a model using your own audio can lead to better transcriptions compared to an off-the-shelf 🐸STT model. If your speech data differs significantly from the data we used in training, training your own model (or fine-tuning one of ours) may lead to large improvements in transcription quality. You can read about how speech characteristics interact with transcription accuracy here.

Dockerfile Setup

We suggest you use our Docker image as a base for training. You can download and run the image in a container:

$ docker pull ghcr.io/coqui-ai/stt-train
$ docker run -it stt-train:latest

Alternatively you can build it from source using Dockerfile.train, and run the locally built version in a container:

$ git clone --recurse-submodules https://github.com/coqui-ai/STT
$ cd STT
$ docker build -f Dockerfile.train . -t stt-train:latest
$ docker run -it stt-train:latest

You can read more about working with Dockerfiles in the official documentation.

Manual Setup

If you don’t want to use our Dockerfile template, you will need to manually install STT in order to train a model.

Prerequisites

  • Python 3.6, 3.7 or 3.8

  • Mac or Linux environment (training on Windows is not currently supported)

  • CUDA 10.0 and CuDNN v7.6

Download

Clone the STT repo from GitHub:

$ git clone https://github.com/coqui-ai/STT

Installation

Installing STT and its dependencies is much easier with a virtual environment.

Set up Virtural Environment

We recommend Python’s built-in venv module to manage your Python environment.

Setup your Python virtual environment, and name it coqui-stt-train-venv:

$ python3 -m venv coqui-stt-train-venv

Activate the virtual environment:

$ source coqui-stt-train-venv/bin/activate

Setup with a conda virtual environment (Anaconda, Miniconda, or Mamba) is not guaranteed to work. Nevertheless, we’re happy to review pull requests which fix any incompatibilities you encounter.

Install Dependencies and STT

Now that we have cloned the STT repo from Github and setup a virtual environment with venv, we can install STT and its dependencies. We recommend Python’s built-in pip module for installation:

$ cd STT
$ python -m pip install --upgrade pip wheel setuptools
$ python -m pip install --upgrade -e .

If you have an NVIDIA GPU, it is highly recommended to install TensorFlow with GPU support. Training will be significantly faster than using the CPU.

$ python -m pip uninstall tensorflow
$ python -m pip install 'tensorflow-gpu==1.15.4'

Please ensure you have the required prerequisites and a working CUDA installation with the versions listed above.

Verify Install

To verify that your installation was successful, run:

$ ./bin/run-ldc93s1.sh

This script will train a model on a single audio file. If the script exits successfully, your STT training setup is ready. Congratulations!

Training on your own Data

Whether you used our Dockerfile template or you set up your own environment, the central STT training module is python -m coqui_stt_training.train. For a list of command line options, use the --help flag:

$ cd STT
$ python -m coqui_stt_training.train --help

Training Data

There’s two kinds of data needed to train an STT model:

  1. audio clips

  2. text transcripts

Data Format

Audio data is expected to be stored as WAV, sampled at 16kHz, and mono-channel. There’s no hard expectations for the length of individual audio files, but in our experience, training is most successful when WAV files range from 5 to 20 seconds in length. Your training data should match as closely as possible the kind of speech you expect at deployment. You can read more about the significant characteristics of speech with regard to STT here.

Text transcripts should be formatted exactly as the transcripts you expect your model to produce at deployment. If you want your model to produce capital letters, your transcripts should include capital letters. If you want your model to produce punctuation, your transcripts should include punctuation. Keep in mind that the more characters you include in your transcripts, the more difficult the task becomes for your model. STT models learn from experience, and if there’s very few examples in the training data, the model will have a hard time learning rare characters (e.g. the “ï” in “naïve”).

CSV file format

The audio and transcripts used in training are specified via CSV files. You should supply CSV files for training (train.csv), validation (dev.csv), and testing (test.csv). The CSV files should contain three columns:

  1. wav_filename - the path to a WAV file on your machine

  2. wav_filesize - the number of bytes in the WAV file

  3. transcript - the text transcript of the WAV file

Alternatively, if you don’t have pre-defined splits for training, validation and testing, you can use the --auto_input_dataset flag to automatically split a single CSV into subsets and generate an alphabet automatically:

$ python -m coqui_stt_training.train --auto_input_dataset samples.csv

Start Training

After you’ve successfully installed STT and have access to data, you can start a training run:

$ cd STT
$ python -m coqui_stt_training.train --train_files train.csv --dev_files dev.csv --test_files test.csv

Next Steps

You will want to customize the training settings to work better with your data and your hardware. You should review the command-line training flags, and experiment with different settings.

For more in-depth training documentation, you should refer to the Advanced Training Topics section.