Home | Previous - About Coqui STT | Next - The alphabet.txt file

Formatting your training data for Coqui STT

Contents

🐸STT expects audio files to be WAV format, mono-channel, and with a 16kHz sampling rate.

For training, testing, and development, you need to feed 🐸STT CSV files which contain three columns: wav_filename,wav_filesize,transcript. The wav_filesize (i.e. number of bytes) is used to group together audio of similar lengths for efficient batching.

Collecting data

This PlayBook is focused on training a speech recognition model, rather than on collecting the data that is required for an accurate model. However, a good model starts with data.

  • Ensure that your voice clips are 10-20 seconds in length. If they are longer or shorter than this, your model will be less accurate.

  • Ensure that every character in your transcription of a voice clip is in your alphabet.txt file

  • Ensure that your voice clips exhibit the same sort of diversity you expect to encounter in your runtime audio. This means a diversity of accents, genders, background noise and so on.

  • Ensure that your voice clips are created using similar microphones to that which you expect in your runtime audio. For example, if you expect to deploy your model on Android mobile phones, ensure that your training data is generated from Android mobile phones.

  • Ensure that the phrasing on which your voice clips are generated covers the phrases you expect to encounter in your runtime audio.

Punctuation and numbers

If you are collecting data that will be used to train a speech model, then you should remove punctuation marks such as dashes, tick marks, quote marks and so on. These will often be confused, and can hinder training an accurate model.

Numbers should be written in full (ie as a cardinal) - that is, as eight rather than 8.

Preparing your data for training

Data from Common Voice

If you are using data from Common Voice for training a model, you will need to prepare it as outlined in the 🐸STT documentation.

In this example we will prepare the Indonesian dataset for training, but you can use any language from Common Voice that you prefer. We’ve chosen Indonesian as it has the same orthographic alphabet as English, which means we don’t have to use a different alphabet.txt file for training; we can use the default.


This example assumes you have already [set up a Docker environment for training. If you have not yet set up your Docker environment, we suggest you pause here and do this first.

First, download the dataset from Common Voice, and extract the archive into your stt-data directory. This makes it available to your Docker container through a bind mount. Start your 🐸STT Docker container with the stt-data directory as a bind mount (this is covered in the environment section).

Your CV corpus data should be available from within the Docker container.

root@3de3afbe5d6f:/STT# ls  stt-data/cv-corpus-6.1-2020-12-11/id/
clips    invalidated.tsv  reported.tsv  train.tsv
dev.tsv  other.tsv        test.tsv      validated.tsv

The ghcr.io/coqui-ai/stt-train Docker image does not come with sox, which is a package used for processing Common Voice data. We need to install sox first.

root@4b39be3b0ffc:/STT# apt-get -y update && apt-get install -y sox

Next, we will run the Common Voice importer that ships with 🐸STT.

root@3de3afbe5d6f:/STT# bin/import_cv2.py stt-data/cv-corpus-6.1-2020-12-11/id

This will process all the CV data into the clips directory, and it can now be used for training.

Importers

🐸STT ships with several scripts which act as importers - preparing a corpus of data for training by 🐸STT.

If you want to create importers for a new language, or a new corpus, you will need to fork the 🐸STT repository, then add support for the new language and/or corpus by creating an importer for that language/corpus.

The existing importer scripts are a good starting point for creating your own importers.

They are located in the bin directory of the 🐸STT repo:

root@3de3afbe5d6f:/STT# ls | grep import
import_aidatatang.py
import_aishell.py
import_ccpmf.py
import_cv.py
import_cv2.py
import_fisher.py
import_freestmandarin.py
import_gram_vaani.py
import_ldc93s1.py
import_librivox.py
import_lingua_libre.py
import_m-ailabs.py
import_magicdata.py
import_primewords.py
import_slr57.py
import_swb.py
import_swc.py
import_ted.py
import_timit.py
import_ts.py
import_tuda.py
import_vctk.py
import_voxforge.py

The importer scripts ensure that the .wav files and corresponding transcriptions are in the .csv format expected by 🐸STT.


Home | Previous - About Coqui STT | Next - The alphabet.txt file