Common Voice training data¶
This document gives some information about using Common Voice data with STT. If you’re in need of training data, the Common Voice corpus is a good place to start.
Common Voice consists of voice data that was donated through Mozilla’s Common Voice initiative. You can download the data sets for various languages here.
After you download and extract a data set for one language, you’ll find the following contents:
.tsv
files, containing metadata such as text transcripts.mp3
audio files, located in theclips
directory
🐸STT cannot directly work with Common Voice data, so you should run our importer script bin/import_cv2.py
to format the data correctly:
bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/common-voice/archive
Providing a filter alphabet is optional. This alphabet is used to exclude all audio files whose transcripts contain characters not in the specified alphabet. Running the importer with -h
will show you additional options.
The importer will create a new WAV
file for every MP3
file in the clips
directory. The importer will also create the following CSV
files:
clips/train.csv
clips/dev.csv
clips/test.csv
The CSV files contain the following fields:
wav_filename
- path to the audio file, may be absolute or relative. Our importer produces relative pathswav_filesize
- samples size given in bytes, used for sorting the data before training. Expects integertranscript
- transcription target for the sample
To use Common Voice data for training, validation and testing, you should pass the CSV
filenames via --train_files
, --dev_files
, --test_files
.
For example, if you download, extracted, and imported the French language data from Common Voice, you will have a new local directory named fr
. You can train STT with this new French data as such:
$ python -m coqui_stt_training.train \
--train_files fr/clips/train.csv \
--dev_files fr/clips/dev.csv \
--test_files fr/clips/test.csv