Home | Previous - Training your model | Next - Deploying your model

Testing and evaluating your trained model


This section of the PlayBook covers testing your trained model and setup before deployment. If you need to test the 🐸STT source code itself, please consult the source code tests.

Let’s say that you’ve already trained an acoustic model and a language model (a scorer). Congratulations! But before you deploy your setup, you will need to evaluate how well it will work in practice - on your intended use case.

We’re talking here about a setup rather than a trained model on purpose - as there are multiple factors that influence how well a setup performs in real life. There are multiple factors that influence the success of an application, and you need to keep all these factors in mind. The acoustic model and language model work with each other to turn speech into text, and there are lots of ways (i.e. decoding hyperparameter settings) with which you can combine those two models.

Gathering training information

When you invoked train.py in the training section, and trained a model, the training would have finished by printing out a set of WER and CER metrics. It would have looked like this:

Testing model on stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv
Test epoch | Steps: 1844 | Elapsed Time: 0:51:11
Test on stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv - WER: 1.000000, CER: 0.824103, loss: 104.989326
Best WER:
WER: 1.000000, CER: 0.873786, loss: 317.729767
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_23819387.wav
 - src: "kami percaya bahwa perdamaian dari koeksistensi dua sistem sosial yang berbeda sepenuhnya bisa terwujud"
 - res: "aaaaaaaaaaaaa"
WER: 1.000000, CER: 0.851485, loss: 295.564240
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_19748999.wav
 - src: "jika anda mencari informasi tentang pergerakan esperanto di indonesia silakan kunjungi halaman webnya"
 - res: "aaaaaaaaaaaaaaa"
WER: 1.000000, CER: 0.875000, loss: 283.844696
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_23819383.wav
 - src: "indah memiliki standar hidup yang tinggi tidak heran dia dikenal sebagai orang yang perfeksionis"
 - res: "aaaaaaaaaaaaaa"
WER: 1.000000, CER: 0.818182, loss: 276.511597
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_24015532.wav
 - src: "selain itu bahasa gaul juga menciptakan kosakata baru yang terbentuk melalui kaidah kaidah tertentu"
 - res: "aaaaaaaaaaaaaaaaaa"
WER: 1.000000, CER: 0.820000, loss: 269.262909
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_24015257.wav
 - src: "berbagai bahasa daerah dan bahasa asing menjadi bahasa serapan dan kemudian menjadi bahasa indonesia"
 - res: "aaaaaaaaaaaaaaaaaa"
Median WER:
WER: 1.000000, CER: 0.800000, loss: 97.870811
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_20954705.wav
 - src: "pemandangan dari hotel sangat indah"
 - res: "aaaaaaa"
WER: 1.000000, CER: 0.941176, loss: 97.848030
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_20387916.wav
 - src: "hari ini hujan turun rintik rintik"
 - res: "aaaaaaaa"
WER: 1.000000, CER: 0.800000, loss: 97.800034
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_20879262.wav
 - src: "berapa biaya sewa untuk ruangan ini"
 - res: "aaaaaaaaa"
WER: 1.000000, CER: 0.705882, loss: 97.773476
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_19611909.wav
 - src: "saya bukan gay tapi pacar saya gay"
 - res: "aaaaaaaaaaa"
WER: 1.000000, CER: 0.806452, loss: 97.725914
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_24018261.wav
 - src: "selamat datang di san fransisco"
 - res: "aaaaaaaaaaa"
Worst WER:
WER: 1.000000, CER: 0.800000, loss: 25.830986
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22546523.wav
 - src: "tidak"
 - res: "aaaa"
WER: 1.000000, CER: 1.333333, loss: 25.499653
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22185104.wav
 - src: "nol"
 - res: "aaaa"
WER: 1.000000, CER: 0.800000, loss: 23.874924
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22546522.wav
 - src: "empat"
 - res: "aaaa"
WER: 1.000000, CER: 0.750000, loss: 22.441967
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22528020.wav
 - src: "tiga"
 - res: "aaaa"
WER: 1.000000, CER: 0.750000, loss: 21.356133
 - wav: file://stt-data/cv-corpus-6.1-2020-12-11/id/clips/common_voice_id_22412536.wav
 - src: "lima"
 - res: "aaaa"

Note: the WER and CER on this output example are both poor because a custom scorer for the language hasn’t been built yet.

If you didn’t keep the training information, then as long as you stored checkpoints while training, then you will be able to re-run just the testing part of training by using the following command:

root@9d052f0c3dcf:/STT# python3 train.py \
    --test_files stt-data/cv-corpus-6.1-2020-12-11/id/clips/test.csv \
    --checkpoint_dir stt-data/checkpoints

By passing just the --test_files parameter and the --checkpoint_dir parameter, train.py will re-run testing. Note that this command will fail if you don’t have checkpoints stored.

Word Error Rate, Character Error Rate, loss and model performance

During acoustic model training with Tensorflow, you hopefully saw the training and validation loss go down over time. At the end of the training, 🐸STT would have printed scores for your model called the Word Error Rate (WER) and Character Error Rate (CER).

The WER is how accurately 🐸STT was able to recognise a word, and is generally a measure of how well the language model (scorer) is operating. The CER is how accurately 🐸STT was able to recognise a character, and is generally a measure of how well the acoustic model is operating, along with an alphabet file.

WER and CER are the typical scores reported for speech recognition models, but their usefulness will vary a lot, depending on the use case of your setup. You should not take the WER as the “be-all” metric for the performance of your setup.

Often, the data in your test .csv file will be different to the data your model will be asked to perform inference on when it is deployed. It is the performance of your setup at runtime - in a real life context - that is most important.

Acoustic model and language model working together

Remember, the acoustic model and language model work together to produce your transcript. You might have an acoustic model that seems to perform abysmally, but if you combine it with the right language model, you experience amazing near-perfect accuracy. How is this possible?

The acoustic model is where the majority of training time is spent. The job of the acoustic model is to use the 🐸STT algorithm - a sequence to sequence algorithm, to learn which acoustic signals correspond to which letters (as specified in the alphabet.txt file). This accuracy is the character error rate (CER).

In many languages though, words that sound the same are spelled differently. These are called homonyms. For example, the words their, they're and there are all pronounced similarly in English, but are spelled differently.

The language model seeks to overcome this challenge. The language model, produced by a scorer, predicts which words will follow each other in a sequence. This is also known in linguistics as n-gram modelling. For example, the words nugget, wings and salad are more likely to occur after the word chicken than say ticket, even though the words chicken and ticket have similar sounds.

The acoustic model and the language model work together to provide better overall accuracy.


In general, if you have a low CER - that is, your characters are being detected accurately in your acoustic model, but you have a high WER - that is, the words are not being detected accurately, this indicates that you should retrain your language model (scorer).

Conversely, if you have a high CER, and a low WER, this indicates that your acoustic model may require fine-tuning.

Fine tuning and transfer learning

Fine tuning and transfer learning are two processes used to improve the accuracy of an acoustic model. Fine tuning is where the same alphabet.txt file is used, with a set of checkpoints from another model. In transfer learning, the alphabet layer is removed from the neural network, and this allows a model to be trained on a model from another language. In general, this works best on languages that have a similar vocabulary and/or structure. For example, English and French will work better than English and Hindi given that English and French are more similar than English and Hindi.

For more information on fine tuning in 🐸STT, please consult the documentation.

For more information on transfer learning in 🐸STT, please consult the documentation.

Home | Previous - Training your model | Next - Deploying your model