The alphabet.txt file¶
This tiny text file is easy to overlook, but it is very important. The exact same alphabet must be used to train the both acoustic model and the language model. This alphabet.txt is the glue that holds the language model and the acoustic model together.
What is alphabet.txt ?¶
Let’s take a look at the English alphabet.txt which was used to train the release 🐸STT models. If you were to ask a native English speaker to write down the alphabet, this
alphabet.txt isn’t what they would write. The
alphabet.txt file contains all characters used in a language which are necessary for writing. Looking at the English alphabet file, the first character is the space
" ". We need spaces to separate words when writing. Following the space, we find all the familiar letters of the alphabet which children learn in school. Finally, we find the apostrophe “’”. The apostrophe is needed for writing contractions, which are very common in English. The apostrophe can distinguish words like “we’re” and “were”, which have different prounuciations. Not all languages need spaces, and not all languages need apostrophes. Creating the alphabet for a new language takes some research. Two people creating the same alphabet file may disagree, and no one is objectively right. The best alphabet will depend on the target application and the available training data. You may notice that the
alphabet.txt file released with 🐸STT for English does not contain any characters with accents, even though they do occur sometimes in English. The off-the-shelf 🐸STT model cannot produce words like “naïvely” or “résumé”, and this was a design decision. We could make an alphabet that contains every possible character for every possible loan-word into English, but then we would need training data for all those new characters.
How does the Glue work?¶
alphabet.txt helps 🐸STT make a lookup table, and at run-time that lookup table is used instead of characters themselves. For the English example, the 🐸STT acoustic model doesn’t have any idea what the letter ‘a’ is, but it does know what index ‘1’ is. The
alphabet.txt file tells us that the index ‘1’ for the acoustic model corresponds to the letter ‘a’, so we can make sense of the output. If the indeces for the acoustic model and language model don’t match, then the acoustic model might hear an ‘a’, but the language model interprets it instead as ‘b’. This mis-match is sneaky, and if the alphabets used for the acoustic model and language are similar, but slightly off, this is a hard problem to diagnose. If you used different
alphabet.txt files, you may not get any run-time error messages, but the output transcriptions will make no sense.
How to diagnose mis-matched alphabets?¶
If you think you used different alphabets to create a language model and an acoustic model, try decoding without the scorer. If you can decode the audio without a scorer and the output is reasonable, but when you decode the same audio with a scorer, and the output is not reasonable, then you could have mis-matched alphabets. Usually the easiest way to fix this is to re-compile the scorer with the correct alphabet.