About Coqui STT¶
What does Coqui STT do?¶
🐸STT is a tool for automatically transcribing spoken audio. 🐸STT takes digital audio as input and returns (one or more) “most likely” text transcripts of that audio.
🐸STT currently uses an implementation of the DeepSpeech algorithm developed by Baidu and presented in this research paper.
🐸STT can be used for two key activities related to Speech-to-Text - training and inference. Speech-to-Text inference - the process of converting spoken audio to written text - relies on a trained model. 🐸STT can be used, with appropriate hardware (GPU) to train a model using a set of voice data, known as a corpus. Then, inference or recognition can be performed using the trained model. 🐸STT includes several pre-trained models.
This Playbook is focused on helping you train your own model.
How does Coqui STT work?¶
🐸STT takes a stream of audio as input, and converts that stream of audio into a sequence of characters in the designated alphabet. This conversion is made possible by two basic steps: First, the audio is converted into a sequence of probabilities over characters in the alphabet. Secondly, this sequence of probabilities is converted into a sequence of characters.
The first step is made possible by a Deep Neural Network, and the second step is made possible by an N-gram language model. The neural network is trained on audio and corresponding text transcripts, and the N-gram language model is trained on a text corpus (which is often different from the text transcripts of the audio). The neural model is trained to predict the text from speech, and the language model is trained to predict text from preceding text. At a very high level, you can think of the first part (the acoustic model) as a phonetic transcriber, and the second part (the language model) as a spelling and grammar checker.
How is Coqui STT implemented?¶