Deployment / Inference

You might call the act of transcribing audio with a trained model either “deployment” or “inference”. In this document we use “deployment”, but we consider the terms interchangable.

Introduction

Deployment is the process of feeding audio (speech) into a trained 🐸STT model and receiving text (transcription) as output. In practice you probably want to use two models for deployment: an audio model and a text model. The audio model (a.k.a. the acoustic model) is a deep neural network which converts audio into text. The text model (a.k.a. the language model / scorer) returns the likelihood of a string of text. If the acoustic model makes spelling or grammatical mistakes, the language model can help correct them.

You can deploy 🐸STT models either via a command-line client or a language binding.

Download trained Coqui STT models

You can find pre-trained models ready for deployment on the Coqui Model Zoo. You can also use the 🐸STT Model Manager to download and try out the latest models:

# Create a virtual environment
$ python3 -m venv venv-stt
$ source venv-stt/bin/activate

# Install 🐸STT model manager
$ python -m pip install -U pip
$ python -m pip install coqui-stt-model-manager

# Run the model manager. A browser tab will open and you can then download and test models from the Model Zoo.
$ stt-model-manager

In every 🐸STT official release, there are different model files provided. The acoustic model uses the .tflite extension. Language models use the extension .scorer. You can read more about language models with regard to the decoding process and how scorers are generated.

How will a model perform on my data?

How well a 🐸STT model transcribes your audio will depend on a lot of things. The general rule is the following: the more similar your data is to the data used to train the model, the better the model will transcribe your data. The more your data differs from the data used to train the model, the worse the model will perform on your data. This general rule applies to both the acoustic model and the language model. There are many dimensions upon which data can differ, but here are the most important ones:

  • Language

  • Accent

  • Speaking style

  • Speaking topic

  • Speaker demographics

If you take a 🐸STT model trained on English, and pass Spanish into it, you should expect the model to perform horribly. Imagine you have a friend who only speaks English, and you ask her to make Spanish subtitles for a Spanish film, you wouldn’t expect to get good subtitles. This is an extreme example, but it helps to form an intuition for what to expect from 🐸STT models. Imagine that the 🐸STT models are like people who speak a certain language with a certain accent, and then think about what would happen if you asked that person to transcribe your audio.

An acoustic model (i.e. .tflite file) has “learned” how to transcribe a certain language, and the model probably understands some accents better than others. In addition to languages and accents, acoustic models are sensitive to the style of speech, the topic of speech, and the demographics of the person speaking. The language model (.scorer) has been trained on text alone. As such, the language model is sensitive to how well the topic and style of speech matches that of the text used in training. The 🐸STT release notes include detailed information on the data used to train the models. If the data used for training the off-the-shelf models does not align with your intended use case, it may be necessary to adapt or train new models in order to improve transcription on your data.

Training your own language model is often a good way to improve transcription on your audio. The process and tools used to generate a language model are described in How to Train a Language Model and general information can be found in CTC beam search decoder. Generating a scorer from a constrained topic dataset is a quick process and can bring significant accuracy improvements if your audio is from a specific topic.

Acoustic model training is described in Training: Quickstart. Fine tuning an off-the-shelf acoustic model to your own data can be a good way to improve performance. See the fine tuning and transfer learning sections for more information.

Model compatibility

🐸STT models are versioned to mitigate incompatibilities with clients and language bindings. If you get an error saying your model file version is too old for the client, you should either (1) upgrade to a newer model, (2) re-export your model from the checkpoint using a newer version of the code, or (3) downgrade your client if you need to use the old model and can’t re-export it.

Using the Python package

Pre-built binaries for deploying a trained model can be installed with pip. It is highly recommended that you use Python 3.6 or higher in a virtual environment. Both pip and venv are included in normal Python 3 installations.

When you create a new Python virtual environment, you create a directory containing a python binary and everything needed to run 🐸STT. For the purpose of this documentation, we will use on $HOME/coqui-stt-venv, but you can use whatever directory you like.

Let’s make the virtual environment:

$ python3 -m venv $HOME/coqui-stt-venv/

After this command completes, your new environment is ready to be activated. Each time you work with 🐸STT, you need to activate your virtual environment, as such:

$ source $HOME/coqui-stt-venv/bin/activate

After your environment has been activated, you can use pip to install stt, as such:

(coqui-stt-venv)$ python -m pip install -U pip && python -m pip install stt

After installation has finished, you can call stt from the command-line.

The following command assumes you downloaded the pre-trained models.

(coqui-stt-venv)$ stt --model model.tflite --scorer huge-vocabulary.scorer --audio my_audio_file.wav

See the Python client for an example of how to use the package programatically.

GPUs will soon be supported: If you have a supported NVIDIA GPU on Linux, you can install the GPU specific package as follows:

(coqui-stt-venv)$ python -m pip install -U pip && python -m pip install stt-gpu

See the release notes to find which GPUs are supported. Please ensure you have the required CUDA dependency.

Using the Node.JS / Electron.JS package

Note that 🐸STT currently only provides packages for CPU deployment with Python 3.5 or higher on Linux. We’re working to get the rest of our usually supported packages back up and running as soon as possible.

You can download the JS bindings using npm:

npm install stt

Special thanks to Huan - Google Developers Experts in Machine Learning (ML GDE) for providing the STT project name on npmjs.org

Please note that as of now, we support:
  • Node.JS versions 4 to 13

  • Electron.JS versions 1.6 to 7.1

TypeScript support is also provided.

If you’re using Linux and have a supported NVIDIA GPU, you can install the GPU specific package as follows:

npm install stt-gpu

See the release notes to find which GPUs are supported. Please ensure you have the required CUDA dependency.

See the TypeScript client for an example of how to use the bindings programatically.

Using the Android AAR libstt package

A pre-built libstt Android AAR package can be downloaded from GitHub Releases, for Android versions 7.0+. In order to use it in your Android application, first modify your app’s build.gradle file to add a local dir as a repository. In the repository section, add the following definition:

repositories {
    flatDir {
        dirs 'libs'
    }
}

Then, create a libs directory inside your app’s folder, and place the libstt AAR file there. Finally, add the following dependency declaration in your app’s build.gradle file:

dependencies {
    implementation fileTree(dir: 'libs', include: ['*.aar'])
}

This will link all .aar files in the libs directory you just created, including libstt.

Using the command-line client

The pre-built binaries for the stt command-line (compiled C++) client are available in the native_client.tar.xz archive for your desired platform. You can download the archive from our releases page.

Assuming you have downloaded the pre-trained models, you can use the client as such:

./stt --model model.tflite --scorer huge-vocabulary.scorer --audio audio_input.wav

See the help output with ./stt -h for more details.

Installing bindings from source

If pre-built binaries aren’t available for your system, you’ll need to install them from scratch. Follow the native client build and installation instructions.

Dockerfile for building from source

We provide Dockerfile.build to automatically build libstt.so, the C++ native client, Python bindings, and KenLM.

Before building, make sure that git submodules have been initialised:

git submodule sync
git submodule update --init

Then build with:

docker build . -f Dockerfile.build -t stt-image

You can then use stt inside the Docker container:

docker run -it stt-image bash

Runtime Dependencies

Running stt may require runtime dependencies. Please refer to your system’s documentation on how to install these dependencies.

  • sox - The Python and Node.JS clients use SoX to resample files to 16kHz

  • libgomp1 - libsox (statically linked into the clients) depends on OpenMP

  • libstdc++ - Standard C++ Library implementation

  • libpthread - Reported dependency on Linux. On Ubuntu, libpthread is part of the libpthread-stubs0-dev package

  • Redistribuable Visual C++ 2015 Update 3 (64-bits) - Reported dependency on Windows. Please download from Microsoft

CUDA Dependency

The GPU capable builds (Python, NodeJS, C++, etc) depend on CUDA 10.1 and CuDNN v7.6.