Python¶

Model¶

class Model(model_path)[source]¶

Class holding a Coqui STT model

Parameters: aModelPath (str) – Path to model file to load

beamWidth()[source]¶

Get beam width value used by the model. If setModelBeamWidth was not called before, will return the default value loaded from the model file.

Returns: Beam width value used by the model.
Type: int

setBeamWidth(beam_width)[source]¶

Set beam width value used by the model.

Parameters: beam_width (int) – The beam width used by the model. A larger beam width value generates better results at the cost of decoding time.
Returns: Zero on success, non-zero on failure.
Type: int

sampleRate()[source]¶

Return the sample rate expected by the model.

Returns: Sample rate.
Type: int

enableExternalScorer(scorer_path)[source]¶

Enable decoding using an external scorer.

Parameters: scorer_path (str) – The path to the external scorer file.
Throws: RuntimeError on error

disableExternalScorer()[source]¶

Disable decoding using an external scorer.

Returns: Zero on success, non-zero on failure.

addHotWord(word, boost)[source]¶

Add a word and its boost for decoding.

Words that don’t occur in the scorer (e.g. proper nouns) or strings that contain spaces won’t be taken into account.

Parameters

word (str) – the hot-word
boost (float) – Positive boost value increases and negative reduces chance of a word occuring in a transcription. Excessive positive boost might lead to splitting up of letters of the word following the hot-word.

Throws

RuntimeError on error

eraseHotWord(word)[source]¶

Remove entry for word from hot-words dict.

Parameters: word (str) – the hot-word
Throws: RuntimeError on error

clearHotWords()[source]¶

Remove all entries from hot-words dict.

Throws: RuntimeError on error

setScorerAlphaBeta(alpha, beta)[source]¶

Set hyperparameters alpha and beta of the external scorer.

Parameters

alpha (float) – The alpha hyperparameter of the decoder. Language model weight.
beta (float) – The beta hyperparameter of the decoder. Word insertion weight.

Returns

Zero on success, non-zero on failure.

Type

int

stt(audio_buffer)[source]¶

Use the Coqui STT model to perform Speech-To-Text.

Parameters: audio_buffer (numpy.int16 array) – A 16-bit, mono raw audio signal at the appropriate sample rate (matching what the model was trained on).
Returns: The STT result.
Type: str

sttWithMetadata(audio_buffer, num_results=1)[source]¶

Use the Coqui STT model to perform Speech-To-Text and return results including metadata.

Parameters

audio_buffer (numpy.int16 array) – A 16-bit, mono raw audio signal at the appropriate sample rate (matching what the model was trained on).
num_results (int) – Maximum number of candidate transcripts to return. Returned list might be smaller than this.

Returns

Metadata object containing multiple candidate transcripts. Each transcript has per-token metadata including timing information.

Type

Metadata()

createStream()[source]¶

Create a new streaming inference state. The streaming state returned by this function can then be passed to feedAudioContent() and finishStream().

Returns: Stream object representing the newly created stream
Type: Stream()
Throws: RuntimeError on error

Stream¶

class Stream(native_stream)[source]¶

Class wrapping a stt stream. The constructor cannot be called directly. Use Model.createStream()

feedAudioContent(audio_buffer)[source]¶

Feed audio samples to an ongoing streaming inference.

Parameters: audio_buffer (numpy.int16 array) – A 16-bit, mono raw audio signal at the appropriate sample rate (matching what the model was trained on).
Throws: RuntimeError if the stream object is not valid

intermediateDecode()[source]¶

Compute the intermediate decoding of an ongoing streaming inference.

Returns: The STT intermediate result.
Type: str
Throws: RuntimeError if the stream object is not valid

intermediateDecodeWithMetadata(num_results=1)[source]¶

Compute the intermediate decoding of an ongoing streaming inference and return results including metadata.

Parameters: num_results (int) – Maximum number of candidate transcripts to return. Returned list might be smaller than this.
Returns: Metadata object containing multiple candidate transcripts. Each transcript has per-token metadata including timing information.
Type: Metadata()
Throws: RuntimeError if the stream object is not valid

intermediateDecodeFlushBuffers()[source]¶

EXPERIMENTAL: Compute the intermediate decoding of an ongoing streaming inference, flushing buffers first. This ensures that all audio that has been streamed so far is included in the result, but is more expensive than intermediateDecode() because buffers are processed through the acoustic model.

Returns: The STT intermediate result.
Type: str
Throws: RuntimeError if the stream object is not valid

intermediateDecodeWithMetadataFlushBuffers(num_results=1)[source]¶

Parameters: num_results (int) – Maximum number of candidate transcripts to return. Returned list might be smaller than this.
Returns: Metadata object containing multiple candidate transcripts. Each transcript has per-token metadata including timing information.
Type: Metadata()
Throws: RuntimeError if the stream object is not valid

finishStream()[source]¶

Compute the final decoding of an ongoing streaming inference and return the result. Signals the end of an ongoing streaming inference. The underlying stream object must not be used after this method is called.

Returns: The STT result.
Type: str
Throws: RuntimeError if the stream object is not valid

finishStreamWithMetadata(num_results=1)[source]¶

Compute the final decoding of an ongoing streaming inference and return results including metadata. Signals the end of an ongoing streaming inference. The underlying stream object must not be used after this method is called.

Parameters: num_results (int) – Maximum number of candidate transcripts to return. Returned list might be smaller than this.
Returns: Metadata object containing multiple candidate transcripts. Each transcript has per-token metadata including timing information.
Type: Metadata()
Throws: RuntimeError if the stream object is not valid

freeStream()[source]¶

Destroy a streaming state without decoding the computed logits. This can be used if you no longer need the result of an ongoing streaming inference.

Throws: RuntimeError if the stream object is not valid

Metadata¶

class Metadata[source]¶

transcripts()[source]¶

List of candidate transcripts

Returns: A list of CandidateTranscript() objects
Type: list

CandidateTranscript¶

class CandidateTranscript[source]¶

Stores the entire CTC output as an array of character metadata objects

tokens()[source]¶

List of tokens

Returns: A list of TokenMetadata() elements
Type: list

confidence()[source]¶: Approximated confidence value for this transcription. This is roughly the sum of the acoustic model logit values for each timestep/character that contributed to the creation of this transcription.

TokenMetadata¶

class TokenMetadata[source]¶

Stores each individual character, along with its timing information

text()[source]¶: The text for this token

timestep()[source]¶: Position of the token in units of 20ms

start_time()[source]¶: Position of the token in seconds