SpeechRT Model

The SpeechRT model can classify audio into "speech", "no-speech" or "silence". The "silence" class indicates audio fragments with very low volume (according to the requested threshold). Compared to the Speech model, the SpeechRT model can react to changes in the audio much faster. The Speech model is our default speech detection model as it offers the best tradeoff between detection latency and precision.

The receptive field of this model is 130ms.

Specification

Receptive FieldResult Type
130msresult ∈ ["speech", "no-speech", "silence"]

Time-series

The time-series result will be an iterable with elements that contain the following information:

{
"timestamp": 0,
"results":{
"speech-rt": {
"result": "speech",
"confidence": 0.92
}
}
}

Time-series with raw values

If raw values were requested, they will be added to the time-series result:

{
"timestamp": 0,
"results":{
"speech-rt": {
"result": "speech",
"confidence": 0.92
}
},
"raw": {
"speech-rt": {
"no-speech": 0.08,
"speech": 0.92
}
}
}

Summary

In case a summary is requested the following will be returned

{
"speech": {
"speech_fraction": 0.30,
"no-speech_fraction": 0.70,
"silence_fraction": 0.0
}
}

where x_fraction represents the percentage of time that x class was identified for the duration of the input.

Transitions

In case the transitions are requested a time-series with the following transition elements will be returned:

{
"timestamp_start": 0,
"timestamp_end": 1500,
"result": "speech",
"confidence": 0.96
},
{
"timestamp_start": 1500,
"timestamp_end": 6000,
"result": "no-speech",
"confidence": 0.86
}

The result above means that the first 1500ms of the audio snippet contained speech, and between 1500ms and 6000ms of the audio no speech was detected.