Language Model
The Language Model classifies speech according to the language that is being spoken in a given segment.
The Language Model is currently available in Early Access. If you wish to try it out, contact Support at support@oto.ai.
The following languages are currently supported:
- English (
en
) - French (
fr
) - German (
de
) - Italian (
it
) - Spanish (
es
)
As this model is only applicable on audio segments that contain speech, it is combined with the SpeechRT and Volume
models to increase the reliability of the results. If the confidence in the classification result is not satisfactory,
the label unknown
is provided.
The receptive field of this model is 2107 milliseconds.
Specification
Receptive Field | Result Type |
---|---|
2107 ms | result ∈ ["de", "en", "es", "fr", "it", "unknown", "no_speech", "silence"] |
Time-series
The time-series result will be an iterable with elements that contain the following information:
{
"timestamp": 0,
"results":{
"language": {
"result": "en",
"confidence": 0.92
}
}
}
Time-series with raw values
If raw values were requested, they will be added to the time-series result:
{
"timestamp": 0,
"results":{
"language": {
"result": "en",
"confidence": 0.92
}
},
"raw": {
"language": {
"de": 0.01,
"en": 0.92,
"es": 0.01,
"fr": 0.02,
"it": 0.02,
}
}
}
Summary
In case a summary is requested the following will be returned
{
"language": {
"de_fraction": 0.4747,
"en_fraction": 0.0303,
"es_fraction": 0.0,
"fr_fraction": 0.0202,
"it_fraction": 0.0,
"no_speech_fraction": 0.3535,
"unknown_fraction": 0.1212,
"silence_fraction": 0.0
}
}
where x_fraction represents the percentage of time that x class was identified for the duration of the input.
Transitions
In case the transitions are requested a time-series with transition elements like shown below will be returned
{
"timestamp_start": 0,
"timestamp_end": 1500,
"result": "no_speech",
"confidence": 0.96
},
{
"timestamp_start": 1500,
"timestamp_end": 4000,
"result": "en",
"confidence": 0.88
}
The example above means that the first 1500ms of the audio snippet contained no speech, and between 1500ms and 4000ms DeepTone™ detected that English was being spoken.