UnderageSpeaker Model
The UnderageSpeaker model can classify speech into "child", "adult" and "unknown". This model is combined with the SpeechRT and Volume models to increase the reliability of the results. If the model cannot reliably classify the speech into "child" or "adult, it's classified as "unknown". Note that a "child" is defined as a speaker in pre-adolescence.
The receptive field of this model is 570.5 milliseconds.
Specification
Receptive Field | Result Type |
---|---|
570.5 ms | result ∈ ["child", "adult", "unknown", "no_speech", "silence"] |
Time-series
The time-series result will be an iterable with elements that contain the following information:
{
"timestamp": 0,
"results":{
"underage-speaker": {
"result": "adult",
"confidence": 0.7447
}
}
}
Time-series with raw values
If raw values were requested, they will be added to the time-series result:
{
"timestamp": 0,
"results":{
"gender": {
"result": "adult",
"confidence": 0.7447
}
},
"raw": {
"gender": {
"child": 0.1276,
"adult": 0.8724
}
}
}
Summary
In case a summary is requested the following will be returned
{
"underage-speaker": {
"adult_fraction": 0.4274,
"child_fraction": 0.4359,
"unknown_fraction": 0.0256,
"no_speech_fraction": 0.1111,
"silence_fraction": 0.0
}
}
where x_fraction represents the percentage of time that x class was identified for the duration of the input.
Transitions
In case the transitions are requested a time-series with transition elements like shown below will be returned.
{
"timestamp_start": 0,
"timestamp_end": 128,
"result": "adult",
"confidence": 0.7359
},
{
"timestamp_start": 128,
"timestamp_end": 320,
"result": "no_speech",
"confidence": 0.5283
}
The example above means that the first 128ms of the audio snippet contained speech by an adult speaker(s), and between 128ms and 320ms no speech was detected.