AudioEvent Model
The AudioEvent model can classify speech into one of the following categories:
nat_human_sounds
- Natural human sounds like breathing, sighing, sneezing or coughingneg_human_sounds
- Negative human sounds like crying, screaming, yellingpos_human_sounds
- Positive human sounds like laughing, giggling or snickeringspeech
- Speech segmentsmusic
- Music segmentsnoise
- Any other noise such as office background noise
This model is combined with the Volume model to increase the reliability of the results.
Note that the human sounds example list above is not exhaustive, other types of sounds
produced by humans can also be classified in one of the categories.
If, for your specific use case, detecting human sounds is not relevant, we recommend
using the Speech model to detect speech
/music
/no_speech
as its reliability is
higher for that specific domain.
The receptive field of this model is 1082 milliseconds.
Specification
Receptive Field | Result Type |
---|---|
1082.0 ms | result ∈ ["nat_human_sounds", "neg_human_sounds", "pos_human_sounds", "speech", "music", "noise", "silence"] |
Time-series
The time-series result will be an iterable with elements that contain the following information:
{
"timestamp": 0,
"results":{
"audio-event": {
"result": "pos_human_sounds",
"confidence": 0.7447
}
}
}
Time-series with raw values
If the raw values were requested, they will be added to the time-series results.
Note that for this particular model these raw values don't have to add up to one, which
means that each class is independent and there can be overlap.
This is due to the nature of what the model tries to predict, meaning that, for example,
we can have a scenario where someone is simultaneously speaking while laughing,
so the model would attribute a high value to both speech
and pos_human_sounds
.
{
"timestamp": 0,
"results":{
"audio-event": {
"result": "nat_human_sounds",
"confidence": 0.7946
}
},
"raw": {
"audio-event": {
"music": 0.0103,
"nat_human_sounds": 0.7946,
"neg_human_sounds": 0.065,
"noise": 0.0482,
"pos_human_sounds": 0.0606,
"speech": 0.6212
}
}
}
Summary
In case a summary is requested the following will be returned
{
"audio-event": {
"music_fraction": 0.0,
"nat_human_sounds_fraction": 0.7097,
"neg_human_sounds_fraction": 0.0,
"noise_fraction": 0.0,
"pos_human_sounds_fraction": 0.0,
"speech_fraction": 0.0,
"silence_fraction": 0.2903
}
}
where x_fraction represents the percentage of time that x class was identified for the duration of the input.
Transitions
In case the transitions are requested a time-series with transition elements like shown below will be returned.
{
"timestamp_start": 0,
"timestamp_end": 128,
"result": "nat_human_sounds",
"confidence": 0.7359
},
{
"timestamp_start": 128,
"timestamp_end": 320,
"result": "no_speech",
"confidence": 0.5283
}
The example above means that the first 128ms of the audio snippet contained human sounds that could be classified as natural (such as coughing or sneezing), and between 128ms and 320ms no speech was detected.