AudioEvent Model

The AudioEvent model can classify speech into one of the following categories:

  • nat_human_sounds - Natural human sounds like breathing, sighing, sneezing or coughing
  • neg_human_sounds - Negative human sounds like crying, screaming, yelling
  • pos_human_sounds - Positive human sounds like laughing, giggling or snickering
  • speech - Speech segments
  • music - Music segments
  • noise - Any other noise such as office background noise

This model is combined with the Volume model to increase the reliability of the results.

Note that the human sounds example list above is not exhaustive, other types of sounds produced by humans can also be classified in one of the categories. If, for your specific use case, detecting human sounds is not relevant, we recommend using the Speech model to detect speech/music/no_speech as its reliability is higher for that specific domain.

The receptive field of this model is 1082 milliseconds.

Specification

Receptive FieldResult Type
1082.0 msresult ∈ ["nat_human_sounds", "neg_human_sounds", "pos_human_sounds", "speech", "music", "noise", "silence"]

Time-series

The time-series result will be an iterable with elements that contain the following information:

{
"timestamp": 0,
"results":{
"audio-event": {
"result": "pos_human_sounds",
"confidence": 0.7447
}
}
}

Time-series with raw values

If the raw values were requested, they will be added to the time-series results. Note that for this particular model these raw values don't have to add up to one, which means that each class is independent and there can be overlap. This is due to the nature of what the model tries to predict, meaning that, for example, we can have a scenario where someone is simultaneously speaking while laughing, so the model would attribute a high value to both speech and pos_human_sounds.

{
"timestamp": 0,
"results":{
"audio-event": {
"result": "nat_human_sounds",
"confidence": 0.7946
}
},
"raw": {
"audio-event": {
"music": 0.0103,
"nat_human_sounds": 0.7946,
"neg_human_sounds": 0.065,
"noise": 0.0482,
"pos_human_sounds": 0.0606,
"speech": 0.6212
}
}
}

Summary

In case a summary is requested the following will be returned

{
"audio-event": {
"music_fraction": 0.0,
"nat_human_sounds_fraction": 0.7097,
"neg_human_sounds_fraction": 0.0,
"noise_fraction": 0.0,
"pos_human_sounds_fraction": 0.0,
"speech_fraction": 0.0,
"silence_fraction": 0.2903
}
}

where x_fraction represents the percentage of time that x class was identified for the duration of the input.

Transitions

In case the transitions are requested a time-series with transition elements like shown below will be returned.

{
"timestamp_start": 0,
"timestamp_end": 128,
"result": "nat_human_sounds",
"confidence": 0.7359
},
{
"timestamp_start": 128,
"timestamp_end": 320,
"result": "no_speech",
"confidence": 0.5283
}

The example above means that the first 128ms of the audio snippet contained human sounds that could be classified as natural (such as coughing or sneezing), and between 128ms and 320ms no speech was detected.