The AudioEvent model can classify speech into one of the following categories:
nat_human_sounds- Natural human sounds like breathing, sighing, sneezing or coughing
neg_human_sounds- Negative human sounds like crying, screaming, yelling
pos_human_sounds- Positive human sounds like laughing, giggling or snickering
speech- Speech segments
music- Music segments
noise- Any other noise such as office background noise
This model is combined with the Volume model to increase the reliability of the results.
Note that the human sounds example list above is not exhaustive, other types of sounds
produced by humans can also be classified in one of the categories.
If, for your specific use case, detecting human sounds is not relevant, we recommend
using the Speech model to detect
no_speech as its reliability is
higher for that specific domain.
The receptive field of this model is 1082 milliseconds.
|Receptive Field||Result Type|
|1082.0 ms||result ∈ ["nat_human_sounds", "neg_human_sounds", "pos_human_sounds", "speech", "music", "noise", "silence"]|
The time-series result will be an iterable with elements that contain the following information:
Time-series with raw values
If the raw values were requested, they will be added to the time-series results.
Note that for this particular model these raw values don't have to add up to one, which
means that each class is independent and there can be overlap.
This is due to the nature of what the model tries to predict, meaning that, for example,
we can have a scenario where someone is simultaneously speaking while laughing,
so the model would attribute a high value to both
In case a summary is requested the following will be returned
where x_fraction represents the percentage of time that x class was identified for the duration of the input.
In case the transitions are requested a time-series with transition elements like shown below will be returned.
The example above means that the first 128ms of the audio snippet contained human sounds that could be classified as natural (such as coughing or sneezing), and between 128ms and 320ms no speech was detected.