Models Overview
The DeepTone™ SDK comes with a set of different models. Each model is optimized to give insights on a certain voice property.
Example
E.g. the Gender
model predicts the gender of voice in a given audio snippet.
Integrated speech- and silence-detector
Finding voice properties only works if there is a voice to classify in a given audio snippet.
This is why all the models that give insights on voice are combined with our Volume
and Speech
models.
Based on the output of the Volume
model, the audio can be classified as Silence
.
If audio was detected, the Speech
model will categorize it into one of three categories:
- Speech
- Music
- Other
Currently, we consider everything that is neither speech nor music to be "other". We will be able to categorize audio into a wider range of categories with
our HumanSounds
model (coming soon ...).
Whereas the Speech model is our default speech detection model, we also provide a "real time": SpeechRT model (in early access). It offers reduced detection latency at a cost of detecting only two classes (speech/no-speech).
Model output
Each of our classifier models will return an iterable that contains a time-series of predictions. Each element of the time-series contains the following information:
- Timestamp - in milliseconds
- Confidence - float ∈ [0,1]
- Model result - depends on the model that is used
Optionally a user can also request a summary
and transitions
components in addition to the time-series result when
processing a file. Additionally, raw
model results can be requested in which case they will be appended to the time-series.
More information about those can be found in the File Processing section.
Receptive field
Each model has a different receptive field - the minimum amount of data it needs to process before it produces reliable results. This means that if you request a model to process data it can output results as often as desired (minimum every 64ms and a multiple of it) but the initial results might not be accurate since there was not enough data to math the receptive field. Also note that if you request your data to be processed by multiple models at the same time, each might start producing accurate results at different points in time, depending on their receptive fields.
Output period
All models can output predictions at best at every 64ms of audio data.
Detailed information on how to configure the properties of a model can be found in the Usage section. The following models are currently included in the DeepTone™ SDK:
Model | Description | Output | Receptive Field |
---|---|---|---|
Speech | Speech classifier that can detect music, general noise, or human speech with high precision. | speech | music | other | silence | 1082ms |
SpeechRT | Speech classifier that can detect human speech for low latency scenarios. | speech | no-speech | silence | 130ms |
Gender | Speaker gender classifier | male | female | unknown | no_speech | silence | 1851 ms |
Arousal | Speaker arousal classifier | low | neutral | high | no_speech | silence | 2107 ms |
Emotions | Speaker emotion classifier | happy | irritated | neutral | tired | no_speech | silence | 2107 ms |