Models Overview

The DeepTone™ Cloud API comes with a set of different models. Each model is optimized to give insights on a certain voice property.

Available models

The following models are available:

Speech: Speech classifier that can detect music, general noise, or human speech
SpeechRT: Speech classifier that can detect human speech, music or general noise for low latency scenarios.
SpeakerMap: Speaker labelling - speaker_1, speaker_2, etc for speaker separation purposes (Alpha Access)
Gender: Speaker gender classifier
Arousal: Speaker arousal classifier
Emotions: Speaker emotion classifier
Language: Speaker language classifier (Early Access)
UnderageSpeaker: Underage Speaker classifier
AudioEvent: Audio Event classifier

Integrated speech- and silence-detector

Finding voice properties only works if there is a voice to classify in a given audio snippet. This is why all the models that give insights on voice are combined with our Volume and SpeechRT models.

Based on the output of the Volume model, the audio can be classified as Silence. If audio was detected, the SpeechRT model will categorize it into one of three categories:

Speech
Music
Other

Currently, we consider everything that is neither speech nor music to be "other". We will be able to categorize audio into a wider range of categories with our HumanSounds model (coming soon ...).

Speech vs SpeechRT model

Our latest SpeechRT model outperforms the Speech model in terms of speech classification in common use cases. The SpeechRT model is therefore the default model used for speech detection within other models.

In choosing between Speech and SpeechRT, it is advisable to experiment with both and choose the one that performs better on your data.

In case of low-latency requirements, the SpeechRT may be preferable as it can react to changes faster due to its shorter receptive field. In cases where decision between speech, music, and noise will be difficult (e.g. overlapping sources), the longer context available to the Speech model may yield better results. If experimentation is not possible, you should choose the SpeechRT model.

Model performance

We are continuously improving our models, so the advice here is subject to change with new model updates.

Model output

Each of our classifier models will return an iterable that contains a time-series of predictions. Each element of the time-series contains the following information:

Timestamp - in milliseconds
Confidence - float ∈ [0,1]
Model result - depends on the model that is used

Optionally a user can also request a summary and transitions components in addition to the time-series result when processing a file. Additionally, raw model results can be requested in which case they will be appended to the time-series. More information about those can be found in the File Processing section.

Receptive field

Each model has a different receptive field - the minimum amount of data it needs to process before it produces reliable results. This means that if you request a model to process data it can output results as often as desired (minimum every 64ms and a multiple of it) but the initial results might not be accurate since there was not enough data to match the receptive field. Also note that if you request your data to be processed by multiple models at the same time, each might start producing accurate results at different points in time, depending on their receptive fields.

The receptive field plays a role in the reaction time of models. Models with longer receptive field would generally take longer to be certain enough that the classification of an audio snippet has changed. Therefore, models with shorter receptive fields would generally react to changes faster.

Output period

All models can output predictions at best at every 64ms of audio data.

Detailed information on how to configure the properties of a model can be found in the Usage section. The following models are currently available in the DeepTone™ Cloud API:

Model	Description	Output	Receptive Field
Speech	Speech classifier that can detect music, general noise, or human speech	speech \| music \| other \| silence	1082ms
SpeechRT	Speech classifier that can detect human speech, music or general noise for low latency scenarios.	speech \| music \| other \| silence	146ms
SpeakerMap	Speaker labelling - speaker_1, speaker_2, etc for speaker separation purposes (Alpha Access)	speaker_1 \| speaker_2 \| ... \| unknown \| no_speech \| silence	2107 ms
Gender	Speaker gender classifier	male \| female \| unknown \| no_speech \| silence	3130 ms
Arousal	Speaker arousal classifier	low \| neutral \| high \| no_speech \| silence	2107 ms
Emotions	Speaker emotion classifier	happy \| irritated \| neutral \| tired \| no_speech \| silence	2107 ms
Language	Speaker language classifier (Early Access)	de \| en \| es \| fr \| it \| unknown \| no_speech \| silence	2107 ms
UnderageSpeaker	UnderageSpeaker classifier that can detect adult or child speakers	adult \| child \| unknown \| no_speech \| silence	570.5 ms
AudioEvent	AudioEvent classifier that can detect various human sounds	neg_human_sounds \| nat_human_sounds \| pos_human_sounds \| noise \| music \| speech \| silence	1082.0 ms

Available models​

Integrated speech- and silence-detector​

Speech vs SpeechRT model​

Model output​

Receptive field​

Output period​