Speaker Detection

Speaker Detection Suite

Speaker detection and identification is a complex process, involving different aspects of audio processing and covering a variety of use cases.

Eventually DeepTone™ will provide a complete suite of models for multi-faceted speaker detection to handle tasks such as

  • detecting known and unknown speakers for speaker separation;
  • identifying snippets by speakers known during calibration;
  • detecting more complex behaviour like speaker overlap, interruption and much more.

The first model from the Speaker Detection suite is the SpeakerMap model.

The SpeakerMap model

The SpeakerMap is the basis for speaker identification tasks. With SpeakerMap, you can identify the speech segments spoken by each speaker throughout an audio file. Prior knowledge of the number of speakers, or splitting the audio into channels manually is not needed. For example usage see Speaker Detection Recipes.

The SpeakerMap classifies audio with a speaker label - speaker_1, speaker_2, etc - or with unknown if the speaker could not be identified. When analysing a segment, the result will contain the majority speaker - speaker_top, as well as the number of speakers identified in that segment. The speaker labels are generated based on the order of speaker, for example speaker_1 will always designate the first speaker identified in the file or stream.

Because it only makes sense to apply this model to speech audio, it is combined with a Speech Detection model to increase the reliability of the results.

The receptive field of this model is X milliseconds.

Specification

Receptive FieldResult Type
2107 msspeaker_top ∈ ["speaker_1", "speaker_2", ..., "speaker_N", "unknown", "no_speech"]
speaker_count ∈ [0, 1, 2, ..]
Tip

The Transitions result type is particularly useful for speaker separation.

Time-series

The time-series result will be an iterable with elements that contain the following information:

{
"timestamp": 0,
"results":{
"speaker-map":{
"speaker_top": "speaker_2",
"speaker_count": 1,
},
}
}

For each audio segment, speaker_top identifies the majority (or the only) speaker in that segment. With speaker_count, you can find out if more than 1 speaker was identified for the segment.

Time-series with raw values

If raw values were requested, they will be added to the time-series result. Below is an example where two speakers were detected in a segment:

{
"timestamp": 0,
"results":{
"speaker-map":{
"speaker_top": "speaker_2",
"speaker_count": 2,
},
},
"raw": {
"speaker-map": {
"speaker_1_fraction": 0.30,
"speaker_2_fraction": 0.50,
"unknown_fraction": 0.10,
"no_speech_fraction": 0.10,
"speaker_count": 2
}
}
}

where speaker_X_fraction represents the percentage of time during which speaker_X spoke between this and the following timestamp. The total number of detected speakers is available in speaker_count.

Summary

In case a summary is requested the following will be returned, for the whole audio input:

{
"speaker-map": {
"speaker_1_fraction": 0.30,
"speaker_2_fraction": 0.50,
"speaker_3_fraction": 0.10,
"unknown_fraction": 0.00,
"no_speech_fraction": 0.10,
"speaker_count": 3
}
}

where speaker_X_fraction represents the percentage of time during which speaker_X spoke in the processed file. The total number of detected speakers is available in speaker_count, and the percentage of time with unidentifiable speakers or no speech is available as unknown_ and no_speech fractions respectively.

Transitions

In case the transitions are requested a time-series with transition elements like shown below will be returned

{
"timestamp_start": 0,
"timestamp_end": 1500,
"result": "speaker_1",
},
{
"timestamp_start": 1500,
"timestamp_end": 4000,
"result": "speaker_2",
}

The example above means that the first 1500ms of the audio snippet contained speech from the first identified speaker - speaker_1, and between 1500ms and 4000ms DeepTone™ detected speech from a second speaker - speaker_2.