SpeakerMap Model

The SpeakerMap is the basis for speaker identification tasks. With SpeakerMap, you can identify the speech segments spoken by each speaker throughout an audio file. Prior knowledge of the number of speakers, or splitting the audio into channels manually is not needed. For example usage with the sdk see the Speaker Detection Recipes page under the Usage ➜ Recipes section on the sidebar.

The SpeakerMap classifies audio with a speaker label - speaker_1, speaker_2, etc - or with unknown if the speaker could not be identified. When analysing a segment, the result will contain the majority speaker - speaker_top, as well as the number of speakers identified in that segment. The speaker labels are generated based on the order of speaker, for example speaker_1 will always designate the first speaker identified in the file or stream.

Because it only makes sense to apply this model to speech audio, it is combined with a Speech Detection model to increase the reliability of the results.

The receptive field of this model is 2107 milliseconds.

Specification

Receptive FieldResult Type
2107 msspeaker_top ∈ ["speaker_1", "speaker_2", ..., "speaker_N", "unknown", "no_speech"]
speaker_count ∈ [0, 1, 2, ..]
Tip

The Transitions result type is particularly useful for speaker separation.

Time-series

The time-series result will be an iterable with elements that contain the following information:

{
"timestamp": 0,
"results":{
"speaker-map":{
"speaker_top": "speaker_2",
"speaker_count": 1,
},
}
}

For each audio segment, speaker_top identifies the majority (or the only) speaker in that segment. With speaker_count, you can find out if more than 1 speaker was identified for the segment.

Time-series with raw values

If raw values were requested, they will be added to the time-series result. Below is an example where two speakers were detected in a segment:

{
"timestamp": 0,
"results":{
"speaker-map":{
"speaker_top": "speaker_2",
"speaker_count": 2,
},
},
"raw": {
"speaker-map": {
"speaker_1_fraction": 0.30,
"speaker_2_fraction": 0.50,
"unknown_fraction": 0.10,
"no_speech_fraction": 0.10,
"speaker_count": 2
}
}
}

where speaker_X_fraction represents the percentage of time during which speaker_X spoke between this and the following timestamp. The total number of detected speakers is available in speaker_count.

Summary

In case a summary is requested the following will be returned, for the whole audio input:

{
"speaker-map": {
"speaker_1_fraction": 0.30,
"speaker_2_fraction": 0.50,
"speaker_3_fraction": 0.10,
"unknown_fraction": 0.00,
"no_speech_fraction": 0.10,
"speaker_count": 3
}
}

where speaker_X_fraction represents the percentage of time during which speaker_X spoke in the processed file. The total number of detected speakers is available in speaker_count, and the percentage of time with unidentifiable speakers or no speech is available as unknown_ and no_speech fractions respectively.

Transitions

In case the transitions are requested a time-series with transition elements like shown below will be returned

{
"timestamp_start": 0,
"timestamp_end": 1500,
"result": "speaker_1",
},
{
"timestamp_start": 1500,
"timestamp_end": 4000,
"result": "speaker_2",
}

The example above means that the first 1500ms of the audio snippet contained speech from the first identified speaker - speaker_1, and between 1500ms and 4000ms DeepTone™ detected speech from a second speaker - speaker_2.