SpeakerMap Model
The SpeakerMap is the basis for speaker identification tasks. With SpeakerMap, you can identify the speech segments
spoken by each speaker throughout an audio file.
Prior knowledge of the number of speakers, or splitting the audio into channels manually is not needed.
For example usage with the sdk see the Speaker Detection Recipes
page under the Usage ➜ Recipes
section on the sidebar.
The SpeakerMap classifies audio with a speaker label - speaker_1
, speaker_2
, etc - or with unknown
if the speaker could not be identified.
When analysing a segment, the result will contain the majority speaker - speaker_top, as well as the number of speakers identified in that segment.
The speaker labels are generated based on the order of speaker, for example speaker_1
will always designate the first speaker identified in the file or stream.
Because it only makes sense to apply this model to speech audio, it is combined with a Speech Detection model to increase the reliability of the results.
The receptive field of this model is 2107 milliseconds.
Specification
Receptive Field | Result Type |
---|---|
2107 ms | speaker_top ∈ ["speaker_1", "speaker_2", ..., "speaker_N", "unknown", "no_speech"] speaker_count ∈ [0, 1, 2, ..] |
The Transitions result type is particularly useful for speaker separation.
Time-series
The time-series result will be an iterable with elements that contain the following information:
{
"timestamp": 0,
"results":{
"speaker-map":{
"speaker_top": "speaker_2",
"speaker_count": 1,
},
}
}
For each audio segment, speaker_top
identifies the majority (or the only) speaker in that segment. With speaker_count
, you can find out if more than 1 speaker was identified for the segment.
Time-series with raw values
If raw values were requested, they will be added to the time-series result. Below is an example where two speakers were detected in a segment:
{
"timestamp": 0,
"results":{
"speaker-map":{
"speaker_top": "speaker_2",
"speaker_count": 2,
},
},
"raw": {
"speaker-map": {
"speaker_1_fraction": 0.30,
"speaker_2_fraction": 0.50,
"unknown_fraction": 0.10,
"no_speech_fraction": 0.10,
"speaker_count": 2
}
}
}
where speaker_X_fraction represents the percentage of time during which speaker_X spoke between this and the following timestamp. The total number of detected speakers is available in speaker_count
.
Summary
In case a summary is requested the following will be returned, for the whole audio input:
{
"speaker-map": {
"speaker_1_fraction": 0.30,
"speaker_2_fraction": 0.50,
"speaker_3_fraction": 0.10,
"unknown_fraction": 0.00,
"no_speech_fraction": 0.10,
"speaker_count": 3
}
}
where speaker_X_fraction represents the percentage of time during which speaker_X spoke in the processed file. The total number of detected speakers is available in speaker_count
, and the percentage of time with unidentifiable speakers or no speech is available as unknown_ and no_speech fractions respectively.
Transitions
In case the transitions are requested a time-series with transition elements like shown below will be returned
{
"timestamp_start": 0,
"timestamp_end": 1500,
"result": "speaker_1",
},
{
"timestamp_start": 1500,
"timestamp_end": 4000,
"result": "speaker_2",
}
The example above means that the first 1500ms of the audio snippet contained speech from the first identified speaker - speaker_1, and between 1500ms and 4000ms DeepTone™ detected speech from a second speaker - speaker_2.