Identifying the same speaker across different audio samples can be done using the DeepTone™ SDK. Checkout the Speaker Detection Recipes to get familiar with the Speaker Map model.
In this Section we will look at the following use cases:
- Create voice signatures of known speakers and detect the same speakers in new unseen audio data - example 1
- DeepTone™ with license key and models
- Audio Files to process:
- An audio file that only contains one speaker, very little silence and is not too long (<10 seconds)
- An audio file that contains multiple speakers, including the same speaker as the first file
To be able to recognise a certain speaker in a file, we need to create a voice signature of the speaker. For this example we need a audio file that only contains one speaker, contains very little silence and is not too long. Something like this example file: Teo
Let's use this short snippet to create a voice signature:
We can now use this created voice signature to check if the speaker "teo" is talking in a new audio file. Let's use this sample audio: Two Speakers.
The summary will look something like this:
We see that we found the speaker
teo and a new speaker
speaker_2. When setting
the output will contain the updated voice signatures including the newly found
Let's give the new speaker a name:
Now we can process the same file again with the updated voice signatures and check the summary:
The summary now looks something like this:
We can see that the fractions for "teo" and "lourenco" changed slightly from before. We can inspect and compare the transitions from both outputs using audacity:
After executing the script, you will find a
in your working directory. You can open the audio file with
Audacity and then import the
files as labels, from
File -> Import -> Labels ... .
When providing both voice signatures to the
process_file function, the predictions are improved compared to when only
providing the voice signature of one of the speakers. This is because the model knows about both of the speakers from
the beginning and is therefore quicker in identifying the speaker "lourenco" when he start talking.
Looking at the transitions of the results we can see the difference: Without providing voice signatures:When providing the voice signatures of both speakers:The model is faster in detecting speaker lourenco's turn.