Voice Signatures Recipes

Overview

Identifying the same speaker across different audio samples can be done using the DeepTone™ SDK. Checkout the Speaker Detection Recipes to get familiar with the Speaker Map model.

In this Section we will look at the following use cases:

  • Create voice signatures of known speakers and detect the same speakers in new unseen audio data - example 1

Pre-requisites

  • DeepTone™ with license key and models
  • Audio Files to process:
    • An audio file that only contains one speaker, very little silence and is not too long (<10 seconds)
    • An audio file that contains multiple speakers, including the same speaker as the first file

Recognise speaker - Example 1

To be able to recognise a certain speaker in a file, we need to create a voice signature of the speaker. For this example we need a audio file that only contains one speaker, contains very little silence and is not too long. Something like this example file: Teo

Let's use this short snippet to create a voice signature:

from deeptone import Deeptone
# Initialise Deeptone
engine = Deeptone(license_key="...")
voice_signatures = engine.create_voice_signatures(speaker_samples=[
{"speaker_id": "teo", "audio": "teo.wav"},
])
# The signatures dictionary will contain the key "teo" with the voice signature as value

We can now use this created voice signature to check if the speaker "teo" is talking in a new audio file. Let's use this sample audio: Two Speakers.

output = engine.process_file(
filename="audio_sample_speaker_map.wav",
models=[engine.models.SpeakerMap],
output_period=1024,
include_summary=True,
include_transitions=True,
voice_signatures=voice_signatures,
include_voice_signatures=True
)
# Print the summary to see if the speaker "teo" is talking in the analysed file
print(output["channels"]["0"]["summary"]["speaker-map"])

The summary will look something like this:

{
"no_speech_fraction": 0.1106,
"teo_fraction": 0.4555,
"speaker_2_fraction": 0.4338,
"speaker_count": 2
}

We see that we found the speaker teo and a new speaker speaker_2. When setting include_voice_signatures to True the output will contain the updated voice signatures including the newly found speaker_2. Let's give the new speaker a name:

updated_voice_signatures = output["channels"]["0"]["voice_signatures"]
# Rename newly found speaker
updated_voice_signatures["lourenco"] = updated_voice_signatures.pop("speaker_2")

Now we can process the same file again with the updated voice signatures and check the summary:

output_with_both_voice_signatures = engine.process_file(
filename="audio_sample_speaker_map.wav",
models=[engine.models.SpeakerMap],
output_period=1024,
include_summary=True,
include_transitions=True,
voice_signatures=updated_voice_signatures
)
# Print the summary to check if both speakers are now found
print(output_with_both_voice_signatures["channels"]["0"]["summary"]["speaker-map"])

The summary now looks something like this:

{
"no_speech_fraction": 0.1106,
"teo_fraction": 0.4338,
"lourenco_fraction": 0.4555,
"speaker_count": 2
}

We can see that the fractions for "teo" and "lourenco" changed slightly from before. We can inspect and compare the transitions from both outputs using audacity:

# Save the transitions to a .txt file
with open(f"output_labels.txt", "w") as f:
for transition in output["channels"]["0"]["transitions"]["speaker-map"]:
f.write(f"{transition['timestamp_start'] / 1000.0}\t{transition['timestamp_end'] / 1000.0}\t{transition['result']}\n")
# Save the transitions to a .txt file
with open(f"output_with_both_signatures_labels.txt", "w") as f:
for transition in output_with_both_voice_signatures["channels"]["0"]["transitions"]["speaker-map"]:
f.write(f"{transition['timestamp_start'] / 1000.0}\t{transition['timestamp_end'] / 1000.0}\t{transition['result']}\n")

After executing the script, you will find a output_labels.txt and output_with_both_signatures_labels.txt in your working directory. You can open the audio file with Audacity and then import the files as labels, from File -> Import -> Labels ... .

When providing both voice signatures to the process_file function, the predictions are improved compared to when only providing the voice signature of one of the speakers. This is because the model knows about both of the speakers from the beginning and is therefore quicker in identifying the speaker "lourenco" when he start talking.

Looking at the transitions of the results we can see the difference: Without providing voice signatures:

Results without voice signaturesWhen providing the voice signatures of both speakers:Results with voice signaturesThe model is faster in detecting speaker lourenco's turn.