Skip to main content

Create Voice Signatures

DeepTone™'s Voice Signatures functionality allows you to extract one or multiple voice signatures of known speakers from your audio files.

Use cases

  • Create voice signatures of known speakers and detect the speakers in new unseen audio data. Check out this example.

Limitations

The provided audio to create voice signatures should adhere to these requirements:

  • The provided audio needs to be in a Supported Format
  • Minimum audio input length is 1 second
  • Maximum audio input length per request is 10 seconds
  • Minimum total effective speech length (excluding silence and other non-speech frames) for creating a voice signature is 1 second. Optimally, the input audio should contain as little silence and non-speech frames as possible
  • Maximum total audio input length allowed for creating a voice signature is 10 seconds
  • Minimum audio Signal-to-noise ratio (SNR) is 0dB. The less noise in the audio the better the voice signature will work

API

create_voice_signatures

The create_voice_signatures function allows you to create voice signatures from a list of speaker_samples.

Arguments

  • speaker_samples (List[Dict[str, str]]): A list of speaker samples. A speaker sample is a python dictionary containing:
    • speaker_id (str) : The unique identifier for a known speaker
    • audio (str or numpy array): A filepath to an audio file or a numpy array containing the audio data with a sampling rate of 16kHz.
  • voice_signatures (Dict[str, VoiceSignature]): A preexisting set of voice signatures. This argument can be used to gradually build up voice signatures over many function calls.

Output

The output of the function will be a set of voice signatures in form of a python dictionary. The keys of the dictionary will be the provided speaker_ids and the value will be a VoiceSignature. A VoiceSignature is a dictionary with a version and data. The created voice signatures can be passed as an argument when processing files.

Example

from deeptone import Deeptone

# Initialise Deeptone
engine = Deeptone(license_key="...")
signatures = engine.create_voice_signatures(speaker_samples=[
{"speaker_id": "pedro", "audio": "pedro.wav"},
{"speaker_id": "mariyana", "audio": mariyana_data} # mariyana_data is a numpy array
])

The returned signatures object will look like this:

{
"pedro": {
"version": 1,
"data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..." # base64 encoded voice signature
},
"mariyana": {
"version": 1,
"data": "T1RPIGlzIGdyZWF0ISBJdCdzIHRydWUu..." # base64 encoded voice signature
},
}

create_voice_signatures_from_timestamps

The create_voice_signatures_from_timestamps function allows to extract voice signatures of multiple known speakers from a single audio source. This is done by providing speaker_segments defining at what time during a file a certain known speaker is talking.

Arguments

  • audio (str or numpy array): A filepath to an audio file or a numpy array containing the audio data with a sampling rate of 16kHz.
  • speaker_segments (List[Dict[str, str/int]]): A list of speaker_segments A speaker segment is a python dictionary containing:
    • speaker_id (str) : The unique identifier for a known speaker
    • timestamp_start (int) : The timestamp in the audio when the speaker starts talking
    • timestamp_end (int) : The timestamp in the audio when the speaker's turn is finished

Example

from deeptone import Deeptone

# Initialise Deeptone
engine = Deeptone(license_key="...")
signatures = engine.create_voice_signatures_from_timestamps(
audio = "sample_audio.wav",
speaker_segments = [
# Defining a turn in the given audio file where speaker "pedro" is talking between 1000ms and 5500ms
{"speaker_id": "pedro", "timestamp_start": 1000, "timestamp_end": 5500},
# Defining a turn in the given audio file where speaker "pedro" is talking between 6500ms and 7500ms
{"speaker_id": "pedro", "timestamp_start": 6500, "timestamp_end": 7500},
# Defining a turn in the given audio file where speaker "mariyana" is talking between 10000ms and 15000ms
{"speaker_id": "mariyana", "timestamp_start": 10000, "timestamp_end": 15000},
]
)

The returned signatures object will look like this:

{
"pedro": {
"version": 1,
"data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..." # base64 encoded voice signature
},
"mariyana": {
"version": 1,
"data": "T1RPIGlzIGdyZWF0ISBJdCdzIHRydWUu..." # base64 encoded voice signature
},
}

merge_voice_signatures

The merge_voice_signatures function allows to merge multiple voice signatures objects.

Arguments

  • voice_signatures_list (List[Dict[str, VoiceSignature]]): A list of voice signatures objects

Example

from deeptone import Deeptone

# Initialise Deeptone
engine = Deeptone(license_key="...")
signatures_1 = engine.create_voice_signatures_from_timestamps(
audio = "sample_audio.wav",
speaker_segments = [
# Defining a turn in the given audio file where speaker "pedro" is talking between 1000ms and 5500ms
{"speaker_id": "pedro", "timestamp_start": 1000, "timestamp_end": 5500},
# Defining a turn in the given audio file where speaker "pedro" is talking between 6500ms and 7500ms
{"speaker_id": "pedro", "timestamp_start": 6500, "timestamp_end": 7500},
# Defining a turn in the given audio file where speaker "mariyana" is talking between 10000ms and 15000ms
{"speaker_id": "mariyana", "timestamp_start": 10000, "timestamp_end": 15000},
]
)
signatures_2 = engine.create_voice_signatures(speaker_samples=[
{"speaker_id": "pedro", "audio": "pedro.wav"},
{"speaker_id": "ana", "audio": "ana.wav"}
])

combined_signatures = engine.merge_voice_signatures([signatures_1, signatures_2])

The returned combined_signatures object will look like this:

{
"pedro": {
"version": 1,
"data": "U3VwZXJkdXBlcm1lZ2FzdGFydm9pY2VzaWduYXR1cmVzMQ..." # base64 encoded voice signature
},
"mariyana": {
"version": 1,
"data": "T1RPIGlzIGdyZWF0ISBJdCdzIHRydWUu..." # base64 encoded voice signature
},
"ana": {
"version": 1,
"data": "RmFudGFzdGljYWxseWl0YXRzdGljYWxiZW5lZml0aWFjb2ZmZW..." # base64 encoded voice signature
},
}

The voice signature for speaker "pedro" will be more robust, because it contains information from three different audio segments where "pedro" is talking