Create Voice Signatures
DeepTone™'s Voice Signatures functionality allows you to extract one or multiple voice signatures of known speakers from your audio files.
Use cases
- Create voice signatures of known speakers and detect the speakers in new unseen audio data. Check out this example.
Limitations
The provided audio to create voice signatures should adhere to these requirements:
- The provided audio needs to be in a Supported Format
- Minimum audio input length is 1 second
- Maximum audio input length per request is 10 seconds
- Minimum total effective speech length (excluding silence and other non-speech frames) for creating a voice signature is 1 second. Optimally, the input audio should contain as little silence and non-speech frames as possible
- Maximum total audio input length allowed for creating a voice signature is 10 seconds
- Minimum audio Signal-to-noise ratio (SNR) is 0dB. The less noise in the audio the better the voice signature will work
API
create_voice_signatures
The create_voice_signatures
function allows you to create voice signatures from a list of speaker_samples
.
Arguments
speaker_samples
(List[Dict[str, str]]): A list of speaker samples. A speaker sample is a python dictionary containing:speaker_id
(str) : The unique identifier for a known speakeraudio
(str or numpy array): A filepath to an audio file or a numpy array containing the audio data with a sampling rate of 16kHz.
voice_signatures
(Dict[str, VoiceSignature]): A preexisting set of voice signatures. This argument can be used to gradually build up voice signatures over many function calls.
Output
The output of the function will be a set of voice signatures in form of a python dictionary. The keys of the dictionary
will be the provided speaker_id
s and the value will be a VoiceSignature
. A VoiceSignature is a dictionary with a
version
and data
. The created voice signatures can be passed as an argument
when processing files.
Example
from deeptone import Deeptone
# Initialise Deeptone
engine = Deeptone(license_key="...")
signatures = engine.create_voice_signatures(speaker_samples=[
{"speaker_id": "pedro", "audio": "pedro.wav"},
{"speaker_id": "mariyana", "audio": mariyana_data} # mariyana_data is a numpy array
])
The returned signatures
object will look like this:
{
"pedro": {
"version": 1,
"data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..." # base64 encoded voice signature
},
"mariyana": {
"version": 1,
"data": "T1RPIGlzIGdyZWF0ISBJdCdzIHRydWUu..." # base64 encoded voice signature
},
}
create_voice_signatures_from_timestamps
The create_voice_signatures_from_timestamps
function allows to extract voice signatures of multiple known speakers
from a single audio source. This is done by providing speaker_segments
defining at what time during a file a certain
known speaker is talking.
Arguments
audio
(str or numpy array): A filepath to an audio file or a numpy array containing the audio data with a sampling rate of 16kHz.speaker_segments
(List[Dict[str, str/int]]): A list ofspeaker_segments
A speaker segment is a python dictionary containing:speaker_id
(str) : The unique identifier for a known speakertimestamp_start
(int) : The timestamp in the audio when the speaker starts talkingtimestamp_end
(int) : The timestamp in the audio when the speaker's turn is finished
Example
from deeptone import Deeptone
# Initialise Deeptone
engine = Deeptone(license_key="...")
signatures = engine.create_voice_signatures_from_timestamps(
audio = "sample_audio.wav",
speaker_segments = [
# Defining a turn in the given audio file where speaker "pedro" is talking between 1000ms and 5500ms
{"speaker_id": "pedro", "timestamp_start": 1000, "timestamp_end": 5500},
# Defining a turn in the given audio file where speaker "pedro" is talking between 6500ms and 7500ms
{"speaker_id": "pedro", "timestamp_start": 6500, "timestamp_end": 7500},
# Defining a turn in the given audio file where speaker "mariyana" is talking between 10000ms and 15000ms
{"speaker_id": "mariyana", "timestamp_start": 10000, "timestamp_end": 15000},
]
)
The returned signatures
object will look like this:
{
"pedro": {
"version": 1,
"data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..." # base64 encoded voice signature
},
"mariyana": {
"version": 1,
"data": "T1RPIGlzIGdyZWF0ISBJdCdzIHRydWUu..." # base64 encoded voice signature
},
}
merge_voice_signatures
The merge_voice_signatures
function allows to merge multiple voice signatures objects.
Arguments
voice_signatures_list
(List[Dict[str, VoiceSignature]]): A list of voice signatures objects
Example
from deeptone import Deeptone
# Initialise Deeptone
engine = Deeptone(license_key="...")
signatures_1 = engine.create_voice_signatures_from_timestamps(
audio = "sample_audio.wav",
speaker_segments = [
# Defining a turn in the given audio file where speaker "pedro" is talking between 1000ms and 5500ms
{"speaker_id": "pedro", "timestamp_start": 1000, "timestamp_end": 5500},
# Defining a turn in the given audio file where speaker "pedro" is talking between 6500ms and 7500ms
{"speaker_id": "pedro", "timestamp_start": 6500, "timestamp_end": 7500},
# Defining a turn in the given audio file where speaker "mariyana" is talking between 10000ms and 15000ms
{"speaker_id": "mariyana", "timestamp_start": 10000, "timestamp_end": 15000},
]
)
signatures_2 = engine.create_voice_signatures(speaker_samples=[
{"speaker_id": "pedro", "audio": "pedro.wav"},
{"speaker_id": "ana", "audio": "ana.wav"}
])
combined_signatures = engine.merge_voice_signatures([signatures_1, signatures_2])
The returned combined_signatures
object will look like this:
{
"pedro": {
"version": 1,
"data": "U3VwZXJkdXBlcm1lZ2FzdGFydm9pY2VzaWduYXR1cmVzMQ..." # base64 encoded voice signature
},
"mariyana": {
"version": 1,
"data": "T1RPIGlzIGdyZWF0ISBJdCdzIHRydWUu..." # base64 encoded voice signature
},
"ana": {
"version": 1,
"data": "RmFudGFzdGljYWxseWl0YXRzdGljYWxiZW5lZml0aWFjb2ZmZW..." # base64 encoded voice signature
},
}
The voice signature for speaker "pedro" will be more robust, because it contains information from three different audio segments where "pedro" is talking