API Reference

class deeptone.Deeptone(license_key: str, prediction_engine=None)[source]

Entry point for the Deeptone SDK. Once this class is initialized, it provides access to the Deeptone Deep Learning models, which allow you to extract insights from your audio files.

Three processing modes are supported:

  • File Processing: This mode allows you to provide a file to Deeptone, which will provide a time series analysis, alongside a summary and list of transitions for the entire file.

  • Audio Bytes Processing: This mode allows you to provide audio bytes to Deeptone. The output will be the same as in the File Processing case.

  • Stream Processing: This mode allows you to provide a real-time audio stream, which results in a continuous analysis, which will periodically generate insights as the stream progresses.

Performance Considerations:

Initialization of the Deep Learning models that power Deeptone is a time-consuming operation. As such, the initialization process of this class can be costly, and thus, we recommend that instances be long-lived.

Thread Safety Considerations:

Instances of Deeptone are thread-safe. However, the actual inference process is done within a critical section, meaning that performance might be limited when using a single instance across multiple threads. If performance is a critical requirement, youshould either ensure each thread has its own Deeptone instance (usage of a pool is recommended).

Raises

LicenseManagerError – When the License Key is invalid or cannot be validated

create_voice_signatures(speaker_samples: List[Dict[str, Union[str, numpy.ndarray]]], voice_signatures: Optional[Dict[str, Dict[str, Union[int, str]]]] = None)Dict[str, Dict[str, Union[int, str]]]

Creates voice signatures from speaker samples of known speakers. The voice signatures are returned in a dictionary mapping the speaker identifier (Name/Id/…) to a voice signature embedding.

The returned voice signatures can be used to identify the speaker more reliably by passing them to the processing functions (process_file for example) in the additional_data argument.

process_file(
    "example.wav",
    [models.SpeakerMap],
    additional_data={"voice_signatures": signatures},
    return_voice_signatures=True  # optional argument to return the updated signatures
)
Parameters
  • speaker_samples (List[Dict[str, Union[str, np.ndarray]]]) – speaker samples of known speakers. Should be a list of dictionaries containing the speaker_id string and the audio key, for which the value can either be a path to an audio file or a numpy array containing the audio data (float64, float32 or int16 data) with a sample rate of 16kHz

  • voice_signatures (Dict[str, VoiceSignature]) – Optionally, a preexisting set of voice signatures can be passed to the function. The set of voice signatures will be updated with the new signatures and returned.

Returns

A dictionary mapping the speaker identifier to the speaker signature

Return type

dict[str, VoiceSignature]

Example

>>> signatures = deeptone.create_voice_signatures(speaker_samples=[
    {"speaker_id": "pedro", "audio": "pedro.wav"},
    {"speaker_id": "mariyana", "audio": mariyana_data} # mariyana_data is a numpy array
])
>>> signatures
{
    "pedro": {
        "version": 1,
        "data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..."  # base64 encoded voice signature
    },
    "mariyana": {
        "version": 1,
        "data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..."  # base64 encoded voice signature
    },
}
create_voice_signatures_from_timestamps(audio: Union[str, numpy.ndarray], speaker_segments: Iterable[Dict[str, Union[int, str]]], existing_signatures: Optional[Dict[str, Dict[str, Union[int, str]]]] = None)Dict[str, Dict[str, Union[int, str]]]

Creates voice signatures from a single audio sample. The voice signatures for one or more known speakers can be created by providing start and end timestamps for known speaker turns.

The voice signatures are returned in a dictionary mapping the speaker identifier (Name/Id/…) to a voice signature embedding. The known speaker’s identities should be passed as a list of dictionaries containing the speaker identifier, the start timestamp and the end timestamp of the speaker’s turn. Example of the identities argument:

speaker_segments = [
    # Defining a turn in the given audio file where speaker "pedro" is talking between 1000ms and 5500ms
    {"speaker_id": "pedro", "timestamp_start": 1000, "timestamp_end": 5500},
    # Defining a turn in the given audio file where speaker "pedro" is talking between 6500ms and 7500ms
    {"speaker_id": "pedro", "timestamp_start": 6500, "timestamp_end": 7500},
    # Defining a turn in the given audio file where speaker "mariyana" is talking between 10000ms and 15000ms
   {"speaker_id": "mariyana", "timestamp_start": 10000, "timestamp_end": 15000},
]
Example of a returned voice signature object:
voice_signatures = {
    "pedro": {
        "version": 1,
        "data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..."  # base64 encoded voice signature
    },
    "mariyana": {
        "version": 1,
        "data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..."  # base64 encoded voice signature
    },
}
Parameters
  • audio (Union[str, np.ndarray]) – audio sample containing one or more known speakers. Can be an audio filepath or a numpy array containing the audio data (float64, float32 or int16 data) with a sample rate of 16kHz

  • speaker_segments (Iterable[Dict[str, Union[str, int]]]) – a list of dicts containing the speaker segments, the start timestamp and the end timestamp of the speaker’s turn.

  • existing_signatures (Dict[str, VoiceSignature]) – Optionally, a preexisting set of voice signatures can be passed to the function. The set of voice signatures will be updated with the new signatures and returned.

Returns

A dictionary mapping the speaker identifier to the speaker signature

Return type

Dict[str, VoiceSignature]

get_available_models()set

Retrieve the name of all available models

Returns

The names of the available models

Return type

set

is_model_available(model_name: str)bool

Check if a model with the given name is available

Parameters

model_name (str) – Model name to validate

Returns

True if the model name provided is available

Return type

bool

merge_voice_signatures((s1, s2, ...))

Merges two or more voice signatures. This function can be used to merge multiple outputs of the create_voice_signatures or create_voice_signatures_from_timestamps functions.

Parameters

voice_signatures_list (List[Dict[str, VoiceSignature]]) – sequence of inventories to merge

Returns

A dictionary mapping the speaker identifier to the speaker signature

Return type

Dict[str, VoiceSignature]

process_audio_bytes(data: numpy.ndarray, models: list, output_period: int, include_summary: bool = False, include_transitions: bool = False, include_raw_values: bool = False, rate_in: Optional[int] = None, use_chunking: bool = False, volume_threshold: float = 0.005, voice_signatures: Optional[Dict[str, Any]] = None, include_voice_signatures=False)dict

Analyse audio data with the list of requested models.

This method can be used to generate timestamped predictions directly from audio bytes provided as a numpy array, rather than an audio file.

Parameters
  • data (np.ndarray) – Data to analyse

  • models (list) – List of models to use for the audio analysis

  • output_period (int) – How often (in milliseconds) the output of the model should be returned

  • include_summary (bool, optional) – Should the summary be included

  • include_transitions (bool, optional) – Should the file transitions be returned

  • include_raw_values (bool, optional) – Should raw model outputs be included

  • rate_in (int, optional) – Sample rate of the original audio (in Hz). Should only be specified if the rate differs from the recommended one (16000).

  • use_chunking (bool, optional) – Should data be chunked before making predictions. Chunking is only recommended in case of very large data arrays, to avoid memory issues.

  • volume_threshold (float, optional) – Threshold below which input data will be considered as no sound. Should be a number between 0 and 1, where 0 will treat all data as sound and 1 will treat all data as no sound.

  • voice_signatures (Dict[str, VoiceSignature]) – Dictionary containing voice signatures of known speakers. The voice signatures can be created using the create_voice_signatures function.

  • include_voice_signatures (bool) – Should the updated voice signatures be included

Returns

A dictionary containing timestamped results and summary/transitions/raw values, if applicable.

If include_summary is set to True, the output will contain a summary for the entire data array.

If include_transitions is set to True, the transitions output groups the raw model output (1 prediction every 64 ms) into phases where the predicted classification remains the same.

If include_raw_values is set to True, all possible classes with their respective probabilities will be returned in the model in addition to the most likely one.

Example

{
    "time_series": [
        {
            "timestamp" : 100,
            "results": {
                "gender": {
                    "result": "female",
                    "confidence": 0.6255
                },
                "another_model": {
                    "result: <>,
                    "confidence": <confidence>
                },
            }
        },
        {
            "timestamp" : 105,
            "results:
            {
                "gender": {...},
                "another_model": {...}
            }
        }
    ]
}

Return type

dict

process_audio_chunk(data: numpy.ndarray, models: list, include_raw_values: bool = False, volume_threshold: float = 0.005, context_samples: int = 0)dict

Analyse an audio chunk with the list of requested models.

This method should be use when a single prediction is needed for the whole chunk. For reliable predictions the duration of the audio should be at least the size of the receptive field of the requested model (approximately 2s for most models). For more info on receptive fields, check Models

Parameters
  • data (np.ndarray) – Data to analyse, representing audio data sampled at 16kHz

  • models (list) – List of models to use for the audio analysis

  • include_raw_values (bool, optional) – Should raw model outputs be included

  • volume_threshold (float) – Threshold below which input data will be considered as no sound. Should be a number between 0 and 1, where 0 will treat all data as sound and 1 will treat all data as no sound.

  • context_samples (int) – Number of samples that are used as context (receptive field), the predictions for which should be removed from the final result. Defaults to 0, as not to remove anything.

Returns

A dictionary with the results from each model.

Refer to Models for details on the outputs for each individual model.

Example

{
    "results": {
        "gender": {
            "result": "female",
            "confidence": 0.6255
        ),
        "arousal": {
            "result": "high",
            "confidence": 0.9431
        )
    },
    "raw": {
        "gender": {
            "female": 0.8,
            "male": 0.2,
        },
        "arousal": {
            "high": 0.9245,
            "neutral": 0.0245,
            "low": 0.01
        },
    }
}

Return type

dict

Raises

ModelNotFoundError – if any of the models are invalid

process_file(filename: str, models: list, output_period: int, channel: Optional[int] = None, include_summary: bool = False, include_transitions: bool = False, include_raw_values: bool = False, use_chunking: bool = False, volume_threshold: float = 0.005, voice_signatures: Optional[Dict[str, Dict[str, Union[int, str]]]] = None, include_voice_signatures=False)dict

Analyse a WAV File with the list of requested models.

Parameters
  • filename (str) – Path to the file to analyse

  • models (list) – List of models to use for the audio analysis

  • output_period (int) – How often (in milliseconds) the output of the models should be returned. The provided value must be a positive multiple of 64.

  • channel (int, optional) – The channel to analyse. If no channel is provided, all channels will be analysed

  • include_summary (bool, optional) – Should the file summary be returned

  • include_transitions (bool, optional) – Should the file transitions be returned

  • include_raw_values (bool, optional) – Should raw model outputs be included

  • use_chunking (bool, optional) – Should data be chunked before making predictions. Use this if the file being analyzed is large, to avoid issues with high memory consumption

  • volume_threshold (float, optional) – Threshold below which input data will be considered as no sound. Should be a number between 0 and 1, where 0 will treat all data as sound and 1 will treat all data as no sound. Defaults to 0.05 which should exclude very quiet fragments from analysis.

  • voice_signatures (Dict[str, VoiceSignature]) – Dictionary containing voice signatures of known speakers. The voice signatures can be created using the create_voice_signatures function.

  • include_voice_signatures (bool) – Should the updated voice signatures be included

Returns

The results of the analysis for the request channels.

In each channel, a Time Series will be returned, containing the aggregated results for the specific time window.

If include_summary is set to True, the output will contain a summary for the entire file.

If include_transitions is set to True, the transitions output groups the raw model output (1 prediction every 64 ms) into phases where the predicted classification remains the same.

If include_raw_values is set to True, all possible classes with their respective probabilities will be returned in the model in addition to the most likely one.

Refer to Models for details on the outputs for each individual model.

Example

{
  "channels": {
    "0": {
      "time_series": [
        {
          "timestamp" : 0,
            "results": {
              "gender": {
                "result": "female",
                "confidence": 0.6255,
              },
              "arousal": {
                "result": "high",
                "confidence": 0.9245,
              },
            },
            "raw": {
              "gender": {
                "female": 0.8,
                "male": 0.2,
              },
              "arousal": {
                "high": 0.9245,
                "neutral": 0.0245,
                "low": 0.01
              },
            }
        },
      ],
      "summary": {
        "gender": {
          "high_fraction": 0.8451,
          "low_fraction": 0.1124,
          "neutral_fraction": 0.0425,
        },
        "arousal": {
          "high_fraction": 0.9451,
          "low_fraction": 0.0124,
          "neutral_fraction": 0.0425,
        }
      },
      "transitions": {
        "gender": [
          {
           "timestamp_start": 0,
           "timestamp_end": 1500,
           "result": "female",
           "confidence": 0.96
           },
          {
           "timestamp_start": 1500,
           "timestamp_end": 3420,
           "result": "male",
           "confidence": 0.87
          },
          ...
          {
           "timestamp_start": 8560,
           "timestamp_end": 10000,
           "result": "female",
           "confidence": 0.89
          }
        ],
        "arousal": [
          {
           "timestamp_start": 0,
           "timestamp_end": 2500,
           "result": "high",
           "confidence": 0.92
           },
          {
           "timestamp_start": 2500,
           "timestamp_end": 3420,
           "result": "low",
           "confidence": 0.85
          },
          ...
          {
           "timestamp_start": 7560,
           "timestamp_end": 10000,
           "result": "neutral",
           "confidence": 0.87
          }
        ]
      }
    }
  }
}

Return type

dict

process_stream(input_generator, models: list, output_period: int, include_raw_values: bool = False, volume_threshold: float = 0.005)

Analyse a real-time audio stream with the list of requested models.

Parameters
  • input_generator (generator) – Generator that yields byte arrays representing audio data sampled at 16kHz

  • models (list) – List of models to use for the audio analysis

  • output_period (int) – How often in ms (milliseconds) the returned generator should yield results. The provided value must be a positive multiple of 64.

  • include_raw_values (bool, optional) – Should raw model outputs be included

  • volume_threshold (float) – Threshold below which input data will be considered as no sound. Should be a number between 0 and 1, where 0 will treat all data as sound and 1 will treat all data as no sound.

Returns

A generator that yields aggregated results for every output_period milliseconds of audio data received by the input_generator.

Refer to Models for details on the outputs for each individual model.

Example

{
    "timestamp": 0,
    "results": {
        "gender": {
            "result": "female",
            "confidence": 0.6255
        ),
        "arousal": {
            "result": "high",
            "confidence": 0.9431
        )
    },
    "raw": {
        "gender": {
            "female": 0.8,
            "male": 0.2,
        },
        "arousal": {
            "high": 0.9245,
            "neutral": 0.0245,
            "low": 0.01
        },
    }
}

Return type

generator

Raises
  • ModelNotFoundError – if any of the models are invalid

  • ValueError – if the output_period provided is invalid

resample_to_16khz(data: numpy.ndarray, rate_in: int)numpy.ndarray

Resample a numpy array to the sampling rate that is accepted by deeptone (16kHz). If the provided input data is an integer type, the data will be scaled and converted to float32 before resampling it.

Parameters
  • data – (np.ndarray) : Data to resample

  • rate_in (int) – Sample rate of the input audio data (in Hz).

Returns

Resampled 16kHz data

Return type

data (np.ndarray)