File Processing

DeepTone™'s File Processing functionality allows you to extract insights from your audio files.

Working with stereo files

DeepTone™ processes each audio channel separately. If you provide a stereo file, you can provide a specific channel to be processed, otherwise, all channels will be processed separately.

Sample data

You can download this sample audio file with a woman speaking for the examples below. For code sample go to Example Usage.

Configuration options and outputs

There are different configuration options and types of outputs which can be used depending on the SDK language.

For code sample go to Example Usage. For detailed output specification go to Output Specification.

Available configuration options

There are several possible arguments which can be passed to the process_file function:

  • filename - the path to the file to be analysed. Supported Audio Formats
  • models - the list of model names to use for the audio analysis. See all available models here.
  • output_period - how often (in milliseconds, multiple of 64) the output of the models should be returned
  • channel - optionally a channel to analyse, otherwise all channels will be analysed
  • include_summary - optionally if the output should contain of summary of the analysis, defaults to False
  • include_transitions - optionally if the output should contain transitions of the analysis, defaults to False
  • include_raw_values - optionally if the result should contain raw model outputs, defaults to False
  • use_chunking - optionally if the data should be chunked before making the analysis (recommended for large files to avoid memory issues)
  • volume_threshold - optionally if a volume level different than default should be considered (higher values will result in more of the data being treated as silence)
  • voice_signatures - optionally (only applies when using the SpeakerMap model) the voice signatures that are used to identify known speakers
  • include_voice_signatures - optionally (only applies when using the SpeakerMap model) if the result should contain the voice signatures of the found speakers. If a voice_signatures object was provided as well, it will update the voice_signatures object with the voice signatures of new speakers. See Create Voice Signatures to find out more about voice signatures

Available Outputs

There are three possible output types, depending on the parameters that you pass to the process_file function:

  • a plain time series - default output type, returned always
  • a plain time series with raw model outputs - raw values are appended when include_raw_values=True
  • a summary - appended to the results when include_summary=True
  • a simplified time series - appended to the results when include_transitions=True
  • voice signatures of speakers - appended to the results when include_voice_signatures=True (only applies when using the SpeakerMap model)

For code sample go to Example Usage. For detailed output specification go to Output Specification.

See below for examples of each of the three outputs:

  • plain time series (according to the specified output_period):
{
"channels": {
"0": {
"time_series": [
{
"timestamp": 0,
"results": {
"gender": {
"result": "male",
"confidence": 0.92
}
}
},
{
"timestamp": 1024,
"results": {
"gender": {
"result": "male",
"confidence": 0.86
}
}
},
{
"timestamp": 2048,
"results": {
"gender": {
"result": "male",
"confidence": 0.85
}
}
},
...
{
"timestamp": 29696,
"results":{
"gender": {
"result": "silence",
"confidence": 1.0
}
}
}
]
}
}
}
  • plain time series with additional raw outputs:
{
"channels": {
"0": {
"time_series": [
{
"timestamp": 0,
"results": {
"gender": {
"result": "male",
"confidence": 0.92
}
},
"raw": {
"gender": {
"male": 0.92,
"female": 0.08
}
}
},
{
"timestamp": 1024,
"results": {
"gender": {
"result": "male",
"confidence": 0.86
}
},
"raw": {
"gender": {
"male": 0.86,
"female": 0.14
}
}
},
{
"timestamp": 2048,
"results":{
"gender": {
"result": "male",
"confidence": 0.85
}
},
"raw": {
"gender": {
"male": 0.85,
"female": 0.15
}
}
},
...
{
"timestamp": 29696,
"results": {
"gender": {
"result": "silence",
"confidence": 1.0
}
},
"raw": {
"gender": {
"male": 0.12,
"female": 0.88
}
}
}
]
}
}
}
  • summary (showing fraction of each class across the entire file):
{
"channels": {
"0": {
"time_series": [ ... ],
"summary": {
"gender": {
"male_fraction": 0.7451,
"female_fraction": 0.1024,
"other_fraction": 0.112,
"unknown_fraction": 0.0405,
"silence_fraction": 0.0,
},
}
}
}
}
  • simplified time series (indicating transition points between alternating results):
{
"channels": {
"0": {
"time_series": [ ... ],
"transitions": {
"gender": [
{
"timestamp_start": 0,
"timestamp_end": 1024,
"result": "female",
"confidence": 0.96
},
{
"timestamp_start": 1024,
"timestamp_end": 3072,
"result": "male",
"confidence": 0.87
},
...
{
"timestamp_start": 8192,
"timestamp_end": 12288,
"result": "female",
"confidence": 0.89
}
],
}
}
}
}
  • voice signatures of the detected speakers:
{
"channels": {
"0": {
"time_series": [ ... ],
"summary": { ... },
"transitions": { ... },
"voice_signatures": {
"speaker_1": {
"version": 1,
"data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..."
},
"speaker_2": {
"version": 1,
"data": "T1RPIGlzIGdyZWF0ISBJdCdzIHRydWUu..."
}
}
}
}
}

Example Usage

You can use the process_file method to process your audio files.

from deeptone import Deeptone
from deeptone.deeptone import GENDER_MALE, GENDER_FEMALE, GENDER_UNKNOWN, GENDER_NO_SPEECH, GENDER_SILENCE
# Initialise Deeptone
engine = Deeptone(license_key="...")
output = engine.process_file(
filename="PATH_TO_AUDIO_FILE",
models=[engine.models.Gender],
output_period=1024,
channel=0,
use_chunking=False,
include_summary=True,
include_transitions=True,
include_raw_values=True,
volume_threshold=0.005
)

The returned object contains the time series with an analysis of the file broken down by the provided output period:

# Inspect the result
print(output)
print("Time series:")
for ts_result in output["channels"]["0"]["time_series"]:
ts = ts_result["timestamp"]
res = ts_result["results"]["gender"]
print(f'Timestamp: {ts}ms\tresult: {res["result"]}\t'
f'confidence: {res["confidence"]}')
print("\nRaw model outputs:")
for ts_result in output["channels"]["0"]["time_series"]:
ts = ts_result["timestamp"]
raw = ts_result["raw"]["gender"]
print(f'Timestamp: {ts}ms\traw results: {GENDER_MALE}: '
f'{raw[GENDER_MALE]}, {GENDER_FEMALE}: {raw[GENDER_FEMALE]}')
summary = output["channels"]["0"]["summary"]["gender"]
male = summary[f"{GENDER_MALE}_fraction"] * 100
female = summary[f"{GENDER_FEMALE}_fraction"] * 100
no_speech = summary[f"{GENDER_NO_SPEECH}_fraction"] * 100
unknown = summary[f"{GENDER_UNKNOWN}_fraction"] * 100
silence = summary[f"{GENDER_SILENCE}_fraction"] * 100
print(f'\nSummary: male: {male}%, female: {female}%, no_speech: {no_speech}%, unknown: {unknown}%, silence: {silence}%')
print("\nTransitions:")
for ts_result in output["channels"]["0"]["transitions"]["gender"]:
ts = ts_result["timestamp_start"]
print(f'Timestamp: {ts}ms\tresult: {ts_result["result"]}\t'
f'confidence: {ts_result["confidence"]}')

The output of the script would be something like:

Time series:
Timestamp: 0ms result: no_speech confidence: 0.6293
Timestamp: 1024ms result: female confidence: 0.9002
Timestamp: 2048ms result: female confidence: 0.4725
Timestamp: 3072ms result: female confidence: 0.4679
....
Raw model outputs:
Timestamp: 0ms raw results: male: 0.1791, female: 0.8209
Timestamp: 1024ms raw results: male: 0.0499, female: 0.9501
Timestamp: 2048ms raw results: male: 0.2638, female: 0.7362
Timestamp: 3072ms raw results: male: 0.266, female: 0.734
Summary: male: 0.0%, female: 67.74%, no_speech: 9.68%, unknown: 6.45%, silence: 16.13%
Transitions:
Timestamp: 0ms result: silence confidence: 1.0
Timestamp: 320ms result: no_speech confidence: 0.7723
Timestamp: 704ms result: silence confidence: 1.0
Timestamp: 768ms result: female confidence: 0.9002
Timestamp: 1408ms result: silence confidence: 1.0
Timestamp: 1472ms result: female confidence: 0.7137
Timestamp: 2880ms result: unknown confidence: 0.0771
Timestamp: 3136ms result: female confidence: 0.3961
Timestamp: 3712ms result: silence confidence: 1.0
Timestamp: 3776ms result: female confidence: 0.7101
Timestamp: 3840ms result: silence confidence: 1.0

Raw output:

{
"channels": {
"0": {
"time_series": [
{ "timestamp" : 0, "results":{ "gender": { "result": "no_speech", "confidence": 0.6293, }}, "raw": {"gender": {"male": 0.1791, "female": 0.8209}}},
{ "timestamp" : 1024, "results":{ "gender": { "result": "female", "confidence": 0.9002, }}, "raw": {"gender": {"male": 0.0499, "female": 0.9501}}},
{ "timestamp" : 2048, "results":{ "gender": { "result": "female", "confidence": 0.4725, }}, "raw": {"gender": {"male": 0.2638, "female": 0.7362}}},
{ "timestamp" : 3072, "results":{ "gender": { "result": "female", "confidence": 0.4679, }}, "raw": {"gender": {"male": 0.266, "female": 0.734}}},
],
"summary": {
"gender": { "male_fraction": 0, "female_fraction": 0.6774, "no_speech": 0.0968, "unknown_fraction": 0.0645, "silence_fraction": 0.1613 },
},
"transitions": {
"gender": [
{"timestamp_start": 0, "timestamp_end": 320, "result": "silence", "confidence": 1.0},
{"timestamp_start": 320, "timestamp_end": 704, "result": "no_speech", "confidence": 0.7723},
{"timestamp_start": 704, "timestamp_end": 768, "result": "silence", "confidence": 1.0},
{"timestamp_start": 768, "timestamp_end": 1408, "result": "female", "confidence": 0.9002},
{"timestamp_start": 1408, "timestamp_end": 1472, "result": "silence", "confidence": 1.0},
{"timestamp_start": 1472, "timestamp_end": 2880, "result": "female", "confidence": 0.7137},
{"timestamp_start": 2880, "timestamp_end": 3136, "result": "unknown", "confidence": 0.0771},
{"timestamp_start": 3136, "timestamp_end": 3712, "result": "female", "confidence": 0.3961},
{"timestamp_start": 3712, "timestamp_end": 3776, "result": "silence", "confidence": 1.0},
{"timestamp_start": 3776, "timestamp_end": 3840, "result": "female", "confidence": 0.7101},
{"timestamp_start": 3840, "timestamp_end": 3968, "result": "silence", "confidence": 1.0}
]
}
}
}
}

Further examples

For more example usage of the summary and transitions, head to the Speech detection recipes and the Arousal detection recipes sections. For example usage of `raw` output to implement custom speech thresholds, head to Example 3 in Speech model recipes.