File Processing

DeepTone™'s File Processing functionality allows you to extract insights from your audio files.

Working with stereo files

DeepTone™ processes each audio channel separately. If you provide a stereo file, you can provide a specific channel to be processed, otherwise, all channels will be processed separately.

Sample data

You can download this sample audio file with a woman speaking for the examples below. For code sample go to Example Usage.

Supported formats

Currently, processing WAV files is supported. Ideally, the files should be 16-bit PCM with the sample rate of 16 kHz. If a different sample rate is provided, the file will be up- or down-sampled accordingly. Please be aware though that using files with sample rates lower than recommended may lead to deterioration of analysis results.

If you're not sure your audio files meet these criteria you can use the CLI tool SoX for that verification by doing the following:

sox --i PATH_TO_YOUR_AUDIO_FILE

The result will be something similar to:

Input File : PATH_TO_YOUR_AUDIO_FILE
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:03.99 = 63840 samples ~ 299.25 CDDA sectors
File Size : 128k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM

SoX also allows you to convert your files in case they don't match our criteria by using the following command:

sox PATH_TO_YOUR_AUDIO_FILE -b 16 PATH_TO_OUTPUT_FILE rate 16k

Configuration options and outputs

There are different configuration options and types of outputs which can be used depending on the SDK language.

For code sample go to Example Usage. For detailed output specification go to Output Specification.

Available configuration options

There are several possible arguments which can be passed to the process_file function:

  • filename - the path to the file to be analysed
  • models - the list of model names to use for the audio analysis
  • output_period - how often (in milliseconds, multiple of 64) the output of the models should be returned
  • channel - optionally a channel to analyse, otherwise all channels will be analysed
  • include_summary - optionally if the output should contain of summary of the analysis, defaults to False
  • include_transitions - optionally if the output should contain transitions of the analysis, defaults to False
  • include_raw_values - optionally if the result should contain raw model outputs, defaults to False
  • use_chunking - optionally if the data should be chunked before making the analysis (recommended for large files to avoid memory issues)
  • volume_threshold - optionally if a volume level different than default should be considered (higher values will result in more of the data being treated as silence)

Available Outputs

There are three possible output types, depending on the parameters that you pass to the process_file function:

  • a plain time series - default output type, returned always
  • a plain time series with raw model outputs - raw values are appended when include_raw_values=True
  • a summary - appended to the results when include_summary=True
  • a simplified time series - appended to the results when include_transitions=True

For code sample go to Example Usage. For detailed output specification go to Output Specification.

See below for examples of each of the three outputs:

  • plain time series (according to the specified output_period):
{
"channels": {
"0": {
"time_series": [
{
"timestamp": 0,
"results": {
"gender": {
"result": "male",
"confidence": 0.92
}
}
},
{
"timestamp": 1024,
"results": {
"gender": {
"result": "male",
"confidence": 0.86
}
}
},
{
"timestamp": 2048,
"results": {
"gender": {
"result": "male",
"confidence": 0.85
}
}
},
...
{
"timestamp": 29696,
"results":{
"gender": {
"result": "silence",
"confidence": 1.0
}
}
}
]
}
}
}
  • plain time series with additional raw outputs:
{
"channels": {
"0": {
"time_series": [
{
"timestamp": 0,
"results": {
"gender": {
"result": "male",
"confidence": 0.92
}
},
"raw": {
"gender": {
"male": 0.92,
"female": 0.08
}
}
},
{
"timestamp": 1024,
"results": {
"gender": {
"result": "male",
"confidence": 0.86
}
},
"raw": {
"gender": {
"male": 0.86,
"female": 0.14
}
}
},
{
"timestamp": 2048,
"results":{
"gender": {
"result": "male",
"confidence": 0.85
}
},
"raw": {
"gender": {
"male": 0.85,
"female": 0.15
}
}
},
...
{
"timestamp": 29696,
"results": {
"gender": {
"result": "silence",
"confidence": 1.0
}
},
"raw": {
"gender": {
"male": 0.12,
"female": 0.88
}
}
}
]
}
}
}
  • summary (showing fraction of each class across the entire file):
{
"channels": {
"0": {
"time_series": [ ... ],
"summary": {
"gender": {
"male_fraction": 0.7451,
"female_fraction": 0.1024,
"other_fraction": 0.112,
"unknown_fraction": 0.0405,
"silence_fraction": 0.0,
},
}
}
}
}
  • simplified time series (indicating transition points between alternating results):
{
"channels": {
"0": {
"time_series": [ ... ],
"transitions": {
"gender": [
{
"timestamp_start": 0,
"timestamp_end": 1024,
"result": "female",
"confidence": 0.96
},
{
"timestamp_start": 1024,
"timestamp_end": 3072,
"result": "male",
"confidence": 0.87
},
...
{
"timestamp_start": 8192,
"timestamp_end": 12288,
"result": "female",
"confidence": 0.89
}
],
}
}
}
}

Example Usage

You can use the process_file method to process your audio files.

from deeptone import Deeptone
from deeptone.deeptone import GENDER_MALE, GENDER_FEMALE, GENDER_UNKNOWN, GENDER_NO_SPEECH, GENDER_SILENCE
# Initialise Deeptone
engine = Deeptone(license_key="...")
output = engine.process_file(
filename="PATH_TO_AUDIO_FILE",
models=[engine.models.Gender],
output_period=1024,
channel=0,
use_chunking=False,
include_summary=True,
include_transitions=True,
include_raw_values=True,
volume_threshold=0.005
)

The returned object contains the time series with an analysis of the file broken down by the provided output period:

# Inspect the result
print(output)
print("Time series:")
for ts_result in output["channels"]["0"]["time_series"]:
ts = ts_result["timestamp"]
res = ts_result["results"]["gender"]
print(f'Timestamp: {ts}ms\tresult: {res["result"]}\t'
f'confidence: {res["confidence"]}')
print("\nRaw model outputs:")
for ts_result in output["channels"]["0"]["time_series"]:
ts = ts_result["timestamp"]
raw = ts_result["raw"]["gender"]
print(f'Timestamp: {ts}ms\traw results: {GENDER_MALE}: '
f'{raw[GENDER_MALE]}, {GENDER_FEMALE}: {raw[GENDER_FEMALE]}')
summary = output["channels"]["0"]["summary"]["gender"]
male = summary[f"{GENDER_MALE}_fraction"] * 100
female = summary[f"{GENDER_FEMALE}_fraction"] * 100
no_speech = summary[f"{GENDER_NO_SPEECH}_fraction"] * 100
unknown = summary[f"{GENDER_UNKNOWN}_fraction"] * 100
silence = summary[f"{GENDER_SILENCE}_fraction"] * 100
print(f'\nSummary: male: {male}%, female: {female}%, no_speech: {no_speech}%, unknown: {unknown}%, silence: {silence}%')
print("\nTransitions:")
for ts_result in output["channels"]["0"]["transitions"]["gender"]:
ts = ts_result["timestamp_start"]
print(f'Timestamp: {ts}ms\tresult: {ts_result["result"]}\t'
f'confidence: {ts_result["confidence"]}')

The output of the script would be something like:

Time series:
Timestamp: 0ms result: no_speech confidence: 0.6293
Timestamp: 1024ms result: female confidence: 0.9002
Timestamp: 2048ms result: female confidence: 0.4725
Timestamp: 3072ms result: female confidence: 0.4679
....
Raw model outputs:
Timestamp: 0ms raw results: male: 0.1791, female: 0.8209
Timestamp: 1024ms raw results: male: 0.0499, female: 0.9501
Timestamp: 2048ms raw results: male: 0.2638, female: 0.7362
Timestamp: 3072ms raw results: male: 0.266, female: 0.734
Summary: male: 0.0%, female: 67.74%, no_speech: 9.68%, unknown: 6.45%, silence: 16.13%
Transitions:
Timestamp: 0ms result: silence confidence: 1.0
Timestamp: 320ms result: no_speech confidence: 0.7723
Timestamp: 704ms result: silence confidence: 1.0
Timestamp: 768ms result: female confidence: 0.9002
Timestamp: 1408ms result: silence confidence: 1.0
Timestamp: 1472ms result: female confidence: 0.7137
Timestamp: 2880ms result: unknown confidence: 0.0771
Timestamp: 3136ms result: female confidence: 0.3961
Timestamp: 3712ms result: silence confidence: 1.0
Timestamp: 3776ms result: female confidence: 0.7101
Timestamp: 3840ms result: silence confidence: 1.0

Raw output:

{
"channels": {
"0": {
"time_series": [
{ "timestamp" : 0, "results":{ "gender": { "result": "no_speech", "confidence": 0.6293, }}, "raw": {"gender": {"male": 0.1791, "female": 0.8209}}},
{ "timestamp" : 1024, "results":{ "gender": { "result": "female", "confidence": 0.9002, }}, "raw": {"gender": {"male": 0.0499, "female": 0.9501}}},
{ "timestamp" : 2048, "results":{ "gender": { "result": "female", "confidence": 0.4725, }}, "raw": {"gender": {"male": 0.2638, "female": 0.7362}}},
{ "timestamp" : 3072, "results":{ "gender": { "result": "female", "confidence": 0.4679, }}, "raw": {"gender": {"male": 0.266, "female": 0.734}}},
],
"summary": {
"gender": { "male_fraction": 0, "female_fraction": 0.6774, "no_speech": 0.0968, "unknown_fraction": 0.0645, "silence_fraction": 0.1613 },
},
"transitions": {
"gender": [
{"timestamp_start": 0, "timestamp_end": 320, "result": "silence", "confidence": 1.0},
{"timestamp_start": 320, "timestamp_end": 704, "result": "no_speech", "confidence": 0.7723},
{"timestamp_start": 704, "timestamp_end": 768, "result": "silence", "confidence": 1.0},
{"timestamp_start": 768, "timestamp_end": 1408, "result": "female", "confidence": 0.9002},
{"timestamp_start": 1408, "timestamp_end": 1472, "result": "silence", "confidence": 1.0},
{"timestamp_start": 1472, "timestamp_end": 2880, "result": "female", "confidence": 0.7137},
{"timestamp_start": 2880, "timestamp_end": 3136, "result": "unknown", "confidence": 0.0771},
{"timestamp_start": 3136, "timestamp_end": 3712, "result": "female", "confidence": 0.3961},
{"timestamp_start": 3712, "timestamp_end": 3776, "result": "silence", "confidence": 1.0},
{"timestamp_start": 3776, "timestamp_end": 3840, "result": "female", "confidence": 0.7101},
{"timestamp_start": 3840, "timestamp_end": 3968, "result": "silence", "confidence": 1.0}
]
}
}
}
}

Further examples

For more example usage of the summary and transitions, head to the Speech detection recipes and the Arousal detection recipes sections. For example usage of `raw` output to implement custom speech thresholds, head to Example 3 in Speech model recipes.