File Processing
DeepTone™'s File Processing functionality allows you to extract insights from your audio files.
Working with stereo files
DeepTone™ processes each audio channel separately. If you provide a stereo file, you can provide a specific channel to be processed, otherwise, all channels will be processed separately.
Sample data
You can download this sample audio file with a woman speaking for the examples below. For code sample go to Example Usage.
Configuration options and outputs
There are different configuration options and types of outputs which can be used depending on the SDK language.
For code sample go to Example Usage. For detailed output specification go to Output Specification.
- Python
- Swift
Available configuration options
There are several possible arguments which can be passed to the process_file
function:
filename
- the path to the file to be analysed. Supported Audio Formatsmodels
- the list of model names to use for the audio analysis. See all available models here.output_period
- how often (in milliseconds, multiple of 64) the output of the models should be returnedchannel
- optionally a channel to analyse, otherwise all channels will be analysedinclude_summary
- optionally if the output should contain of summary of the analysis, defaults to Falseinclude_transitions
- optionally if the output should contain transitions of the analysis, defaults to Falseinclude_raw_values
- optionally if the result should contain raw model outputs, defaults to Falseuse_chunking
- optionally if the data should be chunked before making the analysis (recommended for large files to avoid memory issues)volume_threshold
- optionally if a volume level different than default should be considered (higher values will result in more of the data being treated as silence)voice_signatures
- optionally (only applies when using the SpeakerMap model) the voice signatures that are used to identify known speakersinclude_voice_signatures
- optionally (only applies when using the SpeakerMap model) if the result should contain the voice signatures of the found speakers. If avoice_signatures
object was provided as well, it will update thevoice_signatures
object with the voice signatures of new speakers. See Create Voice Signatures to find out more about voice signatures
Available Outputs
There are three possible output types, depending on the parameters that you pass to the process_file
function:
- a plain time series - default output type, returned always
- a plain time series with raw model outputs - raw values are appended when
include_raw_values=True
- a summary - appended to the results when
include_summary=True
- a simplified time series - appended to the results when
include_transitions=True
- voice signatures of speakers - appended to the results when
include_voice_signatures=True
(only applies when using the SpeakerMap model)
For code sample go to Example Usage. For detailed output specification go to Output Specification.
See below for examples of each of the three outputs:
- plain time series (according to the specified
output_period
):
{
"channels": {
"0": {
"time_series": [
{
"timestamp": 0,
"results": {
"gender": {
"result": "male",
"confidence": 0.92
}
}
},
{
"timestamp": 1024,
"results": {
"gender": {
"result": "male",
"confidence": 0.86
}
}
},
{
"timestamp": 2048,
"results": {
"gender": {
"result": "male",
"confidence": 0.85
}
}
},
...
{
"timestamp": 29696,
"results":{
"gender": {
"result": "silence",
"confidence": 1.0
}
}
}
]
}
}
}
- plain time series with additional raw outputs:
{
"channels": {
"0": {
"time_series": [
{
"timestamp": 0,
"results": {
"gender": {
"result": "male",
"confidence": 0.92
}
},
"raw": {
"gender": {
"male": 0.92,
"female": 0.08
}
}
},
{
"timestamp": 1024,
"results": {
"gender": {
"result": "male",
"confidence": 0.86
}
},
"raw": {
"gender": {
"male": 0.86,
"female": 0.14
}
}
},
{
"timestamp": 2048,
"results":{
"gender": {
"result": "male",
"confidence": 0.85
}
},
"raw": {
"gender": {
"male": 0.85,
"female": 0.15
}
}
},
...
{
"timestamp": 29696,
"results": {
"gender": {
"result": "silence",
"confidence": 1.0
}
},
"raw": {
"gender": {
"male": 0.12,
"female": 0.88
}
}
}
]
}
}
}
- summary (showing fraction of each class across the entire file):
{
"channels": {
"0": {
"time_series": [ ... ],
"summary": {
"gender": {
"male_fraction": 0.7451,
"female_fraction": 0.1024,
"other_fraction": 0.112,
"unknown_fraction": 0.0405,
"silence_fraction": 0.0,
},
}
}
}
}
- simplified time series (indicating transition points between alternating results):
{
"channels": {
"0": {
"time_series": [ ... ],
"transitions": {
"gender": [
{
"timestamp_start": 0,
"timestamp_end": 1024,
"result": "female",
"confidence": 0.96
},
{
"timestamp_start": 1024,
"timestamp_end": 3072,
"result": "male",
"confidence": 0.87
},
...
{
"timestamp_start": 8192,
"timestamp_end": 12288,
"result": "female",
"confidence": 0.89
}
],
}
}
}
}
- voice signatures of the detected speakers:
{
"channels": {
"0": {
"time_series": [ ... ],
"summary": { ... },
"transitions": { ... },
"voice_signatures": {
"speaker_1": {
"version": 1,
"data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..."
},
"speaker_2": {
"version": 1,
"data": "T1RPIGlzIGdyZWF0ISBJdCdzIHRydWUu..."
}
}
}
}
}
Coming soon...
Example Usage
- Python
- Swift
You can use the process_file
method to process your audio files.
from deeptone import Deeptone
from deeptone.deeptone import GENDER_MALE, GENDER_FEMALE, GENDER_UNKNOWN, GENDER_NO_SPEECH, GENDER_SILENCE
# Initialise Deeptone
engine = Deeptone(license_key="...")
output = engine.process_file(
filename="PATH_TO_AUDIO_FILE",
models=[engine.models.Gender],
output_period=1024,
channel=0,
use_chunking=False,
include_summary=True,
include_transitions=True,
include_raw_values=True,
volume_threshold=0.005
)
The returned object contains the time series with an analysis of the file broken down by the provided output period:
# Inspect the result
print(output)
print("Time series:")
for ts_result in output["channels"]["0"]["time_series"]:
ts = ts_result["timestamp"]
res = ts_result["results"]["gender"]
print(f'Timestamp: {ts}ms\tresult: {res["result"]}\t'
f'confidence: {res["confidence"]}')
print("\nRaw model outputs:")
for ts_result in output["channels"]["0"]["time_series"]:
ts = ts_result["timestamp"]
raw = ts_result["raw"]["gender"]
print(f'Timestamp: {ts}ms\traw results: {GENDER_MALE}: '
f'{raw[GENDER_MALE]}, {GENDER_FEMALE}: {raw[GENDER_FEMALE]}')
summary = output["channels"]["0"]["summary"]["gender"]
male = summary[f"{GENDER_MALE}_fraction"] * 100
female = summary[f"{GENDER_FEMALE}_fraction"] * 100
no_speech = summary[f"{GENDER_NO_SPEECH}_fraction"] * 100
unknown = summary[f"{GENDER_UNKNOWN}_fraction"] * 100
silence = summary[f"{GENDER_SILENCE}_fraction"] * 100
print(f'\nSummary: male: {male}%, female: {female}%, no_speech: {no_speech}%, unknown: {unknown}%, silence: {silence}%')
print("\nTransitions:")
for ts_result in output["channels"]["0"]["transitions"]["gender"]:
ts = ts_result["timestamp_start"]
print(f'Timestamp: {ts}ms\tresult: {ts_result["result"]}\t'
f'confidence: {ts_result["confidence"]}')
The output of the script would be something like:
Time series:
Timestamp: 0ms result: no_speech confidence: 0.6293
Timestamp: 1024ms result: female confidence: 0.9002
Timestamp: 2048ms result: female confidence: 0.4725
Timestamp: 3072ms result: female confidence: 0.4679
....
Raw model outputs:
Timestamp: 0ms raw results: male: 0.1791, female: 0.8209
Timestamp: 1024ms raw results: male: 0.0499, female: 0.9501
Timestamp: 2048ms raw results: male: 0.2638, female: 0.7362
Timestamp: 3072ms raw results: male: 0.266, female: 0.734
Summary: male: 0.0%, female: 67.74%, no_speech: 9.68%, unknown: 6.45%, silence: 16.13%
Transitions:
Timestamp: 0ms result: silence confidence: 1.0
Timestamp: 320ms result: no_speech confidence: 0.7723
Timestamp: 704ms result: silence confidence: 1.0
Timestamp: 768ms result: female confidence: 0.9002
Timestamp: 1408ms result: silence confidence: 1.0
Timestamp: 1472ms result: female confidence: 0.7137
Timestamp: 2880ms result: unknown confidence: 0.0771
Timestamp: 3136ms result: female confidence: 0.3961
Timestamp: 3712ms result: silence confidence: 1.0
Timestamp: 3776ms result: female confidence: 0.7101
Timestamp: 3840ms result: silence confidence: 1.0
Raw output:
{
"channels": {
"0": {
"time_series": [
{ "timestamp" : 0, "results":{ "gender": { "result": "no_speech", "confidence": 0.6293, }}, "raw": {"gender": {"male": 0.1791, "female": 0.8209}}},
{ "timestamp" : 1024, "results":{ "gender": { "result": "female", "confidence": 0.9002, }}, "raw": {"gender": {"male": 0.0499, "female": 0.9501}}},
{ "timestamp" : 2048, "results":{ "gender": { "result": "female", "confidence": 0.4725, }}, "raw": {"gender": {"male": 0.2638, "female": 0.7362}}},
{ "timestamp" : 3072, "results":{ "gender": { "result": "female", "confidence": 0.4679, }}, "raw": {"gender": {"male": 0.266, "female": 0.734}}},
],
"summary": {
"gender": { "male_fraction": 0, "female_fraction": 0.6774, "no_speech": 0.0968, "unknown_fraction": 0.0645, "silence_fraction": 0.1613 },
},
"transitions": {
"gender": [
{"timestamp_start": 0, "timestamp_end": 320, "result": "silence", "confidence": 1.0},
{"timestamp_start": 320, "timestamp_end": 704, "result": "no_speech", "confidence": 0.7723},
{"timestamp_start": 704, "timestamp_end": 768, "result": "silence", "confidence": 1.0},
{"timestamp_start": 768, "timestamp_end": 1408, "result": "female", "confidence": 0.9002},
{"timestamp_start": 1408, "timestamp_end": 1472, "result": "silence", "confidence": 1.0},
{"timestamp_start": 1472, "timestamp_end": 2880, "result": "female", "confidence": 0.7137},
{"timestamp_start": 2880, "timestamp_end": 3136, "result": "unknown", "confidence": 0.0771},
{"timestamp_start": 3136, "timestamp_end": 3712, "result": "female", "confidence": 0.3961},
{"timestamp_start": 3712, "timestamp_end": 3776, "result": "silence", "confidence": 1.0},
{"timestamp_start": 3776, "timestamp_end": 3840, "result": "female", "confidence": 0.7101},
{"timestamp_start": 3840, "timestamp_end": 3968, "result": "silence", "confidence": 1.0}
]
}
}
}
}
Further examples
For more example usage of the summary
and transitions
, head to the Speech detection recipes and the Arousal detection recipes sections. For example usage of `raw` output to implement custom speech thresholds, head to Example 3 in Speech model recipes.
Coming soon...