File Processing

DeepTone™'s File Processing functionality allows you to extract insights from your audio files.

Working with stereo files

DeepTone™ processes each audio channel separately. If you provide a stereo file, you can provide a specific channel to be processed, otherwise, all channels will be processed separately.

Sample data

You can download this sample audio file with a woman speaking for the examples below. For code sample go to Example Usage.

Configuration options and outputs

There are different configuration options and types of outputs which can be used depending on the SDK language.

For code sample go to Example Usage. For detailed output specification go to Output Specification.

Python
Swift

Available configuration options

There are several possible arguments which can be passed to the process_file function:

filename - the path to the file to be analysed. Supported Audio Formats
models - the list of model names to use for the audio analysis. See all available models here.
output_period - how often (in milliseconds, multiple of 64) the output of the models should be returned
channel - optionally a channel to analyse, otherwise all channels will be analysed
include_summary - optionally if the output should contain of summary of the analysis, defaults to False
include_transitions - optionally if the output should contain transitions of the analysis, defaults to False
include_raw_values - optionally if the result should contain raw model outputs, defaults to False
use_chunking - optionally if the data should be chunked before making the analysis (recommended for large files to avoid memory issues)
volume_threshold - optionally if a volume level different than default should be considered (higher values will result in more of the data being treated as silence)
voice_signatures - optionally (only applies when using the SpeakerMap model) the voice signatures that are used to identify known speakers
include_voice_signatures - optionally (only applies when using the SpeakerMap model) if the result should contain the voice signatures of the found speakers. If a voice_signatures object was provided as well, it will update the voice_signatures object with the voice signatures of new speakers. See Create Voice Signatures to find out more about voice signatures

Available Outputs

There are three possible output types, depending on the parameters that you pass to the process_file function:

a plain time series - default output type, returned always
a plain time series with raw model outputs - raw values are appended when include_raw_values=True
a summary - appended to the results when include_summary=True
a simplified time series - appended to the results when include_transitions=True
voice signatures of speakers - appended to the results when include_voice_signatures=True (only applies when using the SpeakerMap model)

For code sample go to Example Usage. For detailed output specification go to Output Specification.

See below for examples of each of the three outputs:

plain time series (according to the specified output_period):

{
  "channels": {
    "0": {
      "time_series": [
        {
          "timestamp": 0,
          "results": {
            "gender": {
                "result": "male",
                "confidence": 0.92
            }
          }
        },
        {
          "timestamp": 1024,
          "results": {
            "gender": {
                "result": "male",
                "confidence": 0.86
            }
          }
        },
        {
          "timestamp": 2048,
          "results": {
            "gender": {
              "result": "male",
              "confidence": 0.85
            }
          }
        },
        ...
        {
          "timestamp": 29696,
          "results":{
            "gender": {
              "result": "silence",
              "confidence": 1.0
            }
          }
        }
      ]
    }
  }
}

plain time series with additional raw outputs:

{
  "channels": {
    "0": {
      "time_series": [
        {
          "timestamp": 0,
          "results": {
            "gender": {
              "result": "male",
              "confidence": 0.92
            }
          },
          "raw": {
            "gender": {
              "male": 0.92,
              "female": 0.08
            }
          }
        },
        {
          "timestamp": 1024,
          "results": {
            "gender": {
              "result": "male",
              "confidence": 0.86
            }
          },
          "raw": {
             "gender": {
               "male": 0.86,
               "female": 0.14
           }
         } 
        },
        {
          "timestamp": 2048,
          "results":{
            "gender": {
              "result": "male",
              "confidence": 0.85
            }
          },
          "raw": {
            "gender": {
              "male": 0.85,
              "female": 0.15
            }
          }
        },
        ...
        {
          "timestamp": 29696,
          "results": {
            "gender": {
              "result": "silence",
              "confidence": 1.0
            }
          },
          "raw": {
            "gender": {
              "male": 0.12,
              "female": 0.88
           }
         }
        }
      ]
    }
  }
}

summary (showing fraction of each class across the entire file):

{
  "channels": {
    "0": {
      "time_series": [ ... ],
      "summary": {
         "gender": {
           "male_fraction": 0.7451,
           "female_fraction": 0.1024,
           "other_fraction": 0.112,
           "unknown_fraction": 0.0405,
           "silence_fraction": 0.0,
         },                             
      }
    }
  }
}

simplified time series (indicating transition points between alternating results):

{
  "channels": {
    "0": {
      "time_series": [ ... ],
      "transitions": {
        "gender": [
          {
            "timestamp_start": 0,
            "timestamp_end": 1024,
            "result": "female",
            "confidence": 0.96
          },
          {
            "timestamp_start": 1024,
            "timestamp_end": 3072,
            "result": "male",
            "confidence": 0.87
          },
          ...
          {
            "timestamp_start": 8192,
            "timestamp_end": 12288,
            "result": "female",
            "confidence": 0.89
          }
        ],                              
      }
    }
  }
}

voice signatures of the detected speakers:

{
  "channels": {
    "0": {
      "time_series": [ ... ],
      "summary": { ... },
      "transitions": { ... },
      "voice_signatures": {
        "speaker_1": {
          "version": 1,
          "data": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..."
        },
        "speaker_2": {
          "version": 1,
          "data": "T1RPIGlzIGdyZWF0ISBJdCdzIHRydWUu..."
        }
      }
    }
  }
}

Example Usage

Python
Swift

You can use the process_file method to process your audio files.

from deeptone import Deeptone
from deeptone.deeptone import GENDER_MALE, GENDER_FEMALE, GENDER_UNKNOWN, GENDER_NO_SPEECH, GENDER_SILENCE

# Initialise Deeptone
engine = Deeptone(license_key="...")

output = engine.process_file(
    filename="PATH_TO_AUDIO_FILE",
    models=[engine.models.Gender],
    output_period=1024,
    channel=0,
    use_chunking=False,
    include_summary=True,
    include_transitions=True,
    include_raw_values=True,
    volume_threshold=0.005
)

The returned object contains the time series with an analysis of the file broken down by the provided output period:

# Inspect the result
print(output)

print("Time series:")
for ts_result in output["channels"]["0"]["time_series"]:
    ts = ts_result["timestamp"]
    res = ts_result["results"]["gender"]
    print(f'Timestamp: {ts}ms\tresult: {res["result"]}\t'
            f'confidence: {res["confidence"]}')

print("\nRaw model outputs:")
for ts_result in output["channels"]["0"]["time_series"]:
    ts = ts_result["timestamp"]
    raw = ts_result["raw"]["gender"]
    print(f'Timestamp: {ts}ms\traw results: {GENDER_MALE}: ' 
          f'{raw[GENDER_MALE]}, {GENDER_FEMALE}: {raw[GENDER_FEMALE]}')

summary = output["channels"]["0"]["summary"]["gender"]
male = summary[f"{GENDER_MALE}_fraction"] * 100
female = summary[f"{GENDER_FEMALE}_fraction"] * 100
no_speech = summary[f"{GENDER_NO_SPEECH}_fraction"] * 100
unknown = summary[f"{GENDER_UNKNOWN}_fraction"] * 100
silence = summary[f"{GENDER_SILENCE}_fraction"] * 100
print(f'\nSummary:  male: {male}%, female: {female}%, no_speech: {no_speech}%, unknown: {unknown}%, silence: {silence}%')

print("\nTransitions:")
for ts_result in output["channels"]["0"]["transitions"]["gender"]:
    ts = ts_result["timestamp_start"]
    print(f'Timestamp: {ts}ms\tresult: {ts_result["result"]}\t'
            f'confidence: {ts_result["confidence"]}')

The output of the script would be something like:

Time series:
Timestamp: 0ms      result: no_speech  confidence: 0.6293
Timestamp: 1024ms   result: female     confidence: 0.9002
Timestamp: 2048ms   result: female     confidence: 0.4725
Timestamp: 3072ms   result: female     confidence: 0.4679
....

Raw model outputs:
Timestamp: 0ms      raw results: male: 0.1791, female: 0.8209
Timestamp: 1024ms   raw results: male: 0.0499, female: 0.9501
Timestamp: 2048ms   raw results: male: 0.2638, female: 0.7362
Timestamp: 3072ms   raw results: male: 0.266,  female: 0.734

Summary:  male: 0.0%, female: 67.74%, no_speech: 9.68%, unknown: 6.45%, silence: 16.13%

Transitions:
Timestamp: 0ms      result: silence     confidence: 1.0
Timestamp: 320ms    result: no_speech   confidence: 0.7723
Timestamp: 704ms    result: silence     confidence: 1.0
Timestamp: 768ms    result: female      confidence: 0.9002
Timestamp: 1408ms   result: silence     confidence: 1.0
Timestamp: 1472ms   result: female      confidence: 0.7137
Timestamp: 2880ms   result: unknown     confidence: 0.0771
Timestamp: 3136ms   result: female      confidence: 0.3961
Timestamp: 3712ms   result: silence     confidence: 1.0
Timestamp: 3776ms   result: female      confidence: 0.7101
Timestamp: 3840ms   result: silence     confidence: 1.0

Raw output:

{
  "channels": {
    "0": {
      "time_series": [
        { "timestamp" : 0, "results":{ "gender": { "result": "no_speech", "confidence": 0.6293, }}, "raw": {"gender": {"male": 0.1791, "female": 0.8209}}},
        { "timestamp" : 1024, "results":{ "gender": { "result": "female", "confidence": 0.9002, }}, "raw": {"gender": {"male": 0.0499, "female": 0.9501}}},
        { "timestamp" : 2048, "results":{ "gender": { "result": "female", "confidence": 0.4725, }}, "raw": {"gender": {"male": 0.2638, "female": 0.7362}}},
        { "timestamp" : 3072, "results":{ "gender": { "result": "female", "confidence": 0.4679, }}, "raw": {"gender": {"male": 0.266, "female": 0.734}}},
      ],
      "summary": {
        "gender": { "male_fraction": 0, "female_fraction": 0.6774, "no_speech": 0.0968, "unknown_fraction": 0.0645, "silence_fraction": 0.1613 },
      },
      "transitions": {
        "gender": [
          {"timestamp_start": 0, "timestamp_end": 320, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 320, "timestamp_end": 704, "result": "no_speech", "confidence": 0.7723}, 
          {"timestamp_start": 704, "timestamp_end": 768, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 768, "timestamp_end": 1408, "result": "female", "confidence": 0.9002}, 
          {"timestamp_start": 1408, "timestamp_end": 1472, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 1472, "timestamp_end": 2880, "result": "female", "confidence": 0.7137}, 
          {"timestamp_start": 2880, "timestamp_end": 3136, "result": "unknown", "confidence": 0.0771}, 
          {"timestamp_start": 3136, "timestamp_end": 3712, "result": "female", "confidence": 0.3961}, 
          {"timestamp_start": 3712, "timestamp_end": 3776, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 3776, "timestamp_end": 3840, "result": "female", "confidence": 0.7101}, 
          {"timestamp_start": 3840, "timestamp_end": 3968, "result": "silence", "confidence": 1.0}
      ] 
      }
    }
  }
}

Further examples

For more example usage of the summary and transitions, head to the Speech detection recipes and the Arousal detection recipes sections. For example usage of `raw` output to implement custom speech thresholds, head to Example 3 in Speech model recipes.

Working with stereo files​

Sample data​

Configuration options and outputs​

Example Usage​

Further examples​

Working with stereo files

Sample data

Configuration options and outputs

Example Usage

Further examples