DeepTone™'s File Processing functionality allows you to extract insights from your audio files.
Working with stereo files
DeepTone™ processes each audio channel separately. If you provide a stereo file, you can provide a specific channel to be processed, otherwise, all channels will be processed separately.
Currently, processing WAV files is supported. Ideally, the files should be 16-bit PCM with the sample rate of 16 kHz. If a different sample rate is provided, the file will be up- or down-sampled accordingly. Please be aware though that using files with sample rates lower than recommended may lead to deterioration of analysis results.
If you're not sure your audio files meet these criteria you can use the CLI tool SoX for that verification by doing the following:
The result will be something similar to:
SoX also allows you to convert your files in case they don't match our criteria by using the following command:
Configuration options and outputs
There are different configuration options and types of outputs which can be used depending on the SDK language.
Available configuration options
There are several possible arguments which can be passed to the
filename- the path to the file to be analysed
models- the list of model names to use for the audio analysis
output_period- how often (in milliseconds, multiple of 64) the output of the models should be returned
channel- optionally a channel to analyse, otherwise all channels will be analysed
include_summary- optionally if the output should contain of summary of the analysis, defaults to False
include_transitions- optionally if the output should contain transitions of the analysis, defaults to False
include_raw_values- optionally if the result should contain raw model outputs, defaults to False
use_chunking- optionally if the data should be chunked before making the analysis (recommended for large files to avoid memory issues)
volume_threshold- optionally if a volume level different than default should be considered (higher values will result in more of the data being treated as silence)
There are three possible output types, depending on the parameters that you pass to the
- a plain time series - default output type, returned always
- a plain time series with raw model outputs - raw values are appended when
- a summary - appended to the results when
- a simplified time series - appended to the results when
See below for examples of each of the three outputs:
- plain time series (according to the specified
- plain time series with additional raw outputs:
- summary (showing fraction of each class across the entire file):
- simplified time series (indicating transition points between alternating results):
You can use the
process_file method to process your audio files.
The returned object contains the time series with an analysis of the file broken down by the provided output period:
The output of the script would be something like:
For more example usage of the
transitions, head to the Speech detection recipes and the Arousal detection recipes sections. For example usage of `raw` output to implement custom speech thresholds, head to Example 3 in Speech model recipes.