File Processing

DeepTone™'s File Processing functionality allows you to extract insights from your audio files.

Processing Methods

There are two ways to provide your audio files to the DeepTone™ Cloud API.

URL Method (Production Method)

With the URL method, the URL of the audio file you would like to process is sent in the JSON body of the post request. The DeepTone™ Cloud API will then download the file and process it. This is the recommended approach for a production stage. The reasons for this are:

The file size limits are larger
The files are never stored on OTO's infrastructure
It allows the DeepTone™ Cloud API to scale optimally which results in a higher throughput

Direct File Upload Method (Testing only)

For testing purposes, the the content of local audio files can directly be uploaded with the post request.

For a code sample that shows both method go to Example Usage

Working with stereo files

DeepTone™ processes each audio channel separately. If you provide a stereo file, you can provide a specific channel to be processed, otherwise, all channels will be processed separately.

Sample data

You can download this sample audio file with a woman speaking for the examples below. For code sample go to Example Usage.

File download request validation

When executing file processing, if a file URL is provided, the requests to download the file will be signed using ES512, and can be verified using this Public Key:

Public Key ID:

49FM328PDqqaoRLDZ5ohiXbrKyV2nYf

Public Key:

MIGbMBAGByqGSM49AgEGBSuBBAAjA4GGAAQBllfqEc2DPRDTZz4rfJzBOg1kTsc01CXR8dlYxiK8N0ffbOwqJM49oXD/gLWGl3DnVdhfoGfqtz3buIiaHZSruYwAeiIWAPMj8PFUJpH2igfSujpME2i+z19feAv2t6MkgBQRIDBGPDPWlpghzJPLRcSCk4zYHB9zgeeDLjmXJEOiY5g=

The requests contain an HTTP header X-DeepTone-Request-Verification which represents a JWT with the following information:

Header:

alg: Algorithm used for signing (should match ES512)
kid: They ID of the Key Pair used to sign this request. This ID can be used to find which Public Key from the provided set matches the one used to sign this request.

Claims:

organization_id: The ID of your Organization. Should match your Organizations' attributed ID
project_id: The ID of the Project that initiated the call. Should match your Project's attributed ID
job_id: The Job ID related to this request. Should match the Job ID returned when the file processing request was made
url: The URL for the file to be downloaded. Should match the url of the file that was requested to be processed
iat: Timestamp representing when this request was made
jti: Unique download request identifier
exp: Timestamp representing when this JWT expires

Supported formats

Currently, processing WAV, FLAC and MP3 files is supported. Ideally, the files should be WAV 16-bit PCM with the sample rate of 16 kHz. If a different format is provided, the file will be converted accordingly. Please be aware that using files with lossy formats (e.g. MP3) or with sample rates lower than recommended may lead to deterioration of the results.

If you're not sure your audio files meet these criteria you can use the CLI tool SoX for that verification by doing the following:

sox --i PATH_TO_YOUR_AUDIO_FILE

The result will be something similar to:

Input File     : PATH_TO_YOUR_AUDIO_FILE
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:03.99 = 63840 samples ~ 299.25 CDDA sectors
File Size      : 128k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

SoX also allows you to convert your files in case they don't match our criteria by using the following command:

sox PATH_TO_YOUR_AUDIO_FILE -b 16 PATH_TO_OUTPUT_FILE rate 16k

Limitations

File size

Currently, using the Cloud API for file processing is limited by the size of the file as follows:

15MB max file size for direct file upload
1 hour combined across all channels when providing a URL for the file (i.e. mono -> 1h, stereo -> 30 minutes, etc.)

Usage examples that fit these use cases can be found below. If you would like to process files larger than what's mentioned above with the Cloud API we provide methods to do so in our Troubleshooting page.

These constraints also apply to the on-premise deployments of the DeepTone™ API, however, the limits are configurable.

Result size

The size of the JSON result is currently limited to 30MB for both direct file upload and the URL Method. To reduce the result size you can:

Increase the output_period query parameter
Create separate processing jobs for each model

If these measures are not an option for you, let us know at support@oto.ai.

Configuration options and outputs

There are different configuration parameters and types of outputs which can be requested.

For code sample go to Example Usage. For detailed output specification go to Output Specification.

Available configuration parameters

There are several possible parameters which can be passed to a post request to the /jobs endpoint:

models - the list of model names to use for the audio analysis
output_period - how often (in milliseconds, multiple of 64) the output of the models should be returned
channel - optionally a channel to analyse, otherwise all channels will be analysed
include_summary - optionally if the output should contain of summary of the analysis, defaults to False
include_transitions - optionally if the output should contain transitions of the analysis, defaults to False
include_raw_values - optionally if the result should contain raw model outputs, defaults to False
volume_threshold - optionally if a volume level different than default should be considered (higher values will result in more of the data being treated as silence)
callback - optionally a callback URL that will be invoked once the results are ready. More info about this option here.

Available Outputs

There are three possible output types, depending on the parameters that are set to true on the request:

a plain time series - default output type, returned always
a plain time series with raw model outputs - raw values are appended when include_raw_values=true
a summary - appended to the results when include_summary=true
a simplified time series - appended to the results when include_transitions=true

For code sample go to Example Usage. For detailed output specification go to Output Specification.

See below for examples of each of the three outputs:

plain time series (according to the specified output_period):

{
  "channels": {
    "0": {
      "time_series": [
        {
          "timestamp": 0,
          "results": {
            "gender": {
                "result": "male",
                "confidence": 0.92
            }
          }
        },
        {
          "timestamp": 1024,
          "results": {
            "gender": {
                "result": "male",
                "confidence": 0.86
            }
          }
        },
        {
          "timestamp": 2048,
          "results": {
            "gender": {
              "result": "male",
              "confidence": 0.85
            }
          }
        },
        ...
        {
          "timestamp": 29696,
          "results":{
            "gender": {
              "result": "silence",
              "confidence": 1.0
            }
          }
        }
      ]
    }
  }
}

plain time series with additional raw outputs:

{
  "channels": {
    "0": {
      "time_series": [
        {
          "timestamp": 0,
          "results": {
            "gender": {
              "result": "male",
              "confidence": 0.92
            }
          },
          "raw": {
            "gender": {
              "male": 0.92,
              "female": 0.08
            }
          }
        },
        {
          "timestamp": 1024,
          "results": {
            "gender": {
              "result": "male",
              "confidence": 0.86
            }
          },
          "raw": {
             "gender": {
               "male": 0.86,
               "female": 0.14
           }
         } 
        },
        {
          "timestamp": 2048,
          "results":{
            "gender": {
              "result": "male",
              "confidence": 0.85
            }
          },
          "raw": {
            "gender": {
              "male": 0.85,
              "female": 0.15
            }
          }
        },
        ...
        {
          "timestamp": 29696,
          "results": {
            "gender": {
              "result": "silence",
              "confidence": 1.0
            }
          },
          "raw": {
            "gender": {
              "male": 0.12,
              "female": 0.88
           }
         }
        }
      ]
    }
  }
}

summary (showing fraction of each class across the entire file):

{
  "channels": {
    "0": {
      "time_series": [ ... ],
      "summary": {
         "gender": {
           "male_fraction": 0.7451,
           "female_fraction": 0.1024,
           "other_fraction": 0.112,
           "unknown_fraction": 0.0405,
           "silence_fraction": 0.0,
         },                             
      }
    }
  }
}

simplified time series (indicating transition points between alternating results):

{
  "channels": {
    "0": {
      "time_series": [ ... ],
      "transitions": {
        "gender": [
          {
            "timestamp_start": 0,
            "timestamp_end": 1024,
            "result": "female",
            "confidence": 0.96
          },
          {
            "timestamp_start": 1024,
            "timestamp_end": 3072,
            "result": "male",
            "confidence": 0.87
          },
          ...
          {
            "timestamp_start": 8192,
            "timestamp_end": 12288,
            "result": "female",
            "confidence": 0.89
          }
        ],                              
      }
    }
  }
}

Callbacks

The callback parameter expects a valid URL that will be invoked when your job finishes processing. Once the results are ready the API will invoke this endpoint using POST and with a body that matches the one returned by the GET request to the /file-processing/jobs/{jobId} endpoint.

If the invocation to the callback is successful the API expects a 2XX status code. In case of an unsuccessful invocation (5XX status code) the API will retry invoking the endpoint up to 3 times. This retry mechanism means that there's the possibility that the callback endpoint might receive multiple notifications, which should be handled by the user. Any other status codes will not trigger the retry mechanism. The callback endpoint provided should respond within 10 seconds when being invoked, otherwise the request will timeout.

Example Usage

Shell + Curl
Python

To process a file that is available on this url: https://docs.oto.ai/api/audio/audio_sample.wav use the following curl commands:

Create job:

curl --request POST \
  --url 'https://api.oto.ai/file-processing/jobs?models=speech,arousal&output_period=4096&channel=0&include_summary=true&include_transitions=true&include_raw_values=true&volume_threshold=0.0' \
  --header 'content-type: application/json' \
  --header 'x-api-key: REPLACE_KEY_VALUE' \
  --data '{"url":"https://docs.oto.ai/api/audio/audio_sample.wav"}'

Get results:

curl --request GET \
  --url https://api.oto.ai/file-processing/jobs/REPLACE_JOB_ID/results \
  --header 'x-api-key: REPLACE_KEY_VALUE'

To process a local file use the following curl commands:

Create job:

curl --request POST \
  --url 'https://api.oto.ai/file-processing/jobs?models=speech,arousal&output_period=4096&channel=0&include_summary=true&include_transitions=true&include_raw_values=true&volume_threshold=0.0' \
  --header 'content-type: audio/wav' \
  --header 'x-api-key: REPLACE_KEY_VALUE' \
  --data-binary @path/to/local/file.wav

Get results:

curl --request GET \
  --url https://api.oto.ai/file-processing/jobs/REPLACE_JOB_ID/results \
  --header 'x-api-key: REPLACE_KEY_VALUE'

To submit a processing job of a file that is available at the url https://docs.oto.ai/api/audio/audio_sample.wav, you can use the following code:

import http.client
import time
import json

# REPLACE with your key
API_KEY = "REPLACE_ME_WITH_YOUR_API_KEY"

conn = http.client.HTTPSConnection("api.oto.ai")

# REPLACE with your URL
payload = '{"url":"https://docs.oto.ai/api/audio/audio_sample.wav"}'

post_headers = {
    'content-type': "application/json",
    'x-api-key': API_KEY
}
conn.request("POST", "/file-processing/jobs?models=emotions&output_period=512&include_summary=true&include_transitions=true&include_raw_values=true&volume_threshold=0.001", payload, post_headers)
res = conn.getresponse()
data = res.read()
job_id = json.loads(data)["id"]

To submit a processing job of a local file use the following python code:

# REPLACE with your local wav file
local_audio_file = './sample_audio.wav'
with open(local_audio_file, 'rb') as f:
    data = f.read()
post_headers = {
    'content-type': "audio/wav",
    'x-api-key': API_KEY
    }
conn.request("POST", "/file-processing/jobs?models=gender&output_period=4096&include_summary=true&include_transitions=true&include_raw_values=true&volume_threshold=0.001", data, post_headers)
res = conn.getresponse()
data = res.read()
job_id = json.loads(data)["id"]

After the job is submitted we can periodically check its status and query the results once it's done:

get_headers = { 'x-api-key': API_KEY }

state = "new"
result = {}
while not (state == "done" or state == "error"):
    time.sleep(1)
    conn.request("GET", f"/file-processing/jobs/{job_id}", headers=get_headers)
    res = conn.getresponse()
    data = res.read()
    result = json.loads(data)
    state = result["state"]
    print("Job configuration:", result["config"])
    print("Job state:", state)
    if state == "error":
        print("There was an error: ", result["error_description"])
    if state == "done":
        conn.request("GET", f"/file-processing/jobs/{job_id}/results", headers=get_headers)
        res = conn.getresponse()
        data = res.read()
        result = json.loads(data)["result"]
        print(result)

Now if the job was successful and the results retrieved, the results can be inspected. The returned object contains the time series with an analysis of the file broken down by the provided output period:

print("Time series:")
for ts_result in result["channels"]["0"]["time_series"]:
    ts = ts_result["timestamp"]
    res = ts_result["results"]["gender"]
    print(f'Timestamp: {ts}ms\tresult: {res["result"]}\t'
            f'confidence: {res["confidence"]}')

print("\nRaw model outputs:")
for ts_result in result["channels"]["0"]["time_series"]:
    ts = ts_result["timestamp"]
    raw = ts_result["raw"]["gender"]
    print(f'Timestamp: {ts}ms\traw results: male: ' 
          f'{raw["male"]}, female: {raw["female"]}')

summary = result["channels"]["0"]["summary"]["gender"]
male = summary["male_fraction"] * 100
female = summary["female_fraction"] * 100
no_speech = summary["no_speech_fraction"] * 100
unknown = summary["unknown_fraction"] * 100
silence = summary["silence_fraction"] * 100
print(f'\nSummary:  male: {male}%, female: {female}%, no_speech: {no_speech}%, unknown: {unknown}%, silence: {silence}%')

print("\nTransitions:")
for ts_result in result["channels"]["0"]["transitions"]["gender"]:
    ts = ts_result["timestamp_start"]
    print(f'Timestamp: {ts}ms\tresult: {ts_result["result"]}\t'
            f'confidence: {ts_result["confidence"]}')

The output of the script would be something like:

Time series:
Timestamp: 0ms      result: no_speech  confidence: 0.6293
Timestamp: 1024ms   result: female     confidence: 0.9002
Timestamp: 2048ms   result: female     confidence: 0.4725
Timestamp: 3072ms   result: female     confidence: 0.4679
....

Raw model outputs:
Timestamp: 0ms      raw results: male: 0.1791, female: 0.8209
Timestamp: 1024ms   raw results: male: 0.0499, female: 0.9501
Timestamp: 2048ms   raw results: male: 0.2638, female: 0.7362
Timestamp: 3072ms   raw results: male: 0.266,  female: 0.734

Summary:  male: 0.0%, female: 67.74%, no_speech: 9.68%, unknown: 6.45%, silence: 16.13%

Transitions:
Timestamp: 0ms      result: silence     confidence: 1.0
Timestamp: 320ms    result: no_speech   confidence: 0.7723
Timestamp: 704ms    result: silence     confidence: 1.0
Timestamp: 768ms    result: female      confidence: 0.9002
Timestamp: 1408ms   result: silence     confidence: 1.0
Timestamp: 1472ms   result: female      confidence: 0.7137
Timestamp: 2880ms   result: unknown     confidence: 0.0771
Timestamp: 3136ms   result: female      confidence: 0.3961
Timestamp: 3712ms   result: silence     confidence: 1.0
Timestamp: 3776ms   result: female      confidence: 0.7101
Timestamp: 3840ms   result: silence     confidence: 1.0

Raw output:

{
  "channels": {
    "0": {
      "time_series": [
        { "timestamp" : 0, "results":{ "gender": { "result": "no_speech", "confidence": 0.6293, }}, "raw": {"gender": {"male": 0.1791, "female": 0.8209}}},
        { "timestamp" : 1024, "results":{ "gender": { "result": "female", "confidence": 0.9002, }}, "raw": {"gender": {"male": 0.0499, "female": 0.9501}}},
        { "timestamp" : 2048, "results":{ "gender": { "result": "female", "confidence": 0.4725, }}, "raw": {"gender": {"male": 0.2638, "female": 0.7362}}},
        { "timestamp" : 3072, "results":{ "gender": { "result": "female", "confidence": 0.4679, }}, "raw": {"gender": {"male": 0.266, "female": 0.734}}},
      ],
      "summary": {
        "gender": { "male_fraction": 0, "female_fraction": 0.6774, "no_speech": 0.0968, "unknown_fraction": 0.0645, "silence_fraction": 0.1613 },
      },
      "transitions": {
        "gender": [
          {"timestamp_start": 0, "timestamp_end": 320, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 320, "timestamp_end": 704, "result": "no_speech", "confidence": 0.7723}, 
          {"timestamp_start": 704, "timestamp_end": 768, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 768, "timestamp_end": 1408, "result": "female", "confidence": 0.9002}, 
          {"timestamp_start": 1408, "timestamp_end": 1472, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 1472, "timestamp_end": 2880, "result": "female", "confidence": 0.7137}, 
          {"timestamp_start": 2880, "timestamp_end": 3136, "result": "unknown", "confidence": 0.0771}, 
          {"timestamp_start": 3136, "timestamp_end": 3712, "result": "female", "confidence": 0.3961}, 
          {"timestamp_start": 3712, "timestamp_end": 3776, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 3776, "timestamp_end": 3840, "result": "female", "confidence": 0.7101}, 
          {"timestamp_start": 3840, "timestamp_end": 3968, "result": "silence", "confidence": 1.0}
      ] 
      }
    }
  }
}

You can download the complete python example script here .

Processing Methods​

URL Method (Production Method)​

Direct File Upload Method (Testing only)​

Working with stereo files​

Sample data​

File download request validation​

Supported formats​

Limitations​

File size​

Result size​

Configuration options and outputs​

Callbacks​

Example Usage​