Skip to main content

Output specification

Overview

The output of the DeepTone processing functions is json. When processing a stream, the results are returned one time step at a time. When processing a file, all time steps are returned, together with some additional file-level summaries. Additionally, in both cases raw model outputs can be included in the result.

In this section we present the different components of the DeepTone output:

Single time step, single model

{
"timestamp":1024,
"results":{
"gender":{
"result": "no_speech",
"confidence": 0.7088
}
}
}

Single time step, multiple models

{
"timestamp":2048,
"results":{
"arousal":{
"result": "no_speech",
"confidence": 0.7597
},
"gender":{
"result": "no_speech",
"confidence": 0.7597
}
}
}

Single time step, single model, raw values included

{
"timestamp":1024,
"results":{
"gender":{
"result": "no_speech",
"confidence": 0.7088
}
},
"raw": {
"gender":{
"male": 0.1211,
"female": 0.8789
}
}
}

File-processing output

There are several additional components in the file processing output compared to the streaming output.

The additional functionalities listed bellow correspond to additional keys in the output:

  • processing multiple audio channels - channels
  • processing and delivering results for all time steps - time_series
  • calculating file-level summary - summary
  • calculating file-level transitions - transitions
  • including raw model outputs - raw_outputs

Those differences are reflected in the different keys of the json output described in the following sections.

You can also head straight to a complete example.

channels

When processing files, the high-level structure of the output is shown below. The top level key channels holds the data for each channel - 0, 1 or both.

For each channel, the output contains time_series, and optionally summary and transitions.

{
"channels":{
"0":{
"time_series":[],
"summary":{},
"transitions":{}
}
}
}

time_series

The time_series key holds an array of results, under the results key, one for each time step as defined in the sections above. Optionally, it also holds an array of raw results under raw.

The format is:

{
"channels":{
<CHANNEL_IDENTIFIER>:{
"time_series":[
{
"timestamp":<START_TS_OF_STEP>,
"results": {
<MODEL_NAME>:{
"result":<MODEL_RESULT>,
"confidence":<MODEL_CONFIDENCE>
}
}
},
{
"timestamp":<START_TS_OF_STEP>,
"results": {
<MODEL_NAME>:{
"result"::<MODEL_RESULT>,
"confidence":<MODEL_CONFIDENCE>
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}

An example with real values from the Arousal model:

{
"channels":{
"0":{
"time_series":[
{
"timestamp":0,
"results": {
"arousal":{
"result":"no_speech",
"confidence":0.748
}
}
},
{
"timestamp":4096,
"results": {
"arousal":{
"result":"no_speech",
"confidence":0.7105
}
}
},
{
"timestamp":8192,
"results":{
"arousal":{
"result":"high",
"confidence":0.9926
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}

time_series with multiple models

In case multiple models are used, this will be reflected in the result object for each time step, as explained the multiple model time step result.

For example with speech and arousal, one should expect a key corresponding to each model's name in the results object for a single time step:

{
"channels":{
"0":{
"time_series":[
{
"timestamp":0,
"results":{
"speech":{
"result":"speech",
"confidence":0.748
},
"arousal":{
"result":"neutral",
"confidence":0.658
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}

time_series with raw model outputs

When you request raw values, an additional raw key will be included for each time step of the results. It will hold a dictionary of raw model outputs for every requested model, with all class names as keys and their respective probabilities as values.

The format is:

{
"channels":{
<CHANNEL_IDENTIFIER>:{
"time_series":[
{
"timestamp":<START_TS_OF_STEP>,
"results":{
<MODEL_NAME>:{
"result":<MODEL_RESULT>,
"confidence":<MODEL_CONFIDENCE>
}
},
"raw": {
<MODEL_NAME>:{
"<CLASS1>": <CLASS1_PROBABILITY>,
"<CLASS2>": <CLASS2_PROBABILITY>,
"<CLASS3>": <CLASS3_PROBABILITY>
}
}
},
{
"timestamp":<START_TS_OF_STEP>,
"results":{
<MODEL_NAME>:{
"result"::<MODEL_RESULT>,
"confidence":<MODEL_CONFIDENCE>
}
},
"raw": {
<MODEL_NAME>:{
"<CLASS1>": <CLASS1_PROBABILITY>,
"<CLASS2>": <CLASS2_PROBABILITY>,
"<CLASS3>": <CLASS3_PROBABILITY>
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}

An example with real values from the Arousal model:

{
"channels":{
"0":{
"time_series":[
{
"timestamp":0,
"results":{
"arousal":{
"result":"no_speech",
"confidence":0.748
}
},
"raw": {
"arousal": {
"high": 0.1143,
"low": 0.0034,
"neutral": 0.8823
}
}
},
{
"timestamp":4096,
"results":{
"arousal":{
"result":"no_speech",
"confidence":0.7105
}
},
"raw": {
"arousal": {
"high": 0.0078,
"low": 0.2143,
"neutral": 0.7779
}
}
},
{
"timestamp":8192,
"results":{
"arousal":{
"result":"high",
"confidence":0.9926
}
},
"raw": {
"arousal": {
"high": 0.9926,
"low": 0.0071,
"neutral": 0.0003
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}

summary

The summary key holds file-level calculations which summarise the prevalence of each class for a corresponding model.

{
"channels":{
<CHANNEL_IDENTIFIER>:{
"time_series":[],
"summary":{
<MODEL_NAME>:{
"<CLASS1>_fraction":0.5416,
"<CLASS2>_fraction":0.2609,
"<CLASS3>_fraction":0.1975
}
},
"transitions":{...}
}
}
}

An example with both the arousal and speech models:

{
"channels":{
"0":{
"time_series":[],
"summary":{
"speech":{
"music_fraction":0.5416,
"other_fraction":0.2609,
"speech_fraction":0.1975,
"silence_fraction": 0.0
},
"arousal":{
"high_fraction":0.1793,
"low_fraction":0.0071,
"neutral_fraction":0.0111,
"no_speech_fraction":0.8025,
"silence_fraction": 0.0
}
},
"transitions":{...}
}
}
}

transitions

The transitions key represents a summary of the timeseries. The idea is to indicate the starting time and ending time of each segment of audio, effectively collapsing time steps with the same model output. For more details on interpretation see the description of transitions in the corresponding model section.

Below is the output format, with an array of transitions for each model. Each element in the array represent an uninterrupted segment of the audio classified with the same value - e.g. music between from timestamp_start to timestamp_end.

{
"channels":{
<CHANNEL_IDENTIFIER>:{
"time_series":[],
"summary":{},
"transitions":{
<MODEL_NAME>:[
{
"timestamp_start":<START_TS_OF_SEGMENT>,
"timestamp_end":<END_TS_OF_SEGMENT>,
"result":<MODEL_RESULT_FOR_SEGMENT>,
"confidence":<MODEL_CONFIDENCE>
},
{
"timestamp_start":<START_TS_OF_SEGMENT>,
"timestamp_end":<END_TS_OF_SEGMENT>,
"result":<MODEL_RESULT_FOR_SEGMENT>,
"confidence":<MODEL_CONFIDENCE>
},
...
]
}
}
}

In the example with real data, you can see a music segment, followed by a speech segment.

{
"channels":{
"0":{
"time_series":[],
"summary":{},
"transitions":{
"speech":[
{
"timestamp_start":0,
"timestamp_end":6144,
"result":"music",
"confidence":0.7751
},
{
"timestamp_start":6144,
"timestamp_end":8128,
"result":"speech",
"confidence":0.5445
},
...
]
}
}
}

Complete example

Below you can find a complete example with speech and arousal models. You can paste it in https://jsonformatter.curiousconcept.com/ to explore it more easily.

The commands used to generate it are:

Create job:

curl --request POST \
--url 'https://api.oto.ai/file-processing/jobs?models=speech,arousal&output_period=1024&channel=0&include_summary=true&include_transitions=true&include_raw_values=true&volume_threshold=0.0' \
--header 'content-type: application/json' \
--header 'x-api-key: REPLACE_KEY_VALUE' \
--data '{"url":"https://docs.oto.ai/api/audio/librispeech-84-121123-0001_female.wav"}'

Get results:

curl --request GET \
--url https://api.oto.ai/file-processing/jobs/REPLACE_JOB_ID/results \
--header 'x-api-key: REPLACE_KEY_VALUE'

Complete output:

{
"job_id": "402ce304dd0d4ebf8de8e7808af0e2bd",
"result": {
"channels": {
"0": {
"summary": {
"speech": {
"music_fraction": 0.0,
"other_fraction": 0.1774,
"speech_fraction": 0.8226,
"silence_fraction": 0.0
},
"arousal": {
"low_fraction": 0.0806,
"high_fraction": 0.371,
"neutral_fraction": 0.371,
"silence_fraction": 0.0,
"no_speech_fraction": 0.1774
}
},
"time_series": [
{
"raw": {
"speech": {
"music": 0.0974,
"other": 0.5136,
"speech": 0.389
},
"arousal": {
"low": 0.3482,
"high": 0.3185,
"neutral": 0.3333
}
},
"results": {
"speech": {
"result": "other",
"confidence": 0.5136
},
"arousal": {
"result": "no_speech",
"confidence": 0.5136
}
},
"timestamp": 0.0
},
{
"raw": {
"speech": {
"music": 0.0016,
"other": 0.0048,
"speech": 0.9936
},
"arousal": {
"low": 0.3148,
"high": 0.3729,
"neutral": 0.3124
}
},
"results": {
"speech": {
"result": "speech",
"confidence": 0.9936
},
"arousal": {
"result": "high",
"confidence": 0.3729
}
},
"timestamp": 1024.0
},
{
"raw": {
"speech": {
"music": 0.0077,
"other": 0.0106,
"speech": 0.9817
},
"arousal": {
"low": 0.2809,
"high": 0.356,
"neutral": 0.3631
}
},
"results": {
"speech": {
"result": "speech",
"confidence": 0.9817
},
"arousal": {
"result": "neutral",
"confidence": 0.3631
}
},
"timestamp": 2048.0
},
{
"raw": {
"speech": {
"music": 0.0014,
"other": 0.0023,
"speech": 0.9963
},
"arousal": {
"low": 0.3196,
"high": 0.1526,
"neutral": 0.5278
}
},
"results": {
"speech": {
"result": "speech",
"confidence": 0.9963
},
"arousal": {
"result": "neutral",
"confidence": 0.5278
}
},
"timestamp": 3072.0
}
],
"transitions": {
"speech": [
{
"result": "other",
"confidence": 0.7159,
"timestamp_end": 448.0,
"timestamp_start": 0.0
},
{
"result": "speech",
"confidence": 0.624,
"timestamp_end": 512.0,
"timestamp_start": 448.0
},
{
"result": "other",
"confidence": 0.6728,
"timestamp_end": 768.0,
"timestamp_start": 512.0
},
{
"result": "speech",
"confidence": 0.986,
"timestamp_end": 3968.0,
"timestamp_start": 768.0
}
],
"arousal": [
{
"result": "no_speech",
"confidence": 0.7159,
"timestamp_end": 448.0,
"timestamp_start": 0.0
},
{
"result": "low",
"confidence": 0.3795,
"timestamp_end": 512.0,
"timestamp_start": 448.0
},
{
"result": "no_speech",
"confidence": 0.6728,
"timestamp_end": 768.0,
"timestamp_start": 512.0
},
{
"result": "high",
"confidence": 0.4479,
"timestamp_end": 1216.0,
"timestamp_start": 768.0
},
{
"result": "neutral",
"confidence": 0.3965,
"timestamp_end": 1280.0,
"timestamp_start": 1216.0
},
{
"result": "low",
"confidence": 0.4107,
"timestamp_end": 1536.0,
"timestamp_start": 1280.0
},
{
"result": "high",
"confidence": 0.4204,
"timestamp_end": 2560.0,
"timestamp_start": 1536.0
},
{
"result": "neutral",
"confidence": 0.4904,
"timestamp_end": 3968.0,
"timestamp_start": 2560.0
}
]
}
}
}
}
}
  • Overview
  • Single time step, single model
  • Single time step, multiple models
  • Single time step, single model, raw values included
  • File-processing output
    • channels
    • time_series
    • time_series with multiple models
    • time_series with raw model outputs
    • summary
    • transitions
  • Complete example