Output specification
Overview
The output of the DeepTone processing functions is json. When processing a stream, the results are returned one time step at a time. When processing a file, all time steps are returned, together with some additional file-level summaries. Additionally, in both cases raw model outputs can be included in the result.
In this section we present the different components of the DeepTone output:
- json output for a single time step and single model - real-time case
- json output for a single time step and multiple models - real-time case
- json output for a single time step and single model returning raw values - real-time case
- json output for multiple time steps - file-processing case
Single time step, single model
{
"timestamp":1024,
"results":{
"gender":{
"result": "no_speech",
"confidence": 0.7088
}
}
}
Single time step, multiple models
{
"timestamp":2048,
"results":{
"arousal":{
"result": "no_speech",
"confidence": 0.7597
},
"gender":{
"result": "no_speech",
"confidence": 0.7597
}
}
}
Single time step, single model, raw values included
{
"timestamp":1024,
"results":{
"gender":{
"result": "no_speech",
"confidence": 0.7088
}
},
"raw": {
"gender":{
"male": 0.1211,
"female": 0.8789
}
}
}
File-processing output
There are several additional components in the file processing output compared to the streaming output.
The additional functionalities listed bellow correspond to additional keys in the output:
- processing multiple audio channels -
channels
- processing and delivering results for all time steps -
time_series
- calculating file-level
summary
-summary
- calculating file-level
transitions
-transitions
- including
raw
model outputs -raw_outputs
Those differences are reflected in the different keys of the json
output described in the following sections.
You can also head straight to a complete example.
channels
When processing files, the high-level structure of the output is shown below. The top level key channels
holds the data for each channel - 0
, 1
or both.
For each channel, the output contains time_series
, and optionally summary
and transitions
.
{
"channels":{
"0":{
"time_series":[],
"summary":{},
"transitions":{}
}
}
}
time_series
The time_series
key holds an array of results, under the results
key, one for each time step as defined in the sections above. Optionally, it also holds an array of raw results under raw
.
The format is:
{
"channels":{
<CHANNEL_IDENTIFIER>:{
"time_series":[
{
"timestamp":<START_TS_OF_STEP>,
"results": {
<MODEL_NAME>:{
"result":<MODEL_RESULT>,
"confidence":<MODEL_CONFIDENCE>
}
}
},
{
"timestamp":<START_TS_OF_STEP>,
"results": {
<MODEL_NAME>:{
"result"::<MODEL_RESULT>,
"confidence":<MODEL_CONFIDENCE>
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}
An example with real values from the Arousal model:
{
"channels":{
"0":{
"time_series":[
{
"timestamp":0,
"results": {
"arousal":{
"result":"no_speech",
"confidence":0.748
}
}
},
{
"timestamp":4096,
"results": {
"arousal":{
"result":"no_speech",
"confidence":0.7105
}
}
},
{
"timestamp":8192,
"results":{
"arousal":{
"result":"high",
"confidence":0.9926
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}
time_series
with multiple models
In case multiple models are used, this will be reflected in the result object for each time step, as explained the multiple model time step result.
For example with speech
and arousal
, one should expect a key corresponding to each model's name in the results
object for a single time step:
{
"channels":{
"0":{
"time_series":[
{
"timestamp":0,
"results":{
"speech":{
"result":"speech",
"confidence":0.748
},
"arousal":{
"result":"neutral",
"confidence":0.658
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}
time_series
with raw model outputs
When you request raw values, an additional raw
key will be included for each time step of the results. It will hold a dictionary of raw model outputs
for every requested model, with all class names as keys and their respective probabilities as values.
The format is:
{
"channels":{
<CHANNEL_IDENTIFIER>:{
"time_series":[
{
"timestamp":<START_TS_OF_STEP>,
"results":{
<MODEL_NAME>:{
"result":<MODEL_RESULT>,
"confidence":<MODEL_CONFIDENCE>
}
},
"raw": {
<MODEL_NAME>:{
"<CLASS1>": <CLASS1_PROBABILITY>,
"<CLASS2>": <CLASS2_PROBABILITY>,
"<CLASS3>": <CLASS3_PROBABILITY>
}
}
},
{
"timestamp":<START_TS_OF_STEP>,
"results":{
<MODEL_NAME>:{
"result"::<MODEL_RESULT>,
"confidence":<MODEL_CONFIDENCE>
}
},
"raw": {
<MODEL_NAME>:{
"<CLASS1>": <CLASS1_PROBABILITY>,
"<CLASS2>": <CLASS2_PROBABILITY>,
"<CLASS3>": <CLASS3_PROBABILITY>
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}
An example with real values from the Arousal model:
{
"channels":{
"0":{
"time_series":[
{
"timestamp":0,
"results":{
"arousal":{
"result":"no_speech",
"confidence":0.748
}
},
"raw": {
"arousal": {
"high": 0.1143,
"low": 0.0034,
"neutral": 0.8823
}
}
},
{
"timestamp":4096,
"results":{
"arousal":{
"result":"no_speech",
"confidence":0.7105
}
},
"raw": {
"arousal": {
"high": 0.0078,
"low": 0.2143,
"neutral": 0.7779
}
}
},
{
"timestamp":8192,
"results":{
"arousal":{
"result":"high",
"confidence":0.9926
}
},
"raw": {
"arousal": {
"high": 0.9926,
"low": 0.0071,
"neutral": 0.0003
}
}
},
...
],
"summary":{...},
"transitions":{...}
}
}
}
summary
The summary
key holds file-level calculations which summarise the prevalence of each class for a corresponding model.
{
"channels":{
<CHANNEL_IDENTIFIER>:{
"time_series":[],
"summary":{
<MODEL_NAME>:{
"<CLASS1>_fraction":0.5416,
"<CLASS2>_fraction":0.2609,
"<CLASS3>_fraction":0.1975
}
},
"transitions":{...}
}
}
}
An example with both the arousal
and speech
models:
{
"channels":{
"0":{
"time_series":[],
"summary":{
"speech":{
"music_fraction":0.5416,
"other_fraction":0.2609,
"speech_fraction":0.1975,
"silence_fraction": 0.0
},
"arousal":{
"high_fraction":0.1793,
"low_fraction":0.0071,
"neutral_fraction":0.0111,
"no_speech_fraction":0.8025,
"silence_fraction": 0.0
}
},
"transitions":{...}
}
}
}
transitions
The transitions key represents a summary of the timeseries. The idea is to indicate the starting time and ending time of each segment of audio, effectively collapsing time steps with the same model output. For more details on interpretation see the description of transitions in the corresponding model section.
Below is the output format, with an array of transitions for each model. Each element in the array represent an uninterrupted segment of the audio classified with the same value - e.g. music between from timestamp_start
to timestamp_end
.
{
"channels":{
<CHANNEL_IDENTIFIER>:{
"time_series":[],
"summary":{},
"transitions":{
<MODEL_NAME>:[
{
"timestamp_start":<START_TS_OF_SEGMENT>,
"timestamp_end":<END_TS_OF_SEGMENT>,
"result":<MODEL_RESULT_FOR_SEGMENT>,
"confidence":<MODEL_CONFIDENCE>
},
{
"timestamp_start":<START_TS_OF_SEGMENT>,
"timestamp_end":<END_TS_OF_SEGMENT>,
"result":<MODEL_RESULT_FOR_SEGMENT>,
"confidence":<MODEL_CONFIDENCE>
},
...
]
}
}
}
In the example with real data, you can see a music
segment, followed by a speech
segment.
{
"channels":{
"0":{
"time_series":[],
"summary":{},
"transitions":{
"speech":[
{
"timestamp_start":0,
"timestamp_end":6144,
"result":"music",
"confidence":0.7751
},
{
"timestamp_start":6144,
"timestamp_end":8128,
"result":"speech",
"confidence":0.5445
},
...
]
}
}
}
Complete example
Below you can find a complete example with speech
and arousal
models. You can paste it in https://jsonformatter.curiousconcept.com/ to explore it more easily.
The commands used to generate it are:
Create job:
curl --request POST \
--url 'https://api.oto.ai/file-processing/jobs?models=speech,arousal&output_period=1024&channel=0&include_summary=true&include_transitions=true&include_raw_values=true&volume_threshold=0.0' \
--header 'content-type: application/json' \
--header 'x-api-key: REPLACE_KEY_VALUE' \
--data '{"url":"https://docs.oto.ai/api/audio/librispeech-84-121123-0001_female.wav"}'
Get results:
curl --request GET \
--url https://api.oto.ai/file-processing/jobs/REPLACE_JOB_ID/results \
--header 'x-api-key: REPLACE_KEY_VALUE'
Complete output:
{
"job_id": "402ce304dd0d4ebf8de8e7808af0e2bd",
"result": {
"channels": {
"0": {
"summary": {
"speech": {
"music_fraction": 0.0,
"other_fraction": 0.1774,
"speech_fraction": 0.8226,
"silence_fraction": 0.0
},
"arousal": {
"low_fraction": 0.0806,
"high_fraction": 0.371,
"neutral_fraction": 0.371,
"silence_fraction": 0.0,
"no_speech_fraction": 0.1774
}
},
"time_series": [
{
"raw": {
"speech": {
"music": 0.0974,
"other": 0.5136,
"speech": 0.389
},
"arousal": {
"low": 0.3482,
"high": 0.3185,
"neutral": 0.3333
}
},
"results": {
"speech": {
"result": "other",
"confidence": 0.5136
},
"arousal": {
"result": "no_speech",
"confidence": 0.5136
}
},
"timestamp": 0.0
},
{
"raw": {
"speech": {
"music": 0.0016,
"other": 0.0048,
"speech": 0.9936
},
"arousal": {
"low": 0.3148,
"high": 0.3729,
"neutral": 0.3124
}
},
"results": {
"speech": {
"result": "speech",
"confidence": 0.9936
},
"arousal": {
"result": "high",
"confidence": 0.3729
}
},
"timestamp": 1024.0
},
{
"raw": {
"speech": {
"music": 0.0077,
"other": 0.0106,
"speech": 0.9817
},
"arousal": {
"low": 0.2809,
"high": 0.356,
"neutral": 0.3631
}
},
"results": {
"speech": {
"result": "speech",
"confidence": 0.9817
},
"arousal": {
"result": "neutral",
"confidence": 0.3631
}
},
"timestamp": 2048.0
},
{
"raw": {
"speech": {
"music": 0.0014,
"other": 0.0023,
"speech": 0.9963
},
"arousal": {
"low": 0.3196,
"high": 0.1526,
"neutral": 0.5278
}
},
"results": {
"speech": {
"result": "speech",
"confidence": 0.9963
},
"arousal": {
"result": "neutral",
"confidence": 0.5278
}
},
"timestamp": 3072.0
}
],
"transitions": {
"speech": [
{
"result": "other",
"confidence": 0.7159,
"timestamp_end": 448.0,
"timestamp_start": 0.0
},
{
"result": "speech",
"confidence": 0.624,
"timestamp_end": 512.0,
"timestamp_start": 448.0
},
{
"result": "other",
"confidence": 0.6728,
"timestamp_end": 768.0,
"timestamp_start": 512.0
},
{
"result": "speech",
"confidence": 0.986,
"timestamp_end": 3968.0,
"timestamp_start": 768.0
}
],
"arousal": [
{
"result": "no_speech",
"confidence": 0.7159,
"timestamp_end": 448.0,
"timestamp_start": 0.0
},
{
"result": "low",
"confidence": 0.3795,
"timestamp_end": 512.0,
"timestamp_start": 448.0
},
{
"result": "no_speech",
"confidence": 0.6728,
"timestamp_end": 768.0,
"timestamp_start": 512.0
},
{
"result": "high",
"confidence": 0.4479,
"timestamp_end": 1216.0,
"timestamp_start": 768.0
},
{
"result": "neutral",
"confidence": 0.3965,
"timestamp_end": 1280.0,
"timestamp_start": 1216.0
},
{
"result": "low",
"confidence": 0.4107,
"timestamp_end": 1536.0,
"timestamp_start": 1280.0
},
{
"result": "high",
"confidence": 0.4204,
"timestamp_end": 2560.0,
"timestamp_start": 1536.0
},
{
"result": "neutral",
"confidence": 0.4904,
"timestamp_end": 3968.0,
"timestamp_start": 2560.0
}
]
}
}
}
}
}