Skip to main content

Real Time Processing

The DeepTone™ Cloud API can be used to process a real time audio stream.

Supported Formats

All data is sent to the streaming endpoint as binary-type Websocket messages, whose payloads are the raw audio data. Any other payload will cause an error to be returned and the connection to be closed. You can stream this to DeepTone in real-time, and since the protocol is full-duplex, you will still receive DeepTone responses while uploading data. Currently, only 16-bit, little endian, signed PCM WAV data is supported by the stream endpoint. If a different sample rate is provided you need to specify the sample_rate in the request parameters. The data will be re-sampled, but generally the results might not be accurate.

Configuration and Outputs

There are different configuration parameters and types of outputs which can be requested.

For code sample go to Example Usage. For detailed output specification go to Output specification.

Available configuration parameters

There are several possible parameters which can be added to the requests to the /stream websocket endpoint:

  • models - the list of model names to use for the audio analysis
  • output_period - how often (in milliseconds, multiple of 64) the output of the models should be returned
  • include_raw_values - optionally if the result should contain raw model outputs
  • volume_threshold - optionally if a volume level different than default should be considered (higher values will result in more of the data being treated as silence)
  • sample_rate - in streaming mode, the user has to specify the sample rate of the audio that is being sent.

For code sample go to Example Usage. For detailed output specification go to Output specification.

Available Outputs

You will get one output response from the DeepTone™ Cloud API per output_period milliseconds of the provided input, representing timestamped results from the requested models.

{"timestamp" : 0, "results": {"gender": {"result": "female", "confidence": 0.6418}}}
{"timestamp" : 1024, "results": {"gender": {"result": "male", "confidence": 0.9012}}}
{"timestamp" : 2048, "results": {"gender": {"result": "male", "confidence": 0.7698}}}
{"timestamp" : 3072, "results": {"gender": {"result": "silence", "confidence": 1.0}}}
{"timestamp" : 4096, "results": {"gender": {"result": "female", "confidence": 0.9780}}}
{"timestamp" : 5120, "results": {"gender": {"result": "female", "confidence": 0.8991}}}

If include_raw_values is set to true each result object will also include the raw model outputs:

{
"timestamp":1024,
"results":{
"gender":{
"result": "male",
"confidence": 0.7088
}
},
"raw": {
"gender":{
"male": 0.1211,
"female": 0.8789
}
}
}

Example Usage

Streaming from a microphone

To stream your microphone input to the DeepTone™ Cloud API and get real-time results you can use sox together with websocat.

The following command will request the microphone input stream to be processed using the SpeechRT model with a volume_threshold of 0.001

sox -q -d -t raw -b 16 -r 16000 -e signed -c 1 - | websocat "wss://api.oto.ai/stream?models=speech-rt&volume_threshold=0.001" -b -H "X-Api-Key: YOUR_API_KEY"

Streaming from a file

To stream from a file to the DeepTone™ Cloud API and get real-time results you can use sox together with pv and websocat.

The following command will request your file to be processed using the SpeechRT model with a volume_threshold of 0.001

sox <YOUR_INPUT_FILE> -q -t raw -b 16 -r 16000 -e signed -c 1 - | pv -L 256k | websocat "wss://api.oto.ai/stream?models=speech-rt&volume_threshold=0.001" -b -H "X-Api-Key: YOUR_API_KEY" -n