Real Time Processing

The Deeptone SDK can be used to process a real time audio stream.

Sample Data

You can download this sample audio file from the LibriSpeech ASR corpus with a woman speaking for the examples below. For code sample go to Example Usage.

Supported Formats

The data being fed via the input_generator should be 16-bit PCM with the sample rate of 16 kHz.

Configuration options and outputs

There are different configuration options and types of outputs which can be used depending on the SDK language.

For code sample go to Example Usage. For detailed output specification go to Output specification.

Available configuration options

There are several possible arguments which can be passed to the process_stream function:

  • input_generator - generator that yields byte arrays representing audio data properly sampled
  • models - the list of model names to use for the audio analysis
  • output_period - how often (in milliseconds, multiple of 64) the output of the models should be returned
  • include_raw_values - optionally if the result should contain raw model outputs
  • volume_threshold - optionally if a volume level different than default should be considered (higher values will result in more of the data being treated as silence)

For code sample go to Example Usage. For detailed output specification go to Output specification.

Available Outputs

A generator will be returned which will yield one output per output_period milliseconds of the provided input, representing timestamped results from the requested models.

{"timestamp" : 0, {"results": "gender": {"result": "female", "confidence": 0.6418}}}
{"timestamp" : 1024, {"results": "gender": {"result": "male", "confidence": 0.9012}}}
{"timestamp" : 2048, {"results": "gender": {"result": "male", "confidence": 0.7698}}}
{"timestamp" : 3072, {"results": "gender": {"result": "silence", "confidence": 1.0}}}
{"timestamp" : 4096, {"results": "gender": {"result": "female", "confidence": 0.9780}}}
{"timestamp" : 5120, {"results": "gender": {"result": "female", "confidence": 0.8991}}}

Example Usage

You can use the process_stream method to process a stream of audio. You will need to provide a valid generator that yields audio bytes. Below you will find two different examples, where we:

  • open an audio file and stream bytes from that file, or
  • stream bytes using microphone as an input source

1. Streaming bytes from an audio file

from deeptone import Deeptone
from scipy.io import wavfile
def input_generator(filepath, chunk_size=1024):
print(f"Opening file {filepath}")
rate, data = wavfile.read(filepath)
print(f"Detected sample rate: {rate}")
index = 0
while index < len(data):
yield data[index: min(len(data), index + chunk_size)]
index += chunk_size
return
# Initialise Deeptone
engine = Deeptone(license_key="...")
audio_generator = input_generator("PATH_TO_AUDIO_FILE")
output = engine.process_stream(
input_generator=audio_generator,
models=[engine.models.Gender],
output_period=1024,
volume_threshold=0.005
)

2. Streaming bytes from a microphone

You can find even more detailed recipes on using a microphone in the Gender model recipes section.

from collections import deque
from deeptone import Deeptone
import pyaudio
# Initialise an audio stream
data_buffer = deque()
CHUNK_SIZE = 1024
def writer_callback(in_data, frame_count, time_info, status):
data_buffer.extend(in_data)
return in_data, pyaudio.paContinue
pa = pyaudio.PyAudio()
stream = pa.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=CHUNK_SIZE,
stream_callback=writer_callback,
)
stream.start_stream()
def input_generator(buffer):
while stream.is_active():
while len(buffer) >= CHUNK_SIZE * 2:
samples_read = [buffer.popleft() for x in range(CHUNK_SIZE * 2)]
yield bytes(samples_read)
# Initialise Deeptone
engine = Deeptone(license_key="...")
audio_generator = input_generator(data_buffer)
output = engine.process_stream(
input_generator=audio_generator,
models=[engine.models.Gender],
output_period=1024,
volume_threshold=0.005
)

In either of those two cases, the returned object is a generator that will yield results for every output_period milliseconds:

# Inspect the result
for ts_result in output:
ts = ts_result["timestamp"]
res = ts_result["results"]["gender"]
print(f'Timestamp: {ts}ms\tresult: {res["result"]}'
f' with confidence {res["confidence"]}')

The output of the script would be something like:

Timestamp: 0ms result: female confidence: 0.6418
Timestamp: 1024ms result: female confidence: 0.8682
Timestamp: 2048ms result: silence confidence: 1.0
Timestamp: 3072ms result: female confidence: 0.6606

Raw output:

{ "timestamp" : 0, {"results": "gender": { "result": "female", "confidence": 0.6418, } } }
{ "timestamp" : 1024, {"results": "gender": { "result": "female", "confidence": 0.8682, } } }
{ "timestamp" : 2048, {"results": "gender": { "result": "silence", "confidence": 1.0, } } }
{ "timestamp" : 3072, {"results": "gender": { "result": "female", "confidence": 0.6606, } } }

Further examples

You can find more detailed recipes for real-time processing of microphone input in the Gender model recipes section. For example usage of `raw` output to implement custom speech thresholds, head to Example 3 in Speech detection recipes.