Real Time Processing
The DeepTone SDK can be used to process a real time audio stream.
Sample Data
You can download this sample audio file from the LibriSpeech ASR corpus with a woman speaking for the examples below. For code sample go to Example Usage.
Supported Formats
The data being fed via the input_generator
should be 16-bit PCM with the sample rate of 16 kHz.
Configuration options and outputs
There are different configuration options and types of outputs which can be used depending on the SDK language.
For code sample go to Example Usage. For detailed output specification go to Output specification.
- Python
- Swift
Available configuration options
There are several possible arguments which can be passed to the process_stream
function:
input_generator
- generator that yields byte arrays representing audio data properly sampledmodels
- the list of model names to use for the audio analysisoutput_period
- how often (in milliseconds, multiple of 64) the output of the models should be returnedinclude_raw_values
- optionally if the result should contain raw model outputsvolume_threshold
- optionally if a volume level different than default should be considered (higher values will result in more of the data being treated as silence)
For code sample go to Example Usage. For detailed output specification go to Output specification.
Available Outputs
A generator will be returned which will yield one output per output_period
milliseconds of the provided input, representing timestamped
results from the requested models.
{"timestamp" : 0, {"results": "gender": {"result": "female", "confidence": 0.6418}}}
{"timestamp" : 1024, {"results": "gender": {"result": "male", "confidence": 0.9012}}}
{"timestamp" : 2048, {"results": "gender": {"result": "male", "confidence": 0.7698}}}
{"timestamp" : 3072, {"results": "gender": {"result": "silence", "confidence": 1.0}}}
{"timestamp" : 4096, {"results": "gender": {"result": "female", "confidence": 0.9780}}}
{"timestamp" : 5120, {"results": "gender": {"result": "female", "confidence": 0.8991}}}
Coming soon...
Example Usage
- Python
- Swift
You can use the process_stream
method to process a stream of audio. You will need to provide
a valid generator that yields audio bytes. Below you will find two different examples, where we:
- open an audio file and stream bytes from that file, or
- stream bytes using microphone as an input source
1. Streaming bytes from an audio file
from deeptone import Deeptone
from scipy.io import wavfile
def input_generator(filepath, chunk_size=1024):
print(f"Opening file {filepath}")
rate, data = wavfile.read(filepath)
print(f"Detected sample rate: {rate}")
index = 0
while index < len(data):
yield data[index: min(len(data), index + chunk_size)]
index += chunk_size
return
# Initialise Deeptone
engine = Deeptone(license_key="...")
audio_generator = input_generator("PATH_TO_AUDIO_FILE")
output = engine.process_stream(
input_generator=audio_generator,
models=[engine.models.Gender],
output_period=1024,
volume_threshold=0.005
)
2. Streaming bytes from a microphone
You can find even more detailed recipes on using a microphone in the Gender model recipes section.
from collections import deque
from deeptone import Deeptone
import pyaudio
# Initialise an audio stream
data_buffer = deque()
CHUNK_SIZE = 1024
def writer_callback(in_data, frame_count, time_info, status):
data_buffer.extend(in_data)
return in_data, pyaudio.paContinue
pa = pyaudio.PyAudio()
stream = pa.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=CHUNK_SIZE,
stream_callback=writer_callback,
)
stream.start_stream()
def input_generator(buffer):
while stream.is_active():
while len(buffer) >= CHUNK_SIZE * 2:
samples_read = [buffer.popleft() for x in range(CHUNK_SIZE * 2)]
yield bytes(samples_read)
# Initialise Deeptone
engine = Deeptone(license_key="...")
audio_generator = input_generator(data_buffer)
output = engine.process_stream(
input_generator=audio_generator,
models=[engine.models.Gender],
output_period=1024,
volume_threshold=0.005
)
In either of those two cases, the returned object is a generator that will yield results for every output_period
milliseconds:
# Inspect the result
for ts_result in output:
ts = ts_result["timestamp"]
res = ts_result["results"]["gender"]
print(f'Timestamp: {ts}ms\tresult: {res["result"]}'
f' with confidence {res["confidence"]}')
The output of the script would be something like:
Timestamp: 0ms result: female confidence: 0.6418
Timestamp: 1024ms result: female confidence: 0.8682
Timestamp: 2048ms result: silence confidence: 1.0
Timestamp: 3072ms result: female confidence: 0.6606
Raw output:
{ "timestamp" : 0, {"results": "gender": { "result": "female", "confidence": 0.6418, } } }
{ "timestamp" : 1024, {"results": "gender": { "result": "female", "confidence": 0.8682, } } }
{ "timestamp" : 2048, {"results": "gender": { "result": "silence", "confidence": 1.0, } } }
{ "timestamp" : 3072, {"results": "gender": { "result": "female", "confidence": 0.6606, } } }
Further examples
You can find more detailed recipes for real-time processing of microphone input in the Gender model recipes section. For example usage of `raw` output to implement custom speech thresholds, head to Example 3 in Speech detection recipes.
Coming soon...