Skip to main content

Audio Bytes Processing

DeepTone™'s Audio Bytes Processing functionality allows you to extract insights directly from audio bytes provided as numpy arrays. It can be configured in a way similar to the File Processing and generates output structured in the same way (see an example below).

Example Usage

You can use the process_audio_bytes method to process audio bytes directly. In the example below, we are reading the bytes from an audio file but you can provide them from any other source. Make sure that the provided audio data is in one of the Supported Audio Formats and remember to set the correct sampling rate of the provided audio data with the rate_in argument.

from scipy.io import wavfile

from deeptone import Deeptone
from deeptone.deeptone import GENDER_MALE, GENDER_FEMALE, GENDER_UNKNOWN, GENDER_NO_SPEECH, GENDER_SILENCE

# Read in some audio bytes
rate_in, audio_bytes = wavfile.read("PATH_TO_AUDIO_FILE")

# Initialise Deeptone
engine = Deeptone(license_key="...")

output = engine.process_audio_bytes(
data=audio_bytes,
models=[engine.models.Gender],
output_period=1024,
rate_in=rate_in,
include_summary=True,
include_transitions=True,
include_raw_values=True,
volume_threshold=0.005
)

The returned object contains the time series with an analysis of the file broken down by the provided output period:

# Inspect the result
print(output)

print("Time series:")
for ts_result in output["time_series"]:
ts = ts_result["timestamp"]
res = ts_result["results"]["gender"]
print(f'Timestamp: {ts}ms\tresult: {res["result"]}\t'
f'confidence: {res["confidence"]}')

print("\nRaw model outputs:")
for ts_result in output["time_series"]:
ts = ts_result["timestamp"]
raw = ts_result["raw"]["gender"]
print(f'Timestamp: {ts}ms\traw results: {GENDER_MALE}: '
f'{raw[GENDER_MALE]}, {GENDER_FEMALE}: {raw[GENDER_FEMALE]}')

summary = output["summary"]["gender"]
male = summary[f"{GENDER_MALE}_fraction"] * 100
female = summary[f"{GENDER_FEMALE}_fraction"] * 100
no_speech = summary[f"{GENDER_NO_SPEECH}_fraction"] * 100
unknown = summary[f"{GENDER_UNKNOWN}_fraction"] * 100
silence = summary[f"{GENDER_SILENCE}_fraction"] * 100
print(f'\nSummary: male: {male}%, female: {female}%, no_speech: {no_speech}%, unknown: {unknown}%, silence: {silence}%')

print("\nTransitions:")
for ts_result in output["transitions"]["gender"]:
ts = ts_result["timestamp_start"]
print(f'Timestamp: {ts}ms\tresult: {ts_result["result"]}\t'
f'confidence: {ts_result["confidence"]}')

The output of the script would be something like:

Time series:
Timestamp: 0ms result: no_speech confidence: 0.6293
Timestamp: 1024ms result: female confidence: 0.9002
Timestamp: 2048ms result: female confidence: 0.4725
Timestamp: 3072ms result: female confidence: 0.4679
....

Raw model outputs:
Timestamp: 0ms raw results: male: 0.1791, female: 0.8209
Timestamp: 1024ms raw results: male: 0.0499, female: 0.9501
Timestamp: 2048ms raw results: male: 0.2638, female: 0.7362
Timestamp: 3072ms raw results: male: 0.266, female: 0.734

Summary: male: 0.0%, female: 67.74%, no_speech: 9.68%, unknown: 6.45%, silence: 16.13%

Transitions:
Timestamp: 0ms result: silence confidence: 1.0
Timestamp: 320ms result: no_speech confidence: 0.7723
Timestamp: 704ms result: silence confidence: 1.0
Timestamp: 768ms result: female confidence: 0.9002
Timestamp: 1408ms result: silence confidence: 1.0
Timestamp: 1472ms result: female confidence: 0.7137
Timestamp: 2880ms result: unknown confidence: 0.0771
Timestamp: 3136ms result: female confidence: 0.3961
Timestamp: 3712ms result: silence confidence: 1.0
Timestamp: 3776ms result: female confidence: 0.7101
Timestamp: 3840ms result: silence confidence: 1.0

Raw output:

{
"channels": {
"0": {
"time_series": [
{ "timestamp" : 0, "results":{ "gender": { "result": "no_speech", "confidence": 0.6293, }}, "raw": {"gender": {"male": 0.1791, "female": 0.8209}}},
{ "timestamp" : 1024, "results":{ "gender": { "result": "female", "confidence": 0.9002, }}, "raw": {"gender": {"male": 0.0499, "female": 0.9501}}},
{ "timestamp" : 2048, "results":{ "gender": { "result": "female", "confidence": 0.4725, }}, "raw": {"gender": {"male": 0.2638, "female": 0.7362}}},
{ "timestamp" : 3072, "results":{ "gender": { "result": "female", "confidence": 0.4679, }}, "raw": {"gender": {"male": 0.266, "female": 0.734}}},
],
"summary": {
"gender": { "male_fraction": 0, "female_fraction": 0.6774, "no_speech": 0.0968, "unknown_fraction": 0.0645, "silence_fraction": 0.1613 },
},
"transitions": {
"gender": [
{"timestamp_start": 0, "timestamp_end": 320, "result": "silence", "confidence": 1.0},
{"timestamp_start": 320, "timestamp_end": 704, "result": "no_speech", "confidence": 0.7723},
{"timestamp_start": 704, "timestamp_end": 768, "result": "silence", "confidence": 1.0},
{"timestamp_start": 768, "timestamp_end": 1408, "result": "female", "confidence": 0.9002},
{"timestamp_start": 1408, "timestamp_end": 1472, "result": "silence", "confidence": 1.0},
{"timestamp_start": 1472, "timestamp_end": 2880, "result": "female", "confidence": 0.7137},
{"timestamp_start": 2880, "timestamp_end": 3136, "result": "unknown", "confidence": 0.0771},
{"timestamp_start": 3136, "timestamp_end": 3712, "result": "female", "confidence": 0.3961},
{"timestamp_start": 3712, "timestamp_end": 3776, "result": "silence", "confidence": 1.0},
{"timestamp_start": 3776, "timestamp_end": 3840, "result": "female", "confidence": 0.7101},
{"timestamp_start": 3840, "timestamp_end": 3968, "result": "silence", "confidence": 1.0}
]
}
}
}
}

Further examples

For more example usage of the summary and transitions, head to the Speech detection recipes and the Arousal detection recipes sections. For example usage of `raw` output to implement custom speech thresholds, head to Example 3 in Speech model recipes.