Audio Bytes Processing

DeepTone™'s Audio Bytes Processing functionality allows you to extract insights directly from audio bytes provided as numpy arrays. It can be configured in a way similar to the File Processing and generates output structured in the same way (see an example below).

Example Usage

Python
Swift

You can use the process_audio_bytes method to process audio bytes directly. In the example below, we are reading the bytes from an audio file but you can provide them from any other source. Make sure that the provided audio data is in one of the Supported Audio Formats and remember to set the correct sampling rate of the provided audio data with the rate_in argument.

from scipy.io import wavfile

from deeptone import Deeptone
from deeptone.deeptone import GENDER_MALE, GENDER_FEMALE, GENDER_UNKNOWN, GENDER_NO_SPEECH, GENDER_SILENCE

# Read in some audio bytes
rate_in, audio_bytes = wavfile.read("PATH_TO_AUDIO_FILE")

# Initialise Deeptone
engine = Deeptone(license_key="...")

output = engine.process_audio_bytes(
    data=audio_bytes,
    models=[engine.models.Gender],
    output_period=1024,
    rate_in=rate_in,
    include_summary=True,
    include_transitions=True,
    include_raw_values=True,
    volume_threshold=0.005
)

The returned object contains the time series with an analysis of the file broken down by the provided output period:

# Inspect the result
print(output)

print("Time series:")
for ts_result in output["time_series"]:
    ts = ts_result["timestamp"]
    res = ts_result["results"]["gender"]
    print(f'Timestamp: {ts}ms\tresult: {res["result"]}\t'
            f'confidence: {res["confidence"]}')

print("\nRaw model outputs:")
for ts_result in output["time_series"]:
    ts = ts_result["timestamp"]
    raw = ts_result["raw"]["gender"]
    print(f'Timestamp: {ts}ms\traw results: {GENDER_MALE}: ' 
          f'{raw[GENDER_MALE]}, {GENDER_FEMALE}: {raw[GENDER_FEMALE]}')

summary = output["summary"]["gender"]
male = summary[f"{GENDER_MALE}_fraction"] * 100
female = summary[f"{GENDER_FEMALE}_fraction"] * 100
no_speech = summary[f"{GENDER_NO_SPEECH}_fraction"] * 100
unknown = summary[f"{GENDER_UNKNOWN}_fraction"] * 100
silence = summary[f"{GENDER_SILENCE}_fraction"] * 100
print(f'\nSummary:  male: {male}%, female: {female}%, no_speech: {no_speech}%, unknown: {unknown}%, silence: {silence}%')

print("\nTransitions:")
for ts_result in output["transitions"]["gender"]:
    ts = ts_result["timestamp_start"]
    print(f'Timestamp: {ts}ms\tresult: {ts_result["result"]}\t'
            f'confidence: {ts_result["confidence"]}')

The output of the script would be something like:

Time series:
Timestamp: 0ms      result: no_speech  confidence: 0.6293
Timestamp: 1024ms   result: female     confidence: 0.9002
Timestamp: 2048ms   result: female     confidence: 0.4725
Timestamp: 3072ms   result: female     confidence: 0.4679
....

Raw model outputs:
Timestamp: 0ms      raw results: male: 0.1791, female: 0.8209
Timestamp: 1024ms   raw results: male: 0.0499, female: 0.9501
Timestamp: 2048ms   raw results: male: 0.2638, female: 0.7362
Timestamp: 3072ms   raw results: male: 0.266,  female: 0.734

Summary:  male: 0.0%, female: 67.74%, no_speech: 9.68%, unknown: 6.45%, silence: 16.13%

Transitions:
Timestamp: 0ms      result: silence     confidence: 1.0
Timestamp: 320ms    result: no_speech   confidence: 0.7723
Timestamp: 704ms    result: silence     confidence: 1.0
Timestamp: 768ms    result: female      confidence: 0.9002
Timestamp: 1408ms   result: silence     confidence: 1.0
Timestamp: 1472ms   result: female      confidence: 0.7137
Timestamp: 2880ms   result: unknown     confidence: 0.0771
Timestamp: 3136ms   result: female      confidence: 0.3961
Timestamp: 3712ms   result: silence     confidence: 1.0
Timestamp: 3776ms   result: female      confidence: 0.7101
Timestamp: 3840ms   result: silence     confidence: 1.0

Raw output:

{
  "channels": {
    "0": {
      "time_series": [
        { "timestamp" : 0, "results":{ "gender": { "result": "no_speech", "confidence": 0.6293, }}, "raw": {"gender": {"male": 0.1791, "female": 0.8209}}},
        { "timestamp" : 1024, "results":{ "gender": { "result": "female", "confidence": 0.9002, }}, "raw": {"gender": {"male": 0.0499, "female": 0.9501}}},
        { "timestamp" : 2048, "results":{ "gender": { "result": "female", "confidence": 0.4725, }}, "raw": {"gender": {"male": 0.2638, "female": 0.7362}}},
        { "timestamp" : 3072, "results":{ "gender": { "result": "female", "confidence": 0.4679, }}, "raw": {"gender": {"male": 0.266, "female": 0.734}}},
      ],
      "summary": {
        "gender": { "male_fraction": 0, "female_fraction": 0.6774, "no_speech": 0.0968, "unknown_fraction": 0.0645, "silence_fraction": 0.1613 },
      },
      "transitions": {
        "gender": [
          {"timestamp_start": 0, "timestamp_end": 320, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 320, "timestamp_end": 704, "result": "no_speech", "confidence": 0.7723}, 
          {"timestamp_start": 704, "timestamp_end": 768, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 768, "timestamp_end": 1408, "result": "female", "confidence": 0.9002}, 
          {"timestamp_start": 1408, "timestamp_end": 1472, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 1472, "timestamp_end": 2880, "result": "female", "confidence": 0.7137}, 
          {"timestamp_start": 2880, "timestamp_end": 3136, "result": "unknown", "confidence": 0.0771}, 
          {"timestamp_start": 3136, "timestamp_end": 3712, "result": "female", "confidence": 0.3961}, 
          {"timestamp_start": 3712, "timestamp_end": 3776, "result": "silence", "confidence": 1.0}, 
          {"timestamp_start": 3776, "timestamp_end": 3840, "result": "female", "confidence": 0.7101}, 
          {"timestamp_start": 3840, "timestamp_end": 3968, "result": "silence", "confidence": 1.0}
      ] 
      }
    }
  }
}

Further examples

For more example usage of the summary and transitions, head to the Speech detection recipes and the Arousal detection recipes sections. For example usage of `raw` output to implement custom speech thresholds, head to Example 3 in Speech model recipes.

Example Usage​

Further examples​

Example Usage

Further examples