Arousal Model Recipes

Overview

The Arousal model can be used to classify the speech from an audio snippet into high, neutral or low arousal.

High arousal can be linked to intense emotions - happiness, anger, irritation, excitement.
Low arousal is often associated with tired, depressed or unengaged speech.

The examples here address the combined use of Arousal model with the Speech model to find high energy speech segments in a vlog:

basic analysis of audio files with built-in summarization options - example 1
custom summarization options for audio file analysis - example 2

Pre-requisites

DeepTone with license key and models
audio file(s) you want to process

Sample data

You can download this sample audio file with our CTO talking about OTO for the examples below.

Default summaries - Example 1

Remember to add a valid license key before running the example.

In these examples we make use of the summary and transitions level outputs, calculated optionally when processing a file.

The summary output presents us with the fraction of the audio which falls in a particular class. In the case below we are interested in the high arousal part of the speech, ignoring the audio with no speech detected. The idea is to find how engaged the speaker was while they were speaking.

from deeptone import Deeptone
from deeptone.deeptone import AROUSAL_HIGH, AROUSAL_NO_SPEECH
import time

# Set the required constants
VALID_LICENSE_KEY = None
FILE_TO_PROCESS = None

assert not None in (VALID_LICENSE_KEY, FILE_TO_PROCESS), "Set the required constants"


# Initialise Deeptone
engine = Deeptone(license_key=VALID_LICENSE_KEY)

output = engine.process_file(
    filename=FILE_TO_PROCESS,
    models=[engine.models.Speech, engine.models.Arousal],
    output_period=1024,
    channel=0,
    use_chunking=True,
    include_summary=True,
    include_transitions=True,
    volume_threshold=0
)

arousal_summary = output["channels"]["0"]["summary"]["arousal"]
high_part = arousal_summary[f"{AROUSAL_HIGH}_fraction"] / (
    1 - arousal_summary[f"{AROUSAL_NO_SPEECH}_fraction"]
)
totalTimeAroused = round(high_part * 100, 2)

print(f"You were excited {totalTimeAroused}% of the time you were speaking")

Custom summaries - Example 2

The built-in summary and transitions output present a useful concept of how to collect high-level information from an audio file. They operate on the most granular level of the output - 64ms in most models. As a result, even very small pauses between speech will be reflected in the output. Furthermore, when choosing the classification for a particular snippet, the class with the highest likelihood is chosen, but that may still be low in some difficult to classify cases. Depending on your use case you may be targeting a more custom summarization.

In this second example, instead of analysing all timesteps with high and low voice arousal, we concentrate on the ones with confidence higher than 0.9. This may be useful if you are processing longer files and you are interested only in the most extreme voice expressions. You can easily rework the code sample to get the starting and ending timestamps of the high confidence snippets and in that way build an easy key moments detector.

from deeptone import Deeptone
from deeptone.deeptone import AROUSAL_HIGH, AROUSAL_NO_SPEECH
import time

# Set the required constants
VALID_LICENSE_KEY = None
FILE_TO_PROCESS = None

assert not None in (VALID_LICENSE_KEY, FILE_TO_PROCESS), "Set the required constants"


# Initialise Deeptone
engine = Deeptone(license_key=VALID_LICENSE_KEY)

output = engine.process_file(
    filename=FILE_TO_PROCESS,
    models=[engine.models.Speech, engine.models.Arousal],
    output_period=1024,
    channel=0,
    use_chunking=True,
    include_summary=True,
    include_transitions=True,
    volume_threshold=0
)

arousal_summary = output["channels"]["0"]["summary"]["arousal"]
high_part = arousal_summary[f"{AROUSAL_HIGH}_fraction"] / (
    1 - arousal_summary[f"{AROUSAL_NO_SPEECH}_fraction"]
)
totalTimeAroused = round(high_part * 100, 2)

print(f"You were excited {totalTimeAroused}% of the time you were speaking")
print(
    f'You had {len([transition for transition in output["channels"]["0"]["transitions"]["speech"] if transition["result"] == "music"])} music transitions'
)

# Custom high confidence points count

high_confidence_points = {}
speech_points = 0
for time_series in output["channels"]["0"]["time_series"]:
    arousal_level = time_series["results"]["arousal"]["result"]
    if arousal_level not in high_confidence_points:
        high_confidence_points[arousal_level] = 0
    if time_series["results"]["arousal"]["confidence"] > 0.9:
        high_confidence_points[arousal_level] += 1
    if arousal_level != "no_speech":
        speech_points += 1

print(
    f"You were excited {round(high_confidence_points['high']/speech_points*100, 2)}% ",
    f"of the time speaking with high confidence!",
)
print(
    f"You were not engaged {round(high_confidence_points['low']/speech_points*100, 2)}% ",
    f"of the time speaking with high confidence!",
)

Overview​

Pre-requisites​

Sample data​

Default summaries - Example 1​

Custom summaries - Example 2​

Overview

Pre-requisites

Sample data

Default summaries - Example 1

Custom summaries - Example 2