UnderageSpeaker Model Recipes
Overview
The UnderageSpeaker model can be used to classify a speaker as an adult or child. It also recognizes silence or no speech moments. If the analysis is inconclusive, the result would be classified as unknown.
In the examples below, we demonstrate:
- using the UnderageSpeaker model to find the percentage of the audio during which an adult and a child speaks (example 1)
- using the model to detect if the audio contains a child speaker, and to determine their energy and emotion (example 2)
These can be especially useful for content moderation, detection of bullying or other similar use-cases.
Pre-requisites
- DeepTone with license key
- Audio File(s) you want to process
Sample data
You can download this multichannel audio sample for the following examples.
Determine the percentage that an adult/child speaks - Example 1
Remember to add a valid license key before running the example.
In this example, you can use the summary
level output, which is optionally calculated when processing a file, to calculate the percentage of speech detected to be from an adult and from a child.
We are analysing each channel independently, and we know there is only one speaker per channel.
With this approach, we can determine the age range of a speaker with a high confidence.
from deeptone import Deeptone
from deeptone.deeptone import UNDERAGE_NO_SPEECH, UNDERAGE_CHILD, UNDERAGE_ADULT
# Set the required constants
VALID_LICENSE_KEY = None
FILE_TO_PROCESS = None
OUTPUT_PERIOD_MS = 1024
# Initialise Deeptone
engine = Deeptone(license_key=VALID_LICENSE_KEY)
output = engine.process_file(
filename=FILE_TO_PROCESS,
models=[engine.models.UnderageSpeaker],
output_period=OUTPUT_PERIOD_MS,
include_summary=True,
include_raw_values=True
)["channels"]
for channel in output:
underage_speaker_summary = output[channel]["summary"]["underage-speaker"]
child_percentage = underage_speaker_summary[f"{UNDERAGE_CHILD}_fraction"] / (
1 - underage_speaker_summary[f"{UNDERAGE_NO_SPEECH}_fraction"])
adult_percentage = underage_speaker_summary[f"{UNDERAGE_ADULT}_fraction"] / (
1 - underage_speaker_summary[f"{UNDERAGE_NO_SPEECH}_fraction"])
print(f'Channel {channel}: \t{round(child_percentage * 100, 2)}% of the time a child was speaking and {round(adult_percentage * 100, 2)}% of the time an adult was speaking')
After executing the script using our example file, you should see the following output:
...
Channel 0: 8.73% of the time a child was speaking and 0.0% of the time an adult was speaking
Channel 1: 1.5% of the time a child was speaking and 95.52% of the time an adult was speaking
From this result we can conclude that channel 0 has an underage speaker, and channel 1 has an adult speaker. We can also see, that the adult spoke during the majority of the conversation.
Detect underage speakers and visualise the energy and emotions - Example 2
In this example, we combine the Arousal, Emotion and UnderageSpeaker models to recognize when a child is speaking and to predict their energy and emotions.
from deeptone import Deeptone
from deeptone.deeptone import UNDERAGE_CHILD
# Set the required constants
VALID_LICENSE_KEY = None
FILE_TO_PROCESS = None
OUTPUT_PERIOD_MS = 512
# Initialise Deeptone
engine = Deeptone(license_key=VALID_LICENSE_KEY)
output = engine.process_file(
filename=FILE_TO_PROCESS,
models=[engine.models.Arousal, engine.models.Emotions, engine.models.UnderageSpeaker],
output_period=OUTPUT_PERIOD_MS,
include_summary=True,
include_raw_values=True
)["channels"]
for i in output:
channel = output[i]
for ts_result in channel["time_series"]:
ts = ts_result["timestamp"]
result_underage = ts_result["results"]["underage-speaker"]
result_arousal = ts_result["results"]["arousal"]
result_emotion = ts_result["results"]["emotions"]
if result_underage["result"] == UNDERAGE_CHILD:
print(
f'A child is speaking at {ts}ms \tenergy: {result_arousal["result"]} \temotion: {result_emotion["result"]} \t(Channel {i})'
)
After executing the script using our audio data, you will get the results bellow:
...
A child is speaking at 18432ms energy: low emotion: tired (Channel 0)
A child is speaking at 18944ms energy: high emotion: happy (Channel 0)
A child is speaking at 19456ms energy: high emotion: happy (Channel 0)
A child is speaking at 19968ms energy: high emotion: happy (Channel 0)
A child is speaking at 30208ms energy: neutral emotion: neutral (Channel 0)
A child is speaking at 30720ms energy: high emotion: happy (Channel 0)
A child is speaking at 31232ms energy: high emotion: happy (Channel 0)
You can see that a child spoke between the 18th and 20th seconds and between 30 and 31st seconds, and most of the time was happy and with high energy. There was no child irritation detected.