speakerbox package#

Submodules#

speakerbox.examples module#

class speakerbox.examples.IteratedModelEvalScores(dataset_size: str, equalized_data: bool, mean_audio_per_person_train: float, std_audio_per_person_train: float, mean_audio_per_person_test: float, std_audio_per_person_test: float, mean_audio_per_person_valid: float, std_audio_per_person_valid: float, mean_accuracy: float, std_accuracy: float, mean_precision: float, std_precision: float, mean_recall: float, std_recall: float, mean_duration: float, std_duration: float)[source]#

Bases: DataClassJsonMixin

dataset_size: str#
equalized_data: bool#
mean_accuracy: float#
mean_audio_per_person_test: float#
mean_audio_per_person_train: float#
mean_audio_per_person_valid: float#
mean_duration: float#
mean_precision: float#
mean_recall: float#
std_accuracy: float#
std_audio_per_person_test: float#
std_audio_per_person_train: float#
std_audio_per_person_valid: float#
std_duration: float#
std_precision: float#
std_recall: float#
class speakerbox.examples.ModelEvalScores(accuracy: float, precision: float, recall: float, duration: float)[source]#

Bases: DataClassJsonMixin

accuracy: float#
duration: float#
precision: float#
recall: float#
speakerbox.examples.download_preprocessed_example_data() Path[source]#

Install the example preprocessed dataset from Google Drive.

Stored to the: “example-speakerbox-dataset” directory.

Returns:
Path

The path to the directory with all of the unzipped data.

speakerbox.examples.train_and_eval_all_example_models(example_dataset_dir: str | Path, n_iterations: int = 5, seed: int = 182318512, equalize_data_within_splits: bool = False) DataFrame[source]#

Train and evaluate a model multiple times for each of the dataset sizes.

This was used to investigate the diminishing return of adding more data to the model.

Parameters:
example_dataset_dir: Union[str, Path]

Path to the downloaded and unzipped example dataset.

n_iterations: int

The number of train and evaluation iterations to try for this model before averaging them all. Default: 5

seed: int

A random seed to set global random state.

equalize_data_within_splits: bool

Should the data splits be equalized to the smallest number of examples for any speaker in that split. Default: False (allow different amounts of examples per label)

Returns:
pd.DataFrame

A DataFrame of results for all the models tested.

See also

train_and_eval_example_model

The function used to train and eval a single model dataset size.

speakerbox.examples.train_and_eval_example_model(example_dataset_dir: str | Path, dataset_size_str: Literal['15-minutes', '30-minutes', '60-minutes'], n_iterations: int = 5, seed: int = 182318512, equalize_data_within_splits: bool = False) IteratedModelEvalScores[source]#

Train and evaluate a model multiple times for one of the dataset sizes.

This was used to investigate the diminishing return of adding more data to the model.

Parameters:
example_dataset_dir: Union[str, Path]

Path to the downloaded and unzipped example dataset.

dataset_size_str: Literal[“15-minutes”, “30-minutes”, “60-minutes”]

The dataset size to choose from. This will load (and potentially) subset the packaged data.

n_iterations: int

The number of train and evaluation iterations to try for this model before averaging them all. Default: 5

seed: int

A random seed to set global random state.

equalize_data_within_splits: bool

Should the data splits be equalized to the smallest number of examples for any speaker in that split. Default: False (allow different amounts of examples per label)

Returns:
IteratedModelEvalScores

The average accuracy, precision, recall, and duration over the training and evaluation iterations.

speakerbox.main module#

speakerbox.main.apply(audio: str | Path, model: str, mode: Literal['diarize', 'naive'] = 'diarize', min_chunk_duration: float = 0.5, max_chunk_duration: float = 2.0, confidence_threshold: float = 0.85) Annotation[source]#

Iteritively apply the model across chunks of an audio file.

Parameters:
audio: Union[str, Path]

The audio filepath.

model: str

The path to the trained audio-classification model.

mode: Literal[“diarize”, “naive”]

Which mode to use for processing. “diarize” will diarize the audio prior to generating chunks to classify. “naive” will iteratively process chunks. “naive” is assumed to be faster but have worse performance. Default: “diarize”

min_chunk_duration: float

The minimum size in seconds a chunk of audio is allowed to be for it to be ran through the classification pipeline. Default: 0.5 seconds

max_chunk_duration: float

The maximum size in seconds a chunk of audio is allowed to be for it to be ran through the classification pipeline. Default: 2 seconds

confidence_threshold: float

A value to act as a lower bound to the reported confidence of the model prediction. Any classification that has a confidence lower than this value will be ignore and not added as a segment. Default: 0.95 (fairly strict / must have high confidence in prediction)

Returns:
Annotation

A pyannote.core Annotation with all labeled segments.

speakerbox.main.eval_model(validation_dataset: Dataset, model_name: str = 'trained-speakerbox') Tuple[float, float, float, float][source]#

Evaluate a trained model.

This will store two files in the model directory, one for the accuracy, precision, and recall in a markdown file and the other is the generated top one confusion matrix as a PNG file.

Parameters:
validation_dataset: Dataset

The dataset to validate the model against.

model_name: str

A name for the model. This will also create a directory with the same name to store the produced model in. Default: “trained-speakerbox”

Returns:
accuracy: float

The model accuracy as returned by sklearn.metrics.accuracy_score.

precision: float

The model (weighted) precision as returned by sklearn.metrics.precision_score.

recall: float

The model (weighted) recall as returned by sklearn.metrics.recall_score.

loss: float

The model log loss as returned by sklearn.metrics.log_loss.

speakerbox.main.train(dataset: DatasetDict, model_name: str = 'trained-speakerbox', model_base: str = 'superb/wav2vec2-base-superb-sid', max_duration: float = 2.0, seed: int | None = None, use_cpu: bool = False, trainer_arguments_kws: Dict[str, Any] = {'eval_accumulation_steps': 40, 'evaluation_strategy': 'epoch', 'gradient_accumulation_steps': 1, 'gradient_checkpointing': True, 'learning_rate': 3e-05, 'load_best_model_at_end': True, 'logging_steps': 10, 'metric_for_best_model': 'accuracy', 'num_train_epochs': 5, 'per_device_eval_batch_size': 8, 'per_device_train_batch_size': 8, 'save_strategy': 'epoch', 'warmup_ratio': 0.1}) Path[source]#

Train a speaker classification model.

Parameters:
dataset: DatasetDict

The datasets to use for training, testing, and validation. Should only contain the columns/features: “label” and “audio”. The values in the “audio” column should be paths to the audio files.

model_name: str

A name for the model. This will also create a directory with the same name to store the produced model in. Default: “trained-speakerbox”

model_base: str

The model base to use before fine tuning.

max_duration: float

The maximum duration to use for each audio clip. Any clips longer than this will be trimmed. Default: 2.0

seed: Optional[int]

Seed to pass to torch, numpy, and Python RNGs. Default: None (do not set a seed)

use_cpu: bool

Should the model be trained using CPU. This also sets no_cuda=True on TrainerArguments. Default: False (use GPU if available)

trainer_arguments_kws: Dict[Any]

Any additional keyword arguments to be passed to the HuggingFace TrainerArguments object. Default: DEFAULT_TRAINER_ARGUMENTS_ARGS

Returns:
model_storage_path: Path

The path to the directory where the model is stored.

speakerbox.preprocess module#

speakerbox.preprocess.diarize_and_split_audio(audio_file: str | Path, storage_dir: str | Path | None = None, min_audio_chunk_duration: float = 0.5, diarization_pipeline: Pipeline | None = None, seed: int | None = None, hf_token: str | None = None) Path[source]#

Diarize a single audio file and split the file into smaller chunks stored into directories with the unlabeled speaker annotation.

Parameters:
audio_file: Union[str, Path]

The audio file to diarize and split.

storage_dir: Optional[Union[str, Path]]

A specific directory to store the produced chunks to. Default: None (use the audio file name to create a new directory)

min_audio_chunk_duration: float

Length of the minimum audio duration to allow through after chunking. default: 0.5 seconds

diarization_pipeline: Optional[Pipeline]

A preloaded PyAnnote Pipeline. Default: None (load default)

seed: Optional[int]

Seed to pass to torch, numpy, and Python RNGs. Default: None (do not set a seed)

hf_token: Optional[str]

Huggingface user access token to download the diarization model. Can also be set with the HUGGINGFACE_TOKEN environment variable. https://hf.co/settings/tokens

Returns:
storage_dir: Path

The path to where all the chunked audio was stored.

See also

expand_labeled_diarized_audio_dir_to_dataset

After labeling the audio in the produced diarized audio directory, expand the labeled data into a dataset ready for training.

Notes

Prior to using this function you need to accept user conditions: https://hf.co/pyannote/speaker-diarization and https://hf.co/pyannote/segmentation

The output directory structure of the produced chunks will follow the pattern:

{storage_dir}/
├── SPEAKER_00
│   ├── {start_time_millis}-{start_end_millis}.wav
│   └── {start_time_millis}-{start_end_millis}.wav
├── SPEAKER_01
│   ├── {start_time_millis}-{start_end_millis}.wav
│   └── {start_time_millis}-{start_end_millis}.wav
speakerbox.preprocess.expand_gecko_annotations_to_dataset(annotations_and_audios: List[GeckoAnnotationAndAudio], audio_output_dir: str | Path = 'chunked-audio/', overwrite: bool = False, min_audio_chunk_duration: float = 0.5, max_audio_chunk_duration: float = 2.0) DataFrame[source]#

Expand a list of annotation and audio files into a full dataset to be used for training and testing a speaker classification model.

Parameters:
annotations_and_audios: List[GeckoAnnotationAndAudio]

A list of annotation and their matching audio files to expand into a speaker, audio file path, start and end times.

audio_output_dir: Union[str, Path]

A directory path to store the chunked audio files in. Default: “chunked-audio” directory in the current working directory.

overwrite: bool

When writting out an audio chunk, should existing files be overwritten. Default: False (do not overwrite)

min_audio_chunk_duration: float

Length of the minimum audio duration to allow through after chunking. default: 0.5 seconds

max_audio_chunk_duration: float

Length of the maximum audio duration to split larger audio files into. Default: 2.0 seconds

Returns:
dataset: pd.DataFrame

The expanded dataset with columns: conversation_id, label, audio, duration

Raises:
NotADirectoryError

A file exists at the specified destination.

FileExistsError

A file exists at the target chunk audio location but overwrite is False.

Notes

Generated and attached conversation ids are pulled from the annotation file name.

speakerbox.preprocess.expand_labeled_diarized_audio_dir_to_dataset(labeled_diarized_audio_dir: str | Path | List[str] | List[Path] | List[str | Path], audio_output_dir: str | Path = 'chunked-audio/', overwrite: bool = False, min_audio_chunk_duration: float = 0.5, max_audio_chunk_duration: float = 2.0) DataFrame[source]#

Expand the provided labeled diarized audio into a dataset ready for training.

Parameters:
labeled_diarized_audio_dir: Union[Union[str, Path], List[Union[str, Path]]]

A path or list of paths to diarization results directories. These directories should no longer have the “SPEAKER_00”, “SPEAKER_01”, default labeling but expert annotated labels.

audio_output_dir: Union[str, Path]

A directory path to store the chunked audio files in. Default: “chunked-audio” directory in the current working directory.

overwrite: bool

When writting out an audio chunk, should existing files be overwritten. Default: False (do not overwrite)

min_audio_chunk_duration: float

Length of the minimum audio duration to allow through after chunking. default: 0.5 seconds

max_audio_chunk_duration: float

Length of the maximum audio duration to split larger audio files into. Default: 2.0 seconds

Returns:
dataset: pd.DataFrame

The expanded dataset with columns: conversation_id, label, audio, duration

Raises:
NotADirectoryError

A file exists at the specified destination.

FileExistsError

A file exists at the target chunk audio location but overwrite is False.

See also

diarize_and_split_audio

Function to diarize an audio file and split into annotation directories.

Notes

The provided labeled diarized audio directory(s) should have the following structure:

{labeled_diarized_audio_dir}/
├── label
│   ├── 1.wav
│   └── 2.wav
├── second_label
│   ├── 1.wav
│   └── 2.wav

Generated and attached conversation ids are pulled from the labeled diarized audio directory names.

speakerbox.preprocess.prepare_dataset(dataset: DataFrame, test_and_valid_size: float = 0.4, equalize_data_within_splits: bool = False, n_iterations: int = 100, seed: int | None = None) Tuple[DatasetDict, DataFrame][source]#

Prepare a dataset for training a new speakerbox / audio-classification model.

This function attempts to randomly create train, test, and validation splits from the provided dataframe that meet the following two conditions:

1. There is data holdout by conversation_id. I.e. if the dataset contains data from nine unique conversation ids, the training, test, and validation sets should all have different conversation ids (train has 0, 1, 2, 3; test has 4, 5, 6; validation has 7, 8).

2. There is data stratification by label. I.e. if the dataset contains nine unique labels, the training, test, and validation sets should each have all nine labels present (train, test, and validation all have labels 0-8).

Parameters:
dataset: pd.DataFrame

An expanded dataset with columns: conversation_id, label, audio, duration

test_and_valid_size: float

How much of the dataset to use for the combined test and validation sets as a percent (i.e. 0.4 = 40% of the dataset). The test and validation sets will further split this in half (i.e. 0.4 = 40% which means 20% of the total data for testing and 20% of the total data for validation).

equalize_data_within_splits: bool

After finding valid train, test, and validation splits, should the data within each split be reduced to have an equal number of data for each label. Default: False (Do not equalize labels within splits)

n_iterations: int

The number of iterations to attempt to find viable train, test, and validation sets. Default: 100

seed: Optional[int]

Seed to pass to torch, numpy, and Python RNGs. Default: None (do not set a seed)

Returns:
dataset: DatasetDict

The prepared dataset split into train, test, and validation splits.

value_counts: pd.DataFrame

A value count table where each row is a different label and each column is the count of that label in the matching train, test, or validation set.

Raises:
ValueError

Could not find train, test, and validation sets that meet the holdout and stratification criteria after n iterations. Recommended to annotate more data.

See also

expand_labeled_diarized_audio_dir_to_dataset

Function to move from a directory of diarized audio (or multiple) into a dataset to provide to this function.

expand_gecko_annotations_to_dataset

Function to move from a gecko annotation file (or multiple) into a dataset to provide to this function.

speakerbox.types module#

class speakerbox.types.AnnotatedAudio(conversation_id: str, label: str, audio: str, duration: float)[source]#

Bases: object

audio: str#
conversation_id: str#
duration: float#
classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A#
classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A#
label: str#
classmethod schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) SchemaF[A]#
to_dict(encode_json=False) Dict[str, dict | list | str | int | float | bool | None]#
to_json(*, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: int | str | None = None, separators: Tuple[str, str] | None = None, default: Callable | None = None, sort_keys: bool = False, **kw) str#
class speakerbox.types.GeckoAnnotationAndAudio(annotation_file, audio_file)[source]#

Bases: NamedTuple

Create new instance of GeckoAnnotationAndAudio(annotation_file, audio_file)

annotation_file: Path#

Alias for field number 0

audio_file: Path#

Alias for field number 1

speakerbox.utils module#

speakerbox.utils.set_global_seed(seed: int) None[source]#

Set the global RNG seed for torch, numpy, and Python.

Module contents#

Top-level package for speakerbox.

speakerbox.apply(audio: str | Path, model: str, mode: Literal['diarize', 'naive'] = 'diarize', min_chunk_duration: float = 0.5, max_chunk_duration: float = 2.0, confidence_threshold: float = 0.85) Annotation[source]#

Iteritively apply the model across chunks of an audio file.

Parameters:
audio: Union[str, Path]

The audio filepath.

model: str

The path to the trained audio-classification model.

mode: Literal[“diarize”, “naive”]

Which mode to use for processing. “diarize” will diarize the audio prior to generating chunks to classify. “naive” will iteratively process chunks. “naive” is assumed to be faster but have worse performance. Default: “diarize”

min_chunk_duration: float

The minimum size in seconds a chunk of audio is allowed to be for it to be ran through the classification pipeline. Default: 0.5 seconds

max_chunk_duration: float

The maximum size in seconds a chunk of audio is allowed to be for it to be ran through the classification pipeline. Default: 2 seconds

confidence_threshold: float

A value to act as a lower bound to the reported confidence of the model prediction. Any classification that has a confidence lower than this value will be ignore and not added as a segment. Default: 0.95 (fairly strict / must have high confidence in prediction)

Returns:
Annotation

A pyannote.core Annotation with all labeled segments.

speakerbox.eval_model(validation_dataset: Dataset, model_name: str = 'trained-speakerbox') Tuple[float, float, float, float][source]#

Evaluate a trained model.

This will store two files in the model directory, one for the accuracy, precision, and recall in a markdown file and the other is the generated top one confusion matrix as a PNG file.

Parameters:
validation_dataset: Dataset

The dataset to validate the model against.

model_name: str

A name for the model. This will also create a directory with the same name to store the produced model in. Default: “trained-speakerbox”

Returns:
accuracy: float

The model accuracy as returned by sklearn.metrics.accuracy_score.

precision: float

The model (weighted) precision as returned by sklearn.metrics.precision_score.

recall: float

The model (weighted) recall as returned by sklearn.metrics.recall_score.

loss: float

The model log loss as returned by sklearn.metrics.log_loss.

speakerbox.train(dataset: DatasetDict, model_name: str = 'trained-speakerbox', model_base: str = 'superb/wav2vec2-base-superb-sid', max_duration: float = 2.0, seed: int | None = None, use_cpu: bool = False, trainer_arguments_kws: Dict[str, Any] = {'eval_accumulation_steps': 40, 'evaluation_strategy': 'epoch', 'gradient_accumulation_steps': 1, 'gradient_checkpointing': True, 'learning_rate': 3e-05, 'load_best_model_at_end': True, 'logging_steps': 10, 'metric_for_best_model': 'accuracy', 'num_train_epochs': 5, 'per_device_eval_batch_size': 8, 'per_device_train_batch_size': 8, 'save_strategy': 'epoch', 'warmup_ratio': 0.1}) Path[source]#

Train a speaker classification model.

Parameters:
dataset: DatasetDict

The datasets to use for training, testing, and validation. Should only contain the columns/features: “label” and “audio”. The values in the “audio” column should be paths to the audio files.

model_name: str

A name for the model. This will also create a directory with the same name to store the produced model in. Default: “trained-speakerbox”

model_base: str

The model base to use before fine tuning.

max_duration: float

The maximum duration to use for each audio clip. Any clips longer than this will be trimmed. Default: 2.0

seed: Optional[int]

Seed to pass to torch, numpy, and Python RNGs. Default: None (do not set a seed)

use_cpu: bool

Should the model be trained using CPU. This also sets no_cuda=True on TrainerArguments. Default: False (use GPU if available)

trainer_arguments_kws: Dict[Any]

Any additional keyword arguments to be passed to the HuggingFace TrainerArguments object. Default: DEFAULT_TRAINER_ARGUMENTS_ARGS

Returns:
model_storage_path: Path

The path to the directory where the model is stored.