speakerbox package#
Submodules#
speakerbox.examples module#
- class speakerbox.examples.IteratedModelEvalScores(dataset_size: str, equalized_data: bool, mean_audio_per_person_train: float, std_audio_per_person_train: float, mean_audio_per_person_test: float, std_audio_per_person_test: float, mean_audio_per_person_valid: float, std_audio_per_person_valid: float, mean_accuracy: float, std_accuracy: float, mean_precision: float, std_precision: float, mean_recall: float, std_recall: float, mean_duration: float, std_duration: float)[source]#
Bases:
DataClassJsonMixin
- dataset_size: str#
- equalized_data: bool#
- mean_accuracy: float#
- mean_audio_per_person_test: float#
- mean_audio_per_person_train: float#
- mean_audio_per_person_valid: float#
- mean_duration: float#
- mean_precision: float#
- mean_recall: float#
- std_accuracy: float#
- std_audio_per_person_test: float#
- std_audio_per_person_train: float#
- std_audio_per_person_valid: float#
- std_duration: float#
- std_precision: float#
- std_recall: float#
- class speakerbox.examples.ModelEvalScores(accuracy: float, precision: float, recall: float, duration: float)[source]#
Bases:
DataClassJsonMixin
- accuracy: float#
- duration: float#
- precision: float#
- recall: float#
- speakerbox.examples.download_preprocessed_example_data() Path [source]#
Install the example preprocessed dataset from Google Drive.
Stored to the: “example-speakerbox-dataset” directory.
- Returns:
- Path
The path to the directory with all of the unzipped data.
- speakerbox.examples.train_and_eval_all_example_models(example_dataset_dir: str | Path, n_iterations: int = 5, seed: int = 182318512, equalize_data_within_splits: bool = False) DataFrame [source]#
Train and evaluate a model multiple times for each of the dataset sizes.
This was used to investigate the diminishing return of adding more data to the model.
- Parameters:
- example_dataset_dir: Union[str, Path]
Path to the downloaded and unzipped example dataset.
- n_iterations: int
The number of train and evaluation iterations to try for this model before averaging them all. Default: 5
- seed: int
A random seed to set global random state.
- equalize_data_within_splits: bool
Should the data splits be equalized to the smallest number of examples for any speaker in that split. Default: False (allow different amounts of examples per label)
- Returns:
- pd.DataFrame
A DataFrame of results for all the models tested.
See also
train_and_eval_example_model
The function used to train and eval a single model dataset size.
- speakerbox.examples.train_and_eval_example_model(example_dataset_dir: str | Path, dataset_size_str: Literal['15-minutes', '30-minutes', '60-minutes'], n_iterations: int = 5, seed: int = 182318512, equalize_data_within_splits: bool = False) IteratedModelEvalScores [source]#
Train and evaluate a model multiple times for one of the dataset sizes.
This was used to investigate the diminishing return of adding more data to the model.
- Parameters:
- example_dataset_dir: Union[str, Path]
Path to the downloaded and unzipped example dataset.
- dataset_size_str: Literal[“15-minutes”, “30-minutes”, “60-minutes”]
The dataset size to choose from. This will load (and potentially) subset the packaged data.
- n_iterations: int
The number of train and evaluation iterations to try for this model before averaging them all. Default: 5
- seed: int
A random seed to set global random state.
- equalize_data_within_splits: bool
Should the data splits be equalized to the smallest number of examples for any speaker in that split. Default: False (allow different amounts of examples per label)
- Returns:
- IteratedModelEvalScores
The average accuracy, precision, recall, and duration over the training and evaluation iterations.
speakerbox.main module#
- speakerbox.main.apply(audio: str | Path, model: str, mode: Literal['diarize', 'naive'] = 'diarize', min_chunk_duration: float = 0.5, max_chunk_duration: float = 2.0, confidence_threshold: float = 0.85) Annotation [source]#
Iteritively apply the model across chunks of an audio file.
- Parameters:
- audio: Union[str, Path]
The audio filepath.
- model: str
The path to the trained audio-classification model.
- mode: Literal[“diarize”, “naive”]
Which mode to use for processing. “diarize” will diarize the audio prior to generating chunks to classify. “naive” will iteratively process chunks. “naive” is assumed to be faster but have worse performance. Default: “diarize”
- min_chunk_duration: float
The minimum size in seconds a chunk of audio is allowed to be for it to be ran through the classification pipeline. Default: 0.5 seconds
- max_chunk_duration: float
The maximum size in seconds a chunk of audio is allowed to be for it to be ran through the classification pipeline. Default: 2 seconds
- confidence_threshold: float
A value to act as a lower bound to the reported confidence of the model prediction. Any classification that has a confidence lower than this value will be ignore and not added as a segment. Default: 0.95 (fairly strict / must have high confidence in prediction)
- Returns:
- Annotation
A pyannote.core Annotation with all labeled segments.
- speakerbox.main.eval_model(validation_dataset: Dataset, model_name: str = 'trained-speakerbox') Tuple[float, float, float, float] [source]#
Evaluate a trained model.
This will store two files in the model directory, one for the accuracy, precision, and recall in a markdown file and the other is the generated top one confusion matrix as a PNG file.
- Parameters:
- validation_dataset: Dataset
The dataset to validate the model against.
- model_name: str
A name for the model. This will also create a directory with the same name to store the produced model in. Default: “trained-speakerbox”
- Returns:
- accuracy: float
The model accuracy as returned by sklearn.metrics.accuracy_score.
- precision: float
The model (weighted) precision as returned by sklearn.metrics.precision_score.
- recall: float
The model (weighted) recall as returned by sklearn.metrics.recall_score.
- loss: float
The model log loss as returned by sklearn.metrics.log_loss.
- speakerbox.main.train(dataset: DatasetDict, model_name: str = 'trained-speakerbox', model_base: str = 'superb/wav2vec2-base-superb-sid', max_duration: float = 2.0, seed: int | None = None, use_cpu: bool = False, trainer_arguments_kws: Dict[str, Any] = {'eval_accumulation_steps': 40, 'evaluation_strategy': 'epoch', 'gradient_accumulation_steps': 1, 'gradient_checkpointing': True, 'learning_rate': 3e-05, 'load_best_model_at_end': True, 'logging_steps': 10, 'metric_for_best_model': 'accuracy', 'num_train_epochs': 5, 'per_device_eval_batch_size': 8, 'per_device_train_batch_size': 8, 'save_strategy': 'epoch', 'warmup_ratio': 0.1}) Path [source]#
Train a speaker classification model.
- Parameters:
- dataset: DatasetDict
The datasets to use for training, testing, and validation. Should only contain the columns/features: “label” and “audio”. The values in the “audio” column should be paths to the audio files.
- model_name: str
A name for the model. This will also create a directory with the same name to store the produced model in. Default: “trained-speakerbox”
- model_base: str
The model base to use before fine tuning.
- max_duration: float
The maximum duration to use for each audio clip. Any clips longer than this will be trimmed. Default: 2.0
- seed: Optional[int]
Seed to pass to torch, numpy, and Python RNGs. Default: None (do not set a seed)
- use_cpu: bool
Should the model be trained using CPU. This also sets no_cuda=True on TrainerArguments. Default: False (use GPU if available)
- trainer_arguments_kws: Dict[Any]
Any additional keyword arguments to be passed to the HuggingFace TrainerArguments object. Default: DEFAULT_TRAINER_ARGUMENTS_ARGS
- Returns:
- model_storage_path: Path
The path to the directory where the model is stored.
speakerbox.preprocess module#
- speakerbox.preprocess.diarize_and_split_audio(audio_file: str | Path, storage_dir: str | Path | None = None, min_audio_chunk_duration: float = 0.5, diarization_pipeline: Pipeline | None = None, seed: int | None = None, hf_token: str | None = None) Path [source]#
Diarize a single audio file and split the file into smaller chunks stored into directories with the unlabeled speaker annotation.
- Parameters:
- audio_file: Union[str, Path]
The audio file to diarize and split.
- storage_dir: Optional[Union[str, Path]]
A specific directory to store the produced chunks to. Default: None (use the audio file name to create a new directory)
- min_audio_chunk_duration: float
Length of the minimum audio duration to allow through after chunking. default: 0.5 seconds
- diarization_pipeline: Optional[Pipeline]
A preloaded PyAnnote Pipeline. Default: None (load default)
- seed: Optional[int]
Seed to pass to torch, numpy, and Python RNGs. Default: None (do not set a seed)
- hf_token: Optional[str]
Huggingface user access token to download the diarization model. Can also be set with the HUGGINGFACE_TOKEN environment variable. https://hf.co/settings/tokens
- Returns:
- storage_dir: Path
The path to where all the chunked audio was stored.
See also
expand_labeled_diarized_audio_dir_to_dataset
After labeling the audio in the produced diarized audio directory, expand the labeled data into a dataset ready for training.
Notes
Prior to using this function you need to accept user conditions: https://hf.co/pyannote/speaker-diarization and https://hf.co/pyannote/segmentation
The output directory structure of the produced chunks will follow the pattern:
{storage_dir}/ ├── SPEAKER_00 │ ├── {start_time_millis}-{start_end_millis}.wav │ └── {start_time_millis}-{start_end_millis}.wav ├── SPEAKER_01 │ ├── {start_time_millis}-{start_end_millis}.wav │ └── {start_time_millis}-{start_end_millis}.wav
- speakerbox.preprocess.expand_gecko_annotations_to_dataset(annotations_and_audios: List[GeckoAnnotationAndAudio], audio_output_dir: str | Path = 'chunked-audio/', overwrite: bool = False, min_audio_chunk_duration: float = 0.5, max_audio_chunk_duration: float = 2.0) DataFrame [source]#
Expand a list of annotation and audio files into a full dataset to be used for training and testing a speaker classification model.
- Parameters:
- annotations_and_audios: List[GeckoAnnotationAndAudio]
A list of annotation and their matching audio files to expand into a speaker, audio file path, start and end times.
- audio_output_dir: Union[str, Path]
A directory path to store the chunked audio files in. Default: “chunked-audio” directory in the current working directory.
- overwrite: bool
When writting out an audio chunk, should existing files be overwritten. Default: False (do not overwrite)
- min_audio_chunk_duration: float
Length of the minimum audio duration to allow through after chunking. default: 0.5 seconds
- max_audio_chunk_duration: float
Length of the maximum audio duration to split larger audio files into. Default: 2.0 seconds
- Returns:
- dataset: pd.DataFrame
The expanded dataset with columns: conversation_id, label, audio, duration
- Raises:
- NotADirectoryError
A file exists at the specified destination.
- FileExistsError
A file exists at the target chunk audio location but overwrite is False.
Notes
Generated and attached conversation ids are pulled from the annotation file name.
- speakerbox.preprocess.expand_labeled_diarized_audio_dir_to_dataset(labeled_diarized_audio_dir: str | Path | List[str] | List[Path] | List[str | Path], audio_output_dir: str | Path = 'chunked-audio/', overwrite: bool = False, min_audio_chunk_duration: float = 0.5, max_audio_chunk_duration: float = 2.0) DataFrame [source]#
Expand the provided labeled diarized audio into a dataset ready for training.
- Parameters:
- labeled_diarized_audio_dir: Union[Union[str, Path], List[Union[str, Path]]]
A path or list of paths to diarization results directories. These directories should no longer have the “SPEAKER_00”, “SPEAKER_01”, default labeling but expert annotated labels.
- audio_output_dir: Union[str, Path]
A directory path to store the chunked audio files in. Default: “chunked-audio” directory in the current working directory.
- overwrite: bool
When writting out an audio chunk, should existing files be overwritten. Default: False (do not overwrite)
- min_audio_chunk_duration: float
Length of the minimum audio duration to allow through after chunking. default: 0.5 seconds
- max_audio_chunk_duration: float
Length of the maximum audio duration to split larger audio files into. Default: 2.0 seconds
- Returns:
- dataset: pd.DataFrame
The expanded dataset with columns: conversation_id, label, audio, duration
- Raises:
- NotADirectoryError
A file exists at the specified destination.
- FileExistsError
A file exists at the target chunk audio location but overwrite is False.
See also
diarize_and_split_audio
Function to diarize an audio file and split into annotation directories.
Notes
The provided labeled diarized audio directory(s) should have the following structure:
{labeled_diarized_audio_dir}/ ├── label │ ├── 1.wav │ └── 2.wav ├── second_label │ ├── 1.wav │ └── 2.wav
Generated and attached conversation ids are pulled from the labeled diarized audio directory names.
- speakerbox.preprocess.prepare_dataset(dataset: DataFrame, test_and_valid_size: float = 0.4, equalize_data_within_splits: bool = False, n_iterations: int = 100, seed: int | None = None) Tuple[DatasetDict, DataFrame] [source]#
Prepare a dataset for training a new speakerbox / audio-classification model.
This function attempts to randomly create train, test, and validation splits from the provided dataframe that meet the following two conditions:
1. There is data holdout by conversation_id. I.e. if the dataset contains data from nine unique conversation ids, the training, test, and validation sets should all have different conversation ids (train has 0, 1, 2, 3; test has 4, 5, 6; validation has 7, 8).
2. There is data stratification by label. I.e. if the dataset contains nine unique labels, the training, test, and validation sets should each have all nine labels present (train, test, and validation all have labels 0-8).
- Parameters:
- dataset: pd.DataFrame
An expanded dataset with columns: conversation_id, label, audio, duration
- test_and_valid_size: float
How much of the dataset to use for the combined test and validation sets as a percent (i.e. 0.4 = 40% of the dataset). The test and validation sets will further split this in half (i.e. 0.4 = 40% which means 20% of the total data for testing and 20% of the total data for validation).
- equalize_data_within_splits: bool
After finding valid train, test, and validation splits, should the data within each split be reduced to have an equal number of data for each label. Default: False (Do not equalize labels within splits)
- n_iterations: int
The number of iterations to attempt to find viable train, test, and validation sets. Default: 100
- seed: Optional[int]
Seed to pass to torch, numpy, and Python RNGs. Default: None (do not set a seed)
- Returns:
- dataset: DatasetDict
The prepared dataset split into train, test, and validation splits.
- value_counts: pd.DataFrame
A value count table where each row is a different label and each column is the count of that label in the matching train, test, or validation set.
- Raises:
- ValueError
Could not find train, test, and validation sets that meet the holdout and stratification criteria after n iterations. Recommended to annotate more data.
See also
expand_labeled_diarized_audio_dir_to_dataset
Function to move from a directory of diarized audio (or multiple) into a dataset to provide to this function.
expand_gecko_annotations_to_dataset
Function to move from a gecko annotation file (or multiple) into a dataset to provide to this function.
speakerbox.types module#
- class speakerbox.types.AnnotatedAudio(conversation_id: str, label: str, audio: str, duration: float)[source]#
Bases:
object
- audio: str#
- conversation_id: str#
- duration: float#
- classmethod from_dict(kvs: dict | list | str | int | float | bool | None, *, infer_missing=False) A #
- classmethod from_json(s: str | bytes | bytearray, *, parse_float=None, parse_int=None, parse_constant=None, infer_missing=False, **kw) A #
- label: str#
- classmethod schema(*, infer_missing: bool = False, only=None, exclude=(), many: bool = False, context=None, load_only=(), dump_only=(), partial: bool = False, unknown=None) SchemaF[A] #
- to_dict(encode_json=False) Dict[str, dict | list | str | int | float | bool | None] #
- to_json(*, skipkeys: bool = False, ensure_ascii: bool = True, check_circular: bool = True, allow_nan: bool = True, indent: int | str | None = None, separators: Tuple[str, str] | None = None, default: Callable | None = None, sort_keys: bool = False, **kw) str #
speakerbox.utils module#
Module contents#
Top-level package for speakerbox.
- speakerbox.apply(audio: str | Path, model: str, mode: Literal['diarize', 'naive'] = 'diarize', min_chunk_duration: float = 0.5, max_chunk_duration: float = 2.0, confidence_threshold: float = 0.85) Annotation [source]#
Iteritively apply the model across chunks of an audio file.
- Parameters:
- audio: Union[str, Path]
The audio filepath.
- model: str
The path to the trained audio-classification model.
- mode: Literal[“diarize”, “naive”]
Which mode to use for processing. “diarize” will diarize the audio prior to generating chunks to classify. “naive” will iteratively process chunks. “naive” is assumed to be faster but have worse performance. Default: “diarize”
- min_chunk_duration: float
The minimum size in seconds a chunk of audio is allowed to be for it to be ran through the classification pipeline. Default: 0.5 seconds
- max_chunk_duration: float
The maximum size in seconds a chunk of audio is allowed to be for it to be ran through the classification pipeline. Default: 2 seconds
- confidence_threshold: float
A value to act as a lower bound to the reported confidence of the model prediction. Any classification that has a confidence lower than this value will be ignore and not added as a segment. Default: 0.95 (fairly strict / must have high confidence in prediction)
- Returns:
- Annotation
A pyannote.core Annotation with all labeled segments.
- speakerbox.eval_model(validation_dataset: Dataset, model_name: str = 'trained-speakerbox') Tuple[float, float, float, float] [source]#
Evaluate a trained model.
This will store two files in the model directory, one for the accuracy, precision, and recall in a markdown file and the other is the generated top one confusion matrix as a PNG file.
- Parameters:
- validation_dataset: Dataset
The dataset to validate the model against.
- model_name: str
A name for the model. This will also create a directory with the same name to store the produced model in. Default: “trained-speakerbox”
- Returns:
- accuracy: float
The model accuracy as returned by sklearn.metrics.accuracy_score.
- precision: float
The model (weighted) precision as returned by sklearn.metrics.precision_score.
- recall: float
The model (weighted) recall as returned by sklearn.metrics.recall_score.
- loss: float
The model log loss as returned by sklearn.metrics.log_loss.
- speakerbox.train(dataset: DatasetDict, model_name: str = 'trained-speakerbox', model_base: str = 'superb/wav2vec2-base-superb-sid', max_duration: float = 2.0, seed: int | None = None, use_cpu: bool = False, trainer_arguments_kws: Dict[str, Any] = {'eval_accumulation_steps': 40, 'evaluation_strategy': 'epoch', 'gradient_accumulation_steps': 1, 'gradient_checkpointing': True, 'learning_rate': 3e-05, 'load_best_model_at_end': True, 'logging_steps': 10, 'metric_for_best_model': 'accuracy', 'num_train_epochs': 5, 'per_device_eval_batch_size': 8, 'per_device_train_batch_size': 8, 'save_strategy': 'epoch', 'warmup_ratio': 0.1}) Path [source]#
Train a speaker classification model.
- Parameters:
- dataset: DatasetDict
The datasets to use for training, testing, and validation. Should only contain the columns/features: “label” and “audio”. The values in the “audio” column should be paths to the audio files.
- model_name: str
A name for the model. This will also create a directory with the same name to store the produced model in. Default: “trained-speakerbox”
- model_base: str
The model base to use before fine tuning.
- max_duration: float
The maximum duration to use for each audio clip. Any clips longer than this will be trimmed. Default: 2.0
- seed: Optional[int]
Seed to pass to torch, numpy, and Python RNGs. Default: None (do not set a seed)
- use_cpu: bool
Should the model be trained using CPU. This also sets no_cuda=True on TrainerArguments. Default: False (use GPU if available)
- trainer_arguments_kws: Dict[Any]
Any additional keyword arguments to be passed to the HuggingFace TrainerArguments object. Default: DEFAULT_TRAINER_ARGUMENTS_ARGS
- Returns:
- model_storage_path: Path
The path to the directory where the model is stored.