cdp_backend.utils package#

Subpackages#

Submodules#

cdp_backend.utils.constants_utils module#

cdp_backend.utils.constants_utils.get_all_class_attr_values(cls: Type) List[Any][source]#

Get all class attributes of the provided class. Intended to be used to get all constant values of a class.

Parameters:
cls: Type

The class to get the class attributes values for.

Returns:
class_attr_values: List[Any]:

The class attributes values.

cdp_backend.utils.file_utils module#

cdp_backend.utils.file_utils.append_to_stem(path: Path, addition: str) Path[source]#

Rename a file with a string appended to the path stem.

Parameters:
path: Path

The path to alter

addition: str

The string to be appended to the path stem

Returns:
path: Path

The new path with the stem addition

cdp_backend.utils.file_utils.clip_and_reformat_video(video_filepath: Path, start_time: str | None, end_time: str | None, output_path: Path = None, output_format: str = 'mp4') Path[source]#

Clip a video file to a specific time range and convert to requested output format.

Parameters:
video_filepath: Path

The filepath of the video to clip.

start_time: str

The start time of the clip in HH:MM:SS.

end_time: str

The end time of the clip in HH:MM:SS.

output_path: Path

The output path to place the clip at.

output_format: str

The output format. Default: “mp4”

Returns:
Path:

The path where the new file was stored to.

cdp_backend.utils.file_utils.convert_video_to_mp4(video_filepath: Path, start_time: str | None, end_time: str | None, output_path: Path = None) Path[source]#

Converts a video to an equivalent MP4 file.

Parameters:
video_filepath: str

The filepath of the video to convert.

start_time: str

The start time to trim the video in HH:MM:SS.

end_time: str

The end time to trim the video in HH:MM:SS.

output_path: Path

The output path to place the clip at.

Returns:
output_path: str

The filepath of the converted MP4 video.

cdp_backend.utils.file_utils.download_video_from_session_id(credentials_file: str, session_id: str, dest: str | Path | None = None) str | Path[source]#

Using the session_id provided, pulls the associated video, and places it the destination.

Parameters:
credentials_file: str

The path to the Google Service Account credentials JSON file used to initialize the file store connection.

session_id: str

The id of the session to retrive the video for.

dest: Optional[Union[str, Path]]

A destination to store the file to. This is passed directly to the resource_copy function.

Returns:
Path

The destination path.

See also

cdp_backend.utils.file_utils.resource_copy

The function that downloads the video from remote host.

cdp_backend.utils.file_utils.find_proper_resize_ratio(height: int, width: int) float[source]#

Return the proper ratio to resize a thumbnail greater than 960 x 540 pixels.

Parameters:
height: int

The height, in pixels, of the thumbnail to be resized.

width: int

The width, in pixels, of the thumbnail to be resized.

Returns:
final_ratio: float

The ratio by which the thumbnail will be resized. If the ratio is less than 1, the thumbnail is too large and should be resized by a factor of final_ratio. If the ratio is greater than or equal to 1, the thumbnail is not too large and should not be resized.

cdp_backend.utils.file_utils.generate_file_storage_name(file_uri: str, suffix: str) str[source]#

Generate a filename using the hash of the file contents and some provided suffix.

Parameters:
file_uri: str

The URI to the file to hash.

suffix: str

The suffix to append to the hash as a part of the filename.

Returns:
dst: str

The name of the file as it should be on Google Cloud Storage.

cdp_backend.utils.file_utils.get_hover_thumbnail(video_path: str, session_content_hash: str, num_frames: int = 10, duration: float = 6.0) str[source]#

Produce a gif hover thumbnail from an mp4 video file.

Parameters:
video_path: str

The URL of the video from which the thumbnail will be produced

session_content_hash: str

The video content hash. This will be used in the produced image file’s name

num_frames: int

Determines the number of frames in the thumbnail

duration: float

Runtime of the produced GIF. Default: 6.0 seconds

Returns:
str: cover_name

The name of the thumbnail file: Always session_content_hash + “-hover-thumbnail.png”

cdp_backend.utils.file_utils.get_media_type(uri: str) str | None[source]#

Get the IANA media type for the provided URI. If one could not be found, return None.

Parameters:
uri: str

The URI to get the IANA media type for.

Returns:
mtype: Optional[str]:

The found matching IANA media type.

cdp_backend.utils.file_utils.get_static_thumbnail(video_path: str, session_content_hash: str, seconds: int = 30) str[source]#

A function that produces a png thumbnail image from a video file.

Parameters:
video_path: str

The URL of the video from which the thumbnail will be produced

session_content_hash: str

The video content hash. This will be used in the produced image file’s name

seconds: int

Determines after how many seconds a frame will be selected to produce the thumbnail. The default is 30 seconds

Returns:
str: cover_name

The name of the thumbnail file: Always session_content_hash + “-static-thumbnail.png”

cdp_backend.utils.file_utils.hash_file_contents(uri: str, buffer_size: int = 65536) str[source]#

Return the SHA256 hash of a file’s content.

Parameters:
uri: str

The uri for the file to hash.

buffer_size: int

The number of bytes to read at a time. Default: 2^16 (64KB)

Returns:
hash: str

The SHA256 hash for the file contents.

cdp_backend.utils.file_utils.parse_doc_file(document_raw: bytes) str[source]#

Extract text from a .doc matter file.

Parameters:
document_raw: bytes

The raw document.

Returns:
str:

A str of all text in the .doc file.

cdp_backend.utils.file_utils.parse_document(document_uri: str) str[source]#

Extract text from a .doc, .docx, or .ppt matter file.

Parameters:
document_uri: str

The matter file uri.

Returns:
str:

A string of all text in the matter file.

cdp_backend.utils.file_utils.parse_docx_file(zip_archive_bytes: bytes) str[source]#

Extract text from a .docx matter file.

Parameters:
zip_archive_bytes: bytes

The raw document to be parsed. Word docx files are zip archives.

Returns:
str:

A str of all text in the .docx file.

cdp_backend.utils.file_utils.parse_pdf_file(document_raw: bytes) str[source]#

Extract text from a .pdf matter file.

Parameters:
document_raw: bytes

The raw document.

Returns:
str:

A str of all text in the .pdf file.

cdp_backend.utils.file_utils.parse_pptx_file(document_raw: bytes) str[source]#

Extract text from a .pdf matter file.

Parameters:
document_raw: bytes

The raw document.

Returns:
str:

A str of all text in the .pdf file.

cdp_backend.utils.file_utils.remove_duplicate_space(parsed_text: str) str[source]#

Remove all duplicate whitespace characters and replace with a single space.

Parameters:
parsed_text: str

The parsed text from the document.

Returns:
str:

A string with no more than one consecutive space.

cdp_backend.utils.file_utils.rename_append_to_stem(path: Path, addition: str) Path[source]#

Rename a file with a string appended to the path stem.

Parameters:
path: Path

The path to be renamed

addition: str

The string to be appended to the path stem

Returns:
path: Path

The new path of the renamed file

cdp_backend.utils.file_utils.rename_with_stem(path: Path, stem: str) Path[source]#

Rename a file with a string appended to the path stem.

Parameters:
path: Path

The path to be renamed

stem: str

The string to become the new stem

Returns:
path: Path

The new path of the renamed file

cdp_backend.utils.file_utils.resource_copy(uri: str, dst: str | Path | None = None, copy_suffix: bool = False, overwrite: bool = False) str[source]#

Copy a resource (local or remote) to a local destination on the machine.

Parameters:
uri: str

The uri for the resource to copy.

dst: Optional[Union[str, Path]]

A specific destination to where the copy should be placed. If None provided stores the resource in the current working directory.

copy_suffix: bool

Whether to copy the file suffix or not. Default: False (do not copy with suffix)

overwrite: bool

Boolean value indicating whether or not to overwrite a local resource with the same name if it already exists.

Returns:
saved_path: str

The path of where the resource ended up getting copied to.

cdp_backend.utils.file_utils.should_copy_video(video_filepath: Path, output_format: str = 'mp4') bool[source]#

Check if the video should be copied using ffmpeg StreamCopy codec or if it should be re-encoded as h264.

A video will be copied iff the following conditions are met: - The video at video_filepath has a .mp4 extension - The desired output format is mp4 - The video at video_filepath has a video stream with a codec of h264

Parameters:
video_filepath: Path

The filepath of the video under scrutiny.

output_format: str

The desired output format of the video at video_filepath.

Returns:
bool:

True if the video should be copied, False if it should be re-encoded.

cdp_backend.utils.file_utils.split_audio(video_read_path: str, audio_save_path: str, overwrite: bool = False) tuple[str, str, str][source]#

Split and store the audio from a video file using ffmpeg.

Parameters:
video_read_path: str

Path to the video to split the audio from.

audio_save_path: str

Path to where the audio should be stored.

overwrite: bool

Whether to overwrite existing files or not. Default: False (do not overwrite)

Returns:
resolved_audio_save_path: str

Path to where the split audio file was saved.

ffmpeg_stdout_path: str

Path to the ffmpeg stdout log file.

ffmpeg stderr path: str

Path to the ffmpeg stderr log file.

cdp_backend.utils.file_utils.vimeo_copy(uri: str, dst: Path, overwrite: bool = False) str[source]#

Copy a video from Vimeo to a local destination on the machine for analysis.

Parameters:
uri: str

The url of the Vimeo video to copy.

dst: str

The location of the file to download.

overwrite: bool

Boolean value indicating whether or not to overwrite a local video with the same name if it already exists.

Returns:
dst: str

The location of the downloaded file.

cdp_backend.utils.file_utils.with_stem(path: Path, stem: str) Path[source]#

Create a path with a new stem.

Parameters:
path: Path

The path to alter

stem: str

The string to be the new stem of the path

Returns:
path: Path

The new path with the replaced stem

cdp_backend.utils.file_utils.youtube_copy(uri: str, dst: Path, overwrite: bool = False) str[source]#

Copy a video from YouTube to a local destination on the machine.

Parameters:
uri: str

The url of the YouTube video to copy.

dst: str

The location of the file to download.

overwrite: bool

Boolean value indicating whether or not to overwrite a local video with the same name if it already exists.

Returns:
dst: str

The location of the downloaded file.

cdp_backend.utils.string_utils module#

cdp_backend.utils.string_utils.clean_text(text: str, clean_stop_words: bool = False, clean_emojis: bool = False) str[source]#

Clean text of common characters and extra formatting.

Parameters:
text: str

The raw text to clean.

clean_stop_words: bool

Should English stop words be removed from the raw text or not. Default: False (do not remove stop words)

clean_emojis: bool

Should emojis, emoticons, pictograms, and other characters be removed. Default: False (do not remove pictograms)

Returns:
cleaned_text: str

The cleaned text.

cdp_backend.utils.string_utils.convert_gcs_json_url_to_gsutil_form(url: str) str[source]#

Convert a GCS JSON API url to its corresponding gsutil uri.

Parameters:
url: str

The url in GCS JSON API form.

Returns:
gsutil_url: str

The url in gsutil form. Returns empty string if the input url doesn’t match the form.

cdp_backend.utils.string_utils.remove_emojis(text: str) str[source]#

Minor changes made from this answer on stackoverflow: https://stackoverflow.com/a/58356570.

Module contents#

Utilities package for cdp_backend.