cdp_data package

Subpackages

Submodules

cdp_data.constants module

cdp_data.datasets module

cdp_data.datasets.convert_transcript_to_dataframe(transcript: str | Path | Transcript) DataFrame[source]

Create a dataframe from only the sentence data from the provided transcript.

Parameters:
transcript: Union[str, Path, Transcript]

The transcript to pull all sentences from.

Returns:
pd.DataFrame:

The sentences of the transcript.

cdp_data.datasets.get_session_dataset(infrastructure_slug: str, start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, sample: int | float | None = None, replace_py_objects: bool = False, store_full_metadata: bool = False, store_transcript: bool = False, transcript_selection: str = 'created', store_transcript_as_csv: bool = False, store_video: bool = False, store_audio: bool = False, cache_dir: str | Path | None = None, raise_on_error: bool = True, tqdm_kws: Dict[str, Any] | None = None) DataFrame[source]

Get a dataset of sessions from a CDP infrastructure.

Parameters:
infrastructure_slug: str

The CDP infrastructure to connect to and pull sessions for.

start_datetime: Optional[Union[str, datetime]]

An optional datetime that the session dataset will start at. Default: None (no datetime beginning bound on the dataset)

end_datetime: Optional[Union[str, datetime]]

An optional datetime that the session dataset will end at. Default: None (no datetime end bound on the dataset)

sample: Optional[Union[int, float]]

An optional sample of the dataset to return. If an int, the number of rows to return. If a float, the percentage of rows to return. Default: None (return all rows)

replace_py_objects: bool

Replace any non-standard Python type with standard ones to allow the returned data be ready for storage. See ‘See Also’ for more details. Default: False (keep Python objects in the DataFrame)

store_full_metadata: bool

Should a JSON file of the full event metadata be stored to disk and a path to the stored JSON file be added to the returned DataFrame. Default: False (do not request extra data and store to disk) Currently not implemented

store_transcript: bool

Should a session transcript be requested and stored to disk and a path to the stored transcript JSON file be added to the returned DataFrame. Default: False (do not request extra data and do not store the transcript)

transcript_selection: str

How should the single transcript be selected. Default: “created” (Return the most recently created transcript per session)

store_transcript_as_csv: bool

Additionally convert and store all transcripts as CSVs. Does nothing if store_transcript is False. Default: False (do not convert and store again)

store_video: bool

Should the session video be requested and stored to disk and a path to the stored video file be added to the returned DataFrame. Note: the video is stored without a file extension. However, the video with always be either mp4 or webm. Default: False (do not request and store the video)

store_audio: bool

Should the session audio be requested and stored to disk and a path to the stored audio file be added to the returned DataFrame. Default: False (do not request and store the audio)

cache_dir: Optional[Union[str, Path]]

An optional directory path to cache the dataset. Directory is created if it does not exist. Default: “./cdp-datasets”

raise_on_error: bool

Should any failure to pull files result in an error or be ignored. Default: True (raise on any failure)

tqdm_kws: Dict[str, Any]

A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.

Returns:
dataset: pd.DataFrame

The dataset with all additions requested.

See also

replace_dataframe_cols_with_storage_replacements

The function used to clean the data of non-standard Python types.

Notes

All file additions (transcript, full event metadata, video, audio, etc.) are cached to disk to avoid multiple downloads. If you use the same cache directory multiple times over the course of multiple runs, no new data will be downloaded, but the existing files will be used. Caching is done by simply file existence not by a content hash comparison.

Datasets are cached with the following structure:

{cache-dir}/
└── {infrastructure_slug}
    ├── event-{event-id-0}
    │   ├── metadata.json
    │   └── session-{session-id-0}
    │       ├── audio.wav
    │       ├── transcript.json
    │       └── video
    ├── event-{event-id-1}
    │   ├── metadata.json
    │   └── session-{session-id-0}
    │       ├── audio.wav
    │       ├── transcript.json
    │       └── video
    ├── event-{event-id-2}
    │   ├── metadata.json
    │   └── session-{session-id-0}
    │       ├── audio.wav
    │       ├── transcript.json
    │       └── video
    │   └── session-{session-id-1}
    │       ├── audio.wav
    │       ├── transcript.json
    │       └── video

To clean a whole dataset or specific events or sessions simply delete the associated directory.

cdp_data.datasets.get_vote_dataset(infrastructure_slug: str, start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, replace_py_objects: bool = False, tqdm_kws: Dict[str, Any] | None = None) DataFrame[source]

Get a dataset of votes from a CDP infrastructure.

Parameters:
infrastructure_slug: str

The CDP infrastructure to connect to and pull votes for.

start_datetime: Optional[Union[str, datetime]]

An optional datetime that the vote dataset will start at. Default: None (no datetime beginning bound on the dataset)

end_datetime: Optional[Union[str, datetime]]

An optional datetime that the vote dataset will end at. Default: None (no datetime end bound on the dataset)

replace_py_objects: bool

Replace any non-standard Python type with standard ones to allow the returned data be ready for storage. See ‘See Also’ for more details. Default: False (keep Python objects in the DataFrame)

tqdm_kws: Dict[str, Any]

A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.

Returns:
pd.DataFrame

The dataset requested.

See also

replace_dataframe_cols_with_storage_replacements

The function used to clean the data of non-standard Python types.

cdp_data.datasets.replace_dataframe_cols_with_storage_replacements(df: DataFrame) DataFrame[source]

Run various replacement functions against the dataframe to get the data to a point where it can be store to disk.

Parameters:
df: pd.DataFrame

The data to fix.

Returns:
pd.DataFrame

The updated DataFrame.

See also

replace_db_model_cols_with_id_cols

Function to replace database model column values with their ids.

replace_pathlib_path_cols_with_str_path_cols

Function to replace pathlib Path column values with normal Python strings.

cdp_data.datasets.replace_db_model_cols_with_id_cols(df: DataFrame) DataFrame[source]

Replace all database model column values with the model ID.

Example: an event column with event models, will be replaced by an event_id column with just the event id.

Parameters:
df: pd.DataFrame

The data to replace database models with just ids.

Returns:
pd.DataFrame

The updated DataFrame.

cdp_data.datasets.replace_pathlib_path_cols_with_str_path_cols(df: DataFrame) DataFrame[source]

Replace all pathlib Path column values with string column values.

Example: a transcript_path column with a pathlib Path, will be replaced as a normal Python string.

Parameters:
df: pd.DataFrame

The data to replace paths in.

Returns:
pd.DataFrame

The updated DataFrame.

cdp_data.datasets.save_dataset(df: DataFrame, dest: str | Path) Path[source]

Helper function to store a dataset to disk by replacing non-standard Python types with storage ready replacements.

Parameters:
df: pd.DataFrame

The DataFrame to store.

dest: Union[str, Path]

The path to store the data. Must end in “.csv” or “.parquet”.

Returns:
Path:

The path to the stored data.

See also

replace_dataframe_cols_with_storage_replacements

The function used to replace column values.

cdp_data.instances module

class cdp_data.instances.CDPInstances[source]

Bases: object

Container for CDP instance infrastructure slugs.

Examples

>>> from cdp_data import datasets, CDPInstances
... ds = datasets.get_session_dataset(
...     infrastructure_slug=CDPInstances.Seattle
... )
Alameda = 'cdp-alameda-d3dabe54'
Albuquerque = 'cdp-albuquerque-1d29496e'
Atlanta = 'cdp-atlanta-37e7dd70'
Boston = 'cdp-boston-c384047b'
Charlotte = 'cdp-charlotte-98a7c348'
Denver = 'cdp-denver-962aefef'
KingCounty = 'cdp-king-county-b656c71b'
LongBeach = 'cdp-long-beach-49323fe9'
Louisville = 'cdp-louisville-6fd32a38'
Milwaukee = 'cdp-milwaukee-9f60e352'
Missoula = 'missoula-council-data-proj'
MountainView = 'cdp-mountain-view-7c8a47df'
Oakland = 'cdp-oakland-ba81c097'
Portland = 'cdp-portland-d2bbda97'
Richmond = 'cdp-richmond-a3d06941'
SanJose = 'cdp-san-jose-5d9db455'
Seattle = 'cdp-seattle-21723dcf'

cdp_data.keywords module

cdp_data.keywords.compute_ngram_usage_history(infrastructure_slug: str | List[str], ngram_size: int = 1, strict: bool = False, start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, cache_dir: str | Path | None = None, raise_on_error: bool = True, tqdm_kws: Dict[str, Any] | None = None) DataFrame[source]

Pull the minimal data needed for a session dataset for the provided infrastructure and start and end datetimes, then compute the ngram usage history DataFrame.

Parameters:
infrastructure_slug: str

The CDP infrastructure(s) to connect to and pull sessions for.

ngram_size: int

The ngram size to use for counting and calculating usage. Default: 1 (unigrams)

strict: bool

Should all ngrams be stemmed or left unstemmed for a more strict usage history. Default: False (stem and clean all grams in the dataset)

start_datetime: Optional[Union[str, datetime]]

An optional datetime that the session dataset will start at. Default: None (no datetime beginning bound on the dataset)

end_datetime: Optional[Union[str, datetime]]

An optional datetime that the session dataset will end at. Default: None (no datetime end bound on the dataset)

cache_dir: Optional[Union[str, Path]]

An optional directory path to cache the dataset. Directory is created if it does not exist. Default: “./cdp-datasets”

raise_on_error: bool

Should any failure to pull files result in an error or be ignored. Default: True (raise on any failure)

tqdm_kws: Dict[str, Any]

A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.

Returns:
gram_usage: pd.DataFrame

A pandas DataFrame of all found ngrams (stemmed and cleaned or unstemmed and uncleaned) from the data and their counts for each session and their percentage of use as a percent of their use for the day over the sum of all other ngrams used that day.

See also

cdp_data.datasets.get_session_dataset

Function to pull or load a cached session dataset.

cdp_data.plotting.plot_ngram_usage_histories

Plot ngram usage history data.

Notes

This function calculates the counts and percentage of each ngram used for a day over the sum of all other ngrams used in that day’s discussion(s). This is close but not exactly the same as Google’s NGram Viewer: https://books.google.com/ngrams

This function will pull a new session dataset and cache transcripts to the local disk in the provided (or default) cache directory.

It is recommended to cache this dataset after computation because it may take a while depending on machine resources and available.

cdp_data.keywords.compute_query_semantic_similarity_history(query: str | List[str], infrastructure_slug: str | List[str], start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, cache_dir: str | Path | None = None, embedding_model: str = 'msmarco-distilbert-base-v4', raise_on_error: bool = True, tqdm_kws: Dict[str, Any] | None = None) DataFrame[source]

Compute the semantic similarity of a query against every sentence of every meeting. The max, min, and mean semantic similarity of each meeting will be returned.

Parameters:
query: Union[str, List[str]]

The query(ies) to compare each sentence against.

infrastructure_slug: Union[str, List[str]]

The CDP infrastructure(s) to connect to and pull sessions for.

start_datetime: Optional[Union[str, datetime]]

The earliest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.

end_datetime: Optional[Union[str, datetime]]

The latest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.

cache_dir: Optional[Union[str, Path]]

An optional directory path to cache the dataset. Directory is created if it does not exist. Default: “./cdp-datasets”

embedding_model: str

The sentence transformers model to use for embedding the query and each sentence. Default: “msmarco-distilbert-base-v4” All embedding models are available here: https://www.sbert.net/docs/pretrained-models/msmarco-v3.html Select any of the “Models tuned for cosine-similarity”.’

raise_on_error: bool

Should any failure to pull files result in an error or be ignored. Default: True (raise on any failure)

tqdm_kws: Dict[str, Any]

A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.

Returns:
pd.DataFrame

The min, max, and mean semantic similarity for each event as compared to the query for the events within the datetime range.

Notes

This function requires additional dependencies. Install extra requirements with: pip install cdp-data[transformers].

cdp_data.keywords.fill_history_data_with_zeros(data: DataFrame, ngram_col: str, dt_col: str) DataFrame[source]

A utility function to fill ngram history data with zeros for all missing dates.

Parameters:
data: pd.DataFrame

The ngram history data to fill dates for.

ngram_col: str

The column name for which the “ngram” is stored.

dt_col: str

The column name for which the datetime is stored.

Returns:
data: pd.DataFrame

A DataFrame filled with the original data and filled in with any missing dates with their values being set to zero.

See also

cdp_data.plotting.prepare_ngram_history_plotting_data

Subsets to plotting only columns and ensures values are sorted and grouped.

cdp_data.keywords.get_ngram_relevancy_history(ngram: str, strict: bool = False, start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, infrastructure_slug: str | None = None) DataFrame[source]

Pull an n-gram’s relevancy history from a CDP database.

Parameters:
ngram: str

The unigram, bigram, or trigram to retrieve history for.

strict: bool

Should the provided ngram be used for a strict “unstemmed_gram” query or not. Default: False (stem and clean the ngram before querying)

start_datetime: Optional[Union[str, datetime]]

The earliest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.

end_datetime: Optional[Union[str, datetime]]

The latest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.

infrastructure_slug: Optional[str]

The optional CDP infrastructure slug to connect to. Default: None (you are managing the database connection yourself)

Returns:
ngram_history: pd.DataFrame

A pandas DataFrame of all IndexedEventGrams that match the provided ngram query (stemmed or unstemmed).

See also

cdp_data.keywords.compute_ngram_usage_history

Compute all ngrams usage history for a specific CDP session dataset. Useful for comparing how much discussion is comprised of specific ngrams.

Notes

This function pulls the TF-IDF (or other future indexed values) score for the provided ngram over time. This is a measure of relevancy to a document and not the same as Google’s NGram Viewer which shows what percentage of literature used the term.

cdp_data.plots module

Module contents

Top-level package for cdp_data.

class cdp_data.CDPInstances[source]

Bases: object

Container for CDP instance infrastructure slugs.

Examples

>>> from cdp_data import datasets, CDPInstances
... ds = datasets.get_session_dataset(
...     infrastructure_slug=CDPInstances.Seattle
... )
Alameda = 'cdp-alameda-d3dabe54'
Albuquerque = 'cdp-albuquerque-1d29496e'
Atlanta = 'cdp-atlanta-37e7dd70'
Boston = 'cdp-boston-c384047b'
Charlotte = 'cdp-charlotte-98a7c348'
Denver = 'cdp-denver-962aefef'
KingCounty = 'cdp-king-county-b656c71b'
LongBeach = 'cdp-long-beach-49323fe9'
Louisville = 'cdp-louisville-6fd32a38'
Milwaukee = 'cdp-milwaukee-9f60e352'
Missoula = 'missoula-council-data-proj'
MountainView = 'cdp-mountain-view-7c8a47df'
Oakland = 'cdp-oakland-ba81c097'
Portland = 'cdp-portland-d2bbda97'
Richmond = 'cdp-richmond-a3d06941'
SanJose = 'cdp-san-jose-5d9db455'
Seattle = 'cdp-seattle-21723dcf'