cdp_data package¶

Submodules¶

cdp_data.constants module¶

cdp_data.datasets module¶

cdp_data.datasets.convert_transcript_to_dataframe(transcript: str | Path | Transcript) → DataFrame[source]¶

Create a dataframe from only the sentence data from the provided transcript.

Parameters:

transcript: Union[str, Path, Transcript]: The transcript to pull all sentences from.

Returns:

pd.DataFrame:: The sentences of the transcript.

cdp_data.datasets.get_session_dataset(infrastructure_slug: str, start_datetime: str | datetime | None = None, end_datetime: str | datetime | None = None, sample: int | float | None = None, replace_py_objects: bool = False, store_full_metadata: bool = False, store_transcript: bool = False, transcript_selection: str = 'created', store_transcript_as_csv: bool = False, store_video: bool = False, store_audio: bool = False, cache_dir: str | Path | None = None, raise_on_error: bool = True, tqdm_kws: Dict[str, Any] | None = None) → DataFrame[source]¶

Get a dataset of sessions from a CDP infrastructure.

Parameters:

infrastructure_slug: str: The CDP infrastructure to connect to and pull sessions for.
start_datetime: Optional[Union[str, datetime]]: An optional datetime that the session dataset will start at. Default: None (no datetime beginning bound on the dataset)
end_datetime: Optional[Union[str, datetime]]: An optional datetime that the session dataset will end at. Default: None (no datetime end bound on the dataset)
sample: Optional[Union[int, float]]: An optional sample of the dataset to return. If an int, the number of rows to return. If a float, the percentage of rows to return. Default: None (return all rows)
replace_py_objects: bool: Replace any non-standard Python type with standard ones to allow the returned data be ready for storage. See ‘See Also’ for more details. Default: False (keep Python objects in the DataFrame)
store_full_metadata: bool: Should a JSON file of the full event metadata be stored to disk and a path to the stored JSON file be added to the returned DataFrame. Default: False (do not request extra data and store to disk) Currently not implemented
store_transcript: bool: Should a session transcript be requested and stored to disk and a path to the stored transcript JSON file be added to the returned DataFrame. Default: False (do not request extra data and do not store the transcript)
transcript_selection: str: How should the single transcript be selected. Default: “created” (Return the most recently created transcript per session)
store_transcript_as_csv: bool: Additionally convert and store all transcripts as CSVs. Does nothing if store_transcript is False. Default: False (do not convert and store again)
store_video: bool: Should the session video be requested and stored to disk and a path to the stored video file be added to the returned DataFrame. Note: the video is stored without a file extension. However, the video with always be either mp4 or webm. Default: False (do not request and store the video)
store_audio: bool: Should the session audio be requested and stored to disk and a path to the stored audio file be added to the returned DataFrame. Default: False (do not request and store the audio)
cache_dir: Optional[Union[str, Path]]: An optional directory path to cache the dataset. Directory is created if it does not exist. Default: “./cdp-datasets”
raise_on_error: bool: Should any failure to pull files result in an error or be ignored. Default: True (raise on any failure)
tqdm_kws: Dict[str, Any]: A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.

Returns:

dataset: pd.DataFrame: The dataset with all additions requested.

See also

replace_dataframe_cols_with_storage_replacements: The function used to clean the data of non-standard Python types.

Notes

All file additions (transcript, full event metadata, video, audio, etc.) are cached to disk to avoid multiple downloads. If you use the same cache directory multiple times over the course of multiple runs, no new data will be downloaded, but the existing files will be used. Caching is done by simply file existence not by a content hash comparison.

Datasets are cached with the following structure:

{cache-dir}/
└── {infrastructure_slug}
    ├── event-{event-id-0}
    │   ├── metadata.json
    │   └── session-{session-id-0}
    │       ├── audio.wav
    │       ├── transcript.json
    │       └── video
    ├── event-{event-id-1}
    │   ├── metadata.json
    │   └── session-{session-id-0}
    │       ├── audio.wav
    │       ├── transcript.json
    │       └── video
    ├── event-{event-id-2}
    │   ├── metadata.json
    │   └── session-{session-id-0}
    │       ├── audio.wav
    │       ├── transcript.json
    │       └── video
    │   └── session-{session-id-1}
    │       ├── audio.wav
    │       ├── transcript.json
    │       └── video

To clean a whole dataset or specific events or sessions simply delete the associated directory.

Get a dataset of votes from a CDP infrastructure.

Parameters:

infrastructure_slug: str: The CDP infrastructure to connect to and pull votes for.
start_datetime: Optional[Union[str, datetime]]: An optional datetime that the vote dataset will start at. Default: None (no datetime beginning bound on the dataset)
end_datetime: Optional[Union[str, datetime]]: An optional datetime that the vote dataset will end at. Default: None (no datetime end bound on the dataset)
replace_py_objects: bool: Replace any non-standard Python type with standard ones to allow the returned data be ready for storage. See ‘See Also’ for more details. Default: False (keep Python objects in the DataFrame)
tqdm_kws: Dict[str, Any]: A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.

Returns:

pd.DataFrame: The dataset requested.

See also

replace_dataframe_cols_with_storage_replacements: The function used to clean the data of non-standard Python types.

cdp_data.datasets.replace_dataframe_cols_with_storage_replacements(df: DataFrame) → DataFrame[source]¶

Run various replacement functions against the dataframe to get the data to a point where it can be store to disk.

Parameters:

df: pd.DataFrame: The data to fix.

Returns:

pd.DataFrame: The updated DataFrame.

See also

replace_db_model_cols_with_id_cols: Function to replace database model column values with their ids.
replace_pathlib_path_cols_with_str_path_cols: Function to replace pathlib Path column values with normal Python strings.

cdp_data.datasets.replace_db_model_cols_with_id_cols(df: DataFrame) → DataFrame[source]¶

Replace all database model column values with the model ID.

Example: an event column with event models, will be replaced by an event_id column with just the event id.

Parameters:

df: pd.DataFrame: The data to replace database models with just ids.

Returns:

pd.DataFrame: The updated DataFrame.

cdp_data.datasets.replace_pathlib_path_cols_with_str_path_cols(df: DataFrame) → DataFrame[source]¶

Replace all pathlib Path column values with string column values.

Example: a transcript_path column with a pathlib Path, will be replaced as a normal Python string.

Parameters:

df: pd.DataFrame: The data to replace paths in.

Returns:

pd.DataFrame: The updated DataFrame.

cdp_data.datasets.save_dataset(df: DataFrame, dest: str | Path) → Path[source]¶

Helper function to store a dataset to disk by replacing non-standard Python types with storage ready replacements.

Parameters:

df: pd.DataFrame: The DataFrame to store.
dest: Union[str, Path]: The path to store the data. Must end in “.csv” or “.parquet”.

Returns:

Path:: The path to the stored data.

See also

replace_dataframe_cols_with_storage_replacements: The function used to replace column values.

cdp_data.instances module¶

class cdp_data.instances.CDPInstances[source]¶

Bases: object

Container for CDP instance infrastructure slugs.

Examples

>>> from cdp_data import datasets, CDPInstances
... ds = datasets.get_session_dataset(
...     infrastructure_slug=CDPInstances.Seattle
... )

Alameda = 'cdp-alameda-d3dabe54'¶

Albuquerque = 'cdp-albuquerque-1d29496e'¶

Atlanta = 'cdp-atlanta-37e7dd70'¶

Boston = 'cdp-boston-c384047b'¶

Charlotte = 'cdp-charlotte-98a7c348'¶

Denver = 'cdp-denver-962aefef'¶

KingCounty = 'cdp-king-county-b656c71b'¶

LongBeach = 'cdp-long-beach-49323fe9'¶

Louisville = 'cdp-louisville-6fd32a38'¶

Milwaukee = 'cdp-milwaukee-9f60e352'¶

Missoula = 'missoula-council-data-proj'¶

MountainView = 'cdp-mountain-view-7c8a47df'¶

Oakland = 'cdp-oakland-ba81c097'¶

Portland = 'cdp-portland-d2bbda97'¶

Richmond = 'cdp-richmond-a3d06941'¶

SanJose = 'cdp-san-jose-5d9db455'¶

Seattle = 'cdp-seattle-21723dcf'¶

cdp_data.keywords module¶

Pull the minimal data needed for a session dataset for the provided infrastructure and start and end datetimes, then compute the ngram usage history DataFrame.

Parameters:

infrastructure_slug: str: The CDP infrastructure(s) to connect to and pull sessions for.
ngram_size: int: The ngram size to use for counting and calculating usage. Default: 1 (unigrams)
strict: bool: Should all ngrams be stemmed or left unstemmed for a more strict usage history. Default: False (stem and clean all grams in the dataset)
start_datetime: Optional[Union[str, datetime]]: An optional datetime that the session dataset will start at. Default: None (no datetime beginning bound on the dataset)
end_datetime: Optional[Union[str, datetime]]: An optional datetime that the session dataset will end at. Default: None (no datetime end bound on the dataset)
cache_dir: Optional[Union[str, Path]]: An optional directory path to cache the dataset. Directory is created if it does not exist. Default: “./cdp-datasets”
raise_on_error: bool: Should any failure to pull files result in an error or be ignored. Default: True (raise on any failure)
tqdm_kws: Dict[str, Any]: A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.

Returns:

gram_usage: pd.DataFrame: A pandas DataFrame of all found ngrams (stemmed and cleaned or unstemmed and uncleaned) from the data and their counts for each session and their percentage of use as a percent of their use for the day over the sum of all other ngrams used that day.

See also

cdp_data.datasets.get_session_dataset: Function to pull or load a cached session dataset.
cdp_data.plotting.plot_ngram_usage_histories: Plot ngram usage history data.

Notes

This function calculates the counts and percentage of each ngram used for a day over the sum of all other ngrams used in that day’s discussion(s). This is close but not exactly the same as Google’s NGram Viewer: https://books.google.com/ngrams

This function will pull a new session dataset and cache transcripts to the local disk in the provided (or default) cache directory.

It is recommended to cache this dataset after computation because it may take a while depending on machine resources and available.

Compute the semantic similarity of a query against every sentence of every meeting. The max, min, and mean semantic similarity of each meeting will be returned.

Parameters:

query: Union[str, List[str]]: The query(ies) to compare each sentence against.
infrastructure_slug: Union[str, List[str]]: The CDP infrastructure(s) to connect to and pull sessions for.
start_datetime: Optional[Union[str, datetime]]: The earliest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.
end_datetime: Optional[Union[str, datetime]]: The latest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.
cache_dir: Optional[Union[str, Path]]: An optional directory path to cache the dataset. Directory is created if it does not exist. Default: “./cdp-datasets”
embedding_model: str: The sentence transformers model to use for embedding the query and each sentence. Default: “msmarco-distilbert-base-v4” All embedding models are available here: https://www.sbert.net/docs/pretrained-models/msmarco-v3.html Select any of the “Models tuned for cosine-similarity”.’
raise_on_error: bool: Should any failure to pull files result in an error or be ignored. Default: True (raise on any failure)
tqdm_kws: Dict[str, Any]: A dictionary with extra keyword arguments to provide to tqdm progress bars. Must not include the desc keyword argument.

Returns:

pd.DataFrame: The min, max, and mean semantic similarity for each event as compared to the query for the events within the datetime range.

Notes

This function requires additional dependencies. Install extra requirements with: pip install cdp-data[transformers].

cdp_data.keywords.fill_history_data_with_zeros(data: DataFrame, ngram_col: str, dt_col: str) → DataFrame[source]¶

A utility function to fill ngram history data with zeros for all missing dates.

Parameters:

data: pd.DataFrame: The ngram history data to fill dates for.
ngram_col: str: The column name for which the “ngram” is stored.
dt_col: str: The column name for which the datetime is stored.

Returns:

data: pd.DataFrame: A DataFrame filled with the original data and filled in with any missing dates with their values being set to zero.

See also

cdp_data.plotting.prepare_ngram_history_plotting_data: Subsets to plotting only columns and ensures values are sorted and grouped.

Pull an n-gram’s relevancy history from a CDP database.

Parameters:

ngram: str: The unigram, bigram, or trigram to retrieve history for.
strict: bool: Should the provided ngram be used for a strict “unstemmed_gram” query or not. Default: False (stem and clean the ngram before querying)
start_datetime: Optional[Union[str, datetime]]: The earliest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.
end_datetime: Optional[Union[str, datetime]]: The latest possible datetime for ngram history to be retrieved for. If provided as a string, the datetime should be in ISO format.
infrastructure_slug: Optional[str]: The optional CDP infrastructure slug to connect to. Default: None (you are managing the database connection yourself)

Returns:

ngram_history: pd.DataFrame: A pandas DataFrame of all IndexedEventGrams that match the provided ngram query (stemmed or unstemmed).

See also

cdp_data.keywords.compute_ngram_usage_history: Compute all ngrams usage history for a specific CDP session dataset. Useful for comparing how much discussion is comprised of specific ngrams.

Notes

This function pulls the TF-IDF (or other future indexed values) score for the provided ngram over time. This is a measure of relevancy to a document and not the same as Google’s NGram Viewer which shows what percentage of literature used the term.

cdp_data.plots module¶

Module contents¶

Top-level package for cdp_data.

class cdp_data.CDPInstances[source]¶

Bases: object

Container for CDP instance infrastructure slugs.

Examples

>>> from cdp_data import datasets, CDPInstances
... ds = datasets.get_session_dataset(
...     infrastructure_slug=CDPInstances.Seattle
... )

Alameda = 'cdp-alameda-d3dabe54'¶

Albuquerque = 'cdp-albuquerque-1d29496e'¶

Atlanta = 'cdp-atlanta-37e7dd70'¶

Boston = 'cdp-boston-c384047b'¶

Charlotte = 'cdp-charlotte-98a7c348'¶

Denver = 'cdp-denver-962aefef'¶

KingCounty = 'cdp-king-county-b656c71b'¶

LongBeach = 'cdp-long-beach-49323fe9'¶

Louisville = 'cdp-louisville-6fd32a38'¶

Milwaukee = 'cdp-milwaukee-9f60e352'¶

Missoula = 'missoula-council-data-proj'¶

MountainView = 'cdp-mountain-view-7c8a47df'¶

Oakland = 'cdp-oakland-ba81c097'¶

Portland = 'cdp-portland-d2bbda97'¶

Richmond = 'cdp-richmond-a3d06941'¶

SanJose = 'cdp-san-jose-5d9db455'¶

Seattle = 'cdp-seattle-21723dcf'¶

cdp_data package¶

Subpackages¶

Submodules¶

cdp_data.constants module¶

cdp_data.datasets module¶

cdp_data.instances module¶

cdp_data.keywords module¶

cdp_data.plots module¶

Module contents¶